Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Adaptive set-based tests for pathway analysis
(USC Thesis Other)
Adaptive set-based tests for pathway analysis
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
i
ADAPTIVE SET-BASED TESTS FOR PATHWAY ANALYSIS
by
(Ray) Yu-Chen Su
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(STATISTICAL GENETICS)
May 2015
Copyright 2014 Yu-Chen Su
ii
Contents
List of Tables iv
List of Figures v
Preface vi
Chapter 1. Introduction 1
1.1 Introduction 1
Chapter 2. Existing Methods and Implementation 7
2.1 Likelihood Ratio Test and Permutations 7
2.2 Joint Modeling Approaches 9
2.2.1 LASSO Regression 9
2.2.2 Global Test of Random Effects (GMRE) 10
2.2.3 Hotelling’s T
2
12
2.2.4 Principal Component Analysis (PCA) 14
2.2.5 Partial Least Squares (PLS) 15
2.2.6 Random Forest 16
2.3 Marginal P-values Approaches 18
2.3.1 Minimum P-value (MinP) 19
2.3.2 Higher-Criticism (HC) 19
2.3.3 Flat Threshold (FT) 20
2.3.4 Adaptive Rank Truncation Product (ARTP) 20
Chapter 3. Novel Self-Contained Set-Based Methods 23
3.1 Introduction 23
3.1.1 Stepwise Regression on Principal Components (Step PC) 23
3.1.2 MinP on Principal Components (MinP PC) 23
3.1.3 Adaptive Rank Truncation Product on Principal Components (ARTP
PC) 24
3.1.4 Adaptive Rank Truncation Product on Partial Least Squares
Components (ARTP PLS) 24
3.1.5 Cluster Regression (CR) 24
iii
3.1.6 Cluster Regression – 1 degrees of freedom test (CR-1df) 25
Chapter 4. Simulation Study 26
Chapter 5. Results 30
Chapter 6. Super Learner 39
6.1 Results 41
Chapter 7. Fast Adaptive Set-based Test 42
7.1 Theory and Derivations 43
7.1.1 Asymptotic covariance of Maximum Likelihood Estimates (MLEs) of
separately fitted marginal models) 43
7.1.2 Linear regression models 47
7.1.3 Logistic regression models 49
7.1.4 Deriving
in
50
7.2 Performance and comparison with permutations 51
7.3 “fastARTP” R Package 54
Chapter 8. Pathway Analysis of BMI Growth during Puberty 56
8.1 Children Health Study (CHS) 57
8.2 Genotypes and Ancestry Estimates 59
8.3 Genes and Pathways Annotations 59
8.4 Methods 59
8.5 Results 61
Chapter 9. Concluding Remarks and Future Work 65
References 69
Appendix A: R code for fastARTP 78
Appendix B: Additional BMI growth Pathway Analysis Results 95
iv
List of Tables
Table 1.1 Table form of the Fisher’s test under the competitive method 3
Table 1.2 Table form of the Fisher’s test under the self-contained method 4
Table 4.1 Simulation scenarios 29
Table 5.1 P-values of all pairwise t-tests of the three power performance groups in
Figure 5.3
35
Table 5.2 P-values of all pairwise t-tests of the two power performance groups in
Figure 5.4
35
Table 5.3 Power of methods when only one interaction effect is present from two
imputed SNPs with no marginal effects
36
Table 7.2 Computational time by different methods for five pathways of different
size
53
Table 7.3 P-value accuracy among 3 different methods with a proportions test on
sampling methods
54
Table 8.5 Most significant pathways associated with 7-year BMI growth in CHS.
None of the pathways remain significant after Bonferroni correction
61
v
List of Figures
Figure 2.1 An example distribution of LRT statistic from permutations 8
Figure 5.1 Combining too few or too many p-values using the RTP method results
in a sizable loss of power. By contrast, ARTP is robust to the unknown
number of associated SNPs
30
Figure 5.2 Power of set-based methods as a function of the number of causal
SNPs in scenarios with independent SNPs
32
Figure 5.3 Power of set-based methods as a function of the number of causal
SNPs in scenarios with LD structure from the PTGER3 gene
33
Figure 5.4 Power curves of all methods under the second simulation scenario 34
Figure 7.2 Accuracy of sampling framework for ARTP in two scenarios 53
Figure 8.5 a) One-level ARTP results for 1,258 pathways ordered from smallest
to largest on the x-axis. All SNPs are summarized at once to obtain a pathway
statistic
63
Figure 8.5 b) Null distribution of p-values 63
Figure 8.5 c) Two-level ARTP results for 1,258 pathways ordered from smallest
to largest on the x-axis. All SNPs are first summarized within each gene.
Then, all of the gene statistics are used to obtain a pathway statistic
64
vi
Preface
With a typical sample size of a few thousand subjects, a single genomewide association
study (GWAS) using traditional one-SNP-at-a-time methods can only detect genetic variants
conferring a modest effect on disease risk. Set-based methods, which analyze sets of SNPs
jointly, can detect variants with smaller effects acting within a gene, a pathway, or other
biologically relevant sets. While self-contained set-based methods (those that test sets of variants
without regard to variants not in the set) are generally more powerful than competitive set-based
approaches (those that rely on comparison of variants in the set of interest with variants not in
the set), there is no consensus as to which self-contained methods are best. In particular, several
competitive set tests have been proposed to directly or indirectly ‘adapt’ to the a priori unknown
proportion and distribution of effects of the truly associated SNPs in the set, which is a major
determinant of their power.
A popular adaptive set-based test is the adaptive rank truncated product (ARTP), which
seeks the set of SNPs that yields the best-combined evidence of association. We compared the
standard ARTP, several ARTP variations we introduced, and other adaptive methods in a
comprehensive simulation study to evaluate their performance. We used permutations to assess
significance for all the methods and thus provide a level playing field for comparison. We found
the standard ARTP test to have the highest power across our simulations followed closely by the
global model of random effects (GMRE) and a LASSO based test.
However, ARTP requires permuted data and p-values that may be computationally
intensive. This motivates a fast alternative approach to replace permutations. We proposed a Fast
Adaptive Set-based Test (FAST), which samples permutation equivalents from the asymptotic
distribution of all marginal parameter estimates of interest. An R package “fastARTP” was
vii
written to apply the FAST framework to draw permutation equivalents then applied on to the
ARTP method. We consequently applied “fastARTP” package on a genomewide pathway
analysis using the Children Health Study (CHS).
Acknowledgments
I would like to express my sincere gratitude to my advisor Juan Pablo Lewinger for his
patience, guidance and motivation. I would like to thank my committee members: James
Gauderman, David Conti, Fred Schumacher, and Fengzhu Sun for providing me with support,
encouragement, and insight through this arduous process. I would also like to thank the entire
faculty of the Preventive Medicine department which has been instrumental in my development
as a scientist and investigator. In addition, Kiros Berhane has been vital in my development as a
Ph.D. student and supported me generously. Last but not least, I am sincerely thankful for the
support bestowed upon me by friends and family, who encouraged me through the many tough
periods, and celebrated my growth along the way. Thank you.
1
Chapter 1. Introduction
1.1 Introduction
The first genome-wide association study (GWAS) published in 2005 discovered a
common genetic variant for age-related macular degeneration that caused blindness in the elderly
[Klein, Zeiss et al. 2005]. Since then, the number of GWAS publications hoping to find disease
associations with single-nucleotide polymorphisms (SNPs) has increased dramatically (11912 as
of 2013 [Welter, MacArthur et al. 2014]). Although identifying genetic variants for Mendelian
diseases has proven successful, complex diseases still pose many challenges. For instance,
although there have been some successes in identifying common genetic variants associated with
type II diabetes such as TCF7L2, KCNQ1 and KCNJ11 genes [Sladek, Rocheleau et al. 2007;
Steinthorsdottir, Thorleifsson et al. 2007; Ali 2013], approximately 70 well identified loci
combined only explain ~10% of the heritability of type II diabetes [Sun, Yu et al. 2014]. Much
of the unexplained heritability may due to rare variants, epistatic and/or epigenetic effects
[Eichler, Flint et al. 2010], or ‘missed SNPs’ that may have not been detected by genome-wide
scans. The traditional GWAS approach tests the marginal effect of each of the large number of
SNPs in a genotyping panel (from hundreds of thousands to a few millions), and requires a
threshold of 5x10
-8
[Wang, Li et al. 2010] to achieve genome-wide significance. In order to rank
as top signals and pass this stringent genome-wide threshold with a traditional GWAS, a SNP
variant needs to confer a considerable effect on risk. Small sample sizes may lead to
underpowered scenarios in which even SNPs with sizable effects may not be detected. Typical
GWAS sample sizes are in the few thousands of individuals and many consortia efforts
underway are pooling resources across many studies to increase sample size.
2
A complementary approach to the standard one-SNP-at-a-time analysis of GWAS is to
group SNPs into sets that may contribute jointly to a phenotype. Natural sets are SNPs within a
gene region and SNPs in multiple genes belonging to a common genetic pathway. By shifting the
focus from individual SNPs to SNP sets of potential biological relevance, the hope is to borrow
strength from multiple weak associations to gain power and make the set detectable as a unit.
This is the basis for set-based analysis. Set-based analyses are typically performed as a
secondary analysis after a SNP by SNP marginal scan has been performed. The goal is to detect
any associated pathway where the set of genes comprising the pathway were missed by the
marginal test in the original GWAS.
Set-based methods can be classified as competitive or self-contained, according to how
the null hypothesis is formulated. Both types of null hypothesis test whether SNPs in a set are
associated with the disease of interest but with different references. Competitive methods test
whether the joint distribution of individual SNP association p-values in the pathway is the same
as the joint distribution of individual SNP association p-values outside the pathway, while self-
contained methods test directly whether the SNPs in the pathway are jointly associated with the
disease trait. Thus, to use competitive methods, usually the whole genome-wide data is required
with individual-values computed for each SNP. By contrast, only the SNPs within the set of
interest are required when using self-contained methods. The simplest version of a competitive
method uses Fisher’s exact test on the 2x2 table in which the columns indicate significant vs.
non-significant p-values, and the rows indicate SNPs in the set of interest vs. SNPs not of the set
interest [Fridley and Biernacka 2011] (Table 1.1);
3
Table 1.1 Table form of the Fisher’s test under the competitive method
Significant Non-significant
Genes/SNPs in the pathway a
1
b
1
Genes/SNPs not in the pathway
c
1
d
1
1. Counts of pathways.
Other tests for the 2x2 table also include the
2
test and the binomial z-test for proportions and
have been built into readily available software such as DAVID [Huang, Sherman et al. 2009],
GATHER [Chang and Nevins 2006], PANTHER [Thomas, Campbell et al. 2003], and Ingenuity
Pathway Analysis (IPA) [Kramer, Green et al. 2014]. Another popular method is the gene-set
enrichment analysis (GSEA) originally inspired by microarray expression analysis
[Subramanian, Tamayo et al. 2005; Wang, Li et al. 2007]. GSEA uses a weighted Kolmogorov-
Smirnov (KS) running-sum statistic. Each gene is first represented by one statistic value
generated at the SNP level, whether by taking the minimum p-value of the SNPs or by other
methods, and ranked by its significance. Then, the KS statistic (enrichment score) reflects the
overrepresentation of genes in the pathway at the top of the entire ranked list of genes; in other
words, one investigates how ‘enriched’ the pathway is with top ranking significant genes.
Variations of GSEA include GSEA-SNP and i-GSEA4GWAS. GSEA-SNP calculates the
enrichment score based on all SNPs in a pathway rather than at the gene level [Holden, Deng et
al. 2008]. I-GSEA4GWAS improves the sensitivity of the original GSEA method by adding
weights to those pathways with higher proportion of significant genes [Zhang, Cui et al. 2010].
By contrast, a simple self-contained method will test whether there is an excess of
significant SNP p-values relative to what is expected by chance (Table 1.2) but not in relation to
SNPs not in the set. If p-values were independent across SNPs this could be accomplished with a
simple test of binomial proportions or the equivalent Pearson chi-square test.
4
Table 1.2 Table form of the Fisher’s test under the self-contained method
Significant Non-significant
Genes/SNPs proportion expected by chance 5% 95%
The null hypothesis here is that the number of significant SNPs observed in the set of interest is
no greater than expected by chance; the alternative hypothesis is thus that the set is associated
with the phenotype. Often, the set of interest has a hierarchical structure. For example, a set of
SNPs in a genetic pathway can be divided into subsets of SNP within gene regions. To perform a
pathway-based analysis, one can first derive a statistic representative of each gene by combining
information at the SNP-level; then, at the gene-level, one can combine all gene statistics
(possibly using the same method for combining SNPs) to generate a final pathway statistic.
Alternatively, one can directly combine all the SNP statistics into the final pathway statistic
without regard to genes. Many self-contained methods have been established in order to generate
combined test statistics – minimum p-value, higher-criticism (HC) or flat threshold [De la Cruz,
Wen et al. 2010], adaptive rank truncation product (ARTP) [Yu, Li et al. 2009], Hotelling’s T
2
test [Goeman, van de Geer et al. 2004], GMRE, least absolute shrinkage and selection operator
(LASSO) [Tibshirani 1996], partial least squares (PLS) variations [Wang, Ho et al. 2009],
principal component (PC) variations [Gauderman, Murcray et al. 2007], random forest [Breiman
2001], and variance component tests [Huang and Lin 2013] – and few additional publications
highlighted power comparison of limited number of these methods [Tsai and Chen 2009;
Fridley, Jenkins et al. 2010; Wang, Li et al. 2010; Maciejewski 2014]. I review these methods in
detail in Chapter 2.
Competitive methods are sometimes favored because one can test the pathway of interest
against its complement, those pathways/genes in the rest of the background data, solely based on
5
p-values from the original genome-wide scan without having to re-compute any new statistic
based on the raw genotype data. Self-contained methods, on the other hand, typically require the
raw data because permutations may be needed to assess significance. Nevertheless, due to the
characteristic of the self-contained null hypothesis, self-contained methods tend to have higher
power than competitive methods [Fridley and Biernacka 2011]. To understand why this is the
case, let us consider the simplest method for both self-contained and competitive methods
described above. Both rely on the proportion of significant SNPs using a predefined significance
cutoff point. In the self-contained framework the proportion of significant SNPs in the set would
be compared to the proportion expected under the null. By contrast, a competitive test would
compare the proportion of significant SNPs in the set relative to the proportion of significant
SNPs in the reference set (e.g. all other GWAS SNPS). Clearly, detecting whether a proportion
differs from a fixed value (the proportion expected under the null) is easier than detecting
whether two proportions differ. In the latter case the variance of the test statistic involves both
the variance of the within proportion and the variance of the without proportion resulting in a
more variable test statistic and a lower powered test. The superior power of self-contained
methods has also been shown empirically in many studies. For example, the KS running-sum
statistic has been shown to have less power when compared with other simpler statistics such as
the one-sided z-test and chi-square test [Chasman 2008; Irizarry, Wang et al. 2011]. Its test
statistic gets ‘penalized’ (deduction in KS statistic score) for every significant gene outside the
pathway of interest [Fu 2003]. Fridley et al also indicated that KS running-sum statistic have
lowest power among other methods such as Fisher’s method, Stouffer’s method (SM), global
model with random effect (GMRE), global model with fixed effect (GMFE), and principal
component analysis (PCA) [Fridley, Jenkins et al. 2010]. And, Tsai et al also shown that GSEA
6
performed the worst when compared with multivariate analysis of variance (MANOVA),
analysis of variance (ANOVA), and significance of microarrays-gene set (SAM-GS) methods
[Tsai and Chen 2009]. Additionally, in the case of candidate gene studies or one gene set on a
chip, competitive methods are not applicable [Chasman 2008]. The researcher must decide based
on the data available when choosing between competitive or self-contained methods for pathway
analysis.
The goals of this dissertation are,
(i) To comprehensively evaluate the performance of existing and novel adaptive set-
based tests.
(ii) To improve the computational efficiency of adaptive-set test which typically rely
on permutations.
(iii) To investigate the association of genetic pathways with BMI growth during
puberty.
7
Chapter 2. Existing Self-Contained Set-Based Methods
2.1 Likelihood Ratio Test and Permutations
We assume a set of SNPs of interest based on biological relevance or prior belief from
prior information on gene function such as the Gene Ontology (GO) database [Ashburner, Ball et
al. 2000]. For simplicity of exposition here we assume SNPs within a biological pathway and a
binary trait --typically a case-control indicator. In the following, bold letters represent matrices,
and non-bold letters represent scalars. We let, let Y = Yi i=1,…,n, where n is the total number of
subjects be a
vector of 1 and 0 representing case-control status, X =
is a n x
(L+1) matrix in which the first column is a
vector of ones and each column S j (j=1,…,L,
where L is the total number of SNPs) is the vector of genotypes for SNP j, and S =
. A
standard model typically applied to this kind of data is a logistic regression of the form:
!! "# (1)
Here, # =
$
%
&
' (k=1,…,L) is a (L+1) x 1 vector of parameters where
$
corresponds to
the intercept and
(
corresponds to the log-odds for SNP k.
Based on model (1) one can construct a global test of association based for example on
the Likelihood Ratio Test (LRT) statistic, by comparing the full likelihood against the null
likelihood of :
)*+,-./ 012
345678964
356778964
:
Asymptotically, the likelihood ratio statistic has a chi-square distribution with L degrees
of freedom. However, for many alternative tests reviewed below, the asymptotic null distribution
may be unknown or hard to compute. In such cases, one can use a permutation scheme to assess
8
significance: the phenotypes Yi, i=1,…,n are shuffled while keeping unchanged , and the test
statistic is recomputed for the shuffled data. By performing many permutations, we obtain a null
distribution of the test statistics relative to which we can determine whether the original test
statistic is extreme or not (Figure 2.1).
Figure 2.1 An example distribution of LRT statistic from permutations
Significance of the original data test statistic is assessed by comparing with the
permutation distribution. In our simulation study in Chapter 4 we compare all the methods
described below, relying exclusively on 1,000 permutations per data replicate and 1,000 data
replicates to calculate power. Power is calculated by number of data replicates having a p value
0.05. Using permutations establishes a fair common basis for comparison because it
guarantees that the type I error will be maintained at the nominal level for all methods.
9
Self-contained set-based methods can in turn be classified into joint and marginal
approaches depending on whether they model the effect of individual SNPs or multiple SNPs
jointly. Joint approaches rely on multivariate regression to jointly model and test the effect of all
the SNPs in a set. By contrast, marginal approaches combine individual SNP-based p-values into
a summary statistic for the entire set using some p-value combination method. We will present
pathway-based methods starting with joint modeling approaches below.
2.2 Joint Modeling Approaches
2.2.1 LASSO Regression
SNPs within a small (and sometimes large) region of the genome are in Linkage
Disequilibrium (LD), which can create a multicollinearity problem in model (1). LASSO
regression aims to overcome the multicollinearity by introducing an L1 penalty on the
coefficients. The effect of the L1 penalty is to shrink the regression coefficients s toward zero
making some of the exactly zero[Tibshirani 1996]. Thus the lasso is simultaneously an
estimation and model selection method. In addition, the lasso can handle models with more
covariates than observations ( ) [Goeman and Buhlmann 2007], which are typical of
GWAS data and which un-penalized regression cannot handle. The LASSO minimizes the Sum
of Squared Errors (SSE) like regular linear least square regression while imposing a constraint on
the beta coefficients of the following form:
,,;
<
! =>?
@
0A0=
B
C
@B
DE%
BF%
G
H
I
@F%
JK= L
B
L
D
BF%
The tuning parameter λ is usually selected using cross-validation. In the case of logistic
regression, the constraint is expressed as adding a Lagrangian penalty to the joint likelihood of
10
the parameter estimates [Conroy and Sajda 2012]. To perform a global test of the hypothesis that
all beta coefficients are equal to zero based on the LASSO, the final model, which only has a few
non-zero coefficient estimates, can be compared with the null model without covariates using a
Likelihood Ratio statistic. In our simulations we used the implementation of the LASSO in the
‘penalized’ R-package from Goeman et al. (2010) [Goeman 2010]. This implementation allows
for a combination of L1 LASSO and L2 ridge penalty tuning parameters, which is also called an
elastic net [Zou and Hastie 2005]. Using cross-validations to choose the L1 tuning parameter is
computationally intensive. The tuning parameter for the L2 penalty was chosen to be a small
positive value so that the regression is closed to the LASSO regression, but yet large enough to
speed up convergence.
2.2.2 Global Test of Random Effects (GMRE)
A single pathway may contain hundreds of SNPs if not thousands. This poses a problem
to model (1) when we want to test all as fixed effects with M
$
N# O vs. M
P
N#QO. Under
the null # O, R but if L > n, there are many solutions to R in which Q. From
model (1), Y depends on through for the regression logit(Y) ="#; if there are many
solutions to R in which Q under the null, they will all grant Y the same distribution
(logit(Y) ="#) as does the true null hypothesis and thereby decreasing power to detect the
alternative hypothesis. An attractive alternative is to treat # as random effects. We can assume #
are a sample from some chosen distribution; the expectation of # is then 0 with a covariance
matrix structure S
T
U. The maximum likelihood estimate of # under the random effects model is
the same as the estimate obtained under a penalized regression method if the prior distribution of
# is chosen correctly. For example, independent and identically distributed normal components
of #
results in ridge regression [Hoerl and Kennard 1970], while independently and identically
11
double exponential results in LASSO regression [Tibshirani 1996; Efron, Hastie et al. 2004]. If
V#W! is the likelihood of # given Y and X
YLS
TZ! is the expectation of # given [
H
, then the
marginal density of Y has the form [Goeman, van de Geer et al. 2006]:
\
]S
T
W! X
YLS
T^V#W!_
The null hypothesis in GMRE becomes M
$
N`
H
, which implies M
$
N# O since they both
have same marginal distribution of Y shown above; the alternative hypothesis is then M
P
N`
H
a,
which implies M
$
N#QO. The GMRE is implemented by Goeman et al. (2004) in R-package
‘globaltest’ available on bioconductor [Goeman, van de Geer et al. 2004]. The package is based
on the score test described by Le Cessie and Van Houwelingen (1995) [le Cessie and van
Houwelingen 1995] and Houwing-Duistermaat et al. (1995) [Houwing-Duistermaat, Derkx et al.
1995]. To test `
H
, we have the random effect model of
;b
@
L
@
! 5
E%
AJ
@
)
where 5
E%
link function (logit in our case), A intercept,
@
c C
@d
d d
(i=1,…,n and j=1,…,L)
with ;e! and fge! `
H
h
. The score statistic is given by the derivative of the log
likelihood of the model above with respect to `
H
at `
H
, divided by its standard deviation to
adjust for variances of covariate SNPs:
+
0i!
h
j0i!0i
H
-/4j!
Ti
H
H
-/4j
T
!Jk
l
0mi
H
H
!c j
nn
T
n
T
o0;o!
,;o!
where i i
H
i
p
are the first, second and fourth moments of Y and j
q)!
h
with the same
variable notations from model (1); note that j here is a matrix. However, the R-package
12
uses an equivalent simpler test statistic given by the quadratic form under the null when E(W) =
0:
r
0i!
h
j0i!
i
H
with expectation
;r! -/4j!
and variance
s-r! 1-/4j
T
!J>
i
p
i
H
H
0tG=j
nn
T
n
Z
Under the null, T and Q are asymptotically normally distributed in large sample sizes, but better
approximated by scaled chi-square distribution in small sample sizes. Q is also easy to compute
for high number of SNPs since the j matrix is limited by the sample size but not the SNP size
L.
2.2.3 Hotelling’s T
2
MANOVA is a commonly used method in the microarray gene expression data analysis;
investigators tend to compare thousands of gene expression profiles among various groups of
experimental conditions [Szabo, Boucher et al. 2003; Kim, I et al. 2005]. In the case of only two
experimental conditions, MANOVA test reduces to a Hotelling’s T
2
test [Lu, Liu et al. 2005].
Borrowing this notion, we can use a multivariate test to compare all SNPs in a pathway between
the case and control groups. Let A = uvw indicate the total number of SNPs among cases with m
x L SNP matrix where m is the total number of cases. Let A = uvx indicate the total number of
SNPs among controls with p x L SNP matrix where p is the total number of controls. Note that A
and B here have the same SNP names in the pathway but with different values based on case or
13
control group. Let U
y
= variance-covariance matrix of A and U
z
= variance-covariance matrix of
B. The null hypothesis for the test is M
$
Ny
{
|
{
or equivalently to M
$
Ny
{
z
{
y
{
T
z
{
T
y
{
}
z
{
}
. And, the mean is defined by
y
{
~
{
%
~
{
D
-6~
{
d
9
c ~
d
F%
(j=1,…, L)
|
{
{
%
{
D
-6
{
d
c
d
F%
(j=1,…, L)
where y
{
and |
{
are vectors. When U
y
U
|
, the Hotelling T
2
statistic uses the pooled variance
o
0
!U
y
J0
!U
z
J01
+
H
J
y
{
0|
{
!
h
o
E
y
{
0|
{
!
When U
y
QU
|
, the Hotelling T
2
statistic uses the unpooled variance
o
U
y
J
U
|
+
H
y
{
0|
{
!'o
E%
y
{
0|
{
!
Under the null, +
H
is asymptotically chi-squared distributed with L degrees of freedom.
However, with the R-code made available by Tsai et al. (2009) [Tsai and Chen 2009], we apply
Hotelling’s T
2
statistic in our pathway simulation followed by permutations to obtain the
empirical distribution of the T
2
statistic for even comparing grounds with other methods. Here,
we assume the unpooled variance since distribution of SNPs in cases could be different than that
of controls.
14
2.2.4 Principal Component Analysis (PCA)
Gauderman et al. (2007) have applied principal components (PCs) to test for association
between a group SNPs and case-control status. A principal component is a linear combination of
optimally-weighted observed variables. The first PC explains the most variance in S and
corresponds to the largest eigenvector value. The second PC explains the second most variance
in S, and the proportion of variance explained decreases for the next PC in order. From the
matrix of genotypes S one can obtain principal components [Gauderman, Murcray et al. 2007]:
f
%
J
T
T
JJ
f
H
T
T
J
TT
T
JJ
T
Z
Z
Z
f
D
J
T
T
JJ
where
are the eigenvectors of the covariance matrix cov(S) corresponding to each
PC. The variance of all PC is given by sequentially summing from the largest to smallest value
of
-9-C
F
g-! =s-
D
F%
,
D
! =s-
D
F%
f
D
! K
%
JK
H
JJK
D
Z
One can regress Y on all the PCs that explain a given proportion (say 80%) of the explained
variance in either linear or logistic regression:
!#Jeexe linear regression
!! !# logistic regression
15
where
f
%
f
D
. This technique allows picking only a few PCs to explain a high
proportion of variance and achieve dimension reduction [Gower and Mardia 1974]. Again, a
LRT statistic can be derived using this full model, and significance assessed by permutations.
However, this method has been shown to have low power for testing pathway-based associations
[Fridley, Jenkins et al. 2010]. Therefore we utilize this method only in an improved version as
described in the later section.
2.2.4 Partial Least Squares (PLS)
One possible reason why using PCs in the logistic regression setting has lower power is
due to the fact that the correlations between phenotypes and SNPs are not taken into account
when choosing how many PCs to include in the regression. Partial least squares is a similar
dimension reduction technique that does take the correlations between phenotypes and SNPs into
account. Let
T
! and v be vector of coefficients, one can define the covariance of first
PLS component [Wang, Ho et al. 2009] by
- 9-C
F vF
/g
H
v!
where
v!
%
. Subsequent PLS components can be found in a similar manner; for
example,
T
v
m
!!
H
. One can then regress the PLS components against Y in the
logistic regression model similar to PC based method. The R-package ‘pls’ implements PLS
algorithm [Mevik and Wehrens 2007], and we further investigate some PLS variations in
Chapter 3.
16
2.2.5 Random Forest
Random forest is a machine learning algorithm aimed to improve upon classification and
regression trees to make outcome predictions and assess variable importance (VI) [Breiman
2001]. The algorithm is as follows
(1) Obtain a training data set and let N be its sample size.
(2) Select a bootstrap sample size N with replacement as training data. Those
observations in the bootstrap are ‘in-bag’, and those observations outside of the
bootstrap are ‘out-of-bag’ (OOB). On average 2/3 of the data will be ‘in-bag’ and 1/3
of the data will be ‘out-of-bag’.
(3) Let L = the total number SNPs available. Let m = the number of total SNPs in a
random subset of L and m << L. A node is defined as the split point where the best
SNP out of m is used as a classifier to make the best decision split. The best SNP is
chosen by the maximum Gini Index value where the Gini Index is given by
0c !
H %
F$
Jc
H
F$
c L
!
H %
F$
.
The first term in the equation measures the impurity/entropy of the phenotype values
before the split while the second term measures the weighted impurity/entropy of
phenotype values after the split. Large values of
indicate purer splits and
better homogeneity within after-split groups.
(4) An ‘in-bag’ sample is taken and then splits at the first node. The resulting two parts of
the data are further split by another node independently by resampling m as the new
classifier.
17
(5) We repeat this process until the data is split by as many nodes as possible and
branches out to form a ‘tree’. The tree is fully grown without pruning, or removing
nodes to simplify a tree. Repeat the steps (2), (3), and (4) to grow as many trees as
requested to create a forest of trees (thus the term ‘random forest’).
(6) Run the OOB sample down each tree to obtain a prediction. For each OOB
observation, take the majority vote of final predictions of all the trees as its class
prediction. The error rate is then the proportion of times a particular observation is
not equal to its class prediction in the forest of trees. The average of the all error rates
in the OOB sample is then the OOB error estimate.
(7) To assess variable importance, first record the number of correct OOB predictions in
the OOB sample. Then, permute the values of m and record the number of correct
OOB predictions in the OOB sample again based on this permutation. The raw VI
score is then
f
0f
where f
correct count in original data, f
correct count in permuted data, and t
= total number of trees.
In the framework of pathway analysis, OOB error estimate is not sufficient to use as a test
statistic because it is possible to have the correct predictions by chance; for instance, if the
prediction probability of cases is 50%, then predicting outcome to be the most frequent
observation (majority vote) gives no information on the relationship between Y and X. To adjust
for lucky guesses of the correct predictions, Hardin and Hilbe (2007) [Lindsey 1999] proposed
18
using a pseudo R-squared statistic that is sensitive to correct predictions but penalizes blind
guesses
*
H
0
0
where
number of correct predictions,
most frequent outcomes (
1
if
c b
@
!
@F%
c b
@
!
@F%
, and n = total number of cases and controls. However, we used
the R-package ‘randomForest’ and calculate the likelihood of the alternative based on each
case’s prediction probability [Breiman 1999; Breiman and Cutler 2002; Svetnik, Liaw et al.
2003; Horvath, Shi et al. 2007; R, CD et al. 2007]
¡¢£¡¤¥¥¦ =
L§!!J
0
!
0
L§!!
¨
F%
Z
Since the likelihood under the null is a constant value, we can use a LRT statistic described
above at the beginning of Chapter 2. In our simulation below we used permutations to compute
p-values for the random-forest based LRT statistic.
2.3 Marginal P-values Approaches
In all the approaches described above, the trait is jointly modeled as a function of all
SNPs. A simpler approach is to combine marginal p-values to derive a single summary statistic.
There are many ways of combining marginal p-values, but a most widely used one is based on
Fisher’s method
01=
@!
D
@F%
©
H&!
H
19
which has a chi-square distribution with 2L degrees of freedom under the null. However, since
SNPs are in LD and not independent, Fisher’s method may no longer be a chi-square distribution
under the null. And, similar problem exists for other p-value combining methods that rely on the
independence marginal p-value assumption. Therefore, in our simulation, permutations are used
to get the null distribution of all different marginal p-value combinational approaches. We
introduce all the marginal p-values methods based on the SNP-level setup; however, one can use
the same idea on the pathway-level to obtain one final pathway test statistic.
At the gene-level, one assumes each SNP already has its own p-value derived from any
model such as the logistic regression model
!!
d
(2)
where phenotype as described before,
ª
(j=1,…,L for L total number of SNPs. This
is the marginal test of each SNP.
2.3.1 Minimum P-value (MinP)
Rank all the SNP p-values from smallest to largest then take the minimum p-value, the
best signal, to represent the gene statistic.
2.3.2 Higher-Criticism (HC)
The idea of MinP only takes the smallest p-value to represent each gene, however one
might question why not take best top signals instead to include more small p-values. HC below is
a way to select the top best p-values to derive a summary statistic. First rank the SNP p-values
values from smallest to largest within each gene:
%!
«
H!
«
D!
where L is the total
number of SNPs in the gene. Then, the higher-criticism threshold is given by the statistic that
maximizes
20
¬
) ® 0
@!
¯
@!
0
@!
!
where i =1,...,L is the index position, and
@!
is the SNP p-value at position i in the ranked list
[De la Cruz, Wen et al. 2010]. Once
¬
is found, one combines all the SNP p-values up to this
index by Fisher’s method
0=
@!
°
@F%
as the summary statistic of the gene, where z = the index
¬
occurs [Elston 1991]. For example,
if the maximum
¬
occurs at position 28, then z = 28.
2.3.3 Flat Threshold (FT)
Another way to choose how many top SNP p-values to include is the FT. From the
permutations of the marginal SNP test, by shufflying Y, we choose the 0.1 quantile of the
smallest p-value as the cutoff threshold (Flat Threshold). By ranking all the SNP p-values, we
include those p-values less than the cutoff threshold at each permutation of phenotype and again
use Fisher’s method for the summary gene statistic.
2.3.4 Adaptive Rank Truncation Product (ARTP)
MinP, HC, and FT are methods of choosing one truncation point only; once the
truncation point is determined, one can determine how many top SNPs to include in
summarizing a gene statistic. A natural extension of these methods is then to include more than
one truncation point; then, we can determine which truncation point is the best. Ideally, one can
choose as many truncation points as needed. But, the computational burden will increase
exponentially. Hence, in our simulation, we choose only 5 truncation points as follows. If there is
21
one gene with many SNPs, Yu et al. (2009) [Yu, Li et al. 2009] describes the following
algorithm as an adaptive rank truncation product method:
(1) Let ±
d
²9-C
) 1 ® !, j=1,…,J where J is the number of truncation points picked;
in this case, J = 5 since we picked 5 truncation points. Each truncation point
represents how many SNPs are selected. For instance, if there are a total of 100 SNPs
in the gene, then the truncation point ±
%
³ SNPs, ±
H
SNPs,…, ±
´
1³
SNPs.
(2) Rank
%
$!
µ
H
$!
µ
D
$!
where
@
p-value of ,
(i=1,…,L) and 0 represents
original dataset.
(3) Assume we have B permutations with
%
¶!
µ
H
¶!
µ
D
¶!
, µ·µ, then denote
¸
d
¶!
c
@
¶!
!
¹
º
@F%
µ·µ
µµ)
Thus, there are 5 W statistics for each permutation.
(4) And by using Ge’s algorithm [Ge, Dudoit et al. 2003], we define
,
d
¶!
c ¸
d
¶
»
!
µ¸
d
¶!
!
¼
¶
»
F$
J
Let each column be b = 0,…,B and there are 5 rows total corresponding to ¸
%
¸
´
.
Then, looking at one row at a time, the S statistic compares the current column W
statistic with the rest of the permuted W for the current row. We again have 5 rows of
,
d
for each column of b.
(5) At each column, we record the minimum ,
d
by ½
¶!
9
%¾d¾¿
¶!
. This
22
reduces the 5 rows per column in to just one row per column. Finally, we arrive at the
ARTP statistic by Ge’s algorithm again
c ½
¶
»
!
µ½
¶!
!
¼
¶
»
F$
J
The use of Ge’s algorithm here eliminates the need of multi-level permutations and uses the data
itself as a common reference.
Instead of just choosing 5 truncation points as shown in Yu et al. (2009) [Yu, Li et al.
2009], we extend the notion to choose all possible truncation points. The number of truncation
points will be equivalent to number of SNPs in the set.
23
Chapter 3. Novel Self-Contained Set-Based Methods
3.1 Introduction
In this chapter, we propose extensions and modifications to some of the existing
pathway-based methods introduced in Chapter 2. There are many pathway analysis methods, and
we have chosen to review the most widely used methods in the previous Chapter. However, our
research interest focuses on improving existing methods and proposing new methods thereby
giving a general guideline to which method to use in future pathway-based analyses.
3.1.1 Stepwise Regression on Principal Components (Step PC)
First, use all the SNPs in the pathway for principal component analysis. Then, use as
many principal components required to explain a large proportion of the variance (say 80%) of
the SNPs, in a stepwise logistic regression model. The stepwise logistic regression model selects
the important PCs and tries to minimize the chance of including too many PCs that may
introduce noise into the model. We then use LRT from the model and permute as described in
Chapter 2.
3.1.2 MinP on Pincipal Components (MinP PC)
Unlike Step PC that selects a few important PCs into the model, MinP PC aims to include
only the top PC as the summary statistic much like MinP in Chapter 2. First, we apply PC
analysis on all the SNPs and include all the number of PCs explaining 95% variance; 95% is
chosen so that most PCs are included but simultaneously achieve data dimension reduction.
Perform a marginal effect test on each PC. Choose the minimum p-value out of all the marginal
effect PCs as the summary statistic.
24
3.1.3 Adaptive Rank Truncation Product on Principal Components (ARTP PC)
Similar to the notion described in Chapter 2, instead of selecting one truncation point to
choose top PCs, one can have multiple truncation points adaptively and determine which
truncation point is the best. At the SNP-level, select a grid of values for the proportion of the
total variance explained by PCs
±
d
À7f.4C-ÁÂg--/4
where j = 1,…,5 and z = 75, 80, 85, 90, 95 corresponding to the j. 80% variance has been a
cutoff mark in most PCA publications, and here we include grid points slightly above and below
80%. For each truncation point, we use all the PCs in a logistic multiple regression and derive a
LRT statistic. We then take the LRT statistic and apply the ARTP procedure.
3.1.4 Adaptive Rank Truncation Product on Partial Least Squares (ARTP PLS)
Similarly to ARTP PC procedure, now use the PLS components instead at the SNP-level
for multiple covariates regression. Apply the ARTP method to derive ARTP PLS summary
statistic. PLS can be programed with the R-package ‘pls’.
3.1.5 Cluster Regression (CR)
Because SNPs are in LD, if one SNP is associated with the disease of interest, we expect
those SNPs in LD with the associated SNP to show similar effect. Therefore, one can first use
hierarchical clustering and cluster SNPs that are in high LD; then, use cluster-based approach to
increase power in detecting associations. In our simulation, we take all the SNPs in the pathway
and perform hierarchical cluster by using
0/4--9,.!
H
25
as the distances. Subsequently choose a 0.7 cutoff in the clustering tree diagram so that SNPs
within each cluster have à 0.7 correlations among them. The cutoff here is chosen so that within
each cluster SNPs are in relatively high LD. In a sense, we treat each cluster as a ‘gene’, and use
a logistic multiple regression of all the SNPs in the cluster. Consequently, each cluster has a
representative test statistic where we further apply the ARTP algorithm at the cluster-level,
similar to the gene-level, to obtain a pathway statistic.
3.1.6 Cluster Regression – 1 degrees of freedom test (CR-1df)
Same as described above, we first cluster the SNPs in the pathway. For each cluster, let
T
denote SNP 1, SNP 2,…,SNP j in the cluster. Originally, we use a logistic multiple
regression per cluster in the form of
!!
R
%
J
T
R
H
JJ
R
d
However, this likelihood ratio statistic has multiple degrees of freedom. Multiple degrees of
freedom means there are many possible alternative hypotheses in the total alternative hypothesis
space, and we weigh each one equally. But, we propose summing the SNPs first and then regress
!!
J
T
JJ
R
ÄÅÆ
and assume all the SNPs have the same beta coefficient, or average of beta, in the same
direction; SNPs in the same cluster (high LD with each other) should have the same effect. This
likelihood ratio statistic then only has 1 degree of freedom and puts more weight on a specific
alternative hypothesis instead and resulting to power gain. Prior to applying the average of beta
regression, all the SNPs in the cluster must have the same direction of correlation; flipping of
SNP allelic coding may be required so that all SNPs have the same sign correlations.
26
Chapter 4. Simulation Study
To evaluate the performance of the pathway methods described in Chapter 2 and 3, we
conducted a simulation study to evaluate the performance of the set-based tests described above
in a series of scenarios based on sets with independent SNPs or sets with a realistic linkage
disequilibrium structure among the SNPs. For the latter, we simulated genotypes based on real
genotypic data corresponding to the ~254kb long prostaglandin E receptor 3 (PTGER3) gene
from 1904 non-Hispanic white subjects from the Children Health Study (CHS) [Gauderman,
Avol et al. 2004] GWAS (Respiratory outcomes are the main focus of the CHS and the
prostaglandin E receptor 3 (PTGER3) gene is part of the prostanoid receptor family known to be
associated with aspirin-intolerant asthma [Kim, Kim et al. 2007]). Within PTGER3, the CHS
data had 162 typed SNPs, and we also considered 1093 additional untyped SNPs with a minor
allele frequency (maf) > 5% for a total of 1255 SNPs. The LD structure of this region is fairly
typical with 15 LD blocks. We imputed the untyped 1093 SNPs using the CEU HapMap
[International HapMap 2005] as a reference and haplotyped all 1904 individuals using the
software BEAGLE [Browning and Browning 2007]. This gave us a ‘pool’ of 3808 haplotypes
with a realistic LD structure among the SNPs. We assigned the original 162 typed SNPs to
represent typed SNPs in the simulated data and the originally untyped 1093 SNPs to represent
untyped SNPs in the simulated data. We designated sets of causal SNPs chosen from either the
typed SNPs or among the untyped SNPs in the region. For sets of independent SNPs and for sets
of SNPs in LD, we considered scenarios with a small number of causal SNPs ranging from 0 to
10 causal SNPs, and scenarios with a relatively high number of causal SNPs ranging from 20 to
140 causal SNPs (in increments of 10). Across simulation replicates the set of causal SNPs
27
remained fixed. To simulate an individual’s genotype we sampled a random pair of haplotypes
from the pool of 3,808 haplotypes.
We generated each individual’s case-control phenotype based on a logistic regression
model for the disease probability of the form:
L!
4C#!
J4C#!
t!
where S is the matrix of the genotypes (coded 0,1,2 according to the number of minor alleles) of
the designated causal SNPs, augmented with a vector of ones for the intercept. We generated
1,000 cases and 1,000 controls for each scenario.
If we assign the overall power to detect K causal SNPs by marginal test is 50%, one can
derive the effect size assignment of each causal SNP thru the non-centrality parameter (ncp)
from
ÅÇÄ
0
0
¼
!
¹
ÈÉÊËÉÌ
where
ÅÇÄ
= overall power,
¼
= power by Bonferroni correction, and ±
ÄÍÎÄ
=
number of causal SNPs.
Here, the overall power means that if we conduct a marginal test for each SNP, we expect a
power of 50% to detect significance assuming all SNPs are independent. Once the above
equation is rearranged to obtain a solution for
¼
, we can derive the ncp; after, use the ncp to
calculate the power that should be assigned to each causal SNP. If there are 10 causal SNPs, each
causal SNP should have roughly 56% power to detect significant by marginal test, by a
combination of minor allele frequency and odds ratio assigned, to achieve an overall power of
50% by a Bonferroni correction for multiple testing.
28
The number of causal SNPs, K, was the primary parameter of interest in the simulations.
We set the effect size (log odds-ratio) of each of the causal SNPs so that the power to detect at
least one causal SNP was not too high or too low. Having power in the midrange allows for
performance differences between methods to become apparent, as power differences are
“squeezed” together when overall power is either too close to one or too close to zero. To ensure
simulated scenarios yielding power in the desired midrange, we set the log-odds ratios for the
causal SNPs,
%
Ï
so that, based on their corresponding allele frequencies, each causal SNP
would have the same power to be detected if individually tested with a simple univariate
association test. For the scenarios with low numbers of causal SNPs we assigned 20% marginal
power to detect each causal SNP at the 5% level. For the scenarios with large numbers of causal
SNP we assigned 10% marginal power to each causal SNP. The intercept of the logistic
regression model for the case-control status in (3) was set to yield a population disease
prevalence of approximately 1%. In addition to the simulation scenarios described above, in
which each SNP has the same power to be detected if tested marginally, we also investigated the
performance of set-based tests in a set of simulations where the marginal power to detect each
individual SNP is not constant across the causal SNPs, i.e. there are easier and harder to detect
causal SNPs. In the independent SNP scenarios, similarly to the scenario with LD, we varied the
number of causal SNPs from 0 to 10 in increments of 1 and from 20 to 140. We also focused on
a separate scenario where a single interaction effect is present but no marginal effects. The two
interacting untyped SNPs selected have Pearson correlation coefficient of 0.067, minor allele
frequencies of 0.128 and 0.430, and assigned beta coefficients 0.7039 and 0.9062, respectively.
Their interaction effect has an assigned beta coefficient of 1.46, which gives ~80% marginal
power to detect the interaction effect in a logistic regression model with 3 terms (2 main effects,
29
and one interaction effect). The complete range of simulation sets scenarios is summarized in
Table 4.1.
Table 4.1 Simulation scenarios
Number of Causal SNPs Marginal Power
No LD
Low (0-10) 50%
High (20-140) 10%
2 80% (Interaction effect)
LD
Low (0-10) 20%
4 Varying power across causal SNPs
For each scenario we computed an upper bound to the achievable power to get a sense of
how close the power of each method was to this optimal level. This upper bound was obtained by
testing the SNP set using a likelihood ratio test based on a logistic regression model containing
only the known causal SNPs.
For each scenario examined we computed power based on 1,000 replicates and within
each replicate p-values were computed using 1,000 permutations. Each scenario took
approximately 15 hours of computation using parallelization with a high-performance computing
cluster.
30
Chapter 5. Results
In this chapter, we review the power performance of the established methods and
proposed methods from Chapter 2 and 3 under different simulation scenarios. Figure 5.1
highlights the importance of adaptation to the unknown proportion of associated SNPs.
Figure 5.1 Combining too few or too many p-values using the RTP method results in a
sizable loss of power. By contrast, ARTP is robust to the unknown number of associated
SNPs
In the two scenarios depicted (8 and 50 simulated causal SNPs out of a total of 162 SNPs
respectively), the power attained by the non-adaptive RTP varies widely with the number of top
most significant p-values combined to construct the statistic for the set (i.e. the truncation point).
(RTP attains maximum power when combining a number of p-values far in excess to the actual
number of simulated causal SNPs because of the many additional associations induced by LD).
31
Because the optimal truncation point is not known in advance, using RTP with a fixed truncation
point results in a sizable loss of power relative to the maximum attainable. This inefficiency is
particularly pronounced in scenarios with a higher proportion of associated SNPs, whereby
combining only a few top p-values results in a drastic power loss. By contrast, by adaptively
choosing the number of top SNPs to combine, ARTP achieves power that is very close to the
maximum power achieved by RTP but without any prior knowledge of the optimal number.
These results underscore the need for adapting set-based tests so that they become robust to the
unknown number of associated SNPs.
As expected based on the theoretical guarantees offered by the permutation-based
assessment of significance used in our study, all methods preserved the type I error at the target
level for the scenarios with no simulated causal SNPs (Figure 5.2, Figure 5.3, and Figure 5.4). In
agreement with the observation in Figure 5.1, the adaptive methods ARTP, ARTP-PC, LASSO
and GMRE evaluated in the simulation showed the expected robustness of power, performing
well (relative to the maximum achievable power) for the entire range of simulated causal SNPs,
LD structure (LD vs. independent), distribution of effect sizes (uniform vs. non-uniform power;
not show), and causal SNPs (typed, Figure 5.3 vs. un-typed, Figure 5.4). The two non-adaptive
methods, MinP and Hotelling’s T
2
test also performed in accordance with the observation that
the non-adaptive RTP lacks power robustness: MinP had good power (again relative to the
maximum achievable power) in scenarios with few causal SNPs and poor power in scenarios
with many causal SNPs, while the reverse was
32
Figure 5.2 Power of set-based methods as a function of the number of causal SNPs in
scenarios with independent SNPs. Panel a) low proportion of causal SNPs. Panel b) High
proportion of causal SNPs. CR-1df, PCA and PLS based methods were not included in this
simulation scenarios since without LD among SNPs dimension reduction is not possible
a
b
33
true for Hotelling’s T
2
, which had good power in scenarios with many causal SNPs and poor
power in scenarios with few causal SNPs.
Power grows with increasing number of causal SNPs in Figure 5.3 and 5.4. At the lower
and upper ends in the number of causal SNPs examined, the difference in power among the
methods is small due to power being close to the nominal 5% or 100%, respectively. In Figure
5.3, the power performance of all the methods can be categorized into three groups: high,
medium, and low. ARTP, GMRE, LASSO, CR-1df, MinP, and ARTP PC have relatively high
power and are in Group 1. Step PC and MinP PC have medium power and are in Group 2, while
ARTP PLS and Hotelling have low power and are in Group 3.
Figure 5.3 Power of set-based methods as a function of the number of causal SNPs in
scenarios with LD structure from the PTGER3 gene
34
Figure 5.4 Power curves of all methods under the second simulation scenario
In Figure 5.4, the power performance grouping was more distinctly reduced to two groups.
ARTP, GMRE, LASSO, CR-1df, MinP, ARTP PC, and Step PC now have relatively high power
and belong to Group 1. MinP, ARTP PLS, and Hotelling now have low power and are in Group
2.
The one-way Analysis of Variance (ANOVA) on the average power of each of the three
groups in Figure 5.3 is significant with p-value of 0.03, suggesting there is a difference among
the average power performance of the three groups. Similarly the ANOVA test for the two
groups in Figure 5.4 is also significant with a p-value of 0.02. To further test which pairwise
35
groups are different, we conduct all possible pairwise t-tests as sub-tests of the ANOVAs as
shown in Table 5.1 and 5.2.
Table 5.1 P-values of all pairwise t-tests of the three power performance groups in Figure
5.3
Group 1 Group 2 Group 3
Group 1 -- 0.063 0.01
Group 2 0.103 -- 0.29
Group 3 0.01 0.29 --
Table 5.2 P-values of all pairwise t-tests of the two power performance groups in Figure 5.4
Group 1 Group 2
Group 1 -- 0.021
Group 2 0.021 --
For the typed causal SNPs scenario in Figure 5.3, there is a difference between the high power
Group 1 and low power Group 3. The poor performance of the non-adaptive Hotelling’s T
2
from
Group 3 at the low number range of causal SNPs agree with the results in Figure 5.2. However,
the adaptive version of the partial least squares, ARTP PLS, from Group 3 surprisingly
performed poorly. Step PC improved its power performance in the untyped causal SNPs scenario
in Figure 5.4. But Hotelling’s T
2
and ARTP PLS are still the least performing set-based methods.
In order to examine interaction effect on the methods, we selected two untyped SNPs that
generated one interaction term with ~80% marginal power, which translates to ~50% overall
power to be detected by logistic regression with two main terms and one interaction terms but
with no marginal effects. The result is shown in Table 5.3.
36
Table 5.3 Power of methods when only one interaction effect is present from two imputed
SNPs with no marginal effects
ARTP GMRE LASSO CR-1df MinP
0.051144 0.05249 0.047106 0.043069 0.055182
ARTP PC Step PC MinP PC ARTP PLS Hotelling
0.057873 0.057873 0.05249 0.051144 0.053836
Since all methods are based on joint modeling without interaction terms or marginal p-values
only, we expect none of the methods will detect interaction effect. And Table 5.3 revealed that
all p-values obtained are close to the type I error threshold.
MinP in general performs better than expected. To gain insight into why MinP performs
so well, let us consider the simpler problem of testing the null hypothesis k O, where X is a
single multivariate observation, X~N(μ, Ι), where I is the identity matrix. Asymptotically, the
problem of testing the equality of a set of parameters to zero, such as the association of L SNPs,
is similar to this simpler problem. When the alternative hypothesis is that at least one i
@
Q,
where i = 1…L, the likelihood ratio statistic is given by LRT = "
Ð
" c Ñ
@
H
nF
which has a chi-
square distribution. However, when the alternative hypothesis is that exactly for one i
@
Q (i =
1…L) but the rest are nonzero, the corresponding likelihood ratio statistic is 9-CÑ
@
H
!. These two
test statistics are equivalent to 0c
@!
D
@F%
and ½
@!
respectively, where
@!
is the p-
value based on Ñ
@
H
for testing i . Thus, these two tests are optimal when all i’s are non-zero
and only one i is nonzero respectively. If the alternative hypothesis is that exactly two i
@
Q (i
= 1…L), then the corresponding likelihood ratio statistic is maximum of all possible pairs of
Ñ
@
H
JÑ
d
H
! (i = 1…L, j = 1…L, Q²); similar idea generalizes to when the alternative hypothesis
is exactly 3, 4, 5…etc. i
@
Q. Therefore, it is expected that when only a few causal SNPs are
37
present, MinP should have higher power than the global LRT and reverses as number of causal
SNPs represent a large proportion of the total. ARTP is a method that tries to choose between
these two extreme scenarios when the number of alternative hypothesis i
@
Q falls between one
and all.
The main finding of our simulation study is that in all scenarios ARTP outperformed the
other set-based methods, including the adaptive ones (Figure 5.1 and Figure 5.2). This was a
somewhat surprising result as we expected that the set tests based on the LASSO and GMRE,
which are not only adaptive but model all SNPs in the set jointly, would generally have better
power than ARTP, which while adaptive, models each SNP in the set marginally. However,
although GMRE and LASSO performed well (generally tracking each other closely, GMRE with
slightly better power), they consistently had lower power than ARTP. It is also important to note
that when choosing the truncation point non-adaptively as in the case of Higher-Criticism
threshold, HC (not shown), the performance may fall to the low power end of the spectrum.
As a group, the methods that incorporate some form of dimensionality reduction (ARTP-
PC, minP-PC, CR-1df, ARTP-PLS) --with or without additional adaptation-- did not offer a
power advantage over the methods that did not incorporate dimensionality reduction. For
example, although ARTP-PC performed well, it had consistently lower power than the original
ARTP, and MinP-PC only performed better than its non-dimension reduction counterpart MinP
in scenarios with a large proportion of associated SNPs. CR-1df, which achieves dimensionality
reduction by initial clustering of SNPs based on LD, performed well but has lower power than
ARTP or the other adaptive methods. ARTP-PLS was a particularly poor performer among the
methods incorporating dimensionality reduction. However, this may have more to do with the
particular way PLS reduces dimensionality, i.e. choosing components that maximize the
38
covariance between genotypes and outcome, which may not translate into components of
maximal association.
ARTP performs better since it forms a model selection that is geared towards hypothesis
testing (by maximizing the statistical evidence) while LASSO performs model selection through
tuning the penalty to maximize deviance. Alternative ways of choosing the LASSO tuning
parameter to improve the performance of the LASSO as a set-test its worth investigating. In
addition to the standard ARTP, we proposed two new variants that incorporate either principal
components or partial least squares. The goal was to gain additional power by combining
dimensionality reduction with data adaptation. However, although ARTP PC performed well,
neither ARTP PC nor ARTP PLS performed as well as the standard ARTP. We also evaluated
the use of step model selection on principal components but it performed quite poorly. This
shows that the adaptation step in ARTP is not equivalent to other forms of model selection.
Although ARTP is computationally intensive this is a disadvantage shared with other
adaptive methods that require permutations (GMRE being an exception as there is an asymptotic
approximation that can be used to assess its significance). Also, as originally proposed, ARTP
can only be used for testing genetic main effects, as permutations are not suitable for testing
interactions [Buzkova, Lumley et al. 2011]. ARTP can be applied with many other types of
genomic features such as gene expression, methylation or rare variants.
However, we will also introduce a faster algorithms to perform ARTP without the need
of permutations in the later Chapter but the extension of ARTP for testing GxE interactions
remain to be done in future work.
39
Chapter 6. Super Learner
Chaffee et al. (2010) proposed an algorithm, which he named the Super Learner, which
searches for the best data-adaptive regression (DAR) model fit among a collection of user-
selected DARs [Chaffee, Hubbard et al. 2010]. DAR here is a general term referring to
regression methods based on cross-validation (CV) procedure because each CV fold selects a
slightly different model based on the data; but, DAR could also represent any of the methods
described in Chapter 2 and 3. As aforementioned, there are many pathway-based approaches and
logically one would question whether there is a way to pick the best approach; the Super Learner
has the oracle property, in that asymptotically it is guaranteed to identify the best method and
further improve upon it. The algorithm is based on CV to first search for the DAR with smallest
prediction error and then fit the output of CV from each DAR simultaneously into one final
model. The detailed procedure is as follows:
(1) Partition the subjects into V equally sized and disjoint groups or folds.
(2) Designate the first fold as the validation set and the rest of the data as the training set.
Run each DAR separately on the training set.
(3) Use the model built using the training set to predict the outcomes on the validation
set; this gives probability of disease predictions. Repeat (2) and (3) V times by
designating a different block as validation set until each block has been assigned as
the validation fold once. Append all predictions to form a n x m matrix,
Ò
, in which n
= number of subjects in the data and m = number of DARs included.
(4) Regress the predictions on the true outcomes, !!
Ò
#, as the final
model where we can then perform a LRT against the null of no prediction effects. We
40
take this likelihood ratio statistic (LRS) as the summary gene statistic. Since each
method is likely to specify different models for different sets of data in CV, the LRS
does not follow a chi-square distribution. Therefore, we need to permute the
phenotype and repeat the whole procedure to obtain the distribution of the LRS from
the Super Learner [Polley 2010].
The R package ‘SuperLearner’ implements the algorithm above [Polley 2010].
To apply the methods described in Chapter 2 and 3 as part of the SuperLearner requires
recasting them as regressions. We advise the following in mimicking the ARTP algorithm within
the SuperLearner environment: treat each of the 5 truncation points in ARTP as a separate DAR
to be entered into the SuperLearner package. For example, if the first truncation point has top 8
SNPs from ranking of all marginal test SNP p-values, each of the 8 SNPs would have its own
training model and consequent case prediction probabilities in CV. Then, we would average case
predication probabilities across all the 8 SNPs. Same notion extends to the other truncation
points. In addition, if we have training models built from PCs or cluster regression instead of
SNPs, then similar procedures apply to methods such as ARTP PC and CR-based methods,
respectively.
We then compare the Super Learner output with the previous sets of simulations. Based
on the results from the simulations in Chapter 5, we selected the DARs that showed to have most
power into the Super Learner library: MinP, MinP PC, ARTP, ARTP PC, LASSO, and CR-1df.
After, we compare the Super Learner combined output of all the DARs with each DAR
separately.
41
6.1 Results
The likelihood ratio test statistic was obtained from the Super Learner, and the power
performance was assessed by 1,000 replicates in which each replicate p-value was computed
using 1,000 permutations. The Super Learner with Super Learner library composed of MinP,
MinP PC, ARTP, ARTP PC, LASSO, and CR-1df methods has less power than methods such as
ARTP (not shown). This could be explained by the nature of the Super Learner algorithm. The
Super Learner aims to select the best DAR method out of the Super Learner library without prior
knowledge of which DAR has the best power for a given dataset. However, due to the fact that
many DARs methods are use in the Super Learner library, the Super Learner takes a penalty in
power performance as the number of DARs in the library increase. Therefore, it is expected that
the Super Learner would not have better power than the best method in the library if we know
which method would perform the best beforehand. In this simulation, we know that ARTP is
robust and has the best power, and the Super Learner did not outperform ARTP as expected.
42
Chapter 7. Fast Adaptive Set-based Test
In Chapter 5, we showed that ARTP is a powerful self-contained method that
outperforms its competitors in a wide range of scenarios. However, applying ARTP in large-
scale studies poses a computational challenge because it relies on permutations to assess
significance. For example, to evaluate power for a single scenario in Chapter 5 we used 1,000
simulation replicates and 1,000 permutations within each replicate to compute power for each
simulation replicate. In HPCC using 250 nodes, each with a quad-core cpu, a single simulation
scenario for one replicate took about 15 hours when computing 1,000 simulation replicates in
parallel. Equivalently, this is approximately the amount of time it would take to analyze 100
pathways with about 175 SNPs each. Larger pathways, which are not uncommon, would be even
more computationally demanding. It is clear that to conduct a pathway-based genomewide scan
of association a faster alternative would be highly desirable.
We propose an alternative method for evaluating significance of ARTP tests that does not
require permutations. The idea is to rely on the asymptotic joint distribution of the test statistics
rather than the permutation distribution. As a motivation for the general case, let us consider first
the simpler case in which we fit a single regression with all the SNPs in the model as in (1). We
then consider the case where a separate model is fitted for every SNP, which is the approach
taken in practice. Under the null of no association between the SNPs and trait, the joint
asymptotic distribution of the regression coefficients is multivariate normal, #
Ò
O U! [Efron
and Hinkley 1978] where U is the inverse of the Fisher information matrix. We let
+
@
Ó
Ò
,;
Ô
Ó
Ò
!
43
represent the Wald’s statistic for testing the association with SNP i, i = 1…L. Letting Õ
+
%
+
H
+
D
!, we have that T=D
-1
, here D is the diagonal matrix of standard errors,
Ö ×
,;
Ô
Ó
Ò
!
ØÙ Ø
,;
Ô
D
Ô
!
Ú
By Slutsky’s lemma [Casella and Berger 2002], T has an asymptotic multivariate normal
distribution with ÕO Ö
E
U
Ò
Ö
E
!. The permutation version of ARTP requires recomputing
the ARTP test statistic for each permuted version of the data. This is time consuming because it
requires in turn refitting the regression model for each permutation. Now, instead of permuting
the phenotypes and re-computing the ARTP test statistic for each permutation, we can simply
sample #
Ò
from the distribution of #
Ò
O U
Ò
or equivalently from ÕO Ö
E
U
Ò
Ö
E
!.
For methods relying on marginal p-values from separate models fitted for each SNP, we
show that the joint distribution of the parameter estimates (or Wald test statistics) is still
asymptotically multivariate normal and we derive the asymptotic variance-covariance matrix, U.
7.1 Theory and Derivations
7.1.1 Asymptotic covariance of Maximum Likelihood Estimates (MLEs) of separately fitted
marginal models
Here we derive the asymptotic variance-covariance matrix of parameter estimates
obtained from separately fitted models. We first derive a general form of the covariance matrix
applicable to standard regression models fitted by maximum likelihood. Later we specialize the
derivations for linear and logistic models, the two most commonly used regression models for
large-scale association studies.
44
We consider two separate regression models fitted on the same data set. We let X denote
an n ×p matrix of covariates and Y an n ×1 vector of outcomes corresponding to measurements
on n subjects. We assume that two separate regression models M1 and M2 for regressing Y on X
are fitted, where the set of regression parameters for model Mi, is the vector Û
Ü
, i=1,2 (typically a
vector of estimates of the regression coefficients, variances, etc.). The parameter vectors Û
and
Û
T
do not need to be of the same dimension and the set of covariates used in models M1 and M2
can be any subsets of the full set of covariates in X (with or without overlap between the two
models). For example, the data outcomes Y can be case-control indicators for each subject, and
X the matrix containing genotypes G 1 and G 2 at SNPs 1 and 2 respectively, and additional
adjustment covariates such as age, gender and variables capturing population stratification.
Model 1 could be the regression of Y on G1 and adjustment covariates, and model 2 the
regression of Y on G 2 and adjustment covariates. We assume that models M1 and M2 have
likelihood functions Ý
%
Û
! Ý
%
" WÛ
! and Ý
H
Û
T
! Ý
H
" WÛ
T
! respectively, and that
maximum likelihood is used to obtain estimates Û
Ò
Û
Ò
T
of Û
and Û
T
respectively. Because they
are based on the same or overlapping portions of the data (X,Y) the regression estimates Û
Ò
Û
Ò
T
will be generally correlated. Our goal is to obtain an estimate of the variance-covariance
cov(Û
Ò
Û
Ò
T
!of the two sets of parameter estimates.
Û
Ò
and Û
Ò
T
are based on two different models for X with log-likelihoods Þ
Û
!
c Þ
ß
Û
Là
Zß
á
%(
!
¨
(F%
and Þ
T
Û
T
! c Þ
Tß
Û
T
Là
TZß
á
H(
!
¨
(F%
where à
Zß
â
%%(
â
%(
and
à
TZß
â
H%(
â
H(
are all covariate vectors, and á
%(
and á
H(
are the outcomes for the k
th
individual in the models 1 and 2, respectively. Let ã
Ü
äÞ
Ü
äÛ
Ü
c
äÞ
Üß
äÛ
Ü
¨
(F%
c ã
Üß
¨
(F%
be the score
functions, å
Ü
c å
Üß
¨
(F%
c 0æç
ä
T
å
Üß
äÛ
Ü
äÛ
Ü
è
é å
Üß
¨
(F%
the fisher information matrices where the
45
individual contribution to the score is ã
Üß
with fisher information matrices å
Üß
0æç
ä
T
å
Üß
äÛ
Ü
äÛ
Ü
è
é. In
the independent and identically distributed (iid) case, the ã
Üß
’s are mean zero,æã
Üß
O, iid
random vectors.
For the k
th
individual, let the covariance matrix between the scores of the two models be
represented by
U
Tß
ê¥ã
ß
ã
Tß
! æã
ß
ã
Tß
ë
0æã
ß
æã
Tß
ë
æã
ß
ã
Tß
ë
as æã
ß
æã
Tß
O. And the covariance matrix of all individuals between the scores of the
two models is given by
U
T
æã
ã
Tß
ë
;c ã
ß
¨
ìF%
!c ã
Tß
¨
(F%
!
ë
c æã
ß
ã
Tß
ë
¨
(F%
U
Tß
And by the law of large numbers [Casella and Berger 2002]:
= ã
ß
ã
Tß
ë
¨
(F%
íæã
ß
ã
Tß
ë
U
T
Since we are specifically interested in the ê¥Û
Ò
Û
Ò
T
matrix, where Û
Ò
î ï
%
ð ï
%
H
ñ
ò
%%
ñ
ò
%ó
and Û
Ò
T
î ï
H
ð ï
H
H
ñ
ò
H%
ñ
ò
Hó
for model 1 and 2, respectively, let ñ
ò
%%
and
ñ
ò
H%
represents the beta estimates associated with two different SNPs from two marginal models.
Then, we are only interested in the ê¥ñ
ò
%%
ñ
ò
H%
element out of the 1J!·?1J!
ê¥Û
Ò
Û
Ò
T
matrix as the goal is to fill the variance-covariance matrix of # with the all pairwise
SNP estimates.
The ê¥Û
Ò
Û
Ò
T
is given by
ê¥Û
Ò
Û
Ò
T
æçÛ
Ò
0Û
Û
Ò
T
0Û
T
ë
é0æôÛ
Ò
0Û
õæôÛ
Ò
T
0Û
T
õ
ë
æçÛ
Ò
0Û
Û
Ò
T
0Û
T
ë
é
46
where æôÛ
Ò
0Û
õ æôÛ
Ò
T
0Û
T
õ O.
The term Û
Ò
Ü
0Û
Ü
! for marginal model i can be solved by setting the score function, ã
Ü
,
to zero. Let ã
Ü
Û
Ü
! be the score function with respect to Û
Ü
. Under the following common
regularity conditions of
1. ã
Ü
Û
Ü
! and ã
Ü
h
Û
Ü
! exist
2. æã
Ü
Û
Ü
! O
3. å
0æã
Ü
hh
Û
Ü
!ÃO and is continuous
4. MLE of Û
Ü
is consistent
The Taylor series function ã
Ü
Û
Ò
Ü
evaluated at the vector Û
Ü
has the power series form
ã
Ü
Û
Ü
!J
ã
Ü
h
Û
Ü
!
ö
Û
Ò
Ü
0Û
Ü
J
ã
Ü
h
Û
Ü
!
1ö
Û
Ò
Ü
0Û
Ü
T
J
ã
Ü
h
Û
Ü
!
tö
Û
Ò
Ü
0Û
Ü
m
J
However, only the first order Taylor expansion is significant and the MLE for Û
Ò
Ü
is
O֋
Ü
Û
Ü
!Jã
Ü
h
Û
Ü
!Û
Ò
Ü
0Û
Ü
Û
Ò
Ü
0Û
Ü
÷0ã
Ü
h
Û
Ü
!
E%
ã
Ü
Û
Ü
! 0ø
Ü
E%
ã
Ü
Û
Ü
!
where ø
Ü
ã
Ü
h
Û
Ü
! is the Hessian matrix. And by the law of large numbers:
0ø
Ü
0
ù
H
Þ
ùÛ
Ü
ùÛ
Ü
ë
Û
Ò
Ü
0
= ùú
Û
Ü
ÞÛ
Ü
Là
ÜZß
á
(
!
¨
(F%
í0æû
ù
H
Þ
ùÛ
Ü
ùÛ
Ü
ë
ü å
Ü
where the gradient ú
Û
Ü
is the vector of partial derivatives of ¡Û
Ü
Là
ÜZß
á
(
!, with respect to the
Û
Ü
’s. Thus, Û
Ò
Ü
0Û
Ü
֌
Ü
E%
ã
Ü
Û
Ü
! asymptotically.
It then follows that
ê¥Û
Ò
Û
Ò
T
æçÛ
Ò
0Û
Û
Ò
T
0Û
T
ë
é÷æôå
E%
ã
ã
T
ë
å
T
E%
õ å
E%
æôã
ã
T
ë
õå
T
E%
å
E%
U
T
å
T
E%
47
Therefore, asymptotically >
Û
Ò
Û
Ò
T
Gý2
Û
Û
T
: û
å
E%
å
E%
U
T
å
T
E%
å
E%
U
T
å
T
E%
å
T
E%
üþ. In practice we estimate
this variance-covariance matrix by its ‘observed’ counterpart
û
E%
E%
T
E%
E%
T
E%
ü,
where
Ü
ä
Þ
äÛ
Ü
äÛ
Ü
Û
Ò
Ü
! is the observed information matrix and § c ã
ß
ã
Tß
ë ¨
(F%
is the
empirical covariance matrix between ã
and ã
T
. The asymptotic variance-covariance matrix of
sub-vectors of interest ñ
ò
%%
ñ
ò
H%
where
%
ñ
ò
%%
,
H
ñ
ò
H%
is the sub-matrix obtained from
extracting the corresponding rows and columns of V.
7.1.2 Linear regression models
We now specialize the results above to linear regression models of the form "#J
where á
%
á
H
á
(
á
¨
ë
is a vector of outcomes from all individuals, " is a n x j matrix of
parameters (including first column of 1s) from all individuals, and # is j x 1 vector containing
intercept and all the beta coefficients corresponding to parameters in ". The likelihood and log-
likelihood of the linear regression model is then
} á
(
L"#
H
!
¨
(F%
1
ð
H
!
E%qH
£
0
1ð
H
0àY!
ë
0àY!
¨
(F%
} = ¡1
ð
H
!
E
%
H
£
0
1ð
H
0àY!
ë
0àY!
¨
(F%
0
1
¡1
!0
1
¡ð
H
!0
0
1ð
H
0àY!
ë
0àY!
The score function with respect to # and with respect to ð
H
are thus
ãY!
ù¡}
ùY
ð
H
0àY!
ë
à
48
ð
H
!
ù¡Ý
ùð
H
0
1ð
H
J
1ð
p
0àY!
ë
0àY!
which is equivalent to the sum of individual score functions
ãY! = ã
ß
Y!
¨
(F%
=
ù¡}
ß
ùY
¨
(F%
=
ð
H
á
(
0á ï
(
!à
ß
¨
(F%
ð
H
! =
ù¡Ý
(
ùð
H
(F%
0
1ð
H
J
1ð
p
á
(
0á ï
(
!
H
where á
(
is a scalar outcome and à
ß
is a vector of parameter values for an individual k.
In the current context of testing SNP association with a continuous outcome and adjusting for
two covariates, à
ß
is equivalent to à
ß
(
(
%(
H(
ë
, where
(
is the SNP value for
individual k,
%(
and
H(
are two covariates for individual k, respectively. The individual score
functions with respect to Y and ð
H
becomes
ã
ß
Y!
á
(
0á ï
(
!
ð
H
(
(
%(
H(
ë
ð
H
á
(
0á ï
(
!
(
ð
H
á
(
0á ï
(
!
%(
ð
H
á
(
0á ï
(
!
H(
ð
H
á
(
0á ï
(
!
ë
(
ð
H
! 0
1ð
H
J
1ð
p
á
(
0á ï
(
!
H
When dealing with two separate models that are similar to above, recall that §
c ã
ß
ã
Tß
ë ¨
(F%
, and an individual’s contribution to the S matrix is then
ã
ß
ã
Tß
ë
ð ï
H
á
%(
0á ï
%(
!
%(
ð ï
H
á
%(
0á ï
%(
!
%%(
ð ï
H
á
%(
0á ï
%(
!
H%(
ð ï
H
á
%(
0á ï
%(
!0
1ð ï
H
J
1ð ï
p
á
%(
0á ï
%(
!
H
ë
49
ð ï
H
á
H(
0á ï
H(
!
H(
ð ï
H
á
H(
0á ï
H(
!
%H(
ð ï
H
á
H(
0á ï
H(
!
HH(
ð ï
H
á
H(
0á ï
H(
!0
1ð ï
H
J
1ð ï
p
á
H(
0á ï
H(
!
H
which is a 5 x 5 matrix where
%(
and
H(
represents the two covariates from model i for an
individual k.
7.1.3 Logistic regression models
We now specialize the results above to logistic regression models of the form
¡¥
!! ¡¥2
0
: àY
£àY!
J£àY!
where à is a n x j matrix of parameters (including first column of 1s) from all individuals, Y is a
j x 1 vector containing intercept and all the beta coefficients corresponding to parameters in à,
and = n x 1 vector of probability of being a case for all individuals. The likelihood and log-
likelihood are of the form
}
(
0
(
!
%E
¨
(F%
£à
ß
Y!
J£à
ß
Y!
E%
¨
(F%
¡} = á
(
à
ß
Y0¡¥
J£à
ß
Y!!
(F%
àY!0¡¥
J£àY!!
The score function with respect to Y is then
ãY!
ù¡}
ùY
ë
à0
à£àY!
J£àY!
0!
ë
à
which is equivalent to the sum of individual score functions
50
ãY! = ã
ß
Y!
¨
(F%
=
ù¡}
ß
ùY
¨
(F%
= á
(
0(
!à
ß
¨
(F%
where á is a scalar outcome of 0 or 1,
(
is the probability of being a case, and à
ß
is a vector of
parameter values for an individual k.
Similarly to the linear regression case, an individual’s contribution to the S matrix is then
ã
ß
ã
Tß
ë
á
%(
0%(
!
%(
á
%(
0%(
!
%%(
á
%(
0%(
!
H%(
á
%(
0%(
!
ë
á
H(
0H(
!
H(
á
H(
0H(
!
%H(
á
H(
0H(
!
HH(
á
H(
0H(
!
which is a 4 x 4 matrix where
%(
and
H(
represents the two covariates from model i for an
individual k.
7.1.4 DerivingU
Ò
in #
Ò
O U
Ò
Of the two sets of parameter estimates, Û
Ò
andÛ
Ò
T
, from two separately fitted models, we
are interested in one parameter estimate within each set and therefore only the covariance
between the parameter estimates of interest. In the simpler case of only two models, #
Ò
ñ
ò
%
ñ
ò
H
and the /gñ
ò
%
ñ
ò
H
is equal to the sub-matrix element of
E%
T
E%
. The U
Ò
is then
û
ñ
ò
%
/gñ
ò
%
ñ
ò
H
/gñ
ò
%
ñ
ò
H
ñ
ò
%
ü. Back to the example of testing SNP association above in the logistic
regression models, #
Ò
%
H
! and the sub-matrix element of the S matrix corresponding to
%
and
H
is then
%
á
%
0
%
!
H
á
H
0
H
!. The /g
%
H
! is thus the
E%
T
E%
sub-matrix element
computed with the inverse of the observed fisher information matrix of
%
and
H
and
%
á
%
0
%
!
H
á
H
0
H
!.
51
When there are more than two marginal models, #
Ò
and U
Ò
generalize to ñ
ò
%
ñ
ò
H
ñ
ò
and
×
ñ
ò
%
!ê¥ñ
ò
%
ñ
ò
!
ØÙ Ø
ê¥ñ
ò
ñ
ò
%
! ñ
ò
!
Ú, respectively, where i and j are row and column indexes
representing 1…L models. Once we compute the U
Ò
from the marginal models, we can sample
from the multivariate normal distribution of #
Ò
O U
Ò
to obtain permutation equivalents of #
Ò
and the corresponding Wald’s statistics. Then, ARTP is applied using these Wald’s statistics
based on the permutation equivalents.
Larger pathways impose limitations on constructing U
Ò
(eg. constructing a 10,000 by
10,000 matrix). For pathways larger than 1000 SNPs, we assume each gene is independent and
construct a variance-covariance matrix of marginal beta estimates within a gene. We then sample
permutation equivalents within each gene then aggregate the permutation equivalents before
applying ARTP.
7.2 Performance and Comparison with Permutations
To evaluate the accuracy of the sampling framework for ARTP, we simulated two typed
causal SNP scenarios as shown in Figure 7.2. 400 replicate datasets were simulated with each
dataset containing 2000 permutation equivalents derived from logistic regression in model (2). In
the sampling framework for ARTP, 2000 permutation equivalents were sampled from variance-
covariance matrix of marginal beta estimates, #O U!, and compared with 2000 permutations
from a traditional re-shuffling phenotype approach. The 0 typed causal SNP scenario shows that
the sampling framework p-values agree with the permutation p-values, and there is minimal
difference as the p-values lay on the diagonal of the plot. The 4 typed causal SNP scenario also
shows negligible difference between sampling framework and permutation p-values.
52
We further investigate computational time of permutations, sampling framework, and
sampling framework within gene on several pathways as shown in Table 7.2. Five pathways
were selected of different sizes ranging from 55 to 886 SNPs.
Within a pathway, linear regression in model (1) was used on a continuous outcome and
5000 permutation equivalents were sampled or permuted. The permutation time increases greatly
as number of SNPs in a pathway increase. The sampling framework dramatically decreases
computational burden because the bottleneck lies in the amount of time required to construct the
variance-covariance matrix of marginal beta estimates. When we construct variance-covariance
matrix first within gene, the computational time is even less because there are less total number
of pairwise covariance elements to be constructed due to independent genes assumption. It is
natural
53
Figure 7.2 Accuracy of sampling framework for ARTP in two scenarios
Table 7.2 Computational time by different methods for five pathways of different size
Pathway
Number
of SNPs
Number of
Genes
Permutation
Time (s)
Sampling
Time (s)
Sampling by
Gene Time (s)
Apotosis Inducing Factor 55 3 7120 4.76 7.31
Taurine and HypoTaurine
Metabolism
111 10 17060 16.18 13.31
Myelin and Lymphocyte
Protein
213 19 30945 47.41 25.49
Transport of Mature
mRNA
415 40 60440 134.75 31
Toll-like Receptor Proteins 886 32 132530 636.04 99
54
Table 7.3 P-value accuracy among 3 different methods with a proportions test on sampling
methods
Pathway
Permute
d P-value
Sampling
P-value
Sampling by
Gene P-value
Proportions Test P-value
for Sampling
Apotosis Inducing Factor 0.180 0.180 0.182 0.815
Taurine and HypoTaurine
Metabolism
0.148 0.148 0.161 0.076
Myelin and Lymphocyte
Protein
0.867 0.867 0.872 0.476
Transport of Mature mRNA 0.695 0.697 0.712 0.111
Toll-like Receptor Proteins 0.701 0.701 0.720 0.042
then to look at how closely the sampling method p-values are to the permutation p-values shown
in Table 7.3 below. We see that the permutation and sampling p-values are close to each other
and again agreeing with the results of Figure 5.3. However, when we compare sampling and
sampling by gene p-values, they tend to differ more as the number of SNPs in a pathway reach
closer to 1000 SNPs. This could be attributed to the fact that in larger pathways genes are not
independent and gene-gene interactions are present, thereby making the sampling by gene
method more conservative than permutation equivalent methods.
7.3 “fastARTP” R Package
We propose a new R package called “fastARTP” that allows sampling permutation
equivalents from a constructed variance-covariance matrix of beta estimates of interest from
marginal models before applying the ARTP method. The package allows user to specify whether
to use the sampling framework for all the SNPs, sampling by gene framework, or traditional
permutations in a given pathway. Once the permutation equivalents are obtained, the user can
further specify to apply ARTP on a one-level approach to summarize all SNPs at once or a two-
55
level approach to first summarize within genes then among genes to obtain a pathway statistic.
Details of the package can later be found on the r-projct.org website.
Below call function shows an example of the input fields:
samp.pvals=permute.pval(
pheno.data=y, snp.data=chs.snps,
cov.name=cov.name, cov.data=cov.data, y.type=1,
reps=5000, method="sampling",levels=3,
path.info=test.path, covar.by.gene=F
)
where phenol.data = outcome vector, snp.data = n x L SNP data matrix, cov.name = character
input of covariate names (e.g. c(“age”, “sex”)), cov.data = n x K Covariate data matrix, y.type =
nature of outcome vector (1=continuous, 2=binary), reps = # of permutation equivalents to
sample, method = either “permutation” or “sampling”, levels = type of ARTP used (1=one-level,
2=two-level, 3=both one-level and two-level), path.info = pathway annotation file (SNPs to
genes, genes to pathways), covar.by.gene = sampling within gene or not.
56
Chapter 8. Pathway Analysis of BMI Growth during Puberty
As technology advances to gain ease of access to food resources, childhood obesity
becomes an epidemic worldwide that needs public health awareness [Kosti and Panagiotakos 2006;
Wang and Lobstein 2006]. Obesity has been known to have a mixture of environmental, social,
and genetic underlying factors with age and ethnicity influences. Identifying specific genetic risk
factors could give insights to functional pathways that may be useful in the development of
prevention programs.
One measurement that represents obesity is the body mass index (BMI). Several cross-
sectional studies suggest that the heritability estimates of BMI increase from early childhood
through adolescence and has greater genetic influence during puberty [Jacobson and Rowe 1998;
Pietilainen, Kaprio et al. 1999; Koeppen-Schomerus, Wardle et al. 2001; Haworth, Carnell et al.
2008; Wardle, Carnell et al. 2008]. Longitudinal studies further reported greater heritability
estimates of BMI and indicated greater genetic effect than cross-sectional studies [Chen, Li et al.
2004; Cornes, Zhu et al. 2007; Goode, Cherny et al. 2007; Haworth, Carnell et al. 2008].
Some BMI loci have been identified by cross-sectional GWAS but on European adult
populations [Speliotes, Willer et al. ; Zhao, Bradfield et al. ; Frayling, Timpson et al. 2007;
Thorleifsson, Walters et al. 2009; Willer, Speliotes et al. 2009]. And other index SNP in the fat-
mass and obesity associated (FTO) gene has been reported in a non-European population of
Hispanic subjects. But the FTO index SNP identified in populations of European ancestry failed
to replicate across other ethnic groups, possibly contributed by differences in linkage
disequilibrium (LD) patterns [Ohashi, Naka et al. 2007; Scuteri, Sanna et al. 2007]. In the pediatric
population of European ancestry, Dina et al. investigated cross-sectional studies where BMI was
associated with FTO variation [Dina, Meyre et al. 2007], and Zhao et al. replicated other
57
previously established adult GWAS findings [Zhao, Bradfield et al. ; Zhao, Bradfield et al. 2009].
Other longitudinal studies have also claimed significant associations between genetic variation in
FTO and MC4R with longitudinal BMI change in pediatric populations [Elks, Loos et al. ; Hallman,
Friedel et al. ; Hardy, Wills et al. ; Haworth, Carnell et al. 2008]. These studies agreed on an
increasing genetic effects from childhood through adolescence, suggesting further studies of BMI
growth through puberty may warrant interesting genetic risk factors.
However, localizing specific genetic SNPs associated with BMI growth during puberty
may be difficult as the effect size of these SNPs must be large enough to pass the multiple testing
threshold. Therefore, we propose taking a broader view approach by looking at genetic pathways
that may summarize SNPs of smaller effects. Finding a pathway association is then indicative of
which groups of genes/SNPs within the pathway to further investigate in hopes of better localizing
genetic influences.
In our study, we take a genomewide approach on testing 1,258 pathways with annotations
gathered across four major online databases: BIOCARTA, KEGG, AMBION, and REACTOME.
We utilize the data from Children's Health Study (CHS), a 10-year prospective cohort study of
Southern California children originally proposed to investigate the chronic respiratory effects of
air pollutants. Our goal is to search for possible genetic pathway associations with BMI growth of
children from the age of 9 to 16. In order to test 1,258 with efficiency, we use the “fastARTP”
package proposed in Chapter 7.
8.1 Children Health Study (CHS)
The Children Health Study (CHS) is a 10-year prospective cohort study of Southern
California children that studies the chronic respiratory effects of air pollutants [Peters, Avol et al.
1999]. The children were recruited from schools across 16 Southern California communities
58
between 1993 and 2002. At baseline and subsequent follow-up years, parents/guardians of
children completed questionnaires assessing health and environmental exposures. In addition,
buccal cell samples from all subjects who are available are collected for genetic analysis. Detail
information regarding study design, subject recruitment and health assessment are reported
elsewhere [Navidi, Thomas et al. 1994].
This longitudinal study of childhood asthma and other respiratory outcomes also included
other physical measurements such as Body Mass Index (BMI). BMI is computed as weight (kg)
divided by height
2
(cm
2
), and it is a measure of relative weight based on an individual’s mass and
height. BMI is important because it is often used as a measurement rubric for obesity. An
individual with a BMI > 30 is typically considered obese.
The CHS has also genotyped cohorts B, C, D, and E (out of 5 cohorts total from A-E)
using the Illumina HumanHap550 platform that interrogates over 550,000 SNPs spanning the
genome. Previous findings suggest that there are six top SNPs from 5 regions that are associated
with respiratory outcomes [McConnell, Berhane et al. 2006]. However, there was no study yet to
investigate genetic associations with BMI in the CHS. We used weight and height measurements
taken at baseline at age 9 and at each annual follow-up time point until age 16 to compute BMI
(kg of weight/height in m
2
). A total of 1,376 children (720 female / 656 male) with at least two
time-point observations of BMI, and who had genome-wide data available were used in this
analysis. There were 455 (256 female / 199 male) self-reported Hispanics, and 921 (464 female /
457 male) self-reported non-Hispanic whites.
59
8.2 Genotypes and Ancestry Estimates
Buccal cells were collected from study subjects and genotyping was performed at the USC
Epigenome Center using the Illumina HumanHap550, HumanHap550-Duo and Human610-Quad
BeadChip microarrays. Genetic ancestry estimates were computed using the program
STRUCTURE [Pritchard, Stephens et al. 2000; Falush, Stephens et al. 2003; Falush, Stephens et
al. 2007] on 557 unlinked ancestry informative markers designed to distinguish between
populations of European, African, Native American, and East Asian ancestry. For each individual
in the study the ancestry proportion in each of these ancestral categories was estimated and then
used as covariates to adjust for potential population stratification.
8.3 Genes and Pathways Annotations
Four online databases were used to collect an extensive list of SNP-to-gene and gene-to-
pathway annotations: KEGG [Kanehisa and Goto 2000; Kanehisa, Goto et al. 2014], AMBION
[Prestvik, Hjerto et al. 2007], REACTOME [Milacic, Haw et al. 2012; Croft, Mundo et al. 2014],
and BIOCARTA [Nishimura 2004]. We identified a total of 1,258 pathways with genes and SNP
annotations. Overlap is present as each gene may be mapped to more than one pathways. These
pathways were then cross-referenced and matched with the typed SNPs available from the CHS
study.
8.4 Methods
Longitudinal BMI measurements were converted to normalized BMI-Z scores based on the
CDC growth charts by gender [Kuczmarski, Ogden et al. 2002]. To account for the non-linear BMI
growth pattern during puberty, a linear spline growth curve model was constructed, consisting of
60
three straight lines
over the age intervals of 9-11 years, 11-13 years, and 13-16 years constrained
to be connected
at the two knot points [Gauderman, Vora et al. 2007]:
J
>
ô
0õ
GJ¦
%
`
%
J¦
H
`
H
J£
!
where i indexes subject, j observation, Y is the outcome (e.g. BMI-Z score), T is the time variable
(e.g. age), `
%
and `
H
are constructed based on number of knots and age intervals. The
quantities
represents the longitudinal BMI-Z growth from age 9 to 16, and ¦
%
and ¦
H
parameterize the difference between the overall slope
and the spline segments per age interval.
Each subject will have its own
, representing BMI trajectory from age 9 to 16, and
is
then used as outcome in linear regression to regress on SNP and adjustment covariates. For
example, if we have longitudinal BMI-Z data on children from ages 9 to 16, we might want to
examine the effect of SNP on attained BMI-Z at age 16 (parameterized by
with C=16) and 7-
yr growth in BMI-Z from age 9 to 16 (parameterized by
with from L =9, U =16, and R=7).
However, using the same data from ages 11 to 14, we might want to examine the effect of SNP
on attained BMI-Z at age 13 (parameterized by
with C=13) and 3-yr growth in BMI-Z from
age 11 to 14 (parameterized by
with L=11, U=14, and R=3).
The BMI growth beta estimate bi for each child was the main outcome of interest in our
analysis. We regressed the bi on each SNP in the genomewide panel, coded additively, and
adjusted for ancestry informative markers as covariates in a marginal model using a linear
regression model of the form:
ñ
$
Jñ
§
!
J
"
!
J#
61
where
%
¨
!,
ª
(j=1,…,L for L total number of SNPs), "
!
ancestry
informative marker, and $
d
error term for ª
. The quantities ñ
$
intercept for j
th
marginal
model, ñ
parameter estimate for SNPj, and
parameter estimate for Zj.
From each regression we extracted the scores and fisher information matrices to compute
the variance-covariance matrix of the estimates of SNP effects within each pathway. For each
pathway we computed the ARTP test statistic and p-value using 20,000 samples from asymptotic
multivariate normal distribution of the SNP effect estimates using the derived variance-
covariance matrix. Pathways containing more than 1,000 SNPs are switched to the sampling by
gene framework due to the fact that a large number of SNPs require a large variance-covariance
matrix to be computed, which becomes computationally intensive. For each pathway we
computed both the one-level (summarizing all the SNPs at once in a pathway) and two-level
(summarize first within genes then summarize among genes) ARTP.
8.5 Results
The most significant pathways associated with the 7-year BMI growth in the CHS is
shown in Table 8.5. However, none of the p-values are significant after Bonferroni correction of
multiple testing by the 1,258 pathway tests performed; the Bonferroni corrected type I error
threshold is 3.97x10
-5
.
Table 8.5 Most significant pathways associated with 7-year BMI growth in CHS. None of
the pathways remain significant after Bonferroni correction
Pathway ARTP
One-Level
ARTP
Two-Level
Number
of Genes
Number
of SNPs
Circadian Rhythm Mammals 0.026 0.00018 13 259
Circadian clock in Mammals 0.048 0.00021 12 547
Complement and Coagulation Cascades 0.059 0.0039 64 1250
62
Figure 8.5 shows the one-level and two-level ARTP results for the 1,258 pathways. The
pathways are ordered by their relative size as measured by the number of SNPs and from small
to large on the x-axis. In panel a) the one-level ARTP p-values are plotted and in panel c) the
two-level ARTP p-values are plotted. In panel a) as the pathway size gets large the distribution
of –log (p-values) is skewed toward zero as expected under the null (panel b), uniformly
distributed p-values). As the pathway size gets smaller, the distribution is less skewed toward
zero. This is consistent with a test that is more powerful for smaller pathways or with associated
pathways that tend to be predominantly smaller. By contrast, panel c) does not show a
differential pattern of p-value distribution with pathway size. This is consistent with a test with
power that does not depend on pathway size or with a more uniform distribution of effects across
pathways of all sizes.
The two-level ARTP summarizes first the evidence of association within genes to get a
gene statistic, and then the gene statistics are summarized into a pathway statistic. The two-level
ARTP yielded more significant p-values in terms of magnitude and abundance. The association
signals improved in two-level ARTP when we focus on using genes as base units to summarize
the final pathway statistic. In the one-level ARTP, there exist many null SNPs, which will reduce
the magnitude of association signals as we use a SNP as the base unit to summarize the final
pathway statistic. The results then suggests that significant SNPs are spread across genes rather
than being concentrated in one gene.
63
Figure 8.5 a) One-level ARTP results for 1,258 pathways ordered from smallest to largest
on the x-axis. All SNPs are summarized at once to obtain a pathway statistic
Figure 8.5 b) Null distribution of p-values
a
b
64
Figure 8.5 c) Two-level ARTP results for 1,258 pathways ordered from smallest to largest
on the x-axis. All SNPs are first summarized within each gene. Then, all of the gene
statistics are used to obtain a pathway statistic
c
65
Chapter 9. Concluding Remarks and Future Work.
We expected that modeling the effect of all SNPs jointly would offer superior power and
should be preferred over marginal methods that simply combine individual SNP p-values.
Instead, although LASSO and GMRE performed well, they were consistently less powerful than
ARTP, even in scenarios where the logistic regression model used by the LASSO and GMRE
matched the simulated model. ARTP performs better since it forms a model selection that is
geared towards hypothesis testing (by maximizing the statistical evidence) while LASSO
performs model selection through tuning the penalty to maximize deviance. Alternative ways of
choosing the LASSO tuning parameter to improve the performance of the LASSO as a set-test its
worth investigating.
Nevertheless, ARTP could not be efficiently applied to a large number of tests and
therefore not feasible on a genomewide scale. This is due to the setback in computational time
caused by permutations. Instead, we proposed to derive the scores and fisher information matrix
of each marginal models then to compute a variance-covariance matrix among all marginal
parameter estimates of interest. The variance-covariance matrix has asymptotic properties of a
multivariate normal distribution in which we can sample permutation equivalents. The
permutation equivalents are then used in ARTP. Sampling from asymptotic variance-covariance
matrix is much faster and we created an R packaged called “fastARTP” that applies this method.
“fastARTP” was used in a genomewide pathway analysis application on Children’s
Health Study data. We scanned through 1,258 pathways with annotations collected from four
major online databases including KEGG, BIOCARTA, AMBION, and REACTOME. Although
none of the pathway showed a significant one-level ARTP (summarizing all SNPs in a pathway
at once) p-value, there were five marginally significant two-level ARTP (first summarize within
66
gene then summarize among genes in a pathway) p-values reported. Whether these five pathways
have relevant biological functions that may influence BMI growth remain to be investigated.
In Chapter 5, the results of the interaction effect in simulation scenario are as expected;
we confirmed that none of the methods would perform well in an interaction effect scenario. This
is because the joint modeling approaches and marginal p-values approaches do not include any
interaction term in their models. Alternatively one could specify an interaction model such as
!! ,
@
@
J,
d
d
J,
@
,
d
@d
(5)
where i = 1…L, j = 1…L, and Q².
@d
here is the coefficient of interaction effect between two
SNPs, and one could derive a Wald’s Test and the corresponding p-value for the interaction term
in the model. The interaction term p-values could be derived for all possible pairwise interactions
in a gene or in a pathway referencing model (5). If environmental variables are available, we
could have a similar model that has SNP x Environment interactions
!! ,
@
@
J;
B
d
J,
@
;
B
@B
(6)
where i = 1…L, k = 1…P (P total environmental variables), and
@B
is the SNP x Environment
interaction term. Again we could have all possible SNP by Environment interactions referencing
model (6). Aggregating all pairwise interaction effects in model (5) could be a test for within
gene, gene gene, or pathway pathway interaction. And, aggregating all pairwise interaction
effects in model (6) could be a test for gene environment or pathway environment
interaction.
Applying pathway-based methods such as FT or ARTP to combine interaction evidences
in model (5) or (6) pose two challenges. First, the number of pairwise interactions increases
exponentially as the number of total SNPs in a gene or the number of environmental variables
67
increase linearly; this imposes a computational burden on applying any of the pathway-based
methods. Second, the permutation scheme by shuffling phenotype values then repeat model (5)
or (6) regression is no longer valid. Anderson et al. (2001) has shown that permutation of
phenotype in model (5) or (6) leaves the interaction effect unchanged, and there is no exact
permutation method for testing the interaction term [Anderson 2001]. A solution to the
permutation scheme is to use parametric bootstrap instead proposed by Buzkova et al. (2011)
[Buzkova, Lumley et al. 2011]. The author shows that parametric bootstrap gives valid tests in
situations where permutations tests are not valid and with the similar computational costs. The
parametric bootstrap generates an empirical distribution of the interested test statistic with the
following steps:
1. Fit the original data in a model (6) regression without the interaction term (the null
model without interaction term) and obtain parameter estimates.
2. Sample phenotype responses with replacement from the model built in Step 1.
3. Use the sample responses to fit the original data in model (6) with the interaction term
now. Record the test statistic of the interaction term.
4. Repeat Steps 2 and 3 many times to obtain the distribution of the test statistic.
P-value can then be calculated by comparing the test statistic obtained from original data fitted in
model (6) to the distribution of test statistics generated from above steps. We could investigate
pathway environment interaction effects with this parametric bootstrap approach.
The bootstrap approach of testing for interactions should be included in the “fastARTP”
package. As a natural extension, a possible combination of marginal models with and without
68
interaction terms—based on user defined number of suspected interaction associations—for
ARTP remains to be investigated.
69
References
Ali, O. (2013). "Genetics of type 2 diabetes." World J Diabetes 4(4): 114-123.
Anderson, M. J. (2001). "Permutation tests for univariate or multivariate analysis of variance and
regression." Canadian Journal of Fisheries and Aquatic Sciences 58(3): 626-639.
Ashburner, M., C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K.
Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis,
S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin and G. Sherlock (2000).
"Gene ontology: tool for the unification of biology. The Gene Ontology Consortium." Nat
Genet 25(1): 25-29.
Breiman, L. (1999). "Random forests, random features." University of California,
Berkeley(Technical Report 567).
Breiman, L. (2001). "Random forests." Machine Learning 45(1): 5-32.
Breiman, L. and A. Cutler (2002). "Manual On Setting Up, Using, And Understanding Random
Forests V3.1." University of California, Berkeley.
Browning, S. R. and B. L. Browning (2007). "Rapid and accurate haplotype phasing and missing-
data inference for whole-genome association studies by use of localized haplotype
clustering." Am J Hum Genet 81(5): 1084-1097.
Buzkova, P., T. Lumley and K. Rice (2011). "Permutation and parametric bootstrap tests for gene-
gene and gene-environment interactions." Ann Hum Genet 75(1): 36-45.
Buzkova, P., T. Lumley and K. Rice (2011). "Permutation and Parametric Bootstrap Tests for
Gene-Gene and Gene-Environment Interactions." Annals of Human Genetics 75: 36-45.
Casella, G. and R. L. Berger (2002). Statistical Inference. USA, Duxbury.
Chaffee, P., A. Hubbard and M. van der Laan (2010). "Permutation-based Pathway Testing using
the Super Learner Algorithm." U.C. Berkeley Division of Biostatistics Working Paper
Series(263).
Chang, J. T. and J. R. Nevins (2006). "GATHER: a systems approach to interpreting genomic
signatures." Bioinformatics 22(23): 2926-2933.
Chasman, D. I. (2008). "On the utility of gene set methods in genomewide association studies of
quantitative traits." Genet Epidemiol 32(7): 658-668.
Chen, W., S. Li, N. R. Cook, B. A. Rosner, S. R. Srinivasan, E. Boerwinkle and G. S. Berenson
(2004). "An autosomal genome scan for loci influencing longitudinal burden of body mass
index from childhood to young adulthood in white sibships: The Bogalusa Heart Study."
Int J Obes Relat Metab Disord 28(4): 462-469.
Conroy, B. and P. Sajda (2012). "Fast, Exact Model Selection and Permutation Testing for l-
Regularized Logistic Regression." JMLR Workshop Conf Proc 22: 246-254.
Cornes, B. K., G. Zhu and N. G. Martin (2007). "Sex differences in genetic variation in weight: a
longitudinal study of body mass index in adolescent twins." Behav Genet 37(5): 648-660.
Croft, D., A. F. Mundo, R. Haw, M. Milacic, J. Weiser, G. Wu, M. Caudy, P. Garapati, M.
Gillespie, M. R. Kamdar, B. Jassal, S. Jupe, L. Matthews, B. May, S. Palatnik, K. Rothfels,
V. Shamovsky, H. Song, M. Williams, E. Birney, H. Hermjakob, L. Stein and P.
D'Eustachio (2014). "The Reactome pathway knowledgebase." Nucleic Acids Res
42(Database issue): D472-477.
De la Cruz, O., X. Q. Wen, B. G. Ke, M. S. Song and D. L. Nicolae (2010). "Gene, Region and
Pathway Level Analyses in Whole-Genome Studies." Genetic Epidemiology 34(3): 222-
231.
70
Dina, C., D. Meyre, S. Gallina, E. Durand, A. Korner, P. Jacobson, L. M. Carlsson, W. Kiess, V.
Vatin, C. Lecoeur, J. Delplanque, E. Vaillant, F. Pattou, J. Ruiz, J. Weill, C. Levy-Marchal,
F. Horber, N. Potoczna, S. Hercberg, C. Le Stunff, P. Bougneres, P. Kovacs, M. Marre, B.
Balkau, S. Cauchi, J. C. Chevre and P. Froguel (2007). "Variation in FTO contributes to
childhood obesity and severe adult obesity." Nat Genet 39(6): 724-726.
Efron, B., T. Hastie, I. Johnstone and R. Tibshirani (2004). "Least angle regression." Annals of
Statistics 32(2): 407-451.
Efron, B. and D. V. Hinkley (1978). "Assessing Accuracy of Maximum Likelihood Estimator -
Observed Versus Expected Fisher Information." Biometrika 65(3): 457-482.
Eichler, E. E., J. Flint, G. Gibson, A. Kong, S. M. Leal, J. H. Moore and J. H. Nadeau (2010).
"Missing heritability and strategies for finding the underlying causes of complex disease."
Nat Rev Genet 11(6): 446-450.
Elks, C. E., R. J. Loos, R. Hardy, A. K. Wills, A. Wong, N. J. Wareham, D. Kuh and K. K. Ong
"Adult obesity susceptibility variants are associated with greater childhood weight gain and
a faster tempo of growth: the 1946 British Birth Cohort Study." Am J Clin Nutr 95(5):
1150-1156.
Elston, R. C. (1991). "On Fisher Method of Combining P-Values." Biometrical Journal 33(3): 339-
345.
Falush, D., M. Stephens and J. K. Pritchard (2003). "Inference of population structure using
multilocus genotype data: linked loci and correlated allele frequencies." Genetics 164(4):
1567-1587.
Falush, D., M. Stephens and J. K. Pritchard (2007). "Inference of population structure using
multilocus genotype data: dominant markers and null alleles." Mol Ecol Notes 7(4): 574-
578.
Frayling, T. M., N. J. Timpson, M. N. Weedon, E. Zeggini, R. M. Freathy, C. M. Lindgren, J. R.
Perry, K. S. Elliott, H. Lango, N. W. Rayner, B. Shields, L. W. Harries, J. C. Barrett, S.
Ellard, C. J. Groves, B. Knight, A. M. Patch, A. R. Ness, S. Ebrahim, D. A. Lawlor, S. M.
Ring, Y. Ben-Shlomo, M. R. Jarvelin, U. Sovio, A. J. Bennett, D. Melzer, L. Ferrucci, R.
J. Loos, I. Barroso, N. J. Wareham, F. Karpe, K. R. Owen, L. R. Cardon, M. Walker, G. A.
Hitman, C. N. Palmer, A. S. Doney, A. D. Morris, G. D. Smith, A. T. Hattersley and M. I.
McCarthy (2007). "A common variant in the FTO gene is associated with body mass index
and predisposes to childhood and adult obesity." Science 316(5826): 889-894.
Fridley, B. L. and J. M. Biernacka (2011). "Gene set analysis of SNP data: benefits, challenges,
and future directions." Eur J Hum Genet 19(8): 837-843.
Fridley, B. L., G. D. Jenkins and J. M. Biernacka (2010). "Self-contained gene-set analysis of
expression data: an evaluation of existing and novel methods." PLoS One 5(9).
Fu, W. J. (2003). "Penalized estimating equations." Biometrics 59(1): 126-132.
Gauderman, W. J., E. Avol, F. Gilliland, H. Vora, D. Thomas, K. Berhane, R. McConnell, N.
Kuenzli, F. Lurmann, E. Rappaport, H. Margolis, D. Bates and J. Peters (2004). "The effect
of air pollution on lung development from 10 to 18 years of age." N Engl J Med 351(11):
1057-1067.
Gauderman, W. J., C. Murcray, F. Gilliland and D. V. Conti (2007). "Testing association between
disease and multiple SNPs in a candidate gene." Genetic Epidemiology 31(5): 383-395.
Gauderman, W. J., H. Vora, R. McConnell, K. Berhane, F. Gilliland, D. Thomas, F. Lurmann, E.
Avol, N. Kunzli, M. Jerrett and J. Peters (2007). "Effect of exposure to traffic on lung
development from 10 to 18 years of age: a cohort study." Lancet 369(9561): 571-577.
71
Ge, Y. C., S. Dudoit and T. P. Speed (2003). "Resampling-based multiple testing for microarray
data analysis." Test 12(1): 1-77.
Goeman, J. J. (2010). "L1 penalized estimation in the Cox proportional hazards model." Biom J
52(1): 70-84.
Goeman, J. J. and P. Buhlmann (2007). "Analyzing gene expression data in terms of gene sets:
methodological issues." Bioinformatics 23(8): 980-987.
Goeman, J. J., S. A. van de Geer, F. de Kort and H. C. van Houwelingen (2004). "A global test for
groups of genes: testing association with a clinical outcome." Bioinformatics 20(1): 93-99.
Goeman, J. J., S. A. van de Geer and H. C. van Houwelingen (2006). "Testing against a high
dimensional alternative." Journal of the Royal Statistical Society Series B-Statistical
Methodology 68: 477-493.
Goode, E. L., S. S. Cherny, J. C. Christian, G. P. Jarvik and M. de Andrade (2007). "Heritability
of longitudinal measures of body mass index and lipid and lipoprotein levels in aging
twins." Twin Res Hum Genet 10(5): 703-711.
Gower, J. C. and K. V. Mardia (1974). "Multivariate-Analysis and Its Applications - Report on
Hull Conference, 1973." Journal of the Royal Statistical Society Series C-Applied Statistics
23(1): 60-66.
Hallman, D. M., V. C. Friedel, M. A. Eissa, E. Boerwinkle, J. C. Huber, Jr., R. B. Harrist, S. R.
Srinivasan, W. Chen, S. Dai, D. R. Labarthe and G. S. Berenson "The association of
variants in the FTO gene with longitudinal body mass index profiles in non-Hispanic white
children and adolescents." Int J Obes (Lond) 36(1): 61-68.
Hardy, R., A. K. Wills, A. Wong, C. E. Elks, N. J. Wareham, R. J. Loos, D. Kuh and K. K. Ong
"Life course variations in the associations between FTO and MC4R gene variants and body
size." Hum Mol Genet 19(3): 545-552.
Haworth, C. M., S. Carnell, E. L. Meaburn, O. S. Davis, R. Plomin and J. Wardle (2008).
"Increasing heritability of BMI and stronger associations with the FTO gene over
childhood." Obesity (Silver Spring) 16(12): 2663-2668.
Hoerl, A. E. and R. W. Kennard (1970). "Ridge Regression - Biased Estimation for Nonorthogonal
Problems." Technometrics 12(1): 55-&.
Holden, M., S. Deng, L. Wojnowski and B. Kulle (2008). "GSEA-SNP: applying gene set
enrichment analysis to SNP data from genome-wide association studies." Bioinformatics
24(23): 2784-2785.
Horvath, S., T. Shi, R. M. Shai, C. Chen and S. Nelson (2007). "Statistical Methods Supplement
and R software tutorial: Gene Filtering with a Random Forest Predictor." UCLA tutorial.
Houwing-Duistermaat, J. J., B. H. Derkx, F. R. Rosendaal and H. C. van Houwelingen (1995).
"Testing familial aggregation." Biometrics 51(4): 1292-1301.
Huang, D. W., B. T. Sherman and R. A. Lempicki (2009). "Systematic and integrative analysis of
large gene lists using DAVID bioinformatics resources." Nature Protocols 4(1): 44-57.
Huang, Y. T. and X. H. Lin (2013). "Gene set analysis using variance component tests." Bmc
Bioinformatics 14.
International HapMap, C. (2005). "A haplotype map of the human genome." Nature 437(7063):
1299-1320.
Irizarry, R. A., C. Wang, Y. Zhou and T. P. Speed (2011). "Gene set enrichment analysis made
simple (vol 18, pg 565, 2009)." Statistical Methods in Medical Research 20(5): 571-571.
Jacobson, K. C. and D. C. Rowe (1998). "Genetic and shared environmental influences on
adolescent BMI: interactions with race and sex." Behav Genet 28(4): 265-278.
72
Kanehisa, M. and S. Goto (2000). "KEGG: Kyoto Encyclopedia of Genes and Genomes." Nucleic
Acids Research 28(1): 27-30.
Kanehisa, M., S. Goto, Y. Sato, M. Kawashima, M. Furumichi and M. Tanabe (2014). "Data,
information, knowledge and principle: back to metabolism in KEGG." Nucleic Acids
Research 42(D1): D199-D205.
Kim, B. S., K. I, S. Lee, S. Kim, S. Y. Rha and H. C. Chung (2005). "Statistical methods of
translating microarray data into clinically relevant diagnostic information in colorectal
cancer." Bioinformatics 21(4): 517-528.
Kim, S. H., Y. K. Kim, H. W. Park, Y. K. Jee, S. H. Kim, J. W. Bahn, Y. S. Chang, S. H. Kim, Y.
M. Ye, E. S. Shin, J. E. Lee, H. S. Park and K. U. Min (2007). "Association between
polymorphisms in prostanoid receptor genes and aspirin-intolerant asthma."
Pharmacogenet Genomics 17(4): 295-304.
Klein, R. J., C. Zeiss, E. Y. Chew, J. Y. Tsai, R. S. Sackler, C. Haynes, A. K. Henning, J. P.
SanGiovanni, S. M. Mane, S. T. Mayne, M. B. Bracken, F. L. Ferris, J. Ott, C. Barnstable
and J. Hoh (2005). "Complement factor H polymorphism in age-related macular
degeneration." Science 308(5720): 385-389.
Koeppen-Schomerus, G., J. Wardle and R. Plomin (2001). "A genetic analysis of weight and
overweight in 4-year-old twin pairs." Int J Obes Relat Metab Disord 25(6): 838-844.
Kosti, R. I. and D. B. Panagiotakos (2006). "The epidemic of obesity in children and adolescents
in the world." Cent Eur J Public Health 14(4): 151-159.
Kramer, A., J. Green, J. Pollard, Jr. and S. Tugendreich (2014). "Causal analysis approaches in
Ingenuity Pathway Analysis." Bioinformatics 30(4): 523-530.
Kuczmarski, R. J., C. L. Ogden, S. S. Guo, L. M. Grummer-Strawn, K. M. Flegal, Z. Mei, R. Wei,
L. R. Curtin, A. F. Roche and C. L. Johnson (2002). "2000 CDC Growth Charts for the
United States: methods and development." Vital Health Stat 11(246): 1-190.
le Cessie, S. and H. C. van Houwelingen (1995). "Testing the fit of a regression model via score
tests in random effects models." Biometrics 51(2): 600-614.
Lindsey, J. K. (1999). "A review of some extensions to generalized linear models." Statistics in
Medicine 18(17-18): 2223-2236.
Lu, Y., P. Y. Liu, P. Xiao and H. W. Deng (2005). "Hotelling's T2 multivariate profiling for
detecting differential expression in microarrays." Bioinformatics 21(14): 3105-3113.
Maciejewski, H. (2014). "Gene set analysis methods: statistical models and methodological
differences." Briefings in Bioinformatics 15(4): 504-518.
McConnell, R., K. Berhane, L. Yao, M. Jerrett, F. Lurmann, F. Gilliland, N. Kunzli, J. Gauderman,
E. Avol, D. Thomas and J. Peters (2006). "Traffic, susceptibility, and childhood asthma."
Environmental Health Perspectives 114(5): 766-772.
Mevik, B. H. and R. Wehrens (2007). "The pls package: Principal component and partial least
squares regression in R." Journal of Statistical Software 18(2): 1-23.
Milacic, M., R. Haw, K. Rothfels, G. Wu, D. Croft, H. Hermjakob, P. D'Eustachio and L. Stein
(2012). "Annotating cancer variants and anti-cancer therapeutics in reactome." Cancers
(Basel) 4(4): 1180-1211.
Navidi, W., D. Thomas, D. Stram and J. Peters (1994). "Design and analysis of multilevel analytic
studies with applications to a study of air pollution." Environ Health Perspect 102 Suppl
8: 25-32.
Nishimura, D. (2004). "A View From the Web BioCarta." Biotech Software & Internet Report
2(3): 117-120.
73
Ohashi, J., I. Naka, R. Kimura, K. Natsuhara, T. Yamauchi, T. Furusawa, M. Nakazawa, Y. Ataka,
J. Patarapotikul, P. Nuchnoi, K. Tokunaga, T. Ishida, T. Inaoka, Y. Matsumura and R.
Ohtsuka (2007). "FTO polymorphisms in oceanic populations." J Hum Genet 52(12):
1031-1035.
Peters, J. M., E. Avol, W. J. Gauderman, W. S. Linn, W. Navidi, S. J. London, H. Margolis, E.
Rappaport, H. Vora, H. Gong and D. C. Thomas (1999). "A study of twelve southern
California communities with differing levels and types of air pollution - II. Effects on
pulmonary function." American Journal of Respiratory and Critical Care Medicine 159(3):
768-775.
Pietilainen, K. H., J. Kaprio, A. Rissanen, T. Winter, A. Rimpela, R. J. Viken and R. J. Rose (1999).
"Distribution and heritability of BMI in Finnish adolescents aged 16y and 17y: a study of
4884 twins and 2509 singletons." Int J Obes Relat Metab Disord 23(2): 107-115.
Polley, E. (2010). Super Learner. Berkeley, CA,: 175 leaves.
Prestvik, W. S., A. L. Hjerto, T. S. Steigedal and L. Thommesen (2007). "RNA and protein clean-
up from the same specimen. Comparison between the Qiagen and Ambion protocols."
Scandinavian Journal of Clinical & Laboratory Investigation 67(8): 885-891.
Pritchard, J. K., M. Stephens and P. Donnelly (2000). "Inference of population structure using
multilocus genotype data." Genetics 155(2): 945-959.
R, M. S., C. CD, S. T, H. S, N. SF, R. JKV and S. CL (2007). "IGFBP2 is a Biomarker for PTEN
Status and PI3K/Akt Pathway Activation in Glioblastoma and Prostate Cancer." PNAS.
Scuteri, A., S. Sanna, W. M. Chen, M. Uda, G. Albai, J. Strait, S. Najjar, R. Nagaraja, M. Orru, G.
Usala, M. Dei, S. Lai, A. Maschio, F. Busonero, A. Mulas, G. B. Ehret, A. A. Fink, A. B.
Weder, R. S. Cooper, P. Galan, A. Chakravarti, D. Schlessinger, A. Cao, E. Lakatta and G.
R. Abecasis (2007). "Genome-wide association scan shows genetic variants in the FTO
gene are associated with obesity-related traits." PLoS Genet 3(7): e115.
Sladek, R., G. Rocheleau, J. Rung, C. Dina, L. Shen, D. Serre, P. Boutin, D. Vincent, A. Belisle,
S. Hadjadj, B. Balkau, B. Heude, G. Charpentier, T. J. Hudson, A. Montpetit, A. V.
Pshezhetsky, M. Prentki, B. I. Posner, D. J. Balding, D. Meyre, C. Polychronakos and P.
Froguel (2007). "A genome-wide association study identifies novel risk loci for type 2
diabetes." Nature 445(7130): 881-885.
Speliotes, E. K., C. J. Willer, S. I. Berndt, K. L. Monda, G. Thorleifsson, A. U. Jackson, H. L.
Allen, C. M. Lindgren, J. Luan, R. Magi, J. C. Randall, S. Vedantam, T. W. Winkler, L.
Qi, T. Workalemahu, I. M. Heid, V. Steinthorsdottir, H. M. Stringham, M. N. Weedon, E.
Wheeler, A. R. Wood, T. Ferreira, R. J. Weyant, A. V. Segre, K. Estrada, L. Liang, J.
Nemesh, J. H. Park, S. Gustafsson, T. O. Kilpelainen, J. Yang, N. Bouatia-Naji, T. Esko,
M. F. Feitosa, Z. Kutalik, M. Mangino, S. Raychaudhuri, A. Scherag, A. V. Smith, R.
Welch, J. H. Zhao, K. K. Aben, D. M. Absher, N. Amin, A. L. Dixon, E. Fisher, N. L.
Glazer, M. E. Goddard, N. L. Heard-Costa, V. Hoesel, J. J. Hottenga, A. Johansson, T.
Johnson, S. Ketkar, C. Lamina, S. Li, M. F. Moffatt, R. H. Myers, N. Narisu, J. R. Perry,
M. J. Peters, M. Preuss, S. Ripatti, F. Rivadeneira, C. Sandholt, L. J. Scott, N. J. Timpson,
J. P. Tyrer, S. van Wingerden, R. M. Watanabe, C. C. White, F. Wiklund, C. Barlassina,
D. I. Chasman, M. N. Cooper, J. O. Jansson, R. W. Lawrence, N. Pellikka, I. Prokopenko,
J. Shi, E. Thiering, H. Alavere, M. T. Alibrandi, P. Almgren, A. M. Arnold, T. Aspelund,
L. D. Atwood, B. Balkau, A. J. Balmforth, A. J. Bennett, Y. Ben-Shlomo, R. N. Bergman,
S. Bergmann, H. Biebermann, A. I. Blakemore, T. Boes, L. L. Bonnycastle, S. R. Bornstein,
M. J. Brown, T. A. Buchanan, F. Busonero, H. Campbell, F. P. Cappuccio, C. Cavalcanti-
74
Proenca, Y. D. Chen, C. M. Chen, P. S. Chines, R. Clarke, L. Coin, J. Connell, I. N. Day,
M. D. Heijer, J. Duan, S. Ebrahim, P. Elliott, R. Elosua, G. Eiriksdottir, M. R. Erdos, J. G.
Eriksson, M. F. Facheris, S. B. Felix, P. Fischer-Posovszky, A. R. Folsom, N. Friedrich, N.
B. Freimer, M. Fu, S. Gaget, P. V. Gejman, E. J. Geus, C. Gieger, A. P. Gjesing, A. Goel,
P. Goyette, H. Grallert, J. Grassler, D. M. Greenawalt, C. J. Groves, V. Gudnason, C.
Guiducci, A. L. Hartikainen, N. Hassanali, A. S. Hall, A. S. Havulinna, C. Hayward, A. C.
Heath, C. Hengstenberg, A. A. Hicks, A. Hinney, A. Hofman, G. Homuth, J. Hui, W. Igl,
C. Iribarren, B. Isomaa, K. B. Jacobs, I. Jarick, E. Jewell, U. John, T. Jorgensen, P.
Jousilahti, A. Jula, M. Kaakinen, E. Kajantie, L. M. Kaplan, S. Kathiresan, J. Kettunen, L.
Kinnunen, J. W. Knowles, I. Kolcic, I. R. Konig, S. Koskinen, P. Kovacs, J. Kuusisto, P.
Kraft, K. Kvaloy, J. Laitinen, O. Lantieri, C. Lanzani, L. J. Launer, C. Lecoeur, T.
Lehtimaki, G. Lettre, J. Liu, M. L. Lokki, M. Lorentzon, R. N. Luben, B. Ludwig, P.
Manunta, D. Marek, M. Marre, N. G. Martin, W. L. McArdle, A. McCarthy, B. McKnight,
T. Meitinger, O. Melander, D. Meyre, K. Midthjell, G. W. Montgomery, M. A. Morken,
A. P. Morris, R. Mulic, J. S. Ngwa, M. Nelis, M. J. Neville, D. R. Nyholt, C. J. O'Donnell,
S. O'Rahilly, K. K. Ong, B. Oostra, G. Pare, A. N. Parker, M. Perola, I. Pichler, K. H.
Pietilainen, C. G. Platou, O. Polasek, A. Pouta, S. Rafelt, O. Raitakari, N. W. Rayner, M.
Ridderstrale, W. Rief, A. Ruokonen, N. R. Robertson, P. Rzehak, V. Salomaa, A. R.
Sanders, M. S. Sandhu, S. Sanna, J. Saramies, M. J. Savolainen, S. Scherag, S. Schipf, S.
Schreiber, H. Schunkert, K. Silander, J. Sinisalo, D. S. Siscovick, J. H. Smit, N. Soranzo,
U. Sovio, J. Stephens, I. Surakka, A. J. Swift, M. L. Tammesoo, J. C. Tardif, M. Teder-
Laving, T. M. Teslovich, J. R. Thompson, B. Thomson, A. Tonjes, T. Tuomi, J. B. van
Meurs, G. J. van Ommen, V. Vatin, J. Viikari, S. Visvikis-Siest, V. Vitart, C. I. Vogel, B.
F. Voight, L. L. Waite, H. Wallaschofski, G. B. Walters, E. Widen, S. Wiegand, S. H. Wild,
G. Willemsen, D. R. Witte, J. C. Witteman, J. Xu, Q. Zhang, L. Zgaga, A. Ziegler, P.
Zitting, J. P. Beilby, I. S. Farooqi, J. Hebebrand, H. V. Huikuri, A. L. James, M. Kahonen,
D. F. Levinson, F. Macciardi, M. S. Nieminen, C. Ohlsson, L. J. Palmer, P. M. Ridker, M.
Stumvoll, J. S. Beckmann, H. Boeing, E. Boerwinkle, D. I. Boomsma, M. J. Caulfield, S.
J. Chanock, F. S. Collins, L. A. Cupples, G. D. Smith, J. Erdmann, P. Froguel, H. Gronberg,
U. Gyllensten, P. Hall, T. Hansen, T. B. Harris, A. T. Hattersley, R. B. Hayes, J. Heinrich,
F. B. Hu, K. Hveem, T. Illig, M. R. Jarvelin, J. Kaprio, F. Karpe, K. T. Khaw, L. A.
Kiemeney, H. Krude, M. Laakso, D. A. Lawlor, A. Metspalu, P. B. Munroe, W. H.
Ouwehand, O. Pedersen, B. W. Penninx, A. Peters, P. P. Pramstaller, T. Quertermous, T.
Reinehr, A. Rissanen, I. Rudan, N. J. Samani, P. E. Schwarz, A. R. Shuldiner, T. D. Spector,
J. Tuomilehto, M. Uda, A. Uitterlinden, T. T. Valle, M. Wabitsch, G. Waeber, N. J.
Wareham, H. Watkins, J. F. Wilson, A. F. Wright, M. C. Zillikens, N. Chatterjee, S. A.
McCarroll, S. Purcell, E. E. Schadt, P. M. Visscher, T. L. Assimes, I. B. Borecki, P.
Deloukas, C. S. Fox, L. C. Groop, T. Haritunians, D. J. Hunter, R. C. Kaplan, K. L. Mohlke,
J. R. O'Connell, L. Peltonen, D. Schlessinger, D. P. Strachan, C. M. van Duijn, H. E.
Wichmann, T. M. Frayling, U. Thorsteinsdottir, G. R. Abecasis, I. Barroso, M. Boehnke,
K. Stefansson, K. E. North, I. M. M, J. N. Hirschhorn, E. Ingelsson and R. J. Loos
"Association analyses of 249,796 individuals reveal 18 new loci associated with body mass
index." Nat Genet 42(11): 937-948.
Steinthorsdottir, V., G. Thorleifsson, I. Reynisdottir, R. Benediktsson, T. Jonsdottir, G. B. Walters,
U. Styrkarsdottir, S. Gretarsdottir, V. Emilsson, S. Ghosh, A. Baker, S. Snorradottir, H.
Bjarnason, M. C. Ng, T. Hansen, Y. Bagger, R. L. Wilensky, M. P. Reilly, A. Adeyemo,
75
Y. Chen, J. Zhou, V. Gudnason, G. Chen, H. Huang, K. Lashley, A. Doumatey, W. Y. So,
R. C. Ma, G. Andersen, K. Borch-Johnsen, T. Jorgensen, J. V. van Vliet-Ostaptchouk, M.
H. Hofker, C. Wijmenga, C. Christiansen, D. J. Rader, C. Rotimi, M. Gurney, J. C. Chan,
O. Pedersen, G. Sigurdsson, J. R. Gulcher, U. Thorsteinsdottir, A. Kong and K. Stefansson
(2007). "A variant in CDKAL1 influences insulin response and risk of type 2 diabetes."
Nat Genet 39(6): 770-775.
Subramanian, A., P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert, M. A. Gillette, A.
Paulovich, S. L. Pomeroy, T. R. Golub, E. S. Lander and J. P. Mesirov (2005). "Gene set
enrichment analysis: a knowledge-based approach for interpreting genome-wide
expression profiles." Proc Natl Acad Sci U S A 102(43): 15545-15550.
Sun, X., W. H. Yu and C. Hu (2014). "Genetics of Type 2 Diabetes: Insights into the Pathogenesis
and Its Clinical Application." Biomed Research International.
Svetnik, V., A. Liaw, C. Tong, J. C. Culberson, R. P. Sheridan and B. P. Feuston (2003). "Random
forest: A classification and regression tool for compound classification and QSAR
modeling." Journal of Chemical Information and Computer Sciences 43(6): 1947-1958.
Szabo, A., K. Boucher, D. Jones, A. D. Tsodikov, L. B. Klebanov and A. Y. Yakovlev (2003).
"Multivariate exploratory tools for microarray data analysis." Biostatistics 4(4): 555-567.
Thomas, P. D., M. J. Campbell, A. Kejariwal, H. Mi, B. Karlak, R. Daverman, K. Diemer, A.
Muruganujan and A. Narechania (2003). "PANTHER: a library of protein families and
subfamilies indexed by function." Genome Res 13(9): 2129-2141.
Thorleifsson, G., G. B. Walters, D. F. Gudbjartsson, V. Steinthorsdottir, P. Sulem, A. Helgadottir,
U. Styrkarsdottir, S. Gretarsdottir, S. Thorlacius, I. Jonsdottir, T. Jonsdottir, E. J.
Olafsdottir, G. H. Olafsdottir, T. Jonsson, F. Jonsson, K. Borch-Johnsen, T. Hansen, G.
Andersen, T. Jorgensen, T. Lauritzen, K. K. Aben, A. L. Verbeek, N. Roeleveld, E.
Kampman, L. R. Yanek, L. C. Becker, L. Tryggvadottir, T. Rafnar, D. M. Becker, J.
Gulcher, L. A. Kiemeney, O. Pedersen, A. Kong, U. Thorsteinsdottir and K. Stefansson
(2009). "Genome-wide association yields new sequence variants at seven loci that associate
with measures of obesity." Nat Genet 41(1): 18-24.
Tibshirani, R. (1996). "Regression shrinkage and selection via the Lasso." Journal of the Royal
Statistical Society Series B-Methodological 58(1): 267-288.
Tsai, C. A. and J. J. Chen (2009). "Multivariate analysis of variance test for gene set analysis."
Bioinformatics 25(7): 897-903.
Wang, K., M. Li and M. Bucan (2007). "Pathway-based approaches for analysis of genomewide
association studies." Am J Hum Genet 81(6): 1278-1283.
Wang, K., M. Li and H. Hakonarson (2010). "Analysing biological pathways in genome-wide
association studies." Nat Rev Genet 11(12): 843-854.
Wang, T., G. Ho, K. Ye, H. Strickler and R. C. Elston (2009). "A Partial Least-Square Approach
for Modeling Gene-Gene and Gene-Environment Interactions When Multiple Markers Are
Genotyped." Genetic Epidemiology 33(1): 6-15.
Wang, Y. and T. Lobstein (2006). "Worldwide trends in childhood overweight and obesity." Int J
Pediatr Obes 1(1): 11-25.
Wardle, J., S. Carnell, C. M. Haworth and R. Plomin (2008). "Evidence for a strong genetic
influence on childhood adiposity despite the force of the obesogenic environment." Am J
Clin Nutr 87(2): 398-404.
76
Welter, D., J. MacArthur, J. Morales, T. Burdett, P. Hall, H. Junkins, A. Klemm, P. Flicek, T.
Manolio, L. Hindorff and H. Parkinson (2014). "The NHGRI GWAS Catalog, a curated
resource of SNP-trait associations." Nucleic Acids Research 42(D1): D1001-D1006.
Willer, C. J., E. K. Speliotes, R. J. Loos, S. Li, C. M. Lindgren, I. M. Heid, S. I. Berndt, A. L.
Elliott, A. U. Jackson, C. Lamina, G. Lettre, N. Lim, H. N. Lyon, S. A. McCarroll, K.
Papadakis, L. Qi, J. C. Randall, R. M. Roccasecca, S. Sanna, P. Scheet, M. N. Weedon, E.
Wheeler, J. H. Zhao, L. C. Jacobs, I. Prokopenko, N. Soranzo, T. Tanaka, N. J. Timpson,
P. Almgren, A. Bennett, R. N. Bergman, S. A. Bingham, L. L. Bonnycastle, M. Brown, N.
P. Burtt, P. Chines, L. Coin, F. S. Collins, J. M. Connell, C. Cooper, G. D. Smith, E. M.
Dennison, P. Deodhar, P. Elliott, M. R. Erdos, K. Estrada, D. M. Evans, L. Gianniny, C.
Gieger, C. J. Gillson, C. Guiducci, R. Hackett, D. Hadley, A. S. Hall, A. S. Havulinna, J.
Hebebrand, A. Hofman, B. Isomaa, K. B. Jacobs, T. Johnson, P. Jousilahti, Z. Jovanovic,
K. T. Khaw, P. Kraft, M. Kuokkanen, J. Kuusisto, J. Laitinen, E. G. Lakatta, J. Luan, R. N.
Luben, M. Mangino, W. L. McArdle, T. Meitinger, A. Mulas, P. B. Munroe, N. Narisu, A.
R. Ness, K. Northstone, S. O'Rahilly, C. Purmann, M. G. Rees, M. Ridderstrale, S. M. Ring,
F. Rivadeneira, A. Ruokonen, M. S. Sandhu, J. Saramies, L. J. Scott, A. Scuteri, K. Silander,
M. A. Sims, K. Song, J. Stephens, S. Stevens, H. M. Stringham, Y. C. Tung, T. T. Valle,
C. M. Van Duijn, K. S. Vimaleswaran, P. Vollenweider, G. Waeber, C. Wallace, R. M.
Watanabe, D. M. Waterworth, N. Watkins, J. C. Witteman, E. Zeggini, G. Zhai, M. C.
Zillikens, D. Altshuler, M. J. Caulfield, S. J. Chanock, I. S. Farooqi, L. Ferrucci, J. M.
Guralnik, A. T. Hattersley, F. B. Hu, M. R. Jarvelin, M. Laakso, V. Mooser, K. K. Ong,
W. H. Ouwehand, V. Salomaa, N. J. Samani, T. D. Spector, T. Tuomi, J. Tuomilehto, M.
Uda, A. G. Uitterlinden, N. J. Wareham, P. Deloukas, T. M. Frayling, L. C. Groop, R. B.
Hayes, D. J. Hunter, K. L. Mohlke, L. Peltonen, D. Schlessinger, D. P. Strachan, H. E.
Wichmann, M. I. McCarthy, M. Boehnke, I. Barroso, G. R. Abecasis and J. N. Hirschhorn
(2009). "Six new loci associated with body mass index highlight a neuronal influence on
body weight regulation." Nat Genet 41(1): 25-34.
Yu, K., Q. Li, A. W. Bergen, R. M. Pfeiffer, P. S. Rosenberg, N. Caporaso, P. Kraft and N.
Chatterjee (2009). "Pathway analysis by adaptive combination of P-values." Genet
Epidemiol 33(8): 700-709.
Zhang, K., S. Cui, S. Chang, L. Zhang and J. Wang (2010). "i-GSEA4GWAS: a web server for
identification of pathways/gene sets associated with traits by applying an improved gene
set enrichment analysis to genome-wide association study." Nucleic Acids Res 38(Web
Server issue): W90-95.
Zhao, J., J. P. Bradfield, M. Li, K. Wang, H. Zhang, C. E. Kim, K. Annaiah, J. T. Glessner, K.
Thomas, M. Garris, E. C. Frackelton, F. G. Otieno, J. L. Shaner, R. M. Smith, R. M.
Chiavacci, R. I. Berkowitz, H. Hakonarson and S. F. Grant (2009). "The role of obesity-
associated loci identified in genome-wide association studies in the determination of
pediatric BMI." Obesity (Silver Spring) 17(12): 2254-2257.
Zhao, J., J. P. Bradfield, H. Zhang, P. M. Sleiman, C. E. Kim, J. T. Glessner, S. Deliard, K. A.
Thomas, E. C. Frackelton, M. Li, R. M. Chiavacci, R. I. Berkowitz, H. Hakonarson and S.
F. Grant "Role of BMI-associated loci identified in GWAS meta-analyses in the context of
common childhood obesity in European Americans." Obesity (Silver Spring) 19(12): 2436-
2439.
77
Zou, H. and T. Hastie (2005). "Regularization and variable selection via the elastic net (vol B 67,
pg 301, 2005)." Journal of the Royal Statistical Society Series B-Statistical Methodology
67: 768-768.
78
Appendix A: R Code for fastARTP Package
## Program to run ARTP on all pathways ##
rm(list=ls())
library(plyr)
library(MASS)
library(HardyWeinberg)
library(biglm)
library(ff)
################### ARTP RELATED FUNCTIONS ################
### Rank Truncation Product
CumSumLog <- function(p) cumsum(log(p))
RTP <- function(pvals)
{
pvals <- as.vector(pvals)
l <- length(pvals)
pvals <- sort(pvals)
#if(l <= 5){
truncPoints <- 1:l
#}
#if(l > 5){
# truncPoints <- (1:5)*max(1, floor(l/20))
#}
CumSumLog(pvals)[truncPoints]
79
}
### Ge's algorithm, current_pval <= rest of pval
Ge_algo = function(W)
{
rank(W, ties.method="min")/length(W)
}
### ARTP function
artp=function(x){
W=apply(x, 2, RTP)
S=t(apply(W, 1, Ge_algo))
S.min=t(data.matrix(apply(S,2,which.min)))
MinP=t(data.matrix(apply(S, 2, min)))
ARTP=t(apply(MinP, 1, Ge_algo))
list(ARTP=ARTP, S.min=S.min)
}
### Marginal model function ###
#x=chs.snps[,1]
testSNP = function(x, y, y.type, cov.name=NULL, cov.data=NULL, type=NULL){
y=as.numeric(y)
int.yx=which(!is.na(y) & !is.na(x))
if(!is.null(cov.name) & !is.null(cov.data)){
int.ind=intersect(int.yx,
which(complete.cases(cov.data[,cov.name])))
cov.data=cov.data[int.ind,]
} else { int.ind=int.yx }
80
y=y[int.ind]
x=x[int.ind]
if(!is.null(type)){
if(type=="permute"){
y=sample(y)
} else {y=y}
}
if(!is.null(cov.name) & !is.null(cov.data)){
glm.cov=paste(cov.name, sep="", collapse="+")
if(y.type==1){
glm.form=paste("bigglm(y~x+",glm.cov,",data=cov.data)", sep="",
collapse="")
margin.model=eval(parse(text=glm.form))
} else if(y.type==2){
glm.form=paste("bigglm(y~x+",glm.cov,",
family=binomial, data=cov.data)", sep="",
collapse="")
margin.model=eval(parse(text=glm.form))
}
} else if(is.null(cov.name) & is.null(cov.data)){
if(y.type==1){
margin.model=bigglm(y~x)
} else if(y.type==2){
margin.model=bigglm(y~x, family=binomial)
}
81
} else {
stop("cov.name or cov.data missing")
}
if(is.na(coef(margin.model)[2]){
coef=NA
fitted.val=NA
model.mat=NA
beta=NA
SEbeta=NA
v.cov=NA
p.val=NA
} else {
coef=summary(margin.model)$mat
fitted.val=predict(margin.model,newdata=data.frame(x,
cov.data))
model.mat=model.matrix(margin.model, data=cov.data)
rownames(model.mat)=names(x)
beta=coef[2,1]
SEbeta=coef[2,4]
v.cov=vcov(margin.model)
p.val=coef[2,5]
}
list(coefficients=coef, beta=beta, SEbeta=SEbeta, v.cov=v.cov,
p.value=p.val, model.mat=model.mat, fitted.val=fitted.val)
#list(model=margin.model)
}
82
############# MAIN PERMUTED or SAMPLED P-VALUE FUNCTION
#############################################
#pheno.data=y; snp.data=chs.snps; cov.name=c("q2", "q3", "q4"); y.type=1; reps=10;
method="permutation";
#cov.data=data.frame(q1=pheno$q1_233, q2=pheno$q2_233, q3=pheno$q3_233,
q4=pheno$q4_233); path.info=test.path;
permute.pval=function(pheno.data, snp.data, cov.name=NULL, cov.data=NULL,
y.type, reps, method, levels=NULL, path.info=NULL,
covar.by.gene=F)
{
x=as.matrix(snp.data)
y=as.numeric(pheno.data)
cov.data=cov.data
cov.name=cov.name
y.type=y.type
snp.names=colnames(snp.data)
N_snps=ncol(snp.data)
est.pval=matrix(0, nrow=N_snps, ncol=reps+1)
est.betas=matrix(0, nrow=N_snps, ncol=reps+1)
est.SEbetas=matrix(0, nrow=N_snps, ncol=reps+1)
## Marginal model
ori.model=apply(x, 2, testSNP, y=y, y.type=y.type, cov.name=cov.name,
cov.data=cov.data)
## Marginal beta, SEbeta, and variance
beta.snps=laply(ori.model, function(x){x$beta})
SEbeta=laply(ori.model, function(x){x$SEbeta})
V.COV=llply(ori.model, function(x){x$v.cov})
83
na.index=which(is.na(beta.snps))
## If missing SNPs, remove them and update all related variables
if(length(na.index)>0){
N_snps2=length(beta.snps)-length(na.index)
beta.snps2=beta.snps[-na.index]
SEbeta2=SEbeta[-na.index]
V.COV2=V.COV[-na.index]
a2=combn( (1:ncol(snp.data))[-na.index],2,function(x)x )
snp.names=colnames(snp.data)[-na.index]
x=x[,-na.index]
ori.model.perm=ori.model[-na.index]
est.pval=matrix(0, nrow=N_snps2, ncol=reps+1)
est.betas=matrix(0, nrow=N_snps2, ncol=reps+1)
est.SEbetas=matrix(0, nrow=N_snps2, ncol=reps+1)
} else {
V.COV2=V.COV
beta.snps2=beta.snps
N_snps2=N_snps
SEbeta2=SEbeta
snp.names=colnames(snp.data)
x=x
ori.model.perm=ori.model
}
### Check if any marginal models ###
if(length(ori.model.perm)==0){
list(ARTP.1.lvl=NA, ARTP.2.lvl=NA, num.gene=NA, num.snps=NA)
84
} else {
## Pathway and gene info
if(!is.null(path.info)){
info.ori=path.info[which(as.character(path.info$snp)%in%names(ori
.model.perm)),]
info.ori=info.ori[order(info.ori$gene, info.ori$snp),]
}
#j=1
if(method=="permutation")
{
for (j in 1:(reps+1)){
if(j==1){
est.pval[,j]=laply(ori.model.perm,
function(x){x$p.value})
est.betas[,j]=laply(ori.model.perm,
function(x){x$beta})
est.SEbetas[,j]=laply(ori.model.perm,
function(x){x$SEbeta})
} else {
#z <- as.numeric(sample(y))
model=apply(x, 2, testSNP, y=y, cov.name=cov.name,
cov.data=cov.data,y.type=y.type, type="permute")
est.pval[,j]=as.matrix(laply(model,
function(x){x$p.value}))
est.betas[,j]=as.matrix(laply(model,
function(x){x$beta}))
est.SEbetas[,j]=as.matrix(laply(model,
function(x){x$SEbeta}))
85
}
}
rownames(est.pval)=snp.names
rownames(est.betas)=snp.names
rownames(est.SEbetas)=snp.names
#list(pvals=est.pval, betas=est.betas, SEbetas=est.SEbetas)
} else if(method=="sampling")
{
#fit=ori.model[[1]]
covInfo = function(fit){
X <- fit$model.mat
y.ind=intersect(rownames(X), rownames(fit$fitted.val))
y.int.ind=which(y.ind%in%rownames(fit$fitted.val))
resid.y=y[y.int.ind]-fit$fitted.val[y.int.ind]
var.y=sum((resid.y-mean(resid.y))^2) / (length(resid.y)-1)
#(length(resid.y)-1-(dim(fit$coefficients)[1]-1)
U <- ( (y[y.int.ind]-fit$fitted.val[y.int.ind])/var.y )*X[y.ind,]
iV <- fit$v.cov
tcrossprod(iV, U)
}
V.beta=laply(V.COV2, function(x){x[2,2]})
if( as.logical(covar.by.gene) & !is.null(info.ori)){
#g=info.ori[info.ori$gene=="ACTA1", ]
sigma.star=dlply(info.ori, .(gene), function(g){
86
ori.gene=ori.model[which(names(ori.model)%in%g$snp)]
auxInfo <- lapply(ori.gene, covInfo)
nSNPs=length(ori.gene)
covBetas = matrix(0, nrow=nSNPs, ncol=nSNPs)
for(k in 1:nSNPs){
for(l in 1:nSNPs){
match.ind=intersect(colnames(auxInfo[[k]]),
colnames(auxInfo[[l]]))
covBetas[k,l]=tcrossprod(auxInfo[[k]][,match.ind],
auxInfo[[l]][,match.ind])[2,2]
} # end l loop
} # end k loop
## Transform sigma
eigen.sig=eigen(covBetas, symmetric = TRUE)
eS=eigen.sig$vectors
eV=eigen.sig$values
eV[eV< (1e-7)]=1e-7
eV.mat=matrix(0, nrow=nSNPs, ncol=nSNPs)
diag(eV.mat)=eV
sigma=eS%*%eV.mat%*%t(eS)
colnames(sigma)=names(ori.gene)
rownames(sigma)=names(ori.gene)
sigma
})
samp.count=ddply(info.ori, .(gene), function(x){
dim(x)[1]
87
})
#set.seed(123)
#z=samp.count[1,]
t5=ddply(samp.count, .(gene), function(z){
#set.seed(123)
test=mvrnorm(n=reps, mu=rep(0,z[[2]]),
Sigma=sigma.star[[as.character(z[[1]])]], tol=1e-6,
empirical=F)
t(test)
})
rownames(t5)=info.ori$snp
t5=t5[,-1]
samp.beta=t(t5[names(ori.model.perm),])
all.beta=rbind(beta.snps2,samp.beta)
} else if(as.logical(covar.by.gene) & is.null(info.ori)){
stop("covar.by.gene option is True but gene.info is
missing")
} else {
covBetas = matrix(0, nrow=N_snps2, ncol=N_snps2)
auxInfo <- lapply(ori.model, covInfo)
### Each SNP's model may have different number of residuals
#checking number of residuals for each SNP model
#laply(fits, function(x){ length(names(residuals(x,
type="response"))) })
88
#k=1;l=1;
for(k in 1:N_snps2){
for(l in 1:N_snps2){
match.ind=intersect(colnames(auxInfo[[k]]),
colnames(auxInfo[[l]]))
covBetas[k, l]=tcrossprod(auxInfo[[k]][,match.ind],
auxInfo[[l]][,match.ind])[2,2]
} # end l loop
} # end k loop
## Transform sigma
eigen.sig=eigen(covBetas, symmetric = TRUE)
eS=eigen.sig$vectors
eV=eigen.sig$values
eV[eV< (1e-7)]=1e-7
eV.mat=matrix(0, nrow=N_snps2, ncol=N_snps2)
diag(eV.mat)=eV
sigma.star=eS%*%eV.mat%*%t(eS)
rownames(sigma.star)=names(ori.model.perm)
#Sampling from Multivariate Normal with Beta Mu and sigma.star
#set.seed(123)
samp.beta=mvrnorm(n=reps, mu=rep(0,N_snps2), Sigma=sigma.star,
tol=1e-6, empirical=F)
#samp.beta=mvrnorm(n=reps, mu=rep(0,N_snps2), Sigma=covBetas,
tol=1e-6, empirical=F)
all.beta=rbind(beta.snps2,samp.beta)
89
}
#Standardize Betas by SEbeta
st.beta=matrix(0, nrow=reps+1, ncol=N_snps2)
for(i in 1:N_snps2){
st.beta[,i]=(all.beta[,i]/SEbeta2[i])^2
}
colnames(st.beta)=names(ori.model.perm)
st.beta=t(st.beta)
est.pval=apply(st.beta, 2, function(x){pchisq(x, df=1, ncp=0,
lower.tail=F)})
#rownames(est.pval)=snp.names
} # end "sampling"
if(levels==1){
allsnps.artp=artp(est.pval)
allsnps.path.pval=allsnps.artp$ARTP
allsnps.S.min.ind=allsnps.artp$S.min
list(ARTP.1.lvl=allsnps.path.pval,
ARTP.1.lvl.S.min=allsnps.S.min.ind,
num.gene=length(gene.S.min), num.snps=length(ori.model.perm))
} else if(levels==2){
if(is.null(info.ori)){
stop("No gene info dataset")
} else {
gene.info=info.ori
rownames(gene.info)=info.ori$snp
gene.info=gene.info[rownames(est.pval),]
90
gene.data=data.frame(est.pval,
Gene_name=as.character(gene.info$gene))
W.gene=dlply(gene.data, .(Gene_name),
function(x){apply(x[,1:(reps+1)], 2, RTP)})
S.gene=llply(W.gene, function(x){
if(is.vector(x)){
Ge_algo(x)
} else{
t(apply(x, 1, Ge_algo))
}
})
MinP.gene=llply(S.gene, function(x){
if(is.vector(x)){
x
} else{
(apply(x, 2, min))
}
})
gene.ARTP=ldply(MinP.gene, Ge_algo)
gene.ARTP=data.matrix(gene.ARTP[-1])
rownames(gene.ARTP)=names(MinP.gene)
gene.S.min=llply(S.gene, function(x){ if(is.vector(x)){ 1
} else{
(apply(x, 2, which.min))
}
})
if(dim(gene.ARTP)[1]==1){
91
gene.S.ind=NA
path.S.ind=NA
path.pval=NA
} else {
gene.S.ind=ldply(gene.S.min, function(x){x[1]})
rownames(gene.S.ind)=gene.S.ind[,1]
gene.S.ind=gene.S.ind[,-1]
path.artp=artp(gene.ARTP)
path.pval=path.artp$ARTP
path.S.ind=path.artp$S.min
}
list(ARTP.2.lvl=path.pval, ARTP.genes=gene.ARTP,
path.S.min=path.S.ind, gene.S.min=gene.S.ind,
num.gene=length(gene.S.min),
num.snps=length(ori.model.perm))
}
} else if (levels==3){
allsnps.artp=artp(est.pval)
allsnps.path.pval=allsnps.artp$ARTP
allsnps.S.min.ind=allsnps.artp$S.min
if(is.null(info.ori)){
stop("No gene info dataset")
} else {
gene.info=info.ori
rownames(gene.info)=info.ori$snp
gene.info=gene.info[rownames(est.pval),]
gene.data=data.frame(est.pval, Gene_name=as.character(gene.info$gene))
92
W.gene=dlply(gene.data, .(Gene_name),
function(x){apply(x[,1:(reps+1)], 2, RTP)})
S.gene=llply(W.gene, function(x){
if(is.vector(x)){
Ge_algo(x)
} else{
t(apply(x, 1, Ge_algo))
}
})
MinP.gene=llply(S.gene, function(x){
if(is.vector(x)){
x
} else{
(apply(x, 2, min))
}
})
gene.ARTP=ldply(MinP.gene, Ge_algo)
gene.ARTP=data.matrix(gene.ARTP[-1])
rownames(gene.ARTP)=names(MinP.gene)
gene.S.min=llply(S.gene, function(x){ if(is.vector(x)){ 1
} else{
(apply(x, 2, which.min))
}
})
if(dim(gene.ARTP)[1]==1){
gene.S.ind=NA
path.S.ind=NA
93
path.pval=NA
} else {
gene.S.ind=ldply(gene.S.min, function(x){x[1]})
rownames(gene.S.ind)=gene.S.ind[,1]
gene.S.ind=gene.S.ind[,-1]
path.artp=artp(gene.ARTP)
path.pval=path.artp$ARTP
path.S.ind=path.artp$S.min
}
list(ARTP.1.lvl=allsnps.path.pval,
ARTP.1.lvl.S.min=allsnps.S.min.ind,
ARTP.2.lvl=path.pval, ARTP.genes=gene.ARTP,
path.S.min=path.S.ind, gene.S.min=gene.S.ind,
num.gene=length(gene.S.min),
num.snps=length(ori.model.perm))
}
} #ends levels==3
} #ends ori.model.perm length check
} #ends main function
#####################################################################################
#######################
#####################################################################################
#######################
samp.pvals=permute.pval(
pheno.data=y, snp.data=chs.snps,
cov.name=cov.name, cov.data=cov.data, y.type=1,
reps=5000, method="sampling",levels=3,
path.info=test.path, covar.by.gene=F
)
94
# pheno.data is your outcome vector
# snp.data is the data frame of subject rows by snp columns
# cov.name is in the format of cov.name=c("age", "sex", "town") that
specifices the character vector of covariate names
# cov.data is the data frame of subject rows by covariate columns, this
should match cov.name,
# ie. cov.data=data.frame(age=all.data$age, sex=all.data$sex,
town=all.data$town)
# y.type specifies whether 1=continous (linear regression) or 2=binary
outcome (logistic regression)
# reps is the number of permutation equivalent samples you want to sample
from the variance-covariance matrix of the marginal beta coefficients,
# this does not include the original observation, so reps=5000 you
would actually have 5001 in total
# levels, 1= one-level all snps pathway summary, 2= two-level pathway
summary (method applied first within gene, then method applied again
among genes for pathway summary)
# 3=obtain results for both one-level and two-level summaries
# path.info is a mapping summary data frame that should contain at least
three column variables named "snp", "gene", "pathway"
#(maps what snp to what gene, what gene to what pathway)
# covar.by.gene is an option for building variance-covariance matrix of
marginal betas within a gene, then sample within a gene
# this option is used when the # of SNPs in the pathway is huge, and
building a giant variance-covariance matrix for all SNPs is impossible
95
Appendix B: Additional BMI Growth Pathway Analysis Results
Since two-level ARTP p-values are orders of magnitudes smaller than one-level ARTP,
we show the top 50 pathways based one most significant two-level ARTP p-values.
Pathway Name ARTP One-
Level
ARTP Two-
Level
# of genes # of SNPs
KEGG_CIRCADIAN_RHYTHM_MAMMAL 0.0260 0.00018 13 259
AMBION_CIRCADIAN_CLOCK_IN_MAMMALS 0.0480 0.00021 12 547
KEGG_PROXIMAL_TUBULE_BICARBONATE_RECLAM
ATION 0.0321 0.0015 21 355
AMBION_HEPATIC_ABC_TRANSPORTERS 0.7758 0.0035 60 2166
KEGG_COMPLEMENT_AND_COAGULATION_CASCAD
ES 0.0590 0.0039 64 1250
BIOCARTA_STEM_PATHWAY 0.1220 0.0052 15 225
REACTOME_AMINE_LIGAND_BINDING_RECEPTORS 0.1879 0.0114 39 2032
REACTOME_SEROTONIN_RECEPTORS 0.0472 0.0121 11 765
REACTOME_INTRINSIC_PATHWAY 0.0120 0.0160 13 172
KEGG_CARDIAC_MUSCLE_CONTRACTION 0.9789 0.0210 61 2380
BIOCARTA_NKT_PATHWAY 0.1642 0.0230 25 507
KEGG_ALDOSTERONE_REGULATED_SODIUM_REABS
ORPTION 0.7234 0.0263 36 1598
BIOCARTA_ASBCELL_PATHWAY 0.3668 0.0292 11 178
BIOCARTA_CDK5_PATHWAY 0.3354 0.0362 10 153
AMBION_MSP-RON_SIGNALING 0.8954 0.0430 57 1339
REACTOME_FORMATION_OF_FIBRIN_CLOT_CLOTTI
NG_CASCADE 0.0521 0.0454 28 500
BIOCARTA_DC_PATHWAY 0.1021 0.0469 20 438
BIOCARTA_NKCELLS_PATHWAY 0.2713 0.0498 19 472
AMBION_PATHOGENESIS_OF_RHEUMATOID_ARTHRI
TIS 0.3122 0.0513 49 1094
BIOCARTA_ERYTH_PATHWAY 0.6742 0.0514 15 289
BIOCARTA_INTRINSIC_PATHWAY 0.0498 0.0523 18 447
BIOCARTA_CTL_PATHWAY 0.0662 0.0542 13 217
KEGG_RIBOFLAVIN_METABOLISM 0.0624 0.0546 12 259
KEGG_PPAR_SIGNALING_PATHWAY 0.7487 0.0584 63 1013
BIOCARTA_TOB1_PATHWAY 0.0168 0.0619 19 555
96
KEGG_GLYCOSPHINGOLIPID_BIOSYNTHESIS_GANGL
IO_SERIES 0.0162 0.0620 14 695
BIOCARTA_NTHI_PATHWAY 0.0509 0.0642 24 695
REACTOME_CYTOSOLIC_SULFONATION_OF_SMALL_
MOLECULES 0.2024 0.0660 8 120
BIOCARTA_INFLAM_PATHWAY 0.4320 0.0696 28 550
BIOCARTA_TCAPOPTOSIS_PATHWAY 0.1610 0.0706 8 154
KEGG_TAURINE_AND_HYPOTAURINE_METABOLISM 0.1240 0.0731 10 111
BIOCARTA_TH1TH2_PATHWAY 0.2091 0.0732 18 376
AMBION_BLOOD_COAGULATION_CASCADE 0.9125 0.0832 71 2633
BIOCARTA_IL5_PATHWAY 0.5130 0.0844 10 154
AMBION_C._PNEUMONIAE_INFECTION_IN_ATHEROS
CLEROSIS 0.1145 0.0853 31 576
KEGG_GRAFT_VERSUS_HOST_DISEASE 0.3931 0.0856 33 769
AMBION_PHAGOCYTOSIS_OF_MICROBES 0.9114 0.0863 81 3032
REACTOME_DEPOLARIZATION_OF_THE_PRESYNAPT
IC_TERMINAL 1.0000 0.0897 11 1117
AMBION_ER-MEDIATED_PHAGOCYTOSIS 0.5198 0.0903 94 2115
AMBION_MECHANISM_OF_BOTULINUM_TOXIN 0.7826 0.0941 56 1390
REACTOME_
LICENSING_FACTORS_WITH_PREREPLICATIVE_COM
PLEX 0.1004 0.0961 7 68
KEGG_HISTIDINE_METABOLISM 0.0932 0.0986 27 491
REACTOME_RECYCLING_OF_BILE_ACIDS_AND_SAL
TS 0.2895 0.1010 10 514
AMBION_FC-GAMMAR-
MEDIATED_PHAGOCYTOSIS_IN_MACROPHAGES 0.7433 0.1026 104 3434
REACTOME_RHO_GTPASE_CYCLE 0.9842 0.1036 109 4410
KEGG_PHENYLALANINE_METABOLISM 0.2643 0.1043 14 271
BIOCARTA_TCRA_PATHWAY 0.1060 0.1060 11 214
REACTOME_CLASSICAL_ANTIBODY_MEDIATED_CO
MPLEMENT_ACTIVATION 0.2558 0.1078 5 17
REACTOME_NEF_OF_CELL_SURFACE_RECEPTORS 0.0235 0.1156 20 236
AMBION_INTRINSIC_PROTHROMBIN_ACTIVATION_P
ATHWAY 0.9733 0.1186 68 2521
Abstract (if available)
Abstract
With a typical sample size of a few thousand subjects, a single genomewide association study (GWAS) using traditional one‐SNP‐at‐a‐time methods can only detect genetic variants conferring a modest effect on disease risk. Set‐based methods, which analyze sets of SNPs jointly, can detect variants with smaller effects acting within a gene, a pathway, or other biologically relevant sets. While self‐contained set‐based methods (those that test sets of variants without regard to variants not in the set) are generally more powerful than competitive set‐based approaches (those that rely on comparison of variants in the set of interest with variants not in the set), there is no consensus as to which self‐contained methods are best. In particular, several competitive set tests have been proposed to directly or indirectly ‘adapt’ to the a priori unknown proportion and distribution of effects of the truly associated SNPs in the set, which is a major determinant of their power. ❧ A popular adaptive set‐based test is the adaptive rank truncated product (ARTP), which seeks the set of SNPs that yields the best‐combined evidence of association. We compared the standard ARTP, several ARTP variations we introduced, and other adaptive methods in a comprehensive simulation study to evaluate their performance. We used permutations to assess significance for all the methods and thus provide a level playing field for comparison. We found the standard ARTP test to have the highest power across our simulations followed closely by the global model of random effects (GMRE) and a LASSO based test. ❧ However, ARTP requires permuted data and p-values that may be computationally intensive. This motivates a fast alternative approach to replace permutations. We proposed a Fast Adaptive Set‐based Test (FAST), which samples permutation equivalents from the asymptotic distribution of all marginal parameter estimates of interest. An R package “fastARTP” was written to apply the FAST framework to draw permutation equivalents then applied on to the ARTP method. We consequently applied “fastARTP” package on a genomewide pathway analysis using the Children Health Study (CHS).
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Pathway-based analysis of multivariate traits using classical dimensionality reduction methods
PDF
Multivariate methods for extracting genetic associations from correlated data
PDF
Missing heritability may be explained by the common household environment and its interaction with genetic variation
PDF
Genetic variation in the base excision repair pathway, environmental risk factors and colorectal adenoma risk
PDF
Observed and underlying associations in nicotine dependence
PDF
Discovery of complex pathways from observational data
PDF
Two-step testing approaches for detecting quantitative trait gene-environment interactions in a genome-wide association study
PDF
Robust feature selection with penalized regression in imbalanced high dimensional data
PDF
Gene-set based analysis using external prior information
PDF
Hierarchical approaches for joint analysis of marginal summary statistics
PDF
Population substructure and its impact on genome-wide association studies with admixed populations
PDF
Identification and fine-mapping of genetic susceptibility loci for prostate cancer and statistical methodology for multiethnic fine-mapping
PDF
twas_sim, a Python-based tool for simulation and power analysis of transcriptome-wide association analysis
PDF
Genetic studies of inflammation and cardiovascular disease
PDF
Meat intake, polymorphisms in the NER and MMR pathways and colorectal cancer risk
PDF
Enhancing model performance of regularization methods by incorporating prior information
PDF
Pharmacogenetic association studies and the impact of population substructure in the women's interagency HIV study
PDF
Preprocessing and analysis of DNA methylation microarrays
PDF
Genetic epidemiological approaches in the study of risk factors for hematologic malignancies
PDF
Analysis of SNP differential expression and allele-specific expression in gestational trophoblastic disease using RNA-seq data
Asset Metadata
Creator
Su, (Ray) Yu-Chen
(author)
Core Title
Adaptive set-based tests for pathway analysis
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Statistical Genetics and Genetic Epidemiology
Publication Date
02/20/2015
Defense Date
10/03/2014
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
adaptive pathway analysis,genetic pathway analysis,genomewide pathway analysis,OAI-PMH Harvest,self‐contained pathway analysis,set‐based test
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Lewinger, Juan Pablo (
committee chair
), Conti, David V. (
committee member
), Gauderman, William James (
committee member
), Schumacher, Fredrick (
committee member
), Sun, Fengzhu Z. (
committee member
)
Creator Email
ego168@gmail.com,yuchensu@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-535696
Unique identifier
UC11297535
Identifier
etd-SuRayYuChe-3202.pdf (filename),usctheses-c3-535696 (legacy record id)
Legacy Identifier
etd-SuRayYuChe-3202.pdf
Dmrecord
535696
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Su, (Ray) Yu-Chen
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
adaptive pathway analysis
genetic pathway analysis
genomewide pathway analysis
self‐contained pathway analysis
set‐based test