Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Gene-set based analysis using external prior information
(USC Thesis Other)
Gene-set based analysis using external prior information
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
GENE SET-BASED ANALYSIS USING EXTERNAL PRIOR INFORMATION
by
Meng Liu
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(Biostatistics)
August 2016
Copyright 2016 Meng Liu
i
Preface III
Chapter I: increasing the power of Gene-Set analysis by
incorporating Gene Ontology based gene similarity information
1
Part A. Introduction and background 1
Introduction of gene set-based analysis in genetics research 1
Defining gene sets using prior biological information
2
Methods for prior-knowledge-defined gene set
3
Incorporation of prior knowledge of genes using the GO-based gene similarity 4
Motivation for developing innovative approaches to incorporate gene-gene similarity
information into gene-set analysis
6
Part B. Methodology
8
Variance component test using generalized linear mixed model
8
Setting and Notation for data
10
Variance Component (VC) Score test
11
Prior information for analysis
15
Block equi-correlated prior and decomposition of score test
17
Robustified score test
21
Part C. Simulation studies
24
Gene expression and covariate data
24
Prior resource data
24
Gene set generation
27
Phenotype generation
28
Prior information matrix simulation
29
Part D. Results and discussion
30
Power Simulation results
30
Application to prostate cancer recurrence and peroxisome genes
31
ii
Discussion
34
Chapter II. Effect of incompleteness of Gene Ontology annotation 36
Part A. Introduction
36
Introduction and background
36
Part B. Materials and methods
39
Data sources
39
Quantifying the incompleteness of GO annotations
42
Simulations
42
Information Content (IC) and semantic similarities
44
GO-based Gene clustering using Clique Extracted Ontology algorithms
49
Measurement of the impact of incomplete annotations on gene clustering
49
Clustering genes from sample pathways and measuring the impact of incomplete data
on it
51
Part C. Analysis and Results
52
Robustness of similarities between genes given incomplete annotations 52
The heterogeneity of similarities in terms of gene clustering given full annotations 53
General impact of incomplete annotations on clustering 55
Impact of incomplete annotations on clustering of sample pathway genes 57
Part D. Discussion 63
Reference 68
Appendix A. Derivation of score statistic 74
Appendix B. Human peroxisomal genes 81
Appendix C. Supplemental Figures 86
iii
Preface
It has become increasingly important in today’s genomic research to detect the
joint association of numerous biomarkers (biomarker set) with complex diseases
and predict patients’ disease status or treatment response using those identified
biomarkers. Most existing biomarker-set approaches, however, relies on the
assumption of either independence or equicorrelation among biomarkers and the
power can decrease rapidly due to violation of this assumption. We proposed a
prior knowledge based adaptive approach which assumes a block equicorrelation
structure among biomarkers. This proposed approach divides a set of biomarkers
into different “blocks” using prior information from Gene Ontology, and then
decomposes a global test into different “within-block” local tests. It adaptively
uses the data to optimally determine the value of equicorrelation within blocks. It
maintains as much power as other approaches do when the correlation structure
is mis-specified by prior information, but tends to be more powerful when the
prior information is informative. We evaluate the performance of the proposed
approaches using complex simulation studies and illustrate its application using
public Gene Ontology annotation resources and the Microarray data of USC
Norris Center prostate cancer project. Given the fact that the prior information
can determine the power of gene-set analysis, we checked the quality of one of
the most commonly used biological resources, Gene Ontology. Particularly we
estimated the incompleteness of Gene Ontology annotations among human
genes. A novel simulation design was proposed to count the missing annotations
iv
among human genes compared with annotations among model organism genes.
The impact of incompleteness of GO annotations on hierarchical clustering of
genes was evaluated as well using extensive simulation studies and sample
pathway genes among yeast.
Acknowledgments
I would like to express my sincere gratitude to my advisor Juan Pablo
Lewinger and co-advisor Paul Thomas for their help, guidance and patience. I
would also like to thank my committee members: David Conti, Huaiyu Mi and
Joseph Hacia for their support and motivation. I would also like to thank the
entire biostatistics program in Department of Preventive Medicine which has
been helping me to develop as a scientist and investigator. In addition, I am
sincerely grateful to Duncan Thomas who has been supported me generously for
my dissertation research. Last but not least, I am sincerely thankful to my
schoolmates, friends and family who helped me through my whole Ph.D. career.
Thank you
1
Chapter I. increasing the power of Gene-Set Analysis by
incorporating Gene Similarity Information
Part A. Introduction and backgrounds
Introduction of gene set-based analysis in genetics research
Gene set-based analysis enables join testing of sets of biologically defined
genomic features (e.g. genes within a pathway) for association with a trait (e.g.
diseased vs. non-diseased) or experimental condition (e.g. treated vs. untreated).
Although originally devised to facilitate interpretation of microarray gene
expression studies, gene-set analysis has been extended and adapted for other
types of genomic data such as genomewide association studies (Weng, et al.,
2011) (in contexts other than gene expression studies these approaches are
often collectively referred to as set-based methods) and methylation studies
(Hass, et al., 2015) . More recently, a number of set-based methods for rare
variants have also been proposed to exploit their ability to enhance the statistical
power to detect joint effects that would be otherwise too weak to be detected
individually(Ionita-Laza, et al., 2013; Lee, et al., 2012; Wu, et al., 2011; Yan, et
al., 2015). One of the main reasons that set-based approaches have been very
commonly used in today’s research is that set-based analysis, which tests the
cumulative effects rather than single effect, tends to be more powerful than
univariate analysis. For example genome wide association studies (GWAS) have
made an enormous impact on the understanding of the association between
single nucleotide polymorphisms (SNPs) and traits. However, most complex
2
traits might be caused by numerous genes which work together and the effect for
single SNP might be too small to be detected(Lebrec, et al., 2009; Schaid, et al.,
2012). It is plausible that grouping multiple SNPs from same pathway and
performing global test tends to improve the power.
Defining gene sets using prior biological information
The most commonly used gene-set approaches rely on pathway and gene
function annotation from prior biological knowledge or defining appropriate gene
sets. Many resources have been widely used for obtaining prior biological
information, like the Gene Ontology (GO) and the Kyoto Encyclopedia of Genes
and Genomes (KEGG) and others. The GO is most commonly used because it
has very well-structured knowledge of properties of gene products in 3 domains:
molecular functions, biological process and cellular component(Harris, et al.,
2004). Within each domain, standardized GO terms were used to describe
corresponding properties of gene products. For example, the process of positive
regulation of mitosis is denote by GO term GO: 0045840 under the biological
process domain of the GO. The relationship between GO terms is represented
via Directed Acyclic Graph (DAG). The more specific terms (closer to the leaf of
DAG) are called child terms. Each child term can be linked to multiple more
general terms (closer to the roof of DAG) called parent terms. Genes or gene
products associated with a particular property are annotated with GO terms
corresponding to this property. Thus, each GO term can be considered as an
indicator of corresponding biological process or function for its annotated genes.
3
Methods for prior-knowledge-defined gene set
Established methods within this prior knowledge-defined gene set framework are
over representation analysis (ORA) and gene-set enrichment analysis (GSEA)
implemented in widely used online tools such as Panther(Mi, et al., 2016),
DAVID(Huang, et al., 2007), and GSEA(Subramanian, et al., 2005). Gene Set
Enrichment Analysis (GESA) for gene expression study. It orders a list of p-
values for single genes and then tests the trend of clustering small p-values
(significant genes) within gene sets indicated by the GO term (Subramanian, et
al., 2005). Similar approaches have been adapted to GWAS which test the
overrepresentation of gene sets on the list of significant SNP p-values(Chasman,
2008). However, these methods only test the difference of overrepresentation
between the target set and its complement. Insignificant difference only implies
the association of one set is not different than others, rather than the association
itself is insignificant (Schaid, et al., 2012). Holmans developed another approach
to test the overrepresentation of gene sets by simulating the null distribution of
the number of significant genes within each set(Holmans, et al., 2009).
Significance of overrepresentation implies potential disease causing. However, it
has limits to account for the correlation structure within genes (between SNPs)
and between genes. This might cause inaccurate type I error rate and potential
biased detection of association, especially when multiple SNPs were contained in
a gene(Cantor, et al., 2010). To overcome these limits, Schaid et al. developed
another approach for gene-set analysis: individual SNPs for their associations
were firstly scored using score statistics from regression model. Those scores
4
were combined into scores for genes and then for gene sets under the guidance
of GO structure. Rather than just overrepresentation, this method can directly
test the associations of gene sets with given traits, while account for correlation
both within and between genes. In addition, it can account for testing of multiple,
overlapped or nested gene sets (Schaid, et al., 2012). Nevertheless, like the
methods for overrepresentation, Schaid only used the GO as indicators of
properties of genes which cannot account for correlation within gene set. Other
approaches based on pre-defined gene sets include the global test of Goeman et
al. (Goeman, et al., 2004) and Principal Component test (PCA) (Kong, et al.,
2006; Tomfohr, et al., 2005). None of them can account for the correlation within
gene-set. Approaches using pre-defined gene sets were extensively reviewed
and evaluated by Wu and Lin(Wu and Lin, 2009).
Incorporation of prior knowledge of genes using the GO-based gene
similarity
Intuitively, genes from overlapping pathway are highly interconnected, share
common properties and tend to behave similarly. Their effects for given traits are
thereby, more or less, correlated, depending on how similar their properties are.
It is fairly to assume that accounting for the correlation structure of single genes’
effects correctly would improve the power of testing for the overall associations of
gene set. Theoretically, the correlation between effects is hard to be estimated
without any external biological information. Sources or prior data relevant for
gene-set analysis, however, contain much more finely grained information on
gene function, expression and/or co-regulation than simple categorization into
5
subsets. Commonly used prior data includes Kyoto Encyclopedia of Genes and
Genomes (KEGG), The Cancer Genome Atlas (TCGA) and Weighted Gene Co-
Expression Network (WGCN)(Cancer Genome Atlas Research, 2008; Kanehisa,
et al., 2016; Zhang and Horvath, 2005). This information can be encoded by
similarity matrices that capture both the network topology and the strength of the
connections between genes. Gene-gene similarities can be in turn predictive of
gene co-regulation and of phenotypic similarity (Hu and Zhao, 2016). The GO,
however, has a great advantage to take into account such a correlation because
it has a well-structured ontology of genes’ properties. Based on this ontology,
one can infer the relationships between ultimately genes, using GO-based
semantic similarity.
The GO-based semantic similarity is a measure of similarity between two sets of
GO terms from two different genes. It can capture how similar two genes are in
terms of their biological properties. For example, a high similarity of two genes
based on the GO of biological process domain implies that these two genes
involved in similar biological processes. In further, sharing similar properties
might result in a higher chance of having similar effects on disease. Thus, it is
rational to use GO-based semantic similarity as an estimate of correlation
between genes’ effects. Different GO-based similarity measures have been
developed and explored, among which the information-content (IC) based ones
are the most widely used (Mistry and Pavlidis, 2008; Ovaska, et al., 2008;
Pesquita, et al., 2008; Schlicker, et al., 2006). IC is a way to quantify the
specificity of the GO term, with more specified terms have larger IC while more
6
general terms have less. Similarity between two different terms can be calculated
based on the IC of their ancestor terms, and different approaches (e.g. average,
maximum and best-matched average) can be chosen to combined these IC
based term-term similarities as IC based gene-gene similarities(Schlicker, et al.,
2006). Although those IC-based measures have been widely used in many fields,
e,g. gene clustering, they don’t return a positive-definite matrix of similarities,
which is necessary for most statistical analysis of modeling. Thus, this article
considering another non-IC based semantic similarity measure, GO-based
cosine similarity. It treats each gene’s annotated terms as a vector, and
calculates the similarity between such two vectors that measures the cosine of
the angle between them. Because a matrix of cosine similarities is a positive-
definite matrix, it can be applied to statistical analysis of modeling for further
inference. Other commonly used prior data includes Kyoto Encyclopedia of
Genes and Genomes (KEGG), The Cancer Genome Atlas (TCGA) and Weighted
Gene Co-Expression Network (WGCN)(Cancer Genome Atlas Research, 2008;
Kanehisa, et al., 2016; Zhang and Horvath, 2005). This information can be
encoded by similarity matrices that capture both the network topology and the
strength of the connections between genes. Gene-gene similarities can be in turn
predictive of gene co-regulation and of phenotypic similarity (Hu and Zhao, 2016).
Motivation for developing innovative approaches to incorporate gene-gene
similarity information into gene-set analysis
Despite the availability of more detailed gene-gene similarity information, of the
numerous approaches for gene set analysis, only few can account for
7
correlations among genes using prior information. The Test for the Effect of a
Gene Set (TEGS) models the correlation among genes using a user-specified
working covariance matrix under a multivariate regression framework(Huang and
Lin, 2013). While power is improved when the working covariance is close to the
true covariance, it can lose power when the working covariance is misspecified.
Several methods have thus been proposed to robustify set-based tests with prior
covariance structures. Lin has proposed a global score test that can jointly test
the effect of multiple SNPs using different prior covariances (Lin, 2007). It
assumes a prior covariance that is a mixture of an independence covariance and
a GO-related covariance. However, this method has rarely been applied by
related studies.
In this paper, we introduce a simple and computationally efficient gene-set test
that can exploit prior information in the form of a positive definite similarity matrix
to gain statistical power over standard approaches. The test is based on the
score test framework for generalized linear mixed regression originally developed
by Li and others (Wu, et al., 2011) for testing sets of rare variants. A key element
of our proposal is that we expressly designed the test to be power robust, so that
if the prior information is completely misspecified, the resulting test is only
marginally less powerful than the analogous test with no prior information, but
such that even partially correct information can yield gains in power.
In the important case where the similarity matrix encoding the prior information is
block-equicorrelated, i.e. when it the prior information represents non-overlapping
gene clusters believed to be co-regulated, the proposed score test is a weighted
8
combination of a test of the average effect and a test of the variance of the
effects within each block. We show that large power gains are possible by
incorporating prior information and that the power gains increase with the
dimensionality of the gene set.
We conducted a simulation study to evaluate the performance of the proposed
test with more general prior information based on real similarity matrices from the
GO. The results show that incorporating prior information yields substantial
power gains over approaches that do not incorporate prior information, even in
scenarios where the prior is only partially informative. Moreover, the simulation
results also show that, as intended by design, the power loss due to complete
misspecification of the prior information is very small, allowing researcher to
confidently use the test even if uncertain about the adequacy of the prior
information.
We illustrate our approach by reanalyzing a recent gene expression study of long
term outcomes in prostate cancer after radical prostatectomy. Using prior
information from the GO we uncover previously unidentified evidence of the
involvement of the peroxisomal pathway in prostate cancer progression.
Part B. Methodology
Variance component test using generalized linear mixed model
Generalized linear mixed model has been widely used in today’s research for
multidimensional genetic analysis. It fits effects of environmental covariates as
fixed while those of genes as random. The generalized linear mixed model
9
framework has several advantages on pathway analysis: the assumption of
random effects allows different directions of genes’ effect, which would be more
realistic. Another important property of generalized linear mixed model is that it
has connection with the semiparametric regression model, which allows a more
flexible function of Hilber Space (H) for the complicate joint effects of multiple
genes than linear functions(Liu, et al., 2007; Schaid, 2010). Linear mixed model
can make appropriate inference on this flexible function via VC score test, if a
kernel function that can best capture H was used. The SKAT family test derived
by Wu et al. was a special case of generalized semiparametric regression/logistic
model(Wu, et al., 2011). Because it assumes genes are only linearly associated
with the outcome, a weighted linear kernel function for capturing genotype
similarity between subjects was applied to the VC score test. If a more
complicated relationship between genes and outcome was assumed, one can
still test it using VC score test but with different types of kernel function. For
example, Gaussian process, polynomial, and neural network kernel functions
have been widely used under certain conditions(Liu, et al., 2007). For further
discussion and reference about the choice of kernel functions, please see
Schaid’s review (Schaid, 2010).
Besides the flexibility for testing complicate pathway effects, the generalized
linear mixed model framework allows us to fully account for the relationships
between multiple genes by assuming a prior distribution for random regression
coefficients. One can pre-specify the relationship between genes by specifying a
corresponding covariance structure using the prior distribution. For example, Lee
10
et al. proposed an optimal test procedure, which assumed an exchangeable
correlation structure and estimate 𝜌 from the data to maximize the power(Lee, et
al., 2012). Tzeng assumed a conditional autoregressive (CAR) structure for the
prior and estimated the correlation directly by Haplotype similarity. Intuitively,
these assumptions could be weak if relationships between genes’ effects are
complicated. In addition, the haplotype similarity might reflect more relationships
between genes than those between genes’ effects. The GO similarity based prior,
as mentioned above, can be directly used to pre-specify the covariance and
easily to incorporate into the generalized mixed model driven approach for gene-
set analysis.
A variance component (VC) score test can be easily derived using generalized
linear mixed model to test the joint effects of multiple candidate genes. It is time
efficient because it only requires fitting the model under the null, and has been
proved to be the most local powerful test (Goeman and le Cessie, 2006). Due to
these advantages, the variance component score test has been applied to both
SNP-set and gene expression-set level tests for associations between genotypes
and phenotypes (Goeman, et al., 2004), and proved to be powerful.
Setting and Notation for data
We assume a set of p genes/probes for which gene expression values have been
measured on n subjects. For subject 𝑖 =1..𝑛 , we let 𝑦 𝑖 denote the outcome
variable, 𝐺 𝑖 𝑇 =(𝑔 𝑖1
,…,𝑔 𝑖𝑝
) the p measured expression levels across the p genes,
and 𝑋 𝑖 𝑇 =(𝑥 𝑖1
,…,𝑥 𝑖𝑞
) , q adjustment covariates such as age, gender, race, or
11
other potential confounders and/or effect modifiers that one may want to adjust
for, including a vector of ones for an intercept term. In compact form we can write
𝑦 =(𝑦 1
,…,𝑦 𝑛 )
𝑇 for the 𝑛 ×1 vector of phenotypes, 𝐺 =[𝐺 1
,..𝐺 𝑝 ] for the 𝑛 ×𝑝
matrix of gene expression measurements with columns 𝐺 𝑗 , 𝑗 =1…𝑝 , and 𝑋 =
[𝑋 1
,..𝑋 𝑞 ] for the 𝑛 ×𝑞 matrix of covariates with columns 𝑋 𝑘 ,𝑘 =1…𝑞 . We
assume both 𝐺 and 𝑋 are of full rank 𝑝 and 𝑞 respectively.
Variance Component (VC) Score Test
We model the outcome 𝑦 𝑖 as a generalized linear model, i.e. 𝑦 𝑖 follows an
exponential family distribution and has expectation 𝐸 [𝑦 𝑖 ]=𝜇 𝑖 , where 𝜂 𝑖 =ℎ(𝜇 𝑖 )=
𝑋 𝑖 𝑡 𝛾 +𝐺 𝑖 𝑡 𝛽 is a link function relating the mean to the linear effect
𝑋 𝑖 𝑡 𝛾 +𝐺 𝑖 𝑡 𝛽 𝑉𝑎𝑟 [𝑦 𝑖 ]=𝜙𝑣 (𝜇 𝑖 ) is a variance function, and 𝜙 a dispersion
parameter. The parameter vectors 𝛾 =(𝛾 1
,…,𝛾 𝑞 ) , and 𝛽 =(𝛽 1
,…,𝛽 𝑝 ) are the
regression coefficients for the adjustment covariates and gene expression
variables respectively. The question of interest is whether there is differential
expression, i.e. whether the distribution of the p gene expression variables
𝐺 1
,…,𝐺 𝑝 , differs by phenotype 𝑦 , after adjustment for the covariates 𝑋 1
,…,𝑋 𝑞 .
This is formally assessed by testing the null hypotheses 𝐻 0
: 𝛽 =(𝛽 1
,..,𝛽 𝑝 )=0
against the alternative 𝐻 𝐴 :𝛽 ≠0 that not all regression coefficients for 𝐺 1
,…,𝐺 𝑝
are zero. The standard score or Wald test for testing H0 vs. HA has p degrees of
freedom and is an omnibus test that ‘spreads’ its power ‘evenly’ across all
possible directions in p dimensional space along which 𝛽 can depart from the null
𝐻 𝐴 :𝛽 =0. As a consequence, a standard p-degree of freedom test does not have
12
great power to detect departures from the null along any one direction in
particular. When there is a prior belief that departures from the null are likely to
happen along a preferential direction, one could construct a single degree of
freedom test with better power to detect such departures. For example, if there
were reason to believe that all regression coefficients are similar in sign and
magnitude, once would test the null 𝐻 0
:𝛽 =0 against the alternative hypothesis
𝐻 𝐴 :𝟏 𝑝 𝑇 𝛽 ≠0, 𝟏 𝑝 =(1,…1
⏟
𝑝 ) , of deviation along the line in 𝑝 -dimensional space
spanned by 𝟏 𝑝 . However, the resulting one degree of freedom test will only have
good power to detect deviations from the null along directions that are not far
from the a priori postulated direction. Deviations occurring mostly along
orthogonal directions to the prior will have no or little power to be detected. Since
highly certain prior knowledge is not common in practice, using a test that does
not ‘cover’ all 𝑝 possible directions of departure risks missing a signal that would
be detectable using a different test. To avoid this limitation of tests based on
‘hard’ constraints on the regression parameters, an alternative is to specify prior
knowledge using ‘soft’ constraints. A natural way to encode more diffuse prior
knowledge is through a prior distribution on the parameters 𝛽 . This would in
principle lead to a Bayesian treatment, which would require full specification of
priors for all other model parameters as well as the use of more complex
inferential and computational tools. Instead, we adopt a mixed model framework
by modeling only the parameter 𝛽 as a random effect and all other parameters as
fixed.
13
Following Lin (Lin, 1997), we assume 𝛽 follows an absolutely continuous
distribution with mean of 0 and covariance matrix 𝑉𝑎𝑟 (𝛽 )=𝜏 2
Σ, where 𝜏 2
is the
variance component of the gene expression effects and Σ is a full rank p p
covariance matrix (i.e. symmetric and positive definite). Under these assumptions,
the null hypothesis of interest, 𝐻 0
:𝛽 =0, is equivalent to 𝐻 0
: 𝜏 2
=0 and a score
test of 𝜏 2
=0 can be derived based on the framework by Lin (Lin, 1997). Briefly,
the likelihood for the data (𝑦 ,𝑋 ,𝐺 ) is given by 𝐿 (𝛾 ,𝜏 2
)=∫𝐿 (𝛽 ;𝛾 ,𝜏 2
)𝑓 (𝛽 )𝑑𝛽
where 𝐿 (𝛽 ) is the conditional likelihood and 𝑓 (𝛽 ) is the density of the random
effect 𝛽 . Under the additional regularity condition that the third and higher-order
moments of 𝛽 are 𝑜 (𝜏 2
), the log-likelihood admits a Taylor expansion from which
the the score statistic, i.e. the derivative of 𝐿 (𝛾 ,𝜏 2
) with respect to 𝜏 2
evaluated at
𝐻 0
: 𝜏 2
=0 can be directly obtained. The score statistic is given by the quadratic
form in 𝑦 −𝜇 ̂ :
𝑄 =(𝑦 −𝜇 ̂)
𝑇 ∆𝑉 −1
𝐾 𝑉 −1
∆(𝑦 −𝜇 ̂) (1)
where 𝜇 ̂
𝑖 =ℎ
−1
(𝑋 𝑖 𝑡 𝛾̂) , ∆=𝑑𝑖𝑎𝑔 (ℎ
′
(𝜇 ̂
1
),…,ℎ
′
(𝜇 ̂
𝑛 )), 𝑉 =
𝜙̂
𝑑𝑖𝑎𝑔 (𝑣 (𝜇 ̂
1
) (ℎ
′
(𝜇 ̂
1
))
2
,…,𝑣 (𝜇 ̂
𝑛 ) (ℎ
′
(𝜇 ̂
𝑛 ))
2
), 𝐾 =𝐺 Σ𝐺 𝑇
(Lee, et al., 2012), and 𝛾̂
is the maximum likelihood estimate of the nuisance parameter 𝛾 under the null
model --which is a standard fixed effects generalized linear model. 𝐾 is a kernel
matrix whose entry 𝐾 𝑖𝑗
,1≤𝑖,𝑗 ≤𝑛 measures the similarity between the gene-
expression profiles of subjects i and j across all p genes, when the similarity t
between the individual gene-expression levels of genes k and l, 1≤𝑘 ,𝑙 ≤𝑝 is
14
measures by Σ
𝑘 ,𝑙 . For canonical link functions, which we hereafter assume, the
score statistic simplifies to:
𝑄 =(𝑦 −𝜇 ̂
0
)
𝑇 𝐾 (𝑦 −𝜇 ̂
0
)/𝜙̂
(2)
For logistic and Poisson regression 𝜙 is known and equal to 1, so it does not
need to be estimated. For linear regression, 𝜙 =𝜎 2
, where 𝜎 2
is the variance of
the residual error term, and 𝜙̂
=𝜎̂
2
is the standard linear regression estimate of
the residual variance. It is worth noting that the test statistic 𝑄 depends on the
prior only through the variance of the random effect, 𝑉𝑎𝑟 (𝛽 )=𝜏 2
Σ, and not the
full prior distribution. This is an additional advantage of the score test over a
likelihood ratio test or a fully Bayesian approach: the score test only requires
specification of the the prior variance.
The (asymptotic) distribution of the quadratic form in (1) can be derived by noting
that 𝑄 can be written as a linear combination of uncorrelated 1-df statistics, each
following an approximately 1-df chi-square distribution. This can be seen by
diagonalizing the positive semidefinite matrix 𝑉 −
1
2
𝐾 𝑉 −
1
2
=𝑃 Λ𝑃 𝑇 , with Λ=
diag(λ
1
,…,λ
𝑝 ,
) , λ
1
≥λ
2
≥⋯≥λ
𝑝 >0, the diagonal matrix of positive eigenvalues
of 𝑉 −
1
2
𝐾 𝑉 −
1
2
(the last 𝑛 −𝑝 eigenvalues are zero because 𝑉 −
1
2
𝐾 𝑉 −
1
2
=
𝑉 −
1
2
𝐺 Σ𝐺 𝑇 𝑉 −
1
2
is of full rank p by virtue of Σ, and 𝐺 , and 𝑉 −
1
2
being each of full rank),
and 𝑃 an n p matrix with orthogonal columns (i.e. 𝑃 𝑇 𝑃 =𝐼 𝑝 ), 𝑃 1
,…,𝑃 𝑝 , which are
eigenvectors of 𝑉 −
1
2
𝐾 𝑉 −
1
2
corresponding to the eigenvalues λ
1
,…,λ
𝑝 . The score
statistic is then given by:
15
𝑄 =
1
𝜙̂
(𝑦 −𝜇 ̂
0
)
𝑇 𝐾 (𝑦 −𝜇 ̂
0
)=
1
𝜙̂
(𝑦 −𝜇 ̂)
𝑇 𝑉 −
1
2
𝑉 1
2
𝐾 𝑉 −
1
2
𝑉 1
2(𝑦 −𝜇 ̂
0
)=
1
𝜙̂
(𝑦 −𝜇 ̂)
𝑇 𝑉 −
1
2
𝑃 Λ𝑃 𝑇 𝑉 −
1
2(𝑦 −𝜇 ̂)=(𝑃 𝑇 𝜙̂
−
1
2
𝑉 −
1
2
(𝑦 −𝜇 ̂
0
))
𝑇 Λ(𝑃 𝑇 𝜙̂
−
1
2
𝑉 −
1
2
(𝑦 −𝜇 ̂
0
))=
∑ λ
𝑘 (𝑃 𝑇 𝜙̂
−
1
2
𝑉 −
1
2
(𝑦 −𝜇 ̂
0
))
2
𝑝 𝑘 =1
=∑ λ
𝑘 𝑧 𝑘 2
𝑝 𝑘 =1
where 𝑧 =(𝑧 1
…,𝑧 𝑝 )=𝑃 𝑇 𝜙̂
−
1
2
𝑉 −
1
2
(𝑦 −𝜇 ̂
0
). Each 𝑧 𝑘 is the projection of the
centered and scaled outcome vector 𝜙̂
−
1
2
𝑉 −
1
2
(𝑦 −𝜇 ̂
0
) onto the direction spanned
by the eigenvector 𝑃 𝑘 . The uncorrelatedness of the 𝑧 𝑘 ’s follows from the
orthogonality of the 𝑃 𝑘 ‘s, and since 𝑧 follows and approximate multivariate normal
distribution with independent covariance 𝑁 (𝜙̂
−
1
2
𝑉 −
1
2
(𝜇 −𝜇 0
),𝐼 𝑝 ) , each 𝑧 𝑘 2
follows an
approximate 1-df chi-square distribution. Moreover, because for a multivariate
normal distribution uncorrelatedness is equivalent to independence, the 𝑧 𝑘 2
’s are
approximately independent. The distribution of 𝑄 is therefore well approximated
by a linear combination of independent 1-df chi-square statistics. There are
several approximate and exact methods to numerically evaluate the tail
probabilities of the distribution of linear combination of independent chi-squares
that can be used to compute p-values for the variance component score test. We
use Davies’ exact method based on inverting the characteristic function (Davies,
1980), which is implemented for example in the CompQuadForm R package
along with several other approximate methods.
Prior information for analysis
16
Gene sets of interest for set-based testing typically consist of genes annotated to
the same biological pathway or co-expression network. When such biologically
defined sets genes are expected to be co-regulated, it is plausible that the
greater the similarity between genes within a set, the greater the similarity
between their effect on a phenotypic trait of interest. Therefore, incorporating
appropriate prior information on gene similarity has the potential to increase the
statistical power of set-based tests. Our proposed approach differs from others in
that we can explicitly incorporate prior information that can be encoded as an
arbitrary positive definite similarity matrix. Specifically, rather than assuming an
independent (e.g. SKAT) or equicorrelated covariance structure (e.g. SKAT-O),
the covariance matrix of the generalized linear mixed model directly reflects the
prior information based on external information that we want to incorporate. For
example, co-regulated genes may have similar effects on a trait or outcome; the
covariance matrix can be constructed as a similarity matrix based on GO gene
function annotations. This prior information need not be perfectly correct; even
partially correct information can result in a test with improved power as our
simulation results below show. Different measures have been developed and
explored to calculate gene-gene similarity based on function annotations (Mistry
and Pavlidis, 2008; Pesquita, et al., 2008; Schlicker, et al., 2006), most of which
have been implemented in statistical packages in R (Ovaska, et al., 2008; Yu, et
al., 2010). In this study, we do not discuss which similarity measures are better
for incorporating prior information into gene-set analysis. However, given a
17
similarity measure chosen, we explore different scenarios under which the power
of gene-set analysis might be improved by incorporation of prior information.
Based on the premise that correlated SNPs may have similar effects on an
outcome of interest, the SKAT-O variance components score test by Lee et al
(Lee, et al., 2012)assumes a prior correlation structure consisting of a convex
combination of an independent and an equicorrelated correlation structure. The
convex combination weight is a tuning-parameter that can be optimized by
performing a simple grid search. Although by construction SKAT-O is robust if
the equicorrelated structure is not a good fit for the data, SKAT-O does not allow
incorporation of a general prior correlation structure.
Block equi-correlated prior and decomposition of score test
A particularly illuminating case arises when the prior similarity matrix is block
equicorrelated, i.e. Σ is the partitioned block diagonal matrix with r blocks of
sizes 𝑠 1
,…,𝑠 𝑟 , 𝑠 1
+⋯+𝑠 𝑟 =𝑝 , such that within each block 𝑘 =1,…,𝑟 all diagonal
entries are ones and all off diagonal entries are 𝜌 𝑘 , 0≤𝜌 𝑘 ≤1:
Σ=[
(1−𝜌 1
)𝐼 𝑠 1
+𝜌 1
𝐸 𝑠 1
⋯ 𝟎 ⋮ ⋱ ⋮
𝟎 ⋯ (1−𝜌 𝑟 )𝐼 𝑠 𝑟 +𝜌 𝑟 𝐼 𝑠 𝑟 ]
where 𝐸 𝑠
=𝟏 𝑠 𝑇 𝟏 𝑠 denotes the square matrix of ones of size s. Σ represents the
correlation structure of a random vector whose components have the same
18
pairwise correlation 𝜌 𝑘 within block k, but are uncorrelated between blocks. In
genetic terms means clusters of genes. Since many similarity matrices are well
approximated by a block equicorrelated matrix, the insights drawn from from this
particular case apply very generally.
If in addition 𝑦 follows a linear model with no adjustment covariates, 𝑦 =𝐺𝛽 + 𝜖 ,
𝜖 ~ 𝑁 (0,𝜎 2
𝐼 𝑛 ) and the design is orthogonal, i.e. 𝐺 𝑇 𝐺 =𝐼 𝑝 , (uncorrelated gene
expression levels), the score statistic in (1) can be written as a sum of block-
specific contributions as follows (see Appendix A for the derivation):
𝜎 2
𝑄 =∑ 𝑠 𝑘 {(1+(𝑠 𝑘 −1)𝜌 𝑘 )(𝛽̂(𝑘 )
)
2
+(1−𝜌 𝑘 ) 𝑉𝑎𝑟 (𝛽̂
(𝑘 )
)}
𝑟 𝑘 =1
(3)
where 𝛽̂
=(𝛽̂
(1)
,…,𝛽̂
(𝑠 𝑟 )
) represent the ordinary least squares estimates of the
regression coefficients for the gene expression variables, partitioned according to
the r blocks of Σ, the overbar denotes the sample mean and 𝑉𝑎𝑟 (∙) denotes the
sample variance. From expression (3) it is clear that each block specific
contribution to 𝑄 can be thought of as a weighted combination of two tests, one
based on the sample mean of the estimated regression coefficients 𝛽̂(𝑘 )
, which
tests the mean of the true population regression coefficients within the block, and
the other based on the sample variance of the regression coefficients, 𝑉𝑎𝑟 (𝛽̂
(𝑘 )
) ,
which tests the variability of the true population regression coefficients within the
block. For a fixed norm ‖𝛽̂
(𝑘 )
‖ within block 𝑘 , the term (𝛽̂(𝑘 )
)
2
is largest when
the coefficients of 𝛽̂
(𝑘 )
are all equal, and the corresponding test has the best
power when the true regression coefficients 𝛽 (𝑘 )
are similar in magnitude and
19
sign within the block. By contrast, the term 𝑉𝑎𝑟 (𝛽̂
(𝑘 )
)=
1
𝑠 𝑘 ‖𝛽̂
(𝑘 )
−𝛽̂(𝑘 )
‖
2
is
largest when the sample mean of the estimated regression coefficients 𝛽̂(𝑘 )
=0,
and the corresponding test has the best power when the true regression
coefficients have opposing signs and ‘cancel each other out’ within the block.
Since the two statistics are weighed proportionally to 𝜌 𝑘 and 1−𝜌 𝑘 respectively,
we can interpret 𝜌 𝑘 as encoding prior information about the similarity of the
regression coefficients within block 𝑘 . When 𝜌 𝑘 is close to 1 the prior specifies
that the coefficients are expected to be very similar to each other and so the test
based on the mean receives most of the weight. When 𝜌 𝑘 is close to zero the
prior specifies that the coefficients are expected to be highly variable and their
mean small, and so the test based on the variance receives most of the weight.
For an equicorrelated prior (i.e. a single block), 𝜌 =𝜌 1
=0 yields the standard 𝑝 -
degree of freedom test.
The contribution from each block to 𝑄 is weighted by the block size 𝑠 𝑘 , so larger
blocks have relatively more influence on the overall statistic. It is therefore more
important for the prior information corresponding to large blocks to be correctly
specified.
To illustrate the effect of the structure of the regression coefficients 𝛽 we plot the
power of the score test with a single equicorrelated block prior as a function of
the prior correlation 𝜌 , for a range of alternatives 𝐻 𝐴 : 𝛽 ≠0 (Figure 1) between
two extreme scenarios. The first extreme scenario consists of a regression
20
coefficient vector with identical components, 𝛽 =𝛽 1
=(𝑐 ,𝑐 ,𝑐 ,…,𝑐 )
⏟
𝑝 , 𝑐 >0. This
can be thought of as a a pure ‘means effect’ because there is no variability in the
regression coefficients. At the other end of the range we considered a pure
‘variance effect” scenario where the regression coefficients average to zero,
having the same magnitude but alternating signs, 𝛽 =𝛽 2
=(𝑐 ,−𝑐 ,𝑐 ,−𝑐 ,…,𝑐 ,−𝑐 )
⏟
𝑝 .
We examined the power for a range alternative hypotheses consisting of
regression vectors intermediate in structure between these two extremes β
1
and
β
2
, given by the convex combination of β
1
and β
2
, β=(1−𝑡 )β
1
+tβ
2
, 0≤𝑡 ≤1.
We keep the overall effect size ‖𝛽 ‖ =‖β
1
‖=‖β
2
‖=√𝑝 𝑐 constant because the
power of the standard p degrees of freedom test only depends on ‖𝛽 ‖ and not its
particular structure. Specifically, for each value p, we set the constant 𝑐 >0 so
that the p-degree of freedom test has exactly 50% power when testing at the 5%
significance level. This makes comparisons between the power achieved for
different values of 𝜌 and p easier to interpret.
For alternatives close to β
1
there are large power gains over the standard p-
degree of freedom test using the score test with the value of the prior correlation
𝜌 set appropriately high (Figure 1). As the gene set p increases, both the power
gains over the p-degree of freedom test, and the size of the set of alternatives
around β
1
for which the score test has high power, get larger. However, no single
value of 𝜌 yields a score test with good power across all possible alternatives.
For alternatives close to β
2
, the score test yields the maximum possible power for
values of the prior correlation 𝜌 close to zero (recall that for 𝜌 =0 the score test
21
becomes the standard p-degree of freedom test), but for larger values of 𝜌 the
power can be substantially lower than that of the power of the standard p-degree
of freedom test. The goal of our approach is to use prior information to set a
‘good value’ for 𝜌 , or more generally a good prior covariance matrix Σ that carries
information about the structure of the vector of regression coefficients β.
Figure 1. Power of the score test for different architectures of the regression coefficient vector 𝛽
Figure 1. Power of the score test for different 𝛽 vectors using an the equicorrelated prior
covariance matrix, Σ
𝜌 =(1−𝜌 )𝐼 𝑝 +𝜌𝐸
𝑝 , for a continuous outcome, 𝑦 =𝑋𝛾 + 𝐺𝛽 + 𝜖 ,
𝜖 ~ 𝑁 (0,𝜎 2
𝐼 𝑛 ) , and orthogonal design matrix, 𝐺 𝑇 𝐺 =𝐼 𝑝 . Each panel correspond to a gene-set size
of p = 4, 20 or 100 genes. Power is plotted as a function of the correlation parameter 𝜌 , and each
line corresponds to a different vector of regression coefficiednts β=(1−𝑡 )β
1
+𝑡 β
2
∈ ℝ
𝑝 ,0≤𝑡 ≤
1, consisting of a weighted combinations of a ‘pure mean’ effect β
1
=𝑐 (1,1,1,…,1) , and a ‘pure
variance’ effect, β
2
=𝑐 (1,−1,1,−1,…,1,−1) . The scale factor 𝑐 >0 is set so that the overall
effect size, ‖β‖, is fixed at a value that yields 50% power to reject the null hypothesis 𝐻 0
: 𝛽 =0
using the standard chi-square test with p-degrees of freedom (horizontal gray line).
Robustified score test
22
The power of score test is supposed to be improved as the incorporated GO-
based similarity close to the true relationship between genes. However, the GO
does not always guarantee a true biological prior knowledge of genes. When the
relationship among the genes’ effects is not correctly reflected by the prior
similarity matrix 𝚺 , i.e. the prior is misspecified, the performance of the score test
above will deteriorate. Since a misspecified prior is likely to be the norm rather
than the exception, it is important to develop a test that is robust to
misspecification, i.e. a test that ideally gains power provided the prior is at least
partially informative, but losses only minimal power when the prior is completely
misspecified. Lin has proposed a global score test that can jointly test effects with
different prior covariance [Lin 2007]. It can be roughly considered as a robust
score test by assuming that the genotype effect has a mixed prior of
independence covariance and GO-related covariance. However, this method has
rarely been applied by related studies. To robustify the performance of the score
test, we adopt the approach in the SKAT-o test for rare variants (Lee, et al.,
2012), by introducing a parameterized prior consisting of a weighted average
between an informative and a non-informative prior covariance. Specifically, we
assume as before that the random effect 𝛽 follows a continuous multivariate
distribution with expectation 0 and covariance 𝜏 2
Σ
𝑠 , but where Σ
s
=sΣ+(1 -s)𝐼 𝑝 is
a now a convex combination of an independent covariance matrix prior, 𝐼 𝑝 , and a
a fully informative covariance matrix prior, Σ. Key to the approach is that the
weight 𝑠 is a tuning parameter which is chosen based on the data. Specifically, if
𝑄 𝑠 is the score statistics when the prior covariance is 𝜏 2
Σ
𝑠 and 𝑝 𝑠 its
23
corresponding p-value, we select the value 0≤𝑠̂ ≤1 that yields the smallest p-
value, i.e. 𝑠̂ =𝑎𝑟𝑔𝑚𝑖𝑛 0 ≤ 𝑠 ≤ 1
𝑝 𝑠 . In practice, this is accomplished by performing a
search over a grid of values of 𝑡 , 0= 𝑠 1
<𝑠 2
<⋯<𝑠 𝑏 =1, computing the p-
value for each 𝑄 𝑠 𝑗 , 1 ≤ j ≤𝑏 , and choosing the smallest one. The minimum p-
value 𝑝 𝑡̂
becomes the test statistic for the gene set test. In the particular case of
an equicorrelated correlation structure, Σ=(1−ρ)𝐼 𝑝 + ρ𝟏 𝑝 𝟏 𝒑 𝑇 , the robustified
score is the SKAT-o statistic (Lee, et al., 2012) derived an approximation to the
null distribution for this particular case. For a general prior similarity matrix Σ
however, the null distribution of the smallest p-value, 𝑝 𝑠 ̂
, does not have a known
close form but it can be efficiently obtained by simulation using the following
steps:
1. For each value of 𝑠 𝑗 , j=1,..., b, in the grid of values of 𝑡 compute the
vector of mixing weights 𝛌 𝑗 =(λ
𝑗 ,1
,…,λ
𝑗 ,𝑝 ) (the sorted set of nonzero
eigenvalues of 𝑃 −
1
2
𝐾 𝑃 −
1
2
).
2. Draw M replicates (say M =10,000) of p independent 1-df central chi-
square variables 𝑋 1
(𝑘 )
,...,𝑋 𝑝 (𝑘 )
, k =1,..., M, and for each replicate compute
the mixture Q
𝑗 (𝑘 )
=∑ 𝜆 𝑘 (𝑗 ) 𝑝 𝑘 =1
𝑋 1
(𝑘 )
. Compute the p-value 𝑝 𝑗 (𝑘 )
for Q
𝑗 (𝑘 )
based,
for example, on Davies’ method.
3. For each k=1,...,M, compute the smallest p-value 𝑝 𝑘 ∗
= min
0≤𝑗 ≤𝑏 𝑝 𝑗 (𝑘 )
. The
empirical distribution of the 𝑝 ̂
𝑘 ’s approximates the null distribution the
minimum p-value
24
4. Compute the p-value for the set as the proportion of 𝑝 𝑘 ∗
’s that exceed the
opserved value 𝑝 𝑠 ̂
.
Part C. Simulation studies
Gene expression and covariate data
To evaluate the performance of the proposed test we carried out a simulation
study that explores a range of scenarios with gene sets of different sizes and with
different gene similarity structures. Of particular interest was to assess the
robustness of the test and the extent to which incorporating partially informative
prior covariances can improve the power of gene-set analysis. The design of the
simulation was guided by the lessons learned from the block equicorrelated prior
case above. In addition, to make the simulation data as realistic as possible, we
generated synthetic phenotypes and extracted gene sets, expression profiles,
and prior information from real data.
Gene expression profiles were obtained from a cohort of 187 patients from a
study of prostate cancer prognosis generated using the Illumina DASL assay,
and which included 29,336 expression probes for 19,751 genes. Later in the
application section we perform a reanalysis of a peroxisome gene set from the
same study. Further details of the study are provided therein.
Prior resource data
As prior information structures for our simulation we used similarity matrices
derived from the GO. Specifically, cosine similarity derived from cancer-related
biological process (BP) GO annotations. To determine cancer related BP terms,
25
we obtained a list of the top 312 cancer candidate genes (i.e. genes with ranking
score no less than 4) from the Gene Ranker of TCGA GBM 600, which we then
submitted to PANTHER to perform a Gene-set overrepresentation test. The top
significantly overrepresented BP terms with a fold change of no less than 2.0
were considered as candidate cancer-related terms. GO terms annotated with
more than 1000 human genes were excluded. The final terms included as prior
resource is showed in table 1. Terms to all those selected terms and their
descendant terms were obtained from the GuickGO, a web service of the
European Bioinformatics Institute.
GO term name Fold Enrichment P-value
regulation of cell cycle > 5 1.60E-06
DNA recombination > 5 1.23E-06
DNA repair > 5 6.51E-14
cell proliferation > 5 1.56E-07
DNA metabolic process > 5 8.17E-13
negative regulation of
apoptotic process
> 5 2.13E-02
hemopoiesis > 5 1.12E-02
protein phosphorylation 4.07 3.77E-10
response to stress 4.04 5.36E-11
chromatin organization 3.82 5.68E-03
phosphate-containing 3.37 3.07E-10
26
compound metabolic
process
reproduction 3 1.01E-02
apoptotic process 2.99 5.75E-04
cell death 2.91 8.77E-04
death 2.89 9.87E-04
biosynthetic process 2.86 3.09E-05
organelle organization 2.75 3.51E-03
mesoderm development 2.34 3.88E-02
Table 1. Significant GO BP terms overrepresented among top ranked cancer candidate human
genes, according to PANTHER Overrepresentation Test (release 20150430)
Gene set generation
We selected two gene sets to use in the simulations containing one and two
large gene clusters respectively (Figure 2). These were identified using complete
linkage hierarchical clustering with the cosine similarity based on cancer GO-
terms. A third gene set we used in the simulations is the set of peroxisomal
genes, which has three large gene clusters and we also use later in the
application section. In addition to the gene clusters, each of these gene sets also
contains ‘singleton’ genes with little similarity to the rest.
27
Figure 2. GO based gene clusters used in the simulation study
Figure 2a. Gene set 1. Figure 2b. Gene set 2
Figure 2c. Gene set 3 (peroxisomal genes)
Figure 2. Panels show the hierarchical cluster structure of each of the three gene sets used in
the simulation study. Hierarchical clustering was performed using cosine similarity between genes
based on the cancer GO terms annotated to each gene. The ‘Height’ on the y-axis refers to the
similarity between genes, with low values indicating more similar and high values indicating less
similar genes. Clusters are marked with a rounded red rectangle. For each gene set clusters are
numbered 1,2… from left to right.
For each scenario, a block equicorrelation matrix was used to specify the cluster
structure of gene set, with each block corresponding to one of clusters within set.
Equicorrelation within each block was specified to be equal to the averaged
similarity of corresponding cluster.
28
Phenotype generation
We generated a binary phenotype to mimic the recurring (1)/ non-recurring (0)
cancer status of the original prostate cancer study from which the expression
profiles were obtained, using a logistic regression of the form:
logit {Pr(𝑦 =1|𝑋 ,𝐺 )}=𝛾 0
+𝛾 1
𝑥 1
+𝛾 2
𝑥 2
+𝐺𝛽 (3)
where
i
y is the binary outcome indicator, 𝑥 1
and 𝑥 2
are two adjusting covariates
that mimic age and gender, generated independently of the the expression
profiles: 𝑥 1
= age, assumed uniformly distributed between 40 and 60, and 𝑥 1
=
gender, assumed Bernoulli distributed with probability equal to 0.5. The Intercept
𝛾 0
was set to -1, and effects of age 𝛾 1
and of gender 𝛾 2
were fixed at 0.5 and 1
respectively. These choices yield 0 and 1 outcome classes that are
approximately balanced. For each simulation scenario we generated 2,000
dataset replicates.
To simulate under the null hypothesis of no differential expression 𝛽 was set to
zero. To simulate under the alternative, a particular cluster or clusters of genes
within the subset was designated as differentially expressed and its regression
coefficients were set to 𝛽 𝑖 =𝜇 +𝜎 ,𝜎 ~𝑁 (0,𝜎 2
) , where 𝜇 nonzero values the For
each scenario, corresponding to non-clustered genes were set to be zero;
corresponding to clustered genes were set using 𝛽 =𝜇 +𝜎 ,𝜎 ~𝑁 (0,𝜎 2
) .
Parameter and were set as Table 2. .
29
Table 2. Structure of the vector of regression coefficients 𝛽 in the simulation study.
𝛽 with small variance 𝛽 with larger variance
Gene
Set
Cluster
(i)
Gene Set
Cluster
(i)
1 1 0.16 0.001 1 1 0.16 0.1
2
1 0.045 0.001
2
1 0.035 0.01
2 0 0 2 0 0
3
1 0.1 0.001
3
1 0.1 0.001
2 0.8 0.001 2 -0.1 0.001
3 0.6 0.001 3 -0.1 0.001
Table 2. Regression parameters for generating the binary outcome using the logistic model (3)
Prior information matrix simulation
For both Type I error and Power analysis, the adaptive score test without
incorporation of prior ( I ), with incorporation of global equicorrelated prior
( ' 11 ), with incorporation of misspecified prior (
mis
), with incorporation of
partially informative prior (
partial
) and with incorporation of true prior (
true
)
were included.
true
was block equicorrelation matrix constructed based on Figure
2, in which the entry values within each block were set as average of similarities
between genes within the corresponding cluster, while the entry values between
blocks were set as 0;
partial
was constructed based on the original similarity
matrix between genes which was used for clustering;
mis
was constructed based
on Figure 2, in which entry values within each block were set as 0, while entry
values corresponding to the block of singleton genes were set as the average of
similarities between singleton genes.
30
Part D. Results and discussion
Power simulation results
Figure 3. Y-axis represent the power rate of detecting a differentially expressed gene set,
and X axis represents proportion of genes that are differentially expressed out of all genes
that are clustered. Each column corresponds to a scenario where the candidate gene set
contains 1 cluster, 2 clusters or 3 clusters of genes. Each row corresponds to a scenario
where the variance of effects of differentially expressed genes is small or large. For each
of 6 panels, the power of adaptive score test with incorporation of 5 different prior
information was plotted given 0%, 25%, 50% and 100% genes within clusters are
differentially expressed.
When there is one cluster of differentially expressed genes within the set, the
power of adaptive score test does not differ much by different prior information if
the variance of effect within cluster is large; If the variance of effect within cluster
is small, using informative, partially informative prior or global equicorrelated prior
tends to be more powerful. When there are two clusters of genes within set and
31
only one cluster having differentially expressed genes, score test using
informative prior or partially informative tends to be mostly powerful. Using global
equicorrelated prior can be more powerful than using independent prior, but not
as powerful as using informative prior. This trend tends to be much stronger
when the variance of effect within cluster is getting smaller. When there is three
clusters of differently expressed genes within set, and when the global mean of
genes is large (effects for each cluster are in same direction), using informative
or partially informative prior tends to be more powerful than using global
equicorrelated prior. However, if the global mean of genes is small (effects for
each cluster are in different direction), using informative or partially informative
prior tends to be much more powerful, while using global equicorrelated prior can
be even less powerful than using independent prior.
The risk of incorporating misspecified prior into the test has been reduced by the
adaptive score test. For all scenarios, using misspecified prior is only little bit
worse than using independent prior.
Application to prostate cancer recurrence and peroxisome genes
Peroxisomes contain a variety of enzymes that involved in numerous metabolism
process and the link between peroxisomal dysfunction and a variety of diseases
has already been established(Titorenko and Rachubinski, 2004). However, the
relationship between human peroxisome and progressive prostate cancer has
not been studied before. In this study, we applied the adaptive score test to the
32
cohort of prostate cancer patients to evaluate the association between human
peroxisomes and progressive prostate cancer.
We define the peroxisomal genes according to both Watkins and PeroxisomeDB
website. Appendix B lists all gene members.
Given the cohort of prostate cancer patient from USC Norris Cancer center and
gene expression profile from Illumina Inc, we analyzed the association between
progressive cancer and expression of human peroxisomal genes using gene set
from Watkins 2010, PeroxisomeDB website and the union of them, separately.
Both manual experimental and other curator-based types of GO BP annotations
were used as prior knowledge of genes. GO-based cosine semantic similarity
between genes was calculated using R package csbl.go(Ovaska, et al., 2008).
Adaptive score test with incorporation of independence matrix, perfect similarity
matrix and GO-based were applied to the analysis, separately. Patients’
characteristics including operation of year, age, Gleason score and stage of
cancer were adjusted as covariates in the analysis. The results were showed in
Table 6.
Gene Set
Prior
Watkins 2010 PeroxisomeDB Watkins 2010 and
PeroxisomeDB
independence 0.379 0.111 0.127
Equicorrelated 0.0805 0.169 0.172
GO-based similarity 0.0514 0.043 0.0319
Table 3. p-value for association between progressive prostate cancer and peroxisomal gene set
using adaptive score test with incorporation of three different type of priors.
33
Given the results in Table 3, incorporating the prior information of GO into the
adaptive score test gave a more significant result. It shows the association
between peroxisomal genes and progressive prostate cancer is significant or
marginally significant, differed by definition of gene set. Such significant evidence
cannot be obtained without using prior information of GO, i.e. assuming
peroxisomal genes are independent with each other. Assuming genes are
perfectly related with each other can give a marginally significant evidence of the
association using Watkins’ definition. However, it reduces the significance of
evidence given other definitions. Therefore, this application indicates that
estimating gene-gene relationship using the GO information and incorporating it
into the analysis tends to increase the chance of identifying risk factors for
disease.
To better understand the association between peroxisomal genes with prostate
cancer recurrence, we performed T-test to evaluate univariate association for
each gene in the pathway. According to table 4, there are 5 genes significantly
associated with the recurrence of prostate cancer (p-value <0.05). However,
without using prior information, those significant genes didn’t drive the
association of whole peroxisomal gene-set to be significant. After incorporate the
prior gene-gene similarity information, the results of whole-gene set become
significant. This also confirmes our assumption for this study: if there are
significant genes among the gene set, and if similar genes have similar effects,
then incorporate the similarity between genes (effects) can improve the power of
gene-set test.
34
Table 4
Gene name T-test p-value
AMACR 0.0008
SOD1 0.0027
ABCD3 0.0093
SLC22A5 0.0238
IDI2 0.0358
Table 4. peroxisomal genes that are significantly related to prostate cancer recurrence.
Discussion
In this study we developed a novel statistical approach to incorporate gene-gene
similarity information into gene-set analysis. Particularly, our approach divides a
gene set into different sub-set by incorporating a block-equicorrelated matrix from
gene-gene similarity information. Each block corresponds to a particular sub-set
of genes. The advantage of this approach is that it can maintains statistical
power even if only partial sub-sets of genes are related to the outcome, while the
traditional gene-set approach which measures the overall effect of the whole
gene set are not powerful, such GESA, SKAT, SKATo and Goeman’s global test.
While our approach is robust to the mis-specification of gene-gene similarity, the
crucial part of maintaining power is to using informative prior information for
estimation of gene-gene similarity. In this study we proposed Gene Ontology as
prior information. However, while Gene Ontology is considered to be one of most
commonly used biological resource, there are still limitations including
35
incompleteness of the knowledge (annotations). This probably explains why the
peroxisomal pathway testing for its association with prostate cancer tends to be
powerful only all Gene Ontology annotations are used as prior. When only
cancer-related annotations (sparse annotations) are used our proposed
approach tends to be less powerful than traditional score test. Considering this
limitation of sparse annotations, testing using all available annotations (manual
and electronic) is recommended.
Besides the quality of prior information, the similarity metrics used to calculate
gene-gene similarity from prior information is also very important. The next
chapter introduces different similarity measures and evaluates their performance
given both complete and incomplete annotations. In addition, this proposed
approach incorporates a block-equicorrelated matrix into score test, among
which different block corresponds different gene sub-set. Hierarchical clustering
algorithms can be used to make such matrix. However, Gene-Ontology allows
genes belong to different clusters, which a hierarchical clustering cannot be fit for.
A recent algorithm CliXO (Kramer, et al., 2014) can allows multiple parent
clustering. However, how to make a positive definite lock matrix from such
multiple parent clustering and benefit the gene-set analysis deserves future
exploration. In addition, the completeness of annotations for the peroxisomal
genes could be evaluated by comparing with their orthologs among model-
organisms. Incompleteness of annotations could potentially affect the gene-gene
similarity information as we will talk about in the next chapter. Understanding how
36
incompletely the peroxisomal human genes annotated would help to judge the
precision of the results.
Chapter II: Effect of incompleteness of Gene Ontology
annotation
Part A. Introduction
Introduction and background
Gene Ontology (GO), a standardized vocabulary of biological function and
process terms, is one of the most frequently used resources for gene function
annotations(Ashburner, et al., 2000). It consists of 3 domains: molecular function,
biological process and cellular component. Within each domain, the ontology is
structured as a directed acyclic graph (DAG) and consists of GO terms that
represent different biological properties. Terms low in the DAG are more specific
and can have two types of defined relationships to one or more parent terms
(higher and more general): “is-a” indicating a children term is a sub-class of its
parent term while “part-of” meaning it is a component of its parent term. One can
see that the lower terms represent more specific and informative biological
properties than those at higher level. This implies that genes annotated by a
given term will also be annotated by its parent term. Because of that, the root
ontology term in each domain (the ancestor of all other terms) is associated with
all genes under consideration, while the leaf terms (the lowest ones) usually are
associated with a small proportion of genes which have the corresponding
functions. It is now common to use the GO in many applications (e.g. gene group
enrichment, gene network and pathway analysis)(Chang, et al., 2013; Feng, et
37
al., 2007; Subramanian, et al., 2005). However, the reliability and accuracy of
results of most large-scale GO analysis depends on, intuitively, how complete the
GO annotations of gene products are.
The GO annotations can be assigned either by expert curators from evidence in
the published literature (called “experimental annotations”) or by inference using
computational methods (called “electronic annotations”). Due to weaker evidence
such as sequence homology, electronic annotations are generally considered to
be least reliable (du Plessis, et al., 2011). While Skunca, N., et al. (2012)
reported that the quality of electronic annotations has significantly improved in
recent years, many users of the GO still rely mostly on curated annotations and
usually assign a low weight of electronic annotations or completely excluded
them from analysis(du Plessis, et al., 2011; Skunca, et al., 2012). However, due
to our limited biological knowledge, as well as the limited time and resources of
curators, the experimental annotations is far from complete and consist of only
less than 1% of current available GO annotations. Therefore, it is essential for
users of the GO to understand the impact of incompleteness and evaluate its
potential effect on their analyses.
This is particularly important for research in non-model organisms, e.g. human,
because the limitation of approaching genetics experiments in non-model
organisms. In other words, the incompleteness of annotations is biased towards
non-model organisms. Such bias can potentially mislead users of the GO, and
sometimes even give unexpected results that are contradictory to some well-
proved hypothesis. One good example is the “ortholog conjecture”, which
38
assumes orthologous genes are functionally more similar than paralogous genes.
More functional similarity among paralogs than orthologs has been reported
before (Nehrt, et al., 2011). While many factors have been considered for such
unexpected result (e.g. experimental bias and annotation errors), Thomas et al.
related this issue to the differently biased GO annotations among different
species: mouse annotations are more enriched in organism-level processes,
while human genes are more enriched in cellular biochemical level processes.
These differences result in observed functional differences between human and
mouse orthologs, but actually reflect biases in types of biological knowledge, or
incompleteness of GO annotations (Thomas, et al., 2012). Another proof of the
impact of incomplete annotations is from the findings of Chen and Zhang (2012):
GO similarity between orthologs has been increasing over the past several years,
presumably as GO annotations become more complete(Chen and Zhang, 2012).
Considering those reported impacts of incompleteness, it is important to evaluate
the robustness of different types of very commonly used similarity measures to
the incompleteness of annotations, especially considering their performance on
gene clustering (Frohlich, et al., 2007; Ovaska, et al., 2008; Yu, et al., 2010). As
a step in this direction, we compare the robustness of several commonly used
similarity measures to the completeness of annotation data. Although it is difficult
to definitively conclude which clusters best reflect the underlying biology, we
explore this question by choosing a few well-studied biological processes (DNA
repair and autophagy) and ask how well the genes in these processes cluster
together using different metrics with both complete and incomplete annotations.
39
A novel design was developed in this work to roughly quantify the
incompleteness of annotations of single-cellular and multicellular process among
human orthologs, and use this approximation to simulate incompletely annotated
genes from those originally complete ones.
Part B. Materials and methods
Data sources
Given the fact that model organisms are usually enriched in experimentally-
derived annotations, we assumed annotations of orthologs of human genes
among model organisms (e.g. yeast and mouse) are relatively complete and
hence as our “control”. More specifically, because mouse is much more enriched
for annotations of organism-level process while yeast is enriched for single-
cellular process, we set two different “controls” in this study: mouse orthologs
which we treated as fully annotated with multicellular process terms, and yeast
orthologs which we treated as fullly annotated with single-cellular process terms.
To avoid comparison between sparse annotation data, we further restricted the
“control” to orthologs which have already been well-studied by experiments using
mouse/yeast. In more details, we defined the concept “well-studied gene” by the
number of references a gene was associated with (e.g. number of Pubmed
publications). A criterion of 75 Pubmed associations was used to identify “well-
studied” orthologs among mouse/yeast. Annotation frequency of those othologs
was then compared between mouse/yeast and human. The difference,
considered as parameters for the incompleteness of knowledge among human,
40
can be then used to simulate incompletely annotated orthologs among
mouse/yeast, which referred to the “case” in this study. Since the only ‘variable’
here is the completeness of annotations, a comparison of GO analysis between
using the “control” and the “case” can give us a more precise insight into the
effect of incompleteness of annotations.
Files for association between PubMed IDs and yeast genes were obtained from
Saccharomyces Genome Database (SGD) (Cherry, et al., 1998). Files for
associations between PubMed IDs and mouse genes were obtained from the
Mouse Genome Informatics (MGI) Data and Statistical Reports (Bult, et al., 2004).
This identified a total of 1238 well-studied yeast genes and 1284 well-studied
mouse genes were identified initially. Among the well-annotated list, 866 yeast
markers and 850 mouse markers were annotated with GO experimental
annotations according to the GO database(Harris, et al., 2004). Therefore, those
well-studied genes with GO annotation were set as the complete annotation
dataset.
Human orthologs of well-studied yeast and well-studied mouse genes were
obtained from the Protein Analysis Through Evolutionary Relationships
(PANTHER) Classification System (version 8.1)(Mi, et al., 2013). Only the “least
diverged orthologs” (LDOs), defined as “the genes in two different organisms that
have diverged the least since their common ancestor and are most likely to retain
the greatest functional similarities”, were kept. Any pair of LDOs that were not
one-to-one matched was excluded. Out of 866/850 well-studied yeast/mouse
genes, 484/813 of them have human LDOs identified.
41
GO annotations of those well studied yeast/mouse genes and their human LDOs
were subset to only cellular/multicelluar process ones. For yeast/mouse genes
and their human LDOs, only biological process (BP) GO terms to descendants of
cellular/multicellular process (GO: 0009987/GO: 0032501) were included. In
addition, only “is_a”, “part_of” or “occurs_ in” relationships were used to also
associate each "direct" annotation with higher terms in the DAG, and only terms
with an experimental evidence code (EXP, IDA, IPI, IMP, IGI and IEP) indicated
were considered (except those with qualifier values of “NOT”, “contributes_to” or
“colocalizes_with”). A total of 2299/7813 cellular/multicellular process annotations
for 866/850 well studied yeast/mouse, and a total of 789/818 cellular/multicellular
process annotations for their 484/813 human LDOs were downloaded from the
GO database. Among these datasets, any redundant GO annotations due to
different evidence (e.g. different interfering RNA’s used to knock down the mRNA
levels) were excluded. In addition, for each gene, annotations that are not the
most specific one (those who have offspring terms annotated on the same gene)
were excluded. Finally, any GO term not identified by the Bioconductor package
GO.db 2.8.0 version (http://www.bioconductor.org/ packages/ release/
data/annotation/html/GO.db.html) was excluded. After filtering datasets as above,
a total of 2109/4913 cellular/multicellular process annotations of those well-
studied yeast/mouse genes were kept, respectively, while a total of 583/656
cellular/multicellular process annotations for their human LDOs were obtained.
Quantifying the incompleteness of GO annotations
42
Incompleteness of cellular/multicellular process annotations for human genes
was quantified by comparing the number of cellular/multicellular process
annotations between each pair of human-yeast/mouse LDOs (a total of 484/813
pairs). To avoid sparse comparison, only those pairs which have annotations for
both human and yeast/mouse LDOs were kept. Thus, 395/314 pairs of human-
yeast/mouse LDOs were kept. The difference in the number of annotations
between each yeast/mouse gene and its human LDOs is calculated. Human
genes with as many or more annotations than their yeast/mouse (mouse) LDOs
were assumed to be completely annotated, with the amount of missing
knowledge set as 0. This distribution was used to generate simulated incomplete
annotation sets by randomly removing terms from the complete yeast or mouse
gene sets.
Simulations
The basic idea behind this simulation is to create annotation sets that
approximate the degree of incompleteness of human gene sets. We consider
cellular processes and multicellular processes separately, and treat well-studied
yeast genes as being “completely” annotated with respect to cellular processes,
and well-studied mouse genes as being “completely” annotated with respect to
multicellular processes. We then randomly remove cellular/multicellular
annotations from the well-studied and completely annotated yeast/mouse genes
so that they become incomplete to the same degree as their human LDOs.
Based on the frequency distribution of missing annotations among human
orthologs of yeast and mouse (Plots 1 and 2 in Appendix, respectively), we
43
designed a simulation to generate 100 sets of 395 yeast genes with the same
amount of missing cellular process annotations as shown in plot 1, and another
100 sets of 314 mouse genes with the same amount of missing multicellular
process annotations as shown in plot 2. Those simulated genes, with the same
amount of missing knowledge as human genes, can be then used to compare
with original complete ones for inference of effect of incompleteness among
human genes. The procedure for incomplete annotation set simulation was as
follows:
Step 1: according to plot 1/plot2, among all the sample of human genes,
identify the largest number of missing cellular/multicelluar process annotations a
gene could have, denoted by n1.
Step 2: identify the proportion of yeast/mouse LDOs that have at least
n1cellular/multicelluar process annotations, denoted by p1.
Step 3: from all well-annotated yeast/mouse genes, randomly select a
proportion of p1 genes from those having at least n1 cellular/multicellular process
annotations. (If there were a less proportion of p1 genes having at least n1
annotations, select all of them).
Step 4: randomly remove n1 annotations from each of those selected
genes. Keep them as the first subset of incompletely annotated mouse/yeast
genes, denoted by S1. For the rest the well annotated yeast/mouse genes, repeat
step 1-4 to continue generating S2, S3, S4, …., Sd. (d denotes the number of
different possible missing annotation values an ortholog could have, which
44
equals to 11 for orthologs of yeast and 34 for orthologs of mouse). All simulated
subsets S1, S2, …., Sd are combined together as a whole simulated yeast (mouse)
gene set. We repeat same procedure above 100 times to generate 100 different
simulated yeast (mouse) gene sets.
Information content (IC) and Semantic Similarity
Among several available semantic similarity computation packages, csbl.go
(1.4.0 version) was used in this study because of its computational efficiency [10].
GO terms GO:0036302, GO:0036303, GO:0086094, GO:0097324, GO:1900164,
GO:0036205, GO:0036228, GO:0036297, GO:0097301 and GO:1900622
unavailable in csbl.go were excluded from our analysis. Four measures we
considered in this study require calculating the “information content” (IC) of GO
terms. The IC of a term ti is defined by:
i i
t p t IC log (1)
where p(ti) is the probability of the term ti occurring in the set of annotations for all
genes in the GO database:
root
i
i
t freq
t freq
t p (2)
where troot represents the root term of ontology,
root
t freq equals the total number
of gene annotations in that aspect of the ontology and
i
t freq equals the number
of genes annotated with the term ti or any of its children terms. As we mentioned
earlier, the more specific and informative a term is, the fewer genes are
annotated by that term. Intuitively, as we move from the highest root term to the
45
lowest leaf terms,
i
t freq decreases and IC(ti) increases (from 0 to infinity). This
makes IC a measure of how much biological information a GO term imparts,
which can be further applied to semantic similarity measures.
The most common IC-based semantic similarity measures using GO are Resnik's,
Lin's, Jiang and Conrath's and Schlicker’s measures (D, 1998; Jiang JJ; Resnik,
1995; Schlicker and Albrecht, 2008). This study includes all these four measures.
Resnik’s measure simply uses the IC value of the common ancestors of two
terms as their similarity, and is denoted by:
2 1 2 1 Re
, , t t LCA IC t t Sim
s
(3)
Where
2 1
,t t LCA represents the lowest (most specific) common ancestor term in
the ontology that is shared by terms t1 and t2. The other 3 measures, however,
take into consideration both the IC values of terms t1 and t2 and the IC value of
2 1
,t t LCA .
Lin's measure takes the similarity between terms t1 and t2 as the ratio between
the IC value of their LCA and the average IC value of those two single terms:
2 1
2 1
2 1
, 2
,
t IC t IC
t t LCA IC
t t Sim
Lin
(4)
Jiang and Conrath proposed a measure that is a function of the difference
between the average IC value of two terms and the IC value of the ICA they
share:
1 , 2
1
,
2 1 2 1
2 1
t t LCA IC t IC t IC
t t Sim
Jiang
(5)
46
Both
2 1
,t t Sim
Lin
and
2 1
,t t Sim
Jiang
are determined by how “similar” each term (t1
and t2) is to the LCA of those terms.
2 1 Re
,t t Sim
s
, on the other hand, depends on
the specificity of the LCA itself. Schlicker developed a new measure that
combines both of these attributes by weighting Lin’s measure according the
specificity of the LCA of t1 and t2, which he called the Relevance measure:
2 1
2 1
2 1 2 1 Re
,
, 1 2 ,
t IC t IC
t t LCA IC
t t LCA p t t Sem
l
(6)
In contrast to
2 1 Re
,t t Sim
s
which ranges between 0 and infinity,
2 1
,t t Sim
Lin
,
2 1
,t t Sim
Jiang
and
2 1 Re
,t t Sim
l
all range between 0 and 1.
IC-based scores between GO terms given by each measure above can be used
to calculate IC-based scores between gene products in three different ways
(which we call average, maximum and rcmax, as described below). Because
each gene product can have multiple GO annotations, we need to calculate a
total similarity score for gene
1
g (annotated with n terms) vs. gene
2
g (annotated
with m terms), from the m n GO term similarity matrix S . Each element
ij
s
represents the similarity between the
th
i GO term of
1
g and
th
j GO term of
2
g .
The similarity between
1
g and
2
g can be calculated in 3 ways as follows:
1) use the average score across all pairs of GO terms:
mn
s
g g Sim
m
j
n
i i
ij
1
2 1
, , (7)
2) Use the maximum score across all pairs of GO terms:
47
m j n i s g g Sim
ij
1 , 1 : max ,
2 1
(8)
3) Map each GO term of the gene product
1
g to its rcmax term (i.e. the
single, best matching annotation) of gene product
2
g , average scores
of all these rcmax mappings from
1
g to
2
g , and vice versa. The larger
average is used as the final score between
1
g and
2
g . The formula is
defined as below:
m
n i s
n Scorecolum
m
j
ij
1
1 : max
(9)
n
m j s
Scorerow
n
i
ij
1
1 : max
(10)
The similarity between
1
g and
2
g is then defined as:
Scorerow n Scorecolum g g Sim , m ax ,
2 1
(11)
These three methods are referred as the “average”, “maximum”, and “row-
column maximum (rcmax)” in this paper. Therefore, a total of 12 IC-based
similarities between genes are considered: Lin, JiangConrath, Resnik and
Relevance measure, each with average, maximum and rcmax method.
In addition to IC-based measures, we consider as well the Cosine and weighted
Jaccard measures (Bodenreider, et al., 2005; Pesquita, et al., 2008). The Cosine
measure is based on a vector model space: For a gene g , a vector
n
w w w v , , ,
2 1
is built, with
i
w
equal to
i
t IC if g is annotated with term
i
t and
0 if not. The Cosine Similarity between two genes gi and gj is then defined as:
48
j i
j i
v v
v v
g g Sim
2 1
, (12)
Where · represents the dot product and represent the vector norm. The
weighted Jaccard measure is based on the vector model space as well:
considering genes
1
g and
2
g annotated with term sets GO1 and GO2, the weighted
Jaccard measure similarity between
1
g and
2
g is defined as:
) (
) (
,
2 1
2 1
2 1
i
GO GO
t
i
GO GO
t
t IC
t IC
g g Sim
i
i
(13)
Under each simulation, each type of similarity between genes was calculated for
the complete and all simulated incomplete annotation sets. The change in
similarity due to incompleteness for each pair of genes was calculated separately
for each simulated incomplete annotation set, and these values were then
combined into a distribution of similarity score changes over all simulations.
Gene clustering based on GO annotations using Clique Extracted Ontology
algorithms
We consider using Clique Extracted Oncology (Clixo) algorithms developed by
Kramer(Kramer, et al., 2014). We assume Clixo can give a better performance in
terms of GO-based gene clustering because it allows each gene belongs to
multiple clusters, which fits the Gene ontology structure better than traditional
algorithms (e.g. hierarchical clustering). Let
2 1
, g g Sim denote similarity between
two genes
1
g and
2
g , the corresponding distance between
1
g and
2
g is then
defined by
2 1 2 1
, 1 , g g Sim g g d . For Resnik’s similarity which ranges between
49
0 and infinity,
2 1 Re
2 1 Re
, 1
1
,
g g Sim
g g d
s
s
according to (Ovaska, et al., 2008).
Yeast/Mouse genes with full annotations set and incomplete annotations set are
then clustered separately, based on the distance matrix. As hierarchical
clustering results in a hierarchical tree that can be cut at different similarity
thresholds to define a set of clusters, 10 different clustering thresholds were used,
respectively, that were increased evenly from the bottom to the top of hierarchical
tree.
Measurement of the impact of incomplete annotations on gene clustering
To understand how incomplete annotations affect clustering genes using different
metrics, clustering based on fully annotated genes and incompletely annotated
genes (using the same thresholds obtained from the full annotation set) were
compared. Specifically, we ask whether a gene, after losing annotations, would
be still clustered into the same group or similar groups, become ungrouped
singleton genes, or misclassified into different groups. To answer that, we firstly
match clusters obtained from fully annotations to those obtained from
incompletely annotations as described below:
A two way contingency table T was made, with the
th
i row defining the
th
i cluster of
fully annotated genes
i complete
C
,
and the
th
j column defining the
th
j cluster of
incompletely annotated genes
j incomplete
C
,
.
ij
n represents the number of genes
among
j incomplete i complete
C C
, ,
. For each j where ) ( 1 T ncol j ,
j incomplete
C
,
was
mapped to
'
,i complete
C if
j T nrow j j
j i
n n n n
, 2 1
, , , max
'
(randomly choose one if multiple
50
maximum exists). Hence, for each i where ) ( 1 T nrow i ,
i complete
C
,
can be matched
to m incomplete clusters
m incomplete incomplete incomplete
C C C
, 2 , 1 ,
, , , where 1 m . Among
those matched incomplete clusters,
'
, j incomplete
C was defined as the best matched
cluster of
i complete
C
,
if
im i i
ij
n n n n , , , max
2 1
'
.
For eachi , the proportion of genes among
i complete
C
,
that were clustered into the best
matched incomplete cluster of
i complete
C
,
, any other matched incomplete clusters of
i complete
C
,
, leave as singletons or clustered into other clusters rather than those matched
ones under incomplete annotations were calculated, respectively. Among singleton
genes g
) ( , 2 , 1 ,
, , ,
T nrow complete complete complete
C C C , the proportion that remained as
singleton genes using incomplete annotations was calculated as well. Those
proportions were compared across all similarity measures to evaluate their
performance under incomplete GO knowledge.
Clustering genes from sample pathways and measuring the impact of
incomplete data on it
To better understand how incomplete GO annotations affects capturing the
biological truth, we chose several pathways for yeast and investigated to what
extent genes from each pathway can be clustered together using incomplete
data. Specifically, DNA repair (GO: 0006281), Macroautophagy (GO: 0016236),
protein import into nucleus (GO: 0006606), mitotic sister chromatid segregation
(GO: 0000070), recombinational repair (GO: 0000725), post-Golgi vesicle-
mediated transport (GO:0006892) and cytokinesis (GO:0000910) were selected
51
as sample pathways. For each pathway, we calculate the proportion of genes
from that pathway that are successfully clustered in the same set given each
metric, i.e. the true positive rate (TP), and the proportion of other pathway genes
that mis-classified into that pathway, i.e. the false positive rate (FP). A ROC
curve can then be plotted based on TP and FP at multiple threshold values.
Finally, ROC curves given the full data set were compared with those given an
incomplete data set to evaluate the impact of incompleteness of data on gene
clustering.
In addition, because the DNA repair process is actually a general pathway that
subsumes four distinct DNA repair processes, we break the whole general
pathway into several more specific subclasses, i.e. base-excision repair (GO:
0006284), nucleotide-excision repair (GO: 0006289), mismatch repair (GO:
0006298), double-strand break repair via nonhomologous end joining (GO:
0006303) and double-strand break repair via homologous recombination (GO:
0000724). True positive rate and false discovery rate were calculated for
clustering of genes from each of those specific DNA repair genes.
Part C. Analysis and Results
Robustness of similarities between genes given incomplete
annotations
The similarity between genes can be either decreased or increased when
annotations became incomplete, depending on what type of similarity metrics
was used. As Fig 1 shows, IC avg similarities between yeast genes, especially
those originally high, tend to be even higher when cellular level annotations
52
become incomplete. IC rcmax similarities are more likely to be decreased than
increased by the incomplete cellular level annotations, while IC maximum
similarities are always decreased. The change of Cosine and overlapped
similarities due to missing cellular level annotations are more likely to be random
(Fig1). The incomplete organism-level annotations are found to have a similar
pattern of impact on the similarities between mouse genes (S1 Fig).
53
Figure 1. Plots of similarities of all pairwise cellular-level process genes given each metric
under complete vs. incomplete GO annotations. Each point represents a unique pairwise
gene with the value on X axis as their similarity given the complete annotations and the
value on Y axis as their similarity given a random simulated incomplete set of annotations.
Therefore, each pairwise gene is repeated 100 times in each plot, with each pair having the
same similarity under complete annotations but a different similarity under a different
simulation.
The heterogeneity of similarity metrics in terms of gene clustering
given full annotations
To better compare and understand the impact of incomplete annotations on gene
clustering using different similarity metrics, for each similarity metric, genes were
first clustered using complete annotations across different metrics. Fig 2 is a
heatmap of the overlap between clusters of yeast genes using different metrics.
As Fig 2 shows, clusters tend to be more similar between Lin and JiangConrath
measures, and between Cosine and WeightedJaccard measures. In addition,
genes clustered by Lin avg and JiangConrath avg measures are more likely to be
clustered as well by Lin rcmax or maximum, and JiangConrath rcmax or
maximum measure, respectively. In contrast, genes clustered by Lin rcmax or
maximum, and JiangConrath rcmax or maximum measures, are less likely to be
clustered by Lin average and JiangConrath average, respectively. Intuitively, this
pattern implies that IC-average metrics are more conservative than IC-rcmax and
54
IC-maximum metrics, and tend to cluster more genes. However, we do not
observe a similar pattern among clusters of mouse genes based on organism
level annotations.
Figure 2. the heatmap of overlap between clusters of fully annotated yeast genes across
different similarity metrics. Within each row, the proportion of genes clustered together
by a specific metric that are clustered again by other metrics are reported separately, by
columns. The clustering threshold is set as the 25
th
percentile height of cluster tree
generated by the row metric.
General impact of incomplete annotations on clustering
Given results from the previous part that rcmax and maximum combination
methods cluster more genes than the average methods, it is interesting to see if
this is still true given incomplete data. Intuitively, it depends on how robust those
55
metrics are. Again, the robustness differs mostly by average method and
rcmax/maximum method. Figs 3 showed that missing multi-cellular level
annotations tend to have less impact on clusters obtained by IC average
measures than those obtained by IC rcmax or maximum measures (a higher
proportion of genes remaining in the same clusters are observed when using IC
average measures), but more impact on singleton genes obtained by IC average
measures than those obtained by IC rcmax or maximum measures (a lower
proportion of genes remaining as singletons are observed when using IC
average measures). The impact of incompleteness of annotations on clustering
become smaller when the data is less incomplete (single-cellular genes), but the
difference in the impact between average method and rcmax/maximum method
still exists (S2 Fig). In addition, a higher proportion of mis-clustered genes (yellow
plots) are observed given incomplete organism level annotations (Fig 3) than
given incomplete cellular level annotations (S2 Fig). Considering that organism
level annotations are less complete than cellular-level annotations, it implies the
importance of data completeness on clustering accuracy.
56
Figure 3. Plots of the averaged proportion of mouse genes that remain in the best matched
cluster, in other matched clusters, broken into singletons from clusters, misclassified into
unmatched clusters or remain as singletons when their multi-cellular process GO
annotations become incomplete. Each type of the point and color corresponds to an
averaged proportion obtained from a unique measure. In addition, the X-axis represents
different clustering thresholds. A total of 10 different thresholds increase from the left to
the right evenly, representing the height of corresponding hierarchical tree from the root
to the top.
57
Impact of incomplete annotations on clustering of sample pathway
genes
Results for clustering genes from pathways of macroautophagy, base-excision
repair, and mismatch repair (Fig 4, 5 and 6) indicate that the true positives of
clustering using most metrics tend to decrease due to incomplete annotations.
Such impact is particularly strong for IC maximum measures. More specifically,
Lin and JiangConrath maximum measures which are more powerful for clustering,
i.e. with higher TP rates, are more likely to give less true postives due to missing
annotations (See also S3-S14 Fig). On the other hand, the false discoveries of
clustering using IC maximum measures tend to be increased due to incomplete
annotations (decreased FDR rate). The impact of missing annotations on
clustering using IC rcmax measures is similar to IC maximum measures. For
example, clustering of macroautophagy genes and base-excision repair genes
using IC rcmax measures yields fewer true positives based on incomplete data
(Fig 4 and 5), and clustering of mismatch repair genes and double-strand break
repair genes using IC rcmax measures can give less false discoveries based on
incomplete annotations (Fig 6 and 7). However, we find that this is not
necessarily true of all GO classes. In contrast, either true positives or false
discoveries of clustering using IC average measures can be increased based on
incomplete annotations, e.g. clustering of macroautophagy genes and mismatch
repair genes using Lin, Relevance and Resnik average metrics (Fig 4 and 6,
respectively), and clustering of double-strand break repair genes using all
average metrics (Fig 7). In addition, while true positives given by Cosine and
WeightedJaccard measures can be either increased or decreased due to
58
incomplete data (Fig 4 and 5), the false discoveries tend to be increased in most
cases (Fig 6 and 7). Overall, while the true positives of clustering using different
metrics tend to become similar under incomplete annotations, IC rcmax and IC
maximum measures tend to have a lower FDR. Moreover, among metrics using
rcmax and maximum methods, Lin and JiangConrath measures tend to be more
powerful sometimes under incomplete data than Relevance and Resnik
measures (Fig 4 and 5).
In addition, while incomplete annotations have a smaller impact on clustering of
large pathways such as DNA repair process (S4 Fig), Fig 6 and 7 shows that it
can have effects on clustering some sub-pathways of the DNA repair process,
which indicates that the more informative the target biological pathway is, the
more important the completeness of annotations is.
59
Figure 4. ROC plot of clustering macroautophagy genes under both complete and
incomplete GO annotations. The TP is defined as the proportion of macroautophagy genes
that successfully clustered into the macroautophagy gene set. The FP is defined as the
proportion of non-macroautophagy yeast genes mis-classified as macroautophagy genes.
Each type of the point and color corresponds to the true positive rate obtained from either
full or incomplete annotations. The X-axis represents different clustering thresholds (size
of cluster). A total of 20 different thresholds increase from the left to the right evenly,
representing the height of corresponding hierarchical tree from the root to the top.
60
Figure 5. ROC plot of clustering base-excision repair genes under both complete and
incomplete GO annotations. The TP is defined as the proportion of base-excision repair
genes that successfully clustered into the base-excision repair gene set. The FP is defined
as the proportion of non-base-excision repair genes being mis-classified as base-excision
repair genes set. Each type of the point and color corresponds to the true positive rate
obtained from either full or incomplete annotations. The X-axis represents different
61
clustering thresholds (size of cluster). A total of 10 different thresholds increase from the
left to the right evenly, representing the height of corresponding hierarchical tree from the
root to the top.
Figure 6. ROC plot of clustering 5 mismatch repair genes under both complete and
incomplete GO annotations. The TP is defined as the proportion of mismatch repair genes
62
that successfully clustered into the mismatch repair genes. The FP is defined as the
proportion of non-mismatch repair genes being mis-classified as the mismatch repair
genes set. Each type of the point and color corresponds to the true positive rate obtained
from either full or incomplete annotations. The X-axis represents different clustering
thresholds (size of cluster). A total of 10 different thresholds increase from the left to the
right evenly, representing the height of corresponding hierarchical tree from the root to
the top.
63
Figure 7. ROC plot of clustering 6 double-strand break repair via nonhomologous end
joining genes under both complete and incomplete GO annotations. The TP is defined as
the proportion double-strand break repair via nonhomologous end joining genes that
successfully clustered into double-strand break repair via nonhomologous end joining
genes set. The FP is defined as the proportion of non- double-strand break repair via
nonhomologous end joining genes being mis-classified as double-strand break repair via
nonhomologous end joining genes set. Each type of the point and color corresponds to
the true positive rate obtained from either full or incomplete annotations. The X-axis
represents different clustering thresholds (size of cluster). A total of 10 different
thresholds increase from the left to the right evenly, representing the height of
corresponding hierarchical tree from the root to the top.
Part D. Discussion
In this work we developed a well-controlled design to compare gene-gene
similarities and gene clustering between full and incomplete annotation sets.
While previous studies have observed the potential biases of similarity metrics
due to biased incomplete annotations (Thomas, et al., 2012), this study extends
the evaluation to a more general situation of random incompleteness, with more
metrics being included. We found that incomplete GO annotations have impact
on GO-based gene clustering with respect to both true positives and false
discoveries among clusters. One interesting finding is that the effect of
incompleteness of annotations on IC similarity metrics differs greatly depending
on the method used to combine the semantic similarity scores for pairs of GO
annotations, into an overall score for pairs of genes.
Pesquita C et.al (2008) has compared and discussed different term combination
approaches (Pesquita, et al., 2008): the average approach considers both
matched and unmatched annotation pairs between genes, which can result in
64
underestimations of similarities. In contrast, the maximum approach only uses
the most similar term pair between two genes, and tends to over-estimate the
similarity. The rcmax approach, however, compares each annotation from one
gene to its best-matched one from the other gene, and thus gives the best
performance by disregarding those unmatched pairs. This conclusion supports
most results in our study based on the full annotations: the average approaches,
which tends to underestimate similarity between similar genes, has a poor
behavior of clustering similar genes together. The maximum approach, especially
with Lin and JiangConrath measures, has the most power for clustering similar
genes together but runs a high risk of including false members as well. The
differences can be interpreted as well by results from Tables S1and S2: a higher
proportion of genes clustered by IC-average measures can be clustered as well
by IC-rcmax or -maximum measures, while a smaller proportion is grouped
together by IC average metrics.
However, comparisons between those approaches based on incomplete
annotations have rarely been evaluated. This study evaluates the behavior of
different similarity metrics and clustering methods for incomplete datasets: while
no metric generates clusters with a high true positive rate on incomplete data, IC
rcmax and maximum approaches are still superior to other metrics because they
generate clusters with smaller false discovery rates. We now discuss the
possible reasons for this observation.
First the average approach tends to underestimate similarities between genes
because it over-weights unmatched pairs of annotations. However, with fewer
65
pairs of annotations included, such underestimation is reduced (Fig 1). Intuitively,
this makes clustered genes more likely to stay in the same clusters and
singletons genes more likely to be clustered when annotations become less
(incomplete). It explains why IC average metrics actually tend to generate
clusters with a higher percentage of both true positives and false discoveries
based on incomplete annotations.
In contrast to the average approach, IC maximum and rcmax measures tend to
overestimate similarities given full dataset. However, such overestimations can
be reduced if the most functional similar annotations are missed.
Clustering based on incomplete annotations, on the other hand, tends to be
similar across different metrics: no metric generates clusters with a high true
positive rate. Nevertheless, Lin and JiangConrath measures with rcmax or
maximum approach might still be considered as more appropriate than others
given incomplete data due to a low percentage of false discoveries among
clusters.
It is also worthwhile to understand why Lin and JiangConrath metrics tend to give
better clustering than other IC metrics, on both full and incomplete data. One
explanation is that both Resnik and Relevance metrics weight the similarity
between two terms by the absolute specificity of their lowest common ancestor
(depth in the GO), while Lin and JiangConrath metrics evaluate the similarity by
calculating how much the specificity changed from the ancestor term to children
terms. This would cause different behavior in quantifying relationship between
66
terms, especially under some extreme conditions. For example, Lin and
JiangConrath metrics will give a perfect similarity between two identical terms,
while the Resnik and Relevance metrics will not, as their lowest common
ancestor (themselves) is not generally of maximum specificity. Such issues can
cause similarity between two gene products (term sets) being biased toward
most specific term pairs rather than perfectly matched pairs, especially when
rcmax or maximum approach is used. For example, consider, three simple
annotation sets
2 1 1
,t t T , ) , (
3 2 2
t t T and ) (
4 3
t T , where the most similar pair of
terms given by Resnik or Relevance metric is
4 3
,t t rather than
1 1
,t t if
4 3
,t t LCA is more informative than
2
t . Resnik and Relevance maximum metrics
will cluster
4
T and
2
T together instead of
3
T and
2
T even though they share the
same annotations. This explains why Resnik and Relevance metrics do not
generally cluster as many similar genes from specific pathways (e.g. macro-
autophagy) as Lin and JiangConrath metrics when the rcmax or maximum
approach is used. Such limitations still exist even the GO annotation become
less and incomplete.
Based on all discussions above, a major contribution of this work is to help the
GO users to choose the most biologically meaningful similarity metrics and
clustering methods. A number of previous studies have evaluated different
similarity metrics according to their correlations with standard measures such as
sequence /gene expression similarity. Several different studies identified Resnik
metrics as the best-performance one (Guo, et al., 2006; Lord, et al., 2003; Mistry
67
and Pavlidis, 2008; Sevilla, et al., 2005; Xu, et al., 2008). However, we evaluate
those metrics by a different standard: gene clustering under both complete and
incomplete annotations. When annotations are complete, Lin and JiangConrath
metrics with rcmax approaches are preferred to other IC metrics. However, given
today’s incomplete GO data, both maximum and rcmax approaches also perform
well for clustering when used with Lin or JiangConrath metrics. Finally, one of the
future directions for our work is to study how the information content of the GO
term can be affected by incomplete annotations. Here we assume information
content for each term is fixed, which is a very strong assumption and one of the
limitations for this study. Another future direction for our work is to evaluate if
electronic annotations can reduce the impact of incomplete experimental
annotations. We can compare human genes annotated with both experimental
and electronic annotations and their orthologs among mouse/yeast annotated
with only experimental annotations. The difference can be used to evaluate the
change of impact of incompleteness due to adding electronic annotations.
68
References
Ashburner, M., et al. Gene ontology: tool for the unification of biology. The Gene
Ontology Consortium. Nat Genet 2000;25(1):25-29.
Bodenreider, O., Aubry, M. and Burgun, A. Non-lexical approaches to identifying
associative relations in the gene ontology. Pacific Symposium on Biocomputing
2005 2005:91-102.
Bult, C.J., et al. The Mouse Genome Database (MGD): integrating biology with
the genome. Nucleic Acids Res 2004;32(Database issue):D476-481.
Cancer Genome Atlas Research, N. Comprehensive genomic characterization
defines human glioblastoma genes and core pathways. Nature
2008;455(7216):1061-1068.
Cantor, R.M., Lange, K. and Sinsheimer, J.S. Prioritizing GWAS results: A review
of statistical methods and recommendations for their application. Am J Hum
Genet 2010;86(1):6-22.
Chang, B., Kustra, R. and Tian, W. Functional-network-based gene set analysis
using gene-ontology. PLoS One 2013;8(2):e55635.
Chasman, D.I. On the utility of gene set methods in genomewide association
studies of quantitative traits. Genet Epidemiol 2008;32(7):658-668.
Chen, X. and Zhang, J. The ortholog conjecture is untestable by the current gene
ontology but is supported by RNA sequencing data. PLoS Comput Biol
2012;8(11):e1002784.
Cherry, J.M., et al. SGD: Saccharomyces Genome Database. Nucleic Acids Res
1998;26(1):73-79.
69
D, L. An information-theoretic definition of similarity. In, In 15th International Conf
on Machine Learning. Morgan Kaufmann, San Francisco, CA. 1998. p. 296-304.
du Plessis, L., Skunca, N. and Dessimoz, C. The what, where, how and why of
gene ontology--a primer for bioinformaticians. Brief Bioinform 2011;12(6):723-
735.
Feng, Z., et al. Pathway and gene ontology based analysis of gene expression in
a rat model of cerebral ischemic tolerance. Brain Res 2007;1177:103-123.
Frohlich, H., et al. GOSim--an R-package for computation of information theoretic
GO similarities between terms and gene products. BMC Bioinformatics
2007;8:166.
Goeman, J.J. and le Cessie, S. A goodness-of-fit test for multinomial logistic
regression. Biometrics 2006;62(4):980-985.
Goeman, J.J., et al. A global test for groups of genes: testing association with a
clinical outcome. Bioinformatics 2004;20(1):93-99.
Guo, X., et al. Assessing semantic similarity measures for the characterization of
human regulatory pathways. Bioinformatics 2006;22(8):967-973.
Harris, M.A., et al. The Gene Ontology (GO) database and informatics resource.
Nucleic Acids Res 2004;32(Database issue):D258-261.
Hass, J., et al. Associations between DNA methylation and schizophrenia-related
intermediate phenotypes - a gene set enrichment analysis. Prog
Neuropsychopharmacol Biol Psychiatry 2015;59:31-39.
Holmans, P., et al. Gene ontology analysis of GWA study data sets provides
insights into the biology of bipolar disorder. Am J Hum Genet 2009;85(1):13-24.
70
Hu, Y. and Zhao, H. CCor: A whole genome network-based similarity measure
between two genes. Biometrics 2016.
Huang, D.W., et al. DAVID Bioinformatics Resources: expanded annotation
database and novel algorithms to better extract biology from large gene lists.
Nucleic Acids Res 2007;35(Web Server issue):W169-175.
Huang, Y.T. and Lin, X. Gene set analysis using variance component tests. BMC
Bioinformatics 2013;14:210.
Ionita-Laza, I., et al. Sequence kernel association tests for the combined effect of
rare and common variants. Am J Hum Genet 2013;92(6):841-853.
Jiang JJ, C.D. Semantic Similarity Based on Corpus Statistics and Lexical
Taxonomy. In, ROCLING X: 1997; Taiwan 1997.
Kanehisa, M., et al. KEGG as a reference resource for gene and protein
annotation. Nucleic Acids Res 2016;44(D1):D457-462.
Kong, S.W., Pu, W.T. and Park, P.J. A multivariate approach for integrating
genome-wide expression data and biological knowledge. Bioinformatics
2006;22(19):2373-2380.
Kramer, M., et al. Inferring gene ontologies from pairwise similarity data.
Bioinformatics 2014;30(12):i34-42.
Lebrec, J.J., et al. Integration of gene ontology pathways with North American
Rheumatoid Arthritis Consortium genome-wide association data via linear
modeling. BMC Proc 2009;3 Suppl 7:S94.
71
Lee, S., et al. Optimal unified approach for rare-variant association testing with
application to small-sample case-control whole-exome sequencing studies. Am J
Hum Genet 2012;91(2):224-237.
Liu, D., Lin, X. and Ghosh, D. Semiparametric regression of multidimensional
genetic pathway data: least-squares kernel machines and linear mixed models.
Biometrics 2007;63(4):1079-1088.
Lord, P.W., et al. Semantic similarity measures as tools for exploring the gene
ontology. Pac Symp Biocomput 2003:601-612.
Mi, H., Muruganujan, A. and Thomas, P.D. PANTHER in 2013: modeling the
evolution of gene function, and other gene attributes, in the context of
phylogenetic trees. Nucleic Acids Res 2013;41(Database issue):D377-386.
Mi, H., et al. PANTHER version 10: expanded protein families and functions, and
analysis tools. Nucleic Acids Res 2016;44(D1):D336-342.
Mistry, M. and Pavlidis, P. Gene Ontology term overlap as a measure of gene
functional similarity. Bmc Bioinformatics 2008;9.
Mistry, M. and Pavlidis, P. Gene Ontology term overlap as a measure of gene
functional similarity. BMC Bioinformatics 2008;9:327.
Nehrt, N.L., et al. Testing the ortholog conjecture with comparative functional
genomic data from mammals. PLoS Comput Biol 2011;7(6):e1002073.
Ovaska, K., Laakso, M. and Hautaniemi, S. Fast gene ontology based clustering
for microarray experiments. BioData Min 2008;1(1):11.
Pesquita, C., et al. Metrics for GO based protein semantic similarity: a systematic
evaluation. BMC Bioinformatics 2008;9 Suppl 5:S4.
72
Pesquita, C., et al. Metrics for GO based protein semantic similarity: a systematic
evaluation. Bmc Bioinformatics 2008;9.
Resnik, P. Using information content to evaluate semantic similarity in a
taxonomy. Int Joint Conf Artif 1995:448-453.
Schaid, D.J. Genomic similarity and kernel methods I: advancements by building
on mathematical and statistical foundations. Hum Hered 2010;70(2):109-131.
Schaid, D.J. Genomic similarity and kernel methods II: methods for genomic
information. Hum Hered 2010;70(2):132-140.
Schaid, D.J., et al. Using the gene ontology to scan multilevel gene sets for
associations in genome wide association studies. Genet Epidemiol 2012;36(1):3-
16.
Schlicker, A. and Albrecht, M. FunSimMat: a comprehensive functional similarity
database. Nucleic Acids Research 2008;36:D434-D439.
Schlicker, A., et al. A new measure for functional similarity of gene products
based on Gene Ontology. BMC Bioinformatics 2006;7:302.
Sevilla, J.L., et al. Correlation between gene expression and GO semantic
similarity. Ieee Acm T Comput Bi 2005;2(4):330-338.
Skunca, N., Altenhoff, A. and Dessimoz, C. Quality of computationally inferred
gene ontology annotations. PLoS Comput Biol 2012;8(5):e1002533.
Subramanian, A., et al. Gene set enrichment analysis: a knowledge-based
approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U
S A 2005;102(43):15545-15550.
73
Thomas, P.D., et al. On the Use of Gene Ontology Annotations to Assess
Functional Similarity among Orthologs and Paralogs: A Short Report. PLoS
Comput Biol 2012;8(2):e1002386.
Titorenko, V.I. and Rachubinski, R.A. The peroxisome: orchestrating important
developmental decisions from inside the cell. J Cell Biol 2004;164(5):641-645.
Tomfohr, J., Lu, J. and Kepler, T.B. Pathway level analysis of gene expression
using singular value decomposition. BMC Bioinformatics 2005;6:225.
Weng, L., et al. SNP-based pathway enrichment analysis for genome-wide
association studies. BMC Bioinformatics 2011;12:99.
Wu, M.C., et al. Rare-variant association testing for sequencing data with the
sequence kernel association test. Am J Hum Genet 2011;89(1):82-93.
Wu, M.C. and Lin, X. Prior biological knowledge-based approaches for the
analysis of genome-wide expression profiles using gene sets and pathways. Stat
Methods Med Res 2009;18(6):577-593.
Xu, T., Du, L. and Zhou, Y. Evaluation of GO-based functional similarity
measures using S. cerevisiae protein interaction and expression profile data.
BMC Bioinformatics 2008;9:472.
Yan, Q., et al. A Sequence Kernel Association Test for Dichotomous Traits in
Family Samples under a Generalized Linear Mixed Model. Hum Hered
2015;79(2):60-68.
Yu, G., et al. GOSemSim: an R package for measuring semantic similarity
among GO terms and gene products. Bioinformatics 2010;26(7):976-978.
74
Zhang, B. and Horvath, S. A general framework for weighted gene co-expression
network analysis. Stat Appl Genet Mol Biol 2005;4:Article17.
Appendix A. Derivation of the score statistic
Derivation of the score statistic under a block-equicorrelated prior
Here we assume the linear model 𝑦 =𝑋𝛾 + 𝐺𝛽 + 𝜖 , 𝜖 ~ 𝑁 (0,𝜎 2
𝐼 𝑛 ), where 𝐺 is an
n p matrix of genomic features (n > p), and 𝐼 𝑛 is the identity matrix of size n. We
are interested in testing the hypothesis 𝐻 0
:𝛽 =0 vs. 𝐻 𝐴 :𝛽 ≠0. For simplicity, and
without loss of generality, we further assume 𝛾 =0 (which can always be
achieved by ‘regressing away’ the effect of the adjusting variables 𝑋 ) and that 𝜎 2
is known (if 𝜎 2
is unknown it can be replaced by a consistent estimate without
changing the asymptotic distribution of the statistics involved), and that 𝐺 is of full
rank p. We can write the model more compactly as 𝑦 ~ 𝑁 (𝐺𝛽 ,𝜎 2
𝐼 𝑛 ). The
standard score test statistic for testing 𝐻 0
:𝛽 =0 , is
1
𝜎 2
𝑦 𝑇 𝐺 𝐺 𝑇 𝑦 =
1
𝜎 2
‖𝐺 𝑇 𝑦 ‖
2
, which has a chi-square distribution (central under the null and non-
central under the alternative) with p degrees of freedom, yields an omnibus test
that ‘spreads’ its power evenly across all directions in p dimensional space along
which 𝛽 can depart from the null, and therefore does not have great power to
detect departures along any one direction in particular. When there is a prior
belief that departures from the null are likely to occur along a preferential
direction, one could construct a single degree of freedom test with good power to
detect such departures. For example, if there were reason to believe that all
75
regression coefficients may be similar in sign and magnitude, once would test the
null 𝐻 0
:𝛽 =0 against the alternative hypothesis 𝐻 𝐴 :𝟏 𝑝 𝑇 𝛽 ≠0, 𝟏 𝑝 =(1,…1 ⏟
𝑝 )
𝑇 , , of
‘equal deviation’ along all axes. This can be accomplished by projecting the
ordinary least squares estimate of 𝛽 , 𝛽̂
=(𝐺 𝑇 𝐺 )
−1
𝐺 𝑇 𝑦 , onto the line spanned by
the vector 𝟏 𝑝 𝑇 and scaling the projection by its standard error:
𝑧 𝟏 𝑝 =
𝟏 𝑝 𝑇 𝛽̂
𝑆𝐸 (𝟏 𝑝 𝑇 𝛽̂
)
=
𝟏 𝑝 𝑇 (𝐺 𝑇 𝐺 )
−1
𝐺 𝑇 𝑦 𝝈 √𝟏 𝑝 𝑇 (𝐺 𝑇 𝐺 )
−1
𝟏 𝑝 ~ 𝑁 (
𝟏 𝑝 𝑇 𝛽 𝝈 √𝟏 𝑝 𝑇 (𝐺 𝑇 𝐺 )
−1
𝟏 𝑝 ,1) . Because 𝐸 [𝑧 𝟏 𝑝 ]=
𝟏 𝑝 𝑇 𝛽 𝝈 √𝟏 𝑝 𝑇 (𝐺 𝑇 𝐺 )
−1
𝟏 𝑝 , it is clear that, for a vector of regression coefficients of fixed overall
size ‖𝛽 ‖, where ‖ .‖ denotes the standard Euclidean norm, the power of the test
based on 𝑧 𝟏 𝑝 is maximal when 𝛽 is a multiple of 1
𝑝 , i.e. when all the regression
coefficients are identical.
A drawback of a single degree of freedom tests to detect departures from the null
𝐻 0
:𝛽 =0 along a pre-specified direction is that the power will be generally low to
detect departures along any other direction. In particular, if 𝛽 departs from the
null 𝐻 0
:𝛽 =0 along a direction orthogonal to that of the postulated one, the
resulting test will have no power to detect it. Since there usually is no certainty
about the expected type of departure from the null it is too ‘risky’ to ‘bet’ in favor
of a particular direction. This lack of power robustness renders any one single df
test of little practical use. An alternative to ‘hard’ constraints on the parameters 𝛽 ,
is to impose ‘soft’ constraints in the form of a distribution of possible values of 𝛽 ,
i.e. a random effect model for 𝛽 . Let us assume 𝛽 ~ 𝐹 (𝛽 ), where 𝐹 is a
76
continuous multivariate distribution in ℝ
𝑝 , with 𝐸 [𝛽 ]=0 , and 𝑉𝑎𝑟 (𝛽 )= 𝜏 2
Σ,
where Σ is a p p positive definite matrix correlation matrix. The null hypothesis
𝐻 0
:𝛽 =0 becomes 𝐻 0
:𝜏 2
=0, and the the score test for 𝐻 0
vs. 𝐻 𝐴 :𝜏 2
>0 is
based on the quadratic form 𝑄 =
1
𝜎 2
𝑦 𝑇 𝐺 Σ𝐺 𝑇 𝑦 . As shown below, the test based on
𝑄 is equivalent to a linear combination of p 1-df tests along pre-specified
directions of departure from the null 𝐻 0
:𝛽 =0. To see this, let 𝐺 Σ𝐺 𝑇 =𝑃 Λ𝑃 𝑇 , with
Λ=diag(λ
1
,…,λ
𝑝 ,
), λ
1
≥λ
2
≥⋯≥λ
𝑝 >0, a diagonal matrix with the positive
eigenvalues of 𝐺 Σ𝐺 𝑇 (the last 𝑛 −𝑝 eigenvalues are zero because 𝐺 Σ𝐺 𝑇 is of
rank p), and 𝑃 an n p matrix with orthogonal columns (i.e. 𝑃 𝑇 𝑃 =𝐼 𝑝 ), 𝑃 1
,…,𝑃 𝑝 ,
which are eigenvectors of 𝐺 Σ𝐺 𝑇 corresponding to the eigenvalues λ
1
,…,λ
𝑝 . Notice
that since 𝐺 Σ𝐺 𝑇 𝑃 𝑘 =𝜆 𝑘 𝑃 𝑘 it follows that 𝑃 𝑘 belongs to the space spanned by the
columns of 𝐺 , which we will nedd later. We can write:
𝑄 =
1
𝜎 2
𝑦 𝑇 𝐺 Σ𝐺 𝑇 𝑦 =
1
𝜎 2
𝑦 𝑇 𝑃 Λ𝑃 𝑇 𝑦 =(
1
𝜎 𝑃 𝑇 𝑦 )
𝑇 Λ(
1
𝜎 𝑃 𝑇 𝑦 )=∑ λ
𝑘 (
1
𝜎 𝑃 𝑘 𝑇 𝑦 )
2
𝑝 𝑘 =1
=∑ λ
𝑘 𝑧 𝑘 2
𝑝 𝑘 =1
(1)
where 𝑧 =(𝑧 1
…,𝑧 𝑝 )=
1
𝜎 𝑃 𝑇 𝑦 ~ 𝑁 (
1
𝜎 𝑃 𝑇 𝐺𝛽 ,𝐼 𝑛 ) . Each 𝑧 𝑘 =
1
𝜎 𝑃 𝑘 𝑇 𝑦 ~ 𝑁 (
1
𝜎 𝑃 𝑘 𝑇 𝐺𝛽 ,1), is
the projection of the scaled outcome vector
1
𝜎 𝑦 onto the direction spanned by the
eigenvector 𝑃 𝑘 , and 𝑧 𝑘 2
is the 1-df chi-square statistic for testing departures from
the null along the line spanned by 𝑃 𝑘 . Lets write 𝑦̂ =𝐺 𝛽̂
, and 𝜖 ̂=𝑦 −𝑦̂ . Since
𝑦 =𝐺 𝛽̂
+𝜖 ̂ , with 𝜖 ̂ , orthogonal to the space spanned by the columns of 𝐺 it
follows that 𝑃 𝑘 𝑦 =𝑃 𝑘 𝐺 𝛽̂
and we can rewrite (1) as:
𝑄 =∑ λ
𝑘 (
1
𝜎 𝑃 𝑘 𝑇 𝑦 )
2
𝑝 𝑘 =1
=∑ λ
𝑘 (
1
𝜎 𝑃 𝑘 𝑇 𝐺 𝛽̂
)
2
𝑝 𝑘 =1
=∑ λ
𝑘 (𝑄 𝑘 1
𝜎 𝛽̂
)
2
𝑝 𝑘 =1
77
where 𝑄 𝑘 =𝐺 𝑇 𝑃 𝑘
Expressed in terms of the original regression coefficients 𝛽 , 𝑧 𝑘 2
is the 1-df chi-
square statistic for testing departures from the null 𝐻 0
:𝛽 =0 along the line
spanned by 𝛽 𝑘 =(𝐺 𝑇 𝐺 )
−1
𝐺 𝑇 𝑃 𝑘 . Thus, the quadratic form 𝑄 can be written as a
linear combination of independent 1-df chi-square statistics with linear
combination coefficients λ
1
,…,λ
𝑝 , where each chi-square statistic is sensitive to
deviations along one of the orthogonal directions 𝛽 1
=(𝐺 𝑇 𝐺 )
−1
𝐺 𝑇 𝑃 1
,…,𝛽 𝑝 =
(𝐺 𝑇 𝐺 )
−1
𝐺 𝑇 𝑃 𝑝 , (the independence of the chi-square follows from the orthogonality
of the 𝑃 𝑘 ’s). When the p positive eigenvalues of 𝐺 Σ𝐺 𝑇 are identical, all directions
are weighted equally and 𝑄 reduces to the standard p-degree of freedom score
test statistic.
Form 𝑧 𝑘 ~ 𝑁 (
1
𝜎 𝑃 𝑘 𝑇 𝐺𝛽 ,1) it follows that the noncentrality parameter of 𝑧 𝑘 2
~ 𝜒 1
2
(𝛿 𝑘 2
)
is 𝛿 𝑘 2
=
1
𝜎 2
(𝑃 𝑘 𝑇 𝐺𝛽 )
2
and ‖𝛿 ‖
2
=∑ 𝛿 𝑘 2
=
𝑝 𝑘 =1
1
𝜎 2
∑ (𝑃 𝑘 𝑇 𝐺𝛽 )
2
=
𝑝 𝑘 =1
1
𝜎 2
‖𝑃 𝑇 𝐺𝛽 ‖
2
=
1
𝜎 2
‖𝐺𝛽 ‖
2
, where the last equality follows from the fact that 𝐺𝛽 is spanned by the
columns of 𝑃 and that the latter are orthonormal. It is clear then from inspection
of (2), that for a fixed ‖𝛿 ‖ the power of the test based on 𝑄 is maximal when the
chi-square component corresponding to the largest eigenvalue λ
1
has the highest
possible non-centrality parameter, i.e. when 𝛿 1
=‖𝛿 ‖, 𝛿 2
=⋯=𝛿 𝑝 =0 . In
general, the test will have good power if the chi-square components with large
eigenvalues also have large noncentrality parameters.
78
A particularly illuminating example amenable to explicit analytical treatment is the
case where the columns of 𝐺 are orthogonal,
1
√𝑛 𝐺 𝑇 1
√𝑛 𝐺 =𝐼 𝑝 , (e.g. uncorrelated
gene expression levels) and the prior similarity matrix Σ is block equicorrelated,
i.e. Σ is the partitioned block diagonal matrix with r blocks of sizes 𝑠 1
,…,𝑠 𝑟 , 𝑠 1
+
⋯+𝑠 𝑟 =𝑝 , such that within each block 𝑘 =1,…,𝑟 all diagonal entries are ones
and all off diagonal entries are 𝜌 𝑘 , 0≤𝜌 𝑘 ≤1 given by
Σ=[
(1−𝜌 1
)𝐼 𝑠 1
+𝜌 1
𝐸 𝑠 1
⋯ 𝟎 ⋮ ⋱ ⋮
𝟎 ⋯ (1−𝜌 𝑟 )𝐼 𝑠 𝑟 +𝜌 𝑟 𝐼 𝑠 𝑟 ]
(2)
where 𝐸 𝑠
=𝟏 𝑠 𝑇 𝟏 𝑠 denotes the square matrix of ones of size s. Σ represents the
correlation structure of a random vector whose variable components have the
same pairwise correlation 𝜌 𝑘 within block k, and whose component variables are
uncorrelated between blocks. The orthogonality of 𝐺 , 𝐺 𝑇 𝐺 =𝑛𝐼
𝑝 , simplifies the
analysis because it implies the p nonzero eigenvalues of 𝐺 Σ𝐺 𝑇 are identical to the
p eigenvalues of Σ and thus do not depend on the particular form of 𝐺 . It is also
immediate to verify that if 𝑈 𝑗 ,𝑗 =1,..𝑝 , are eigenvectors of Σ, then 𝑃 𝑗 =𝐺 𝑈 𝑗 ,𝑗 =
1,..𝑝 , are eigenvectors of 𝐺 Σ𝐺 𝑇 , and, because ‖𝑃 𝑗 ‖
2
=‖𝐺 𝑈 𝑗 ‖
2
=𝑈 𝑗 𝑇 𝐺 𝑇 𝐺 𝑈 𝑗 =
𝑈 𝑗 𝑇 𝑈 𝑗 =‖𝑈 𝑗 ‖
2
, 𝑃 𝑗 is normalized provided 𝑈 𝑗 is.
Let us examine first the case r=1, which reduces to a simple equicorrelated
structure (also known as exchangeable or compound symmetry):
79
Σ=(1−𝜌 )𝐼 𝑝 +𝜌𝐸
𝑝 =[
1 ⋯ 𝜌 ⋮ ⋱ ⋮
𝜌 ⋯ 1
]
The largest eigenvalue is 1+(𝑝 −1)𝜌 with multiplicity one and (normalized)
eigenvector 𝑈 1
=
1
√
𝑝 𝟏 𝒑 . The next 𝑝 −1 eigenvalues are all identical to 1−𝜌 . The
invariant space associated with the 1−𝜌 eigenvalue has dimension 𝑝 −1 and is
the orthogonal complement of 𝟏 𝑝 . Any orthonormal basis 𝑈 2
,…,𝑈 𝑝 for the
invariant space constitutes a normalized set of eigenvectors corresponding to the
eigenvalue 1−𝜌 . We write 𝑈 =[𝑈 1
,…,𝑈 𝑝 ] for the orthonormal matrix with
columns 𝑈 1
,…,𝑈 𝑝 . As pointed above, the normalized eigenvector of
𝐺 Σ𝐺 𝑇 corresponding to the eigenvalue 1+(𝑝 −1)𝜌 is 𝑃 1
=𝐺 𝑈 1
=
1
√
𝑝 𝐺 𝟏 𝒑 . Noticing
that 𝛽̂
=(𝐺 𝑇 𝐺 )
−1
𝐺 𝑇 𝑦 =𝐺 𝑇 𝑦 , we get 𝑃 1
𝑇 𝑦 = 𝑈 1
𝑇 𝐺 𝑇 𝑦 =
1
√
𝑝 𝟏 𝒑 𝑻 𝛽̂
and the quadratic
form in (1) becomes:
𝜎 2
𝑄 =[1+(𝑝 −1)𝜌 ](𝑃 1
𝑇 𝑦 )
2
+(1−𝜌 )(∑ (𝑃 𝑗 𝑇 𝑦 )
2
𝑝 𝑗 =2
)=
=[1+(𝑝 −1)𝜌 ](
1
√𝑝 𝟏 𝒑 𝑻 𝐺 𝑇 𝑦 )
2
+(1−𝜌 )(∑ (𝑈 𝑗 𝑇 𝐺 𝑇 𝑦 )
2
𝑝 𝑗 =2
)=
={1+(𝑝 −1)𝜌 }(
1
√𝑝 𝟏 𝑝 𝑇 𝐺 𝑇 𝑦 )
2
+(1−𝜌 ){‖𝑈 𝑇 𝐺 𝑇 𝑦 ‖
2
−(𝑈 1
𝑇 𝐺 𝑇 𝑦 )
2
}=
={1+(𝑝 −1)𝜌 }(√𝑝 𝛽̂
)
2
+(1−𝜌 )‖𝛽̂
−𝛽̂
‖
2
(3)
The final expression is a (weighted) variance decomposition of the overall effect
size ‖𝛽̂
‖
2
into a variability around zero (𝛽̂
)
2
and the residual variability ‖𝛽̂
−𝛽̂
‖
2
.
80
The term (𝛽̂
)
2
is large when the coefficients of 𝛽̂
are similar in magnitude and of
the same sign; it captures deviations of the average regression coefficients from
zero. The term ‖𝛽̂
−𝛽̂
‖
2
is large when the regression coefficients are highly
variable; it captures the overall deviation of the regression coefficients from the
mean of the regression coefficients.
For the general block equicorrelated case, let 𝑉 𝑘 =(0,..0,𝟏 𝑠 𝑘 ,0,..0), an indicator
vector for block 𝑘 =1,..,𝑟 , i.e. the vector with 𝑠 𝑘 ones in the positions
corresponding to block 𝑘 , and zeros elsewhere. The largest r eigenvalues of the
block equicorrelated matrix in (2) are 1+(𝑠 1
−1)𝜌 1
,..,1+(𝑠 𝑟 −1)𝜌 𝑟 with
corresponding normalized eigenvectors 𝐺 𝑉 𝑘 ,𝑘 =1,…,𝑟 . The remaining
eigenvalues are 1−𝜌 𝑘 , 𝑘 =1,…,𝑟 , each with multiplicity 𝑠 𝑗 −1. It follows from
the block structure of Σ and the case r =1 that the quadratic form can be written
as,
𝜎 2
𝑄 =∑(1+(𝑠 𝑘 −1)𝜌 𝑘 )
𝑟 𝑘 =1
(√𝑠 𝑘 𝛽̂
𝑘 )
2
+(1−𝜌 𝑘 )‖𝛽̂
𝑘 −𝛽̂
𝑘 ‖
2
=
=∑𝑠 𝑘 {(1+(𝑠 𝑘 −1)𝜌 𝑘 )(𝛽̂
𝑘 )
2
+(1−𝜌 𝑘 )
1
𝑠 𝑘 ‖𝛽̂
(𝑘 )
−𝛽̂
(𝑘 )
‖
2
}
𝑟 𝑘 =1
=
=∑𝑠 𝑘 {(1+(𝑠 𝑘 −1)𝜌 𝑘 )(𝛽̂
𝑘 )
2
+(1−𝜌 𝑘 ) 𝑉𝑎𝑟 (𝛽̂
(𝑘 )
)}
𝑟 𝑘 =1
where 𝛽̂
𝑘 represent the sub-vector of 𝛽̂
of dimension 𝑠 𝑘 corresponding to the
entries in block 𝑘 .
81
The orthogonality of 𝐺 , 𝐺 𝑇 𝐺 =𝐼 𝑝 , simplifies the analysis because it implies the p
nonzero eigenvalues of 𝐺 Σ𝐺 𝑇 are identical to the p eigenvalues of Σ and thus do
not depend on the particular form of 𝐺 . It is also immediate to verify that if 𝑈 𝑗 ,𝑗 =
1,..𝑝 , are eigenvectors of Σ, then 𝑃 𝑗 =𝐺 𝑈 𝑗 ,𝑗 =1,..𝑝 , are eigenvectors of 𝐺 Σ𝐺 𝑇 ,
and, because ‖𝑃 𝑗 ‖
2
=‖𝐺 𝑈 𝑗 ‖
2
=𝑈 𝑗 𝑇 𝐺 𝑇 𝐺 𝑈 𝑗 =𝑈 𝑗 𝑇 𝑈 𝑗 =‖𝑈 𝑗 ‖
2
, 𝑃 𝑗 is normalized
provided 𝑈 𝑗 is.
Appendix B. Human peroxisomal genes
Human peroxisomal genes based on Watkins 2010 and the PeroxisomeDB
website.
Gene name Watkins 2010 PeroxisomeDB Both
ABCD1 YES YES YES
ABCD2 YES YES YES
ABCD3 NO YES NO
ABCD4 YES YES YES
ABP1 YES NO NO
ACAA1 YES YES YES
ACAD11 NO YES NO
ACBD5 NO YES NO
ACOT1 NO YES NO
ACOT2 NO YES NO
ACOT4 YES YES YES
ACOT8 YES YES YES
ACOX1 YES YES YES
82
ACOX2 YES YES YES
ACOX3 YES YES YES
ACSF3 NO YES NO
ACSL1 NO YES NO
ACSL3 NO YES NO
ACSL4 NO YES NO
ACSL5 NO YES NO
ACSL6 NO YES NO
AGPS YES YES YES
AGXT YES YES YES
ALDH3A2 YES YES YES
AMACR YES YES YES
ATAD1 YES NO NO
BAAT YES YES YES
CAT YES YES YES
CNOT1 YES NO NO
CRAT YES YES YES
CROT YES YES YES
DAO YES YES YES
DDO YES YES YES
DECR2 NO YES NO
DHRS4 YES NO NO
DNAJC10 NO YES NO
DNM1L YES YES YES
ECH1 YES YES YES
ECH1 YES YES YES
EHHADH YES YES YES
83
ENSG00000106406 NO YES NO
ENSG00000140876 NO YES NO
ENSG00000157326 NO YES NO
EPHX2 YES YES YES
FAR1 YES YES YES
FAR2 YES YES YES
FIS1 YES YES YES
FNDC5 NO YES NO
GNPAT YES YES YES
GSTK1 YES YES YES
HACL1 YES YES YES
HAO1 YES YES YES
HAO2 YES YES YES
HMGCL YES YES YES
HMGCR YES NO NO
HSD17B4 YES YES YES
HSDL2 YES NO NO
IDE YES YES YES
IDH1 YES YES YES
IDI1 YES YES YES
IDI2 YES YES YES
LKAP YES NO NO
LONP2 YES YES YES
MAP2K2 YES NO NO
MAVS YES NO NO
MFF YES NO NO
MLYCD YES YES YES
84
Mosc2 NO YES NO
MPV17 NO YES NO
MPV17L YES NO NO
MUL1 YES NO NO
MVK NO YES NO
NOS2 YES YES YES
NOS2 YES YES YES
NUDT12 YES YES YES
NUDT7 YES NO NO
PAOX NO YES NO
PECI YES YES YES
PECR YES YES YES
PEX1 YES YES YES
PEX10 YES YES YES
PEX11A YES YES YES
PEX11B YES YES YES
PEX11G YES YES YES
PEX12 YES YES YES
PEX13 YES YES YES
PEX14 YES YES YES
PEX16 YES YES YES
PEX19 YES YES YES
PEX26 YES YES YES
PEX3 YES YES YES
PEX5 YES YES YES
PEX5L NO YES NO
PEX6 YES YES YES
85
PEX7 YES YES YES
PHYH YES YES YES
PIPOX YES YES YES
PMVK YES YES YES
PNPLA8 YES NO NO
POMC NO YES NO
PRDX1 NO YES NO
PRDX5 YES YES YES
PXMP2 YES YES YES
PXMP3 YES YES YES
PXMP4 YES YES YES
RHOC NO YES NO
SCP2 YES YES YES
SCP2 YES YES YES
SERHL NO YES NO
SLC22A5 NO YES NO
SLC25A17 YES YES YES
SLC27A2 YES YES YES
SOD1 YES YES YES
SOD2 NO YES NO
TMEM135 NO YES NO
TRIM37 YES YES YES
TTC1 YES NO NO
TYSND1 YES NO NO
VIM YES NO NO
XDH NO YES NO
ZADH2 NO YES NO
86
Appendix C. Supplemental Figures
Figure S1. Plots of similarities of all pairwise organism-level process genes given each
metric under complete vs. incomplete GO annotations. Each point represents a unique
pairwise gene with the value on X axis as their similarity given the complete annotations
and the value on Y axis as their similarity given a random simulated incomplete set of
87
annotations. Therefore, each pairwise gene is repeated 100 times in each plot, with each
pair having the same similarity under complete annotations but a different similarity under
a different simulation.
Figure S2. Plots of the averaged proportion of yeast genes that remain in the best matched
cluster, in other matched clusters, broken into singletons from clusters, misclassified into
unmatched clusters or remain as singletons genes when their cellular process GO
88
annotations become incomplete. Each type of the point and color corresponds to an
averaged proportion obtained from a unique measure. In addition, the X-axis represents
different clustering thresholds. A total of 20 different thresholds increase from the left to
the right evenly, representing the height of corresponding hierarchical tree from the root
to the top.
Figure S3. ROC plot of clustering 21 macroautophagy genes under both complete and
incomplete GO annotations. The TP is defined as the proportion of macroautophagy genes
that successfully clustered into the macroautophagy gene set. The FP is defined as the
89
proportion of non-macroautophagy yeast genes being classified as the macroautophagy
gene set. Each type of the point and color corresponds to the true positive rate obtained
from either full or incomplete annotations. The X-axis represents different clustering
thresholds (size of cluster). A total of 20 different thresholds increase from the left to the
right evenly, representing the height of corresponding hierarchical tree from the root to
the top.
90
Figure S4. Plots of the true positive rate of clustering 58 DNA repair genes under both
complete and incomplete GO annotations. The TP is defined as the proportion of DNA
repair genes that successfully clustered into the DNA repair set. The FP is defined as the
proportion of non-DNA repair genes being mis-classified as the DNA repair gene set. Each
type of the point and color corresponds to the true positive rate obtained from either full
or incomplete annotations. The X-axis represents different clustering thresholds (size of
cluster). A total of 20 different thresholds increase from the left to the right evenly,
representing the height of corresponding hierarchical tree from the root to the top.
91
Figure S5. ROC plot of clustering 6 base-excision repair genes under both complete and
incomplete GO annotations. The TP is defined as the proportion of base-excision repair
genes that successfully clustered into the base-excision repair gene set. The FP is defined
as the proportion of non-base-excision repair genes among the base-excision repair gene
set. Each type of the point and color corresponds to the true positive rate obtained from
92
either full or incomplete annotations. The X-axis represents different clustering thresholds
(size of cluster). A total of 20 different thresholds increase from the left to the right evenly,
representing the height of corresponding hierarchical tree from the root to the top.
93
Figure S6. ROC plot of clustering 15 nucleotide-excision repair genes under both complete
and incomplete GO annotations. The TP is defined as the proportion of nucleotide-
excision repair genes that successfully clustered into the nucleotide-excision repair gene
set. The FP is defined as the proportion of non-nucleotide-excision repair genes being
classified in the nucleotide-excision repair gene set. Each type of the point and color
corresponds to the true positive rate obtained from either full or incomplete annotations.
The X-axis represents different clustering thresholds (size of cluster). A total of 20
different thresholds increase from the left to the right evenly, representing the height of
corresponding hierarchical tree from the root to the top.
94
Figure S7. Plots of the true positive rate of clustering 5 mismatch repair genes under both
complete and incomplete GO annotations. The TP is defined as the proportion of
mismatch repair genes that successfully clustered into the mismatch repair gene set. The
FP is defined as the proportion of other pathway genes being Mis-classified into this
cluster. Each type of the point and color corresponds to the true positive rate obtained
95
from either full or incomplete annotations. The X-axis represents different clustering
thresholds (size of cluster). A total of 20 different thresholds increase from the left to the
right evenly, representing the height of corresponding hierarchical tree from the root to
the top.
96
Figure S8. ROC plot of clustering 6 double-strand break repair via nonhomologous end
joining genes under both complete and incomplete GO annotations. The TP is defined as
the proportion of double-strand break repair via nonhomologous end joining genes that
were successfully clustered into double-strand break repair via nonhomologous end
joining gene set. The FP is defined as the proportion of non- double-strand break repair
via nonhomologous end joining genes being mis-classified double-strand break repair via
nonhomologous end joining gene set. Each type of the point and color corresponds to the
true positive rate obtained from either full or incomplete annotations. The X-axis
represents different clustering thresholds (size of cluster). A total of 20 different
thresholds increase from the left to the right evenly, representing the height of
corresponding hierarchical tree from the root to the top.
97
Figure S9. ROC plot of clustering 12 double-strand break repair via homologous
recombination genes under both complete and incomplete GO annotations. The TP is
defined as the proportion of double-strand break repair via homologous recombination
genes that successfully clustered into the double-strand break repair via homologous
recombination genes set. The FP is defined as the proportion of other pathway genes
98
being misclassified in the double-strand break repair via nonhomologous end joining gene
set. Each type of the point and color corresponds to the true positive rate obtained from
either full or incomplete annotations. The X-axis represents different clustering thresholds
(size of cluster). A total of 20 different thresholds increase from the left to the right evenly,
representing the height of corresponding hierarchical tree from the root to the top.
99
Figure S10. ROC plot of clustering protein import into nucleus genes under both complete
and incomplete GO annotations. The TP is defined as the proportion of protein import into
nucleus genes that successfully clustered together. The FP is defined as the proportion of
other pathway genes being misclassified as protein import into nucleus genes. Each type
of the point and color corresponds to the true positive rate obtained from either full or
incomplete annotations. The X-axis represents different clustering thresholds (size of
100
cluster). A total of 20 different thresholds increase from the left to the right evenly,
representing the height of corresponding hierarchical tree from the root to the top.
101
Figure S11. ROC plot of clustering mitotic sister chromatid segregation genes under both
complete and incomplete GO annotations. The TP is defined as the proportion of mitotic
sister chromatid segregation genes that successfully clustered together. The FP is defined
as the proportion of other pathway genes being misclassified as mitotic sister chromatid
segregation genes. Each type of the point and color corresponds to the true positive rate
obtained from either full or incomplete annotations. The X-axis represents different
102
clustering thresholds (size of cluster). A total of 20 different thresholds increase from the
left to the right evenly, representing the height of corresponding hierarchical tree from the
root to the top.
Figure S12. ROC plot of clustering post-Golgi vesicle-mediated transport genes under
both complete and incomplete GO annotations. The TP is defined as the proportion of
103
post-Golgi vesicle-mediated transport genes that successfully clustered together. The FP
is defined as the proportion of other pathway genes being misclassified as Post-Golgi
vesicle-mediated transport genes. Each type of the point and color corresponds to the true
positive rate obtained from either full or incomplete annotations. The X-axis represents
different clustering thresholds (size of cluster). A total of 20 different thresholds increase
from the left to the right evenly, representing the height of corresponding hierarchical tree
from the root to the top.
104
Figure S13. ROC plot of clustering recombinational repair genes under both complete and
incomplete GO annotations. The TP is defined as the proportion of recombinational repair
genes that successfully clustered together. The FP is defined as the proportion of other
pathway genes being misclassified as recombinational repair genes. Each type of the
point and color corresponds to the true positive rate obtained from either full or
incomplete annotations. The X-axis represents different clustering thresholds (size of
cluster). A total of 20 different thresholds increase from the left to the right evenly,
representing the height of corresponding hierarchical tree from the root to the top.
105
Figure S14. ROC plot of clustering cytokinesis genes under both complete and incomplete
GO annotations. The TP is defined as the proportion of cytokinesis genes that
successfully clustered together. The FP is defined as the proportion of other pathway
genes being misclassified as cytokinesis genes. Each type of the point and color
corresponds to the true positive rate obtained from either full or incomplete annotations.
The X-axis represents different clustering thresholds (size of cluster). A total of 20
106
different thresholds increase from the left to the right evenly, representing the height of
corresponding hierarchical tree from the root to the top.
Abstract (if available)
Abstract
It has become increasingly important in today’s genomic research to detect the joint association of numerous genes (gene set) with complex diseases. Most existing gene-set approaches, however, relies on the assumption of either independence or equicorrelation and the power can decrease rapidly due to violation of this assumption. Here we propose a prior knowledge based adaptive approach which assumes a block equicorrelation structure among genes. This proposed approach divides genes into different “blocks” using prior information, and then decomposes a global test derived from mixed models into different “within-block” local tests. It adaptively uses the data to optimally determine the value of equicorrelation within blocks, and is powerful particularly when within-block association is stronger than global association. We evaluated the proposed approach using complex simulation studies and illustrate its application using public Gene Ontology annotation resources and the Microarray data of USC Norris Center prostate cancer project. In addition, we relied on Gene Ontology annotation semantic similarity between genes as the prior information, which may result in biased results due to incomplete annotation sets. To learn what the best option of similarity metrics may be, we developed representations of both 'complete' and 'incomplete' GO annotations datasets based on experimentally-supported annotations from the GO database. We then assessed the clusters derived from these metrics using different clustering methods. We found that metrics which average over all pairwise annotation similarities always had poor performance. In contrast, metrics using the maximal pairwise annotation similarity or averaging over best-matched pairwise annotation similarities were generally among the best performers even the Gene Ontology annotations are incomplete.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Adaptive set-based tests for pathway analysis
PDF
Generalized linear discriminant analysis for high-dimensional genomic data with external information
PDF
Incorporating prior knowledge into regularized regression
PDF
Prediction and feature selection with regularized regression in integrative genomics
PDF
Leveraging functional datasets of stimulated cells to understand the relationship between environment and diseases
PDF
Statistical analysis of high-throughput genomic data
PDF
Enhancing model performance of regularization methods by incorporating prior information
PDF
Hierarchical approaches for joint analysis of marginal summary statistics
PDF
Improving the power of GWAS Z-score imputation by leveraging functional data
PDF
Detecting joint interactions between sets of variables in the context of studies with a dichotomous phenotype, with applications to asthma susceptibility involving epigenetics and epistasis
Asset Metadata
Creator
Liu, Meng
(author)
Core Title
Gene-set based analysis using external prior information
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Biostatistics
Publication Date
07/25/2016
Defense Date
06/13/2016
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
adaptive methods,gene clustering,gene ontology,gene-set based analysis,incomplete gene ontology annotation,mixed model,OAI-PMH Harvest,prior information,score test,semantic similarity,simulation,variance component
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Lewinger, Juan Pablo (
committee chair
), Conti, David (
committee member
), Hacia, Joseph (
committee member
), Mi, Huaiyu (
committee member
), Thomas, Paul (
committee member
)
Creator Email
liumeng@usc.edu,liumeng85@hotmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-275814
Unique identifier
UC11279534
Identifier
etd-LiuMeng-4604.pdf (filename),usctheses-c40-275814 (legacy record id)
Legacy Identifier
etd-LiuMeng-4604.pdf
Dmrecord
275814
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Liu, Meng
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
adaptive methods
gene clustering
gene ontology
gene-set based analysis
incomplete gene ontology annotation
mixed model
prior information
score test
semantic similarity
simulation
variance component