Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Evaluating the effects of testing framework and annotation updates on gene ontology enrichment analysis
(USC Thesis Other)
Evaluating the effects of testing framework and annotation updates on gene ontology enrichment analysis
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
1 / 126
Evaluating the effects of testing framework
and annotation updates on Gene Ontology
Enrichment Analysis.
By
Xinyu Guo
1
Mentor: Paul D Thomas
1
A Thesis Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfilment of the Requirements for the Degree
MASTER OF SCIENCE
(Biostatistics)
December 2017
1 Preventive Medicine Department, Keck School of Medicine, University of Southern
California, 90033 CA, USA
2 / 126
Acknowledgements
First, I want to express my sincere gratitude to my mentor, Dr. Paul D Thomas for the full
support to my master project. He is very kind, professional and wholeheartedly
considerate to students. When I was in the very difficult moment during my thesis
analysis stage, I felt confused and desperate with the dilemma. He told me to focus on the
very key point of the question and helped me with the problem details personally step by
step. His serious attitude toward science and innovative thinking infected me and set the
best example for me in my future life.
Then, I want to thanks to my committee number professor Christianne Lane, she
reviewed my drafted thesis carefully and gave me a lot of advices on improving my thesis
structure and increasing demonstration angle.
3 / 126
Table of contents
Acknowledgements ....................................................................................................................... 2
Abstract ......................................................................................................................................... 6
Introduction ................................................................................................................................... 8
Methods ....................................................................................................................................... 15
Using PANTHER classification system to perform enrichment analysis ........................... 15
Interpreting PANTHER classification system results ......................................................... 15
R platform and R packages ................................................................................................ 17
Improving GO enrichment analysis using FET + FDR method based on R platform ........ 18
Hierarchy and cluster analysis of significant GO terms result ........................................... 21
Part A analysis: Enrichment analysis in scientific article: Stromal gene expansion predicts
clinical outcome in breast cancer. ...................................................................................... 22
Part B: Enrichment analysis in scientific article: Genetic risk and a primary role for cell-
mediated immune mechanisms in multiple sclerosis. ........................................................ 27
Common rate and obsolete rate .......................................................................................... 33
Wilcoxon Signed-rank test for comparing BIN + BON and FET + FDR method GO term
results based on R platform ................................................................................................ 34
Result .......................................................................................................................................... 35
Part A ................................................................................................................................. 35
Enrichment Analysis of genes that Correlate with good outcomes (Yellow cluster) .......... 35
GO Biological Process (BP) ............................................................................................... 35
Result analysis of FET + FDR method .............................................................................. 35
Comparing the FET + FDR method result with the author’s findings ............................... 35
Comparing FET + FDR method result with BIN + BON method result ............................ 39
GO Cellular Component (CC) ............................................................................................ 40
4 / 126
Result analysis of FET + FDR method .............................................................................. 40
Comparing FET + FDR method result with the author’s findings ..................................... 40
Comparing FET + FDR method result with BIN + BON method result ............................ 43
GO Molecular Function (MF) ............................................................................................ 44
Result analysis of FET + FDR method .............................................................................. 44
Comparing FET + FDR method result with the author’s findings ..................................... 44
Comparing FET + FDR method result with BIN + BON method result ............................ 45
Enrichment Analysis of genes that Correlate with bad outcomes (Cyan cluster) ............... 46
GO Biological Process (BP) ............................................................................................... 46
Result analysis of FET + FDR method .............................................................................. 46
Comparing FET + FDR method result with the author’s findings ..................................... 46
Comparing FET + FDR method result with BIN + BON method result ............................ 50
GO Cellular Component (CC) ............................................................................................ 51
Result analysis of FET + FDR method .............................................................................. 51
Comparing FET + FDR method result with the author’s findings ..................................... 51
Comparing FET + FDR method result with BIN + BON method result ............................ 53
GO Molecular Function (MF) ............................................................................................ 53
Result analysis of FET + FDR method .............................................................................. 53
Comparing FET + FDR method result with the author’s findings ..................................... 53
Comparing FET + FDR method result with BIN + BON method result ............................ 56
Enrichment Analysis of genes that Correlate with mixed outcomes (Purple cluster) ......... 56
GO Biological Process (BP) ............................................................................................... 56
Result analysis of FET + FDR method .............................................................................. 56
Comparing FET + FDR method result with the author’s findings ..................................... 56
Comparing FET + FDR method result with BIN + BON method result ............................ 58
5 / 126
GO Cellular Component (CC) ............................................................................................ 58
Result analysis of FET + FDR method .............................................................................. 58
Comparing FET + FDR method result with the author’s findings ..................................... 58
Comparing FET + FDR method result with BIN + BON method result ............................ 59
GO Molecular Function (MF) ............................................................................................ 59
Result analysis of FET + FDR method .............................................................................. 59
Comparing FET + FDR method result with the author’s findings ..................................... 59
Comparing FET + FDR method result with BIN + BON method result ............................ 59
Common rate and obsolete rate analysis for part A ............................................................ 61
After finishing all the enrichment analysis in part A, we could generally analyze the common
rate and obsolete rate. ......................................................................................................... 61
Average common rate for part A analysis could be calculated by: ..................................... 61
Average obsolete rate for part A analysis could be calculated by: ...................................... 61
Part B ................................................................................................................................. 61
Comparing FET + FDR method result with BIN + BON method result ............................ 61
Result analysis of common FET + FDR method results for List 1 and List 2 .................... 62
Comparing FET + FDR method result with the author’s findings ..................................... 62
Discussion ................................................................................................................................... 90
FET + FDR method vs BIN + BON method ...................................................................... 90
FET + FDR method vs author’s finding ............................................................................. 90
Works Cited ................................................................................................................................. 94
Supplements ................................................................................................................................ 96
6 / 126
Abstract
Gene Ontology (GO) is using structured, controlled and dynamic terminology to describe
and annotate gene groups and products of the biological attributes and functions [1].
Enrichment analysis is using statistical test method to compare a researcher’s gene list to
reference gene list to find the GO terms (biological function) that are over- or under-
represented. PANTHER (protein analysis through evolutionary relationship)
classification system is an online platform submitting large gene groups for gene list
analysis, ontology, pathway analysis and inquiry etc [2].
The research of this paper consists of two aspects. The first is to evaluate whether Fisher
Exact Test with FDR procedure method (FET + FDR) could improve the enrichment
analysis GO terms results comparing to Binomial Test with traditional Bonferroni
procedure method (BIN + BON). The second is to assess the GO annotation updates
effect on the GO enrichment analysis of former scientific papers. Two scientific articles
published in the past which are Stromal gene expansion predicts clinical outcome in
breast cancer [3] and Genetic risk and a primary role for cell-mediated immune
mechanisms in multiple sclerosis [4] were re-analyzed in this thesis in order to access the
two important aspects of gene list analysis. Wilcoxon Signed-Rank Test is performed in
the end for comparing the performance of FET + FDR and BIN + BON results.
7 / 126
The researcher’s gene lists provided by the articles were uploaded into the PANTHER
classification system and the GO term observation statistics could be extracted from the
system result. Statistical methods were performed based on R platform and generate the
results. Common rates and obsolete rates are calculated to measure the common level
between FET + FDR method results and author’s findings and evaluate the GO
annotations influence on author’s finding respectively.
From the first part of the result, we found that comparing to BIN + BON method, the
FET + FDR method contains significantly more GO terms based on Wilcoxon Signed-
Rank Test and the GO terms are more specific. The second result part shows that in most
cases, FET + FDR method results are different from or contains more results than the
author’s original findings.
Thus, we can conclude that FET + FDR method are less conservative and more
informative comparing to BIN + BON method and GO annotation updates influence the
GO enrichment analysis results reported in the original publications.
8 / 126
Introduction
The goal of the Gene Ontology(GO) project is to create structured, controlled and
dynamic terminology to describe and annotate gene groups and products according to
their biological properties and functions. GO consists of two main parts: Gene
Ontology(GO) and GO annotation. GO defines the structure of the biological properties
or terms and their relationships with each other in the same or different domains resulting
in a directed acyclic tree structure in the database [16].
There are three independent domains in Gene Ontology: (1) cellular component(CC),
which specifics the locations which are at the subcellular or macromolecular complexes
level where the gene biological process or function are performed; (2) molecular
function(MF), which describes the molecular level activities of a gene group; and (3)
biological process(BP), describe pathways and larger biological programs carried out by
multiple gene products working together [1].
On the other hand, GO annotations state the associations between gene products and
ontology terms which supported by direct experiment evidences or phylogenetic
principles.
With the biology knowledge accumulates, both ontology and GO annotations are being
revised and increasing constantly. Currently there are over 34 biological databases,
groups and many scientists contribute to the ontology and GO annotation updates in the
9 / 126
worldwide. Thus, one of my target in this thesis is to analyze the influence of GO
annotation updates on the GO analysis results in the former scientific papers [16].
One of main usages of GO is performing enrichment analysis on gene sets. In recent
years, for the interpretation of large “omics” datasets, pathway enrichment analysis has
become one of the best way for analyzing and extracting meanings from high-throughput
differentially expressed genes and proteins, as it reduces tens of thousands of genes,
proteins and other biological molecular into hundreds smaller groups and sets by
pathways they are involved in which decrease the complexity. Pathway is a simplified
model of biology process in a cell or tissue defined by biologists. This method has been
applied to the GO term analysis [14].
Over-Representation Analysis or Enrichment analysis (ORA) is a method using statistical
test to evaluate the percentage of genes observed in a pathway or GO term that has
differential expression. If the difference expressed significantly, the pathway or GO term
could be determined as relevant. ORA enable researchers to interpret gene expression
data in terms of biology that underlies it. The statistical significance of the differential
expression could be accessed by using statistical test method to compare a researcher’s
gene list to reference gene list and find out pathways or GO terms that are over- or under-
represented in their lists [5].
10 / 126
Some online websites could be used to perform ORA, one of them is PANTHER
classification system. The PANTHER (protein analysis through evolutionary
relationship) classification system (http://www.pantherdb.org/) is a comprehensive online
platform. It contains multiple analysis tools including statistical analysis, gene ontology,
pathway analysis. Researchers could use tools in PANTHER system to manipulate and
analyze large gene list [2].
For the overrepresentation test in the GO enrichment analysis, it tests for whether two
binomial proportions are significantly different in the contingency tables. The statistical
method used by the PANTHER classification system is Binomial Test with Bonferroni
procedure (BIN + BON). The Binomial Test is an exact test to compare the observed
distribution to the expected distribution when there are only two categories. Thus, it aims
to test binomially distributed hypotheses [8].
One common use of the Binomial test is in the case where the null hypothesis is that two
categories are equally likely to occur. In the PANTHER system, it tests for each GO
term, the numbers of genes are hit are equally likely in both the researcher’s gene list and
reference gene list. The following formula is the Binomial exact test equation:
𝑃 𝑋 =
𝑛!
𝑛−𝑋 !𝑋!
(
𝑘
𝐾
)
+
(1−
𝑘
𝐾
)
-. +
[9]
11 / 126
In the equation, k is number of gene mapped to GO term in the reference gene list, K is
total number of gene in the reference gene list, X is the number of gene mapped to GO
term in the researcher’s gene list, n is the total number of gene in the reference gene list.
Since there may be thousands of GO terms are tested in the overrepresentation analysis.
The p-values of some tests might be smaller than 0.05 purely by chance. We need to
perform correction to control this problem. The method used by PANTHER classification
system is Bonferroni correction (BON), which controls for familywise error rate. The
critical value for an individual test is found by dividing the family wise error rate by the
number of tests. For example, if there are 1000 tests, the familywise error rate is 0.05, the
critical value for one individual test is 0.05/1000 = 0.00005. Thus, the test is significant
when P- value is smaller than 0.00005. This method is used by PANTHER classification
system. However, there is a shortcoming for this method. BON correction is appropriate
when each test in independent. If the tests are overlapping, as in the case of related GO
terms, this correction is too conservative and the false negative rate will be very high. In
this paper, only the GO term that has huge significance difference could be tested, a lot of
information would be missing by this correction procedure.
Thus, another target in this thesis is to research on whether Fisher Exact Test with FDR
correction (FET + FDR) approach could obtain significantly more or better GO terms
results comparing to BIN + BON method.
12 / 126
Fisher's exact test is one of the statistical significance tests used in the analysis of
contingency tables. The reason that it is more exact than other statistical tests is that it
could calculate the null hypothesis significance exactly rather than depend on the
approximation which rely the exact degree on the sample size [10]. In this thesis, the test
is performed as Table 1 and the following formula, the red four cells are the 2 x 2 table
we want to analysis in the Fisher Exact Test:
Each GO term Number of gene
mapped to GO term
Number of gene
not mapped to GO
term
Total
Researching Gene
list
a b a+b
Reference Gene list c d c+d
Column Total a+c b+d a+b+c+d (=n)
Table 1: 2 x 2 contingency table for Fisher Exact Test
𝑃 =
345
3
647
6
-
346
=
𝑎+𝑏 ! 𝑐+𝑑 ! 𝑎+𝑐 ! 𝑏+𝑑 !
𝑎!𝑏!𝑐!𝑑!𝑛!
[11]
In this paper, I decided to use Fisher Exact Test instead of Binomial exact test or Chi-
squared test. Because based on Binomial distribution, Binomial test assumes that one
population is compared to another that is infinite in size. However, Fisher Exact Test is
based on Hypergeometric distribution and is more sensitive to the sizes for both
comparing lists. For small, sparse, or unbalanced data, the exact and asymptotic p-values
can be quite different and may lead to opposite conclusions concerning the Chi-squared
test hypothesis [12]. In contrast, the Fisher Exact Test could be regardless less of sample
13 / 126
characteristics if the rows and columns could keep fixed. Thus, Fisher Exact Test is better
comparing to Chi-square test and binomial test.
For FDR correction, it is basically the method to control the false discovery rate. In this
correction, I could set up a percentage number as the false discovery rate accepted to be
Q. Then all the P-values of tests should be ranked from small to large. Then rank the
smallest P-value test to be i = 1 and the next smallest to be i = 2 etc. We then compare
each P-value to (i / m) *Q, where m is the total number of tests, all the P-values that are
smaller than (i / m) *Q would be considered significant [13]. In this paper, the
overrepresentation analysis usually contains thousands of GO terms, the results using (i /
m) * Q = 0.1 as the criterion would be reported in the paper.
When comparing the result GO terms between BIN + BON and FET +FDR method, the
statistical method used in this thesis is Wilcoxon Signed-Rank Test. It is a non-parametric
test which evaluate the population mean ranks difference between two related, matched
samples. The null hypothesis of the test is the difference between two matched groups
with the symmetric distribution are around zero. The test procedures are as follows.
There are two pairs of lists x1, x2 with j = 1, 2, 3, …, N terms in each list. First, calculate
the absolute difference values of two lists for each j which is |x2j – x1j| and keep the sign
of the x2j – x1j result as S. If |x2j – x1j| equals to 0, then exclude this pair. Denote the
reduced sample size as Na. Then order the Na pairs lists from smallest to largest by the
14 / 126
absolute difference values |x2j – x1j| and rank them as Rj from 1, 2, 3, .... The ties
absolute differences give the same rank by the average of their original ranks. |W| can be
assessed by the following equation:
|𝑊|= | 𝑆×
3
ABC
Rj|
As Na increase, W converges to normal distribution. Thus when Na >= 10, z =
F
GHI
and
Var=
MH(MH4C)(NMH4C)
O
. The two-side test reject H0 when Z >Z
RISTSRHU
[18].
15 / 126
Methods
Using PANTHER classification system to perform enrichment analysis
For the usage of PANTHER classification system, first, the names or IDs of genes being
analyzed should be pasted into the webpage, or uploaded as a .txt file. Then, the name of
organism should be selected as the reference list from the species pull down. In this
paper, I used homo sapiens as the reference group for all the analysis. For the third step,
the name of analysis should be selected. In this paper, I selected statistical
overrepresentation test. Finally, clicking “submit” button, the overrepresentation test is
generated and the result tables are displayed in the next webpage.
Interpreting PANTHER classification system results
There are two parts in the result webpage: Analysis summary, and results. Analysis
summary part provides the information about Analysis type, Annotation version, Release
date, analyzed list (the uploaded list), Reference list (Homo Sapiens in this paper),
Annotation data set and the check box for the Bonferroni correction. The annotation type
(CC, BP or MF in this paper) could be selected in the annotation data set pull down.
For the results part, we first reviewed the information of unmapped and multiple mapping
gene ID from the first two columns. A gene ID is unmapped means the gene name or ID
provided by the list could not be found in the Annotation data base. Multiple mapping
16 / 126
means the gene ID provided by the list are not specified precisely enough, as, there are
two or more genes in the database are related to this name or ID. Thus, before
interpreting the result tables, it is better to eliminate or modified the unmapped or
multiple mapping gene names or IDs in the sample list. It may lead to wrong results or
information losing. The second table lists the statistics for each GO terms: the reference
number, the sample number, the expected number, Fold enrichment, over- or
underrepresentation and P-value.
The reference number gives the number of genes in the entire human genome related to
certain GO term and sample number gives the number of genes in the sample list
annotated to certain GO term. The expected number column gives the number of genes
that would be expected to appear in the sample list based on reference list. The +/-
column in the table gives that “+” indicates overrepresentation and “-” indicates
underrepresentation. Overrepresentation means the sample frequency is larger than
reference frequency which is more genes are appeared than expected in the sample list for
a certain GO term based on reference list. Underrepresentation means that the sample
frequency is smaller than reference frequency which is less genes are appeared than
expected [2]. P-value column gives the P-value of multiple test with Binomial Test and
Bonferroni correction for each GO term. It is the probability of seeing x genes in total y
genes in sample list, given the reference frequency. When the P-value is approaching to
17 / 126
zero, it is more significant that the sample gene list is related to the GO term, the result is
more non-random and worth researching in detail.
R platform and R packages
The programming software I used in this thesis is R, version 3.3.3 released in 2017-03-06
on macOS Sierra version 10.12.4 platform.
R is a language and environment for statistical computing and graphics. It provides a
wide variety of statistical, for example, linear and nonlinear modelling, classical
statistical tests, classification, clustering etc. and graphical techniques, and is highly
extensible. R can be extended by installing packages with different functions. Besides the
basic packages carried by the R platform, people can download more powerful and
extensive packages on the internet. Comprehensive R Archive Network (CRNA)
containing a wide range of modern statistics packages which could provide various
statistical method for R users. People can code, share, download packages they needed
from CRNA family of internet sites on R [6].
Instead of only R platform, I used R studio as my programming tool in the research. R
Studio is an integrated development environment (IDE) for R like JAVA Eclipse and
C++ NetBeans etc which helps researchers working with R code easily and effective.
Some of the results and tables in the thesis are displayed by R mark down file (RMD).
18 / 126
RMD uses a productive notebook interface to integrate narrative text and code to produce
output which makes the results elegant and formatted.
Two extra R packages used in this thesis which are xlsx package [17] and RamiGO
package [7]. xlsx package provides R functions to read, write and format Excel file
formats which is written and maintained by Adrian A. Dragulescu.and The RamiGO
package is providing functions to interact with the AmiGO visualize web server and
label, color, highlight particular GO term or export summary GO tree [7]. First, I used
xlsx package in the research to read in two xlsx files provided by the analyzing papers.
Then I used RamiGO package for analyzing the hierarchy of significant GO term results.
Using getAmigoTree () function in the RamiGO package could generate the Amigo tree
plot for the GO terms groups in .png file [17]. The plots show the hierarchy attributes and
relationships among GO terms. Besides, using the pcolors option in the getAmigoTree ()
function could color the GO terms in the tree plot for labeling related GO terms (The
colors settled in the paper are “white” and “tomato”).
Improving GO enrichment analysis using FET + FDR method based on R
platform
One of my main purpose in this paper is to compare the results of Fisher Exact Test with
FDR correction (FET + FDR) with the results of Binomial test with Bonferroni correction
(BIN + BON) used by PANTHER classification system.
19 / 126
I first extract the .txt file report of BIN + BON method result from PANTHER
classification system. Then, I imported the .txt file into R and extract all the GO terms
and their sample frequency labeled as "samplenum", reference frequency labeled as
"refnum", total sample number labeled as "sampletotal" and total reference number
labeled as "homoptotal" respectively into four columns and grouped into a data frame.
Each two rows consists a 2 x 2 contingency table, I performed BIN method for each two
rows in the data frame by using the R command binom.test(x, n, (y/z)). In the analysis, x
is “samplenum”, n is "refnum", y is "homoptotal" and z is "sampletotal".
BON correction is performed using the R command p.adjust(pvalue, method =
"bonferroni") and alpha is assigned to be 0.05, P-value is the result list of the P-values
generated by the Binomial Test in the previous part. For FET, since the four counts in the
2 x 2 table is different from Binomial Test. I generated two new values “samplenoobs” =
"sampletotal" - “samplenum”, and “homonoobs” = “homoptotal” - “refnumis”. Then the
2 x 2 table is consisted of “samplenoobs”, “homonoobs”, “samplenum” and “refnum”
formatted as Table 1 using the R command fisher.test(matrix(x,nr=2))$p.value). FDR
correction is performed by p.adjust(pvalue, method = "FDR" ) and the criterion is set to
be 0.1, P-value is the result list of the P-values generated by the fisher.test. The
comparison of two methods work flows are shown in the Figure 1.
20 / 126
21 / 126
Figure 1: Work flow of comparisons between FET + FDR and BIN + BON method based on
R platform and PANTHER classification system.
Hierarchy and cluster analysis of significant GO terms result
The results analysis in the thesis consist of two parts. First, I generated the significant GO
terms lists using Fisher Exact Test with FDR correction (FET + FDR) method for each
molecular function, biological process and cellular component domains respectively.
Then, I compared FET + FDR method result with the results reported in the published
paper by comparing their resulting GO terms lists and I also compared FET + FDR to the
results from BIN + BON method with the same method.
The GO terms found using each method were further analyzed by plotting the Amigo
Tree, which clusters related GO terms and simplifies the analysis.
Two R commands are used in presenting the results. knitr::kable() command in R mark
down files is used to display the resulting GO lists as a neatly formatted table.
getAmigoTree() function of RamiGO R package is used to plot the GO terms clusters and
hierarchies organized as a directed acyclic tree plot. It helps researching on the common
properties shared by the GO terms and the relationships among them.
22 / 126
Part A analysis: Enrichment analysis in scientific article: Stromal gene
expansion predicts clinical outcome in breast cancer.
The scientific paper analyzed in part A is Stromal gene expansion predicts clinical
outcome in breast cancer, a highly-cited paper written by Finak et al [3] that used the GO
to analyze differential expression data. In the paper, gene expression differences were
used as criterion to identify the breast tumor and classify cancer clinical outcome. Thus,
gene expression could be used as a prognostic factor for breast cancer metastasis.
However, the genes expression used as the prognostic factor in the past is extracted from
whole cell tissue and surrounding stroma which is not very accurate. The researchers
found that stroma served as a very important role in the tumor initiation and growth. The
researchers focused on studying the gene expression in isolated stroma alone working as
a new prognostic factor, stroma-derived prognostic predictor (SDPP) prognostic which is
used to predicts and classify breast cancer clinical outcome more accurately.
The researchers studied 53 stroma samples and classified them into three clusters by their
outcomes, good, bad and mixed using recurrence rate and relapse-free survival time as
the criteria. Good cluster stroma samples have significant reduced recurrence rate and
longer relapse-free survival time. Bad cluster has increased recurrence rate and shorter
relapse-free survival time. Mixed cluster is the mixture of both good and bad outcome
cluster. Three clusters are labeled as Yellow (Table 3), Cyan (Table 4) and Purple (Table
2) for Good, Bad and Mixed outcome. There are 163 genes that showed large differences
23 / 126
in expression between cluster comparisons. The paper provided genes list that are
expressed predominantly in each cluster. Gene ontology enrichment analysis is
performed in this part to the GO terms that is enriched in the analysis of three clusters.
The significant GO terms for each cluster enrichment analysis gave the characteristic
biological functions in the cluster.
Thus, in the result part for this paper, I redid this GO enrichment analysis again using the
method described above.
Purple Cluster
ITGBL1
OGN
SORCS2
ADRA2A
CXCL14
FRZB
RAI2
HOXA10
PRND
FGF18
ESR1
p10275
BCAN
TLN2
PSCD3
SLC40A1
GREB1
WISP2
p12273
TFF1
TFF3
PSMD11
24 / 126
RPL10
SCGB2A2
ACAA2
PDCD7
ZHX2
TCEA3
Table 2: Purple cluster gene list
The table contains the genes expressed predominantly in samples of the mixed-outcome (Purple)
cluster [4].
Yellow Cluster
CD48
PLEK
SOAT1
LAP3
PLA2G7
MS4A4A
GIMAP5
RUNX3
HLA-A
HLA-F
IL10RA
NCF2
COTL1
COTL1
GZMA
CD8A
CD52
TRBV5-4
CD3D
CD247
CD2
XCL1
GZMB
CYBB
CCL13
25 / 126
MEI1
HCST
Table 3: Yellow cluster gene list
The table contains the genes expressed predominantly in samples of the good-outcome (Yellow)
cluster [4].
Cyan Cluster
IL4I1
O43315
S100A8
S100P
CLEC4E
Q13938
MMP12
LCN2
SYTL1
CALB2
MMP7
GRB14
HRASLS
SCEL
ADGRF1
P10451
IQGAP3
S100A7
S100A9
HIST1H1C
SPNS2
CXCL1
MMP1
STK38L
KRT23
UGCGL1
ACTG2
KCNK5
SCRG1
26 / 126
VGLL1
ROPN1
SHC4
UBE2C
KIF18B
FAM83D
NCAPG
ASPM
CENPF
GBP5
C6orf173
ECT2
GPR56
RDH10
MYBL1
Q8N3C7
FAM54A
SGOL1
CHEK1
LGALSL
ZNF165
LCP1
CDCA7
KYNU
NDC80
RIOK3
O60911
SLAIN1
SQLE
GJD4
BXDC1
AZIN1
ATG5
GTF3C6
SRPK1
AMD1
GK
27 / 126
CRY1
TACSTD1
TFEC
LACTB2
ITGB8
SLC30A5
LRRCC1
ORMDL1
MZT1
CHML
KLF8
IL8
ADM
STK24
C6orf168
SNTG2
HTATIP2
C6orf203
C6orf117
B3GNT5
RCAN1
OXR1
EDN1
RIPK4
PERP
GALNT3
Table 4: Cyan cluster gene list
The table contains the genes expressed predominantly in samples of the bad-outcome (Cyan) cluster
[4].
Part B: Enrichment analysis in scientific article: Genetic risk and a
primary role for cell-mediated immune mechanisms in multiple sclerosis.
For Part B analysis, the scientific paper researched in this part is Genetic risk and a
primary role for cell-mediated immune mechanisms in multiple sclerosis. In this paper,
28 / 126
from two major international consortia, the International Multiple Sclerosis Genetics
Consortium (IMSGC), Welcome Trust Case Control Consortium 2 (WTCCC2), a Gene
Ontology enrichment analysis is performed to understand the biological underpinnings of
multiple sclerosis, as inferred from the genetic loci conferring increased risk for the
disease [4]. Previous studies had shown that the variation of gene factor which located in
major histocompatibility complex (MHC) make a great difference to the increase of the
disease prevalence in affected individuals. Genome-wide association studies (GWAS) has
previously defined that there are more than 20 additional risk loci have very important
role in disease susceptibility. The author and researching groups have included 9772
European cases and used genome-wide association studies (GWAS) to find more than 29
new susceptibility loci.
For the GO enrichment analysis in this paper, authors chose the genes mapping close to
the identified loci and did the overrepresentation test. The genes are obtained from
UniProtKB that are annotated by single nucleotide polymorphism (SNP) were grouped
into three categories: i. SNPs in Table S1; ii. SNPs in the top tier of Table S2 or with
P<1x10-4.5 in discovery and the same direction of effect in replication; iii. SNPs in either
of the above category.
The genes in Table S1 and top tier of Table S2 consist List 1 (Table 5). The genes in
bottom tier of Table S2 and the ones that p-value < 1E-4.5 in discovery with same
29 / 126
direction of effect in replication extracted from the supplementary information consist
List 2 (Table 6).
I replicated the GO enrichment analysis by performing FET + FDR method for both the
lists for biological process. I first compared FET + FDR method result with BIN + BON
method result for each list. Then, I found the intersection (terms found in both analyses)
of the significant GO term results for List 1 and List 2 and compare it to the result
reported in the published paper.
List 1
MMEL1
EVI5
CD58
RGS1
KIF21B
CBLB
TMEM39A
IL12A
IL7R
PTGER4
OLIG3
IL7
IL2RA
ZMIZ1
CD6
TNFRSF1A
CYP27B1
MPHOSPH9
CLEC16A
IRF8
STAT3
TYK2
CD40
30 / 126
VCAM1
PLEK
MERTK
SP140
EOMES
CD86
IL12B
BACH2
THEMIS
MYB
IL22RA2
TAGAP
ZNF746
MYC
PVT1
HHEX
CLECL1
ZFP36L1
BATF
GALC
MALT1
TNFSF14
MPV17L2
DKKL1
CYP24A1
MAPK1
SCO2
Table 5: Gene List 1
The table contains the genes included in the Table S1 and top tier of Table S2 from the
supplementary information of the paper [5].
List 2
AGAP2
AHI1
ALPK2
ARHGEF3
31 / 126
BACH2
BATF
C16orf75
C1orf106
C2orf69
C3orf1
CARD11
CD5
CD58
CD86
CDC37
CLEC16A
CLECL1
CXCR5
CYP24A1
DDAH1
DKKL1
DLEU1
ELMO1
EOMES
EVI5
EXTL2
FCRL3
GPR65
HHEX
ICAM3
IL12A
IL12B
IL22RA2
IL2RA
IL6
IL7R
IRF8
KCNMA1
MAF
MALT1
MANBA
32 / 126
MAP3K14
MAPK1
MERTK
MMEL1
MPV17L2
MYB
MYC
MYNN
NA
NCF4
NCOA5
NDFIP1
NDUFA4
NFKB1
NFKBIZ
ODF3B
P43405 SYK
PKIA
PLCL2
PLEK
PRDX5
PTGER4
PTPRK
PVR
PVT1
RGS1
RGS14
RNF213 KIAA1618
RPS6KB1
RRAGD
RREB1
SAE1
SLC15A2
SLC30A7
SORBS2
SOX8
SP140
33 / 126
STAT3
TAGAP
TCF7
TNFRSF1A
TNFRSF6B
TNFSF14
TNKS
UBASH3B
WNT9B
ZBTB46
ZFP36L1
ZMIZ1
ZNF438
ZNF767
Table 6: Gene List 2
The table contains the genes included in the Bottom tier of Table 2, and p-value < 1E-4.5 in
discovery with same direction of effect in replication from the supplementary information of the
paper [5]. The genes names at the right side of the cell indicates the original name of the genes in
the paper. The change of genes because the update of gene annotation.
Common rate and obsolete rate
In order to compare the results generated by FET + FDR method with author’s findings
and evaluate the GO annotation updates influence in each enrichment analysis, we
introduce two measurements for researching.
Common rate is calculated in each enrichment analysis, dividing the common part of FET
+ FDR method results and author’s findings by the total FET + FDR method results
number. It shows the size of common part in FET + FDR method result. High common
rate represents high similarity between FET + FDR method result and author’s findings.
34 / 126
Obsolete rate is also calculated in each enrichment analysis, dividing the number of
obsoleted GO terms in the author’s findings by the total GO terms number in the author’s
findings. The high obsolete rate represents high influence of GO annotation update on
authors GO enrichment analyses.
Average obsolete rate and average common rate of all 10 enrichment analyses will be
calculated in the end for conclusion.
Wilcoxon Signed-rank test for comparing BIN + BON and FET + FDR
method GO term results based on R platform
After finishing all the enrichment analyses, the number of significant GO terms from both
BIN + BON and FET + FDR method results from all 11 comparisons performed in the
thesis were calculated and written into the two columns called “BIN” and “FET” in R.
The null hypothesis is that there is no significant difference of the result GO terms
number between BIN + BON and FET + FDR method. R command, wilcox.test() is used
for performing Wilcoxon Signed-Rank Test for the two columns and obtain the result.
Significance level alpha for the test is 0.05.
35 / 126
Result
Part A
Enrichment Analysis of genes that Correlate with good outcomes
(Yellow cluster)
GO Biological Process (BP)
Result analysis of FET + FDR method
Using the FET + FDR method on the set of genes in the "good outcome" cluster, there are
59 significant GO terms in the list (Figure 2). The list is sorted from the most significant
GO term to less significant down to the 0.1 FDR criterion. The AmiGO tree plot of GO
term result (Supplement Figure 2) shows that there are six clusters of biological
processes: Leukocyte activation, T cell proliferation (Both T helper cell and Cytotoxic T
cell activations are predominant in this cluster), interferon-gamma, antigen process
(including peptide antigen and exogenous antigen presentation) and signaling pathway.
From above, we could know that immune response and defense response processes are
predominant in the Yellow cluster gene list.
Comparing the FET + FDR method result with the author’s findings
There are 63 significant GO terms in the author’s findings (Figure 2). Comparing FET +
FDR method results to the results paper, though there are only seven GO terms are the
same in FET + FDR method result and author’s findings, from the Amigo tree plot
36 / 126
(Supplement Figure 1), in general our results were in agreement in higher hierarchy GO
terms. In particular, the general "immune response" term appears among the most
significant terms, with more specific T-cell related terms also highly significant. This
indicates that in the good outcome cluster, there is an upregulation of the immune
response, which was noted by the authors at the time as a major factor underlying the
improved outcome. We do note, however, that in our analysis we identify additional,
more specific immune response GO terms that are not appeared in the author’s findings
as well, such as those related to cytokine induction, antigen processing, and cytolytic
processes like superoxide generation. We could calculate the common GO term rate by
7/59 = 0.119, there are only 11.9% GO terms in FET + FDR method result are in
common with author’s findings.
Besides, in the Amigo tree plot (Supplement Figure 1) of author’s findings, there are nine
GO terms are listed on the top, this means that these GO terms are no longer exist in the
database because of the GO annotation updates since the paper is released. For the
obsolete GO term rate in this analysis, 9/63 = 0.143, there are 14.3% GO terms have been
obsoleted due to the GO annotation updates.
37 / 126
38 / 126
39 / 126
Figure 2: Significant result comparisons between FET + FDR method and author’s finding
in Yellow cluster biological process.
There are two lists in the figure, the left list is the significant GO term (FDR<=0.1) results of FET
+ FDR method which are ordered by FDR score from small to large in the last column. The six
columns in the list are “GO” which is GO term annotation and ID, “samplenum”, “refnum”,
“samplenoobs”, “homonoobs” and “FDR”. The right list is significant GO term results of author’s
finding. The five columns are “GO”, “Term”, “Number of genes in category”, “Number of genes
hit” and “Pvalue”. The list is ordered by Pvalue from small to large. The arrows indicate the same
GO term in both result lists.
Comparing FET + FDR method result with BIN + BON method result
For BIN + BON method, there are only seven significant GO terms in the list (Table 7)
and all the GO terms are included in FET + FDR method results. However, FET + FDR
method obtained 52 more GO terms result comparing to BIN + BON method. If we
analyze the GO terms in the BIN + BON method result, for example, immune system
process, defense response and positive regulation of cell activation, we could find that
most of them are general and contains less information. Nevertheless, the Amigo tree plot
(Supplement Figure 2) shows that FET + FDR GO terms results contain more biological
specific and detailed GO terms.
Thus, the major difference in the two methods is that the FET + FDR method is more
sensitive, and returns a larger number of terms, most of which are more specific and
therefore more informative.
40 / 126
Table 7: Significant GO term result of BIN + BON method in Yellow cluster biological
process.
The list is the significant GO term (P-value<=0.05) results of BIN + BON method which is
ordered by P-value from small to large. The six columns in the list are “GO” which is GO term
annotation and ID, “samplenum”, “refnum”, “samplenoobs”, “homonoobs” and “Pvalue”.
GO Cellular Component (CC)
Result analysis of FET + FDR method
There are 17 significant GO terms in the FET + FDR method result (Figure 3). The
Amigo Tree plot (Supplement Figure 4) shows that the genes in Yellow cluster generally
perform functions in membrane part, macromolecular complex and MHC protein
complex. Phagocytic vesicle membrane and endoplasma membrane are the main
locations in membrane part. For macromolecular complex, T cell receptor complex is
predominant.
Comparing FET + FDR method result with the author’s findings
BIN + BON
41 / 126
There are 12 significant GO terms in author’s finding. Five significant GO terms are in
both lists. The Amigo tree graph (Supplement Figure 3) of author’s findings for CC
shows that genes perform functions generally in MHC protein complex, T cell receptor
complex and plasma related membrane. Thus, for the two lists, the general clusters are
similar, however, FET + FDR method results contains a few more terms, some of which
are highly informative, such as the phagocytic vesicle. The common GO term rate could
be calculated by 5/17= 0.294, 29.4% significant GO terms in FET + FDR method are in
common with author’s findings. All the significant GO terms in author’s findings are still
in the GO database now.
42 / 126
FET + FDR Author’s published results
43 / 126
Figure 3: Significant result comparisons between FET + FDR method and author’s finding
in Yellow cluster cellular component.
There are two lists in the figure, the left list is the significant GO term (FDR<=0.1) results of FET
+ FDR method which are ordered by FDR score from small to large in the last column. The six
columns in the list are “GO” which is GO term annotation and ID, “samplenum”, “refnum”,
“samplenoobs”, “homonoobs” and “FDR”. The right list is significant GO term results of author’s
finding. The five columns are “GO”, “Term”, “Number of genes in category”, “Number of genes
hit” and “Pvalue”. The list is ordered by P-value from small to large. The arrows indicate the
same GO term in both result lists.
Comparing FET + FDR method result with BIN + BON method result
Similarly to biological process, the BIN + BON results only contain four terms in the
result list (Table 8) and they are all included by FET + FDR method result. The FET +
FDR method result includes more specific, and therefore generally more informative GO
terms.
Table 8: Significant GO term result of BIN + BON method in Yellow cluster cellular
component.
The list is the significant GO term (P-value<=0.05) results of BIN + BON method which is
ordered by P-value from small to large. The six columns in the list are “GO” which is GO term
annotation and ID, “samplenum”, “refnum”, “samplenoobs”, “homonoobs” and “Pvalue”.
BIN + BON
44 / 126
GO Molecular Function (MF)
Result analysis of FET + FDR method
There is no significant GO term in FET + FDR method Yellow cluster molecular
function result.
Comparing FET + FDR method result with the author’s findings
Author’s finding result contains 29 significant GO terms (Table 9). However, there is no
significant GO term in FET + FDR method result. It shows that FET + FDR method
result is different from author’s finding result. Many of the terms reported by the authors
cover only one or two genes and should not have reached significance. These terms are
"noise" that was not found in the FET + FDR method, and we note that despite reporting
them in their supplemental files, the authors did not consider this results worth reporting
in the actual written paper.
In this analysis, the common rate is zero since there is no common result between FET +
FDR method and author’s finding. From the Amigo tree plot, (Supplement Figure 5) we
could calculate that the obsolete GO term rate is, 5/29 = 0.172, there are 17.2% GO terms
are no longer existed in GO database now.
45 / 126
Table 9: Significant GO terms results of author’s finding in Yellow cluster molecular
function.
The list contains five columns which are “GO”, “Term”, “Number of genes in category”,
“Number of genes hit” and “Pvalue”. The list is ordered by P-value from small to large.
Comparing FET + FDR method result with BIN + BON method result
There is no significant GO term found in BIN + BON method result.
Author’s published results
46 / 126
Enrichment Analysis of genes that Correlate with bad outcomes (Cyan
cluster)
GO Biological Process (BP)
Result analysis of FET + FDR method
For the analysis of the Cyan cluster (poor outcomes) relative to biological processes,
there are 21 significant GO terms in the FET + FDR method results (Figure 4). After
analyzing the Amigo Tree plot (Supplement Figure 7), there are three general clusters for
the results. extracellular matrix disassembly, mitotic cell cycle processes, and anti-
microbial response / granulocyte chemotaxis activation. Thus, unlike the good outcome
cluster, these processes are indicative of metastatic and cell proliferative processes, and,
interestingly response to bacterial infection rather than a T-cell-mediated immune
response.
Comparing FET + FDR method result with the author’s findings
There are 107 significant GO terms in author’s results (Figure 4). After researching on
Amigo Tree plot (Supplement Figure 6), we found that these are spread throughout many
parts of the ontology, making them difficult to interpret. Indeed, the authors did not cite
the enrichment results in their Discussion, but instead focused on a few genes of interest.
Interestingly, their results did include cell cycle processes and some chemotaxis-related
terms, but amid all the other significant terms these could not be interpreted
47 / 126
unambiguously. We note again that many of their enriched terms are due to a
questionable signal from only one or two genes. The comparisons are shown in Figure 4.
For the common rate, 4/21 = 0.190, 19.0% of the significant GO terms result are in
common with author’s findings. Besides, we also notice that there are eight GO terms are
listed on the top in the Amigo tree plot for author’s findings (Supplement Figure 6).
Thus, the obsolete GO term rate is, 8/107 = 0.075. There are 7.5% of GO term results in
author’s findings are obsoleted.
48 / 126
49 / 126
50 / 126
Figure 4: Significant result comparisons between FET + FDR method and author’s finding
in Cyan cluster biological process.
There are two lists in the figure, the left list is the significant GO term (FDR<=0.1) results of FET
+ FDR method which are ordered by FDR score from small to large in the last column. The six
columns in the list are “GO” which is GO term annotation and ID, “samplenum”, “refnum”,
“samplenoobs”, “homonoobs” and “FDR”. The right list is significant GO term results of author’s
finding. The five columns are “GO”, “Term”, “Number of genes in category”, “Number of genes
hit” and “Pvalue”. The list is ordered by P-value from small to large. The arrows indicate the
same GO term in both result lists.
Comparing FET + FDR method result with BIN + BON method result
There are only three significant GO terms found in BIN + BON method result (Table 10)
and all are included in the FET + FDR method result. The GO terms are very high in the
GO hierarchy, and very general. Thus, comparing to FET + FDR method, BIN + BON
method in BP Cyan cluster provides less information.
Table 10: Significant GO term result of BIN + BON method in Cyan cluster biological
process.
The list is the significant GO term (P-value<=0.05) results of BIN + BON method which is
ordered by P-value from small to large. The six columns in the list are “GO” which is GO term
annotation and ID, “samplenum”, “refnum”, “samplenoobs”, “homonoobs” and “Pvalue”.
BIN + BON
51 / 126
GO Cellular Component (CC)
Result analysis of FET + FDR method
There are 10 significant GO terms in Cyan cluster CC enrichment analysis (Figure 5).
The Amigo Tree plot (Supplement Figure 9) shows that the functions are generally
performed in organelle, cytoplasm and spindle. In organelle, condensed chromosome
region is predominant.
Comparing FET + FDR method result with the author’s findings
There are 12 significant GO terms in author’s finding list for Cyan cluster CC (Figure 5).
There are two common significant GO terms in both results. From the Amigo tree plot
(Supplement Figure 8), we could find that the common GO terms are generally clustered
in condensed chromosome. The comparisons are shown in Figure 5. For the common
rate, 2/10 = 0.20. There are 20.0% of significant GO terms results generated by FET +
FDR method are in common with author’s findings. Since no GO terms are obsoleted in
the author’s finding, the obsolete rate is zero.
52 / 126
FET + FDR
Author’s published results
53 / 126
Figure 5: Significant result comparisons between FET + FDR method and author’s finding
in Cyan cluster cellular component.
There are two lists in the figure, the left list is the significant GO term (FDR<=0.1) results of FET
+ FDR method which are ordered by FDR score from small to large in the last column. The six
columns in the list are “GO” which is GO term annotation and ID, “samplenum”, “refnum”,
“samplenoobs”, “homonoobs” and “FDR”. The right list is significant GO term results of author’s
finding. The five columns are “GO”, “Term”, “Number of genes in category”, “Number of genes
hit” and “Pvalue”. The list is ordered by P-value from small to large. The arrows indicate the
same GO term in both result lists.
Comparing FET + FDR method result with BIN + BON method result
There is no significant GO term in BIN + BON method result. It provides nothing
comparing to FET + FDR method.
GO Molecular Function (MF)
Result analysis of FET + FDR method
There is only one significant GO term in FET + FDR method result. It is GO:0050786,
“RAGE receptor binding”. Thus, the FET + FDR method provides very little information
in Cyan cluster MF.
Comparing FET + FDR method result with the author’s findings
There are 43 significant GO terms in author’s finding (Table 11). The Amigo Tree plot
(Supplement Figure 10) shows that the main clusters for the molecular function are
nucleobase transmembrane transporter activity, photoreceptor activity, kinase activity,
chemokine receptor binding and protein glucosyltransferase activity. There are no
54 / 126
common parts in both results. Similarly to the Yellow cluster, in the actual paper text, the
authors make no mention of significant molecular functions. They are spread throughout
the ontology and do not yield a consistent interpretation. For the common rate, 0/1 = 0.
There are eight GO terms in author’s finding are obsoleted. Thus, the obsolete rate is 8/43
= 0.186. There are 18.6% significant GO terms found in author’s result are not exist.
55 / 126
Author’s published results
56 / 126
Table 11: Significant result of author’s finding in Cyan cluster molecular function.
The list in the figure is significant GO term results of author’s finding. The five columns are
“GO”, “Term”, “Number of genes in category”, “Number of genes hit” and “Pvalue”. The list is
ordered by P-value from small to large.
Comparing FET + FDR method result with BIN + BON method result
FET + FDR method provide the same result with BIN + BON method with GO:
0050786, “RAGE receptor binding”.
Enrichment Analysis of genes that Correlate with mixed outcomes
(Purple cluster)
GO Biological Process (BP)
Result analysis of FET + FDR method
For purple cluster BP, there is no significant GO term in FET + FDR method result.
There is no information being provided in this part.
Comparing FET + FDR method result with the author’s findings
There are 26 significant GO terms in author’s finding. The GO terms in the Amigo Tree
(Supplement Figure 11) are very diverse, there are generally three clusters which are
Gland development, cell proliferation and iron iron transport. Since there is no significant
GO term in FET + FDR method result, there is nothing common in both results, the
common rate is 0. The author’s finding results are shown in Table 12. We note again that
apparently the authors did not find these results to be worth reporting in the paper,
57 / 126
consistent with the FET + FDR lack of findings. For the GO annotation update, there is
on GO term in author’s finding is obsoleted, the obsolete rate is 1/26 = 0.038. There are
3.8% GO terms being obsoleted in author’s findings.
Table 12: Significant GO term results of author’s finding in Purple cluster biological
process.
Author’s published results
58 / 126
The list in the figure is significant GO term results of author’s finding. The five columns are
“GO”, “Term”, “Number of genes in category”, “Number of genes hit” and “Pvalue”. The list is
ordered by P-value from small to large.
Comparing FET + FDR method result with BIN + BON method result
Both FET + FDR method and BIN + BON method generates no significant GO term
in the result
GO Cellular Component (CC)
Result analysis of FET + FDR method
For Purple CC, there is no significant GO term in FET + FDR method result. There is no
information being provided in this part.
Comparing FET + FDR method result with the author’s findings
There is also only one significant GO term in author’s finding result and no information
being provided. The table is shown in Table 13. The obsolete rate is 0 and the common
rate is 0.
Table 13: Significant GO term results of author’s finding in Purple cluster cellular
component.
There is only one term in both resulting list, the above list is the significant GO term result of
FET + FDR method, the below list is the significant GO term result of author’s findings.
Author’s published results
59 / 126
Comparing FET + FDR method result with BIN + BON method result
BIN + BON method generates no significant GO term in the result.
GO Molecular Function (MF)
Result analysis of FET + FDR method
For Purple cluster MF, there is no significant GO term in FET + FDR method result.
There is no information being provided in this part.
Comparing FET + FDR method result with the author’s findings
In author’s finding, there are a lot of binding activities such as protein binding, hormone
binding and peptide binding clusters. These results were not reported in the paper text.
Since there is no significant GO term in FET + FDR method result for Purple MF. There
is no common part in both lists, the common rate is 0. Author’s finding could be shown
in Table 14. The Amigo tree plot (Supplement Figure 12) shows that there are five GO
terms in author’s finding are no longer exist, the obsolete rate is 5/36 = 0.139. There are
13.9% GO terms in author’s finding are obsoleted from GO term database.
Comparing FET + FDR method result with BIN + BON method result
Both FET + FDR method and BIN + BON method has no significant GO term result.
60 / 126
Table 14: Significant GO term results of author’s finding in Purple cluster molecular
function.
The list in the figure is significant GO term results of author’s finding. The five columns are
“GO”, “Term”, “Number of genes in category”, “Number of genes hit” and “Pvalue”. The list is
ordered by P-value from small to large.
Author’s published results
61 / 126
Common rate and obsolete rate analysis for part A
After finishing all the enrichment analysis in part A, we could generally analyze the
common rate and obsolete rate.
Average common rate for part A analysis could be calculated by:
𝐴𝑣𝑔 𝐶𝑅 =
0.119 + 0.294 + 0 + 0.190 + 0.2 + 0 + 0 + 0 + 0
9
=0.089
Average obsolete rate for part A analysis could be calculated by:
𝐴𝑣𝑔 𝑂𝑅 =
0.143 + 0 + 0.172 + 0.075 + 0 + 0.186 + 0.04 + 0 + 0.139
9
=0.084
In conclusion, there are averaging only 8.9% GO terms results of FET + FDR method are
in common with author’s finding. There are averaging 8.4% GO terms results of author’s
finding are obsoleted from GO database.
Part B
Comparing FET + FDR method result with BIN + BON method result
For List 1, there are 133 significant GO terms in BIN + BON method result (Table 16)
and 390 significant GO terms in FET + FDR method result (Table 15). For List 2, there
are 169 significant GO terms in BIN + BON method result (Table 18) and 533 significant
GO terms in FET + FDR method result (Table 17).
62 / 126
All the significant GO terms in BIN + BON results are included in FET + FDR results.
Comparing to BIN + BON method, FET + FDR method is more sensitive, specific and
contains more information.
Result analysis of common FET + FDR method results for List 1 and List 2
We found all significant terms in common for the analysis of the two lists, in order to
compare with the authors' published analysis. There are 336 common significant GO term
of FET + FDR method results for List 1 and List 2 relative to biological process. After
analyzing the Amigo Tree plot of the intersection of List 1 and List 2 (Supplement Figure
13), the main clusters relative to biological process are regulation of T cell mediated
cytotoxicity, T cell differentiation, T cell proliferation, inflammatory response and NF-
kappa-B signaling pathway, T cell lineage commitment, defense response to protozoan
and bacterium, response to lipopolysaccharide, NK cell differentiation, apoptotic process,
regulation of transcription, JAK-STAT cascade and positive regulation of tyrosine
phosphorylation of STAT protein.
Comparing FET + FDR method result with the author’s findings
The results published in the paper (Table 19) shows GO terms are generally clustered to
T cell activation and proliferation. After comparing the FET + FDR results to the results
published in the paper, I found that all the results published in the paper are included in
63 / 126
FET + FDR method except “GO:0023033 signaling pathway” and this GO term is no
longer exist. Instead, the FET+FDR results include the more informative GO:0007165,
signal transduction. Interestingly we now find in the GO analysis a biological process that
the authors reported was found only using Ingenuity Pathway Analysis [15]: T helper cell
differentiation (GO:0042093). We also find enrichment of genes in the inflammatory
response and regulation of apoptotic process, which was not found in the originally
published analysis. The obsolete rate in part B analysis is 1/21 = 0.05 and the common
rate is 20/336 = 0.06.
64 / 126
FET + FDR
65 / 126
66 / 126
67 / 126
68 / 126
69 / 126
Table 15: Significant GO term results of FET + FDR method of List 1
The list in the figure is significant GO term (FDR<=0.1) results of FET + FDR method of List 1
which are ordered by FDR score from small to large in the last column. The six columns in the list
are “GO”, “samplenum”, “refnum”, “samplenoobs”, “homonoobs” and “FDR”.
70 / 126
BIN + BON
71 / 126
72 / 126
Table 16: Significant GO term results of BIN + BON method of List 1
The list in the table is significant GO term (P-value<=0.05) results of BIN + BON method of List
1 which are ordered by P-value from small to large in the last column. The six columns in the list
are “GO”, “samplenum”, “refnum”, “samplenoobs”, “homonoobs” and “Pvalue”.
73 / 126
FET + FDR
74 / 126
75 / 126
76 / 126
77 / 126
78 / 126
79 / 126
80 / 126
81 / 126
82 / 126
Table 17: Significant GO term results of FET + FDR method of List 2
The list in the table is significant GO term (FDR<=0.1) results of FET + FDR method of List 2
which are ordered by FDR score from small to large in the last column. The six columns in the list
are “GO”, “samplenum”, “refnum”, “samplenoobs”, “homonoobs” and “FDR”.
83 / 126
BIN + BON
84 / 126
85 / 126
86 / 126
Table 18: Significant GO term results of BIN + BON method of List 2
The list in the table is significant GO term (P-value<=0.05) results of BIN + BON method of List
2 which are ordered by P-value from small to large in the last column. The six columns in the list
are “GO”, “samplenum”, “refnum”, “samplenoobs”, “homonoobs” and “Pvalue”.
87 / 126
Significant GO term results of Author’s finding
88 / 126
Table 19: Significant GO term results of author’s finding.
The table is the significant GO term results of author’s finding for both table 1 and top tier of
table 2 cluster and bottom tier of table 2 and p-value < 1E-4.5 in discovery with same direction of
effect in replication.
Wilcoxon Signed-Rank Test for GO terms results difference between FET
+ FDR method and BIN + BON method based on R
The table below (Table 20) shows the data input for the Wilcoxon Signed-Rank Test. The
number in each cell represents the number of significant GO terms generated by each
enrichment analysis from both part A and part B. We notice that the GO term result of
enrichment analysis from purple cluster shows that both BIN + BON and FET + FDR
could not find significant GO term. The purple cluster gene list is mixed with good and
bad outcomes, it might be the reason of it. However, this will not influence the Wilcoxon
Signed-Rank Test, the rows with no change between groups will be automatically
excluded from the calculation.
The Figure 6 shows the result of the Wilcoxon Signed-Rank Test, P-value equals 0.03603
represents significance which means that FET + FDR method obtained significant more
GO terms result comparing to BIN + BON method for both part A and part B enrichment
analysis.
89 / 126
BIN + BON FET + FDR
7 59
4 17
0 0
3 21
0 10
1 1
0 0
0 0
0 0
133 390
169 533
Table 20: GO term result number for each enrichment analysis of BIN + BON and FET +
FDR method.
The table contains the number of GO term result from both BIN + BON and FET + FDR method
for each enrichment analysis. The first column contains the GO term result from BIN + BON
method and the second contains the GO term result from FET + FDR method. Each row
represents each enrichment analysis result obtained in the previous analysis.
Figure 6: Wilcoxon signed rank test result.
The figure shows the Wilcoxon signed rank test result for the two paired groups, BIN + BON
result numbers and FET + FDR result numbers. The p-value is 0.03603 which is significant.
90 / 126
Discussion
FET + FDR method vs BIN + BON method
For all the results comparing these two methods, all the BIN + BON method results are
included in the FET + FDR method results. The Wilcoxon Signed-Rank test proves that
FET + FDR method contains significantly more GO terms. Thus, FET + FDR method
results are more sensitive, specific and provide more information comparing to BIN +
BON method results. The significant GO terms in FET + FDR method results also make
biological sense related to the specific studies, breast cancer and multiple sclerosis. Thus,
based on these reasons, it is concluded that the FET + FDR method with 0.1 criterion is
better than BIN + BON method at 0.05 significance level.
FET + FDR method vs author’s finding
For the analysis of breast cancer paper outcomes [3], we could find that there are only
averaging 8.9% of significant GO terms in FET + FDR method results in common with
author’s finding. The highest common rate is 29.4% and even three out of nine analyses
have zero common part for FET + FDR and author’s finding. Thus, we can conclude
from the previous analyses explanation and common rate that most of the results in our
FET + FDR method are different from and more accurate than author’s finding in GO
enrichment analysis.
91 / 126
The first reason might be GO annotation update in the gene ontology database in the time
since the papers were originally published. In the results analysis, we could find some
examples about the annotations differences between FET + FDR results and author’s
result. For example, GO 0045321 is annotated “leukocyte activation” in recently database
but annotated “immune cell activation” before. Besides, in the Amigo Tree plot for the
author’s finding results, we have already noticed there are a lot of GO terms are plotted
outside the tree which means these GO terms are no longer in the GO database.
We have calculated the obsolete rate for each enrichment analysis to measure the GO
annotation update level. The average obsolete rate in the breast cancer paper analysis is a
very high percentage, 8.4% which represents that averaging 8.4% of GO terms result of
author’s finding were obsoleted or deleted from GO database from the time the paper was
released. Thus, it could be concluded that the revision of GO annotations update
influence the scientific paper results in the past.
In addition, some of the significant GO terms shown in author’s results are very
questionable, if we check the column “Number of genes hit” in the results published in
the paper, many of the terms reported by the authors have very little genes in the column
which will be very sensitive to the GO annotations change. It is not surprising that
addition of new annotations in the data base over the past nine years would change the
result.
92 / 126
One thing should be noticed here is that though the gene lists for the three clusters and
gene ontology results for the breast cancer paper were provided by the authors. The
detailed statistical method used in the enrichment analyses and the reference list is not
mentioned in this article. Thus, we could not determine exactly whether the large
difference between author’s findings and our results is due to any reason mentioned
above.
However, the author still need to explicit and be care of the statistical methods they used
for the GO enrichment analysis and we suggest that the author should check the column
“Number of genes hit” problems appearing in the enrichment analysis results and pay
attention to the GO annotation updates which might influence the GO results.
For the part B analysis of multiple sclerosis susceptibility loci [4] outcome, we found that
all the author’s findings are included in FET + FDR method results. But FET + FDR
method results contain additional enriched biological processes that may help to interpret
the causes of disease susceptibility. The common rate is also very small, only 6% and the
obsolete rate is also around 5%. As mentioned before, the updated GO database and the
advantage of FET + FDR method both contribute to the enlargement of the significant
GO term results.
Thus, from all the conclusions mentioned above, we suggest that authors using gene
ontology analyses in their scientific articles should pay attention to the influence of GO
93 / 126
annotation updates on their results as time goes by. Furthermore, FET + FDR statistical
method is proved to be an efficient and accurate tool in GO enrichment analysis.
94 / 126
Works Cited
[1] Consortium, G.O. (2004) The Gene Ontology (GO) database and informatics
resource. Nucleic Acids Research, 32.
[2] Mi, H., Muruganujan, A., Casagrande, J.T. and Thomas, P.D. (2013) Large-scale gene
function analysis with the PANTHER classification system. Nature Protocols, 8, 1551–
1566.
[3] Finak, Greg and Bertos, Nicholas and Pepin, Francois and Sadekova, Svetlana and
Souleimanova, Margarita and Zhao, Hong and Chen, Haiying and Omeroglu, Gulbeyaz
and Meterissian, Sarkis and Omeroglu, Atilla and Hallett, Michael and Park, Morag.
(2008) Stromal gene expression predicts clinical outcome in breast cancer. Nature
Medicine, 14, 518-527.
[4] The International Multiple Sclerosis Genetics Consortium & The Wellcome Trust
Case Control Consortium 2. (2011) Genetic risk and a primary role for cell-mediated
immune mechanisms in multiple sclerosis. Nature, 476, 214--219
[5] Subramanian A, Tamayo P, Mootha VK, et al. (2005) Gene set enrichment analysis:
A knowledge-based approach for interpreting genome-wide expression profiles.
Proceedings of the National Academy of Sciences of the United States of America,
102(43), 15545-15550.
[6] Team, R. C. (2000). R language definition. Vienna, Austria: R foundation for
statistical computing.
[7] Schröder, M.S., Gusenleitner, D., Quackenbush, J., Culhane, A.C. and Haibe-Kains,
B. (2013) RamiGO: an R/Bioconductor package providing an AmiGO Visualize
interface. Bioinformatics, 29, 666–668.
[8] Howell, D.C. (2013) Statistical methods for psychology. Wadsworth Cengage
Learning, Belmont, CA.
[9] Deviant, S. (2010) The practically cheating statistics handbook. StatisticsHowTo.com,
Place of publication not identified.
[10] Kreiner, S. (1992) [A Survey of Exact Inference for Contingency Tables]: Comment:
Exact Inference in Multidimensional Tables. Statistical Science, 7, 163–165.
[11] McDonald, J.H. (2009). Handbook of Biological Statistics (2nd ed.), 70-75. Sparky
House Publishing, Baltimore, Maryland.
[12] Mehta, C.R., Patel, N.R. and Tsiatis, A.A. (1984) Exact Significance Testing to
Establish Treatment Equivalence with Ordered Categorical Data. Biometrics, 40, 819.
[13] McDonald, J.H. (2014). Handbook of Biological Statistics (3rd ed.), 254-256.
Sparky House Publishing, Baltimore, Maryland.
95 / 126
[14] Khatri, Purvesh, Marina Sirota, and Atul J. Butte. (2012) Ten Years of Pathway
Analysis: Current Approaches and Outstanding Challenges. PLoS Computational
Biology, 8(2), e1002375.
[15] Krämer, A., Green, J., Pollard, J. and Tugendreich, S. (2013) Causal analysis
approaches in Ingenuity Pathway Analysis. Bioinformatics, 30, 523–530.
[16] The Gene Ontology Consortium. (2017). Expansion of the Gene Ontology
knowledgebase and resources. Nucleic Acids Research, 45(Database issue), D331–D338.
http://doi.org/10.1093/nar/gkw1108.
[17] Adrian A. Dragulescu (2014). xlsx: Read, write, format Excel 2007 and Excel
97/2000/XP/2003 files. R package version 0.5.7. https://CRAN.R-
project.org/package=xlsx.
[18] Wilcoxon, F. Individual comparisons by ranking methods. Bobbs-Merrill, College
Division, Indianapolis, IN.
96 / 126
Supplements
97 / 126
98 / 126
Supplement Figure 1: Amigo Tree of Yellow Cluster Biological Process of Author’s finding.
99 / 126
100 / 126
101 / 126
Supplement Figure 2: Amigo Tree of Yellow Cluster Biological Process of FET + FDR
method.
102 / 126
Supplement Figure 3: Amigo Tree of Yellow Cluster Cellular Component of Author’s
finding.
103 / 126
Supplement Figure 4: Amigo Tree of Yellow Cluster Cellular Component of FET + FDR
finding.
104 / 126
Supplement Figure 5: Amigo Tree of Yellow Cluster Molecular Function of Author’s
finding.
105 / 126
106 / 126
107 / 126
108 / 126
109 / 126
110 / 126
Supplement Figure 6: Amigo Tree of Cyan Cluster Biological Process of author’s finding.
111 / 126
112 / 126
Supplement Figure 7: Amigo Tree of Cyan Cluster Biological Process of FET + FDR
finding.
113 / 126
Supplement Figure 8: Amigo Tree of Cyan Cluster Cellular Component of Author’s
finding.
114 / 126
Supplement Figure 9: Amigo Tree of Cyan Cluster Cellular Component of FET + FDR
finding.
115 / 126
116 / 126
117 / 126
Supplement Figure 10: Amigo Tree of Cyan Cluster Molecular Function of author’s
finding.
118 / 126
119 / 126
Supplement Figure 11: Amigo Tree of Purple Cluster Biological Process of author’s finding.
120 / 126
121 / 126
Supplement Figure 12: Amigo Tree of Purple Cluster Molecular function of author’s
finding.
122 / 126
123 / 126
124 / 126
125 / 126
126 / 126
Supplement Figure 13: Amigo Tree of common significant GO term results for List1 and
List2 for FET + FDR.
Abstract (if available)
Abstract
Gene Ontology (GO) is using structured, controlled and dynamic terminology to describe and annotate gene groups and products of the biological attributes and functions. Enrichment analysis is using statistical test method to compare a researcher’s gene list to reference gene list to find the GO terms (biological function) that are over- or under-represented. PANTHER (protein analysis through evolutionary relationship) classification system is an online platform submitting large gene groups for gene list analysis, ontology, pathway analysis and inquiry etc. ❧ The research of this paper consists of two aspects. The first is to evaluate whether Fisher Exact Test with FDR procedure method (FET + FDR) could improve the enrichment analysis GO terms results comparing to Binomial Test with traditional Bonferroni procedure method (BIN + BON). The second is to assess the GO annotation updates effect on the GO enrichment analysis of former scientific papers. Two scientific articles published in the past which are Stromal gene expansion predicts clinical outcome in breast cancer and Genetic risk and a primary role for cell-mediated immune mechanisms in multiple sclerosis were re-analyzed in this thesis in order to access the two important aspects of gene list analysis. Wilcoxon Signed-Rank Test is performed in the end for comparing the performance of FET + FDR and BIN + BON results. ❧ The researcher’s gene lists provided by the articles were uploaded into the PANTHER classification system and the GO term observation statistics could be extracted from the system result. Statistical methods were performed based on R platform and generate the results. Common rates and obsolete rates are calculated to measure the common level between FET + FDR method results and author’s findings and evaluate the GO annotations influence on author’s finding respectively. ❧ From the first part of the result, we found that comparing to BIN + BON method, the FET + FDR method contains significantly more GO terms based on Wilcoxon Signed-Rank Test and the GO terms are more specific. The second result part shows that in most cases, FET + FDR method results are different from or contains more results than the author’s original findings. ❧ Thus, we can conclude that FET + FDR method are less conservative and more informative comparing to BIN + BON method and GO annotation updates influence the GO enrichment analysis results reported in the original publications.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Gene-set based analysis using external prior information
PDF
Application of phylogenetic trees in variant analysis, genome evolution and gene functional annotation
PDF
Genome-wide characterization of the regulatory relationships of cell type-specific enhancer-gene links
PDF
Identification of differentially connected gene expression subnetworks in asthma symptom
PDF
Differential methylation analysis of colon tissues
PDF
Essays on bioinformatics and social network analysis: statistical and computational methods for complex systems
PDF
Evaluation of factors influencing Los Angeles Tiered-Dispatch System’s improvement on bystander CPR rate and inter reliability between electronic patient care report (ePCR) and 911 call review on...
PDF
An assessment of impact of early local progression on subsequent risk for the treatment failure in adolescent and young adult patients with non-metastatic osteosarcoma
PDF
Shortcomings of the genetic risk score in the analysis of disease-related quantitative traits
PDF
Interim analysis methods based on elapsed information time: strategies for information time estimation
PDF
Detecting joint interactions between sets of variables in the context of studies with a dichotomous phenotype, with applications to asthma susceptibility involving epigenetics and epistasis
PDF
The risk estimates of pneumoconiosis and its relevant complications: a systematic review and meta-analysis
PDF
Linking air pollution to integrative gene and metabolites networks in young adult with asthma
PDF
The influence of dietary fructose on genetic effects of GCK and GCKR in Mexican Americans
PDF
Effects of post-menopausal hormone therapy on arterial stiffness in the ELITE trial
PDF
The evaluation of the long-term effectiveness of zero/low fluoroscopy workflow in ablation procedures for the treatment of paroxysmal and persistent atrial fibrillation
Asset Metadata
Creator
Guo, Xinyu
(author)
Core Title
Evaluating the effects of testing framework and annotation updates on gene ontology enrichment analysis
School
Keck School of Medicine
Degree
Master of Science
Degree Program
Biostatistics
Publication Date
10/19/2017
Defense Date
10/19/2017
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
BIN BON method,enrichment analysis,FET FDR method,gene ontology,OAI-PMH Harvest,PANTHER classification system,Wilcoxon signed-rank test
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Thomas, Paul D. (
committee chair
), Lane, Christianne (
committee member
), Mi, Huaiyu (
committee member
)
Creator Email
georgeguoxy@gmail.com,xinyuguo@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-445225
Unique identifier
UC11264000
Identifier
etd-GuoXinyu-5847.pdf (filename),usctheses-c40-445225 (legacy record id)
Legacy Identifier
etd-GuoXinyu-5847.pdf
Dmrecord
445225
Document Type
Thesis
Rights
Guo, Xinyu
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
BIN BON method
enrichment analysis
FET FDR method
gene ontology
PANTHER classification system
Wilcoxon signed-rank test