Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Statistical modeling of sequence and gene expression data to infer gene regulatory networks
(USC Thesis Other)
Statistical modeling of sequence and gene expression data to infer gene regulatory networks
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
STATISTICAL MODELING OF SEQUENCE AND GENE EXPRESSION
DATA TO INFER GENE REGULATORY NETWORKS
by
Xiting Yan
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTATIONAL BIOLOGY)
August 2009
Copyright 2009 Xiting Yan
Dedication
To my parents, Zhong Yan and Xiongying Wu, whose love and support have always
been with me and led me to success.
ii
Acknowledgements
I would like to thank Dr. Fengzhu Sun for supporting and advising me through my
PhD research. I would also like to thank Dr. Ting Chen for introducing me into several
projects and giving me good advises on them.
Prof. Minping Qian who was my advisor in Peking University, has introduced me
into computational biology, a fascinating research field. I appreciate her introduction
and advises. She inspired my interest in computational biology.
iii
Table of Contents
Dedication ii
Acknowledgements iii
List Of Tables vii
List Of Figures viii
Abstract xii
Chapter 1: Introduction 1
1.1 Biological Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Molecular Experimental Techniques . . . . . . . . . . . . . . . . . . . . . 9
1.2.1 Microarray . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.2 Second Generation Sequencing . . . . . . . . . . . . . . . . . . . . 12
1.3 Our Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.1 Synopsis of chapter 2. . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.2 Synopsis of chapter 3. . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3.3 Synopsis of chapter 4. . . . . . . . . . . . . . . . . . . . . . . . . . 17
Chapter 2: Identifying Differentially Expressed Alternatively Spliced Transcripts
using mRNA-seq Data 18
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Model and Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 Read Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.2 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.3 Likelihood Function . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.4 EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.5 Confidence Interval of the Estimation . . . . . . . . . . . . . . . . 28
2.2.6 Likelihood Ratio Test . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4 Applications to Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4.2 Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4.3 The Maximum Likelihood Approach . . . . . . . . . . . . . . . . . 37
iv
2.4.4 Consistency of MLE results across lanes . . . . . . . . . . . . . . . 37
2.4.5 Genes with multiple expressed spliced variants . . . . . . . . . . . 38
2.4.6 Effects of gene annotations . . . . . . . . . . . . . . . . . . . . . . 43
2.4.7 Differentially expressed variants between the kidney and the liver . 44
2.4.8 Differentially spliced genes between the kidney and the liver . . . . 45
2.4.9 cDNA supports for the differentially expressed variants and differ-
entially spliced genes . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.4.10 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Chapter 3: Testing Gene Set Enrichment for Subset of Genes 60
3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.2.1 Association strength measures for individual genes . . . . . . . . . 63
3.2.2 Local association statistic for each strict subset and the global
statistic for the gene set . . . . . . . . . . . . . . . . . . . . . . . . 64
3.2.3 Permutation test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.2.4 Multiple testing correction for multiple gene sets . . . . . . . . . . 66
3.3 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.3.1 Simulation 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.3.2 Simulation 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.4 Applications to Real Data Sets . . . . . . . . . . . . . . . . . . . . . . . . 76
3.4.1 Male Vs. Female Lymphoblastoid Cells . . . . . . . . . . . . . . . 76
3.4.2 P53 Status in Cancer Cell Lines . . . . . . . . . . . . . . . . . . . 78
3.5 ROC curves for Sub-GSE . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.6 Conclusions and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Chapter 4: Gene Transcriptional Network Reconstruction 88
4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.2 Model and Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.2.1 Probability Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.2.2 EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.3 Preliminary Results: A Simulation Study . . . . . . . . . . . . . . . . . . 93
4.3.1 Simulated Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.3.2 Preliminary Results on Simulated Data . . . . . . . . . . . . . . . 96
4.4 Summary and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Chapter 5: Future Work 99
5.1 Identifying Differentially Expressed Transcripts from RNA-seq Data . . . 99
5.1.1 Addressing the Coverage Bias . . . . . . . . . . . . . . . . . . . . . 100
5.1.2 Considering the Ambiguous Reads . . . . . . . . . . . . . . . . . . 101
5.1.3 Novel Transcripts Discovery . . . . . . . . . . . . . . . . . . . . . . 101
5.2 Testing Gene Set Enrichment for Subset of Genes . . . . . . . . . . . . . . 102
5.3 Gene Transcriptional Network Reconstruction . . . . . . . . . . . . . . . . 102
5.3.1 Estimating regulatory relationship through penalized likelihood . . 103
5.3.2 Estimating regulatory relationship through maximum parsimony . 104
5.3.3 Considering nucleotide polymorphisms in the genotypes . . . . . . 106
v
5.3.4 Distinguishing the Direct and Indirect Connections in the Gene
Transcriptional Regulatory Network . . . . . . . . . . . . . . . . . 107
References 108
vi
List Of Tables
2.1 Mapping results of 10 lanes (5 for the human kidney and 5 for the human
liver) of Solexa mRNA-seq data. For each lane, the table lists the number
of original reads, the number of reads mapped to mRNA transcripts anno-
tated in AceView by RMAP (a read may be mapped to multiple genes),
and the number of reads mapped to unique genes (a read may be mapped
to multiple spliced variants of the same gene). . . . . . . . . . . . . . . . 36
2.2 The average Pearson correlation coefficients between the estimated rela-
tive expression levels of spliced variants from the five different lanes that
sequenced the same kidney and liver sample at 3pM concentration (first
ten rows) and 1.5pM (the last row). Each row corresponds to one pair
of replicate lanes under the same concentration level. The last row is the
replicate lane pair comparison from 1.5pM. . . . . . . . . . . . . . . . . . 38
2.3 The list of the randomly selected 10 genes as well as comparison between
the cDNA supports and the TranSEQ estimation for each of their anno-
tatedsplicedvariantsinAceView. Amongthelastfourcolumns,thecounts
andthefractionsofcDNAs fromAceView (A)thatsupportthepresenceof
each variant in kidney (K) or liver (L) are compared with their estimated
expression levels by TranSEQ (T). . . . . . . . . . . . . . . . . . . . . . . 43
3.1 Thetop 7genesets bySub-GSE,GSEAandsigPath andtheir correspond-
ingp-values,FDRandmaxq-values. ThecytogeneticbandschrYp11 Xp22
wasnamedchrYp22intheoriginalgenesetdatawhichinclude8genesthat
are on both chrYp11 and chrXp22. . . . . . . . . . . . . . . . . . . . . . 77
3.2 The top 3 functional gene sets and their corresponding q-values and FDR
rates by Sub-GSE and GSEA. . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.3 Significant functional gene sets (q-value≤ 0.03) detected in 50 of the NCI-
60 cell lines. The q-values are calculated by Sub-GSE. . . . . . . . . . . . 80
vii
List Of Figures
1.1 Illustration of nucleotides and a DNA molecule. (picture from http://en.
wikipedia.org/wiki/DNA) . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 The central dogma of molecular biology: the process that the information
flowsfromDNAtoRNAiscalledtranscription,andtheprocessthatthein-
formation flows from RNA to protein is called translation. Unusual flow of
informationishighlightedingreen. (picturefromhttp://en.wikipedia.org/
wiki/The Central Dogma) . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Diagrams of transcription initiation, transcription elongation and tran-
scription termination. RNAP stands for RNA polymerase. (picture from
http://en.wikipedia.org/wiki/Transcription (genetics)) . . . . . . . . . . . 6
1.4 IllustrationofatypicalDNAmicroarrayexperiment. (picturefromhttp://
www.microarray.lu/en/MICROARRAY Overview.shtml) . . . . . . . . . . 10
1.5 Illustration of the (a) Sanger sequencing and (b) second generation se-
quencing. (picture from [69]) . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1 The RNA-map of gene APOA1 from the ASPIC database. . . . . . . . . . 31
2.2 The accuracy of estimated relative expression levels for the first simulated
dataset. The average absolute error (AE) and its standard deviation (SD)
of the estimated relative expression levels are shown in the left panel. The
average absolute error (AE) rate and its standard deviation (SD) of the
estimated relative expression levels are shown in the right panel. . . . . . 33
2.3 The accuracy of the estimated relative expression levels in the second sim-
ulated dataset. The (a) mean and (b) standard deviation (SD) of the
absolute error rate of the estimated expression ratios of Tr1 over Tr2. The
(c) mean and (d) standard deviation of the AE for estimated expression
levels for the 6 unexpressed variants. . . . . . . . . . . . . . . . . . . . . . 51
viii
2.4 The accuracy of the estimated expression levels for the third simulated
dataset. The (a) mean and (b) standard deviation (SD) of the absolute
error of the estimated expression levels for different values of the ratio
between the expression levels of the 3’ end exons to the 5’ end exons. . . . 52
2.5 Distribution of the number of expressed spliced variants. (a) The number
of genes that have 0-20 and >20 expressed spliced variants. (b) Boxplots
of the number of expressed spliced variants across genes in kidney and in
liver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.6 The gene structures, base-level coverage by mapped reads and estimated
gene expression levels for gene NINJ1 in (a) the kidney and (b) the liver.
In both (a) and (b), the top half of the figure shows the base-level cov-
erage of reads mapped to exons of human gene NINJ1, and the bottom
half shows the gene structures of 5 alternatively spliced variants, NINJ1.a
through NINJ1.e, the estimated relative expression levels, and the number
of reads mapped to across two adjacent exons. A are the variants having
cDNA sequences to support their expression in the corresponding tissue,
as annotated in AceView. . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.7 Differentially expressed spliced variants and differentially spliced genes.
(a) Venn diagram of the spliced variants expressed in each tissue among
the differentially expressed spliced variants. (b) Histogram of the p values
assessing the differential splicing for genes. . . . . . . . . . . . . . . . . . 55
2.8 Venn diagram of the numberof theexpressed spliced variants byTranSEQ
and cDNA sequence supports from AceView in kidney and liver. . . . . . 55
2.9 Comparison between expression levels of cDNA-supported spliced variants
and those without cDNA supports. Histograms for the expression levels of
the cDNA-supportedspliced variants and thosewithoutcDNA supportsin
(a) kidney and (b) liver. Boxplots for the expression levels of the cDNA-
supportedspliced variants and those withoutcDNA supportsin (c) kidney
and (d) liver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.10 The distribution of the estimated expression levels for all expressed spliced
variants in kidney (grey) and liver (white). . . . . . . . . . . . . . . . . . 57
2.11 Effect of the number of mapped reads on the predictions of expressed
spliced variants. Each bin corresponds to one given number of randomly
chosen reads for all the highly-expressed genes which have exactly 4 anno-
tated splicedvariants andaremappedtoby800-1,000 reads. Thefractions
of genes that have 0,1,2,3 and 4 expressed spliced variants determined
by (ab) controlling FDR<0.0001 and by (cd) the estimated expression
level>5% in kidney and in liver are shown. . . . . . . . . . . . . . . . . . 58
ix
2.12 Effectofthenumberofmappedreadsonthepredictednumberofexpressed
spliced variants. Each bin corresponds to one given number of randomly
chosen reads for all the highly-expressed genes which have exactly 6 an-
notated spliced variants and are mapped to by 800-1000 reads. For each
bin,thefractionsof genesthathave 0,1,2,··· ,6expressedsplicedvariants
determined by controlling FDR less than 0.0001 in (a) kidneyand (b) liver
areshown. In(c)and(d), wealsorestricttheestimated relative expression
levels to be at least 5%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.1 The average p-values of the 100 gene sets across the 100 different data sets
are plotted against their corresponding set sizes. . . . . . . . . . . . . . . 67
3.2 The histogram of the p-values under the null hypothesis of no association
between the gene sets and the phenotype. . . . . . . . . . . . . . . . . . . 68
3.3 The standard deviation of the p-values across the 100 runs of Sub-GSE on
simulation 1 is plotted against the average p-values for all the gene sets. . 70
3.4 The average ranks of the gene set related to the phenotype for Sub-GSE,
GSEA, GSA, and SigPath for different percentages of correlated genes
(PCG) and correlation coefficients within the chosen genes. Theleft panel
compares theaverage ranksin2-D plots inwhich each subplotcorresponds
to one value of PCG. For a given PCG, the average ranks from the four
methods are plotted against the correlation coefficient between the corre-
lated genes. The right panel shows the average rank versus the percentage
of correlated genes and correlation coefficient in a 3-D plot. . . . . . . . . 71
3.5 The standard deviation of the p-values across the 100 runs of Sub-GSE on
simulation 2 is plotted against the average p-values for all the gene sets. . 74
3.6 The average ranks of the target gene sets by Sub-GSE, GSEA, GSA and
SigPath for different percentages of correlated genes and correlation co-
efficients in Simulation II. The left panel compares the average ranks in
2-D plots in which each subplot corresponds to one value of PCG. For
a given PCG, the average ranks of the target sets from the four methods
areplotted against thecorrelation coefficient between thecorrelated genes.
The right panel shows the average ranks of the target sets versus thePCG
and correlation coefficient in a 3-D plot. The four cubes correspond to
Sub-GSE, GSEA, GSA and SigPath. . . . . . . . . . . . . . . . . . . . . . 75
3.7 Relationship between significant gene sets and p53. Solid arrows describe
the interactions between p53 and different biological processes or genes.
Dashed arrows show the definition of those gene sets in the corresponding
dashed rectangle. The solid arrows are constructed based on previous
works in [2,19,28,49,85,89]. . . . . . . . . . . . . . . . . . . . . . . . . . . 79
x
3.8 The ROC curves for Sub-GSE. . . . . . . . . . . . . . . . . . . . . . . . . 83
4.1 Evaluation of the EM algorithm for estimating gene regulation relation-
ships. The receiver operating characteristic curves (ROC) show the re-
lationship between the false positive rate and sensitivity using the EM
algorithm for different numbers of genotypes, the number of genotypes
with given gene pairs (k =1,5) . . . . . . . . . . . . . . . . . . . . . . . . . 97
xi
Abstract
Understanding the gene regulatory network has always been one of the important and
challengingtaskstounderstandingthemechanismsbehinddifferentbiologicalprocessesor
behaviors of organisms. Development in biological experimental techniques has produced
a deluge of data to help accomplish this task, including the microarray data, ChIP-chip
data, sequencing data, etc. Based on these techniques, biologists have designed different
experiments to investigate the regulatory networks from different aspects. However, due
to the random effects during the production of these data, statistical and probabilistic
models are needed to extract reliable regulatory relationship from the data. Moreover,
integration of different sources of data is also critical for obtaining complete and accurate
information regarding the biological mechanisms.
Inthisdissertation,Itrytoinferthegenetranscriptionregulatorynetworksbyfirstde-
tecting the downstream or regulated genes in the network, which tend to be differentially
expressed under different conditions such as different tissues or treatment versus control
experiment. Two different methods are developed to detect the differentially expressed
transcripts or genes from the sequencing data and the gene expression microarray data.
The first method estimates the expression levels of all the annotated mRNAs in avail-
able database usingthe mRNA sequencing data and identifies the differentially expressed
xii
mRNAs based on the estimations. In this way, however, the mRNAs are considered in-
dependently which contradicts the fact that interactions between genes are commonly
observed. Therefore, the second method was developed which detects sets of genes that
are enriched in differentially expressed or more generally speaking, phenotype associated
genes. The sets of genes are predefined so that genes in each set interact with each other
in certain way. Due to the limited knowledge of the interactions between mRNAs, cur-
rently this method can only be applied to the gene expression data. Applications of both
methods to simulated data sets and real data sets show robust and accurate predictions.
After the downstream genes are detected, the genes that regulates the expression of these
downstream genes arecritical toinfer theregulatory networks. Thus, weproposeanother
analysis method which utilizes an EM algorithm to predict the gene regulatory network
fromthegeneexpressiondata, sequencedataandallele-specific expressiondataincertain
number of genotypes. Preliminary simulation studies suggest the minimum number of
genotypes that are needed to achieve satisfactory detection accuracy. Due to the lack of
real data set, so far as, this method has only been applied to simulated data sets.
xiii
Chapter 1
Introduction
Thedifferentbiologicalprocessesofalivingcellarecarriedoutthroughaconcertedsystem
of genes, proteins, RNAs and other molecules. The system is coordinated through inter-
actions between molecules which form a diverse molecule interaction networks. Under-
standingthesemoleculeinteraction networks hasalways beenthemajorgoal ofmolecular
cell biology. This includes understanding the interactions in the networks, the conditions
underwhichthenetworkoperatesandthemechanismbehindtheresponsesofthenetwork
to different perturbations. Among these molecule interaction networks, the gene tran-
scriptional regulatory network is essential because it determines the mRNA expression in
the cell from which proteins are translated.
Theadventofdifferentexperimentaltechniques, includingthewholegenomesequenc-
ing,themeasurementofmRNAlevelsundervariousconditionsandgenomewidedetection
of protein-protein and protein-DNA interaction, provides us abundant and various types
of information for the gene transcriptional regulatory network. However, due to ran-
dom effects during the production of these data, statistical and probabilistic models are
needed to extract reliable regulation relationship from the data. Moreover, integration of
1
different sources of data is also critical for obtaining complete and accurate information
regarding the biological mechanisms.
1.1 Biological Background
Webeginwithabriefoverview ofthebasicconceptsinmolecularbiologythatarerelevant
to this dissertation. Readers interested in details are referred to molecular biology books
(e.g. [48]).
Gene was firstdefined by Mendel as a “particular factor” that passes unchanged from
parents to progeny more than one century ago. Since then, scientists have discovered
that genes actually lie on chromosomes as a linear array where chromosomes consist
of DNA (deoxyribonucleic acid) and proteins. The DNA, instead of the proteins, is the
actualgeneticmaterialthatcarriesthegeneticinformationfromgenerationtogeneration.
DNA is made of two strands of nucleotides that wrap around each other. The nucleotides
are further made of three parts: a phosphate group, a sugar group and one of four types
of nitrogen bases as illustrated in Figure 1.1. The four types of nitrogen bases found
in nucleotides are: adenine (A), thymine (T), guanine (G) and cytosine (C) which lead
to four types of nucleotides as well. To form one strand of DNA, nucleotides are linked
into a chain so that the phosphate group and the sugar group of one nucleotide are
connected to the sugar group of the nucleotide at the 5’ end and the phosphate group of
the nucleotide at the 3’ end, respectively. The alternating phosphate and sugar groups
form the backbone of the DNA strand and the order, or sequence, of the nitrogen bases
attached to the backbone determines what biological instructions are contained in the
2
Phosphate-
deoxyribose
backbone
Adenine
Cytosine
Guanine
Thymine
O
O
O
O
O
O
O
O O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O O
O
O
O
O
O
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
O _
O _
O _
O _
O _
_ O
_ O
_ O
_ O
_ O
P
P
P
P
P
P
P
P
NH 2
OH
OH
NH
H 2 N
HN
NH 2
H 2 N
HN
H 2 N
NH
NH 2
3' end
5' end
3' end
5' end
Figure 1.1: Illustration of nucleotides and a DNA molecule. (picture from http://en.
wikipedia.org/wiki/DNA)
3
DNA. For example, the sequence ATCGTT might instruct for blue eyes, while ATCGCT
might instruct for black.
The two strands of the DNA molecule bind to each other through weak hydrogen
bonds between the nitrogen bases (Figure 1.1) on the two strands according to a specific
pairing rule. Guanine (G) can hydrogen bond specifically only with cytosine (C), while
adenine (A) can bond specifically only with thymine (T). The pairs of bases are called
⁀
BasePairs(bp). Thebasepairingruleiscalledbase complementarity. Inthisway,thetwo
strandsare complement to each other and thesequence of bases is completely determined
by only one of the strands.
TheDNAmoleculecarriesinstructionsforsynthesisandregulationsforproteinswhich
is a class of large molecules composed of one or multiple chains of amino acids. Proteins
participate in diverse biological processes in cells and determine the shape and the struc-
ture of the cell. The instructions for synthesizing a specific protein are contained in a
segment oftheDNAsequenceknownasgene. Theprocessesthattheproteins aresynthe-
sized under the instructions in DNA are summarized as the central dogma of molecular
biology shown in Figure 1.2. Essentially, there are two major processes named transcrip-
tion and translation.
• Transcription: a piece of messenger RNA (mRNA) is synthesized using the DNA
sequence as template. The process is implemented mainly by RNA polymerase and
transcription factors. The process by which genes are transcribed into mRNAs and
function in the cell is called gene expression.
4
Figure 1.2: The central dogma of molecular biology: the process that the information
flowsfromDNAtoRNAiscalledtranscription,andtheprocessthattheinformationflows
from RNA to protein is called translation. Unusual flow of information is highlighted in
green. (picture from http://en.wikipedia.org/ wiki/The Central Dogma)
• Translation: themRNAsaretranslatedintoproteinswitheachtripletofnucleotides
in it corresponding to an amino acid. The process is facilitated mainly by the
ribosome and the transfer RNAs.
Since this dissertation is focused on gene transcriptional regulatory networks, we pro-
vide more details about transcription here. The DNA molecules are wrapped around
nucleosomes to form chromosomes. For a gene to be transcribed into mRNAs, the DNA
region corresponding to the gene has to become accessible first. This is usually achieved
by the chromatin structure modifications mediated by various post-translational modifi-
cations on the histone proteins in the nucleosomes. Once the DNA region corresponding
to the gene is accessible, the RNA polymerase will bind to the gene’s promoter region on
the DNA. In eukaryotes, however, the binding of the RNA polymerase is mediated by a
collection of proteins called transcription factors(Figure 1.3). The transcription factors
will bind to the DNA first and only after certain transcription factors are attached to the
promoter region does the RNA polymerase bind to it.
5
RNAP
5'
3' 5'
3'
Template
Strand
Coding
Strand
Gene
Transcription Initiation
RNAP 5'
3' 5'
3'
Template
Strand
Coding
Strand
5'
Transcription Elongation
RNAP
5'
3' 5'
3'
Template
Strand
Coding
Strand
5'
3'
Transcription Termination
Figure 1.3: Diagrams of transcription initiation, transcription elongation and
transcription termination. RNAP stands for RNA polymerase. (picture from
http://en.wikipedia.org/wiki/Transcription (genetics))
After the RNA polymerase binds to the promoter region, one strand of the DNA is
used as the template for RNA synthesis. This is known as the transcription elongation
shown in Figure 1.3. The RNA molecule also consists a chain of the same four different
types of nucleotides except that thymine (T) is replaced by uracil (U) and the deoxyri-
bose in DNA becomes a ribose (5-carbon) sugar. As the transcription proceeds, RNA
polymerase traverses the template strand from 3’ to 5’ and uses the base pairing comple-
mentarity with the DNA template to create an RNA copy that is complementary with
the template strand. Since the RNA is complementary with the template strand, it has
the same sequence as the other strand (coding strand) and is generated from 5’ to 3’.
6
In bacteria, the transcription is terminated in two ways. In the first way which is
named ”Rho-independent transcription termination”, an RNA structure can be formed
followed by a run of U’s in the 3’ end, which detach the RNA from the DNA template.
Alternatively, a protein called ”Rho” destabilizes the interaction between the template
DNA strand and the RNA which releases the RNA from the elongation complex. In
eukaryotes, the transcription termination is less well understood. The limited knowledge
suggests that the transcription termination in eukaryotes should involve cleavage of the
RNA transcript, which is followed by adding A’s at the 3’ end of the RNA, namely
polyadenylation.
After the transcription termination, a new primary RNA transcript is generated. In
eukaryotes,beforetheRNAtranscriptistranslatedintoprotein,certainpost-transcriptional
modificationsareappliedtoconverttheprimarytranscriptRNAsintomatureRNAs. For
the precursor messenger RNA (pre-mRNA) to become mature messenger RNA (mRNA),
the post-transcription modifications include 5’ capping, 3’ polyadenylation and RNA
splicing, all of which occur prior to protein synthesis. Among them, the RNA splicing
is vital for the correct translation in eukaryotes as the human pre-mRNAs contain both
exons and introns. The exons are coding sections of the primary RNA transcript and the
introns are the non-coding sections of the primary RNA transcript. The RNA splicing
process cuts out the introns and connects the remained exons. Interestingly, different
combinations of introns may be cut out in different pre-mRNAs transcribed from the
same gene. The different linear combinations of exons then undergo the process of trans-
lation where specific and unique sequences of amino acids are synthesized, resulting in
different protein isoforms. This RNA splicing variation mechanism is called alternative
7
splicing which facilitates the synthesis of a greater variety of proteins from the relatively
small number of genes.
Although every cell in one organism has the same DNA molecules in the nucleus,
different cells express different genes at different levels under different conditions. And
more importantly, the phenotypic differences that distinguish the various types of cells in
ahighereukaryotearelargelyduetothedifferencesintheexpressionofgenesthatcodefor
proteins, that is, those RNAs transcribed by RNA polymerase. In theory, the expression
of genes can be regulated at any stage of transcription. However, for most genes, the
regulation on the transcription initiation by the interaction of RNA polymerase with its
promoter is the major control point for gene expression.
By examining the groups of genes under common control, scientists found that each
group of genes share promoter (or enhancer) elements that can be recognized by regu-
latory transcription factors. Under conditions when the gene needs to be expressed, the
corresponding transcription factors become available. Through the binding of the tran-
scription factors to the promoter elements, transcription factors can enhance or repress
the expression of the nearby genes. The response element in the promoter regions for
transcription factors are typically short, 5-25 base pairs in length. An organism typically
has hundredsof different transcription factors, each of which is capable of recognizing dif-
ferentresponseelements inthepromoterregions of certain groupsof genes. Transcription
factors usually work together to ensure that the correct amount of RNAs is transcribed.
Although transcription factors are critical for gene expression regulation, they are
definitely not the only factors that control the gene expression. Before the transcription
is initiated, another class of proteins regulate the structure of the DNA molecule so that
8
the correct regions in the DNA are accessible to transcription factors and transcription
initiation complex. After the primary RNA transcript is synthesized, as we have men-
tioned, the post-transcription modifications will control whether or not the RNA will
be translated into proteins. Also the mature mRNAs can be degraded and even after
the mRNA is translated into proteins, post-translational modifications can regulate the
structure of protein and thus determine its activity.
1.2 Molecular Experimental Techniques
To understand the gene transcriptional regulatory network, biologists usually exert vari-
ous perturbations (mutations, heat shock, etc.) to the organisms or obtain samples from
different tissues. Changes or differences in the gene expression levels, the sequences, and
the protein-protein or protein-DNA binding will be very important. With the develop-
ment of molecule experimental techniques in recent years, biologists have been able to
measure these changes or differences at a large scale. Here, we give a review on the
experimental technologies that are relevant to the analysis in this dissertation.
1.2.1 Microarray
With the availability of genome sequence and the development in spotting hybridization
probes,atechnologycalledDNAmicroarrayhasbeendevelopedtomeasuretheexpression
levelsofthousandsofmRNAtargetssimultaneously,whichtakesasnapshotofthemRNA
expression in the cell.
The microarray technology is based on DNA hybridization: the process in which a
single stranded DNA binds to its complement strand of DNA molecule. Figure 1.4 gives
9
Figure 1.4: Illustration of a typical DNA microarray experiment. (picture from http://
www.microarray.lu/en/MICROARRAY Overview.shtml)
a diagram describing a typical cDNA microarray experiment. For each target of interest
suchasgeneormRNA,asetofshortprobesaredesignedtospecificallyhybridizewiththe
target molecules. The probes can be oligonucleotides or complementary DNA (cDNA)
which will be immobilized and organized on a solid surface in certain way. Meanwhile,
to prepare the samples, the nucleotide acid of interest is purified first. This can be all
the RNAs, DNAs from different species, or RNA/DNA bound with particular proteins.
If RNAs are purified, they will be reverse transcribed into cDNAs. The extracted DNA
molecules will be amplified and labeled using either fluorescent or radiolabels. The la-
beled samples are then mixed with the microarray so that the probes on the array can
hybridize with their target DNA molecules in the sample. Since the samples are labeled,
after the hybridization, a laser excites the dye and a detector measures its dye intensity.
10
The intensities of the fluorescent or radio for the probes corresponding to each gene are
considered as the expression level.
However, since different probes have different binding affinities and we do not know
the exact amount of each probe, the intensities can not be directly associated with the
real amount of target molecules. Instead, the cDNA microarray experiment usually mea-
sures the gene expression in both the target sample and the reference sample that are
dyed with different colors on the same array (Figure 1.4). Typically green is used to label
the reference sample while the target sample is labeled red. The mixture of the reference
and target samples are hybridized with the same microarray and the difference between
the intensities of green and red fluorescent stands for the expression level of the target
molecule. We note that there are also other microarray technologies (e.g., the oligonu-
cleotide array of Affymetrix). Themajor difference between them lies in what is attached
to the array and the protocols for using them.
The genome wide measurement of the expression levels is called expression profile
which actually describes the changes in the gene expression of each gene to perform the
coordinated tasks in adaption to a changing environment such as mutations, different
tissues, etc. We also notice that although DNA microarray can measure the changes in
the mRNA expression levels which is a direct consequence of transcriptional regulation,
there are many more important cellular properties that are not observed by microarrays
but are very important for understanding the transcriptional regulation. For example,
since the probes on the microarray are designed based on the sequence of the target
molecules, it can not be used to discover novel transcripts in the sample. In terms of
mRNAs,aswedescribedinsection 1.1, alternative splicingisprevalentinhigheukaryotes
11
like human. Thus, DNA microarray can be designed to measure the expression of the
connectionregionsbetweenexons. Althoughthiscanbeusedtodetectgenesthatundergo
alternativesplicing,itdoesnotprovidethesequencesofalltheexpressedmRNAisoforms.
The recently emerged second generation sequencing technology has provided us potential
to accomplish this.
1.2.2 Second Generation Sequencing
Early in 1990s, the Sanger biochemistry (Figure 1.5a) was used to do high-throughput
DNA sequencing such as the human genome sequencing. The DNA molecule to be se-
quenced is prepared in two ways depending on whether we have any information for the
DNA sequence. For shotgun de novo sequencing, which means we do not have or we do
not trust the sequence information of the target DNA molecule, the DNA is randomly
fragmented and cloned into plasmid which will be further transformed into Escherichia
Coli. For targeted re-sequencing, which means we do have some sequence information
of the target DNA molecule, PCR amplification is carried out with primers designed ac-
cording to the sequences flanking the target region. The sequencing is performed in a
“cycle sequencing” reaction [69] where the cycle of template denature, primer annealing
and primer extensions are performed. Each round of primer extension is stochastically
terminated by the dideoxynucleotides (ddNTPs) which are labeled with fluorescent. The
resultant single stranded and end fluorescent labeled products are separated in capillary-
based polymer gel. As the fragments of different lengths exit the capillary, a laser excite
thefluorescentlabelintheendofthemwhichisdetected andprovidesthesequenceofthe
template sequence. The Sanger biochemistry can be applied to achieve read-lengths of
12
Figure1.5: Illustrationofthe(a)Sangersequencingand(b)secondgenerationsequencing.
(picture from [69])
up to about 1,000 bps with an accuracy as high as 99.999%. The cost of high throughput
shotgun genomic sequencing using Sanger biochemistry is about $0.50 per kilobase [69].
There are, however, many factors that limit the applications of the Sanger sequencing
to comprehensive analysis of genomes, transcriptomes, and interactomes. For example,
the Sanger biochemistry needs the E. Coli transformation and colony picking which is
a bottleneck for the parallelism of the sequencing. Furthermore, the cost of the Sanger
biochemistry is relative high given the huge size of the genomes, transcriptomes and the
interactomes. Theselimitationsleadtothedevelopmentofthesecondgenerationsequenc-
ing technologies, including the 454 sequencing, Solexa technology, the SOLiD platform,
the Polonator and the HeliScope Single Molecule Sequencer technology. Although dif-
ferent technologies have very different protocols, they have a common work flow which
is described in Figure 1.5b. First, a library of DNA molecules is prepared by random
13
fragmentation of DNA, followed by in vitro ligation of common adaptor sequences to the
fragments. The adaptor ligated fragments are then amplified through emulsive PCR, in
situ polonies or bridge PCR. Although different technologies may use different amplifica-
tion methods, the derived PCR amplicons all end up spatially clustered. Based on the
PCRamplicons,thesequencingprocessproceedswithalternatingcyclesofenzyme-driven
biochemistry and imaging-based data acquisition. In each cycle, the fluorescently labeled
nucleotides are incorporated according to each template sequence in the PCR amplicons
by either polymerase or ligase. The following imaging of the full array then help iden-
tify the incorporated nucleotides. The alternating cycles of this process can eventually
provides the sequences of the target DNA molecules. To sequence mRNAs using these
technologies, reverse transcription is needed to convert mRNAs into cDNA first. Then
the technologies can be applied directly to the cDNAs to obtain the sequences of them.
The second generation technologies, however, have certain disadvantages compared
to the Sanger biochemistry. The most prominent ones are the relatively short read-
length and the lower sequencing accuracy. Although these limitations create algorithmic
challenges for data analysis, we note that these technologies will be improved in the
immediate future.
14
1.3 Our Approaches
1.3.1 Synopsis of chapter 2
In chapter 2, we present a method that estimates the alternatively spliced mRNA ex-
pression levels and identifies the differentially expressed mRNAs from the mRNA se-
quencing data. The second generation sequencing technology has been utilized to inves-
tigate the presence and prevalence of gene transcripts in multiple organisms. Although
it is evident that alternative splicing is prevalent in higher eukaryotes, previous studies
have not addressed the following critical biological questions: 1) the relative expres-
sion levels of alternatively spliced variants (mRNA isoforms) of given genes and 2) the
differentially expressed spliced variants and differentially spliced genes between two sam-
ples. Based on the annotated alternatively spliced variants in the AceView database
(http://www.ncbi.nlm.nih.giv/IEB/Research/Acembly/), we have developed a pow-
erful statistical method, termed TranSEQ [95], based on maximum likelihood estimation
to estimate the relative expression levels of those annotated alternatively spliced variants
in AceView. Moreover, based on these estimations, we develop methods which employ
the likelihood ratio test to identify differentially expressed spliced variants and differen-
tially spliced genes between two samples. Thus, by applying TranSEQ to two mRNA-seq
data sets obtained from human liver and kidney samples, we discover that (1) around
95% of the highly-expressed human genes (mapped to by at least 100 reads out of the
65 million Solexa reads) are alternatively spliced (10,819 in human kidney and 9,208 in
human liver), among which more than 70% of genes express at least two spliced vari-
ants at an expression level higher than 15%, (2) the average number of the expressed
15
spliced variants per gene is 5.9 for kidney and 5.5 for liver, (3) only about 3.6% of all the
annotated spliced variants are differentially expressed (FDR < 0.0001) between the two
tissues, and (4) 2,361 genes which we term differentially spliced genes, show significantly
different expression patterns for their spliced variants between the two tissues. These
findings indicate that a large number of the alternatively spliced variants are expressed
in each single tissue, while the major differences in the transcriptomes between tissues
are attributed to the difference in the relative expression levels of these variants.
1.3.2 Synopsis of chapter 3
In chapter 3, we present a method that combines gene expression data with functionally
related gene sets to detect gene sets that are enriched in phenotype associated genes.
When the given phenotype is binary, the detected sets of genes are enriched of genes that
are differentially expressed when the cell has the two different phenotypes. Before this
method, many methods have been developed to test the gene set enrichment of genes
associated with certain phenotypes or cell states. These approaches usually combine
gene expression data with functionally related gene sets as defined in databases such as
GeneOntology (GO), KEGG, or BioCarta. The results based on gene set analysis are
generally more biologically interpretable, accurate and robust than the results based on
individual gene analysis. However, while most available methods for gene set enrichment
analysistesttheenrichmentoftheentiregeneset,itismorelikelythatonlyasubsetofthe
genesinthegenesetmayberelatedtothephenotypesofinterest. Therefore,wedevelopa
novel method, termed Sub-GSE[94], which measures the enrichment of a predefinedgene
set, or pathway, by testing its subsets. The application of Sub-GSEto two simulated and
16
two real data sets shows that Sub-GSE is more sensitive than previous methods, such as
GSEA, GSA, and SigPath, in detecting gene sets associated with a phenotypeof interest.
This is particularly true for cases in which only a fraction of the genes in the gene set are
associated with the phenotypes. Furthermore, the application of Sub-GSE to two real
data sets demonstrates that it can detect more biologically meaningful gene sets than
GSEA.
1.3.3 Synopsis of chapter 4
In chapter 4, we describe the analysis on reconstructing the underlying gene transcrip-
tional regulatory network by combining the gene expression data, the sequence data and
the allelic-specific expression data. To reconstruct the underlying gene regulatory net-
work, gene expression changes were commonly measured in single gene knockouts and
differentially expressed genes are suspected to be regulated by the knocked out gene. It
is costly and labor intensive to generate knockouts for every single gene. One potential
wayto circumvent theselimitations isto examineandanalyzegene expressionfrommany
individual genotypes together with the sequence alteration information. We construct a
probabilistic model that integrates the gene expression data with the sequence alteration
datatopredicttheregulatoryrelationshipsamonggenes. Themodelisabletorecover the
regulatory relationships accurately as revealed by our simulation studies. The minimum
number of samples needed to obtain accurate predictions by the method is also given.
17
Chapter 2
Identifying Differentially Expressed Alternatively Spliced
Transcripts using mRNA-seq Data
2.1 Introduction
Alternative splicing is known to be prevalent in most eukaryotes, especially in human.
Analysis of EST-cDNA sequences and microarray profiling data has indicated that more
than half of human genes are alternatively spliced [39,40,45,55]. In particular, the use of
alternative exons varies across different developmental stages and is tissue- or pathology-
specific [5,33,62,66]. The importance of alternative splicing is further illustrated by
the discovery of a large number of human diseases that are attributed to mis-splicing
events [27,72]. Therefore, measuring the expression levels of the different spliced variants
of genes is critical to our ability to understand the function of genes, the development
and the disease etiologies. Computational methods have been developed for annotating
alternatively spliced transcripts by aligning EST-cDNA sequences to genome sequences,
18
and the corresponding databases have been built for genome-wide annotation of alterna-
tive splicing in human and other organisms, for example, ASD [73], ASPicDB [15], and
AceView [80].
High-throughput biotechnologies have also been developed to investigate alternative
splicing events. In the past, exon and splicing arrays have been employed to measure the
expression levels of spliced variants [15,39]. However, these array-based approaches have
been shown to be ineffective to some short exons, and they also suffer from high noise
levels [11]. The second generation sequencing technology has been utilized to measure
mRNA expression levels and to discover novel transcripts and alternative splicing events
in multiple organisms [52,57,59,78,79,84,88]. This technique has proved to be highly
replicable and reliable [52]. However, estimating the expression levels of alternatively
spliced variants (mRNA isoforms) of genes is a question that remains unresolved, as
is the identification of differentially expressed spliced variants and differentially spliced
genes.
Therefore,wetakeadvantageofcurrentannotationsofalternativelysplicedvariantsof
humangenesintheAceViewdatabase(http://www.ncbi.nlm.nih.gov/IEB/Research/
Acembly/),andwedevelopapowerfulstatistical methodnamedTranSEQ.TranSEQuses
maximum likelihood estimation (MLE) to estimate the relative expression levels of each
given gene’s spliced variants, i.e., the ratio of the number of expressed transcripts for one
variant over that for the whole gene. Our approach consists of two main steps. As a first
step,weuseaprogramcalledRMAP[70]tomapallreadstotheannotatedhumanmRNA
transcripts obtained from AceView [80]. Next, TranSEQconstructs a probabilistic model
for the relative expression levels of all the annotated spliced variants for each gene and
19
employs an expectation-maximization (EM) algorithm to estimate the expression levels
sothatthelikelihood functionintheprobabilisticmodelismaximized. Inthemean time,
the likelihood ratio statistic is used to test if a variant is truly present in the sample or
not. Besides estimating the expression levels, when data sets from two samples are given,
we use the likelihood ratio test to identify differentially expressed spliced variants and
differentially spliced genes.
We applyTranSEQtotwo mRNAsequencingdatasets generated from humankidney
and liver samples in a previous study [52] and estimate the relative expression levels of
the spliced variants in both tissues. We compare the results between the two tissues and
identify differentially expressed variants and differentially spliced genes. The comparison
showsthatthereisextensive variation between thetwotissuesintheexpressionof spliced
variants.
2.2 Model and Algorithm
2.2.1 Read Mapping
Before theprobabilisticmodelof theobserved readsis constructed, for each read, wemap
thereadsbacktotheannotatedmRNAsequencesfromAceViewwithupto2mismatches.
TheAceViewdatabaseprovidesstrictlycDNA-supportedhumantranscriptomeandgenes
by co-aligning all quality-filtered human cDNA data from GeneBank, dbEST and the
RefSeq to the human genome. We download all the annotated human mRNA sequences
on autosomal chromosomes (includingUTRregions) fromtheAceView Humandatabase,
Apr07 release, excluding the clouds genes that do not have strong evidence of existence.
20
Reads aremappedonto thesemRNAsequences byaprogramcalled RMAP[70], allowing
up to 2 mismatches and no insertion or deletion. That is to say, RMAP searches in
the downloaded mRNA sequences for subsequences that have up to 2 mismatches when
compared with the read sequence. For a given read, there will be four different mapping
results. The read can be mapped to 1) a unique mRNA; 2) multiple mRNAs that are
transcribed from the same gene; 3) multiple mRNAs transcribed from multiple genes;
and 4) none of the mRNAs. Reads that have the first two mapping results are termed
as uniquely mapped reads. Those reads mapped to mRNAs from multiple genes are
called ambiguous reads. Both the uniquely mapped reads and the ambiguous reads can
be included in our analysis but the algorithms will be different. In this section, I will
describe the algorithm that only consider the uniquely mapped reads. The model and
algorithm that include the ambiguous reads can be found in the section of extensions on
the method. When we only consider the uniquely mapped reads, the algorithm is simpler
than that when both uniquely mapped reads and the ambiguous reads are included. If
theproportionoftheambiguousreadsissmall, wesuggestthatonlytheuniquelymapped
reads are included in the analysis.
2.2.2 Notations
To make the formulation of our model clear, we first introduce the notations. Let G =
(g
1
,g
2
,··· ,g
N
) betheset of genes. For any given gene g
i
∈G, let S
i
= (s
i1
,s
i2
,··· ,s
im
i
)
be the set of spliced variants (mRNA isoforms) annotated in AceView for gene g
i
and
let S =
S
N
i=1
S
i
be the set of all annotated spliced variants of all genes from AceView.
For any isoform s
ij
∈ F, let L
ij
be its length and θ
ij
be the fraction of s
ij
among all
21
the mRNA molecules. For gene g
i
, let θ
i
= (θ
i1
,θ
i2
,··· ,θ
im
i
) be the set of fractions for
all its mRNA isoforms and Θ ={θ
1
,θ
2
,··· ,θ
N
} be the set of fractions of all mRNA
isoforms. The goal is to estimate all the fractions in Θ and detect those isoforms that are
differentially expressed under different conditions.
Let R ={r
1
,r
2
,··· ,r
n
} be the set of all n observed uniquely mapped reads. Since
these reads are mapped to isoforms of unique genes, R can be split into N parts denoted
by{R
1
,R
2
,··· ,R
N
}, where R
i
is the set of reads that are mapped to s
i1
,s
i2
,··· ,s
im
i
,
i.e.,isoforms ofgeneg
i
. For anygiven readr, wefindthecandidateisoformsthatgenerate
it by the mapping program RMAP. If readr can be mapped to isoform s
ij
, let δ(r→s
ij
)
be 1 and 0 if otherwise.
2.2.3 Likelihood Function
We assume that the reads are independently and randomly sampled from the mRNAs.
Then based on the above notations, we know that the probability of a random read r
being generated by s
ij
is
P(r←s
ij
|Θ,S) =
θ
ij
(L
ij
−|r|+1)
P
N
k=1
P
m
k
m=1
θ
km
(L
km
−|r|+1)
,
where|r| is the length of the read. To simplify the formula, let π
i
=P(r←S
i
) which is
the probability of a random read r being generated by isoforms in S
i
and α
ij
= P(r←
22
s
ij
|r←S
i
) which is the probability of the read r actually being generated by the isoform
s
ij
given that the read is generated by isoforms in S
i
. We have that
P(r←s
ij
|Θ,S) =P(r←s
ij
|r←S
i
)P(r←S
i
)=π
i
α
ij
.
Given that the read is generated by s
ij
, the probability of observing the sequence of the
read r is
P(r|r←s
ij
,Θ,S) =
δ(r→s
ij
)
L
ij
−|r|+1
. (2.1)
Therefore, thelikelihood of observing a readr can bewritten to bethe following function
of π and α:
L(π,α|r) =P(r|π,α,S) =
N
X
i=1
m
i
X
j=1
P(r,r←s
ij
|Θ,S) =
N
X
i=1
m
i
X
j=1
π
i
α
ij
δ(r→s
ij
)
L
ij
−|r|+1
Since the reads are independent, the likelihood function becomes
L(π,α|R) =
Y
r∈R
P(r|π,α,S) =
Y
r∈R
N
X
i=1
m
i
X
j=1
π
i
α
ij
δ(r→s
ij
)
L
ij
−|r|+1
. (2.2)
By replacing the Θ with π and α, the likelihood function has a clean formula and it is
sufficient to estimate all the π
i
’s and α
ij
’s to obtain the estimation for all theθ
ij
. In fact,
since we have that π
i
α
ij
=
θ
ij
(L
ij
−|r|+1)
P
N
k=1
P
m
k
m=1
θ
km
(L
km
−|r|+1)
and
P
N
i=1
P
m
i
j=1
θ
ij
= 1, if
we know the values of π
i
’s and α
i
’s, we have that
θ
ij
=
π
i
α
ij
/(L
ij
−|r|+1)
P
N
k=1
P
m
i
m=1
π
k
α
km
/(L
km
−|r|+1)
(2.3)
23
Themaximumlikelihoodestimationforπandαareobtainedbymaximizingthelikelihood
function in equation 2.2.
2.2.4 EM Algorithm
We develop an expectation-maximization (EM) algorithm to maximize the likelihood
function. The basic idea of the EM algorithm is that the observed data is supplemented
by some variables. The supplement makes the likelihood function of the complete data
(observed data and the supplemented data) have a relatively simpler formula than the
that of the observed data only. Then the estimation of the parameters is iteratively
updated and each iteration increases the likelihood function of the observed data. The
algorithm terminates when the value of the likelihood function converges.
Therearemanywaystochoosethesupplementedvariables. Hereweselectthenumber
of reads generated by each isoform. Let z
ij
be the actual number of reads generated by
theisoforms
ij
andZ ={z
ij
:i = 1,2,··· ,N;j = 1,2,··· ,m
i
}. Becausetheprobabilities
of an isoform generating different reads that are mapped to it are the same, the model
becomes a multinomial model after the number of reads generated by each isoform is
added. Therefore, thelikelihood functionof thecomplete dataD = (R,Z)can bewritten
as
L(π,α|R,Z) =
N
Y
i=1
m
i
Y
j=1
π
i
·α
ij
·
1
L
ij
−|r|+1
z
ij
. (2.4)
The log likelihood function of the complete data, therefore, can be written as
l(π,α|R,Z) =
N
X
i=1
m
i
X
j=1
z
ij
logπ
i
+
N
X
i=1
m
i
X
j=1
z
ij
logα
ij
+C, (2.5)
24
where C is a constant free of Θ.
The EM algorithm runs in an iterative way and each iteration consists of E-step and
M-step. In the E-step, based on the current estimation for π and α which are denoted as
π
(I)
and α
(I)
, the expectation E(z
ij
|R,π
(I)
,α
(I)
,S) is calculated. To do that, we define
another variable ξ
rij
which is 1 if read r is actually generated by s
ij
. Therefore, we have
that z
ij
=
P
r∈R
ξ
rij
and E(z
ij
|R,π,α,S) =
P
r∈R
P(ξ
rij
= 1) which can be calculated
in the following way:
P(ξ
rij
= 1|R,π,α,S) =P(r←s
ij
|r,π
(I)
,α
(I)
,S)
=
P(r←s
ij
,r|π
(I)
,α
(I)
,S)
P
N
k=1
P
m
k
m=1
P(r,r←s
km
|π
(I)
,α
(I)
,S)
=
P(r|r←s
ij
,π
(I)
,α
(I)
,S)P(r←s
ij
|π
(I)
,α
(I)
,S)
P
N
k=1
P
m
k
m=1
P(r|r←s
km
,π
(I)
,α
(I)
,S)P(r←s
km
|π
(I)
,α
(I)
,S)
=
π
(I)
i
α
(I)
ij
δ(r→s
ij
)/(L
ij
−|r|+1)
P
N
k=1
P
m
k
m=1
π
(I)
k
α
(I)
km
δ(r→s
km
)/(L
km
−|r|+1)
.
(2.6)
Therefore, we have that
E(z
ij
|R,π
(I)
,α
(I)
,S) =
X
r∈R
π
(I)
i
α
(I)
ij
δ(r→s
ij
)/(L
ij
−|r|+1)
P
N
k=1
P
m
k
m=1
π
(I)
k
α
(I)
km
δ(r→s
km
)/(L
km
−|r|+1)
. (2.7)
Let ˆ z
(I)
ij
= E(z
ij
|R,π
(I)
,α
(I)
,S) and then in the M-step, the estimation for π and α are
updated by the values that maximize the log likelihood function l(π,α|R,
ˆ
Z
(I)
) defined
in equation (2.5). This maximization should be accomplished under the constraint that
25
P
N
i=1
π
i
= 1 and for all i= 1,2,··· ,N, we have that
P
m
i
j=1
α
ij
= 1. Using the Lagarange
multiplier, we have the following close form for the updated π
(I+1)
and α
(I+1)
:
π
(I+1)
i
=
P
m
i
j=1
ˆ z
(I)
ij
P
N
k=1
P
m
k
m=1
ˆ z
(I)
km
(2.8)
α
(I+1)
ij
=
ˆ z
(I)
ij
P
m
i
m=1
ˆ z
(I)
im
. (2.9)
In summary, the EM algorithm runs in the following way:
1. Initialization: initialize every π
i
by π
(0)
i
which is the fraction of reads mapped to
isoforms of gene g
i
. To initialize the value for α, first set every α
(0)
ij
to 0. Then for
every read r∈R
i
, add 1/n
r
to the α
(0)
’s of the isoform it is mapped to where n
r
is
the total number of isoforms the read r is mapped to.
2. E-step: calculate the expectations ˆ z
(I)
ij
based on current estimation, π
(I)
and α
(I)
according to equation (2.7).
3. M-step: updatetheestimation byπ
(I+1)
andα
(I+1)
according toequation (2.8) and
(2.9).
4. Convergence: calculate the log of the likelihood function defined in equation (2.2)
based on the updated estimated parameters and compare it with that based on
the current estimated parameters. The difference between them is calculated as
ǫ= logL(π
(I+1)
,α
(I+1)
|R)−logL(π
(I)
,α
(I)
|R). Ifǫissmallthanacertainthreshold,
the algorithm is terminated. Otherwise, go to E-step again.
26
5. Transformation: with the estimated values for π and α, calculate θ
ij
’s according to
equation (2.3).
TheEM algorithm describedabove can consider both the uniquelymappedreads and
the ambiguous reads. However, when we only consider the uniquely mapped reads,
the EM algorithm can be much simpler. In this case, the set of reads R can be divided
into N parts, R
1
,R
2
,··· ,R
N
where R
i
includes the reads uniquely mapped to spliced
variants of gene g
i
. Under this simplification, the likelihood function in equation (2.2)
can be rewritten as
L(π,α|R) =
N
Y
i=1
Y
r∈R
i
π
i
·
m
i
X
j=1
α
ij
L
ij
−|r|+1
δ(r→s
ij
). (2.10)
The log likelihood function becomes
logL(π,α|R) =
N
X
i=1
C
i
logπ
i
+
N
X
i=1
X
r∈R
i
log
m
i
X
j=1
α
ij
L
ij
−|r|+1
·δ(r→s
ij
)
, (2.11)
where C
i
is the number of reads mapped to spliced variants of gene g
i
. Therefore, to
maximize the log likelihood function in equation (2.11), we have that ˆ π
i
=
C
i
|R|
, where
|R| is total number of reads in read set R. To obtain the maximum likelihood estimation
for α, we can utilize the EM algorithm again which maximizes the following function for
each gene g
i
:
L(α|R
i
) =
X
r∈R
i
log
m
i
X
j=1
α
ij
L
ij
−|r|+1
·δ(r→s
ij
)
. (2.12)
27
Again, here we add the missing variable z
ij
which is the number of reads in R
i
that are
generated by spliced variant s
ij
. Let z
i
= (z
i1
,z
i2
,··· ,z
im
i
), then the likelihood function
of the complete data (R,z) becomes
L(α|R
i
,z
i
) =
m
i
Y
j=1
α
ij
L
ij
−|r|+1
z
ij
.
The log likelihood function of the complete data can be written as
logL(α|R
i
,z
i
)=
m
i
X
j=1
z
ij
·log
α
ij
L
ij
−|r|+1
.
Thus, in the EM algorithm, the E-step calculates the expectation of the missingvariables
as the following
ˆ z
(I)
ij
=
X
r∈R
i
α
(I)
ij
·δ(r→s
ij
)/(L
ij
−|r|+1)
P
m
i
k=1
α
(I)
ik
·δ(r→s
ik
)/(L
ik
−|r|+1)
.
Then in the M-step, using the Lagarange multiplier method, the estimation of the pa-
rameters is updated as
ˆ α
(I+1)
ij
=
ˆ z
(I)
ij
|R
i
|
.
2.2.5 Confidence Interval of the Estimation
Other than the point estimation for the parameters, the confidence interval for the es-
timation is also necessary. The asymptotic covariance matrix of the MLE for θ is equal
28
to the inverse of the expected information matrix, which can be approximated by the
observed information matrix defined as
I(Θ;R) =−
∂
2
log(L(Θ|R))
∂Θ∂Θ
T
|
Θ=
ˆ
Θ
,
where
ˆ
Θ stands for the MLE from the EM algorithm [53]. The standard deviation of the
estimated expression level will then be calculated as the diagonal of the matrix.
2.2.6 Likelihood Ratio Test
Based on the estimated fraction of every isoform among all the molecules, it is important
to examine the following hypothesis:
1. Is a given isoform expressed or not?
2. Given two samples, does a given gene express its alternatively spliced variants with
different patterns?
3. Given two samples, is a given isoform differentially expressed?
Wetest thesehypothesisusingthelikelihood ratio testwhichcomparethemaximumlike-
lihood under the null hypothesis with that under the alternative hypothesis to determine
whether the null hypothesis can be rejected with certain confidence.
For a given isoform s
ij
, the statistical hypothesis of the first hypothesis testing is
H
0
:θ
ij
= 0↔H
1
:θ
ij
> 0. (2.13)
29
The likelihood ratio statistic is defined as
λ
1
ij
=−2log
L(Θ
0
|R)
L(Θ
1
|R)
, (2.14)
where Θ
0
= argmax
Θ:θ
ij
=0
L(Θ|R) and Θ
1
= argmax
Θ
L(Θ|R). Θ
0
again can be calculated
using the EM algorithm with the constraint that θ
ij
= 0. For every isoform, λ
1
ij
can
be calculated and compared with the chi-square distribution with 1 degree of freedom,
χ
2
(1).
If two sets of observed reads, R
s1
and R
s2
, from two samples are given, for a given
gene g
i
, the null hypothesis of the second hypothesis testing is
H
0
:∀j = 1,2,··· ,m
i
,θ
s1
ij
=θ
s2
ij
, (2.15)
where θ
s1
ij
and θ
s2
ij
are the fractions of mRNA s
ij
molecules in sample 1 and 2. Again, the
likelihood ratio statistic is defined as
λ
2
i
=−2log
L(Θ
0
|R)
L(Θ
s1
1
|R
s1
)L(Θ
s2
1
|R
s2
)
, (2.16)
where Θ
0
= argmax
Θ:∀j=1,2,···,m
i
,θ
s1
ij
=θ
s2
ij
L(Θ|R
s1
∪ R
s2
), Θ
s1
1
= argmax
Θ
L(Θ|R
s1
) and Θ
s2
1
=
argmax
Θ
L(Θ|R
s2
). For this testing, the likelihood ratio statistic is compared with the
chi-square distribution with m
i
−1 degrees of freedom.
For a given isoform s
ij
, the null hypothesis of the last hypothesis is
H
0
:θ
s1
ij
=θ
s2
ij
, (2.17)
30
Figure 2.1: The RNA-map of gene APOA1 from the ASPIC database.
where θ
s1
ij
and θ
s2
ij
are the fractions of mRNA s
ij
molecules in sample 1 and 2. The
likelihood ratio statistic is calculated as
λ
3
i
=−2log
L(Θ
0
|R)
L(Θ
s1
1
|R
s1
)L(Θ
s2
1
|R
s2
)
, (2.18)
whereΘ
0
= argmax
Θ:θ
s1
ij
=θ
s2
ij
L(Θ|R
s1
∪R
s2
),Θ
s1
1
= argmax
Θ
L(Θ|R
s1
)andΘ
s2
1
= argmax
Θ
L(Θ|R
s2
).
For this testing, the likelihood ratio statistic should be compared with the chi-square dis-
tribution with 1 degree of freedom.
2.3 Simulation Studies
To testify and demonstrate the robustness of our method, three simulation studies are
implemented. In the first simulation, we downloaded the sequences of the 8 spliced
variants for the gene named “APOA1” from ASPIC [16]. The structure of each spliced
variant is shown in Figure 2.1. The true relative expression levels are generated by first
obtaining 8independentuniformlydistributed(uniform[0, 1]) randomnumbersandthen
normalizing them so that the normalized relative expression levels add up to 1. For each
31
given total number of reads (100, 500, 1000, 3000, 5000, 7000, 9000 and 11000), 100 sets
of reads are sampled from the sequences of the 8 spliced variants. For each read, the
spliced variant from which it is generated is randomly chosen from the 8 spliced variants
accordingtotheirtruerelativeexpressionlevelsmultipliedbytheirlengths. Thentheread
is uniformlysampled fromthesequenceof thechosen splicedvariant. We applyTranSEQ
to allthesimulated sets of reads independently. For each given total numberof reads, the
average and standard deviation of the absolute difference between the estimated relative
expression levels and the true values (absolute error) are calculated (Figure 2.2). The
ratio between the absolute error and the true expression levels is defined as the absolute
error rate. The average and standard deviation of it is calculated as well (Figure 2.2).
Both the absolute error and the absolute error rate drop dramatically when more reads
are sampled. When the number of reads increases to 1000, the slope of them becomes
small and close to 0 which means that they do not change significantly when the number
of reads changes. The standard deviations of both the absolute error and the absolute
errorrateareveryclosetothemeans. Theresultsalsobecomestablewhenthenumberof
reads increases to 1000. In real data, however, not every spliced variant is expressed, and
also the number of spliced variants may be different. Therefore, we may need a smaller
number of reads than what is shown in Figure 2.2 for the error to be small.
In the second simulation, we use the same spliced variants as those in the first sim-
ulation. However, only two of them are expressed with certain expression ratio between
them. The two expressed spliced variants are shown as ”Tr1”’ and ”Tr2” in Figure 2.1.
The log (base 2) expression ratio between them (log
2
(Tr1/Tr2) ) is set to 0,2,4 and 6.
Given the expression level for each spliced variant, 100 sets of reads are sampled from
32
0 2000 6000
0.01 0.03 0.05 0.07
Average AE
Number of Reads
(SD of) Average Absolute Error
AE
SD of AE
0 2000 6000
0.5 1.0 1.5 2.0 2.5
Average AE Rate
Number of Reads
(SD of) Average Absolute Error Rate
AE rate
SD of AE rate
Figure 2.2: The accuracy of estimated relative expression levels for the first simulated
dataset. Theaverageabsoluteerror(AE)anditsstandarddeviation(SD)oftheestimated
relative expression levels are shown in the left panel. The average absolute error (AE)
rate and its standard deviation (SD) of the estimated relative expression levels are shown
in the right panel.
the variants in the same way as the first simulation. After we apply TranSEQ to each
simulated set of reads, the mean and standard deviation of the absolute error rate for
the expression ratios of Tr1 over Tr2 are calculated. In addition, we also calculate the
absolute error for the estimated expression levels of the 6 unexpressed spliced variants.
For the unexpressed spliced variants (Figure 2.3cd), both the mean and standard devia-
tion of the absolute error drops close to 0 when there are 500 reads suggesting that the
expression levels of the unexpressed spliced variants can be estimated very accurately.
Meanwhile for the expression ratios of Tr1 and Tr2 (Figure 2.3ab), both the mean and
the standard deviation of the absolute error rates for the expression ratio increase when
the expression ratio of Tr1 over Tr2 increases.
The motivation for the third simulation is that during the reverse transcription of
the mRNA to cDNA, the mRNAs are degraded which causes a decreasing number of
33
reads on the 5’ end. To investigate the effect of the mRNA degradation on TranSEQ, we
simulatethesequencesof5exonsdenotedas A,B,C,DandEthatareeach 250bpslong.
Taking them as the 5 exons of a gene from 5’ end to 3’ end, we generate three spliced
variants that consist of exons ADE, BDE and CDE. Only the variant ADE is expressed,
and the exons D and E generate more reads than exon A does to simulate the mRNA
degradation. The ratio between the number of reads sampled from DE and A is set to
be 1,2,4,6,8 and 10. Reads are uniformly sampled from the exons. Again, for each given
total numberof reads, we calculate the absolute error of the estimated relative expression
levels for the two unexpressed spliced variants BDE and CDE when different proportions
ofreads aresampledfromexonsDandE.Thedistributionof theerror(Figure2.4) shows
that the increasing mRNA degradation does cause the error to increase. However, even
when there are only 100 reads in total and exons D and E generate 10 times more reads
than exon A, the error in the prediction is smaller than 5e-10. As expected, this effect
decreases when the total number of reads increases. Fewer outliers are observed as well.
2.4 Applications to Real Data
2.4.1 Data
The mRNA-seq data from a previous study [52] is reanalyzed in this paper. The ex-
periment is briefly summarized as follows. Total RNA from the liver and kidney tissue
samples from one human male was extracted. Poly(A) mRNA was purified and sheared
prior to cDNA synthesis. The resulting cDNAs were processed into libraries of template
34
molecules and sequenced on the Illumina Genome Analyzer. Illumina’s sequencing tech-
nology simultaneously sequenced millions of short reads (32 bps in this experiment) of
the template molecules. Samples from both the liver and the kidney were sequenced at
two different concentrations (3pM and 1.5pM). For details of the experiment, see [52].
2.4.2 Mapping
TheAceViewdatabaseprovidesstrictlycDNA-supportedhumantranscriptomeandgenes
byco-aligningallquality-filtered humancDNAdatafromGeneBank, dbESTandtheRef-
Seq to the human genome. We download all the human mRNA sequences on autosomal
chromosomes (including UTR regions) from the AceView Human database, Apr07 re-
lease, excluding the clouds genes that do not have strong evidence of existence. Reads
are mapped onto these mRNA sequences by a program called RMAP [70], allowing up to
two mismatches and no insertion or deletion (Table 2.1). Among the 66 million reads ob-
tained from thekidneysampleat 3pMconcentration, about42% can bemappedonto the
mRNA sequences, while only 27% are mapped to unique genes (a read may be mapped
to multiple spliced variants but they belong to one gene). For the liver sample, the total
number of reads at 3pM concentration is 69 million, and the percentages are 40% for the
mapped reads and 21% for the reads mapped to unique genes. Combining both datasets,
23.9% of the reads are mapped to unique genes. Note that the previous study [52] ap-
plied a different mapping strategy to the same datasets by first mapping reads onto the
human genome with up to two mismatches and then extracting reads mapped to unique
locations intheexons. Thefraction oftheuniquelymappedreadswas 22.5%, slightly less
than the 23.9% we mapped in this paper. We carefully examine the difference between
35
Kidney Sample
R1L1 R1L3 R1L7 R2L2 R2L6
Number of
Original Reads
13017169 13401343 12848201 13687929 13449864
Number of
Mapped Reads
5399656 5584626 5249243 5756229 5911408
Number of Uniquely
Mapped Reads
3482151 3591935 3375113 3694220 3784909
Liver Sample
R1L2 R1L4 R1L6 R1L8 R2L3
Number of
Original Reads
14003322 14230879 13525355 13096715 14761931
Number of
Mapped Reads
5558586 5576879 5388618 5172572 5808997
Number of Uniquely
Mapped Reads
3447772 3457236 3338951 3210133 3579122
Table2.1: Mappingresultsof10lanes (5forthehumankidneyand5forthehumanliver)
of Solexa mRNA-seq data. For each lane, the table lists the number of original reads, the
number of reads mapped to mRNA transcripts annotated in AceView by RMAP (a read
may be mapped to multiple genes), and the number of reads mapped to unique genes (a
read may be mapped to multiple spliced variants of the same gene).
the two mapping strategies. All reads mapped to unique exons in [52] are included in
our mapped read set. In addition, the mapped reads in this study include the following
two sets: 1) reads that are mapped across two known adjacent exons by our mapping
strategyand2)readsthataremappedtomultiplelocations inthepreviousstudybecause
of the inclusion of the intronic and intergenic regions in the genome-mapping strategy.
Among all the reads in these two sets, the average percentage of reads from the latter set
across thereplicate lanes is81.1%/80.5% and86.4%/85.7% for kidneyandliver underthe
3pM/1.5pM concentration, respectively. This analysis shows that the mappingprocedure
used in this paper is accurate.
36
2.4.3 The Maximum Likelihood Approach
The fraction of each mRNA isoform molecules is estimated using the EM algorithm de-
scribed in section 2.2. The variances of the estimated expression levels of the spliced
variants are calculated using the observed Fisher information matrix [24]. For each iso-
form, we determine whether it is expressed or not by calculating the p-value for the null
hypothesisof noexpressionusingthelikelihood ratiostatistic describedinsection 2.2. To
determine whether the mRNA isoform is differentially expressed between the two tissues,
we use the likelihood ratio statistic (section 2.2) again to test the null hypothesis that
the variant is expressed at the same level between the two tissues. False discovery rate
(FDR) is estimated as in [9] to adjust for multiple testing errors.
2.4.4 Consistency of MLE results across lanes
Both the kidney and the liver samples were sequenced in five replicate lanes at 3pM
concentration and two replicate lanes at 1.5pM concentration. For each lane, we only
consider genes uniquely mapped to by at least 100 reads (highly-expressed genes) and
apply TranSEQto estimate the relative expression levels of the spliced variants. For each
tissue sample, we extract a common set of highly-expressed genes across all the replicate
lanes at the same concentration level and calculate the average Pearson correlation coef-
ficient (PCC) of the estimated relative expression levels between any two replicate lanes
across genes. The results show high consistency between replicate lanes (Table 2.2). At
the 3pM concentration, the average PCC across all pairs of lanes is 0.845 for the kidney
sample and 0.899 for the liver sample. At the 1.5pM concentration, the average PCC is
0.832 for the kidney sample and 0.891 for the liver sample.
37
Kidney Liver
R1L1:R1L3 0.844282535926633 R1L2:R1L4 0.895548908616798
R1L1:R1L7 0.843982017691902 R1L2:R1L6 0.90152906396178
R1L1:R2L2 0.845619991122307 R1L2:R1L8 0.902354425441016
R1L1:R2L6 0.845363769235202 R1L2:R2L3 0.899360749619289
R1L3:R1L7 0.8435497360318 R1L4:R1L6 0.89562651089979
R1L3:R2L2 0.843214687306855 R1L4:R1L8 0.899370211516938
R1L3:R2L6 0.84572026950366 R1L4:R2L3 0.898641769377957
R1L7:R2L2 0.842783399176404 R1L6:R1L8 0.902939547012783
R1L7:R2L6 0.845649589014566 R1L6:R2L3 0.898455794215281
R2L2:R2L6 0.846731741859942 R1L8:R2L3 0.89890288387888
R2L4:R2L8 0.832297868518725 R2L1:R2L7 0.891348472558054
Table 2.2: The average Pearson correlation coefficients between the estimated relative
expression levels of spliced variants from the five different lanes that sequenced the same
kidney and liver sample at 3pM concentration (first ten rows) and 1.5pM (the last row).
Each row corresponds to one pair of replicate lanes under the same concentration level.
The last row is the replicate lane pair comparison from 1.5pM.
We further examine the MLE results between the two concentration levels, 3pM and
1.5pM. We combine reads from the replicate lanes at the same concentration level into
one set and repeat the above procedure. The average PCC is 0.885 in the kidney sample
and 0.913 in the liver sample. The results show that both the mapping procedure and
the MLE method are robust across different lanes, and the read sets from all lanes are
of similar high quality. Therefore, in the following study, we combine reads from the
five replicate lanes at the 3pM concentration into one read set and focus on the highly-
expressed genes.
2.4.5 Genes with multiple expressed spliced variants
We extract 12,391 and 11,040 highly-expressed genes in the kidney and liver samples,
respectively. We filter out those genes with only one annotated transcript, leaving 11,329
38
and 9,734 genes with multiple spliced variants for the kidney and the liver samples, re-
spectively. For each of these genes, we applyTranSEQ to estimate the relative expression
levels of its annotated spliced variants and assess the significance of the presence of each
variant in the tissue. With false discovery rate FDR<0.0001, we identify 10,822 (95.5%)
and 9,221 (94.6%) genes with at least two variants present (significantly expressed) for
thekidneyandfortheliver(Fig. 1a), respectively, suggestingthatthealternative splicing
events are extensive. These numbers are higher than the 74% estimated by a previous
study [39], even though they used 52 tissues and cell lines.
Reports from two recent studies showed that 95% of the genes with multiple exons
are alternatively spliced in 6 different human tissues [63], and 92%-94% of these genes
are alternatively spliced in 15 diverse human tissues and cell lines [63,87]. Our results by
TranSEQ on the kidney and the liver data are consistent with these reports: 94.4% or
10,805outof11,451highly-expressedgeneswithmultipleexonsinkidneyarealternatively
spliced, and 93.7% or 9,200 out of 9,823 highly-expressed genes with multiple exons in
the liver are alternatively spliced.
Among the 10,822 and 9,221 alternatively spliced genes in kidney and liver, 73.4%
(7,939/10,822) and 72.3% (6,666/9,221) express at least two spliced variants with expres-
sion levels of 15% or higher, respectively. The average number of significantly expressed
spliced variants per gene is 5.9 and 5.5 for the kidney and the liver sample, respectively
(Figure 2.5).
The estimated expression levels of the spliced variants of the human gene NINJ1 are
shown in Figure 2.6. The human gene NINJ1 encodes the nerve injury-induced protein 1
which is a homophilic cell adhesion molecule that promotes axonal growth. It may play
39
a role in nerve regeneration and in the formation and function of other tissues. Figure
2.6a and Figure 2.6b show the mapping results of NINJ1 in the kidney and the liver,
respectively. The reads mapped to the exonic regions unique to the first two variants
support their expression in both tissues. The gene structures for NINJ1.a and NINJ1.b
are similar except that (1) the middle exon of NINJ1.b partially overlaps the two middle
exonsinNINJ1.aand(2)theexonatthe5’endofNINJ1.aislongerthanthatofNINJ1.b.
Thecoverage of reads mappedto the uniqueexonic region of NINJ1.b is much lower than
that of the overlapped exonic regions shared by NINJ1.a and NINJ1.b. Therefore, the
expression level of NINJ1.b should be significantly lower than that of NINJ1.a. The
results of TranSEQ are consistent with this observation and show that, in both tissues,
NINJ1.a has an expression level of 94% or higher (Figure 2.6). These results are also
consistent with the number of cDNA sequences (15 in kidney and 20 in liver) aligned
for NINJ1 in AceView. However, in liver, NINJ1.b has a slightly higher expression level
compared with the kidney sample because its unique exonic region is mapped to by a
larger number of reads. The presence of NINJ1.b in liver is also supported by one cDNA
sequence shown in AceView (Table 2.3). This example confirms that our MLE method
is capable of accurately determining the relative expression levels of spliced variants.
Gene Names Variant Names AK TK AL TL
HEMK1
HEMK1.aApr07 0 (0.0%) 5.8e-87 0 (0.0%) 0
HEMK1.bApr07 7 (50.0%) 0.021 2 (40%) 0
HEMK1.cApr07 4 (28.6%) 0.14 3 (60%) 0.13
HEMK1.dApr07 0 (0.0%) 2.0e-11 0 (0.0%) 0
HEMK1.eApr07 0 (0.0%) 0.009 0 (0.0%) 0.051
HEMK1.fApr07 1 (7.1%) 0.086 0 (0.0%) 0.01
HEMK1.gApr07 0 (0.0%) 0.027 0 (0.0%) 1.2e-8
HEMK1.hApr07 1 (7.1%) 0.029 0 (0.0%) 0.013
HEMK1.iApr07 0 (0.0%) 0.12 0 (0.0%) 0.16
HEMK1.jApr07 0 (0.0%) 0.2 0 (0.0%) 0.25
40
HEMK1.kApr07 0 (0.0%) 7.0e-181 0 (0.0%) 0
HEMK1.lApr07-unspliced 1 (7.1%) 0.19 0 (0.0%) 0.13
HEMK1.mApr07 0 (0.0%) 0.17 0 (0.0%) 0.22
HEMK1.nApr07-unspliced 0 (0.0%) 0.0063 0 (0.0%) 0.038
SH3BP5L
SH3BP5L.aApr07 7 (77.8%) 0.39 10 (90.9%) 0.38
SH3BP5L.bApr07 0 (0.0%) 0.14 0 (0.0%) 0.28
SH3BP5L.cApr07 2 (22.2%) 0.21 0 (0.0%) 0.16
SH3BP5L.dApr07 0 (0.0%) 1.9e-132 0 (0.0%) 0
SH3BP5L.eApr07 0 (0.0%) 1.2e-157 0 (0.0%) 0
SH3BP5L.fApr07 0 (0.0%) 0.18 0 (0.0%) 1.9e-110
SH3BP5L.gApr07 0 (0.0%) 0.081 1 (9.1%) 0.18
SH3BP5L.hApr07 0 (0.0%) 9.0e-13 0 (0.0%) 9.0e-95
TUBA1C
TUBA1C.aApr07 0 (0.0%) 3.8e-10 0 (0.0%) 1.2e-11
TUBA1C.bApr07 3 (75.0%) 0.91 7 (70.0%) 0.91
TUBA1C.cApr07 0 (0.0%) 0.011 0 (0.0%) 6.9e-10
TUBA1C.dApr07 0 (0.0%) 8.2e-58 0 (0.0%) 7.6e-288
TUBA1C.eApr07 0 (0.0%) 1.8e-58 0 (0.0%) 5.9e-259
TUBA1C.fApr07 0 (0.0%) 0.039 0 (0.0%) 0.011
TUBA1C.gApr07-unspliced 0 (0.0%) 0.014 0 (0.0%) 0.031
TUBA1C.hApr07-unspliced 1 (25.0%) 0.028 3 (30.0%) 0.028
TUBA1C.iApr07-unspliced 0 (0.0%) 6.2e-88 0 (0.0%) 0.017
TUBA1C.jApr07 0 (0.0%) 1.7e-73 0 (0.0%) 0
GNL2
GNL2.aApr07 2 (40.0%) 0.64 1 (33.3%) 0.4
GNL2.bApr07 0 (0.0%) 0.0068 0 (0.0%) 0.014
GNL2.cApr07 0 (0.0%) 0.049 1 (33.3%) 0.08
GNL2.dApr07 0 (0.0%) 0.099 0 (0.0%) 0.11
GNL2.eApr07 1 (20.0%) 1.4e-67 0 (0.0%) 4.8e-19
GNL2.fApr07 0 (0.0%) 0.048 0 (0.0%) 0.31
GNL2.gApr07 0 (0.0%) 0.084 0 (0.0%) 0.082
GNL2.hApr07 0 (0.0%) 0.031 1 (33.3%) 5.0e-89
GNL2.iApr07 0 (0.0%) 0.041 0 (0.0%) 6.9e-27
GNL2.jApr07 2 (40.0%) 1.4e-17 0 (0.0%) 4.5e-23
NINJ1
NINJ1.aApr07 15 (100%) 0.97 19 (95.0%) 0.94
NINJ1.bApr07 0 (0.0%) 0.03 1 (5.0%) 0.062
NINJ1.cApr07 0 (0.0%) 9.7e-11 0 (0.0%) 1.0e-14
NINJ1.dApr07 0 (0.0%) 2.2e-10 0 (0.0%) 2.5e-10
NINJ1.eApr07-unspliced 0 (0.0%) 5.2e-78 0 (0.0%) 2.5e-80
ADCY6
ADCY6.aApr07 1 (14.3%) 0.23 1 (10.0%) 0.20
ADCY6.bApr07 1 (14.3%) 0.26 1 (10.0%) 9.0e-10
ADCY6.cApr07 5 (71.4%) 0.33 8 (80.0%) 0.70
ADCY6.dApr07 0 (0.0%) 1.2e-15 0 (0.0%) 4.2e-108
ADCY6.eApr07 0 (0.0%) 0.11 0 (0.0%) 0.085
ADCY6.fApr07 0 (0.0%) 9.9e-32 0 (0.0%) 5.2e-109
ADCY6.gApr07-unspliced 0 (0.0%) 0.032 0 (0.0%) 0.014
41
ADCY6.hApr07-unspliced 0 (0.0%) 0.035 0 (0.0%) 7.7e-44
ZDHHC2
ZDHHC2.aApr07 15 (100%) 0.19 9 (100%) 0.22
ZDHHC2.bApr07 0 (0.0%) 2.9e-28 0 (0.0%) 4.2e-25
ZDHHC2.cApr07 0 (0.0%) 0.16 0 (0.0%) 1.2e-9
ZDHHC2.dApr07 0 (0.0%) 0.028 0 (0.0%) 3.9e-16
ZDHHC2.eApr07 0 (0.0%) 0.59 0 (0.0%) 0.5
ZDHHC2.fApr07-unspliced 0 (0.0%) 0.031 0 (0.0%) 0.28
NEBL
NEBL.aApr07 1 (33.3%) 0.16 1 (12.5%) 0.18
NEBL.bApr07 0 (0.0%) 0.013 1 (12.5%) 0.033
NEBL.cApr07 0 (0.0%) 0.0086 0 (0.0%) 2.4e-51
NEBL.dApr07 0 (0.0%) 3.7e-61 0 (0.0%) 1.8e-13
NEBL.eApr07 0 (0.0%) 3.2e-30 0 (0.0%) 8.9e-26
NEBL.fApr07 2 (66.6%) 0.65 6 (75.0%) 0.2
NEBL.gApr07 0 (0.0%) 8.6e-16 0 (0.0%) 4.6e-38
NEBL.hApr07 0 (0.0%) 2.8e-9 0 (0.0%) 1.3e-9
NEBL.iApr07 0 (0.0%) 0.025 0 (0.0%) 0
NEBL.jApr07 0 (0.0%) 3.5e-13 0 (0.0%) 1.5e-34
NEBL.kApr07 0 (0.0%) 3.2e-22 0 (0.0%) 0.24
NEBL.lApr07 0 (0.0%) 5.3e-20 0 (0.0%) 0
NEBL.mApr07 0 (0.0%) 9.4e-33 0 (0.0%) 2.6e-28
NEBL.nApr07 0 (0.0%) 0.016 0 (0.0%) 0
NEBL.oApr07 0 (0.0%) 0.013 0 (0.0%) 0
NEBL.pApr07 0 (0.0%) 0.018 0 (0.0%) 0
NEBL.qApr07-unspliced 0 (0.0%) 0.024 0 (0.0%) 0.067
NEBL.rApr07 0 (0.0%) 0.02 0 (0.0%) 3.6e-61
NEBL.sApr07-unspliced 0 (0.0%) 0.032 0 (0.0%) 0.27
NEBL.tApr07 0 (0.0%) 2.7e-23 0 (0.0%) 0
NEBL.uApr07-unspliced 0 (0.0%) 0.022 0 (0.0%) 0
FMR1
FMR1.aApr07 1 (4.8%) 0.054 1 (33.3%) 0.27
FMR1.bApr07 0 (0.0%) 0.33 0 (0.0%) 0.22
FMR1.cApr07 20 (95.2%) 0.0018 1 (33.3%) 0.052
FMR1.dApr07 0 (0.0%) 0.32 0 (0.0%) 0.11
FMR1.eApr07 0 (0.0%) 0.033 1 (33.3%) 0.15
FMR1.fApr07 0 (0.0%) 0.048 0 (0.0%) 0.084
FMR1.gApr07 0 (0.0%) 0.13 0 (0.0%) 9.4e-20
FMR1.hApr07 0 (0.0%) 1.8e-162 0 (0.0%) 9.3e-212
FMR1.iApr07 0 (0.0%) 1.3e-8 0 (0.0%) 4.5e-131
FMR1.jApr07 0 (0.0%) 0.06 0 (0.0%) 0.067
FMR1.kApr07 0 (0.0%) 4.4e-196 0 (0.0%) 5.9e-273
FMR1.lApr07 0 (0.0%) 0.003 0 (0.0%) 0.03
FMR1.mApr07 0 (0.0%) 0.02 0 (0.0%) 0.016
SMG6.aApr07 5 (100%) 0.22 5 (41.7%) 0.14
SMG6.bApr07 0 (0.0%) 0.002 0 (0.0%) 0.0046
SMG6.cApr07 0 (0.0%) 0.0024 0 (0.0%) 0.0028
SMG6.dApr07 0 (0.0%) 1.6e-23 0 (0.0%) 3.2e-60
42
SMG6.eApr07 0 (0.0%) 0.12 0 (0.0%) 0.0069
SMG6.fApr07 0 (0.0%) 0.16 0 (0.0%) 0.03
SMG6.gApr07 0 (0.0%) 4.5e-11 0 (0.0%) 5.1e-85
SMG6.hApr07 0 (0.0%) 5.6e-51 0 (0.0%) 2.3e-198
SMG6.iApr07 0 (0.0%) 3.3e-52 0 (0.0%) 3.4e-205
SMG6.jApr07-unspliced 0 (0.0%) 0.11 0 (0.0%) 0.22
SMG6.kApr07 0 (0.0%) 0.037 0 (0.0%) 2.2e-72
SMG6.lApr07 0 (0.0%) 0.13 2 (16.7%) 0.14
SMG6.mApr07 0 (0.0%) 7.4e-10 0 (0.0%) 2.7e-101
SMG6.nApr07 0 (0.0%) 6.3e-22 0 (0.0%) 5.9e-147
SMG6.oApr07 0 (0.0%) 0.05 0 (0.0%) 2.1e-86
SMG6.pApr07 0 (0.0%) 8.5e-45 0 (0.0%) 1.2e-147
SMG6.qApr07 0 (0.0%) 1.1e-13 0 (0.0%) 1.9e-120
SMG6.rApr07 0 (0.0%) 5.7e-15 0 (0.0%) 0.07
SMG6.sApr07-unspliced 0 (0.0%) 0.034 3 (25.0%) 0.11
SMG6.tApr07-unspliced 0 (0.0%) 0.042 2 (16.7%) 0.072
SMG6.uApr07 0 (0.0%) 6.3e-14 0 (0.0%) 6.1e-206
SMG6.vaApr07-unspliced 0 (0.0%) 0 0 (0.0%) 0
SMG6.vbApr07-unspliced 0 (0.0%) 0.033 0 (0.0%) 0.18
SMG6.vcApr07-unspliced 0 (0.0%) 0.068 0 (0.0%) 0.023
SMG6.vdApr07 0 (0.0%) 9.9e-69 0 (0.0%) 0
Table 2.3: The list of the randomly selected 10 genes as well
ascomparisonbetweenthecDNAsupportsandtheTranSEQ
estimation foreach oftheirannotatedsplicedvariantsinAce-
View. Among the last four columns, the counts and the frac-
tions of cDNAs from AceView (A) that supportthe presence
of each variant in kidney (K) or liver (L) are compared with
their estimated expression levels by TranSEQ (T).
2.4.6 Effects of gene annotations
When compared with the 1,250 alternatively spliced genes in mouse liver reported by a
previous study [57], the results by TranSEQ show far more extensive alternative splicing
43
events in human liver. The difference arises from the fact that this previous report
used the UCSC Mouse Genome annotation (July, 2007) (mm9), which collects a smaller
number of alternatively spliced variants than AceView. Such difference can further be
illustrated by the human gene NINJ1 for which the UCSC genome browser annotates
only two spliced variants, both almost identical to NINJ1.a (Figure 2.6). However, the
existence of NINJ1.b is clearly supported by the reads mapped to its unique exonic
region. The annotations in AceView are based on mRNAs from GenBank or RefSeq,
withsingle-pass cDNA sequences fromdbEST,andTrace-filtered withastringent quality
control. TheaccuracyofAceViewhasbeencomparedwiththoseofothergeneannotation
databases andprogramsusingabenchmarkdataset, specifically, the44ENCODEregions
that were carefully annotated by the Sanger Institute [80]. The results showed that
AceView reported three to four times more transcripts (sensitivity 84%) than UCSC
Known Genes (sensitivity 28%), RefSeq (sensitivity 21%) or Ensembl (sensitivity 19%).
Therefore, we believe that AceView’s gene annotation is of high quality and that, as a
result, the predictions of TranSEQ based on AceView are more accurate.
2.4.7 Differentiallyexpressedvariantsbetweenthekidneyandtheliver
Todetectthedifferentiallyexpressedsplicedvariantsbetweenkidneyandliver, weextract
9,111 genes that are highly expressed in both tissues and have multiple annotated spliced
variants. A total of 128,643 annotated spliced variants are examined for differential
expression between the kidney and the liver using the likelihood ratio test. For each
spliced variant, the likelihood ratio test calculates a p-value to assess the significance
of the hypothesis that the spliced variant is differentially expressed between kidney and
44
liver. Small p values suggest differential expression of the corresponding variant between
the two tissues. We also calculate the absolute difference between the two estimated
expression levels for each variant in the two tissues. The Pearson correlation coefficient
betweentheabsolutedifferenceandthepvaluesacrossallthesplicedvariantsis-0.54. By
controlling FDR<0.0001, we obtain 4,628 (3.6%) differentially expressed spliced variants
belonging to 2,811 genes (30.9% of the 9,111 genes).
For all the 4,628 differentially expressed variants, we also examine their p values,
assessing whether they are present in kidney and liver. The number of spliced variants
expressed in each tissue is counted (Figure 2.7a). The results show that 3,709 variants
are expressed in both tissues, while only 533 and 366 variants are exclusively expressed
in kidney and liver, respectively.
2.4.8 Differentially spliced genes between the kidney and the liver
In addition to the differentially expressed spliced variants, we also detect genes that are
differentially spliced between kidney and liver. There are 8,535 genes that expressed at
least 2 spliced variants in both kidney and liver. For each of these genes, we employ
the likelihood ratio test again to assess the significance of the difference in the relative
expression levels of their spliced variants between the two tissues. The null hypothesis is
that the relative expression levels of all the spliced variants for a given gene are the same
in the two tissues. The alternative hypothesis is that the relative expression levels of all
thespliced variantsforthegiven genearenotnecessarilythesameinthetwo tissues. The
likelihood ratio test calculates the ratio of the maximum likelihood of the observed reads
in the two tissues under the null hypothesis over that under the alternative hypothesis.
45
As a result, 4,312 (50.5%) genes have a p-value smaller than 0.001 (Figure 2.7b). Even
after the Bonferroni correction, which has been proven to be very conservative, there are
still 2,361 (27.7%) genes which have adjusted p-values smaller than 0.001. The result
is consistent with the conclusion from the above analysis on the differentially expressed
splicedvariants. Thelargenumberofgenes havingsignificantly differentexpressionlevels
of their spliced variants in the two tissues by the likelihood ratio test indicates that the
differencesinthetranscriptomesofthetwotissuesaremainlyattributedtothedifferential
expression of the transcripts which are present in both tissues.
2.4.9 cDNAsupportsforthedifferentiallyexpressedvariantsanddifferentially
spliced genes
Since AceView lists all the cDNA sequences that support the presence of each spliced
variant in various human tissues, we randomly select 10 genes (Table 2.3) that are highly
expressed in both the kidney and the liver and have multiple annotated spliced variants.
For each tissue, the expressed variants predicted by TranSEQ for these 10 genes are
compared with those having cDNA support derived from the corresponding tissue from
AceView. Theresults areshownin Figure2.8. Among the121 annotated spliced variants
for the 10 selected genes, 62 and 50 are predicted to be expressed in kidney and in liver
by TranSEQ, respectively. In contrast, AceView shows cDNA supports for 21 variants in
kidney and 25 variants in liver. The overlaps, 18 (85.7% of 21) in kidney and 20 (80%
of 25) in liver, are predicted to be expressed by TranSEQ with an average expression
level of 29.7% and 26.2%, respectively. For the rest of the 44 variants in kidney and the
30 variants in liver that are predicted to be expressed, the average estimated expression
46
levels aremerely8.1% and9.8%, respectively. ThecDNA-supportedsplicedvariants have
significantly higher (p-value<0.005 using the Wilcoxon rank-sum test) estimated relative
expression levels than those without cDNA supports in both tissues. The one-sided p-
valuebyWilcoxonrank-sumtestis1.2e-4inkidneyand3.2e-3inliver. Thisindicatesthat
the lack of cDNA support for some of the predicted expressed spliced variants in kidney
and in liver most likely results from low expression in the cell. The distribution of the
expression levels for the two groups of spliced variants shows that the relative expression
levels of most of the spliced variants are very low (Figure 2.9). The distribution of the
predicted expression levels of all the spliced variants by TranSEQ (Figure 2.10) supports
this conclusion.
Furthermore, for each spliced variant, we calculate its relative expression level by
calculating the fraction of its cDNA supports in the tissue among all the cDNAs for the
gene in the tissue. The calculated cDNA-based expression levels are compared with the
estimations by TranSEQ (Table 2.3). For each gene, we calculate the Pearson correlation
between the estimated expression levels by TranSEQ and the cDNA-based expression
levels. The average Pearson correlation is 0.52 for the kidney and 0.55 for the liver. This
shows that the estimated expression levels by TranSEQ are accurate.
Finally, we examine the p-values for all the 121 spliced variants of these 10 genes to
assess whether they are differentially expressed between the kidney and the liver. By
controlling FDR<0.0001, we found 6 spliced variants (ADCY6.bApr07, ADCY6.cApr07,
FMR1.eApr07, NEBL.fApr07, SMG6.sApr07-unspliced and ZDHHC2.fApr07-unspliced)
that aresignificantly differentially expressedin thetwo tissues according to thelikelihood
47
ratio test. For 4 of them (ADCY6.cApr07, FMR1.eApr07, NEBL.fApr07, SMG6.sApr07-
unspliced), the number of cDNA sequences for them in kidney is different from that in
liver.
2.4.10 Discussion
We have developed a statistical method based on the maximum likelihood estimation,
TranSEQ,bywhichwecanestimate therelative expressionlevels oftheannotated spliced
variants for any given gene using the mRNA-seq data. Applications to both simulated
datasets and experimental data show that TranSEQ can accurately estimate the relative
expression levels for spliced variants.
There are, however, several factors that may affect the accuracy of the results. The
number of mapped reads is one of them. To examine the effect of the number of the
mapped reads, we extract a set of genes that have exactly 4 annotated spliced variants
in AceView and are mapped to by 800-1,000 reads in each tissue. There are 24 and 12
such genes in kidney and liver, respectively. In each tissue, we construct 7 sets of reads
for each gene by randomly sampling 200, 300, 400, 500, 600, 700 and 800 reads from the
mapped reads. We apply TranSEQ to each set to predict the expressed spliced variants
with FDR¡0.0001. This process is repeated 100 times to calculate the average fractions
of genes that have 0, 1, 2, 3 and 4 expressed spliced variants (Figure 2.11). The results
showthatthenumberofexpressedsplicedvariantsdoesincreasewiththenumberofreads
(Figure 2.11ab). As the number of reads for a given gene increases, the probability of
rare variants being detected will also increase, causing the number of expressed variants
identifiedbyFDRtoincrease. Ontheotherhand,thenumberofhighlyexpressedvariants
48
does not strongly depend on the number of reads, as the number of reads above a certain
threshold will consistently detect these variants. To test this hypothesis, we calculate
the fractions of genes that have 0,1,2,3 and 4 expressed variants with estimated relative
expression levels higher than or equal to 0.05. The results are given in Figure 2.11cd.
The number of expressed spliced variants is much more stable when the number of reads
increases in both tissues. The same analysis is applied to genes with a different number
of annotated spliced variants and mapped reads. We observe a similar pattern (Figure
2.12). The threshold of the number of mapped reads required to detect spliced variants
with low expression depends upon many factors, such as the gene structure, the number
of annotated spliced variants, or the gene expression level. Investigating the effects of
these factors on the threshold of the number of mapped reads for each gene is a topic for
the further studies.
Although we have shown that TranSEQ can accurately estimate the expression levels
of theannotated spliced variants in AceView, there are still many ways in which it can be
improved. First, the current model assumes that the reads are uniformly sampled from
the expressed variants. However, in the experimental data we observed, the base-level
coverage of exons is uneven. Such bias may be derived from multiple sources, including
the GC contents, the distance to the 5’ end or 3’ end of the transcripts, sequencing
errors, and incomplete and inaccurate annotations in the AceView database. Second,
a large number of reads that are mapped to multiple genes are ignored in this study.
Although considering these reads may introduce ambiguity caused by sequencing errors,
single nucleotide polymorphisms (SNPs) and repetitive regions in the genome, the large
number of such reads makes the consideration appealing. In fact, TranSEQ has the
49
potential to include the ambiguous reads in the model. Third, the annotations of mRNA
sequencesintheAceView databasemaybeincomplete. Sincethismethodisrobusttothe
widerangeoftheexpressionlevel ratios between splicedvariants, thiscanbeimprovedby
enhancingthedatabasewithpredictedmRNAsusingpossibleexonjunctionsorassembled
template sequences from mRNA-seq data with longer reads. Finally, the quality scores
of the reads, which may help improve the prediction accuracy, are not considered in our
model.
50
0 2000 6000 10000
0.0 0.5 1.0 1.5 2.0 2.5
AE rate for the
Expression Ratio Tr2/Tr1
Number of Reads
AE Rate of Tr2/Tr1
1/1
1/4
1/16
1/64
0 2000 6000 10000
0.0 0.5 1.0 1.5 2.0 2.5
SD of AE rate for the
Expression Ratio of Tr2/Tr1
Number of Reads
SD of AE Rate of Tr2/Tr1
1/1
1/4
1/16
1/64
0 2000 6000 10000
0.000 0.005 0.010 0.015
AE for the
Unexpressed Variants
Number of Reads
Mean of AE
1/1
1/4
1/16
1/64
0 2000 6000 10000
0.00 0.02 0.04
SD of the AE for the
Unexpressed Variants
Number of Reads
SD of AE
1/1
1/4
1/16
1/64
a b
c d
Figure 2.3: The accuracy of the estimated relative expression levels in the second simu-
lated dataset. The(a) mean and (b)standard deviation (SD) of theabsolute error rate of
theestimated expression ratios of Tr1over Tr2. The(c) mean and(d) standarddeviation
of the AE for estimated expression levels for the 6 unexpressed variants.
51
0 2000 6000 10000
0.0 0.5 1.0 1.5 2.0 2.5
Expression Ratio Tr2/Tr1
Reads Number
AE Rate of Tr2/Tr1
1/1
1/4
1/16
1/64
0 2000 6000 10000
0.0 0.5 1.0 1.5 2.0 2.5
Expression Ratio Tr2/Tr1
Reads Number
SD of AE Rate of Tr2/Tr1
1/1
1/4
1/16
1/64
0 2000 6000 10000
0.000 0.005 0.010 0.015
Unexpressed Variants
Reads Number
Mean of AE
1/1
1/4
1/16
1/64
0 2000 6000 10000
0.00 0.02 0.04
Unexpressed Variants
Reads Number
SD of AE
1/1
1/4
1/16
1/64
a b
c d
Figure 2.4: The accuracy of the estimated expression levels for the third simulated
dataset. The (a) mean and (b) standard deviation (SD) of the absolute error of the
estimated expression levels for different values of the ratio between the expression levels
of the 3’ end exons to the 5’ end exons.
52
a
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
the number of expressed variants
the number of genes
0 200 600 1000 1400
kidney
liver
b
kidney liver
0 20 40 60 80
Figure 2.5: Distribution of the number of expressed spliced variants. (a) The number of
genes that have 0-20 and >20 expressed spliced variants. (b) Boxplots of the number of
expressed spliced variants across genes in kidney and in liver.
53
a
b
0 10 30 50
NINJ1.eApr07−unspliced (5.2e-78)
NINJ1.dApr07 (2.2e-10)
NINJ1.cApr07 (9.7e-11)
NINJ1.bApr07 (3%)
NINJ1.aApr07 (97%)
13 41 3
13 3
13 41 1
13 41 0 1
A
0 5 15 25
NINJ1.eApr07−unspliced (2.5e-80)
NINJ1.dApr07 (2.5e-10)
NINJ1.cApr07 (1e-14)
NINJ1.bApr07 (6.2%)
NINJ1.aApr07 (93.7%)
A
6 16 0
6 0
6 16 0
6 16 0 0
A
5’ 3’
5’ 3’
Figure 2.6: The gene structures, base-level coverage by mapped reads and estimated
gene expression levels for gene NINJ1 in (a) the kidney and (b) the liver. In both (a)
and (b), the top half of the figure shows the base-level coverage of reads mapped to exons
of human gene NINJ1, and the bottom half shows the gene structures of 5 alternatively
spliced variants, NINJ1.a through NINJ1.e, the estimated relative expression levels, and
the number of reads mapped to across two adjacent exons. A are the variants having
cDNA sequences to support their expression in the corresponding tissue, as annotated in
AceView.
54
kidney liver
0
366 553 3709
a b
p value
# of genes
0.0 0.2 0.4 0.6 0.8 1.0
0 1000 3000 5000
Figure 2.7: Differentially expressed spliced variants and differentially spliced genes. (a)
Venn diagram of the spliced variants expressed in each tissue among the differentially
expressedspliced variants. (b)Histogram of thepvaluesassessingthedifferential splicing
for genes.
Kidney
TranSEQ AceView
121
3 44 18
Liver
TranSEQ AceView
121
5 30 20
Figure 2.8: Venn diagram of the number of the expressed spliced variants by TranSEQ
and cDNA sequence supports from AceView in kidney and liver.
55
0.05 0.25 0.45 0.65 0.85
Kidney
Expression Levels
# of variants
0 5 10 20 30
supported
unsupported
supported unsupported
0.0 0.2 0.4 0.6 0.8 1.0
Kidney
Expression Levels
0.05 0.25 0.45 0.65 0.85
liver
Expression Levels
# of variants
0 5 10 15
supported
unsupported
supported unsupported
0.0 0.2 0.4 0.6 0.8
liver
Expression Levels
a
b
c
d
Figure 2.9: Comparison between expression levels of cDNA-supported spliced variants
and those without cDNA supports. Histograms for the expression levels of the cDNA-
supported spliced variants and those without cDNA supports in (a) kidney and (b) liver.
Boxplots for the expression levels of the cDNA-supported spliced variants and those
without cDNA supports in (c) kidney and (d) liver.
56
0.025 0.17 0.32 0.47 0.62 0.78 0.93
Expression Levels
# of variants
0 5 10 15 20 25 30
kidney
liver
Figure 2.10: The distribution of the estimated expression levels for all expressed spliced
variants in kidney (grey) and liver (white).
57
Kidney (FDR)
# of chosen reads per gene
fraction of genes
0.0 0.2 0.4 0.6 0.8 1.0
200
300
400
500
600
700
800
0
1
2
3
4
Liver (FDR)
# of chosen reads per gene
fraction of genes
0.0 0.2 0.4 0.6 0.8 1.0
200
300
400
500
600
700
800
0
1
2
3
4
Kidney (Frequency)
# of chosen reads per gene
fraction of genes
0.0 0.2 0.4 0.6 0.8 1.0
200
300
400
500
600
700
800
0
1
2
3
4
Liver (Frequency)
# of chosen reads per gene
fraction of genes
0.0 0.2 0.4 0.6 0.8 1.0
200
300
400
500
600
700
800
0
1
2
3
4
a b
c d
Figure 2.11: Effect of the numberof mappedreads on thepredictions of expressed spliced
variants. Each bin corresponds to one given number of randomly chosen reads for all the
highly-expressed genes which have exactly 4 annotated spliced variants and are mapped
to by 800-1,000 reads. The fractions of genes that have 0,1,2,3 and 4 expressed spliced
variantsdeterminedby(ab)controllingFDR<0.0001andby(cd)theestimatedexpression
level>5% in kidney and in liver are shown.
58
Kidney (FDR)
# of chosen reads per gene
fraction of genes
0.0 0.2 0.4 0.6 0.8 1.0
200
300
400
500
600
700
800
0
1
2
3
4
5
6
Liver (FDR)
# of chosen reads per gene
fraction of genes
0.0 0.2 0.4 0.6 0.8 1.0
200
300
400
500
600
700
800
0
1
2
3
4
5
6
Kidney (Frequency)
# of chosen reads per gene
fraction of genes
0.0 0.2 0.4 0.6 0.8 1.0
200
300
400
500
600
700
800
0
1
2
3
4
5
6
Liver (Frequency)
# of chosen reads per gene
fraction of genes
0.0 0.2 0.4 0.6 0.8 1.0
200
300
400
500
600
700
800
0
1
2
3
4
5
6
a b
c d
Figure 2.12: Effect of the number of mapped reads on the predicted number of expressed
spliced variants. Each bin corresponds to one given number of randomly chosen reads for
all the highly-expressed genes which have exactly 6 annotated spliced variants and are
mapped to by 800-1000 reads. For each bin, the fractions of genes that have 0,1,2,··· ,6
expressed spliced variants determined by controlling FDR less than 0.0001 in (a) kidney
and (b) liver are shown. In (c) and (d), we also restrict the estimated relative expression
levels to be at least 5%.
59
Chapter 3
Testing Gene Set Enrichment for Subset of Genes
3.1 Background
Genome-wide gene expression profiling using microarray technologies has been ubiqui-
tously used in biological research. An important problem is to identify gene sets that are
significantly changed under a certain treatment (for example, two different cell lines or
tissues or the same cell line under different conditions). A gene set is basically a group of
genes with related functions, e.g., genes in a biological process or in the same complex.
There are a variety of ways by which genes, and, ultimately, gene sets may be defined.
For example, gene sets can be defined according to the information provided by several
databases,suchasGeneOntology[3],KEGG[61],Biocarta(http://www.biocarta.com),
and Pfam [71]. Gene sets may also be defined by cytogenetic bands, by region of genomic
sequence or by establishing the functional relationships among them. Importantly, by
using a gene set-based approach, a high power can potentially be achieved for detecting
differentially expressed gene sets by integrating expression changes of genes inside the
60
same gene set, even when the expression changes of individual genes are modest. More-
over, because the gene sets have already been annotated by their common functions in
thedatabases, thebiological interpretation for agiven list of significant genesets willalso
be clear. At least one study [43] showed that using such gene set-based approaches did
increase the congruence of the identified gene sets between different data sets addressing
thesamebiological problem. To detect differentially expressedgene sets, several methods
have been proposed, which can be roughly categorized into three groups.
The first group identifies a list of significant differentially expressed genes (DEGs)
using individual gene analysis methods, and then examines the enrichment of gene sets
within this gene list using different statistical tests, such as the binomial test, Fisher’s
exact test, or the hypergeometric test [1,8,10,21,22,37,42]. Khatri and Draghici [41]
compared fourteen different methods within this group. Each of these methods is easy to
implement,butflawedby1)sensitivitytothecutoffvaluefordefiningthelistofsignificant
DEGs, 2) non-consideration of the relative position of genes inside the significant DEG
list, and3)assumptionofindependencebetween thegenes, whichmay maketheresulting
p-value misleading.
The second group of methods does not depend on the predefined DEG list. Instead,
these methods calculate a gene-specific statistic, known as the “local” statistic, which
measures the strength of association between the gene expression and the phenotype
for each gene. A “global” statistic for a gene set is then constructed as a function of
the local statistic for each gene in it. The significance of the global statistic is assessed
by permutation test, and different methods arrive at this assessment in different ways
[6,25,29,38,56,60,65,77,81]. In contrast to calculating a gene-specific statistic, the third
61
group of methods directly combines the expression levels of all genes in the gene sets
and they are represented as gene set-specific features. These features are then compared
between the treatment and the control groups to identify significantly affected gene sets
[30,31,47,83,91,96]. Some methods also integrated the interaction information between
genes in the gene sets [50,58,67,90].
The available methods generally tested the association of all the genes in a gene set
with the phenotype. In reality, however, it is more likely that only genes in a subset of
the gene set of interest are associated with the phenotype. Three possible factors may
explain this. First, since the function annotation defined in the available databases, such
as GO, KEGG, and Biocarta, are incomplete, or even erroneous, some of the genes in a
gene set of interest may not truly belong to the set. Second, the gene sets are sometimes
defined according to the genomic regions of the genes. Thus, the expression of the genes
in the same set may not coordinate with each other. For example, although a group of
gene sets has been defined by the cytogenetic bands on human chromosomes [77], the
expression of genes on the same cytogenetic bands do not necessarily correlate with each
other. The result is that only a subset of genes in a given cytogenetic band will then be
correlated with a phenotype. Third, even if all the genes in the gene set have the same
function,orbelongtothesamecomplex,itispossiblethatonlygenesinonebranchofthe
pathwayareassociated withthephenotype. Thecumulative effect oftheseconsiderations
strongly suggests that the currently available methods for gene set enrichment analysis
maynotbepowerfulenoughtodetecttheassociationofagivengenesetwithaphenotype,
particularlyin thecase whereonlya subsetof thegenes is associated withthephenotype.
62
Inthispaper,therefore,weextendaset-association strategyforgeneticpolymorphism
association studies developed by [36] to a set-enrichment analysis. In so doing, we want
to test the nullhypothesis that no subsets of genes in the gene set are associated with the
phenotype. We refer to the resulting method as Gene Set Enrichment by testing Subset
association, or Sub-GSE. Using two simulated data sets, we first show that Sub-GSE
has higher sensitivity in identifying gene sets associated with a phenotype compared to
GSEA, SigPath, and GSA, when only a fraction of the genes are associated with the
phenotype. Next, we apply Sub-GSE to two real data sets. One involves gene expression
datarelatedtogenderandtheotheridentifiedfunctionalgenesetsrelatedtop53mutation
status. For the first dataset, Sub-GSE identified cytogenetic bands Xp22 as significantly
associated with gender, while GSEA failed to detect this association. For the second
dataset, Sub-GSE identified several novel functional gene sets, including DNA damage
genes, cell cycle checkpoints genes and programmed cell death genes that are associated
with p53mutation status andthat werealso notdetected byGSEA.Overall, this method
providesacomplementaryapproachforidentifyinggenesetsassociatedwithaphenotype,
especially when only a subset of genes in a gene set is associated with the phenotype.
3.2 Methods
3.2.1 Association strength measures for individual genes
For a given gene, suppose the gene expression levels measured in the experiment are
(e
1
,e
2
,··· ,e
m
), where m is the number of samples. The corresponding phenotypic data
for the m samples are denoted as C = (c
1
,c
2
,··· ,c
m
). Depending on the measurement
63
levels of C, we measure the association strength between the gene expression and pheno-
type by the absolute value of t-statistic, Kruskal-Wallis statistic or Pearson correlation
coefficient for binary, discrete or continuous phenotypic data, respectively.
3.2.2 Local association statistic for each strict subset and the global
statistic for the gene set
Suppose the given gene set has a total of s genes and the sorted association strength
measures are a
1
≥ a
2
≥···≥ a
s
. According to the definition of the strict subsets, the
i-th strict subset include genes that correspond to (a
1
,a
2
,··· ,a
i
). The local association
statistic for the strict subset i is defined as
T
i
=
P
i
j=1
a
j
i
.
We use the permutation test described below to define a p-value, p
i
for the i-th strict set.
The statistic for the given gene set is the minimum p-value over all the strict sets, i.e.
P
min
=
s
min
i=c
p
i
.
3.2.3 Permutation test
We employ the algorithm in [7] to assess the significance of the given gene set. The
algorithm permutes the phenotypic data C = (c
1
,c
2
,··· ,c
m
) for N times and keep the
gene expression data intact. After each permutation, the association strength measures
64
are re-calculated using the permuted phenotypic data and the “strict subsets” are re-
defined. Suppose that the statistics T
(n)
i
for the n-th permutation and the i-th strict
subset are organized as
T
(0)
c
T
(0)
c+1
··· T
(0)
s
T
(1)
c
T
(1)
c+1
··· T
(1)
s
··· ··· ··· ···
T
(N)
c
T
(N)
c+1
··· T
(N)
s
,
where c is the minimum number of genes in the strict sets. Here we take the observed
data as the permutation 0.
Based on the data matrix, we calculate the raw p-value for the i-th strict set of the
observed data as
p
i
=p
(0)
i
=
#{n> 0 :T
(n)
i
≥T
(0)
i
}
N
.
The testing statistic for the gene set is P
min
=P
(0)
min
= min
s
i=c
p
(0)
i
.
ToestimatethedistributionofP
min
underthenullhypothesis,wereplacetheobserved
“strictsubset”statistics withthepermutedones. Thepermutedrawp-valuesforthen-th
permutation and i-th strict subset are calculated by
p
(n)
i
=
#{l≥ 0,l6=n :T
(l)
i
≥T
(n)
i
}
N
.
The minimum p-value P
(n)
min
= min
s
i=c
p
(n)
i
is taken as one of the random sample from the
distribution of P
min
under the null hypothesis. Finally the significance of the gene set is
calculated as
p−value =
#{n≥ 1 :P
(n)
min
≤P
(0)
min
}
N
65
3.2.4 Multiple testing correction for multiple gene sets
In a typical situation, there will be multiple gene sets to be analyzed. After we assess the
significance level for each of them according to the procedure described above, q-values
of all the given gene sets are calculated using the QVALUE R package [74] for multiple
testing correction.
3.3 Simulation Studies
We first elucidate that the p-values do not depend on the size of the gene set and the
p-value has a uniform distribution in [0,1] under the null hypothesis. To achieve these
objectives, we simulate a data set where all the gene sets have different set sizes and no
gene sets are related to the phenotype. The simulation is implemented in the following
steps:
1. Generate 100 gene sets whose sizes are 5,6,7,8,...,104. The total number of genes is
5450;
2. The gene expression levels in 100 samples for each gene are generated from a stan-
dard normal distribution. Different genes are independent of each other. Different
samples are also independent of each other;
3. Generate the phenotypic data from another independent standard normal distribu-
tion in 100 samples;
4. Repeat steps 1-3 for 100 times;
66
20 40 60 80 100
0.44 0.46 0.48 0.50 0.52 0.54 0.56 0.58
Set Size
Average p−value
Figure 3.1: The average p-values of the 100 gene sets across the 100 different data sets
are plotted against their corresponding set sizes.
In total, the simulation generates 100 data sets that have gene expression data and a
corresponding phenotypic data. We apply Sub-GSE to the 100 data sets separately.
First, since the gene sets have different sizes, we plot the average p-values of all the
gene sets across the 100 different data sets against their set sizes to see whether the gene
set size affects the significance level. Figure 3.1 shows that the set size does not affect
the p-values.
Second, the phenotypic data is independent of the expression levels of all the genes.
Therefore, Sub-GSE should not detect any significant gene sets. In Figure 3.2, the his-
togram of all the p-values of the 100 gene sets from the 100 data sets is shown. The
histogram illustrates that the p-values from the Sub-GSE have a uniform distribution for
67
Histogram of p−values
p−value
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
0 20 40 60 80 100 120
Figure 3.2: The histogram of the p-values under the null hypothesis of no association
between the gene sets and the phenotype.
gene sets that are not related to the phenotype, which is consistent with the theoretical
uniform distribution under the null hypothesis.
3.3.1 Simulation 1
We first evaluate the performance of Sub-GSE using simulated data in which gene ex-
pression profiles with different correlations within the gene set are generated. The ex-
pression profiles for 1000 genes in 100 samples are simulated. The genes are divided
into 50 non-overlapping gene sets with 20 genes in each. The gene expression profiles
for the 100 samples represent 100 independent vectors of random variables generated
from a multivariate normal distribution. The multivariate normal distribution has 1000
68
dimensions corresponding to the 1000 genes. The mean is a vector of 1000 zeroes,
and the variance of the expression levels of each gene is 1. To simulate the depen-
dence between genes, we randomly select a certain percentage of correlated genes (PCG)
= (0%,10%,··· ,90%) in each gene set and let the correlation coefficient between any
two of them be ρ = 0,0.1,0.2,··· ,0.9. The remaining genes are independent of each
other and those that are chosen. We use this simulation strategy based on the following
considerations. Thechosen genes inthegenesetcorrespondtothosein thesamecomplex
or pathway; thus, their expression profiles are correlated. Also, since the remaining genes
represent those not belonging to the group, they are more likely to be independentof the
chosen genes and each other.
If a given gene is among those that are chosen, we use its expression levels as the
phenotype. Therationale of thisstep is to determineif ourSub-GSEmethodcan identify
the gene set to which this particular gene belongs. We repeat this process for all the
chosen genes. Thus, we have a total of 1000×PCG different phenotypic data. To avoid
the problem where a gene has exactly the same expression profile as the phenotype,
we eliminate the gene’s expression profile from the expression data if it is used as the
phenotype.
We use the following approach to study the robustness of Sub-GSE. For each given
correlation coefficient and PCG, we randomly choose one of the simulated phenotypic
data and the corresponding gene expression data. Sub-GSE is applied to the chosen data
set for 100 times. The standard deviations of the p-values across the 100 different runs
are plotted against the average p-values for all the gene sets in Figure 3.3. The figure
shows that the standard deviation of the p-value for the same gene set is smaller than
69
0.0 0.2 0.4 0.6 0.8 1.0
0.000 0.002 0.004 0.006
10%
Mean p−value
SD of p−value
0.0 0.2 0.4 0.6 0.8 1.0
0.000 0.002 0.004 0.006
20%
Mean p−value
SD of p−value
0.0 0.2 0.4 0.6 0.8 1.0
0.000 0.002 0.004 0.006
30%
Mean p−value
SD of p−value
0.0 0.2 0.4 0.6 0.8 1.0
0.000 0.002 0.004 0.006
40%
Mean p−value
SD of p−value
0.0 0.2 0.4 0.6 0.8 1.0
0.000 0.002 0.004 0.006
50%
Mean p−value
SD of p−value
0.0 0.2 0.4 0.6 0.8 1.0
0.000 0.002 0.004 0.006
60%
Mean p−value
SD of p−value
0.0 0.2 0.4 0.6 0.8 1.0
0.000 0.002 0.004 0.006
70%
Mean p−value
SD of p−value
0.0 0.2 0.4 0.6 0.8 1.0
0.000 0.002 0.004 0.006
80%
Mean p−value
SD of p−value
0.0 0.2 0.4 0.6 0.8 1.0
0.000 0.002 0.004 0.006
90%
Mean p−value
SD of p−value
Figure 3.3: The standard deviation of the p-values across the 100 runs of Sub-GSE on
simulation 1 is plotted against the average p-values for all the gene sets.
70
Average Rank
Correlation
0.2 0.6
0 5 10 20
10%
Sub−GSE
GSEA
GSA
SigPath
0.2 0.6
5 10 20
20%
Sub−GSE
GSEA
GSA
SigPath
0.2 0.6
5 10 15 20
30%
Sub−GSE
GSEA
GSA
SigPath
0.2 0.6
5 10 15
40%
Sub−GSE
GSEA
GSA
SigPath
0.2 0.6
2 6 10 14
50%
Sub−GSE
GSEA
GSA
SigPath
0.2 0.6
2 6 10 14
60%
Sub−GSE
GSEA
GSA
SigPath
0.2 0.6
2 4 6 8 12
70%
Sub−GSE
GSEA
GSA
SigPath
0.2 0.6
2 4 6 8 10
80%
Sub−GSE
GSEA
GSA
SigPath
0.2 0.6
2 4 6 8 10
90%
Sub−GSE
GSEA
GSA
SigPath
correlation
0.2
0.4
0.6
0.8
percentage
0.2
0.4
0.6
0.8
Average Rank
5
10
15
20
25
GSEA
correlation
0.2
0.4
0.6
0.8
percentage
0.2
0.4
0.6
0.8
Average Rank
5
10
15
20
25
SigPath
correlation
0.2
0.4
0.6
0.8
percentage
0.2
0.4
0.6
0.8
Average Rank
5
10
15
20
25
GSA
correlation
0.2
0.4
0.6
0.8
percentage
0.2
0.4
0.6
0.8
Average Rank
5
10
15
20
Sub−GSE
Figure 3.4: The average ranks of the gene set related to the phenotype for Sub-GSE,
GSEA, GSA, and SigPath for different percentages of correlated genes (PCG) and corre-
lation coefficients within the chosen genes. The left panel compares the average ranks in
2-D plots in which each subplot corresponds to one value of PCG. For a given PCG, the
average ranks from the four methods are plotted against the correlation coefficient be-
tween the correlated genes. Theright panelshows theaverage rankversus the percentage
of correlated genes and correlation coefficient in a 3-D plot.
0.006 and even smaller when the p-value is close to either 1 or 0. The closer the p-value
is to 0 or 1, the smaller the standard deviation is. The maximum standard deviation is
achieved when the average p-value is around 0.5.
Foreachgivenpairofpercentageofcorrelatedgenes(PCG)andcorrelationcoefficient,
we apply all four methods, Sub-GSE, GSEA, GSA, and SigPath, to the corresponding
data. All the gene sets are ranked in an increasing order of their q-values so more
significant gene sets have smaller rank. The rank of the gene set, some of whose member
genes arecorrelated withthephenotypicdata, isextracted toevaluate theperformanceof
the methods. Figure 3.4 shows the average rank of the gene set related to the phenotype
for different combinations of PCG and correlation coefficient.
First, as seen in the left panel in Figure 3.4, for small PCG = 10%,20% and 30%,
the average rank of the gene set related to the phenotype based on Sub-GSE is always
71
the lowest, irrespective of the coefficient value. On the other hand, for large PCG, the
performance of Sub-GSE is similar or slightly worse than GSEA and GSA for small
correlation. The right panel in Figure 3.4 confirms this because the average rank of the
gene set related to the phenotype based on Sub-GSE decreases much faster than those
for the other methods when PCG is small. The results of Figure 3.4 can be explained as
follows. When PCG is low, only a small fraction of the genes in the target gene set are
correlated with the phenotype. GSEA, GSA, and SigPath cannot distinguish the target
gene set from the other gene sets since these methods consider all the genes in the gene
set of interest in their statistics. In contrast, Sub-GSE incrementally tests each strict set
and chooses the smallest p-value across all the strict sets as a test statistic, thus making
the test more powerful.
Second, across differentcombinations ofPCGand correlation coefficient, we findthat
GSA and GSEA achieve similar results. Both GSA and GSEA use t-statistics to obtain
therankinglistofgenes. For applications inthisarticle, theonlydifferencebetween them
is that GSA restandardizes the statistics before the permutation to reduce the effect of
correlation between genes. However, in both panels of Figure 3.4, the average ranks of
the target gene set by GSEA and GSA are quite similar, especially when the PCG is
high. Consequently, restandardization in GSA does not seem to be very efficient in this
simulation study, especially when there are many correlated genes.
Third,toshowthesensitivityandspecificityofSub-GSE,weneedagroupofgenesets
that are related with the phenotype. Therefore, we do another set of simulations similar
assimulation 1. Thedetailed descriptionsofthesimulation andtheresultingROCcurves
can befound in the supplementary materials[see Additional file 1]. The results show that
72
thehigher thePCGand thecorrelation coefficient are, thehighertheAUC scoreis. Once
the correlation coefficient is higher than 0.4, the AUC score is higher than 0.85 no matter
what the PCG is. When PCG is higher than 0.5, the AUC score can be higher than 0.75
regardless of the correlation coefficient.
3.3.2 Simulation 2
In reality, most phenotypes are the joint effect of multiple genes, probably from multiple
pathways. Therefore, we also simulate a more realistic case where the phenotypes are
assumedtobeacomplexfunctionofexpressionlevelsfromtwogenesets. Asinsimulation
1, we again consider 1000 genes divided equally into 50 non-overlapping gene sets of 20
genes in each. For fixed PCG and ρ,
1. Simulate the expression profiles of the 1000 genes as in the first simulation for 100
individuals;
2. Choose two gene sets K
1
and K
2
from the 50 gene sets. Let SK
1
and SK
2
be
the correlated genes in K
1
and K
2
, respectively. Define the phenotype for the j-th
individual as
y
j
= 1+
X
i∈SK
1
e
2
ij
+
X
i∈SK
2
e
2
ij
+ǫ
j
,
where ǫ
j
has a normal distribution with mean 0 and variance 0.25.
3. Analyze the data using GSEA, SigPath, GSA, and Sub-GSE to rank the gene sets.
Rank all the gene sets in increasing order of their q-values.
73
0.0 0.2 0.4 0.6 0.8 1.0
0.001 0.003 0.005
10%
Mean p−value
SD of p−value
0.0 0.2 0.4 0.6 0.8 1.0
0.001 0.004
20%
Mean p−value
SD of p−value
0.0 0.2 0.4 0.6 0.8 1.0
0.001 0.004
30%
Mean p−value
SD of p−value
0.0 0.2 0.4 0.6 0.8 1.0
0.001 0.004
40%
Mean p−value
SD of p−value
0.0 0.2 0.4 0.6 0.8 1.0
0.000 0.003 0.006
50%
Mean p−value
SD of p−value
0.0 0.2 0.4 0.6 0.8 1.0
0.001 0.003 0.005
60%
Mean p−value
SD of p−value
0.0 0.2 0.4 0.6 0.8 1.0
0.000 0.003 0.006
70%
Mean p−value
SD of p−value
0.0 0.2 0.4 0.6 0.8 1.0
0.000 0.003 0.006
80%
Mean p−value
SD of p−value
0.0 0.2 0.4 0.6 0.8 1.0
0.000 0.003 0.006
90%
Mean p−value
SD of p−value
Figure 3.5: The standard deviation of the p-values across the 100 runs of Sub-GSE on
simulation 2 is plotted against the average p-values for all the gene sets.
4. Repeat steps 1-3 100 times to assess the performanceof the different analytic meth-
ods by the effects of the different gene expression data.
We study the robustness of Sub-GSE as follows. Similar to the process in simulation
1, for each given correlation coefficient and PCG, we randomly choose a phenotypic data
and the corresponding gene expression data. Sub-GSE is applied to the chosen data set
for 100 times. The standard deviation of the p-values across the 100 runs for all the gene
sets is plotted against the average p-values in Figure 3.5. Again, the standard deviation
74
Average Rank
Correlation
0.2 0.6
20 22 24 26
10%
Sub−GSE
GSEA
GSA
SigPath
0.2 0.6
18 20 22 24 26
20%
Sub−GSE
GSEA
GSA
SigPath
0.2 0.6
18 20 22 24
30%
Sub−GSE
GSEA
GSA
SigPath
0.2 0.6
18 20 22 24
40%
Sub−GSE
GSEA
GSA
SigPath
0.2 0.6
18 22 26
50%
Sub−GSE
GSEA
GSA
SigPath
0.2 0.6
16 18 20 22 24
60%
Sub−GSE
GSEA
GSA
SigPath
0.2 0.6
16 20 24
70%
Sub−GSE
GSEA
GSA
SigPath
0.2 0.6
16 18 20 22 24
80%
Sub−GSE
GSEA
GSA
SigPath
0.2 0.6
16 18 20 22
90%
Sub−GSE
GSEA
GSA
SigPath
correlation
0.2
0.4
0.6
0.8
percentage
0.2
0.4
0.6
0.8
Average Rank
16
18
20
22
24
GSEA
correlation
0.2
0.4
0.6
0.8
percentage
0.2
0.4
0.6
0.8
Average Rank
20
22
24
26
SigPath
correlation
0.2
0.4
0.6
0.8
percentage
0.2
0.4
0.6
0.8
Average Rank
16
18
20
22
GSA
correlation
0.2
0.4
0.6
0.8
percentage
0.2
0.4
0.6
0.8
Average Rank
16
18
20
22
Sub−GSE
Figure 3.6: The average ranks of the target gene sets by Sub-GSE, GSEA, GSA and
SigPath for different percentages of correlated genes and correlation coefficients in Sim-
ulation II. The left panel compares the average ranks in 2-D plots in which each subplot
corresponds to one value of PCG. For a given PCG, the average ranks of the target
sets from the four methods are plotted against the correlation coefficient between the
correlated genes. The right panel shows the average ranks of the target sets versus the
PCG and correlation coefficient in a 3-D plot. The four cubes correspond to Sub-GSE,
GSEA, GSA and SigPath.
of thep-values across different runsfor each gene set is smaller than 0.006. Thecloser the
average p-value is to either 0 or 1, the smaller the standard deviation is. The maximum
standard deviation is achieved when the average p-value is around 0.5.
For this simulation study, we again apply the four different methods to prioritize the
gene sets as in the first simulation study and calculate the average rank of the two target
gene sets. The results can be found in Figure 3.6.
As shown in Figure 3.6, the average ranks of the target gene sets based on all the
methods are relatively high. This could result from the involvement of two different gene
sets when simulating the phenotypic data and the fact that the phenotypic data are the
sum of the squared expression levels of correlated genes. Another potential complicating
factor is that the phenotype includes a noise in addition to the function of the gene
75
expression levels of the component genes. All these facts can weaken the correlation
between the phenotypic data and the gene expression profile of individual genes inside
the true gene sets. Despite these problems, Sub-GSE performs relatively well compared
to the other three methods, especially when PCG is low. When PCG is high, the
performances of Sub-GSE are close to those of GSEA and GSA since both GSEA and
GSA consider all the genes inside the gene sets. Again, the performances of GSEA and
GSA are similar.
3.4 Applications to Real Data Sets
3.4.1 Male Vs. Female Lymphoblastoid Cells
We also apply Sub-GSE to two real data sets from [77]. The first data set measured the
mRNAexpressionprofilesfromlymphoblastoidcellsderivedfrom15malesand17females
usingAffymetrixU133A chip. Thegenderof theindividualsrepresentsthecorresponding
phenotypic data. Thegene sets are chosen as thecytogenetic sets (C1, 319 gene sets) and
the functional gene sets (C2, 522 gene sets) defined in [77]. The cytogenetic sets contain
24 gene sets, onefor each of the 24 humanchromosomes, and 295 gene sets corresponding
to cytogenetic bands along the chromosomes. The functional sets include 472 gene sets
containing genes whose products are involved in specific metabolic and signaling path-
ways, as reported in eight publicly available, manually curated databases, and 50 gene
sets containing genes co-expressed in response to genetic and chemical perturbations, as
reported in various experimental studies (see supporting text in [77] for details). We ap-
ply Sub-GSE,GSEA, and SigPath to these two types of gene sets independentlywith the
76
Sub-GSE GSEA sigPath
set names q-value set names FDR set names max q-value
chrY 0 chrY 0 chrY 0
chrYq11 0 chrYq11 0.000801147 chrYq11 0
chrYp11 0 chrYp11 0.000811309 chrYp11 0
chrXp22 0.147 chrYp11 Xp22 0.25508726 chrYp11 Xp22 0.315700257
chrX 0.294 chr11p12 0.31999215 chr1q22 0.990782566
chrXp11 0.686 chr6q24 0.6108403 chr20p11 0.991916264
chr15q11 0.924 chr5p14 0.6131663 chr15q21 0.991916264
Table 3.1: The top 7 gene sets by Sub-GSE, GSEA and sigPath and their corresponding
p-values, FDR and max q-values. The cytogenetic bands chrYp11 Xp22 was named
chrYp22 in the original gene set data which include 8 genes that are on both chrYp11
and chrXp22.
objective of identifying the cytogenetic regions that are differentially expressed between
males and females and the functional gene sets related to sex distinction, respectively.
First, we apply Sub-GSE to investigate the enrichment of cytogenetic gene sets (C1).
Asexpected, thethreemostsignificantcytogenetic bandsarechrY,chrYp11andchrYq11
whichallhaveaq-valueof0andaretheonlythreecytogenetic bandsfromchromosomeY
in genesets C1. Theyarealso theonlythreesignificant genesets in C1byGSEA(FDR<
0.2) and SigPath (max q-value< 0.2). Besides these expected bands on chromosome Y,
other bandsthat are ranked as the top 7 among all the gene sets by Sub-GSE,GSEA and
SigPath are listed in Table 3.1. As seen from the lists, Sub-GSE is sensitive enough to
identifycytogenetic bandsonbothchromosomesXandYattheq-valuethresholdof0.20.
On the contrary, neither GSEA nor SigPath is able to detect any bands on chromosome
X at the FDR threshold of 0.20. Again, this result shows the sensitivity of Sub-GSE.
Second, we applySub-GSEto investigate the enrichment of functional gene sets (C2).
Both Sub-GSE and GSEA detect three significant gene sets whose significance levels are
listed in Table 3.2: testis-related genes, genes that escape X inactivation, and female
77
Set Name Sub-GSE (q-value) GSEA (FDR)
Testis related genes <0.001 0.012
Genes that escape X inactivation 0.165 <0.001
Female reproductive tissue expressed genes 0.165 0.045
Table3.2: Thetop3functionalgenesets andtheircorrespondingq-values andFDRrates
by Sub-GSE and GSEA.
reproductive tissue-expressed genes. The q-values of all the other gene sets are larger
than 0.9 by Sub-GSE. Hence, in this dataset, the results by Sub-GSE are roughly the
same as those achieved by GSEA.
3.4.2 P53 Status in Cancer Cell Lines
The second real data set corresponds to the gene expression data and phenotypic data
related to p53 mutation status from [77]. The objective of this study is to identify novel
targets of the transcription factor p53. The p53 mutation status gene expression data
examined the gene expression patterns from the NCI-60 collection of cancer cell lines.
The expression profiles were measured using Affymetrix U95Av2 chips. The mutational
status of the p53 gene had been reported for 50 of the NCI-60 cell lines, with 17 being
classifiedasnormaland33ascarryingmutationsinthegene. Wetakethegeneexpression
profiles of these 50 cell lines as the expression data and the vector of binary indicators of
the mutational patterns (normal or mutated) as the corresponding phenotypic data. For
the gene set data, we only use the functional sets (C2, 522 gene sets) in [77], which was
already described in the application noted above.
78
P53
DNA Damage
Hypoxia
Radiation
Heat Shock
Chemical Carcinogens
DNA Damage Signaling Pathway
Hypoxia and p53 in Cardiovascular System
Radiation Sensitivity Genes
ATM Signaling Pathway
Stress Induction of HSP Regulation
Drug Resistance and Metabolism Genes
p53 signaling pathway
p53 signaling pathway genes
p53 upregulated genes
Cell Cycle Arrest Apoptosis
G1 and S Phase of Cell Cycle
Cell Cycle:G2/M Checkpoint
G2 and M Phases of Cell Cycle
Cell Cycle Regulator Genes
Programmed Cell Death
TrkA
Pitx2
TrkA Signaling Pathway
Multi-step Regulation of Transcription by Pitx2
Ceramide
Fas
Ceramide Signaling Pathway
Fas Signaling Pathway
Figure 3.7: Relationship between significant gene sets and p53. Solid arrows describe the
interactions between p53 and different biological processes or genes. Dashed arrows show
the definition of those gene sets in the corresponding dashed rectangle. The solid arrows
are constructed based on previous works in [2,19,28,49,85,89].
Functional sets that have a q-value smaller than or equal to 0.03 by Sub-GSE are
extracted and listed in Table 3 together with their q-values. Comparing the list of sig-
nificant gene sets by Sub-GSE and GSEA [77], we can see that Sub-GSE obtains more
significant gene sets than GSEAdoes, which again shows thesensitivity of Sub-GSE.The
relationship between the identified gene sets by Sub-GSE (Table 3) and p53 is illustrated
in Figure 3.7. Basically, all these gene sets are significantly enriched in genes that are
differentially expressed in p53 mutants versus those without p53 mutations. Therefore,
the identified gene sets by Sub-GSE are potentially those regulated by p53. According
to the definitions of these gene sets, as shown in Figure 3.7, we can roughly divide them
into three groups.
79
Set Name q-value
Hypoxia and p53 in the Cardiovascular system <0.001
G1 and S Phases of the Cell Cycle <0.001
p53 Signaling Pathway(p53Pathway) 0.022
TrkA Signaling Pathway 0.022
P53 Upregulated Genes 0.022
p53 Signaling Pathway Genes(p53 signalling) 0.029
Cell Cycle: G2/M Checkpoint 0.029
G2 and M Phases of the Cell Cycle 0.029
Programmed Cell Death 0.029
DNA Damage Signaling Pathway 0.029
Radiation Sensitivity Genes 0.029
Cell Cycle Regulator Genes 0.029
ATM Signaling Pathway 0.029
Ceramide Signaling Pathway 0.029
Drug Resistance and Metabolism Genes 0.029
Stress Induction of HSP Regulation 0.029
Multi-step Regulation of Transcription by Pitx2 0.029
Fas Signaling Pathway 0.029
Table 3.3: Significant functional gene sets (q-value≤ 0.03) detected in 50 of the NCI-60
cell lines. The q-values are calculated by Sub-GSE.
The first group includes gene sets that are directly regulated or affected by p53,
including the “p53 signaling pathway”, “p53 signaling pathway genes”, and “p53 upreg-
ulated genes”. This group of gene sets was detected by both Sub-GSE and GSEA [77]
(FDR< 0.015).
The second group contains gene sets that are “downstream” of p53. These gene sets
caneitherbeinducedorinhibitedbyp53[19,28,49,85]. Forexample,itiswellknownthat
p53 induces cell cycle arrest duringthe G1/S phase and the G2/M phase checkpoint [28].
By itself, p53 can activate an important death receptor, Fas, which triggers the “Fas
Signaling Pathway” and thus leads to apoptosis [49]. It is also well known that p53
functions “upstream” of ceramide in response to genotoxic stress [19].
80
The third group includes gene sets related to the “upstream” biological processes or
genes for p53. These “upstream” biological processes, such as DNA damage caused by
radiation or chemical carcinogens, for example, pass theDNA damage signaldown to p53
andfurtherinducesome of the“downstream” pathways. Twogenes, TrkAandPitx2, are
known to affect apoptosis through regulation of p53 [2,89]. Thegene sets related to these
“upstream” biological processes actually include genes related to those “downstream”
biological processes in the second group.
In this dataset, Sub-GSE not only detects the gene sets identified by GSEA [77], but
also detects more novel gene sets related to p53. Previous studies from the literature
supportthefindingsin that all the significant gene sets identified bySub-GSEare related
to p53, as shown in Figure 3.7.
3.5 ROC curves for Sub-GSE
The number of relevant gene sets is one and two in simulation study 1 and 2 respectively
which is too small to get a smooth ROC curve. Therefore, we did another simulation
which is very similar to simulation 1 to show the sensitivity and specificity of Sub-GSE.
The simulation is implemented according to the following steps:
1. Simulate 2genesets, each of whichhas20genes inside. Therearenocommongenes
in the two gene sets;
2. Thegene expression levels of the 40 genes are generated from a multivariate normal
distribution. The mean and the variance for each gene is 0 and 1. In each gene set,
81
certain percentage of genes (PCG=10%, 20%,..., 90%) are correlated. The corre-
lation coefficient between them varies from 0.1 to 0.9. The correlation coefficient
between any other gene pair is 0.
3. For any given PCG and the correlation coefficient, simulate 1000 gene expression
data sets for the 40 genes. For each simulation, randomly select one gene that
is correlated with some other genes in the corresponding gene set. Take the gene
expression profile of this gene as the phenotypic data and remove them from the
gene expression data.
The simulation generates 1000 gene expression data sets as well as the corresponding
phenotypic data for each given PCG (10%, 20%, ..., 90%) and correlation coefficient
(0.1,0.2,...,0.9). We apply Sub-GSE to each data set and calculate the p-values for the
two gene set. The result of the 1000 data sets for given PCG and correlation coefficient
are pooled to draw the ROC curves. Different thresholds of the p-values are set and the
corresponding sensitivity and specificity rates are calculated by comparing the gene sets
that have p-values smaller than the threshold and those from which the corresponding
phenotypic data is from. The ROC curves are shown in Figure 3.8:
According to Figure 3.8, once the PCG is larger than 0.3, the AUC score will be as
high as 0.7 even for the smallest correlation coefficient. When PCG is smaller than 0.3,
the AUC score can still be larger than 0.5.
82
10%
False positive rate
True positive rate
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.1:0.536
0.2:0.6656
0.3:0.7678
0.4:0.8717
0.5:0.9599
0.6:0.9889
0.7:0.9977
0.8:0.9998
0.9:0.9991
20%
False positive rate
True positive rate
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.1:0.6177
0.2:0.8442
0.3:0.9709
0.4:0.9982
0.5:0.9999
0.6:1
0.7:1
0.8:0.9999
0.9:1
30%
False positive rate
True positive rate
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.1:0.6856
0.2:0.9293
0.3:0.9967
0.4:0.9999
0.5:1
0.6:1
0.7:1
0.8:1
0.9:0.9995
40%
False positive rate
True positive rate
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.1:0.7401
0.2:0.9589
0.3:0.9991
0.4:0.9999
0.5:1
0.6:1
0.7:1
0.8:1
0.9:1
50%
False positive rate
True positive rate
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.1:0.7836
0.2:0.9785
0.3:0.9992
0.4:0.9999
0.5:1
0.6:1
0.7:1
0.8:1
0.9:1
60%
False positive rate
True positive rate
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.1:0.7834
0.2:0.9854
0.3:0.9995
0.4:0.9994
0.5:1
0.6:1
0.7:1
0.8:1
0.9:1
70%
False positive rate
True positive rate
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.1:0.8205
0.2:0.9898
0.3:0.9997
0.4:0.9999
0.5:1
0.6:1
0.7:1
0.8:1
0.9:1
80%
False positive rate
True positive rate
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.1:0.8466
0.2:0.9931
0.3:0.9999
0.4:0.9999
0.5:1
0.6:1
0.7:1
0.8:1
0.9:1
90%
False positive rate
True positive rate
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.1:0.8711
0.2:0.9939
0.3:0.9998
0.4:0.9999
0.5:1
0.6:1
0.7:1
0.8:1
0.9:1
Figure 3.8: The ROC curves for Sub-GSE.
3.6 Conclusions and Discussions
To summarize, we have developed a method, called Sub-GSE, to identify gene sets that
are associated with a phenotype by testing the association between the strict subsets of
genes and the phenotype. In many applications, it is very likely that only a subset of
genes in a gene set of interest is associated with the phenotype. However, since currently
available methods for gene set enrichment analysis usually test the association of all the
genes in a gene set with the phenotype, the power of these methods is correspondingly
reduced. In contrast, Sub-GSE is based on the idea of set-association approach first
proposed by [36] and it incrementally tests the association of “strict subsets” with the
phenotype. The strict subsets contain the genes having the top association strength of
83
individual genes with the phenotype. We first study the performance of Sub-GSE and
compare it with three widely used methods for gene set enrichment analysis: GSEA,
GSA, and SigPath. Our simulations show that Sub-GSE outperforms GSEA, GSA, and
SigPath in prioritizing gene sets associated with a phenotype when the fraction of genes
associated with the phenotype is relatively small. On the other hand, these four methods
all achieve similar results when the fraction of associated genes is large. When applied
to two real data sets, Sub-GSE is shown to detect more biologically meaningful gene sets
than GSEA. For example, Sub-GSE identified cytogenetic band Xp22 as significantly
associated with gender (q-value < 0.20), whileneither GSEA nor SigPath identified them
as significant at a FDR< 0.20. Similarly, Sub-GSE identified many gene sets including,
for instance, DNA damage genes, cell cycle checkpoints genes and programmed cell death
genes, as significantly associated with p53 mutation status. These were not identified
by GSEA. This evidence supports the high sensitivity of Sub-GSE. Most of the detected
gene sets have supports from previous studies for the association between them and the
p53 mutations.
Usually a large number of sets will be detected as significant for most tests of gene
enrichment analysis. Since Sub-GSE is more sensitive in detecting significant gene sets
than other tests for gene set enrichment analysis, we expect that many more gene sets
will be identified as significant. This may reflect biological reality instead of statistical
artifacts. For example, cancer can affect a large number of genes and gene categories. By
studyingtheGOrelationshipamongthesignificantgenesets,themorespecificsignificant
GO categories may represent the real underlying affected function categories.
84
Theadvantages of Sub-GSEover other approaches for testing gene set enrichment are
most evident when only a fraction of the genes in the gene set of interest are associated
with the phenotype. If we believe that most genes in a gene set of interest are associated
with thephenotype, other approaches, includingGSEA,GSA,andSigPath, mayperform
better than Sub-GSE. Under this scenario, the use of the minimum p-value across all the
strictsubsetsasateststatistic,whichisdoneinSub-GSE,wouldresultintheintroduction
of more noise. It is possible that the minimal p-value may be achieved for some subsets
of the gene set of interest, making Sub-GSE less powerful. The results of our simulations
are consistent with this observation. On the other hand, our simulations also showed
that the performance of Sub-GSE is only marginally worse than the other approaches
under the conditions noted above. We do not claim that Sub-GSE is always better than
GSEA, GSA, or SigPath. Instead, Sub-GSE complements other approaches for gene set
enrichment analysis when the fraction of associated genes is relatively small.
The speed of Sub-GSE is determined by the number of gene sets and the number of
genes inside each gene set. To give an example of the running time, we download the
gene expression data with accession number GSE5081 from NCBI (http://www.ncbi.
nlm.nih.gov/) which hybridized total RNAs from gastric biopsy specimens of patients
with Helicobacter pylori positive (HP+) and Helicobacter pylori negative (HP-) antrum
erosions (ER+), and the corresponding, adjacent normal mucosae (ER-). The gene ex-
pression data includes 54675 probes and 32 samples. HP+ and HP- are treated as the
phenotype. Mappings between the probes and GO categories are from the R package
(http://www.r-project.org/) named ”hgu133plus2”. All the probes are mapped to
8310 GO categories in total. We run Sub-GSE on this data set using a computer with
85
Pentium 4 CPU 3.60GHz/3.59GHz, 1.00GB of RAM. It took 12.7 hours when 1000 per-
mutations are required and the strict set size threshold is 1.
Usually we need to simultaneously test the association of a large number of gene sets
with the phenotype. For each gene set, we can use Sub-GSE to test the association of
the gene set with the phenotype to obtain a p-value. We have shown in the “Results”
section that the p-value is uniformly distributed under the null hypothesis that no subset
is associated with the phenotype. When we test for a large number of gene sets, the issue
of multiple testing is of concern. To solve this problem, conventional methods such as
Bonferroni correction can be used. However, Bonferroni correction is too conservative in
mostsituations. Anothercurrentlywidelyusedmethoddealingwithmultipletestingisto
control false discovery rate(FDR) as implemented inthesoftware package QVALUE[74].
For the QVALUE package to work well, the p-values for all the gene sets need to be
weakly dependent. When the sizes of the gene sets are relatively small compared to the
totalnumberofgenes,weexpectthatthep-valuestobeweaklydependentsincethegenes
usuallyformmodulesandgenesfromdifferentmodulesaremorelikelytobeindependent.
When these assumptions are in doubt, we can use the p-values obtained from Sub-GSE
to indicate the statistical significance of the gene sets.
TherearetwooptionsinSub-GSE:theminimalsizeforthestrictsetsandthestatistic
to measure the association strength between gene expression profiles and the phenotype.
We set the tuning parameter c to be the minimal size of the strict subsets on which to
test. ParameterccancontrolthesensitivityandthespecificityofSub-GSE,thushavinga
significant effect on its performance. Generally, the sensitivity of Sub-GSEdecreases and
the specificity increases as c increases. Therefore, the choice of c should depend on the
86
balance between sensitivity and specificity. Although we set c = 5 in this paper, which
restricts theminimalsizefor thestrict sets, we can, instead, requirethat theminimalsize
of the strict sets depend on the size of the gene set of interest. For example, one could
considerthesubsetsofgenesthatcoveratleast10%ofthegivengenesinsideeachgeneset.
Sincethe minimalset size of thesubsetmay bedifferent for different gene sets, the effects
of this type of restriction need to be further studied. The other Sub-GSE option involves
thestatisticusedtomeasuretheassociationstrengthbetweengeneexpressionprofilesand
the phenotype. In this paper, we use t-statistics, Kruskal-Wallis statistics, and Pearson’s
correlation to evaluate the association strength between the gene expression profiles and
discrete, categorical, and quantitative phenotypes, respectively. Other statistics can also
be applied. The power of Sub-GSE to detect enriched gene sets for different types of
statistics also needs to be further studied.
It is well known that genes in the same pathway or complex tend to be correlated. A
natural question is whether it is better first to do principal component analysis (PCA)
and then apply Sub-GSE to the principal components. We implemented this idea and
foundtheapproachlesspowerfulthanthemethodimplementedinthispaper. Apotential
explanation is that the expression profiles of the genes in the gene sets among the cases
and controls do not satisfy the normality assumption making the PCA approach less
powerful. More studies are needed to see under what conditions the combination of PCA
and Sub-GSE is more powerful than Sub-GSE alone.
87
Chapter 4
Gene Transcriptional Network Reconstruction
4.1 Background
To reconstruct the underlying gene regulatory network, gene expression changes were
commonly measured in single gene knockouts and differentially expressed genes are sus-
pected to be regulated by the knocked out gene. Although this type of experiments
provide us important information about the regulatory relationships, they require too
much cost and labor to include the knockouts for all genes.
One potential way to circumvent these limitations is to examine and analyze gene
expressionfrommanyindividualgenotypes. Allthegenotypescontainmultiplemutations
andinparalleldeterminethenucleotidevariationineach genotype(forrecentapplications
and review, see [26,34,35,46]). For each of the genotypes, expression changes as well
as the alterations in the coding sequence or regulatory regions are measured. Denote
the genes that have alterations in their coding regions or regulatory regions as altered
genes. By examining the co-perturbation among different genotypes between genes and
one altered gene, we can identify the potential genes that are regulated by the altered
88
genes. In each genotype, the altered genes are identified according to either the sequence
data (for coding alterations)( [14,44,92]) or allele-specific expression data(for regulatory
alterations)( [12,18,51,64,93]). Differentially expressed genes that are not detected as
altered genes are treated as the perturbed genes. With these data, we propose to predict
theregulatorynetworksbasedonaprobabilisticmodel. AnEMalgorithmisimplemented
to estimate the probability of the regulatory relationships between altered genes and the
perturbed genes. Preliminary simulation study gives the ROC curve that describe the
accuracy of the model.
4.2 Model and Algorithm
4.2.1 Probability Model
If we have I genotypes in total and their corresponding F1 offspring. By comparing the
allele specificexpression in the F1 offspringwith that in thecorrespondingparental geno-
types, we can define a set of altered genes S
i
alt
. Other differentially expressed genes that
are not defined as the altered genes are referred to as perturbed genes, denoted as S
(i)
pert
.
However, genes in S
(i)
pert
may not include all genes that are potentially regulated by genes
in S
i
alt
. First, the experiments may not be sensitive enough to detect all the differentially
expressed genes. Second, the regulated genes by the same gene may be different in dif-
ferent samples or under different conditions. Namely, the regulation mechanism may be
stochastic which may result in non-differential expression of some regulated genes. Due
to these two reasons, we define S
=
alt
S
S
(i)
alt
and S
pert
=
S
S
(i)
pert
and try to infer the gene
89
regulation network from genes in S
alt
to those in S
pert
. To achieve this goal, we consider
the following probabilistic model.
Before describingthe model, wedenote the probabilitythat thej-th perturbedgene is
differentially expressed conditional on the l-th altered gene being differentially expressed
by λ
lj
. We also make the following assumptions:
• Each altered gene in a given genotype affects the perturbed genes in the same
genotype independently,
• Onegivengeneisdifferentiallyexpressedifatleastoneofthedifferentiallyexpressed
altered genes regulates it,
• The perturbed genes are independent in terms of differential expression.
With the above notation and assumptions, the probability that the j-th perturbed
gene in the i-th genotype is differentially expressed with probability,
P(X
ij
= 1) = 1−
Y
l∈S
(i)
alt
(1−λ
lj
) (4.1)
and
P(X
ij
= 0) = 1−P(X
ij
= 1|S
(i)
alt
). (4.2)
In i-th genotype, the probability of the observed expression differentiation data for
the perturbed genes is
P
i
=
Y
j∈Spert
P(X
ij
=x
ij
),
90
where X
ij
is a random variable which is 1 when the j-th perturbed gene in the i-th
genotype is differentially expressed and 0 otherwise. Due to the independence between
different genotypes, multiplying across all the genotypes gives the total probability of the
observed data which is
Q
I
i=1
P
i
.
4.2.2 EM Algorithm
We next develop a preliminary algorithm for inferring the gene regulatory network based
on the above probabilistic model. If there is only one altered gene, e.g. A, the parameter,
λ
Aj
, for the perturbed gene j can be estimated by the fraction of genotypes in which
gene j is differentially expressed. To assess the significance of λ
Aj
, a statistical test, such
as the chi-square test can be used. However, the real data may have multiple altered
genes, especially when there are more than one genotype included. In this situation, we
treat the problem as a missing value problem with the regulation relationship missing.
An expectation maximization (EM) algorithm is employed to estimate the parameters in
the probabilistic model. Notations for the algorithm is as follows. The observed data is
the expression changes of all the perturbed genes in S
(i)
pert
, which is denoted as X
ij
with
i standing for genotype and j standing for the perturbed gene. The missing data is the
data for all Y
ilj
(i: genotype, l: altered gene, j: perturbedgene) whereY
ilj
= 1 if the l-th
91
altered gene affects the j-th perturbed gene in the i-th genotype, and Y
ilj
= 0 otherwise.
With these notations and the assumptions made above, we have that
P(Y
ilj
= 1|X
ij
=x
ij
) =
P(Y
ilj
= 1,X
ij
=x
ij
)
P(X
ij
=x
ij
)
=
P(Y
ilj
= 1)P(X
ij
=x
ij
|Y
ilj
= 1)
P(X
ij
=x
ij
)
=
λ
lj
P(X
ij
=x
ij
|Y
ilj
= 1)
P(X
ij
=x
ij
)
.
(4.3)
Note that P(X
ij
= 1|Y
ilj
= 1) = 1 and P(X
ij
= 0|Y
ilj
= 1) = 0 based on our
assumptions. The denominator can be calculated according to (4.1) and (4.2). Let
V
l
be the set of genotypes where the l-th altered gene is differentially expressed, i.e.
V
l
=
n
genotype i,l∈S
(i)
alt
o
. The EM algorithm can be described as follows.
1. Initialize λ
lj
=
1
N
alt
, where N
alt
is the number of altered genes.
2. The E-step: at the k-th step, calculate the conditional probabilities in Equation
(4.3) based on the current values of λ
lj
=λ
(k)
lj
.
3. The M-step: at the (k+1)-st step, update λ
lj
as follows,
λ
(k+1)
lj
=
P
i∈V
l
P(Y
ilj
= 1|X
ij
=x
ij
)
|V
l
|
.
4. Repeat Steps 2 and 3 until all the λ
lj
converge.
92
Based on the estimated values of all the λ
lj
from the EM algorithm, we can rank all
the predicted regulatory relationships. Another potential useful criterion is to use the
likelihood ratio statistic
LR
i
0
j
0
=−2
max
λi
0
j
0
=0
P(Data)
max
λi
0
j
0
>0
P(Data)
.
4.3 Preliminary Results: A Simulation Study
4.3.1 Simulated Data
To examine the power of the EM algorithm, we simulate a data set and apply the above
EM algorithm to it. To simulate the data, we make some assumptions according to
previous studies as follows. It was estimated that each genotype has roughly M = 500
to 1000 altered genes and about N = 2000 to 3000 perturbed gene. There are T = 15000
genes in the Drosophila genome in total. Therefore, in a given genotype, the probability
that a gene is an altered gene is pc = M/T ≈1/30–1/15. All the altered genes are
classified into three different groups based on our biological knowledge.
• Hub genes: Genes like transcription factors(TFs), may regulate a large number
of genes. Such genes are called hub genes. We expect that there are about 500
TFs in the Drosophila genome and thus a given gene is a TF with probability
α
1
= 500/15000 = 1/30. In previous gene knockout experiments, such as sex-
lethal (sxl), tra, or fru, the expression levels of about 150 – 300 genes are changed.
Therefore, we assume that a TF regulates a gene with probability pr1 =150 –
300/15000≈ 1/100 – 1/50.
93
• Orphan genes: Opposite to transcription factors, some other genes, such as en-
zymes, may regulate a relatively small number of genes, say about 15-30 genes. We
refer to such genes as orphan genes. An orphan gene regulates another gene with
probability pr2 =15 – 30/15000≈ 1/1000 – 1/500. A rough estimate of the num-
ber of enzymes is 5000 resulting in a gene being an orphan gene with probability
α
2
= 5000/15000 = 1/3.
• Other genes: Many network studies based on yeast have found a power-law prop-
erty for molecule networks. A gene regulates k genes with probability ck
γ
, where
c and γ are constants. Studies on molecular networks showed that γ is generally
between 1 and 2 ( [?], [?]). In our simulations, we assume that γ = 1.5 and c can
be chosen so that c
P
400
k=0
k
−1.5
= 1 since we will study 400 genes in total. For each
altered gene, once we have the number, k, of genes it regulates, the k genes are
randomly chosen from the 400 genes.
The following simulations will be implemented to evaluate the approach as well as to
see what we can conclude with I = 50 to 200 genotypes.
Step 1: Simulating the network according to the description above. Thesimulated
regulatoryrelationshipbetweenthe400genesarerepresentedbyanadjacentmatrix
with a
lj
= 1 if gene l regulates gene j by the simulation.
Step 2: Simulating altered genes and perturbed gene expression changes. Once
a network is obtained from step 1, we simulate the gene expression changes in
I≈ 50 to 200 genotypes as follows. Each gene is assigned as an altered gene with
probabilitypc=N/T (1/15). Ifgenelisselectedasanalteredgene, thegenesgenes
94
regulated by gene l according to the network simulated in step 1, i.e.{j : a
lj
= 1}
will have a probability α (say 0.1, 0.3, or 0.5) of being affected and thus being
differentially expressed. If a gene is regulated by multiple altered genes, the gene
is differentially expressed if at least one of the altered genes make it differentially
expressed in the genotype. The resulting genes are the perturbed genes in this
genotype. Repeat the second step for I times to obtain I genotypes.
Step3: EMalgorithmandlikelihoodratioscorecalculation. UsetheEMalgorithm
described above to estimate the probability of gene l regulating gene j, i.e. λ
lj
.
Step 4: Evaluations. We study the performance of our approach with varying
numbers of genotypes (I =50, 100, 125, 150, 200), the numbers of genotypes con-
taining a gene pair (k =1, 5), and the probability that an altered gene makes a
perturbed gene differentially expressed (α = 0.1, 0.3, 0.5). The receiver operating
characteristic (ROC) curve and the area under the ROC curve (AUC) are used to
evaluate the approach. For a given threshold, t, we predict that gene l regulates
gene j if λ
lj
>t and no relationships between them if otherwise. By comparing the
predicted regulatory relationship with the true underlying regulation relationship,
we have the following table.
Predicted Positive Predicted Negative
real positive True positive, TP False negative, FN R1
real negative False positive, FP True negative, TN R2
C1 C2 T
95
The standard performance measures for the classification problem based on these
four values are sensitivity (SN, also referred to as recall), precision, false positive
rate (FPR) and specificity defined as follows,
precision =
TP
TP +FP
, recall =sensitivity =
TP
TP +FN
,
FPR =
FP
FP +TN
, specificity = 1−FDR.
The ROC curve gives the relationship between sensitivity and false positive rate for
different threshold values of t. In the statistical learning community, an AUC score
from 0.75 to 0.92 is considered to be good, 0.92 to 0.97 to be very good, and 0.97
to 1.00 to be excellent.
4.3.2 Preliminary Results on Simulated Data
We implement the simulations above and the results are shown in Figure 4.1. According
to the figure, first, the AUC score increases as the number of genotypes, the number of
genotypes containing a particular gene pair, and the probability of activation increase.
These observations are consistent with our intuitions. Second, the values of α affect the
results dramatically. When α≥ 0.5, the AUC score is always above 0.96 with even only
100genotypes. Whenα = 0.3, theAUCscoreisabout0.95with150genotypes. However,
when α = 0.1 even with 200 genotypes and the AUC score is about 0.82, which means
we will not have enough power to identify the regulatory network in this situation.
96
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.4 0.8
a=0.1 k=1
1−spec: unnamed marker
sens: unnamed diagnosis
50:0.605773065492573
100:0.731021926319093
125:0.754406532368158
150:0.787728077033453
200:0.82022747022269
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.4 0.8
a=0.1 k=5
1−spec: unnamed marker
sens: unnamed diagnosis
50:0.663266071086691
100:0.75258113921857
125:0.753930935342996
150:0.790693874136745
200:0.820731223172858
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.4 0.8
a=0.3 k=1
1−spec: unnamed marker
sens: unnamed diagnosis
50:0.776757416159808
100:0.890531044977306
125:0.926666111261603
150:0.950542210561323
200:0.986723754931595
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.4 0.8
a=0.3 k=5
1−spec: unnamed marker
sens: unnamed diagnosis
50:0.861447997287968
100:0.907683028027109
125:0.938144708886532
150:0.956774908403343
200:0.986723754931595
0.0 0.2 0.4 0.6 0.8 1.0
0.2 0.6 1.0
a=0.5 k=1
1−spec: unnamed marker
sens: unnamed diagnosis
50:0.878583726258349
100:0.95870902853955
125:0.985429266650753
150:0.990937257778577
200:0.998594066396496
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.4 0.8
a=0.5 k=5
1−spec: unnamed marker
sens: unnamed diagnosis
50:0.953113716401414
100:0.977589378158706
125:0.991051925423732
150:0.991996878115075
200:0.998592983232187
Figure 4.1: Evaluation of the EM algorithm for estimating gene regulation relationships.
The receiver operating characteristic curves (ROC) show the relationship between the
false positive rate and sensitivity using the EM algorithm for different numbers of geno-
types, the number of genotypes with given gene pairs (k =1,5)
and the probability of detection (α =0.1,0.3,0.5).
97
4.4 Summary and Discussions
Inourpreliminarysimulation study,aprobabilisticmodelfortheregulatory relationships
between altered genes and perturbed genes is developed. An EM algorithm is employed
to estimate the probabilities of the regulatory relationships. In the future, we will further
develop and study the model and other computational approaches for inferring the regu-
latory relationships. In particular we will consider both false positives and false negatives
in the differential gene expression of the perturbed genes. We will also simulate more re-
alistic gene regulation relationship such as combinatorial controls in the underlying gene
regulation network. Our computational algorithms will be carefully evaluated to see if
we can recover such regulatory relationships even if the assumptions in our model are
violated.
98
Chapter 5
Future Work
In chapter 2,3 and 4, we have described our approaches to infer the gene transcriptional
regulatory networks using the DNA microarray expression data, mRNA sequencing data,
genome sequence data and allelic-specific expression data. Although these methods have
been shown to be accurate and powerful, there are still rooms for improvements. In this
chapter, we describe several directions for future work.
5.1 Identifying Differentially Expressed Transcripts from
RNA-seq Data
In chapter 2, we have described a powerful statistical method to estimate the expression
levels of the alternatively spliced mRNAs and identify differentially expressed mRNAs
from the RNA-seq data. Although the method has been shown to accurately estimate
the expression levels of the annotated spliced variants in AceView, there are many future
directions in which it can be improved.
99
5.1.1 Addressing the Coverage Bias
First, as we have mentioned, themodel in chapter 2 assumes that the reads are uniformly
sampled from the expressed variants which contradicts with the observations of uneven
coverage rate fromthedata. Possible reasons forthis bias includethefragmentation bias,
sequencing error and incomplete and inaccurate annotations of mRNAs in AceView.
For ourmodeltoaddressthecoverage bias, thepattern ofthebiasneedstobestudied
and experiments specifically designed can help achieve this goal. There are two ways to
study the bias. The first one is to summarize the observed bias and the second way is
to study the effect of different factors, such as the fragmentation process, the sequencing
errors and so on. By either way, the ultimate goal of the studies is to provide the
probability of generating a read across the different positions on the mRNAs. Then in
ourmodel, these probabilities can replace theuniformdistribution inourmodeltoobtain
moreaccurate predictions. Supposefromthestudiesonthecoverage bias, theprobability
of a random read generated by s
ij
with the position k on s
ij
as the first base of the read
is P
ijk
, then the probability in equation 5.1.1 can be rewritten as
P(r|r←s
ij
,Θ,S) =δ(r→s
ij
)·P
ijk
,
wherek is the position of the firstbase of the subsequence in s
ij
that match the sequence
of the read r. In this way, the pattern of the bias is directly involved in the model.
However, a model that address the effect of different factors on the bias may be more
accurate and easy to be integrated into our model.
100
5.1.2 Considering the Ambiguous Reads
The model in chapter 2 only considers reads that are mapped to mRNAs transcribed
from the same gene. In this way, there are a large number of reads being ignored which
are mapped to multiple genes. The reasons for this mapping ambiguity can be the se-
quencingerrors, the shortlength of thereads or the repetitive regions or singlenucleotide
polymorphisms (SNPs) in the human genome. With the development of the sequencing
technique, the sequencing errors are expected to decrease very fast and the length of the
read will increase as well. However, the structure of the human genome will not change.
Thus, although the proportion of the ambiguous reads is expected to decrease in the near
future, the ambiguous reads will exist for sure and the proportion of them is higher in
higher eukaryotes like human. Actually the model in chapter 2 can be directly applied to
consider the ambiguous reads. The likelihood function in equation (2.2) will not change
and the EM algorithm does not have to be changed either.
5.1.3 Novel Transcripts Discovery
As our method relies on the annotations in AceView, the annotation accuracy of the
database is important for the results. The simulation studies in chapter 2 have shown
that the results are accurate even when many unexpressed isoforms are included in the
analysis. Therefore, the incomplete and inaccurate annotations of AceView can be com-
pensate by enhancing the annotations with predicted mRNAs using the exon junctions
that have reads mapped to them or even the assembled template sequences from mRNA
sequencingdata with longer or pair-endedreads. However, thepossiblemRNAtemplates
can increase exponentially withthe numberof exon junctions. Onepotential way to solve
101
this is to construct a graph with nodes being the exons and edges being the exon junc-
tions. Then the possible template sequences can be grown from one of the end of the
graph by adding exons one by one. For each exon to be added, our method is applied to
estimate theexistence ofthecurrentlysets of transcripts. Thosetranscriptswithinsignif-
icant expression levels should be removed. In this way, the number of possible template
sequences can be significantly reduced.
5.2 Testing Gene Set Enrichment for Subset of Genes
In chapter 3, we developed a method to test the gene set enrichment for subsets of genes
and have shown its power to detect the set of genes associated with given phenotype.
Themethod,however, doesnotconsidertheinteractions between genesinthepre-defined
gene sets. For some of the pre-defined gene sets, genes inside are extracted from certain
type of molecule interaction networks, such as the signaling pathway, gene regulatory
network, etc. The information on the interactions between the genes may help improve
the predictions. Actually there have been several methods based on this idea. However,
they again consider the whole network instead of the sub-network inside. Under the
frameworkofourmethod,wecantrytoconsidertheinteraction structuresofthenetwork
and detect the associated gene sets by looking at all the sub-networks inside.
5.3 Gene Transcriptional Network Reconstruction
In chapter 4, an EM algorithm was designed to infer the gene transcriptional regulatory
networks from the gene expression data, allelic-specific expression data and sequence
102
alteration data. The preliminary simulation study have suggested the minimum sample
size necessary for the analysis. This study is still in a very beginning stage and there are
many directions that we can work on in the future.
5.3.1 Estimating regulatory relationship through penalized likelihood
There are several limitations to the approach utilized in our simulation study. One
is that the number of parameters is huge. For 400 genes, each genotype contains on
average about 400×1/15 = 27 altered genes. With I = 150 genotypes, there are roughly
27×I = 4050 altered genes. Thus, on average each gene will show up as an altered gene
in about 10 genotypes. With 200 genotypes, each gene shows as an altered gene in about
20 genotypes and almost all the genes can show up as an altered gene. Therefore, there
are about 400×400 = 16×10
4
parameters. In each genotype, we can observe whether a
perturbedgene is differentially expressed or not. Thus, we have roughly 400 observations
corresponding to roughly every gene as a perturbed gene. Therefore, the number of total
observations is about 400×I = 6×10
4
observations. The number of parameters is much
higher than the number of observations. It should be noted though that the parameter
matrix is very sparse. For each gene, the number of genes regulated by the gene is very
small, say less than 5% of the gene set, corresponding to 20 genes among the 400 genes.
There are only roughly 20×400 = 8000 non-zero parameters. Given this observation,
we will explore the use of Penalized likelihood approach such as LASSO ( [23,82,99]).
We will also extend the methods on estimating sparse correlation matrixes for Gausian
randomfield andgraphmodels toour situations ([4,54,97]). Thenumberof observations
(60,000) isabout7.5foldsofthenon-zerotermsintheparameterset(8,000). TheLASSO
103
approach and its extensions such as elastic net ( [99]) should be able to give reasonable
estimates of the non-zero parameters (Personal communications with Dr. Wenjiang Fu
at Michigan State University).
5.3.2 Estimating regulatory relationship through maximum parsimony
Themaximumlikelihoodapproachandthepenalizedlikelihoodapproachforinferringthe
regulatory relationship given above depend on several important assumptions including
the independent contributions of altered genes to the expression of perturbed genes and
independency of expression levels of the perturbed genes. These assumptions may be
violated in the real biological system. However, these assumptions are needed in order to
derive the EM algorithm and the penalized likelihood approach to infer the gene regula-
tory networks. We make these assumptions as a convenience for deriving the algorithms.
The final test will be based on whether we can recover the real networks.
To avoid making these assumptions, we propose a model-free approach for inferring
theregulatoryrelationship amongthegenes. Thebasicideaistousethesmallestnumber
of gene regulatory relationships to explain the observed interactions. Let x
lj
indicate the
potential for gene l regulating gene j. A differentially expressed perturbed gene j in the
i-th genotype is referred to as explained by these potentials if
X
l∈S
(i)
alt
x
lj
≥ 1.
104
We want to find the potential that 1) can explain all the differentially expressed per-
turbed genes across all the genotypes, and 2) that minimize the sum of all the potentials.
Mathematically, we want to solve the constrained minimization problem
min
X
lj
x
lj
with the constraints
X
l∈S
(i)
alt
x
lj
≥ 1, i = 1,2,··· ,I; j∈S
(i)
pert
,
where S
(i)
pert
indicates all the differentially expressed perturbed genes. This optimiza-
tion problem can potentially be solved using an optimization package in Matlab (The
MathWorks Inc.)
In the above formulation of the problem, we do not specifically consider false pos-
itives and false negatives in the detection of differentially expressed genes. In reality,
both false positively and false negatively differentially expressed genes are present in the
experiments. To overcome this problem, we use the idea that the potentials should 1)
explain the highest number of differentially expressed perturbed genes, and 2) should
make
P
l,j
x
lj
as small as possible. Obviously, these two objectives cannot be achieved
simultaneously. Instead, we introduce a parameter α and we try to find the potentials
x
lj
to minimize
X
lj
x
lj
+α×
I
X
i=1
X
j∈S
(i)
pert
I
X
l∈S
(i)
alt
x
lj
−1
,
where I(z) = 1if z ≥ 0 and I(z) = 0 otherwise. We will study the optimal way to
determine α.
105
The parsimony approach does not make specific assumptions regarding how the al-
tered genes regulate the perturbed genes and thus, supposedly, is robust to various gene
regulation mechanisms. However, the advantages and disadvantages of thelikelihood and
the parsimony approaches need to becarefully studied undervarious plausible regulatory
models. This will be carried out during the tenure of this application.
5.3.3 Considering nucleotide polymorphisms in the genotypes
Note that in the previous description we have collapsed coding and regulatory alterations
together, while in fact we will analyze them jointly and separately. In addition to the
information on expression on allele-specific expression in F
1
offspring, for the subset of
strains fully re-sequenced by others, we will also know all polymorphisms in regulatory
regions that could potentially result in the alteration of this gene expression. As in sev-
eral expression quantitative trait loci (eQTL) studies using both family and population
samples ( [13,17,20,32,68,75,76]), we can studythe relationship between gene expression
changes and genetic polymorphisms using standard statistical tools such as linear regres-
sion. While population structure is not of a great concern in generally panmictic D..
melanogaster population, extreme care will be taken to avoid false positive discoveries
between expression changes and genetic polymorphisms by taking population structure
into consideration (our close by colleagues have extensive experiences in dealing with
population structures in association studies; for instance [98]). We hope that associa-
tion studies between gene expression levels and genetic polymophisms in this study may
partially help us understand the cis and trans relationship in our sample.
106
5.3.4 Distinguishing the Direct and Indirect Connections in the Gene
Transcriptional Regulatory Network
For a given gene, the methods developed in chapter 4 can identify its down-stream genes.
Theinferred regulatory networks may contain cycles, e.g. A↔B (A and B regulate each
other) and contain direct as well as indirect regulations. For example, if A→ B→ C,
our algorithm should infer A→ B, A→ C, and B→ C. In this situation, A→ C is
indirect regulation. We will studymethods to identify the direct targets and to construct
the gene regulatory network. The following algorithm can be used to distinguish direct
from indirect regulations [86]. First, we construct a network containing all the genes of
interest and the inferred regulation relationships among these genes. Second, we identify
strongly connected components (SCC) in which any pairs of genes in each SCC can be
reached from each other. The genes in each SCC are combined to form one super-gene
to form a new network. Third, recursively delete direct edges between super-genes that
are also connected by a longer path [86]. The resulting network contains the smallest
number of edges satisfying the subset relationship inferred from our methods developed
above and should only include indirect targets.
107
References
[1] F Al-Shahrour, R D´ ıaz-Uriarte, and J Dopazo. Fatigo: a web tool for finding sig-
nificant associations of gene ontology terms with groups of genes. Bioinformatics,
20:578–580, 2004.
[2] R S Aloyz, S X Bamji, C D Pozniak, J G Toma, J Atwal, D R Kaplan, and F D
Miller. P53 is essential for developmental neuron death as regulated by the trka and
p75 neurotrophin receptors. The Journal of Cell Biology, 143:1691–1703, 1998.
[3] M Ashburner, C A Ball, J A Blake, D Botstein, H Butler, J M Cherry, A P Davis,
KDolinski,SSDwight,JTEppig,MAHarris,DPHill,LIssel-Tarver,AKasarskis,
S Lewis, J C Matese, J E Richardson, M Ringwald, G M Rubin, and G Sherlock.
Gene ontology: tool for the unification of biology. Nature Genetics, 25:25–29, 2000.
[4] O. Banerjee and L. El Ghaoui. Model selection through sparse maximum likeli-
hood estimation for multivariate Gaussian or binary data. The Journal of Machine
Learning Research, 9:485–516, 2008.
[5] S.Barberan-SolerandA.M.Zahler. Alternative splicingregulation duringC.elegans
development: splicingfactors asregulatedtargets. PLoS Genetics,4:e1000001, 2008.
[6] WTBarry,ABNobel,andFAWright. Significanceanalysisoffunctionalcategories
in gene expression studies: a structured permutation approach. Bioinformatics,
21:1943–1949, 2005.
[7] T Becker and M Knapp. A powerful strategy to account for multiple testing in the
context of haplotype analysis. American Journal of Human Genetics, 75:561–570,
2004.
[8] T Beißbarth and T P Speed. Gostat: find statistically overrepresented gene ontolo-
gies within a group of genes. Bioinformatics, 20:1464–1465, 2004.
[9] Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical
and powerful approach to multiple testing. Journal of the Royal Statistical Society.
Series B (Methodological), pages 289–300, 1995.
[10] G F Berriz, O D King, B Bryant, C Sander, and Roth F P. Characterizing gene sets
with funcassociate. Bioinformatics, 19:2502–2504, 2003.
[11] B.J. Blencowe. Alternative splicing: new insights from global analyses. Cell,
126(1):37–47, 2006.
108
[12] N.J. Bray, P.R. Buckland, M.J. Owen, and M.C. O’Donovan. Cis-acting variation
in the expression of a high proportion of genes in human brain. Human genetics,
113(2):149–153, 2003.
[13] R.B. Brem and L. Kruglyak. The landscape of genetic complexity across 5,700
gene expression traits in yeast. Proceedings of the National Academy of Sciences,
102(5):1572–1577, 2005.
[14] C.D. Bustamante, A. Fledel-Alon, S. Williamson, R. Nielsen, M.T. Hubisz,
S. Glanowski, D.M. Tanenbaum, T.J. White, J.J. Sninsky, R.D. Hernandez, et al.
Natural selection on protein-coding genes in the human genome. Nature, 437:1153–
1157, 2005.
[15] J.C.Castle,C.Zhang,J.K.Shah,A.V.Kulkarni,A.Kalsotra,T.A.Cooper,andJ.M.
Johnson. ASD: a bioinformatics resource on alternative splicing. Nature Genetics,
40(12):1416–1425, 2008.
[16] T. Castrignano, R. Rizzi, I.G. Talamo, P.D.O. De Meo, A. Anselmo, P. Boniz-
zoni, and G. Pesole. ASPIC: a web resource for alternative splicing prediction
and transcript isoforms characterization. Nucleic Acids Research, 34(Web Server
issue):W440, 2006.
[17] V.G. Cheung, R.S. Spielman, K.G. Ewens, T.M. Weber, M. Morley, and J.T. Bur-
dick. Mapping determinants of human gene expression by regional and genome-wide
association. Nature, 437(7063):1365–1369, 2005.
[18] C.R.Cowles,J.N.Hirschhorn,D.Altshuler,andE.S.Lander. Detectionofregulatory
variation in mouse genes. Nature genetics, 32:432–437, 2002.
[19] G S Dbaibo, M Y Pushkareva, R A Rachid, N Alter, M J Smyth, L M Obeid, and
Y A Hannun. p53-dependent ceramide response to genotoxic stress. The Journal of
Clinical Investigation, 102:329–339, 1998.
[20] A.L. Dixon, L. Liang, M.F. Moffatt, W. Chen, S. Heath, K.C.C. Wong, J. Taylor,
E.Burnett, I.Gut, M. Farrall, et al. Agenome-wide association studyof global gene
expression. Nature genetics, 39(10):1202–1207, 2007.
[21] S W Doniger, N Salomonis, K D Dahlquist, K Vranizan, S C Lawlor, and B R
Conklin. Mappfinder: using gene ontology and genmapp to create a global gene-
expression profile from microarray data. Genome Biology, 4:R7, 2003.
[22] S Drˇ aghici, P Khatri, R P Martins, G C Ostermeier, and S A Krawetz. Global
functional profiling of gene expression. Genomics, 81:98–104, 2002.
[23] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals
of statistics, pages 407–451, 2004.
[24] B. Efron and D.V. Hinkley. Assessing the accuracy of the maximum likelihood
estimator: Observed versus expected Fisher information. Biometrika, 65(3):457–
483, 1978.
109
[25] B Efron and R Tibshirani. On testing the significance of sets of genes. The Annals
of Applied Statistics, 1:107–129, 2007.
[26] A. Friedman and N. Perrimon. Genetic screening for signal transduction in the era
of network biology. Cell, 128(2):225–231, 2007.
[27] M.A. Garcia-Blanco, A.P. Baraniak, and E.L. Lasda. Alternative splicing in disease
and therapy. Nature Biotechnology, 22:535–546, 2004.
[28] L E Giono and J J Manfredi. The p53 tumor suppressor participates in multiple cell
cycle checkpoints. Journal of Cellular Physiology, 209:13–20, 2006.
[29] J J Goeman and P B¨ uhlmann. Analyzing gene expression data in terms of gene sets:
methodological issues. Bioinformatics, 23:980–987, 2007.
[30] J J Goeman, J Oosting, A M Cleton-Jansen, J K Anninga, and van Houwelingen
H C. Testing association of a pathway with survival using gene expression data.
Bioinformatics, 21:1950–1957, 2005.
[31] J J Goeman, S A van de Geer, F D Kort, and H C van Houwelingen. A global
test for groups of genes: testing association with a clinical outcome. Bioinformatics,
20:93–99, 2004.
[32] H.H.H. G¨ oring, J.E. Curran, M.P. Johnson, T.D. Dyer, J. Charlesworth, S.A. Cole,
J.B.M. Jowett, L.J. Abraham, D.L. Rainwater, A.G. Comuzzie, et al. Discovery of
expression QTLs using large-scale transcriptional profiling in human lymphocytes.
Nature genetics, 39(10):1208–1216, 2007.
[33] P.J. Grabowski and D.L. Black. Alternative RNA splicing in the nervous system.
Progress in Neurobiology, 65(3):289–308, 2001.
[34] D. Hansen and T. Schedl. The regulatory network controlling the proliferation-
meiotic entry decision in the Caenorhabditis elegans germ line. Current Topics in
Developmental Biology, 76:6–6, 2006.
[35] B.M. Hersh, C.E. Nelson, S.J. Stoll, J.E. Norton, T.J. Albert, and S.B. Carroll. The
UBX-regulated network in the haltere imaginal disc of D. melanogaster. Develop-
mental biology, 302(2):717–727, 2007.
[36] J Hoh, A Wille, and J Ott. Trimming, weighting, and grouping snps in human
case-control association studies. Genome Research, 11:2115–2119, 2001.
[37] D A Hosack, G Jr Dennis, B T Sherman, H C Lane, and R A Lempicki. Identifying
biological themes within lists of genes with ease. Genome Biology, 4:R70, 2003.
[38] Z Jiang and R Gentleman. Extensions to gene set enrichment. Bioinformatics,
23:306–313, 2007.
110
[39] J.M. Johnson, J. Castle, P. Garrett-Engele, Z. Kan, P.M. Loerch, C.D. Armour,
R. Santos, E.E. Schadt, R. Stoughton, and D.D. Shoemaker. Genome-wide survey
of human alternative pre-mRNA splicing with exon junction microarrays. Science,
302(5653):2141–2144, 2003.
[40] D. Kampa, J. Cheng, P. Kapranov, M. Yamanaka, S. Brubaker, S. Cawley,
J. Drenkow, A. Piccolboni, S. Bekiranov, G. Helt, et al. Novel RNAs identified
from an in-depth analysis of the transcriptome of human chromosomes 21 and 22.
Genome Research, 14(3):331–342, 2004.
[41] P Khatri and S Drˇ aghici. Ontologcal analysis of gene expression data: current tools,
limitations and open problems. Bioinformatics, 21:3587–3595, 2005.
[42] C C Kim and S Falkow. Significance analysis of lexical bias in microarray data.
Genome Biology, 4:12, 2003.
[43] S Kim and D J Volsky. Page: Parametric analysis of gene set enrichment. BMC
Bioinformatics, 6:144, 2005.
[44] J.O. Korbel, A.E. Urban, J.P. Affourtit, B. Godwin, F. Grubert, J.F. Simons, P.M.
Kim, D. Palejev, N.J. Carriero, L. Du, et al. Paired-end mapping reveals extensive
structural variation in the human genome. Science, 318(5849):420, 2007.
[45] E.S. Lander et al. Initial sequencing and analysis of the human genome. Nature,
409:860–921, 2001.
[46] C.R. Landry, C.I. Castillo-Davis, A. Ogura, J.S. Liu, and D.L. Hartl. Systems-level
analysis and evolution of the phototransduction network in Drosophila. Proceedings
of the National Academy of Sciences, 104(9):3283, 2007.
[47] D M Levine, D R Haynor, J C Castle, S B Stepaniants, M Pellegrini, M Mao, and
J M Johnson. Page: Parametric analysis of gene set enrichment. Genome Biology,
7:R93, 2006.
[48] B. Lewin. Gene VIII. Pearson Prentice Halll Upper Saddle River, NJ, 2004.
[49] YLi,AJRaffo, LDrew, YMao, ATran,DPPetrylak, andRLFine. Fas-mediated
apoptosis is dependent on wild-type p53 status in human cancer cells expressing
a temperature-sensitive p53 mutant alanine-143. Cancer Research, 63:1527–1533,
2003.
[50] M Liu, A Liberzon, S W Kong, W R Lai, P J Park, I S Kohane, and S Kasif.
Network-based analysis of affected biological processes in type 2 diabetes models.
PLoS Genetics, 3:e96, 2007.
[51] H.S. Lo, Z. Wang, Y. Hu, H.H. Yang, S. Gere, K.H. Buetow, and M.P. Lee. Allelic
variation in gene expression is common in the human genome. Genome Research,
13(8):1855–1862, 2003.
111
[52] J.C. Marioni, C.E. Mason, S.M. Mane, M. Stephens, and Y. Gilad. RNA-seq: An
assessment of technical reproducibility and comparison with gene expression arrays.
Genome research, 18(9):1509, 2008.
[53] G. McLachlan and D. Peel. Finite mixture models. Wiley-Interscience, 2004.
[54] N. Meinshausen and P. Buhlmann. High-dimensional graphs and variable selection
with the Lasso. Annals of statistics, 34(3):1436, 2006.
[55] B. Modrek and C. Lee. A genomic view of alternative splicing. Nature genetics,
30:13–19, 2002.
[56] V K Mootha, C M Lindgren, K F Eriksson, A Subramanian, S Sihag, J Lehar,
PPuigserver, ECarlsson,MRidderstr˚ ale, ELaurila, NHoustis, MJDaly, NPatter-
son,JPMesirov, TRGolub,PTamayo, BSpiegelman,ESLander,JNHirschhorn,
D Altshuler, and L C Groop. Pgc-1α-responsive genes involved in oxidative phos-
phorylaton are coordinately downregulated in human diabetes. Nature Genetics,
34:267–273, 2003.
[57] A. Mortazavi, B.A. Williams, K. McCue, L. Schaeffer, and B. Wold. Mapping and
quantifying mammalian transcriptomes by RNA-Seq. Nature methods, 5:621–628,
2008.
[58] S ¸ Nacu,RCritchley-Thorne,PLee, andSHolmes. Geneexpressionnetworkanalysis
and applications to immunology. Bioinformatics, 23:850–858, 2007.
[59] U. Nagalakshmi, Z. Wang, K. Waern, C. Shou, D. Raha, M. Gerstein, and M. Sny-
der. The transcriptional landscape of the yeast genome defined by RNA sequencing.
Science, 320(5881):1344, 2008.
[60] M A Newton, F A Quintana, J A den Boon, S Sengupta, and P Ahlquist. Random-
set methods identify distinct aspects of the enrichment signal in gene-set analysis.
The Annals of Applied Statistics, 1:85–106, 2007.
[61] H Ogata, S Goto, K Sato, W Fujibuchi, H Bono, and M Kanehisa. Kegg: Kyoto
encyclopedia of genes and genomes. Nucleic Acids Research, 27:29–34, 1999.
[62] S.G. Palusa, G.S. Ali, and A.S.N. Reddy. Alternative splicing of pre-mRNAs of
Arabidopsis serine/arginine-rich proteins: regulation by hormones and stresses. The
Plant Journal, 49(6):1091–1107, 2007.
[63] Q. Pan, O. Shai, L.J. Lee, B.J. Frey, and B.J. Blencowe. Deep surveying of alterna-
tive splicingcomplexity inthehumantranscriptomebyhigh-throughputsequencing.
Nature genetics, 2008.
[64] T. Pastinen and T.J. Hudson. Cis-acting regulatory variation in the human genome.
Science, 306(5696):647–650, 2004.
[65] P Pavlidis, J Qin, V Arango, J J Mann, and E Sibille. Using the gene ontology for
microarray data mining: A comparison of methods and application to age effects in
human prefrontal cortex. Neurochemical Research, 29:1213–1222, 2004.
112
[66] M. Pick, C. Flores-Flores, and H. Soreq. From brain to blood: alternative splicing
evidence for the cholinergic basis of Mammalian stress responses. Annals of the
New York Academy of Sciences, 1018(1 Stress: CurrentNeuroendocrineandGenetic
Approaches):85–98, 2004.
[67] JRahnenf¨ uhrer,FSDomingues, JMaydt, andTLengauer. Calculating thestatisti-
cal significance of changes in pathway activity from gene expression data. Statistical
Applications in Genetics and Molecular Biology, 3:16, 2004.
[68] E.E. Schadt, S.A. Monks, T.A. Drake, A.J. Lusis, N. Che, V. Colinayo, T.G. Ruff,
S.B. Milligan, J.R. Lamb, G. Cavet, et al. Genetics of gene expression surveyed in
maize, mouse and man. 2003.
[69] J. Shendure and H. Ji. Next-generation DNA sequencing. nature biotechnology,
26(10):1135–1145, 2008.
[70] A.D. Smith, Z. Xuan, and M.Q. Zhang. Using quality scores and longer reads im-
proves accuracy of Solexa read mapping. BMC bioinformatics, 9(1):128, 2008.
[71] E L Sonnhammer, S R Eddy, and Durbin R. Pfam: a comprehensive database of
protein domain families based on seed alignments. Proteins, 28:405–420, 1997.
[72] S.Stamm. Signalsandtheirtransductionpathways regulatingalternative splicing: a
newdimensionofthehumangenome. HumanMolecular Genetics,11(20):2409–2416,
2002.
[73] S.Stamm,J.J.Riethoven, V.LeTexier,C.Gopalakrishnan,V.Kumanduri,Y.Tang,
N.L. Barbosa-Morais, and T.A. Thanaraj. ASD: a bioinformatics resource on alter-
native splicing. Nucleic Acids Research, 34(Database Issue):D46, 2006.
[74] JDStorey. Adirectapproachtofalsediscoveryrates. JournaloftheRoyal Statistical
Society, Series B, 64:479–498, 2002.
[75] B.E. Stranger, M.S. Forrest, A.G. Clark, M.J. Minichiello, S. Deutsch, R. Lyle,
S. Hunt, B. Kahl, S.E. Antonarakis, S. Tavare, et al. Genome-wide associations of
gene expression variation in humans. PLoS Genet, 1(6):e78, 2005.
[76] B.E. Stranger, A.C. Nica, M.S. Forrest, A. Dimas, C.P. Bird, C. Beazley, C.E. In-
gle, M. Dunning, P. Flicek, D. Koller, et al. Population genomics of human gene
expression. Nature Genetics, 39(10):1217–1224, 2007.
[77] A Subramanian, P Tamayo, V K Mootha, S Mukherjee, B L Ebert, M A Gillette,
A Paulovich, S L Pomeroy, T R Golub, E S Lander, and J P Mesirov. Gene set
enrichment analysis: A knowledge-based approach for interpreting genome-wide ex-
pression profiles. Proceedings of the National Academy of Sciences of the United
States of America, 102:15545–15550, 2005.
[78] D.J. Sugarbaker, W.G. Richards, G.J. Gordon, L. Dong, A. De Rienzo, G. Maulik,
J.N. Glickman, L.R. Chirieac, M.L. Hartman, B.E. Taillon, et al. Transcriptome
113
sequencing of malignant pleural mesothelioma tumors. Proceedings of the National
Academy of Sciences, 105(9):3521, 2008.
[79] M.Sultan,M.H.Schulz,H.Richard,A.Magen,A.Klingenhoff,M.Scherf,M.Seifert,
T. Borodina, A. Soldatov, D. Parkhomchuk, et al. A global view of gene activity
and alternative splicing by deep sequencing of the human transcriptome. Science,
321(5891):956–959, 2008.
[80] D. Thierry-Mieg and J. Thierry-Mieg. AceView: a comprehensive cDNA-supported
gene and transcripts annotation. Genome Biology, 7(Suppl 1):S12, 2006.
[81] L Tian, S A Greenberg, S W Kong, J Altschuler, I S Kohane, and P J Park. Discov-
ering statistically significant pathways in expression profiling studies. Proceedings of
the National Academy of Sciences of the United States of America, 102:13544–13549,
2005.
[82] R. Tibshirani. Regression shrinkageand selection via the lasso. Journal of the Royal
Statistical Society. Series B (Methodological), pages 267–288, 1996.
[83] J Tomfohr, J Lu, and T B Kepler. Pathway level analysis of gene expression using
singular value decomposition. BMC Bioinformatics, 6:225, 2005.
[84] T.T.Torres,M.Metta,B.Ottenw¨ alder,andC.Schl¨ otterer. Geneexpressionprofiling
by massively parallel sequencing. Genome research, 18(1):172, 2008.
[85] B Vogelstein, D Lane, and A J Levine. Surfing the p53 network. Nature, 408:307–
310, 2002.
[86] A. Wagner. How to reconstruct a large genetic network from n gene perturbations
in fewer than n 2 easy steps. Bioinformatics, 17(12):1183–1197, 2001.
[87] E.T. Wang, R. Sandberg, S. Luo, I. Khrebtukova, L. Zhang, C. Mayr, S.F.
Kingsmore, G.P. Schroth, and C.B. Burge. Alternative isoform regulation in hu-
man tissue transcriptomes. Nature, 456(7221):470–476, 2008.
[88] A.P.M.Weber,K.L.Weber,K.Carr,C.Wilkerson, andJ.B.Ohlrogge. Samplingthe
Arabidopsistranscriptomewith massively parallelpyrosequencing. Plant physiology,
144(1):32, 2007.
[89] Q Wei. Pitx2a binds to human papillomavirus type 18 e6 protein and inhibits
e6-mediated p53 degradation in hela cells. The Journal of Biological Chemistry,
280:37790–37797, 2005.
[90] ZWeiandHLi. Amarkovrandomfieldmodelfornetwork-basedanalysisofgenomic
data. Bioinformatics, 23:1537–1544, 2007.
[91] Z Wei and H Li. Nonparametric pathway-based regression models for analysis of
genomic data. Biostatistics, 8:265–284, 2007.
114
[92] S.H. Williamson, M.J. Hubisz, A.G. Clark, B.A. Payseur, C.D. Bustamante, and
R.Nielsen. Localizing recent adaptive evolution in the humangenome. PLoS Genet,
3(6):e90, 2007.
[93] H.Yan,W.Yuan,V.E.Velculescu, B.Vogelstein, andK.W.Kinzler. Allelicvariation
in human gene expression. Science, 297(5584):1143–1143, 2002.
[94] X.Yan andF.Sun. Testinggenesetenrichmentforsubsetof genes: Sub-GSE. BMC
bioinformatics, 9(1):362, 2008.
[95] X. Yan, F. Sun, and T. Chen. TranSEQ: Identifying Differentially Expressed Al-
ternatively Spliced Transcripts using mRNA-seq Data . Genome Research, under
revision, 2009.
[96] C Ye and E Eskin. Discovering tightly regulated and differeitially expressed gene
sets in whole genome expression data. Bioinformaitcs, 23:e84–e90, 2006.
[97] M. Yuan and Y. Lin. Model selection and estimation in the Gaussian graphical
model. Biometrika, 94(1):19–36, 2007.
[98] K. Zhao, M.J. Aranzana, S. Kim, C. Lister, C. Shindo, C. Tang, C. Toomajian,
H. Zheng, C. Dean, P. Marjoram, et al. An Arabidopsis example of association
mapping in structured samples. PLoS Genet, 3(1):e4, 2007.
[99] H. Zou and T. Hastie. Regularization and variable selection via the elastic net.
Journal of the Royal Statistical Society Series B, 67(2):301–320, 2005.
115
Abstract (if available)
Abstract
Understanding the gene regulatory network has always been one of the important and challenging tasks to understanding the mechanisms behind different biological processes or behaviors of organisms. Development in biological experimental techniques has produced a deluge of data to help accomplish this task, including the microarray data, ChIP-chip data, sequencing data, etc. Based on these techniques, biologists have designed different experiments to investigate the regulatory networks from different aspects. However, due to the random effects during the production of these data, statistical and probabilistic models are needed to extract reliable regulatory relationship from the data. Moreover, integration of different sources of data is also critical for obtaining complete and accurate information regarding the biological mechanisms.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Investigating the evolution of gene networks through simulated populations
PDF
Scalable latent factor models for inferring genetic regulatory networks
PDF
The evolution of gene regulatory networks
PDF
Innovative sequencing techniques elucidate gene regulatory evolution in Drosophila
PDF
Integrative approach of biological systems analysis on regulatory pathways, modules, protein essentialities, and disease gene classification
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
Investigating hematopoietic stem cells via multiscale modeling, single-cell genomics, and gene regulatory network inference
PDF
Statistical analysis of microarray data and functional genomics of yeast ageing
PDF
Integrative analysis of gene expression and phenotype data
PDF
Geometric interpretation of biological data: algorithmic solutions for next generation sequencing analysis at massive scale
PDF
Genome-wide association study of factors influencing gene expression variation and pleiotropy
PDF
Detecting and understanding differentiation of microarray expression data
PDF
Too many needles in this haystack: algorithms for the analysis of next generation sequence data
PDF
Alzheimer’s disease: dysregulated genes, ethno-racial disparities, and environmental pollution
PDF
Decoding the embryo: on spatial and genomic tools to characterize gene regulatory networks in development
PDF
Computational and experimental approaches for the identification of genes and gene networks in the Drosophila sex-determination hierarchy
PDF
Whole genome bisulfite sequencing: analytical methods and biological insights
PDF
The use of alignment-free statistics for the evolutionary study of study of 5' cis-regulatory sequences
PDF
Genome-wide studies reveal the isoform selection of genes
PDF
Genome sequencing and transcriptome analysis of the phenotypically plastic spadefoot toads
Asset Metadata
Creator
Yan, Xiting
(author)
Core Title
Statistical modeling of sequence and gene expression data to infer gene regulatory networks
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology
Degree Conferral Date
2009-08
Publication Date
08/07/2009
Defense Date
05/13/2009
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
data analysis,gene expression,inference,OAI-PMH Harvest,regulatory networks,sequence,statistical modeling
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Sun, Fengzhu (
committee chair
), Chen, Ting (
committee member
), Nuzhdin, Sergey (
committee member
), Schumitzky, Alan (
committee member
), Waterman, Michael S. (
committee member
)
Creator Email
xitingya@usc.edu,yanxiting@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m2542
Unique identifier
UC1493285
Identifier
etd-Yan-2928 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-175848 (legacy record id),usctheses-m2542 (legacy record id)
Legacy Identifier
etd-Yan-2928.pdf
Dmrecord
175848
Document Type
Dissertation
Rights
Yan, Xiting
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
data analysis
gene expression
inference
regulatory networks
sequence
statistical modeling