Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Isoform quantification and splicing regulation analysis in RNA-seq studies
(USC Thesis Other)
Isoform quantification and splicing regulation analysis in RNA-seq studies
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Isoform Quantification and Splicing Regulation Analysis in RNA-Seq Studies
by
Jing Zhang
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
December 2013
Copyright 2013 Jing Zhang
Dedication
This thesis is dedicated to my family.
ii
Acknowledgments
First and foremost, I would like to thank my two advisors, Dr. C.-C. Jay Kuo and Dr.
LiangChen,fortheirsupportandguidanceduringmygraduatestudyinUSC.Duringthe
past five years, Dr. Liang Chen and Dr. Jay Kuo aroused my interest in computational
biology and provided detailed directions for each project. Their encouragement and
support, both scientifically and emotionally, led me throughout the most difficult times
of my life. It would be impossible for me to complete this thesis without them.
I would like to express my appreciation to Professor Michael Waterman and Professor
Fengzhu Sun for being in either my Ph.D. candidacy or dissertation committees. I want
to thank Dr. Jasmine Zhou, Dr. Andrew Smith, and Dr. Ting Chen for their help
in my Ph.D studies. I also appreciate the help and suggestions from colleagues, Dr.
Sudeep Srivastava, Dr. Shihua Zhang, Dr. Lin Wan, Cheyu Lee, Maoqi Xu, Dr. Min
Xu, Dr. Chao Dai, Dr. Xiting Yan, Dr. Sungje Cho, Dr. Fang Fang, Quan Chen,
Wangshu Zhang, and every other member in USC computational biology program and
media communication lab in EE department.
Finally, Iwouldespeciallythankmyhusband, mybrother, myfather, andmymother.
They gave me tremendous love, encouragement, and support. This dissertation is dedi-
cated to them.
iii
Table of Contents
Dedication ii
Acknowledgments iii
List of Tables vii
List of Figures viii
Abstract xi
Chapter 1: Introduction 1
1.1 Alternative Splicing and Gene Expression . . . . . . . . . . . . . . . . . . 1
1.1.1 Types of alternative splicing . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Splicing Regulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 RNA-seq technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Summary of my work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.1 Quantification of gene expression at the isoform level . . . . . . . . 9
1.3.2 Context based splicing regulations . . . . . . . . . . . . . . . . . . 9
1.3.3 Structure based splicing regulations . . . . . . . . . . . . . . . . . 10
Chapter 2: Gene Expression Quantification with Isoform Resolution from RNA-seq
Data 12
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Bias estimation in WemIQ . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.2 Gene and isoform expression quantification in WemIQ . . . . . . . 15
2.2.3 Positionalandsequence-specificbiascorrectioninsingle-isoformgenes 17
2.2.4 Read mapping and expression estimation . . . . . . . . . . . . . . 18
2.3 Simulation data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1 Read generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.2 WemIQ improved isoform expression with perfect gene annotation 21
2.3.3 WemIQ provides robust isoform quantification despite incomplete
gene annotations, complicated gene structures, or low coverage . . 23
2.3.4 WemIQ provides accurate and robust exon inclusion rate estimation 25
2.4 Real data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
iv
2.4.1 WemIQ effectively removes the bias shown in single-isoform genes 26
2.4.2 Comparison with qRT-PCR . . . . . . . . . . . . . . . . . . . . . . 29
2.4.3 Cross platform comparison using ENCODE data . . . . . . . . . . 30
2.4.4 Comparison of single-cell and population-cell RNA-seq data . . . . 36
2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Chapter 3: Splicing Elements Discovery through Varying Effect Regression Model 44
3.1 Introduction to splicing regulatory elements discovery . . . . . . . . . . . 44
3.2 Existing methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3 A varying effect model on splicing regulatory elements discovery . . . . . 46
3.3.1 Introduction on varying coefficient regression model . . . . . . . . 47
3.3.2 Estimation methods . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4 Application on real data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4.2 Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.4.3 Exons considered in VERSE . . . . . . . . . . . . . . . . . . . . . 54
3.4.4 Fitting the VERSE model . . . . . . . . . . . . . . . . . . . . . . . 54
3.5 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.5.1 VERSE identifies a list of novel motifs from 16 human tissue RNA-
seq data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.5.2 Comparison with linear regression demonstrates the advantage of
VERSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.5.3 General and unique SREs display significant GC disparity . . . . . 62
3.5.4 Position and conservation preferences of predicted SREs . . . . . . 64
3.6 Software availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.6.1 Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.6.2 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.7.1 Integration of conservation scores improves SRE prediction accuracy 67
3.7.2 Tissue,conservation,andpositionpreferencesofSREsdemonstrate
the complexity of splicing regulations . . . . . . . . . . . . . . . . 68
3.7.3 Brains demonstrate the highest regulation complexity through SREs 70
3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Chapter 4: Structure Based Splicing Regulation 72
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2.1 Selection of splice sites . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3.1 Difference of pre-mRNA secondary structures between alternative,
constitutive, and skipped splice sites . . . . . . . . . . . . . . . . . 75
4.3.2 Difference of GC content between alternative, constitutive, and
skipped splice sites and its association with the stability of pre-
mRNA secondary structure . . . . . . . . . . . . . . . . . . . . . . 77
v
4.3.3 Splice sites of the first exons or long exons are GC enriched and
hence more stable in humans and mice . . . . . . . . . . . . . . . . 81
4.3.4 Tissue-specific alternative splice sites are GC enriched . . . . . . . 82
4.3.5 Selection on GC content around splice sites . . . . . . . . . . . . . 83
4.3.6 GC effect is more dominant than the nucleotide-order effect . . . . 85
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Chapter 5: Conclusion and Future Work 91
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Bibliography 94
vi
List of Tables
2.1 Cross platform gene expression correlation of WemIQ, Cufflinks, and RSEM 32
2.2 Cross platform gene expression fold change of WemIQ, Cufflinks, and RSEM 39
3.1 Comparison of VERSE with linear regression in 16 human tissues/cell lines 63
vii
List of Figures
1.1 Alternative Splicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Types of alternative splicing . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Example of splicing regulation [18] . . . . . . . . . . . . . . . . . . . . . . 3
1.4 RNA-seq experiment [81] . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1 Flowchart of WemIQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Gene structures used for simulation . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Performance comparison of WemIQ, Cufflinks, and RSEM on simulated data 22
2.4 Exon inclusion rate estimation from the two isoforms gene case . . . . . . 25
2.5 Bias removal by WemIQ, Cufflinks, and RSEM . . . . . . . . . . . . . . . 27
2.6 Characteristics of gene expression from WemIQ, Cufflinks, and RSEM . . 30
2.7 Estimated λ from the ENCODE data sets. Small but significant λ
difference were observed in the two data sets. . . . . . . . . . . . . . . . . 33
2.8 Scatter plot of gene expressions of the ENCODE data. Red points
are the genes with more than two fold gene expression change. . . . . . . 34
2.9 Fold change of gene expressions from ENCODE project. A-B:
expression fold change of all expressed genes; C-D: fold change of top
genes; E-F: fold change of top isoforms. . . . . . . . . . . . . . . . . . . . 35
2.10 Bias parameters in single cells and population cells RNA-seq data . . . . 37
2.11 Comparison of estimation performance on single cell RNA-seq data . . . . 40
viii
3.1 Flowchart of VERSE methods to identify SREs . . . . . . . . . . . . . . . 52
3.2 Ordered p-values in brains and examples of the coeffcient func-
tions. (A, B) Ordered p-values from VERSE for the upstream and down-
stream intronic regions in brains. The y-axis is the absolute log transform
of regressionp-values, and the x-axis is the ranks of the hexamers. The top
10 ranked motifs in both regions are listed. (C, D) Coeffcient functions of
TGCATG for the downstream intronic regions in brains and muscles. The
x-axis is the baseline preference from the phyloP conservation score. The
larger values near 1 indicate the conserved TGCATG sites, while smaller
values near -1 are the newly evolved ones. . . . . . . . . . . . . . . . . . . 58
3.3 Heat maps of regulation similarity among 16 human tissues. Sim-
ilarity between each tissue pair is calculated as the percentage of shared
motifs. Hierarchical clustering is used to group tissues with higher level of
co-regulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.4 General and specific SREs by VERSE . . . . . . . . . . . . . . . . . . . . 64
4.1 Comparisonofstabilitydistributionofalternativesplicesitesand
constitutive or skipped splice sites in humans. A and B are for the
5sscomparison. CandDareforthe3sscomparison. altmeansalternative
splice sites, cons means constitutive splice sites, and skip means skipped
splice sites. Alternative splices sites tended to have more stable structures. 76
4.2 Distributions of the GC content of different splice sites in hu-
mans. Compared with constitutive and skipped splice sites, alternative
splice sites were significantly enriched with GC (the p-values based on the
Wilcoxon tests were all <2.2×10
−16
). . . . . . . . . . . . . . . . . . . . . 77
4.3 GC content negatively correlates with energy in humans. A-C
are for alternative, constitutive, and skipped 5ss. D-F are for alternative,
constitutive, and skipped 3ss. The GC content in all splice sites in humans
was negatively correlated with the predicted structural energy. The fitted
regression lines and the R2 values are listed to show the goodness of fit. . 79
4.4 GC content difference explains structural stability difference.
Boxplots of the energy are plotted (dots represent outliers). Alternative,
constitutive, and skipped splice sites in humans had similar stability given
the similar GC number (Wilcoxon tests for splice site energy comparison
within each GC group, P >0.5). . . . . . . . . . . . . . . . . . . . . . . . 80
4.5 Free energy and GC content comparison between tissue-specific and non-
tissue-specific alternative donor and acceptor sites in humans with boxplots. 83
ix
4.6 Comparison of GC percentage around real (red lines) and decoy
(black) splice sites in humans. In general, the sequences around the
real splice sites in humans were more GC enriched compared to the decoy
ones (0.50 vs. 0.48 at the donor sites, 0.47 vs. 0.46 at the acceptor sites,
Wilcoxon tests, 2.2×10
−16
). In addition, the exonic regions near the real
splice sites were more enriched in GC compared to regions far away from
the junctions, while no such enrichment has been observed near the decoy
splice sites. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.7 Stability of native and permuted sequences in humans with box-
plots A and B are for alternative splice sites. C and D are for constitutive
splice sites. E and F are for skipped splice sites.. . . . . . . . . . . . . . . 87
x
Abstract
The rapid advances in high-throughput sequencing technologies provide us an opportu-
nity to dissecttranscriptomes with unprecedented resolution. Based onRNA-seq studies,
alternative splicing has become more and more appreciated as a key mechanism in higher
eukaryotes to expand transcriptomes by generating multiple isoforms from a single gene.
Therefore, accurate quantification of transcript isoforms and discovery of splicing regula-
tion rules are important to understand biological processes such as tissue differentiation
and cell development, where alternative splicing plays an essential role. However, RNA-
seq experiments, which randomly sample the short fragments from the transcriptome,
exhibit distinct data characteristics from previous techniques and thus await the devel-
opment of powerful computational methods.
First and foremost, we present a weighted-log-likelihood expectation maximization
method on isoform quantification (WemIQ). WemIQ integrates fragment length informa-
tion and distributes reads among isoforms through an expectation maximization (EM)
algorithm. More importantly, the weighted log-likelihood is used to handle both inter-
and intra- gene bias. Simulation and real data analyses show that, compared with other
xi
methods, WemIQ significantly improves the quantification of isoform expression, exon in-
clusion ratios, and gene expression under a variety of gene structures and provides robust
mRNA measurements across different datasets for direct comparisons.
Then, we tried to decipher the splicing code from the context aspect through identifi-
cation of splicing regulatory elements (SREs). The fact that a variety of other biological
signals cooperatively govern the splicing pattern indicates the necessity of developing
novel tools to incorporate information from multiple sources to improve splicing factor
binding sites prediction. Under this context, we proposed a Varying Effect Regression
for Splicing Elements (VERSE) to discover intronic SREs in the proximity of exon junc-
tions by integrating other biological features. It may serve as a powerful tool to not only
discover the splicing elements by incorporating additional informative signals but also
precisely quantify their varying contribution under different biological context.
At last, we investigated the structure aspect of splicing regulation by performing a
genomic study of RNA secondary structures around splice sites in humans, mice, fruit
flies, andnematodes. WeobservethatGCcontentaroundsplicesitesiscloselyassociated
with the splice site usage in multiple species. RNA secondary structure is the possible
explanation, because the structural stability difference among alternative splice sites,
constitutive splice sites, and skipped splice sites can be explained by the GC content
difference. Alternative splice sites tend to be GC-enriched and exhibit more stable RNA
secondary structures in all of the considered species. In humans and mice, splice sites of
firstexonsandlongexonstendtobeGC-enrichedandhenceformmorestablestructures,
indicating the special role of RNA secondary structures in promoter proximal splicing
eventsandthesplicingoflongexons. Inaddition,GC-enrichedexon-intronjunctionstend
xii
to be overrepresented in tissue-specific alternative splice sites, indicating the functional
consequence of the GC effect.
xiii
Chapter 1
Introduction
1.1 Alternative Splicing and Gene Expression
Pre-mRNA splicing in eukaryotes removes introns and joins exons together [26], [50].
However, during this biological process, multiple mature mRNA isoforms may be gener-
atedfromthesamegenebydifferentcombinationsofexonsorthroughdifferentselections
of splice sites to further increase the final protein diversity [6], [53], [61].
Figure 1.1: Alternative Splicing
Eversinceitsfirstdiscoveryin1970s,alternativesplicinghasbeentheresearchhotspot
in the past forty years. First of all, alternative splicing is the major source of transcrip-
tome diversity in humans [6], [53], [78]. Recent next generation sequencing technique
1
shows that more than 95 percent of the protein coding genes experienced differential
splicing to generate multiple transcript isoforms. Further knowledge of splicing event
may help to understand a variety of biology events, such as gene expression and tissue
specificity [4]. Besides, alternative splicing is a common phenomenon from bacteria to
humans, although the prevalence of this event varies significantly across species, and thus
studies in these areas should be beneficial to comprehend the evolutionarily conserved
regulation mechanisms as well as the species difference [37]. In addition, there is increas-
ing evidence that aberrant recognition of the exon-intron junctions is closely relevant to
numerous fatal diseases [11], [74], [79]. For example, mutations that destroy the regula-
tors or change splicing patterns may explain cystic fibrosis, and related therapies have
been proposed accordingly [24].
1.1.1 Types of alternative splicing
According to the different types of splice site selection, five different alternative splicing
categories are generally recognized as in Fig. 1.2.
(1) Skipped exons: also called cassette exons, which may be spliced out completely in
some transcripts. It is the most popular type of alternative splicing event.
(2) Mutually exclusive exons: the two exons that are spliced out in some isoforms,
but not at the same time.
(3) Alternative donor/acceptor site: one junction of the exon is fixed in all isoforms
while the other end is varying across transcripts.
(4) Intron retain: part of the pre-mRNA sequence that would be spliced out in the
splicing process, but it is flanked by introns.
2
Figure 1.2: Types of alternative splicing
1.1.2 Splicing Regulation
Splicing event is catalyzed by the splicesome that is a large ribonucleoprotein complex
with several hundred proteins and five small nuclear RNAs [30], [34]. The recognition
of splice sites requires multiple mRNA regulatory proteins to bind to various splicing
signals in nascent pre-mRNA sequences. Splicing code, which signifies a combinatory
role of splicing regulators composed by numerous genomic features, has been more and
more widely mentioned to predict tissue-dependent changes in alternative splicing for
thousands of exons [80].
Figure 1.3: Example of splicing regulation [18]
Among these numerous features, scientists first realized that the sequences at the two
exon-intron junctions are far from random, but instead conserved and distinct for up to 9
bases at the donor site and 15 bases at the acceptor site [65]. These consensus sequences
3
at the splice sites would help to promote the recognition of several protein factors that
are essential for splicing events. For instance, U2AF would bind to the polypyrimidine
sequences of the 3’ site [8]. However, such consensus sequences alone provide inadequate
information for precise splice site recognition. One example is the existence of various
decoy splice sites, which also resemble the consensus sequence at the real splice sites, but
never experience splicing during their entire cell cycles. It provides convincing evidence
that consensus sequence alone, although important, is far from enough to guarantee
accurate splicing events [70].
Besides, the branch point is also playing a key role during the splicing process [27].
Up to 18 to 40 nt upstream of the 3’ site, the highly degenerated consensus sequence
YNYCRAYoftenexistsinhumanfortheadenosinetoattackthe5’splicesiteandpromote
the exon recognition. But the existence of branch point is a weak signal, not only in the
sense of its varying distance and nucleotide composition, but also because many 3’ splice
sites lacking in this sequence are still spliced out correctly. This has also been proved
by many delicately designed experiments that the inclusion of the predicted consensus
sequence at the branch point does not improve the recognition of nearby exons [70].
Later on, several proteins has been claimed to frequently promote or prohibit the
splicingprocessthroughbindingtosomeshort,degeneratesequencesnearthesplicesites.
Twowell-knownexamplesaretheSRproteinfamily[25],[41],[66],[72]andheterogeneous
nuclearribonuclearproteins(hnRNPs)[1]. Theseexperimentalresultsremindusthatthe
genetic information may also exist in the flanking exonic and intronic sequences near the
junction sites. The first example came in 1993 when Shimura and colleagues discovered
the regulatory elements in the exonic region [83], and later on numerous experiments
4
reported different individual examples to assist the exon recognition [11], [16]. Based on
the location and function of these short sequences near the junction sites, these splicing
regulatoryelements(SRE)couldbedividedintofourdistinctcategories: exonicenhancers
(ESE)andsilencers(ESS),intronicenhancers(ISE)andsilencers(ISS).Pointmutational
analysis shows that the mutations that destroy the key SRE signals would result in
erroneous splicing pattern and mal-functional protein products, confirming that they are
important parts of the splicing code [11].
Additionally, it has been reported that local RNA secondary structures affect splice
site selection through experimental observations from individual genes [9], [15], [19], [31],
[67]. With the growing amount of genomic data and tools available, more genome-wide
studies were carried out to support the hypothesis that pre-mRNA secondary structures
areinvolvedinthesplicingprocess. Forexample,Patterson et al. reportedthatthesplice
site prediction can be improved by adding the localized pre-mRNA secondary structure
information to the conventional sequence-based approaches [54]. Hiller et al. found
that some experimentally verified splicing enhancers and silencers near splice sites are
significantly enriched in the single-stranded regions of the local secondary structures
[31]. Conserved secondary structures in Drosophila genomes were identified and they
may modulate splicing regulation through long distance interactions. Shepard et al.
discovered that stable and conserved pre-mRNA secondary structures around splice sites
may promote alternative splicing to a large extent [67]. All of these results indicate that
the secondary structure of pre-mRNA is part of the mRNA splicing code [80].
In summary, a variety of genomic signals, including consensus sequences, mRNA
binding proteins, and secondary structures, contribute to accurate splice site recognition
5
in a highly coordinated way, and the understanding of such complicated rules awaits
further experimental and computational efforts.
1.2 RNA-seq technologies
Dissection and quantification of the transcriptome are important to interpret a variety
of biological problems, such as functional elements identification, cell development and
differentiation, as well as various human diseases, and several techniques have been used
to explore the mRNA products from the cell.
Microarray technologies were previously widely used for transcriptome dissection and
quantification[2], [62]. Varioustypesofmicroarrays, suchtheexontilingarrays, junction
arrays, and body arrays, are designed dedicatedly to prioritize the genes and measure
mRNA products conveniently. However, several key disadvantages limit its application.
First,whilethefunctionofthevastmajorityofhumangenomeregionsstillremainelusive,
microarray technology is highly dependent on gene annotations, so investigation of such
regions is extremely difficult. Second, the affinity of probes is usually highly sensitive to
SNPs and thus introduces huge bias to the final quantification. Additionally, the overall
signal to noise level is low, leaving the following downstream analysis very challenging.
Sanger sequencing techniques are also the one of the most widely used methods for
transcriptome analysis for more than twenty years. Instead of relying on hybridization
of the probes as in microarrays, it can directly capture the cDNA sequence with high
accuracy. However, the lower throughput and relatively higher cost severely limit its
application in genome scale studies, and gene or isoform quantification is difficult as
6
compared with other techniques. But it is still widely used for small scale projects and
especially effective on obtaining long reads (more than 500bp).
The emerging RNA sequencing technology (RNA-seq) are recently developed and
widely adopted in scientific field to either dissect or quantify the transcriptome [81].
The procedure of a RNA-seq experiment is shown in Fig. 1.4. First, the RNAs, either
from a single cell or from population cells, are converted to cDNA libraries, and then
sequenced from one (single-end reads) or both ends (pair-end reads). Millions or even
billions of reads can be generated in parallel for downstream analysis. Usually these
reads are mapped to the transcriptome/genome, or even used for de novo transcriptome
assembly, and detailed expression level with gene or isoform resolution can be provided
for investigators.
RNA-seqtechnologieshaveclearadvantageothermethods. First, itshighthroughput
and low cost makes it very attractive for genome scale transcriptome studies. It provides
us the opportunity to understand the transcriptome with unprecedented resolution. Be-
sides, RNA-seq is independent of gene annotations, which is convenient for organisms
whose functional genomic regions are mainly unknown. The transcripts structures could
beconstructedby de novoassemblydirectlywithsinglenucleotideresolution,andexpres-
sions can be estimated directly from the mapped read counts with higher accuracy [81].
GreatsuccesshasbeengainedbyRNA-seqtechnologiesonvariousorganismsforaccurate
transcriptome dissection and quantification.
7
Figure 1.4: RNA-seq experiment [81]
1.3 Summary of my work
The goal of this thesis is to understand how the genes are expressed at the mRNA
level, and what are the possible post-transcriptional expression regulation mechanisms.
Specifically, we first developed weighted-log-likelihood expectation maximization method
forisoformquantification(WemIQ)fromRNA-seqdatatobettermeasurethealternative
splicing events quantitatively. Then we studied the splicing regulation rules from two
aspects, the context and structure based regulations.
8
1.3.1 Quantification of gene expression at the isoform level
Expression of multiple mRNA isoforms from individual genes significantly contributes to
transcriptome diversity in higher eukaryotes, and is usually controlled precisely during
development or in response to extracellular stimuli. Quantification of each isoform is
fundamental to understanding gene regulation. However, the deconvolution of isoform
expression from RNA-seq remains challenging because of non-uniform read sampling and
subtledifferencesamongisoforms. Here,wepresentaweighted-log-likelihoodexpectation
maximization method on isoform quantification (WemIQ). WemIQ integrates fragment
length information and distributes reads among isoforms through an expectation maxi-
mization(EM)algorithm. Moreimportantly,theweightedlog-likelihoodisusedtohandle
both inter- and intra- gene bias. Distinct from previous methods in assuming certain bias
sources and formats, WemIQ removes the overall local bias estimated in a date-driven
manner without any presumption. Simulation and real data analysis show that, com-
pared with other methods, WemIQ significantly improves the quantification of isoform
expression, exon inclusion ratios, and gene expression under a variety of gene structures.
WemIQ can be downloaded from http://www-rcf.usc.edu/ liangche/softtemp.
1.3.2 Context based splicing regulations
Identification of splicing regulatory elements (SREs) deserves special attention because
these cis-acting short sequences are vital parts of splicing code. The fact that a variety of
other biological signals cooperatively govern the splicing pattern indicates the necessity
of developing novel tools to incorporate information from multiple sources to improve
splicing factor binding sites prediction. Under this context, we proposed a Varying Effect
9
Regression for Splicing Elements (VERSE) to discover intronic SREs in the proximity
of exon junctions by integrating other biological features [90]. As a result, 1562 intronic
SREs were identified in 16 human tissues, many of which overlapped with experimentally
verified binding motifs for several well-known splicing factors, including FOX-1, PTB,
hnRNP A/B, hnRNP F/H, and so on. The discovered tissue, region, and conservation
preferences of the putative motifs demonstrate that splice site selection is a complicated
process that needs subtle and delicate regulation. VERSE may serve as a powerful tool
to not only discover the splicing elements by incorporating additional informative signals
but also precisely quantify their varying contribution under different biological context.
1.3.3 Structure based splicing regulations
Alternative splicing increases protein diversity by generating multiple transcript isoforms
from a single gene through different combinations of exons or through different selections
of splice sites. It has been reported that RNA secondary structures are involved in
alternative splicing. Here we perform a genomic study of RNA secondary structures
around splice sites in humans, mice, fruit flies, and nematodes to further investigate this
phenomenon [89].
WeobservethatGCcontentaroundsplicesitesiscloselyassociatedwiththesplicesite
usage in multiple species. RNA secondary structure is the possible explanation, because
the structural stability difference among alternative splice sites, constitutive splice sites,
and skipped splice sites can be explained by the GC content difference. Alternative splice
sites tend to be GC-enriched and exhibit more stable RNA secondary structures in all
of the considered species. In humans and mice, splice sites of first exons and long exons
10
tend to be GC-enriched and hence form more stable structures, indicating the special
role of RNA secondary structures in promoter proximal splicing events and the splicing
of long exons. In addition, GC-enriched exon-intron junctions tend to be overrepresented
in tissue-specific alternative splice sites, indicating the functional consequence of the GC
effect. Compared with regions far from splice sites and decoy splice sites, real splice sites
are GC-enriched. We also found that the GC-content effect is much stronger than the
nucleotide-order effect to form stable secondary structures. All of these results indicate
that GC content is related to splice site usage and it may mediate the splicing process
through RNA secondary structures.
11
Chapter 2
Gene Expression Quantification with Isoform Resolution
from RNA-seq Data
2.1 Introduction
The rapid advances in high-throughput sequencing technologies provide us an opportu-
nity to dissect transcriptomes with unprecedented resolution [14], [49], [51], [81]. Based
on RNA-seq studies, alternative splicing has become more and more appreciated as a
key mechanism in higher eukaryotes to expand transcriptomes by generating multiple
isoforms from a single gene [6], [26], [52]. For example, it has been reported that up to
95 percent of human multiexon genes undergo alternative splicing [53]. Therefore, an
accurate quantification of transcript isoforms is important to understand gene regulation
through alternative splicing. It helps to pinpoint distinct transcript isoforms important
to biological processes such as tissue differentiation and cell development, where alter-
native splicing plays an essential role. However, the accurate estimation of transcript
isoform expression from RNA-seq data remains a challenge. Many state-of-the-art ap-
proaches initially assumed that short sequence reads in RNA-seq were uniformly sampled
12
Figure 2.1: Flowchart of WemIQ
from each transcript. However, after careful deliberation, lines of literature have shown
that the position-level read count often demonstrates larger variation than it would be
under the uniform assumption [44], [68]. As a result, read counts need to be adjusted
for gene abundance quantification. Add-on functions have been introduced to handle the
overdispersion in the methods of isoform expression estimation [43], [58]. These methods
13
assumed a constant bias factor for each relative position of genes or simply corrected the
sequence-specific bias caused by random hexamer priming. However, the overall bias is
complicatedandcausedbymultiplefactorsincludingmanyunknownfactors,andthebias
pattern can vary significantly across different genes and different protocols [28], [44], [58].
In light of these facts, the quantification of transcript isoforms awaits a more appropriate
handling of the bias in RNA-seq.
Inchapter2,weproposearobustisoform-expressionquantificationmethod: Weighted-
log-likelihood expectation maximization Isoform Quantification (WemIQ). Given the tar-
get gene annotation, WemIQ can accurately estimate mRNA products at both the gene
and the transcript isoform level from RNA-seq data. WemIQ uses the expectation-
maximization(EM)strategytodistributereadsamongdifferentisoformsandincorporates
the fragment length information of paired-end reads. More importantly, bias in RNA-seq
iscorrectedbyassigningdifferentweightstoreadsfromdifferentgeneregionswhencalcu-
lating the weighted-log-likelihood. Notably, we estimate the bias parameters (or weights)
without any presumption about the source and format of bias.
Todemonstratetheeffectivenessofourmethod, weappliedWemIQtobothsimulated
and real data sets. Simulation studies show that, in both transcript-isoform-centric and
exon-centric quantification, our approach significantly outperforms the state-of-the-art
software, such as Cufflinks [77], RSEM [43], and SpliceTrap [86]. Cufflinks and RSEM
quantify isoform expression, while SpliceTrap quantifies exon inclusion rates in an exon-
centric way. Unlike other simulation studies which generated reads based on the uniform
assumption[43],[51]orbasedonapipelineinconsiderationofonlyafewbiasfactors[77],
our simulations generated reads with various bias patterns to mimic the real situation.
14
Analysis on real data also shows that our bias correction is more effective, our expression
quantification is more accurate, and the expression estimates are more robust across
different labs and different platforms.
2.2 Methods
2.2.1 Bias estimation in WemIQ
First, we separated the gene into S non-redundant virtual exons. An overlapped exon
would be split into multiple virtual exons (see Fig. 2.1). Let Xs represent the posi-
tion level read counts for the virtual exon s, and it follows a generalized Poisson (GP)
distribution with
P (X
s
=x)=
θ
s
(θ
s
+xλ
s
)
x−1
e
−θs−xλs
/x!,x=0,1,2,···
0,x>qifλ
s
<0,
(2.1)
where θ
s
> 0, max(−1, −θ
s
/q) ≤ λ
s
≤ 1, and q(≥4) is the largest positive integer for
which θ
s
+qλ
s
> 0 when λ
s
< 0. Here, λ
s
represents the overdispersion and θ
s
is the
expression after bias removal.
2.2.2 Gene and isoform expression quantification in WemIQ
Let g denote the gene of interest, which contains m transcripts. L
j
represents the length
of the j
th
transcript isoform, j = 1···m. Then the effective length (i.e. the number of
potential fragment starting positions) of each transcript can be approximated by L
′
j
=
L
j
−E(l), where E(l) is the average fragment length (either inferred from reads mapped
15
to single-isoform genes or specified by users). The set of read pairs mapped to this gene
is R ={r
1
,···r
i
,···r
n
}, with b
i
and e
i
as the beginning and ending mapping position of
the read pair r
i
. The unknown parameters τ = {τ
1
,···τ
i
,···τ
n
} represent the fraction
of reads coming from each isoform. Assuming latent variables [τ = {τ
1
,···τ
i
,···τ
n
} to
indicate the reads’ origins of transcript isoforms, π
i
=j if r
i
comes from isoform j. Then
the complete log likelihood of read r
i
under the parameter set τ.
log(P{B
i
=b
i
,E
i
=e
i
,π
i
=j|τ})
=log(P{B
i
=b
i
|π
i
=j,τ}P{E
i
=e
i
|B
i
=b
i
,π
i
=j,τ}P{π
i
=j|τ})
=log
1
L
′
j
×P{l
i,j
}×τ
I(π
i
=j)
j
,
(2.2)
where l
i,j
denotes the fragment length of r
i
if the read comes from isoform j and its
distribution is known (inferred based on single-isoform genes or specified by users). To
correct the bias, we assigned smaller weights for reads from regions with large bias:
w
i
= 1−λ
S
b
i
, where λ
S
b
i
is the estimated overdispersion parameter for the virtual exon
that the read mapped to. So the total weighted complete log-likelihood is:
weighted log(P{R,π|τ})=
n
X
i=1
m
X
j=1
w
i
I(π
i
=j)log
1
L
′
j
×P{l
i,j
}×τ
j
, (2.3)
Then the expectation and the maximization steps are iterated to estimate the unknown
parameter tau. The expression level of isoform j is then estimated as:
θ
j
=
n
X
i=1
w
i
P{π
i
=j|ˆ τ
j
,b
i
,e
i
}=
n
X
i=1
w
i
1
L
′
j
×P{l
i,j
}×τ
j
m
P
j
′
=1
1
L
′
j
′
×P{l
i,j
′}×τ
j
′
(2.4)
16
The gene expression is estimated as the sum of the isoform expression: θ =
m
P
j=1
θ
j
.
The relative expression of isoform j is
P
j
|δ
j
−
ˆ
δ
j|
δ
j
. The relative estimation error can
be expressed as
ˆ
δ
j
=
θ
j
/
θ
, where δ
j
is the true isoform percentage in simulations. For
direct comparison across different data sets, we defined the amount of mRNA transcripts
per million reads as
θ
′
=
θ
k,j
×
˜
L
k,j
P
K
k=1
P
m
k
j=1
θ
k,j
×
˜
L
k,j
×1000000 (2.5)
2.2.3 Positional and sequence-specific bias correction in single-isoform
genes
For the positional bias removal, we divided positions of each single-isoform gene into 20
binsfrom5
′
to3
′
ofthegene. Theaveragenumberofreadcountsx
k
forthek
th
binacross
all the single-isoform genes was calculated. Then the positional bias w
p,k
for positions in
the x
k
bin was
w
p,k
=
1
20
P
20
k=1
x
k
x
k
. (2.6)
For the sequence-specific bias removal, we treated the preceding 6 nucleotides (nt) of
the first read end and the following 6 nt of the second read end as the primers. Let
c(p
i
),i=1,··· ,n
p
, denote the occurrence of primer p
i
in the reads mapped to the single
isoform genes. Then the sequence based bias for read i was
w
s,i
=
1
np
P
np
i
′
=1
c(p
i
′)
c(p
i
)
. (2.7)
Hence, read count x
i
was adjusted according to the weights: x
′
i
= x
i
×w
p,bin(i)
×w
s,i
where bin(i) denotes the relative position bin that the position i belongs to. For our
17
model, we estimated the overdispersion parameter for the virtual exon s as λ
s
from the
GP model, and the corrected read count was x
′′
i
= x
i
×(1−λ
s
i
), where s
i
denotes the
virtual exon that the position i belongs to.
2.2.4 Read mapping and expression estimation
We aligned sequence reads to the reference genome using TopHat (version 2.04) [76].
The resultant SAM files were taken into WemIQ and Cufflinks (version 2.02) for isoform
quantification. RSEM(version1.18)andSpliceTrap(version0.90.5)tooktherawreadsas
inputandperformedthealignmentinternallybycallingBowtie[42]. Theimplementation
of WemIQ is efficient. Starting from SAM files, it only takes WemIQ 8 minutes and less
than 2Gb memory to estimate gene and isoform expression based on a total of 5.6 million
read pairs.
2.3 Simulation data
2.3.1 Read generation
Due to alternative splicing, different exons of a single gene can be expressed differently.
A robust isoform quantification method should distinguish the expression heterogeneity
across exons from the bias inherent to RNA-seq protocols. Our approach combines the
bias removal and the EM algorithm to provide accurate expression estimations at the
isoform resolution, as shown in the flow chart in Fig. 2.1. We performed simulations
to test whether WemIQ has advantages over the other methods. Instead of simulating
uniformly distributed reads, we simulated read counts with overdispersion to mimic the
18
Figure 2.2: Gene structures used for simulation
real situation. GP distribution was recently proven to effectively model the over dis-
persed RNA-seq data [68]. We first assigned different bias parameters along the gene and
19
simulated reads by GP distribution. It was demonstrated from these simulation results
that WemIQ can significantly improve the isoform percentage estimation compared with
Cufflinks and RSEM (results not shown). However, for a fair comparison, the negative
binomial instead of the GP distribution was used to generate the reads in the following
simulation,eventhoughthelatterhasbeenshowntobetterfittheread-countdistribution
in real RNA-seq data [68].
To evaluate the performance of WemIQ, we simulated RNA-seq datasets and com-
pared WemIQ with three popular software packages: Cufflinks, RSEM and SpliceTrap.
We first simulated a two-isoform gene model, where a longer transcript with five 250-bp
exons was generated and the second exon was skipped as a cassette exon to form the
shorter isoform. For the three-isoform gene models, two isoforms with cassette exon 2
and 4 were generated, respectively. To show the robustness of model, we also generated
gene structures with transcripts missing from the annotation, or gene structures compli-
cated with both alternative splice sites and exon skipping events. These gene structures
were shown in Fig. 2.2.
We used the negative binomial distribution NB(r,p) to simulate the reads. The
negative binomial distribution NB(r,p) can be treated as a gamma-Poisson mixture: a
Poisson(θ)distributionwhoseθitselfisarandomvariableanddistributedasGamma
r,
p
(1−p)
.
The mean of NB(r,p) is
rp
(1−p)
and the variance is
rp
(1−p)
2
. Therefore, we can vary p to
simulate different degree of overdispersion. For exons shared between isoforms, the ratio
oftheirisoform-specificexpressioncanberepresentedbytheratioofrsbecausepremains
the same for the two isoforms. The negative binomial distribution has been used to fit
RNA-seq data [59].
20
When we simulated the reads, p
i
was sampled uniformly from 0.75 to 0.95 for exon
i and kept the same for both isoforms if the exon was shared between the two isoforms.
Thenwesetr =1fortheminorisoform,andsampledatranscriptexpressionratiof with
equal probabilities from 2,3,4 with replacement. Therefore, in the major isoform, reads
for exon i follows the distribution NB(f,p
i
). To simulate the low-coverage cases, the r
value was set to be 0.05. A total of 1,000 cases were simulated in two-isoform gene model
withboththehighandthelowcoverage,aswellasthemodelwithincompleteannotation.
For other gene models, 500 cases were simulated. For the high coverage scenarios (r =1
for minor isoforms), the average position-level read count was 19.35 to 22.28, equivalent
to the 0.993 percentile in RNA-seq data from ENCODE (see Section 2.4.3). For the low
coveragescenarios(r =0.05forminorisoforms), theaverageposition-levelreadcountwas
0.96 to 1.11, equivalent to the 0.904 percentile in RNA-seq data from ENCODE.
After the reads were generated for each exonic position, the fragment length was
assigned for each 50-bp paired-end read using the Gaussian distribution with the mean
of 236.49 and the standard deviation of 23.55. The parameters were inferred from the
single-isoform genes in the human tissue RNA-seq datasets. If the ending position was
beyondthecorrespondingtranscript, thisfragmentlengthwasdefinedasinvalid. Upto5
trials were performed to find a valid fragment length. Otherwise, the read was discarded.
2.3.2 WemIQimprovedisoformexpressionwithperfectgeneannotation
Westartedfromasimplegenemodelwithtwotranscriptisoforms: thelongeronecontains
five 250-nt exons; while the shorter isoform lacks the second exon (Fig. 2.2). In total
21
Figure 2.3: Performance comparison of WemIQ, Cufflinks, and RSEM on simulated data
27,847,470 50bp pair-end reads were generated randomly from the whole genome. The
boxplots of the relative errors show that WemIQ significantly outperforms Cufflinks and
RSEM no matter when the longer or the shorter isoform is the minor isoform (Fig. 2.2
A). We also turned on the bias correction for Cufflinks and RSEM. The bias correction
in RSEM improved its performance while Cufflinks reported similar results regardless of
its bias removal. Hereinafter, we used the default bias correction options in RSEM and
Cufflinks in comparison with WemIQ. More specifically, when the shorter isoform was
the minor one, the mean relative error for WemIQ was 0.198 as compared with 0.382 for
Cufflinks and 0.355 for RSEM (P <2.2×10
−16
, paired t tests)
22
We extended the two-isoform gene model to a three-isoform model by adding a third
isoform in which the fourth exon was skipped (Fig. 2.2 B). Similarly, we first selected
one isoform as the minor one and set its r equal to one, and then randomly selected two
expression ratios f
1
and f
2
for the remaining two isoforms. The relative errors of isoform
expression percentage estimates were summed up to evaluate the overall performance.
Consistent with the two-isoform case, WemIQ greatly improves the estimation of relative
expression of isoforms (Fig. 2.3 B). The mean relative errors were around 0.384-0.578
for WemIQ while they were as high as 0.741-1.102 for the other two methods. When
the two shorter isoforms were expressed at least 2 fold more (f ≥ 2) than the longer
isoform (the three leftmost boxplots in Fig. 2.3B), the isoform quantification became
more challenging. This is possibly due to the limited number of the read fragments that
span the two cassette exons and uniquely belong to the longer isoform. The chance for
Cufflinks and RSEM to falsely declare the minor isoform was as high as 0.180 and 0.266,
while the chance was only 0.042 for WemIQ.
2.3.3 WemIQprovidesrobustisoformquantificationdespiteincomplete
gene annotations, complicated gene structures, or low coverage
Although RNA-seq provides an opportunity to reconstruct the transcriptome in a rapid
and cost-effective way, it has been shown that in a typical RNA-seq experiment with 23.4
million raw reads, a quarter of the transcripts are not covered at all, and over half of the
remaining ones are only partially covered [59]. As a result, it is extremely challenging to
perform ab initio discovery of a full and accurate transcript annotation from a RNA-seq
dataset. On the other hand, even for the Ensembl annotation, which may be by far
23
the most complete annotation with 56,299 genes and 190,053 transcripts, it still likely
misses out some isoforms due to their weaker expression. To understand how incomplete
annotation would affect isoform quantification, we tested the performance of our model
and others in simulation studies where the annotation missed some existing isoforms.
There are many possibilities how a missing isoform may affect quantification of the
known isoforms. The scenario that makes the deconvolution more challenging is proba-
bly when the missing isoform is much similar to one of the known isoforms. To simulate
such a case, we built on top of the two-isoform gene model, and assumed a third isoform
missing in the annotation but similar to the longer (or the shorter) transcript which use
an alternative splice site at the fourth exon. Like cassette exon, alternative splice site se-
lection is another type of prevalent alternative splicing events. We first simulated to have
the two alternative splice sites separated by 100 nucleotides. We then estimated relative
expressionlevelsofthetwoannotatedisoformswithWemIQ,CufflinksandRSEM.When
the missing isoform was truncated from the shorter one, the mean relative error for the
two known isoforms was 0.393 for WemIQ, but increased to 0.438 and 0.449 for Cufflinks
and RSEM, respectively. When the missing isoform was truncated from the shorter one,
WemIQ still significantly outperformed RSEM and Cufflinks (Fig. 2.3C).
When the distance between two alternative splice sites are smaller, it would become
even more challenging to infer the isoform expression. To test whether WemIQ can
handle subtle difference between isoforms, we decreased the distance between alternative
splice sites to only 15 nt difference (Fig. 2.3D). We designated these isoforms as the
long isoform, the isoform with alternative splice sites, and the short isoform. And we
simulated three different scenarios where each of them was the minor isoform. Compared
24
to the other methods, WemIQ demonstrates an improved performance. For example, the
mean relative estimation error of WemIQ was 0.434 when the long isoform was the minor
one. The error increased to 0.797 and 0.657 in Cufflinks and RSEM respectively.
Despite the advances of sequencing technologies, some regions or genes still have
limited sequence fragments, either because of low expression levels or lower mappability
(e.g. repetitive regions). Therefore, we simulated another group of low-coverage data
with the average position-level read count of 1.10 based the two-isoform gene model.
WemIQ still significantly outperforms other methods (Fig. 2.3E, P <2.2×10
−16
, paired
t tests).
2.3.4 WemIQprovidesaccurateandrobustexoninclusionrateestimation
Figure 2.4: Exon inclusion rate estimation from the two isoforms gene case
Exon-centric studies (e.g. of splicing regulation) requires accurately inferring the
inclusion rates of cassette exons. We compared the performance of WemIQ, Cufflinks,
25
RSEM,andSpliceTraponthistaskinthetwo-isoformmodel(Fig. 2.2A).Specifically, we
considered six scenarios where the simulated true inclusion rates were 1/5, 1/4, 1/3, 2/3,
3/4, and 4/5, respectively (the red lines in Fig. 2.4). In all cases, WemIQ outperformed
Cufflinks, RSEM, and SpliceTrap by providing more accurate estimation of inclusion rate
withsmallervariation(Fig. 2.4). Thiswasmostobviouswhentheexoninclusionlevelwas
relatively low (i.e. 1/5-1/3), possibly due to the challenge caused by the lower sequence
coverage on cassette exons. In general, Cufflinks and RSEM tended to overestimate the
inclusion rates and SpliceTrap underestimate the inclusion rates. In contrast, WemIQ
consistently provided more accurate estimations under different circumstances (Fig. 2.4).
In studies of splicing regulation, a difference of 0.1 in the inclusion rate is usually of high
interest. We found that 18.85%, 21.60%, and 40.85% of the exons analyzed by Cufflinks,
RSEM and SpliceTrap respectively, exhibited an inclusion rate deviating from the true
rate by more than 0.1. These deviations were further compounded by a large variation
of the estimates (Fig. 2.4). Conversely, only 1.35% of cassette exons in WemIQ had such
error and the variation was much smaller.
2.4 Real data analysis
2.4.1 WemIQeffectivelyremovesthebiasshowninsingle-isoformgenes
Theheterogeneityofreadcountsalongthetranscriptomecomesfromthreemainsources:
(1) expression levels vary among genes; (2) within a single gene, regions shared by multi-
ple isoforms are expected to contain more reads than regions specific to a single isoform;
26
(3) the inherent experimental bias of the RNA-seq protocol results in differential sam-
pling of reads from different regions. Our goal is to remove the bias to more precisely
estimatethegeneandthetranscriptisoformexpression. Todemonstratethebiasandthe
Figure 2.5: Bias removal by WemIQ, Cufflinks, and RSEM
inappropriateness of current bias removal methods, we studied genes that have only one
isoform by both Ensembl and Refseq annotations. Heterogeneity of read counts within
individual single-isoform gene should reflect only RNA-seq bias. Paired-end reads from
fourhumantissuesamples(IlluminaBodyMap2transcriptome)weremappedtoboththe
human genome and transcriptome by TopHat [76] and the read count for each position
was calculated. If read fragments are sampled uniformly along the single-isoform genes,
27
the position-level read count is expected to follow a Poisson distribution. We performed
the Kolmogorov-Smirnov (KS) test to compare the observed read-count distribution with
the expected Poisson distribution. A KS statistic near zero indicates good fitting, while a
lager KS statistic indicates larger deviation from the Poisson distribution, or more severe
bias. The solid lines (no bias correction) in Figure 2.5 clearly show severe deviations
from the uniform sampling as a majority of genes display a large KS statistic. In the
kidney sample, the distributions of read counts for 68.70% of the single-isoform genes
were significantly different from the Poisson distribution (P <0.05, KS tests). This per-
centage was consistently high in all other samples (58.41%, 59.82%, and 58.39% for the
brain, muscle, and liver samples, respectively). We then corrected the sequence-specific
bias from random hexmer priming and the bias from relative positions as those in [28],
[87] in order to understand whether these biases could account for most, if not all, of
the deviations (details in Methods). To our surprise, the deviation from the underlying
uniform assumption was even worse (the dash-dot lines in Fig. 2.5). Notably, Cufflinks
has recently changed to use only the sequence-specific bias correction in light of the re-
duced accuracy caused by positional bias correction. We therefore tested correcting the
sequence-specific bias alone. In that case, the performance was still worse than without
any correction and similar to when both types of bias were removed (the dotted lines in
Fig. 2.5 and they overlapped the dash-dot lines with slight differences).
Differentfromtheotherapproaches, WemIQusesthegeneralizedPoisson(GP)-based
GPseq model [68] which captures the local bias directly from the read-count distribution
without specifying the bias source. Specifically, we estimated the overdispersion param-
eter for each exon and corrected the observed position-level read counts of that exon by
28
a weight of 1−λ. As shown in Figure 2.5, our correction (dashed lines) more effectively
removesthebiasinallconsideredtissues. Forexample,inthemuscletissue,upto90.04%
ofthesingle-isoformgeneshadaKSstatisticlessthan0.1afterbiascorrectionbyGPseq.
This percentage decreased to 52.44% with the positional and sequence-specific bias cor-
rections. As expected, the degree of overdispersion was different for different exons of the
same transcript, showing a large range of λs. The largest λ difference for each gene had
a median of 0.204-0.318 in the considered tissues. Additionally, different genes exhibited
different bias. For instance, the median for a gene could be as high as 0.947 or as low
as 0.208 in the kidney tissue. These results show that the read starting site deviates
severely from a uniform distribution, and the deviation varies from regions to regions
even within the same transcript. The bias needs to be considered in estimating gene and
isoform expression. Toward this goal, our GPseq-based bias correction performs much
better than the positional and sequence-specific bias correction.
2.4.2 Comparison with qRT-PCR
In addition to the advantages in the estimation of relative isoform expression and exon
inclusion rates, WemIQ also provides more accurate overall gene expression estimation
(i.e. the sum of isoform expressions) and the subsequent absolute isoform expression
estimation. Due to the lack of benchmark dataset, it is difficult to directly compare
the isoform expression measurements. Instead, we used the TaqMan qRT-PCR results
on approximately 1,000 genes in the Microarray Quality Control (MAQC) Project as a
benchmark for gene expression measurements (20). Then we applied WemIQ, Cufflinks,
and RSEM on a set of 50bp paired-end reads from the human brain sample used in
29
the qRT-PCR experiments and compared the estimates from both platforms (21). We
required that for each gene, at least 75% of the qRT-PCR replicates had a detectable
expression and WemIQ, Cufflinks, and WemIQ all provided a reliable estimate. Finally,
526geneswerecomparedacrossmethodsandplatforms. ThecorrelationoftheqRT-RCR
data and WemIQ was 0.739, higher than those of Cufflinks (0.681) and RSEM (0.700),
indicating an improved overall gene expression estimation over other methods.
2.4.3 Cross platform comparison using ENCODE data
Figure 2.6: Characteristics of gene expression from WemIQ, Cufflinks, and RSEM
30
We downloaded four RNA-seq datasets on the GM12878 cells from the ENCODE
projecttocomparetheconsistencyofestimatesacrosstechnicalreplicatesandlabs(GEO
Accessioncode: GSM958728andGSM758559)[56]. Twoofthedatasets(technicalrepli-
cates)fromtheCaltechlab(labA)were75bppair-endreadsandtheothertwo(technical
replicates) from the Cold Spring Harbor lab (lab B) were 76bp pair-end reads preserving
the strand information. The gene/isoform expressions were estimated according to the
Refseq annotation by WemIQ, Cufflinks, RSEM, and then normalized by the total sum
of estimations. Transcripts less than 150nt were removed from the Refseq annotation.
Highly expressed genes were defined as the top 4 percent genes according to the gene
length normalized read counts from all four data sets. However, on the isoform level,
such read counts may not reflect the expression for each single transcript. Hence, for
each method we selected the top 10 percent highly expressed isoforms as the candidates
for each data and then got the union of these candidates (1213 transripts) to calculate
the isoform expression fold changes.
Before comparison among multiple methods, we found that several top ranked genes
take up to 40% of the total expression in all four datasets from Cufflinks, but are barely
expressed according to RSEM and WemIQ estimations . After careful deliberation, these
genes are short microRNAs or small nuclear RNAs that are usually less than 100bp in
length,withonlyoneofthepair-endreadsmappedtotheseveralbeginningpositionsfrom
the whole transcript. But in WemIQ only concordant pair-end reads were considered, so
these genes were recognized as lowly expressed. Hence, in the following analysis, we
removed such genes in the annotation files for a fair comparison. After removing these
short microRNAs and small nuclear RNAs, a small group of genes in Cufflinks estimation
31
still contribute a large fraction to the total expression (Fig. 2.6). For example, the top 30
genes take up to 20% to 30%, while they only contribute to 4.7% to 6.2% from WemIQ.
One of the possible reason for such discrepancy is that estimates from WemIQ are not
sensitive to the extreme mRNA read stacks mapped to a certain region in some genes,
because it shrunk the average read count by a factor of 1 − λ to remove biases from
various sources and reflect the true expression value.
Table 2.1: Cross platform gene expression correlation of WemIQ, Cufflinks, and RSEM
group Method Cal.1 vs. Csh.1 Cal.1 vs. Csh.2 Cal.2 vs. Csh.1 Cal.2 vs. Csh.2
All genes
WemIQ 0.857 0.901 0.845 0.891
Cufflinks 0.851 0.878 0.825 0.860
RSEM 0.832 0.884 0.809 0.867
top genes
WemIQ 0.625 0.751 0.606 0.727
Cufflinks 0.659 0.729 0.602 0.675
RSEM 0.567 0.695 0.499 0.625
We observed a small but significant λ distribution in the four data sets, indicating
different bias levels across these protocols (Fig. 2.7). For example, the median λs from
the undirected reads from Caltech are only 0.175 and 0.230, but they increases to 0.270
and 0.256 for the directed reads (P < 2.2×10
−16
paired Wilcoxcon tests). Such bias
heterogeneity reflects the dynamic nature of RNA-seq experiments and appropriate bias
removal is necessary to minimize its effect on the mRNA quantification step.
To evaluate estimation consistency across replicates and labs, we used linear regres-
sions to correlate gene/isoform expression from different labs. Specifically, we ranked
the genes by the gene length normalized read counts ¯ n
g
and compared the regression R-
squared valuesin Table 2.1. Using all the genes with mapped reads, the R-squared values
from WemIQ range from 0.845 to 0.901, consistently higher than those from Cufflinks
or RSEM (0.809 to 0.884). Then we selected the 626 genes with the top four percent
32
Figure 2.7: Estimated λ from the ENCODE data sets. Small but significant λ
difference were observed in the two data sets.
¯ n
g
as the highly expressed gene group. WemIQ is comparable in one data set but out-
performs in the remaining three cross protocol comparisons (Table 2.1). Scatter plots
of gene measurement comparison of the highly expressed gene were given in Figure 2.8.
WemIQ provides a noticeable less number of genes with over two fold gene expression
change (red dots in Fig. 2.8) than Cufflinks and RSEM, and therefore overcomes the bias
heterogeneity in a robust manner.
InadditiontotheR-squaredvalues,theabsoluteexpressiondifferenceisalsoexpected
to be smaller in the direct comparison of RNA-seq results, and hence we provided the
crossplatformfoldchangevaluesinFig. 2.9. Ideally,genesexpressionmeasurementsfrom
technical replicates should be the same, while some difference is expected due to random
sampling or bias effect. Therefore, a good mRNA quantification method should remove
such bias and provide as consistent as possible measurements for effective comparison.
33
Figure 2.8: Scatter plot of gene expressions of the ENCODE data. Red points
are the genes with more than two fold gene expression change.
Among all the expressed genes, the median log gene expression fold changes are 0.634-
0.968 for WemIQ, consistently smaller than 0.759-1.041 in Cufflinks and 0.860-1.239 for
RSEM ( Fig. 2.9 B, P < 1.43×10
−5
for paired Wilcoxon test). Setting ξ as the gene
expression fold change threshold, the percentage of genes with larger fold change ρ
ξ
were
also calculated. Considering all the expressed genes, WemIQ demonstrated comparable
ρ
ξ
values as Cufflinks and RSEM. It is well known that a significant part of the total
expressions are amounted to very limited number of top expressed genes (Fig. 2.6), so
the lowly expressed genes are easily prone to sampling effect in RNA-seq experiments
34
Figure2.9: FoldchangeofgeneexpressionsfromENCODEproject. A-B:expres-
sion fold change of all expressed genes; C-D: fold change of top genes; E-F: fold change
of top isoforms.
or normalization bias in the downstream analysis, and accordingly demonstrates larger
fold changes. Hence, we focused on the 626 highly expressed genes as discussed above.
As expected, we observed a decreases fold change in all methods, among which WemIQ
provided the least fold change values (Fig 2.9, P < 2.2×10
−16
paired Wilcoxon tests).
Moreover,wefoundthattheρ
ξ
valuesfromWemIQareconsistentlysmallerthanCufflinks
and RSEM with respect to all thresholds (Fig. 2.9 C). For instance, only 12.9%-27.6% of
35
highly expressed genes shows more than two fold gene expression change, compared with
28.9%-41.7% for Cufflinks and 34.0%-51.0% for RSEM.
We also compared isoform expression estimations from different methods. First, 5817
multiple isoform genes with non-zero expressions in all four data sets were selected, and
themajorisoformwereidentifiedastheonewithlargestexpressionwithinthegenes. The
consistentmajorisoformgenesaredefinedasthosewiththesamemajorisoformfromtwo
data sets. In total, WemIQ provides comparable percentage of consistent major isoform
genes with other methods (0.774-0.810 vs. 0.763-0.811 in Cufflinks and 0.764-0.809 in
RSEM). We also selected a group of highly expressed isoforms (details in methods) and
compared the fold changes at the isoform level. Similar to the gene expression results,
WemIQ exhibits remarkable reduction in fold change compared to RSEM and Cufflinks.
For example, the median log fold changes in WemIQ are only 0.345-0.524, but increased
to 0.487-0.662 in Cufflinks and 0.611-0.864 in RSEM (Fig 2.9 F, P <2.2×10
−16
, paired
Wilcoxontests). Furthermore,4.45%-18.2%isoformsshowlargerthantwofoldexpression
changesinWemIQ,comparedwith15.9%-29.9%and22.2%-42.7%inCufflinksandRSEM
(Fig 2.9 E), indicating improved robustness isoform level measurements for direct cross
platform comparison.
2.4.4 Comparison of single-cell and population-cell RNA-seq data
Recently, there is rapidly emerging need for researchers to dissect the transcriptome from
a tiny quantity of mRNA, for example the single cell sequencing [57], [73]. Starting from
very limited amount of genetic material, much more rounds of amplification is neces-
sary during the library construction steps, possibly resulting in different bias patterns
36
Figure 2.10: Bias parameters in single cells and population cells RNA-seq data
and additional computational challenges than those from the traditional RNA-seq data
analysis [29]. Hence, a robust mRNA quantification method should handle these data
heterogeneity and provide consistent measurement. Here we applied our method on 21
RNA-seq data sets, including 18 single cell and three population cells RNA-seq data.
Twenty-one mouse RNA-seq datasets were downloaded from SRA (SRP015959), with
libraries constructed from 18 single cells, and 3 populations of 10,000 cells. Considering
the fact that a significant part of the second read in the pair-end reads start from the
primer sequences, we first trimmed the reads in to 50bp pair-end reads. NCBI annota-
tion build37.1 for mouse was used for the following expression analysis in all methods.
We removed genes that are less than 250bp in length. The reads were mapped only to
the transcriptome for the know genes using TopHat (version 2.04) with default parame-
ters. Resultant SAM files were given to WemIQ and Cufflinks (version 2.02) for further
analysis. RSEM (version 1.1.8) call Bowtie internally for the mapping part and then
expressions are calculated using its default parameters. Pair-wise gene expression corre-
lations were calculated from the gene with non-zero estimates in both data sets. Besides,
37
we averaged the gene expressions from the 18 single cell estimates and compared it with
those from population cells.
To check the bias characteristics of different data sets, the λ values from the GP
estimation were analyzed. In general, we observed a larger variance in λ values in the
single cells data than those in the population cells data (Fig. 2.10). It is possibly due to
the additional layer of bias source during library construction from tiny amount of input
in single cell RNA-seq experiments. Besides, we calculated the λ correlation pair-wisely
and provided the heat-map in Fig. 2.10. We found the largest correlation coefficients
among the population cells RNA-seq data sets (0.699-0.729), and the correlation between
single cells and single cells (or population cells) are much lower (0.335-0.623). Therefore,
such λ heterogeneity needs appropriate correction in the downstream analysis.
At the gene level expression estimates, pair-wise gene expression correlation were cal-
culated for the expressed genes and heatmaps were given in Figure 2.11. Consistent with
the λ comparisons, the gene expressions from population cells demonstrate the high-
est correlation coefficients (0.994-0.997,Fig. 2.11). We observed an obviously decreased
correlationbetweenthesinglecellsandsinglescells(orpopulationcells)datasetsandap-
parently larger variation among these correlations (0.228- 0.790 for WemIQ). Both scale
and pattern of correlation are similar in WemIQ, Cufflinks, and RSEM (Fig. 2.11). Con-
sidering the fact that the possible variations of gene expression may exists among cells,
it is not necessary that larger correlation scores indicate better mRNA measurements.
However, at least the correlation pattern between the same single cell RNA-seq data
would be consistent with different population cells data from technical replicates. In this
aspect, WemIQ and RSEM both provide stable correlations across all three population
38
cells (similar colors in the last three columns in Fig 2.11). Nevertheless, population cell
datasettwoinCufflinksshowsverydistinctcorrelationpatternthanothertwodatasets.
Besides, we averaged the gene expression from 18 single cells and compared its estimates
from those in the population cells. Focusing on 10,590 non-zero expressed genes by all
methods, we found that the median gene expression fold changes (log scale) from WemIQ
are only 1.289, 1.285, and 1.257, significantly lower than those from Cufflinks and RSEM
(2.327-2.588, Table 2, P < 2.2×10
−16
paired Wilcoxon tests). In addition, consistent
withtheENCODERNA-seqresults,weobservedapronouncedreductionintheρ
ξ
values
in WemIQ, demonstrating the advantage of WemIQ in the estimation robustness.
Table 2.2: Cross platform gene expression fold change of WemIQ, Cufflinks, and RSEM
statistics methods S vs. P1 S vs. P2 S vs. P3
median
WemIQ 1.289 1.285 1.257
Cufflinks 2.565 2.588 2.562
RSEM 2.341 2.366 2.327
variance
WemIQ 5.338 5.445 5.567
Cufflinks 33.410 33.187 33.546
RSEM 6.627 6.677 6.794
Additionally, we also analyzed the isoform expressions estimation performance by
comparing the transcript percentage within the gene of the multiple isoform genes pair-
wisely, and the heatmaps of correlations were given (Fig. 2.11). Compared with the
gene expressions, the isoform percentage correlations are smaller in all methods, proba-
bly due to the additional complexity in the deconvolution process. However, we found
higher correlation coefficients in the majority of pair-wise comparisons from WemIQ and
thus demonstrated the improved isoform level measurements (416 and 420 out of all 420
comparisons).
39
Figure 2.11: Comparison of estimation performance on single cell RNA-seq data
40
2.5 Discussion
Sequencing bias along gene positions hinders the deconvolution of transcript isoform ex-
pression, because the expression heterogeneity caused by isoform-structure is mixed with
that caused by different bias for different regions. We previously developed a hierarchical
Bayesian model (BASIS) to identify differentially isoforms without quantifying the abso-
lute isoform expression for each condition [92]. BASIS directly focused on read coverage
differences between two conditions and the bias specific to each position was partially
canceled out. Heteroscedasticity of the read coverage that cannot be canceled out by
taking differences was further handled by grouping. However, the estimation of the ab-
solute expression levels of transcript isoforms requires a more delicate handling of bias in
RNA-seq.
Most existing methods of isoform quantification overlooked the bias and assumed a
uniform sampling of reads, resulting in problematic measurements at both the gene and
theisoformlevel. Forexample,alargerbiasinacertainregionthanotherregionscanlead
toincorrectreadallocationandinaccurateestimationofisoformexpressionsubsequently.
The sequence-specific bias correction and/or the positional bias correction such as those
in Cufflinks and RSEM are not sufficient to remove the bias while allocating reads to
transcript isoforms. Even more, these two types of bias correction could be problematic
as shown in our analysis of single-isoform genes. After the sequence-specific and/or
the positional bias correction, the deviation from the uniform sampling was even worse.
Because the source of bias in RNA-seq is complicated with many unknown factors, the
simplified bias correction based on only two factors could be inappropriate.
41
WemIQ targets the challenging of accurate isoform quantification with bias removal
by a weighted log-likelihood based EM algorithm. The bias parameter for each exon was
estimated in a data-driven manner without any assumption about the bias source, and
then we assigned different weights to reads when calculating the log-likelihood function.
Simulation studies demonstrated that WemIQ significantly improved the expression esti-
mation from both the isoform-centric and the exon-centric perspectives. The estimates
from WemIQ were robust to different degree of bias. Conversely, some estimates from
Cufflinks, RSEM,orSpliceTrapcouldbegreatlydifferentfromtheunderlyingtruth. The
improvement over other methods was still significant when the coverage was low or the
gene structure was complicated with isoforms with subtle differences. WemIQ can handle
theadditionaloverdispersioncausedbymissingtranscriptsandthusprovidesmorerobust
estimations over Cufflinks and RSEM when gene annotation is incomplete. Besides, we
compared the overall gene expression levels with other platforms such as the qRT-PCR
resultsandWemIQshowedanimprovedestimation. Itisworthmentioningthatalthough
qRT-PCR may not perfectly measure gene expression, it represents another independent
platform with potentially different bias sources. In addition, it is known that bias occurs
dynamically and may vary from experiment from experiment. Hence, an important issue
the mRNA measurement method should concern is appropriate handling of bias hetero-
geneityofdifferentdatasetstoprovideconsistentandrobustcomparisons. Weappliedour
method on a variety of RNA-seq data, including high coverage ENCODE data and single
cell sequencing data with very low input genetic materials. Results demonstrate that
WemIQ does not only provide improved or at least comparable expression correlations
at both gene and isoform level, but also significantly reduced the expression fold change.
42
Therefore, WemIQ may potentially serve as a powerful tool to make direct measurement
comparisons across different experiments.
2.6 Conclusion
In summary, we propose WemIQ to quantify gene expression from the RNA-seq data
with the transcript isoform resolution. The advantage of WemIQ is that it can capture
boththeintra-andinter-genebiasvariationfromtheposition-levelreadcounts, andthen
incorporatesthisinformationtoallocatereadsamongtranscriptsthroughaweightedEM
method. The fragment length information is further utilized in the EM algorithm. Both
simulation and real data analysis show that WemIQ can correct the bias and provide
more accurate relative and absolute isoform and gene expression estimation.
43
Chapter 3
Splicing Elements Discovery through Varying Effect
Regression Model
3.1 Introduction to splicing regulatory elements discovery
Alternative splicing is a key biological process in higher eukaryotes to generate multiple
transcript isoforms from a single gene, and thus promotes protein diversity with func-
tional or structural differences [6]. With the boost of genomic data due to the recent
development in deep sequencing techniques, new splicing events and novel transcript
isoforms greatly extended our appreciation of its popularity. For instance, it has been
reported that in humans up to 95% of the protein coding genes experienced alternative
splicing [53]. However, its regulation mechanism, the so called splicing code, still remains
elusive and deserves additional effort.
In eukaryotes, despite core signal landmarks such as the junction consensus sequences
and branch points, other auxiliary splicing regulatory elements (SREs) also play a vital
role through the recruitment of splicing factors and thus facilitate accurate splicing site
recognition. The identification of such SREs would greatly enhance our capability to
44
understand the regulation rules, and even to predict splicing patterns under specific
biological environment.
3.2 Existing methods
Numerous experimental methods were proposed to identify these cis-acting elements,
which have been excellently summarized in a review paper [13]. Experimental efforts
mainlyincludeSystematicEvolutionofLigansbyExponentialEnrichment(SELEX)[39],
[46], [47], [71], UV crosslinking and immunoprecipitation (CLIP) [33], and splicing re-
porter systems. Due to experimental constraints, SREs were reported for only a few
splicingfactors, but the vastremainingare stillunknown, especiallythose for the uniden-
tified splicing factors.
Meanwhile, several computational efforts were also made to explore the existing ge-
nomic data to detect putative splicing factor binding motifs. Word count related enrich-
ment analysis is the most widely used approach, which discovers the statistically over-
or under-represented short oligonucleotides through the comparison of foreground and
background sequences. For example, Fairbrother et al. compared exons with strong and
weak splice sites and reported a list of 238 exonic splicing enhancers (ESE) with a high
validation rate, known as RESCUE ESE [20]. Zhang and Chasin compared non-coding
exons with negative controls, such as the pseudo genes or the 5’ untranslated regions
(UTR) of intronless genes, and discovered 2069 8-mer SREs [91]. Brudno ⁀et al. iden-
tified 25 brain-specific alternative exons and compared the exonic and flanking intronic
sequences therein with the background sequences, for instance, those in the proximity of
45
other cassette exons, and provided several putative mRNA protein binding sites specific
to the brain tissue [7]. More recently, Wen et al. adopted a similar approach to uncover
the tissue specific SREs from mouse RNA-seq data in brain, liver, and muscle tissues
by comparing the regions near the included and excluded cassette exons [84]. However,
this type of analysis heavily relies on the accurate selection of background sequences. A
second type of approach is the linear regression between motif counts and the expression
or splicing levels of genes or exons. This methodology was first introduced by Busse-
maker et al. [10] in 2001, and its several adapted versions have received great success
in the prediction of transcript factor binding motifs. In 2007, Das et al. first applied
this method on a small group of muscle specific cassette exons and discovered several
potential SREs [17]. However, the linear assumption of these models might oversimplify
the complex relationship between motif counts and exon recognition.
3.3 A varying effect model on splicing regulatory elements
discovery
Three key questions are to be answered on the splicing regulatory elements discovery
problem: whether there are any SREs, where are they, and how do they work? Hence
we are not only supposed to discover the SRES and distinguish them from the random
occurrences, but also to quantify its changing effect to splice site recognition according
to various biological environments. Hence, we utilized the varying coefficient regression
model for the SRE discovery [21], [90].
46
3.3.1 Introduction on varying coefficient regression model
Parametric statistical inference models are usually used to draw conclusion with underly-
ingdatastructureassumptions. Despiteitspopularityandsuccessinvariousfields,itmay
oversimplify the complex nature of real problems and accordingly results in unsatisfac-
tory consequences. With the fast development of computation capability, non-parametric
methods, which requires no assumption of the predictors and makes estimations solely
by large input data, has gained increasing attention. Part of its success may be due to
the fact that it may provide insights of model structures yet to be determined. Varying
coefficient regression is one of such methods that can explore the dynamic relationships
betweenthepredictorsandresponsesinmanyscientificareas, suchaseconomics, finance,
politics, epidemiology, medicalscience, ecologyandsoon[21]. Inthischapter, weapplied
this method to the SRE discovery problem.
In 1991, Cleveland, Grosse and Shyu have proposed local linear regression methods
in multidimensional spaces. Let X = (x
1
,x
2
,···x
p
)
T
denote the predictor vector in the
regression problem, and a scalar variable as U. The varying coefficient regression models
may be expressed in the following form.
M(u,X)=X
T
a(U) (3.1)
Where the a(U) = (a
1
(U),···a
p
(U))
T
is the functional coefficient to be estimated and
m(U,X) = E(Y|U,X) represents the regression function. One of the advantage of the
model in 3.1 is that it allows the estimated coefficient between X and Y to change with
U, andthusallowsnon-linearinteractionsbetweenX andU toguaranteemoreflexibility.
47
However,sometimesnotallthepredictorparametersvarywiththebaselineparameter
U, and in this case the semi-parametric varying coefficient regression model is often
considered as a generalization of the model in equation 3.1, which can be expressed in
the following structure.
M(u,X,Z)=X
T
a(U)+β
T
Z+ε (3.2)
Where β is the q dimensional vector of unknown coefficients. Due to the curse of
dimensionality, the baseline parameter U is usually a scalar instead of vector.
3.3.2 Estimation methods
Ingeneral, differentmethodscanbeusedtoestimatethevaryingcoefficientfunction: the
kernel-local polynomial smoothing, polynomial spline, and the smoothing spline. In this
chapter, we selected the kernel smoothing models and adopt a least square approach to
estimate the coefficient function a(U) [21].
Suppose one of the samples from
U,X
T
,y
is
U
i
,X
T
i
,y
i
,i=1···n, and we have
y =X
T
a(U)+ε (3.3)
With E[ε] = 0 The coefficient var(ε) = σ
2
(U) is estimated by the input data in the
neighborhood of a(u) by the solution of the minimum of the error estimation function
L(a,b)=
n
X
i=1
y
i
−X
T
i
a−X
T
i
b(U
i
−u)
2
K
h
(U
i
−u) (3.4)
48
Inequation3.4,K
h
(t)=K(t/h)/histhekernelfunction. Throughoutthisresearch,
the Epanechnikov kernel
K(t)=0.75
1−t
2
+
is used and h is the bandwidth to choose the neighborhood. Let
X=(X
1
,···X
n
)
T
,U
u
=diag{U
1
−u,···U
n
−u}
Γ
u
=(X,U
u
X),Y =(y
1
,···y
n
)
T
W
u
=diag(K
h
(U
1
−u),···K
h
(U
u
−u))
(3.5)
Equation 3.4 may be changed to the smoothing version of least square estimation of
y
1
.
.
.
y
n
=
X
1
X
1
(U
1
−u)
.
.
.
.
.
.
X
n
X
1
(U
1
−u)
a
b
(3.6)
Then we may have the estimated varying coefficient function in equation 3.7
ˆ a(u)=(I
p
,0
P
)
Γ
T
u
W
u
Γ
u
−1
Γ
T
u
W
u
Y (3.7)
Informula3.7,I
p
istheidentitymatrixwithdimensionp,and0
p
isap×pdimensional
matrix with all elements equal to 0. ˆ a(u) is the estimated coefficient function and has a
asymptotically normal distribution.
For the non-parametric model in equation 3.2, for any given β, we have
y
∗
k
=
p
X
i=1
a
i
(U
k
)X
k,i
+ε
k
,k =1···n (3.8)
49
Where y
∗
k
= y
k
−
p
P
j=1
β
i
(U
k
)Z
k,i
,k = 1···n. Then it transforms the semi-parametric
model in 3.2 into the non-parametric varying coefficient regression model in 3.1, and
formula 3.8 may be written as
Y−Zβ =α
T
X+ε=M+ε (3.9)
The solution to 3.9 can be given by
ˆ a(u)=(I
p
,0
P
)
Γ
T
u
W
u
Γ
u
−1
Γ
T
u
W
u
(Y−Zβ) (3.10)
Then the estimator of M can be expressed as
M=
X
T
1
0
Γ
T
u
1
W
u
1
Γ
u
1
−1
Γ
T
u
1
W
u
1
.
.
.
X
T
1
0
Γ
T
u
1
W
u
1
Γ
u
1
−1
Γ
T
u
1
W
u
1
(Y−Zβ)=S(Y−Zβ) (3.11)
In equation 3.11, S is just an averaging matrix which is uniquely determined by the
observations. Plug in 3.9, we can get
ˆ
β =
Z
T
(I−S)
T
(I−S)Z
T
Z
T
(I−S)
T
(I−S)Y
M=S
Y−Z
ˆ
β
(3.12)
50
3.4 Application on real data
In this section, we developed a Varying Effect Regression model on Splicing Elements
(VERSE) to predict the genome wide intronic SREs from RNA-seq data (Fig. 3.1).
VERSE is based on a semi-parametric varying coefficient model. Instead of simple motif
identification, VERSE is able to incorporate additional genomic information to discover
the exact locations of effective SREs. Additionally, it can quantify the contribution
of individual motif existence to the splice site recognition under different local context
and different biological conditions. In our previous study, we have demonstrated the
uniqueness and improved statistical power of VERSE by applying it to 16 human cell
lines to identify a list of 1562 cell specific SREs [90].
3.4.1 Methods
The scheme of VERSE is given in Fig. 3.1. First, a list of interested exons are selected
and their inclusion rates in a certain tissue or cell line are expressed as y
1
,y
2
,···y
n
. For
each motif candidate (e.g., a specific oligonucleotide or a position weight matrix for a set
of oligonucleotides), x
i,j
and u
i,j
represent the sequence and non-sequence based feature
effects at occurrence j of exon i. For example, x
i,j
can be one for an oligonucleotide
occurrence and it can be a matching score for a position weight matrix. And x
i,j
can
be the average sequence conservation score for this location. Then the following varying
coefficient regression model can be set up as
y
i
=β+
m
i
X
j=1
a(u
i,j
)x
i,j
(3.13)
51
Figure 3.1: Flowchart of VERSE methods to identify SREs
where m
i
is the motif count for exon i and β is the baseline exon inclusion rate of all
the exons. Different from linear regression where a(u
i,j
) is a constant, we assume that it
is changing with the non-sequence features and estimate this function directly in a data
52
driven manner. Note that for different biological conditions, we have different sets of y
i
s
and a(u
i,j
)s. Only motif occurrences with significantly non-zero a(u
i,j
)s are declared as
functional SREs for the considered biological condition.
Least square estimators are used for estimating the coefficient function a(u
i,j
). Cross
validation is used to automatically select the optimal bandwidth parameter for the semi-
parametric varying coefficient regression model. The corresponding model complexity
has been adjusted in the hypothesis testing to evaluate the association significance.
3.4.2 Data collection
Non-motif based biological features such as the phyloP conservation scores are treated as
thebaselinebindingpreferenceofsplicingfactors,andthenourVERSEmodelutilizesthis
baseline preference to predict the contribution from each hexamer to the exon inclusion
rate accordingly. Besides the capability to integrate additional information, another
advantage of VERSE is that the estimated coefficient function is allowed to change with
the non-motif based features nonlinearly. We applied VERSE to 16 human tissues and
predicted the intronic SREs, among which many overlapped with experimentally verified
SREs.
RNA-seq data set (SRA010153 for the MAQC data, SRP000727 for the human tissue
data) for 16 tissues or cell lines were downloaded from the NCBI sequence read archive
(http://www.ncbi.nlm.nih.gov/sra/). Wefirstalignedthereadstothehumangenomeus-
ingBowtieversion0.12.1withthedefaultsettings,andthenfurtheralignedtheunmapped
ones against the human refseq RNA sequences (downloaded from the NCBI website, ver-
sion 36) to discover the junction reads. Only the uniquely mappable reads were used and
53
we only considered genes on autosomes or X chromosomes. The position-level read count
was the number of body or junction reads starting from an exonic position of a gene (or
an exon) without considering the strand information.
3.4.3 Exons considered in VERSE
A non-redundant exon list was assembled from the Refseq gene annotations. Overlapped
regions were further split into disjointed regions. Then, in each tissue, the exon/gene
expression levels were estimated by the average of total reads mapped in this region,
and the exon inclusion rate was calculated as the ratio of exon expression level to its
corresponding gene expression level. In order to discover the tissue specific alternative
splicing events, exon inclusion rates were normalized across all the 16 tissues or cell lines,
andexonswithanormalizedinclusionrategreaterthan5orlessthan0.3wereconsidered
inVERSE.Tomakeacontrollist,wealsoselectedconstitutiveexonswiththenormalized
exoninclusionratebetween0.95and1.05inVERSE.Notethattheselectionofalternative
or constitutive exons is only a rough selection and VERSE can further distinguish them
according to the estimated coefficients.
3.4.4 Fitting the VERSE model
In order to differentiate the positional preference of the SREs, the upstream and down-
stream intronic regions that flanked the exons were discussed separately, and workflow
was shown in Fig. 3.1 . We considered exons with flanking introns with at least 400bp in
length. Specifically, the upstream and downstream 200bp intronic regions immediately
54
flankingtheexonjunctionswereconsideredasthetargetregionstosearchforsplicingfac-
tor motif sites. Besides, to exclude the bias due to the consensus sequences near the two
junction sites, the 10bp regions directly adjunct to the exon junctions were excluded in
ouranalysis,resultinga190bptargetsequenceinbothdirections. Foreachhexamerw,let
n
w
i,t
represent its occurrences in a target region around exon i in tissue t. y
i,t
denotes the
absolutelogvalueofthenormalizedexoninclusionrateinthistissue. Foreachexoni,the
baseline preference of the j
th
occurrence is described by u
i,j
, which is averaged across the
allthenucleotideswithineachhexamer. ThephyloPconservationscoresfrom44mammal
speciesweredownloadedfromtheUCSCgenomebrowsertorepresentthebaselineprefer-
ences(http://hgdownload.cse.ucsc.edu/goldenPath/hg18/phastCons44way/placentalMammals/).
All phyloP scores greater than 3 or less than 3 were transformed to 1 or 1, and the rest
were normalized by 3. Hence the support of u
i,j
is [−1,1]. The following semi-parametric
varying coefficient regression is fitted with bandwidth h.
y
i,t
=
n
w
i,t
X
j=1
α(u
i,j
)+β+ε
i
(3.14)
α(•)isthevaryingcoefficientfunctiontomodelthechangingcontributionofmotifhitsto
the exon inclusion rate as a function of the baseline preference, and β is the parametric
part of our model to indicate the baseline inclusion rate for each exon. ε
i
is an i.i.d
Gaussian distributed random variable with zero mean. A kernel function k(•) is used
55
to smooth the errors and conduct local linear regressions. Define k
h
(•) =
k
•
/
h
h
, Y =
[y
1,t
,··· ,y
n,t
]
T
, X=
n
w
1,t
,··· ,n
w
n,t
T
, and
W
u
=diag
n
1
n
w
1,t
Pn
w
1,t
j=1
k
h
(u
1,j
−u),··· ,
1
n
w
n,t
Pn
w
n,t
j=1
k
h
(u
n,j
−u)
o
Z=[1,··· ,1]
T
D
u
=
n
w
1,t
n
w
1,t
P
j=1
(u
1,j
−u)
.
.
.
.
.
.
n
w
n,t
n
w
n,t
P
j=1
(u
n,j
−u)
(3.15)
y
i,t
=
n
w
i,t
X
j=1
α(u
i,j
)+β+ε
i
(3.16)
Many approaches have been proposed to estimate the coefficient β and the unknown
coefficient function α(•), and here we adopted the profile least square estimation as the
following:
ˆ
β =
n
Z
T
(I−S)
T
(I−S)Z
o
−1
Z
T
(I−S)
T
(I−S)Y
α(u)=[1,0]
D
T
u
W
u
D
u
−1
D
T
u
W
u
Y−Z
ˆ
β
S=
1 0
Pn
w
1,t
j=1
n
D
T
u
1,j
W
u
1,j
D
T
u
1,j
o
−1
D
T
u
1,j
W
u
1,j
···
1 0
Pn
w
n,t
j=1
n
D
T
u
n,j
W
u
n,j
D
T
u
n,j
o
−1
D
T
u
n,j
W
u
n,j
(3.17)
I is the identity function with dimension n. The bandwidth parameter h determines
theradiusoftheneighborhoodinthelocallinearregressionandthusreflectsthedegreeof
56
freedom of the model. A larger bandwidth would benefit from the variance side, but lose
on the precision side. We selected 1.0, which is half of the whole support of the baseline
preference u, to perform the least square estimation for the varying coefficient model. To
evaluate the significance of estimated varying coefficient function α(•), a chi-square test
was carried out.
To make a comparison with VERSE, a linear regression model with the constant
effect, was fitted without the consideration of the baseline preference andF test was used
to evaluate the performance of each regression.
3.5 Results
3.5.1 VERSE identifies a list of novel motifs from 16 human tissue
RNA-seq data
In the search of SREs, we focused on the upstream and downstream 200bp regions flank-
ing the tissue specific alternative exons, which were repeatedly reported as enriched with
splicingfactorbindingsites. The10bpregionsinthedirectvicinityofexon-intronbound-
aries were removed to avoid the effect of consensus sequences. Regression P values were
calculated from our VERSE model against the no effect model, and accordingly to dis-
tinguish random occurrences from functional SREs.
Considering the 6 − mer SREs, as an example, ordered P values for all the 4096
hexamers in the brain tissue were plotted and the top ten ranked hexamers were listed in
Figures 3.2A and B respectively. A total of 87 and 79 hexamers have a P value less than
0.0005 for up- and down-stream regions respectively (FDR < 0.03). To demonstrate
57
Figure3.2: Orderedp-valuesinbrainsandexamplesofthecoeffcientfunctions.
(A, B) Ordered p-values from VERSE for the upstream and downstream intronic regions
in brains. The y-axis is the absolute log transform of regressionp-values, and the x-axis
is the ranks of the hexamers. The top 10 ranked motifs in both regions are listed. (C,
D) Coeffcient functions of TGCATG for the downstream intronic regions in brains and
muscles. The x-axis is the baseline preference from the phyloP conservation score. The
larger values near 1 indicate the conserved TGCATG sites, while smaller values near -1
are the newly evolved ones.
58
the varying contribution of the predicted SREs, we also plotted the estimated coefficient
function of the well-known FOX-1 family binding motif TGCATG in brain and skeleton
muscle tissues in Figures 3.2 C and D. Our VERSE model not only reported that TG-
CATG can explain the variation of the exon inclusion rates well (P = 2.4×10
−7
and
1.1×10
−8
respectively), but also demonstrated the varying contribution with respect to
thebaselinepreference. Asindicated, theconservedTGCATGmotifswouldconsiderably
promote brain or muscle specific alternative splicing, consistent with the conclusion in
lines of references that FOX-1 family is enriched in the intronic regions following up-
regulated alternative exons. Similar conclusions can be drawn from other tissues and the
5-mer SREs prediction results.
Setting the P value cutoff at 0.0005, we have predicted a total of 1562 6−mer SREs
across all 16 human tissues or cell lines, many of which were verified through experiments
in other literatures. For example, the FOX family protein binding sites (T)GCATG was
ranked second and third in brains and muscles, respectively. Interestingly, this motif was
proved to be phylogenetically conserved since ancient times, and our estimated varying
coefficient function also indicates that conservation should be an important factor in the
functionality of the motif (Figs. 3.2 C and D). Specifically, the conserved TGCATG sites
make a much stronger contribution to promote alternative splicing as compared to the
newlyevolvedones. Similarly,thepolypyrimidinelikemotifCTCTCTandTCTCTCfor
PTBwereranked4
th
and47
th
inupstreamintronflanksinbrain,consistentwithprevious
conclusions. The (GGG)n runs, which has been frequently mentioned as a SRE, was also
found to profoundly affect splice site selection. The discovery of these well-known SREs
provided convincing evidence of the statistical power of our proposed VERSE model.
59
Additionally, gene ontology (GO) analysis was also performed to validate the SRE
functions. To use TGCATG and the coefficient function predicted in brains as an exam-
ple,theforegroundgenes(i.e. realfunctionalSREs)wereselectedasthosewithTGCATG
and coefficients greater than 0.4, while the background genes were selected as those with
coefficients between 0.05 and 0.05 to represent the random occurrences. Using the GO
analysistoolGORILLA(http://cbl-gorilla.cs.technion.ac.il/),themostsignificantenrich-
ment term in the foreground genes is the neuron specific term synapse (P =2.5×10
−12
).
Besides, a list of other neuron specific terms, such as synaptic membrane, presynaptic
membrane,axon,andsynapsepart(P =3.6×10
−8
,2.2×10
−7
,2.3×10
−7
,and1.7×10
−6
),
are also ranked 2
nd
, 3
rd
, 4
th
, and 8
th
. This does not only verify the role of FOX-1 family
in regulating brain related alternative splicing events, but also offers convincing evidence
of VERSE to make detailed inference of splicing regulation rules.
With the predicted SRE list from VERSE, it is also worthwhile to group these tissues
according to their SRE similarities. Tissues similar to each other would probably under
co-regulation to a larger extent. Hence we quantitatively measured the regulation simi-
larity by dividing the number of shared SREs in each tissue/cell line pair to the number
of total claimed ones. Heat maps of these matrices were provided in Figure 3.3. In the
upstream intron flanks, the colon tissue demonstrates the lowest average regulation simi-
larity with other 15 tissues/cell lines while the testes tissue displays the highest similarity
(1.62% and 12.72% respectively). Two clusters can be divided by hierarchical clustering
(Fig. 3.3 A): one is the more similar groups represented by muscle, heart, adipose, testes,
breast and breast cancer cell lines; and the other is composed of brain, liver, colon and
60
MAQ-HC(AmbionshumanbrainreferenceRNA)withdistinctmotifsfromothertissues.
We made similar conclusions in the downstream regions.
Figure3.3: Heatmapsofregulationsimilarityamong16humantissues. Similar-
ity between each tissue pair is calculated as the percentage of shared motifs. Hierarchical
clustering is used to group tissues with higher level of co-regulation.
3.5.2 Comparisonwithlinearregressiondemonstratestheadvantageof
VERSE
As a comparison with existing methods, we performed SRE predictions using simple
linear regressions. Because VERSE can detect both motifs with constant coefficients
and varying coefficients, many motifs were uniquely discovered by VERSE but missed
by the linear regression, some of which were well documented in previous literature.
For example, without considering the non-motif based feature, the FOX-1 family motif
61
TGCATG failed to explain the inclusion rate variation in neither brains nor muscles
by the linear regression (P = 0.42 and 0.11 for downstream regions). Similarly, in the
upstream region, the well-known motif CTCTCT also showed no significance in brains
(P = 0.49). However, they are all top ranked ones reported by VERSE, demonstrating
the enhanced statistical power of VERSE.
Inaddition,wealsoobservedthattheadvantagebroughtbyVERSEisdifferentacross
differenthumantissues. Intissueslikeadipose, breast, testesandseveralcancercelllines,
more than half of the identified motifs by VERSE were also claimed by linear regressions
and such high overlapping rates were observed in both upstream and downstream regions
(high intersect/union ratio in Table 3.1), indicating the consistency of both algorithms.
However, in other tissues, such as brain, liver, and colon, very limited number of SRES
were found to be shared by the two methods. These indicate the different roles of the
baseline preferences in different tissues.
3.5.3 General and unique SREs display significant GC disparity
Two types of the predicted SREs were defined: the specific ones were defined as those
declared by VERSE in only one tissue, while the general ones were identified in at least
half of the 16 human tissues/cell lines. Because VERSE considers tissue-specific alterna-
tive splicing events, the majority of the predicted motifs were identified in a single tissue.
For example, in the upstream region, 632 out of 1135 SREs (55.68%) were found to be
significant only in one tissue, while only 37 SREs (3.26%) were declared as the general
SREs (dark bars in Fig. 3.4). Similar results were also observed in downstream regions:
585 specific SREs (56.80%) and 33 general SREs (3.20%). P values for the chi-square
62
Table 3.1: Comparison of VERSE with linear regression in 16 human tissues/cell lines
upstream downstream
tissue VERSE LR union intersect VERSE LR union intersect
BT474 86 57 91 52 59 29 63 25
lymph 285 191 301 175 206 140 220 126
testes 350 224 363 211 274 207 287 194
adipose 182 132 206 108 195 134 212 117
colon 28 2 28 2 26 6 27 5
muscle 146 48 149 45 199 102 209 92
heart 65 31 66 30 69 26 71 24
liver 47 15 50 12 32 5 32 5
maquhr 96 52 98 50 63 44 73 34
maqhc 40 3 40 3 43 5 43 5
T47D 122 73 130 65 110 75 120 65
MB435 165 102 176 91 125 87 136 76
MCF7 124 87 137 74 128 71 139 60
breast 405 307 427 285 325 262 344 243
HME 189 142 197 134 192 121 208 105
brain 87 6 87 6 79 11 81 9
tests were both less than 2.2×10
−16
, indicating the significant tissue preference of our
predicted SREs in alternative splicing regulation. Interestingly, we discovered that the
general motifs are mostly AT enriched, while the specific ones demonstrate substantial
GC enrichment. Specifically, the average GC content for the specific SREs was 0.484
and 0.471 for upstream and downstream regions respectively, but it dropped to 0.099 and
0.081 for general SREs (P values for the unpaired Wilcoxon tests were 2.2×10
−16
and
4.1×10
−16
, respectively).
Besides, tissue specificity was defined as the ratio of the number of specific motifs to
the general ones identified by VERSE. The brain tissue showed the highest specificity,
demonstrating the distinct splicing regulation rules. It is worthwhile to mention that the
mixedtissueMAQ-UHR(StratagenesuniversalhumanreferenceRNA,whichiscomposed
63
Figure 3.4: General and specific SREs by VERSE
of total RNA from 10 different human cell lines) demonstrated the lowest specificity at
0.414 and 0.850 in both flanking intron regions, as we expected.
3.5.4 Position and conservation preferences of predicted SREs
LinesofreferencesmentionedthatthefunctionofSREsdependsontheirrelativepositions
to splicing junctions, and some may even perform opposite roles when located in different
regions. For example, the poly pyrimidine like motif CTCTCT or TCTCTC would help
therecruitmentofPTBbeforesplicesitesmostoftime,while(GGG)
n
runsmightregulate
exon recognition both before and after exon junctions. Such position dependency reveals
the complexity of the splicing regulation process, so we analyzed this preference of our
predicted SREs in each tissue separately.
In general, only 603 out of 1562 SREs in 16 human tissues/cell lines (around 38.60%)
are shared in both upstream and downstream regions, indicating the significant position
64
preference. We also observed substantial tissue variations in the percentage of shared
SREs between upstream and downstream intronic flanks. For instance, in breast and
testes tissues, the shared percentages were up to 23.31% and 22.83%, respectively, while
inbrainandlivertissues,thepercentagedroppedto3.75%and1.28%. Suchhighdiscrep-
ancyacrossmultipletissuesdoesnotonlyindicatethefactthatthepositionaldependency
would affect the regulation process to different extents, but also corresponds very well
with our previous discovery that brains showed high complexity in the splicing regulation
process.
3.6 Software availability
We have successfully developed VERSE, the first software to our knowledge, to predict
the locations of function SREs in a genome scale directly from RNA-seq data. The
advantages of VERSE are I) it can integrate non-sequence based genomic features and
capture the complicated interactions between the features and splicing factor binding;
II) it is able to precisely quantify the contribution of individual motif occurrence to the
splice site recognition under a specific biological condition; III) it can distinguish the
effective SREs from random motif occurrences. We anticipate that VERSE may serve
as a powerful tool for researchers to integrate both sequence and non-sequence based
features to precisely identify the splicing factor binding sites.
3.6.1 Input
Two fundamental questions VERSE trying to answer are whether the existence of short
motifs would contribute to splicing and how such contribution varies under different local
65
context and biological conditions. Hence at least two formatted input files are needed as
input.
1) A list of double type values to quantify splicing events for a considered biological
condition. For example, we used the normalized exon inclusion rate to quantify splicing
events [68], [90]. Such values can be conveniently estimated through RNA-seq expression
analysis software [44], [77].
2) Both sequence and non-sequence based genomic features. For the sequence based
features, it can be the occurrence count for the candidate oligonucleotide motif, or a
matching score for a candidate position weight matrix. The non-sequence based features
canbeanybiologicalsignaturethatisinvolvedinthesplicingfactorbindingprocess,such
as mRNA secondary structures [89], the GC content [89], sequence conservation, and so
on [3].
3) In VERSE, we made no assumption about the coefficient function a(u
i,j
) except
that the contribution of motifs are changing smoothly against u
i,j
. To capture the most
suitable degree of freedom in the model, by default we performed 5-fold cross validation
to select the optimal bandwidth during the fitting process. However, the user may decide
the bandwidth selection method or directly assign appropriate values by command line
inputs.
3.6.2 Output
The principle output of VERSE is a list of detailed regression statistics, including the
P values for our varying coefficient model against non-effect and constant effect models.
Besides, VERSE also provides the estimated coefficient function a(u
i,j
) and performance
66
statistics during bandwidth selection for each candidate motif by users request. In ad-
dition, with all the above information, VERSE can provide a list of predicted splicing
factor binding sites by setting a threshold on the quantified contribution of each motif
occurrence.
3.7 Discussion
In this chapter, we developed VERSE through a varying coefficient regression model to
associate the inclusion rate of exons with hexamers occurrences in intronic regions. The
advantage of this model is that non-motif based biological features such as the conserva-
tionscorecanbeincorporatedintothemodelasthebaselinepreferenceandthusprovides
more precise splicing contribution in a quantitative way. Using VERSE, we successfully
identified 1562 6-mer intronic SREs in 16 human tissues. The discovery of well-known
motifs such as (T)GCATG for FOX-1 family, CT enriched poly pyrimidine sequences
for PTB, (GGG)n runs for hnRNP A/B and hnRNP F/H proved the reliability of our
predictions. In addition to the overall association significance, our VERSE model also
provided detailed contribution functions of these SREs according to their conservation
scores, further expanding our knowledge to splicing regulation rules.
3.7.1 IntegrationofconservationscoresimprovesSREpredictionaccuracy
Inspired by the report that the assembly of multiple features would greatly improve the
prediction precision of splicing pattern, we adopted the conservation score as a prior in-
formation to predict SREs. By permitting changes of the motif occurrence coefficients,
VERSE successfully accommodates non-linear relationship between SRE occurrence and
67
splicing ratios. Phylogenetic conservation information has been reported in many litera-
turestobeanimportantmRNAfeaturetoinferbiologicalfunctionality. Itisworthnoting
that the phyloP score is able to represent both conservation and evolution, and thus we
could discover SREs that are under negative or positive selection. With the assistance of
this prior information, VERSE not only identified almost all the motifs reported by the
linear regression, but also discovered many novel ones with the conservation dependency.
Some of these new ones were frequently mentioned as important splicing factor binding
sites in numerous literature. For example, the phylogenetically conserved SRE TGCATG
was not reported as significant in neither brains nor muscles by the simple linear regres-
sion, but additional conservation information makes them top ranked by our VERSE
model. This result indicates that if utilized in a rationale way, the baseline preference is
very informative in SRE identification and deserves further attention.
3.7.2 Tissue,conservation,andpositionpreferencesofSREsdemonstrate
the complexity of splicing regulations
First, tissue preference of the SREs may affect splice site usage and promote regulation
complexity in a much larger extent than ever expected. Among the 1562 putative motifs
discovered in the intron regions, the majority were only declared to be functional in one
tissue (as shown in the first bars in Fig. 3.4), while only around 3 percent were found
in the majority of tissues (as shown in the last bars Fig. 3.4). Additionally, our tissue
SREsimilaritystudyfurthersupportstheconclusionthatregulatoryelementsmighthelp
splicing factor recruitment only under particular biological environment. These discov-
eries not only suggest that SREs contribute greatly to the tissue specification process
68
through alternative splicing, but also imply that splice site recognition process is so com-
plicated that splicing regulation rules and SRE identification deserve more consideration
in a tissue specific manner. Interestingly, a sharp GC difference was observed between
the specific and general SREs. The general motifs show strong AT bias, while the specific
ones are mostly GC enriched. It has been computationally predicted that many intronic
SREs for constitutive splicing are AT enriched. Our discovery opens another possibil-
ity that the AT enriched ones would also be functional for general alternative splicing,
instead of promoting tissue specific alternative splicing events. In addition, one of our
pervious papers indicates that GC content is indeed involved in the splicing regulation
process probably through maintaining stable mRNA secondary structures, and such in-
volvement might exist since ancient times. Our VERSE predicted results also indicate
that GC content plays a role in the binding of splicing factors.
In addition, VERSE enables us to make the first genome wide prediction of conser-
vation dependent SREs from RNA-seq data. The high percentage of hexamers that were
uniquely discovered by VERSE through adding conservation scores as the prior infor-
mation confirmed the existence of conservation dependent SREs. More importantly, we
found that these conservation dependent SREs would preferentially function in a tissue
specific manner rather than work as universal ones. Our estimated coefficient function
analysis shows that up to 90 percent of the conserved SREs tend to promote alterna-
tive splicing, either by enhancing or repressing the expression of cassette exons, while
the newly evolved counterparts usually perform differently from the conserved ones even
withinthesametissue. Itstronglyimpliesthatthesecis-actingSREsareunderstrictneg-
ative selection to maintain they contribution to enhance transcript diversity from ancient
69
times. Evenwithinthesametissueandundersimilarconservationconstraint, motifsmay
act oppositely due to the location difference. For instance, only 38.60% percent of motifs
are shared by both upstream and downstream regions. All together, these preferences
of tissues, conservation, and positions, and the coupling of these factors, indicate that
splicingregulationisacomplicatedbiologicalprocess,andtheconsiderationofsuchinfor-
mation into its regulation mechanism exploration would significantly extend our current
understanding.
3.7.3 Brains demonstrate the highest regulation complexity through
SREs
It has been reported in lines of references that brains exhibited a large number of tissue
specific alternative spliced exons. Consistent with such results, we also observed the
highest level of regulation complexity in terms of SRE characteristics in brains. First of
all, we observed an extraordinary large number of specific SREs but very limited general
ones in brain tissues. This phenomenon was further supported by the tissue similarity
analysis since brains shared very small percentage of motifs with other tissues. Moreover,
even within the brain tissue, only 3.75% of the total identified SREs were found to
be shared in the upstream and downstream intronic regions, compared with more than
10.00% in some other tissues, indicating a strong positional preference. Furthermore, a
large number of SREs were found to regulate splicing in brain tissues with conservation
constraints and most of these conservation-dependent SREs only work in a tissue specific
manner. All of these results provide convincing evidence that brains might have the most
complicated splicing regulation mechanism that needs tissue, conservation, and regional
70
specificity to ensure a reliable and efficient regulation mechanism for the greatest level of
tissue specific alternative splicing.
3.8 Conclusion
In conclusion, our VERSE model is powerful to detect SREs based on RNA-seq data.
Besides the conservation scores, other types of non-motif features can be incorporated in
the model in a similar manner. With the continuing improvement of high throughput
sequencingtechniques, morerefinedanalysisofSREswouldbefeasibleinthenearfuture.
71
Chapter 4
Structure Based Splicing Regulation
Alternative splicing increases protein diversity by generating multiple transcript isoforms
from a single gene through different combination of exons or through different selection
of splice sites. It has been reported that RNA secondary structures are involved in
alternative splicing. In this chapter, we describe a genomic study of alternative splice
sites within the human genome. We observe that GC content around splice sites is
closelyassociatedwiththestabilityofRNAsecondarystructures. Thestructuralstability
difference among alternative splice sites, constitutive splice sites, and skipped splice sites
can be explained by the GC content difference. GC nucleotides in both stem and loop
regions contribute to the structural stability. In addition, we show that there is natural
selection on the nucleotide order of the sequence to make the corresponding secondary
structure more stable. However, the GC effect is more dominant than the nucleotide-
order-selection effect. All of these results indicate that GC content plays an important
role in alternative splicing by stabilizing RNA secondary structures.
72
4.1 Introduction
Pre-mRNA splicing in eukaryotes removes introns and joins exons together. It is cat-
alyzed by the splicesome that is a large ribonucleoprotein complex with several hundred
proteins and five small nuclear RNAs [35], [93]. The recognition of splice sites requires
multiple RNA binding proteins to bind to various splicing signals in pre-mRNAs. Genes
can choose different sets of splice sites to produce multiple transcript isoforms, which
further increases the complexity of splicing regulation. In eukaryotes, besides some short
consensus sequence elements around the 5 splice site (5ss), the 3 splice site (3ss), the
branch point, and the polypyrimidine tract, the splicing process needs other SREs, such
as splicing enhancers or silencers [11], [22], [69], [80]. In addition, pre-mRNA secondary
structures also play an important role in splicing regulation [69] by either assisting reg-
ulatory protein binding or modulating splicing regulation through long distance interac-
tions [54], [56]. Hence, in this chapter, we try to explore the role of mRNA secondary
structures on splicing regulation.
4.2 Methods
4.2.1 Selection of splice sites
ThesplicesitepositionsofdifferentexonswereobtainedfromtheUCSCGenomeBrowser
(alternative splicing event track, version hg18). In this chapter, we had a total of 5,601
alternative 5 ss, 8,378 alternative 3 ss, and 43,126 skipped splice sites (21,553 5 ss and
21,573 3 ss) for cassette exons.
73
We generated our own list of internal constitutive exons using the UCSC transcript
informationwithtworequirements: 1)aconstitutiveexonshouldappearinallisoformsof
the gene without overlapping with any other exon and preserve exactly the same starting
and ending positions; 2) the gene should contain at least four different isoforms. Thus,
62,118constitutivesplicingsites(31,4335ssand30,6853ss)forinternalconstitutiveexons
wereselectedintotal. Notethatforexonswithalternativesplicesites, themultiplesplice
sites were studied separately except the study within the same gene (see below). Each
splice site corresponded to a 141-nt sequence that was then examined by the RNAfold
program to achieve the minimum free energy. The default settings of RNAfold were used
for the energy prediction. The constitutive splice sites that are less than 141 nt apart
were removed to avoid the possible overlapping sequence window.
For the comparison of different splice sites of the same gene to demonstrate the local
GC variation, if there were multiple alternative (or constitutive, skipped) splice sites,
the average GC content was calculated to represent the GC content for the alternative
(or constitutive, skipped) splice sites of this gene. Then, the paired Wilcoxon tests were
performed to compare the GC difference between alternative and constitutive or skipped
splice sites of the same gene.
Randompermutationswereusedtogeneratethecontroldatatoevaluatetheselection
on the nucleotide order. For each sequence with 141 nucleotides in length, the whole
sequence was randomly permuted and the pre-mRNA secondary structure was predicted
for this control sequence. We did the permutation ten times. The results were all similar
(data not shown).
74
4.3 Results
4.3.1 Differenceofpre-mRNAsecondarystructuresbetweenalternative,
constitutive, and skipped splice sites
IthasbeenreportedthatstableRNAsecondarystructuresareassociatedwithalternative
splicing events [67]. We first examined the stability of RNA secondary structures near
exon-intron junctions in humans. Specifically, we assembled alternative splice sites from
internalexonswithmultiplesplicesites,constitutivesplicesitesfrominternalconstitutive
exons, and skipped splice sites from cassette exons (see methods). Since pre-mRNA
sequences favor local structures rather than global ones in vivo [63], 70 nucleotides were
added up- and down-stream of each splice site to predict the secondary structure by the
free energy minimization program RNAfold [5], [32]. The structural stability distribution
in humans is plotted in Figure 4.1. At the donor sites (5ss), the average minimum free
energy of alternative splice sites was −41.28 kcals/mol, significantly lower than those of
constitutiveandskippedsplicesites: −38.43and−37.20kcals/molrespectively(Wilcoxon
tests, P < 2.2×10
−16
, Figs. 4.1 A and B). Similarly, at the acceptor sites (3ss), the
average free energy around alternative splice sites was −40.03 kcals/mol, compared with
−36.18 and −35.28 kcals/mol for constitutive and skipped splice sites (Wilcoxon tests,
P < 2.2×10
−16
, Figs. 4.1 C and D). The comparison demonstrates that for internal
exons, alternative splice sites favor more stable secondary structures than constitutive
and skipped splice sites, which is consistent with the results in [67]. In order to test
whetherthisthermodynamicadvantageofalternativesplicesitesmighthaveexistedfrom
ancient times, we generalized our structure stability comparison in several other species
75
as well. For all of the species that we have tested (nematodes, fruit flies, and mice), a
significant enrichment of stable structures was observed in alternative splice sites except
thatthedifferencebetweenmousealternativedonorsitesandconstitutivedonorsiteswas
small. Due to the incompleteness of transcript annotations, it was difficult to distinguish
alternative splice sites from the constitutive and skipped ones with high confidence in
many other species.
Figure 4.1: Comparison of stability distribution of alternative splice sites and
constitutiveorskippedsplicesitesinhumans. AandBareforthe5sscomparison.
C and D are for the 3 ss comparison. alt means alternative splice sites, cons means
constitutive splice sites, and skip means skipped splice sites. Alternative splices sites
tended to have more stable structures.
76
4.3.2 Difference of GC content between alternative, constitutive, and
skipped splice sites and its association with the stability of pre-
mRNA secondary structure
In the exploration of other differences between different splice sites, we found that al-
ternative splice sites tend to be GC enriched. For example, at the donor sites of the
human genome, the average GC percentage was 0.52 for alternative 5ss, which was sig-
nificantly higher than 0.48 and 0.47 for constitutive and skipped splice sites (Wilcoxon
tests, P < 2.2× 10
−16
, Figs. 4.2 A and B). At the acceptor sites, the GC content of
alternative 3ss was also significantly higher than that of constitutive or skipped splice
sites: 0.52, 0.47, and 0.46 respectively (Wilcoxon tests, P < 2.2× 10
−16
, Figs. 4.2 C
and D). Results were similar when a 61-nt instead of the 141-nt sequence window was
used for each splice site. We then studied the relationship between the structural differ-
Figure 4.2: Distributions of the GC content of different splice sites in humans.
Compared with constitutive and skipped splice sites, alternative splice sites were signifi-
cantlyenrichedwithGC(thep-valuesbasedontheWilcoxontestswereall<2.2×10
−16
).
77
ence and the GC difference between alternative splice sites and constitutive or skipped
splice sites. Linear regression was performed to reveal that the GC content difference
can explain the stability variations among these splice sites remarkably. Specifically, the
GC content was negatively correlated with the predicted minimum free energy of the
sequence (Pearsons correlations in humans: −0.90 −0.83, P < 2.2×10
−16
). In other
words, GC content was positively correlated with stability. Such significant correlations
wereobservedforeachsplicesitecategory,andforboththedonorandacceptorsites(Fig.
4.3). We further found that the overall GC content, no matter if it is intronic or exonic,
contributes to the correlation with the energy significantly. The fitted regression lines
between the GC content and the energy were similar in different splice site categories. It
indicates that alternative, constitutive, and skipped splices sites have similar structural
stability given the same GC content. Similar results were observed in nematodes, fruit
flies, and mice. The regression lines of nematodes and fruit flies were slightly different
from those of humans and mice, possibly due to the biological differences between these
organisms. In the RNAfold program, we set different temperature parameters to reflect
theirdifferentbodytemperatures. Werepeatedthefreeenergyanalysisinnematodesand
fruit flies by setting the same temperature parameter as that in humans and mice, and
still observed regression lines different from those in humans and mice, indicating that
such differences were not simply due to the different temperature settings. To further
demonstrate that the GC content difference can explain the stability variations between
alternative and constitutive or skipped splice sites, we compared alternative splice sites
with constitutive and skipped splice sites with similar GC content in humans (Fig. 4.4).
No significant energy difference was observed among these splice sites given the same GC
78
Figure 4.3: GC content negatively correlates with energy in humans. A-C are
for alternative, constitutive, and skipped 5ss. D-F are for alternative, constitutive, and
skipped 3ss. The GC content in all splice sites in humans was negatively correlated with
the predicted structural energy. The fitted regression lines and the R2 values are listed
to show the goodness of fit.
content (Wilcoxon tests, P > 0.05), indicating that the GC percentage of the junction
sequence is the major factor to explain the distinct potentials to form stable structures
among alternative, constitutive, and skipped splice sites. Since structural stability was
significantly associated with the GC percentage in junction sequences, it is possible that
long-rangeGCenrichmentpatterns, insteadofthelocalGCvariation, wouldresultinthe
genome-wide thermodynamic advantages in alternatively spliced sites. For example, if
79
Figure 4.4: GC content difference explains structural stability difference. Box-
plots of the energy are plotted (dots represent outliers). Alternative, constitutive,
and skipped splice sites in humans had similar stability given the similar GC number
(Wilcoxon tests for splice site energy comparison within each GC group, P >0.5).
alternativesplicesitesweremorefrequentlyselectedinhighGCisochors, thejunctionse-
quences therein would be significantly biased to more stable base pairings compared with
theconstitutiveandskippedsplicesites. Toclarifythesetwofactors, pairwisefreeenergy
comparisonswereperformedwithstrictdistancecontroltoensuresimilarGCbackground
but allowing local GC variations. Across the human genome, we only selected the alter-
native splice sites with at least one constitutive splice site within a distance of 3000 bp.
The alternative splice sites and the nearby constitutive splice sites shared similar GC
background, but had local variations. Pairwise energy comparison of these two groups
still suggested slight yet significant enrichment of more stable structures in alternatively
spliced sites (−45.54 vs. −44.71 kcals/mol, paired Wilcoxon test P =0.001 for the donor
sites, −43.95 vs. −43.46 kcals/mol, P =0.03 for the acceptor sites). Similar results were
also observed in the comparison between alternative and skipped splice sites (data not
80
shown), further confirming the contribution of local GC variations to stabilize mRNA
secondary structures.
4.3.3 Splice sites of the first exons or long exons are GC enriched and
hence more stable in humans and mice
It is well known that human promoter regions are enriched in GC [36]. We suspect that
splice sites near transcript start sites may form more stable structures compared with
those in internal regions. To test our hypothesis, a list of constitutively and alternatively
spliced donor sites of the first exons was generated and structures were predicted by the
RNAfold program (see methods). Note that for studies in other places of this paper, only
splice sites of internal exons were considered. As expected, both alternative (−52.87 vs.
−41.28 kcals/mol) and constitutive splice sites (−55.56 vs. −38.43 kcals/mol) near the
transcriptionstartsitespreferentiallyformedmorestablestructuresthanthemiddleones
(Wilcoxontests,P <2.2×10
−16
). However,contrastingwiththesplicesitesofthemiddle
exons, for the first exons, alternative splice sites displayed less stable structures than
constitutive splice sites (−52.87 vs. −55.56 kcals/mol, Wilcoxon test, P = 4.9×10
−4
).
As expected, the GC content of these alternative splices was lower than that of the
constitutive splice sites (0.62 vs. 0.64, Wilcoxon test, P = 7.0 × 10
−6
). The results
indicate that the intervention of RNA secondary structures in splicing may vary upon
regions and it depends on GC content.
Shepard et al. found that long exons tended to have more stable structures around
splice sites [67]. Similarly, we observed a significant bias in long exons (length > 200
bp) toward more stable structures in humans (Wilcoxon tests, P < 2.2×10
−16
for the
81
donor sites and P = 5.3 × 10
−8
for acceptor sites). More importantly, this difference
can also be explained by the GC difference. Specifically, the GC content of the splice
sites around long exons was higher compared with that around short exons (length 200
bp) (Wilcoxon tests, P < 2.2× 10
−16
for both the donor and acceptor sites). Hence,
these results suggest that pre-mRNA secondary structures may play different roles in the
splicing of exons with different lengths, depending on the GC content.
Similar results for the first exons and long exons were obtained for mice. However, in
nematodesorfruitflies, sincepromoterregionswerenotGC-enriched[55], thesplicesites
near the promoter regions did not exhibit more stable structures. Besides, no obvious
bias toward more stable structures has been observed in long exons of nematodes or flies.
4.3.4 Tissue-specific alternative splice sites are GC enriched
To investigate the consequence of the GC effect, we focused on tissue-specific alternative
splicing events that are more likely to be functional. Similar to the criteria in [12],
splicing events with a proportionality change of at least 10 percent and a corresponding
P-value less than 0.3 in any of the 48 human tissues were considered as tissue-specific
events. A total of 1,640 alternative donor sites and 1,342 alternative acceptor sites were
claimed as tissue-specific alternative splice sites (see methods). We observed a significant
bias of these functional splice sites to be GC-enriched, and thus to form more stable
structures than other splice sites (Fig. 4.5). For example, at the donor sites, the average
energy for the tissue-specific alternative splice sites was −46.25 kcals/mol. It increased
to −41.20 kcals/mol for the non-tissue-specific alternative splice sites (Wilcoxon test,
P < 2.2 × 10
−16
). This stability difference can also be explained by the GC content
82
difference (0.56 vs. 0.52, Wilcoxon test, P < 2.2 × 10
−16
). Similar results were also
observed at the acceptor site.
Figure 4.5: Free energy and GC content comparison between tissue-specific and non-
tissue-specific alternative donor and acceptor sites in humans with boxplots.
4.3.5 Selection on GC content around splice sites
WehaveshownthatGCcontentwasstronglycorrelatedwithstructuralstability. Wenext
investigated whether GC is specifically enriched around splice sites so that the formed
stable structures can be involved in the splicing process. All the nucleotide sequences
werealignedaccordingtotheexon-intronjunction,andtheaverageGCpercentageacross
multiple sequences was calculated for each relative position. At the donor sites, in both
exonic and intronic regions, the GC content around the splice sites was higher than that
far away from the splice sites (red lines in Fig. 4.6). At the acceptor sites, the exonic
regions around the junctions were also enriched with GC. However, no significant GC
83
Figure 4.6: Comparison of GC percentage around real (red lines) and decoy
(black) splice sites in humans. In general, the sequences around the real splice sites
inhumansweremoreGCenrichedcomparedtothedecoyones(0.50vs. 0.48atthedonor
sites, 0.47 vs. 0.46 at the acceptor sites, Wilcoxon tests, 2.2×10
−16
). In addition, the
exonic regions near the real splice sites were more enriched in GC compared to regions
far away from the junctions, while no such enrichment has been observed near the decoy
splice sites.
enrichment was observed for the flanking introns around the acceptor sites, possibly due
to the polypyrimidine track around these regions. In addition, we used the decoy splice
sites as controls (black lines in Fig. 4.6) [88]. For these decoy splice sites, the GC
content was almost uniformly distributed and there was no enrichment around the decoy
splice sites. For each real splice site, we chose the closest decoy splice site as a control.
Therefore, therealsplicesitesandthedecoysplicesitessharedthesameGCbackground,
but had local GC variation. In general, the real splice sites had higher GC content than
the nearby decoy splices sites (paired Wilcoxon tests, P < 2.2×10
−16
), hence the real
splice sites formed more stable structures (paired Wilcoxon tests, P = 1.38×10
−15
at
donorsites, andP =1.52×10
−8
attheacceptorsites). Thedifferencebetweenrealsplice
sites and decoy splice sites could also be related to the higher GC content in the whole
84
real exon region besides the splice site region. Then we calculated the GC content for
the 50-bp exonic regions near the junction sites, and then normalized it by the average
GC percentage of the exon (i.e. the region from the junction site to 100 bp in the exonic
direction). TheGCcontentaroundthedecoysplicesiteswasalsonormalizedtheirnearby
100-bpregions. Afterthenormalization,thehigherGCcontentduetoexonswasremoved
and we still observed GC enrichment in real splice sites compared with decoy splice sites,
(paired Wilcoxon tests, P = 2.25 × 10
−4
for the acceptor sites, and P < 2.2 × 10
−16
for the donor sites). Thus, GC participates in both the exon formation and the splicing
process. In addition, we found that the stability difference between real splice sites and
decoy sites was larger for alternative splice sites than that for constitutive or skipped
splice sites (1.83 vs. 1.38 or 0.68 kcals/mol at the donor sites, 1.00 vs. 0.71 or 0.44
kcals/mol at the acceptor sites). All these results indicate that real splice sites tends to
be GC enriched, especially around alternative splice sites.
4.3.6 GC effect is more dominant than the nucleotide-order effect
It has been reported that native mRNA sequences usually demonstrate lower minimum
free energies as compared to permuted control sequences [64], [85]. In this work, we
focused on regions near the splice sites and compared the GC effect with the nucleotide-
order effect. It is also known that the dinucleotide frequency affects the predicted free
energy significantly due to the algorithm used in the RNAfold program, and the differ-
ence between native sequences and permuted sequences diminished if the dinucleotide
frequency was fixed [85]. Therefore, in our analysis, both permutations keeping the first
85
order nucleotide frequencies and permutations keeping the second order nucleotide fre-
quencies were used. As shown in Figure 4.7, the GC percentage was correlated with
the structure stability for all native and permuted sequences. Compared with the first
order permuted sequences (
′
p1
′
, white boxes), the native sequences (
′
orig
′
, black boxes)
showed more stable structures (paired Wilcoxon tests, P < 2.2×10
−16
). Nevertheless,
when the dinucluetide frequencies were fixed, the energies of the permuted controls were
similar to those of the native sequences, suggesting insignificant nucleotide-order effect
with fixed dinucleotides frequencies. In addition, the difference between native and the
first order permuted sequences increased with the GC content. For example, the mean
energy change was -0.89 for the GC number around 50 and it was -2.19 for the GC
number around 90 at the constitutive donor sites. Furthermore, the GC effect was more
dominant than the first-order-nucleotide effect. In Figure 4.7, there are four GC number
groups and the average GC number difference between adjacent groups is 10. The native
sequences always showed less stable structures than the first order permuted sequences
in adjacent groups that contain larger GC content (Wilcoxon tests, P < 2.2×10
−16
).
For example, the native sequences with the GC number 49-51 were less stable than the
first order permuted sequences with the GC number 59-61. Thus, the difference caused
by the nucleotide-order effect (from the comparison of the
′
orig
′
and
′
p1
′
) was smaller
than that caused by the GC difference (i.e. the GC number difference of 10 or the GC
percentage difference of 3.5%).
86
Figure4.7: Stabilityofnativeandpermutedsequencesinhumanswithboxplots
A and B are for alternative splice sites. C and D are for constitutive splice sites. E and
F are for skipped splice sites.
87
4.4 Discussion
In this chapter, we studied GC content around splice sites in terms of its effect on the
splice site usage and the stability of pre-mRNA secondary structures in multiple species.
For middle exons, alternative splice sites were more enriched with GC than constitutive
and skipped splice sites, and hence exhibited stronger potential to form stable secondary
structures (as shown in Figs. 4.1 and 4.3). More importantly, we showed that the GC
contentwasthemajorfactortoaccountforthestructuralstabilitydiscrepancy. Giventhe
same GC content, the predicted free energy is similar no matter whether it is alternative
splice sites or other splice sites (Fig. 4.6). We also notice that although the structural
topology predicted from RNAfold has already been proved to demonstrate significantly
high correlations with the experimental results in yeast [38], it is still possible that the
estimated secondary structure may be different from that in vivo for the species we
tested. In addition, it is still unclear whether RNA tertiary structure is also a common
mechanism to regulate splicing [82]. Advanced experimental technologies are in need to
better predict RNA structures.
Several possible regulation mechanisms can be proposed for our genome-wide struc-
tural stability observations. First, mRNA secondary structures may mediate the splicing
process via affecting the motif recognition rate to facilitate or prevent the binding of
splicing regulators. It has been reported that splicing regulators have different RNA
structural topology preferences [45]. The stable stem regions (tend to be GC-enriched) in
alternative splice sites might mask some key motifs and thus repress the site recognition.
Second, long exons exhibit pronounced GC enrichment and hence more stable structures,
88
indicating that RNA secondary structures may be actively involved in the splicing event
via long distance mediations to bring the distal signals together [60], [75]. Third, the
GC-enriched sequences around the first donor sites indicate that the role of stable RNA
secondary structures in splicing may vary across regions. It has been reported that splic-
ingnearthepromoterregionenhancestranscription[23]. Ourdiscoveredstablesecondary
structures around the promoter regions may contribute to the interaction between splic-
ing and transcription via serving as stable binding platforms of the transcription/splicing
regulators. Furthermore, increasing evidence shows that in the beginning of the tran-
scription process, the polymerase II may fall into some paused status, or even go forward
and backward from time to time. Theoretical computations demonstrate that the stable
hairpinsinthenascentRNAisbeneficialforthebacktrackingofpolymeraseII[40],which
explain our discovery about the more stable structures of the first donor sites.
It is well known that the structure prediction software, such as RNAfold and Mfold,
executesthepre-mRNAsecondarystructurecalculationundersimplifiedconditions,prob-
ably resulting in inappropriate minimum free energy predictions. Besides, the folding of
nascentmRNAsequencesmaychangefrequentlyindifferentbiologicalenvironments[48].
We therefore performed studies on functional alternative splice sites (i.e., tissue-specific
alternative splice sites). These sites also tended to be GC-enriched, further suggesting
the functional consequence of GC content in splicing.
SequencesaroundsplicesitesweremoreGCenrichedcomparedtoeitherthepositions
farawayfromthesplicesitesornearbydecoysplicesites,indicatingtheselectionpressure
on splice site regions to form stable structures. We also investigated whether additional
factors exist to affect the structural stability. Permutation analysis reveals only limited
89
nucleotide-order effect in the native sequence to keep a favorable context with larger
thermodynamic advantages (see Fig. 4.7). Thus, the stability variation introduced by
GC was more dominant that that caused by the nucleotide order.
In order to check whether the regulation role of GC content in splicing is a universal
phenomenon in multiple species, we extended our work on humans to nematodes, fruit
files, and mice. In spite of the lower quality of the splicing event lists due to the incom-
pleteness of the gene annotation compared to humans, we also observed the enrichment
ofstablestructuresinalternativelysplicedsitesaswellasslightlydifferentyetstillstatis-
tically significant correlations between the GC percentage and the free energy in all these
species. Thus, the involvement of GC content in the splicing regulation process might
have been in existence from ancient times.
4.5 Conclusion
All together, our results show that GC content around splice sites may play an important
role in splicing regulation by forming stable secondary structures. Through the selection
of GC enriched sequences, exons with alternative splice sites can maintain stable pre-
mRNA structures to promote alternative splicing. This GC effect is more dominant than
the nucleotide-order effect. On the other hand, constitutive exons and cassette exons are
not enriched with stable structures. It indicates that that the pre-mRNA structure is
part of, but not the whole of, the splicing code. We expect to investigate the biological
significance in details when related experimental data become available in the future.
90
Chapter 5
Conclusion and Future Work
5.1 Conclusion
It is well known that alternative splicing is a major gene expression regulation mecha-
nismthatgeneratesmultipletranscriptisoformsfromasinglegene, andthusexpandsthe
transcriptomediversity. Inthisdissertation,weinvestigatedtwoessentialproblemstoun-
derstand alternative splicing in higher eukaryotes through RNA-seq data: transcriptome
quantification and splicing regulation.
In light of both inter- and intra- bias heterogeneity in RNA-seq experiments, we
proposed an accurate and robust mRNA quantification method named WemIQ to de-
convolutetranscriptexpressions. ItadoptedGPdistributiontocapturebiasleveldirectly
fromrawreadswithoutanybiassourceassumption,andutilizedaweightedEMalgorithm
to allocate reads iteratively to different isoforms. Both simulation and real data studies
demonstrate that WemIQ does not only provide improved isoform-centric or exon-centric
transcript measurements, but also allow robust and immediate expression comparisons
from various datasets.
91
Ineukaryotes, numerousgenomicsignalscooperateinahighlycoordinatedmannerto
assistaccuratesplicesiterecognition,andinthisthesisweexploredthisbiologicalprocess
through two aspects: the context and structure based regulations. First, we proposed
a varying effect model (VERSE) in search of intronic splicing regulatory elements. It
considers the combinatory nature of splicing regulation by incorporating other biological
features,liketheconservationscore,touncoverthesignificantlyassociatedhexamerswith
splicesiteusage, andeventuallylocatestheSREsfromrandomoccurrencesbyevaluating
their specific contribution to splicing in a certain biological environment. We applied
VERSE on 16 human tissues and discovered a list of tissue specific SREs. We found that
SREs have distinct preferences of tissues, conservation, and positions, and the coupling
of these factors, indicate that splicing regulation is a complicated biological process, and
the consideration of such information into its regulation mechanism exploration would
significantly extend our current understanding. In addition, we performed a genomic
investigation on the role mRNA secondary structures, and found that it may play a role
in splicing regulation in humans, mice, fruit flies, and nematodes through GC content.
5.2 Future work
In the future, we would like to polish our current WemIQ code to make it available
online with user friendly interface. Detailed documentation would be provided for users
to quantify mRNA products from RNA-seq data conveniently.
Besides, there was very limited data published on the same tissue to collect a variety
of genomic features for SRE identification when we performed the study of VERSE.
92
Hence, onlyconservationscoreswereusedatthattime. Recentlyduetothecollaborative
efforts of the whole computational biology community, especially from researchers in the
ENCODE project, supplementary features other than the conservation score may be
extracted and added into VERSE. We believe that such assisting information would
result in the discovery of novel SREs with more detailed splicing contribution functions.
93
Bibliography
[1] N.Abdul-MananandK.R.Williams,“hnrnpa1bindspromiscuouslytooligoribonu-
cleotides: utilization of random and homo-oligonucleotides to discriminate sequence
frombase-specificbinding,”Nucleicacidsresearch,vol.24,no.20,pp.4063–70,1996.
[2] C.A.Ball,G.Sherlock,H.Parkinson,P.Rocca-Sera,C.Brooksbank,H.C.Causton,
D. Cavalieri, T. Gaasterland, P. Hingamp, F. Holstege, M. Ringwald, P. Spellman,
J.Stoeckert, C.J., J.E.Stewart, R.Taylor, A.Brazma, andJ.Quackenbush, “Stan-
dards for microarray data,” Science, vol. 298, no. 5593, p. 539, 2002.
[3] Y. Barash, J. A. Calarco, W. Gao, Q. Pan, X. Wang, O. Shai, B. J. Blencowe, and
B. J. Frey, “Deciphering the splicing code,” Nature, vol. 465, no. 7294, pp. 53–9,
2010.
[4] J. Benz, E. J. and S. C. Huang, “Role of tissue specific alternative pre-mrna splicing
in the differentiation of the erythrocyte membrane,” Transactions of the American
Clinical and Climatological Association, vol. 108, pp. 78–95, 1997.
[5] A. F. Bompfunewerer, R. Backofen, S. H. Bernhart, J. Hertel, I. L. Hofacker, P. F.
Stadler, and S. Will, “Variations on rna folding and alignment: lessons from be-
nasque,” Journal of mathematical biology, vol. 56, no. 1-2, pp. 129–44, 2008.
[6] D. Brett, H. Pospisil, J. Valcarcel, J. Reich, and P. Bork, “Alternative splicing and
genome complexity,” Nature genetics, vol. 30, no. 1, pp. 29–30, 2002.
[7] M. Brudno, M. S. Gelfand, S. Spengler, M. Zorn, I. Dubchak, and J. G. Conboy,
“Computational analysis of candidate intron regulatory elements for tissue-specific
alternative pre-mrna splicing,” Nucleic acids research, vol. 29, no. 11, pp. 2338–48,
2001.
[8] R. J. Buckanovich and R. B. Darnell, “The neuronal rna binding protein nova-1
recognizes specific rna targets in vitro and in vivo,” Molecular and cellular biology,
vol. 17, no. 6, pp. 3194–201, 1997.
[9] E. Buratti and F. E. Baralle, “Influence of rna secondary structure on the pre-mrna
splicingprocess,” Molecularandcellularbiology,vol.24, no.24, pp.10505–14, 2004.
[10] H. J. Bussemaker, H. Li, and E. D. Siggia, “Regulatory element detection using
correlation with expression,” Nature genetics, vol. 27, no. 2, pp. 167–71, 2001.
94
[11] L. Cartegni, S. L. Chew, and A. R. Krainer, “Listening to silence and understanding
nonsense: exonic mutations that affect splicing,” Nature reviews. Genetics, vol. 3,
no. 4, pp. 285–98, 2002.
[12] J. C. Castle, C. Zhang, J. K. Shah, A. V. Kulkarni, A. Kalsotra, T. A. Cooper,
and J. M. Johnson, “Expression of 24,426 human alternative splicing events and
predicted cis regulation in 48 tissues and cell lines,” Nature genetics, vol. 40, no. 12,
pp. 1416–25, 2008.
[13] L. A. Chasin, “Searching for splicing motifs,” Advances in experimental medicine
and biology, vol. 623, pp. 85–106, 2007.
[14] N. Cloonan, A. R. Forrest, G. Kolle, B. B. Gardiner, G. J. Faulkner, M. K. Brown,
D.F.Taylor, A.L.Steptoe, S.Wani, G.Bethel, A.J.Robertson, A.C.Perkins, S.J.
Bruce, C. C. Lee, S. S. Ranade, H. E. Peckham, J. M. Manning, K. J. McKernan,
and S. M. Grimmond, “Stem cell transcriptome profiling via massive-scale mrna
sequencing,” Nature methods, vol. 5, no. 7, pp. 613–9, 2008.
[15] B. Clouet d’Orval, Y. d’Aubenton Carafa, P. Sirand-Pugnet, M. Gallego, E. Brody,
and J. Marie, “Rna secondary structure repression of a muscle-specific exon in hela
cell nuclear extracts,” Science, vol. 252, no. 5014, pp. 1823–8, 1991.
[16] T.A.CooperandC.P.Ordahl,“Nucleotidesubstitutionswithinthecardiactroponin
t alternative exon disrupt pre-mrna alternative splicing,” Nucleic acids research,
vol. 17, no. 19, pp. 7905–21, 1989.
[17] D. Das, T. A. Clark, A. Schweitzer, M. Yamamoto, H. Marr, J. Arribere, S. Mi-
novitsky, A. Poliakov, I. Dubchak, J. E. Blume, and J. G. Conboy, “A correlation
with exon expression approach to identify cis-regulatory elements for tissue-specific
alternative splicing,” Nucleic acids research, vol. 35, no. 14, pp. 4845–57, 2007.
[18] I. D’Souza and G. D. Schellenberg, “Regulation of tau isoform expression and de-
mentia,” Biochimica et biophysica acta, vol. 1739, no. 2-3, pp. 104–15, 2005.
[19] L. P. Eperon, I. R. Graham, A. D. Griffiths, and I. C. Eperon, “Effects of rna
secondarystructureonalternativesplicingofpre-mrna: isfoldinglimitedtoaregion
behind the transcribing rna polymerase?” Cell, vol. 54, no. 3, pp. 393–401, 1988.
[20] W.G.Fairbrother,R.F.Yeh,P.A.Sharp,andC.B.Burge,“Predictiveidentification
of exonic splicing enhancers in human genes,” Science, vol. 297, no. 5583, pp. 1007–
13, 2002.
[21] J. Q. Fan and W. Y. Zhang, “Statistical estimation in varying coefficient models,”
Annals of Statistics, vol. 27, no. 5, pp. 1491–1518, 1999.
[22] N. A. Faustino and T. A. Cooper, “Pre-mrna splicing and human disease,” Genes
and development, vol. 17, no. 4, pp. 419–37, 2003.
95
[23] A. Furger, J. M. O’Sullivan, A. Binnie, B. A. Lee, and N. J. Proudfoot, “Promoter
proximalsplicesitesenhance transcription,” Genesanddevelopment, vol.16, no.21,
pp. 2792–9, 2002.
[24] M. A. Garcia-Blanco, A. P. Baraniak, and E. L. Lasda, “Alternative splicing in
disease and therapy,” Nature biotechnology, vol. 22, no. 5, pp. 535–46, 2004.
[25] B. R. Graveley, “Sorting out the complexity of sr protein functions,” RNA, vol. 6,
no. 9, pp. 1197–211, 2000.
[26] ——, “Alternative splicing: increasing diversity in the proteomic world,” Trends in
genetics : TIG, vol. 17, no. 2, pp. 100–7, 2001.
[27] M.R.Green,“Biochemicalmechanismsofconstitutiveandregulatedpre-mrnasplic-
ing,” Annual review of cell biology, vol. 7, pp. 559–99, 1991.
[28] K. D. Hansen, S. E. Brenner, and S. Dudoit, “Biases in illumina transcriptome
sequencing caused by random hexamer priming,” Nucleic acids research, vol. 38,
no. 12, p. e131, 2010.
[29] T. Hashimshony, F. Wagner, N. Sher, and I. Yanai, “Cel-seq: single-cell rna-seq by
multiplexed linear amplification,” Cell reports, vol. 2, no. 3, pp. 666–73, 2012.
[30] A. Hegele, A. Kamburov, A. Grossmann, C. Sourlis, S. Wowro, M. Weimann, C. L.
Will, V. Pena, R. Luhrmann, and U. Stelzl, “Dynamic protein-protein interaction
wiring of the human spliceosome,” Molecular cell, vol. 45, no. 4, pp. 567–80, 2012.
[31] M. Hiller, Z. Zhang, R. Backofen, and S. Stamm, “Pre-mrna secondary structures
influence exon recognition,” PLoS genetics, vol. 3, no. 11, p. e204, 2007.
[32] I. L. Hofacker and P. F. Stadler, “Memory efficient folding algorithms for circular
rna secondary structures,” Bioinformatics, vol. 22, no. 10, pp. 1172–6, 2006.
[33] K. B. Jensen and R. B. Darnell, “Clip: crosslinking and immunoprecipitation of in
vivo rna targets of rna-binding proteins,” Methods in molecular biology, vol. 488,
pp. 85–98, 2008.
[34] M. S. Jurica and M. J. Moore, “Pre-mrna splicing: awash in a sea of proteins,”
Molecular cell, vol. 12, no. 1, pp. 5–14, 2003.
[35] ——, “Pre-mrna splicing: awash in a sea of proteins,” Molecular cell, vol. 12, no. 1,
pp. 5–14, 2003.
[36] K. R. Kalari, M. Casavant, T. B. Bair, H. L. Keen, J. M. Comeron, T. L. Casavant,
andT.E.Scheetz,“Firstexonsandintrons–asurveyofgccontentandgenestructure
in the human genome,” In silico biology, vol. 6, no. 3, pp. 237–42, 2006.
[37] H. Keren, G. Lev-Maor, and G. Ast, “Alternative splicing and evolution: diversifi-
cation, exon definition and function,” Nature reviews. Genetics, vol. 11, no. 5, pp.
345–55, 2010.
96
[38] M. Kertesz, Y. Wan, E. Mazor, J. L. Rinn, R. C. Nutter, H. Y. Chang, and E. Segal,
“Genome-wide measurement of rna secondary structure in yeast,” Nature, vol. 467,
no. 7311, pp. 103–7, 2010.
[39] S. Kim, H. Shi, D. K. Lee, and J. T. Lis, “Specific sr protein-dependent splicing
substrates identified through genomic selex,” Nucleic acids research, vol. 31, no. 7,
pp. 1955–61, 2003.
[40] A. V. Klopper, J. S. Bois, and S. W. Grill, “Influence of secondary structure on
recovery from pauses during early stages of rna transcription,” Physical review. E,
Statistical, nonlinear, and soft matter physics, vol. 81, no. 3 Pt 1, p. 030904, 2010.
[41] J. D. Kohtz, S. F. Jamison, C. L. Will, P. Zuo, R. Luhrmann, M. A. Garcia-Blanco,
andJ.L.Manley,“Protein-proteininteractionsand5’-splice-siterecognitioninmam-
malian mrna precursors,” Nature, vol. 368, no. 6467, pp. 119–24, 1994.
[42] B. Langmead, C. Trapnell, M. Pop, and S. L. Salzberg, “Ultrafast and memory-
efficient alignment of short dna sequences to the human genome,” Genome biology,
vol. 10, no. 3, p. R25, 2009.
[43] B. Li and C. N. Dewey, “Rsem: accurate transcript quantification from rna-seq data
with or without a reference genome,” BMC bioinformatics, vol. 12, p. 323, 2011.
[44] J. Li, H. Jiang, and W. H. Wong, “Modeling non-uniformity in short-read rates in
rna-seq data,” Genome biology, vol. 11, no. 5, p. R50, 2010.
[45] X. Li, G. Quon, H. D. Lipshitz, and Q. Morris, “Predicting in vivo binding sites
of rna-binding proteins using mrna secondary structure,” RNA, vol. 16, no. 6, pp.
1096–107, 2010.
[46] H.X.Liu,S.L.Chew,L.Cartegni,M.Q.Zhang,andA.R.Krainer,“Exonicsplicing
enhancer motif recognized by human sc35 under splicing conditions,” Molecular and
cellular biology, vol. 20, no. 3, pp. 1063–71, 2000.
[47] H. X. Liu, M. Zhang, and A. R. Krainer, “Identification of functional exonic splic-
ing enhancer motifs recognized by individual sr proteins,” Genes and development,
vol. 12, no. 13, pp. 1998–2012, 1998.
[48] E. M. Mahen, P. Y. Watson, J. W. Cottrell, and M. J. Fedor, “mrna secondary
structures fold sequentially but exchange rapidly in vivo,” PLoS biology, vol. 8,
no. 2, p. e1000307, 2010.
[49] S. Marguerat and J. Bahler, “Rna-seq: from technology to biology,” Cellular and
molecular life sciences : CMLS, vol. 67, no. 4, pp. 569–79, 2010.
[50] B. Modrek and C. Lee, “A genomic view of alternative splicing,” Nature genetics,
vol. 30, no. 1, pp. 13–9, 2002.
[51] A. Mortazavi, B. A. Williams, K. McCue, L. Schaeffer, and B. Wold, “Mapping and
quantifying mammalian transcriptomes by rna-seq,” Nature methods, vol. 5, no. 7,
pp. 621–8, 2008.
97
[52] T. W. Nilsen and B. R. Graveley, “Expansion of the eukaryotic proteome by alter-
native splicing,” Nature, vol. 463, no. 7280, pp. 457–63, 2010.
[53] Q. Pan, O. Shai, L. J. Lee, B. J. Frey, and B. J. Blencowe, “Deep surveying of
alternative splicing complexity in the human transcriptome by high-throughput se-
quencing,” Nature genetics, vol. 40, no. 12, pp. 1413–5, 2008.
[54] D. J. Patterson, K. Yasuhara, and W. L. Ruzzo, “Pre-mrna secondary structure
prediction aids splice site prediction,” Pacific Symposium on Biocomputing. Pacific
Symposium on Biocomputing, pp. 223–34, 2002.
[55] E.A.Rach, H.Y.Yuan, W.H.Majoros, P.Tomancak, andU.Ohler, “Motifcompo-
sition, conservation and condition-specificity of single and alternative transcription
start sites in the drosophila genome,” Genome biology, vol. 10, no. 7, p. R73, 2009.
[56] V. A. Raker, A. A. Mironov, M. S. Gelfand, and D. D. Pervouchine, “Modulation
of alternative splicing by long-range rna structures in drosophila,” Nucleic acids
research, vol. 37, no. 14, pp. 4533–44, 2009.
[57] D. Ramskold, S. Luo, Y. C. Wang, R. Li, Q. Deng, O. R. Faridani, G. A. Daniels,
I. Khrebtukova, J. F. Loring, L. C. Laurent, G. P. Schroth, and R. Sandberg, “Full-
length mrna-seq from single-cell levels of rna and individual circulating tumor cells,”
Nature biotechnology, vol. 30, no. 8, pp. 777–82, 2012.
[58] A. Roberts, C. Trapnell, J. Donaghey, J. L. Rinn, and L. Pachter, “Improving rna-
seq expression estimates by correcting for fragment bias,” Genome biology, vol. 12,
no. 3, p. R22, 2011.
[59] M. D. Robinson and G. K. Smyth, “Moderated statistical tests for assessing differ-
ences in tag abundance,” Bioinformatics, vol. 23, no. 21, pp. 2881–7, 2007.
[60] S.Rogic,B.Montpetit,H.H.Hoos,A.K.Mackworth,B.F.Ouellette,andP.Hieter,
“Correlation between the secondary structure of pre-mrna introns and the efficiency
of splicing in saccharomyces cerevisiae,” BMC genomics, vol. 9, p. 355, 2008.
[61] L. Rowen, J. Young, B. Birditt, A. Kaur, A. Madan, D. L. Philipps, S. Qin, P. Minx,
R. K. Wilson, L. Hood, and B. R. Graveley, “Analysis of the human neurexin genes:
alternativesplicingandthegenerationofproteindiversity,” Genomics, vol.79, no.4,
pp. 587–97, 2002.
[62] M. Schena, D. Shalon, R. W. Davis, and P. O. Brown, “Quantitative monitoring of
gene expression patterns with a complementary dna microarray,” Science, vol. 270,
no. 5235, pp. 467–70, 1995.
[63] R. Schroeder, R. Grossberger, A. Pichler, and C. Waldsich, “Rna folding in vivo,”
Current opinion in structural biology, vol. 12, no. 3, pp. 296–300, 2002.
[64] W. Seffens and D. Digby, “mrnas have greater negative folding free energies than
shuffledorcodonchoicerandomizedsequences,”Nucleicacidsresearch,vol.27,no.7,
pp. 1578–84, 1999.
98
[65] P. Senapathy, M. B. Shapiro, and N. L. Harris, “Splice junctions, branch point sites,
and exons: sequence statistics, identification, and applications to genome project,”
Methods in enzymology, vol. 183, pp. 252–78, 1990.
[66] H.ShenandM.R.Green, “Rsdomainscontactsplicingsignalsandpromotesplicing
byacommonmechanisminyeastthroughhumans,”Genesanddevelopment,vol.20,
no. 13, pp. 1755–65, 2006.
[67] P. J. Shepard and K. J. Hertel, “Conserved rna secondary structures promote alter-
native splicing,” RNA, vol. 14, no. 8, pp. 1463–9, 2008.
[68] S. Srivastava and L. Chen, “A two-parameter generalized poisson model to improve
the analysis of rna-seq data,” Nucleic acids research, vol. 38, no. 17, p. e170, 2010.
[69] M. B. Stadler, N. Shomron, G. W. Yeo, A. Schneider, X. Xiao, and C. B. Burge,
“Inferenceofsplicingregulatoryactivitiesbysequenceneighborhoodanalysis,”PLoS
genetics, vol. 2, no. 11, p. e191, 2006.
[70] H. Sun and L. A. Chasin, “Multiple splicing defects in an intronic false exon,”
Molecular and cellular biology, vol. 20, no. 17, pp. 6414–25, 2000.
[71] R.TackeandJ.L.Manley, “Thehuman splicingfactorsasf/sf2andsc35possessdis-
tinct, functionally significant rna binding specificities,” The EMBO journal, vol. 14,
no. 14, pp. 3540–51, 1995.
[72] ——,“Determinantsofsrproteinspecificity,”Currentopinionincellbiology,vol.11,
no. 3, pp. 358–62, 1999.
[73] F. Tang, C. Barbacioru, Y. Wang, E. Nordman, C. Lee, N. Xu, X. Wang, J. Bodeau,
B. B. Tuch, A. Siddiqui, K. Lao, and M. A. Surani, “mrna-seq whole-transcriptome
analysis of a single cell,” Nature methods, vol. 6, no. 5, pp. 377–82, 2009.
[74] J. Tazi, N. Bakkour, and S. Stamm, “Alternative splicing and disease,” Biochimica
et biophysica acta, vol. 1792, no. 1, pp. 14–26, 2009.
[75] S. Thompson-Jager and H. Domdey, “Yeast pre-mrna splicing requires a minimum
distance between the 5’ splice site and the internal branch acceptor site,” Molecular
and cellular biology, vol. 7, no. 11, pp. 4010–6, 1987.
[76] C. Trapnell, L. Pachter, and S. L. Salzberg, “Tophat: discovering splice junctions
with rna-seq,” Bioinformatics, vol. 25, no. 9, pp. 1105–11, 2009.
[77] C. Trapnell, B. A. Williams, G. Pertea, A. Mortazavi, G. Kwan, M. J. van Baren,
S. L. Salzberg, B. J. Wold, and L. Pachter, “Transcript assembly and quantifica-
tion by rna-seq reveals unannotated transcripts and isoform switching during cell
differentiation,” Nature biotechnology, vol. 28, no. 5, pp. 511–5, 2010.
[78] E. T. Wang, R. Sandberg, S. Luo, I. Khrebtukova, L. Zhang, C. Mayr, S. F.
Kingsmore, G. P. Schroth, and C. B. Burge, “Alternative isoform regulation in hu-
man tissue transcriptomes,” Nature, vol. 456, no. 7221, pp. 470–6, 2008.
99
[79] G. S. Wang and T. A. Cooper, “Splicing in disease: disruption of the splicing code
and the decoding machinery,” Nature reviews. Genetics, vol. 8, no. 10, pp. 749–61,
2007.
[80] Z. Wang and C. B. Burge, “Splicing regulation: from a parts list of regulatory
elements to an integrated splicing code,” RNA, vol. 14, no. 5, pp. 802–13, 2008.
[81] Z. Wang, M. Gerstein, and M. Snyder, “Rna-seq: a revolutionary tool for transcrip-
tomics,” Nature reviews. Genetics, vol. 10, no. 1, pp. 57–63, 2009.
[82] M. B. Warf and J. A. Berglund, “Role of rna structure in regulating pre-mrna splic-
ing,” Trends in biochemical sciences, vol. 35, no. 3, pp. 169–78, 2010.
[83] A. Watakabe, K. Tanaka, and Y. Shimura, “The role of exon sequences in splice site
selection,” Genes and development, vol. 7, no. 3, pp. 407–18, 1993.
[84] J. Wen, A. Chiba, and X. Cai, “Computational identification of tissue-specific al-
ternative splicing elements in mouse genes from rna-seq,” Nucleic acids research,
vol. 38, no. 22, pp. 7895–907, 2010.
[85] C. Workman and A. Krogh, “No evidence that mrnas have lower folding free ener-
gies than random sequences with the same dinucleotide distribution,” Nucleic acids
research, vol. 27, no. 24, pp. 4816–22, 1999.
[86] J. Wu, M. Akerman, S. Sun, W. R. McCombie, A. R. Krainer, and M. Q. Zhang,
“Splicetrap: a method to quantify alternative splicing under single cellular condi-
tions,” Bioinformatics, vol. 27, no. 21, pp. 3010–6, 2011.
[87] Z. Wu, X. Wang, and X. Zhang, “Using non-uniform read distribution models to
improve isoform expression inference in rna-seq,” Bioinformatics, vol. 27, no. 4, pp.
502–8, 2011.
[88] G. Yeo and C. B. Burge, “Maximum entropy modeling of short sequence motifs with
applications to rna splicing signals,” Journal of computational biology : a journal of
computational molecular cell biology, vol. 11, no. 2-3, pp. 377–94, 2004.
[89] J. Zhang, C. C. Kuo, and L. Chen, “Gc content around splice sites affects splicing
through pre-mrna secondary structures,” BMC genomics, vol. 12, p. 90, 2011.
[90] ——, “Verse: a varying effect regression for splicing elements discovery,” Journal of
computational biology : a journal of computational molecular cell biology, vol. 19,
no. 6, pp. 855–65, 2012.
[91] X.H.ZhangandL.A.Chasin, “Computationaldefinitionofsequencemotifsgovern-
ing constitutive exon splicing,” Genes and development, vol. 18, no. 11, pp. 1241–50,
2004.
[92] S. Zheng and L. Chen, “A hierarchical bayesian model for comparing transcriptomes
at the individual transcript isoform level,” Nucleic acids research, vol. 37, no. 10, p.
e75, 2009.
100
[93] Z.Zhou,L.J.Licklider,S.P.Gygi,andR.Reed,“Comprehensiveproteomicanalysis
of the human spliceosome,” Nature, vol. 419, no. 6903, pp. 182–5, 2002.
101
Abstract (if available)
Abstract
The rapid advances in high-throughput sequencing technologies provide us an opportunity to dissect transcriptomes with unprecedented resolution. Based on RNA-seq studies, alternative splicing has become more and more appreciated as a key mechanism in higher eukaryotes to expand transcriptomes by generating multiple isoforms from a single gene. Therefore, accurate quantification of transcript isoforms and discovery of splicing regulation rules are important to understand biological processes such as tissue differentiation and cell development, where alternative splicing plays an essential role. However, RNA-seq experiments, which randomly sample the short fragments from the transcriptome, exhibit distinct data characteristics from previous techniques and thus await the development of powerful computational methods. ❧ First and foremost, we present a weighted-log-likelihood expectation maximization method on isoform quantification (WemIQ). WemIQ integrates fragment length information and distributes reads among isoforms through an expectation maximization (EM) algorithm. More importantly, the weighted log-likelihood is used to handle both inter- and intra- gene bias. Simulation and real data analyses show that, compared with other methods, WemIQ significantly improves the quantification of isoform expression, exon inclusion ratios, and gene expression under a variety of gene structures and provides robust mRNA measurements across different datasets for direct comparisons. ❧ Then, we tried to decipher the splicing code from the context aspect through identification of splicing regulatory elements (SREs). The fact that a variety of other biological signals cooperatively govern the splicing pattern indicates the necessity of developing novel tools to incorporate information from multiple sources to improve splicing factor binding sites prediction. Under this context, we proposed a Varying Effect Regression for Splicing Elements (VERSE) to discover intronic SREs in the proximity of exon junctions by integrating other biological features. It may serve as a powerful tool to not only discover the splicing elements by incorporating additional informative signals but also precisely quantify their varying contribution under different biological context. ❧ At last, we investigated the structure aspect of splicing regulation by performing a genomic study of RNA secondary structures around splice sites in humans, mice, fruit flies, and nematodes. We observe that GC content around splice sites is closely associated with the splice site usage in multiple species. RNA secondary structure is the possible explanation, because the structural stability difference among alternative splice sites, constitutive splice sites, and skipped splice sites can be explained by the GC content difference. Alternative splice sites tend to be GC-enriched and exhibit more stable RNA secondary structures in all of the considered species. In humans and mice, splice sites of first exons and long exons tend to be GC-enriched and hence form more stable structures, indicating the special role of RNA secondary structures in promoter proximal splicing events and the splicing of long exons. In addition, GC-enriched exon-intron junctions tend to be overrepresented in tissue-specific alternative splice sites, indicating the functional consequence of the GC effect.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
Genome-wide studies reveal the isoform selection of genes
PDF
Comparative transcriptomics: connecting the genome to evolution
PDF
Probing the genetic basis of gene expression variation through Bayesian analysis of allelic imbalance and transcriptome studies of oil palm interspecies hybrids
PDF
Predicting functional consequences of SNPs: insights from translation elongation, molecular phenotypes, and pathways
Asset Metadata
Creator
Zhang, Jing
(author)
Core Title
Isoform quantification and splicing regulation analysis in RNA-seq studies
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
11/04/2015
Defense Date
10/08/2013
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
alternative splicing,gene expression regulation,OAI-PMH Harvest,RNA-seq
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Chen, Liang (
committee chair
), Kuo, C.-C. Jay (
committee chair
), Haldar, Justin P. (
committee member
), Sun, Fengzhu Z. (
committee member
), Waterman, Michael S. (
committee member
)
Creator Email
jingzhang.wti.bupt@gmail.com,zhang28@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-343497
Unique identifier
UC11296609
Identifier
etd-ZhangJing-2128.pdf (filename),usctheses-c3-343497 (legacy record id)
Legacy Identifier
etd-ZhangJing-2128.pdf
Dmrecord
343497
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Zhang, Jing
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
alternative splicing
gene expression regulation
RNA-seq