Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Genome-wide studies reveal the isoform selection of genes
(USC Thesis Other)
Genome-wide studies reveal the isoform selection of genes
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Genome-wide Studies Reveal the
Isoform Selection of Genes
Ph.D. Dissertation
Submitted by
Che-yu Lee
Computational Biology & Bioinformatics Ph.D. Program, Department of Biological
Sciences
University of Southern California
August 2015
Ph.D. Dissertation Committee
Dr. Liang Chen (Chair)
Dr. Fengzhu Sun
Dr. Qi-Long Ying
ii
© 2015
Che-yu Lee
All Rights Reserved
iii
Acknowledgement
I would like to thank my advisor Dr. Liang Chen for her mentoring and constant
support. In the past 6 years, Dr. Chen has taught me how to be an active learner, a
cautious scientist, and a caring person.
I would like to express my gratitude to Dr, Fengzhu Sun, Dr. Michael S.
Waterman, Dr. Ting Chen and Dr. Qilong. Thank you for participating my research
project and guiding my academic growth.
All my friends in Dr. Chen's lab have been helping me so much in my research.
Sudeep Srivastava taught me the very first concept of the distribution of RNA-seq reads.
Jing Zhang helped me in both coding and experimental designing. Maoqi Xu inspired
me many interesting algorithm and solutions.
Last, but by no means, I thank my girl friend, Xu Han. You are the big support for
me.
iv
Table of Contents
Acknowledgement ........................................................................................................... iii
Table of Contents ............................................................................................................iv
List of Tables ...................................................................................................................vi
List of Figures ................................................................................................................. vii
Abstract..........................................................................................................................viii
Chaper 1. Alternative polyadenylation sites reveal distinct chromatin
accessibility and histone modification in human cell lines
1.1 Introduction ....................................................................................................... 2
1.2 Methods. ........................................................................................................... 4
1.3 Result ................................................................................................................ 6
1.3.1 Alternative polyadenylation sites reveal lower chromatin accessibility ....... 6
1.3.2 Nucleosome deletion around poly (A) sites .............................................. 10
1.3.3 Histone modefication around poly (A) sites .............................................. 11
1.4 Discussion....................................................................................................... 14
Chaper 2. Identify Alternative Splicing regulatory Network with Weighted Least
Square Regression
2.1 Introduction ..................................................................................................... 19
2.2 Method ............................................................................................................ 23
2.3 Simulation ....................................................................................................... 26
2.3.1 Estimation of correlation coefficient ......................................................... 26
2.3.2 The power of correlation tests in different types of relationship ............... 27
2.4 Result .............................................................................................................. 36
2.5 Discussion....................................................................................................... 43
Chaper 3. Quantifying protein synthesis rate with an algorithmic translation
model
v
3.1 introduction ..................................................................................................... 45
3.2 Method ............................................................................................................ 47
3.3 Result .............................................................................................................. 50
3.4 Future work ..................................................................................................... 54
Chaper 4. Summary
Bibliography ................................................................................................................ 59
vi
List of Tables
Table 1. Comparison of the H3K36me3 signals among different poly (A) sites. ...................... 12
Table 2. Comparison of the H3K36me3 signals around the APAs of genes using three poly (A)
sites and those of genes using only one poly (A) site. .................................................... 13
Table 3. Parameters setting of four different types of correlation models. ........................... 31
Table 4. The p-value of seven correlation statistics under the hypothesis that CELF2 and the
inclusion rate of 6th exon of LEF1 is independent. ........................................................ 39
vii
List of Figures
Figure 1. Chromatin accessibility differences among poly (A) sites. ......................................... 7
Figure 2. Wilcoxon rank test results for DNase-seq signals around different sites in the six cell
lines. .............................................................................................................................. 8
Figure 3. Nucleosome distribution around poly (A) sites in the K562 cell line. ....................... 11
Figure 4. An example of H3K36me3 signals in the K562 cell line. .......................................... 14
Figure 5. Read coverage in junctions and the accuracy of inclusion rate estimation .............. 24
Figure 6. The median of estimation errors among different test. .......................................... 28
Figure 7. The power of statistics for various correlation coefficients rejecting at 5% significant
level. ............................................................................................................................ 29
Figure 8. The power of various correlation tests rejecting at 5% significant level as noise level
increasing on linear model. .......................................................................................... 32
Figure 9. The power of various correlation tests rejecting at 5% significant level as noise level
increasing on quadratic model. .................................................................................... 33
Figure 10. The power of various correlation tests rejecting at 5% significant level as noise level
increasing on cross model. ........................................................................................... 34
Figure 11. The power of various correlation tests rejecting at 5% significant level as noise level
increasing on time series model. .................................................................................. 35
Figure 12. The inclusion rate between ESCs and other tissues among three species. ............. 40
Figure 13. The scatter plot of CELF2 expression level and inclusion rate of the LEF1 6th exon.
.................................................................................................................................... 41
Figure 14. The distribution of ribosomes on mRNAs ............................................................. 47
Figure 15. The difference between initiation rate and termination rate of 500 genes in
budding yeast. ............................................................................................................. 51
Figure 16. The correlation between the estimated termination rate and ribosome density
method. ....................................................................................................................... 52
Figure 17. The correlation between the estimated protein synthesis rate and mass
spectrometry data of yeast. ......................................................................................... 53
viii
Abstract
The next-generation sequencing (NGS) technology has altered the study of
mRNA from microarray to a new era. Comparing to microarray, RNA-seq has the ability
to quantify large range of expression level, detect novel transcripts without reference
genome, offer higher specificity and sensitivity, and detect the rare or low-abundance
transcripts. Technologies such as ChiP-seq, DNase-seq, MNase-seq and ribosome
profiling combine next generation sequencing with new biochemistry experiment and
provide more information to study mRNA interaction, regulation and expression.
Alternative polyadenylation has also been identified as a critical regulatory
mechanism in human gene expression even though there are still no sufficient study to
clarify the mechanism. The National Human Genome Research Institute launched a
public research project, ENCODE, to identify all functional elements in human genome
sequence with variety NGS experiments. We combine various sequencing data of
human cells to recognize the critical functional elements around alternative
polyadenylation sites statistically. The discovered functional elements signals have
improved the polyadenylation or alternative polyadenylation prediction in machine
learning methods.
RNA-seq is a powerful tool to study alternative splicing or isoform selection under
multiple conditions. The first study using RNA-seq to quantify splicing event calculated
the ratio between inclusive junction reads and exclusive junction reads as inclusion level.
After the original approach, various tools have been designed to better estimate the
isoform ratio. Here we propose a weighted least square (WLS) method to solve the
ix
problems of low-coverage in junctions and small sample size in splicing event study. We
prove the WLS is robust to noise, better in correlation coefficient estimation, and
discovering 10% ~ 20% more true positive than other traditional correlation statistics
with simulations.
Both microarray and RNA-seq technique measure mRNA abundance as genome
expression level. However, mRNA is not a perfect index of protein synthesis since they
are under extensive post-transcriptional regulations. In additional, programmed
ribosome pausing is observed during translation to facilitate the cotranslation folding
and secretion, resulting in different translation rate of genes. A ribosome profiling
technique, which is based on sequencing the ribosome-protected mRNA, provides
insight into the real translation process. Sequencing data display the significant different
speed of ribosomes on each region of mRNA template and difficulty to calculate protein
synthesis rate with simple distribution models. Here we applied a translation speed
model, dividing the translation into three steps, to estimate the real protein synthesis
rate.
In summary, my study focus on gathering sequencing data to discover important
features of genes, proposing math models to reduce the bias and applying models to
real data for future experiments.
1
Chaper 1. Alternative polyadenylation sites reveal
distinct chromatin accessibility and histone modification
in human cell lines
2
In addition to alternative splicing, alternative polyadenylation has also been
identified as a critical and prevalent regulatory mechanism in human gene expression.
However, the mechanism of alternative polyadenylation selection and the involved
factors are still largely unknown.
In this study, we use the ENCODE data to scan DNA functional elements,
including chromatin accessibility and histone modification, around transcript cleavage
sites. Our results demonstrate that polyadenylation sites tend to be less sensitive to
DNase I. However, these polyadenylation sites have preference in nucleosome-
depleted regions, indicating the involvement of chromatin higher-order structure rather
than nucleosomes in the resultant lower chromatin accessibility. More interestingly, for
genes using two alternative polyadenylation sites, the distal sites show even lower
chromatin accessibility comparing to the proximal sites or the unique sites of genes
using only one polyadenylation site. We also observe that the histone modification mark,
histone H3 lysine 36 tri-methylation (H3K36Me3), exhibits different patterns around the
cleavage sites of genes using multiple polyadenylation sites from those of genes using
a single polyadenylation site. Surprisingly, the H3K36Me3 levels are comparable among
the alternative polyadenylation sites themselves. In summary, polyadenylation and
alternative polyadenylation are closely related to functional elements on the DNA level
1.1 Introduction
Post-transcriptional RNA processing, including 5' capping, intron splicing, RNA
editing, and polyadenylation (poly (A)), has been recognized as important regulation
3
steps of gene expression. Alternative polyadenylation(Dunham, et al., 2012) is a
widespread mechanism in eukaryotic cells to control gene expression by producing
transcript isoforms with different poly (A) sites. It has been reported that around 30% of
human genes can select APA sites in their 3' untranslated regions (UTRs) (Lin, et al.,
2012). Even though most of the genes with APA produce identical proteins, APA can
also alter gene expression by changing microRNA binding regions (Mayr and Bartel,
2009), transportation and protein production rates (Lutz, 2008). Recent research
provides evidence that genes have preference for isoforms with shorter 3' UTRs in
embryonic or cancer cells because these isoforms have higher translational efficiency
than longer isoforms (Ji, et al., 2009; Lin, et al., 2012; Shepard, et al., 2011). The study
of polyadenylation in five mammals demonstrated that the usage of poly (A) sites is
more conserved in the same tissue across different species than within a species (Derti,
et al., 2012). All of these reflect the functional and evolutional importance of APA
regulation.
In general, poly (A) sites contain two core elements, the con-served AAUAAA hexamer
in the upstream and the GU-/U- rich region in the downstream. The AAUAAA, referred
as polyadenylation signal (PAS), is located ~10-35 nucleotides upstream of the
cleavage site and acts as the binding motif for the cleavage and polyadenylation
specificity factor (CPSF). The GU- or U-rich region, located 20-40 nucleotides
downstream of poly (A) sites, is recognized by the cleavage stimulatory factor (CstF).
CstF is recruited by CPSF and assembled into a protein complex to process
polyadenylation (Edmonds, 2002). The high evolutional conservation of these two motifs
indicates their critical roles in polyadenylation.
4
To investigate other factors related to polyadenylation and the alternative selection of
poly (A) sites, we combined different types of data from the ENCODE project (Dunham,
et al., 2012) including RNA-PET data, DNase-seq data, MNase-seq data, and ChIP-seq
data. We observed the low chromatin accessibility around poly (A) sites, especially
around the distal poly (A) sites of genes using two cleavage sites. This low chromatin
accessibil-ity is not due to the high density of nucleosomes. Conversely, poly (A) sites
prefer nucleosome-depleted regions. In addition, the higher level of tri-methylated
histone H3 at lysine 36 (H3K36me3) was observed around APA sites comparing to
unique poly (A) sites, suggesting the role of histone modification in simultaneously
deploying multiple poly (A) sites for a single gene. However, no significant difference
was detected among the multiple poly (A) sites themselves.
1.2 Methods.
The poly (A) sites were identified from the RNA-PET data in the nucleus of cells
from the ENCODE project (Djebali, et al., 2012). Note that RNA-PET has been used to
map the 5' cap and the 3' poly (A) signatures of individual transcripts (Fullwood, et al.,
2009). Here, we specifically targeted on the nuclear compartment because the cleavage
of transcripts from DNA occurs in the nucleus. Ensembl gene annotations
(http://genome.ucsc.edu/) were applied to obtain the largest potential genomic region of
each gene. Poly (A) sites within 20 base pairs (bp) were considered as the same one.
To capture novel poly (A) locations beyond the known gene regions, we extended each
gene region to at most 3500 bp without overlapping other genes. After inferring the poly
(A) sites from the RNA-PET data, the genes were classified into several groups based
5
on the number of poly (A) sites they used for the considered cell line. In this paper, we
mainly focused on the comparison between genes using a single poly (A) site and
genes using two poly (A) sites. HeLa-S3 cells, for example, had 1,560 genes using two
poly (A) sites and 3,607 genes using only one poly (A) site. According to our procedures,
if a gene could choose an alternative poly (A) site in a different cell line but only
exhibited a single poly (A) site in the considered cell line, we counted it as a gene using
only one poly (A) site. Similarly, we required that both poly (A) sites were used in the
considered cell line to claim a gene using two poly (A) sites. Randomly chosen positions
from the ending sites of middle exons and body regions of exons were treated as
control groups. Additionally, we considered random intergenic positions or terminus of
pseudogenes. We applied the same examinations to six different human cell lines,
HeLa-S3, HepG2, HUVEC, GM12878, NHEK and K562.
The chromatin accessibility was estimated by the DNase I hypersensitive experiments
performed at Duke University in the ENCODE project. Because of the low coverage of
DNase-seq data, we constructed a window of 80 bp and calculated the average DNase
signal. We focused on the 160 bp regions consisting of two 80bp windows for both the
upstream and the downstream of poly (A) sites. Because of the possible relationship
between nucleosome occupancy and chromatin accessibility, we investigated the
nucleosome positions around poly (A) sites with the MNase-seq data which were only
available in the GM12878 and K562 cell lines from the ENCODE project.
The Chip-seq data of H3K36Me3 marks from University of Washington in the ENCODE
project were combined with the RNA-PET data to investigate the histone modification
marks around poly (A) sites with the 80bp-window strategy mentioned above. We also
6
examined other modification data from the ENCODE project such as H3K27me3 and
H3K4me3 marks. However, no evidence of possible correlation between these markers
and APA sites was observed. Genes using three APA sites were also examined to
confirm our conclusions. All the statistical tests were performed with the SAS
programming.
1.3 Result
1.3.1 Alternative polyadenylation sites reveal lower chromatin accessibility
Based on the Ensembl gene annotations, there were 57,892 non-overlapping
genes. For each of the six considered human cell lines, we inferred the utilized poly (A)
sites of each gene using the RNA-PET reads and classified the genes into different
groups. Then the DNase-seq data reflecting the chromatin accessibility were considered
on the poly (A) sites. Using the K562 cell line as an example, the DNase signals
demonstrated lower chromatin accessibility around poly (A) sites than those around
randomly chosen ending positions of middle exons or those around random positions
inside an exon body (P < 0.0001, Wilcoxon tests; average DNase signals were 0.01-
0.015 vs. 0.0165-0.023) (Figure 1). Intriguingly, the distal cleavage sites of genes using
two poly (A) sites displayed significantly lower signals than other poly (A) sites (P =
0.0001-0.0078, Wilcoxon signed-rank tests or Wilcoxon tests; average DNase signals
were 0.01-0.012 vs. 0.013-0.015). However, the difference between the proximal poly
(A) sites and the unique poly (A) sites was insignificant (P = 0.56-0.88, Wilcoxon tests),
although the latter but not the former was the final terminus of gene transcription. As
7
shown in Figure 2, similar patterns were observed in other cell lines. The signals in the
HUVEC and the NHEK cell lines were weaker and this could be due to the different
qualities of the RNA-PET and the DNase-seq data.
Figure 1. Chromatin accessibility differences among poly (A) sites.
DNase-seq signals around different positions in the K562 cell line. The lines show the average
DNase-seq density around proximal poly (A) sites (red), distal poly (A) sites (blue), unique poly
(A) sites (black), random exonic sites (yellow), random exon ending sites (light blue), random
intergenic positions or terminus of pseudogenes (Dunham, et al.) . Position 0 represents the
considered sites and the 100bp upstream and downstream regions are shown. The vertical lines
represent the standard errors for each position.
8
Intergenic regions and pseudogenes are tightly associated with lower levels of DNase
hypersensitivity. The lower chromatin accessibility around the distal cleavage sites
could be due to the higher level of contamination with sites in intergenic regions or
pseudogenes. Our randomly chosen intergenic sites and the poly (A) sites of
pseudogenes confirms the lower hypersensitivity in these regions than other groups (P
<0.037, Wilcoxon tests; Figure 1). We then investigated the original RNA-PET signals
Figure 2. Wilcoxon rank test results for DNase-seq signals around different sites in the six
cell lines.
Each rectangular block displays the results for the six cell lines shown in the top left block. Thus,
each block is divided into six smaller rectangular regions. For each cell line, four different regions,
including the −160~−80 bp upstream, −80~0 bp upstr eam, 0~80 bp downstream and 80~160 bp
downstream regions of the considered sites, were tested separately and the results were shown
from the left to the right in the smaller rectangles. Thus, each smaller rectangle is further divided
into four segments. The colors were coded to demonstrate the significance level of the difference
between the two groups. For the comparison between distal poly (A) sites and proximal poly (A)
sites, Wilcoxon signed-rank tests were performed instead.
9
for our poly (A) sites. If the distal cleavage sites include more false sites from intergenic
regions or pseudogenes, weaker RNA-PET signals are expected for this group.
However, we found no significant signal strength difference between the distal and the
proximal poly (A) sites (P > 0.246, Wilcoxon signed-rank tests for all six cell lines). Distal
poly (A) sites had even higher RNA-PET signals than unique poly (A) sites in three cell
lines (HepG2, GM12878 and NHEK; p < 0.012, Wilcoxon tests). And no difference was
observed in the other three cell lines. All these indicate that the quality of poly(A) site
discovery is comparable among different groups and the lower levels of chromatin
accessibility around distal poly(A) sites are not due to the higher contamination of false
sites in intergenic regions or pseudogenes. Additionally, the verification test of RNA-
PET data showed over 99% of the PETs represented full-length tran-scripts, and the
majority fell within 10 bp of the known 5' and 3' boundaries of these transcripts. The
false-positive poly (A) sites are considered as really few proportion in this study.
To confirm the association between the lower chromatin accessibility and the transcript
cleavage, we designed another analysis. The genes with two potential poly (A) sites
across all the cell lines were selected. For each site, the DNase signals from the six cell
lines were divided into two groups according to whether the cell line utilized the site for
polyadenylation or not. The average signals were calculated for each group of the site.
Then the Wilcoxon signed-rank test was performed on the average signals of all the
proximal (or distal) poly (A) sites. The significantly lower accessibility was observed for
cell lines utilizing the sites for polyadenylation comparing to cell lines without utilizing
the sites for polyadenylation (P = 2.96×10−7 or 2.66×10−6 for proximal or distal poly (A)
sites).
10
1.3.2 Nucleosome deletion around poly (A) sites
Because nucleosome occupancy is closely related to chromatin accessibility, we
further investigated whether the lower chromatin accessibility around poly (A) sites was
due to the densely occupied nucleosomes or due to other higher-order chromatin
structures. Some studies showed that in S. Cerevisiae the high % AT content correlates
with low nucleosome occupancy especially in homopolymeric runs of poly (A) and poly
(T) occurring in the 5' or 3' nucleosome-free region (NFR) (Jansen and Verstrepen,
2011). The nucleosomes depletion around poly (A) sites in human T cells was also
observed (Spies, et al., 2009). To check the nucleosome occupancy around poly (A)
sites in our considered cell lines, we obtained the MNase-seq signals. Randomly
chosen exonic or exon ending sites were processed as control groups as mentioned
previously. Using the K562 cell line as an example, Figure 3 clearly shows that poly (A)
sites, no matter whether they are unique or APA sites, have lower nucleosome
occupancy than random exonic or exon ending sites (P = 0.0001-0.0055, Wilcoxon tests;
average 0.85-1.05 vs. 1.05-1.2). The MNase-seq data was also available for the
GM12878 cell line. Similar results were observed (P < 0.0001, Wilcoxon tests). The
results indicate that the lower chromatin accessibility around poly (A) sites is not due to
the high density of nucleosomes and it could be due to other higher-order chromatin
structures. Additionally, nucleosomes were well-positioned in exons comparing to
introns (the “Exon ending” panel in Fig ure 3). Previous research has discovered that
nucleosomes are preferentially positioned in exons and the organization is evolutionary
conserved (Schwartz, et al., 2009).
11
1.3.3 Histone modefication around poly (A) sites
H3K36me3 has been reported as a determinant of pre-mRNA alternative splicing
(Soojin Kim, 2011). To further study the possible relationship between APA and
Figure 3. Nucleosome distribution around poly (A) sites in the K562 cell line.
Three lines represent the third quartile, the median and the first quartile of the MNase-seq values
at each position. For all the three different poly (A) groups, the cleavage sites display fewer
nucleosomes than randomly chosen exonic or exon ending positions.
12
Proximal poly (A) vs. unique poly (A)
Proximal
poly(A) vs.
unique poly(A)
HeLa-S3 HepG2 HUVEC GM12878 NHEK K562 GM12878_adj K562_adj
−160~−80 bp
upstream
0.0029 <.0001 <.0001 <.0001 0.0007 <.0001 <.0001 <.0001
−80~0 bp
upstream
0.0178 <.0001 <.0001 0.0004 <.0001 <.0001 <.0001 <.0001
0~80 bp
downstream
0.0003 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001
80~160 bp
downstream
0.0023 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001
Distal poly (A) vs. unique poly (A)
Proximal
poly(A) vs.
unique poly(A)
HeLa-S3 HepG2 HUVEC GM12878 NHEK K562 GM12878_adj K562_adj
−160~−80 bp
upstream
0.0051 <.0001 <.0001 <.0001 <.0001 <.000
1
<.0001 <.0001
−80~0 bp
upstream
0.0017 <.0001 <.0001 <.0001 <.0001 <.000
1
<.0001 <.0001
0~80 bp
downstream
<.0001 <.0001 <.0001 <.0001 <.0001 <.000
1
<.0001 0.0018
80~160 bp
downstream
0.0002 <.0001 <.0001 <.0001 <.0001 <.000
1
<.0001 <.0001
Proximal poly (A) vs. distal poly (A)
Proximal
poly(A) vs.
unique poly(A)
HeLa-S3 HepG2 HUVEC GM12878 NHEK K562 GM12878_adj K562_adj
−160~−80 bp
upstream
0.8289 0.1793 0.9892 0.1594 0.3440 0.348
8
0.1303 0.1978
−80~0 bp
upstream
0.9260 0.0523 0.2037 0.9671 0.5690 0.179
5
0.9850 0.1265
0~80 bp
downstream
0.8666 0.2720 0.0927 0.9904 0.1496 0.715
0
0.9278 0.5571
80~160 bp
downstream
0.2610 0.9449 0.5902 0.6049 0.6582 0.446
6
0.5103 0.5424
Table 1. Comparison of the H3K36me3 signals among different poly (A) sites.
The poly (A) sites of genes using two cleavage sites were compared with the unique poly (A)
sites of genes using only one cleavage site. This table lists the p-values of the Wilcoxon rank
sum tests and the last two columns show the p-values after removing sites with zero MNase-seq
signal. For the comparison between distal poly (A) sites and proximal poly (A) sites, Wilcoxon
signed-rank tests were performed instead.
H3K36me3 modification, the H3K36me signals around poly (A) sites were collected
from the CHIP-seq data of the ENCODE project and the average read coverage within a
80bp window was calculated.
As shown in Table 1, significantly differences were observed in the H3K36me3 signals
between unique poly (A) sites and APA sites after applying Wilcoxon rank tests. More
specifically, the poly (A) sites of genes using multiple cleavage sites in a cell line
demonstrated higher H3K36me3 levels. The average density signals are 3.5 for multi-
poly (A) genes and 3.1 for unique poly (A) genes. Since H3K36me3 modification may
13
Proximal poly (A) vs. unique poly (A)
Proximal poly(A)
vs. unique
poly(A)
HeLa-S3 HepG2 HUVEC GM12878 NHEK K562
−160~−80 bp
upstream
<.0001 <.0001 <.0001 <.0001 <.0001 <.0001
−80~0 bp
upstream
<.0001 <.0001 <.0001 <.0001 <.0001 <.0001
0~80 bp
downstream
<.0001 <.0001 <.0001 <.0001 <.0001 <.0001
80~160 bp
downstream
<.0001 <.0001 <.0001 <.0001 <.0001 <.0001
Middle poly (A) vs. unique poly (A)
Proximal poly(A)
vs. unique
poly(A)
HeLa-S3 HepG2 HUVEC GM12878 NHEK K562
−160~−80 bp
upstream
<.0001 <.0001 <.0001 <.0001 <.0001 <.0001
−80~0 bp
upstream
<.0001 <.0001 <.0001 <.0001 <.0001 <.0001
0~80 bp
downstream
<.0001 <.0001 <.0001 <.0001 <.0001 <.0001
80~160 bp
downstream
<.0001 <.0001 <.0001 <.0001 <.0001 <.0001
Distal poly (A) vs. unique poly (A)
Proximal poly(A)
vs. unique
poly(A)
HeLa-S3 HepG2 HUVEC GM12878 NHEK K562
−160~−80 bp
upstream
0.0011 <.0001 <.0001 <.0001 <.0001 <.0001
−80~0 bp
upstream
0.0014 <.0001 <.0001 <.0001 <.0001 <.0001
0~80 bp
downstream
0.0001 <.0001 <.0001 <.0001 <.0001 <.0001
80~160 bp
downstream
0.0030 <.0001 <.0001 <.0001 <.0001 <.0001
Table 2. Comparison of the H3K36me3 signals around the APAs of genes using three
poly (A) sites and those of genes using only one poly (A) site.
The poly (A) sites of genes using three cleavage sites were compared with the unique poly
(A) sites of genes using only one cleavage site. This table lists the p-values of the Wilcoxon
rank sum tests.
depend on nucleosomes, we removed the regions with weak MNase-seq signals (i.e.
the average density signal equal to zero), the con-clusion is the same (the last two
columns of Table 1). Similar analysis was performed for gene using three APA sites to
con-firm our conclusion that the higher H3K36me3 level was observed in APA sites
(Table 2). Interestingly, the multiple cleavage sites of the same gene demonstrated
similar H3K36me3 signals among themselves (Table 1). An example of H3K36me3
around poly (A) sites is shown in Figure 4. Gene DTNBP1 used two poly (A) sites in the
K562 cell line, but gene HCCS only used one poly (A) site in this cell line. The figure
clearly shows the higher H3K36me3 levels of the APA sites than those of the unique
poly (A) site.
14
Finally, we examined the possible associations between poly (A) site signals and
DNase hypersensitivity, nucleosome occupancy , or H3K36me3 marks. The range of
the significant spearman coefficient is -0.019~0.142 for DNase, -0.078~0.166 for
H3K36me3 and -0.063~0.065 for nucleosome occupancy. However, the pattern is no
consistent among different cell lines and the correlation is not strong. It may because
the RNA-PET signal is confounded by gene expression.
1.4 Discussion
We discovered the low chromatin accessibility around poly (A) sites across six
human cell lines, especially for the distal poly (A) sites of genes using multiple poly (A)
sites.
Figure 4. An example of H3K36me3 signals in the K562 cell line.
The red arrows indicate the APA sites of DTNBP1 and the uniquely deployed poly (A) site of
HCCS in the considered cell line. The APA sites tend to display higher H3K36me3 modification
than the unique poly (A) site.
15
The phenomena of less number of nucleosomes located around poly (A) sites is
consistent with the previous report and the correlation between CG percentage and
nucleosome enrichment. Specifically, regions around poly (A) sites contained higher AT %
than those random exonic or exon ending sites (e.g., P < 2.2×10-16, Wilcoxon tests,
NHEK cell line). Therefore, we excluded the possibility that the lower chromatin
accessibility around poly (A) sites was resulted from higher nucleosome occupancy.
Interestingly, the proximal instead of the distal poly (A) sites exhibited similar chromatin
accessibility with the unique poly (A) sites, although both the distal and the unique poly
(A) sites were the final termini of gene transcription.
H3K36me3 modification was found to peak within actively transcribed genes especially
in exonic regions and therefore has been considered as an indicator for transcriptional
regions of genes (Bannister, 2004). Accordingly, the unique poly (A) sites of genes with
only one cleavage site, as the transcription termini, were expected to have lower
H3K36me3 comparing to the proximal poly (A) sites of APA genes. Surprisingly, there
was no significant difference between the distal and the proximal poly (A) sites of the
same gene as shown in Table 1. These APA sites all tended to have higher H3K36me3
than the unique poly (A) sites (Tables 1 and Fig. 4), indicating the distinct role of
H3K36me3 in genes using APA. Besides, we also examined other modification data
from the ENCODE project such as H3K27me3 and H3K4me3. However, no evidence of
possible correlation between these markers and APA was discovered.
Previous study reported that H3K36me3 was more enriched at poly (A) sites of short
isoforms than those of long isoforms (Lin, et al., 2012). This discrepancy may be due to
the distinct criteria for the selection of proximal and distal poly (A) sites. We required the
16
usage of both sites in the considered cell line. On the contrary, they pooled the poly (A)
sites from multiple breast tissues to claim a short or long isoform, and did not require
the usage of both poly (A) sites when they compared histone modification signals in a
specific cell line. Thus, some of the poly (A) sites of their short or long isoforms can be
the unique sites according to our definitions.
Instead of using the windows, they focused on the distributions of the distance between
the poly (A) site and the nearest H3K36me3 mark. When we applied the Kolmogorov-
Smirnov test to our selected sites in our considered cell lines, there was still no
significant difference for the distance distributions among the APA sites of the same
genes (P > 0.84 for all the six cell lines).
For the H3k36me3 study, to confirm there were no additional poly (A) sites in the
downstream, we searched for the longest genomic ranges of genes ab initio instead of
relying on the known gene annotation as we showed in the Results section. Thus, after
identifying the initial and terminal sites of transcripts from the RNA-PET reads,
overlapped transcripts were clustered and considered as the same genes. Similar
results were observed.
The human poly (A) sites identification had been implemented by machine-learning
methods using sequence and structural patterns around poly (A) sites (Chang, et al.,
2011). The accuracy rate for their model or polya_svm is around 60%~70% when
applying CDS as the negative set. Here we applied the Support Vector Machine method
libsvm (Chang and Lin, 2011) and input DNase sensitivity, nucleosome occupancy
signals, and histone marks as features. The accuracy is 65%~68% for poly (A) sites
prediction, 56%~57% for APA prediction and 54%~56% for proximal/distal prediction. If
17
we combine the genomic sequence with above DNA elements, the accuracy is
improved to 79%~80% for poly (A) sites prediction, 60%~66% for APA prediction and
59%~64% for proximal/distal prediction. The results suggest that our discovered DNA
functional elements provide comparable information with the sequence and structural
patterns to predict poly (A) sites. In the future, we can develop a model to utilize all
these features to further improve the prediction of poly (A) sites. In conclusion, we
observed distinct patterns of DNase I sensitivity and H3K36me3 around the multiple
cleavage sites simultaneously used by a gene. The detailed mechanism needs more
specific biological examinations.
18
Chaper 2. Identify Alternative Splicing regulatory
Network with Weighted Least Square Regression
19
Correlation-based methods have been applied to reconstruct gene co-expression
networks (Aoki, et al., 2007; Langfelder and Horvath, 2008; Reverter and Chan, 2008;
Song, et al., 2012). They have also been used to infer alternative splicing network links
through high-throughput microarray or RNA-seq data (Aoki, et al., 2007; Langfelder and
Horvath, 2008). However, the insufficient sequencing depth greatly affects the accuracy
of splicing ratio quantification. Consequently, it hinders the inference of alternative
splicing regulation through the correlation analysis between splicing regulators and
downstream alternative exons.
Here we propose a weighted least square model to better estimate the
correlation between splicing regulator expression and alternative exon inclusion rates
from RNA-seq data. Different weights are assigned to each sample according to its
sequencing depth when calculating the correlation coefficient. Our method is not
restricted to skipping exon events but is capable to other alternative splicing events
such as alternative 5'/3' splice sites. Simulations proved that our method is more
accurate in estimating the correlation coefficient than other traditional methods and is
robust to noise among various types of correlation models.
2.1 Introduction
Alternative splicing (AS) is the process on precursor mRNA to produce different
transcripts variants, by skipping, extending, shorting exon or retaining intron sequencing.
This mechanism generates large biodiversity of proteome complexity and acts as a key
20
role of gene regulation in higher eukaryotes species. In addition, many diseases have
been related to mutations in regulatory sequence or are simply caused by the changes
in the isoform ratio of a single gene, highlighting the relevance of AS to medical
therapy(Tazi, et al., 2009).
The rapid development of high-throughput sequencing technique of short cDNA
fragment has facilitated the study to characterizing the transcript isoform, and therefore
proved ~90% of human genes are under the regulation of alternative splicing (AS)(Pan,
et al., 2008; Wang, et al., 2008). RNA-seq provides in-depth information on both known
or novel splicing event using the reads mapping at exon-exon junctions. However, the
research of AS by junction reads is essentially dependent on the sequencing depth and
junction reads overage. To construct alternative splicing regulatory network, accurate
estimation of the isoform ratio and detection of regulated isoforms by splicing factors
remain challenging.
Inclusion level is the first wildly applied method to evaluate alternative exon
skipping events basing on inclusion/exclusion reads ratio(Wang, et al., 2008). An
alternative measure "percentage spliced in" (PSI or ψ ) estimates the percentage of
isoforms that include the cassette exons(Katz, et al., 2010). Another estimateψ sj
considers body reads on the cassette exon as well as junction reads and calculates the
density ratio between inclusive region and exclusive region. MISO further includes
reads on flanking exon region and calculates the posterior probability of annotated
isoforms using Bayesian statistics(Katz, et al., 2010). SpliceTrap also considers the
same read supporting region and utilizes Bayesian inference method to measure
21
inclusion ratio on paired-end RNA-seq data(Wu, et al., 2011). Instead of treating all the
junction reads the same, MMES calculates the score based on read length on both
sides of flanking exon and number of mismatches on each reads (Wang, et al., 2010).
MMES gives each junction read different weight and employs statistical model to
quantify exon inclusion level. These tools mainly focus on reads around cassette exons
and assume the uniform distribution for positions of mapped reads.
The other quantification algorithms of alternative splicing focus on the whole
isoforms region instead of exon and junction regions as above tools. These methods
distribute the mapping reads on constitutive exons into different isoforms to estimate the
transcripts abundance. rSeq assumes that the reads follows Poisson distribution and
the mean of specific region is the summation of expression level of isoforms(Jiang and
Wong, 2009). The isoform expression level could be obtained by solving the joint log-
likelihood function with maximum likelihood estimation. Expection-maximization (EM)
algorithm is widely applied to assign multi-reads to genome reference or distribute
reads among isoforms. RSEM improves the accuracy after involving multi-reads into
gene expression level estimation(Li and Dewey, 2011). It is useful when the RNA-seq
data contains high proportion of multi-reads, including sequencing with short read length
or organism with highly repetitive genome sequence. Both Solas and WemIQ apply EM
algorithm to distribute reads among isoforms while Solas uses Poisson process to
simulate the number of reads covering a gene (Richard, et al., 2010). WemIQ considers
the sequencing bias from different region and estimates the oversampleing or
undersampling with generalized Poisson model (Zhang, et al., 2014). Different weights
are added to log likelihood equations according to bias from different virtual exons.
22
These methods reply on known and accurate gene annotation. Cufflinks, on the other
hand, aims to estimate individual transcript isoform abundance including novel
unannotated transcripts (Trapnell, et al., 2010). TopHat is first called to map the reads
and assemble the mapped reads to construct minimal numbers of transcripts. Then the
likelihood function is solved with maximum a posterior (Mohanty, et al.) estimate for the
isoform abundance through Bayesian inference procedure.
The basic models of alternative splicing events include exon skipping, mutually
exclusive exons, alternative 5'/3' splicing sites and retained intron. However, most of the
previous tools mainly focused on exon skipping event only. Isoform quantification
models could apply to all types of AS but they are heavily replied on accurate isoform
annotation. The estimation could also affected by sequencing bias in different regions
of isoforms. Here we propose a weighted least square model to better estimate the
correlation between splicing events and regulators expression. According to the number
of junction reads of each sample, different weights are assigned to measure the
correlation coefficient and identify significant correlation. Our method is not restricted to
skipping exon and is capable to all AS events such as alternative 5'/3'. Simulations
prove that our method is more accurate in estimating the correlation coefficient than
other traditional methods and is robust to noise among various type of correlation
models.
23
2.2 Method
The shortness of RNA-seq read length and low-coverage of sequencing were
usually considered as the main cause of junction reads insufficiency. However,
considering RNA-seq data of 50~60 million reads with 75 bp length in human tissues,
more than 49% of junctions have read coverage less than 10 (Figure 5A). The
traditional inclusion rate equation might result in large error in these situations. On the
other hand, if we filter out the samples with low junction reads, the sample size would
be largely reduced and the true signal would be obscure. To better understand the
estimation error caused by insufficient junction reads, we perform a simulation to
represent the correlation between junction reads coverage and the accuracy of inclusion
rate. The inclusion rates of a splicing event among n samples are
U[0,1],
we generate T junction reads between [10,100] with
.
Then the estimated
and it is considered as an
accurate estimation if
. The accuracy and
24
A
B
Figure 5. Read coverage in junctions and the accuracy of inclusion rate estimation
(A) For human RNA-seq with 400-500M reads (coverage 22~26X), more than 45% junctions
have read coverage less than 10 (B) The positive correlation between junction reads number
and accuracy of inclusion rate estimation. The estimated inclusion rate is considered as
correct when its absolute difference with real inclusion rate is less than 0.1.
25
are linearly correlated and saturated after the size of junction reads reach 100 (Figure
5B). A similar result has also been observed for alternative 5'/3' cases with
and
. Since the abundance of
junction reads is an index of accuracy, we propose a weighted least square (WLS)
method to estimate the correlation between the expression level of splicing factors and
the inclusion rate of an alternative splicing. For a considered gene X and a splicing
event k with inclusion rate Y among N samples, we assume a linear regression model
where is a residual as the difference between the real value and the
predicted value by our model. For a sample i, let X is a matrix where
is the
expression level of the gene, Y is a matrix where
is the inclusion rate of
the splicing event and W is a diagonal matrix where
. X and Y are first normalized
according to the weight
.
where the weighted mean of X is
and the weighted variance of X
The estimated
is calculated by minimize the weighted square of residual .
26
The standard error of
,
, where the mean square error
The statistic
, therefore we can calculate p-value according to t
distribution.
2.3 Simulation
2.3.1 Estimation of correlation coefficient
We generated two independent standard normal random variables
,
, and let
. Therefore Y has a correlation
coefficient of r with X and follows Gaussian distribution N(0,1). To simulate the inclusion
rate of a splicing event, we transformed X into uniform distribution by using Inverse
transformation method. The number of junction reads of each splicing event is randomly
sampled according to the empirical distribution of junction reads count in human. The
correlation coefficient r is randomly assigned between [0,1] and the error of estimations
is defined as for all the methods over 1000 iterations. Figure 6 displayed that the
estimation errors of correlation coefficient using weighted least square method are
smaller than other traditional correlation estimation.
27
2.3.2 The power of correlation tests in different types of relationship
Correctly identifying the correlated genes of a splicing event is critical to study
alternative splicing mechanism. Here we compare the power of several correlation
models, incuding Pearson's correlaion, Spearman's rank-order correlation, Kendall tau
rank correlation, Hoeffding's D test, dCov test, Renyi correlation, MI, MIC and our
weighted least square model (Albanese, et al., 2013; Sales and Romualdi, 2011) using
R packages (Hmisc, energy, acepack, parmigene and minerva) with default parameter
setting at 5% significant level. For each type of relationship, we run 200 iterations
among various sample sizes to ensure the obtained performance is reliable. Renyi
correlation is approximated by using standard ACE estimate. Since dCov, ACE and MI
follow unknown asymptotic distribution, their p-values were calculated using
permutation test with 500 iterations.
28
Some pre-calculated p-value tables were downloaded from
http://www.exploredata.net/Downloads/P-Value-Tables while others were calculated
with permutation procedure. Some gene regulations did not show strong correlation
Figure 6. The median of estimation errors among different test.
The estimation error between calculated correlation coefficients and real values using
different methods. WLS better estimates correlation coefficients than other mothods,
especially when the sample size is small
29
either due to experimental noise or cofactor effect. A reliable statistic test is expected to
detect weak signal in order to fit the real dataset.
Figure 7. The power of statistics for various correlation coefficients rejecting at 5%
significant level.
In the most situations, weighted correlation test has better power than others statistics,
especially when the sample size or the correlation coefficient is small.
30
The same setting as the previous correlation coefficient simulation is applied again with
to compute the power curves of each method in different sample sizes, 25, 50,
75 and 100. WLS has the best power especially when the sample size is not enough
(Figure 7). As a result, WLS is useful when researchers could not obtain enough RNA-
seq samples.
The other four types of dependence relationship models are listed in table 3, including
linear, quadratic, cross and coupled time serious with noise. For these simulations, we
generate a random Gaussian variable
and an independent noise variable
. The gene expression is set
and inclusion rate
is
obtained from inverse transform sampling of
. The number of junction reads of
is
randomly sampled according to the empirical distribution of junction reads count in
human. The power of the tools to identify the correlation between
and
is calculated
after 200 iterations under different sample sizes.
31
We already proved that WLS could better identify weak correlation relationship in linear
model, and this current result also revealed that it is robust to noise (Figure 8). For non-
linear models, quadratic relationship, rank correlation such as Spearmen and Kendall
outperform other tools but WLS is still one of the best methods (Figure 9). Since the
negative values will be positive after squaring, the mean of quadratic is increased to 2
instead of 0 to avoid this situation. The regulation of splicing factors for a target gene
might be opposite in different tissues or different development stages and therefore it
could be difficult to detect. A cross model, which gene could either up-regulate or down-
regulate their target with equal probability, is simulated. The simulation represents that
WLS has the best power to identify this special type of relationship (Figure 10). Time
serious model is widely applied in cell development stages and drug treatment
experiments (Figure 11). For example, the previous research of coexpression in cell
Linear
Quadratic
Cross
Time
series
Table 3. Parameters setting of four different types of correlation models.
32
cycle data successfully identified the two pairs of coexpressed transcription factors
(BAS1 vs. GCN4; MSN2 vs YAP1)(Wang, et al., 2014). For time serious evaluation, our
WLS is still ranked as one of the best tool among different sample sizes and noise
conditions.
Figure 8. The power of various correlation tests rejecting at 5% significant level as
noise level increasing on linear model.
33
Figure 9. The power of various correlation tests rejecting at 5% significant level as
noise level increasing on quadratic model.
34
Figure 10. The power of various correlation tests rejecting at 5% significant level as
noise level increasing on cross model.
35
In summary, WLS is one the strongest tools to identify regulator expression and
correlated splicing event among different relationship models. Even though rank
correlation tests perform better in non-linear situation, WLS could achieve the similar
Figure 11. The power of various correlation tests rejecting at 5% significant level as
noise level increasing on time series model.
36
power after obtaining sufficient sample size. WLS could be even more beneficial in the
situation of low correlation coefficient or insufficient sample size.
2.4 Result
Embryonic stem cell has been considered as a potential key tool for regenerative
medicine and tissue engineering because of their capacity of self-renewal. Transcription
factors, including OCT4, NANOG, TCF3 and LEF1(Boyer, et al., 2006), are vital genes
for regulating pluripotency and reprogramming. The isoform selection of transcription
factors such as FOXP1 could stimulate the expression of these pluripotency-required
genes and therefore promote the maintenance of embryonic stem cells (ESC)
pluripotency (Gabut, et al., 2011). However, most of the alternative splicing regulations
on transcription factors and their upstream splicing factors remain largely unknown. To
better understand the role of alternative splicing in stem cell pluripotency and the
regulatory network, we gathered 32 RNA-seq data from various cell lines and tissues
among human, mouse and rat as our experimental target. The mouse and rat ESCs are
collected from zeroth, third, seventh and fourteenth day of cell differentiation toward
neurons. After sequencing the exacted mRNA with HiSeq, we obtained around 40
million paired-end reads for each sample. The other RNA-seq data of among human,
mouse and rat, including brain, cerebellum, heart, liver, kidney are downloaded from
GEO database under the access numbers GSE25842, GSE30352, GSE46669 and
GSM530343. In rat dataset, only four time points of ESCs, liver and kidney are applied
due to public data availability. All the sequencing reads are mapped to the
37
corresponding reference assemble, hg19, mm10 and rn4 with TopHat and the RPKM of
splicing factors is calculated with Cufflinks.
To investigate the potential ESC-specific alternative splicing regulators and their
target exons, 240 human, 230 mouse and 217 rat genes based on previous research
and literature mining of “splicing” or “spliceosome” in Refseq annotation were selected
for testing. We focused on the cassette exons, alternative 5’/3’ of 43 key
regulators(Boyer, et al., 2006) of pluripotency mechanism including OCT4, TCF3 and
NANOG. According to human Gencode 19, mouse Gencode M2 and rat ensemble
annotation for rn4, all the cassette exons and alternative 5’/3’ sites of these regulators
were chosen as candidates. The corresponding genomic locations among three species
are computed by liftOver, a genome coordinate conversion, to identify conserved
cassette exons or cleavage sites during evolution. We identify the significant correlation
between the splicing factors expression level and inclusion rate by using WLS to
calculate the p-value. In addition, the local false positive rate (locfdr) of the pairs were
also estimated and a threshold of 0.2 locfdr is set. We identified 66 potential regulations
between splicing factors and cassette exon and 8 between splicing factors and
alternative 5’/3’ (supplementary 1 ).
Transcription factor LEF1 participates in Wnt pathway, binding to stabilized -
catenin and helping maintain the undifferentiated status of ESCs through modulation of
Oct4 and NANOG(Huang and Qin, 2010). LEF1 has two cassette exons, 6th exon and
11st exon, and then we discovered 6th exon is under specific regulation in ESCS.
Among three species, stem cells have significant higher preference on exon-skipped
isoform comparing to other cell lines or tissues(Figure 12, P = 0.00016,Wilcoxon tests)
38
and the similar phenomenon, the inclusion of LEF1 exon 6 is increased during thymic
development, is observed in human cell lines (Mallory, et al., 2011). Mouse brain also
showed a dramatic regulatory switch during development. Exon 6 was almost skipped in
embryonic thalamus and adult thalamus exhibited nearly no skipping (Nagalski, et al.,
2013). Among the LEF1 correlated splicing factors we identified, SART3 is also involved
in hematopoietic stem cell proliferation with WNT1 function.
The inclusion of exon 6 is dependent on the expression and binding of splicing
factor CELF2 to two intronic flanking region of regulated exon, and knockdown of
CELF2 results in loss of exon 6 inclusion. In this study, even splicing factor CELF2 is
not selected as a candidate because of Its lack of annotation in rat, we test the above
statistics on CELF2 and the inclusion rate of LEF1 exon 6 in human and mouse.
Because of the estimation error from low-coverage samples (Figure 13a), none of the
traditional methods can successfully identify the correlation except our WLS (Figure 13b,
Table 4). In samples with insufficient junction reads coverage, the large estimation bias
of inclusion rate disturbed the correlation statistics. After filtering out the samples with
junction reads less than 10, the significance could be also discovered in Pearson’s
correlation and dCov test. The boundary between insufficient and sufficient junction
reads in a sample is dynamic and the different standard may apply when the
sequencing depth and sample size are changed. WLS performs the best in evaluating
connection without considering threshold selection in this real case. Since isoform
quantification is an alternative method to estimate the inclusion rate of splicing event,
Cufflinks was tested in this case using the FPKM ratio between isoforms. Most of the
statistic tools still fail to identify the correlation between LEF1 exon skipping and CELF2
39
expression level except our method and MIC. Since MIC is designed for linear and
nonlinear relationship and has power around 0.18 even for two independent variables,
the significant correlation might due to the detection of linear and parabola relation type
P-value P-value (#read≥10) Cufflinks
WLS 0.025 0.006
Pearson 0.545 0.014 0.553
Spearman 0.499 0.626 0.888
Kendall 0.379 0.726 0.793
Hoeffding’s D 0.116 0.623 0.114
dCov 0.275 0.025 0.615
Renyi 0.054 0.060 0.108
MI 0.254 0.166 0.574
MIC 0.090 0.850 0.014
Table 4. The p-value of seven correlation statistics under the hypothesis that CELF2
and the inclusion rate of 6th exon of LEF1 is independent.
Previous study already proved the regulation of CELF2 on the exon-skipping event of LEF1.
WLS model is the only method to detect the correlation without pre-processing data.
Pearson's correlation and dCov could also identify the significant correlation after filtering out
biased samples. The isoform quantification method with Cufflinks is an alternative tool to
estimate the inclusion rate but most of statistical tests still fail to detect the correlation.
40
Figure 12. The inclusion rate between ESCs and other tissues among three
species.
Column 1,2 shows the inclusion rate of cassette exon of LEF1 (6th) in ESCs is
significant lower than other tissues (P = 0.00016). Since TCF3 has mutually exclusive
exons, the inclusion rate is defined as the percentage of RNA with first exon retained.
The inclusion rate of TCF3 is significantly higher in ESCs (column 3 and 4,
P=0.001474) and this phenomenon is conspicuous in human cells and tissues (column
5 and 6, P = 0.001554).
41
A
B
Figure 13. The scatter plot of CELF2 expression level and inclusion rate of the
LEF1 6th exon.
(a) The original graph (b) The size of points represents the log10(#junction reads).
The correlation of CELF2 and the inclusion rate of exon 6 in LEF1 is clearly observed
after adjusting the samples with weights.
42
TCF3 is also a -catenin-responsive transcription factor, acting as repressor of
OCT3 and NANOG (Pereira, et al., 2006; Tam, et al., 2008), and also the target of these
factors(Cole, et al., 2008; Loh, et al., 2006). TCF3/LEF complexes have a key role in
controlling cell fate lineages in stem cells (Merrill, et al., 2001). The isoform selection of
TCF3 is critical for pluripotency and knockdown of specific isoform will inhibit distinct
differentiation pathways (Salomonis, et al., 2010). According to latest Gencode
annotations of human, mouse and rat, Tcf3 has two mutually exclusive exons instead of
a cassette exon predicted in previous study and we observed isoform selection again in
our dataset. The preference of the first exon is significantly higher in stem cell lines than
other tissues (Figure 12, P = 0.001474, Wilcoxon tests) especially in human (Figure 12,
P=0.001554, Wilcoxon tests). The fact that several splicing factors which are highly
correlated with TCF3 has been reported or predicted as AS regulators in stem cells. For
example, previous study shows the expression of MNBL2 will result in repressing of
ESC-specific alternative splicing(Han, et al., 2013) and we observe the low expression
of MNBL2 in ESCs. The expression of MNBL2 and inclusion rate of TCF3 exist high
negative correlation (r=-0.89, WLS) which is consistent with previous research. Another
identified regulator, RBMX, is a X-linked RNA binding motif protien, playing several role
in tissue-specific alternative splicing and male-specific function in spermatogenesis. In
human protein-protein prediction database(McDowall, et al., 2009), RBMX has the
highest probability to be the interactor between TCF3 and MSH2, a DNA mismatch
repair protein. These information supports the reliability of our discovery.
43
2.5 Discussion
In this paper, we propose a weighted least-square method capable of estimating
correlation coefficient and detecting correlation between regulator expression and
alternative splicing event. Simulation proves that WLS performs as one of the best test
tools under different sample sizes, noise levels and relationship models. WLS is also
the only test that successfully identifies the confirmed correlation between the
expression level CELF2 and the inclusion rate of 6th exon of LEF1 without additional
pre-processing. By applying our WLS to read RNA-seq data, we discover the significant
isoform preference of TCF3 and LEF1 in ESCs. In the supplementary files, part of highly
correlated splicing regulators have been proved their involvement in pluripotency AS
control or protein interaction. In practice, experiments could further confirm our
discovery and clarify the mechanism of AS in pluripotency process.
44
Chaper 3. Quantifying protein synthesis rate with an
algorithmic translation model
45
The abundance of mRNA has been considered as a measurement of protein
expression level because of the rapid development of RNA-seq technique and the sharp
reduction in per-base cost. However, mRNA is under extensive post-transcriptional
regulations and therefore protein synthesis prediction by transcripts density is not
reliable. In addition, genes might be under different translation rates since the
programmed ribosome pausing, which facilitate the cotranslation folding and secretion,
is observed during translation. Ribosome profiling is a new sequencing technique that
targets on the ribosome-protected mRNA, providing an alternative approach to measure
protein synthesis. Sequencing data display the different speed patterns of ribosomes
during initiation, elongation and termination and it is difficult to calculate protein
synthesis rate with simple distribution models. In the following sections, a translation
speed model which divides translation into three steps is applied to simulate the real
protein synthesis rate.
3.1 introduction
The mRNA abundance is widely utilized as a measurement of gene expression
because of the low cost and high performance of RNA-seq. Even though the expression
level of proteins and their mRNA density are highly correlated, the estimation of protein
synthesis rate with RNA-seq may contain some biases. First, mRNA is under the
extensive regulation at the level of translation, including RNA silencing, imprinting, and
transposon silencing. These biological processes inhibit mRNA translation and alter the
expression level of genes. In addition, the stack of ribosomes and pause of translation
46
are observed in cells. Variation of translation rates, which might result from local mRNA
structure or usage of rare translated codon, could have drastic effects on folding
efficiency of newly synthesis proteins (Tsai, et al., 2008).
Two dimensional polyacrylamide gel electrophoresis (2D-PAGE) is a gel
electrophoresis method to separate and identify proteins by two different physics
properties. In the first dimension, the proteins are separated according to their net
charge. Then the second dimension further classifies the proteins by the size. After the
dye procedure, imagine analysis can quantify protein spot. This method could analysis
thousands of proteins in a single run. The proteins in each spot could be collected and
further analyzed by mass spectrometry. However, this technique requires a large
amount of sample handling and is hard to be designed for high throughput analysis. The
measurement has small range of expression level, low sensitivity, high background
noise and limited reproducibility. Recently, mass spectrometry plays an important role in
structural biology because of its high sensitivity and selectivity. Since mass
spectrometry requires pure compound, it is also hard to be automated for high
throughput analysis. As a result, even though above techniques could direct monitor the
protein abundance, they are not practical in large scale study because of the lack of
high throughput design.
Ribosome profiling is a mRNA sequencing targeting on ribosome-protected
region. This high throughput sequencing could monitor the ribosomes density and their
movement along mRNAs. Protein synthesis rates are estimated according to the
47
ribosome densities and distributions on transcripts. The movement of ribosome is under
position-specific control (Figure 14) and estimation of velocity with a simple distribution
is difficult. For example, initiation step is the slowest stage so we observe the stack up
of ribosome in the beginning of gene. Here we apply a translation model to estimate the
initiation, elongation and termination rates of gene. Our method has high reproducibility
and is highly correlated with ribosome density. When we select the mass spectrometry
data as a standard, our estimation represents higher correlation than other models.
3.2 Method
We construct the translation model based on previous algorithmic
approach(Mehra and Hatzimanikatis, 2006). The translation process are classified into
three steps, including initiation, elongation and termination. To fit the model consistent
Ingolia N.T. et al. (2009) Science
Figure 14. The distribution of ribosomes on mRNAs
48
into real situation, we assume they have different constants and velocity for each step.
For a specific gene g, the translation process involves initiation velocity (
), elongation
velocity (
, and termination velocity
. The initiation rate is described
by the equation
where
is a constant for initiation complex formation and
is dissociation rate
constant.
is the total number of free ribosomes.
is the probability that the initiation
site of mRNA is not occupied.
where
is the probability of codon i in gene g being occupied by the front of a
ribosome. L stands for the number of codon a ribosome occupied while binding to
mRNA. Since the footprint of ribosome-protected mRNA are about 26-28 nt long, we
choose L=9 in this study.
is the abundance of mRNA of gene g. Let r be the ratio of
free ribosomes among all ribosomes,
/
, where
is the total amount of
ribosomes. The ratio between abundance of mRNA of gene g and total ribosomes are
defined as
,
. As a result, the initiation rate is rewritten as
The velocity of ribosome during elongation are described as
49
where
is the elongation constant for codon j in mRNA of gene g.
is the
conditional probability that codon j+1 is free given codon j is occupied.
The termination rate is considered as the rate of protein synthesis in steady state.
And the steady state assumption is described as
and
are obtained from RNA-seq data and ribosome profiling data. Those
unknown parameters we have to estimate include initiation constant, dissociation
constant, elongation constants, termination constant, fraction of free ribosome and
mRNA/ribosome ratio.
The first step, given G number of genes, we estimated the elongation constant by
minimizing the cost function
50
After the elongation constant
is obtained from above equation, then we can
estimate fraction of free ribosome (r), mRNA ratio (
, initiation constants (
),
dissociation constant (
) and termination constant (
).
In the end, we can approach the termination rate of each gene with RNA-seq and
ribosome profiling data.
3.3 Result
The ribosome profiling data and RNA-seq data of budding yeast and mouse are
obtained from previous studies(Ingolia, et al., 2009; Ingolia, et al., 2011). The reads
from RNA-seq are mapped to reference assemble sacCer2 and mm10 using tophat.
The mRNA expression level of genes are calculated with GPseq. Read mapping files
are also downloaded from above study to calculate the distribution of reads on each
gene (
). The above parameter approach is applied to calculate the optimal
termination rates of genes with different iterations in order to capture the global
minimum of the cost function.
51
Firstly we inspect whether our parameter estimation method could converge and
reach reasonable solutions. If the estimation is converged, the initiation rate, elongation
rates and termination rate should be similar due to the steady state assumption. More
than 500 genes of budding yeast are displayed, proving the low estimation error
between initiation rate and termination rate (Figure 15).
Figure 15. The difference between initiation rate and termination rate of 500
genes in budding yeast.
The differences between initiation rates and termination rates of translation are small
in most of the genes, proving the converge of our translation model.
52
The traditional and intuitable method for the protein synthesis rate estimation
calculates the density of ribosomes on mRNA of each gene without considering the
difference of translation velocities among various positions. We expect our method
could improve this estimation but not drastically alter the result. When we compare our
method with ribosome density calculation, the high correlation between these two
methods was observed, proving the reliability of our approach (Figure 16).
Figure 16. The correlation between the estimated termination rate and ribosome
density method.
The translation model is highly correlated with ribosome density on mRNA.
53
A
B
Figure 17. The correlation between the estimated protein synthesis rate and
mass spectrometry data of yeast.
(A) The correlation between traditional ribosome density estimation and mass
spectrometry data. (B) The correlation between estimated termination rate of
translation model and mass spectrometry data. The translation model represent higher
correlation coefficient with the reference experiment result, indicating the improvement
of this model from ribosome density calculation.
54
Evaluation of protein synthesis is challenging now due to the insufficient of large scale
protein quantification experiment. Traditional gene expression level verification methods,
such as RT-PCR, focus on mRNA level instead of protein level. The mass spectrometry
data which provides more reliable reference is applied as reference (Picotti, et al., 2013).
Compared to ribosome density method, our estimated termination rate represents
slightly higher correlation with mass spectrometry data. Since the programmed
ribosome pause and cofolding are key mechanisms in higher eukaryotic mammals, we
believe the distinct difference could be observed in mouse. However, large scale of
mass spectrometry analysis of mouse is required to further prove this translation model.
3.4 Future work
The development of ribosome profiling technique has provided an critical
approach in protein quantification study. Compared to traditional method, ribosome
profiling has the advantages of high throughput, low cost and easy sample handling.
Besides, the distribution of ribosome is under control of translation-related factors,
mRNA position and codon usage and it is impractical to apply a simple distribution as
RNA-seq analysis. In this study, we proposed a more comprehensive model than a
simple density calculation. When comparing with mass spectrometry data, our
estimation shows higher correlation. However, more mass spectrometry data of high
eukaryotic mammals are required to prove and further improve our model.
55
Chaper 4. Summary
56
The next-generation sequencing technologies is enormously improved in terms
of speed, read length, throughput and cost during this decade. With diverse
biochemistry design, profiling experiments could reveal different aspect of functional
elements: ChIP-seq reveals the binding sites of DNA-associated proteins; DNase-seq
reveals the chromatin accessibility of different regions; MNase-seq reveals the positions
of nucleosomes complexes; and ribosome profiling identifies the ribosome distribution
on mRNA. Integrated analysis of these rich data resource, which mighty depends on
computational tools and statistical measurements, will improve the disease test and
clinical treatment. The progress and application of the genomic big data analysis
remains challenging but promising.
The alternative ployadenylation study, we successfully integrated functional
elements around poly (A) sites, including H3K36me3, chromatin accessibility,
nucleosome occupancy, to implement a SVM prediction model. The low chromatin
accessibility around poly (A) sites across six human cell lines was observed, especially
for the distal poly (A) sites of genes using multiple poly (A) sites. Since the phenomena
of nucleosomes-depletion around poly (A) sites were revealed after processing MNase-
seq data, we excluded the possibility that the lower chromatin accessibility around poly
(A) sites was resulted from higher nucleosome occupancy. As an indicator for
transcriptional regions of genes, H3K36me3 modification was found to peak within
actively transcribed genes especially in exonic regions. These APA sites all tended to
have higher H3K36me3 than the unique poly (A) sites, indicating the distinct role of
H3K36me3 in genes using APA.
57
In RNA-seq analysis, the estimation bias from insufficiency of junction reads was
a challenging problem for isoform quantification. Scientist have implemented different
tools to improve the accuracy of isoform quantification either by including body reads on
supporting region into consideration, giving weights to each junction read, distributing
multi-reads to genome reference or assigning reads among isoforms with EM algorithm.
However, most of these methods mainly focused on exon skipping event. A weighted
least square regression model was proposed to reduce the estimation errors from low
coverage samples by assigning weights according to junction reads coverage. WLS
was not restricted to skipping exon and capable to all AS events such as alternative 5'/3'.
Simulations proved that our method was more accurate in estimating the correlation
coefficient than other traditional methods and was robust to noise among four types of
correlation models. Splicing factor CELF2 binded to two intronic flanking regions of the
6th exon of LEF1 and therefore regulated the alternative splicing event. WLS was the
only tool successfully identifying the correlation without pre-processing among several
statistical tests
. Ribosome profiling was a new sequencing technique that targeted at the
ribosome-protected mRNA, providing an alternative approach to measure protein
synthesis. Since the different velocities of ribosomes were observed during initiation,
elongation and termination, comprehensive translation model was essential instead of a
simple distribution. To simulate the true protein synthesis rate, a translation model
considering different initiation, elongation and termination rates of genes was applied to
ribosome profiling data. This model had high reproducibility, high consistent with
ribosome density and better similarity with the mass spectrometry data in budding
58
yeast. Since translation was under more complicated control in high eukaryotic
mammals, this translation model was expected to better approach the real protein
synthesis mechanism. In the future, we would verify and improve this translation model
when large scale mouse protein quantification data are available.
59
Bibliography
60
Albanese, D., et al. (2013) Minerva and minepy: a C engine for the MINE suite and its R, Python
and MATLAB wrappers, Bioinformatics, 29, 407-408.
Aoki, K., Ogata, Y. and Shibata, D. (2007) Approaches for extracting practical information from
gene co-expression networks in plant biology, Plant & cell physiology, 48, 381-390.
Bannister, A.J. (2004) Spatial Distribution of Di- and Tri-methyl Lysine 36 of Histone H3 at
Active Genes, Journal of Biological Chemistry, 280, 17732-17736.
Boyer, L.A., Mathur, D. and Jaenisch, R. (2006) Molecular control of pluripotency, Current
opinion in genetics & development, 16, 455-462.
Chang, C.C. and Lin, C.J. (2011) LIBSVM: A Library for Support Vector Machines, Acm T Intel
Syst Tec, 2.
Chang, T.-H., et al. (2011) Characterization and prediction of mRNA polyadenylation sites in
human genes, Medical & Biological Engineering & Computing, 49, 463-472.
Cole, M.F., et al. (2008) Tcf3 is an integral component of the core regulatory circuitry of
embryonic stem cells, Genes & development, 22, 746-755.
Derti, A., et al. (2012) A quantitative atlas of polyadenylation in five mammals, Genome
research, 22, 1173-1183.
Djebali, S., et al. (2012) Landscape of transcription in human cells, Nature, 489, 101-108.
Dunham, I., et al. (2012) An integrated encyclopedia of DNA elements in the human genome,
Nature, 489, 57-74.
Edmonds, M. (2002) A history of poly A sequences: from formation to factors to function,
Progress in nucleic acid research and molecular biology, 71, 285-389.
Fullwood, M.J., et al. (2009) Next-generation DNA sequencing of paired-end tags (PET) for
transcriptome and genome analyses, Genome research, 19, 521-532.
Gabut, M., et al. (2011) An alternative splicing switch regulates embryonic stem cell pluripotency
and reprogramming, Cell, 147, 132-146.
61
Han, H., et al. (2013) MBNL proteins repress ES-cell-specific alternative splicing and
reprogramming, Nature, 498, 241-245.
Huang, C. and Qin, D. (2010) Role of Lef1 in sustaining self-renewal in mouse embryonic stem
cells, Journal of genetics and genomics = Yi chuan xue bao, 37, 441-449.
Ingolia, N.T., et al. (2009) Genome-wide analysis in vivo of translation with nucleotide resolution
using ribosome profiling, Science, 324, 218-223.
Ingolia, N.T., Lareau, L.F. and Weissman, J.S. (2011) Ribosome profiling of mouse embryonic
stem cells reveals the complexity and dynamics of mammalian proteomes, Cell, 147, 789-802.
Ji, Z., et al. (2009) Progressive lengthening of 3' untranslated regions of mRNAs by alternative
polyadenylation during mouse embryonic development, Proceedings of the National Academy
of Sciences of the United States of America, 106, 7028-7033.
Jiang, H. and Wong, W.H. (2009) Statistical inferences for isoform expression in RNA-Seq,
Bioinformatics, 25, 1026-1032.
Katz, Y., et al. (2010) Analysis and design of RNA sequencing experiments for identifying
isoform regulation, Nature methods, 7, 1009-1015.
Langfelder, P. and Horvath, S. (2008) WGCNA: an R package for weighted correlation network
analysis, BMC Bioinformatics, 9, 559.
Li, B. and Dewey, C.N. (2011) RSEM: accurate transcript quantification from RNA-Seq data with
or without a reference genome, BMC Bioinformatics, 12, 323.
Lin, Y., et al. (2012) An in-depth map of polyadenylation sites in cancer, Nucleic acids research,
40, 8460-8471.
Loh, Y.H., et al. (2006) The Oct4 and Nanog transcription network regulates pluripotency in
mouse embryonic stem cells, Nature genetics, 38, 431-440.
Lutz, C.S. (2008) Alternative polyadenylation: a twist on mRNA 3' end formation, ACS Chem
Biol, 3, 609-617.
62
Mallory, M.J., et al. (2011) Signal- and development-dependent alternative splicing of LEF1 in T
cells is controlled by CELF2, Molecular and cellular biology, 31, 2184-2195.
Mayr, C. and Bartel, D.P. (2009) Widespread Shortening of 3 ′UTRs by Alternative Cleavage
and Polyadenylation Activates Oncogenes in Cancer Cells, Cell, 138, 673-684.
McDowall, M.D., Scott, M.S. and Barton, G.J. (2009) PIPs: human protein-protein interaction
prediction database, Nucleic acids research, 37, D651-656.
Mehra, A. and Hatzimanikatis, V. (2006) An algorithmic framework for genome-wide modeling
and analysis of translation networks, Biophysical journal, 90, 1136-1146.
Merrill, B.J., et al. (2001) Tcf3 and Lef1 regulate lineage differentiation of multipotent stem cells
in skin, Genes & development, 15, 1688-1705.
Mohanty, B.K., Maples, V.F. and Kushner, S.R. (2012) Polyadenylation helps regulate functional
tRNA levels in Escherichia coli, Nucleic acids research, 40, 4589-4603.
Nagalski, A., et al. (2013) Postnatal isoform switch and protein localization of LEF1 and TCF7L2
transcription factors in cortical, thalamic, and mesencephalic regions of the adult mouse brain,
Brain structure & function, 218, 1531-1549.
Pan, Q., et al. (2008) Deep surveying of alternative splicing complexity in the human
transcriptome by high-throughput sequencing, Nature genetics, 40, 1413-1415.
Pereira, L., Yi, F. and Merrill, B.J. (2006) Repression of Nanog gene transcription by Tcf3 limits
embryonic stem cell self-renewal, Molecular and cellular biology, 26, 7479-7491.
Picotti, P., et al. (2013) A complete mass-spectrometric map of the yeast proteome applied to
quantitative trait analysis, Nature, 494, 266-270.
Reverter, A. and Chan, E.K. (2008) Combining partial correlation and an information theory
approach to the reversed engineering of gene co-expression networks, Bioinformatics, 24,
2491-2497.
63
Richard, H., et al. (2010) Prediction of alternative isoforms from exon expression levels in RNA-
Seq experiments, Nucleic acids research, 38, e112.
Sales, G. and Romualdi, C. (2011) parmigene--a parallel R package for mutual information
estimation and gene network reconstruction, Bioinformatics, 27, 1876-1877.
Salomonis, N., et al. (2010) Alternative splicing regulates mouse embryonic stem cell
pluripotency and differentiation, Proceedings of the National Academy of Sciences of the United
States of America, 107, 10514-10519.
Schwartz, S., Meshorer, E. and Ast, G. (2009) Chromatin organization marks exon-intron
structure, Nature Structural & Molecular Biology, 16, 990-995.
Shepard, P.J., et al. (2011) Complex and dynamic landscape of RNA polyadenylation revealed
by PAS-Seq, RNA, 17, 761-772.
Song, L., Langfelder, P. and Horvath, S. (2012) Comparison of co-expression measures: mutual
information, correlation, and model based indices, BMC Bioinformatics, 13, 328.
Soojin Kim, H.K., Nova Fong, Benjamin Erickson, and David L. Bentley (2011) Pre-mRNA
splicing is a determinant of histone H3K36 methylation, PNAS, 108, 13564-13569.
Spies, N., et al. (2009) Biased chromatin signatures around polyadenylation sites and exons,
Molecular cell, 36, 245-254.
Tam, W.L., et al. (2008) T-cell factor 3 regulates embryonic stem cell pluripotency and self-
renewal by the transcriptional control of multiple lineage pathways, Stem cells, 26, 2019-2031.
Tazi, J., Bakkour, N. and Stamm, S. (2009) Alternative splicing and disease, Biochimica et
biophysica acta, 1792, 14-26.
Trapnell, C., et al. (2010) Transcript assembly and quantification by RNA-Seq reveals
unannotated transcripts and isoform switching during cell differentiation, Nature biotechnology,
28, 511-515.
64
Tsai, C.J., et al. (2008) Synonymous mutations and ribosome stalling can lead to altered folding
pathways and distinct minima, Journal of molecular biology, 383, 281-291.
Wang, E.T., et al. (2008) Alternative isoform regulation in human tissue transcriptomes, Nature,
456, 470-476.
Wang, L., et al. (2010) A statistical method for the detection of alternative splicing using RNA-
seq, PloS one, 5, e8529.
Wang, Y.X., Waterman, M.S. and Huang, H. (2014) Gene coexpression measures in large
heterogeneous samples using count statistics, Proceedings of the National Academy of
Sciences of the United States of America, 111, 16371-16376.
Wu, J., et al. (2011) SpliceTrap: a method to quantify alternative splicing under single cellular
conditions, Bioinformatics, 27, 3010-3016.
Zhang, J., Kuo, C.C. and Chen, L. (2014) WemIQ: an accurate and robust isoform quantification
method for RNA-seq data, Bioinformatics.
Abstract (if available)
Abstract
The next-generation sequencing (NGS) technology has altered the study of mRNA from microarray to a new era. Comparing to microarray, RNA-seq has the ability to quantify large range of expression level, detect novel transcripts without reference genome, offer higher specificity and sensitivity, and detect the rare or low-abundance transcripts. Technologies such as ChiP-seq, DNase-seq, MNase-seq and ribosome profiling combine next generation sequencing with new biochemistry experiment and provide more information to study mRNA interaction, regulation and expression. ❧ Alternative polyadenylation has also been identified as a critical regulatory mechanism in human gene expression even though there are still no sufficient study to clarify the mechanism. The National Human Genome Research Institute launched a public research project, ENCODE, to identify all functional elements in human genome sequence with variety NGS experiments. We combine various sequencing data of human cells to recognize the critical functional elements around alternative polyadenylation sites statistically. The discovered functional elements signals have improved the polyadenylation or alternative polyadenylation prediction in machine learning methods. ❧ RNA-seq is a powerful tool to study alternative splicing or isoform selection under multiple conditions. The first study using RNA-seq to quantify splicing event calculated the ratio between inclusive junction reads and exclusive junction reads as inclusion level. After the original approach, various tools have been designed to better estimate the isoform ratio. Here we propose a weighted least square (WLS) method to solve the problems of low-coverage in junctions and small sample size in splicing event study. We prove the WLS is robust to noise, better in correlation coefficient estimation, and discovering 10% ~ 20% more true positive than other traditional correlation statistics with simulations. ❧ Both microarray and RNA-seq technique measure mRNA abundance as genome expression level. However, mRNA is not a perfect index of protein synthesis since they are under extensive post-transcriptional regulations. In additional, programmed ribosome pausing is observed during translation to facilitate the cotranslation folding and secretion, resulting in different translation rate of genes. A ribosome profiling technique, which is based on sequencing the ribosome-protected mRNA, provides insight into the real translation process. Sequencing data display the significant different speed of ribosomes on each region of mRNA template and difficulty to calculate protein synthesis rate with simple distribution models. Here we applied a translation speed model, dividing the translation into three steps, to estimate the real protein synthesis rate. ❧ In summary, my study focus on gathering sequencing data to discover important features of genes, proposing math models to reduce the bias and applying models to real data for future experiments.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
Isoform quantification and splicing regulation analysis in RNA-seq studies
PDF
Genome-wide studies reveal the function and evolution of DNA shape
PDF
Probing the genetic basis of gene expression variation through Bayesian analysis of allelic imbalance and transcriptome studies of oil palm interspecies hybrids
PDF
Comparative genomics of translational regulation
PDF
Comparative transcriptomics: connecting the genome to evolution
PDF
Discovery of mature microRNA sequences within the protein- coding regions of global HIV-1 genomes: Predictions of novel mechanisms for viral infection and pathogenicity
PDF
Computational methods for translation regulation analysis from Ribo-seq data
PDF
Using genomics to understand the gene selectivity of steroid hormone receptors
PDF
Neural activity alterations with learning in MBNL2 KO mice
PDF
Mapping genetic variants for nonsense-mediated mRNA decay regulation across human tissues
PDF
Innovative sequencing techniques elucidate gene regulatory evolution in Drosophila
PDF
The splicing error of FOXP1 in type I myotonic dystrophy
PDF
Genetics and the environment: evaluating the role of noncoding RNA in autism spectrum disorder
PDF
Genome-wide studies of protein–DNA binding: beyond sequence towards biophysical and physicochemical models
PDF
Placenta growth factor-miRNAs-lncRNAs axis in the regulation of ET-1 gene involved in pulmonary hypertension in sickle cell disease
PDF
Using novel small molecule modulators as a tool to elucidate the role of the Myocyte Enhancer Factor 2 (MEF2) family of transcription factors in leukemia
PDF
Evaluating the robustness and reproducibility of RNA-Seq quantification tools using computational replicates
PDF
Characterization of three novel variants of the MAVS adaptor
PDF
Mechanistic studies of novel small molecule anti-cancer agents using next generation sequencing
Asset Metadata
Creator
Lee, Che-yu
(author)
Core Title
Genome-wide studies reveal the isoform selection of genes
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Publication Date
07/31/2015
Defense Date
04/30/2015
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
alternative polyadenylation,alternative splicing,histone modification,OAI-PMH Harvest,protein synthesis,regulatory network,RNA-seq,weighted least square regression
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Chen, Liang (
committee chair
), Sun, Fengzhu Z. (
committee member
), Ying, Qi-Long (
committee member
)
Creator Email
cheyulee@usc.edu,cheyulee74@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-621236
Unique identifier
UC11303819
Identifier
etd-LeeCheyu-3774.pdf (filename),usctheses-c3-621236 (legacy record id)
Legacy Identifier
etd-LeeCheyu-3774-1.pdf
Dmrecord
621236
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Lee, Che-yu
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
alternative polyadenylation
alternative splicing
histone modification
protein synthesis
regulatory network
RNA-seq
weighted least square regression