Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
(USC Thesis Other)
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
MODEL SELECTION METHODS FOR GENOME WIDE ASSOCIATION
STUDIES AND STATISTICAL ANALYSIS OF RNA SEQ DATA
by
Sudeep Srivastava
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTATIONAL BIOLOGY AND BIOINFORMATICS)
August 2012
Copyright 2012 Sudeep Srivastava
Dedication
to Family and Friends
ii
Acknowledgments
I would first and foremost like to thank Professor Liang Chen for her support,patience
and advice throughout my PhD, especially in tough times. I would also like to thank
my committee members Prof. Fengzhu Sun, Prof Sergey Nuzhdin and Prof Peter Rad-
chenko for their guidance and advice. I would also like to thank Prof. Ting Chen for
being on my Qualifying exam committee and providing helpful advice and guidance.
I would like to thank Kjong, Iris and Jenny for reading over this dissertation and pro-
viding valuable feedback. None of this would have been possible without the great
friends I have made over these 6 years at MCB, USC and Los Angeles. Finally, I would
like to thank my family for the unshakeable trust and support they placed on me. This
dissertation is dedicated to all of them.
iii
Table of Contents
Dedication ii
Acknowledgments iii
List of Tables vii
List of Figures ix
Abstract xi
Chapter 1: Genome Wide Association Studies 1
1.1 Linkage Disequilibrium . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Single Marker Association Studies and Multiple Testing . . . . . . . 3
1.3 Common and Rare Variants . . . . . . . . . . . . . . . . . . . . . 6
1.4 Population Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Chapter 2: Model Selection Methods for Genome Wide Association Stud-
ies 10
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Pre-filtering . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 Sure Independence Screening . . . . . . . . . . . . . . . . . 14
2.2.3 Stochastic Search Variable Selection . . . . . . . . . . . . . 16
2.2.4 LASSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.5 Elastic Net . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.6 Area Under the Curve . . . . . . . . . . . . . . . . . . . . . 23
2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.1 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . 24
2.3.2 Daunorubicin-Induced Cytotoxicity Data . . . . . . . . . . 29
2.3.3 Rheumatoid arthritis data . . . . . . . . . . . . . . . . . . . 34
2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
iv
Chapter 3: RNA Sequencing 39
3.1 High Throughput Sequencing and RNA Seq . . . . . . . . . . . . . 39
3.2 RNA Seq Analysis Pipeline . . . . . . . . . . . . . . . . . . . . . . 40
3.2.1 Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.2 Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.3 Modelling and Quantification . . . . . . . . . . . . . . . . . 42
3.2.4 Biases in RNA Seq data . . . . . . . . . . . . . . . . . . . . 44
3.2.5 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.6 Differential Expression . . . . . . . . . . . . . . . . . . . . 48
3.2.7 Splicing rates for exons and Alternative transcript quantifi-
cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Chapter 4: A two-parameter generalized Poisson model to improve the
analysis of RNA-Seq data 52
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2 Two-parameter generalized Poisson model for RNA-seq data . . . . 54
4.2.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2.2 Intuition for the generalized Poisson distribution . . . . . . 55
4.2.3 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2.4 Method to identify differentially expressed genes . . . . . . 58
4.2.5 Methods to identify differentially spliced exons . . . . . . . 61
4.2.6 Simulation Strategy to calculate P-values . . . . . . . . . . 64
4.2.7 Real data sets . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2.8 Computation efficiency and software packages . . . . . . . 67
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3.1 Gene and gene expression estimation . . . . . . . . . . . . . 68
4.3.2 Factors associated with bias . . . . . . . . . . . . . . . . 75
4.3.3 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.3.4 Identification of differentially expressed genes . . . . . . . . 80
4.3.5 Comparison with other models . . . . . . . . . . . . . . . . 90
4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Chapter 5: Quantifying transcript isoform expression with positional
bias from RNA-seq 100
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.2.1 Isoform Estimation . . . . . . . . . . . . . . . . . . . . . . 102
5.2.2 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . 105
5.2.3 Application to Real Data . . . . . . . . . . . . . . . . . . . 108
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.3.1 Results from Simulation Studies . . . . . . . . . . . . . . . 109
5.3.2 Results from Real Data Analysis . . . . . . . . . . . . . . . 113
v
5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Chapter 6: Conclusions and Future Work 119
6.1 Model Selection Algorithms for Genome Wide Association Stud-
ies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.2 RNA-Seq Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Bibliography 123
vi
List of Tables
2.1 Simulation results of QTL mapping comparing SSVS,LASSO and
Elastic Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2 Number of markers selected by the SSVS for different posterior
probability cutoffs. . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1 Goodness of fit for genes and exons using the GP and the Poisson
model for RNA Seq reads . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Comparison of the GP estimates with estimates from the Quantigene
assay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3 Top 10 oligonucleotides with the largest absolute correlations with
^
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.4 The number of differentially expressed genes declared from the com-
parison of technical or biological replicates . . . . . . . . . . . . . 82
4.5 The number of differentially expressed genes declared from the com-
parison of technical or biological replicates . . . . . . . . . . . . . 83
4.6 Goodness-of-fit of the GP model and the negative binomial (NB)
model. The percentage of genes and exons with a chi-square test
p-value> 0.05 are listed. . . . . . . . . . . . . . . . . . . . . . . . 92
4.7 Comparison between the GP model and the mseq model . . . . . . . 93
4.8 Error rate of the GP model in the case of dependence . . . . . . . . 95
5.1 Comparisons of isoform expression estimates by simulations. Bias
parameters are varied in the simulation model . . . . . . . . . . . . 110
5.2 Comparisons of isoform expression estimates by simulations. The
isoform expression levels are varied . . . . . . . . . . . . . . . . . 111
vii
5.3 Comparisons of isoform expression estimates by simulations. The
bias parameter is sampled from real data . . . . . . . . . . . . . . . 111
5.4 Comparison of absolute isoform expression estimation using the GP,
Poisson and N-URD models. Simulations based on bias curves . . . 112
5.5 Comparison of relative isoform estimation using the GP, Poisson
and N-URD models. Simulations based on bias curves . . . . . . . 112
viii
List of Figures
1.1 Single Nucleotide Polymorphism . . . . . . . . . . . . . . . . . . . 2
1.2 Linkage Disequilibrium pattern on a stretch of chromosome 12 . . . 4
2.1 LASSO applied to a diabetes dataset . . . . . . . . . . . . . . . . . 20
2.2 ROC Curves for the LASSO for different marker densities . . . . . 26
2.3 ROC Curves for the SSVS for different marker densities . . . . . . . 27
2.4 Positions of markers selected by the SSVS, the LASSO and the
Elastic Net for the Daunorubicin-induced cytotoxicity data. . . . . . 31
2.5 Comparison of the outputs of the three methods for the Daunorubicin-
induced cytotoxicity data. . . . . . . . . . . . . . . . . . . . . . . 32
2.6 Comparison of the outputs of SSVS and LASSo for the two IgM
and anti-CCP phenotype in the RA data. . . . . . . . . . . . . . . . 36
3.1 RNA Sequencing pipeline . . . . . . . . . . . . . . . . . . . . . . . 41
4.1 Histogram of the p-values for the goodness of fit chi-square tests . . 72
4.2 Fraction of total mRNA amount derived from highly expressed genes
in human tissue samples . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3 The observed and the expected frequencies of read counts for the
top four highly expressed genes in liver tissue . . . . . . . . . . . . 76
4.4 ROC curves for the GP, the Poisson and the GLM(Poisson, negative
binomial and quasi-Poisson links) in the identification of differen-
tially expressed genes. . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.5 ROC curves for the GP using the permutation test, the Poisson, and
the GLM (Poisson, negative binomial, and quasi-Poisson links) in
the identification of differentially expressed genes . . . . . . . . . . 84
ix
4.6 Number of tissue-specific genes identified from differentially expres-
sion studies using the GP model . . . . . . . . . . . . . . . . . . . 86
4.7 Number of tissue-specific genes identified from differentially expres-
sion studies using the simulation strategy to calculate the p-value for
the GP model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.8 Identification of differentially spliced exons . . . . . . . . . . . . . 88
4.9 Identification of differentially spliced exons using the simulation
strategy for the GP model . . . . . . . . . . . . . . . . . . . . . . . 90
5.1 Transcript structures used in the basic simulation model . . . . . . . 106
5.2 Distribution of values. (A) All exons with estimates for the
Brain RNA-seq HISeq dataset were used. (B) Highly expressed
exons were used. . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.3 The annotation and coverage of 3 different genes from Human RNA-
seq experiments using the Illumina HiSeq machine. . . . . . . . . . 115
x
Abstract
Genome-wide association studies are important tools to reconstruct the genotype phe-
notype map to understand the underlying genetic architecture of complex traits. This
enables us to better understand the genetic architecture of these phenotypes. With the
advances in genotyping and high throughput sequencing technologies, millions of mark-
ers can be identified for individual populations in very short durations of time. Due to
the multiple loci control nature of complex phenotypes, there is great interest to test
markers simultaneously instead of one by one. In chapter 2, we compare three model
selection methods for genome wide association studies using simulations: the Stochastic
Search Variable Selection (SSVS), the Least Absolute Shrinkage and Selection Opera-
tor (LASSO) and the Elastic Net. We apply the three methods to identify genetic vari-
ants that are associated with daunorubicin-induced cytotoxicity. We also compare the
LASSO and the SSVS to a dataset of two quantitative phenotypes related to Rheumatoid
Arthritis.
In the second part of the dissertation, a two parameter generalized Poisson(GP)
model to analyze RNA Seq is proposed. Deep sequencing of RNAs (RNA-seq) has
been a useful tool to characterize and quantify transcriptomes. However, there are sig-
nificant challenges in the analysis of RNA-seq data, such as how to separate signals
from sequencing bias and how to perform reasonable normalization. In chapter 4 ,we
used the generalized Poisson model to separate out the “true” expression level from the
xi
bias. We show that the GP model fits the data much better than the traditional Poisson
model. Based on the GP model, we can improve the estimates of gene or exon expres-
sion, perform a more reasonable normalization across different samples, and improve
the identification of differentially expressed genes and the identification of differentially
spliced exons. We also use a likelihood based approach to estimate the expression levels
of transcripts using the GP model discussed in chapter 5.
RNA Sequencing and genome wide associations studies have led to a rapid growth
in understanding of complex genetic phenotypes and diseases. These two methods are
crucial tools in the genomic age in the fields of molecular biology, genomics, population
and quantitative genetics etc. Using these tools effectively, with the help of statistical
and algorithmic methods, would lead to a rapid growth of knowledge in these fields and
in the overall field of biology.
xii
Chapter 1
Genome Wide Association Studies
Genome Wide Association studies aim at finding associations between genotypes or the
genetic content of the cell and phenotypes at a genome wide scale. Phenotypes are
observable characteristics or traits of an organism, tissue or cell. They can be quan-
titative or qualitative in nature. Quantitative phenotypes include height, weight, gene
expression etc. which can be measured as a continuous value. Qualitative phenotypes
have categorical values, like whether a person has a disease or not. The aim is to find
loci in the genome which cause variation in a given phenotype and hence have an effect
on the phenotype. Even though association does not imply causation, it gives us insight
into the genetic structure of the phenotype. The most common genotypes used in asso-
ciation studies are SNPs, copy number variation, Insertions/Deletions, microsatellites
etc.
Single Nucleotide Polymorphisms or SNPs are the simplest type of genetic varia-
tion where a single DNA base pair varies across individuals of a population. A SNP is
illustrated in Fig 1.1 in a population of 7 people. The base at the 9
th
position is a SNP
which can either be adenine(A) or tyrosine(T). Most SNPs are biallelic, i.e. they have
two alleles. In the example, the two alleles of the SNP are T which is the major allele
as it occurs at a higher frequency and A which is the minor allele. Triallelic SNPs are
present in the population but are very infrequent. Single Nucleotide Polymorphisms can
be identified using microarrays like the Affymetrix Chip Array or using high through-
put sequencing machines.Affymetrix Chip arrays have probes which correspond to the
reference allele of the SNP. If the sample has the reference allele at the SNP position,
1
it will bind. On the other hand, if the alternative allele is present, the probe will not
bind. Therefore the SNP alleles can be identified for a given sample. However this
method cannot be used to identify SNPs at new positions. High throughput sequencers
like Illumina, SOLID can sequence the genome at high coverage levels and SNPs can be
identified by identifying variations in the sequence compared to the reference genome.
Algorithms and software have been developed which identify SNPs from high through-
put genomic data like COMB [76], GATK [56] etc. These algorithms use reads mapping
to a reference genome to identify whether particular positions can be classified as SNPs
using Bayesian methods [76] or simple counting based approaches. Novel SNPs can be
identified using this method as the whole genome is resequenced for each sample.
Figure 1.1: Single Nucleotide Polymorphism : This figure describes a SNP in a pop-
ulation of 7 people which is either A or T. A is the minor allele with a minor allele
frequency of 3=7 and T is the major allele with a frequency of 4=7.
SNP
A/T
2
1.1 Linkage Disequilibrium
The basic principle behind using SNPs and other biomarkers instead of the whole
genomic sequence is based upon the principle of linkage disequilibrium(LD). Linkage
disequilibrium is defined as the non-random association of alleles between multiple loci.
SNPs which are close to each other are usually inherited together and hence show a
strong correlation with each other. Taking advantage of this fact, we can use SNPs to do
association studies and look for gene loci which are close to them to make inferences
of genetic loci associated with a particular phenotype. The most common measure to
estimate LD between two allelesA
1
=A
2
andB
1
=B
2
is given by
r =
D
p
p
1
p
2
q
1
q
2
(1.1)
where p
1
;p
2
;q
1
;q
2
are the frequencies of alleles A
1
;A
2
;B
1
;B
2
respectively and D =
x
11
p
1
q
1
wherex
11
is the frequency of allelesA
1
;B
1
occurring together. Therefore LD
estimates the difference between the haplotype frequency under the assumption of inde-
pendence (p
1
q
1
) and the actual haplotype frequency (x
11
) after suitable normalization.
An example of the LD pattern of chromosome 12 of the human genome are shown in
Figure 1.2. This figure was taken from [64]. We can see three blocks which are in high
LD with each other and they correspond to the positions of three genes on chromosome
12. Therefore several markers would be in high LD with these three genes.
1.2 Single Marker Association Studies and Multiple
Testing
The most common method to perform association studies on a genome wide scale is
testing each marker for a significant association with the phenotype. For quantitative
3
Figure 1.2: LD pattern on a stretch of chromosome 12. A red square represents high
LD and a blue square represents low LD. We can see several LD blocks which repre-
sent large stretches of DNA which are in high LD. This enables high power association
studies to be carried out. This figure was taken from [64]
phenotypes, ifY represents the phenotype data andX
j
represents the values of SNPj
(The SNP can be encoded as the number of major alleles it has. Therefore if the SNP
4
is A/A, with A being the major allele, it is encoded as 2, A/a as 1 and a/a as 0), then a
simple linear regression can be carried out as follows
Y =
j
X
j
+ (1.2)
where
j
represents the association between SNPj and the phenotype. Therefore testing
forH
0
:
j
= 0 would give us a p-value to test for the significance of association. For
qualitative phenotypes or case control studies, the odds ratio can be calculated based on
the allele frequencies in the case and control groups. The p-value for significance can
be calculated using a Chi-squared test.
However marker by marker association studies have the following problems
1. Multiple Testing : Since there arep tests being conducted simultaneously wherep
is the number of markers, andp is always in the order of millions of markers, the
significance cutoff has to be adjusted for multiple testing. This is because if the
significance level is kept at 0:01 and 100 independent tests are carried out, then
the probability of an error occurring in at least one of those tests is 0:01100 = 1.
Therefore, the p-value should be corrected to take into account for multiple correc-
tion. If Bonferroni correction is carried out, then for a million markers the p-value
cutoff changes from 0:05 to 5 10
8
which can be too stringent and might miss
several associated markers. Therefore effective multiple testing corrections need
to be investigated. However, the markers which are significant at such thresholds
have been successfully replicated in multiple studies [55].
2. Interactions : Interactions between markers cannot be modeled when each marker
is tested one by one. Therefore, epistatic and compensatory effects might be
missed by single marker analyses. However since p is in the order of millions,
modeling two marker interactions would increase the number of predictors to
5
p(p 1)=2 which will make the multiple testing problem worse. Therefore an
effective way to analyze interactions need to be developed. Zhang and Liu [96] use
a Bayesian MCMC method to infer epistatic interactions in case-control studies.
Even though this method can handle around 500,000 markers, it is not able to han-
dle the millions of markers generated in genome wide experiments nowadays.[19]
1.3 Common and Rare Variants
There are two factors which contribute to the statistical power of a method to detect
associations. These are the frequency of the alleles in the population which are associ-
ated with the phenotype and the size of the effect. The effect size of the allele is how
much of the phenotype is explained by the allele i.e. how much variation in the pheno-
type can be explained by the variation in the SNP. Many alleles could have small effects
on the phenotype or fewer alleles may have a large effect on the phenotype.
The Common Disease Common Variant (CDCV) hypothesis states that common
diseases are a result of common variants [91]. Disease susceptibility is a result of the
action of several common variants i.e. they each explain a small percentage of the
disease risk in a population. This hypothesis dominated the field of GWAS but has been
refuted because of missing heritability in most GWAS studies [30].
Moving away from the CDCV model, there have been three other models which
have been proposed for the architecture of complex traits.
1. A large number of common variants with a small-effect explain the different
genetic traits. This is also called the Infinitesimal model where hundreds of thou-
sands of loci contribute to the genetic architecture of a complex trait. There have
been studies where common variants have captured majority of the genetic vari-
ance of the trait [91]. Common traits have also been found to be associated in
6
studies of model organisms [7]. However very few of these variants identified
in disease association studies have been functionally validated and there is still
the problem of the missing heritability i.e. most of the variation in the pheno-
type cannot be explained by the identified variants. Majority of the GWA studies
have not been able to explain the ratio of genotypic to phenotypic variance in a
population[92]
2. The rare allele model proposes that there are many low frequency alleles with a
large effect. However there are multiple such rare alleles which act independently
in different groups of the population and therefore their effects cannot explain
the total variance in the complete population. Evolutionary theory predicts that
disease alleles should occur infrequently as they should be removed from the pop-
ulation by selection. [6]. Population studies have shown that many rare disorders
are due to rare alleles of large effect e.g. the BRCA1 and BRCA2 breast cancer
susceptibility mutations. [26]. However the analysis of GWAS data is not con-
sistent with this model as studies have shown that there are common variants are
responsible for explaining associations and these results have been replicated in
independent studies.
3. Broad sense heritability models point to interactions including genotype by geno-
type and genotype by environment to model quantitative traits. These models
incorporate epigenetic effects like DNA methylation patterns. Since GWA stud-
ies don’t have sufficient power to incorporate genotype by genotype interactions,
there is very little real data for this model. However, several genetic interaction
screens in yeast [20] have demonstrated that this model is biologically relevant.
Similar examples have been shown in Drosophila melanogaster and C. elegans
[52, 44]
7
Though there are many arguments for and against each of these individual models
[30], a combination of all the three models needs to be investigated. The integration of
rare,common variants and environmental factors would provide a better understanding
of the genetic architecture of traits.
1.4 Population Structure
Population structure arises when there are systematic differences in the alleles of sub-
groups of a given population. Therefore the population itself can act as a confound-
ing variable when the traits correlate with the population and therefore cause spurious
associations between these alleles and the trait. Population structure arises due to non-
random mating between groups of populations because of factors like distance. The
most popular methods to correct for population structure are using PCA and mixed
models
1. Principal Component Analysis : PCA has been used to correct for population
stratification by removing the top principal components if they show any varia-
tion with respect to the population[60]. These top components can also be used as
covariates in different models . Software like EIGENSTRAT can be used to cor-
rect for stratification in GWA studies (using PCA). However PCA cannot model
family structure which need to be modeled separately.
2. Mixture Models : In linear mixed models, the phenotypeY is a function of fixed
effectsX plus random effectsu. Therefore
Y =X +u + (1.3)
8
with Var(u) =
2
g
K,whereK represents a kinship matrix which describes the com-
ponent of random variation that has been inherited ( represents random variation
independent of any other effects). K is defined as the pairwise similarity of indi-
viduals, so it can explain the population structure. Software like EMMAX and
TASSEL have made these methods applicable to genome wide association stud-
ies [39, 38, 97]. However in some situations modeling population structure as a
random effect might lead to incorrect results since it is a fixed effect for all sam-
ples. For fixed effect modeling, PCA is more applicable by introducing the top
principal components as fixed effect covariates.
Fixing for population structure is crucial to get true associations between the markers
and the phenotype and hence perform accurate association studies.
9
Chapter 2
Model Selection Methods for Genome
Wide Association Studies
Due to the multiple loci control nature of complex phenotypes, there is great interest to
test markers simultaneously instead of one by one. In this chapter, we compare three
model selection methods for genome wide association studies using simulations: the
Stochastic Search Variable Selection (SSVS), the Least Absolute Shrinkage and Selec-
tion Operator (LASSO) and the Elastic Net. We apply the three methods to identify
genetic variants that are associated with daunorubicin-induced cytotoxicity. We also
compare the LASSO and SSVS on a Rheumatoid Arthritis dataset. The simulation stud-
ies were performed by using the genotype data of 60 unrelated individuals from the CEU
population in the Hapmap project. This research has been published in [77] and [79].
2.1 Introduction
With the advances in genotyping technology, it has become feasible to perform large-
scale, high-density genome wide association (GWA) studies to search for common
genetic variants underlying complex phenotypes ( [55, 42] ). However, due to lack of
computing power, single-marker tests remain the primary tools in the analysis of GWA
data. Most quantitative phenotypes are complex in nature and are caused by multiple
genetic variants, each of them having varying degree of effects. The possible interac-
tions among genetic variants and the interactions between genes and the environment
10
present additional challenges for Quantitative Trait Loci (QTL) mapping. Due to the
multiple loci control nature, testing markers simultaneously instead of one by one may
increase statistical power. In order to identify the correct set of genetic variants from mil-
lions of markers, efficient and reasonable model selection algorithms are in urgent need.
Three popular model selection methods have been proposed : the Stochastic Search
Variable Selection (SSVS) [29], the Least Absolute Shrinkage and Selection Operator
(LASSO) [82], and the Elastic Net [99].
In the SSVS, a latent variable
is introduced to perform the variable selection for
the regression mode.
i
= 1 implies that thei
th
variable is included in the model and
i
= 0 implies that thei
th
variable is excluded from the model. A homogenous ergodic
Markov Chain can be generated by the Gibbs Sampler. The empirical distribution of
based on the Markov chain will converge to the actual posterior distribution of
[11].
The LASSO proposed by Tibshirani is a shrinkage based selection method for linear
regression. It minimizes the residual sum of squares subject to the constraint on the sum
of absolute value of coefficients. ThisL
1
-Norm constraint produces shrunk coefficients
with some of them exactly equal to zero, which leads to interpretable models. In 2004,
Efron et al. proposed the Least Angle Regression(LARS) [27] which is a computation-
ally efficient model selection algorithm. There is a close connection between the LARS
and the LASSO. A simple modification of the LARS algorithm can yield all the LASSO
solutions. Due to their popularity and usefulness, the LASSO and the LARS have drawn
intensive research interest in the statistical field.
The Elastic Net proposed by Zou and Hastie [99] uses a novel regularization penalty.
The naive Elastic Net uses a combination of theL
1
andL
2
regularization penalty. How-
ever, the Elastic Net uses a scaled version of the naive Elastic Net estimate to reduce
the overshrinking of parameters. It has been shown that the Elastic Net outperforms
the LASSO [99]. In addition, it has a grouping effect in which correlated predictors
11
group together. Thus, they are included together or excluded together from the model.
This is advantageous in association studies as many markers are highly correlated via
high linkage disequilibrium (LD). Another modification to the LARS algorithm gives
all the solutions to the Elastic Net for a given value of the parameter. This enables a fast
implementation of the Elastic Net algorithm. Due to efficient implementation and the
grouping effect, the Elastic Net is very commonly used.
The LASSO and the Elastic Net are two of the most popular model selection methods
which involve the minimization of the mean square error with respect to regularization
constraints. SSVS on the other hand is based on the Gibbs Sampler which belongs to
the broader class of Markov Chain Monte Carlo methods. Hence, a comparison of the
three methods would be of great interest.
We apply these methods to two biological datasets. The first phenotype dataset we
use is the cytotoxicity induced by a chemotherapeutic agent Daunorubicin. Daunoru-
bicin is an anthracycline chemotherapeutic agent, which is used in the treatment of var-
ious cancers including leukemia, lymphoma, and advanced HIV-associated Kaposi’s
sarcoma [21, 71]. Daunorubicin has also been shown to be toxic and is associated with
myelosuppression and cardiac toxicity [50, 72, 93]. It has been reported that 29 %
of variation in susceptibility to daunorubicin-induced toxicity is due to genetics [25].
Therefore, it is important to conduct GWA studies to identify genetic variants which are
responsible for increased susceptibility to daunorubicin-induced toxicity. We used the
phenotype data provided in [34]. The authors used a cell growth inhibition assay to mea-
sure variations in the cytotoxicity of daunorubicin. We applied the SSVS, the LASSO
and the Elastic Net to the 3,967,790 SNPs to identify genetic variants associated with
daunorubicin-induced cytotoxicity.
12
We also focus on two quantitative phenotypes related to rheumatoid arthritis(RA).
The data was provided by the Genetic Analysis Workshop(GAW16). RA is an autoim-
mune disease which causes chronic inflammation in joints. It is important to study this
disease as it can lead to long-term joint damage resulting in loss of function and disabil-
ity. The two quantitative phenotypes measure the amount of antibodies that co-occur
with RA. Rheumatoid factor IgM is an antibody which attaches to immunoglobulin
G forming an immune complex. The immune complex can trigger different types of
inflammation related processes in the body. However, up to 20% of RA patients remain
negative for IgM, and many patients with a positive IgM may not have the disease. A
newly identified autoantibody, anti-cyclic citrullinated peptide (anti-CCP) is more pre-
dictive of erosive outcomes [35]. If present in a patient at a moderate to high level,
it confirms the diagnosis of RA. Anti-CCP may also be detected in healthy individ-
uals years before the onset of clinical RA. Therefore an analysis of both phenotypes
might give insightful results into the disease. Specifically, we applied the SSVS and the
LASSO to the two quantitative phenotypes to identify genetic variants associated with
RA. We compared the results based on 545,080 SNP’s.
2.2 Methodology
2.2.1 Pre-filtering
The genotype was coded as 0,1 or 2 for homozygous rare alleles, heterozygous alle-
les, and homozygous common alleles respectively (i.e., assuming additive effect). The
missing alleles were imputed according to the genotype frequency calculated from the
available data. As a prescreening step, markers with a minor allele frequency less than
0.01 were discarded. For the simulation studies, the phenotype data was simulated with
13
5 causal markers and each of the causal markers had an equal effect on the pheno-
type. The marker effect and the linkage disequilibrium among markers were varied.
Simulation Studies details are described in Section 2.3.1.For the Daunorubicin data set,
the phenotype data was transformed using an inverse normalization of percentile ranks.
This was done to make the phenotype normally distributed, an assumption required by
all the three methods. A log transformation was performed on the Rheumatoid arthritis
phenotypes to achieve normally distributed data.
2.2.2 Sure Independence Screening
We assume that the high dimensional data is sparse. That means most of the markers
are not associated with the output phenotype. This is a reasonable assumption since
there are huge number of markers included in the study. One of the biggest challenges
with a large number of predictors is that there might be spurious correlations among
different predictors which might lead to confounding correlations with the output. This
problem is accentuated in association studies when markers are highly correlated among
themselves due to linkage disequilibrium. The Dantzig Selector which is the solution to
aL
1
regularization problem [9] has been shown to have the ideal risk up to a logarithmic
factorlog(p) wherep is the number of predictors. Howeverlog(p) can also grow very
fast when the number of markersp is large and this bound is no longer adequate with the
large number of markers in current . Hence, we need to apply an effective dimensionality
reduction method in the first stage before we apply a model selection method in the
second stage as the efficiency of most model selection algorithms decreases dramatically
with the increase in the number of predictors. This dimensionality reduction also helps
in reducing computational time required for the model selection methods.
To decrease the dimensionality of the marker data for the real data set, a method
called Sure Independence Screening (SIS) was performed [28]. SIS has been shown to
14
have the Sure Screening property which is that all the important variables survive in the
model with a probability close to 1 under some conditions which are stated in [28]. The
SIS method assumes a linear model.
Y =X +;
whereX denotes the genotype data which are columnwise standardized,Y denotes the
phenotype data, is the regression coefficient, and are i.i.dN(0; 1) random variables
independent of the rest of the parameters in the model. The method uses correlation
learning i.e. calculating the correlations between the predictors and the dependent vari-
able to detect predictors in the true model.w is defined as the vector obtained by
w =X
T
Y:
SIS selects the largest component-wise magnitudes of the vectorw. For a given,M
denotes the marker set output by SIS as
M
=f1ip ::jw
i
j is amongst the [n] largest of allg;
where [n] denotes the integer part of n. The authors in [28] suggest that should
be chosen such that [n] < n. However, since we want to demonstrate that SSVS and
LASSO can handle more markers than the number of samples, we select 1000 markers
with the largest correlations for the Rheumatoid Arthritis dataset and 200 markers for
the Daunorubicin cytotoxicity dataset. Therefore, for our data, we used = 10=3 and
= 1:15; 1:34 for the Daunorubicin and RA datasets respectively. SIS reduces the
number of markers too(n) number of markers.
15
2.2.3 Stochastic Search Variable Selection
SSVS was proposed by George and McCulloch [29]. SSVS uses a hierarchical Bayes
model to identify the associated variables. Here, we assumed that the phenotype follows
a multiple regression model of a subset of the markers. The canonical regression setup
is given by
Y =X +; (2.1)
where the phenotype dataY isn 1, genotype dataX = [X
1
;:::;X
p
] isnp,X
i
is
the genotype data for markeri, = (
1
;:::;
p
)
0
, andN(0;
2
) where
2
is scalar.
The number of samples in the population is given byn and the number of markers byp.
A latent variable
i
is defined as the indicator whether marker i is selected in the
model or not. The
0
i
s follow a mixture model of the form:
i
j
i
(1
i
)N(0;
2
) +
i
N(0;
2
): (2.2)
2
is set close to 0 and
2
>> 0. Therefore, if the marker is included in the model,
its effect is distributed as a normal distribution centered around 0 with a large variance,
and when the marker is not included in the model, its effect is distributed by a normal
distribution close to a function modeled with a very small
2
. These kind of models
are called spike and slab models [57].
Any prior information about the
i
’s can be incorporated by setting a prior on the
i
’s. Letf(
) denote the prior. In our model, we assumed that the
i
’s are independent
with marginal distributions as below:
P (
i
= 1) = 1P (
i
= 0) = 1=p: (2.3)
16
To obtain 2.2 as the prior for
i
j
i
, a multivariate normal prior is used as follows:
j
N(0;D
RD
): (2.4)
R is the prior correlation matrix andD
is defined as
D
=diag[
1
;
2
;:::;
p
]; (2.5)
where
i
= if
i
= 0 and
i
= if
i
= 1. We used the prior correlation matrixR
as the identity matrix, but correlations between the markers can be incorporated in this
prior correlation matrix. The prior correlation matrix can also store other information
about the SNPs, like conservation score, quality score etc.
The prior on the residual variance is given by
2
j
InverseGamma(n=2;kYXk
2
=2): (2.6)
A markov chain is generated as follows
0
;
0
;
0
;
1
;
1
;
1
;:::;
j
;
j
;
j
;::: (2.7)
The transition probabilities are given by
j
jY;
j1
;
j1
N
p
(A
j1
X
0
X
^
LS
(
j1
)
2
;A
j1) (2.8)
where A
j1 =
X
0
X
(
j1
)
2
+D
1
j1
R
1
D
1
j1
1
and D
1
j
=
diag[(
1
)
1
; (
2
)
1
;:::; (
p
)
1
].
17
The
2
follows an inverse gamma distribution given by
j
jY;
j
;
(j1)
IG
n +
j1
2
;
jYX
j
j
2
+
j1
j1
2
(2.9)
and the transition probability for
i
is given by
j
i
= 1j
j
;
j
;
j
(i)
f(
j
j
j
(i)
;
j
i
= 1)p
i
f(
j
j
j
(i)
;
j
i
= 1)p
i
+f(
j
j
j
(i)
;
j
i
= 0)(1p
i
)
(2.10)
j
(i)
represents all the
’s except thej
th
marker. These formulae have been taken from
[29].The posterior probabilities of
can be estimated from the Markov chain gener-
ated by the Gibbs Sampler. We run the Gibbs Sampler for 2000 iterations to achieve
stationarity and then run it for an additional 8000 iterations to estimate the posterior
probabilities. We use a cutoff of 0.9 for the real dataset and a varying cutoff to calculate
the area under the curve for simulation studies.
2.2.4 LASSO
The LASSO, or the “least absolute shrinkage and selection operator”, tries to shrink
some coefficients and set most of the coefficients exactly equal to 0 using anL
1
regular-
ization parameter to achieve a model with a small number of variables and a small mean
square error.
The LASSO assumes a linear model
Y =X +;
18
where Y;X;; are the same as in SIS. The LASSO tries to minimizekYX
^
k
2
subject to theL
1
norm
P
j
j
j
j t. Heret 0 is the tuning or shrinkage parameter.
This statement can be rephrased as
^
=argmin
(kYXk
2
+
X
j
j
j
j);kambda> 0
By using the L
1
norm, the LASSO ensures that a subset of the predictors are exactly
0. Some studies have been done on the consistency of the LASSO. Zhao and Yu [98]
proved that when a condition known as the Strong Irrepresentable Condition is satis-
fied and when the error terms have some finite moments, the LASSO is strongly sign
consistent, i.e.9 =f(n) which is independent ofY orX such that
lim
n!1
P (sign(
^
()) =sign())! 1:
for large p and q where q is the number of markers which are not associated with the
phenotype. However, we only have 56 samples, hence we cannot rely on this condition
for the LASSO to pick up the correct markers.
The LASSO is a quadratic programming problem but can be solved by a simple
modification to the Least Angle Regression algorithm by Efron et al.[27]. The Least
Angle regression is closely related to the Forward Selection procedure. It requires
p steps to construct the whole model. The LARS algorithm starts with zero predic-
tors in the model, and selects the predictor which is most correlated with the response
and increases the coefficient in the direction of the predictor. As soon as correlation
between the residualYX and another predictor becomes equal to that of the first
predictor, LARS then increases the coefficient in the direction of the angle bisector
between the two predictors. Therefore, the correlation betweenYX and the active
set ( the set of predictors whose values are non-zero at that particular iteration) are
19
always the same at each iteration of the algorithm. A parsimonious model can be con-
structed by using cross validation or a C
p
score to choose an iteration cutoff for the
algorithm. LARS is modified in such a way that predictors can not only enter the model,
but also be removed from the model when the correlation with the residuals change sign.
This modification of the LARS algorithm can generate all the LASSO solutions (with
iteration cutoffs corresponding to different values of) An example of the LASSO and
LARS procedures are illustrated in figure 2.1
Figure 2.1: LASSO applied to a diabetes dataset [27] provided by the R package lars.
The figure illustrates the LARS algorithm used to estimate the LASSO’s. The X-axis
represents the normalizedL
1
norm and the Y-axis represents the values of the effects of
each of the predictors. The iteration numbers are displayed on top and different colors
represent different predictors and the predictor numbers are displayed on the right side.
In iteration 10, predictor number 7 is removed from the model and it is reintroduced
again in iteration 11.
* * * * * * * * * * * * *
0.0 0.2 0.4 0.6 0.8 1.0
-500 0 500
|beta|/max|beta|
Standardized Coefficients
* * * * *
*
*
*
* * * * *
*
*
*
*
* * * * * * * * *
* * *
*
*
*
*
*
* * * * *
* * * * * * *
*
*
*
*
*
*
* * * * * * * * *
*
*
*
*
* * * *
*
*
*
*
*
*
* *
*
* * * * * * * *
* *
* *
*
* *
*
*
* * *
*
*
*
*
*
*
* * * * * *
*
* * * * * *
LASSO
5 2 1 7 8 4 6 9
0 2 3 4 5 6 7 8 9 10 12
20
2.2.5 Elastic Net
The Elastic Net uses the linear model and tries to minimize the least square error using
a novel regularization penalty. It uses two regularization parameters (
1
;
2
). The linear
model is given as:
Y =X +;
whereY,X, and have their usual meanings.The Elastic Net estimator
^
(elastic net)
is derived from the naive Elastic Net estimator
^
. For any fixed non-negative
1
and
2
, the naive Elastic Net criterion is defined as
L(
1
;
2
;) =jYXj
2
+
2
p
X
i=1
2
i
+
1
p
X
i=1
j
i
j:
The naive Elastic Net estimator
^
is the minimizer of the above equation:
^
= arg min
fL(
1
;
2
;)g:
An artificial data set (Y
;X
) is defined as follows:
X
= (1 +
2
)
1=2
0
@
X
p
2
I
p
1
A
;
whereI
p
is thepp identity matrix and p is the number of markers. And
Y
=
0
@
Y
0
1
A
;
21
where 0 is thep 1 0-vector.
It can be shown that the naive Elastic Net solves a lasso-type problem given by
^
= arg min
^
jY
X
j
2
+
1
p
1 +
2
j
j
1
:
The Elastic Net estimates are given as
^
(elastic net) =
p
1 +
2
^
:
This kind of scaling undoes the overshrinking effect when we combine theL
1
andL
2
penalties. To choose the coefficients, a two dimensional cross validation, i.e. a cross
validation for each pair of the parameters
1
and
2
, is performed. The LARS extension
which implements the Elastic Net algorithm, called the LARS-EN, outputs a sequence
of variables corresponding to a given
2
.
1
has a one to one correspondence with
the number of iterations that the LARS-EN algorithm was run for. Therefore selecting
the active model at a given iteration for a particular
2
would give us the Elastic Net
solution corresponding to particular value of (
1
;
2
). To estimate the model parameters,
different values of
2
are chosen (i.e. (0; 0:01; 0:1; 1; 10; 100)) and the other tuning
parameter is chosen using 10 fold cross validation. The chosen
2
is the one giving the
minimum cross-validation error. The Elastic Net has been shown to have a grouping
effect, i.e. variables highly correlated with each other are included and excluded in the
model together. This is extremely useful in association mapping as many markers are in
high LD with each other. However, since the LASSO uses the LARS algorithm, once
a marker is introduced in the model, the markers which are in high correlation with it
have a very little chance of entering the model. This is because the LARS algorithm
uses the correlations of the markers not in the model with the residuals.A new marker
is introduced in the model when its correlation with the residual becomes equal to those
22
with the markers in the active set. Having already selected a highly correlated marker is
already in the active set, the correlations of the other markers with the residual will also
decrease as the correlation of the active set with the residuals decreases and therefore
those markers might never enter the model. This can lead to inaccurate models, since
the true markers may not be identified as part of the model.
2.2.6 Area Under the Curve
For simulation studies, the Area Under the Curve (AUC) statistic was used to assess
the power of the method. The Receiver Operating Characteristic (ROC) curve is the
two-dimensional plot of TP(c) vs. FP(c) ( or sensitivity vs. (1-specificity) ) for1 <
c <1. The overall performance of a classifier can be measured by the area under
the ROC curve. This quantity is called the AUC. An AUC of 0.5 represents a complete
random guess. The True Positive Rates(TP) and the False Positive Rates(FP) are defined
as follows:
TP (c) =
Number of Markers correctly classified as causal markers
Number of Causal Markers
;
FP (c) =
Number of Markers wrongly classified as causal markers
Number of Non-Causal Markers
;
wherec is the cutoff used in the method. For the SSVSc is the cutoff for the posterior
probability for the
j
’s. For the LASSO,c is the number of iterations that the LASSO is
run for.
For the SSVS, the AUC is calculated using the following formula modified from
[53]
AUC =
1
n
C
n
C
c
X
i2C;j2C
c
If
i
>
j
g;
23
whereC is the set of indices of the causal markers andC
c
is the set of indices of the non-
causal markers. n
C
andn
C
c are the number of causal and non-causal markers respec-
tively. Here the posterior probabilities for the
i
’s are used. This formula calculates the
area under the curve since it takes into account for different cutoffs implicitly how many
causal markers and how many non-causal markers are included in the model. For the
LASSO, the following formula is used to calculate the AUC
AUC =
1
n
C
n
C
c
X
i2C;j2C
c
If
i
<
j
g;
where
i
represents the first iteration at which the i
th
marker enters the model. This
corresponds to a cutoff of iterations used to select the model for the LASSO.
The AUC for the Elastic Net is calculated using the same formula as the LASSO, using
the
2
from the setf0; 0:01; 0:1; 1; 10; 100g as given in [99].
2.3 Results
2.3.1 Simulation Studies
For simulation studies, we considered the phenotype data for 60 individuals. The geno-
type data was simulated as follows: we selected a set of markers from chromosome 1 of
the Hapmap CEU population data at different marker densities. The density of markers
was varied according to the average number of markers selected from every 1,000 mark-
ers of the Hapmap Phase I data. We used 60 markers per sample in the simulations. 60
markers were chosen for the ease of AUC analysis for the LASSO and the elastic Net.
The AUC formula as explained previously requires each variable to enter the model.
However, since the LARS algorithm will exit after 60 iterations, a model consisting
of all the markers will never be achieved if we use more than 60 markers. We need the
24
number of markers less than the number of samples only for the ease of AUC calculation
in simulation studies. The LASSO and the elastic Net have the ability to handle more
markers than the sample size demonstrated in the real data examples. Among these 60
markers, five markers were selected to be associated with the phenotype. These markers
were labeled as the causal markers. The phenotype was simulated from a linear model
with different coefficients according to the following equation:
Y
j
=X
1j
+X
2j
+::: +X
5j
+
j
;
whereY
j
is the simulated phenotype for thej
th
individual, X
1j
;X
2j
;:::;X
5j
denotes
the genotypes of the five causal markers for thej
th
individual, denotes the coefficient
of the causal markers and
0
j
s are simulated as i.i.d N(0; 1). The AUC statistic was
calculated for the three methods and are summarized in Table 2.1. AUC values are
shown for the SSVS with = 0:05 and different values of, the LASSO and the Elastic
Net. The AUC values for a single marker F-test are also shown in Table 2.1. The SSVS
consistently has a higher AUC value than the other two methods. The Elastic Net has
a smaller AUC than the LASSO when the effect size is small and LD is high, but is
consistently higher than the LASSO in the other parameter values. The single marker
F-test has a higher AUC than the LASSO and the Elastic Net when the marker effect
is small. However, the LASSO and the Elastic Net are better in cases when the marker
effect is larger as shown in Table 2.1.
Figures 2.2 and 2.3 show the ROC curves for the LASSO and the SSVS. Figure
2.2 shows the ROC curves for the LASSO with the marker data set at different marker
densities for = 1. These marker densities are varied by selecting 100,10 and 1 marker
on an average per 1000 markers in the Hapmap data. We can see that the area under
the ROC curve increases as the marker density decreases. The reason for this is that
25
Figure 2.2: ROC Curves for the LASSO for different marker densities ( = 1).
0.2 0.4 0.6 0.8
0.6 0.7 0.8 0.9 1.0
1 - Specificity
Sensitivity
1 marker per 1000 Hapmap markers
10 markers per 1000 Hapmap markers
100 markers per 1000 Hapmap markers
when the marker density decreases, the independence between the markers “increases”
and hence the method can detect the causal markers more effectively. The same is
seen in Figure 2.3 for the SSVS ( = 1; = 0:05 and = 1). However, since the
values are very close to each other, the curves nearly overlap with one another. Also,
the ROC Curves intersect with each other, making it difficult to judge which one has
the higher AUC. More detailed values are listed in Table 2.1. It clearly shows that the
AUC increases when marker density decreases. The AUC values also change as the
parameter changes for the SSVS. When coefficient = 1, = 1 gives the maximum
AUC score. However, the differences are small.A ROC curve by definition is monotone
increasing. The ROC has a dip at the end for the lasso in Figure 2.2. The LASSO is
implemented as a modified LARS algorithm which can remove predictors after they are
26
Figure 2.3: ROC Curves for the SSVS for different marker densities ( = 1; =
0:05; = 1).
0.2 0.4 0.6 0.8 1.0
0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00
1 - Specificity
Sensitivity
1 marker per 1000 Hapmap markers
10 markers per 1000 Hapmap markers
100 markers per 1000 Hapmap markers
27
Table 2.1: Simulation results of QTL mapping. Simulation results are shown for different coefficients, marker densities (mea-
sured in the average number of markers selected from every 1,000 markers of the Hapmap Phase I data) and different values of
for SSVS. = 0:05 for the SSVS. The AUC values are the average values across 1,000 simulations.
Coefficient() Marker Density SSVS( = 1) SSVS( = 2) SSVS( = 3) LASSO Elastic Net F-test
0.5 200 0.898 0.891 0.884 0.684 0.679 0.729
0.5 100 0.923 0.918 0.913 0.7 0.718 0.753
0.5 10 0.949 0.944 0.942 0.783 0.824 0.800
0.5 1 0.950 0.947 0.942 0.798 0.830 0.814
1 200 0.969 0.955 0.947 0.873 0.894 0.814
1 100 0.987 0.982 0.978 0.886 0.912 0.854
1 10 0.997 0.996 0.995 0.965 0.977 0.922
1 1 0.999 0.998 0.997 0.979 0.982 0.932
1.5 200 0.989 0.985 0.976 0.926 0.940 0.838
1.5 100 0.998 0.998 0.996 0.939 0.961 0.887
1.5 10 1 1 0.999 0.991 0.993 0.951
1.5 1 1 1 1 0.995 0.996 0.951
2 200 0.991 0.988 0.986 0.952 0.962 0.823
2 100 0.999 0.997 0.997 0.962 0.973 0.874
2 10 1 1 0.999 0.995 0.997 0.963
2 1 1 1 1 0.998 0.999 0.966
28
added into the model. Since the algorithm with different iteration cutoffs corresponds to
a LASSO solution for a particular, we used the iteration number as the cutoff to make
the ROC curve. However, as we increase the number of iterations in the cutoff, more
false positives might be included and the true positive rate might also decrease as true
causal markers are removed from the model. Fortunately, these mainly happen when the
false positive rate is high (e.g.,> 0:7). And we are interested in the performance of the
method when the false positive rate is reasonably low. If we assume that a variable is
never removed from the model and calculate the sensitivity and specificity, we will get
a ROC curve which is monotone increasing.
2.3.2 Daunorubicin-Induced Cytotoxicity Data
We considered 3,967,790 SNPs in the Daunorubicin cytotoxicity data analysis. Markers
with a minor allele frequency less than 0.01 were removed to screen out noisy SNPs.
After this step, 2,598,208 markers remained. A total of 56 unrelated CEU individuals
have phenotype data available. We performed an inverse normal transform to convert the
phenotype data into a normally distributed phenotype. Using SIS, the correlation was
calculated between the phenotype and the genotype. The top 200 markers were chosen
for the SSVS, the LASSO, and the Elastic Net. This number was arbitrarily chosen but
deliberately kept larger than the number of individuals in the sample since all the three
methods can handle the p > n scenario. The Sure Independent Screening procedure
suggests the screening procedure to screen for n 1 markers where n is the number
of individuals. However, this relies on the linear assumption and therefore we relax the
suggestion to choose more markers. Ideally, we will want to choose as many markers
as the downstream method can handle computationally as analyzing markers together is
more powerful than one by one.
29
Table 2.2: Number of markers selected by the SSVS for different posterior probability cutoffs.
Cutoff = 0.5 Cutoff = 0.6 Cutoff = 0.7 Cutoff = 0.8 Cutoff = 0.9 AdjustedR
2
for cutoff = 0.9
0.05 1 0 0 0 0 0 -
0.05 2 0 0 0 0 0 -
0.05 3 0 0 0 0 0 -
0.01 0.2 4 2 2 1 0 -
0.01 0.4 5 4 2 1 1 0.254
0.01 1.2 4 3 3 3 2 0.4527
0.001 0.02 0 0 0 0 0 -
0.001 0.04 0 0 0 0 0 -
0.001 0.12 12 12 7 5 4 0.6598
0.0001 0.002 0 0 0 0 0 -
0.0001 0.004 0 0 0 0 0 -
0.0001 0.012 33 29 29 29 29 0.9641
30
Figure 2.4: Positions of markers selected by the SSVS, the LASSO and the Elastic Net for the Daunorubicin-induced cytotoxi-
city data.
● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● 0 50 100 150 200
5 10 15 20
Markers Selected by the SSVS, the LASSO and the elastic net
Marker Position ( Mbps )
Chromosome Number
● Markers Selected by the SSVS
Markers Selected by the LASSO
Markers Selected by the elastic net
Markers Selected by all three methods
31
Figure 2.5: Comparison of the outputs of the three methods for the Daunorubicin-
induced cytotoxicity data.
SSVS LASSO
Elastic Net
6 markers 1 markers
13
markers
10
markers
45
markers
0
92 markers
32
For the SSVS, the results can vary dramatically by changing the initial parameters
of and. We use = 0:05; 0:01; 0:001; 0:0001 and corresponding values of based
on the values of
2
2
= 400; 1600; 3600. Table 2.2 shows the number of variables selected
for the models with different parameters and posterior probability cutoffs. The results
are robust to the posterior probability cutoffs, which also indicates the convergence of
the chain. Using the model with the largest adjusted R
2
, we select = 0:0001 and
= 0:012 giving an adjusted R
2
of 0.9641. Figure 2.4 shows the positions (green
triangles) of the markers which were selected by the SSVS with model parameters =
0:0001; = 0:012.
The LASSO algorithm was run till the residual was below a certain level . The opti-
mum number of iterations was selected using a 10 fold cross validation. The minimum
mean square error was achieved at the 115
th
iteration. The LASSO selected 56 markers
in the final model. The positions of the markers are also shown in Figure 2.4 (blue dia-
monds). The Venn diagram in Figure 2.5 shows that the SSVS and the LASSO identify
10 common markers.
The Elastic Net algorithm was also used to select significant markers. The
2
param-
eter was selected from possible values of 0,0.01,0.1,1,10 and 100 using a 10 fold cross
validation. A
2
= 0:1 was used in the final model as it gave the minimum cross valida-
tion score. The number of iterations were further chosen using a 10 fold cross validation
and a model of size 160 was chosen (red circles in Figure 2.4). Among these selected
markers, 23 were also identified by the SSVS and 55 were identified by the LASSO. The
10 markers common between the SSVS and the LASSO were also selected by the Elas-
tic Net. From Figure 2.4, we can see that the models chosen by the different methods
are quite consistent, except for the size of the model.
33
2.3.3 Rheumatoid arthritis data
The markers with a minor allele frequency less than 0.01 were screened out. After this
step, 425,555 markers for the IgM phenotype and 425,586 markers for the anti-CCP phe-
notype remained. After removing the samples with missing phenotypes, 746 samples
for the IgM phenotype and 867 samples for the anti-CCP phenotype were left. The cor-
relation between the phenotype and the genotype was calculated. The top 1,000 markers
were chosen for the two phenotypes respectively. Then the SSVS and the LASSO were
run on both phenotypes.
For the SSVS, we used a cutoff of 0.9 for the posterior probabilities of markers being
included in the model. The parameter values in the simulation studies, = 0:05 and
= 1; 2; 3, were used such that
2
2
= 400, 1600 and 3600 respectively. For the IgM
phenotype, the SSVS selected 20, 11, and 6 markers for the three
2
2
ratios respectively.
For the anti-CCP antibody phenotype, the SSVS selected 18, 12 and 6 markers for the
three
2
2
ratios respectively. The positions of the markers selected by the SSVS for
= 0:05 and = 1 are shown in Figure 2.4. No common markers are identified for the
two quantitative phenotypes.
The LASSO algorithm was run for 500 iterations unless it terminated earlier auto-
matically because there were no more significant markers. The optimum cutoff for the
number of iterations was selected using a 10 fold cross-validation. The minimum cross-
validation error was achieved at 31 iterations for the IgM phenotype and at 33 iterations
for the anti-CCP phenotype. It therefore resulted in 31 and 33 significant markers for
the IgM and the anti-CCP phenotypes respectively. The marker positions are also shown
in Figure 2.6. No common markers were identified for the two phenotypes. However
two markers, rs230662 for the IgM Phenotype and marker rs483731 up for the anti-
CCP antibody phenotype are within 100 Kb of each other and both lie on the same gene
34
KTCD21 at chromosome location 11q14.1. This marker has been enclosed in a diamond
in Figure 2.6.
The SSVS and the LASSO identified a few common markers for the IgM phenotype
and the anti-CCP phenotype. For the IgM phenotype, the SSVS selected rs1938032
and the LASSO selected rs12032393, and both markers lie in the gene LRRC8D on
chromosome 1 (enclosed by a star in Fig. 2.6). On chromosome 7, the SSVS selected
rs2392467 and the LASSO selected rs13234380 for the IgM phenotype. Both markers
are within 100 Kb of each other and are shown in Figure 2.6 (enclosed by a star). For the
anti-CCP phenotype , the SSVS selected rs10183908 and the LASSO selected rs972485
in the gene LRP1B on chromosome 2 (enclosed by a triangle). On chromosome 4, the
SSVS selected rs192068 and the LASSO selected rs1540052 while on chromosome 10,
the SSVS selected rs807013 and LASSO selected rs17113682 which are shown in Fig.
2.6. Both pairs of markers lie within 100 Kb of each other, but neither of them lie in
annotated genes. The SSVS and the LASSO selected two common markers, rs2186830
and rs334438 on chromosome 18. The marker rs2186830 lies on the gene COLEC12
(enclosed by a triangle). None of these genes have been reported to be associated with
RA in the literature.
2.4 Conclusions
In this study, we show that the SSVS outperforms the LASSO and the Elastic Net in sim-
ulation studies. All the three methods have similar trends in power as we change marker
density and coefficients of the markers. For the daunorubicin data set, the SSVS, the
LASSO and the Elastic Net select a significantly common model, however the size of
the SSVS model is small. The Elastic Net selects a very large model. None of the com-
mon markers selected have been previously reported to be associated with daunorubicin
35
Figure 2.6: Comparison of the outputs of SSVS and LASSO for the two IgM and anti-CCP phenotype in the RA data. The
squares show the markers selected by SSVS and the circles show the markers selected by LASSO. The stars enclose the loci
identified by both SSVS and LASSO for the IgM phenotype. The triangles enclose loci identified by both SSVS and LASSO
for the anti-CCP phenotype. The diamond encloses the locus identified by LASSO for the two phenotypes
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 50 100 150 200
5 10 15 20
Markers picked by the SSVS and the LASSO for the two phenotypes
Position on Chromosome (Mbp's)
Chromosome Number
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● SSVS Markers : Rheumatoid Factor IgM Antibody
LASSO Markers : Rheumatoid Factor IgM Antibody
SSVS Markers: Anti−CCP Antibody
LASSO Markers: Anti−CCP Antibody
36
induced cytotoxicity. However, in the prescreening, the SIS only selects two markers
which have been previously reported to be associated with daunorubicin induced cyto-
toxicity, rs220200 and rs10142144. For the Rheumatoid Arthritis study, the LASSO
and SSVS identified two common markers for the anti-CCP antibody phenotype and
two and three markers which are located close to each other for the Rheumatoid fac-
tor IgM phenotype and the anti-CCP antibody phenotype respectively. Some of these
markers were found to be in annotated genes such as LRRC8D,LRP1B, and COLEC12.
However these genes have not been reported to be associated with RA.
The methods have their advantages and disadvantages in their application to model
selection. The SSVS is computationally intensive as it requires computation of a matrix
inverse at each iteration. The LARS requires the inverse computation of a matrix with
the size of the active set. Therefore as the number of iterations increases, the time
required for an iteration would increase. Both the LASSO and the Elastic Net use the
LARS algorithm and hence have the same problem. However, since the SSVS requires
the Markov chain achieve stationarity, it requires much more time than the LASSO and
the Elastic Net. Therefore, we need to carry out a marker screening step (e.g., SIS proce-
dure) before running any of the three methods. The Gibbs Sampler in the SSVS requires
prior parameter selection. The parameters for the SSVS have been selected according
to [29] for different values of. However, it is impossible to explore the whole space.
The final results may be very sensitive to parameter estimation. Biological information
can be incorporated using the prior parameters. For the LASSO on the other hand, the
cutoff needs to be selected. A cross-validation scheme as used in this paper could be
biased by the data set. The Elastic Net involves a two fold cross validation where one
of the parameters are again chosen from a specific set. As the size of this set increases,
the cross validation procedure can become very computationally intensive. In real data,
it would be beneficial to use multiple methods and weight the results accordingly.
37
Prescreening methods have become of utmost importance with the advances in tech-
nology. We used the SIS to reduce the number of markers from 2598208 to 200. The
authors in the SIS paper [28] suggested to usep = n 1 orp =
n
logn
which are much
smaller than 200 in our case. We wanted to demonstrate the ability of these methods
to use more markers than samples. So we chose 200 (much larger thann 1 and
n
logn
)
instead. The dimensionality reduction would ideally be method specific. We need to
develop a method-specific algorithm to choose the number of retained markers after
dimensionality reduction. If the causal markers are discarded in the prescreening step,
then it is impossible for any method to identify them. We also demonstrated the ability
of the SSVS, the LASSO and the Elastic Net to handle more markers than the sample
size of 56. However, increasing the number of markers can dramatically reduce the
power. It would be of great interest to address the relationship between the number of
markers chosen in the prescreening step and the statistical power.
The SSVS and the LASSO both rely on the assumption that the predictors are inde-
pendent. However, the markers are dependent on each other due to linkage disequilib-
rium, which would need to be considered to make an accurate statistical inference. The
Elastic Net with its grouping property is a better choice when there are highly correlated
variables.
We have not discussed the effect of these methods on the problem of Population
Structure. Either we could correct for population structure before applying these meth-
ods or we can incorporate EMMA or PCA with these methods. These would be a simple
extension of the models as both methods for population structure correction assume the
linear model. Also, this analysis assumes the CDCV model and therefore might fail if
the genetic architecture of the phenotype does not follow this model.
38
Chapter 3
RNA Sequencing
3.1 High Throughput Sequencing and RNA Seq
High throughput sequencing or next generation sequencing (NGS) is a technology
through which several million short reads can be sequenced from a given sample of
DNA which can be sequenced in parallel on the same machine at the same time. NGS
enables the production of enormous volume of data cheaply and in very short durations
of time compared to Sanger sequencing. These new technologies have transformed the
landscape of genomic analysis, as whole genomes and transcriptomes can be sequenced
for individual experiments. The most popular sequencing technologies are 454, Illu-
mina and SOLiD sequencing. The latter ones produce shorter reads and are more often
used for resequencing experiments rather than de-novo sequencing because they pro-
duce shorter length reads. Recently, another generation of technologies like Helioscope,
Ion semiconductor sequencing, single molecule sequencers and nanopore have been
developed ,increasing quality and decreasing the cost of sequencing.
RNA sequencing is a technology where high throughput sequencing is used to
sequence cDNA which is obtained by reverse transcribing the mRNA in a sample. The
RNA is poly-A selected and then converted to cDNA before or after fragmentation. [58]
show that fragmentation before converting to cDNA produces a more uniform cover-
age of the RNA sequence reads. These reads can then be mapped back to a reference
sequence. If the reads are mapped back to a genome, then there are reads which map to
contiguous regions and non-contiguous regions of the genome like exon-exon junctions
39
which could point to a splicing event. The reads which map contiguously are known
as body reads and the reads which map across junctions are called junction reads. The
relative expression levels of the region of interest can be quantified from these read
counts, using models which will be discussed in the following sections. The RNA Seq
procedure is outlined in Figure 3.1 taken from [87]. This technology has been used for
analyzing the transcriptome for many organisms and has replaced microarrays as the
next generation of transcriptomics analysis.
3.2 RNA Seq Analysis Pipeline
3.2.1 Mapping
After the reads have been generated from the sequencing machine, they are mapped
onto a reference sequence which could be either the genome, transcriptome or a
gene of interest. Mapping tools like Eland,Bowtie, PerM, BWA, SOAP, Maq, RMap
[43, 13, 46, 45, 49, 75] etc. are used to map short single end or paired end reads against
the reference allowing a fixed number of mismatches between the read and the refer-
ence sequence. Bowtie, BWA and SOAP use the Burrows Wheeler Transform to map
short reads. PerM uses periodic spaced seeds which are used to preprocess the reference
sequence before the matching procedure. RMap incorporates base-call quality scores to
weight mismatches and improve the accuracy of matching. Mismatches are incorporated
to take into account polymorphisms as well as mapping errors. An important question
is the number of mismatches to be used to map the reads to the reference sequence.
A mismatch rate needs to be incorporated to take into account the average rate of poly-
morphisms present in the particular species being analyzed, the error rate of the machine
and the length of the read. Increasing the number of mismatches results in an increase
in the number of reads which are mapped to multiple locations in the genome and are
40
Figure 3.1: RNA Sequencing pipeline. mRNA is either fragmented to form smaller
mRNA and then converted to cDNA or converted to cDNA and then fragmented. The
adaptors are ligated to the cDNA fragments which are then sequenced via a sequencing
machine to produce short reads. These short reads are then mapped to a region of interest
and a base-resolution profile is obtained. This figure is taken from [87]
41
usually discarded. Heuristics are employed to solve this problem, and hence the number
of mismatches should be carefully chosen to avoid inaccuracies in further downstream
analysis.
3.2.2 Assembly
In the case where the reference sequence is not available, de-novo assembly of the tran-
scriptome can be used to recreate the transcriptome profile of the sample. De-novo
assembly has been extensively used for sequencing DNA . Several softwares have been
developed to do de-novo sequence assembly like Velvet,ABySS3, etc [95, 74]. Almost
all sequence assemblers are based on either Overlap graphs or deBruijn graphs [22]The
giant panda genome was sequenced and assembled using high throughput sequencing
[48]. The Illumina Genome Analyzer was used and the genome which is approximately
2.3 Gb long was assembled using the SOAPdeNovo assembler . However such stud-
ies require extensive sequencing depth of the genome, for example a coverage depth of
73X was required to sequence the Panda genome. De-novo transcript assembly is a more
complicated problem due to the variability in coverage of different regions of the tran-
scriptome because of alternative splicing. This can therefore cause several transcripts
not to be detected as most assemblers will not be able to distinguish between compli-
cated alternative isoforms. Methods to perform assembly and quantification together
are still in early stages of development, and have to be approached cautiously since
quantification can be heavily biased by using the same data to do the assembly.
3.2.3 Modelling and Quantification
RNA Seq estimates the level of expression of a gene, exon or transcript use discrete
counts of reads mapping to various positions , compared to microarrays where the
expression levels were estimated by a continuous signal of fluorescence emitted by the
42
hybridization of reads to given probes.Unlike microarray, RNA Seq does not have the
problem of saturation biases and cross hybridization . However, RNA Seq has its own set
of biases and therefore effective modeling is required to estimate accurately the actual
signal and separate it from noise.
The position level read counts in an RNA Seq Experiment are usually modeled as
a Poisson distribution. The Poisson distribution is to model the total number of events
occurring in a given interval of time or space that occur at a known average rate and
independently of previous events. We can model the read counts as a Poisson distribu-
tion if we assume that a given genomic region like a gene produces reads with a known
average rate and that a read generated from one position is independent from all other
reads. In an ideal situation this would be the case and the expression rate of a geneg
would be proportional to
g
/k
g
l
g
(3.1)
where k is a normalization constant,
g
is the concentration of RNA molecules from
the gene in the library and l
g
is the effective length of the exon. [54].
g
describes a
fraction of RNA molecules expressed by exong in a library. If the counts of an exong
are denoted asX
(g)
,then
X
(g)
Poisson(
g
) (3.2)
The most common estimate used to quantify the expression levels of a gene or exon
is known as RPKM or Reads per Kilobase per Million mapped reads. [58]. RPKM for
geneg can be written as
RPKM
g
=
g
10
9
Nl
g
(3.3)
whereN is the total number of reads in the given library andl
g
is the length of the exon
in base pairs.
43
3.2.4 Biases in RNA Seq data
The read count distribution in RNA Seq has a lot of biases which cause the read count
distribution to be quite different from the Poisson distribution. There is extreme hetero-
geneity in the data with the variance being larger than the mean in most of the reads
in RNA Seq. Therefore, RPKM might not represent an accurate measurement of the
relative expression levels of regions of interest. Some of the biases in RNA Seq data are
listed as follows:
1. Transcript Length Bias : Longer transcripts produce longer fragments which can
be sequenced and therefore produce more number of reads. Hence, the gene
expression estimate is normalized by the length of the gene. However, Bullard
et al. [5] and Oshlack et al. [61] show that there is a statistically significant bias
introduced by the length of the gene. The longer genes not only have an increase
in the per gene read count but also the per base read count. Oshlack et al. [61]show
that the ability to detect differentially expressed genes is significantly higher for
longer transcripts than shorter ones. This causes a strong bias when comparing
transcripts of different lengths, for example in gene set analysis.
2. GC Bias : Dohm et al. [24] show that regions with high GC content have a high
representation of reads. They attribute this bias to the Illumina machine, and also
to the difference in melting behavior between AT and CG rich DNA which could
interfere with the PCR amplification process and therefore result in a non-uniform
distribution of reads across the AT and CG rich regions of DNA.
3. Random Hexamer priming Bias : Hansen et al. [32] found a strong distinctive pat-
tern in the nucleotide frequencies of the first 13 positions of the 5’ end of mapped
RNA Seq reads. This was consistently observed across different experiments.
Since this bias was not observed in either DNA Seq, CHIP Seq experiments and
44
also in RNA Seq experiments with Oligo dT priming, they attributed this bias
to random hexamer priming. Therefore, cDNA fragments beginning with certain
13 mers are systematically over or under represented which is not attributable to
the expression level of these cDNA fragments [32]. They suggest a reweighting
scheme to adjust for this bias and therefore move towards a more uniform distri-
bution of reads.
4. PCR Amplification Bias: Aird et al. [1] identify PCR as a major source of bias in
Illumina sequencing data. PCR amplification can non-uniformly amplify certain
regions of the genome. This can cause a large number of reads to be map to these
regions which are only artifacts of the amplification process. They use real time
qRT-PCR to study the effect of PCR amplification biases in Illumina libraries.
This bias is present in RNA Seq reads as well. This PCR artifacts can heavily bias
the expression estimates.
3.2.5 Normalization
Normalization is essential to detect true differences between samples since different
samples may have different levels of coverage, potential technical bias like lane effects,
flow cell effects etc. In this section I will describe some of the popular normalizing
methods used to normalize RNA Seq samples.
Global-Scaling Normalization
This class of methods use one value per sample is used to normalize the whole sample.
The most common normalization method is the RPKM normalization where the total
number of reads mapping to a given region of interest is normalized by the length of
the region of interest and the total number of reads being mapped in the whole library.
45
Other methods for global scale normalization are to use the upper quartile of the gene
expression levels in each library to normalize each gene in that library or to use the reads
mapping to one specific gene.Spike ins can also be used to normalize the samples.
Quantile Normalization
Bullard et al. [5] proposed a quantile normalization method inspired from microar-
rays to normalize across RNA Seq. In this method, the statistical properties of the
distributions across all samples are made identical. As an example, if there aren sam-
ples or libraries (X
1
;X
2
;:::;X
n
) being compared with g genes in each sample, i.e.
X
i
=fX
i;1
;X
i;2
;:::;X
i;g
g, the following steps are carried out :
1. Each sample is ordered as
(X
(1)
i
;X
(2)
i
;:::;X
(g)
i
) = (X
i;f
1
i
(1)
;X
i;f
1
i
(2)
;:::;X
i;f
1
i
(g)
) (3.4)
wheref
1
i
(k) is the gene with thek
th
smallest gene expression
2.8i = 1;:::;n andj = 1;:::;g, setX
i;f
1
i
(j)
= median
k=1;:::;n
fX
k;f
1
k
(j)
g
Bullard et al. showed that Quantile normalization is superior to all global-scaling nor-
malizing procedures.
Trimmed Mean of Ms(TMM) normalization
In RPKM normalization, it is assumed that the total RNA is the same for all libraries.
However this assumption might not hold true, especially when comparing samples
which are not replicates. The TMM normalization method assumes that the expres-
sion levels for a core set of genes G is similar between samples . This method was
proposed by Robinson and Oshlack [68]
46
IfX
k;g
is the number of reads mapping on geneg in libraryk,
k;g
is the true expres-
sion level of the gene in this library, l
g
is the length of the gene and N
k
is the total
number of reads for libraryk, the expected value of the number of reads can be written
as
E[X
k;g
] =
k;g
l
g
S
k
N
k
(3.5)
HereS
k
represents the total RNA in the sample. The normalizing factor between two
libraries k and k
0
can be written as f
k;k
0 =
S
k
S
k
0
. The TMM approach estimates this
ratio using a subset of genes which are not differentially expressed in either of the two
samples and therefore do not bias the normalizing factor. To select the subset G of genes,
Robinson and Oshlack [68] propose calculating the log-fold-changes and the absolute
expression levels for each geneg between librariesk andk
0
.
M
(k:k
0
)
g
=log
2
X
k;g
=N
k
X
k
0
;g
=N
k
0
(3.6)
A
(k;k
0
)
g
=
1
2
log
2
X
k;g
N
k
+log
2
X
k
0
;g
N
0
k
(3.7)
These quantities are calculated whenX
k;g
;X
k
0
;g
6= 0 The subset G of genes are selected
after trimming the upper and lowerx% of the genes infA
(k;k
0
)
g andfM
(k;k
0
)
g. There-
fore the genes which have an extreme log fold change or are very highly expressed are
removed from the set.
Several other methods have been proposed for differential expression and use differ-
ent normalization procedures. Anders and Huber [4] in their algorithm DESeq to detect
differential expression of genes, use the following estimate as a normalizing factor for
each gene in a particular libraryk
S
(k)
=median
g
X
i;g
Q
n
j=1
X
j;g
1=n
(3.8)
47
In the next chapter, we use a generalized Poisson model to improve the normalization
for RNA Seq.
3.2.6 Differential Expression
Effective normalization enables accurate comparisons of expression levels across dif-
ferent samples or libraries. Statistical significant evidence of differential expression can
lead to a better understanding of several biological processes including disease. If
1
g
is
the true expression level of a gene in condition 1 and
2
g
is the true expression level of a
gene in condition 2, we want to perform the test where
H
0
:
1
g
=
2
g
vs.H
1
:
1
g
6=
2
g
(3.9)
The true expression represents the number of RNA molecules in each of the samples.
However, usually we only have estimates of these quantities in terms of either microar-
ray data or RNA Seq read counts. To answer this question using RNA Seq, several
methods have been proposed which are used to identify differential expression.
GLM based methods
Bullard et al. [5] propose a GLM method which combines lane and technical effects
to estimate differential expression. The read counts X
i;j
for the j
th
gene in the i
th
lane/sample is modeled as
log(E[X
i;j
jd
i
) =log(d
i
) +
a(i);j
+
i;j
(3.10)
wherelog(d
i
) is an offset for thei
th
lane which can be used as a global normalization
factor for each lane. The gene expression level is
a(i);j
where a(i) is the biological
48
sample or condition. Therefore, several replicates can be used to estimate the expression
of a particular sample or condition.
i;j
represents other technical effects like flow cell
effect, library preparation effects etc.
To test for differential expression of gene g between two samples 1 and 2, the GLM
is used to test whether
1;g
=
2;g
These statistics are found to be more robust and
accurate compared to t-test statistics and F-test statistics.
Negative Binomial based methods
The negative binomial distribution has been used to model count data and has the prop-
erty of modeling over-dispersion in replicates. This distribution can be considered as a
gamma mixture of Poisson distributions. The density function of a random variable X
distributed as negative binomial can be written as
P (X =k) =
k +r 1
k
(1p)
r
p
k
fork = 0; 1; 2;::: (3.11)
where 0p 1. The mean and variance of this distribution is given by
=
pr
1p
(3.12)
2
=
pr
(1p)
2
= +
2
r
(3.13)
Two popular algorithms using the negative binomial distribution to detect differential
expression are edgeR [67] and DESeq [4]. edgeR models the read count data as
X
i;g
NB(M
i
p
g;a(i)
;
g
) (3.14)
for geneg in libraryi. M
i
is the total number of reads in libraryi,
g
is the dispersion
of the reads in geneg andp
g;a(i)
models the relative expression or abundance of geneg
49
in the biological sample or conditiona(i) to which libraryi belongs.
i;g
= M
i
p
g;a(i)
and variance is given by
i;g
(1 +
i;g
g
). So by using a maximum likelihood procedure,
edgeR tests for differential expression of geneg between conditions 1 and 2 by testing
the hypothesis H
0
: p
g1
= p
g2
. The normalization used is the Trimmed Mean of Ms
normalization.
Anders and Huber developed a differential expression estimation algorithm DESeq
[4], which uses the same negative binomial model, but models the second parameter as
s
ga(i)
and therefore is dependent on the biological sample.
X
i;g
NB(M
i
p
g;a(i)
;s
g;a(i)
) (3.15)
DESeq assumes that the dispersion is a smooth function of the mean and therefore
s
g;a(i)
=s(p
g;a(i)
)[4]. This assumption is made due to the small number of replicates in
real data. The same hypothesis is used as edgeR to test for differential expression.
Several other methods have been proposed for Differential Expression. Hardcastle
and Kelly [33] developed baySeq which uses an empirical Bayesian model to find genes
which are differentially expressed. Wang et al. [86] have developed a package DEGSeq,
which uses methods based on MA-plots to visualize differential expression. Tarazona
et al. [81] developed a method NOISeq which is a nonparametric method to detect
differences in expression which are robust to low sequencing depths. We propose a like-
lihood ratio test based on the generalized Poisson model, discussed in the next chapter,
to identify differentially expressed genes and we show that it performs than Poisson and
Negative Binomial based methods to identify differential expression.
50
3.2.7 Splicing rates for exons and Alternative transcript quantifica-
tion
There is considerable interest in looking at alternative usage of exons and expression
levels of transcripts compared to gene expression estimates. Since multiple transcripts
contribute to the number of reads mapping to a gene, extracting alternative splicing
information is crucial for understanding the transcriptome. Junction reads or reads that
span exon-exon junctions are integral to the analysis of alternative splicing in exons. For
an exone, letN
excl
be the number of junction reads which exclude the exon andN
incl 5’
and N
incl 3’
be the number of reads which include the exon from the 5’ and 3’ ends
respectively. Then the most basic statistic which can characterize the exon inclusion
ratio for a skipped exon can be written as
e
=
N
excl
N
excl
+
N
incl 5’
+N
incl 3’
2
(3.16)
This was used by Wang et al. [85] in their analysis.Differential exon usage between
samples 1 and 2 can be identified by using the statisticj
(1)
e
(2)
e
j. SpliceTrap is a
method developed by Wu et al. [88] which quantifies exon inclusion levels using paired-
end RNA-Seq data. It uses a Bayesian model to estimate the inclusion ratios
There are several methods for alternative isoform transcript quantification. Jiang and
Wong [37] used a Poisson model to deconvolute the isoform expressions using known
annotation. CUFFLINKS [83] also uses the Poisson assumption to identify alternatively
spliced transcripts and calculate their expression levels. MISO or mixture of isoforms
was developed by Katz et al. [40] incorporating paired-end reads to estimate the expres-
sion of alternatively spliced transcripts by using an improved estimate of by correcting
for transcript length bias and extending it to isoform expression.
51
Chapter 4
A two-parameter generalized Poisson
model to improve the analysis of
RNA-Seq data
Deep sequencing of RNAs (RNA-seq) has been a useful tool to characterize and quan-
tify transcriptomes. However, there are significant challenges in the analysis of RNA-
seq data, such as how to separate signals from sequencing bias and how to perform
reasonable normalization. In this chapter, we focus on a fundamental question in RNA-
seq analysis: the distribution of the position-level read counts. This research has been
published in [78]
4.1 Introduction
With the advance of high-throughput sequencing technologies, transcriptomes can be
characterized and quantified at an unprecedented resolution. Deep sequencing of RNAs
(RNA-seq) has been successfully applied to many organisms [5, 14, 58, 59, 85]. How-
ever, there are still many challenges in analyzing RNA-seq data. In this chapter, we
focus on a basic question in RNA-seq analysis: the distribution of the position-level
read count (i.e. the number of sequence reads starting from each position of a gene or
an exon). It is usually assumed that the position-level read count follows a Poisson dis-
tribution with rate. The gene length-normalized read count, which is a popular gene
52
expression estimate, is then the maximum likelihood estimator (MLE) of. . However,
a Poisson distribution with rate cannot explain the non-uniform distribution of the
reads across the same gene or the same exon. A different distribution or a better model
is in need to better characterize the randomness of the sequence reads.
We proposed using a two-parameter generalized Poisson (GP) model for the gene
and exon expression estimation. Specifically, we fit a GP model with parameters and
to the position-level read counts across all of the positions of a gene (or an exon).
The estimated parameter reflects the transcript amount for the gene (or exon) and
represents the average bias during the sample preparation and sequencing process. Or
the estimated can be treated as a shrunk value of the mean with the shrinkage factor.
We found that the GP model can better estimate gene (or exon) expression by separating
true signals from sequencing bias.
It has been shown that normalization continues to be a critical component of RNA-
seq analysis [68]. A few highly expressed genes specific to one sample can make
the library-size based normalization approach inappropriate. Our proposed GP model
showed that for some highly expressed genes, many of the reads might be caused by
sequencing bias. By removing the sequencing bias captured by, we can better perform
the normalization across different samples. It has also been observed that in differen-
tially expressed gene studies, although the gene-level read counts across replicate lanes
can be fitted by a Poisson distribution, the fit is inappropriate when applied to biological
replicates or different biological samples [5]. In addition, the parameter estimation for
the biological effect may be unreliable given the small number of replicate lanes. On
the other hand, based on our GP model, we can better identify differentially expressed
genes and differentially spliced exons through log-likelihood ratio approaches. These
conclusions were demonstrated by applying our model to multiple RNA-seq data sets in
multiple organisms.
53
4.2 Two-parameter generalized Poisson model for
RNA-seq data
4.2.1 Model
For each gene, let X represent the number of mapped reads starting from an exonic
position of the gene. The observed counts arefx
1
;:::;x
n
g where n is the total number
of non-redundant exonic positions (or gene length). The sum ofx
i
’s is equal to the total
number of reads mapped to this gene. We assume thatX follows a GP distribution with
parameters and:
Pr(X =x) =
8
>
>
<
>
>
:
( +x)
x1
e
x
x!
x = 0; 1; 2;:::
0 forx>q if< 0
(4.1)
where> 0, max(1;
q
) 1, andq( 4) is defined as
q =minfq 4j +q> 0;< 0g (4.2)
When = 0, the GP model reduces to the Poisson model. From our results on the
real datasets, more than 99% of the estimates were larger than 0. The mean of X is:
=
1
, and the variance of X is:
2
=
(1)
3
. The GP distribution was defined
and studied by Consul and Jain [17, 18]. In RNA-seq experiments, the parameter
can be treated as the transcript amount for the gene and represents the bias during
the sample preparation and sequencing process. The underlying mechanisms for the
sequencing bias remain unknown and need further investigation. The MLE’s of
^
and
54
^
can be solved by obtaining the solution of the following equation using the Newton-
Raphson method or any other optimization procedure
n
X
i=1
x
i
(1x
i
)
x + (x
i
x)
n x = 0 (4.3)
After calculating the estimate for, we can solve for
^
from
^
= x(1
^
).
Thus,
^
is a shrunk value of the sample mean if
^
> 0. This relationship can also be
inferred by the equation that = (1). Note that there is no MLE if all of thex’s
are equal to 0 or 1. The same model can be specified for an exon. The observed read
counts at the positions arefz
1
;:::;z
m
g andm is the exon length.
4.2.2 Intuition for the generalized Poisson distribution
An intuition for using the distribution can be explained as follows. The Generalized
Poisson distribution is a Poisson type probability distribution which is an approximation
of the Generalized Negative Binomial distribution [18]. The density function for the
Generalized Negative Binomial distribution is given by:
b
(x;n;) =
n(n +x)
x!(n +xx + 1)
x
(1)
n+xx
; 0<< 1;jj< 1 (4.4)
For =
1
2
, the GNB distribution reduces to the Poisson distribution, while for = 1,
the GNB distribution becomes the Negative Binomial. The parameter can be treated
as the probability of a success or a read mapping to a particular position. on the
other hand represents the level of dispersion in the data. A high represents a larger
dispersion while a smaller represents a smaller dispersion of the data. This can be
seen intuitively in the following case :
55
In a queueing process starting with n customers, if the customers arrive in batches
of 1, the probability that exactly ( 1)x +n customers will be served before the
queue first vanishes is given by the Generalized Negative Binomial.
Hence the parameter can confound the rate of arrivals of customers, or in our sit-
uation the average number of reads mapping to a genomic sequence because a higher
would lead to accumulation of reads at certain locations and no reads in others which
can be attributable to sequencing preference and other biases. Therefore using the aver-
age read count per base pair to estimate gene expression might lead to overestimates
when is large since the mean of the distribution is given by
m =
n
1
(4.5)
However, the parameter we are interested in is onlyn.
Asn and grow larger and is small, such thatn = and =, the General-
ized Negative Binomial can be approximated as the Generalized Poisson with parame-
ters and. For the reason explained above using onlyn to estimate gene expression,
is used as the estimate for gene expression. The mean of the Generalized Poisson is
given by
m =
1
(4.6)
Therefore 1 represents how much m needs to shrink to approach the true gene
expression rate. For a higher , the shrinkage is much higher. Therefore we use
as the parameter representing gene expression and as the shrinkage parameter.
Another interpretation for the GP distribution is through the quasi-binomial distribu-
tion. Considering a random variable X following a quasi-binomial distribution I (QBD-
I):
56
Pr(X =x) =
m
x
p(p +x)
x1
(1px)
mx
: (4.7)
X represents the number of successes in m trials. The probability of success in any
one trial isp and in all the other trials isp +x. The probability of success increases
or decreases depending on the positive or negative value. The change of probabil-
ity is proportional to the number of successes. When m!1;p! 0, and ! 0
such thatmp = andm = , the QBD-I approaches the GP model with parameter
(;) [15]. For the RNA-Seq data, the actual transcript amount can be approximated
by a Poisson distribution with parameter. The bias intruduced in the sample prepara-
tion and sequencing process is represented by. Therefore, the observed position-level
read count followed the GP distribution with parameters (;). Many other models can
lead to the GP distribution. For example, under certain limits, the generalized negative
binomial distribution and the generalized Markov-P
`
’olya distribution approaches the GP
distribution [16]. The GP model has been applied to many biological problems such as
the frequency of chemically induced chromosome aberrations in human leukocytes [36]
4.2.3 Normalization
To identify differentially expressed genes, we need to perform normalization across
different samples. The total amount of sequenced RNAs in sample 1 can be estimated
bys
1
=
P
G
g=1
^
1;g
l
g
, where
^
1;g
is the MLE of in the GP model for geneg in sample
1,l
g
is the gene length, andG is the total number of genes. Similarly, the total amount
of sequenced RNAs in sample 2 can be estimated by s
2
=
P
G
g=1
^
2;g
l
g
, where
^
2;g
is the MLE of for gene g in sample 2. To perform normalization, we assume that
the total amount of RNAs in sample 1 is equal to the total amount of RNAs in sample
57
2. Therefore, the scaling factor for the comparison between the two samples can be
estimated as:
w =
s
2
s
1
(4.8)
When
g
= 0,
^
g
= x
g
that is the MLE for the Poisson model. When
g
= 0 for all
g’s, the scaling factor is the same as that in the normalization procedure using the total
number of mapped reads (here only reads mapped to the transcriptome were considered).
When
g
> 0;
^
g
< x
g
we remove extra reads that are due to the biased sequencing for
this gene. When
g
< 0,
^
g
> x
g
we compensate read counts for this gene because
some of reads cannot be sequenced successfully. In our applications, we found that
the majority of genes had a positive
g
. The ratio of the gene expression between two
samples can be estimated asw
^
1;g
^
2;g
4.2.4 Method to identify differentially expressed genes
We used the likelihood ratio test to identify differentially expressed genes. LetX repre-
sents the position-level read count in sample 1. Similarly,Y is the random variable for
the gene in sample 2. To estimate the unrestricted MLEs, we have:
Pr(X =x) =
1
(
1
+x
1
)
x1
e
1
x
1
x!
(4.9)
Pr(Y =y) =
2
(
2
+y
2
)
y1
e
2
y
2
y!
(4.10)
where (
1
;
2
) and (
2
;
2
) can be estimated by the maximum likelihood approach men-
tioned above independently. When
1
< 0 or
2
< 0 and somex’s ory’s are larger than
the corresponding q values, the likelihood is zero and the parameter estimation fails.
58
Under the null hypothesis (restricted parameter space), thus the gene is not differen-
tially expressed, we have
Pr(X =x) =( +x
1
)
x1
e
x
1
x!
(4.11)
Pr(Y =y) =w(w +y
2
)
y1
e
wy
2
y!
(4.12)
wherew is a normalization constant associated with the different sequencing depths for
the two samples. We can choose w =
s
2
s
1
, and s
1
and s
2
were calculated based on
the unrestricted maximum likelihood model. Through the parameter specification, we
preserved the original counts.
1
and
2
were taken as the same values of the MLEs
from the unrestricted maximum likelihood model. Thus, we assumed that the MLE of
from the unrestricted maximum likelihood model was close to the true value. Then
the restricted profile MLE
^
can be obtained by solving the equation using the Newton-
Raphson method:
2n
(n +nw) +
n
X
i=1
(x
i
1)
+x
i
^
1
+
n
X
i=1
(y
i
1)w
w +y
i
^
2
= 0 (4.13)
The log-likelihood ratio test statistic can be calculated as:
T =2ln
L(
^
;
^
1
;
^
2
jx;y)
L(
^
1
;
^
2
;
^
1
;
^
2
jx;y)
!
(4.14)
If the null model is true, T is approximately chi-square distributed with one degree
of freedom.
59
To perform the comparison, we also used the Poisson model and the log-likelihood
ratio approach to identify differentially expressed genes. For the unrestricted Poisson
model:
Pr(X =x) =
x
1
e
1
x!
(4.15)
Pr(Y =y) =
y
2
e
2
y!
(4.16)
The MLEs are
^
1
= x and
^
2
= y For the restricted null model:
Pr(X =x) =
x
e
x!
(4.17)
Pr(Y =y) =
(w)
y
e
(w
y!
(4.18)
where w can be chosen as
P
g
i=1
y
g
l
g
P
g
i=1
x
g
l
g
. The profile MLE under the null is
^
=
P
n
i=1
x
i
+
P
n
i=1
y
i
n +nw
(4.19)
The log-likelihood ratio test statistic can be calculated as:
T =2ln
L(
^
jx;y)
L(
^
1
;
^
2
jx;y)
!
(4.20)
and it follows a chi-square distribution with one degree of freedom if the null model is
true.
We also used the generalized linear model (GLM) proposed in [5] for the ROC curve
study. The log/Poisson link function was used for the linear model:
log(E(R
i;j
jd
i
)) =logd
i
+
a(i);j
+
i;j
(4.21)
60
whereR
i;j
is the number of reads mapped to genej in lanei,
a(i);j
is the biological effect
for genej in the biological groupa(i),
i;j
is other technical effect (i.e. flow cell effect),
d
i
is the total number uniquely mapped reads for lane i and the regression coefficient
of log(d
i
) is set to 1. For the MAQC-2 data, there are two biological groups: UHR
and brain samples distributed among 14 lanes. Therefore i ranges from 1 to 14. a(i) =
1(UHR) for the first 7 lanes, anda(i) = 0 (brain) for the last 7 lanes. These 14 lanes
were distributed among two flow cells so that = (1; 0; 1; 0; 1; 0; 1; 0; 1; 0; 1; 0; 1; 0).
The log-likelihood ratio statistic was calculated based on the full model and the null
model in which the biological effect is zero. Besides the log/Poisson link function, we
also applied the negative binomial and the quasi-Poisson link functions in the GLM.
4.2.5 Methods to identify differentially spliced exons
In this study we focused on ‘skipped exon’ events without considering other alternative
splicing events such as alternative 5’ splice sites or alternative 3’ splice sites. In the first
sample, the read count for the considered middle exon is denoted by Z, and the read
count for the corresponding gene is denoted byX. In the second sample, we denote the
read counts for the exons and genes byV andY respectively.
Under the unrestricted parameter space, we can estimate the parameters forZ;X;V and
Y using the method described earlier. We model the read counts for the exon and the
gene in the two conditions as a generalized Poisson with the following parameters :
61
ZGP (
Z
;
Z
) (4.22)
XGP (
X
;
X
) (4.23)
VGP (
V
;
V
) (4.24)
YGP (
Y
;
Y
) (4.25)
The resultant estimators for the GP parameters are denoted as
^
Z
;
^
Z
;
^
X
;
^
X
;
^
V
;
^
V
;
^
Y
;
^
Y
. Under the restricted null model, in which the
exon is not differentially skipped or the splicing ratio between the exon expression and
the gene expression is the same between the two samples. Therefore we define the
splicing ratiob under the null model as
b =
Z
X
=
V
Y
(4.26)
Since the splicing ratio is the same in the restricted model, we can write
X
as
1
and
Z
asb
1
. Similarly we write
Y
as
2
and
V
asb
2
. If the lambda’s in the unrestricted
model are denoted as
1
;
2
;
3
;
4
, we have
Pr(Z =z) =
b
1
(b
1
+z
1
)
z1
e
b
1
z
1
z!
(4.27)
Pr(X =x) =
1
(
1
+x
2
)
x1
e
1
x
2
x!
(4.28)
Pr(V =v) =
b
2
(b
2
+v
3
)
v1
e
b
2
v
3
v!
(4.29)
Pr(Y =y) =
2
(
2
+y
4
)
y1
e
2
y
4
y!
(4.30)
62
whereb > 0. The
1
;
2
;
3
;4 were taken as
^
Z
;
^
X
;
^
V
and
^
Y
respectively i.e.
the estimators under the restricted null model. The profile MLEs of
1
,
2
andb can be
obtained by solving ;
m +n
1
+
m
X
i=1
(z
i
1)
b
b
1
+z
i
^
Z
+
n
X
i=1
(x
i
1)
1
1
+x
i
^
X
(mb +n) = 0 (4.31)
m +n
2
+
m
X
i=1
(v
i
1)
b
b
2
+v
i
^
V
+
n
X
i=1
(y
i
1)
1
2
+y
i
^
Y
(mb +n) = 0 (4.32)
m
X
i=1
(z
i
1)
1
b
1
+z
i
^
Z
+
m
X
i=1
(v
i
1)
2
b
2
+v
i
^
V
m(
1
+
2
) +
2m
b
= 0 (4.33)
The log-likelihood ratio test-statistic can be calculated as:
T =2ln
0
@
L
^
b;
^
1
;
^
2
;
^
Z
;
^
X
;
^
V
;
^
Y
jz;x;v;y
L
^
Z
;
^
Z
;
^
X
;
^
X
;
^
V
;
^
V
;
^
Y
;
^
Y
jz;x;v;y
1
A
(4.34)
As this likelihood statistic corresponds to a test forH
0
:b = 1 vsH
1
:b6= 1, it follows
a chi-square distribution with 1 degree of freedom under the null.
For the Poisson model, the MLEs of the Poisson parameters are the respective sample
means :
^
Z
= z;
^
X
= x;
^
Y
= y;
^
V
= v. Under the null model,
Pr(Z =z) =
(b
1
)
z
e
b
1
z!
(4.35)
Pr(X =x) =
(
1
)
x
e
1
x!
(4.36)
Pr(V =v) =
(b
2
)
v
e
b
2
v!
(4.37)
Pr(Y =y) =
(
2
)
y
e
2
y!
(4.38)
63
The restricted profile MLEs are
^
1
=
P
m
i=1
z
i
+
P
n
i=1
x
i
n +mb
(4.39)
^
2
=
P
m
i=1
v
i
+
P
n
i=1
y
i
n +mb
(4.40)
^
b =
(n(
P
m
i=1
z
i
+
P
n
i=1
x
i
)
m(
P
m
i=1
v
i
+
P
n
i=1
y
i
)
(4.41)
The log-likelihood ratio test statistic T given as
T =2ln
L(
^
b;
^
1
;
^
2
jz;x;v;y)
L(
^
Z
;
^
X
;
^
V
;
^
Y
jz;x;v;y)
!
(4.42)
follows a chi-square distribution with one degree of freedom if the null model is true.
4.2.6 Simulation Strategy to calculate P-values
In the likelihood ratio tests to identify differentially expressed genes or differentially
spliced exons, we treated the estimates from the unrestricted model as true values and
put them into the profile likelihood calculation for simplicity. The derived test statistics
T may not exactly follow a chi-square distribution with one degree of freedom under
the null. To better estimate the p-values, we adopted a simulation strategy using a per-
mutation test. For each nucleotide position, we randomly assigned the mapped reads
to the two considered conditions. The simulated sample pair should have no differen-
tially expressed genes or differentially spliced exons. We repeated the simulations 1000
times. For each simulated sample pair i, we calculated the test statistics T
gi
for each
gene g. The distribution of T
gi
’s (i = 1;:::; 1000) is the null distribution of the test
statistic for geneg. To obtain an accurate p-value for the observed test statistic, we fitted
the distribution of T
gi
’s by a Gamma distribution and calculated the p-value based on
64
the Gamma distribution. In our considered datasets, the null distribution can be well
fitted by a Gamma distribution for more than 84% of genes (the p-value based on the
Kolmogorov-Smirnov test> 0.05). Based on the p-values from the simulation strategy,
we performed similar studies to identify differentially expressed genes or differentially
spliced exons. Although they identified slightly more differentially expressed genes and
differentially spliced exons, the results and the conclusions were similar with the study
based on the p-values from the chi-square distribution.
4.2.7 Real data sets
The real data sets we used included: (I) The MAQC data including the MAQC-2 and
the MAQC-3 data from [5]. The MAQC-2 data were for two biological samples (human
UHR and brain samples) and each sample had 7 lanes distributed across two flow-cells.
The MAQC-3 data were for the UHR sample with four different library preparations. (II)
The human data were for 9 human tissues and 5 breast cancer cell lines [85]. (III) The
mouse data were for three mouse tissues, each with two replicates [58]. The controlled
hydrolysis of RNA samples was performed before cDNA synthesis. (IV) The yeast
data were for oligo(dT)-primed and random-hexamer-primed cDNA samples, each with
original, biological and technical replicates [59]. All of these data were obtained from
the Illumina sequencing platforms. These samples had multiple lanes distributed in one
flow-cell or across multiple flow-cells.
For the mouse data, the uniquely mapped reads including body reads and junction
reads were downloaded from http://woldlab.caltech.edu/rnaseq/. Other data were down-
loaded from the Sequence Read Archive http://www.ncbi.nlm.nih.gov/sra/ as the .fastq
files: SRA010153 for the MAQC data, SRP000727 for the human data (the two low-
coverage MAQC samples were excluded), SRX000559-SRX000564 for the yeast data.
65
The sequence reads were mapped to the human genome (NCBI 37.1 or hg19, down-
loaded from the NCBI website) or the yeast genome (downloaded from the Saccha-
romyces Genome Database http://www.yeastgenome.org/). The unmapped reads were
further mapped against the human refseq RNA sequences (downloaded from the NCBI
website) or the yeast ORF sequences (downloaded from the Saccharomyces Genome
Database). The resultant reads were treated as junction reads. Note that we only used
the uniquely mapped reads and removed the multi-reads. The mapping was performed
by using Bowtie, version 0.12.1 and the default settings [43]. The Refseq gene annota-
tions for human and mouse were downloaded from the NCBI website and the ORF gene
annotations for yeast were downloaded from the Saccharomyces Genome Database. We
only considered genes on autosomes or X chromosomes. The position-level read count
was the number of body or junction reads starting from an exonic position of a gene
(or an exon) without considering the strand information because the sample preparation
did not consider strands. A non-redundant exon list was assembled from the Refseq
gene annotations or the ORF gene annotations. If two exons were overlapped, they
were broken into three “exons”: the region specific to exon 1, the shared region, and the
region specific to exon 2. Note that both alternative exons and constitutive exons were
included.
To demonstrate that
^
of the GP model can better represent gene expression levels,
we utilized the QuantiGene data for the MAQC-2 samples [8]. The QuantiGene sys-
tem detects RNA directly without reverse transcription and PCR amplification. The
background-corrected signal was averaged across the replicates with detectable expres-
sion. If more than two thirds of the replicates had detectable expression, the gene was
included in the regression analysis (see Table 4.2). For each regression, we further
required that the gene had an available expression estimate from the considered statis-
tical model. The sample size for these regression models was around 170. To validate
66
our differentially expressed genes, we also utilized the qRT-PCR data for the MAQC-2
samples [73]. The DCt values with respect to POLR2A (endogenous control gene) were
averaged across the replicates with detectable expression. If 75% of the replicates had
detectable expression for both samples, the gene was claimed to have a reliable log-ratio
value in the qRT-PCR data set. To validate our identified differentially spliced exons, we
used an enlarged junction list downloaded from http://genes.mit.edu/burgelab/mRNA-
Seq/ which included a large number of known and predicted junctions. The liftOver tool
from the UCSC genome browser was used to convert the coordinates of the junctions
between version hg18 and hg19. We remapped the reads to the enlarged junction lists
using BLAST [3] and requiring at most 2 mismatches and at least 4 matched bases on
each side of the junction. The mapped junction reads can better represent the inclusion
rate of a middle exon. The inclusive junction reads were those starting at the 3’ end of
the exon (5’ splice site) and those ending at the 5’ end of the exon (3’ splice site). The
exclusive junction reads were those skipping the considered middle exon.
4.2.8 Computation efficiency and software packages
The code for the GP model was written in C and the program was computationally effi-
cient. For example, for the MAQC data with a total of six samples, starting from the
position-level read counts, it took 1 minute to estimate the gene and exon expression,
2 minutes to identify differentially expressed genes, and 5 minutes to identify differen-
tially spliced eons. The memory used was 800 Mb on a single CPU. It took more time if
we used the simulation strategy to calculate the p-values for the identification of differ-
entially expressed genes and differentially spliced eons. We also prepared an R-package
’GPseq’ to implement the methods proposed. Scripts have also been developed to gen-
erate the input files required for the software from mapped SAM files and annotation
files. These can be downloaded at : http://www-rcf.usc.edu/ liangche/software.htm .
67
4.3 Results
4.3.1 Gene and gene expression estimation
We fitted a GP model to the position-level counts of a gene or an exon. The parameters
and were estimated using the maximum likelihood approach. We first examined the
lane and flow-cell effect on the bias parameter using the MAQC-2 data from [5]. In
the MAQC-2 data, each of the two biological samples (UHR and brain) was sequenced
in seven lanes distributed across two flow-cells. We estimated g(g = 1;:::;G; G is
the total number of genes) for each lane separately (the goodness-of-fit will be fur-
ther discussed in Table 4.1). The correlation of the values between the seven lanes
for the same sample was around 0.96-0.98. Because the seven lanes were distributed
across two flow-cells, the high correlations also indicated that the flow-cell effect on
was negligible. However, the correlation of between the lanes for different biologi-
cal samples was only around 0.61-0.63, which indicated that the sequencing bias was
biological sample dependent. Note that if two independent variables X and Y follow
the GP distribution with parameters (
1
;) and (
2
;), the sum of X and Y is also a
GP variable with parameters (
1
+
2
;) [16]. This justifies the pooling of read counts
across lanes that are believed to have similar sequencing bias. We therefore pooled the
read counts across lanes for the same sample in the following studies. We further exam-
ined the library preparation effect on the sequencing bias. In the MAQC-3 data [5],
four different library preparations were used for the UHR sample. The correlation of
the values between the library preparations was around 0.97-0.98, indicating that the
library preparation effect was also negligible. In the yeast data, cDNAs were prepared
by either random hexamers or oligo(dT) primers. The correlation of the’s between the
two library preparations was 0.92.
68
Table 4.1: If the P-value for the chi-square test> 0.05, we declared that the model fit
the data well. The MAQC data contained both the MAQC-2 and the MAQC-3 data. The
MAQC-2 data was for two biological samples(human UHR and brain samples). The
MAQC-3 data was for the UHR sample with four different library preparations. The
human data was for nine human tissues and five breast cancer cell lines. The mouse
data was for three mouse tissues, each with two replicates. The yeast data was for oligo
(dT)-primed and random hexamer-primed cDNA samples, each with original, biological
and technical replicates. For each of these datasets, reads from multiple replicate lanes
were pooled together. However, for the MAQC-2 sep data, reads from the 7 replicate
lanes for each of the MAQC-2 samples were fitted seprarately.
Gene Level Exon level
GP (%) Poisson (%) GP (%) Poisson(%)
MAQC data 85.72 1.57 89.62 19.71
Human data 77.28 3.22 88.78 28.35
Mouse data 88.57 7.88 91.73 39.67
Yeast data 93.24 20.49 93.21 23.73
MAQC-2 sep 92.93 10.18 92.90 41.51
The goodness-of-fit of the GP model was examined by the chi-square test. If the
p-value for the chi-square test-statistic was larger than 0.05, we declared that the model
fitted the data well. Table 4.1 lists the percentages of genes (or exons) with the position-
level count data fitted well by the GP model or by the Poisson model. The position-level
counts can be fitted well by the GP model for the majority of genes or exons (77-93%).
However, the Poisson model only fitted well for 2-42% of the genes or exons. The p-
values of the genes that were declared to be well fitted by the GP model were uniformly
distributed between 0.05 and 1. This can be seen in Figure 4.1. The uniformity was
slightly worse for exons, but still much better than that of the Poisson model . For the
genes or exons that cannot be fitted by the GP model, 93-99% of them cannot be fitted
by the Poisson model either, and the fit of the Poisson model was even worse than the fit
of the GP model. The expression levels of the un-fitted genes and exons were generally
higher than those fitted well by the GP model. It indicated that there was additional bias
not captured by for the highly-expressed genes.
69
Based on the definition of the GP distribution, the expectation is: =
1
and the
variance is:
2
=
(1)
3
. The parameter is a measure of the departure from Pois-
sonicity. The variance of the GP distribution is greater than, equal to, or less than the
expectation according to whether > 0; = 0; or < 0, respectively. Consul [16]
stated that the parameter is the average rate for the natural Poisson process. The param-
eter is the average rate of the effort that the subjects are making. A positive value of
indicates that the subjects are making an effort to accelerate the natural process and a
negative value of denotes an effort to retard the process.
Furthermore, we used the QuantiGene arrays as gold standards to show that can
better represent the ”true signal” for gene expression. The QuantiGene system detects
RNA directly without reverse transcription and PCR amplification. The bias introduced
in the reverse transcription and the PCR amplification steps can be avoided. Therefore,
the QuantiGene system provides an accurate gene expression measurement. The
QuantiGene data were obtained from [8]. A linear regression model was fitted between
the log of the QuantiGene signal and the log of the estimated value. The slope was
0.94 and 0.90 for the two MAQC-2 samples (Table 4.2). However the slope was 0.72
and 0.71 if the Poisson estimate was used. The R-square values were similar for the
regression models. A slope different from 1.0 indicates the compression or expansion
effects between the estimated expression levels and the gold standard signals. If a slope
is equal to 1.0, the estimated fold change between two different genes of the same
sample is equal to the gold standard fold change. As we can see, the slope is closer
to 1.0 for the GP estimates than the Poisson estimates. It indicates that can better
represent the ”true signal” and it makes the direct comparison between two different
genes reliable.
70
It has been observed that a few highly expressed genes contributed a large fraction of
sequence reads in certain tissue samples [65]. Figure 4.2 shows the fraction of mRNA
sample contributed by the highly expressed genes in the human tissues. The mRNA
amount derived from a gene was equal to the product of the estimated gene expression
multiplied by the gene length: for the Poisson model, and for the GP model. According
to the Poisson model, the top 10 genes contributed to about 15-31% of the total mRNAs
in the human heart, muscle and liver tissues (Fig. 4.2A). If we used the GP model, the
fraction contributed by highly expressed genes was smaller (Fig. 4.2B). We examined
the top 4 genes with the largest contributions to the total RNAs in the liver tissue based
on the Poisson model. The 4 genes were also the top genes based on the GP model
(ranks 8, 1, 12 and 3). Figure 4.3 shows the observed frequency of the read count equal
to k (k = 0; 1; 2;:::) (blue bars) and the expected frequencies based on the GP model
(magenta bars) or the Poisson model (brown bars). We only plotted the ”k”s with at
least one frequency among the three types of frequencies larger than or equal to 5 for
visual convenience. The GP model fitted much better than the Poisson model. Note that
for these genes, there were about 1.4-3.8% of positions with thousands of reads starting
exactly from these positions (i.e. k 1000). For each of these ”k”s, the frequency of
positions with exactly k number of reads was small (< 5) so that they were not plotted
in Figure 4.3. However, these outliers greatly affected the MLE of the Poisson model,
which resulted in a peak at a high ”k” value for the expected frequencies (peak of the
brown bars), but the observed frequencies (blue bars) were low in this region. Although
the effect of outliers on the GP model was relatively smaller, they still caused the p-
values of the chi-square tests less than 0.05. It was unrealistic to assume that these
outlier reads reflected the gene expression correctly because the majority of the gene
positions did not exhibit such high read counts. It was more likely that these reads
contained sequencing bias. Therefore, if we counted all the reads mapped to the gene
71
Figure 4.1: Histogram of the p-values for the chi-square tests. The goodness-of-fit of the GP model (black bars) and the Poisson
model (grey bars) was examined by the chi-square test. A is for the gene expression estimation. B is for the exon expression
estimation.
A B
P-value of the Chi-square test (gene)
0.0 0.2 0.4 0.6 0.8
0.0 0.2 0.4 0.6 0.8 1.0
Percentage
0.0 0.1 0.2 0.3 0.4 0.5 0.6
0.0 0.2 0.4 0.6 0.8 1.0
P-value of the Chi-square test (exon)
Percentage
72
Table 4.2: The QuantiGene system detects RNA directly without reverse transcription and PCR amplification. The bias intro-
duced in the reverse transcription and the PCR amplification steps is therefore avoided. Y represents the averaged detectable
signal form the replicates of the QuantiGene data.X represents the estimate from the GP model of the means estimate from the
Poisson model. A slope of 1:0 for the regression line indicates that the estimated gene expression is accurate and the estimated
fold change between two different genes is the same as that from the QuantiGene data. Thuslog(y
1
=y
2
) = 1:0log(x
1
=x
2
).
The slopes for the estimates from the GP model were closer to 1.0 compared with those for the mean estimates from the
Poisson model.
estimate from the GP model Mean estimate from the Poisson Model
Regression line R
2
Regression line R
2
UHR Sample log(y) = 4:20 + 0:94log(x) 0.62 log(y) = 3:38 + 0:72log(x) 0.68
Brain Sample log(y) = 3:92 + 0:90log(x) 0.54 log(y) = 3:23 + 0:71log(x) 0.57
73
Figure 4.2: The fraction of total mRNA amount derived from highly expressed genes
for the human tissue samples. Genes were ranked based on the product of its esti-
mated expression level and the gene length x
g
l
g
for the Poisson model (A) or
^
g
l
g
for
the GP model (B). Then the percentage of mRNA amount contributed from the top
1; 10;:::; 10000 genes was calculated and plotted
0.0 0.2 0.4 0.6 0.8 1.0
Number of genes
Fraction of mRNA
1 10 100 1000 10000
Poisson model GP model
A B
0.0 0.2 0.4 0.6 0.8 1.0
Number of genes
Fraction of mRNA
1 10 100 1000 10000
testes
lymph
brain
breast
adipose
colon
heart
muscle
liver
74
to estimate the gene expression, we counted extra reads due to sequencing bias. On
the other hand, the GP model can better separate the bias from the true signal. Note
that all the frequencies were less than 5 and the GP model still fitted better than the
Poisson model. These highly expressed genes would affect the normalization across
different samples if we used the Poisson estimates, as argued in [68]. This will be
further discussed in a forthcoming section on normalization.
4.3.2 Factors associated with bias
As we mentioned above, the bias parameter was independent of lanes, flow-cells, and
library preparations, but dependent on biological samples. We further explored possible
factors related to. We first considered the average number of reads mapped to a gene
( x
g
). Although the MLE of does not have a closed-form expression, the method of
moments estimator of is:
^
g
= 1
s
x
g
s
2
g
(4.43)
where s
2
g
is the sample variance. Therefore we expect a negative correlation between
^
g
and
s
x
g
s
2
g
. In our real datasets, we found a negative correlation between
^
g
and
s
x
g
s
2
g
, the correlation was around -0.61 to -0.87. Thus, the increase of the variance
was faster than the increase of the mean. We therefore expect a positive correlation
between
^
g
and
p
x
g
. Specifically, the Pearson correlation between
^
g
and
p
x
g
was
around 0.66-0.84 for the human tissue and cell line data, 0.83-0.85 for the yeast data,
and 0.54-0.75 for the mouse tissue data. To remove the possible effect of different
robustness of
^
’s for different genes, we focused on genes with converged
^
g
from the
MAQC-2 data. We pooled the reads from the seven lanes gradually by adding one lane
75
Figure 4.3: The observed and the expected frequencies of the read count equal to k.(A-D) are for the top four highy expressed
genes in the liver tissue. Blue bars are for the observed frequencies, brown bars are for the expected frequencies based on
the Poisson model, and magenta bars are for the expected frequencies based on the GP model. We only plotted the ’k’s with
at least one frequency among the three types of frequencies 5. The total number of reads mapped to the genes ( x
g
l
g
)
were about 677476; 329818; 277529 and 272551. The total number of estimated reads from the GP model (
^
g
l
g
) were about
6340:2; 11297:0; 4589:7 and 9138:9 for these four genes.
0 6 13 22 31 40 49 58 69 89 127 159 170 181 192 203 214
0 100 200 300 400 500 600 700
0 5 11 18 25 32 39 46 53 60 67 74 81 88 95 103 114 135 209
0 50 100 150 200 250 300
0 6 13 21 29 37 45 54 66 85 97 118 128 138 148 158
0 50 100 150 200 250 300 350
0 6 13 21 29 37 45 53 61 69 77 89 116 126 136 146 156 166
0 10 20 30 40 50 60
A B
C D
Position-level read count k
Frequency
76
each time. Once we added one lane,
^
g
was calculated again. If from the first six lanes
was within 5% of the final from the total seven lanes, the gene was said to have a
converged
^
g
. The correlation between
^
g
and
p
x
g
for these genes was 0.78 for the
UHR sample and 0.81 for the brain sample. Because the increase of the sample variance
was faster than the increase of the sample mean, there was also a negative correlation
between
s
x
g
s
2
g
ands
g
(the correlation was around -0.19 to -0.71 for the real datasets).
This indicates a positive correlation between s
g
and
^
g
. Specifically, we found that
the Pearson correlation between
^
g
ands
g
was around 0.22-0.71 for all the considered
datasets. The results about x
g
ands
g
also explained why was dependent on biological
samples because the expression levels were different.
We next studied the nucleotide composition of each gene. We counted the occur-
rence of every possible oligonucleotide with length 6. Because the library prepa-
rations did not consider the strand information, we pooled the oligonucleotides with
their reverse complementary counterparts. Table 4.3 lists the top 10 oligonucleotides
with the largest absolute correlations with
^
g
. The mean value of
^
g
was used if there
were multiple biological samples ( for example the human and mouse datasets). For the
human tissue and yeast data (oligo(dT) primed or random-hexamer primed), if a gene
was A/T enriched, the bias parameter generally tended to be smaller. However, the
pattern was different for the mouse data in which the hydrolysis of RNA was used before
cDNA priming. It has been shown that the controlled RNA hydrolysis step significantly
improves the uniformity of sequence coverage [58]. The nature of the bias could be
different for the mouse data. In summary, the bias parameter is a combined effect of
the expression level, nucleotide composition, and experimental procedures such as RNA
hydrolysis before cDNA reverse transcription.
77
Table 4.3: Top 10 oligonucleotides with the largest absolute correlations with
^
. Every possible oligonucleotide with length 6
was considered. The occurrences of the oligonucleotides and their reverse complementary counterparts were pooled together
for each gene. The correlation between the log of the oligonucleotide frequency and
^
was calculated and ranked based on the
absolute value. For the human tissue data, the average
^
across the nine tissues was used. For the yeast data, the samples were
prepared using random hexamer or oligo(dT) primers. For the mouse data, the average
^
across three tissues were used. The
hydrolysis of RNA was performed before cDNA priming for the mouse data.
Human tissue Yeast random hexamer Yeast oligo(dT) Mouse tissue
Motif Corr Motif Corr Motif Cor Motif Corr
ACT or AGT 0:312 ATA or TAT 0:550 ATA or TAT 0:543 AG or CT 0:350
A or T 0:311 ATAA or TTAT 0:507 ATAA or TTAT 0:495 AGAG or CTCT 0:349
TA 0:310 AATA or TATT 0:495 AATA or TATT 0:488 CAG or CTG 0:347
TTTC or GAAA 0:308 TAAA or TTTA 0:495 TAAA or TTTA 0:484 CTC or GAG 0:346
ATTC or GAAT 0:308 AT 0:486 AT 0:474 TC or GA 0:342
TCA or TGA 0:308 AAAA or TTTT 0:481 AAAA or TTTT 0:469 TCTG or CAGA 0:342
TAG or CTA 0:308 ATG or CAT 0:478 AAAT or ATTT 0:462 AGC or GCT 0:341
AT 0:307 AAAT or ATTT 0:473 ATG or CAT 0:462 C or G 0:340
TTCA or TGAA 0:306 ATTA or TAAT 0:471 ATTA or TAAT 0:458 TCC or GGA 0:339
ACA or TGT 0:306 TA 0:468 TA 0:458 AGA or TCT 0:339
78
4.3.3 Normalization
To perform the normalization across different samples, the total RNA amount was
always assumed to be equal. Denote the total RNA amount for the k-th sample as
s
k
=
P
G
g=1
^
k;g
l
g
, ,where ^
k;g
is the estimated expression level of gene g in sample
k, andl
g
is the gene length. To compare sample 1 with sample 2, the expression level
in sample 2 can be adjusted by multiplying
s
1
s
2
. If we assume that the position-level
count across different positions of the gene follows a Poisson distribution, the MLE is
the average number of reads mapped to gene ( x
k;g
), and thens
k
is the total number of
reads mapped to the transcriptome. Thus, this is the standard normalization to the total
number of reads. If we assume that the position-level count follows a two-parameter GP
distribution, ^
k;g
=
^
k;g
. We remove the reads due to sequencing bias when we calcu-
late the total RNA amount. Robinson and Oshlack argued that some highly expressed
genes specific to one sample will lead to more falsely declared differentially expressed
genes if the standard normalization is applied [68]. They proposed a weighted trimmed
mean approach to remove extreme values when calculate the normalization factor. Our
MLE
^
for the GP model is equal to x(1
^
. It can be treated as a shrunk value of
x , which trims the extreme RNA amount contributed by a single gene in nature. We
should note that more than 99% of the
^
’s in our applications were larger than zero. In
addition, we have =(1). Therefore 1 represents how much needs to shrink
to approach the true gene expression rate. As shown in Figure 4.2, the GP expression
estimates of highly expressed genes were shrunk toward smaller values. Therefore the
effect of a few outlier genes on the normalization factor was relatively small.
79
4.3.4 Identification of differentially expressed genes
Based on the GP model, the log-likelihood ratio approaches were proposed to identify
differentially expressed genes. The scaling normalization factor
s
2
s
1
was used directly
in the parameter estimation, while the data itself were not modified and the sampling
properties were preserved. To evaluate our model, we used the technical or biological
replicates. There should be no differentially expressed genes between technical repli-
cates if the sequencing depth is similar. If the sequencing depth is unbalanced, certain
genes are detected in one sample but they are not detected or poorly detected in the
other sample, which will still lead to ”differentially expressed genes”. Table 4.4 lists the
number of differentially expressed genes declared from the comparison of these repli-
cates. The false discovery rate was controlled at 0.0001. In general, the GP model
declared less differentially expressed genes. For the MAQC-3 data, they were technical
replicates of the UHR sample with different library preparations, and the sequencing
depth was comparable between the replicates. The GP model identified no differentially
expressed genes. However, the Poisson model still identified as most as 120 differen-
tially expressed genes, which can be treated as false positives. For the mouse data, the
sequencing depth was much deeper for the second replicates. Although we performed
normalization, both models identified many false positives (left panel). If we randomly
selected a subset of the uniquely mapped reads from the original larger sample to make
the sequencing depth equal, the GP model identified much less false positives (right
panel). However, the Poisson model still identified as many as 1007 false positives. The
results indicated that we should design the same number lanes for the two-sample com-
parison to keep the sequencing depth similar. The results also indicated the need for a
more complicated normalization method for two samples with very different sequenc-
ing depth (e.g., the ratio of sequencing depth = 0.55). For the technical replicates of the
yeast data, although the sequencing depth was very different, the GP model identified
80
no false positives while the Poisson model identified a few false positives. The dif-
ference between biological replicates was larger than the difference between technical
replicates, even after we made the sequencing depth the same. In the above analysis, we
obtained the p-values for the likelihood ratio tests based on the chi-square distribution
with one degree of freedom. We also calculated the p-values based on the simulation
strategy mentioned in ”materials and methods”. The results were listed in Table 4.5.
The GP model still identified less number of differentially expressed genes between
technical replicates with similar sequencing depth.
Figure 4.4: ROC curves for the GP, the Poisson and the GLM(Poisson, negative binomial
and quasi-Poisson links) in the identification of differentially expressed genes. Genes
with the estimates in the six models and a reliable log-ratio value in the qRT-PCR exper-
iments were considered. We further limited our studies on the standard positives(qRT-
PCR absolute log-ratio> 2:0;n = 218) and the standard negatives (qRT-PCR absolute
log-ratio< 0:2;n = 74). A true positive was required to be differentially expressed in
the same direction according to both RNA-seq and qRT-PCR.
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
FPR
TPR
GP
Poisson
GLM
GLM_NB
GLM_quasi
81
Table 4.4: The number of differentially expressed genes declared from the comparison of technical or biological repli-
cates The numbers in the brackets were the sequencing depth ratio between the two samples. The sequencing depth was defined
as the total number of reads that can be uniquely mapped to the genome or junctions. For the mouse and yeast data, besides
the studies on the original data sets, we also randomly selected a subset of the uniquely mapped reads from the original larger
sample so that the sequencing depth was the same for these two samples. The DR was controlled at 0.0001. Only genes with
existing MLEs of the parameters for both the GP and the Poisson models in the two samples were considered.
Original data
GP Poisson GP Poisson
MAQC-3:library1 vs library 2 (1.11) 0 47 MAQC-3:library1 vs library 3 (0.98) 0 6
MAQC-3:library1 vs library 4 (1.02) 0 120 MAQC-3:library2 vs library 3 (0.89) 0 43
MAQC-3:library2 vs library 4 (0.92) 0 7 MAQC-3:library3 vs library 4 (1.04) 0 118
Original data Random subset from the original data
GP Poisson GP Poisson
Mouse brain: original vs tech(0.55) 422 390 Mouse brain: original vs tech(1.0) 2 265
Mouse liver: original vs tech(0.75) 67 823 Mouse liver: original vs tech(1.0) 27 720
Mouse muscle: original vs tech(0.86) 171 1085 Mouse muscle: original vs tech(1.0) 119 1007
Yeast random hexamers : original vs tech ( 0.66) 0 8 Yeast random hexamers : original vs tech ( 1.0) 0 8
Yeast oligo dT : original vs tech( 0.64) 0 2 Yeast oligo dT : original vs tech ( 1.0) 0 1
Yeast random hexamers : original vs bio( 0.96) 280 590 Yeast random hexamers : original vs bio ( 1.0) 275 579
Yeast oligo dT : original vs bio( 1.38) 203 571 Yeast oligo dT : original vs bio ( 1.0) 163 487
82
Table 4.5: The number of differentially expressed genes declared from the comparison of technical or biological repli-
cates The numbers in the brackets were the sequencing depth ratio between the two samples. The sequencing depth was defined
as the total number of reads that can be uniquely mapped to the genome or junctions. For the mouse and yeast data, besides
the studies on the original data sets, we also randomly selected a subset of the uniquely mapped reads from the original larger
sample so that the sequencing depth was the same for these two samples. The FDR was controlled at 0.0001. The p-values
were calculated by a permutation test and fitting the test statistic to a gamma distribution. Only genes with existing MLEs of
the parameters for both the GP and the Poisson models in the two samples were considered.
Original data
GP Poisson GP Poisson
MAQC-3:library1 vs library 2 (1.11) 4 47 MAQC-3:library1 vs library 3 (0.98) 0 6
MAQC-3:library1 vs library 4 (1.02) 9 120 MAQC-3:library2 vs library 3 (0.89) 2 43
MAQC-3:library2 vs library 4 (0.92) 4 7 MAQC-3:library3 vs library 4 (1.04) 8 118
Original data Random subset from the original data
GP Poisson GP Poisson
Mouse brain: original vs tech(0.55) 1019 390 Mouse brain: original vs tech(1.0) 20 265
Mouse liver: original vs tech(0.75) 468 823 Mouse liver: original vs tech(1.0) 182 720
Mouse muscle: original vs tech(0.86) 516 1085 Mouse muscle: original vs tech(1.0) 348 1007
Yeast random hexamers : original vs tech ( 0.66) 14 8 Yeast random hexamers : original vs tech ( 1.0) 0 8
Yeast oligo dT : original vs tech ( 0.64) 24 2 Yeast oligo dT : original vs tech ( 1.0) 0 1
Yeast random hexamers : original vs bio ( 0.96) 384 590 Yeast random hexamers : original vs bio ( 1.0) 378 579
Yeast oligo dT : original vs bio ( 1.38) 283 571 Yeast oligo dT : original vs bio ( 1.0) 251 487
83
Figure 4.5: ROC curves for the GP, the Poisson, and the GLM (Poisson, negative bino-
mial, and quasi-Poisson links) in the identification of differentially expressed genes.
Genes with the estimates in the six models and a reliable log-ratio value in the Quanti-
Gene experiments were considered. We further limited our studies on the standard pos-
itives (QuantiGene absolute log-ratio larger than 1.5, n=60) and the standard negatives
(QuantiGene absolute log-ratio less than 0.5, n=29). A true positive was required to be
differentially expressed in the same direction according to both RNA-seq and Quanti-
Gene.
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
FPR
TPR
GP
Poisson
GLM
GLM_NB
GLM_quasi
To further validate our methods, we utilized the real-time PCR (qRT-PCR) data for
the MAQC UHR and brain samples [73]. Genes with qRT-PCR absolute log-ratio less
84
than 0.2 were treated as negatives and genes with qRT-PCR absolute log-ratio larger
than 2.0 were treated as positives. The same criteria were used in [5]. Then the log-
likelihood ratio tests based on the GP model and the Poisson model were performed for
these gold-standard genes respectively. In addition, we also considered the generalized
linear model (GLM) proposed in [5] with the log/Poisson link function, the negative
binomial link function, and the quasi-Poisson link function. In the GLM, the replicate
lanes were considered separately. The true positive rate and the false positive rate were
calculated for each given threshold on the test statistics. Figure 4.4 shows the ROC
curves. It clearly shows that the GP model performs better than the Poisson and the
generalized linear models. The ROC curves ignored the genes with absolute log-ratio
between 0.2 and 2.0 (“no-call” genes). Ideally, these ”no-call” genes should have the p-
values between the p-values for the positive standards and the p-values for the negative
standards. For the GP model, 10 “no-call” genes had a p-value the median p-value of
the positive standards (i.e. false positives), and 95 “no-call” genes had a p-value the
median p-value of the negative standards (i.e. false negatives). However, if we used the
Poisson model, the number increased to 56 for the false positives and 117 for the false
negatives. The number further increased to 58 and 125 for the GLM model. Note that
for the MAQC-2 data, there were no extremely high gene expression levels even based
on the Poisson model. However, we still observed a significant improvement of the GP
model over other methods. Thus, the advantages of the GP model are not limited to the
normalization issue. This trend was also seen when we used the Quantigene assays as
validation. This is illustrated in figure 4.5
We applied the GP model to the human tissue data and identified the tissue-specific
genes. Specifically, for each tissue, we counted the number of genes differentially
expressed between this tissue and all the other eight tissues. The differentially expressed
genes were determined with FDR threshold 0.0001. Figure 4.6 shows that the GP model
85
Figure 4.6: Number of tissue-specific genes. For each human tissue, we counted the
number of genes differentially expressed between this tissue and all the other eight tis-
sues. Grey bars are for the GP model and the white bars are for the Poisson model. The
shaded regions represent the shared genes between the GP and Poisson models. Only
genes with existing MLEs of the parameters for both the GP and the Poisson models in
the two compared samples were considered.
Number of tissue-specific genes
brain liver muscle heart testes lymph adipose breast colon
0 500 1000 1500 2000
identified much fewer tissue-specific genes than the Poisson model. The brain tissue had
the largest number of tissue-specific genes followed by liver and muscle (grey bars).
However, if we used the Poisson model, liver had much more tissue-specific genes than
brain (2301 vs. 1871). Based on the GP model, we identified 119 housekeeping genes
with constant expression levels across the nine different tissues (thus, not differentially
expressed between any two tissues). Enriched gene annotations included “ribonucleo-
protein complex”, “ribosome biogenesis”, “rRNA processing”, “RNA processing”, and
so on. These were analyzed by the David gene annotation tool [23]. The number of
86
housekeeping genes dropped to 19 if the Poisson model was specified. The pattern was
similar if we used the simulation strategy to calculate the p-values shown in Figure 4.7
Figure 4.7: Number of tissue-specific genes. For each human tissue, we counted the
number of genes differentially expressed between this tissue and all the other eight tis-
sues. Grey bars are for the GP model and the white bars are for the Poisson model.
The p-value for the GP model was obtained by the simulation strategy. The shaded
regions represent the shared genes between the GP and Poisson models. Only genes
with existing MLEs of the parameters for both the GP and the Poisson models in the
two compared samples were considered.
Number of tissue-specific genes
brain liver muscle heart testes lymph adipose breast colon
0 500 1000 1500 2000
87
Figure 4.8: Identification of differentially spliced exons. A-C shows the number of differen-
tially spliced exons for each pair of human tissues. Only middle exons with existing MLEs of
the parameters for both the GP and the Poisson models in each pair of tissues were considered.
A is for the GP model with FDR 0.01 for each two-tissue comparison, B is for the Poisson model
with FDR 0.01, C is for the Poisson model with FDR 1:0 10
14
. The size of each black box
indicates the number of differentially spliced exons. However the size of the boxes between
different panels (A-C) is not comparable. The total discoveries for A-C are: 5099, 94849, and
5719. Ly: lymph node, Te: testes, Ad: adipose, Co: colon, Mu: muscle, He: heart, Li: liver,
Bre: breast, Bra: brain. D shows the difference of the inclusion rates calculated from the junction
reads for the differentially spliced exons identified by the GP model (GP) or the Poisson model
(P). The differentially spliced exons specific to the GP (GP specific) or the Poisson (P specific)
models are also shown. The FDR was controlled at 0.01.
Ly Te Ad Co Mu He Li Bre Bra
Ly Te Ad Co Mu He Li Bre Bra
Ly
Te
Ad
Co
Mu
He
Li
Bre
Bra
Ly Te Ad Co Mu He Li Bre Bra
Ly
Te
Ad
Co
Mu
He
Li
Bre
Bra
Ly
Te
Ad
Co
Mu
He
Li
Bre
Bra
GP model: FDR 0.01 Poisson model: FDR 0.01
Poisson model: FDR 1.0×10
−14
A B
C D
Identification of differentially spliced exons
Based on the GP model, we can also identify differentially spliced exons using the log-
likelihood ratio tests. We focused on the middle exons and compared the splicing ratio
88
of
e
g
between two samples. Figure 4.8A plots the number of differentially spliced exons
for each pair of human tissues. It clearly shows that brain was the tissue that had the
greatest number of the differentially spliced exons, which was consistent with previous
reports using microarrays [12] or EST-based approaches [90]. However, if we assumed
a Poisson model, many other tissues had a larger number of differentially spliced exons
(Fig. 4.8B). We used a more stringent FDR criterion for the Poisson model to make
the total number of discoveries approximately equal to that of the GP model. Still,
the pattern was different from that of the GP model (Fig. 4.8C). Note that Pan et al.
used the similar way to present their results by plotting the boxes [62]. In our data
preprocessing, we mapped the reads that cannot be mapped to the genome to the refseq
RNA sequences, and treated them as junction reads. Both junction reads and body reads
were used for the GP model. Here, we remapped the reads to an enlarged junction list
from [85] to get a more complete junction read counts. Then we calculated the inclusion
rate of a middle exon as: the number of inclusive junction reads/(the number of inclusive
junction reads + the number of exclusive junction reads). The absolute differences of
the inclusion rates between each pair of tissues were calculated and compared between
the differentially spliced exons identified by the GP model and those identified by the
Poisson model (Fig. 4.8D). Apparently, the differentially spliced exons identified by
the GP model had a larger inclusion rate difference between the two compared tissues.
For the exons identified only by the GP model, they also had a larger inclusion rate
difference than those identified only by the Poisson model. The results indicated that
the GP-based approach can better identify differentially spliced exons. The pattern was
similar if we used the simulation strategy to calculate the p-values (Figure 4.9)
89
Figure 4.9: Number of differentially spliced exons for each pair of human tissues based
on the GP model. The p-value for the GP model was obtained by the simulation strategy.
Ly Te Ad Co Mu He Li Bre Bra
Ly
Te
Ad
Co
Mu
He
Li
Bre
Bra
GP model: FDR 0.01
4.3.5 Comparison with other models
The negative binomial distribution has been used to model the additional variation from
biological replicates for the digital gene expression (DGE) data [69]. In the review paper
[63], the authors also mentioned that the background tag distribution for CHIP-seq data
can be modeled with a negative binomial distribution. In our study, the ROC in Figure
4.4 shows that the negative binomial performs even worse than the Poisson model. We
find that the variance of the gene-level read count was not always larger than the mean
across replicates. This is usually the case for technical replicates. For example, for the
90
MAQC-2 UHR sample, considering the gene-level read counts across seven replicate
lanes, 37% of the genes had a smaller sample variance than the sample mean and 60%
of the genes had a larger sample variance than the sample mean. For the MAQC-2 brain
sample, 32% of the genes had a smaller sample variance than the sample mean and 64%
of the genes had a larger variance. These indicate that for the gene-level read counts,
there is both overdispersion and underdispersion. The negative binomial which can only
handle overdispersion is inappropriate to handle the gene-level read counts. If we fitted
the negative binomial distribution to the position-level read counts, the goodness-of-
fit was much better than that for the Poisson model. But it was still worse than the
GP model. The results are shown in Table 4.6 The estimated mean from the negative
binomial model was treated as the gene expression estimate. The QuantiGene data were
used as gold standards to test the performance again. The slope for the linear regression
between the “true” signal and the estimate on the log scale was 0.72 (R
2
= 0.62) and
0.63 (R
2
= 0.49) for the two MAQC-2 samples. They were different from 1.0 and worse
than the GP model. In terms of computational efficiency, although the calculation of
MLEs for both the GP model and the negative binomial model utilized a numerical
optimization method, the latter was slower because of the involvement of the Gamma
function. For example, it took the GP model about 1 minute to estimate the expression
levels for the MAQC data, but it took the negative binomial model about 12 minutes.
All the results indicate that the GP model was better than the negative binomial model.
Recently, Li et al. [47] used the properties of local sequences to model the varying
Poisson rates for different positions and provided an R package “mseq”. The Poisson
rate of each position was modeled as dependent on the gene expression level and the
nucleotide sequence surrounding this nucleotide. To compare the GP model with the
mseq model, we simulated a bootstrap sample for the position-level read counts based
on the fitted GP distribution or the fitted Poisson rates from the mseq model. Then
91
Table 4.6: Goodness-of-fit of the GP model and the negative binomial (NB) model. The
percentage of genes and exons with a chi-square test p-value> 0.05 are listed.
Gene Level Exon level
GP (%) Negative Binomial (%) GP (%) Negativa Binomial(%)
MAQC data 85.72% 81.53% 89.62% 91.18%
Human data 77.28% 61.52% 88.78% 84.70%
Mouse data 88.57% 83.41% 91.73% 91.48%
Yeast data 93.24% 89.87% 93.21% 90.00%
MAQC-2 sep 92.93% 89.41% 92.90% 92.79%
the Kolgomorov Smirnov test was performed to compare the bootstrap sample from the
fitted model and the observed sample. Ideally the two samples should have similar dis-
tributions. For the GP model, about 50-98% genes with a p-value larger than 0.05, which
indicates that the bootstrap sample approximated to the observed sample. However, the
percentage was only about 0-2.04% for the mseq model. The results are summarized
in Table 4.7. The QuantiGene data were used as gold standards to test the expression
estimates of the mseq model. The slope for the regression lines was 0.61 (R
2
=0.62) and
0.58 (R
2
=0.54). It indicates that the expression estimates from the mseq model were
worse than those from the GP model. We should also note that the mseq model provides
a read count estimate for each position, but the GP model only provides the overall gene
expression estimate without specifying the expression level of each position.
Dependence of position level counts
In the GP model, we made an implicit assumption that the observed position level counts
are independent for different positions of a gene. If the dependence of different positions
is high, the error rate for the expression estimate will be high (about 10-fold increase for
correlation around 0.5 compared with the independent case). We simulated the depen-
dent GP through the multivariate model as follows:
92
Table 4.7: The comparison between the GP model and the mseq model. For each gene,
we simulated a bootstrap sample for the position-level read counts based on the fitted
GP distribution or the fitted Poisson rates from the mseq model. Then the Kolgomorov
Smirnov test was performed to compare the bootstrap sample from the fitted model and
the observed sample. Percentages of genes with a Kolgomorov Smirnov test p-value>
0.05 are listed. W1: Top 100 expressed genes in the mouse brain tissue ( Mortazavi
et. al. from the Wold Lab[58]). W2: Top 100 expressed genes in the mouse liver
tissue. W3: Top 100 expressed genes in the mouse skeletal muscle tissue. B1: Top 100
expressed genes in the human adipose, brain and breast tissues (Wang et. al from Chris
Burge’s Lab[85]). B2: Top 100 expressed genes in the human colon, heart and liver
tissues. B3: Top 100 expressed genes in the human lymph node, skeletal muscle and
testes tissues. G1: Top 100 expressed genes in the mouse embryoid bodies (Cloonan
et. al from Grimmond Lab[14]). G2 : Top 100 expressed genes in the undifferentiated
mouse embryonic stem cells. These datasets were used in the original paper about the
mseq model [47]. Goodness-of-fit of the GP model and the negative binomial (NB)
model. The percentage of genes and exons with a chi-square test p-value 0.05 are
listed.
Dataset Mseq GP
W1 0 98%
W2 0 82%
W3 0 86%
B1 0 88%
B2 0 91%
B3 0 94%
G1 2.04% 50%
G2 1% 81%
1. ForN bases, we simulated the countsY
i
;i = 1;;N as:Y
i
=X
i
+X
N
+ 1, where
X
i
GP (
1
;) and X
N
+ 1 GP (
2
;). Therefore Y
i
GP (
1
+
2
;) and
X
N
+ 1 represents the dependence between different bases [84] .
2. We then estimated and fromY
i
s based on the GP model.
3. We varied
1
;
1
2
, and to consider different scenarios.
4. For each scenario, we calculated the correlation between positions based on 1000
simulations. The simulation study is summarized in Table 4.8. However, in the
93
real situation, the positions close to each other may be dependent on each other,
but the positions far away should be independent. Thus, there is a block-wise cor-
relation structure. The error rate should be smaller than the case where all of the
positions are dependent on each other with a correlation 0.5. More sophisticated
model can be developed to better address the dependence issue in the future.
4.4 Conclusions
In the chapter, we focused on the position-level read count (i.e. the number of reads start-
ing from each position). We found that a two-parameter GP model can fit the position-
level read counts more appropriately than a traditional Poisson model. The parameter
reflects the transcript amount for a gene (or an exon) and the parameter represents
the average bias during the sample preparation and sequencing process for this gene
(or exon). The goodness-of-fit studies showed that the GP model fits the position-level
read count much better than the Poisson model. Through the GP model, we can better
estimate gene and exon expression levels and perform the normalization across different
samples. The GP model improves the identification of differentially expressed genes
The estimated gene expression
^
can be treated as a shrunk value of the sample mean,
because
^
= x(1
^
) and
^
is a positive value (but< 1) for more than 99% of the exons
and genes in our applications. In this sense,
^
can be treated as a shrinkage factor for
the gene expression estimation. This relationship can also be inferred by the equation
that = (1). Because
^
had a positive correlation with
p
^ x, we shrunk more for
the highly expressed genes or we removed more reads due to sequencing bias. In con-
trast to microarray data in which the signal is saturated at a certain threshold, RNA-seq
data have some extremely large values. However, when we examined the details of the
highly expressed genes, we found some suspicious large number of reads starting from
94
Table 4.8: The error rate of the GP model. The error rate was calculated based on 1000 simulations. The error rates are listed
in the table and correlations between positions are listed in the brackets. The error rates for independent case (
2
=0) are also
listed.
= 0:2 = 0:2 = 0:2 = 0:45 = 0:4
1
= 0:5
1
= 1
1
= 2
1
= 0:5
1
= 1
1
=
2
=3 0.569 (0.19) 0.529 (0.184) 0.45 (0.208) 0.639 (0.198) 0.757 (0.206)
1
=
2
= 2 0.795 (0.315) 0.7 (0.311) 0.66 (0.327) 0.923 (0.321) 1.677 (0.32)
1
=
2
=1 1.005 (0.516) 0.863 (0.501) 0.69 (0.5) 1.265 (0.53) 1.01 (0.52)
2
= 0 0.124 (-0.018) 0.096 (-0.014) 0.08 (-0.002) 0.125 (-0.02) 0.096 (-0.02)
= 0:42 = 0:8 = 0:8 = 0:8
1
= 2
1
= 0:5
1
= 1
1
= 2
1
=
2
=3 0.39 (0.5) 1.013 (0.18) 0.902 (0.19) 0.71 (0.19)
1
=
2
= 2 1.13 (0.603) 1.521 (0.323) 1.276 (0.323) 0.98 (0.329)
1
=
2
=1 0.612 (0.51) 2.34 (0.551) 1.82 (0.538) 1.362 (0.53)
2
= 0 0.08 (-0.001) 0.124 (-0.033) 0.096 (-0.033) 0.08 (-0.031)
95
several positions of the genes. These outliers substantially affected the sample mean
estimation (see Fig. 4.3). On the contrary, our GP model can remove these suspicious
reads and obtain a more reasonable expression estimate. From our real data analysis,
we also found a positive correlation between
^
and the sample standard deviations. In
addition ,
^
also had high correlations with certain A/T enriched oligonucleotides for
the commonly used library preparation procedures such as random hexamer priming
or oligo(dT) priming in human and yeast. However, if the RNA hydrolysis was used
before cDNA priming such as in the mouse data, the associated oligonucleotides were
changed. More RNA-seq data are needed to draw a conclusion here. Further investiga-
tion is needed to study the underlying mechanisms of. Hansen et al. found that the
random hexamer priming induces bias in the nucleotide composition at the beginning of
the sequence reads and they proposed a reweighting scheme for the read count [32]. We
can easily incorporate it into our data preprocessing to further remove such bias.
The shrinkage property of the GP model has an immediate impact on the normal-
ization across different samples, because a few highly expressed genes specific to one
sample makes the library-size based normalization problematic. For example, in the
human liver tissue, the top 10 genes contributed about 31% of the total RNA amount if
the was used to estimate the gene expression. If these 10 genes are not expressed in the
second tissue and we assume that the total RNA amounts are equal for the normaliza-
tion, the RNA amounts of the remaining genes are actually very different between the
two tissues. Lets insert hypothetical values for this example. The RNA amount from the
10 genes in the liver tissue is 31 and the RNA amount from the remaining thousands of
genes is 69. If the 10 genes are not expressed in the second tissue (i.e. 0 RNA amount),
the RNA amount for the remaining genes will be scaled to 100 to make the total RNA
amount equal. However, the identification of differentially expressed genes among the
remaining genes will be problematic because their total RNA amount is 69 vs. 100. On
96
the contrary, in our GP model, the contribution from the top 10 genes was shrunk to
1-4% for all of the considered tissues. It therefore improves the identification of differ-
entially expressed genes and the identification of differentially spliced exons (see Table
4.4, Figs 4.6,4.8). In the future, we can improve the normalization by further removing
extreme values from the GP model and adding the weights about the robustness of the
estimates, similar to the method proposed in [68].
To evaluate the performance of the GP model on the identification of differentially
expressed genes, we used technical or biological replicates as negative controls. For the
technical replicates, ideally, there should be no differentially expressed genes. We found
that if the sequencing depth was comparable, the GP model identified no false positives
in general. But the Poisson model identified many false positives. As we expected, the
difference between biological replicates was larger than the difference between technical
replicates. For the very unbalanced design (i.e. the sequencing depth was very different
between the two samples), the GP model still performed better than the Poisson model.
If the sequencing depth was the same, the performance of the GP model was much
better than that of the Poisson model. The results also demonstrate that the advantages
of the GP model over the Poisson model are not limited to the normalization issue. The
ROC curve studies also concluded that the GP model can better identify differentially
expressed genes (Fig. 4.4).
If the sequencing depth is very different for the two samples, besides the difficulty
of normalization, the different robustness of the parameter estimation is another issue.
As reported in [58], the RPKM estimation was sensitive to the total number of map-
ping reads for genes with low expression levels. We also showed that the parameter
estimation for the GP model was sensitive to the sequencing depth for genes with low
expression levels. Therefore, we suggest designing a comparable sequencing depth to
perform sample comparison. In addition, a deeper sequencing depth will provide us
97
more stable expression estimates. The deeper sequencing yields more sequence reads.
A longer read can help to map to the genome more accurately. However it does not
contain any addition information about the sampling property of the RNA-seq data.
We used the maximum likelihood approach to estimate the parameters in the GP
model. If a gene expression level is too low and there is no position with more than
one read, the MLE does not exist. For the identification of differentially expressed
genes and differentially spliced exons, we used the log-likelihood ratio tests. Under
the null hypothesis, we calculated the profile likelihood. Our parameter of interest is
=
1
2
(or =
Z
X
V
Y
=
b
1
b
2
for differentially spliced exons study) and under the null
= 1. There are nuisance parameters, b, and’s. For the first two parameters, we
estimated them in a usual way. However, we found that ’s are very dependent on
and they cannot be treated as orthogonal parameters (results not shown). For simplicity,
we treated the estimates from the unrestricted model as true values and put them
into the profile likelihood calculation. A more sophisticated method can be specified to
better calculate the profile likelihood, but it will increase the computational complexity.
Here we also provided a simulation strategy to estimate the null distribution of the test
statistics instead of using the chi-square distribution with one degree of freedom.
The current studies of alternative splicing using RNA-seq data focus on the junction
reads and calculate the “inclusion rate”. The “inclusion rate” was represented by the
ratio between inclusive junction reads and the sum of inclusive and exclusive junction
reads. In exon array studies, the differentially spliced exons were examined in terms
of the comparison between the “splicing ratios”, and the “splicing ratio” was repre-
sented by the ratio between exon expression and gene expression. The “inclusion rate”
is always 1, but the “splicing ratio” can be larger than 1 because the “gene expression”
is an overall estimate of the expression levels of multiple isoforms. Therefore, the value
98
of the “splicing ratio” in a single sample alone cannot give us much information, but the
comparison of the two “splicing ratios” is meaningful to identify differentially spliced
exons. On the contrary, the “inclusion rate” in a single sample contains the informa-
tion about whether this exon is alternatively spliced. However, the sequencing depth
is usually too low to obtain enough junction reads and the positional bias is strong. In
this work, we showed that the GP model can better identify differentially spliced exons
using the “splicing ratio” approach. More importantly, the fold change of two different
genes of the same sample can be accurately estimated by the GP model. Therefore,
in the future, we will compare an exon with a constitutive exon in the same sample to
identify alternatively spliced exons based on the GP model.
99
Chapter 5
Quantifying transcript isoform
expression with positional bias from
RNA-seq
High throughput sequencing technologies have enabled us to study transcriptomes at an
unprecedented resolution. However, the deconvolution of transcript isoform expression
levels is still challenging. In this chapter,we propose a linear model with the general-
ized Poisson assumption to quantify transcript isoform expression. We demonstrate the
effectiveness of the model using simulations. We show that the generalized Poisson-
based model is able to better estimate the relative expression levels of different isoforms
of a given gene compared to two existing methods. We also apply the model to a human
RNA-seq dataset generated from the Illumina HiSeq machine. Using the generalized
Poisson-based model, we demonstrate that the positional bias in sequencing needs to be
effectively modeled to better quantify isoform levels and improve our understanding of
transcriptome complexity.
5.1 Introduction
High-throughput sequencing technologies have enabled transcriptome quantification by
deep sequencing the cDNA reverse transcribed from RNA in cells. These technologies,
originally used to sequence DNA and genomes, have now been applied to transcriptomes
100
of many organisms [14, 58, 59]. Sequence reads generated from these high throughput
experiments are mapped back to the reference genomes to estimate the expression levels
of genes and exons. Different models have been employed to accurately estimate the
expression levels from these reads. Some of these include using the Poisson model[58]
and the Negative Binomial model [4]. Recently, we used the generalized Poisson model
to effectively model read counts to accurately estimate expression levels of genes and
exons [78].
Besides the expression quantification at the gene and exon level, quantification of
transcript isoforms is another important perspective of transcriptomics. Alternative
splicing via which one gene gives rise to multiple transcript isoforms by selecting differ-
ent combinations of exons is prominent in higher organisms [85, 62]. It greatly increases
the complexity of an organism’s transcriptome and proteome. Therefore, to effectively
characterize and understand the transcriptome, we need to quantify gene expression at
the transcript isoform level. Many diseases such as cancer have been related to the mis-
splicing of transcripts [10, 94] and therefore the accurate understanding and quantifying
of transcript isoforms will also help in drug design and disease therapy.
The Poisson assumption has been used to infer the expression levels of different
transcript isoforms [37]. However the Poisson model cannot effectively model the bias
in RNA-seq experiments. Wu et al. [89] introduced a non-uniform read distribution
model to improve isoform expression estimation. In this chapter, we propose a decon-
volution model based on the generalized Poisson assumption to infer transcript isoform
expression. The generalized Poisson assumption can effectively separate the bias and
the actual signals in RNA-seq experiments, as we discussed in [78]. Similarly, the bias
removal leads to more accurate estimates of isoform expression in the deconvolution
model.
101
The remainder of the chapter is organized as follows. We introduce the deconvolu-
tion model based on the generalized Poisson assumption first. Then we use simulation
studies to demonstrate how the model performs better than other algorithms. We also
apply the method to a human RNA-seq dataset with 16 tissues and summarize the results.
Finally, we conclude the chapter with a brief discussion.
5.2 Methods
5.2.1 Isoform Estimation
We propose to use the generalized Poisson assumption in the isoform expression decon-
volution. The generalized Poisson distribution is given by :
P (Y =y) =
( +y)
y1
e
(+y)
y!
; (5.1)
whereY is a random variable following a generalized Poisson distribution with param-
eters and [17, 16] and we limited to the case when is nonnegative. As described in
([78]) ,Y represents the position level read count for a given gene or exon andy is the
number of sequenced reads starting from that position.
If we write =f
1
;
2
;:::;
n
g, where
i
represents the parameter in the gen-
eralized Poisson model for the i
th
exon of the gene with a total of n exons, and
=f
1
;
2
;:::;
m
g, where
j
is the expression level for thej
th
isoform of the gene
with a total ofm isoforms. Then and are related as follows =X, whereX is the
mn design matrix defined as
x
ij
=
8
>
>
<
>
>
:
1 isoformj contains exoni
0 isoformj skips exoni
(5.2)
102
Specifically, for thei
th
, we have
i
=
P
m
j=1
x
ij
j
. In a coordinate hillwise approach,
when we estimate
j
, we already know the
k
values8k6=j. Therefore we can rewrite
i
as
i
= c
ij
+x
ij
j
. Thus, all
i
’s are a function of
j
. The likelihoodL() and log
likelihoodl() can be written as
L() =
n
Y
i=1
i
(
i
+
i
y
i
)
y
i
1
e
(
i
+
i
y
i
)
y
i
!
;
l() =
n
X
i=1
log(
i
) + (y
i
1)log(
i
+
i
y
i
) (
i
+
i
y
i
)log(y
i
!);
where the sum and product are taken over all the exons of the gene. Taking the derivative
of the log likelihood function with respect to
j
, we can calculate the score function
U(
j
) which can be written as
U(
j
) =
@l()
@
j
=
n
X
i=1
1
i
+
y
i
1
i
+
i
y
i
1
X
ij
To use the Newton Raphson Method, we take the second derivatives of the log likelihood
function which can be written as
@
2
l()
@
j
k
=
n
X
i=1
1
2
i
(y
i
1)
(
i
+
i
y
i
)
2
X
ik
X
ij
Therefore, we can use the Newton Raphson Method as follows till convergence to cal-
culate the values
k+1
=
k
H
1
U (5.3)
103
whereH =
@
2
l()
@
j
k
mm
andU =fU(
1
);U(
2
);:::;U(
m
)g.
To ensure concavity, we look at the Hessian matrixH() defined as ann matrix
where the (j;k)
th
element is
H
jk
() =
@
2
l()
@
j
@
k
=
n
X
i=1
1
2
i
(y
i
1)
(
i
+
i
y
i
)
2
X
ik
X
ij
=
n
X
i=1
d
i
X
ik
X
ij
=
n
X
i=1
H
(i)
jk
()
whered
i
=
1
2
i
+
(y
i
1)
(
i
+
i
y
i
)
2
. Since the sum of concave functions is also concave, if
we prove that H
(i)
is negative semi-definite for an arbitraryi , then we can show that
the likelihood is concave.
PROPOSITION:H
(i)
is negative semi-definite.
Proof :
Case I :y
i
1)
(y
i
1)
(
i
+
i
y
i
)
2
0)
1
2
i
+
(y
i
1)
(
i
+
i
y
i
)
2
0
Case II:y
i
= 0)
(y
i
1)
(
i
+
i
y
i
)
2
=
1
2
i
)
1
2
i
+
(y
i
1)
(
i
+
i
y
i
)
2
= 0
Therefore8i;d
i
0. For any vectorz = (z
1
;z
2
;:::;z
m
)
0
, we have
z
0
H
(i)
z =z
0
(dX
i
X
0
i
)z
=d(X
i
z
0
)
0
(X
0
i
z)
=dkX
i
z
0
)k
2
0
104
where X
i
= (X
i1
;X
i2
;:::;X
im
)
0
This implies that H
(i)
is negative semi-definite .
SinceH
(i)
is negative semi-definite,H is also negative semi-definite and therefore the
log likelihood function is concave.
Because the joint log-likelihood function is concave we can use a simple optimiza-
tion method to compute the MLE
^
. Jiang and Wong [37] used a coordinate-wise hill
climbing algorithm (however they used the Poisson assumption in their model). We can
also use the coordinate wise hill climbing algorithm by writing l() in terms of
j
as
follows
l() =
n
X
i=1
log(c
ij
+x
ij
j
)+(y
i
1)log(c
ij
+x
ij
j
+
i
y
i
)(c
ij
+x
ij
j
+
i
y
i
)log(y
i
)
(5.4)
wherec
ij
is given by
c
ij
=
X
k=1:::m;k6=j
x
ik
k
(5.5)
For a givenj andi ,c
ij
is constant for the coordinate-wise hill climbing. This method
will be used when the Newton Raphson method cannot converge because of problems
in matrix inversion. At each iteration we will ensure that
j
0 for allj = 1;:::;m.
5.2.2 Simulation Studies
We used two types of simulations to demonstrate the usefulness of our model and com-
pared it with another two existing models. For our basic simulation model, we simulated
a gene with three transcripts and four exons. The transcript structure is detailed in Fig-
ure 5.1. Therefore we have two exons which may undergo alternative splicing and two
constitutive exons. Each exon was 100 base pairs long and the reads were 25 base
pairs long. We simulated the reads for each of the isoforms using a generalized Poisson
105
Figure 5.1: Transcript structures used in the basic simulation model
model, because it has been demonstrated that the real data can be well fit by a gener-
alized distribution [78]. The expression level of a transcript was represented as the
parameter in the generalized Poisson model and the values were fixed for individual
exons. The intuition is that the parameter represents the true expression of each iso-
form and the parameter represents the bias which can be position specific and hence
varies for different exons. Therefore the read counts for exoni in isoformj were sam-
pled from a generalized Poisson distribution with parameters
j
;
i
.
1
;
2
;
3
represent
the expression levels of Isoforms 1,2 and 3 respectively and (
i
;
i
) are the generalized
Poisson parameters for individual exons.
We compared our model with two methods: the Poisson model as proposed by
Jiang and Wong [37] and the Non Uniform Read distribution (N-URD) model proposed
in [89]. The Poisson model models the position level counts of reads as a Poisson distri-
bution and the N-URD model further incorporates bias curves which were step functions
and constant at each exon. We used the following error term to assess the model perfor-
mance.
=
3
X
i=1
i
P
3
j=1
j
^
i
P
3
j=1
^
j
!
2
(5.6)
where the
i
’s are the true expression estimates and the
^
i
’s are the estimated expression
levels. Three different scenarios were considered:(1) The isoform expression levels were
fixed, but we changed the values. (2) The values were fixed, but we changed the
106
isoform expression: two expression levels of two isoforms were kept constant and the
expression level of the third isoform was varied at a ratio of 0.1, 0.2, 0.5, 1 , 2 ,5 and 10
to the other isoforms. (3) The values were sampled from real data. For (1) and (2), the
i
values increased from 5’ to 3’ end of the gene. This was to mimic the phenomenon
that there is more bias at the 3’ end compared to the 5’ end in some RNA-seq data. We
also randomly selected one exon and changed the number of mapped reads by 20% for
(1) and (2). This modification was to introduce extra noise and mimic the real data.
For the second type of simulations, we used the isoform structures from the human
RefSeq annotations. The isoform expression were sampled from a normal distribution
N(; 2) where the mean was sampled from another normal distributionN(; 1). The
values of were varied from 1 to 6. The normal distributions were truncated at 0
to remove any isoforms with negative expressions. We simulated the bias curve by
sampling real bias from the HiSeq RNA-seq dataset. The bias curve was generated by
using single-isoform genes with more than 100 mapped reads. We divided each gene
into 100 regions and normalized the number of reads mapped to each region by the total
number of reads mapped to that gene (denoted asb
j
). We simulated a bias curveb
0
using
the following algorithm:
1. Calculate the median of theb
j
’s:m.
2. Find the set of outliersA =fijb
i
= 2 (m 1:5IQR;m + 1:5IQR)g where IQR is
the interquartile range.
3. We calculate the bias as follows
b
0
i
=
8
>
>
<
>
>
:
1 i = 2A
b
i
m
i2A
(5.7)
107
We did not use the originalb
j
as the bias because the sum ofb
j
’s was always 1. But in
real data, the sum of the bias for a gene has no such constraint. If the expression level
of an isoform j is given by
j
, then we simulate the number of reads mapping to the
k
th
position of isoform j as a Poisson with mean
j
b
0
i
ifb
100k
l
c = i, where l is the
total number non-redundant exonic positons for this gene. Therefore, ifb
0
i
= 18i, then
we simulate the number of reads mapping to each position of the gene as Poisson with
mean
j
. After estimating the isoform expression levels by the three methods, we cor-
related the results with the true simulation parameters. The isoforms whose expression
estimates did not converge were removed from this analysis. This included isoforms
whose base level expression was estimated as greater than 100 and those isoforms with
negative expression estimates.
5.2.3 Application to Real Data
We also applied our model to a human Illumina HiSeq dataset with 75 base pair reads
and a total of 16 tissues. The transcript annotations were obtained from the RefSeq
database in NCBI. Overlapping transcripts were combined into gene clusters and over-
lapping exons were split into disjoint exons. Reads were mapped using PerM [13] with
at most 8 mismatches per read allowed. Isoform expression levels were then estimated
using both the generalized Poisson model and the Poisson model.
108
5.3 Results
5.3.1 Results from Simulation Studies
Basic Simulations based on the Generalized Poisson Model
In the basic simulation model, we used the isoform structure described in the Methods
section. Different values of
0
s were specified for different isofoms and different
values were specified for different exons. The results are outlined in Tables 5.1-5.3. In
Table 5.1, we kept the expression levels of the isoforms constant and varied the values.
We varied a parameter and set the parameters for the 4 exons in the simulation model
as
1
= ;
2
= + 0:1;
3
= + 0:2;
4
= + 0:3 where
i
is the parameter for
exoni. The generalized Poisson estimate had a much lower error than both the Poisson
and the N-URD model. All three methods were implemented using the Newton Raphson
optimization algorithm. Our model based on the generalized Poisson assumption always
performed better than the Poisson model and the N-URD model. When the positional
bias was large (i.e. the values were large), the Poisson model and the N-URD model
resulted in a much higher error rate. However, if the value was very negative (e.g.,
<0:3), our model performed worse (data not shown) due to the errors in parameter
estimation for the truncated distribution function. But in real data, as is discussed in
the real data analysis section, almost all of the values are greater than 0 and hence
our method would perform much better. In Table 5.2, we kept the values constant
by setting = 0:5 (
1
= 0:5;
2
= 0:6;
3
= 0:7;
4
= 0:8) and varied the isoform
expression levels. Again, the GP model gave a much smaller error as compared to the
Poisson and N-URD model. In Table 5.3, we sample real values from genes with four
exons in the human dataset. It can be seen that the GP model still did much better than
the other two models. In addition, we found that the N-URD model performs worse
than the Poisson model in terms of our defined error term. It indicates that the step
109
function specified in the N-URD model cannot capture the bias well. Even though the
simulations in this study are biased towards the generalized Poisson model, it shows the
inability of the Poisson model and N-URD model to handle such biases. We introduce
extra noise to show the robustness of the GP model.
Table 5.1: Comparison of isoform estimation using the generalized Poisson (GP) model,
the Poisson model, and the N-URD model with different’s. The bias parameters for
the four exons were
1
= ;
2
= + 0:1;
3
= + 0:2;
4
= + 0:3. The errors,
as described in the Methods section, were represented by. In this table, we kept the
constant (
1
=
2
=
3
= 1) and varied the values. Extra noise was introduced to a
randomly selected exon.
GP
Poisson
N-URD
0 0.0017 0.0085 0.1
0.1 0.0021 0.011 0.119
0.2 0.0017 0.016 0.145
0.3 0.0017 0.021 0.166
0.4 0.0016 0.037 0.654
0.5 0.0016 0.067 1.266
0.6 0.0018 0.138 0.654
Simulations Based on the Sampled Real Bias
We also performed simulations using bias curves generated from the HiSeq dataset as
described in the Methods Section. After simulating the reads, we used the generalized
Poisson, the Poisson and the N-URD models to estimate the isoform expression lev-
els. We calculated the Pearson correlation coefficient (PCC) between the true isoform
expression and the estimated isoform expression. We also calculated theR
2
value from
fitting a regression line between the true isoform expression and the estimated expres-
sion. As shown in Table 5.4, the generalized Poisson performed much better than the
other two methods according to both the PCC and R
2
. The generalized Poisson can
effectively shrink the bias and hence provides an accurate expression level estimation.
110
Table 5.2: Comparison of isoform estimation using the generalized Poisson (GP) model,
the Poisson model, and the N-URD model with different’s. In this table, we kept the
constant (
1
= 0:5;
2
= 0:6;
3
= 0:7;
4
= 0:8) and varied the values. Extra noise
was introduced to a randomly selected exon.
1
2
3
GP
Poisson
N-URD
1 1 0.2 0.00266 0.125 0.389
1 1 0.5 0.0024 0.092 1.04
1 1 1 0.0016 0.067 1.26
1 1 2 0.0011 0.039 1.34
1 1 5 0.0006 0.013 1.52
1 1 10 0.0002 0.004 1.32
0.2 1 1 0.0019 0.05 1.19
0.5 1 1 0.002 0.055 0.48
2 1 1 0.0015 0.084 0.259
5 1 1 0.0019 0.073 0.366
10 1 1 0.006 0.075 0.45
1 0.2 1 0.002 0.03 0.41
1 0.5 1 0.0016 0.054 0.732
1 2 1 0.0015 0.088 0.49
1 5 1 0.002 0.117 1.36
1 10 1 0.005 0.131 1.71
Table 5.3: Comparison of isoform estimation using the generalized Poisson (GP) model,
the Poisson model, and the N-URD model with’s sampled from the real data. In this
table, we kept the values constant (
1
=
2
=
3
= 1) and sampled the values for
each exon (
1
;
2
;
3
;
4
) from real data sets.
1
2
3
4
GP
Poisson
N-URD
0.55 0 0.6633 0.706 0.002 0.243 0.137
0.625 0.645 0 0.652 0.0013 0.096 0.106
0.871 0.789 0.789 0.768 0.0015 0.027 0.041
0.634 0.613 0.414 0.777 0.0013 0.067 1.333
0.437 0.325 0.235 0.454 0.0016 0.024 0.023
0.482 0.5 0.575 0.707 0.0014 0.0465 0.184
This can also be seen in the slope values of the regression lines for the generalized Pois-
son, Poisson, and N-URD models. A slope close to 1 implies that the model is able
to accurately estimate the expression level. In Table 5.5, we used the relative isoform
expression instead, thus we normalized the each isoform expression by the sum of all
111
Table 5.4: Comparison of absolute isoform expression estimation using the GP, Poisson
and N-URD models. The second type of simulations described in the Methods sec-
tion was used. The parameter related to isoform expression was varied from 1 to 6.
The bias was sampled from the real data. The Pearson correlation coefficient (PCC)
was calculated between the absolute isoform expression estimates and the true isoform
expression. A regression line was also fit between the true isoform expression and the
expression estimates with theR
2
and slop reported.
1 2 3 4 5 6
PCC for GP 0.92 0.75 0.88 0.80 0.69 0.65
PCC for Poisson 0.47 0.39 0.60 0.43 0.57 0.35
PCC for N-URD 0.07 0.22 0.099 0.16 0.10 0.09
R
2
for GP 0.85 0.56 0.77 0.64 0.48 0.42
R
2
for Poisson 0.22 0.15 0.36 0.19 0.33 0.12
R
2
for N-URD 0.005 0.049 0.009 0.025 0.01 0.007
Slope for GP 1.08 0.78 1.10 0.89 0.71 0.63
Slope for Poisson 0.14 0.10 0.28 0.09 0.22 0.08
Slope for N-URD 0.005 0.007 0.009 0.008 0.011 0.002
Table 5.5: Comparison of relative isoform estimation using the GP, Poisson and N-URD
models. The second type of simulations described in the Methods section was used. The
estimated isoform expression was normalized by the sum of the isoform expression for
the same gene.
1 2 3 4 5 6
PCC for GP 0.96 0.94 0.95 0.92 0.91 0.88
PCC for Poisson 0.91 0.85 0.88 0.89 0.89 0.87
PCC for N-URD 0.78 0.60 0.74 0.78 0.83 0.66
R
2
for GP 0.91 0.89 0.91 0.86 0.82 0.77
R
2
for Poisson 0.82 0.73 0.78 0.80 0.78 0.75
R
2
for N-URD 0.60 0.36 0.54 0.61 0.69 0.43
isoform expression levels for the same gene. Through the comparison on the PCC and
the R
2
values, we can see that the generalized Poisson again did much better in pre-
dicting the relative isoform expression levels compared to the Poisson and the N-URD.
We note that all of the three models performed better in predicting the relative expres-
sion than predicting the absolute expression. The possible reason could be that the bias
is partially canceled out when we take ratios between isoform expression levels. We
also tried Cufflinks [83] with the “bias” option [66] to estimate the relative isoform
112
expression. The read counts were simulated using the strategy we mentioned before
and the gene sequence composition was simulated using a uniform model for each of
the nucleotide bases. The gene annotation was provided to cufflinks apriori. Cufflinks
performed much worse than the three methods with a correlation between 0.36 to 0.503
for the relative isoform expression estimation.
5.3.2 Results from Real Data Analysis
Distribution of Values
We applied the generalized Poisson-based isoform estimation method to a 16-tissue
RNA-seq dataset obtained from the Illumina HISeq machine. We were able to obtain
isoform expression estimates between 25.5% to 46.62% of the Refseq genes. The
remaining genes either had very low expression levels or had no converged isoform
estimates. Figure 5.2A shows the distribution of for the different exons in the brain
tissue. It can be seen that most of the values are greater than 0. We also took a subset
of the exons which had more than 100 mapped reads. The distribution is shown in
Figure 5.2B. All the values for these exons were greater than 0 and most of them
were greater than 0.5. Hence in this scenario, it is expected that we will get a much
more accurate estimate using the generalized Poisson estimation algorithm. Similar
conclusions can be made for the other 15 tissues (data not shown).
113
Figure 5.2: Distribution of values. (A) All exons with estimates for the Brain RNA-
seq HISeq dataset were used. (B) Highly expressed exons were used.
114
Figure 5.3: The annotation and coverage of 3 different genes from Human RNA-seq experiments using the Illumina HiSeq
machine. The green dotted lines represent the boundaries of disjoint exonic regions and the black bars represent the exons
which are included in a given transcript. The transcript structures are shown at the top of the figure while the coverage is plotted
at the bottom of the figure. The relative abundances as calculated by the GP and Poisson models are provided along with the
transcript names.
0 1000 2000 3000 4000
0 100 200 300
(a) Annotation and Coverage of GJA1 in Adrenal tissue
Position on the gene
Coverage (number of reads)
uc011ebo.1(GP:0.5% P:42.5%)
uc011ebp.1(GP:1.5% P:0.5%)
uc003pyr.2(GP:98% P:57%)
0 2000 4000 6000 8000 10000
0 20 40 60
(b) Annotation and Coverage of CREBZF in Liver tissue
Position on the gene
Coverage (number of reads)
uc010rtc.1(GP:65% P:23.8%)
uc010rtd.1(GP:32.5% P:75% )
uc001pas.2(GP:2.5% P:8.75%)
0 1000 2000 3000 4000
0 50 150
(c) Annotation and Coverage of ACP1 ( Acid Phosphatase1 ) in Liver tissue
Position on the gene
Coverage (number of reads)
uc002qwd.2(GP:0.19% P:8.13%)
uc002qwe.3(GP:46.3% P:27.1%)
uc002qwg.2(GP:31.9% P:15 %)
uc002qwh.2(GP:0.6 % P:39 %)
uc002qwf.2(GP:21.4% P:10.5%)
115
the mRNA from this gene, while the generalized Poisson estimates the expression to be
0.5%.
Figure 5.3 (b) shows the coverage of the CREBZF gene in the human liver tissue.
CREBZF interacts with a Herpes Simplex Virus-1 related Host Cell Factor 1 [51]. Over-
expression of CREBZF inhibits viral replication of HSV-1 thereby implicating the role
of the protein during viral infection [2]. The read coverage for this gene is relatively
low compared to the other two examples and the bias seems to be fairly uniform. The
poisson distribution estimates that 75% of the gene transcript is from the uc010rtd.1 iso-
form, while our generalized Poisson model estimates that uc010rtc.1 with 65% of the
gene expression is the major isoform.
Figure 5.3 (c) shows the coverage of the acid phosphatase 1 (ACP1) gene in the
human liver tissue. The product of this gene is part of the phosphotyrosine protein
phosophatase family of proteins. ACP1 has been shown to be associated with obesity
[31]. This gene has 5 isoforms and a total of 11 disjoint exonic regions. The poisson
model identifies isoform uc002qwh.2 as the major isoform with 39% of the total expres-
sion. However the generalized Poisson model estimates that this isoform is the least
expressed and isoform uc002qwe.3 is the major isoform. This is because the poisson
distribution uses the information of the last exon to assign a higher expression level to
uc002qwh.2. However there is a heavy 3’ bias in this gene. The expression level of the
last exon is estimated to be 2.4 from the Poisson distribution. The generalized Poisson
model estimates the for this exon as 0.8 and estimates the expression level of the exon
as 0.5 which is 20% of the initial value.This bias can be seen as coming from a 3’ bias
which is not corrected when using the Poisson distribution. However the GP model
shrinks the expression level of this exon to a more reasonable estimate.
Using these three distinct examples, we have illustrated how the generalized Poisson
model can help to overcome the bias problem in isoform expression estimation. The
116
Poisson model, unable to handle the position level bias, may mis-specify the major
isoforms. However a better validation data set is required to validate and compare our
methods and to develop robust methods for isoform expression estimation.
5.4 Discussion
In this chapter, we focused on the deconvolution of transcript isoform expression. We
applied the generalized Poisson model for position level read counts to the RNA-seq
dataset and used a likelihood based approach to estimate the expression levels of known
isoforms. We estimated the parameters for the individual exons and then used these
parameters to calculate the expression level for transcript isoforms. We performed
in depth simulation studies to assess the performance of the method compared to ones
using a Poisson model and a Non Uniform Read distribution (N-URD) model.
Simulation studies examined the effect of varying biases and the effect of varying
expression ratios among transcript isoforms. Our model consistently outperformed the
other two methods. We also introduced extra errors by modifying exon read counts in
the simulations. Therefore our model was robust to additional noise. Besides the basic
simulation model, we sampled bias curves from real datasets and used the bias curves to
simulate reads for exons from real annotations. Our generalized Poisson-based model
outperformed the Poisson-based model, the N-URD model and the Cufflinks model.
However, better simulation strategies need to be used to compare the methods. These
simulation strategies need to be developed independent of method development so that
there is no bias introduced in the simulation strategy favoring one of the methods.
We also applied our generalized Poisson-based model to a human 16 tissue RNA-seq
dataset. We showed that the values of almost all the exons were greater than 0, indicat-
ing that our model can better estimate isoform expression levels with the consideration
117
of biases. We showed three examples of genes whose isoform levels were estimated
using the generalized Poisson and the Poisson assumptions. We show that incorporating
biases can lead to different isoforms being identified as the major isoform compared to
the when the biases are not incorporated. However, an experiment needs to be designed
to validate the expression levels of isoforms biologically to compare different methods
for isoform expression estimation.
As seen in the simulation results, when the mean parameter for isoform expression
is larger, we obtain a better estimate of isoform expression. Therefore, in a RNA-seq
experiment, a high read coverage is expected to accurately estimate isoform expression.
However, a higher coverage also leads to a higher bias level as shown in the histogram
of values. Therefore removing the bias is very important for efficient estimation of
isoform levels. We showed three examples of genes whose exon bias levels were greater
than 0.5. The isoform expression estimates were very different between the generalized
Poisson and the Poisson model, further indicating that the biases can affect the isoform
expression estimation significantly.
This method can be further expanded to estimate differential expression of isoforms
when we have RNA-Seq experiments involving two or more samples. Using a similar
likelihood based approach as in [78], we can test the hypothesis of an isoform being dif-
ferentially expressed. However, the p-value computation for this process is very inten-
sive and simplifications will be need to achieve computational tractability.
118
Chapter 6
Conclusions and Future Work
6.1 Model Selection Algorithms for Genome Wide Asso-
ciation Studies
With the advent of high throughput sequencing technologies, sequencers are being
incorporated into a variety of protocols like RNA Seq, CHIP Seq, Bisulfite sequencing
etc. to generate huge amounts of datasets which provide a large amount of phenotypic
data. It is of great interest to find the genotypes for these phenotypes like gene expres-
sion, protein binding, histone methylation and acetylation sites. Therefore, methods in
Genome Wide Association studies have to become more sophisticated and efficient to
meet these demands.
LASSO , Elastic Net and SSVS are a set of model selection algorithms which have
been used extensively. Specific modifications to these algorithms taking into account
for the data structure of genome wide association studies would significantly improve
the performance of GWAS. Population structure as described before is big challenge
in genome wide association studies. Therefore incorporating population structure into
these methods would improve the accuracy of identifying SNPs which are truly associ-
ated. Even though the methods are capable of handling more number of markers than
samples, there is still a computational limitation to the number of markers that can be
incorporated into the model. Therefore, these methods need to be modified to increase
computational efficiency. This will be useful since epistatic effects of more than one
markers need to be identified to do effective genome wide association studies.
119
Another area to explore are the cis-cis and cis-trans compensatory mutations on a
genome wide scale. These methods can be modified to identify mutations which on
their own have a significant effect but compensate for each other when both are present.
When both mutations are present upstream of a gene, i.e. in the promoter region of a
gene, it would be cis-cis compensation while if the compensatory mutation is not located
in the 5’ regulatory region, then it is a cis-trans compensatory mutation. These studies
have been done for drosophila [80]It is of great interest to find such mutations as they
give us an insight into population genetics.
In our studies, we used the SIS to screen for markers before the model selection
methods were applied. We do not take into account any prior knowledge of these SNPs
or any biological knowledge to screen the SNPs in the first step of the analysis. All
the methods need to be modified to take into account prior biological knowledge. For
example, the conservation scores can be used as priors in the SSVS to make better
predictions of associated SNPs. Incorporating population structure and LD would also
be useful in these association studies.
6.2 RNA-Seq Analysis
RNA-Seq has enabled the generation of phenotypes for multiple genotypes in very small
amounts of time to be used for Genome Wide Association Studies. The Body reads
and junction reads provide detailed information about the expression levels of genomic
regions. These expression levels can then be used for association studies with SNP’s
in either specific regions in the genome or the whole genome. Calling SNPs has also
moved from arrays to high throughput sequencing assays which has reduced the cost
and enabled a more thorough annotation of SNPs for each individual sample.
120
A thorough exploration of the bias parameter in RNA Seq would be of great interest
to make the quantitative analysis more accurate and more useful. The parameter in
the generalized Poisson model explains the deviation of the read counts from a Poisson
process. We introduced as a parameter which explains the overall bias of RNA Seq.
A method to analyze these biases individually using the parameter would be of great
interest, for example how much of the bias explained by is due to PCR bias, primer
bias, 5’ bias etc. Collaborative experiments using wet lab and computational approaches
to address these issues will greatly advance the understanding of this technology. A
more in-depth analysis of the bias in RNA Seq experiments can be further used to sim-
ulate RNA Seq reads to develop methods to analyze RNA Seq data. Validation of new
methods for RNA Seq analysis like gene expression and transcript expression estima-
tion are hindered by the fact that accurate simulation studies cannot be carried out for
RNA Seq reads. There is only one software [70] available for this purpose and several
authors use an ad-hoc way of simulating the reads which might be either biased towards
their method or completely non-representative of the distribution of RNA Seq reads and
hence do not demonstrate the usefulness of the method accurately. Therefore an in-
depth study of the bias in RNA Seq reads would be extremely helpful in all aspects of
downstream analysis and extraction of scientific conclusions.
Normalization in RNA Seq reads, which is fundamental in the detection of differen-
tially expressed genes and exons across is still an open problem. Several normalization
techniques have been described in the literature. However, accurate normalization relies
heavily on the accurate characterization of bias. Therefore, further investigation into
normalization techniques would be very helpful. The research in this area would also be
exponentially speeded up if a good simulation strategy for RNA Seq reads is developed.
121
We used a simple normalization technique using the GP estimates in our model. Exten-
sion of other normalization techniques using the GP model would be very interesting to
analyze their effectiveness for real data.
A problem with any high throughput sequencing experiment is that short reads map-
ping to repeat regions of the genome are thrown out of the analysis. This might lead to a
biases while quantifying gene expression for different genes as well as lack of coverage.
since . Mortazavi et. al [58] resolve the problem by classifying each multi-read to the
gene according to the proportion of the number of uniquely mapped reads. However,
this does not solve the bias problem and therefore there needs to be an effective way
to classify multi reads. Using the GP model, the classification of multi reads could be
improved by incorporating the bias in the parameter. This problem could be tackled
using an E-M algorithm.
6.3 Conclusions
RNA Seq in conjunction with genome wide association studies have enabled an
increased understanding of the genotype phenotype map which is of great interest to
many scientists in the field of biology. They both offer tools for therapeutic applica-
tions to enable drug targeting and ultimately leading to personalized medicine. RNA
Seq can be used to sequence the transcriptome of an individual to find abnormalities in
transcription rates for a patient and hence be used in diagnosis. It can be used to identify
drug targets and hence identify which drugs would work best for a particular individual
with a disease like cancer. DNA Sequencing can be used to identify genotypes of a
patient and hence identify which medicines would work best for the specific individual.
The genotypes could also give a clue as to which individual is more susceptible to a
particular disease and hence a prevention plan could be tailored to their needs.
122
Bibliography
[1] D. Aird, M. G. Ross, W. S. Chen, M. Danielsson, T. Fennell, C. Russ, D. B. Jaffe,
C. Nusbaum, and A. Gnirke. Analyzing and minimizing PCR amplification bias
in Illumina sequencing libraries. Genome Biol., 12:R18, 2011.
[2] O. Akhova, M. Bainbridge, and V . Misra. The neuronal host cell factor-binding
protein Zhangfei inhibits herpes simplex virus replication. J. Virol., 79:14708–
14718, Dec 2005.
[3] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local
alignment search tool. J. Mol. Biol., 215:403–410, Oct 1990.
[4] S. Anders and W. Huber. Differential expression analysis for sequence count data.
Genome Biol., 11:R106, 2010.
[5] J. H. Bullard, E. Purdom, K. D. Hansen, and S. Dudoit. Evaluation of statistical
methods for normalization and differential expression in mRNA-Seq experiments.
BMC Bioinformatics, 11:94, 2010.
[6] M. G. Bulmer. The effect of selection on genetic variability: a simulation study.
Genet. Res., 28:101–117, Oct 1976.
[7] M. K. Burke, J. P. Dunham, P. Shahrestani, K. R. Thornton, M. R. Rose, and
A. D. Long. Genome-wide analysis of a long-term evolution experiment with
Drosophila. Nature, 467:587–590, Sep 2010.
[8] R. D. Canales, Y . Luo, J. C. Willey, B. Austermiller, C. C. Barbacioru, C. Boy-
sen, K. Hunkapiller, R. V . Jensen, C. R. Knight, K. Y . Lee, Y . Ma, B. Maqsodi,
A. Papallo, E. H. Peters, K. Poulter, P. L. Ruppel, R. R. Samaha, L. Shi, W. Yang,
L. Zhang, and F. M. Goodsaid. Evaluation of DNA microarray results with quan-
titative gene expression platforms. Nat. Biotechnol., 24:1115–1122, Sep 2006.
[9] E Candes and T Tao. The dantzig selector : Statistical estimation whenp is much
larger thann. Annals of Statistics, 35:2313–2351, 2007.
[10] C. Carriere, S. Mirocha, S. Deharvengt, J. R. Gunn, and M. Korc. Aberrant expres-
sions of AP-2 splice variants in pancreatic cancer. Pancreas, 40:695–700, Jul 2011.
123
[11] G Casella and EI George. Explaining the gibbs sampler. The American Statistician,
46:167–174, 1992.
[12] J. C. Castle, C. Zhang, J. K. Shah, A. V . Kulkarni, A. Kalsotra, T. A. Cooper,
and J. M. Johnson. Expression of 24,426 human alternative splicing events and
predicted cis regulation in 48 tissues and cell lines. Nat. Genet., 40:1416–1425,
Dec 2008.
[13] Y . Chen, T. Souaiaia, and T. Chen. PerM: efficient mapping of short sequencing
reads with periodic full sensitive spaced seeds. Bioinformatics, 25:2514–2521, Oct
2009.
[14] N. Cloonan, A. R. Forrest, G. Kolle, B. B. Gardiner, G. J. Faulkner, M. K. Brown,
D. F. Taylor, A. L. Steptoe, S. Wani, G. Bethel, A. J. Robertson, A. C. Perkins, S. J.
Bruce, C. C. Lee, S. S. Ranade, H. E. Peckham, J. M. Manning, K. J. McKernan,
and S. M. Grimmond. Stem cell transcriptome profiling via massive-scale mRNA
sequencing. Nat. Methods, 5:613–619, Jul 2008.
[15] P.C. Consul. A simple urn model dependent upon predetermined strategy. Sankya.
Indian J. Stat. Ser. B, 36:391–399, 1974.
[16] P.C. Consul. Generalized Poisson Distributions: Properties and Applications.
Marcel Dekker Inc New York, 1989.
[17] P.C. Consul and G.C. Jain. Generalization of Poisson distribution. Technometrics,
15:791–793, 1973.
[18] P.C. Consul and G.C. Jain. Some interesting properties of generalized Poisson
distribution. Biometr. Z, 15:495–500, 1973.
[19] H. J. Cordell. Detecting gene-gene interactions that underlie human diseases. Nat.
Rev. Genet., 10:392–404, Jun 2009.
[20] M. Costanzo, A. Baryshnikova, J. Bellay, Y . Kim, E. D. Spear, C. S. Sevier,
H. Ding, J. L. Koh, K. Toufighi, S. Mostafavi, J. Prinz, R. P. St Onge, B. Van-
derSluis, T. Makhnevych, F. J. Vizeacoumar, S. Alizadeh, S. Bahr, R. L. Brost,
Y . Chen, M. Cokol, R. Deshpande, Z. Li, Z. Y . Lin, W. Liang, M. Marback, J. Paw,
B. J. San Luis, E. Shuteriqi, A. H. Tong, N. van Dyk, I. M. Wallace, J. A. Whit-
ney, M. T. Weirauch, G. Zhong, H. Zhu, W. A. Houry, M. Brudno, S. Ragibizadeh,
B. Papp, C. Pal, F. P. Roth, G. Giaever, C. Nislow, O. G. Troyanskaya, H. Bussey,
G. D. Bader, A. C. Gingras, Q. D. Morris, P. M. Kim, C. A. Kaiser, C. L. Myers,
B. J. Andrews, and C. Boone. The genetic landscape of a cell. Science, 327:425–
431, Jan 2010.
124
[21] H Davis and T Davis. Daunorubicin and adriamycin in cancer treatment: an anal-
ysis of their roles and limitations. Cancer Treat Rep, 63:809–815, 1979.
[22] N.G. de Bruijn. A Combinatorial Problem. Koninklijke Nederlandse Akademie v.
Wetenschappen, 49:758–764, 1946.
[23] G. Dennis, B. T. Sherman, D. A. Hosack, J. Yang, W. Gao, H. C. Lane, and R. A.
Lempicki. DAVID: Database for Annotation, Visualization, and Integrated Dis-
covery. Genome Biol., 4:P3, 2003.
[24] J. C. Dohm, C. Lottaz, T. Borodina, and H. Himmelbauer. Substantial biases in
ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids
Res., 36:e105, Sep 2008.
[25] S Duan, WK Bleibel, RS Huang, SJ Shukla, X Wu, JA Badner, and ME Dolan.
Mapping genes that contribute to daunorubicin-induced cytotoxicity. Cancer Res,
67:5425–33, 2007.
[26] D. F. Easton, A. M. Deffenbaugh, D. Pruss, C. Frye, R. J. Wenstrup, K. Allen-
Brady, S. V . Tavtigian, A. N. Monteiro, E. S. Iversen, F. J. Couch, and D. E.
Goldgar. A systematic genetic assessment of 1,433 sequence variants of unknown
clinical significance in the BRCA1 and BRCA2 breast cancer-predisposition
genes. Am. J. Hum. Genet., 81:873–883, Nov 2007.
[27] B Efron, T Hastie, I Johnstone, and R Tibshirani. Least angle regression. Annals
of Statistics, 32:407–499, 2004.
[28] J Fan and J Lv. Sure independence screening for ultrahigh dimensional feature
space. Journal of the Royal Statistical Society Series B, 70:849–911, 2008.
[29] EI George and RE McCulloch. Variable selection via gibbs sampling. Journal of
the American Statistical Association, 88:881–889, 1993.
[30] G. Gibson. Rare and common variants: twenty arguments. Nat. Rev. Genet.,
13:135–145, Feb 2011.
[31] F. Gloria-Bottini, A. Magrini, L. Di Renzo, A. De Lorenzo, A. Bergamaschi, and
E. Bottini. Body mass index and acid phosphatase locus 1 in diabetic disorders.
Acta Diabetol, 47:139–143, Dec 2010.
[32] K. D. Hansen, S. E. Brenner, and S. Dudoit. Biases in Illumina transcriptome
sequencing caused by random hexamer priming. Nucleic Acids Res., 38:e131, Jul
2010.
125
[33] T. J. Hardcastle and K. A. Kelly. baySeq: empirical Bayesian methods for identi-
fying differential expression in sequence count data. BMC Bioinformatics, 11:422,
2010.
[34] RS Huang, S Duan, EO Kistner, WK Bleibel, SM Delaney, DL Fackenthal, S Das,
and ME Dolan. Genetic variants contributing to daunorubicin-induced cytotoxic-
ity. Cancer Res, 68:3161–3168, 2008.
[35] TW Huizinga and et al. Refining the complex rheumatoid arthritis phenotype based
on specificity of the hla-drb1 shared epitope for antibodies to citrullinated proteins.
Arthritis Rheum, 52:3433–3438, 2005.
[36] K.G. Janardan and D.J. Schaeffer. Models for the analysis of chromosomal aber-
rations in human leukocytes. Biometrical J., 19:599–612, 1977.
[37] H. Jiang and W. H. Wong. Statistical inferences for isoform expression in RNA-
Seq. Bioinformatics, 25:1026–1032, Apr 2009.
[38] H. M. Kang, J. H. Sul, S. K. Service, N. A. Zaitlen, S. Y . Kong, N. B. Freimer,
C. Sabatti, and E. Eskin. Variance component model to account for sample struc-
ture in genome-wide association studies. Nat. Genet., 42:348–354, Apr 2010.
[39] H. M. Kang, N. A. Zaitlen, C. M. Wade, A. Kirby, D. Heckerman, M. J. Daly, and
E. Eskin. Efficient control of population structure in model organism association
mapping. Genetics, 178:1709–1723, Mar 2008.
[40] Y . Katz, E. T. Wang, E. M. Airoldi, and C. B. Burge. Analysis and design of
RNA sequencing experiments for identifying isoform regulation. Nat. Methods,
7:1009–1015, Dec 2010.
[41] N. M. Kerr, C. S. Johnson, C. R. Green, and H. V . Danesh-Meyer. Gap junc-
tion protein connexin43 (GJA1) in the human glaucomatous optic nerve head and
retina. J Clin Neurosci, 18:102–108, Jan 2011.
[42] L Kruglyak. The road to genome-wide association studies. Nature Rev. Genet.,
9:314–318, 2008.
[43] B. Langmead, C. Trapnell, M. Pop, and S. L. Salzberg. Ultrafast and memory-
efficient alignment of short DNA sequences to the human genome. Genome Biol.,
10:R25, 2009.
[44] B. Lehner, C. Crombie, J. Tischler, A. Fortunato, and A. G. Fraser. Systematic
mapping of genetic interactions in Caenorhabditis elegans identifies common mod-
ifiers of diverse signaling pathways. Nat. Genet., 38:896–903, Aug 2006.
126
[45] H. Li and R. Durbin. Fast and accurate short read alignment with Burrows-Wheeler
transform. Bioinformatics, 25:1754–1760, Jul 2009.
[46] H. Li and R. Durbin. Fast and accurate long-read alignment with Burrows-Wheeler
transform. Bioinformatics, 26:589–595, Mar 2010.
[47] J. Li, H. Jiang, and W. H. Wong. Modeling non-uniformity in short-read rates in
RNA-Seq data. Genome Biol., 11:R50, 2010.
[48] R. Li, W. Fan, G. Tian, H. Zhu, L. He, J. Cai, Q. Huang, Q. Cai, B. Li, Y . Bai,
Z. Zhang, Y . Zhang, W. Wang, J. Li, F. Wei, H. Li, M. Jian, J. Li, Z. Zhang,
R. Nielsen, D. Li, W. Gu, Z. Yang, Z. Xuan, O. A. Ryder, F. C. Leung, Y . Zhou,
J. Cao, X. Sun, Y . Fu, X. Fang, X. Guo, B. Wang, R. Hou, F. Shen, B. Mu, P. Ni,
R. Lin, W. Qian, G. Wang, C. Yu, W. Nie, J. Wang, Z. Wu, H. Liang, J. Min, Q. Wu,
S. Cheng, J. Ruan, M. Wang, Z. Shi, M. Wen, B. Liu, X. Ren, H. Zheng, D. Dong,
K. Cook, G. Shan, H. Zhang, C. Kosiol, X. Xie, Z. Lu, H. Zheng, Y . Li, C. C.
Steiner, T. T. Lam, S. Lin, Q. Zhang, G. Li, J. Tian, T. Gong, H. Liu, D. Zhang,
L. Fang, C. Ye, J. Zhang, W. Hu, A. Xu, Y . Ren, G. Zhang, M. W. Bruford, Q. Li,
L. Ma, Y . Guo, N. An, Y . Hu, Y . Zheng, Y . Shi, Z. Li, Q. Liu, Y . Chen, J. Zhao,
N. Qu, S. Zhao, F. Tian, X. Wang, H. Wang, L. Xu, X. Liu, T. Vinar, Y . Wang,
T. W. Lam, S. M. Yiu, S. Liu, H. Zhang, D. Li, Y . Huang, X. Wang, G. Yang,
Z. Jiang, J. Wang, N. Qin, L. Li, J. Li, L. Bolund, K. Kristiansen, G. K. Wong,
M. Olson, X. Zhang, S. Li, H. Yang, J. Wang, and J. Wang. The sequence and de
novo assembly of the giant panda genome. Nature, 463:311–317, Jan 2010.
[49] R. Li, Y . Li, K. Kristiansen, and J. Wang. SOAP: short oligonucleotide alignment
program. Bioinformatics, 24:713–714, Mar 2008.
[50] S Lipschultz. Exposure to anthracyclines during childhood causes cardiac injury.
Semin Oncol, 33:S8–14, 2006.
[51] R. Lu and V . Misra. Zhangfei: a second cellular protein interacts with herpes sim-
plex virus accessory factor HCF in a manner similar to Luman and VP16. Nucleic
Acids Res., 28:2446–2454, Jun 2000.
[52] J. C. Lucchesi. Synthetic lethality and semi-lethality among functionally related
mutants of Drosophila melanfgaster. Genetics, 59:37–44, May 1968.
[53] S Ma and J Huang. Combining multiple markers for classification using roc. Bio-
metrics, 63:751–757, 2007.
[54] J. C. Marioni, C. E. Mason, S. M. Mane, M. Stephens, and Y . Gilad. RNA-seq:
an assessment of technical reproducibility and comparison with gene expression
arrays. Genome Res., 18:1509–1517, Sep 2008.
127
[55] MI McCarthy and et al. Genome-wide association studies for complex traits: con-
sensus, uncertainty and challenges. Nature Rev. Genet., 9:356–369, 2008.
[56] A. McKenna, M. Hanna, E. Banks, A. Sivachenko, K. Cibulskis, A. Kernytsky,
K. Garimella, D. Altshuler, S. Gabriel, M. Daly, and M. A. DePristo. The Genome
Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA
sequencing data. Genome Res., 20:1297–1303, Sep 2010.
[57] T. J. Mitchell and J. J. Beauchamp. Bayesian Variable Selection in Linear Regres-
sion. Journal of the American Statistical Association, 83(404):1023–1032, 1988.
[58] A. Mortazavi, B. A. Williams, K. McCue, L. Schaeffer, and B. Wold. Mapping and
quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods, 5:621–628,
Jul 2008.
[59] U. Nagalakshmi, Z. Wang, K. Waern, C. Shou, D. Raha, M. Gerstein, and M. Sny-
der. The transcriptional landscape of the yeast genome defined by RNA sequenc-
ing. Science, 320:1344–1349, Jun 2008.
[60] J. Novembre and M. Stephens. Interpreting principal component analyses of spatial
population genetic variation. Nat. Genet., 40:646–649, May 2008.
[61] A. Oshlack, M. D. Robinson, and M. D. Young. From RNA-seq reads to differen-
tial expression results. Genome Biol., 11:220, 2010.
[62] Q. Pan, O. Shai, L. J. Lee, B. J. Frey, and B. J. Blencowe. Deep surveying of
alternative splicing complexity in the human transcriptome by high-throughput
sequencing. Nat. Genet., 40:1413–1415, Dec 2008.
[63] S. Pepke, B. Wold, and A. Mortazavi. Computation for ChIP-seq and RNA-seq
studies. Nat. Methods, 6:22–32, Nov 2009.
[64] H. Q. Qu and C. Polychronakos. Reassessment of the type I diabetes association
of the OAS1 locus. Genes Immun., 10 Suppl 1:69–73, Dec 2009.
[65] D. Ramskold, E. T. Wang, C. B. Burge, and R. Sandberg. An abundance of ubiq-
uitously expressed genes revealed by tissue transcriptome sequence data. PLoS
Comput. Biol., 5:e1000598, Dec 2009.
[66] A. Roberts, C. Trapnell, J. Donaghey, J. L. Rinn, and L. Pachter. Improving RNA-
Seq expression estimates by correcting for fragment bias. Genome Biol., 12:R22,
2011.
[67] M. D. Robinson, D. J. McCarthy, and G. K. Smyth. edgeR: a Bioconductor pack-
age for differential expression analysis of digital gene expression data. Bioinfor-
matics, 26:139–140, Jan 2010.
128
[68] M. D. Robinson and A. Oshlack. A scaling normalization method for differential
expression analysis of RNA-seq data. Genome Biol., 11:R25, 2010.
[69] M. D. Robinson and G. K. Smyth. Moderated statistical tests for assessing differ-
ences in tag abundance. Bioinformatics, 23:2881–2887, Nov 2007.
[70] M. Sammeth, V . Lacroix, P. Ribeca, and R. Guigo. The flux simulator.
http://flux.sammeth.net, 2010.
[71] D Schurmann, A Dormann, T Grunewald, and B Ruf. Successful treatment of aids-
related pulmonary kaposi’s sarcoma with liposomal daunorubicin. Eur Respir J,
7:824–825, 1994.
[72] K Seiter. Toxicity of the topoisomerase ii inhibitors. Expert Opin Drug Saf, 4:219–
34, 2005.
[73] L. Shi, L. H. Reid, W. D. Jones, R. Shippy, J. A. Warrington, S. C. Baker, P. J.
Collins, F. de Longueville, E. S. Kawasaki, and K. Y . et al. Lee. The MicroArray
Quality Control (MAQC) project shows inter- and intraplatform reproducibility of
gene expression measurements. Nat. Biotechnol., 24:1151–1161, Sep 2006.
[74] J. T. Simpson, K. Wong, S. D. Jackman, J. E. Schein, S. J. Jones, and I. Birol.
ABySS: a parallel assembler for short read sequence data. Genome Res., 19:1117–
1123, Jun 2009.
[75] A. D. Smith, Z. Xuan, and M. Q. Zhang. Using quality scores and longer reads
improves accuracy of Solexa read mapping. BMC Bioinformatics, 9:128, 2008.
[76] T. Souaiaia, Z. Frazier, and T. Chen. ComB: SNP calling and mapping analysis for
color and nucleotide space platforms. J. Comput. Biol., 18:795–807, Jun 2011.
[77] S. Srivastava and L. Chen. Comparison between the stochastic search variable
selection and the least absolute shrinkage and selection operator for genome-wide
association studies of rheumatoid arthritis. BMC Proc, 3 Suppl 7:S21, 2009.
[78] S. Srivastava and L. Chen. A two-parameter generalized Poisson model to improve
the analysis of RNA-seq data. Nucleic Acids Res., 38:e170, Sep 2010.
[79] S. Srivastava and L. Chen. Model selection methods for Genome Wide Association
Studies. Communications in Informations and Systems, 10 No. 1:39–52, 2010.
[80] K. R. Takahasi, T. Matsuo, and T. Takano-Shimizu-Kouno. Two types of cis-trans
compensation in the evolution of transcriptional regulation. Proc. Natl. Acad. Sci.
U.S.A., 108(37):15276–15281, Sep 2011.
129
[81] S. Tarazona, F. Garcia-Alcalde, J. Dopazo, A. Ferrer, and A. Conesa. Differential
expression in RNA-seq: a matter of depth. Genome Res., 21:2213–2223, Dec
2011.
[82] R Tibshirani. Regression shrinkage and selection via the lasso. Journal of the
Royal Statistical Society, 58:267–288, 1996.
[83] C. Trapnell, B. A. Williams, G. Pertea, A. Mortazavi, G. Kwan, M. J. van Baren,
S. L. Salzberg, B. J. Wold, and L. Pachter. Transcript assembly and quantification
by RNA-Seq reveals unannotated transcripts and isoform switching during cell
differentiation. Nat. Biotechnol., 28:511–515, May 2010.
[84] R. Vernic. A multivariate generalization of the generalized Poisson distribution.
Astin Bullettin, 30(1):57–67, 2000.
[85] E. T. Wang, R. Sandberg, S. Luo, I. Khrebtukova, L. Zhang, C. Mayr, S. F.
Kingsmore, G. P. Schroth, and C. B. Burge. Alternative isoform regulation in
human tissue transcriptomes. Nature, 456:470–476, Nov 2008.
[86] L. Wang, Z. Feng, X. Wang, X. Wang, and X. Zhang. DEGseq: an R package
for identifying differentially expressed genes from RNA-seq data. Bioinformatics,
26:136–138, Jan 2010.
[87] Z. Wang, M. Gerstein, and M. Snyder. RNA-Seq: a revolutionary tool for tran-
scriptomics. Nat. Rev. Genet., 10:57–63, Jan 2009.
[88] J. Wu, M. Akerman, S. Sun, W. R. McCombie, A. R. Krainer, and M. Q. Zhang.
SpliceTrap: a method to quantify alternative splicing under single cellular condi-
tions. Bioinformatics, 27:3010–3016, Nov 2011.
[89] Z. Wu, X. Wang, and X. Zhang. Using non-uniform read distribution models to
improve isoform expression inference in RNA-Seq. Bioinformatics, 27:502–508,
Feb 2011.
[90] Q. Xu, B. Modrek, and C. Lee. Genome-wide detection of tissue-specific alterna-
tive splicing in the human transcriptome. Nucleic Acids Res., 30:3754–3766, Sep
2002.
[91] J. Yang, B. Benyamin, B. P. McEvoy, S. Gordon, A. K. Henders, D. R. Nyholt,
P. A. Madden, A. C. Heath, N. G. Martin, G. W. Montgomery, M. E. Goddard, and
P. M. Visscher. Common SNPs explain a large proportion of the heritability for
human height. Nat. Genet., 42:565–569, Jul 2010.
130
[92] J. Yang, T. A. Manolio, L. R. Pasquale, E. Boerwinkle, N. Caporaso, J. M. Cun-
ningham, M. de Andrade, B. Feenstra, E. Feingold, M. G. Hayes, W. G. Hill, M. T.
Landi, A. Alonso, G. Lettre, P. Lin, H. Ling, W. Lowe, R. A. Mathias, M. Mel-
bye, E. Pugh, M. C. Cornelis, B. S. Weir, M. E. Goddard, and P. M. Visscher.
Genome partitioning of genetic variation for complex traits using common SNPs.
Nat. Genet., 43:519–525, Jun 2011.
[93] R Young, R Ozols, and C Myers. The anthracycline antineoplastic drugs. N Engl
J Med, 305:139–153, 1981.
[94] Z. Yu, B. Zhang, B. Cui, Y . Wang, P. Han, and X. Wang. Identification of spliced
variants of the proto-oncogene hdm2 in colorectal cancer. Cancer, Jul 2011.
[95] D. R. Zerbino and E. Birney. Velvet: algorithms for de novo short read assembly
using de Bruijn graphs. Genome Res., 18:821–829, May 2008.
[96] Y . Zhang and J. S. Liu. Bayesian inference of epistatic interactions in case-control
studies. Nat. Genet., 39:1167–1173, Sep 2007.
[97] Z. Zhang, E. Ersoz, C. Q. Lai, R. J. Todhunter, H. K. Tiwari, M. A. Gore, P. J. Brad-
bury, J. Yu, D. K. Arnett, J. M. Ordovas, and E. S. Buckler. Mixed linear model
approach adapted for genome-wide association studies. Nat. Genet., 42:355–360,
Apr 2010.
[98] P Zhao and B Yu. On model selection consistency of lasso. Journal of Machine
Learning Research, 7:2541–2563, 2006.
[99] H Zou and T Hastie. Regularization and variable selection via the elastic net. JRST,
67:301–320, 2005.
131
Abstract (if available)
Abstract
Genome-wide association studies are important tools to reconstruct the genotype phenotype map to understand the underlying genetic architecture of complex traits. This enables us to better understand the genetic architecture of these phenotypes. With the advances in genotyping and high throughput sequencing technologies, millions of markers can be identified for individual populations in very short durations of time. Due to the multiple loci control nature of complex phenotypes, there is great interest to test markers simultaneously instead of one by one. In chapter 2, we compare three model selection methods for genome wide association studies using simulations: the Stochastic Search Variable Selection (SSVS), the Least Absolute Shrinkage and Selection Operator (LASSO) and the Elastic Net. We apply the three methods to identify genetic variants that are associated with daunorubicin-induced cytotoxicity. We also compare the LASSO and the SSVS to a dataset of two quantitative phenotypes related to Rheumatoid Arthritis. ❧ In the second part of the dissertation, a two parameter generalized Poisson(GP) model to analyze RNA Seq is proposed. Deep sequencing of RNAs (RNA-seq) has been a useful tool to characterize and quantify transcriptomes. However, there are significant challenges in the analysis of RNA-seq data, such as how to separate signals from sequencing bias and how to perform reasonable normalization. In chapter 4 ,we used the generalized Poisson model to separate out the "true" expression level from the bias. We show that the GP model fits the data much better than the traditional Poisson model. Based on the GP model, we can improve the estimates of gene or exon expression, perform a more reasonable normalization across different samples, and improve the identification of differentially expressed genes and the identification of differentially spliced exons. We also use a likelihood based approach to estimate the expression levels of transcripts using the GP model discussed in chapter 5. ❧ RNA Sequencing and genome wide associations studies have led to a rapid growth in understanding of complex genetic phenotypes and diseases. These two methods are crucial tools in the genomic age in the fields of molecular biology, genomics, population and quantitative genetics etc. Using these tools effectively, with the help of statistical and algorithmic methods, would lead to a rapid growth of knowledge in these fields and in the overall field of biology.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Genome-wide studies reveal the isoform selection of genes
PDF
Isoform quantification and splicing regulation analysis in RNA-seq studies
PDF
Probing the genetic basis of gene expression variation through Bayesian analysis of allelic imbalance and transcriptome studies of oil palm interspecies hybrids
PDF
Probabilistic modeling and data integration to examine RNA-protein interactions
PDF
Application of machine learning methods in genomic data analysis
PDF
Too many needles in this haystack: algorithms for the analysis of next generation sequence data
PDF
Bayesian hierarchical models in genetic association studies
PDF
Plant genome wide association studies and improvement of the linear mixed model by applying the weighted relationship matrix
PDF
Bayesian analysis of transcriptomic and genomic data
PDF
Benchmarking of computational tools for ancestry prediction using RNA-seq data
PDF
Genome-wide association study of factors influencing gene expression variation and pleiotropy
PDF
Mapping genetic variants for nonsense-mediated mRNA decay regulation across human tissues
PDF
Best practice development for RNA-Seq analysis of complex disorders, with applications in schizophrenia
PDF
Genomic mapping: a statistical and algorithmic analysis of the optical mapping system
PDF
Developing statistical and algorithmic methods for shotgun metagenomics and time series analysis
PDF
Computational methods for translation regulation analysis from Ribo-seq data
PDF
Sharpening the edge of tools for microbial diversity analysis
PDF
Statistical modeling of sequence and gene expression data to infer gene regulatory networks
PDF
Genome-wide studies reveal the function and evolution of DNA shape
PDF
Genome sequencing and transcriptome analysis of the phenotypically plastic spadefoot toads
Asset Metadata
Creator
Srivastava, Sudeep
(author)
Core Title
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Publication Date
06/26/2012
Defense Date
04/05/2012
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
alternative splicing,differential expression,expression,genome wide association studies,model selection,OAI-PMH Harvest,RNA seq
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Chen, Liang (
committee chair
), Nuzhdin, Sergey V. (
committee member
), Radchenko, Peter (
committee member
), Sun, Fengzhu Z. (
committee member
)
Creator Email
sudeep.mentor@gmail.com,sudeepsr@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-49465
Unique identifier
UC11289365
Identifier
usctheses-c3-49465 (legacy record id)
Legacy Identifier
etd-Srivastava-905.pdf
Dmrecord
49465
Document Type
Dissertation
Rights
Srivastava, Sudeep
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
alternative splicing
differential expression
expression
genome wide association studies
model selection
RNA seq