University of Southern California Dissertations and Theses

Best practice development for RNA-Seq analysis of complex disorders, with applications in schizophrenia

(USC Thesis Other)

Best practice development for RNA-Seq analysis of complex disorders, with applications in schizophrenia

PDF

Download a page range

Download transcript

Copy asset link

Request this asset

Transcript (if available)

Content

BEST PRACTICE DEVELOPMENT FOR RNA-SEQ ANALYSIS OF
COMPLEX DISORDERS, WITH APPLICATIONS IN SCHIZOPHRENIA

by

Christopher John Armoskus

__________________________________________________________________

A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(BIOSTATISTICS)

May 2020
ii

Table of Contents

Abstract iv
Chapter 1 Background 1
1.1 RNA-Seq Differential Expression 1
1.2 Schizophrenia 1
1.3 Tissue and Cellular Models 2
1.4 RNA Sequencing Applications 2
1.5 Description of, and example for, Bayes theorem 3
1.6 Example of BMA for inclusion of a covariate 6
1.7 Summary 10
1.8 Figures 11
Chapter 2 Analysis of RNA-Seq data in schizophrenia 13
2.1 Abstract 13
2.2 Introduction 14
2.3 Methods 16
2.4 Results 25
2.5 Discussion 38
2.6 Figures 41
2.7 Tables 55
Chapter 3 Comparison of tools for RNA-Seq differential gene
expression 62
3.1 Abstract 62
3.2 Introduction 63
3.3 Methods 74
3.4 Results 85
3.5 Discussion 90
3.6 Figures 92
Chapter 4 Model averaging to improve performance in analysis of
gene expression 103
4.1 Abstract 103
4.2 Introduction 103
4.3 Methods 107
4.4 Results 112
4.5 Discussion 117
4.6 Figures 120
Chapter 5 Conclusions and Future Directions 135
5.1 Performance of existing programs 135
iii

5.2 Advances with model averaging algorithm 136
References 137

iv

Abstract
Transcriptome analysis by RNA sequencing is a powerful technique that is becoming
cheaper and easier, allowing for larger experiments than ever to be run. An increased sample size
is particularly important when examining small effects, such as those seen on average when
disease states come from heterogeneous causes, such as in complex psychiatric disorders. In this
thesis I will present a transcriptome+genome analysis into the effects of schizophrenia in
cultured neural progenitor cells derived from olfactory neuroepithelium (CNON), including
analysis of differentially expressed genes and expression quantitative trait loci (eQTLs).
Simulations based on actual RNA-Seq data from here and from the GTEx Project were used to
test existing analysis methods under known-truth conditions and examine statistical properties,
especially the false positive rate, false discovery rate, true positive rate, and area under the
receiver operating characteristic (ROC) curve. Results show that newer normalization techniques
designed with RNA-Seq in mind (e.g., TMM and RLE) perform better than older methods, as
well as that there are some issues when data does not match the distribution assumed in the
algorithm that hinder performance. To address this latter issue, an algorithm based on principles
from Bayes model averaging is used to average over different models weighted by how likely it
is that the data comes from one model or the other. This new model averaging algorithm shows
improved performance over existing solutions.
Chapter 1 - Background
1.1 - RNA-Seq Differential Expression
Transcriptome analysis via short read massively parallel sequencing, also known as Next
Generation Sequencing (NRS), (RNA-Seq) presents great opportunities for understanding the
biology of many diseases (Costa et al., 2013). However, the vast majority of evaluations of
differential expression tools are done examining situations with large effect sizes and consistent
effects in small sample sizes (Łabaj and Kreil, 2016; Law et al., 2014; Li et al., 2015; Lin et al.,
2016; Quinn et al., 2018a; Rapaport et al., 2013); the results of such evaluations may not be
applicable to examining differential expression in other settings. Complex diseases which feature
small, heterogeneous differences in gene expression that require large sample sizes to identify,
such as many psychiatric disorders, present a substantially different situation that must be
examined to identify optimal methodologies for this application.
1.2 - Schizophrenia
Schizophrenia (SCZ) is a highly heritable disease (~80%) with a global prevalence of
0.5-1% and a large cost (Goldner et al., 2002; Sullivan et al., 2003; Wu et al., 2005). Current
genome wide association studies (GWAS) have identified over 250 associated loci, however
identified variants together account for less than 10% of the heritability in terms of the liability
scale (Ripke et al., 2014). In addition, all known SNPs are projected to account for only 27% of
variability on this scale (Loh et al., 2015), suggesting mechanisms of action that we aren't able to
account for yet, such as epistasis (MacKay 2014). Most of these variants lie outside of the coding
regions of the genome, suggesting that they act by altering gene expression and splicing to
change transcript levels (Gusev et al. 2014). Therefore, expression profiling in appropriate
1
biological models is going to be essential to understanding the biology of genetic variation in
risk for developing SCZ, and understanding how genetic factors influence disease can lead to a
better understanding and, eventually, better treatments for disease.
1.3 - Tissue and Cellular Models
In order to perform RNA-Seq you need cells/tissue to extract RNA from; ideally the
source of the RNA will be relevant to the phenotype of the disorder being studied, however this
is problematic for disorders present in tissue that cannot be easily sampled, such as the brain. In
order to deal with this problem there are two more common approaches: the use of post-mortem
tissue or the use of cell cultures derived from a non-harmful biopsy.
1.4 - RNA Sequencing Applications
There are many applications for sequencing RNA, the most prominent of which is
looking for differential expression between samples or groups (Stark et al., 2019). Differential
expression can be analyzed at the gene, transcript, splice junction, or exon level(Conesa et al.,
2016; Oshlack et al., 2010; Seyednasrollah et al., 2013). Analysis of different types of features
can require different methods of analysis, as the nature of the data is different (Bray et al., 2016;
Pimentel et al., 2016).
One major categorization that affects subsequent analysis is whether there is a need to
examine non-overlapping features, such as genes or splice junctions, where a given read will
only fall into one bin allowing for a simple counting process, or overlapping features, such as
transcripts, where a given read may have come from multiple different transcripts and we must
estimate transcript estimation based on the data as a whole. Reads may, in some cases, have
potentially come from two or more genes, however such reads are generally removed at the
2
counting stage as while they may map to a unique position in the genome, they do not map
uniquely in the transcriptome; the loss of reads in cases like this for genes is far smaller than the
loss of reads we would have to deal with if reads mapping to multiple transcripts needed to be
thrown out.
The arrival of less expensive transcriptome analysis via short read sequencing (RNA-
Seq) presents unprecedented opportunities for understanding the biology of disease, especially
complex psychological disorders such as schizophrenia (SCZ). Complex psychological disorders
generally feature much more subtle differences in gene expression than are seen between, for
example, cancer and matching healthy tissue or comparisons between different tissues. This
leads to not only a need for larger sample sizes but also a need to optimize differential expression
analysis for different targets.
1.5 - Description of, and example for, Bayes theorem
Along with the frequentist approach, there is also the Bayesian approach that uses Bayes
theorem to combine information from a prior probability with information derived from the data:

𝑃𝑃 ( 𝜃𝜃 | 𝐷𝐷 ) =
𝑃𝑃 ( 𝐷𝐷 | 𝜃𝜃 ) ∗ 𝑃𝑃 ( 𝜃𝜃 )
𝑃𝑃 ( 𝐷𝐷 )

In this basic version of the theorem, θ is some parameter that we are interested in estimate, such
as a log2foldChange or dispersion in a differential expression comparison. P(θ) is a prior
probability that can be specified ahead of time or derived from the data (e.g., empirical Bayes
method).
3

In combination with the frequentist hypothesis testing approach, we could also let θ be the
probability that a given null hypothesis is true when considering a set of null hypotheses to be
tested: this application gives us the Bayesian False Discovery Rate (FDR). Specifically we might
have a set of n genes that have been tested for differential expression between two groups. We
have some prior expectation, P(θ), as to how likely it is that a given gene has a true change in
expression between groups. We also have p-values derived by frequentist methods, like a two
sample t-test, for how likely the data for a given gene is given the null hypothesis is true, P(D|θ).

To calculate the third term of the right hand side of Bayes theorem, we expand the term P(D) to
account for two different mutually exclusive situation: θ and not θ, or H_0 and H_A.

𝑃𝑃 ( 𝐷𝐷 ) = 𝑃𝑃 ( 𝐷𝐷 ∩ θ ) + P(D ∩ ¬ θ) = 𝑃𝑃 ( 𝐷𝐷 | θ) ∗ P( θ) + P(D|¬ θ) ∗ P(¬ θ)
P(¬ θ) = 1 − P( θ)
𝑃𝑃 ( 𝐷𝐷 |¬ θ) = 1 − β

Where β is the type II error rate, and 𝑃𝑃 ( 𝐷𝐷 |¬ θ), the probability of data showing a significant test
statistic the alternative hypothesis is true and there really is an effect. For a limiting case, we can
consider a situation in which we have 100% power and so 𝑃𝑃 ( 𝐷𝐷 |¬ θ) = 1, then we can rewrite the
overall statement as:
4

𝐹𝐹 𝐷𝐷 𝐹𝐹 = 𝑃𝑃 ( 𝐻𝐻 0
| 𝐷𝐷 ) ≥
𝛼𝛼 ∗ 𝑃𝑃 ( 𝐻𝐻 0
)
𝛼𝛼 ∗ 𝑃𝑃 ( 𝐻𝐻 0
) + �1 − 𝑃𝑃 ( 𝐻𝐻 0
) �

When we combine this equation with another frequentist method, the Benjamini-Hochberg FDR,
we can reverse Bayes theorem to look at what our analysis is saying about the proportion of
genes that are truly altered:

𝑃𝑃 ( 𝐻𝐻 0
) ≤
1
𝛼𝛼 𝐹𝐹 𝐷𝐷 𝐹𝐹 + 1 − 𝛼𝛼

Adding numbers to an example, if we test n=23,920 genes for differential expression at a BH
FDR of 5% and find s=36 to be significantly DEX then:

𝛼𝛼 = 𝑠𝑠 ∗
𝐹𝐹 𝐷𝐷 𝐹𝐹 𝑛𝑛 = 36 ∗
0.05
23920
= 7.53 𝑒𝑒 − 05
𝑃𝑃 ( 𝐻𝐻 0
) ≤
1
𝑠𝑠 𝑛𝑛 + 1 −
𝑠𝑠 𝑛𝑛 ∗ 𝐹𝐹 𝐷𝐷 𝐹𝐹 =
1
1.0014
= 0.9986
𝑃𝑃 ( 𝐻𝐻 𝐴𝐴 ) = 1 − 𝑃𝑃 ( 𝐻𝐻 0
) ≥ 0.14%

5
Classifying the 23,920 genes by result, we find expected numbers of true and false positives and
negatives:

𝑇𝑇 𝑃𝑃 = (1 − 𝐹𝐹 𝐷𝐷 𝐹𝐹 ) ∗ 𝑠𝑠 = 0.95 ∗ 36 = 34.2 = (1 − 𝛽𝛽 ) ∗ 𝑃𝑃 ( 𝐻𝐻 𝐴𝐴 ) ∗ 𝑛𝑛
𝐹𝐹 𝑃𝑃 = 𝐹𝐹 𝐷𝐷 𝐹𝐹 ∗ 𝑠𝑠 = 𝛼𝛼 ∗ 𝑛𝑛 = 1.8

Thus a given BH FDR provides a lower bound on the number/proportion of truly DEX genes
depending on the average power. If two different studies have the same BH FDR, they might still
have very different Bayes FDRs if we assume similar power and the proportion of truly DEX
genes is estimated from the other study. Given restrictions on likely differences in power
between studies, using a prior estimated based on one set of data to approximate the Bayes FDR
in another set of data can give very different results from the BH FDR.
1.6 - Example of BMA for inclusion of a covariate to better estimate another
term
Here we use covariate inclusion in normal linear models as an example of the
effectiveness of Bayes Model Averaging in adjusting to different models based on observations
from the data.
1.6.1 - Normal linear models
Consider two normal linear models attempting to estimate the effect of a main term, β,
possibly accounting for the effect of a secondary term, λ:

6
𝑀𝑀 1
: 𝑌𝑌 𝑖𝑖 = 𝛼𝛼 �
1
+ 𝛽𝛽 ̂
1
∗ 𝑋𝑋 𝑖𝑖 + 𝜀𝜀 𝑖𝑖
𝜀𝜀 𝑖𝑖 ~ 𝑁𝑁 (0, 𝜎𝜎 1
2
)

𝑀𝑀 2
: 𝑌𝑌 𝑖𝑖 = 𝑎𝑎 �
2
+ 𝛽𝛽 ̂
2
∗ 𝑋𝑋 𝑖𝑖 + λ
�
∗ W
i
+ 𝛿𝛿 𝑖𝑖
𝛿𝛿 𝑖𝑖 ~ 𝑁𝑁 (0, 𝜎𝜎 2
2
)

The X
i
values are a binary indicator variable for whether the sample belongs to group A or group
B. The W
i
values are continuous values representing a covariate that may or may not include
useful information about variance in the continuous outcome values.
In the case where the W
i
values explain variance in Y
i
it is better that the term λ be
included in the model. However in the case where the W
i
values are independent of the Y
i
values
it is better (although just slightly) for the term λ to not be included in the model. This can be
shown using simulation analysis (code in snippet XX). If Y_i and W_i variables are independent,
then model 1 has an estimate of β that is closer to the actual value than the estimate from model 2
50.71% of the time in a run of 100,000 simulation; the overall estimates including their variance
are both very similar however (Figure 1A). However if the two sets of values are correlated,
which in the simulation means that the value of W_i has a linear relationship with the expected
value of Y_i after accounting for the effect of β, then model 2 has an estimate of β that is closer
to the actual β 83.88% of the time, and also shows greatly reduced variance (Figure 1B).
Inclusion of a covariate that explains variance in the outcome value offers improved performance
in both accuracy (we get an estimate that is closer to the truth more often) and precision (our
7
estimate shows less variance); the difference in precision is larger in this situation and
particularly important when we are using null hypothesis significance testing to determine
whether the effect of β is statistically significant.
Standard methods generally involve looking at the data to determine whether we should
use model 1 or model 2 based on factors such as AIC. Bayesian model averaging instead
includes both models and combines their estimates of β, weighting the contributions of 𝛽𝛽 1
and 𝛽𝛽 2

based on the conditional likelihood given the model is true.
1.6.2 - Bayes model averaging
Let β be the coefficient for the main term
Pr( 𝛽𝛽 | 𝐷𝐷 ) = � Pr( 𝛽𝛽 𝑘𝑘 | 𝐷𝐷 , 𝑀𝑀 𝑘𝑘 ) ∗ Pr( 𝑀𝑀 𝑘𝑘 | 𝐷𝐷 )
2
𝑘𝑘 = 1

𝐾𝐾 =
Pr( 𝐷𝐷 | 𝑀𝑀 2
)
Pr( 𝐷𝐷 | 𝑀𝑀 1)
=
∫ Pr( 𝛽𝛽 2
| 𝑀𝑀 2
) ∗ Pr( 𝐷𝐷 | 𝛽𝛽 2
, 𝑀𝑀 2
) 𝑑𝑑 𝛽𝛽 2
∫ Pr( 𝛽𝛽 1
| 𝑀𝑀 1
) ∗ Pr( 𝐷𝐷 | 𝛽𝛽 1
, 𝑀𝑀 1
) 𝑑𝑑 𝛽𝛽 1
≈
Pr � 𝐷𝐷 � 𝛽𝛽 ̂
2
, 𝑀𝑀 2
�
Pr � 𝐷𝐷 � 𝛽𝛽 ̂
1
, 𝑀𝑀 1
�

"K" is the ratio of the likelihoods of the data under either model. Because these likelihoods are
extremely small, they should be calculated on the log scale to maintain precision
log( 𝐾𝐾 ) = log( 𝐿𝐿 1
) − log( 𝐿𝐿 2
)
𝐾𝐾 = exp (log( 𝐿𝐿 1
) − log( 𝐿𝐿 2
)
If we plug this into the previous equation and assume that the two models have equal priors, then
8
Pr( 𝑀𝑀 1
| 𝐷𝐷 ) =
Pr( 𝐷𝐷 | 𝑀𝑀 1
) ∗ Pr ( 𝑀𝑀 1
)
Pr( 𝐷𝐷 | 𝑀𝑀 1
) ∗ Pr( 𝑀𝑀 1
) + Pr( 𝐷𝐷 | 𝑀𝑀 2
) ∗ Pr ( 𝑀𝑀 2
)
=
Pr( 𝐷𝐷 | 𝑀𝑀 1
)
Pr( 𝐷𝐷 | 𝑀𝑀 1
) + Pr( 𝐷𝐷 | 𝑀𝑀 2
)
=
1
1 + Pr( 𝐷𝐷 | 𝑀𝑀 2
) /Pr ( 𝐷𝐷 | 𝑀𝑀 1
)
=
1
1 + 𝐾𝐾
Pr( 𝑀𝑀 2
| 𝐷𝐷 ) =
𝐾𝐾 1 + 𝐾𝐾
When adding the BMA of the previous two models as a third model, it is important to not
exclusively use either of the previous simulation types. The point of BMA is to account for
uncertainty of whether the values in W_i contain useful information in examining; when
simulations are run where the ground truth is exactly equal to one of the two previous models,
that model is expected to perform better.

To compare the three models, a given dataset is now simulated with a random (equal
probabilities) choice of if the W_i values are independent of the Y_i values. Evaluating the three
estimates over a set of 100,000 simulated datasets with a larger λ we find that the BMA model
performs much better than model 1, which doesn't take λ into account: BMA has an estimate
closer to the real value of β 56.22% of the time and also has less variance (lm1 SD of β estimates
= 0.331; bma SD of β estimates = 28.31. Comparison to model 2 is more ambiguous, as model 2
produces an estimate that is closer to the truth than BMA 74.12% of the time; yet despite this the
estimates from BMA across the set of simulations show very slightly less variance (SD = 0.2832
for lm2; SD = 0.2831) as well as very slightly less root mean square error from the truth (number
same as before, given number of significant digits) (Figure 2). While lm2 estimates are closer to
the truth more often, the mean squared deviation of lm2 is higher, suggestion (very slightly with
these simulation settings) less accuracy.
9
1.7 - Summary
As RNA-Seq enters new areas of research via decreasing costs and increasing sample
sizes, old algorithms will have to be evaluated in new situations to make sure they can deal with
new problems. Here we propose to establish a RNA-Seq read count simulation program to
enable the testing of differential expression programs, test those programs against a number of
factors, and create a novel algorithm to deal with certain problems seen in existing programs
while testing.
10
1.8 - Figures

Figure 1-1: Boxplots of estimates for beta from both model 1 and model 2 in
100,000 simulations. Red horizontal line indicates the actual value of beta used to
simulate data. A) Estimates in the case where W_i is independent of Y_i. B)
Estimates in the case where the mean of the simulated Y_i values varies linearly
with W_i.

11

Figure 1-2: Boxplots of estimates for beta from all three models in 100,000
simulations. Red horizontal line indicates the actual value of beta used to simulate
data. The mean of the simulated Y_i values vary linearly with the W_i values 50%
of the time.

12
Chapter 2 - Analysis of RNA-Seq data with applications in
schizophrenia
2.1 - Abstract
BACKGROUND: GWAS of schizophrenia demonstrated that variations in the non-
coding regions are responsible for most of common variation heritability of the disease. It is
hypothesized that these risk variants alter gene expression. Thus, studying alterations in gene
expression in schizophrenia may provide a direct approach to understanding the etiology of the
disease. In this study we use Cultured Neural progenitor cells derived from Olfactory
Neuroepithelium (CNON) as a genetically unaltered cellular model to elucidate the
neurodevelopmental aspects of schizophrenia.
METHODS: We performed a gene expression study using RNA-Seq of CNON from 111
controls and 144 individuals with schizophrenia. Differentially expressed (DEX) genes were
identified with DESeq2, using covariates to correct for sex, age, library batches and one
surrogate variable component. Genotyping of 201 of the 255 total participants allowed for
examination of eQTLs. Small RNA sequencing of 170 of the 255 total participants allowed for
examination of differential expression of miRNAs.
RESULTS: 80 genes were DEX (FDR of 10%), showing enrichment in cell migration,
cell adhesion, developmental process, synapse assembly, cell proliferation and related gene
ontology categories. Cadherin and Wnt signaling pathways were positive in overrepresentation
test, and, in addition, many genes are specifically involved in Wnt5A signaling. The DEX genes
were modestly, but significantly, enriched in the genes overlapping SNPs with genome-wide
significant association from the PGC GWAS of schizophrenia (PGC SCZ2). We also found
13
substantial overlap with genes associated with other psychiatric disorders or brain development,
enrichment in the same GO categories as genes with mutations de novo in schizophrenia, and
studies of iPSC-derived neural progenitor cells. 218,346 eQTLs were found, mapping to 5,636
distinct eGenes at an FDR of 5%. 83 miRNA were found to be DEX at an FDR of 5%.
CONCLUSIONS: CNON cells are a good model of the neurodevelopmental aspects of
schizophrenia and can be used to elucidate the etiology of the disorder. An increase in Wnt
signaling seems to be a hallmark of a subset of SCZ cases.
2.2 - Introduction
2.2.1 - Schizophrenia and neurodevelopment
Schizophrenia (SCZ) is a devastating psychiatric disorder with an estimated lifetime
prevalence of 0.5-1% worldwide (Goldner et al., 2002; Li et al., 2017; Somers et al., 2006; Stilo
and Murray, 2010). With heritability of about 81% (Sullivan et al., 2003), genetics plays a
critical role in the disease’s etiology, however the mechanism by which genetic variation
contributes to the disease is unknown (Ripke et al., 2014; Visscher et al., 2017). Genome Wide
Association Studies (GWAS), such as the PGC2 SCZ study, have identified over a hundred loci
influencing risk of developing SCZ, however it appears most of these common variants do not
influence risk by altering the coding of proteins (Cross-Disorder Group of the Psychiatric
Genomics Consortium et al., 2013; Ripke et al., 2014; Tansey et al., 2016).
One major mechanism by which genetic variation can lead to phenotype without
affecting protein coding is by altering gene expression levels (Majewski and Pastinen, 2011;
Walker et al., 2019). Studies of such relationships, called expression Quantitative Trait Loci
(eQTLs) where a variant, the eQTL, has a significant effect on some gene, the eGene, most
14
commonly in proximity to the variant (cis-eQTLs) rather than being far away and possibly
located on a different chromosome (trans-eQTLs), are becoming common (Bainbridge et al.,
2013; Hauberg et al., 2017; Shabalin, 2012). This allows for estimation of genetically regulated
expression (GREx), as opposed to variation from other sources, such as the environment.
Additionally, many direct and indirect pieces of evidence indicate that aberrations in brain
development are major contributors (Lewis and Levitt, 2002; Raedler et al., 1998; Weinberger,
1988). Post-mortem studies of the brains of adults with SCZ provide important information about
functional changes associated with the disease, but it is highly unlikely that the gene expression
profiles of differentiated cells can provide a full picture of changes in neurodevelopment. The
same issue applies to eQTLs as these have frequently been shown to be tissue and cell type
specific (Gilad et al., 2008; McKenzie et al., 2014; Yang et al., 2006).
Even though the genetic mechanisms regulating neurodevelopment are poorly known, it
seems reasonable to suggest that regulation of cell migration and the balance between cell
division and differentiation of neuronal cells are critical to proper brain development. Most
neuronal cells in the adult brain are not dividing, migrating or differentiating, limiting the utility
of adult tissue samples to find alterations in gene expression that affect neurodevelopment.
2.2.2 - CNON cell culture model
Different biological models, such as hiPSC-derived neural progenitor cells (Brennand et
al., 2015, 2011) and olfactory epithelium derived cells/tissue (Cascella et al., 2007; Lavoie et al.,
2017) have been suggested to study neurodevelopmental aberrations in schizophrenia. For this
study, we have used Cultured Neural progenitor cells derived from Olfactory Neuroepithelium
(CNON) of individuals with, and without, schizophrenia (Evgrafov et al., 2011). These cells are
not genetically modified neural progenitors, actively divide, and migrate in 2D and 3D cultures.
15
2.2.3 - eQTLs
Combining information from both RNA-Seq (or potentially gene expression microarrays)
and genotyping, we are able to study the correlation between gene expression levels and specific
genetic variants, commonly referred to as expression Quantitative Trait Loci (eQTL) analysis.
This can be considered either a differential expression problem where we are comparing between
different genotypes (AA vs. AB vs. BB, most likely in an additive/linear model), or a GWAS
problem where the trait being analyzed is the expression of a given gene.
2.2.4 - Summary
Here we present a study of transcriptome expression profiles using RNA-Seq (strand-
specific, rRNA-depleted total RNA) in CNON lines derived from 144 SCZ and 111 control
(CTL) individuals. These are paired with genotyping data in 201 cases to allow for eQTL
analysis.
2.3 - Methods
2.3.1 - Sample characteristics
The study protocol used was approved by the University of Southern California
Institutional Review Board. Diagnosis for schizophrenia was made based on criteria from the
DSM-IV, with people excluded from being controls if they have a family or personal history of
schizophrenia, or if they answered questions indicating a history of any mental illness. Most
patients and control subjects were recruited from participants of the Genomic Psychiatry Cohort
(GPC) study (1R01MH085548) (Pato et al., 2013), and a few patients were recruited through Los
Angeles county / University of Southern California outpatient psychiatric clinic. Biopsies were
taken and cell cultures developed based on methods described previously(Evgrafov et al., 2011;
16
Wrobel et al., 2013). Given the common variation overlap of SCZ and Major Depressive
Disorder (MDD) (Lee et al., 2013), we excluded controls that endorsed either of the MDD probe
questions on the GPC screener.
2.3.2 - RNA-Seq sample and model
The individuals' characteristics, including sex, age, and race/ethnicity, were tested for
balance between schizophrenia and control groups by Chi-Square test or two-sample t-test. Tests
showed the sample set to be unbalanced in diagnosis with respect to age, sex, and race/ethnicity
(Table 1), therefore these terms were considered as covariates.
We applied a categorization of samples into a minimal number of racial/ethnic categories.
Based on examining PCA of genotypes for the subset of samples with genotyping information,
we found 3 clusters considering the first two PCs (Figure 1). The first PC appears to separate
non-Hispanic Caucasians from African-Americans, based on correlating sample value with
race/ethnicity information collected for participants. The second PC appears to separate Non-
Hispanic Caucasians from Hispanics. These 3 clusters, with some samples distributed between
the clusters and seeming to correspond to mixed race/ethnicity individuals, accounted for 89.4%
of individuals sorted into these 3 major categories. 85.1% of samples in these 3 groups could be
considered matched for case vs. control status, however this still results in a significant lack of
independence between Diagnosis and Race/Ethnicity (Chi-Square = 9.6, df = 2, p = 0.008)
largely caused by a preponderance of black schizophrenics in the study (35 unmatched with
controls). The remaining samples came from a variety of racial/ethnic groups including East
Asian, Indian, Native American, and individuals of mixed race/ethnicity. Including this fourth
racial/ethnic category (Other) still showed a significant lack of independence between Diagnosis
17
and Race/Ethnicity (Chi-Square = 9.8, df = 3, p = 0.02) and also, if we consider individuals in
the Other category to be unmatched, brought the match percentage down to 74.5%.
While effects of race on gene expression appeared to not be expansive enough to require
correction in the statistical model (Figure 1), age was included as a covariate largely on the basis
of the much higher significance of the effect of diagnosis (t-test p-value for difference in age
between SCZ and control = 6.4e-09; compare to p-values of 0.025 for sex and 0.02 for
race/ethnic category). Sex was included as a result of the larger amount of variation explained by
this term (2%, as opposed to 0.5% or 0.37% for age and diagnosis, respectively). Further data
supporting the decision not to correct by race/ethnic category in the differential expression
analysis is included as part of the results.
2.3.3 - RNA-Seq
RNA was purified from ~400 000 cells grown on 6 cm Petri dishes (~90% of confluence)
coated with Matrigel diluted 1:2 in Coon’s medium, typically on the third passage, using the
Direct-Zol RNA MiniPrep kit (Zymo Research), according to manufacturer’s protocol. For some
cell lines, we purified two or more RNA samples from different plates of the same passage, or
different passages, in order to assess variability of gene expression profiles in technical replicates
at different levels of the sample preparation procedure. Quality of purified RNA was consistently
high, with a RNA integrity Number (RIN) typically exceeding 9. RNA libraries were prepared in
batches of 24-48 samples with approximately equal numbers of cases and controls using TruSeq
Stranded Total RNA LT Library preparation kits with Ribo-Zero Gold (Product Numbers RS-
122-2301, RS-122-2302, Illumina) according to manufacturer’s protocol, using a Hamilton
STARlet liquid handling robot to increase library preparation consistency (script available on
request). For many RNA samples, we made two or more libraries. Equimolar pools of at least 4
18
libraries, containing both cases and controls, were constructed after quantification using the
KAPA Library Quantification Kit (Kapa Biosystems), and sequenced together using HiSeq2000
DNA Sequencers (Illumina) as 100 bp single-end reads. On average, each sample was run in
3.93 channels (lanes) across 3.91 flow cells, to reduce potential channel and flow cell bias.
Every set of sequencing reads with unique index in every channel of flow cell was quality
controlled for overall complexity (total number of reads, average entropy, percentage of reads
with low entropy, frequency of most common K-mers and monomers). Prior to mapping, we
removed reads containing more than 50% of adapter sequences, monomers or other low entropy
reads (metric entropy below 1%). The rest of reads from each individual channel were trimmed
(if adapters constitute less than 50% of the read) and sequentially aligned to rRNA, mtDNA, the
rest of human transcriptome (GenCode v22 gene models) and genome (GRCh38) using our
custom RNA-Seq alignment pipeline, GT-FAR (https://genomics.isi.edu/gtfar). Mapping quality
was monitored by examining the distribution of reads with different number of substitutions to
match the reference. Reads mapped to rRNA and mtDNA were excluded from following
analysis. Reads generated from the same library, but ran in different channels or different flow
cells were assessed for quality separately, and those that passed QC were united as a set of reads
specific to the individual. We required every library ran in one channel to have at least 1 million
reads after QC, with overall mapping rate at least 75% and reads aligned to the correct strand at
>90%.
We analyzed female-specific (XIST) and male-specific (DDX3Y, USP9Y, KDM5D)
ubiquitously-expressed genes to confirm samples properly clustered with annotated sex. We also
required that RNA-defined genotypes corresponded to DNA-defined genotypes, which were
previously determined by microarray or whole genome sequencing. Additionally, we identified
19
and removed outliers with regard to the alignment percentage to mtDNA, rRNA and gene
models. We vigorously tested correlation of gene expression between libraries, RNA samples,
individuals, as well as channels and flow cells. We assumed that correlation between sequencing
data should generally decrease in the following order: the same library in the same flow cell, the
same library in different flow cells, different libraries from the same RNA sample, libraries from
different RNA samples, libraries from RNA samples purified from different biopsies of the same
individual, and lastly libraries from different individuals. Most sequencing data fit the expected
pattern, and outliers were excluded after thorough investigation of potential reasons for the
deviation. Finally, we analyzed the relationship between number of detected genes and number
of aligned reads; a few outliers from a best-fit relationship were excluded. Reads that aligned to
the sense strand of a single gene model with minimal number of mismatches (not more than six)
were assigned to the gene. The number of reads assigned to a gene was used as a proxy of gene
expression in DEX gene analysis.
2.3.4 - RT-qPCR
Reverse transcription quantitative Polymerase Chain Reaction (RT-qPCR) was performed
in duplicates using the Biomark HD (Fluidigm) on a Flex Six Gene Expression IFC (Integrated
Fluidic Circuit), according to manufacturer’s protocols and using the recommended reagents.
Normalized relative expression (to ACTB) was calculated using the ΔΔCt method (Karlen et al.,
2007)(13), and log-transformed expression values were tested by ANOVA for an effect of SCZ.
Additionally, expression values for individual samples measured by both methods and average
expression values per gene were also compared to RNA-Seq data.
20
2.3.5 - Differential gene expression analysis
We performed main differential gene expression (DEX) analysis between SCZ and
controls using DESeq2 v1.16.1 (Love et al., 2014)(14) in R v3.4.1, an algorithm assessing
difference between mean gene expression in groups using a generalized linear model and
assuming a negative binomial distribution of RNA-Seq reads. The analysis used covariates of
sex, age, 4 library batches and 1 surrogate variable (SVA v3.24.4). Procedures used to arrive at
this covariate set are further described in Supplemental Materials.
DESeq2's built-in normalization process was used. The analysis was done on 23,920
expressed genes, defined as having on average at least 3.5 reads per sample, based on the density
plot of log-transformed baseMean for all genes (Figure 2). Resulting p-values were adjusted for
multiple comparisons based on the Benjamini-Hochberg False Discovery Rate (FDR) using the
p.adjust function in R. Transcripts Per Million transcripts (TPM) values were calculated by
dividing the mean number of unnormalized reads uniquely mapped to a gene by the median
transcript length (Li, 2018)(15) in Gencode 22 for that gene and normalizing the resulting values
so they sum to one million (Wagner et al., 2012)(16). Hierarchical clustering analysis was done
in R using the 'hclust' function where the distance is calculated as one minus the Pearson
correlation coefficient and using average linkage. DEX genes were defined as those showing
significance at an FDR of 10%.
2.3.6 - Permutation analysis of differential expression
To test that our findings are not due to random variation we performed two forms of
permutation analysis. First, 1,000 analyses were run under the same conditions as the main
analysis, except with labels for diagnosis randomly permuted for all samples. Second, 250
21
analyses each of random subsets in 4 different configurations were analyzed: random half of
cases vs. other half of cases, and random half of controls vs. other half of controls (null
comparisons); and, due to the difference in number of cases and controls, random half of cases
vs. random same number of controls, and random half of controls vs. random same number of
cases (case/control comparisons). The numbers of differentially expressed (DEX) genes in the
permuted analyses were compared to the number seen in the original, unpermuted analysis by
Wilcoxon test. The number of DEX genes in the null comparisons was compared to the number
DEX in the case/control comparisons by Mann-Whitney test.
2.3.7 - DEX enrichment analyses
The significant DEX gene list was analyzed for gene ontology (GO) enrichment using the
g:Profiler web tool based on ENSEMBL Gene IDs and using the "Ordered Query" setting which
considers more significant genes as more relevant than less significant genes. The DEX gene list
was also analyzed for pathway enrichment using the PANTHER DB "Gene List Analysis" tool's
"statistical overrepresentation test".
Finally, the DEX gene list was analyzed for significant overlap with genome-wide
significant SCZ GWAS variants. We identified GWAS variants’ p-values within a given gene
based on the PGC SCZ2 dataset (https://www.med.unc.edu/pgc/results-and-downloads) and
Gencode 22 annotation back-ported to human reference genome GRCh37 (hg19) to match
coordinates used in GWAS. Fisher’s Exact Test was used to test for enrichment of genes co-
localized with genome-wide significant (p<5x10-8) GWAS peaks and calculate estimated odds-
ratio. Due to a very broad peak in the HLA region (25 - 34 Mb of chromosome 6 in GRCh37
coordinates), genes from this region were removed from the analysis, as were genes from the Y
chromosome.
22
2.3.8 - WGCNA
Weighted Gene Correlation Network Analysis (WGCNA) (Langfelder and Horvath,
2008)(17) was performed for all expressed genes on residuals after correction for effects of three
explicit batches as well as one surrogate variable, with a soft-threshold power of 5 to achieve
approximate scale-free topology (SFT R2>0.85; truncated R2>0.95), and a minimum module
size of 50 genes. Modules of co-expressed genes produced by the algorithm were tested for
enrichment of DEX genes by Fisher's exact test and for correlation of the eigengene with
diagnosis (SCZ vs. control), after controlling for sex and age (linear model, no interactions); the
p-value cutoff was adjusted for multiple comparisons. Gene set enrichment analysis was applied
to modules that had a significant p-value on the aforementioned statistical tests.
2.3.9 - Comparison of CNON to BrainSpan data
We used the GT-FAR pipeline to process raw data of 585 BrainSpan samples across 41
individuals, 26 brain regions and 10 developmental stages (from "Early Fetal" to "Middle
Adulthood") (www.brainspan.org). CNON raw data was processed through the same pipeline
using the same set of genes (matched by Ensembl ID) as used for processing BrainSpan data. To
mitigate the differences in RNA-Seq methodology (e.g, differences in read length and library
construction) we assessed relative similarity of gene expression profiles of CNON and brain
samples from different brain regions and at different age in the same PC space derived from
BrainSpan data only.
PC1 of the BrainSpan data associated with developmental age, and separates the pre- and
post-natal samples. CNON samples cluster with the prenatal samples and appear to be more
similar to the second trimester BrainSpan samples, than the first trimester (Figure 3).
23
2.3.10 - Genotyping data, phasing, and imputation
201 samples from the 255 in the CNON DEX comparison had associated genotype data,
largely as part of the GPC project (Pato et al., 2013). The type of data varied with 60 samples run
on the Illumina Omni2.5 microarray, 23 run on the Illumina OmniExpress microarray, 33 run on
the PsychChip microarray, 53 samples with Whole Genome Sequencing done at USC, and 32
samples with Whole Genome Sequencing performed at the Broad. Some samples had multiple
types of data, in which case only the most comprehensive form of genotyping was used and
mentioned in the preceding.
Both microarray and sequencing data were converted to Variant Call Format (VCF) and
computationally phased using the SHAPEIT program against the 1000 Genomes Phase 3
haplotype reference panel (Delaneau et al., 2012, 2014). Next all data were imputed by
IMPUTE2, also against the 1000 Genomes Phase 3 reference panel (Howie et al., 2009).
Imputation served to make the genotype data from different sources similar enough that samples
did not separate by genotyping method on the first 4 principal components. Imputed data were
trimmed based on a number of metrics: at least 90% confidence in imputation, SNP minor allele
frequency > 1%, lack of Hardy-Weinberg deviation (p > 10e-4), SNP missingness < 5%.
2.3.11 - eQTL calculation
Genotype covariates were calculated by the Multi-Dimensional Scaling (MDS) function
of PLINK and once again showed separation that accorded with self-reported race/ethnicity data
(Purcell et al., 2007). Gene expression data was normalized by DESeq2 (Love et al., 2014). Gene
expression data, genotyping data, and 4 principal components of the genotyping data were fed
into Probabilistic estimation of expression residuals (PEER) (Stegle et al., 2012) to generate
24
covariates to correct for large correlated effects within the gene expression data. Genotype data,
gene expression data, and both sets of calculated covariates along with known covariates (age,
sex, diagnosis) were fed to MatrixEQTL to calculate eQTLs (Shabalin, 2012). Potential eQTLs
were separated into two separate analysis: cis-eQTLs were defined as comparisons between a
gene and a variant that was within 1 Mb; other comparisons were defined as trans-eQTLs, and
the two sets were adjusted for multiple corrections separately. A Benjamini-Hochberg FDR
correction was applied to the resulting p-values, and eQTLs with a BH FDR of less than 5%
were accepted.
2.3.12 - miRNA Analysis
Analysis of short RNA sequencing data was processed separately. Following adapter
trimming and related QC, short RNA reads were aligned directly against the mature miRNA
reference from miRBase by BWA MEM (Kozomara and Griffiths-Jones, 2011; Li and Durbin,
2009). Counts were then retrieved and processed by DESeq2 for analysis of differential
expression.
2.4 - Results
2.4.1 - Characterization of CNON cells
The RNA-Seq data in this study are consistent with our previous observations using
Affymetrix Human Exon 1.0 ST arrays (9) that CNON lines are neural progenitors (Table 2). In
order to determine the period of human brain development the CNON lines most resemble, we
compared RNA-Seq data from CNON to 647 poly-A RNA-Seq 76 bp datasets from post-mortem
human brain samples across 41 individuals, 26 brain regions and 10 developmental stages (from
"Early Fetal" to "Middle Adulthood") (www.brainspan.org, (Li et al., 2018)(18)). To mitigate the
25
differences in RNA-Seq methodology, we transformed the gene expression values of each
CNON sample using the coordinates described by the principal components of the BrainSpan
data. The first principle component of the BrainSpan data roughly corresponds to developmental
age and separates the pre- and post-natal samples (Supplemental Figure 3). The CNON samples
form a tight cluster within the prenatal samples, particularly those from the mid fetal period
(weeks 13-24, second trimester).
2.4.2 - Model selection for differential expression analysis
A critical component of differential expression (DEX) analysis in complex genetically
heterogeneous diseases such as schizophrenia is adequate correction for technical elements that
increase noise and add confounding factors which often have a stronger effect than the disease
itself. Adding covariates, which are used for correction of specific or unknown confounding
factors, may increase power if sufficient variation is taken into account, or reduce it, if the
covariate does not explain a substantial amount of variation, and there is a risk of overfitting that
may overwhelm the benefits of adjustment if too many are added.
Without adjustment for any other variables, diagnosis accounts for 0.8% of variation in
the total data. Known covariates analyzed for inclusion in the analysis are sex, age,
race/ethnicity, library batch, and sequencing batch (flowcell). Sex and age account for 2.4% and
1.7% of total variance in a model including only themselves, respectively. Given the fact that
both sex and age are significantly imbalanced with respect to diagnosis (Table 1), both were
included in the model. The combined model of diagnosis, sex, and age explained 3.4% of
variation.
26
Based on the equation for the Bayesian information criterion (BIC) (Schwarz, 1978)(1), a
single added parameter must explain at least an additional 2.1% of total variance to reduce the
BIC, indicating a better fitting model. This was used as a guideline to determine inclusion of
additional parameters to adjust for known batches, such as samples that were part of the same
library creation batch or samples that were run on the same flowcell. Three library batches were
found to explain more than 2.1% of variation (each explains from 2.7% to 11.6%) and were
added to the model. Four flowcells were found to explain more than 2.1% of variation in the
data, however, library batches and flowcells are highly correlated, and library batch explained
variance better in fewer covariates. Samples from the same library batch that were run on
different flowcells appeared the same in PCA, while samples from different library batches that
were run on the same flowcell appeared different. To deal with technical effects we added 3
additional covariates for the library batches; the updated model explained 20.1% of variation in
the data.
Control and case groups were also not balanced by racial/ethnic category (Table 1; Chi-
Square test p = 0.02). However, analysis of the first 10 principal components for the log-
transformed normalized expression values showed no substantial general effect of racial/ethnic
category on gene expression (test for difference in means between groups by one-way ANOVA,
all FDR > 10%), a drastic difference with the effect in genotyping data for a subset of samples
(Figure 1). The same was true when testing on the top 10 principal components after correction
for library batch effects.
Differential expression analysis with racial/ethnic category included as a covariate (4
groups: Non-Hispanic Caucasian, Hispanic, African-American and Other) identified genes
showing expression differences correlated with race/ethnicity, however the additional residual
27
variance explained by the added covariates (1.2% of variance explained by 3 indicator variables)
was insufficient to offset the penalty for increased parameterization in the model. All DEX genes
identified using racial/ethnic category as a covariate showed differences in the same direction as
DEX genes identified without correction for ethnicity (Pearson correlation of test statistics, r =
0.990) but generally of less statistical significance: 53 of 80 DEX genes remain at FDR < 0.1, 79
of 80 at FDR < 0.2 (Figure 4). 24 additional genes became significant at FDR < 0.1 with the
addition of racial/ethnic category covariates, also all showing trends of differential expression in
the same direction as in the main analysis (76 of 77 at FDR < 0.2). Thus, overall influence of
race/ethnicity on expression profile appears insignificant and does not justify correction for that
covariate by racial ethnic category, which groups rare racial/ethnic backgrounds in a single
"Other" category and does not account for the continuous distribution of racial/ethnic
background such as that seen between Hispanic and non-Hispanic Caucasians (Figure 1).
Furthermore, adding covariates for race/ethnicity results in only small differences in significance
for all but 2 genes (Figure 4), and test statistics are highly correlated both for all expressed genes
(Pearson r = 0.972) and for DEX genes (Pearson r = 0.990). SLPI appears to have dropped
greatly in significance due to the fact that, despite showing a decrease in SCZ in 3 of 4
racial/ethnic groups, it shows a slight increase in SCZ among Hispanics. PADI2 appears to have
greatly increased in significance because expression in Non-Hispanic Caucasians is significantly
higher than in the other 3 groups.
To adjust for possible unknown confounders we used Surrogate Variable Analysis (Leek,
2014; Leek et al., 2012)(2). We calculated how many SVA covariates we should use based on
the method proposed by Leek (Leek, 2011)(3) and implemented in the "num.sv" function of the
SVA package while using the previous covariates as the preexisting model. The addition of one
28
surrogate variable to previously included explicit covariates was recommended. Adding this
calculated covariate to the model explained 31.4% of the total variation.
2.4.3 - Differential expression results
We performed differential gene expression (DEX) analysis between SCZ and controls
using DESeq2, an algorithm assessing differences in mean gene expression between groups
using a negative binomial model for RNA-Seq reads. We used the normalization built into
DESeq2, which determines size factors based on median ratio of gene expression between
samples. Normalized counts were calculated as raw counts divided by the size factors. Finally,
implementation of DESeq2 as a generalized linear model (GLM) allows for taking covariates
into account, correcting for confounding factors and batch effects. The covariates taken into
account for the DEX analysis included sex, age at time of biopsy, the first 2 principal
components, and 1 library batch. This combination of covariates was arrived at through a process
of trial and error, but it is hoped the program developed in Aim 2 will be able to follow a similar
process reliably and objectively.
Analysis was done on 23,920 expressed genes, defined as having on average at least 3.5
normalized counts (baseMean), which is equivalent to 0.17 counts per million (CPM). We
defined this cut-off number based on the density plot of log-transformed baseMean for all genes
to include those in the distribution of expressed genes, not in the larger unexpressed genes peak,
with a relatively liberal threshold ensuring all expressed genes, including low expressed ones,
were included (Figure 2).
Using our analysis model, 80 genes were DEX between SCZ and CTL at a false
discovery rate (FDR) of 10%, corresponding to a maximum p-value of 3.3x10-4 (Table 3;
29
complete gene list in Supplemental Table S1). The average fold change was 1.8 (range 1.08 to
9.09) (Figure 1) at expression levels of 0.07 to 552 TPM.
To evaluate the accuracy of RNA-Seq gene quantification and DEX analysis, we
compared results with RT-qPCR. For that purpose, we used a subset of 146 samples and
performed DEX analysis on RNA-Seq data on only these samples. From this list of DEX genes
we selected 5 genes (CCL8, HTR2B, PLAT, PPARGC1A, and VAV3) to include genes with
fold change differences in both directions and spanning a range of gene expression levels.
Expression of these genes and ACTB (used for normalization) was assessed by RT-qPCR on the
same set of 146 samples. Expression data from RT-qPCR and RNA-Seq were highly correlated
within each gene (mean r=0.75, range 0.63 - 0.90, all p-values < 2e-15) (Figure 5A-E), and mean
expression of all six genes had correlation r=0.943 (p=0.0047) (Figure 5F). DEX of 4 of 5 genes
was replicated, while one gene (PLAT) did not reach significance (p=0.15) but shows a trend in
the same direction as that seen in RNA-Seq (Figure 5GH).
A recent paper from Fromer et al. (2016) represents the largest SCZ DE study to date,
although it is of post-mortem brain tissue rather than cultured neural progenitor cells. While the
full QQ plot and lambda cannot be calculated from the data publicly available, the 4.2% of genes
they report DE at FDR < 0.05 is enough to reconstruct a portion and compare that to the CNON
study (Figure 6). While lambda cannot be calculated, since we don't have data at the 50th
percentile, we can calculate the equivalent inflation factor at the 95.8th percentile: 2.29 for
Fromer et al. compared to 1.32 in this study. These inflation factors represent the extent of
strongly DE genes we are assuming to be true for these results to be consistent, with Fromer et
al. assuming a much larger percentage of expressed genes are DE compared to the CNON study.
30
Such a QQ plot from a large study published in a highly reputable journal suggests that basic
statistical assumptions and QC methods may not be being given appropriate heed.
2.4.4 - DEX enrichment analyses
It is thought that causal variants in SCZ GWAS loci regulate expression of genes
involved in the etiology of SCZ (Gallagher and Chen-Plotkin, 2018; Tak and Farnham,
2015)(20,21). Previous studies have shown that regulatory variants are likely to be localized near
to or within the genes they directly regulate (Morris et al., 2017)(22), we tested the hypothesis
that DEX genes are more likely to be co-localized with genome-wide significant (p<5x10-8)
variants from the PGC SCZ2 GWAS (Ripke et al., 2014)(19), excluding the HLA region of
chromosome 6. Three DEX genes (ESAM, FOXO3, and SRPK2) were found to overlap
independent genome-wide significant variants (Fisher's exact test, odds-ratio=3.8, p=0.049)
(Figure 7). Two other genes just missed being overlapped with genome-wide significant peaks.
DEX gene IL1RAPL1 is almost genome-wide significant (p=5.3x10-8), and AC007618.3 sits
within an intron of CACNA1C, in which another intron contains one of the most significant
GWAS signals (Ripke et al., 2014) (19). For comparison, genes that were DEX (FDR<10%)
based on either sex or age were not found to be enriched for overlap with SCZ GWAS variants
(p>0.05).
Gene set enrichment analysis (GSEA) shows significant (corrected p<0.05) enrichment of
the DEX genes in 47 GO terms from 10 categories (related groups of terms), including cell
migration, cell adhesion, developmental process, cell proliferation, synapse assembly and
PANTHER pathway analysis (Mi et al., 2016)(24) shows over-representation of DEX genes
involved in the cadherin and Wnt signaling pathways (FDR=1.39% and 1.91% respectively).
Hierarchical clustering of DEX genes further supports the finding of involvement of Wnt
31
signaling. The most prominent cluster of correlated genes includes WNT5A, the most expressed
Wnt ligand gene in CNON, and several genes involved in, or regulating, Wnt signaling.
2.4.5 - WGCNA analysis
WGCNA identified 23 gene expression modules containing 78.0% of expressed genes
(18,636 genes). Module size ranged from 3,934 to 64 genes. Only module #3, containing 32
DEX genes (out of 2,675 genes in the module), showed significant enrichment of DEX genes
(Fisher’s exact test; OR=3.6; raw p=4.0x10-11). This enrichment was driven by genes that were
in the main cluster of DEX genes seen in the hierarchical clustering (Figure 8), including
WNT5A, GAS1, and FOXO3. Based on GSEA, this module shows extremely significant
enrichment for GO terms "cell cycle process", "chromosome segregation", and "DNA
replication" (corrected p<5x10-18 for all), among others. Additionally, the module eigengene
showed a significant correlation with SCZ status (p=0.004).
Hierarchical clustering of samples based on the expression of genes in WGCNA module
#3 shows a clear separation into two groups, an effect further supported by examination of the
heatmap (Figure 9A). The same separation is apparent when SCZ and CTL samples are
examined separately (Figure 9BC). The smaller subgroup shows an enrichment of SCZ samples
as compared to controls (Fisher's exact test, p=3.6x10-05, OR=4.76).
2.4.6 - Permutation analysis of differential expression
To assess the probability that our DEX findings could be due to random statistical
variation, we performed two forms of permutation analysis: a standard permutation analysis in
which some case/control imbalance between groups is expected by chance (and hence some
effect of SCZ), and a second with comparisons where we can expect the null hypothesis to hold
32
(case vs. case; control vs. control). For the first, we randomly permuted the diagnosis labels but
held all other factors constant and found a median of 11 DEX genes at FDR < 10% (mean =
28.92; 25th percentile = 5; 75th percentile = 22). We used the Wilcoxon signed rank test to
assess significance of the mean signed rank between permuted and experimental data, and the
number of DEX genes found in permutation analysis is significantly lower (p < 2x10-16) than in
the original analysis. This shows that the difference in gene expression between the actual CTL
and SCZ groups is greater than between random groupings.
For the second permutation analysis, we compared groupings where the null hypothesis
should hold (random half of SCZ vs another half SCZ and similar CTL-CTL; null comparisons)
to those where we expect the null hypothesis to be violated (case/control comparisons). The null
comparisons resulted in significantly fewer DEX genes at FDR < 10% (Figure 10; SCZ-SCZ,
median=2.0, mean=2.7; CTL-CTL median=2.0, mean=3.7) as compared to case/control
comparisons (medians = 2.2 and 4.4 DEX genes, means = 5.3 and 8.3, for two sample sizes
corresponding to either half of controls or half of SCZ, respectively; p < 3.6e-15 by Wilcoxon
test).
DEX genes from the main analysis were found to be differentially expressed (FDR <
10%) in case/control subsets permutations much more often than in null comparisons (median =
13.7% for case/control compared to median = 0.2% for null comparisons; Mann-Whitney test p
< 2.2e-16). 33 DEX genes were not found to be significant in any null comparisons, while 2
(CHI3L1 and LINC01013) were found to be significant in at least 10% of null comparisons.
Comparatively, only 1 DEX gene (FAM110C) was not found to be significant in any
case/control comparisons, while 62 were found to be significant in at least 10% of case/control
comparisons.
33
These permutation analyses strongly suggest that the majority of genes found to be DEX
are not likely to be false positives and, instead, are reproducible results. However, the fact that
the probability for DEX genes to be identified in case-control permutations is not high indicates
that power is relatively low at that sample size, suggesting many more DEX genes exist.
2.4.7 - Comparison to other SCZ DEX gene studies
Although SCZ GWAS is considered the most general and direct way to identify causative
common genetic variants, consideration of other related phenotypes, and other types of genetic
studies, such as de novo mutations, mutations segregating with psychiatric disorders in
multigenerational families, and gene expression studies in relevant models also provide
information about genes likely involved in SCZ. We found that many genes identified in these
studies are also DEX in our study. FOXO3, SRPK2, BACE2, GABRA4, RC3H1, NRXN3,
TANC2 and LYL1 are genome-wide significant in DEPICT-based association in GWAS meta-
analysis of intelligence and associated traits (Lee et al., 2018)(33). SNP within PRKCA is
significant in GWAS of neuroticisms (Kichaev et al., 2019)(34). Four DEX genes have been
reported to have de novo non-silent mutations in individuals with SCZ (ESAM, PI15, SSBP3,
TANC2) (Katoh and Katoh, 2007)(25), seven have been reported to have de novo missense
mutation(s) in patients with autism spectrum disorder (ARHGAP26, JAK1, IL1RAPL1, MCAM,
FLG, PRKCA, PCDHB16) (Lin et al., 2017)(26), copy number variations of IL1RAPL1 have
been identified in multiple patients with SCZ (Mi et al., 2016)(24) and point mutations in the
same gene have been observed in individuals with mental retardation (Luykx et al., 2010)(27).
An exonic deletion in NRXN3 was found to segregate with neurodevelopmental and
neuropsychiatric conditions in a three-generation Chinese family (Yuan et al., 2018)(35).
Moreover, among 19 genes most correlated with WNT5A (R>0.6) nine have de novo mutations
34
in either SCZ, autism spectrum disorder or developmental disorder, demonstrating striking
convergence between gene expression data and de novo mutation studies of SCZ on a pathway
level. In total, 18 of the 80 genes overlap with previously published results from genome-wide
studies of psychiatric disorders, and some of them have supported evidence from several
independent studies (ESAM, FOXO3, SRPK2, IL1RAPL1, TANC2, PRKCA, NRXN3).
The largest transcriptome study of SCZ was performed by the CommonMind Consortium
(CMC) using post-mortem adult dorsolateral prefrontal cortex (258 SCZ vs. 279 CTL) (Fromer
et al., 2016)(36), which has a very different pattern of gene expression than CNON or the fetal
brain (Li et al., 2018). Only two genes showed significant differences after correcting for
multiple comparisons in both studies (FAM110C and CLEC3B) and the differences were in
opposite directions in both cases. However, correlation of test statistics for genes expressed in
both studies was highly significant (r=0.174; p<2.2x10-16, n=14,924 genes); this correlation
increased to r=0.42 (p<2.2x10-16, n=453 genes) on the set of genes that were nominally
significant in both studies.
We also compared our results with data from studies of iPSC-derived neural progenitor
cells, which are most similar to the first trimester samples in BrainSpan (Brennand et al.,
2015)(37). We found three DEX genes, all changed in the same direction, in common with a
study based on microarray expression profiling of 4 SCZ and 4 CTL individuals (ARHGAP26,
NRXN3, LRRC61) (Brennand et al., 2015) (37), and there was a significant positive correlation
in z statistics testing for differential expression between SCZ and CTL groups for genes with
publicly available data (Pearson's r=0.117, p=0.025, n=367 genes). This work was extended
using RNA-Seq and two additional controls (4SCZ/6CTL) (38) and shared 5 genes with our
DEX list (GAS1, WNT5A, BACE2, FRMD6, PRKCA); all genes except PRKCA, had the same
35
direction of effect. As before, the overall comparison of test statistics with our study was
significant (r=0.15, p=4.26x10-5, n=724 genes). This latter study implicated Wnt signaling in
schizophrenia, which agrees with our findings of involvement of WNT5A in etiology of the
disease.
However, the largest SCZ study using iPSC-derived NPCs (10SCZ/9CTL) from the same
group is significantly negative correlated in test statistics with our data for all genes (r=-0.08,
p<2.2x10-16, n=15,862). The results of this latest study are also significantly negatively
correlated with two previous studies from the same group (r=-0.33, p=7.87x10-12, n=412 with
microarray study and r=-0.19, p=1.13x10-6, n=654 with RNA-seq study).
2.4.8 - Power calculations based on observations
Power analysis was run based on a negative-binomial model using the R Bioconductor
package "RNASeqPower". This models the expected power of a RNA-Seq differential
expression experiment given a certain type I error rate, depth, fold change, and biological
coefficient of variation (BCV). The biological coefficient of variation was estimated in this cell
type after subtracting effects of all terms in the model used for DE analysis according to the
equations presented by Hart (Hart et al., 2013).
Power in an RNA-Seq experiment is the result of a number of variables: fold-change and
depth for a given gene, the expected biological coefficient of variation (BCV) = sqrt(ψ), and the
significance threshold/alpha. The larger the fold-change or greater depth for a gene generally
implies higher power, while a higher BCV results in less power. Given the nature of an FDR, the
more genes that are found significant, the higher the effective alpha is and hence the higher the
power is. The median fold change for genes found DEX at FDR < 0.1 in our experiment is 1.29
36
and individual instances range from 1.16 to 1.62. The BCV for expressed genes (depth >= 3.5) in
CNON samples has a median of 0.526, which is in line with other reports for variation among
humans. The median depth of DEX genes was 30.6.

Given a BCV of 0.526, if we look at power based on a threshold of FDR < 0.1 with 50
significant DEX genes, power at the median fold change observed (1.29) is 14% at a depth of
3.5, increasing to 41% at a depth of 30 (Figure 11 above). The expected power from minimum
fold change to maximum ranges from 1.5% to 89.8% at the minimum depth, or from 4.6% to
99.8% at the median depth.
However, the biological coefficient of variation is significantly larger in DEX genes than
it is in other expressed genes (Mann-Whitney test p-value < 2e-16), with a median of 1.17. If this
BCV is more characteristic of genes affected by SCZ, then power estimates for detecting such
genes are lower than previous estimates (Figure 11 below). If we look at power based on a
threshold of FDR < 0.1 with 50 significant DEX genes, power at the median fold change
observed is 1.6% at a depth of 3.5, increasing to 2.2% at a depth of 30. The power at a depth of
30 ranges from 0.26% at the minimum fold change observed to 23% at the maximum fold
change observed.
2.4.9 - eQTL results
218,346 cis-eQTLs were identified that regulated 5,636 distinct genes (eGenes). A
boxplot of log2 normalized gene expression for a highly significant example eQTL can be seen
in Figure 13.
37
2.4.10 - miRNA results
Adapter trimming of short RNA sequencing in many samples revealed a strong peak at
around 22-23 bp that corresponds to the expected length of mature miRNA (Figure 14). Analysis
of miRNA found 83 of them DEX between SCZ and control groups at an FDR of 0.05. 6 results
confirmed previous reports of altered expression of specific miRNAs: miR-9-5p, miR-25-3p, and
miR-132-3p were seen to be downregulated in SCZ; miR-30a-3p, miR-195-3p, and miR-485-5p
were seen to be upregulated in SCZ. Two other results were near matches to previous reports:
miR-26a-1-3p was seen to be DEX in SCZ where an association between SCZ and miR-26b has
been reported before; miR-30c-5p was seen to be DEX where association between SCZ and
miR-30a,b,d,e have been reported before. Ingenuity Pathway Analysis shows that biological
functions of “cell proliferation” and “cell cycle” (among others) are predicted to be affected by
differential expression of miRNA in SCZ.
2.5 - Discussion
2.5.1 - SCZ CNON cells
Schizophrenia is a complex genetic disorder that originates during fetal development,
typically manifests symptoms in adolescence and early adulthood, and persists throughout adult
life. While post-mortem brain transcriptome studies assess changes in gene expression of
differentiated neuronal and glial cells in the adult brain, we developed a genetically unmodified
cell-based system, CNON, to study the neurodevelopmental component of the disorder. These
cells, developed from olfactory neuroepithelium, are neural progenitors, and express a
transcriptome most similar to the mid-fetal period of the brain (Figure 3), a time of increased risk
for the development of schizophrenia (Boksa, 2008; Khandaker et al., 2013)(39,40).
38
2.5.2 - CNON DEX conclusions
We identified 80 DEX genes at FDR<10% and found an overrepresentation of genes
annotated with gene ontology terms related to processes of cell proliferation, migration and
differentiation, all fundamental aspects of neurodevelopment. We looked for convergence of our
transcriptome data analysis with results of other types of genomic or transcriptomic studies of
SCZ or related psychiatric diseases that were performed on a comprehensive, genome-wide or
transcriptome-wide basis. GWAS is considered the most general way to identify causative
genomic loci associated with common variants. The main mechanistic explanation for
involvement of GWAS loci in disease etiology is through altering gene expression (Gallagher
and Chen-Plotkin, 2018; Tak and Farnham, 2015)(20,21). However, the specific causal SNPs are
generally not known, and neither are the genes which they regulate, or the developmental stage
or type of cells where the regulation is important for the development of the disease. Despite
these complexities, our study shows a significant agreement between our DEX gene list and
genome-wide significant loci in the PGC SCZ2 GWAS (odds-ratio=3.8, p=0.049). Other
genomic approaches, such as identification of de novo non-synonymous mutations in psychiatric
disorders, also provide independent evidence for involvement of some DEX genes with SCZ or
neurodevelopment in general, suggesting that alterations in expression of some genes and
changes in gene function could result in similar phenotype. Finally, a transcriptome study (Topol
et al., 2015)(38) of iPSC-derived NPCs, a similar cellular model of SCZ, agrees with our
conclusion of involvement of Wnt signaling in SCZ etiology.
Analysis of the DEX genes provides insight into the neurodevelopmental processes
altered in SCZ. These genes appear to function in a number of biological pathways and
functions, with the largest group being involved in Wnt signaling in general, and WNT5A
39
signaling in particular. Most DEX genes in this group (co-expressed with DEX gene WNT5A),
as observed in both the hierarchical clustering of DEX genes (Figure 8) and the WGCNA
analysis (Figure 9), have known functions in Wnt signaling. The Wnt signaling pathway was
also found to be over-represented in the PANTHER pathway analysis.
Analysis of individuals by hierarchical clustering and heatmap analysis for WGCNA
module #3, which showed a significant enrichment of DEX genes, identifies a group of
individuals with abnormal expression of genes involved either in WNT5A regulation or
downstream Wnt signaling. This subset is significantly enriched for individuals with SCZ and
may represent a molecular subtype.
In summary, our results show that DEX analysis of CNON cells produces biologically
meaningful results, demonstrates convergence with other genome- and transcriptome-wide
studies of SCZ and related traits, and provides insight into specific mechanisms of
developmental aspects of the disease. We also show that CNON are a good cellular model to
study developmental aspects of brain disorders. Further studies using this model will improve the
mechanistic view of SCZ etiology with finer detail. CNON cells are derived from living
individuals, providing numerous opportunities for personalized medicine at a substantially lower
cost than the development of iPSCs. Cell lines provide additional opportunities to test hypotheses
using molecular tools such as CRISPR, siRNA and miRNA knock-down. These technologies can
be combined with our ongoing epigenetic studies (Rhie et al., 2018)(53) and functional tests for
proliferation, migration, cell adhesion, and to evaluate cellular phenotypes in 2D and 3D
cultures.

40
2.6 - Figures

Figure 2-1: Principal components analysis of genotyping (top) and gene
expression (bottom) data. Genotyping PCA is strongly separated in accordance
with self-reported race/ethnic category, however no such pattern is seen in PCA of
gene expression data.
41

Figure 2-2: Density plot of mean gene expression per gene. Red line indicates
the gene expression cutoff of 3.5 counts per sample on average (baseMean), which
is equivalent to 0.17 counts per million (CPM). Cutoff was chosen liberally to
include even low-expressed genes while still removing the large peak of
unexpressed genes.
42

Figure 2-3: Projection of CNON gene expression profiles (black) onto the first
two principal components of BrainSpan data (colored). Color represents
developmental stages. Weeks p.c. is weeks post conception. Figure shows CNON
samples cluster with mid-fetal brain samples (orange).

43

Figure 2-4: Scatterplot of test statistics (z-statistics) comparing results in
analyses with and without including covariates for racial/ethnic category.
Black dots indicate genes that were significant (FDR < 10%) whether or not such
covariates were used (53 genes). Red dots indicate genes that dropped out of
significance with the addition of the covariates (27 genes). Blue dots indicate genes
that entered into significance with the addition of the covariates (24 genes).

44

Figure 2-5: RNA-Seq and RT-qPCR data of CCL8, HTR2B, PLAT,
PPARGC1A and VAV3 in the same 146 RNA samples normalized to
expression level of ACTB. (A-E) scatterplots showing correlation between
expression measurements in RNA-Seq and RT-qPCR on samples. (F) scatterplot
showing correlation of log-transformed average expression in RNA-Seq to deltaCt
per gene. "r=" in header indicates Pearson correlation coefficient. (G-H) boxplots
showing expression in samples divided into control (CTL) and schizophrenia
(SCZ) groups. "x" indicates mean expression. * designates statistically significant
difference in mean gene expression in RNA-Seq data (FDR < 10%), and #
designates significance of mean gene expression measured by RT-qPCR (p <
0.05).

45

Figure 2-6: QQ plots for expressed genes in this study and Fromer et al.
(2016).

46

Figure 2-7: DEX genes that overlap SCZ GWAS peaks as seen in the PGC2
study.

47

Figure 2-8: Heatmap and hierarchical clustering of DEX genes. Clustering was
performed using average linkage and a distance of one minus the absolute value of
the Pearson correlation coefficient. Lighter color indicates higher correlation.
Genes were assigned to a group (colors in bar above heatmap) based on a cutoff at
clustering distance of 0.6.

48

Figure 2-9: Heatmaps of sample-sample correlation based on genes from
WGCNA module #3. (A) Heatmap for all samples. Color bar at top indicates SCZ
(red) or CTL (blue) status. (B) Heatmap for only CTL samples. (C) Heatmap for
only SCZ samples. Color bar at top indicates membership in a SCZ-enriched
subset of samples (purple) or the main group of samples (yellow).

49

Figure 2-10: Boxplot of the number of genes found to be DEX in subsets
permutation analysis. There were significantly more genes found DEX in
comparisons between subsets of SCZ samples and subsets of CTL samples
(case/control comparisons, right; SCZ-SCZ, median=2.0, mean=2.7; CTL-CTL
median=2.0, mean=3.7) vs. comparisons between subsets of SCZ samples or
subsets of CTL samples (null comparisons, left; medians = 2.2 and 4.4 DEX genes,
means = 5.3 and 8.3, for two sample sizes corresponding to either half of controls
or half of SCZ) (Mann-Whitney test, p < 3.6x10-15). Data points more than 1.5
inter-quartile ranges from the median are not shown.

50

Figure 2-11: Power analysis curves showing expected power to detect genes at
median depth and various fold changes across a range of sample sizes using
BCV of all expressed genes (above) or only DEX genes (below)

51

Figure 2-12: Volcano plot for SCZ vs. CTL DEX comparison. Genes with raw
p < 10
-5
are labeled. Red line indicates FDR < 10%.

52

Figure 2-13: Boxplot of log2 normalized gene expression for a highly
significant autosomal example eQTL separated by minor allele count.

53

Figure 2-14: Sequence length distribution plot for small RNA sequencing after
adapter trimming shows a peak of reads that are about 22-23 bp long, the
right size for miRNA.

54
2.7 - Tables
Table 2-1. Descriptive statistics for individuals in the study.
Balance between SCZ and control groups was tested by Chi-Square test (sex and
race/ethnicity) or by Welch's t-test (age).
Controls SCZ Balance Test P-value
Total 111 144
Sex 0.025
Male 67 (60.4%) 106 (73.6%)
Female 44 (39.6%) 38 (26.4%)
Age 49.9 (S.D. = 12.7) 40.5 (S.D. = 12.1) 6.42E-09
Race/Ethnicity 0.02
Non-Hispanic Caucasian 44 41
African-American 27 62
Hispanic 27 27
Other 13 14

55
Table 2-2. Mean gene expression of marker genes in CNON cells. TPM:
transcripts per million.

Gene Symbol Marker name TPM
glial markers
S100B S100 beta 0.09
GFAP GFAP 0.16
OLIG2 Olig2 0.17
epithelial markers
KRT5 cytokeratin-5 0.09
CDH1 CDH1 0.21
Neuronal markers
UCHL1 PGP9.5 173.55
MAP1A MAP-1a 26.22
MAP1B MAP-1b, MAP5 125.5
TUBB3 β-tubulin 3 2.35
CDH2 N Cadherin 56.48
Markers of differentiated neurons
OMP OMP 0.5
GNAL Golf alpha 0.88
RBFOX3 NeuN 0.31
TH
Tyrosine
hydroxylase 0.28
Neural progenitor markers
NES Nestin 61.21
VIM vimentin 3777.85
REST REST 44.86
NEPRO NEPRO 26.67
Notch signaling
NOTCH2 Notch2 70.21
PSEN1 Presenilin 1 25.53
PSEN2 Presenilin 2 14.23
ADAM10 ADAM10 44.66
ADAM17 ADAM17 39.94
JAG1 Jagged1 11.91
Cell proliferation markers
MKI67 Ki-67 75.98
CCND1 Cyclin D1 537.46
CCNB1 Cyclin B1 288.61

56

Neural Precursor Cell Expressed, Developmentally
Down-Regulated (NEDD) ubiquitin protein ligases
NEDD4 NEDD4 56.05
NEDD9 HEF1 13.85
NEDD8 NEDD8 116.96
NEDD1 NEDD1 59.77
NEDD4L NEDD4-2 13.65
Stemness and proneural markers
POU5F1 Oct-4 2.19
SOX2 SOX2 0.38
ASCL1 Mash1 0.21
NANOG Nanog 0.35
ATOH1 MATH1 0.18
NEUROD6 MATH2 0.19
NEUROD4 Neuro D4 0.15
ATOH7 MATH5 0.31
NEUROG1 Neurogenin 1 0.24
NEUROG2 Neurogenin 2 0.17

57
Table 2-3. Genes significantly DEX between SCZ and Control at FDR < 0.1.
To facilitate the comparison of expression of each gene, normalized read counts
were transformed to transcripts per million transcripts (TPM) using the median
transcript length in Gencode release 22 for each gene as gene length. Gene
symbols and gene types are taken from Gencode release 27 where available. Notes:
1) Gene lies under genome-wide significant PGC SCZ2 GWAS peak; 2) De novo
non-silent mutations in the gene has been identified in patients with SCZ; 3) De
novo missense mutation(s) in the gene has been identified in patients with autism
spectrum disorder; 4) CNVs were identified in multiple patients with SCZ, 5) gene
found significantly associated with psychiatric disorder or related traits (other than
PGC SCZ2 GWAS).
Ensembl ID Gene Symbol Gene Type TPM
Log2
Fold-
Change
FDR
Not
es
ENSG000002780
99 U1 snRNA 4.84 3.184 9.77x10
-10

ENSG000001706
27 GTSF1 protein_coding 0.362 2.162 5.50x10
-07

ENSG000001495
64 ESAM protein_coding 0.347 -1.423 3.79x10
-06

1,
2, 5
ENSG000001154
61 IGFBP5 protein_coding 552 1.045 5.54x10
-04

ENSG000001247
49 COL21A1 protein_coding 0.926 1.184 2.31x10
-03

ENSG000001804
47 GAS1 protein_coding 24.3 0.842 2.47x10
-03

ENSG000001577
66 ACAN protein_coding 0.217 -1.232 5.69x10
-03

ENSG000001645
32 TBX20 protein_coding 0.115 -1.509 5.77x10
-03

ENSG000001359
14 HTR2B protein_coding 2.74 1.169 5.77x10
-03

ENSG000000085
17 IL32 protein_coding 2.33 1.044 5.77x10
-03

ENSG000001651
97 VEGFD protein_coding 0.614 1.125 6.66x10
-03

ENSG000001414
69 SLC14A1 protein_coding 7.48 -1.679 6.66x10
-03

ENSG000002606
04 AL590004.4 lincRNA 0.602 0.813 9.83x10
-03

ENSG000001142
51 WNT5A protein_coding 186 0.521 9.83x10
-03

58
ENSG000001847
31 FAM110C protein_coding 0.591 -1.003 2.71x10
-02
5
ENSG000001624
34 JAK1 protein_coding 81.7 0.180 2.98x10
-02
3
ENSG000001436
31 FLG protein_coding 0.127 -1.248 3.41x10
-02
3
ENSG000001330
48 CHI3L1 protein_coding 1.036 1.513 3.49x10
-02

ENSG000001186
89 FOXO3 protein_coding 9.45 0.207 3.49x10
-02
1, 5
ENSG000002262
37 GAS1RR lincRNA 0.719 0.503 3.49x10
-02

ENSG000001721
56 CCL11 protein_coding 1.45 1.426 3.67x10
-02

ENSG000002305
52 AC092162.2 lincRNA 0.460 0.707 4.12x10
-02

ENSG000002036
48
AC007618.1
processed_
pseudogene
0.662 0.522 4.53x10
-02

ENSG000000767
06 MCAM protein_coding 15.52 -1.012 4.53x10
-02
5, 3
ENSG000001098
19 PPARGC1A protein_coding 2.09 0.715 4.53x10
-02

ENSG000001882
27 ZNF793 protein_coding 8.96 0.244 4.61x10
-02

ENSG000001458
19 ARHGAP26 protein_coding 16.85 0.324 4.61x10
-02
3
ENSG000001203
24 PCDHB10 protein_coding 0.096 0.914 4.64x10
-02

ENSG000001352
50 SRPK2 protein_coding 63.78 0.107 4.91x10
-02
1, 5
ENSG000002334
76
EEF1A1P6
processed_
pseudogene
0.550 0.382 4.91x10
-02

ENSG000001128
52 PCDHB2 protein_coding 1.22 0.647 4.91x10
-02

ENSG000001890
58 APOD protein_coding 3.91 0.885 4.91x10
-02

ENSG000001203
37 TNFSF18 protein_coding 3.05 1.216 4.91x10
-02

ENSG000001971
81 PIWIL2 protein_coding 0.520 -0.718 4.91x10
-02

ENSG000002791
18 AC093535.2 TEC 7.38 0.454 4.91x10
-02

ENSG000002772
32 GTSE1-AS1 lincRNA 0.858 -0.423 4.91x10
-02

ENSG000002355
31 MSC-AS1
antisense_RN
A 36.55 0.343 5.50x10
-02

59
ENSG000001709
21 TANC2 protein_coding 19.16 0.213 5.71x10
-02
2, 5
ENSG000001547
21 JAM2 protein_coding 7.324 0.712 5.92x10
-02

ENSG000001375
58 PI15 protein_coding 0.147 1.091 5.92x10
-02
2
ENSG000001241
07 SLPI protein_coding 1.241 -2.265 5.92x10
-02

ENSG000001091
58 GABRA4 protein_coding 0.362 0.483 5.92x10
-02

ENSG000001542
29 PRKCA protein_coding 89.5 0.191 5.92x10
-02
3
ENSG000001132
05 PCDHB3 protein_coding 0.136 0.846 5.92x10
-02
5
ENSG000002555
83
AC084357.3
transcribed_
unprocessed_
pseudogene
0.819 -0.824 5.92x10
-02

ENSG000001836
90 EFHC2 protein_coding 0.162 0.908 5.92x10
-02

ENSG000001768
96 TCEANC protein_coding 1.374 0.226 5.92x10
-02

ENSG000000216
45 NRXN3 protein_coding 0.615 -0.797 6.20x10
-02

ENSG000002580
71
ARL2BPP2
processed_
pseudogene
2.53 0.300 6.42x10
-02

ENSG000001049
03 LYL1 protein_coding 0.485 -0.502 6.93x10
-02
5
ENSG000001691
22 FAM110B protein_coding 17.59 0.387 7.28x10
-02

ENSG000002456
80 ZNF585B protein_coding 30.62 0.194 7.28x10
-02

ENSG000001693
06 IL1RAPL1 protein_coding 0.074 -0.899 7.28x10
-02
3, 4
ENSG000000683
05 MEF2A protein_coding 24.27 0.212 7.81x10
-02

ENSG000002341
55 LINC02535 lincRNA 0.797 -0.376 7.93x10
-02

ENSG000002726
74 PCDHB16 protein_coding 0.648 0.610 7.93x10
-02
3
ENSG000001113
71 SLC38A1 protein_coding 137 0.273 7.93x10
-02

ENSG000002365
35 RC3H1-IT1 sense_intronic 2.202 0.334 7.93x10
-02

ENSG000001642
65 SCGB3A2 protein_coding 5.455 0.395 7.93x10
-02

ENSG000001638 CLEC3B protein_coding 6.303 0.691 7.93x10
-02

60
15
ENSG000002233
61
FTH1P10
transcribed_
processed_
pseudogene
0.732 -0.906 8.33x10
-02

ENSG000001395
08 SLC46A3 protein_coding 4.206 0.302 8.33x10
-02

ENSG000001739
47 PIFO protein_coding 0.142 0.517 8.52x10
-02

ENSG000001464
53 PNLDC1 protein_coding 0.181 1.274 8.52x10
-02

ENSG000001822
40 BACE2 protein_coding 32.35 0.304 8.58x10
-02

ENSG000001132
69 RNF130 protein_coding 24.13 0.173 8.78x10
-02

ENSG000001399
26 FRMD6 protein_coding 284 0.263 8.81x10
-02

ENSG000002605
22 AC106785.1
processed_
pseudogene 2.283 0.385 8.83x10
-02

ENSG000002792
50 AC022919.1 TEC 1.262 0.419 9.29x10
-02

ENSG000002262
61 AC064836.1
processed_
pseudogene 3.389 -0.519 9.88x10
-02

ENSG000001890
67 LITAF protein_coding 452 0.250 9.88x10
-02

ENSG000001642
44 PRRC1 protein_coding 34.65 0.131 9.88x10
-02

ENSG000002727
69 AC097532.2 lincRNA 0.237 0.476 9.88x10
-02

ENSG000001746
97 LEP protein_coding 1.893 1.342 9.88x10
-02

ENSG000001572
16 SSBP3 protein_coding 4.987 0.264 9.88x10
-02
2
ENSG000002618
24 LINC00662 lincRNA 21.58 0.205 9.88x10
-02

ENSG000001064
59 NRF1 protein_coding 8.061 -0.142 9.88x10
-02

ENSG000001623
83 SLC1A7 protein_coding 0.142 0.828 9.88x10
-02

ENSG000002284
95 LINC01013 lincRNA 1.26 0.921 9.90x10
-02

ENSG000001979
28 ZNF677 protein_coding 20.20 0.171 9.92x10
-02

61
Chapter 3 - Comparison of tools for RNA-Seq differential gene
expression analysis of small effects requiring large sample sizes
3.1 - Abstract
Transcriptome analysis via short read massively parallel sequencing (RNA-Seq) presents
great opportunities for understanding the biology of many diseases. However, the vast majority
of evaluations of differential expression tools are done examining situations with large effect
sizes and consistent effects in small sample sizes; the results of such evaluations may not be
applicable to examining differential expression in other settings. Common complex diseases
which feature small, heterogeneous differences in gene expression that require large sample sizes
to identify, such as many psychiatric disorders, present a substantially different situation that
must be examined to identify optimal methodologies for this application.
To examine relevant gene expression differences for this type of condition, we simulated
data based on studies such as the CommonMind and CNON projects, which use hundreds of
samples to examine schizophrenia. Simulated data was based on a number of statistical
distributions, such as the negative binomial, log-normal, and multinomial, all parameterized to
match characteristics observed in real data. A simulation approach allows the creation of data
matching a wide variety of specifications as well as knowledge of truth; this offers a clearer
examination of correct results compared to consistency based approaches that compare results
between different tools or subsets of a dataset. A variety of analysis programs (e.g., DESeq2,
edgeR, and voom-limma) with different settings were examined, such as the use of different
normalization techniques. Batch effects and other sources of technical variance and methods for
removing them (e.g., using covariates such as batch and SVA) were tested and model selection
62
techniques were employed. Methodologies were evaluated primarily by ROC and precision-
recall curves, and secondly by power and control of the FPR/FDR.
Results show the superiority of newer normalization techniques such as edgeR's TMM
and DESeq2'sRLE methods compared to quantile normalization and scaling to the total number
of reads. Following this the greatest variance between programs is based on what distribution
they expect, with voom-limma performing better on log-normal data and edgeR/DESeq2
performing significantly better on negative binomial data. This highlights the need for a greater
examination of the true/apparent distribution of RNA-Seq data both in general and in a specific
dataset.
3.2 - Introduction
Differential expression (DEX) analysis using short read RNA-Seq is an increasingly
important technique for understanding genetic changes related to disease (Conesa et al., 2016;
Dunning et al., 2008; Mortazavi et al., 2008; Stark et al., 2019). As the cost per read of
sequencing decreases, larger experiments become more feasible allow for the large sample sizes
needed to study disorders with small genetic effect (Costa et al., 2013; Karczewski and Snyder,
2018; Visscher et al., 2017). As most experiments previously featured a small number of
biological replicates, performance testing has focused on such experimental designs, posing the
question of whether the best performance in such scenarios represents the best performance
under other conditions.
Another issue when judging DEX analyses is the background amount of variation
between samples. Evaluations based primarily on variation between technical replicates like with
the SEQC standard (Quinn et al., 2018a; Su et al., 2014) should show only technical variation
63
with the exception of the genes that are differentially expressed. Such data would show a
different distribution of dispersions and relationship between mean and variance than data
between biological replicates with the same genotype (e.g., cell cultures, inbred mouse models),
which would in turn likely show less biological variance than comparisons between genetically-
unrelated humans, due to effects of genotype on gene expression. Many studies, such as that in
the paper introducing voom, admit to featuring less biological variance than would be seen
between independent human beings (Law et al., 2014).
It is important to differentiate between gene-level and transcript-level DEX, as the latter
generally requires probabilistic estimation of which transcript a given read aligns with. With
gene-level analysis, we can look only at reads that uniquely align to one gene; this approach also
applies to analysis of individual exons, splice junctions, ChIP-Seq peaks and other situations.
A number of previous studies have looked at the performance of DEX software.
However, this study features a number of differences, e.g.: 1) The proportion of simulated DEX
genes and the size of their effect is smaller than most other studies; 2) The number of samples is
greatly increased compared to most evaluations; 3) Multiple distributions, most notably negative
binomial and log-normal, are used here, whereas I have been able to find any other studies
departing from Poisson/negative binomial; 4) Performance is evaluated at a range of thresholds,
so that we can observe how they order genes in terms of significance beyond a single cutoff.
Several evaluations (Łabaj and Kreil, 2016; Su et al., 2014) use a threshold where the fold-
change must be greater than 2, which would ignore essentially all differences examined here.
Previous studies have shown that the most impactful step in RNA-Seq DEX analysis is
the normalization/transformation of data and calling of DEX genes, as opposed to read alignment
64
or quantification (Lin et al., 2016; Williams et al., 2017). Some studies specify
normalization/transformation as the main point (Bullard et al., 2010; McGee et al., 2019).
3.2.1 - Different approaches to evaluating DEX algorithms
In attempting to judge differential expression methods, there are two main approaches:
known truth situations based on simulated data or spike-ins, and unknown truth situations where
we judge based on reproducibility (Łabaj and Kreil, 2016; Quinn et al., 2018a). Reproducibility
in Łabaj and Kreil (2016) is defined as calling the same gene differentially expressed based on
technical replicates produced at different sites recreating the same biological model, which may
not be the same as whether genes seen in one biological cohort will be seen again in another
biological cohort examining the same phenotype; the latter is the reproducibility of interest when
studying gene expression in situations such as disease vs. control.
Spike-in studies, such as the SEQC standard, generally feature large and consistent
changes in expression; different samples do not have different levels of the background sample
or of the genes spiked, meaning the data can be expected to follow a Poisson distribution because
there is no variation between samples in the same group. This complete removal of expected
biological variation between independent replicates creates a scenario that greatly departs from
the type of RNA-Seq experiment considered here and cannot be assumed to give accurate results
in terms of the relative performance of programs for such types of experiment.
Simulation studies offer the ability to examine a known-truth situation where samples
show intrinsic biological differences in expression of many genes, not just DEX genes, similar to
that which is found when comparing samples from independent, genetically-unrelated human
beings. The main argument against simulation studies is that we can't be certain simulated data is
65
sufficiently similar to real data to make the performance evaluation meaningful; in particular,
effects such as complicated transcriptome-wide co-expression networks of genes are not
simulated, both because of increased simulation complexity and because it is harder to create
rules for simulating such gene-gene interactions that mimic real world scenarios. Despite
potential differences from real data, however, we can program simulations based on the same
assumptions that are made by differential expression analysis programs, which almost all if not
all similarly assume independence between genes. While no approach is perfect, simulation
analysis was chosen here as most likely to provide useful information about the performance of
various programs in simulations based on different models and parameters.
3.2.2 - Biases for differential expression
Various biases have been reported for tests of differential expression in RNA-Seq, such
as count bias. Count bias is an effect that has been reported where genes are more likely to be
called differentially expressed if there are more reads mapping to them overall. Count bias is also
connected to gene/transcript length bias, as longer transcripts can be expected to have more reads
mapping to them at a given number of copies of the transcript being considered. More in-depth
analysis of this effect has shown that it is largely limited to cases where a negative binomial
model is used on data that has very little over-dispersion, such as technical replicates and, to a
lesser extent, genetically identical cell cultures. In studies comparing gene expression between
genetically distinct individuals or cell cultures, the dispersion parameter of the negative binomial
model is likely to be much larger, as it is related to the actual biological variance between
replicates.
66
3.2.3 - Analysis of existing RNA-Seq datasets
The first step in developing a realistic simulation is an analysis of the basic characteristics
of multiple RNA-Seq datasets. These datasets will be: 1) the ultradeep transcriptome profiling of
human reference RNA performed by the SEQC consortium(Xu et al., 2014, 2016); 2) the
Brainspan project, which includes RNA-Seq from several types of brain tissue across
developmental timepoints starting in early infancy and going into middle adulthood; 3) the Gene-
Tissue Expression project (GTEx), which includes RNA-Seq from many different types of
tissue(Melé et al., 2015); and 4) Cultured Neural progenitor cells derived from Olfactory
Neuroepithelium (CNON).
CNON cells will offer the most important model as they provide the best representation
of the type of differential expression (between people with schizophrenia and controls) that
requires further methodological improvement: a comparison looking at the effects of a complex,
heterogeneous psychological disorder. GTEx data is next most important, as analyses within a
given tissue / type of cell culture will show what we expect are background levels of sample to
sample variation between genetically distinct individuals. Differential expression can also be
seen between different tissues and between the two types of human reference RNA examined by
SEQC; however the extent of differential expression in these cases is qualitatively larger than
that seen between the same tissue or type of derived cell culture in individuals only differing
based on a psychiatric illness. This difference is mostly seen in the distribution of fold changes,
both that there tend to be fewer changes of smaller extent and that these changes are not as
consistent from person to person.
Many parameters will be recorded from analysis, including baseMean, baseVar,
dispersion, moderated dispersion (if applicable), variance explained by known effects (e.g. tissue
67
type, age, sex, library batch, etc.), covariance structure, number of DEX genes tied to known
effects and their distribution of fold changes, and inflation measurements at various quantiles
(0.5, 0.75, 0.9, 0.95, 0.99). In addition to these individual measures, relationships between
measures, most importantly that between mean and variance and that between mean and
dispersion, will be analyzed. Permutation analysis of RNA-Seq data within a given
tissue/condition can show what background levels of differential expression from either natural
biological variation, batch effects, or other factors looks like.
3.2.4 - Normalizations in RNA-Seq data
There are different forms of normalization of RNA-Seq count data. The most important
normalization methods with regard to this study are those comparing the same gene in different
samples of the same study; however this aspect of normalization is mixed in with simpler forms,
such as CPM, RPKM, and TPM normalization. CPM refers to Counts Per Million counts and
refers to dividing the number of counts assigned to a given gene in a given sample to the total
number of reads counted for that sample times one million; it should be noted that it is generally
the total number of counted reads that is used, meaning the number of reads that uniquely map
to genes rather than the total number of reads sequenced for a sample. CPM values are thus
subject to biases when comparing between experiments due to possible differences in reference
genome, reference transcriptome, mapping algorithm, and counting method. CPM values are also
biased in favor of larger genes/transcripts at a given level of expression. If there are two copies
of a 10kb transcript of gene A and 10 copies of a 1kb transcript of gene B, we might expect
roughly two times as many reads for the gene A transcript even though it has only 1/5th the
expression in terms of copies of the mRNA.
68
The Reads Per Kilobase of transcript per Million counted reads measure is used to adjust
CPM values to take into account transcript length by dividing CPM by the length of the
transcript in kilobase; when using RPKM with gene level analysis, the median transcript length
is often used. The Transcripts Per Million transcripts (TPM) value is growing in popularity to be
used in place of RPKM values based on some studies indicating they are more comparable
between experiments. To get TPM values, RPKM values are taken and adjusted so the sum of all
TPM equal one million; the sum of all RPKM values will vary between different experiments.
These measures adjust for transcript length, but still leave other biases in place and may increase
bias between studies with different reference transcriptomes as more information from the
reference is used.
For purposes of comparing a gene's expression in one sample to its expression in another
sample, CPM, RPKM, and TPM all normalize in the same way: by dividing by the total number
of counted reads, also called the library size. However this normalization does not take into
account the compositional nature of the data
In differential expression testing, generally there will be a minimum expression cutoff to
cut down on the number of genes being tested based on an assumption that genes expressed
below a certain point are essentially background noise and/or are not important for cellular
function and/or there is not enough data for a useful analysis. This cutoff can be made on raw
counts or normalized counts, and can be made on either the actual number of counts or on CPM
or TPM values. An important difference between cutoffs based on the number of counts for a
gene and cutoffs based on CPM/TPM values is that increased sequencing depth can increase the
number of counts to an arbitrary level, assuming there are any reads uniquely mapping to the
69
gene at all; this is not true of CPM/TPM values which use the sequencing depth as part of their
calculation.
3.2.5 - Possible statistical models for simulating RNA-Seq datasets
There are 4 statistical distributions to be investigated for suitability at simulating RNA-
Seq data: the normal, the log-normal, the negative binomial, and the multinomial. The normal is
the simplest and most common, however it is likely to be a bad fit to RNA-Seq data except at
very low means with little difference between groups (Gierliński et al., 2015); the main problem
with the normal is that it assumes a constant variance, while RNA-Seq data generally shows an
increased variance as the mean increases. The log-normal is usually implemented by applying a
log transform to the counts or normalized counts; this provides a better fit to the data because it
homogenizes the variances (Figure 1).
Strong theoretical arguments can be made against both the normal and the log-normal,
however. Firstly, RNA-Seq data is count data, meaning it is discrete and hence may be better
modeled by a discrete distribution such as Poisson or negative binomial, rather than a continuous
distribution such as normal or log-normal; to deal with this the simulated counts must be rounded
to integers and hence do not exactly follow a log-normal distribution, although they generally
become non-integer following normalization. Second, the process of sequencing a transcriptome
can be thought to be randomly and independently sampling reads from a pool with replacement,
meaning constant probability; the resulting count is the number of reads mapping to a given gene
together with the given amount of sequencing, which can be considered an offset in the
regression. This should give us data described by a Poisson distribution but only applies to the
case of a constant mean.
70
The data in Figure 1-left shows the mean-variance relationship in the untransformed,
which is similar to that seen with a Poisson distribution, where the variance is the same as the
mean. Where the Poisson distribution would be is at the red line plotting y=x. We find that we
have more variance than is expected based on a Poisson distribution, which is an expected
situation because we do not expect the mean to be constant from person to person. A good model
for modelling an over-dispersed Poisson distribution is the negative binomial distribution, which
has a mean parameter similar to the mean of a Poisson distribution, but also a dispersion
parameter that accounts for the increased variance. A more strict understanding of the negative
binomial distribution can be found in a hierarchical model called the Gamma-Poisson mixture
distribution, for if:
𝜆𝜆 ~ 𝐺𝐺 𝑎𝑎 𝐺𝐺𝐺𝐺 𝑎𝑎 � 𝑟𝑟 ,
1 − 𝑝𝑝 𝑝𝑝 �
𝑋𝑋 | 𝜆𝜆 ~ 𝑃𝑃𝑃𝑃𝑃𝑃 𝑠𝑠 𝑠𝑠 𝑃𝑃 𝑛𝑛 ( 𝜆𝜆 )
Then:
𝑋𝑋 ~ 𝑁𝑁 𝑁𝑁 𝑃𝑃 𝑛𝑛𝑃𝑃 𝐺𝐺 ( 𝑟𝑟 , 𝑝𝑝 )
That is a negative binomial distribution is a Poisson distribution conditional on a given value of
λ, but λ is not a constant but rather a random variable with a Gamma distribution. This Gamma
distribution is then given a biological interpretation as representative as the biological variation
from person to person; if the mean (λ) was the same for each person, then we would have a
Poisson distribution, but there is variation from person to person that gets added on to the
technical variation that is a result of the sampling procedure.
71
The negative binomial distribution is a very practical model that is used by many DE
analysis programs, such as DESeq2, edgeR, and more recent versions of Cufflinks(Love et al.,
2014; Robinson et al., 2009; Trapnell et al., 2012). However, for the purpose of simulating RNA-
Seq data there are some discrepancies between the negative binomial model and the RNA-Seq
process. Specifically, simulating the number of reads going to each gene by a negative binomial
distribution does not take into account the anti-correlation between reads (if a read belongs to
one gene, it can't belong to another). Similarly, if we sequence 1 million reads, we can only have
a maximum sum of counts for that sample of 1 million, and probably fewer due to some reads
not passing quality control or mapping to a known gene.
This process of pulling out reads and then putting them into exactly one gene-bucket is
more closely mirrored by the more complicated multivariate multinomial distribution. Whether
the more complicated multinomial distribution actually offers improvements over the more
commonly used negative binomial is not known due to a lack of previous studies on this topic;
but answering this question is one of the aims of this study.
Simulating a single sample based on the standard multinomial data can be done, however
when the distribution of expression for a given gene across samples is examined, the result will
essentially follow a Poisson distribution, lacking the increased variance seen between
biologically independent samples. In order to use the multinomial model while still simulating
natural variation in gene expression a hierarchical model can be used, similar to the gamma-
Poisson model. A multinomial model is made where each genes expected expression in the
sample is simulated based on a gamma distribution, and then this set of gamma random variables
is divided by their sum to transform them into a set of probabilities that add up to one, giving us
a gamma-multinomial distribution.
72
3.2.6 - Error rates in all-null comparisons
In initial simulations we will leave out the actually DE genes and look at the false
positive rate at a number of cutoffs. We will control for this using standard raw/uncorrected p-
value cutoffs and compare alpha used to the actual type 1 error rate.
3.2.7 - False discovery rates and power in non-all-null comparisons
Further simulations will include break simulated samples into two arbitrary groups of
equal size. Fold-changes in the mean for some subset of genes will be introduced with the log2-
fold-change drawn from a uniform distribution and given a random sign (equal probability of up
or down in group 'b' compared to 'a'). Group 'a' will always give genes their general gene mean,
while group 'b' will always have this mean multiplied by 2^log2FC, increasing or decreasing it.
We will concentrate on fold-changes in the range from 0.5 to 2.
When dealing with a complex disorder, such as schizophrenia, that has a large amount of
genetic heterogeneity, it is important to understand that the fold-change between the groups will
likely only be a fraction of the fold-change in the subset who have alterations of a specific gene.
If a gene has its expression doubled, but only in 10% of people with schizophrenia, then we will
only observe a 10% increase between the groups.
Expected power based on both normal t-test power calculations and the "RNASeqPower"
package in R (assumes negative binomial distribution)(Hart et al., 2013) will be calculated based
on mean expression, dispersion/BCV, and log2FC. Genes sharing similar expected power will be
binned to determine what the actual power is for a given bin / expected power. Further
investigation will reveal what differences from the theoretical model are responsible for
differences between expected and actual power.
73
Finally, area under the curve (AUC) will be calculated for receiver operating
characteristic (ROC) curves and Precision-Recall curves (PRAUC) as part of assessing the
sensitivity and specificity of various RNA-Seq DE analysis programs under various conditions.
3.3 - Methods
3.3.1 - Real RNA-Seq datasets
RNA-Seq data for 47 tissue/cell types were examined from the GTEx project V7 release
(all types with over 100 samples, excluding blood which had over 2400 samples as no other
types had such a large sample size) (Lonsdale et al., 2013). Data was taken at the stage of raw
count files, with read alignment and gene quantification done by the GTEx project. Data were
broken up by tissue type for analysis. Data was heterogeneous as it was sequenced over a long
period in many batches with different kits, read lengths, etc. Datasets separated by tissue type
were not independent as the same individual often had multiple tissue types sequenced.
RNA-Seq data from Cultured Neural progenitor cells derived from Olfactory
Neuroepithelium (CNON) were also used as a reference, in a study comparing expression
between cultures derived from humans with schizophrenia vs. cultures derived from human
controls (Evgrafov et al., 2017).
These datasets were analyzed both on the basis of raw counts and using edgeR with no
model / comparisons. EdgeR was used to normalize data and estimate dispersions as it is
significantly faster than DESeq2. The distribution of average gene expression, the distribution of
gene-level dispersion estimates, and the mean-variance and mean-dispersion relationships were
used in simulation of data. Additionally, the goodness of fit of these datasets to either the
negative binomial or log-normal distributions was assessed.
74
3.3.2 - Data simulation
RNA-Seq uniquely mapped read counts were simulated by a custom-built set of R scripts
based on 3 different distributions: the negative binomial distributions, the log-normal
distribution, and the gamma-multinomial distribution. A range of means was simulated based on
parameters estimated from different real data sets, and roughly paired variances were simulated
to maintain the mean-variance relationship seen in real RNA-Seq data (Figure 2).
Differential expression (DEX) effects were simulated by randomly generating log fold-
changes based from a uniform distribution, with the direction of effect randomly assigned.
Variances for DEX genes were based on the altered mean in altered samples, rather than the
original mean.
3.3.3 - General gene expression parameters
First, for regular (non-dex, non-zero) genes, whose amount is specified as n.genes, each
gene has a p.ex probability of being considered expressed, with log2 mean expression taken from
a normal distribution with parameters mean.ex, sd.ex; otherwise it is considered unexpressed
with log2 mean expression taken from a normal distribution with parameters mean.uex, sd.uex.
Parameters were initially estimated from real data with modifications made to explore specific
aspects of differential expression, such as the effect of including unexpressed genes in the
comparison. DEX genes, whose amount is specified as n.dex, are always considered expressed
for the purpose of simulating their average expression level.
Second, dispersion parameters (ψ) are simulated from a shifted gamma distribution with
parameters psi.offset, psi.shape, and psi.rate. The latter two are the shape and rate parameters for
the gamma distribution, and psi.offset is added to the simulated values. A gamma distribution is
75
used to roughly match the distribution of ψ seen in the expressed genes of real data, taking into
account that ψ is always non-negative; an offset is used based on a biological expectation that
there is always at least some biological or technological variation in the actual samples added to
the technical sampling variation added in sequencing.
Third, DEXness was simulated with effect sizes (log2 fold-changes) taken from a
uniform distribution between min.fc and max.fc, with changes given an equal probability of
increased or decreased expression from the previously simulated overall log2 mean.
Library sizes are also simulated from a uniform distribution between size.min and
size.max, where a size of 1 is equivalent to no change in mean expression based on simulated
sequencing depth.
The final mean for sample i and gene g therefore follows the equation:
𝜇𝜇 𝑖𝑖 , 𝑔𝑔 = 2
𝑏𝑏 𝑔𝑔 + 𝑓𝑓 𝑖𝑖 , 𝑔𝑔 ∗ 𝑠𝑠 𝑖𝑖
Where 𝑏𝑏 𝑔𝑔 is the base log2 mean, 𝑓𝑓 𝑖𝑖 , 𝑔𝑔 is the base log2 fold-change due to simulated DEX
effect, and 𝑠𝑠 𝑖𝑖 is the relative library size.
3.3.4 - Negative binomial model (nb2)
In the negative binomial model, counts are then simulated as
𝐶𝐶 𝑖𝑖 , 𝑔𝑔 ~ 𝑁𝑁𝑁𝑁 ( 𝜇𝜇 𝑖𝑖 , 𝑔𝑔 , ψ
𝑔𝑔 2
)
Based on the NB2 parameterization of the negative binomial distribution (Gierliński et
al., 2015). In this parameterization,
𝐸𝐸 � 𝐶𝐶 𝑖𝑖 , 𝑔𝑔 � = 𝜇𝜇 𝑖𝑖 , 𝑔𝑔
76
And
𝑉𝑉 𝑎𝑎𝑟𝑟 � 𝐶𝐶 𝑖𝑖 , 𝑔𝑔 � = 𝜇𝜇 𝑖𝑖 , 𝑔𝑔 + 𝜇𝜇 𝑖𝑖 , 𝑔𝑔 2
∗ ψ
𝑔𝑔 2

All random number generation from given distributions is performed by function in the
base R library.
3.3.5 - Negative binomial as gamma-Poisson mixture model
The negative binomial model can equivalently be simulated as a mixture model where λ
parameters for Poisson distributions are derived from a gamma distribution:
λ
i, g
~ 𝐺𝐺 𝑎𝑎 𝐺𝐺𝐺𝐺 𝑎𝑎 (
1
ψ
𝑔𝑔 2
,
1
𝜇𝜇 𝑖𝑖 , 𝑔𝑔 ∗ ψ
𝑔𝑔 2
)
𝐶𝐶 𝑖𝑖 , 𝑔𝑔 | λ
i, g
~ 𝑃𝑃𝑃𝑃𝑃𝑃 𝑠𝑠 𝑠𝑠 𝑃𝑃 𝑛𝑛 ( λ
i, g
)
The negative binomial model written out this way partitions variance into two portions:
first, the variance of the gamma distribution represents actual variance between libraries, which
is a mixture of biological variance and technical variance from steps such as library preparation;
second, the variance of the Poisson distribution represents the sampling variance that is
determined by the sequencing process.
Using the delta method to take the expected variance of log-transformed counts, we find
that:
𝐸𝐸 � 𝐶𝐶 𝑖𝑖 , 𝑔𝑔 | λ
i, g
� = λ
i, g

Var �log � 𝐶𝐶 𝑖𝑖 , 𝑔𝑔 � � ≈
1
𝜇𝜇 𝑖𝑖 , 𝑔𝑔 + ψ
𝑔𝑔 2

77
And as we sequence to greater depths we can decrease the sampling variance from
sequencing to arbitrary levels (getting more and more precise estimates of λ
i, g
), but cannot
change the variance that is actually present between one library and another. Properties of λ
i, g

are:
𝐸𝐸 � λ
i, g
� = 𝜇𝜇 𝑖𝑖 , 𝑔𝑔
𝑉𝑉 𝑎𝑎𝑟𝑟 � λ
i, g
� = 𝜇𝜇 𝑖𝑖 , 𝑔𝑔 2
∗ ψ
𝑔𝑔 2

𝐶𝐶 𝑃𝑃 𝑒𝑒 𝑓𝑓𝑓𝑓𝑃𝑃 𝐶𝐶 𝑃𝑃 𝑒𝑒 𝑛𝑛 𝐶𝐶 𝑃𝑃𝑓𝑓 𝑉𝑉 𝑎𝑎𝑟𝑟 𝑃𝑃𝑎𝑎 𝐶𝐶 𝑃𝑃𝑃𝑃 𝑛𝑛 � λ
i, g
� = ψ
g

Which has lead the ψ parameter to sometimes be described as the Biological Coefficient
of Variation (BCV) (McCarthy et al., 2012); however it is important to remember that it
generally includes some technical variance as well from steps such as library preparation.
3.3.6 - Log normal model (ln)
Counts are simulated in a log-normal model such that the mean and variance match that
seen in the negative binomial model. To do this using standard functions in R, first the mean and
variance of the count data must be calculated in terms of the mean and variance of the log
counts, as rlnorm uses that parameterization:
𝐶𝐶 𝑖𝑖 , 𝑔𝑔 ~ 𝐿𝐿 𝑃𝑃 𝐿𝐿 𝑁𝑁𝑃𝑃 𝑟𝑟 𝐺𝐺 𝑎𝑎 𝐿𝐿 ( 𝜃𝜃 𝑖𝑖 , 𝑔𝑔 , 𝜎𝜎 𝑖𝑖 , 𝑔𝑔 2
)
𝐸𝐸 � 𝐶𝐶 𝑖𝑖 , 𝑔𝑔 � = exp � 𝜃𝜃 𝑖𝑖 , 𝑔𝑔 +
𝜎𝜎 𝑖𝑖 , 𝑔𝑔 2
2
� = 𝜇𝜇 𝑖𝑖 , 𝑔𝑔
𝑉𝑉 𝑎𝑎𝑟𝑟 � 𝐶𝐶 𝑖𝑖 , 𝑔𝑔 � = �exp � 𝜎𝜎 𝑖𝑖 , 𝑔𝑔 2
� − 1 � ∗ exp �2 ∗ 𝜃𝜃 𝑖𝑖 , 𝑔𝑔 + 𝜎𝜎 𝑖𝑖 , 𝑔𝑔 2
� = 𝜇𝜇 𝑖𝑖 , 𝑔𝑔 + 𝜇𝜇 𝑖𝑖 , 𝑔𝑔 2
∗ ψ
𝑔𝑔 2

78
𝜃𝜃 𝑖𝑖 , 𝑔𝑔 = log � 𝜇𝜇 𝑖𝑖 , 𝑔𝑔 � −
1
2
∗ log �
1
𝜇𝜇 𝑖𝑖 , 𝑔𝑔 + 𝜓𝜓 𝑔𝑔 2
+ 1 �
𝜎𝜎 𝑖𝑖 , 𝑔𝑔 2
= log (
1
𝜇𝜇 𝑖𝑖 , 𝑔𝑔 + 𝜓𝜓 𝑔𝑔 2
+ 1)
Comparing this to earlier equations, we can see that while the mean and variance of the
data created by this model are the same, the mean and variance of the log distribution are
different. Additionally higher moments of the two datasets are different even on the level of
counts.
As the log-normal distribution does not produce count data, the actual counts passed to
all programs are rounded to the nearest integer. The resulting counts do not include negative
values naturally.
3.3.7 - Gamma-Multinomial mixture model (m1/m2)
Previously discussed models ignore an important facet of RNA-Seq data: the data are
compositional, read only as proportions of the total number of reads sequenced. A gene cannot
have more reads mapped to it than there are reads sequenced, and a read that uniquely maps to
one gene cannot also uniquely map to another. Furthermore, a large increase in the expression of
one gene may "crowd out" other reads, resulting in there being fewer / a smaller proportion of
reads even if those other genes haven't changed their absolute expression level. Previous models
do not take these factors into account as they assume each gene is independent.
A multinomial model recreates these characteristics; however it does not, by itself,
display characteristics seen in real RNA-Seq data such as over-dispersion. A multinomial
distribution by itself will tend to look like a set of Poisson distributions if viewed on a univariate
79
level and the data meet certain assumptions, such as that any gene makes up a small portion of
the overall total.
𝑿𝑿 ~ 𝑀𝑀𝑀𝑀 𝐿𝐿 𝐶𝐶 𝑃𝑃 𝑛𝑛𝑃𝑃 𝐺𝐺𝑃𝑃 𝑎𝑎 𝐿𝐿 ( 𝑛𝑛 , 𝒑𝒑 )
𝐸𝐸 [ 𝑋𝑋 𝑖𝑖 ] = 𝑛𝑛 ∗ 𝑝𝑝 𝑖𝑖
𝑉𝑉 𝑎𝑎𝑟𝑟 [ 𝑋𝑋 𝑖𝑖 ] = 𝑛𝑛 ∗ 𝑝𝑝 𝑖𝑖 ∗ (1 − 𝑝𝑝 𝑖𝑖 )
𝐶𝐶 𝑃𝑃 𝐶𝐶 � 𝑋𝑋 𝑖𝑖 , 𝑋𝑋 𝑗𝑗 � = − 𝑛𝑛 ∗ 𝑝𝑝 𝑖𝑖 ∗ 𝑝𝑝 𝑗𝑗 ( 𝑃𝑃 ≠ 𝑗𝑗 )
If we make the following conversions and assumptions:
𝜆𝜆 𝑖𝑖 = 𝑛𝑛 ∗ 𝑝𝑝 𝑖𝑖
(1 − 𝑝𝑝 𝑖𝑖 ) ≈ 1
𝑝𝑝 𝑖𝑖 ∗ 𝑝𝑝 𝑗𝑗 ≈ 0
Then
𝐸𝐸 [ 𝑋𝑋 𝑖𝑖 ] = 𝜆𝜆 𝑖𝑖
𝑉𝑉 𝑎𝑎𝑟𝑟 [ 𝑋𝑋 𝑖𝑖 ] = 𝜆𝜆 𝑖𝑖
𝐶𝐶 𝑃𝑃 𝐶𝐶 [ 𝑋𝑋 𝑖𝑖 ] = 0
And the random variable representing all data from a given sample is approximately a set of
independent Poisson distributed random variables. This can also be viewed as considering the
multinomial distribution to be a multivariate set of binomial distributions, and then using the law
of rare events to relate a given binomial distribution to a given Poisson distribution (n -> ∞; np -
> λ). This relationship between the two distributions is also taken advantage of by the
80
multinomial-Poisson transformation, which can ease calculations on multinomial distributions
(Baker, 1994; Guimar, 2004).
This distribution, however, does not represent the biological and technical variation
between different samples: this is why the variance does not show the over-dispersion seen in
real biological replicates. To recreate real looking data, more variance can be added using a
hierarchical model where the probability of the outcome being in a given category is itself a
random variable. If we again use a gamma random variable to simulate the mean value in a given
sample/library, take the expected proportions from those gamma variates by normalizing them to
add up to 1, and use those proportions as the probability parameters for a multinomial
distribution simulation (named m1):
λ
i, g
~ 𝐺𝐺 𝑎𝑎 𝐺𝐺𝐺𝐺 𝑎𝑎 (
1
ψ
𝑔𝑔 2
,
1
𝜇𝜇 𝑖𝑖 , 𝑔𝑔 ∗ ψ
𝑔𝑔 2
)
𝑝𝑝 𝑖𝑖 , 𝑔𝑔 =
λ
i, g
∑ λ
i, g 𝑔𝑔
𝐿𝐿 𝑖𝑖 = 𝐿𝐿 ∗ 𝑠𝑠 𝑖𝑖
𝐶𝐶 𝑖𝑖 | 𝑝𝑝 𝑖𝑖 ~ 𝑀𝑀𝑀𝑀 𝐿𝐿 𝐶𝐶 𝑃𝑃 𝑛𝑛𝑃𝑃 𝐺𝐺𝑃𝑃 𝑎𝑎 𝐿𝐿 ( 𝑝𝑝 𝑖𝑖 , 𝐿𝐿 𝑖𝑖 )
We can expand this model to be closer to real RNA-Seq by having buckets in the
multinomial distribution that do not represent reads uniquely mapped to a gene, but instead reads
that don't pass QC, reads that map to multiple locations, and reads that don't map at all. Having
created such non-gene buckets, we can then possibly provide counts falling in them to
downstream analyses; such counts would be meaningless in negative binomial or log-normal
81
simulations as they would be independent of other counts. We call such a distribution with one
additional bucket that is 40% of total counts, m2:
𝑝𝑝 𝑖𝑖 , 𝑔𝑔 + 1
= 0.4 ∗ � 𝑝𝑝 𝑖𝑖 , 𝑔𝑔 𝑔𝑔 𝑗𝑗 = 1

In analyses of m2, the count for quasi-gene g+1 is not passed to the analysis to better
match standard RNA-Seq DEX analysis workflows.
3.3.8 - Simulation parameters
Initial simulations were modeled on parameters derived from the CNON experiment
annotated against Gencode. Sixty thousand genes were simulated, of which 30% were expected
to be expressed. The log2 mean for expressed genes was 6.5 (SD = 2.5), while the log2 mean for
unexpressed genes was -0.7 (SD = 3.7). One thousand genes that are always zero were also
included. The dispersion parameters were sampled from a gamma distribution with shape = 2
and rate = 4 (low dispersion scenario) or rate = 2 (high dispersion scenario) that was offset by
0.21 to avoid very small dispersion parameters. There were 2 groups, with 100 samples in each
group, except where indicated otherwise. The size factors (number of reads relative to the
baseline) were between 2 and 6.
In the non-null scenarios that include actual differentially expressed genes, 200 DEX
genes were additionally simulated with log2 fold-changes uniformly distributed between 0.05
and 1, with an equal chance for increased or decreased expression.
3.3.9 - DEX analysis workflows
49 different workflows were assessed from the possible combinations of X different tests
(DESeq2, edgeR, voom-limma, ALDEx2, and Welch's t-test) with X different methods of
82
normalization/transformation to control for library size (none, scaling by number of reads,
scaling by upper-quartile, quantile normalization, edgeR's TMM, DESeq2's median of ratios,
EdgeR RLE implementation, centered log ratio, and interquartile log ratio). For methods
assuming the normal distribution a log-transformation was also used, with negative binomial
methods similarly using a log link function in a GLM framework. Other workflow options also
tested include: DESeq2's outlier trimming, edgeR's mean-dispersion trend estimation, and the
number of Monte-Carlo instances in ALDEx2 analyses.
Normalization factors from edgeR's "calcNormFactors" function were adjusted to size
factors by multiplying them by the number of reads in the given library/sample to get an
effective library size, and then dividing effective library sizes by their arithmetic mean so that
they average 1. These size factors were used to scale the raw counts into normalized counts for
software that didn't support using an offset as part of the regression and for CPM and related
normalizations.
3.3.10 - DEX performance assessment
A number of metrics were used in assessing performance of DEX analysis workflows:
True Positive Rate (TPR)/Recall/Sensitivity/Power, False Positive Rate (FPR)/Specificity,
empirical False Discovery Rate/Precision. Measurements were made both at fixed cutoffs (e.g.
Benjamini-Hochberg FDR < 0.05) and at numerous cutoffs between 0 and 1 to create a Receiver
Operating Characteristic (ROC) curve or Precision-Recall curve; for these, the area under the
curve (AUC) was used as a general metric for performance of a workflow. PRAUC was used as
the primary measure for comparison between the effectiveness of different programs; ROC AUC
was also examined and almost always showed the same relative performance, however PRAUC
83
is generally considered to be a better metric when looking for something that is rare, as we
hypothesize here DEX genes are.
Metrics were assessed based on a simpler criteria of whether a DEX gene was called as
DEX (recall = n.dex.exp.sig / n.dex.exp), as well as a more complex criteria that adds a
requirement for the DEX effect to be in the correct direction (recall = n.dex.exp.sig.signed /
n.dex.exp). If a gene is called as DEX but has its fold-change in the wrong direction then recall
will always be less than 1, even when the FDR/FPR reaches 1; this provides a penalty for calling
genes in the wrong direction, however the only analysis that calculates the fold-change in a
substantially different way is ALDEx2 with the Wilcoxon test, which uses the difference in the
median rather than the mean. Given this, almost all methods will call the gene changed in the
wrong direction if one does, however the penalty in AUC decreases as the FDR/FPR at which
the bad call is made increases. Evaluation was also done both on all genes in a given experiment
and only on genes considered expressed based on a cutoff (3.5) for the mean of the unnormalized
counts (unnormalized counts are used to establish a universal cutoff despite different
normalization methods).
Performance is compared between different workflows analyzing the same data, not
between different distributions used to simulate data.
3.3.11 - R library versions and source code
Analyses were run in R 3.5.3, with the following versions of libraries used:
DescTools_0.99.28, ALDEx2_1.14.1, edgeR_3.24.3, limma_3.38.3, DESeq2_1.22.2,
SummarizedExperiment_1.12.0, DelayedArray_0.8.0, BiocParallel_1.16.6, matrixStats_0.54.0,
84
Biobase_2.42.0, GenomicRanges_1.34.0, GenomeInfoDb_1.18.2, IRanges_2.16.0,
S4Vectors_0.20.1, BiocGenerics_0.28.0.
The source code for all simulations and analyses of simulated data can be found in a Git
repository at https://github.com/carmoskus/rea_sim
3.4 - Results
3.4.1 - Analysis of real data
Analysis of real data is necessary to match attributes with simulated data, such as the
critical importance of the mean-variance relationship (Law et al., 2014). Normalized counts from
CNON as well as GTEx tissues display variances following the line of equality with the mean
(Poisson distribution behavior) at low means and then increasing so that there is significantly
more variance than would be seen in a Poisson distribution / no biological variance situation. The
dispersion parameters, which represent the parameter controlling the increased variance, are
derived from data either with no model (GTEx data) or with a model that controls for variation
from known factors (CNON data) (Figure 2).
3.4.2 - Examination of data under null conditions
The first group of simulation settings is designed to test for false positives in the absence
of any real positives
While various distributions were used to randomly generate counts, a combination of two
normal distributions was used to represent mean expression levels of genes on the log2 scale
(Figure 3). In addition, a number of genes (default =1000) were also set to have a mean
85
expression of zero to better match observed data; while these will be cut out with any minimum
expression cutoff, it is important to make sure .
Based on the analysis of the estimates of dispersion seen from the CNON data (Figure 4),
we are able to identify that we can produce dispersion values from a transformed fitted gamma
distribution and largely mimic that seen in the data. The main difference is that the plot in Figure
4 is only of the dispersions for expressed genes - the estimates of ψ, which as mentioned is a
coefficient of variation and so scales greatly with very small means; this is not a problem as long
as even the barest minimum expression cutoff is used.
Based on the parameters and distributions stated here, we are able to produce simulated
data that looks very much like the data from our wet-lab work (Figures 2,5).
3.4.3 - Comparison of output from different programs
𝑋𝑋 𝑖𝑖 = 𝐶𝐶𝑃𝑃 𝑀𝑀𝑛𝑛 𝐶𝐶 𝑠𝑠 𝑓𝑓 𝑃𝑃𝑟𝑟 𝐿𝐿 𝑃𝑃 𝐶𝐶𝑒𝑒 𝑛𝑛 𝐿𝐿 𝑒𝑒 𝑛𝑛𝑒𝑒 𝑃𝑃𝑛𝑛 𝑠𝑠 𝑎𝑎 𝐺𝐺𝑝𝑝𝐿𝐿 𝑒𝑒 ′ 𝑃𝑃 ′
𝐴𝐴 =
1
𝑛𝑛 � log( 𝑋𝑋 𝑖𝑖 )
𝑛𝑛 𝑖𝑖 = 1

𝑁𝑁 = log �
1
𝑛𝑛 � 𝑋𝑋 𝑖𝑖 𝑛𝑛 𝑖𝑖 = 1
�
𝐴𝐴 ≠ 𝑁𝑁
Different programs output subtly different statistics for measures like mean expression
and log fold-change. In particular when averaging across a number of samples, we need to pay
attention to whether we are looking at the mean difference in log expression (more common) or
the log of the mean ratio of expression (less common, but in some ways more obvious as a
86
measure. For instance, edgeR adds an "AveLogCPM"/"logCPM" field to output that is the log2
of the mean CPM value for a given gene, however the most common output used from voom-
limma is taking the mean of the voom-transformed (includes log2 transform) counts, either by
the $E field of the voom object or the $Amean field of the linear model fit. The difference
between these statistics can be seen in Figure 6. Either of these slightly different statistics will
commonly be reported as just "logCPM" without mention when averaging across different values
occurred.
There is a similar and likely even more important issue when examining fold-changes;
this is more important because log2 fold-changes are the statistic used in the null hypothesis
significance test that forms the basis of deciding whether or not a gene is DEX. Voom-limma is
consistent in that it always tends to look at the mean of the transformed/normalized counts; this
is because the 'voom' transformation is effectively an add on to the already existing limma
package to adapt it for analysis of RNA-Seq data - the follow up analysis, like calculating the
coefficient for a term can only see the voom transformed data. EdgeR acts differently, as it works
on the counts directly using a log-link, and so it takes the mean fold-change seen in the
normalized counts and then log-transforms it. This and other small differences can result in
substantial disagreements, such as voom-limma stating that a gene is upregulated but edgeR
stating that it is downregulated, though this is not likely to be statistically significant in either
case (Figure 7).
3.4.4 - Effects of normalization
The first change in analysis workflows tested was different normalizations. Primary
results for this section come from configuration "6r4nz0" across all 4 simulated distributions.
Normalizations were tested in a two-way ANOVA with the first factor being different anlaysis
87
programs (edgeR, DESeq2, voom-limma, t-test) and the second factor being the normalization
method; within factor comparisons were analyzed by Tukey's HSD. Figure 8 shows the effect of
normalization on ROC AUC and eFDR: TMM-style normalizations (including TMM, DESeq2's
RLE, and the edgeR RLE implementation) perform the best across all normalizations. The
ranking of normalizations, which was consistent across all tests, is that not adjusting for library
size performs the worst, followed to scaling by the total number of reads (improvement of 0.013
AUC from no normalization), then scaling by upper-quartile (improvement of 0.028 AUC from
no normalization), then TMM-style normalizations (improvement of 0.047 AUC from no
normalization). The different TMM-style normalizations showed no difference with each other
(all adj. p > 0.99). Looking at the other factor, we find that both voom-limma and the t-test
outperform edgeR and DESeq2 on the log-normal simulated data (mean difference of 0.006
AUC), however edgeR and DESeq2 outperform the other two on negative binomial simulated
data as well as on the gamma-multinomial simulated data (mean difference of 0.024). In no case
do we see significant differences between performance of edgeR and DESeq2, or between voom-
limma and a t-test (all adj. p in these comparisons > 0.7). The exact extent of the AUC
differences varies based on simulation parameters, however the ordering of how well different
normalizations fair remains consistent. For example, while voom-limma does not perform
noticeably better in these large sample size simulations, it becomes more effective if a smaller
sample size is used or genes with lower mean counts are tested.
A smaller comparison was also examined with tests limited to voom-limma and the t-test,
in order that quantile normalization could be used. As quantile normalization can produce non-
integer values it cannot be used with edgeR or DESeq2, unless one rounds the resulting values to
pseudocounts. In this analysis quantile out-performs no normalization and scaling to the total
88
number of reads (mean AUC improvement = 0.039 and 0.030, respectively), but not upper-
quartile normalization or TMM-style normalizations (mean AUC difference = -0.0028 and -
0.007).
In negative binomial simulated data using TMM normalization, all tests control the
empirical FDR reasonably well at a Benjamini-Hochberg FDR cutoff of 5%: DESeq2 = 5.4%
eFDR, edgeR = 4.1% eFDR, voom-limma = 4.5% eFDR, t-test = 4.3% eFDR. Results are similar
in gamma-multinomial simulated data. However, test methods assuming a negative binomial
distribution perform substantially worse based on eFDR when run on log-normal simulated data:
DESeq2 = 13.5% eFDR, edgeR = 12.5% eFDR, voom-limma = 4.8% eFDR, t-test = 4.7% eFDR.
3.4.5 - Effects of changing number of DEX genes
More DEX genes effectively reduces max AUC when taking direction of effect into
account because it allows more chances for the effect to be in an incorrect direction, restricting
the TPR below 1 no matter the threshold. This is seen in 6r4nz0 vs. 6r4nz0ng4. This is also
affected by the fold-changes of the DEX genes, with small fold-changes producing more of this
effect than large fold-changes.
3.4.6 - Effects of distribution assumption in different distributions
Results for edgeR indicate a substantially increased FPR compared to the p-value cutoff /
alpha when run on log-normal data; this separation is particularly dramatic when viewed on a -
log10 scale (Figure 9, bottom). While edgeR also shows a much smaller increase in FPR over
alpha at very small alphas (Figure 7, top-right), this was not found to be substantial enough to
affect final results or the eFDR given the p-values return to the expected distribution at
approximately alpha = 5e-08, below probable alpha values used in actual differential gene
89
expression experiments. Results for voom-limma showed no such deviations from the expected
distribution (Figure 10). Analysis of how effective programs are at different sample sizes
shows that increasing the sample-size leads to more distribution dependent effects, with the
performance difference between edgeR/DESeq2 and voom-limma/t-test increasing as sample
size increases (Figure 11).
3.5 - Discussion
3.5.1 - Heterogeneity of effects and small fold-change
Thresholds for the fold-change seen in (Łabaj and Kreil, 2016). Such a threshold
eliminates the possibility of identifying smaller changes, and invalidates evaluations of statistical
performance for tests that use them in situations where smaller changes are expected.
3.5.2 - Issues with replicability as a criterion
Using DEX to look for biomarkers. Replicability may not provide best classification.
Why do measures of replicability not correlate with AUC performance? Replicability measures
are tied in large part to number of genes detected as DEX at given cutoff; AUC avoids single
cutoff, can let programs calibrated differently compete; McGee et al. (2019) have to not use the
tests when judging ALDEx2. Some of these issues can be dealt with by taking the top N genes
instead of genes beyond a certain cutoff. More effects that reproduce aren't always better when
they come with more effects that don't reproduce.
When comparing different analyses we must use a similar prior, especially in the case
where one analysis predicts more DEX genes than the others; if the true number of DEX genes is
the same, and the analyses don't show much overlap, the genes in the analysis predicting fewer
DEX genes should be per-gene more likely to be true? All this depends on similar power, but
90
even with that ... Remember that power changes between different simulations with the same
parameters because the power changes with the random sizes of the fold-changes of DEX genes.
Look at correlation between more true positives and more false positives across the same
analysis on different datasets from same parameters, as a lower alpha/calibration can be expected
to change both.

91
3.6 - Figures

Figure 3-1: Plots of mean-variance relationship in untransformed (left) and
log2-transformed (right) normalized RNA-Seq count data from CNON
project. The untransformed data show a clear increase in variance as the mean
increases; this increase in variance is slightly larger than the directly proportional
scaling we would expect from a Poisson distribution, showing us an overdispersed
Poisson distribution, which can be modeled using a negative binomial distribution.
The log2-transformed data shows that variances are relatively constant across a
wide range of means - an important feature for the log2-transformed data to fit a
normal distribution. An offset of +1 was used with the log2-transformation to
avoid the issue of zeros.

92

Figure 3-2: Relationship between mean and variance in real and simulated
RNA-Seq datasets.

93

Figure 3-3: Histogram of log2-transformed mean expression level for genes,
with overlaid matched simulation distribution (blue line). Analysis of log2-
transformed mean expression levels in CNON data (ignoring zeros) shows a
clearly bimodal distribution that can be well matched by a sum of two normal
distribution. 75% of the genes are considered to fall into the lower "unexpressed"
normal distribution (green line, mean = -0.7, SD = 3.7). A remaining 25% of genes
are considered expressed and distributed based on the second normal distribution
(red line, mean = 9.5, SD = 2.2). The sum of the two normal distributions
approximates the total distribution of non-zero-mean genes.

94

Figure 3-4: Density plot of square root of psi estimates from DESeq2 analysis
of CNON data, with overlaid gamma distribution fitted. Analysis of square
roots of psi estimate (as defined in C1.3) finds they provide a strong match to a
fitted gamma distribution. The gamma distribution also makes theoretical sense as
the variable is relatively normally distributed but bound to always be greater than
zero.

95
Figure 3-5: Plots of log2 mean expression density (top) and mean-variance
relationship (bottom). The top panel shows the good fit between the CNON
density (black) and the simulations expression density (blue). The bottom panel
shows the same mean-variance relationship seen in the original CNON data.

96

Figure 3-6: Scatterplots showing difference in average expression measured
by mean of log CPM vs. log of mean CPM. Top shows comparison between
most common average expression measure from edgeR (log of mean CPM) and
voom-limma (mean of log CPM) for individual expressed genes for negative
binomial (left) and log normal (right) distributed data. Bottom compares log of
mean CPM estimates from both programs; results are not identical due to very
slight differences in transformation. Left top: Pearson's r = 0.977. Left bottom: r =
0.99996. Right top: r = 0.996. Right bottom: r = 0.99997.
97

Figure 3-7: Scatterplots showing difference in log2 fold-change estimates
between edgeR and voom-limma. In negative binomial distributed data (top),
Pearson's r = 0.784. In log normal distributed data (bottom), r = 0.837.

98

Figure 3-8: DEX performance of different analyses using different
normalizations in non-null simulations.

99

Figure 3-9: Mean FPR-plots of null analyses over 1000 simulations for edgeR
analysis of negative binomial (top) and log normal (bottom) data

100

Figure 3-10: Mean FPR-plots of null analyses over 1000 simulations for voom-
limma analysis of negative binomial (top) and log normal (bottom) data
101
A
B

Figure 3-11: Area under the precision-recall curve for different analysis at
different sample sizes in non-null simulations

102
Chapter 4 - Model averaging to improve performance in analysis of
gene expression
4.1 - Abstract
While current methods of transcriptome differential expression analysis are powerful for
understanding the biology of many diseases, there is still room for progress in terms of building
more advanced algorithms that are better able to deal with data created by RNA-Sequencing. In
this chapter, we examine ways in which one type of existing algorithm (assuming a negative
binomial distribution of normalized read counts) compares with another type of existing
algorithm (assuming a log normal distribution of normalized read counts), particularly situations
where one existing algorithm out-performs the other. We then use techniques based on Bayesian
Model Averaging to combine the strengths of both approaches by recognizing when data better
matches the assumptions of one over the other. This new algorithm based on model averaging
outperforms standard algorithms (edgeR and voom-limma) when data are drawn from different
potential distributions, in particular correcting a lack of control of FPR/FDR in negative binomial
based algorithms run on log normal distributed data.
4.2 - Introduction
4.2.1 - Current approaches to differential expression
The field of differential expression analysis of RNA-Seq data is rapidly evolving, but
most current differential expression analyses can be sorted into those based on the negative
binomial distribution (e.g., edgeR) or those effectively based on a log-normal distribution, as
they use a normal linear model following log transformation (e.g., voom-limma) (Law et al.,
2014; Robinson et al., 2009). Previous differential expression performance testing (see chapter 3)
103
based on simulated data showed that analysis programs performed best when data was simulated
according to their assumed model, with the most problematic effects being an increased False
Positive Rate (FPR) / type I error rate at small α cutoffs with analysis software that assumes a
negative binomial distribution (e.g., edgeR, DESeq2) on data that was simulated based on a log-
normal distribution; this resulted in a substantially increased empirical False Discovery Rate
(eFDR) when using the Benjamini-Hochberg False Discovery Rate (FDR) correction to attempt
to control the FDR (Benjamini and Hochberg, 1995). With eFDRs twice the FDR cutoff used or
higher, there were a large number of false positives being discovered by certain commonly used
software when data did not fit a negative binomial distribution. Due to the importance of
properly controlling the FDR, some fix would appear to be necessary; ideally one that would
keep the increased power of negative binomial distribution assumptions without the FPR/FDR
inflation.
4.2.2 - Bayesian model averaging
Bayesian model averaging (BMA) is a set of techniques based on averaging results across
a number of possible models rather than trying to identify a single "best" model and using that
exclusively (Fragoso et al., 2018; Hoeting et al., 1999). Generally BMA is performed by
comparing the likelihoods conditional on the different models; however there are issues with
applying this approach to the problems that are attempting to be solved here. Rather than
averaging across different models meaning different sets of covariates that may or may not be
included in calculations, but are based on the same statistical distribution, we are averaging
across different models meaning different statistical distributions with different properties.
For instance, the negative binomial distribution is discrete while the log normal
distribution is continuous, complicating the problem of comparison. Complicating things further
104
is the fact that the distribution we are interested in is that of the normalized data, which is no
longer exactly discrete. While the negative binomial distribution takes only integer values, the
values of the normalized data are not integers, nor are they regular divisions of integers (1, 1.2,
1.4, etc.). Each value is an integer generally adjusted by a sample-specific, and possibly gene
specific, size factor; quantile normalization would be a case where this type of scaling method is
used, however analyses such as edgeR and DESeq2 cannot use quantile normalization..
The log normal distribution has similar issues, as we do not actually believe data follows
a log normal distribution and do not simulate it exactly as such - simulated counts are randomly
pulled from a log normal distribution as continuous values and then rounded to an integer to
match the output of RNA-Seq analysis. After being converted to integers, the data are
normalized to non-integer values during analysis by the same scaling process as is mentioned,
but this is not meant as a way of reversing the previous rounding.
Because of these issues complicating the comparison of the likelihoods, we chose to use a
different approach by estimating the probability of the data conditional on a given statistical
model by using a goodness of fit test.
4.2.3 - Model averaging by goodness of fit
To put the two statistical distributions on an even footing, p-values from the one sample
Kolmogorov-Smirnov (KS) goodness of fit test will be used to weight the results of the two input
analyses (Birnbaum and Tingey, 1951; Marsaglia et al., 2003). The p-value in null hypothesis
significance testing (NHST) is defined as the probability of getting a test statistic as extreme or
more extreme given the null hypothesis is true (Marden, 2000). In the KS test, the test statistic is
a measure of the deviations between the observed quantiles and the quantiles from the theoretical
105
distribution being compared to. In our case, the null hypothesis is that the data are drawn from
the given distribution, so the p-value represents the probability of seeing such large deviations
from the given distribution by chance. This approach is different from, but mirrors, use of the
conditional likelihoods.
Perhaps the most common use of likelihoods is for deriving maximum likelihood
estimators. By finding the value of the parameter that maximizes the likelihood, we are finding
the value that produces the most probable fit to the data that we observe, whether that is a
general mean or the effect estimate for a covariate. With the KS GOF test, we are trying to figure
out whether one distribution or the other is the more probable fit to the data. A p-value of 1
indicates that the data exactly fit the tested distribution, while a low p-value indicates it is less
likely that the data are drawn from the tested distribution.
4.2.4 - Development of custom RNA-Seq analysis program
Following principles laid down by, for example, Law et al. (2014) that the most
important aspect in analyzing for differential expression is properly estimating the mean,
variance, and mean-variance relationship, we focus on these first two moments when combining
data. This also appears to be sufficient based on the fact that both the negative binomial and log-
normal distributions are two parameter distributions, and so estimating two parameters should be
sufficient to capture the critical information.
Programming framework
Like many popular packages for genomics such as DESeq2, RNA-Seq Expression Analysis
(REA) will be written in R. Code is being tracked using git and hosted on GitHub at
https://github.com/carmoskus/rea_sim.
106
Reuse of proven algorithms with inclusion of new features
Rather than starting from scratch and ignoring the progress that has been made so far, we plan to
extend existing programs. Using model averaging we will pool data from two different programs
(edgeR and voom-limma) to calculate an improved estimate that adjusts itself to be closer to one
program or the other based on the observed distribution of data.
4.3 - Methods
4.3.1 - Simulation of data
Data was simulated under the two most common distributions RNA-Seq count data are
assumed to follow: the negative binomial distribution and the log-normal distribution. Details are
included as part of Chapter 3.
4.3.2 - Testing for differential expression
Two programs were selected to represent analyses based on the negative binomial (NB)
and log normal (LN) distributions: edgeR and voom-limma. In previous analyses described in
Chapter 2, edgeR and DESeq2 (both based on NB distribution) performed similarly, with edgeR
being selected because of greater execution speed. Voom-limma was selected as the most
prominent tool based on the log normal distribution, which in practice means that it assumes the
log-transformed data follow a normal distribution. In both cases edgeR's TMM normalization
was used as it performed the best (alongside DESeq2's RLE) in the Chapter 2 analyses and is the
default recommendation for both edgeR and voom-limma.
4.3.3 - Estimates of fold change
Consider a dataset, X, consisting either of normalized counts
107
We consider estimates for fold-change differences (multiplicative differences in mean
expression) between groups to be estimated as follows for the case where we are considering the
data to fit a log-normal distribution:
𝐴𝐴 𝑔𝑔 =
1
𝑛𝑛 � log( 𝑋𝑋 𝑖𝑖 )
𝑖𝑖 ∈ 𝐺𝐺 𝑔𝑔 = 𝐴𝐴 𝑟𝑟 𝑃𝑃 𝐶𝐶 ℎ 𝑀𝑀 𝑒𝑒 𝑎𝑎 𝑛𝑛 𝑖𝑖 ∈ 𝐺𝐺 𝑔𝑔 [log( 𝑋𝑋 𝑖𝑖 )] = log �𝐺𝐺 𝑒𝑒𝑃𝑃 𝑀𝑀 𝑒𝑒𝑎𝑎 𝑛𝑛 𝑖𝑖 ∈ 𝐺𝐺 𝑔𝑔 𝑋𝑋 𝑖𝑖 �
𝐿𝐿 𝑃𝑃𝐿𝐿 2 𝐹𝐹 𝐶𝐶 𝐴𝐴 = 𝐴𝐴 2
− 𝐴𝐴 1
= log �
𝐺𝐺 𝑒𝑒𝑃𝑃 𝑀𝑀 𝑒𝑒𝑎𝑎 𝑛𝑛 𝑖𝑖 ∈ 𝐺𝐺 2
𝑋𝑋 𝑖𝑖 𝐺𝐺 𝑒𝑒𝑃𝑃 𝑀𝑀𝑒𝑒𝑎𝑎 𝑛𝑛 𝑖𝑖 ∈ 𝐺𝐺 1
𝑋𝑋 𝑖𝑖 �
In the case where we are considering the data to fit a negative binomial distribution, we are
instead representing the fold-change as follows:
𝑁𝑁 𝑔𝑔 = log �
1
𝑛𝑛 � 𝑋𝑋 𝑖𝑖 𝑛𝑛 𝑖𝑖 = 1
� = log �𝐴𝐴 𝑟𝑟 𝑃𝑃 𝐶𝐶 ℎ 𝑀𝑀 𝑒𝑒 𝑎𝑎 𝑛𝑛 𝑖𝑖 ∈ 𝐺𝐺 𝑔𝑔 𝑋𝑋 𝑖𝑖 �
𝐿𝐿 𝑃𝑃𝐿𝐿 2 𝐹𝐹 𝐶𝐶 𝐵𝐵 = 𝑁𝑁 2
− 𝑁𝑁 1
= log �
𝐴𝐴 𝑟𝑟 𝑃𝑃 𝐶𝐶 ℎ 𝑀𝑀 𝑒𝑒 𝑎𝑎 𝑛𝑛 𝑖𝑖 ∈ 𝐺𝐺 2
𝑋𝑋 𝑖𝑖 𝐴𝐴 𝑟𝑟 𝑃𝑃 𝐶𝐶 ℎ 𝑀𝑀 𝑒𝑒 𝑎𝑎 𝑛𝑛 𝑖𝑖 ∈ 𝐺𝐺 1
𝑋𝑋 𝑖𝑖 �
These two measures are not equivalent
4.3.4 - Goodness of fit testing
Goodness of fit testing to either the negative binomial or log-normal distribtions was
performed using a one-sample Kolmogrov-Smirnov test on log2-transformed, TMM-normalized
counts with an offset of 1. Mean and size parameters to be passed to the cumulative distribution
function (CDF) for the negative binomial distribution were calculated as follows:
𝑋𝑋 𝑖𝑖 ~ 𝑁𝑁𝑁𝑁 ( 𝜇𝜇 , 𝑟𝑟 )
𝜇𝜇 =
1
𝑛𝑛 ∗ � 𝑋𝑋 𝑖𝑖
108
𝜎𝜎 2
=
1
𝑛𝑛 − 1
∗ � ( 𝑋𝑋 𝑖𝑖 − 𝜇𝜇 )
2

𝑟𝑟 =
𝜇𝜇 2
max ( 𝜎𝜎 2
− 𝜇𝜇 , 0.1)

The size parameter must be greater than zero, so a minimum value of 0.1 is included to correct
for the case where random sample variation results in the sample variance being less than the
sample mean. A continuity/discreteness correction was added to the negative binomial CDF
using linear interpolation such that NB.CDF(1.5) was halfway between NB.CDF(1) and
NB.CDF(2).
Log-mean and log-SD parameters to be passed to the cumulative distribution function for
the log-normal distribution were calculated as follows:
𝑌𝑌 𝑖𝑖 ~ 𝐿𝐿 𝑃𝑃 𝐿𝐿 𝑁𝑁𝑃𝑃 𝑟𝑟 𝐺𝐺 𝑎𝑎 𝐿𝐿 ( 𝐺𝐺 , 𝑠𝑠 )
𝐺𝐺 = log( 𝜇𝜇 ) −
1
2
∗ log �
𝜎𝜎 2
𝜇𝜇 2
+ 1 �
𝑠𝑠 = � log �
𝜎𝜎 2
𝜇𝜇 2
+ 1 �
4.3.5 - Model averaging
In Bayesian model averaging, the posterior distribution of a quantity β, in this case the
estimate of the log2foldchange between two groups being compared in a differential expression
analysis, given data D and two possible models is:
Pr( 𝛽𝛽 | 𝐷𝐷 ) = � Pr( 𝛽𝛽 | 𝑀𝑀 𝑘𝑘 , 𝐷𝐷 ) ∗ Pr( 𝑀𝑀 𝑘𝑘 | 𝐷𝐷 )
2
𝑘𝑘 = 1

109
Where:
𝑀𝑀 1
= 𝐿𝐿 𝑃𝑃𝐿𝐿 𝑁𝑁 𝑃𝑃𝑟𝑟 𝐺𝐺 𝑎𝑎 𝐿𝐿
𝑀𝑀 2
= 𝑁𝑁 𝑒𝑒 𝐿𝐿 𝑎𝑎 𝐶𝐶 𝑃𝑃 𝐶𝐶𝑒𝑒 𝑁𝑁 𝑃𝑃 𝑛𝑛𝑃𝑃 𝐺𝐺 𝑃𝑃 𝑎𝑎 𝐿𝐿
Pr( 𝑀𝑀 𝑘𝑘 | 𝐷𝐷 ) =
Pr( 𝐷𝐷 | 𝑀𝑀 𝑘𝑘 ) ∗ Pr ( 𝑀𝑀 𝑘𝑘 )
∑ Pr( 𝐷𝐷 | 𝑀𝑀 𝑙𝑙 ) ∗ Pr ( 𝑀𝑀 𝑙𝑙 )
2
𝑙𝑙 = 1

Pr( 𝑀𝑀 1
| 𝐷𝐷 ) =
Pr( 𝐷𝐷 | 𝑀𝑀 1
) ∗ Pr ( 𝑀𝑀 1
)
Pr( 𝐷𝐷 | 𝑀𝑀 1
) ∗ Pr( 𝑀𝑀 1
) + Pr( 𝐷𝐷 | 𝑀𝑀 2
) ∗ Pr ( 𝑀𝑀 2
)
=
1
1 +
Pr( 𝐷𝐷 | 𝑀𝑀 2
) ∗ Pr ( 𝑀𝑀 2
)
Pr( 𝐷𝐷 | 𝑀𝑀 1
) ∗ Pr ( 𝑀𝑀 1
)
=
1
1 + 𝐾𝐾
Pr( 𝑀𝑀 2
| 𝐷𝐷 ) = 1 − Pr( 𝑀𝑀 1
| 𝐷𝐷 ) =
𝐾𝐾 1 + 𝐾𝐾
When:
𝐾𝐾 =
Pr( 𝐷𝐷 | 𝑀𝑀 2
) ∗ Pr ( 𝑀𝑀 2
)
Pr( 𝐷𝐷 | 𝑀𝑀 1
) ∗ Pr ( 𝑀𝑀 1
)

In this case, we assume equal priors for the two distributions so that term cancels out and we get:
𝐾𝐾 =
Pr( 𝐷𝐷 | 𝑀𝑀 2
)
Pr( 𝐷𝐷 | 𝑀𝑀 1
)
≈
𝑝𝑝 . 𝑛𝑛 𝑏𝑏 𝑝𝑝 . 𝐿𝐿𝑛𝑛

Where p.nb is the p-value resulting from the Kolmogorov-Smirnov Goodness of Fit test to the
negative binomial distribution and p.ln is the corresponding p-value from the test fitting the data
to a log-normal distribution.
We use edgeR to derive an estimate of β along with the variance of the estimate assuming
a negative binomial distribution, and use voom-limma to derive the same assuming a log-normal
distribution. We used TMM normalization in both cases, though this is not required. We can then
110
test for significant differential expression by performing a Wald test or likelihood-ratio test. In
such a Wald test:
𝐻𝐻 0
: 𝛽𝛽 = 0
𝐻𝐻 1
: 𝛽𝛽 ≠ 0
𝑧𝑧 =
𝐸𝐸 [ 𝛽𝛽 | 𝐷𝐷 ]
� 𝑉𝑉 𝑎𝑎𝑟𝑟 [ 𝛽𝛽 | 𝐷𝐷 ]

With z being the test statistic that is compared to a standard normal distribution to get a p-value.
The resulting p-value is then most likely corrected for multiple comparisons in a standard way,
such as the Benjamini-Hochberg FDR correction.
The equations used to calculate the quantities that go into z are:
𝛽𝛽 ̂
𝑘𝑘 = 𝐸𝐸 [ 𝛽𝛽 | 𝐷𝐷 , 𝑀𝑀 𝑘𝑘 ]
𝐸𝐸 [ 𝛽𝛽 | 𝐷𝐷 ] = � 𝛽𝛽 ̂
𝑘𝑘 2
𝑘𝑘 = 1
∗ Pr( 𝑀𝑀 𝑘𝑘 | 𝐷𝐷 )
𝑉𝑉 𝑎𝑎𝑟𝑟 [ 𝛽𝛽 | 𝐷𝐷 ] = � � 𝑉𝑉 𝑎𝑎𝑟𝑟 [ 𝛽𝛽 | 𝐷𝐷 , 𝑀𝑀 𝑘𝑘 ] + 𝛽𝛽 ̂
𝑘𝑘 2
� ∗ Pr( 𝑀𝑀 𝑘𝑘 | 𝐷𝐷 ) − 𝐸𝐸 [ 𝛽𝛽 | 𝐷𝐷 ]
2
2
𝑘𝑘 = 1

4.3.6 - Evaluating performance
Performance was evaluated via a number of metrics. First, the actual False Positive Rate
(FPR) was compared to the alpha cutoff used at a number of data points between 0 and 1 to
confirm that the model averaging analysis corrected the inflated type I error rate seen when
edgeR and DESeq2 examine log normally distributed data. Secondly, in simulations that
included actual DEX genes, the area under the precision-recall curve was compared to ensure
111
that overall discriminatory performance was not harmed by use of model averaging. Finally, the
empirical False Discovery Rate (eFDR) was checked at the standard Benjamini-Hochberg False
Discover Rate (BH FDR) cutoff of 5% to make sure that issues previously seen with an inflated
eFDR were corrected. Further details are included as part of Chapter 3.
4.3.7 - R library versions and source code
Analyses were run in R 3.5.3, with the following versions of libraries used:
DescTools_0.99.28, ALDEx2_1.14.1, edgeR_3.24.3, limma_3.38.3, DESeq2_1.22.2,
SummarizedExperiment_1.12.0, DelayedArray_0.8.0, BiocParallel_1.16.6, matrixStats_0.54.0,
Biobase_2.42.0, GenomicRanges_1.34.0, GenomeInfoDb_1.18.2, IRanges_2.16.0,
S4Vectors_0.20.1, BiocGenerics_0.28.0.
The source code for all simulations and analyses of simulated data can be found in a Git
repository at https://github.com/carmoskus/rea_sim
4.4 - Results
4.4.1 - GOF test statistics on simulated data
An initial question is whether the Kolmogorov-Smirnov goodness of fit (GOF) test to the
negative binomial (NB) and log normal (LN) distributions should be performed on normalized or
unnormalized data. Figure 1 (top) shows that the median GOF p-value for fit of NB data to the
NB distribution increases from 0.66 to 0.75 (Wilcoxon p < 2.2e-16), while the fit of NB data to
the LN distribution only increase from 0.14 to 0.17. Figure 1 (bottom) similarly shows that the
median GOF p-value for fit of LN data to LN distribution increases from 0.59 to 0.74 (Wilcoxon
p < 2.2e-16), while the fit of LN data to the NB distribution only increases from 0.18 to 0.22
(Wilcoxon p = 1.66e-10). Based on these results, which are mirrored in other simulations using
112
different parameters, the hypothesis that we get a better fit between the data and the distribution
that simulated when using normalized data is confirmed, and normalized counts were used in all
following analyses.
GOF test p-values show a strong relationship between mean expression of the gene and
the GOF p-value to either distribution (Figures 2 and 3, left side). As the gene's mean expression
lowers, there is a change from generally matching the distribution to almost always showing
significant violations of GOF. This appears to be caused by interaction with the zero-lower
bound for normalized count values. When looking only at expressed genes, this effect is not
prominent (Figures 2 and 3, right side).
Re-examining data from Figure 1, we can see that data generally show a much better fit
to their own distribution than to the other distribution. On expressed genes, XX. Similarly the
median difference in GOF p-value of NB simulated genes to NB vs. to LN is 0.4 (Figure 1,
bottom). This effect can also be seen in scatterplots of GOF test p-values (Figure 4, left), with p-
values mostly in the upper-left side for NB data (fit better to NB than LN) and mostly in the
bottom-right side for LN data (fit better to LN than NB). Plotting the same p-values on a -log10
scale shows the difference is particularly prominent when looking at highly significant p-values
(Figure 4, right).
4.4.2 - Model averaging probabilities
Examining K values based on the scale defined in the methods is analogous to examining
Bayes Factors in the case where prior probabilities are equal in the same way that the ratio of the
p-values is analogous to the ratio of the conditional likelihoods. Plots of K values on the log10
scale clearly show strong trends for genes to be weighted more from the distribution they are
113
actually drawn from (Figure 5). Figure 6 shows the actual weights given to the two models as
inferred from the GOF test statistics of the data and confirms the same trend. Of the genes
considered expressed in the NB dataset, 89.6% were weighted toward the NB model by at least
50% and 23% were weighted toward the NB model by at least 99%.
4.4.3 - Improvement in FPR
Initial tests of the model averaging algorithm were run on the low dispersion, null
parameters to test control of the false positive rate (FPR) by a p-value cutoff. Results for edgeR
indicate a substantially increased FPR compared to the p-value cutoff / alpha when run on log-
normal data; this separation is particularly dramatic when viewed on a -log10 scale (Figure 7,
bottom). While edgeR also shows a much smaller increase in FPR over alpha at very small
alphas (Figure 7, top-right), this was not found to be substantial enough to affect final results or
the eFDR given the p-values return to the expected distribution at approximately alpha = 5e-08,
below probable alpha values used in actual differential gene expression experiments. Results for
voom-limma showed no such deviations from the expected distribution (Figure 8).
Tests of the model averaging (MA) algorithm showed that control of the FPR was
substantially improved from that seen in the edgeR algorithm (Figure 9). While the MA left
behind a slight deviation at very small p-value cutoffs like that seen when edgeR was run on
negative binomial data, the much larger deviation seen when edgeR was run on log normal data
was removed. Over p-value cutoffs from approximately 2e-07 to 0.7 the observed FPR was
actually very slightly below the p-value cutoff. From these observations we can conclude that the
loss of control of the FPR by p-value cutoff when edgeR was run on log normal data has been
corrected. This does not yet address the question of whether the MA algorithm performs better
than voom-limma, as voom-limma also properly controls the FPR via p-value cutoff.
114
4.4.4 - Improvement in eFDR
The most important improvement from model averaging comes from the change in the
empirical False Discovery Rate (eFDR) at a Benjamini-Hochberg FDR cutoff of 5%. When
running edgeR on log-normal distributed data, a BH FDR cutoff of 5% results in an eFDR of
37.5% under high dispersion settings and 10.2% under low dispersion settings; use of the MA
algorithm reduces this to 4.4% and 4.7%, respectively (Figure 10, "v5ln"). This correction of the
eFDR by the MA algorithm is also present when analyzing datasets that are a mix of log normal
and negative binomial distributed genes (Figure 10, "mix1").
4.4.5 - Improvement in Power
While controlling the FPR and the eFDR are critical tests that any analysis method should
pass, given these tests are satisfied we need to consider the power of the analysis method to
differentiate between better or worse methods. In Figure 11, examination of the percentage of
actual log fold changes called significant in the correct direction at an FDR of 5% shows that
edgeR has the highest power across all simulations; however only the high power under the
negative binomial (NB) model can be considered meaningful, as the others inflate the type I error
rate which decreases the type II error rate. Voom-limma does perform the best on log normal
(LN) distributed data, however it performs worse than the MA algorithm on both the NB
distributed data and the mixed distribution data. The MA algorithm outperforms voom-limma in
terms of power in both the low dispersion (median = 51.5% vs. 49.7%) and high dispersion
(median = 29.5% vs. 27.5%) settings. While the power difference is modest, it comes while still
properly controlling the type I error rate and the eFDR; it is also highly robust to variation in
parameters.
115
4.4.6 - Improvement in general performance characteristics
Previous tests are based on comparison at a specific p-value or FDR cutoff. In addition to
this, we can consider the overall performance of an analysis method across a range of cutoffs
from 0 to 1 by looking at the receiver operating characteristic (ROC) curve, the Area Under the
ROC Curve (AUC), and the Area Under the Precision-Recall Curve (PRAUC). This is important
because it is possible for one analysis method to show superior power at one cutoff, but for a
different analysis method to show superior power at a different cutoff. The PRAUC is taken to
be the most important metric here as we generally control differential gene expression
experiments via FDR rather than p-value cutoff.
In analysis of simulated data that contains equal numbers of log normal and negative
binomial distributed genes (Figure 12, "Both"), the MA algorithm outperforms both edgeR
(median increase of 0.030 and 0.098 in low and high dispersion parameters, respectively) and
voom-limma (median increase of 0.025 and 0.040 in low and high dispersion parameters,
respectively).
Accurate estimation of test statistics was also examined, with similar results. As seen in
Figure 13, while the two existing tools were both (non-significantly) better at estimating log2
fold change for different genes when data were generated according to their expectations (edgeR
and voom-limma in NB and LN data, respectively, performed 0.01 and 0.02 better than each
other), the model averaging (MA) algorithm was able to adapted and match performance with
whichever algorithm was better depending on the situation (Tukey's HSD p-value for difference
between edgeR and MA in NB = 0.3; p-value for difference between voom-limma and MA in
LN = 0.12) and outperformed both of them when the testing data included data showing both
distributions (mean R for MA = 0.0046 better than edgeR and 0.0095 better than voom-limma).
116
The same relative performance can be seen in calculations of the root mean square deviation
between a log2FC estimate by the given program and its actual value (Figure 14). The bulk of
error appears to be due to sampling variation, as opposed to the effect of the different algorithms,
as estimates from edgeR and voom-limma agreed with each other more strongly than either of
them agreed with the actual log2 fold change used in simulating the data.
The different performance in estimation of log2 fold change depending on distribution is
predicted by theoretical means. The maximum likelihood estimator (MLE) for the mean of a log-
normal is the geometric mean, rather than the arithmetic mean, since the arithmetic mean is the
MLE for the mean of the normal distribution that the log transformed variates and the MLE is
invariant under transformation (possibly needing to assume a fixed 𝜎𝜎 2
):
𝑋𝑋 𝑖𝑖 ~ 𝐿𝐿 𝑃𝑃 𝐿𝐿 𝑁𝑁𝑃𝑃 𝑟𝑟 𝐺𝐺 𝑎𝑎 𝐿𝐿 ( 𝜇𝜇 , 𝜎𝜎 2
)
log( 𝑋𝑋 𝑖𝑖 ) ~ 𝑁𝑁 𝑃𝑃𝑟𝑟 𝐺𝐺 𝑎𝑎 𝐿𝐿 ( 𝜇𝜇 , 𝜎𝜎 2
)
𝑀𝑀 𝐿𝐿𝐸𝐸 ( 𝜇𝜇 ) =
1
𝑛𝑛 � log( 𝑋𝑋 𝑖𝑖 )
𝑖𝑖 = log �
�
� 𝑋𝑋 𝑖𝑖 𝑖𝑖 𝑛𝑛 �
The different performance in estimation of log2 fold changes was robust even to less
common performance quantification, such as a sign test for whether the estimated log2FC was at
least in the same direction as the actual log2FC, regardless of significance (Figure 15).
4.5 - Discussion
The model averaging framework presented here shows numerous advantages over using a
single-model framework regardless of what the data look like, in particular by properly
117
controlling the actual/effective false discovery rate (eFDR) when using a Benjamini-Hochberg
FDR correction with edgeR/DESeq2 on data that includes log-normal distributed genes.
The framework described here is ready to be expanded to include other possible models,
such as the zero-inflated Poisson model sometimes used in single-cell data or the Dirichlet-
Multinomial model used to analyze read counts as compositional data (Fernandes et al., 2014;
Kharchenko et al., 2014; Ning Li et al., 2011; Quinn et al., 2018b).
4.5.1 - Extensions of goodness of fit testing
Currently goodness of fit (GOF) testing is run on the normalized counts without
adjustment for any variates or covariates; this could lead to issues when a (co)variate
significantly effects the distribution of counts. In one extreme case, a variate could describe a
large enough separation between the means of the two groups being compared that the data
becomes bimodal, which fits none of the examined models.
Correction for this can be attempted by performing GOF testing on residualized data after
the effects of (co)variates are subtracted out. In the simple two-group model used here this can
be considered less necessary as GOF is simply being performed under the null hypothesis that
there is no effect of the primary variate. Additionally effects of (co)variates should impact fit to
the log-normal and negative binomial distributions similarly.
4.5.2 - Other possible applications of model averaging in DEX analysis
Bayesian Model Averaging (BMA) is more commonly applied to average across models
using different sets of covariates, however this is not yet common in differential expression
analysis of RNA-Seq data. Covariates used in differential expression analysis include both
known covariates, such as sex or library batch, and unknown covariates, which are estimated by
118
tools such as SVA and PEER based on the data. The same question as that considered here for
choosing between different model distributions will need to be asked for such covariates: should
model averaging happen on a dataset-wide basis or on a gene-by-gene basis.
Model averaging offers an alternative to consensus techniques that aim to identify DEX
genes based on whether multiple programs agree. This is sometimes seen as a solution to the
problem of there being no single algorithm that appears to work best in all cases, however this
method does not use information from the dataset being analyzed to guide its choices; if such
information can be found to be informative in real datasets, algorithms that such data, as in
model averaging, would most likely be found to have improved performance compared to
algorithms that don't.

119
4.6 - Figures

Figure 4-1: Comparison of GOF p-values on unnormalized vs. normalized
data. Comparison is on only expressed genes in the low dispersion, null analysis.

120

Figure 4-2: Relationship between mean expression and Kolmogorov-Smirnov
Goodness of Fit test p-values on negative binomial distribution simulated
data. GOF p-values for fit to both the negative binomial and log-normal
distributions show a trend of increasingly severe deviations as mean expression
decreases. In order to make all datapoints visible on a log-scale, p-values of 0 are
replaced with the smallest non-zero p-value time 1e-3.

121

Figure 4-3: Relationship between mean expression and Kolmogorov-Smirnov
Goodness of Fit test p-values on log-normal distribution simulated data. GOF
p-values for fit to both the negative binomial and log-normal distributions show a
trend of increasingly severe deviations as mean expression decreases. In order to
make all datapoints visible on a log-scale, p-values of 0 are replaced with the
smallest non-zero p-value time 1e-3.

122

Figure 4-4: Scatterplots of p-values for GOF tests to the log normal and
negative binomial distributions

123

Figure 4-5: Histograms of log10(K) values for a negative binomial (NB)
dataset (top) and a log normal (LN) dataset (bottom). Both datasets are from
the null condition with the low dispersion parameters.

124

Figure 4-6: Histograms of inferred probability of gene being distributed
negative binomial (NB) given the data (D).

125

Figure 4-7: Mean FPR-plots of null analyses over 1000 simulations for edgeR
analysis of negative binomial (top) and log normal (bottom) data. In all
pictures, the blue vertical line represents the 50th percentile and the green vertical
line the 10th percentile in each scale.

126

Figure 4-8: Mean FPR-plots of null analyses over 1000 simulations for voom-
limma analysis of negative binomial (top) and log normal (bottom) data. In all
pictures, the blue vertical line represents the 50th percentile and the green vertical
line the 10th percentile in each scale.

127

Figure 4-9: Mean FPR-plots of null analyses over 1000 simulations for gene-
level model averaging analysis of negative binomial (top) and log normal
(bottom) data. In all pictures, the blue vertical line represents the 50th percentile
and the green vertical line the 10th percentile in each scale.

128

Figure 4-10: FDR boxplots under low and high dispersion simulation
parameters

129

Figure 4-11: Power boxplots under low (top) and high (bottom) dispersion
parameters. Analyses are grouped by distribution (LN = log normal, NB =
negative binomial) and by analysis method. Power is calculated as the percentage
of true log fold changes called DEX at an FDR of 5% in the correct direction.

130

Figure 4-12: Area Under the Precision-Recall Curve (PRAUC) boxplots under
low (top) and high (bottom) dispersion simulation parameters across 1000
simulations. Analyses are grouped by distribution (LN = log normal, NB =
negative binomial) and by analysis method.

131

Figure 13: Correlation of log2FC estimates with actual values used in
simulation. EdgeR performs significantly better than voom-limma in negative
binomial simulated data and vice-versa in log normal simulated data (both p <
2.2e-16). Tukey's HSD p-value for difference between edgeR and MA in NB = 0.3;
p-value for difference between voom-limma and MA in LN = 0.12. MA performs
significantly better than either on data that includes both distributions equally (p <
2.2e-16).
132

Figure 14: RMSE in estimates of log2FC by different methods. Analysis of
accuracy of estimates of log2 fold change (log2FC) by root mean square error
(RMSE) demonstrates same ranking as by Pearson correlation (Figure 13) in terms
of performance. Model Averaging (MA/aim2_v6) algorithm shows no significant
difference from better performing algorithm when data is generated according to
that algorithms assumptions, but does significantly better on data overall due to
adaptation to different distributions.

133

Figure 15: Hit percentage for matching the sign of an actual simulated
differential expression effect. Boxplots are over 1000 simulations and represent
the proportion of estimates of log2 fold change (log2FC) for genes that had a
differential expression effect were in the correct direction (upregulated vs.
downregulated).

134
Chapter 5 - Conclusions and Future Directions
5.1 - Performance of existing programs
The performance of existing programs was analyzed and the different algorithms
compared. Two major conclusions emerged from this analysis. First, normalization appears to be
the most impactful step, with older normalizations such as scaling by the number of reads
performing very poorly in "breakdown" situations where there are large changes in the number
of reads coming from a small number of genes. This can occur because of differentially
expressed genes, or it can occur simply because of highly expressed genes showing large
amounts of variation from sample to sample without regard for groups being compared. In either
case, performance based on either PRAUC or on controlling the FDR properly suffers
substantially. The best algorithms for normalization appear to be newer algorithms invented with
these issues that are part of the compositional nature of RNA-Seq data in mind: edgeR's TMM
and DESeq2's RLE. There were no noticeable differences seen between these two algorithms, or
between the implementation of RLE by DESeq2 and its implementation in the edgeR package.
These algorithms can and should be used even when not using edgeR/DESeq2, such as when
using voom-limma, a simple t-test, or other tests for differential expression. One exception to
this is for the ALDEX package, which directly models the data as compositional using a
multinomial-Dirichlet Bayes conjugate system that appears to correctly adjust for the biases seen
in older normalization algorithms.
Second, whether the statistical model assumed by the analysis package is actually the
distribution under which gene expression was simulated also has a substantial effect, though not
as large as normalization. In reviewing the literature and other tools used to simulate RNA-Seq
data, all papers/tools found used the negative binomial distribution or Poisson distribution
135
(which can be considered a form of the negative binomial) to simulate the data. This is despite
the popularity of voom-limma, which assumes that the errors will be normally distributed after a
log transformation. Thus, the evaluation of differential expression programs on log-normal
simulated data was novel and revealed an issue that does not appear to be previously reported:
namely that the actual FDR is not well controlled by edgeR and DESeq2, or presumably by other
packages assuming a negative binomial distribution, when the data actually follows a log-normal
distribution. EdgeR and DESeq2 both feature this problem to a similar extent. The performance
of methods assuming log normality, such as voom-limma or the t-test after log transformation,
control the FDR well under all tested statistical models, though they do see an increase in
performance as judged by PRAUC when working on log-normal data relative to negative
binomial data in comparison to the performance of edgeR/DESeq2 on those same datasets.
5.2 - Advances with model averaging algorithm
The model averaging algorithm has shown that it can effectively adjust the statistical
model being used to look for differential expression effects based on the observed distribution of
the data. This corrects inflated false positive rates / false discovery rates, while maintaining the
power that the individual analyses have on data that matches their expectations.
Increasing the statistical power and accuracy of differential expression algorithms can
lead to a better understanding of diseases examined by differential expression, eventually leading
to improved treatments.

136
References
Bainbridge, M.N., Warren, R.L., Hirst, M., Romanuik, T., Zeng, T., Go, A., Delaney, A.,
Griffith, M., Hickenbotham, M., Magrini, V., et al. (2013). RNA-Seq optimization with eQTL
gold standards. BMC Genomics 7, 892.
Baker, S.G. (1994). The Multinomial-Poisson Transformation. J. R. Stat. Soc. 36, 317–329.
Benjamini, Y., and Hochberg, Y. (1995). Controlling the False Discovery Rate: A Practical and
Powerful Approach to Multiple Testing. J. R. Stat. Soc. Ser. B 57, 289–300.
Birnbaum, Z.W., and Tingey, F.H. (1951). One-Sided Confidence Contours for Probability
Distribution Functions. Ann. Math. Stat. 22, 592–596.
Boksa, P. (2008). Maternal infection during pregnancy and schizophrenia. J. Psychiatry
Neurosci. 33, 183–185.
Bray, N.L., Pimentel, H., Melsted, P., and Pachter, L. (2016). Near-optimal probabilistic RNA-
seq quantification. Nat. Biotechnol. 34, 525–527.
Brennand, K., Savas, J.N., Kim, Y., Tran, N., Simone, A., Hashimoto-Torii, K., Beaumont, K.G.,
Kim, H.J., Topol, A., Ladran, I., et al. (2015). Phenotypic differences in hiPSC NPCs derived
from patients with schizophrenia. Mol. Psychiatry 20, 361–368.
Brennand, K.J., Simone, A., Jou, J., Gelboin-burkhart, C., Tran, N., Sangar, S., Li, Y., Mu, Y.,
Chen, G., Yu, D., et al. (2011). Modelling schizophrenia using human induced pluripotent stem
cells. Nature 473, 221–225.
Bullard, J.H., Purdom, E., Hansen, K.D., and Dudoit, S. (2010). Evaluation of statistical methods
for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics.
137
Cascella, N.G., Takaki, M., Lin, S., and Sawa, A. (2007). Neurodevelopmental involvement in
schizophrenia: The olfactory epithelium as an alternative model for research. J. Neurochem. 102,
587–594.
Conesa, A., Madrigal, P., Tarazona, S., Gomez-Cabrero, D., Cervera, A., McPherson, A.,
Szcześniak, M.W., Gaffney, D.J., Elo, L.L., Zhang, X., et al. (2016). A survey of best practices
for RNA-seq data analysis. Genome Biol. 17, 1–19.
Costa, V., Aprile, M., Esposito, R., and Ciccodicola, A. (2013). RNA-Seq and human complex
diseases: Recent accomplishments and future perspectives. Eur. J. Hum. Genet. 21, 134–142.
Cross-Disorder Group of the Psychiatric Genomics Consortium, Lee, S.H., Ripke, S., Neale,
B.M., Faraone, S. V, Purcell, S.M., Perlis, R.H., Mowry, B.J., Thapar, A., Goddard, M.E., et al.
(2013). Genetic relationship between five psychiatric disorders estimated from genome-wide
SNPs. Nat. Genet. 45, 984–994.
Delaneau, O., Marchini, J., and Zagury, J.F. (2012). A linear complexity phasing method for
thousands of genomes. Nat. Methods 9, 179–181.
Delaneau, O., Marchini, J., McVeanh, G.A., Donnelly, P., Lunter, G., Marchini, J.L., Myers, S.,
Gupta-Hinch, A., Iqbal, Z., Mathieson, I., et al. (2014). Integrating sequence and array data to
create an improved 1000 Genomes Project haplotype reference panel. Nat. Commun. 5.
Dunning, M.J., Barbosa-Morais, N.L., Lynch, A.G., Tavaré, S., and Ritchie, M.E. (2008).
Statistical issues in the analysis of Illumina data. BMC Bioinformatics 9, 85.
Evgrafov, O. V., Wrobel, B.B., Kang, X., Simpson, G., Malaspina, D., and Knowles, J.A.
(2011). Olfactory neuroepithelium-derived neural progenitor cells as a model system for
138
investigating the molecular mechanisms of neuropsychiatric disorders. Psychiatr. Genet. 21,
217–228.
Evgrafov, O. V, Armoskus, C., Wrobel, B.B., Spitsyna, V.N., Souaiaia, T., Herstein, J.S.,
Walker, C.P., Nguyen, J.D., Camarena, A., Weitz, R., et al. (2017). Gene expression in patient-
derived neural progenitors provide insights into neurodevelopmental aspects of schizophrenia.
Fernandes, A.D., Reid, J.N.S., Macklaim, J.M., Mcmurrough, T.A., Edgell, D.R., Gloor, G.B.,
Fernandes, D., A., Reid, J., et al. (2014). Unifying the analysis of high-throughput sequencing
datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments
by compositional data analysis. Microbiome 2, 1–13.
Fragoso, T.M., Bertoli, W., and Louzada, F. (2018). Bayesian Model Averaging: A Systematic
Review and Conceptual Classification. Int. Stat. Rev. 86, 1–28.
Fromer, M., Roussos, P., Sieberts, S.K., Johnson, J.S., Kavanagh, D.H., Perumal, T.M.,
Ruderfer, D.M., Oh, E.C., Topol, A., Shah, H.R., et al. (2016). Gene expression elucidates
functional impact of polygenic risk for schizophrenia. Nat. Neurosci. 19, 1442–1453.
Gallagher, M.D., and Chen-Plotkin, A.S. (2018). The Post-GWAS Era: From Association to
Function. Am. J. Hum. Genet. 102, 717–730.
Gierliński, M., Cole, C., Schofield, P., Schurch, N.J., Sherstnev, A., and Barton, G.J. (2015).
Statistical models for RNA-seq data derived from a two-condi- tion 48-replicate experiment.
Arxiv 1–15.
Gilad, Y., Rifkin, S.A., and Pritchard, J.K. (2008). Revealing the architecture of gene regulation:
the promise of eQTL studies. Trends Genet. 24, 408–415.
139
Goldner, E.M., Hsu, L., Waraich, P., and Somers, J.M. (2002). Prevalence and incidence studies
of schizophrenic disorders: a systematic review of the literature. Can. J. Psychiatry. 47, 833–843.
Guimar, P. (2004). Understanding the multinomial-Poisson transformation. 265–273.
Hart, S.N., Therneau, T.M., Zhang, Y., Poland, G.A., and Kocher, J.-P. (2013). Calculating
Sample Size Estimates for RNA Sequencing Data. J. Comput. Biol. 20, 970–978.
Hauberg, M.E., Zhang, W., Giambartolomei, C., Franzén, O., Morris, D.L., Vyse, T.J.,
Ruusalepp, A., Fromer, M., Sieberts, S.K., Johnson, J.S., et al. (2017). Large-Scale Identification
of Common Trait and Disease Variants Affecting Gene Expression. Am. J. Hum. Genet. 100,
885–894.
Hoeting, J.A., Madigan, D., Raftery, A.E., and Volinsky, C.T. (1999). Bayesian Model
Averaging : A Tutorial. 14, 382–417.
Howie, B.N., Donnelly, P., and Marchini, J. (2009). A Flexible and Accurate Genotype
Imputation Method for the Next Generation of Genome-Wide Association Studies. PLoS Genet.
5, e1000529.
Karczewski, K.J., and Snyder, M.P. (2018). Integrative omics for health and disease. Nat. Rev.
Genet. 19, 299–310.
Karlen, Y., McNair, A., Perseguers, S., Mazza, C., and Mermod, N. (2007). Statistical
significance of quantitative PCR. BMC Bioinformatics 8, 1–16.
Katoh, M., and Katoh, M. (2007). STAT3-induced WNT5A signaling loop in embryonic stem
cells, adult normal tissues, chronic persistent inflammation, rheumatoid arthritis and cancer
(Review). Int. J. Mol. Med. 19, 273–278.
140
Khandaker, G.M., Zimbron, J., Lewis, G., and Jones, P.B. (2013). Prenatal maternal infection,
neurodevelopment and adult schizophrenia: A systematic review of population-based studies.
Psychol. Med. 43, 239–257.
Kharchenko, P. V., Silberstein, L., and Scadden, D.T. (2014). Bayesian approach to single-cell
differential expression analysis. Nat. Methods 11, 740–742.
Kichaev, G., Bhatia, G., Loh, P.R., Gazal, S., Burch, K., Freund, M.K., Schoech, A., Pasaniuc,
B., and Price, A.L. (2019). Leveraging Polygenic Functional Enrichment to Improve GWAS
Power. Am. J. Hum. Genet. 104, 65–75.
Kozomara, A., and Griffiths-Jones, S. (2011). miRBase: integrating microRNA annotation and
deep-sequencing data. Nucleic Acids Res. 39, D152–D157.
Łabaj, P.P., and Kreil, D.P. (2016). Sensitivity, specificity, and reproducibility of RNA-Seq
differential expression calls. Biol. Direct 11, 66.
Langfelder, P., and Horvath, S. (2008). WGCNA: An R package for weighted correlation
network analysis. BMC Bioinformatics 9.
Lavoie, J., Sawa, A., and Ishizuka, K. (2017). Application of olfactory tissue and its neural
progenitors to schizophrenia and psychiatric research. Curr. Opin. Psychiatry 30, 176–183.
Law, C.W., Chen, Y., Shi, W., and Smyth, G.K. (2014). voom: precision weights unlock linear
model analysis tools for RNA-seq read counts. Genome Biol. 15, R29.
Lee, J.J., Wedow, R., Okbay, A., Kong, E., Maghzian, O., Zacher, M., Nguyen-Viet, T.A.,
Bowers, P., Sidorenko, J., Karlsson Linnér, R., et al. (2018). Gene discovery and polygenic
prediction from a genome-wide association study of educational attainment in 1.1 million
141
individuals. Nat. Genet. 50, 1112–1121.
Lee, S.H., Ripke, S., Neale, B.M., Faraone, S. V., Purcell, S.M., Perlis, R.H., Mowry, B.J.,
Thapar, A., Goddard, M.E., Witte, J.S., et al. (2013). Genetic relationship between five
psychiatric disorders estimated from genome-wide SNPs. Nat. Genet. 45, 984–994.
Leek, J.T. (2011). Asymptotic Conditional Singular Value Decomposition for High-Dimensional
Genomic Data. Biometrics 67, 344–352.
Leek, J.T. (2014). Svaseq: Removing batch effects and other unwanted noise from sequencing
data. Nucleic Acids Res. 42, e161.
Leek, J.T., Johnson, W.E., Parker, H.S., Jaffe, A.E., and Storey, J.D. (2012). The SVA package
for removing batch effects and other unwanted variation in high-throughput experiments.
Bioinformatics 28, 882–883.
Lewis, D.A., and Levitt, P. (2002). Schizophrenia as a Disorder of Neurodevelopment. Annu.
Rev. Neurosci. 25, 409–432.
Li, H.-D. (2018). GTFtools: a Python package for analyzing various modes of gene models.
BioRxiv.
Li, H., and Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler
transform. Bioinformatics 25, 1754–1760.
Li, M., Santpere, G., Kawasawa, Y.I., Evgrafov, O. V., Gulden, F.O., Pochareddy, S., Sunkin,
S.M., Li, Z., Shin, Y., Zhu, Y., et al. (2018). Integrative functional genomic analysis of human
brain development and neuropsychiatric risks. Science (80-. ). 362.
Li, P., Piao, Y., Shon, H.S., and Ryu, K.H. (2015). Comparing the normalization methods for the
142
differential analysis of Illumina high-throughput RNA-Seq data. BMC Bioinformatics 16, 347.
Li, Z., Chen, J., Yu, H., He, L., Xu, Y., Zhang, D., Yi, Q., Li, C., Li, X., Shen, J., et al. (2017).
Articles Genome-wide association analysis identifies 30 new susceptibility loci for
schizophrenia. 49.
Lin, X., Yang, L., Wang, G., Zi, F., Yan, H., Guo, X., Chen, J., Chen, Q., Huang, X., Li, Y., et
al. (2017). Interleukin-32α promotes the proliferation of multiple myeloma cells by inducing
production of IL-6 in bone marrow stromal cells. Oncotarget 8, 92841–92854.
Lin, Y., Golovnina, K., Chen, Z., Lee, H.N., Negron, Y.L.S., Sultana, H., Oliver, B., Harbison,
S.T., Golovnina, K., Negron, Y.L.S., et al. (2016). Comparison of normalization and differential
expression analyses using RNA-Seq data from 726 individual Drosophila melanogaster. BMC
Genomics 17, 1–20.
Loh, P., Bhatia, G., Gusev, A., Finucane, H.K., Bulik-sullivan, B.K., Pollack, S.J., Working, S.,
Consortium, G., Candia, T.R. De, Lee, S.H., et al. (2015). a n a ly s i s Contrasting genetic
architectures of schizophrenia and other complex diseases using fast variance-components
analysis. Nat. Publ. Gr. 47, 1385–1392.
Lonsdale, J., Thomas, J., Salvatore, M., Phillips, R., Lo, E., Shad, S., Hasz, R., Walters, G.,
Garcia, F., Young, N., et al. (2013). The Genotype-Tissue Expression (GTEx) project. Nat.
Genet. 45, 580–585.
Love, M.I., Huber, W., and Anders, S. (2014). Moderated estimation of fold change and
dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550.
Luykx, J.J., Boks, M.P.M., Terwindt, A.P.R., Bakker, S., Kahn, R.S., and Ophoff, R.A. (2010).
143
The involvement of GSK3β in bipolar disorder: Integrating evidence from multiple types of
genetic studies. Eur. Neuropsychopharmacol. 20, 357–368.
Majewski, J., and Pastinen, T. (2011). The study of eQTL variations by RNA-seq: From SNPs to
phenotypes. Trends Genet. 27, 72–79.
Marden, J.I. (2000). Hypothesis Testing: From p Values to Bayes Factors. J. Am. Stat. Assoc.
95, 1316.
Marsaglia, G., Tsang, W.W., and Wang, J. (2003). Evaluating Kolmogorov’s Distribution. J.
Stat. Softw. 8.
McCarthy, D.J., Chen, Y., and Smyth, G.K. (2012). Differential expression analysis of
multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Res. 40,
4288–4297.
McGee, W.A., Pimentel, H., Pachter, L., and Wu, J.Y. (2019). Compositional Data Analysis is
necessary for simulating and analyzing RNA-Seq data. BioRxiv 564955.
McKenzie, M., Henders, A.K., Caracella, A., Wray, N.R., and Powell, J.E. (2014). Overlap of
expression Quantitative Trait Loci (eQTL) in human brain and blood. BMC Med. Genomics 7,
31.
Melé, M., Ferreira, P.G., Reverter, F., Deluca, D.S., Monlong, J., Sammeth, M., Young, T.R.,
Goldmann, J.M., Pervouchine, D.D., Sullivan, T.J., et al. (2015). The human transcriptome
across tissues and individuals. Science (80-. ). 348, 660–665.
Mi, H., Poudel, S., Muruganujan, A., Casagrande, J.T., and Thomas, P.D. (2016). PANTHER
version 10: Expanded protein families and functions, and analysis tools. Nucleic Acids Res. 44,
144
D336–D342.
Morris, D.L., Hauberg, M.E., Zhang, W., Giambartolomei, C., Franze, O., Vyse, T.J., Ruusalepp,
A., Consortium, C., Sklar, P., Schadt, E.E., et al. (2017). Large-Scale Identification of Common
Trait and Disease Variants Affecting Gene Expression. 885–894.
Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L., and Wold, B. (2008). Mapping and
quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods 5, 621–628.
Ning Li, Elashoff, D.A., Robbins, W.A., and Lin Xun (2011). A hierarchical zero-inflated log-
normal model for skewed responses. Stat. Methods Med. Res. 20, 175–189.
Oshlack, A., Robinson, M.D., and Young, M.D. (2010). From RNA-seq reads to differential
expression results. Genome Biol. 11, 1–10.
Pato, M.T., Sobell, J.L., Medeiros, H., Abbott, C., Sklar, B.M., Buckley, P.F., Bromet, E.J.,
Escamilla, M.A., Fanous, A.H., Lehrer, D.S., et al. (2013). The genomic psychiatry cohort:
Partners in discovery. Am. J. Med. Genet. Part B Neuropsychiatr. Genet. 162, 306–312.
Pimentel, H.J., Bray, N., Puente, S., Melsted, P., and Pachter, L. (2016). Differential analysis of
RNA-Seq incorporating quantification uncertainty. Nat. Publ. Gr.
Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M.A.R., Bender, D., Maller, J.,
Sklar, P., De Bakker, P.I.W., Daly, M.J., et al. (2007). PLINK: A tool set for whole-genome
association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575.
Quinn, T.P., Crowley, T.M., and Richardson, M.F. (2018a). Benchmarking differential
expression analysis tools for RNA-Seq: Normalization-based vs. log-ratio transformation-based
methods. BMC Bioinformatics 19, 1–15.
145
Quinn, T.P., Erb, I., Richardson, M.F., and Crowley, T.M. (2018b). Understanding sequencing
data as compositions: An outlook and review. Bioinformatics 34, 2870–2878.
Raedler, T.J., Knable, M.B., and Weinberger, D.R. (1998). Schizophrenia as a developmental
disorder of the cerebral cortex. Curr. Opin. Neurobiol. 8, 157–161.
Rapaport, F., Khanin, R., Liang, Y., Pirun, M., Krek, A., Zumbo, P., Mason, C.E., Socci, N.D.,
and Betel, D. (2013). Comprehensive evaluation of differential gene expression analysis methods
for RNA-seq data. Genome Biol. 14, R95.
Rhie, S.K., Schreiner, S., Witt, H., Armoskus, C., Lay, F.D., Camarena, A., Spitsyna, V.N., Guo,
Y., Berman, B.P., Evgrafov, O. V., et al. (2018). Using 3D epigenomic maps of primary
olfactory neuronal cells from living individuals to understand gene regulation. Sci. Adv. 4.
Ripke, S., Neale, B.M., Corvin, A., Walters, J.T.R., Farh, K.-H., Holmans, P.A., Lee, P., Bulik-
Sullivan, B., Collier, D.A., Huang, H., et al. (2014). Biological insights from 108 schizophrenia-
associated genetic loci. Nature 511, 421–427.
Robinson, M.D., McCarthy, D.J., and Smyth, G.K. (2009). edgeR: A Bioconductor package for
differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140.
Schwarz, G. (1978). Estimating the dimension of a model. Ann. Stat. 6, 461–464.
Seyednasrollah, F., Laiho, A., and Elo, L.L. (2013). Comparison of software packages for
detecting differential expression in RNA-seq studies. Brief. Bioinform. 16, 59–70.
Shabalin, A.A. (2012). Matrix eQTL: ultra fast eQTL analysis via large matrix operations.
Bioinformatics 28, 1353–1358.
Somers, J.M., Goldner, E.M., Waraich, P., and Hsu, L. (2006). Prevalence and incidence studies
146
of anxiety disorders: A systematic review of the literature. Can. J. Psychiatry 51, 100–113.
Stark, R., Grzelak, M., and Hadfield, J. (2019). RNA sequencing: the teenage years. Nat. Rev.
Genet. 20, 631–656.
Stegle, O., Parts, L., Piipari, M., Winn, J., and Durbin, R. (2012). Using probabilistic estimation
of expression residuals (PEER) to obtain increased power and interpretability of gene expression
analyses. Nat. Protoc. 7, 500–507.
Stilo, S.A., and Murray, R.M. (2010). The epidemiology of schizophrenia: replacing dogma with
knowledge. Dialogues Clin. Neurosci. 12, 305–315.
Su, Z., Łabaj, P.P., Li, S., Thierry-Mieg, J., Thierry-Mieg, D., Shi, W., Wang, C., Schroth, G.P.,
Setterquist, R.A., Thompson, J.F., et al. (2014). A comprehensive assessment of RNA-seq
accuracy, reproducibility and information content by the Sequencing Quality Control
Consortium. Nat. Biotechnol. 32, 903–914.
Sullivan, P.F., Kendler, K.S., and Neale, M.C. (2003). Schizophrenia as a Complex Trait:
Evidence from a Meta-analysis of Twin Studies. Arch. Gen. Psychiatry 60, 1187–1192.
Tak, Y.G., and Farnham, P.J. (2015). Making sense of GWAS: Using epigenomics and genome
engineering to understand the functional relevance of SNPs in non-coding regions of the human
genome. Epigenetics and Chromatin 8, 1–18.
Tansey, K.E., Rees, E., Linden, D.E., Ripke, S., Chambert, K.D., Moran, J.L., McCarroll, S.A.,
Holmans, P., Kirov, G., Walters, J., et al. (2016). Common alleles contribute to schizophrenia in
CNV carriers. Mol. Psychiatry 21, 1085–1089.
Topol, A., Zhu, S., Tran, N., Simone, A., Fang, G., and Brennand, K.J. (2015). Altered WNT
147
Signaling in Human Induced Pluripotent Stem Cell Neural Progenitor Cells Derived from Four
Schizophrenia Patients. Biol. Psychiatry 78, e29–e34.
Trapnell, C., Roberts, A., Goff, L., Pertea, G., Kim, D., Kelley, D.R., Pimentel, H., Salzberg,
S.L., Rinn, J.L., and Pachter, L. (2012). Differential gene and transcript expression analysis of
RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc. 7, 562–578.
Visscher, P.M., Wray, N.R., Zhang, Q., Sklar, P., McCarthy, M.I., Brown, M.A., and Yang, J.
(2017). 10 Years of GWAS Discovery: Biology, Function, and Translation. Am. J. Hum. Genet.
101, 5–22.
Wagner, G.P., Kin, K., and Lynch, V.J. (2012). Measurement of mRNA abundance using RNA-
seq data: RPKM measure is inconsistent among samples. Theory Biosci. 131, 281–285.
Walker, R.L., Ramaswami, G., Hartl, C., Mancuso, N., Gandal, M.J., de la Torre-Ubieta, L.,
Pasaniuc, B., Stein, J.L., and Geschwind, D.H. (2019). Genetic Control of Expression and
Splicing in Developing Human Brain Informs Disease Mechanisms. Cell 179, 750-771.e22.
Weinberger, D.R. (1988). Implications of Normal Brain Development for the Pathogenesis of
Schizophrenia. Arch. Gen. Psychiatry 45, 1055.
Williams, C.R., Baccarella, A., Parrish, J.Z., and Kim, C.C. (2017). Empirical assessment of
analysis workflows for differential expression analysis of human samples using RNA-Seq. BMC
Bioinformatics 18, 1–12.
Wrobel, B.B., Mazza, J.M., Evgrafov, O. V., and Knowles, J.A. (2013). Assessing the efficacy of
endoscopic office olfactory biopsy sites to produce neural progenitor cell cultures for the study
of neuropsychiatric disorders. Int. Forum Allergy Rhinol. 3, 133–138.
148
Wu, E.Q., Birnbaum, H.G., Shi, L., Ball, D.E., Kessler, R.C., Moulis, M., and Aggarwal, J.
(2005). The economic burden of schizophrenia in the United States in 2002. J. Clin. Psychiatry
66, 1122–1129.
Xu, J., Su, Z., Hong, H., Thierry-Mieg, J., Thierry-Mieg, D., Kreil, D.P., Mason, C.E., Tong, W.,
and Shi, L. (2014). Cross-platform ultradeep transcriptomic profiling of human reference RNA
samples by RNA-Seq. Sci. Data 1, 140020.
Xu, J., Gong, B., Wu, L., Thakkar, S., Hong, H., and Tong, W. (2016). Comprehensive
assessments of RNA-seq by the SEQC consortium: FDA-led efforts advance precision medicine.
Pharmaceutics 8.
Yang, X., Schadt, E.E., Wang, S., Wang, H., Arnold, A.P., Ingram-Drake, L., Drake, T.A., and
Lusis, A.J. (2006). Tissue-specific expression and regulation of sexually dimorphic genes in
mice. Genome Res. 16, 995–1004.
Yuan, H., Wang, Q., Liu, Y., Yang, W., He, Y., Gusella, J.F., Song, J., and Shen, Y. (2018). A
rare exonic NRXN3 deletion segregating with neurodevelopmental and neuropsychiatric
conditions in a three-generation Chinese family. Am. J. Med. Genet. Part B Neuropsychiatr.
Genet. 177, 589–595.

149

Abstract (if available)

Linked assets

University of Southern California Dissertations and Theses

Conceptually similar

PDF

Generalized linear discriminant analysis for high-dimensional genomic data with external information

PDF

Model selection methods for genome wide association studies and statistical analysis of RNA seq data

PDF

Incorporating prior knowledge into regularized regression

PDF

Prediction and feature selection with regularized regression in integrative genomics

PDF

Analysis of SNP differential expression and allele-specific expression in gestational trophoblastic disease using RNA-seq data

PDF

Latent unknown clustering with integrated data (LUCID)

PDF

Statistical analysis of high-throughput genomic data

PDF

Hierarchical regularized regression for incorporation of external data in high-dimensional models

PDF

Cultured neuronal cells derived from the olfactory neuroepithelium growing in three dimensions as a model system for schizophrenia

PDF

Integrative analysis of multi-view data with applications in epidemiology

PDF

Prediction modeling with meta data and comparison with lasso regression

PDF

Hierarchical approaches for joint analysis of marginal summary statistics

PDF

Finding signals in Infinium DNA methylation data

PDF

Benchmarking of computational tools for ancestry prediction using RNA-seq data

PDF

Enhancing model performance of regularization methods by incorporating prior information

PDF

Computational methods for translation regulation analysis from Ribo-seq data

PDF

Polygenic analyses of complex traits in complex populations

PDF

Characterization and discovery of genetic associations: multiethnic fine-mapping and incorporation of functional information

PDF

ROC surface in the presence of verification bias

PDF

Modeling mutational signatures in cancer

Asset Metadata

Creator Armoskus, Christopher John (author)

Core Title Best practice development for RNA-Seq analysis of complex disorders, with applications in schizophrenia

School Keck School of Medicine

Degree Doctor of Philosophy

Degree Program Biostatistics

Publication Date 08/04/2020

Defense Date 12/21/2019

Publisher University of Southern California (original), University of Southern California. Libraries (digital)

Tag bioinformatics,Biostatistics,differential expression,eQTL,OAI-PMH Harvest,RNA-seq,schizophrenia,statistics

Language English

Contributor Electronically uploaded by the author (provenance)

Advisor Conti, David (committee chair), Knowles, James (committee member), Siegmund, Kimberly (committee member), Lewinger, Juan Pablo (committee member)

Creator Email armoskus@usc.edu,carmoskus@gmail.com

Permanent Link (DOI) https://doi.org/10.25549/usctheses-c89-264753

Unique identifier UC11673216

Identifier etd-ArmoskusCh-8141.pdf (filename),usctheses-c89-264753 (legacy record id)

Legacy Identifier etd-ArmoskusCh-8141.pdf

Dmrecord 264753

Document Type Dissertation

Rights Armoskus, Christopher John

Type texts

Source University of Southern California (contributing entity), University of Southern California Dissertations and Theses (collection)

Access Conditions The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...

Repository Name University of Southern California Digital Library

Repository Location USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA