Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Mapping genetic variants for nonsense-mediated mRNA decay regulation across human tissues
(USC Thesis Other)
Mapping genetic variants for nonsense-mediated mRNA decay regulation across human tissues
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Mapping Genetic Variants for Nonsense-mediated mRNA Decay Regulation
Across Human Tissues
by
Bo Sun
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTATIONAL BIOLOGY AND BIOINFORMATICS)
May 2022
Copyright 2022 Bo Sun
Acknowledgements
First, it is a genuine pleasure to express my thanks and gratitude to my advisor, mentor, and guide
Dr. Liang Chen. Her dedication and keen interest above all her attitude to help her students
had been largely responsible for completing my work. Her timely advice, meticulous scrutiny
and scientific approach had helped me to a very large extent to accomplish my research.
It is my privilege to thankAndrewP.McMahon,PhD,FRS, andPietroCippà,MD,PhD
for their collaboration in 2018. It was my first collaboration work and an interdisciplinary one.
Thanks to their professionalism, as a student I learned valuable things including collaborative
works, efficient communications and scrutiny.
I thank profusely all other faculty members, specifically Dr.FengzhuSun,Dr.MarkChais-
son,Dr. AndrewSmith andDr. RemoRohs, for their help and guidance along the way. I also
thank allSTAFFS of our QCB program for their kind support and help.
I also thank my peerZifanZhu,XinBai andQifanYang. We studied, tackled exams, and
dealt with all aspects of problems outside of the program together. It was another treasure of my
phd study.
Finally, I would like to thank myparents and my girlfriendShuangLi for their unconditioned
love and undoubted trust. Without their support and encouragement, it would be impossible for
me to complete this journey.
ii
TableofContents
Acknowledgements ii
ListofTables v
ListofFigures vi
Abstract x
Chapter1: Background 1
1.1 Nonsense-mediated mRNA decay (NMD) as a
post-transcriptional regulation mechanism . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Premature termination codons (PTCs) and the
NMD substrates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 NMD substrates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Recognition of PTC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2.1 Faux 3’UTR model . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2.2 EJC model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.2.3 Unified model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.3 The degradation of targeted mRNAs . . . . . . . . . . . . . . . . . . . . . 9
1.3 Expression quantitative trait loci (eQTL) . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 GTEx generated eQTL data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Introduction and motivation for our study . . . . . . . . . . . . . . . . . . . . . . 12
Chapter2: NMD-QTL:Genome-wideLandscapeofGeneticVariantsAssociatedwith
NMDRegulationacrossHumanTissues 15
2.1 Modelling of NMD-QTLs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Identification of NMD-QTLs across different human
tissues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Tissue specificity of NMD-QTLs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Colocalization of NMD-QTLs and disease SNPs . . . . . . . . . . . . . . . . . . . 20
2.5 Disease NMD-QTLs in brain tissues . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6 NMD-QTLs are more likely to be located within gene bodies and exons compared
to eQTLs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.7 Ordinal positions of NMD-QTLs along exons and introns . . . . . . . . . . . . . . 25
iii
2.8 Location preference of NMD-QTLs at the base-pair
resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.9 The relationship between NMD-QTLs and miRNAs or RNA-binding proteins
(RBPs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.10 Colocalization of splicing-QTLs and NMD-QTLs . . . . . . . . . . . . . . . . . . . 29
Chapter3: Method 44
3.1 Data acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2 Modeling of NMD-QTLs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3 Identification of NMD-QTLs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4 Effect size confirmation with quantile regression . . . . . . . . . . . . . . . . . . . 46
3.4.1 Zero-inflation in RNA-seq data . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4.2 Quantile Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.4.2.1 Definition of quantile . . . . . . . . . . . . . . . . . . . . . . . . 50
3.4.2.2 The minimization problem . . . . . . . . . . . . . . . . . . . . . 50
3.4.2.3 Extension to regression . . . . . . . . . . . . . . . . . . . . . . . 51
3.5 Simulation studies to test our model . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.6 Genome-wide NMD-QTL mapping for GTEx tissues . . . . . . . . . . . . . . . . . 53
3.7 NMD-QTL signature similarity between a tissue pair . . . . . . . . . . . . . . . . 54
3.8 Colocalization of NMD-QTLs with disease-associated SNPs . . . . . . . . . . . . . 54
3.9 Colocalization enrichment score for MeSH . . . . . . . . . . . . . . . . . . . . . . 54
3.10 Enrichment of NMD-QTLs in the targets of miRNAs and RBPs . . . . . . . . . . . 55
3.11 Splicing-QTL mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Chapter4: DiscussionandBroaderImpact 59
4.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2 Broader Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Chapter5: Conclusions 63
AppendixA 65
Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
A.1 Customize expression normalization . . . . . . . . . . . . . . . . . . . . . . . . . . 65
A.2 Python code snippet for phenotype normalization . . . . . . . . . . . . . . . . . . 66
A.3 Implementation details of the simulations . . . . . . . . . . . . . . . . . . . . . . . 67
A.4 Statistical tests used in disease SNP colocalization study for NMD-QTLs . . . . . . 68
A.4.1 Hyper-geometric test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
A.4.2 Proportion test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
A.5 Github repository for large tables . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Bibliography 70
iv
ListofTables
1.1 The substrates of NMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 The 43 diseases with significant NMD-QTLs enrichment. p-values for both the
proportion test and the hypergeometric test <0.05. . . . . . . . . . . . . . . . . . . 23
3.1 The data sources and links used in this study. . . . . . . . . . . . . . . . . . . . . . 45
A.1 Links to supplementary tables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
v
ListofFigures
1.1 Canonical NMD pathway in humans. (Image credit: Mariuswalter, CC BY-SA 4.0) 2
1.2 Current NMD models, source [8] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 Modeling of NMD-QTLs. (a) Identification of NMD-QTLs across human tissues
based on the GTEx data. (b) Parameterization of the profiled expression of NMD
genes. t represents the total amount of transcribed transcripts,α represents the
percentage of transcribed transcripts targeted by NMD, and θ represents the
percentage of NMD-targeted transcripts that remained after degradation. The
parameters can be different for different alleles of a genetic variant. (c) Allelic
effect sizes of a pNMD-QTL, dNMD-QTL, or eQTL for NMD-targeted transcripts
and non-NMD transcripts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.2 Identification of pNMD-QTLs and dNMD-QTLs. (a) Examples of a pNMD-QTL
and a dNMD-QTL. A variant is identified as a pNMD-QTL if the allelic effect
size is in opposite directions for NMD-targeted transcripts and non-NMD
transcripts. A variant is identified as a dNMD-QTL if the association is observed
in NMD-targeted transcripts only. (b) Precision-recall curves for identifying
pNMD-QTLs and dNMD-QTLs. The effect size was simulated between 0.1 and
0.15. (c) The number of pNMD-QTLs and dNMD-QTLs detected for each of the
48 tissues. The shaded areas represent the number of variants also identified as
eQTLs in GTEx. The color scheme used for tissues is the same as the one used in
GTEx. (d) The fractions of positive effect sizes of pNMD-QTLs and dNMD-QTLs
for NMD-targeted transcripts in each tissue. . . . . . . . . . . . . . . . . . . . . . 33
2.3 NMD-QTL identification results for simulations with various effect sizes. The top
panels show the Precision-Recall curve for small effect size (0.05-0.1), medium
effect size (0.1-0.15), and large effect size (0.15-0.2). The bottom panels show
the p-value distributions for the allelic effect for NMD-targeted transcripts and
non-NMD transcripts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
vi
2.4 Tissue specificity of NMD-QTLs. (a) The similarity of NMD-QTL signatures
between tissue pairs. The similarity score s* reflects the enrichment of shared
NMD-QTLs among different tissues. The detected NMD-QTLs from brain tissues
are very distinctive from other tissue. (b) Clustering of similarity scores within
brain sub-regions. (c) Tissue interaction map. No edge is drawn if the similarity
score s* between the two tissues <10. Otherwise, black edges are drawn if
10≤ s
∗ < 20; orange edges are drawn if20≤ s
∗ < 30 and red edges are drawn
ifs
∗ >30. (d) The number of tissues an NMD-QTL variant is detected from. . . . 35
2.5 Variant-disease association analysis for NMD-QTLs. (a) Venn diagram among
NMD-QTLs, eQTLs and disease SNPs. About 33.61% (the gray sector) disease
SNPs identified as eQTLs are also discovered as NMD-QTLs. And 37.63% (light
blue sector) disease SNPs identified as NMD-QTLs are not discovered as eQTLs.
(b) Top 10 diseases associated with enriched NMD-QTLs. They are ranked by
the p values of the proportion tests. (c) Colocalization with disease SNPs under
each MeSH term. The colocalization scores are compared among pNMD-QTLs,
dNMD-QTLs, eQTLs, and randomly sampled SNPs (the number is the same as
that of NMD-QTLs). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.6 Disease NMD-QTLs in brain tissues. (a) Tissue pair-wise similarity of non-
disease and disease QTL signatures. Disease NMD-QTLs but not disease eQTLs
are more similar in brain tissues. (b) UpSet plot showing the top 25 most
frequent brain sub-region memberships displayed by genes with NMD-QTLs.
Compared to genes whose NMD-QTL associations are observable in almost all
brain sub-regions, genes appeared in only a single brain sub-region are more
likely to contain disease NMD-QTLs. . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.7 Genomic locations of NMD-QTLs. (a) The coefficient estimations for ˆ w
1
.
(b) The coefficient estimations for ˆ w
2
. ˆ w
1
and ˆ w
2
are for the model N
i,g
=
w
0
+w
1
N
o,g
+w
2
log
2
(l
g
), whereN
i,g
is the number of NMD-QTLs (or eQTLs)
inside a gene,N
o,g
is the number of NMD-QTLs (or eQTLs) outside of any gene,
andl
g
is the gene length. (c) The ˆ w
1
estimation for each tissue. Error bars: 95%
confidence intervals. The ˆ w
1
estimations for NMD-QTLs are consistently larger
than those for eQTLs, meaning that NMD-QTLs prefer locating inside a gene
body. (d) The proportion of genic variants that are located on exons. The exonic
proportions are significantly higher for NMD-QTLs, compared to eQTL controls
which are eQTLs discovered for the same pNMD- or dNMD-genes. (e) The
scatter plots of the exonic proportions. The size of the symbol is proportional to
the sample size for a specific tissue. . . . . . . . . . . . . . . . . . . . . . . . . . . 38
vii
2.8 Ordinal positioning of NMD-QTLs along exon and intron intervals. The exon
and intron intervals are deduced from the collapsed gene model and ranked
from 0 for both 5’ and 3’ ends. (a) Distribution of the count of NMD-QTLs on
each exon ordinal position normalized by the exon count per ordinal position.
(b) Distribution of the count of NMD-QTLs on each intron ordinal position
normalized by the intron count per ordinal position. (c) Distribution of the count
of NMD-QTLs on each exon ordinal position normalized by both the exon count
and exon length per ordinal position.(d) Distribution of the count of NMD-QTLs
on each intron ordinal position normalized by both the intron count and intron
length per ordinal position. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.9 Ordinal positions of exon and intron intervals in NMD genes. (a) The exon or
intron interval count per ordinal position. (b) The median exon or intron length
per ordinal position. (c) Distribution of the raw pNMD-QTL or dNMD-QTL
count per exon ordinal position. (d) Distribution of the raw pNMD-QTL or
dNMD-QTL count per intron ordinal position. . . . . . . . . . . . . . . . . . . . . 40
2.10 Distances of NMD-QTLs to boundaries of exons, introns, stop codons, and
transcription start sties. (a) Distances to exon boundaries for variants located
on an exon. The comparison is made between pNMD-QTLs and eQTLs. (b)
Distances to intron boundaries for variants located on an intron. The comparison
is made between pNMD-QTLs and eQTLs. (c) Exonic distances of pNMD-
QTLs and dNMD-QTLs to stop codons, and the corresponding eQTL controls
comparisons. (d) Distances to transcription start sites. Note that all genetic
variants within 1 million base-pair up- and down-stream TSS are considered in
our NMD-QTL mapping. (e) Genomic distances between NMD-QTLs and the
stop codons. In (e), different from (c), both exonic and intronic base-pair were
counted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.11 Relationship between NMD-QTLs and miRNAs or RBPs. (a) Top 20 miRNAs
whose target regions are enriched with dNMD-QTLs or pNMD-QTLs. (b)
Percentages of NMD-QTLs in miRNA targets compared to that of eQTLs. (c) Top
20 RBPs whose binding regions profiled in K562 are enriched with dNMD-QTLs
or pNMD-QTLs. (d) Percentages of NMD-QTLs in binding regions of RBPs in
K562 compared to that of eQTLs. (e) Top 20 RBPs whose binding regions profiled
in HepG2 are enriched with dNMD-QTLs or pNMD-QTLs. (f) Percentages of
NMD-QTLs in binding regions of RBPs in HepG2 compared to that of eQTLs. . . 42
viii
2.12 Relationship between NMD-QTLs and miRNAs or RBPs. (a) Top 20 miRNAs
whose target regions are enriched with dNMD-QTLs or pNMD-QTLs. (b)
Percentages of NMD-QTLs in miRNA targets compared to that of eQTLs. (c) Top
20 RBPs whose binding regions profiled in K562 are enriched with dNMD-QTLs
or pNMD-QTLs. (d) Percentages of NMD-QTLs in binding regions of RBPs in
K562 compared to that of eQTLs. (e) Top 20 RBPs whose binding regions profiled
in HepG2 are enriched with dNMD-QTLs or pNMD-QTLs. (f) Percentages of
NMD-QTLs in binding regions of RBPs in HepG2 compared to that of eQTLs. . . 43
3.1 Comparison of different linear models on simulated expression with varying
dropout zeros effects. (A) The underlying true effect size is 0.3. (B) Underlying
effect size is -0.3. The percentage of zero expression increases from 10%to 40%,
from left to right. Quantile regression always provides a robust model estimation
while the performance of both OLS and Huber regression becomes worse when
the percentage of zero increases. (C) Comparison between quantile regression
and INT-OLS when the dropout percentage is 40% or 45% (effect size is -0.3). . . . 48
3.2 Comparison of different linear models on effect size estimation. (A) Estimated
effect size of different linear models on a simulated data with underlying
effect size = 0 and an over-dispersion issue (error term simulated from a Laplace
distribution with s=1.5). (B)Comparison of quantile regression, OLS and INT-OLS
under different residual distributions (bottom: Student’s t distribution (DF=5);
top: inverse Gamma distribution (shape=4, scale=0.2)). The mean kurtosis among
the 10,000 simulations was increased from 14.16 to 19.69. The underlyingβ gt
is 0
or 0.097. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3 Over-dispersion and pervasive dropout zeros in GTEx RNA-seq data. (A)
Boxplots of gene expression kurtosis. For each gene, the excess kurtosis was
calculated over all samples from a given tissue. The boxplots are colored by the
sample size for a given tissue. We then chose the whole blood data for real data
analysis hereinafter since it has the most widely distributed kurtosis. (B) Rug
density plot of log2-transformed gene expression level. FOXC1 is an example of
large kurtosis genes. NAPGA is an example of small kurtosis genes. The vertical
line marks the mean, and the rugs on x-axis show individual data points. (C)
Percentage of genes with zero expression values. The left panel shows the results
for all genes. The right panel shows genes with TPM > 0.1 in at least 10 samples.
(D) Scatter plot of gene expression values from two randomly selected genes.
One was with high kurtosis and the other with large amounts of dropouts. . . . . 57
3.4 Detection power and sample size. (a) Positive correlation between the number of
NMD-QTLs identified for each tissue and the tissue sample size (i.e., the number
of donors). (b) The positive correlation observed for GTEx eQTLs as well. . . . . . 58
ix
Abstract
Nonsense-mediated mRNA decay (NMD) was originally conceived as an mRNA surveillance
mechanism that recognizes and degrades nonsense transcripts to prevent the production of po-
tentially deleterious truncated proteins. Recent research showed NMD is an important post-
transcriptional gene regulation mechanism targeting many non-aberrant mRNAs. Here we elu-
cidate the NMD regulation of individual genes across human tissues through genetical genomics.
Genetic variants corresponding to NMD regulation were identified based on the GTEx data through
our unique modeling of transcript expressions. Specifically, we identified genetic variants that
enhance or suppress the post-transcriptional production of NMD-targeted transcripts (pNMD-
QTLs), as well as genetic variants regulating the decay efficiency of NMD-targeted transcripts
(dNMD-QTLs). We explored the tissue specificity of identified NMD-QTLs, and profiled their
genomic positions and distances to exons, introns, transcription start sites and stop codons. We
further found that compared to expression quantitative trait loci (eQTLs), NMD-QTLs are more
likely to be co-localized with disease SNPs, splicing QTLs, and in the binding sites of miRNAs
and RNA binding proteins (RBPs). In conclusion, we revealed the genome-wide landscape of ge-
netic variants associated with NMD regulation (the NMD-QTLs) across human tissues. The tissue
specificity analysis of NMD-QTLs demonstrated the especially important role of NMD in brains.
The exploration of NMD-QTLs’ preferential genomic positions suggested additional attributes
x
for a transcript to be regulated by NMD. Furthermore, the co-localization analysis with other
trait-associated SNPs and post-transcriptional regulatory elements assisted the understanding of
NMD-QTLS’ functionality and their interactions with other post-transcriptional regulations.
xi
Chapter1
Background
1.1 Nonsense-mediated mRNA decay (NMD) as a
post-transcriptionalregulationmechanism
Nonsense-mediated mRNA decay (NMD) was originally conceived as an mRNA surveillance
mechanism that recognizes and degrades nonsense transcripts which contain premature termi-
nation codons (PTCs). Figure 1.1 gives an overview of the PTC mediated transcript decay process.
The truncated transcripts if translated, would produce gain-of-function or loss-of-function dele-
terious proteins. Currently, the scope of NMD pathway has been further broadened, since recent
research shows NMD is an important post-transcriptional gene regulation mechanism target-
ing many non-aberrant mRNAs. Scientists had uncovered numerous mechanisms of how NMD
affects gene regulations, studying of NMD had allowed scientists to determine the causes for
certain heritable diseases and the dosage compensation effect in mammals [1]. NMD adjusts
the transcriptome and proteome outputs dynamically under varied physiological environments.
1
However, how natural genetic variants affect NMD and modulate gene expressions remains elu-
sive.
Figure 1.1: Canonical NMD pathway in humans. (Image credit: Mariuswalter, CC BY-SA 4.0)
1.2 Premature termination codons (PTCs) and the
NMDsubstrates
1.2.1 NMDsubstrates
The initiator of NMD is premature termination codons (PTCs). Most NMD substrates in cells are
PTC-containing mRNAs, and PTCs are mainly derived from the mutations at the DNA or RNA
2
level (see Table 1.1). Remarkably, the majority of PTCs are produced from faulty- or alternative-
PTC Origin Feature
PTC+ PTC arising at the DNA level Nonsense mutations
Nucleotide insertions and deletions that shift
the reading frame
Mutations lead to splice signal alterations
DNA rearrangements of immunoglobulin
and T-cell receptor genes
PTC arising at the RNA level Transcription errors
Faulty or alternative splicing
Pre-mRNA that escapes nuclear retention
Programmed frameshifts
PTC- Physiological mRNA mRNAs with uORF
mRNAs with introns in 3’UTR
mRNAs with long 3’UTR
Selenoprotein
mRNAs of transposons, retroviruses and
pseudo-genes
Table 1.1: The substrates of NMD
splicing of pre-mRNAs. Jaillion et al. [2] in the study of double-nucleated paramecia (Puramecium
eetraurelia) found that a lot of short introns contain a stop codon. The retention of such introns
during splicing (either faulty or alternative) results in PTC bearing transcripts. The clearance
of such splicing products is mainly achieved through the NMD pathway. Up to 75%-90% of the
genes in the human genome are alternatively spliced, and about 1/3 of the alternatively spliced
products are PTC-containing mRNAs. Literature links alternative splicing (AS) to NMD-coupled
transcriptional regulation as AS-NMD [3]. In 2010, De Lima Morais et al. [4] studied the genes
regulated by AS-NMD in mammals and found that the regulated genes had high levels of homol-
ogy in species and suggested that AS-NMD may be a conserved regulating mechanism. Analysis
of the transcriptomes of yeasts (S. cescererisiae), fruit flies (Drosophila Melanogaster) and hu-
mans (Homo sapiens) found that NMD deficiency resulted in significant changes in the mRNA
3
abundance of 3% to 10% of the genes [5, 6], indicating that some endogenous normal physiological
genes are affected by NMD. The normal mRNA products without PTC can also be recognized as
a substrate by NMD. The structure of normal mRNA regulated by NMD usually has the following
characteristics: 3’-UTR contains introns, upstream open reading frames (upstream ORF), pro-
grammed frameshift, or overly long 3’-UTR, and so on. It is speculated that these structures can
act as cis-elements recognized by NMD and activate the NMD pathway. In 2011, Huang [7] dis-
covered that the NMD pathway factors themselves, such as UPF (the central regulators of NMD)
and SNG (suppressor with morphological effect on genitalia), are the substrates of NMD as well.
These NMD factors response to environmental stress and developmental signals to lock/unlock
the NMD pathway to achieve stable regulation of intracellular gene expressions. It has been
reported that NMD plays an important role in silencing the genomic “transcriptional noise” by
inhibiting the expression of pseudogenes or transposons. However, the molecular mechanism of
such “PTC-free but regulated by NMD” mRNAs remains unclear.
1.2.2 RecognitionofPTC
As a way of mRNA quality monitoring, the core step of NMD is how to distinguish PTC-containing
mRNAs from normal mRNAs. It is generally believed that the key to identify PTC-containing
mRNA is the formation of NMD complex on the mRNA. Once an mRNA recruits the NMD com-
plex, it will be marked by NMD and subject to rapid degradation. The core factors involved in
the formation of the NMD complex are UPF proteins, including UPFI, UPF2 and UPF3, of which
UPF2 and UPF3 combine to promote the function of UPF1. In addition, the SMG protein family:
SMG1, SMG5, SMG6 and SMC7 are playing important roles in the NMD pathway of multicellular
4
organisms. At this stage, there are three recognized models explaining the formation process of
NMD complex, described in the following subsections.
1.2.2.1 Faux3’UTRmodel
The faux 3’-UTR model (Figure 1.2 (a)) believes that if the translation termination is abnormal, it
will lead to the formation of NMD complex and stimulate the rapid degradation of mRNA. Due
to the appearance of a PTC, the 3’-UTR of mRNA is too long, so it cannot interact with the ri-
bosome correctly as in the regular translation termination. This phenomenon will be sensed by
the NMD pathway and the mRNA will be recognized as a target for degradation. The poly(A)
binding protein (PABP) is a 3’-UTR factor involved in translation termination. The interactions
between PABP and ribosome-releasing factors (eRFs) play an important role in normal interro-
gation and translation termination. The ribosome that terminates translation at the PTC cannot
interact with PABP due to its distance from the poly(A) tail, other UPF proteins are recruited to
the ribosome to begin the formation NMD complex. In 2004 Amrani et al. [9] performed ribosome
footprinting analysis of mRNAs translation termination in yeasts and showed that the ribosome
footprint at PTC (which triggers the recruitment of UPF1) was significantly different from that of
a normal one, while no abnormal ribosome location was observed in cells lacking UPF1. If when
PABP1p(PABP homologous protein in yeasts) is artificially recruited to PTC, the PTC-bearing
mRNA becomes stable and cannot be degraded. Thus, this proved that abnormality of translation
termination and NMD is indeed triggered by the PTC which recruits UPFs instead of binding to
PABP. Although the precise molecular events of translation termination remain to be determined,
in yeasts an environment suitable for normal translation termination—namely the elF4E-eIF4G
(tranlation initiation factor, elF) and the Pabp1p-SUP35 (eRF3 Homologous protein) interactions,
5
Figure 1.2: Current NMD models, source [8]
6
iare required. Otherwise NMD factors such as UPF1 will bind to the final translating ribosome
and cause mRNA to be degraded.
1.2.2.2 EJCmodel
Recent studies of NMD have completely overturned the understanding of some intracellular
events, such as the notion that there is no correlation between pre-mRNA splicing and mRNA
degradation in the cytoplasm. When introns are spliced, an exon-junction complex (EJC) will
form 20-25nt upstream of the exon junction. The core of mammalian EJC are eIF4AIll, Y14, MA-
COH (homologous protein of the product of Drosophila geneMagonashi), and MLN51 (metastatic
lymph node 51). At the periphery of EJC, there are other proteins which are in dynamic states
and can be constantly changing. During translation, EJCs formed on mRNA will be removed one
by one as the ribosome slides on mRNA [10].
The main procedures of the EJC model are related to the deportation of mRNAs from nucleus.
First, mRNA binds to multiple EJCs after maturation in the nucleus, and then EJCs are accompa-
nied by binding to peripheral proteins during the process of entering the cytoplasm. This includes
the binding of UPF3 in the nucleus and the binding of UPF2 outside the nuclear membrane. During
translation, the ribosome removes EJCs one by one. And if the ribosome terminates at the PTC,
the EJCs downstream of the PTC will remain. Then the UPF3 bound by the EJC component Y14
will rapidly participate in the formation of the NMD complex [8, 10, 11]. It can be seen that EJC
promotes UPF1 in the recruitment on the ribosome and formation of the NND complex, which
promotes and accelerates mRNA degradation (Figure 1.2 (b)). The pioneer round of translation
(first round of translation) is a translation process characterized by the mRNA cap-binding pro-
tein complex (CBC). When the mRNA-guided translation enters a stable state, the cap-binding
7
protein is eIF4E. The CBC protein CBP80 can co-immunoprecipitate with many NMD factors.
Knockout of this gene leads to inactivation of the NMD pathway, while eIF4E does not have this
feature [12, 10, 13]. The EJC model better explains the close relationship between the pioneer
round of translation and NMD decay when mRNAs are deported from nucleus.
1.2.2.3 Unifiedmodel
A large body of evidence indicates that EJC is not required for NMD in non-mammals, instead
changes in the 3’-UTR length determining whether abnormal translation termination happened
is more common. In 2008, Stalder et al. proposed an unified model [14]: in order to effectively rec-
ognize the large amount of PTC-containing mRNAs produced by the huge genome, EJC gradually
evolved into an auxiliary and enhanced signal for NMD recognition, so as to realize the monitor-
ing of mRNAs and the regulation of expressions at the post-transcriptional level of normal genes.
The NMD in plants can be characterized by both the faux 3-UTR mode and the EJC model, which
provides an important basis for the unified model [15]. When translation is terminated, eRF1
recognizes the stop codon and stops the ribosome. eRF3 combines with eRF1 and interacts with
PABP, PABPC1 (cytoplasmic Poly(A)-bingding protein) and other proteins. The secondary struc-
ture of the 3’-UTR further shortens the distance between the ribosome and PABP in space, thereby
ensuring normal and efficient translation termination. In comparison, the ribosome on the NMD
targeted mRNA stopped too early. This weakens the interaction between eRF3 and PABP, and
also, UPF1 competes with PABP for binding to eRF3, ultimately leading to mRNA degradation.
The unified model suggests that this process is highly conserved across all eukaryotes and is
fundamental for the PTC recognition. The EJC located downstream of a PTC binds to UPF2 and
UPF3, which significantly increases the binding efficiency of UPF1 to the translation-terminating
8
ribosome mediated by UPF2 and UPF3. Turning the competition between UPF1 and PABP for
ribosome binding towards favoring UPF1 [16], which triggers NMD as mentioned in previous
sections. In summary, whether NMD complexes can be formed in different organisms depends
on the competition between antagonists and facilitators of NMD for binding to ribosomes [17,
18].
1.2.3 ThedegradationoftargetedmRNAs
Studies on various model organisms have shown that mRNAs bound to UPF1 can be degraded
by two different pathways, depending on whether the phosphorylated UPF1 is bound by the
SMG5/SMG7 heterodimer or by the endonuclease SMG6. The former promotes mRNA degreda-
tion through its CCR4 (carbon catabolite repression 4)/CAF1 (CCR4-associated factor 1)-mediated
deadenylation and DCP1/DCP2 (dipeptidyl carboxypeptidase)-mediated decapping, which ex-
pose the 5-terminal to exonuclease XRN1 [19]. The latter results in cleavage by SMG6 around the
PTC followed by exonuclease degradation from the cleavage point. In 2011, Okada-Katsuhata et
al. [20] found that when UPF1 is phosphorylated by SMG1, at least two phosphorylated binding
sites can bind to SMG5/SMG7 and SMG6 complexes, and the binding of SMG5/SNG7 will lead
to the dissociation of ribosomes. The current experiments show that the degradation pathway
is SMG5/SMG7-dependent in yeasts, SMG6-dependent in fruit flies, and SMG6-dependent and
SMG5/SMG7-dependent in mammals. More experiments are needed to determine which mode
of degradation dominates in different species.
9
1.3 Expressionquantitativetraitloci(eQTL)
Before stating our proposed NMD-QTL concept, let’s first take a quick review on the traditional
eQTL analysis.
Genetic variation in a population is commonly studied through the analysis of single nu-
cleotide polymorphisms (SNPs), the most common type of genetic variation among humans.
Expression quantitative trait loci (eQTL) analysis seeks to identify genetic variants that affect
the expression of one or more genes. For example, scanning for a gene-SNP pair for which the
expression of the gene is associated with the allelic configuration of the SNP. Identification of
eQTLs has proven to be a powerful tool in the study and understanding of the human diseases.
With high-throughput sequencing and modern genotyping methods, a typical eQTL analysis can
involve tens of millions of SNPs and tens of thousands of genes. Whole genome eQTL analysis
has separated eQTLs into two distinct types of manifestation. The first is a cis-eQTL, in which
the position of the eQTLs near the physical position of the gene. Because of proximity, cis-eQTL
effects tend to be much stronger, and thus can be more easily detected by Genome-wide associ-
ation studies (GWAS) and eQTL studies. Often, these cis-eQTLs function as promoters of certain
polymorphisms, affect methylation and chromatin conformation (thus increasing or decreasing
access to transcription), and can manifest as insertions and deletions to the genome. Cis-eQTLs
are generally classified as variants that lie within 1 million base pairs of the gene of interest.
Even the local cis-eQTL analyses restrict attention to nearby SNPs, the analysis can still involve
hundreds of millions of gene-SNP pairs. The second distinct type of eQTL is a trans-eQTL. A
trans-eQTL does not map near the physical position of the gene it regulates. Their functions and
effects on the gene expression are generally more indirect. Not directly boosting or inhibiting
10
transcription but rather, affecting kinetics, signaling pathways, and so on. Since such effects are
harder to determine explicitly, they are harder to find in eQTL analysis. In addition, such trans-
regulation networks can be extremely complex. However, eQTL analysis has led to the discovery
of trans hotspots which refer to loci that have widespread transcriptional effects [21].
To date, most eQTL studies have considered the effects of genetic variation on expression
within a single tissue (typically blood). An important next step is the simultaneous analysis of
eQTLs in multiple tissues. Multi-tissue analysis has the potential to improve the findings of single
tissue eQTL studies by borrowing strength across tissues, and to address fundamental biological
questions about the nature and source of the differences between tissues. An important feature
of multiple tissue studies is that a SNP may be associated with the expression of a gene in some
tissues, but not in others.
In an earlier work, we developed a method utilizing quantile regression to ensure accurate
effect size estimation for large-scale eQTL mappings [22]. In this work, we are working with the
NIH Genotype-Tissue Expression (GTEx) Consortium, and aiming at providing a genome-wide
landscape of genetic variants associated with NMD regulations across human tissues. Our new
work will require and benefit from the earlier capability to obtain accurate effect size estimation
in genome-wide eQTL scanning.
1.4 GTExgeneratedeQTLdata
The Genotype-Tissue Expression (GTEx) project is an ongoing effort to build a comprehensive
public resource to study tissue-specific gene expression and regulation. Samples were collected
from 54 non-diseased tissue sites across nearly 1000 individuals, primarily for molecular assays
11
including whole-geome sequencing (WGS), whole exome sequencing (WES), and RNA sequenc-
ing (RNA-Seq). The GTEx Portal provides open access to data including gene expression, QTLs,
and histology images. The GTEx portal can be accessed here: https://gtexportal.org/home/.
We analyzed the GTEx v7 release, which contains genotyping and tissue-specific gene ex-
pression data from 48 human tissues. We proposed and performed a novel type of Expression
quantitative trait loci (eQTL) study named NMD-QTLs in this work, utilizing the unprecedented
data available from GTEx.
1.5 Introductionandmotivationforourstudy
Nonsense-mediated mRNA decay (NMD) is a fundamental cellular mechanism utilized by eu-
karyotic organisms from yeasts to humans. It was originally conceived as an mRNA surveillance
mechanism that recognizes and degrades transcripts harboring a premature termination codon
(PTC) to prevent the production of potentially deleterious truncated proteins [17, 23]. These
mRNAs with PTCs would originate from various sources such as nonsense mutations, aberrant
alternative splicing, and DNA rearrangements [24]. Nowadays, broader impact of NMD has also
been revealed in genetic diseases, gene editing and cancer immunotherapy [25]. Recent research
showed NMD is an important post-transcriptional gene regulation mechanism targeting many
non-aberrant mRNAs as well [5, 26, 27, 28]. Transcriptome-wide studies showed that NMD ac-
counted for autoregulation of 3%-10% normal transcripts in different cell types [29, 30, 31]. Our
own analysis of the latest GENCODE annotation showed that about 12% natural transcripts are
presumably targeted by NMD regulation since they follow the 50-nt rule (i.e., the presence of a
12
terminal codon >50 nt upstream of the last exon-exon junction). Accordingly, NMD has been im-
plicated in cellular homeostasis, stress response, cell cycle, differentiation, development, neural
activity, immunity, and spermatogenesis [32, 33, 34, 35, 36, 37, 38, 39]. NMD typically downreg-
ulates its substrate 2-20 folds [40, 41], providing significant functional and clinical implications.
Although the overall degradation process is believed to be similar for both erroneous and nor-
mal transcripts, factors involved in NMD recognition and regulation of degradation efficiency
are poorly understood. Variations in NMD efficiency have been observed across tissues and in-
dividuals [42, 7]. Nonsense mutations lead to the appearance of PTCs and account for 20.3%
of all disease-associated single-base pair mutations [43]. NMD can both aggravate or alleviate
the effects of PTCs causing genetic diseases, which varies across both individuals and diseases.
Overall NMD more frequently aggravates the effects of detrimental, disease-causing mutations.
For example, NMD can inactivate tumor-supressor genes silences the expression of neoanti-
gens in cancer [44]. Variations in NMD efficiency can interfere with the outcome of therapeu-
tic strategies aiming at suppressing nonsense mutations to restore full-length proteins through
PTC readthrough, since the efficacy of such nonsense suppression therapies largely depends on
NMD efficiency (reviewed in [45]). Thus, a thorough examination of natural genetic variants
affecting NMD efficiency is essential to understanding NMD regulation, its modulatory role in
nonsense-mutation triggered diseases, their phenotypic variability and different degrees of ex-
pressivity. Genetic variants associated with changes in gene expression (expression quantitative
trait loci, eQTLs) are being catalogued on an unprecedented scale for human tissues, for example,
by the Genotype-Tissue Expression (GTEx) project [46, 47]. Genetical genomics has been focused
on identifying genetic loci linked to variation in mRNA expression levels. Recent studies have
extended the analysis to identify genetic variants associated with other molecular phenotypes
13
such as ribosome occupancy, protein abundance, allele-specific expression, and alternative splic-
ing [48, 49, 50]. eQTLs mapped to NMD factors (e.g., SMG7, UPF3B, MAGOH, and so on) provide
valuable information about how genetic variants may affect the NMD pathway in trans through
their cis-acting regulation on NMD factors. A couple of studies examined how loss-of-function
variants and protein-truncating variants affect their own gene expression, which implicated the
involvement of surveillance NMD [51, 52]. However, a systematic survey of cis genetic variants
on individual NMD substrates and their impact on NMD-regulation is still lacking. Here we uti-
lize the GTEx resources and build a genome-wide landscape of tissue-specific genetic variants
involved in NMD regulations. We focus on natural mRNAs targeted by NMD instead of aberrant
mRNAs caused by DNA mutations or RNA processing errors. We identify genetic variants con-
trolling the percentage of NMD-targeted transcripts, and variants regulating the decay efficiency
of NMD-targeted transcripts.
14
Chapter2
NMD-QTL:Genome-wideLandscapeofGeneticVariants
AssociatedwithNMDRegulationacrossHumanTissues
2.1 ModellingofNMD-QTLs
We propose a statistical model to identify and characterize genetic variants associated with NMD
and apply it to human tissue data (Figure 2.1 (a)). The mRNA transcripts of a gene were divided
into two categories: NMD-targeted and non-NMD transcripts, based on whether they were an-
notated by the “nonsense_mediated_decay” tag in the GENCODE annotation. Genes with both
NMD-targeted and non-NMD transcripts were referred as NMD-genes. We usedα andθ to pa-
rameterize NMD regulation:α represents the percent of transcribed transcripts targeted by NMD,
andθ represents the percent of NMD-targeted transcripts remained after degradation. As illus-
trated in (Figure 2.1 (b)), gene haplotypes bearing different alleles (‘A’ or ‘a’) may generate differ-
ent amounts of transcribed mRNAs (t
A
ort
a
), different percentages of transcribed mRNAs ( α A
or
α a
) may be targeted by NMD regulation, and these NMD-targeted transcripts may be degraded at
different degradation efficiency ( 1− θ A
or1− θ a
). Note that the degradation efficiency is relative
15
to the regular degradation of non-NMD transcripts. Generally, genetic variants identified from
a regular eQTL analysis are those with allele-specific transcription levels (i.e., t
A
̸= t
a
) without
taking into consideration of the NMD-transcripts (α ) and its degradation (1− θ ). Here, we aim to
capture genetic variants resulting in different α s orθ s. The former controls the different percent-
ages of NMD-targeted transcript isoforms (α A
̸= α a
), therefore was named pNMD-QTLs. The
latter regulates the degradation efficiency of NMD-targeted isoforms ( θ A
̸= θ a
) and was named
dNMD-QTLs. Herein they are jointly referred as NMD-QTLs.
As reasoning in Figure 2.1 (c), pNMD-QTLs and dNMD-QTLs can be identified and distin-
guished from regular eQTLs by examining their allelic effect sizes for NMD-targeted transcripts
and non-NMD transcripts separately. For a pNMD-QTL, the allelic effect sizes are in the opposite
direction for NMD-targeted and non-NMD transcripts, and the observed regulatory strength is
smaller for NMD-targeted transcripts due to less transcripts detected after degradation. For a
dNMD-QTL, the significant association is only observable in NMD-targeted transcripts but not
in non-NMD transcripts. Examples of a pNMD-QTL and a dNMD-QTL are shown in Figure 2.2
(a). A positive effect size is observed for the alternative allele of the pNMD-QTL when consider-
ing NMD-targeted transcripts (i.e., higher transcript amount when an individual contains more
copies of the alternative allele). However, the allelic effect size is negative for non-NMD tran-
scripts. For the dNMD-QTL, the copy number of the alternative allele only affects the amount of
NMD-targeted transcripts but not non-NMD transcripts.
To evaluate the performance of our models, we estimated the precision and recall for the
identification of pNMD-QTLs and dNMD-QTLs from simulations. A total of 10,000 genetic vari-
ants were simulated with 5% being pNMD-QTLs (α A
̸= α a
, θ A
= θ a
, t
A
= t
a
), 5% being
dNMD-QTLs (α A
= α a
, θ A
̸= θ a
, t
A
= t
a
), and the rest being null cases for NMD-QTLs
16
(α A
= α a
, θ A
= θ a
, t
A
= t
a
). The null cases were either regular eQTLs, or variants not as-
sociated with expression depending on the randomly assigned differences between t
A
and t
a
.
When the allelic effect size of NMD-QTLs was between 10% and 15% (i.e., |α A
− α a
|∈[0.1,0.15]
or|θ A
− θ a
| ∈ [0.1,0.15]), at a recall rate of 0.8, the precision was 0.93 and 0.84 for the iden-
tification of pNMD-QTLs and dNMD-QTLs, respectively (Figure 2.2 (b)). More detailed results
and results for other simulations with smaller (5%-10%) or larger (15%-20%) allelic effect sizes
can be found in Figure 2.3. More detailed results and results for other simulations with smaller
(5%-10%) or larger (15%-20%) allelic effect sizes can be found in Supplementary Figure 1 in the
github repository (A.5). The simulations verified that we can distinguish NMD-QTLs from the
regular eQTLs by jointly considering the significance and directions of the allelic effect sizes for
NMD-targeted and non-NMD transcripts. To ensure the accuracy of the effect size estimation, we
applied an additional quantile regression model (see in Methods 3.4) as the quantile regression has
the desired robustness to the outliers and dropouts in transcriptomics data, and it significantly
improves eQTL mapping with an accurate effect size estimation [22].
2.2 Identification of NMD-QTLs across different human
tissues
We identified pNMD-QTLs and dNMD-QTLs in cis for each of the 48 tissues considered in GTEx
v7. On average, 6033 pNMD-QTLs and 34049 dNMD-QTLs were detected for a tissue (Figure
2.2 (c)). Usually more dNMD-QTLs were discovered than pNMD-QTLs (4-11 folds in different
tissues). The thyroid and the tibial nerves had the most numbers of NMD-QTLs (99053 and 95803
respectively), while the minor salivary gland and vagina tissues had the least (7044 and 12148
17
respectively). It is worthwhile to mention that 7.3%-93.0% of pNMD-QTLs and 24.4%-76.0% of
dNMD-QTLs were not identified as eQTLs from GTEx (Figure 2.2 (c), non-shaded areas). They
are missed in a standard eQTL identification procedure which models the overall expression as
phenotypes without dissecting the NMD regulation. To distinguish NMD-QTLs promoting or
inhibiting NMD, we examined the sign of the effect size for NMD-targeted transcripts (Figure 2.2
(d)). Since the genotype was coded as the number of alternate non-reference allele, a positive sign
for the effect size suggests that α alt
>α ref
(pNMD-QTLs) or1− θ alt
<1− θ ref
(dNMD-QTLs). An
average of 59% pNMD-QTLs across tissues had positive signs. Thus, their alternate non-reference
alleles promote the generation of NMD-targeted transcripts. Usually less than 50% of dNMD-
QTLs had positive signs (average: 47%). Thus, the non-reference alleles of these dNMD-QTLs
inhibit the decay of NMD-targeted transcripts. Equivalently, an average of 53% dNMD-QTLs had
negative signs and their non-reference alleles promote the decay of NMD-targeted transcripts.
2.3 TissuespecificityofNMD-QTLs
To examine the tissue similarity of NMD regulation, we considered a similarity score reflecting
the enrichment of shared NMD-QTLs between two tissues (details in Methods 3.7). As shown
in Figure 2.4 (a), brain tissues had unique lists of NMD-QTLs: the NMD-QTL similarities were
generally high among different brain tissues (average of 24.7), but the average similarity between
brain and any other tissues was as low as 6.6. This suggests that NMD-QTLs were shared within
different brain sub-regions but were substantially different from those identified in other tissues.
Within the brain tissues, the cerebellum and cerebellum hemisphere were relatively distin-
guished from others: the similarity score was 16.38 between these two, but they were 6-8 with all
18
other brain tissues (Figure 2.4 (a), (b)). The amygdala had substantially high similarity with other
brain sub-regions especially the substantia nigra (score=50.7), hippocampus (47.7), and the ante-
rior cingulate cortex (47.0). It has been found that the nigra–amygdala connections function in
surprise-induced attention enforcement [53], and the cortex-to-amygdala pathway is crucial for
pathogenesis of psychiatric diseases including depression and anxiety disorders [54]. The shared
NMD regulations may suggest common regulation between connected pathways.
The tissue interaction map based on the similarities of NMD-QTLs (Figure 2.4 (c)) again clearly
shows the distinction of brain tissues as well as the connections among major viscera (small
intestine, spleen, stomach) and reproductive organs (uterus, vagina, ovary). The strong tissue
specificity of NMD-QTLs was also supported by the observation that 54.3% NMD-QTLs were
only discovered in a single tissue (Figure 2.4 (d), brain sub-regions were combined as one tissue).
Taken together, NMD-QTLs exist ubiquitously across the human body as the NMD regulation
itself, and many of them demonstrate strong tissue specificity.
A handful of NMD-QTLs were universally discovered in all tissues (Figure 2.4 (d)). They are
located in important genes such as TP53I13 (a tumor suppressor gene), ST7L (homologous to tu-
mor suppressor gene ST7), NDUFAF7 (an enzyme involved in the assembly and stabilization of
Complex I, an important complex for mitochondrial respiratory chain), IRF5 (belongs to a tran-
scription factor family with diverse roles including virus-mediated activation of interferon, cell
growth, differentiation, apoptosis and immune system activity), MMAB (an enzyme that catalyzes
the final step in the conversion of vitamin B12 into adenosylcobalamin), and three ribosomal pro-
teins (S19, L13, L27a). It is known NMD has high inter-individual variability and closely affect the
clinical outcome for genetic disease [55, 56]. Therefore, these natural variations of NMD-QTLs
impact tumor suppressor and important metabolic genes across all tissues and individuals, may
19
be exploited as a novel rule to enhance personalized cancer immunotherapy and to treat a wider
range of genetic diseases.
2.4 ColocalizationofNMD-QTLsanddiseaseSNPs
Functional variants were well-know to play an important role in diseases etiology. Utilizing the
collection of disease-related SNPs in DisGeNET [57], we explored the colocalization of NMD-
QTLs and disease-related SNPs. Specifically, out of the 54,643 non-redundant pNMD-QTLs and
the 319,608 non-redundant dNMD-QTLs, 1151 (2.10%) and 3952 (1.24%) of them were found to
be disease SNPs reported in DisGeNET, respectively. Compared to regular eQTLs discovered in
GTEx for NMD-genes (0.82% were found to be disease SNPs), NMD-QTLs were much more likely
to be colocalizing with disease SNPs: the odds ratio was 2.45 (p-value of the Fisher’s exact test:
8.5× 10
− 154
) for pNMD-QTLs vs. eQTLs; and it was 1.49 (4.2× 10
− 94
) for dNMD-QTLs vs. eQTLs.
Albeit the total number of non-redundant NMD-QTLs was only about 11.53% of the total
number of non-redundant eQTLs (331,233 vs. 2,872,430 for NMD-genes), 33.61% disease SNPs
colocalized with eQTLs were also discovered by our NMD-QTLs (the gray area of Figure 2.5 (a)).
More importantly, 37.63% disease SNPs colocalizing with NMD-QTLs were not detected as eQTLs
(the light blue area of Figure 2.5 (a)). A total of 737 diseases were associated with NMD-QTLs,
compared to 735 diseases associated with eQTLs in NMD-genes. Such higher chances of disease
association observed in NMD-QTLs suggest that NMD regulation is an essential component to
maintain the normal function of cells. They may provide a tangible path to dissect the underlying
molecular mechanisms of disease.
20
Among the total of 8073 diseases considered in DisGeNET, we identified 43 diseases that
demonstrated prominent NMD-QTLs enrichment (p-values < 0.05 for both the proportion test
and the hypergeometric test, details in Methods A.4). Remarkably, 9 out of the 43 (20.9%) dis-
eases were mental or behavior dysfunctions which were categorized as “brain and mental disor-
der (F03)” under the “psychiatry and psychology (F)” category of the Medical Subject Headings
(MeSH) system. The most significant one was the “unipolar depression”, followed by the “bipolar
disorder”. The top 10 diseases are shown in Figure 2.5 (b) and the complete list of 43 diseases
is in Table 2.1. Interestingly, mutations in core NMD pathway genes are implicated in multiple
mental disorders [58, 59, 60, 61]. Together with our aforementioned findings that NMD-QTLs
show special distinctiveness and tissue-specificity in brain tissues, the brain disease association
further suggests the critical role of NMD regulation for brain function and malfunction.
The diseases in DisGeNet were categorized into 29 MeSH terms. We then studied the detailed
colocalization status of disease SNPs and NMD-QTLs for each specific MeSH term. The calcu-
lation of the colocalization enrichment score was described in Methods 3.9. As shown in Figure
2.5 (c), dNMD-QTLs were observed to be enriched in nervous system diseases (C10) and mental
disorders (F03). The fold-change of the enrichment for dNMD-QTLs vs. eQTLs was 2.2 for C10
and 2.3 for F03. The fold change was as high as 9.0 for C10 and 8.0 for F03 when comparing
dNMD-QTLs with randomly sampled SNPs. On the contrary, pNMD-QTLs were less enriched
than eQTLs for these two terms. But the fold change was still as high as 2.8 and 2.3 respectively
when compared to randomly selected SNPs. This suggests that regulating the decay efficiency
of NMD-targeted mRNA substrates is a key component of NMD regulation in the brain and neu-
ral systems. Additionally, dNMD-QTLs but not pNMD-QTLs were more enriched than eQTLs in
21
terms such as neoplasm (C04) and pathological conditions (C23). Both pNMD-QTLs and dNMD-
QTLs were more enriched than eQTLs for terms such as musculoskeletal diseases (C05), digestive
system diseases (C06), respiratory tract disease (C08), skin and connective tissue diseases (C17)
and immune system diseases (C20).
Some of NMD-QTLs were related to many diseases. For example, the pNMD-QTL SNPrs5443
was associated with 48 out of the total 8073 diseases. It acted as a pNMD-QTL to regulate
the fraction of NMD-targeted transcripts of the DNA repair gene XRCC3 in 18 tissues includ-
ing 6 brain tissues. This implies that NMD regulation interplays closely with DNA repairing
and is essential in ensuring genome integrity and preventing genetic disease. The dNMD-QTL
SNP rs1800629, regulating the degradation efficiency of NMD-transcripts of genes HLA-C and
ATP6V1G2-DDX39B (read-through transcription between neighboring genes ATP6V1G2 and
DDX39B), was related to 58 diseases and 18 of them were related to the neoplastic processes.
2.5 DiseaseNMD-QTLsinbraintissues
Compared to non-disease NMD-QTLs, disease-associated NMD-QTLs showed an even stronger
tissue specificity in brain sub-regions (Figure 2.6 (a)). By contrast, brain compartmentalization
was weaker in disease eQTLs than in non-disease eQTLs. This reaffirms that NMD-QTLs play a
critical role in maintaining the neural system functionalities and can be valuable for examining
brain and psychiatric pathogenesis.
We examined the sub-region membership of genes with at least one discovered NMD-QTL
in brain sub-regions. (i.e., the specific set of brain sub-regions from which NMD-QTLs were
discovered for a given gene). The 25 most frequent brain sub-region memberships were displayed
22
diseaseName diseaseClass diseaseSemanticType p_value_prop p_value_hyper num_of_dis-
ease_snp
num_of_nmd_-
qtl
AIDS related complex C02;C20 Disease or Syndrome 0.034199 0.000025 10 5
Age related macular degenera-
tion
C11 Disease or Syndrome 0.002465 0.0 551 83
Alcoholic Intoxication, Chronic C25;F03 Mental or Behavioral Dysfunc-
tion
0.034258 0.000006 323 32
Anemia, Pernicious C15;C18 Disease or Syndrome 0.000157 0.00007 3 3
Antisocial Personality Disorder F03 Mental or Behavioral Dysfunc-
tion
0.003251 0.04123 1 1
Asthma C08;C20 Disease or Syndrome 0.0 0.0 728 215
Astrocytoma C04 Neoplastic Process 0.000218 0.008072 43 6
Behcet Syndrome C07;C11;C14;C16;C17 Disease or Syndrome 0.022672 0.0 225 100
Biliary cirrhosis C06 Disease or Syndrome 0.000846 0 560 433
Bipolar Disorder F03 Mental or Behavioral Dysfunc-
tion
0.0 0.0 664 216
Burkitt Lymphoma C02;C04;C15;C20 Neoplastic Process 0.003836 0.000177 8 4
Carcinoma in situ of uterine
cervix
C04 Neoplastic Process 0.015444 0.00496 3 2
Cerebral Infarction C10;C14 Disease or Syndrome 0.039577 0.041427 32 4
Cocaine Abuse C25;F03 Mental or Behavioral Dysfunc-
tion
0.001626 0.0017 2 2
Degenerative polyarthritis C05 Disease or Syndrome 0.004518 0.0 92 19
Dementia, Vascular C10;C14;F03 Disease or Syndrome 0.011434 0.021976 15 3
Depressive Disorder, Treatment-
Resistant
F03 Mental or Behavioral Dysfunc-
tion
0.008755 0.022823 6 2
Diabetes Mellitus, Insulin-
Dependent
C18;C19;C20 Disease or Syndrome 0.0 0.0 774 261
Diabetic Nephropathy C12;C13;C19 Disease or Syndrome 0.011847 0.038483 110 9
Eczema Herpeticum C02;C17 Disease or Syndrome 0.000059 0.009648 4 2
HIV-1 infection NaN Disease or Syndrome 0.038628 0.0 417 188
Heart Failure, Diastolic C14 Disease or Syndrome 0.015444 0.009648 4 2
Hip Dislocation, Congenital C05;C16 Congenital Abnormality 0.000114 0.0 8 7
Huntington Disease C10;C16;F03 Disease or Syndrome 0.002377 0.0 68 17
Hyperoxaluria C12;C13 Disease or Syndrome 0.048685 0 1 2
Irritable Bowel Syndrome C06 Disease or Syndrome 0.049556 0.003294 36 6
Leprosy C01 Disease or Syndrome 0.010597 0.0 147 36
Lymphoma, Follicular C04;C15;C20 Neoplastic Process 0.037153 0.000002 74 14
Major depression, single episode NaN Mental or Behavioral Dysfunc-
tion
0.042739 0.009012 11 3
Malignant neoplasm of breast C04;C17 Neoplastic Process 0.000679 0.0 1361 120
Migraine with Aura C10 Disease or Syndrome 0.001556 0.000827 38 7
Multiple Sclerosis C10;C20 Disease or Syndrome 0.0 0.0 745 192
Myositis C05;C10 Disease or Syndrome 0.008826 0 11 25
Night pain NaN Sign or Symptom 0.003251 0.04123 1 1
Optic Atrophy, Hereditary,
Leber
C10;C11;C16;C18 Disease or Syndrome 0.024391 0.009648 4 2
Oral Cavity Carcinoma NaN Neoplastic Process 0.047755 0.000003 17 7
Steroid-sensitive nephrotic syn-
drome
NaN Disease or Syndrome 0.033681 0 3 4
Subclinical hypothyroidism NaN Disease or Syndrome 0.011434 0.000658 5 3
Toxic Epidermal Necrolysis C07;C17;C20;C25 Disease or Syndrome 0.011124 0.0 17 8
Toxoplasmosis, Congenital C03;C10;C16 Disease or Syndrome 0.048685 0.022823 6 2
Unipolar Depression F03 Mental or Behavioral Dysfunc-
tion
0.0 0.0 378 166
Vascular occlusion NaN Disease or Syndrome 0.018721 0.04123 1 1
disc disorder NaN Disease or Syndrome 0.018721 0.04123 1 1
Table 2.1: The 43 diseases with significant NMD-QTLs enrichment. p-values for both the propor-
tion test and the hypergeometric test <0.05.
23
via an UpSet plot (Figure 2.6 (b). On this finer sub-tissue level, the most frequent membership
pattern appeared to be genes with NMD-QTLs identified in a single or a few sub-regions (in blue
rectangle in Figure 2.6 (b), as well as genes with NMD-QTLs discovered in almost all brain sub-
regions (in orange rectangle in Figure 2.6 (b). Interestingly, the former group of genes was more
likely to contain disease NMD-QTLs compared to the latter group of genes (Figure 2.6 (b).
2.6 NMD-QTLs are more likely to be located within gene
bodiesandexonscomparedtoeQTLs
The 50-nt rule was applied to tag NMD transcripts in GENCODE because the rule has the strongest
predictive value for NMD susceptibility [62]. Our studied NMD-targeted transcripts all follow
this rule. However, the degree of NMD regulation can still vary. For example, the percentage of
transcripts undergoing NMD can vary and their decay efficiency can vary. Our discovered NMD-
QTLs suggest locations of cis-regulatory elements for NMD. The preferred genomic positions of
NMD-QTLs therefore inform us additional determinants of NMD regulation.
First, we examined whether NMD-QTLs had any preference for residing inside a gene body or
in the intergenic regions. For each gene, we counted the numbers of NMD-QTLs located within
(N
i,g
) or outside (N
o,g
) of the gene body. Then a linear model was fit between N
i,g
and N
o,g
taking the gene length (l
g
) into account: N
i,g
= w
0
+w
1
N
o,g
+w
2
log
2
(l
g
). We performed the
modeling for each tissue separately and compared the results for NMD-QTLs and eQTLs. As
shown in Figure 2.7 (a), the estimations of ˆ w
1
for NMD-QTLs were consistently larger than those
for eQTLs, especially for brain tissues (Figure 2.7 (c)). This suggests that compared to eQTLs,
NMD-QTLs are more likely to function within a gene body instead of in the intergenic regions.
24
However, the estimations of ˆ w
2
for NMD-QTLs and eQTLs were similar and the differences were
not significant (p-value=0.37, Wilcoxon test, Figure 2.7 (b)). This suggests that the gene length
factor was controlled at a similar level for NMD-QTLs and eQTLs.
For those NMD-QTLs residing within a gene body, we further examined whether they prefer
exon regions or intron regions. Compared to eQTLs, NMD-QTLs were more likely to be in exonic
positions (Figure 2.7 (d), p-values from Wilcox tests <1×10
−6
). The tissue-wise exonic proportions
of NMD-QTLs and eQTLs are further compared in Figure 2.7 (e). Regulatory elements for mRNA
degeneration are usually on exons after mRNA splicing. In addition, exon-junction complex (EJC)
has been reported to be involved in the initiation of NMD [10, 8]. It is possible that NMD-QTLs
in exonic regions are more effective to interact with the deposition of EJC to tag NMD regulation
for natural transcripts.
2.7 OrdinalpositionsofNMD-QTLsalongexonsandintrons
Since the location of EJC is essential for NMD regulation, we next studied the relative positions
of NMD-QTLs within a gene body by defining an ordinal rank for exons and introns. For each
gene, the combined exon intervals from the collapsed gene model (i.e., combining all isoforms of a
single gene into a single transcript) used in GTEx were considered. The intron intervals are then
deduced from those exon intervals. We then ranked these exon and intron intervals separately
from both 5’-end and 3’-end, starting from 0. For all genes with NMD-QTLs, the frequency count
and the median length of the exon (or intron) intervals on each ordinal position were shown in
Figure 2.9 (a), (b). For all genic NMD-QTLs, we recorded their exon (or intron) ordinal positions,
the raw count distributions are shown in Figure 2.9 (c), (d). To remove the counting bias, we
25
normalized the raw count by the number of exon or intron intervals at each ordinal position
(Figure 2.8 (a), (b)), and further by the median length of exon or intron intervals at each ordinal
position (Figure 2.8 (c), (d)). Both pNMD-QTLs and dNMD-QTLs were more likely to be located
in the last exon interval (rank 0 at 3’, Figure 2.8 (a)). However, this was partly due to the fact
that the last exon interval was usually much longer than other exons (median length of 831 vs.
187). Taking the exon interval length into account, the chance of an NMD-QTL locating in an
exonic position of the last exon interval (rank 0 at 3’) was actually much lower (Figure 2.8 (c)). The
highest frequency is within the penultimate exon from the 3’ end (i.e., rank 1), followed by the two
upstreaming exons (i.e., ranks 1-2). these 3’ exon ordinal positions many influence the formation
of the last EJC or affect the interaction between PTC and downstream EJC. Considering the intron
ordinal positions for NMD-QTLs, both pNMD-QTLs and dNMD-QTLs preferred the first intron
interval (rank 0 at 5’, Fig 7b, d). Specifically, pNMD-QTLs tended to avoid the positions of introns
with the rank 8 at 3’, whereas dNMD-QTLs were more uniformly positioned along introns other
than the first intron interval at 5’.
2.8 Location preference of NMD-QTLs at the base-pair
resolution
In addition to the ordinal positions along exons or introns, we examined the base-pair distance
of NMD-QTLs to the exon or intron boundary. Both NMD-QTLs and eQTLs favored the 0-50bp
26
regions close to exon boundaries (Figure 2.10 (a) for pNMD-QTLs)
∗
. On the other hand, NMD-
QTLs were more likely to be near intron boundaries than eQTLs (Figure 2.10 (b) for pNMD-
QTLs
†
).
The location of PTC is tightly related to NMD regulation. We therefore considered the dis-
tances between NMD-QTLs and PTCs of NMD-targeted transcripts (or regular stop codons of
non-NMD transcripts). For exonic NMD-QTLs, we calculated the distances on the transcripts
without considering intronic positions (Figure 2.10 (c)). Compared to eQTLs, pNMD-QTLs were
more concentrated in the (75 bp, +150 bp) regions around PTCs (orange line vs. gray line in the
upper panel of Figure 2.10 (c)), and more concentrated around the (25 bp, +50 bp) regions around
regular stop codons (red line vs. light blue line). We also observed another peak of pNMD-QTLs
(red line) around 400 bp upstream of the regular stop codons. Since the median distance be-
tween the upstream PTC and the downstream regular stop codon of the same gene is 822 bp
(calculated from annotation), the peak on 400 bp of regular stop codons usually corresponds to
positions downstream of PTCs. The distribution of exonic dNMD-QTLs was more similar to that
of exonic eQTLs (lower panel of Figure 2.10 (c)). However, dNMD-QTLs were still more concen-
trated around PTCs (orange line vs. gray line in the lower panel of Figure 2.10 (c)). Additionally,
we consider the genomic distances (exonic positions or intronic positions) between genic NMD-
QTLs and PTCs (or regular stop codons). As shown in Figure 2.10 (d), NMD-QTLs can be located
upstream or downstream of PTCs but rarely downstream of regular stop codons.
When we investigated the distance of NMD-QTLs relative to transcription start sites (TSS), we
noticed that majority of NMD-QTLs were close to or within genes (Figure 2.10 (d)). Interestingly,
∗
the results for dNMD-QTLs are the same, therefore the plot is omitted here
†
plot is similar for dNMD-QTLs hence omitted.
27
there was a noticeable peak around 172k base-pair upstream of the TSS especially for pNMD-
QTLs. This indicates some NMD-QTLs can affect the downstream NMD regulation through long-
range interactions.
2.9 The relationship between NMD-QTLs and miRNAs or
RNA-bindingproteins(RBPs)
We wonder how these cis-acting NMD-QTLs interact with trans-acting factors to impact the
NMD regulation. We focused on miRNAs and RBPs. miRNAs base-pair with complementary
mRNA molecules and function in RNA silence and gene regulation [63, 64]. RBPs are critical for
post-transcriptional RNA processing and modification, such as splicing, mRNA stabilization and
mRNA localization [65, 66]. We hypothesize some NMD-QTLs could function by affecting the
miRNA or RBP binding sites. Here we examined how many NMD-QTLs located at miRNA target
regions predicted by TargetScan [67] as well as how many NMD-QTLs located at RBP binding
regions surveyed in ENCODE [68].
When we considered the targets of conservative miRNAs, 7.6‰ nonredundant pNMD-QTLs
and 6.3‰ nonredundant dNMD-QTLs were found to be located in these miRNA target regions.
Both were higher than the 4.2‰ level observed for regular eQTLs (for pNMD-QTLs vs. eQTLs,
odds ratio=1.79, Fisher’s exact test p-value=6.3×10
−29
, for dNMD-QTLs vs. eQTLs, odds ratio=1.49,
Fisher’s exact test p-value=1.5×10
−62
). Top 20 miRNAs whose targets were enriched with pNMD-
QTLs or dNMD-QTLs are shown in Figure 2.11 (a) and seven of them were shared between pNMD-
QTLs and dNMD-QTLs (orange colored). The full list of miRNAs can be found in Supplementary
Table 2 in the github repository (A.5). Those miRNAs may be involved in the NMD regulation
28
pathways. When the colocalization percentages were calculated for individual tissues, NMD-
QTLs still had higher such percentages than eQTLs (Figure 2.11 (b)).
A total of 12.7% nonredundant dNMD-QTLs and 14.9% pNMD-QTLs were found to be in the
binding sites of RBPs in the K562 cell line. Similarly, 14.9% nonredundant dNMD-QTLs and 16.4%
nonredundant pNMD-QTLs were found to be in the binding sites of RBPs in the HepG2 cell line.
Such percentages were higher than the percentage of 10.3% and 12.3% for regular eQTLs in K562
and HepG2 respectively. For pNMD-QTLs vs. eQTLs, the odds ratio was 2.84 for K562 and 3.13 for
HepG2 (Fisher’s exact test p-values <1×10
−16
). For dNMD-QTLs vs. eQTLs, the odds ratio was 2.42
for K562 and 2.84 for HepG2 (Fisher’s exact test p-values <1×10
−16
). Among the top 20 RBPs whose
targets in the K562 cell line were enriched with pNMD-QTLs or dNMD-QTLs, 9 were shared
(Figure 2.11 (c), orange colored). The top 20 RBPs from HepG2 were more similar for pNMD-
QTLs and dNMD-QTLs (17 out of 20 were shared and orange colored, Figure 2.11 (d)), which
includes UPF1, the key effector of the NMD pathway. Other RBPs could function as co-factors or
modulators of the NMD regulation pathways. Full lists of RBPs were provided in Supplementary
Table 3 in the github repository (A.5). When we considered NMD-QTLs discovered from each
individual tissue, we still found that NMD-QTLs, especially pNMD-QTLs, were more likely to be
in the binding sites of RBPs compared to eQTLs (Figure 2.11 (d), (f)).
2.10 Colocalizationofsplicing-QTLsandNMD-QTLs
Normal stop codons are usually located in the last exon downstream of any EJC complex. A
common NMD-inducing feature is a PTC at least 50 nt upstream of an exon-exon junction [11, 12],
29
which was adopted in GENCODE. In mammalian cells, alternative splicing coupled to nonsense-
mediated decay (AS-NMD) is a conserved mechanism of post-transcriptional gene regulation [69].
Alternative splicing can alter the reading frame to produce a PTC-containing transcript which is
subject to NMD [3, 70, 71, 72].
We investigated the AS-NMD coupling through analyzing the colocalization of NMD-QTLs
and splicing-QTLs. Here, we identified splicing-QTLs by mapping variants associated with the
inclusion ratios of annotated cassette exons (details in Methods 3.11). As expected, NMD-QTLs
were much more likely to be colocalized with splicing-QTLs than with eQTLs (average odds ra-
tios: 15.9 for pNMD-QTLs vs. eQTLs and 6.1 for dNMD-QTLs vs. eQTLs respectively). Compared
to dNMD-QTLs, pNMD-QTLs were more likely to act as splicing-QTLs simultaneously since al-
ternative splicing is coupled with NMD through “producing” transcripts subject to NMD. Such
co-localization of pNMD-QTLs and splicing-QTLs was more prominent in brain tissues (average
odds ratio in brain tissues for pNMD-QTLs vs. eQTLs: 30.8, Figure 2.12 (a)). The odds ratio was
as high as 60.8 in the brain amygdala which defines and regulates human emotions. The results
suggest that AS-NMD can be a prevalent post-transcriptional regulation mechanism for brain
tissues especially for the amygdala.
To validate our discovery, we further focused on NMD-genes with direct annotation-based
evidence of AS-NMD coupling: given a cassette exon, all NMD-targeted transcripts include the
cassette exon and all non-NMD transcripts skip that cassette exon, or conversely, all NMD tran-
scripts skip the cassette exon and all non-NMD transcripts include it. As shown in Figure 2.12
(b), out of all 6503 NMD-genes, 803 (12.3%) of them had direct annotation evidence for AS-NMD
coupling (the green arc). Out of the 2614 NMD-genes with splicing-QTLs, 706 genes (27.0%)
had direct evidence of AS-NMD coupling. For genes regulated by colocalized pNMD-QTLs and
30
splicing-QTLs, the percentage of genes with direct AS-NMD coupling was even higher (64 out
of 197, 32.5%). However, this percentage decreased to 23.1% (226 out of 979) for genes with colo-
calized dNMD-QTLs and splicing-QTLs. Thus, NMD-genes with direct AS-NMD coupling were
more likely to have splicing-QTLs and colocalized pNMD-QTLs (Fisher’s exact tests p-values
<1×10
−16
). This corroborates that alternative splicing couples with NMD through ‘producing’
NMD-transcripts, and suggests that alternative splicing itself does not influence decay efficiency
of NMD transcripts.
31
Cap
Cap
AUG
AUG
Intron Intron
AAAAA
AAAAA
STOP
STOP
RNA splicing
mRNA decay
EJC EJC
Exon Junction Complex
Cap
AUG
AAAAA
STOP
Normal Transcript
Cap AAAAA
Cap AAAAA
Cap AAAAA
non-NMD transcripts NMD-targeted transcripts (not decayed yet)
Premature Termination Codon (PTC)
Ribosome
translation
Ribosome displace EJC
Last EJC retained
Cap
AUG
AAAAA
STOP PTC
PTC
early stop
Initiate NMD NMD machinery
pre-mRNA
Cap AAAAA
Whole-genome NMD-QTL calling
48 human tissues
+/+ +/-
-/-
genotype
expression level
+/+ +/-
-/-
genotype
expression level
pNMD-QTL
dNMD-QTL NMD-targeted transcripts
non-NMD transcripts
pNMD-QTL
α ≠ α , θ = θ = θ , t =t =t
dNMD-QTL
θ ≠ θ , α = α = α, t =t =t
Regular QTL
t ≠ t , α = α =α , θ = θ = θ
Transcript type
non-NMD
non-NMD
non-NMD
NMD-targeted
NMD-targeted
NMD-targeted
Total detected transcripts
AA Aa aa
Strength of a
(Aa-AA or aa-Aa)
2tα θ t(α +α )θ 2tα θ
2t(1-α ) 2t(1-α ) t(2-α -α )
2tαθ tα(θ +θ ) 2tαθ
2t(1-α) 2t(1-α) 2t(1-α)
2t αθ 2t α θ (t +t )αθ
2t (1-α) 2t (1-α) (t +t )(1-α)
t(α -α )θ
-t(α -α )
0
tα(θ -θ )
(t -t )αθ
(t -t )(1-α)
* According to our modeling, tα(1-θ) fraction transcripts are NMD-targeted, and will not be detected due to decaying
A
A
A A
A
a a a
a A a A a
a
A
A a A a
a
a
a
a
a
a
A
A
A
A
A
A
A a
A
A
a a
a
A
A
A
A
A
a
a
a
a
a
(a)
(c)
(b)
decayed transcripts
will not be detected *
functional
protein
poly-A
selection
poly-A
selection
Genotype
Haplotype
mRNA transcription
A
A
a
a
NND-QTL
AAAAA
AAAAA
AAAAA
AAAAA
non-NMD transcripts
non-NMD transcripts
non-NMD transcripts
non-NMD transcripts
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
AAAAA
NMD-targeted transcripts
NMD-targeted transcripts
NMD-targeted transcripts
NMD-targeted transcripts
AAAAA
AAAAA
AAAAA
α
A
1-α
A
α
a
1-α
a
α : percent of NMD-targeted transcripts
: detecable NMD-targeted transcripts after NMD
θ
A
1-θ
A
θ
θ
a
1-θ
a
t
A
t
a
total transcribed total transcribed
Figure 2.1: Modeling of NMD-QTLs. (a) Identification of NMD-QTLs across human tissues based
on the GTEx data. (b) Parameterization of the profiled expression of NMD genes. t represents
the total amount of transcribed transcripts,α represents the percentage of transcribed transcripts
targeted by NMD, and θ represents the percentage of NMD-targeted transcripts that remained
after degradation. The parameters can be different for different alleles of a genetic variant. (c)
Allelic effect sizes of a pNMD-QTL, dNMD-QTL, or eQTL for NMD-targeted transcripts and non-
NMD transcripts.
32
Figure 2.2: Identification of pNMD-QTLs and dNMD-QTLs. (a) Examples of a pNMD-QTL and
a dNMD-QTL. A variant is identified as a pNMD-QTL if the allelic effect size is in opposite di-
rections for NMD-targeted transcripts and non-NMD transcripts. A variant is identified as a
dNMD-QTL if the association is observed in NMD-targeted transcripts only. (b) Precision-recall
curves for identifying pNMD-QTLs and dNMD-QTLs. The effect size was simulated between 0.1
and 0.15. (c) The number of pNMD-QTLs and dNMD-QTLs detected for each of the 48 tissues.
The shaded areas represent the number of variants also identified as eQTLs in GTEx. The color
scheme used for tissues is the same as the one used in GTEx. (d) The fractions of positive effect
sizes of pNMD-QTLs and dNMD-QTLs for NMD-targeted transcripts in each tissue.
33
Precision
Recall
Precision
Precision
p-values p-values p-values p-values p-values p-values
NMD-targeted non-NMD
Small regulatory effect size (0.05-0.1) Medium regulatory effect size (0.1-0.15)
Recall Recall
Large regulatory effect size (0.15-0.2)
NMD-targeted non-NMD NMD-targeted non-NMD
Count Count
Count Count
Count Count
Figure 2.3: NMD-QTL identification results for simulations with various effect sizes. The top
panels show the Precision-Recall curve for small effect size (0.05-0.1), medium effect size (0.1-
0.15), and large effect size (0.15-0.2). The bottom panels show the p-value distributions for the
allelic effect for NMD-targeted transcripts and non-NMD transcripts.
34
Figure 2.4: Tissue specificity of NMD-QTLs. (a) The similarity of NMD-QTL signatures between
tissue pairs. The similarity score s* reflects the enrichment of shared NMD-QTLs among different
tissues. The detected NMD-QTLs from brain tissues are very distinctive from other tissue. (b)
Clustering of similarity scores within brain sub-regions. (c) Tissue interaction map. No edge is
drawn if the similarity score s* between the two tissues <10. Otherwise, black edges are drawn
if10≤ s
∗ <20; orange edges are drawn if20≤ s
∗ <30 and red edges are drawn ifs
∗ >30. (d)
The number of tissues an NMD-QTL variant is detected from.
35
x
y
z
x / (x + y) = 37.63%
y / (y + z) = 33.61%
(a) (b)
(c)
0 2 4 6 8
Enrichment of NMD-QTLs
Colocalization score with disease SNPs
Figure 2.5: Variant-disease association analysis for NMD-QTLs. (a) Venn diagram among NMD-
QTLs, eQTLs and disease SNPs. About 33.61% (the gray sector) disease SNPs identified as eQTLs
are also discovered as NMD-QTLs. And 37.63% (light blue sector) disease SNPs identified as NMD-
QTLs are not discovered as eQTLs.(b) Top 10 diseases associated with enriched NMD-QTLs. They
are ranked by the p values of the proportion tests. (c) Colocalization with disease SNPs under
each MeSH term. The colocalization scores are compared among pNMD-QTLs, dNMD-QTLs,
eQTLs, and randomly sampled SNPs (the number is the same as that of NMD-QTLs).
36
non-disease NMD-QTLs
non-disease eQTLs
Disease NMD-QTLs
Disease eQTLs
# genes
(a)
(b)
with disease NMD-QTLs
with non-disease NMD-QTLs only
Brain
Tissues
Brain
Tissues
Figure 2.6: Disease NMD-QTLs in brain tissues. (a) Tissue pair-wise similarity of non-disease
and disease QTL signatures. Disease NMD-QTLs but not disease eQTLs are more similar in brain
tissues.(b) UpSet plot showing the top 25 most frequent brain sub-region memberships displayed
by genes with NMD-QTLs. Compared to genes whose NMD-QTL associations are observable in
almost all brain sub-regions, genes appeared in only a single brain sub-region are more likely to
contain disease NMD-QTLs.
37
Figure 2.7: Genomic locations of NMD-QTLs. (a) The coefficient estimations for ˆ w
1
. (b) The
coefficient estimations for ˆ w
2
. ˆ w
1
and ˆ w
2
are for the model N
i,g
= w
0
+w
1
N
o,g
+w
2
log
2
(l
g
),
where N
i,g
is the number of NMD-QTLs (or eQTLs) inside a gene, N
o,g
is the number of NMD-
QTLs (or eQTLs) outside of any gene, and l
g
is the gene length. (c) The ˆ w
1
estimation for each
tissue. Error bars: 95% confidence intervals. The ˆ w
1
estimations for NMD-QTLs are consistently
larger than those for eQTLs, meaning that NMD-QTLs prefer locating inside a gene body.(d) The
proportion of genic variants that are located on exons. The exonic proportions are significantly
higher for NMD-QTLs, compared to eQTL controls which are eQTLs discovered for the same
pNMD- or dNMD-genes. (e) The scatter plots of the exonic proportions. The size of the symbol
is proportional to the sample size for a specific tissue.
38
Normalized by exon count per ordinal position Normalized by intron count per ordinal position
Normalized by both exon count and exon length Normalized by both intron count and intron length
Frequency Frequency
Frequency Frequency
5’ 3’ 5’ 3’ 5’ 3’
5’ 3’ 5’ 3’
5’ 3’
5’ 3’ 5’ 3’
(a) (b)
(c) (d)
Figure 2.8: Ordinal positioning of NMD-QTLs along exon and intron intervals. The exon and
intron intervals are deduced from the collapsed gene model and ranked from 0 for both 5’ and 3’
ends. (a) Distribution of the count of NMD-QTLs on each exon ordinal position normalized by
the exon count per ordinal position. (b) Distribution of the count of NMD-QTLs on each intron
ordinal position normalized by the intron count per ordinal position.(c) Distribution of the count
of NMD-QTLs on each exon ordinal position normalized by both the exon count and exon length
per ordinal position. (d) Distribution of the count of NMD-QTLs on each intron ordinal position
normalized by both the intron count and intron length per ordinal position.
39
(a) exon/intron count by ordinal position
(b) Median exon/intron length by ordinal position
(c)
Raw NMD-QTL count by exon ordinal position
(d)
Raw NMD-QTL count by intron ordinal position
Frequency Frequency
Frequency Frequency
Figure 2.9: Ordinal positions of exon and intron intervals in NMD genes. (a) The exon or intron
interval count per ordinal position. (b) The median exon or intron length per ordinal position.
(c) Distribution of the raw pNMD-QTL or dNMD-QTL count per exon ordinal position. (d) Dis-
tribution of the raw pNMD-QTL or dNMD-QTL count per intron ordinal position.
40
Figure 2.10: Distances of NMD-QTLs to boundaries of exons, introns, stop codons, and transcrip-
tion start sties.(a) Distances to exon boundaries for variants located on an exon. The comparison
is made between pNMD-QTLs and eQTLs.(b) Distances to intron boundaries for variants located
on an intron. The comparison is made between pNMD-QTLs and eQTLs. (c) Exonic distances
of pNMD-QTLs and dNMD-QTLs to stop codons, and the corresponding eQTL controls compar-
isons. (d) Distances to transcription start sites. Note that all genetic variants within 1 million
base-pair up- and down-stream TSS are considered in our NMD-QTL mapping. (e) Genomic
distances between NMD-QTLs and the stop codons. In (e), different from (c), both exonic and
intronic base-pair were counted.
41
Figure 2.11: Relationship between NMD-QTLs and miRNAs or RBPs. (a) Top 20 miRNAs whose
target regions are enriched with dNMD-QTLs or pNMD-QTLs. (b) Percentages of NMD-QTLs in
miRNA targets compared to that of eQTLs. (c) Top 20 RBPs whose binding regions profiled in
K562 are enriched with dNMD-QTLs or pNMD-QTLs. (d) Percentages of NMD-QTLs in binding
regions of RBPs in K562 compared to that of eQTLs. (e) Top 20 RBPs whose binding regions
profiled in HepG2 are enriched with dNMD-QTLs or pNMD-QTLs. (f) Percentages of NMD-
QTLs in binding regions of RBPs in HepG2 compared to that of eQTLs.
42
Brain tissues
(a)
(b)
0(6503)
803 (12.3%)
NMD-genes (n=6503)
NMD-genes with splicing-QTL
5700
1908
226
753
706
64
133
+
+
pNMD-genes
=
+
+
23.0%
dNMD-genes
=
direct annotation-based
AS-NMD coupling
pNMD-genes with colocalized splicing-QTLS
dNMD-genes with colocalized splicing-QTLS
+
clean non-clean
+
clean non-clean
% of colocalizatoin with splicing-QTLs
Figure 2.12: Relationship between NMD-QTLs and miRNAs or RBPs. (a) Top 20 miRNAs whose
target regions are enriched with dNMD-QTLs or pNMD-QTLs. (b) Percentages of NMD-QTLs in
miRNA targets compared to that of eQTLs. (c) Top 20 RBPs whose binding regions profiled in
K562 are enriched with dNMD-QTLs or pNMD-QTLs. (d) Percentages of NMD-QTLs in binding
regions of RBPs in K562 compared to that of eQTLs. (e) Top 20 RBPs whose binding regions
profiled in HepG2 are enriched with dNMD-QTLs or pNMD-QTLs. (f) Percentages of NMD-
QTLs in binding regions of RBPs in HepG2 compared to that of eQTLs.
43
Chapter3
Method
3.1 Dataacquisition
For NMD-QTL mapping, we obtained expression data at the transcript level and genotype data
from the Genotype-Tissue Expression (GTEx) project (dbGaP accession: phs000424.v7.p2). We
grouped transcripts tagged by “nonsense_mediated_decay” in the GENCODE annotation (re-
lease 19, GRCh37.p13) as NMD-targeted transcripts for a gene. For disease SNP analysis, we
downloaded the curated variant-disease associations from the DisGeNET database (GRCh38) [57].
Genome coordinates from different assembly versions were converted to hg19 through segment_-
liftover [73]. For miRNA targets, we considered the genome coordinates of targets (conserved or
not) for conservative and broad conservative miRNAs obtained in TargetScan [67] which included
a total of 257 miRNAs. For RNA binding protein (RBP) binding sites, eCLIP experiments for 120
RBPs from the K562 cell line and 103 RBPs from the HepG2 cell line from ENCODE were used.
The data resources we used and URL links for them are summarized in Table A.1.
44
Data Resource URL
GTEx Analysis V7 https://gtexportal.org/home/datasets
Gene annotation file https://www.gencodegenes.org/human/release_19.html
Variant-disease association https://www.disgenet.org/downloads
Genome assembly conversion tool https://github.com/baudisgroup/segment-liftover
miRNA targets http://www.targetscan.org/cgi-bin/targetscan/data_-
download.vert72.cgi
Table 3.1: The data sources and links used in this study.
3.2 ModelingofNMD-QTLs
Assuming the amount of mRNAs being transcribed is t, the percentage of transcribed mRNAs that
are subject to NMD isα , and the NMD degradation efficiency relative to regular mRNA degrada-
tion is (1− θ ), the amount of NMD-targeted transcripts observed from the RNA-seq experiment
is the result of t· α · θ . To apply an additive genetic model, the total level for NMD-targeted
transcripts can be written asT =t
A
· α A
· θ A
+t
a
· α a
· θ a
, where the subscript A and a denote the
two alleles for a bi-allelic SNP. Two scenarios were our major interests: (i) pNMD-QTL where the
genetic variant affects the percentage of transcripts subject to NMD ( α A
̸= α a
), (ii) dNMD-QTL
where the genetic variant regulates the NMD efficacy ( θ A
̸= θ a
). These two cases can be dis-
tinguished from regular eQTLs by examining their effect sizes obtained in our NMD-aware QTL
mapping procedure as mentioned before in Figure 2.1.
3.3 IdentificationofNMD-QTLs
The mRNA transcripts of a gene were divided into two categories: NMD-targeted and non-NMD
transcripts, based on whether they were annotated by a “nonsense_mediated_decay” tag in the
GENCODE annotation. We summarized transcripts level by adding up the transcripts per million
45
(TPM) of all isoforms from the NMD-targeted group and the non-NMD group respectively. We
excluded genes with all or none NMD-tagged transcripts. The summarized expressions were nor-
malized within samples by a normalizing factor calculated based on the median of the geometric
mean of all genes considered and were then normalized across samples by a rank-based inverse
normal transformation. Detailed formulas for normalization can be found in the Appendix A.1
and A.2.
QTL mappings were performed with FastQTL [74] on both the NMD-targeted and non-NMD
groups. According to our modeling, pNMD-QTLs and dNMD-QTLs can be identified by com-
paring their regulatory effect sizes (Figure 2.1). For pNMD-QTLs, the allelic regulatory effect for
NMD-targeted transcripts and non-NMD transcripts are in the opposite direction. For dNMD-
QTLs, the regulatory effect is only observable for NMD-targeted transcripts but not non-NMD
transcripts. In our QTL mapping (NMD-QTLs and splicing-QTLs), similar to the approach in
GTEx, variants in the up- or down-stream 1Mb cis window around the TSS were considered. The
same covariates used in GTEx eQTL analysis were used. They included genotyping principal
components, sequencing platform, sequencing protocol, sex, and a set of covariates identified
using the Probabilistic Estimation of Expression Residuals (PEER) method [75].
3.4 Effectsizeconfirmationwithquantileregression
RNA-seq quantifications tend to be heavy-tailed and with inflated number of zeros. The effect
size estimation in the ordinary linear square fit in FastQTL is sensitive to such non-normality
and outliers, even with the applied inverse normal transformation. To reduce false positives and
increase the model robustness, for pNMD-QTLs and dNMD-QTLs discovered from FastQTL, we
46
further fit quantile regression models on TPM values without the inverse normal transformation
as we did previously [22]. Only pNMD-QTLs and dNMD-QTLs still exhibiting desired effect size
properties in quantile regression models were in our final discovery list.
We noticed the excess counts of zeros lead to power issues and difficulty in estimating the
effect size in the association observed in eQTL analysis. In one of my previous publications, we
proposed to use quantile regression instead of ordinary least square fit to model the gene-variants
association. For the reason that quantile regression have the desired robustness to outliers and
inflated number of zeros. Figure 3.1 and Figure 3.2 show that quantile regression performed
the best among all popular robust regression algorithms and gives the most accurate effect size
estimation.
3.4.1 Zero-inflationinRNA-seqdata
Transcriptomics has become one of the standard tools in modern biology for unraveling the
molecular basis of biological processes and diseases. Although a large amount of zeros are antici-
pated for single-cell RNA-seq data (scRNA-seq), due to the fact that in scRNA-seq individual cells
are captured and the amplification therefore start from the minute amount of starting material.
There is substantial controversy over the origins and proper treatment of zeros, and in practice,
we observed that in bulk RNA-seq (for example, the RNA-seq data we used from GTEx v7) the
zero-inflation can be very much severe as well. Figure 3.3 (c) shows the fractions of zeros in the
Transcriptome Per Million (TPM) from GTEx v7 expression data. And Figure 3.3 (d)
47
Figure 3.1: Comparison of different linear models on simulated expression with varying dropout
zeros effects. (A) The underlying true effect size is 0.3. (B) Underlying effect size is -0.3. The
percentage of zero expression increases from 10%to 40%, from left to right. Quantile regression
always provides a robust model estimation while the performance of both OLS and Huber regres-
sion becomes worse when the percentage of zero increases. (C) Comparison between quantile
regression and INT-OLS when the dropout percentage is 40% or 45% (effect size is -0.3).
48
Figure 3.2: Comparison of different linear models on effect size estimation. (A) Estimated effect
size of different linear models on a simulated data with underlying effect size = 0 and an over-
dispersion issue (error term simulated from a Laplace distribution with s=1.5). (B)Comparison of
quantile regression, OLS and INT-OLS under different residual distributions (bottom: Student’s
t distribution (DF=5); top: inverse Gamma distribution (shape=4, scale=0.2)). The mean kurtosis
among the 10,000 simulations was increased from 14.16 to 19.69. The underlyingβ gt
is 0 or 0.097.
49
3.4.2 QuantileRegression
3.4.2.1 Definitionofquantile
Before discussing quantile regression, let’s recall on the definition of quantiles. Given some ob-
servationsy
1
,y
2
,...,y
n
∈R. How do we define the 50th quantile, better known as the median?
The median is the element right in the middle ifn is odd, or the sum of the two middle elements
if n is even. Quantiles can be defined in a similar manner: for τ ∈ (0,1), the τ -th quantile is a
value such that at leastτ fraction of the observations are less than or equal to it. This definition
is straightforward to understand, but it is a bit hard for this notion to be easily generalized.
3.4.2.2 Theminimizationproblem
Instead, we can define the τ -th quantile as the solution to a minimization problem. And the algo-
rithm to find τ -th quantile can be viewed as finding the solution to such minimization problem.
Here is the minimization problem way of defining quantile: if we define ρ τ (y)=y(τ − I{y <0}),
then for a set of observationsy
1
,y
2
,...,y
n
∈R, we define its τ -th quantile as
ˆ q
τ =argmin
q∈R
n
X
i=1
ρ τ (y
i
− q)
=argmin
q∈R
n
X
i=1
(y
i
− q)(τ − I{y
i
=q
(y
i
− q). If we increaseq by some littleϵ so that the numebr
of terms in each summation doesn’t change, then the overall expression increases by
np(1− τ )ϵ − n(1− p)τϵ =nϵ (p− τ ) (3.3)
If p < τ , then we decrease the objective function value in Equation 3.1 by making q larger.
Ifp > τ , then we can decrease the objective function value by makingq smaller. The only place
where we can’t have any improvement is whenp=τ , i.e.,q chose such tahtτ of the observations
are less than it.
3.4.2.3 Extensiontoregression
Koenker and Bassett Jr (1978) extended this minimization problem idea to quantile regression.
For each response value y
i
, we have a corresponding estimate ˆ y
i
. We can interpret the setting
above as trying to minimize
51
n
X
i=1
ρ τ (y
i
− ˆ
(y
i
)) (3.4)
where ˆ y
i
= β 0
+β 1
x
i
1
+...+β p
x
ip
andx
i
1
,...,x
ip
are the values for thei-th observation
for thep feature on hand. Thus we can rewrite the minimization problem as
argmin
β 0
,...,β p
n
X
i=1
ρ τ (y
i
− β 0
− β 1
x
i
1
−···− β p
x
ip
) (3.5)
or in vector form (and incorporating the intercept as a feature)
argmin
⃗
β n
X
i=1
ρ τ (y
i
− x
T
i
⃗
β ) (3.6)
This gives us the formulation of quantile regression. Note compared to linear regression,
we applyρ τ instead off(x)=x
2
to eachy
i
− x
T
i
⃗
β 3.5 Simulationstudiestotestourmodel
To validate our NMD-QTL models, we estimated the precision and recall for identifying pNMD-
QTLs and dNMD-QTLs via massive amount of simulations. We simulated the expression levels
(total detected transcripts) for NMD-targeted transcripts and non-NMD transcripts by formulas
in Figure 1c for different scenarios. Then we performed QTL calling for both NMD-targeted
and non-NMD transcripts through regression models. We declared pNMD-QTL or dNMD-QTL
findings according to the desired effect size and directions.
Specifically, we performed three rounds of simulations, each round with different regulatory
strength and 10,000 simulated genetic variants. For each genetic variant, n genotypes, X =
52
[x
1
,x
2
,...,x
n
], were simulated with a minor allele frequency ofm, wheren=500 andm=0.05.
The total detected transcripts, Y
c
= [y
c
1
,y
c
2
,...,y
c
n
], were simulated as a function of genotype
and transcripts type plus a Gaussian random noise: y
c
i
=f
c
(x
i
;Θ i
(t,α,θ ))+σ i
.
For the simulated expression and genotype data, we performed a linear regression Y
c
=
β c
X+ϵ for each SNP and inspected the coefficient β c
. If β ′
c
were significant (p-value less than
a particular threshold) for both NMD-targeted and non-NMD transcripts and were in opposite
directions, the SNP was identified as a pNMD-QTL. If β c
was significant for NMD-targeted tran-
scripts but not for non-NMD transcripts, the SNP was identified as a dNMD-QTL. The detailed
parameter settings for simulations with different regulatory strengths can be found in Appendix
A.3
3.6 Genome-wideNMD-QTLmappingforGTExtissues
The analysis power for NMD-QTL detection was linearly related to the sample size (Figure 3.4
(a), the regression coefficient of the sample size was 28.8 for pNMD-QTLs and 117.0 for dNMD-
QTLs), but less sensitive than that for eQTL detection (Figure 3.4 (b), regression coefficient: 2938).
The testis and whole blood tissues were drastically deviated from this linear trend (Figure 3.4 (a)),
suggesting that the testis had many more NMD-QTLs while the whole blood had many less NMD-
QTLs than expected. Such deviations were also observed in the eQTLs mapping (Figure 3.4 (b))
53
3.7 NMD-QTLsignaturesimilaritybetweenatissuepair
To measure the similarity of identified NMD-QTLs between tissues, we calculated s
ij
=
n
ij
/(n
i
− n
ij
)
(n
j
/(n− n
j
))
where n
ij
is the number of NMD-QTLs shared between tissue i and j; n
i
and n
j
are the num-
bers of NMD-QTLs discovered in respective tissues; and n is the total number of nonredundant
NMD-QTLs discovered across all tissues. Thus, s
ij
compares the odds of observing tissue j’s
NMD-QTLs in tissuei with the odds of observing tissuej’s NMD-QTLs in all NMD-QTLs. Note
thats
ij
is not necessarily equal tos
ji
, we calculateds
∗ ij
=
s
ij
+s
ji
2
as the final similarity score for
NMD-QTL signatures between two tissues. If two tissues share many NMD-QTLs,s
∗ ij
is large.
3.8 ColocalizationofNMD-QTLswithdisease-associatedSNPs
Considering the disease associated variants reported in DisGeNET, we compared how likely the
regular eQTLs reported by GTEx and our NMD-QTLs to be colocalized with disease SNPs. We
devised a strategy with the combination of a proportion test and a hyper-geometric test to exam
the significance of the colocalization between disease SNPs and NMD-QTLs. Here, we pooled
NMD-QTLs identified across different tissues. Thus, one NMD-QTL could be counted multiple
times if it was identified in multiple tissues. Detail of such two tests can be found in Appendix
A.4.
3.9 ColocalizationenrichmentscoreforMeSH
For a disease category represented by a MeSH term (C01-C26, F01-F03), the colocalization en-
richment score was calculated as the number of pNMD-QTLs or dNMD-QTLs which were also
54
associated with the disease SNP in the DisGeNET database for the MeSH term, further normal-
ized by the total number of pNMD-QTLs or dNMD-QTLs and the total number disease SNPs
annotated by the MeSH term in DisGeNET. The same calculation was performed for eQTL and
randomly sampled SNPs.
3.10 Enrichment of NMD-QTLs in the targets of miRNAs
andRBPs
miRNA and RBPs were ranked by the number of nonredundant pNMD-QTLs or dNMD-QTLs
across all tissues which were located in their target regions and normalized by the total base-pair
length of the target regions.
3.11 Splicing-QTLmapping
We identified all cassette exons from the GENCODE annotation. A cassette exon is defined as a
non-boundary exon between two other exons and can be either included or skipped to generate
two distinct transcripts in alternative splicing. To simplify, we considered transcripts either use
or skip the cassette exon entirely. In other words, transcripts partially overlap with the cassette
exons were excluded. To perform splicing-QTL mapping, for a given cassette exon, we used
the inclusion ratio of that cassette exon as the quantitative trait. The inclusion ratio of a given
cassette exon is calculated as follows. All transcripts from a gene{t
(i)
}
M+N
i=1
can be divided into
55
two groups{t
(i)
}
M+N
i=1
={t
(j)
inc
}
M
j=1
∪{t
(k)
exc
}
N
k=1
,t
inc
are transcripts that include the cassette exon,
t
exc
are transcripts that exclude the cassette exon. The inclusion rate is:
P
M
j=1
TPM
t
(j)
inc
P
M
j=1
TPM
t
(j)
inc
+
P
N
k=1
TPM
t
(k)
exc
(3.7)
, whereTPM(·) is the detected transcripts per million for a transcript. The list of cassette exons
and their corresponding transcripts are too long to be presented here, and can be found in the
Supplementary Table 5 in github repository Appendix A.5.
56
Figure 3.3: Over-dispersion and pervasive dropout zeros in GTEx RNA-seq data. (A) Boxplots of
gene expression kurtosis. For each gene, the excess kurtosis was calculated over all samples from
a given tissue. The boxplots are colored by the sample size for a given tissue. We then chose
the whole blood data for real data analysis hereinafter since it has the most widely distributed
kurtosis. (B) Rug density plot of log2-transformed gene expression level. FOXC1 is an example
of large kurtosis genes. NAPGA is an example of small kurtosis genes. The vertical line marks
the mean, and the rugs on x-axis show individual data points. (C) Percentage of genes with zero
expression values. The left panel shows the results for all genes. The right panel shows genes with
TPM > 0.1 in at least 10 samples. (D) Scatter plot of gene expression values from two randomly
selected genes. One was with high kurtosis and the other with large amounts of dropouts.
57
(a) (b)
Testis
Whole Blood
Testis
Whole Blood
Figure 3.4: Detection power and sample size. (a) Positive correlation between the number of
NMD-QTLs identified for each tissue and the tissue sample size (i.e., the number of donors). (b)
The positive correlation observed for GTEx eQTLs as well.
58
Chapter4
DiscussionandBroaderImpact
4.1 Discussion
To uncover the etiological path from SNPs to resultant phenotypic traits, it is essential to under-
stand the genetic mechanisms underlying gene expression variation. The current gene expression
regulatory landscape derived from eQTL studies is far from complete. Traditional eQTL studies
have been focusing on the identification of genetic loci linked to variations in the overall mRNA
expression [76, 77, 78], and more recently to variations in mRNA splicing [79, 80, 81]. A special
interest of the current human eQTL studies is tissue-specific genetic impacts such as the efforts in
GTEx [76, 82, 83], because they can provide valuable insights into disease phenotypes manifested
on particular tissues. However, transcriptional and post-transcriptional regulations jointly deter-
mine tissue-specific gene expression, which needs a more careful integrated examination. In this
study, we analyzed the post-transcriptional processing of RNAs and proposed a more fine-grind
regulatory QTL model, the NMD-QTL model, which pinpoints genetic variants that function via
regulating the production of NMD-transcripts (pNMD-QTLs) and the decay efficiency of targeted
RNAs (dNMD-QTLs). Through joint consideration of allelic effect sizes for both NMD-transcripts
59
and normal transcripts, we distinguish pNMD-QTLs and dNMD-QTLs from regular eQTLs. The
modelling has been validated by extensive simulations mimicking real data sets. Taking advan-
tage of the unprecedented resource in GTEx, we reveal the genome-wide landscape of genetic
variants affecting NMD regulation which can be missed in regular GTEx eQTLs mapping, and
uncover its tissue specificity and its variant-disease associations.
Little is known about the factors underlying variable NMD efficiency. The enrichment of
NMD-QTLs within gene bodies and exons especially the penultimate exons from the 3’ end may
help to pinpoint new features for NMD regulation in addition to the 50-nt rule. Regions around
PTC are important regulatory regions since NMD-QTLs tend to be located around PTCs. The
enrichment of pNMD-QTLs in near proximity of regular stop codons and their upstream 400 bp
positions implies additional determinants of producing NMD-targeted transcripts.
The location preferences of NMD-QTLs in the targets of miRNAs and RBPs provide new clues
that NMD regulation is not fixated but can be modulated by miRNAs and RBPs. The mam-
malian core NMD machinery includes a PIKK complex (SMG1, SMG8, SMG9) and other SMG
proteins (SMG5, SMG6, SMG7), UPF proteins (UPF1, UPF2, UPF3A, UPF3B), eukaryotic release
factors (eRF1, eRF3), and exon junction complex (EJC) members (eIF4A3, RBM8A, MAGOH, and
MLN51) [77, 78, 33]. Enrichments of NMD-QTLs colocalizing with RBP binding sites and miRNA
targets suggest that these cis-acting regulatory elements interact with RBPs and miRNAs to mod-
ulate the generation and decay efficiency of NMD transcripts. The colocalization of NMD-QTLs
especially pNMD-QTLs with splicing-QTLs suggests that RBPs may interact with NMD regula-
tions through the coupling of alternative splicing and NMD.
60
4.2 BroaderImpact
A large majority of genetic variants reported in the genome-wide association studies (GWAS)
of common human diseases lies in introns or intergenic regions [84], suggesting their roles in
gene expression regulation instead of protein coding. Our results suggest NMD-QTLs are more
valuable to dissect the variant-disease associations, and characterizing disease susceptibility, es-
pecially for brain and mental disorders. The higher colocalization rate of NMD-QTLs with disease
SNPs makes them especially valuable for facilitating the narrow-down and the interpretation of
variants reported in GWAS. The enrichment of NMD-QTLs was observed for a variety of diseases,
including musculoskeletal diseases, digestive system diseases, respiratory tract disease, skin and
connective tissue diseases, and immune system diseases. The link between NMD regulation and
various disease susceptibility stresses the fact that NMD functions beyond the scope of RNA
surveillance and as an intricate gene regulation mechanism across human tissues. In particu-
lar, brain tissues exhibit unique NMD-QTL signatures, and these NMD-QTLs uncover prominent
variant-disease associations.
Approximately 3 million individuals in the US alone are afflicted with genetic diseases caused
by nonsense mutations (PTC diseases) [85]. Additionally, deficiencies in NMD factors have been
linked to neurological disorders, immune diseases and cancers [86, 87, 32]. An NMD-based ther-
apeutic strategy will not be possible without a comprehensive understanding of NMD regulation.
For example, NMD efficiency influences patients’ response to nonsense-suppression drugs [88],
but how NMD efficiency is modulated is unclear. NMD inhibitors have been proposed to treat
61
PTC diseases, but potential side effects are awaiting to be addressed [85, 89, 45]. Our discov-
ered NMD-QTLs may assist the design of gene-specific NMD inhibition to treat PTC diseases,
mitigating the global side effects caused by manipulations of core NMD factors [90].
62
Chapter5
Conclusions
We reveal the genome-wide landscape of genetic variants associated with NMD regulation across
human tissues. We name this novel type of functional genetic variants theNMD-QTL. Genetic
variants affecting the percentage of NMD-targeted transcripts are referred as pNMD-QTLs. Ge-
netic variants regulating the decay efficiency of NMD-targeted transcripts are referred as dNMD-
QTLs. The tissue specificity analysis of NMD-QTLs demonstrates NMD regulatory elements play
an especially important role in brain tissues. We empirically showed the genomic positioning of
NMD-QTLs with regard to important genomic features including exon, intron, transcription start
site, premature and regular termination codons. Such exploration of NMD-QTLs’ preferential
genomic positions suggests additional attributes for a transcript to be regulated by NMD. Fur-
thermore, the colocalization analysis with other trait-associated SNPs and post-transcriptional
regulatory elements assists the understanding of their functionality and their interactions with
other post-transcriptional regulations.
63
These NMD-QTL mappings across different tissues, their disease associations, and their posi-
tion characteristics are valuable for exploring the post-transcriptional RNA biogenesis, establish-
ing the etiological roadmap from SNPs to resultant phenotypic traits, and facilitating the research
of NMD-based therapeutic strategies for genetic diseases.
64
AppendixA
Appendix
A.1 Customizeexpressionnormalization
We performed NMD-QTL mapping for 48 tissues from GTEx with sample size≥ 70. For a given
tissue, the NMD expression and non-NMD expression were obtained by the following procedure:
1. lowly expressed genes were filtered out, only genes with TPM > 0.1 in at least 10 samples
were selected for analysis
2. summarized the filtered genes expression into NMD expression and non-NMD expression.
3. The numbers start at 1 with each use of theenumerate environment.
4. NMD expression could be sparse, only genes with number of zeros less than 40% in NMD
expression were kept.
5. phenotype traits (NMD expression and non-NMD expression) were normalized within
samples based on a normalization factor, suppose the total number of genes is G and
the total number of individuals isK, normalization factor for each sample k was calculated
based on all genes as follows:
65
• First, a geometric mean for expressed genes is calculated based for each individualk:
µ (k)
=
Q
g∈G:TPM
(k)
g
̸=0
TPM
(k)
g
1
P
g∈G
I(TPM
(k)
g
̸=0)
, where I(·) is the indicator func-
tion.
• Then, the normalization factor for each individual k is obtained as the median fold-
change of expressed genes compared to this geometric mean:
δ (k)
=median(
TPM
(k)
g
µ (k)
, forg∈G)
• Finally the NMD expression and non-NMD expression were normalized by this nor-
malization factor. More formally, normalized TPM for every gene g and sample k is
then given by: TPM
′
(k)
g
=
TPM
(k)
g
δ (k)
6. NMD expression and non-NMD expression were normalized across samples by rank-
based inverse normal transform to satisfy the statistical assumption in FastQTL for QTL
mapping.
The Python code snippet for the above described normalization and inverse normal transform is
shown in next appendix.
A.2 Pythoncodesnippetforphenotypenormalization
1 import pandas as pd
2 import scipy.stats as ss
3 from scipy.stats.mstats import gmean
66
4 def rank_INT(x, c=3./8):
5 """Rank-based Inverse Normal Transform.
6 Ties share the same value after transformation."""
7 n = len(x)
8 r = ss.rankdata(x, method='average')
9 return ss.norm.ppf((r - c) / (n - 2 * c + 1))
10
11 # normalization factors
12 gene_tpm = pd.read_csv(os.path.join(base_path,
'GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct.gz'),
skiprows=2, sep='\t', usecols=tissue_cols[2:])
,→
,→
13 gene_tpm = gene_tpm[(gene_tpm > .1).sum(axis=1) >= 10] # filter out lowly
expressed genes ,→
14 genes_gmean = gene_tpm.apply(lambda row: gmean(row[row != 0]), axis=1) #
calculate geometric mean ,→
15 norm_factors = gene_tpm.divide(genes_gmean, axis=0).median(axis=0) #
calculate normalization factor ,→
A.3 Implementationdetailsofthesimulations
Without any prior knowledge on how the NMD effect should be, parameters θ (t,α,θ ) were gen-
erated by first drawing α A
andθ A
from a uniform distributionU(0.2,0.8), and then obtainingα a
,
andθ a
by ensuring the effect size |α A
− α a
| and|θ A
− θ a
| belong to the desired range. For cases
67
assuming t
A
= t
a
= t we empirically chose t = 4 based on the real data in hand, and for cases
assumingt
A
̸=t
a
, we drew botht
A
∼ U(3,5) andt
a
∼ U(3,5), and then applied a Gaussian noise
term. We drew the Gaussian noise term δ ∼ N (0,1.5) and truncated the simulated expression
to be non-negative for the physical reason–the expressions can not be negative.
A.4 StatisticaltestsusedindiseaseSNPcolocalizationstudy
forNMD-QTLs
A.4.1 Hyper-geometrictest
For a given diseased
i
, letN be the total number of NMD-QTLs discovered, M be the universal
set of SNPs (here we assume it is the number of all SNPs tested in GTEx), letn be the number of
markers reported in DisGeNET ford
i
, and x be the number of NMD-QTLs that were indeed the
disease marker for d
i
. Then the p-value for the test is given by p = Pr(X≥ x− 1) where X
follows a hypergeometric distribution, whose probability mass function is defined as
p(x,M,n,N)=
n
x
M− n
N− x
M
N
(A.1)
A.4.2 Proportiontest
For a given disease d
i
, four statistics were gathered as follows. The alternative hypothesis was
thenH
A
:
x
1
y
1
>
x
2
y
2
.
1. x
1
: number of NMD-QTLs reported to be associated with thed
i
68
2. y
1
: total number of NMD-QTLs reported to be associated diseases
3. x
2
: number of eQTLs reported to be associated with thed
i
4. y
2
: total number of eQTLs reported to be associated diseases
A.5 Githubrepositoryforlargetables
Following supplementary tables can be found inhttps://github.com/bsun0802/nmd_qtl_publication.
Table Description
Supplementary Table 1 The 43 diseases with significant NMD-QTLs
enrichment. p-values for both the proportion
test and the hypergeometric test <0.05
Supplementary Table 2 The list of miRNAs and the number of
dNMD-QTLs or pNMD-QTLs in their target
regions
Supplementary Table 3 The list of RBPs and the number of dNMD-
QTLs or pNMD-QTLs in their binding sites
in the K562 or the HepG2 cell lines
Supplementary Table 4 The data source and links
Supplementary Table 5 The list of cassette exons considered in
splicing-QTL mapping.
Table A.1: Links to supplementary tables.
69
Bibliography
[1] Neil Brockdorff and Bryan M Turner. “Dosage compensation in mammals”. In: Cold
Spring Harbor perspectives in biology 7.3 (2015), a019406.
[2] L Jaillon, Chi-Sun Poon, and Yat Hung Chiang. “Quantifying the waste reduction
potential of using prefabrication in building construction in Hong Kong”. In: Waste
management 29.1 (2009), pp. 309–320.
[3] Sika Zheng and Douglas L Black. “Alternative pre-mRNA splicing in neurons: growing up
and extending its reach”. In: Trends in Genetics 29.8 (2013), pp. 442–448.
[4] David A de Lima Morais and Paul M Harrison. “Large-scale evidence for conservation of
NMD candidature across mammals”. In: PloS one 5.7 (2010), e11695.
[5] Joshua T Mendell, Neda A Sharifi, Jennifer L Meyers, Francisco Martinez-Murillo, and
Harry C Dietz. “Nonsense surveillance regulates expression of diverse classes of
mammalian transcripts and mutes genomic noise”. In: Nature genetics 36.10 (2004),
pp. 1073–1078.
[6] Feng He, Xiangrui Li, Phyllis Spatrick, Ryan Casillo, Shuyun Dong, and Allan Jacobson.
“Genome-wide analysis of mRNAs regulated by the nonsense-mediated and 5’ to 3’
mRNA decay pathways in yeast”. In: Molecular cell 12.6 (2003), pp. 1439–1452.
[7] Lulu Huang, Chih-Hong Lou, Waikin Chan, Eleen Y Shum, Ada Shao, Erica Stone,
Rachid Karam, Hye-Won Song, and Miles F Wilkinson. “RNA homeostasis governed by
cell type-specific and branched feedback loops acting on NMD”. In: Molecular cell 43.6
(2011), pp. 950–961.
[8] Saverio Brogna and Jikai Wen. “Nonsense-mediated mRNA decay (NMD) mechanisms”.
In: Nature structural & molecular biology 16.2 (2009), pp. 107–113.
[9] Nadia Amrani, Robin Ganesan, Stephanie Kervestin, David A Mangus, Shubhendu Ghosh,
and Allan Jacobson. “A faux 3-UTR promotes aberrant termination and triggers
nonsense-mediated mRNA decay”. In: Nature 432.7013 (2004), pp. 112–118.
70
[10] Hervé Le Hir, David Gatfield, Elisa Izaurralde, and Melissa J Moore. “The exon–exon
junction complex provides a binding platform for factors involved in mRNA export and
nonsense-mediated mRNA decay”. In: The EMBO journal 20.17 (2001), pp. 4987–4997.
[11] Yao-Fu Chang, J Saadi Imam, and Miles F Wilkinson. “The nonsense-mediated decay RNA
surveillance pathway”. In: Annu. Rev. Biochem. 76 (2007), pp. 51–74.
[12] Indrani Rebbapragada and Jens Lykke-Andersen. “Execution of nonsense-mediated
mRNA decay: what defines a substrate?” In: Current opinion in cell biology 21.3 (2009),
pp. 394–402.
[13] Jungwook Hwang and Lynne E Maquat. “Nonsense-mediated mRNA decay (NMD) in
animal embryogenesis: to die or not to die, that is the question”. In: Current opinion in
genetics & development 21.4 (2011), pp. 422–430.
[14] Lukas Stalder and Oliver Mühlemann. “The meaning of nonsense”. In: Trends in cell
biology 18.7 (2008), pp. 315–321.
[15] Zoltán Kerényi, Zsuzsanna Mérai, László Hiripi, Anna Benkovics, Péter Gyula,
Christophe Lacomme, Endre Barta, Ferenc Nagy, and Dániel Silhavy. “Inter-kingdom
conservation of mechanism of nonsense-mediated mRNA decay”. In: The EMBO journal
27.11 (2008), pp. 1585–1595.
[16] Andrea B Eberle, Lukas Stalder, Hansruedi Mathys, Rodolfo Zamudio Orozco, and
Oliver Mühlemann. “Posttranscriptional gene regulation by spatial rearrangement of the
3’untranslated region”. In: PLoS biology 6.4 (2008), e92.
[17] Isabelle Behm-Ansmant, Isao Kashima, Jan Rehwinkel, Jérôme Saulière, Nadine Wittkopp,
and Elisa Izaurralde. “mRNA quality control: an ancient machinery recognizes and
degrades mRNAs with nonsense codons”. In: FEBS letters 581.15 (2007), pp. 2845–2853.
[18] Guramrit Singh, Indrani Rebbapragada, and Jens Lykke-Andersen. “A competition
between stimulators and antagonists of Upf complex recruitment governs human
nonsense-mediated mRNA decay”. In: PLoS biology 6.4 (2008), e111.
[19] Andrea B Eberle, Søren Lykke-Andersen, Oliver Mühlemann, and Torben Heick Jensen.
“SMG6 promotes endonucleolytic cleavage of nonsense mRNA in human cells”. In:
Nature structural & molecular biology 16.1 (2009), pp. 49–55.
[20] Yukiko Okada-Katsuhata, Akio Yamashita, Kei Kutsuzawa, Natsuko Izumi,
Fumiki Hirahara, and Shigeo Ohno. “N-and C-terminal Upf1 phosphorylations create
binding platforms for SMG-6 and SMG-5: SMG-7 during NMD”. In: Nucleic acids research
40.3 (2012), pp. 1251–1266.
71
[21] Boel Brynedal, JinMyung Choi, Towfique Raj, Robert Bjornson, Barbara E Stranger,
Benjamin M Neale, Benjamin F Voight, and Chris Cotsapas. “Large-scale trans-eQTLs
affect hundreds of transcripts and mediate patterns of transcriptional co-regulation”. In:
The American Journal of Human Genetics 100.4 (2017), pp. 581–591.
[22] Bo Sun and Liang Chen. “Quantile regression for challenging cases of eQTL mapping”. In:
Briefings in bioinformatics 21.5 (2020), pp. 1756–1765.
[23] Madhuri Bhuvanagiri, Anna M Schlitter, Matthias W Hentze, and Andreas E Kulozik.
“NMD: RNA biology meets human genetic medicine”. In: Biochemical Journal 430.3
(2010), pp. 365–377.
[24] Lulu Huang and Miles F Wilkinson. “Regulation of nonsense-mediated mRNA decay”. In:
Wiley Interdisciplinary Reviews: RNA 3.6 (2012), pp. 807–828.
[25] Rik GH Lindeboom, Michiel Vermeulen, Ben Lehner, and Fran Supek. “The impact of
nonsense-mediated mRNA decay on genetic disease, gene editing and cancer
immunotherapy”. In: Nature genetics 51.11 (2019), pp. 1645–1651.
[26] Jan Rehwinkel, Ivica Letunic, Jeroen Raes, Peer Bork, and Elisa Izaurralde.
“Nonsense-mediated mRNA decay factors act in concert to regulate common mRNA
targets”. In: Rna 11.10 (2005), pp. 1530–1544.
[27] Hidenori Tani, Naoto Imamachi, Kazi Abdus Salam, Rena Mizutani, Kenichi Ijiri,
Takuma Irie, Tetsushi Yada, Yutaka Suzuki, and Nobuyoshi Akimitsu. “Identification of
hundreds of novel UPF1 target transcripts by direct determination of whole
transcriptome stability”. In: RNA biology 9.11 (2012), pp. 1370–1379.
[28] Joachim Weischenfeldt, Inge Damgaard, David Bryder, Kim Theilgaard-Monch,
Lina A Thoren, Finn Cilius Nielsen, Sten Eirik W Jacobsen, Claus Nerlov, and
Bo Torben Porse. “NMD is essential for hematopoietic stem and progenitor cells and for
eliminating by-products of programmed DNA rearrangements”. In: Genes & development
22.10 (2008), pp. 1381–1396.
[29] Michael J Lelivelt and Michael R Culbertson. “Yeast Upf proteins required for RNA
surveillance affect global expression of the yeast transcriptome”. In: Molecular and
Cellular Biology 19.10 (1999), pp. 6710–6719.
[30] Quinn M Mitrovich and Philip Anderson. “mRNA surveillance of expressed pseudogenes
in C. elegans”. In: Current biology 15.10 (2005), pp. 963–967.
[31] Tatsuaki Kurosaki and Lynne E Maquat. “Nonsense-mediated mRNA decay in humans at
a glance”. In: Journal of cell science 129.3 (2016), pp. 461–467.
72
[32] Soren Lykke-Andersen and Torben Heick Jensen. “Nonsense-mediated mRNA decay: an
intricate machinery that shapes transcriptomes”. In: Nature reviews Molecular cell biology
16.11 (2015), pp. 665–677.
[33] Franziska Ottens and Niels H Gehring. “Physiological and pathophysiological role of
nonsense-mediated mRNA decay”. In: Pflugers Archiv-European Journal of Physiology
468.6 (2016), pp. 1013–1028.
[34] Etienne Raimondeau, Joshua C Bufton, and Christiane Schaffitzel. “New insights into the
interplay between the translation machinery and nonsense-mediated mRNA decay
factors”. In: Biochemical Society Transactions 46.3 (2018), pp. 503–512.
[35] Xin Han, Yanling Wei, Hua Wang, Feilong Wang, Zhenyu Ju, and Tangliang Li.
“Nonsense-mediated mRNA decay: a ‘nonsense’pathway makes sense in stem cell
biology”. In: Nucleic acids research 46.3 (2018), pp. 1038–1051.
[36] Eleen Y Shum, Samantha H Jones, Ada Shao, Jennifer Dumdie, Matthew D Krause,
Wai-Kin Chan, Chih-Hong Lou, Josh L Espinoza, Hye-Won Song, Mimi H Phan, et al.
“The antagonistic gene paralogs Upf3a and Upf3b govern nonsense-mediated RNA
decay”. In: Cell 165.2 (2016), pp. 382–395.
[37] Jianqiang Bao, Chong Tang, Shuiqiao Yuan, Bo T Porse, and Wei Yan. “UPF2, a
nonsense-mediated mRNA decay factor, is required for prepubertal Sertoli cell
development and male fertility by ensuring fidelity of the transcriptome”. In:
Development 142.2 (2015), pp. 352–362.
[38] Jianqiang Bao, Kristoffer Vitting-Seerup, Johannes Waage, Chong Tang, Ying Ge,
Bo T Porse, and Wei Yan. “UPF2-dependent nonsense-mediated mRNA decay pathway is
essential for spermatogenesis by selectively eliminating longer 3’UTR transcripts”. In:
PLoS genetics 12.5 (2016), e1005863.
[39] Tatsuaki Kurosaki, Hitomi Sakano, Christoph Proschel, Jason Wheeler,
Alexander Hewko, and Lynne E Maquat. “NMD abnormalities during brain development
in the Fmr1-knockout mouse model of fragile X syndrome”. In: Genome biology 22.1
(2021), pp. 1–13.
[40] Jayanthi P Gudikote, J Saadi Imam, Ramon F Garcia, and Miles F Wilkinson. “RNA
splicing promotes translation and RNA surveillance”. In: Nature structural & molecular
biology 12.9 (2005), pp. 801–809.
[41] Fusako Usuki, Akio Yamashita, Itsuro Higuchi, Tetsuo Ohnishi, Tadafumi Shiraishi,
Mitsuhiro Osame, and Shigeo Ohno. “Inhibition of nonsense-mediated mRNA decay
rescues the phenotype in Ullrich’s disease”. In: Annals of Neurology: Official Journal of the
American Neurological Association and the Child Neurology Society 55.5 (2004),
pp. 740–744.
73
[42] Almoutassem B Zetoune, Sandra Fontaniere, Delphine Magnin, Olga Anczukow,
Monique Buisson, Chang X Zhang, and Sylvie Mazoyer. “Comparison of
nonsense-mediated mRNA decay efficiency in various murine tissues”. In: BMC genetics
9.1 (2008), pp. 1–11.
[43] Matthew Mort, Dobril Ivanov, David N Cooper, and Nadia A Chuzhanova. “A
meta-analysis of nonsense mutations causing human genetic disease”. In: Human
mutation 29.8 (2008), pp. 1037–1047.
[44] Fran Supek, Ben Lehner, and Rik GH Lindeboom. “To NMD or Not To NMD:
nonsense-mediated mRNA decay in cancer and other genetic diseases”. In: Trends in
Genetics 37.7 (2021), pp. 657–668.
[45] Jake N Miller and David A Pearce. “Nonsense-mediated decay in genetic disease: friend or
foe?” In: Mutation Research/Reviews in Mutation Research 762 (2014), pp. 52–64.
[46] GTEx Consortium, Kristin G Ardlie, David S Deluca, Ayellet V Segrè, Timothy J Sullivan,
Taylor R Young, Ellen T Gelfand, Casandra A Trowbridge, Julian B Maller,
Taru Tukiainen, et al. “The Genotype-Tissue Expression (GTEx) pilot analysis:
Multitissue gene regulation in humans”. In: Science 348.6235 (2015), pp. 648–660.
[47] A Battle, CD Brown, BE Engelhardt, and SB Montgomery. “GTEx Consortium;
Laboratory, Data Analysis &Coordinating Center (LDACC)—Analysis Working Group;
Statistical Methods groups—Analysis Working Group; Enhancing GTEx (eGTEx) groups;
NIH Common Fund; NIH/NCI; NIH/NHGRI; NIH/NIMH; NIH/NIDA; Biospecimen
Collection Source Site—NDRI; Biospecimen Collection Source Site—RPCI; Biospecimen
Core Resource—VARI; Brain Bank Repository—University of Miami Brain Endowment
Bank; Leidos Biomedical—Project Management; ELSI Study; Genome Browser Data
Integration &Visualization—EBI; Genome Browser Data Integration
&Visualization—UCSC Genomics Institute, University of California Santa Cruz; Lead
analysts; Laboratory, Data Analysis &Coordinating Center (LDACC); NIH program
management; Biospecimen collection; Pathology; eQTL manuscript working group,
Genetic effects on gene expression across human tissues”. In: Nature 550 (2017),
pp. 204–213.
[48] Linfeng Wu, Sophie I Candille, Yoonha Choi, Dan Xie, Lihua Jiang,
Jennifer Li-Pook-Than, Hua Tang, and Michael Snyder. “Variation and genetic control of
protein abundance in humans”. In: Nature 499.7456 (2013), pp. 79–82.
[49] Yun-Hua Esther Hsiao, Jae Hoon Bahn, Xianzhi Lin, Tak-Ming Chan, Rena Wang, and
Xinshu Xiao. “Alternative splicing modulated by genetic variants demonstrates
accelerated evolution regulated by highly conserved proteins”. In: Genome research 26.4
(2016), pp. 440–450.
74
[50] Hong Zhang, Shengqian Dou, Feng He, Junjie Luo, Liping Wei, and Jian Lu.
“Genome-wide maps of ribosomal occupancy provide insights into adaptive evolution
and regulatory roles of uORFs during Drosophila development”. In: PLoS biology 16.7
(2018), e2003903.
[51] Daniel G MacArthur and Chris Tyler-Smith. “Loss-of-function variants in the genomes of
healthy humans”. In: Human molecular genetics 19.R2 (2010), R125–R130.
[52] MA Rivas et al. “Impact of predicted protein-truncating genetic variants on the human
transcriptome”. In: Science (), pp. 666–669.
[53] Hongjoo J Lee, Jina M Youn, Michela Gallagher, Peter C Holland, et al. “Role of substantia
nigra–amygdala connections in surprise-induced enhancement of attention”. In: Journal
of Neuroscience 26.22 (2006), pp. 6077–6081.
[54] Wei-Zhu Liu, Wen-Hua Zhang, Zhi-Heng Zheng, Jia-Xin Zou, Xiao-Xuan Liu,
Shou-He Huang, Wen-Jie You, Ye He, Jun-Yu Zhang, Xiao-Dong Wang, et al.
“Identification of a prefrontal cortex-to-amygdala pathway for chronic stress-induced
anxiety”. In: Nature communications 11.1 (2020), pp. 1–15.
[55] Lam Son Nguyen, Miles F Wilkinson, and Jozef Gecz. “Nonsense-mediated mRNA decay:
inter-individual variability and human disease”. In: Neuroscience & Biobehavioral Reviews
46 (2014), pp. 175–186.
[56] Mehrdad Khajavi, Ken Inoue, and James R Lupski. “Nonsense-mediated mRNA decay
modulates clinical outcome of genetic disease”. In: European journal of human genetics
14.10 (2006), pp. 1074–1081.
[57] Janet Piñero, Juan Manuel Ramirez-Anguita, Josep Sauch-Pitarch, Francesco Ronzano,
Emilio Centeno, Ferran Sanz, and Laura I Furlong. “The DisGeNET knowledge platform
for disease genomics: 2019 update”. In:Nucleicacidsresearch 48.D1 (2020), pp. D845–D855.
[58] Lam S Nguyen, Hyung-Goo Kim, Jill A Rosenfeld, Yiping Shen, James F Gusella,
Yves Lacassie, Lawrence C Layman, Lisa G Shaffer, and Jozef Gécz. “Contribution of copy
number variants involving nonsense-mediated mRNA decay pathway genes to
neuro-developmental disorders”. In: Human molecular genetics 22.9 (2013), pp. 1816–1825.
[59] Patrick S Tarpey, F Lucy Raymond, Lam S Nguyen, Jayson Rodriguez, Anna Hackett,
Lucianne Vandeleur, Raffaella Smith, Cheryl Shoubridge, Sarah Edkins, Claire Stevens,
et al. “Mutations in UPF3B, a member of the nonsense-mediated mRNA decay complex,
cause syndromic and nonsyndromic mental retardation”. In: Nature genetics 39.9 (2007),
pp. 1127–1133.
75
[60] AM Addington, J Gauthier, A Piton, FF Hamdan, A Raymond, N Gogtay, R Miller,
J Tossell, J Bakalar, G Germain, et al. “A novel frameshift mutation in UPF3B identified in
brothers affected with childhood onset schizophrenia and autism spectrum disorders”. In:
Molecular psychiatry 16.3 (2011), pp. 238–239.
[61] F Laumonnier, C Shoubridge, C Antar, LS Nguyen, Hilde Van Esch, T Kleefstra, S Briault,
JP Fryns, B Hamel, J Chelly, et al. “Mutations of the UPF3B gene, which encodes a protein
widely expressed in neurons, are associated with nonspecific mental retardation with or
without autism”. In: Molecular psychiatry 15.7 (2010), pp. 767–776.
[62] Martino Colombo, Evangelos D Karousis, Joel Bourquin, Rémy Bruggmann, and
Oliver Muhlemann. “Transcriptome-wide identification of NMD-targeted human mRNAs
reveals extensive redundancy between SMG6-and SMG7-mediated degradation
pathways”. In: Rna 23.2 (2017), pp. 189–201.
[63] David P Bartel. “MicroRNAs: target recognition and regulatory functions”. In: cell 136.2
(2009), pp. 215–233.
[64] David P Bartel. “Metazoan micrornas”. In: Cell 173.1 (2018), pp. 20–51.
[65] Daniel J Hogan, Daniel P Riordan, André P Gerber, Daniel Herschlag, Patrick O Brown,
and Sean R Eddy. “Diverse RNA-binding proteins interact with functionally related sets
of RNAs, suggesting an extensive regulatory system”. In: PLoS biology 6.10 (2008), e255.
[66] Bradley M Lunde, Claire Moore, and Gabriele Varani. “RNA-binding proteins: modular
design for efficient function”. In: Nature reviews Molecular cell biology 8.6 (2007),
pp. 479–490.
[67] Vikram Agarwal, George W Bell, Jin-Wu Nam, and David P Bartel. “Predicting effective
microRNA target sites in mammalian mRNAs”. In: elife 4 (2015), e05005.
[68] ENCODE Project Consortium et al. “An integrated encyclopedia of DNA elements in the
human genome”. In: Nature 489.7414 (2012), p. 57.
[69] Syed Shamsh Tabrez, Ravi Datta Sharma, Vaibhav Jain, Atif Ahmed Siddiqui, and
Arnab Mukhopadhyay. “Differential alternative splicing coupled to nonsense-mediated
decay of mRNA ensures dietary restriction-induced longevity”. In: Nature
communications 8.1 (2017), pp. 1–13.
[70] Sika Zheng. “Alternative splicing and nonsense-mediated mRNA decay enforce neural
specific gene expression”. In: International Journal of Developmental Neuroscience 55
(2016), pp. 102–108.
[71] Corinna Giorgi, Gene W Yeo, Martha E Stone, Donald B Katz, Christopher Burge,
Gina Turrigiano, and Melissa J Moore. “The EJC factor eIF4AIII modulates synaptic
strength and neuronal protein expression”. In: Cell 130.1 (2007), pp. 179–191.
76
[72] Karen Yap and Eugene V Makeyev. “Regulation of gene expression in mammalian
nervous system through alternative pre-mRNA splicing coupled with RNA quality
control mechanisms”. In: Molecular and Cellular Neuroscience 56 (2013), pp. 420–428.
[73] Bo Gao, Qingyao Huang, and Michael Baudis. “segment_liftover: a Python tool to convert
segments between genome assemblies”. In: F1000Research 7 (2018).
[74] Halit Ongen, Alfonso Buil, Andrew Anand Brown, Emmanouil T Dermitzakis, and
Olivier Delaneau. “Fast and efficient QTL mapper for thousands of molecular
phenotypes”. In: Bioinformatics 32.10 (2016), pp. 1479–1485.
[75] Oliver Stegle, Leopold Parts, Richard Durbin, and John Winn. “A Bayesian framework to
account for complex non-genetic factors in gene expression levels greatly increases
power in eQTL studies”. In: PLoS computational biology 6.5 (2010), e1000770.
[76] Alexis Battle, Christopher D Brown, Barbara E Engelhardt, and Stephen B Montgomery.
“Genetic effects on gene expression across human tissues.” In: Nature 550.7675 (2017),
pp. 204–213.
[77] Michael Morley, Cliona M Molony, Teresa M Weber, James L Devlin, Kathryn G Ewens,
Richard S Spielman, and Vivian G Cheung. “Genetic analysis of genome-wide variation in
human gene expression”. In: Nature 430.7001 (2004), pp. 743–747.
[78] Barbara E Stranger, Matthew S Forrest, Mark Dunning, Catherine E Ingle,
Claude Beazley, Natalie Thorne, Richard Redon, Christine P Bird, Anna De Grassi,
Charles Lee, et al. “Relative impact of nucleotide and copy number variation on gene
expression phenotypes”. In: Science 315.5813 (2007), pp. 848–853.
[79] Stephen B Montgomery, Micha Sammeth, Maria Gutierrez-Arcelus, Radoslaw P Lach,
Catherine Ingle, James Nisbett, Roderic Guigo, and Emmanouil T Dermitzakis.
“Transcriptome genetics using second generation sequencing in a Caucasian population”.
In: Nature 464.7289 (2010), pp. 773–777.
[80] Joseph K Pickrell, John C Marioni, Athma A Pai, Jacob F Degner, Barbara E Engelhardt,
Everlyne Nkadori, Jean-Baptiste Veyrieras, Matthew Stephens, Yoav Gilad, and
Jonathan K Pritchard. “Understanding mechanisms underlying human gene expression
variation with RNA sequencing”. In: Nature 464.7289 (2010), pp. 768–772.
[81] Diego Garrido-Martın, Beatrice Borsari, Miquel Calvo, Ferran Reverter, and
Roderic Guigó. “Identification and analysis of splicing quantitative trait loci across
multiple tissues in the human genome”. In: Nature communications 12.1 (2021), pp. 1–16.
[82] M Melé, PG Ferreira, F Reverter, DS DeLuca, J Monlong, M Sammeth, TR Young,
JM Goldmann, DD Pervouchine, TJ Sullivan, et al. “Human genomics. The human
transcriptome across tissues and individuals.. 2015;(6235): 660-5|”. In: Science 348 ().
77
[83] John Lonsdale, Jeffrey Thomas, Mike Salvatore, Rebecca Phillips, Edmund Lo,
Saboor Shad, Richard Hasz, Gary Walters, Fernando Garcia, Nancy Young, et al. “The
genotype-tissue expression (GTEx) project”. In: Nature genetics 45.6 (2013), pp. 580–585.
[84] Lucia A Hindorff, Praveen Sethupathy, Heather A Junkins, Erin M Ramos,
Jayashri P Mehta, Francis S Collins, and Teri A Manolio. “Potential etiologic and
functional implications of genome-wide association loci for human diseases and traits”.
In: Proceedings of the National Academy of Sciences 106.23 (2009), pp. 9362–9367.
[85] Kim M Keeling and David M Bedwell. “Suppression of nonsense mutations as a
therapeutic approach to treat genetic diseases”. In: Wiley Interdisciplinary Reviews: RNA
2.6 (2011), pp. 837–852.
[86] Francesca Sartor, Jihan Anderson, Colin McCaig, Zosia Miedzybrodzka, and
Berndt Muller. “Mutation of genes controlling mRNA metabolism and protein synthesis
predisposes to neurodevelopmental disorders”. In: Biochemical Society Transactions 43.6
(2015), pp. 1259–1265.
[87] Giuseppe Balistreri, Claudia Bognanni, and Oliver Muhlemann. “Virus escape and
manipulation of cellular nonsense-mediated mRNA decay”. In: Viruses 9.1 (2017), p. 24.
[88] Liat Linde, Stephanie Boelz, Malka Nissim-Rafinia, Yifat S Oren, Michael Wilschanski,
Yasmin Yaacov, Dov Virgilis, Gabriele Neu-Yilik, Andreas E Kulozik, Eitan Kerem, et al.
“Nonsense-mediated mRNA decay affects nonsense transcript levels and governs
response of cystic fibrosis patients to gentamicin”. In: The Journal of clinical investigation
117.3 (2007), pp. 683–692.
[89] Kim M Keeling. “Nonsense suppression as an approach to treat lysosomal storage
diseases”. In: Diseases 4.4 (2016), p. 32.
[90] Tomoki T Nomakuchi, Frank Rigo, Isabel Aznarez, and Adrian R Krainer. “Antisense
oligonucleotide–directed inhibition of nonsense-mediated mRNA decay”. In: Nature
biotechnology 34.2 (2016), pp. 164–166.
78
Abstract (if available)
Abstract
Nonsense-mediated mRNA decay (NMD) was originally conceived as an mRNA surveillance mechanism that recognizes and degrades nonsense transcripts to prevent the production of potentially deleterious truncated proteins. Recent research showed NMD is an important post-transcriptional gene regulation mechanism targeting many non-aberrant mRNAs. Here we elucidate the NMD regulation of individual genes across human tissues through genetical genomics. Genetic variants corresponding to NMD regulation were identified based on the GTEx data through our unique modeling of transcript expressions. Specifically, we identified genetic variants that enhance or suppress the post-transcriptional production of NMD-targeted transcripts (pNMD-QTLs), as well as genetic variants regulating the decay efficiency of NMD-targeted transcripts (dNMD-QTLs). We explored the tissue specificity of identified NMD-QTLs, and profiled their genomic positions and distances to exons, introns, transcription start sites and stop codons. We further found that compared to expression quantitative trait loci (eQTLs), NMD-QTLs are more likely to be co-localized with disease SNPs, splicing QTLs, and in the binding sites of miRNAs and RNA binding proteins (RBPs). In conclusion, we revealed the genome-wide landscape of genetic variants associated with NMD regulation (the NMD-QTLs) across human tissues. The tissue specificity analysis of NMD-QTLs demonstrated the especially important role of NMD in brains. The exploration of NMD-QTLs’ preferential genomic positions suggested additional attributes for a transcript to be regulated by NMD. Furthermore, the co-localization analysis with other trait-associated SNPs and post-transcriptional regulatory elements assisted the understanding of NMD-QTLS’ functionality and their interactions with other post-transcriptional regulations.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
Computational algorithms for studying human genetic variations -- structural variations and variable number tandem repeats
PDF
Comparative transcriptomics: connecting the genome to evolution
PDF
danbing-tk: a computational genetics framework for studying variable number tandem repeats
PDF
Genome-wide studies reveal the isoform selection of genes
PDF
Validating structural variations: from traditional algorithms to deep learning approaches
PDF
Computational algorithms and statistical modelings in human microbiome analyses
PDF
Too many needles in this haystack: algorithms for the analysis of next generation sequence data
PDF
Exploring the genetic basis of complex traits
PDF
Improved methods for the quantification of transcription factor binding using SELEX-seq
PDF
Probabilistic modeling and data integration to examine RNA-protein interactions
PDF
Understanding the characteristic of single nucleotide variants
PDF
Constructing metagenome-assembled genomes and mobile genetic element host interactions using metagenomic Hi-C
PDF
Predicting functional consequences of SNPs: insights from translation elongation, molecular phenotypes, and pathways
PDF
Probing the genetic basis of gene expression variation through Bayesian analysis of allelic imbalance and transcriptome studies of oil palm interspecies hybrids
PDF
Sharpening the edge of tools for microbial diversity analysis
PDF
Exploration of human microbiome through metagenomic analysis and computational algorithms
PDF
Statistical and computational approaches for analyzing metagenomic sequences with reproducibility and reliability
PDF
DNA shape at transcription factor binding sites: from purifying selection to a new alphabet
PDF
Characterizing synonymous variants by leveraging gene expression and GWAS datasets
Asset Metadata
Creator
Sun, Bo
(author)
Core Title
Mapping genetic variants for nonsense-mediated mRNA decay regulation across human tissues
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Degree Conferral Date
2022-05
Publication Date
03/06/2022
Defense Date
03/01/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
disease-variant association,eQTL,Nonsense-mediated decay,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Chen, Liang (
committee chair
), Chaisson, Mark (
committee member
), McMahon, Andrew (
committee member
), Sun, Fengzhu (
committee member
)
Creator Email
bos@usc.edu,bsundevice@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC110768085
Unique identifier
UC110768085
Legacy Identifier
etd-SunBo-10426
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Sun, Bo
Type
texts
Source
20220308-usctheses-batch-915
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
disease-variant association
eQTL
Nonsense-mediated decay