Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Predicting functional consequences of SNPs: insights from translation elongation, molecular phenotypes, and pathways
(USC Thesis Other)
Predicting functional consequences of SNPs: insights from translation elongation, molecular phenotypes, and pathways
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Predicting Functional Consequences of SNPs:
Insights from Translation Elongation, Molecular Phenotypes,
and Pathways
By
Zheyu Li
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(CHEMICAL ENGINEERING)
May 2024
Copyright 2024 Zheyu Li
ii
Acknowledgements
What a journey this has been! From dreaming of becoming a veterinarian, to diving into chemical
engineering, and finally landing in quantitative computational biology—my academic path has truly been
a rollercoaster. I'm incredibly fortunate to have had Dr. Liang Chen as my mentor during this bumpy ride.
I am deeply grateful that she saw my potential and welcomed me into her group despite my
unconventional background. Dr. Chen has not only guided me with patience and clarity through the
complexities of computational biology but has also genuinely cared about my well-being. Her efforts to
create a nurturing and supportive environment have been instrumental in both my personal and
professional development.
I am also thankful to the members of my dissertation and qualifying exam committee, Drs. Stacey
Finley, Jerry Lee, Fengzhu Sun, and Pin Wang, for their valuable input, constructive criticism, and
encouragement. Their expertise and diverse perspectives have enriched this work immensely.
Special thanks to my parents for their steadfast support and immense pride in my achievements.
Your encouragement has been a cornerstone of my growth, and I am deeply grateful for it.
To my incredible wife and my best friend, thank you for always being by my side and supporting
me in every way imaginable. You've made me so much happier than I was 13 years ago when we first
met. Now that I’m finally out of school, let’s start our next, even more exciting and adventurous chapter.
To my fantastic son, thank you for coming into my life. You’ve excelled at being a baby since the
moment you were born, while I’m still learning how to be a better father. There will be many years ahead
before you might read this dissertation. When that time comes, I hope you are happy, brave, and free. Just
relax and always be yourself.
Lastly, to my father. I miss you. I will try to live a happy life, one that I want to remember, just as
you always wished for me.
iii
Table of Contents
Acknowledgements........................................................................................................................................ii
List of Tables.................................................................................................................................................v
List of Figures...............................................................................................................................................vi
Abstract........................................................................................................................................................vii
Chapter 1 Introduction...................................................................................................................................1
Predicting functional consequence of SNPs on the translational level......................................................1
Conducting GWAS with intermediate molecular phenotypes...................................................................2
Exploring SNPs among pathways .............................................................................................................2
Chapter 2 Predict Functional Consequences of SNPs on mRNA Translation ..............................................4
Introduction................................................................................................................................................4
Methods .....................................................................................................................................................6
Ribosome footprint data download and preprocess...............................................................................6
Machine learning model on ribosome collision.....................................................................................7
Predict ribosome occupancy changes caused by different alleles of a SNP..........................................8
Results........................................................................................................................................................9
Disease-related SNPs tend to cause more considerable ribosome occupancy changes.........................9
Gene ontology and disease term analysis for top SNPs with large D’s...............................................12
Allele configuration of RibOc-SNPs...................................................................................................15
RibOc-SNPs are enriched in the 5’ CDS regions ................................................................................22
Some RibOc-SNPs shift FODs bi-directionally for different transcript isoforms...............................23
Discussion................................................................................................................................................26
Chapter 3 Dissect genetic architecture of circadian rhythms from multi-tissue gene expression in the
natural population ........................................................................................................................................28
Introduction..............................................................................................................................................28
Methods ...................................................................................................................................................29
Assembly of human circadian genes ...................................................................................................29
Pre-processing of Genotype-Tissue Expression data ..........................................................................30
Molecular quantification of circadian disruption.................................................................................30
Mapping of genetic variants associated with circadian deviation scores ............................................31
Investigation of potential confounding from tissue sampling time .....................................................31
Results......................................................................................................................................................32
Circadian disruptions in human population.........................................................................................32
Genome-wide mapping of SNPs associated with circadian disruptions across multiple tissues ........32
Distribution of Circ-SNPs on the genome ...........................................................................................34
Exploration of potential confounding arising from tissue sampling time ...........................................37
iv
Circ-SNPs are closely linked to known circadian-related traits..........................................................38
Genes containing Circ-SNPs encode druggable targets ......................................................................40
Discussion................................................................................................................................................42
Chapter 4 Conclusions and Future work .....................................................................................................44
Forecasting ribosome collision around SNPs..........................................................................................44
Associate SNPs to Circadian rhythm with molecular phenotype ............................................................44
Extract knowledge from SNPs identified from pathways .......................................................................45
References....................................................................................................................................................46
Appendices ..................................................................................................................................................51
Supplementary Figures............................................................................................................................51
v
List of Tables
Table 3.1 Sleep-related GWAS traits reported by GWAS Catalog.............................................................38
Table 3.2 Number of drugs that medicate different condition groups.........................................................41
vi
List of Figures
Figure 2.1 Workflow of building a random forests model base on profiled ribosome footprints...............10
Figure 2.2 Disease-related SNPs lead to larger changes in ribosome occupancy .......................................11
Figure 2.3 Enriched gene ontology and disease terms for RibOc-SNPs.....................................................14
Figure 2.4 Nucleotide conversions enriched or depleted in RibOc-SNPs...................................................17
Figure 2.5 Amino acid changes enriched or depleted in RibOc-SNPs........................................................19
Figure 2.6 Distributions of Ds for amino acid conversions involving stop codons ....................................20
Figure 2.7 Density plots of Ds for each conversion involving stop codons................................................21
Figure 2.8 Location comparison between RibOc-SNPs and general SNPs.................................................23
Figure 2.9 Analysis of RibOc-SNPs that can shift FOD bi-directionally....................................................25
Figure 3.1 Tissue-specific statistics for 16 tissues containing Circ-SNPs...................................................33
Figure 3.2 Impact of minor alleles on deviations scores .............................................................................34
Figure 3.3 Manhattan plots of Circ-SNPs from multiple tissues.................................................................36
Figure 3.4 Master circadian genes containing Circ-SNPs are linked to “circadian” traits..........................40
Figure 3.5 Compositions of condition categories for drugs targeting on our master circadian genes ........41
Figure S1 Robustness of our analysis pipeline ............................................................................................51
Figure S2 Changes in FODs (D) caused by SNPs.......................................................................................52
Figure S3 Distributions of amino acid conversions that are enriched or depleted in RibOc-SNPs and
appear in pairs..............................................................................................................................................53
Figure S4 Distributions of Dif_opposite and Dif_same ..............................................................................54
Figure S5 Distributions of deviation scores derived from different tissues ................................................55
Figure S6 Composition of tissues in Circ-SNP clusters..............................................................................56
Figure S7 Genomic location of identified Circ-SNPs .................................................................................57
vii
Abstract
Genome-wide association studies (GWAS) aim to uncover genetic variants linked to diseases and
traits by comparing individuals' phenotypes with their genotypes. Despite GWAS revealing numerous
associations, understanding the functional implications of identified genetic variants remains a challenge.
To enhance the interpretability of single nucleotide polymorphisms (SNPs) identified through GWAS, we
developed an innovative pipeline that annotates SNPs at the translational level. Additionally, we
introduced an intermediate molecular phenotype to improve GWAS workflows and extended this strategy
to pathways, providing a richer biological context for deciphering SNP functions.
The functional impact of SNPs on translation has yet to be considered to prioritize diseasecausing single nucleotide polymorphisms (SNPs) from genome-wide association studies (GWAS). Here
we apply machine learning models to genome-wide ribosome profiling data to predict SNP function by
forecasting ribosome collisions during mRNA translation. SNPs causing remarkable ribosome occupancy
changes are named RibOc-SNPs (Ribosome-Occupancy-SNPs). We found that disease-related SNPs tend
to cause notable changes in ribosome occupancy, suggesting translational regulation as an essential
pathogenesis step. Nucleotide conversions, such as “G → T”, “T → G”, and “C → A”, are enriched in
RibOc-SNPs with the most significant impact on ribosome occupancy, while “A → G” (or “A→ I” RNA
editing) and “G → A” are less deterministic. Among amino acid conversions, “Glu → stop (codon)”
shows the most significant enrichment in RibOc-SNPs. Interestingly, there is selection pressure on stop
codons with lower collision likelihood. RibOc-SNPs are enriched at the 5’ coding sequence regions,
implying hot spots of translation initiation regulation. Strikingly, about 22.1% of the RibOc-SNPs lead to
opposite changes in ribosome occupancy on alternative transcript isoforms, suggesting that SNPs can
amplify the differences between splicing isoforms by oppositely regulating their translation efficiency.
Circadian rhythms are 24-hour cycles that observed in many facets of physiological and
behavioral processes, such as regulation of sleep, metabolism, and immune response. Numerous studies
viii
had suggested genetic influence on circadian rhythms. Several genome-wide association studies (GWAS)
were conducted to identify genetic variants associated with circadian rhythms. Usually, these GWAS rely
on sleep related traits, such as onset time, duration, and quality of sleep, to reflect the daily light-dark
cycles of circadian rhythms. In this study, a circadian deviation score was proposed as an intermediate
phenotype to represent the degree of circadian disruption on the molecular level. Such deviation scores
were derived from gene expression levels from a general population, which spare continuous sampling
within 24-hour cycles. Additionally, we broadened the potential factors that affect circadian rhythms by
incorporating thousands of genes into the calculation of deviation scores in a tissue-specific manner. With
these deviation scores, we discovered 654 SNPs associated with the expression-level circadian disruption.
In addition to SNPs reported to be related with circadian traits such as “insomnia”, “chronotype
measurement”, and “circadian rhythm”, our discovered list contains many novel ones. Our findings will
significantly advance the understanding of genetic components underlying sleep disorders and our
approach will benefit the design of future GWAS strategies.
We delved into the impact of pathway depth within the hierarchical structure on associated SNPs,
leveraging our expanded intermediate molecular phenotype approach across more than 1,000 pathways.
In this investigation, higher-level pathways consistently displayed a greater number of associated SNPs,
mirroring their intricate regulatory mechanisms and involvement of multiple genes, which heightens the
potential for genetic variations, including SNPs, to influence various pathway components. Conversely,
lower-level pathways exhibited higher SNP to gene ratios, indicative of a more pronounced polygenic
influence from individual SNPs within these more specific pathways. Notably, disease-related SNPs
demonstrated a larger polygenic contribution compared to those in non-disease-related pathways. Among
the significant SNPs identified—over 200,000 in total—many overlapped with known cis-eQTLs and cissQTLS within corresponding genes, affirming the reliability of our approach. Yet, some identified SNPs
extended beyond the scanning range of cis-QTLs, potentially signifying trans-QTLs and introducing a
novel mapping avenue. Furthermore, our analysis of SNPs overlapping with those in the GWAS Catalog
ix
revealed traits closely linked to the pathways where significant SNPs were discovered, indicating
functional relevance. Noteworthy was the identification of some SNPs in multiple pathways, especially in
pathways interconnected by functions, even spanning across different branches within the pathway
hierarchy.
1
Chapter 1 Introduction
Since the first GWAS was published in 2005 (1), numerous associations between genetic variants
and diseases or traits have been established. To date, nearly 7,000 GWAS have been conducted, revealing
over 300,000 loci associated with more than 10,000 diseases and traits (2). It is crucial to gain a
comprehensive understanding of the functional and clinical implications of genetic variants to fully
leverage the potential of GWAS in disease diagnostics and drug discovery. However, the translation of
GWAS findings into clinical interventions faces challenges due to our limited understanding of the
polygenic nature of common diseases, co-segregation of genetic variants resulting from linkage
disequilibrium, and the context-dependent physiological functions of genetic variants. We successfully
contributed to the understanding of the functional consequences of SNPs by introducing an innovative
SNP annotating method based on the translational perspective. Additionally, we devised a GWAS
workflow with alternative phenotype selections. Furthermore, we incorporated broadened biological
meanings into the interpretation and organization of SNPs identified from GWAS focusing on different
traits.
Predicting functional consequence of SNPs on the translational level
Given the vast number of SNPs that may contribute to each associated disease or trait, prioritizing
SNPs for further investigation in experimental follow-ups is crucial (3). Several bioinformatics tools have
been developed to prioritize candidates that could cause diseases. These tools primarily rely on existing
functional annotations and conservation scores to predict the functional consequences of these candidates
(4). Many tools focus on missense mutations and their potential impact on protein sequences, activities,
and structures (5-9). Others investigate the effects of GWAS SNPs on mRNA transcription by assessing
their co-localization with functional regulatory loci, such as open chromatin annotations, histone
modification annotations, transcription factor binding annotations, expression quantitative trait loci
(eQTL), and 3D chromosome interaction annotations (10-18). Some studies have also explored the
2
potential impact of SNPs on mRNA splicing (19-21). However, there remains a significant gap in the
functional prediction of SNPs that affect translation outputs, highlighting the need for further research in
this area. Translation elongation plays a critical role in protein synthesis, yet it is often overlooked in the
functional annotation of SNPs. While many current tools rely on existing functional annotations and
conservation scores to predict functional consequences, our method takes a novel approach by focusing
on the translational level, which represents a conceptual innovation in this field.
Conducting GWAS with intermediate molecular phenotypes
Many traits examined in existing GWAS are not effectively captured by suitable phenotypes, and
the circadian rhythm is among these traits. Circadian rhythms, which are 24-hour cycles governing
controls over various physiological and behavioral processes such as sleep regulation, metabolism, and
immune system function, have been shown to have a genetic influence. Despite several genome-wide
association studies (GWAS) aimed at identifying genetic variants associated with circadian rhythms, the
lack of appropriate traits capable of accurately characterizing these rhythms remains a challenge for such
studies. To address this gap and better understand the molecular mechanisms underlying circadian
rhythms while discovering novel master regulators, we proposed an intermediate phenotype to quantify
molecular-level circadian disruptions in individuals. This novel intermediate molecular phenotype, termed
circadian deviation scores, allowed us to successfully identify 654 significant loci in a tissue-specific
manner closely related to circadian rhythm traits.
Exploring SNPs among pathways
Most GWAS typically analyze one SNP at a time without considering its biological context, yet a
single SNP seldom contributes significantly to complex diseases or traits, which usually involve
cumulative effects from multiple loci (22). In response, pathway analysis (PA) was developed to
contextualize the list of significant SNPs identified by GWAS within biological processes (23). Our
proposed GWAS investigation aims to elucidate relationships among pathways by examining their
3
hierarchical structure and the corresponding SNPs, thereby filling gaps in the pathway networks.
Conventional PA involves mapping significant SNPs onto genes to create a gene list, which is then
compared to preset gene lists from potential pathways to generate a series of putative pathways ranked by
relevance. While this approach offers biological insights into SNPs, its reliance on genes as intermediates
is limited, as the vast majority of GWAS-identified SNPs (~90%) do not map to genes (24), particularly
those landing in intergenic regions. Moreover, existing PA typically focuses on individual pathways,
often overlooking interactions among pathways. Our revised PA integrates pathway information into the
scanning step, ensuring that none of the discovered SNPs are overlooked, including those in intergenic
regions. Furthermore, we utilize the hierarchical structure of pathways as the framework for our GWAS,
enabling the discovery of interactions between pathways and overcoming the limitations of traditional PA
methods.
4
Chapter 2 Predict Functional Consequences of SNPs on mRNA
Translation
Introduction
Genome-wide association studies (GWAS) over the last decade have accumulated a large number
of genetic variants associated with common diseases. We need to fully understand the functional and
clinical consequences of genetic variants to harness the power of GWAS in disease diagnostics and drug
discovery. Translating GWAS findings to clinical interventions is hindered by our limited knowledge of
the polygenic architecture of common diseases, co-segregated genetic variants due to linkage
disequilibrium, and the physiological-context-dependent functions of genetic variants. It remains a
challenge to pinpoint the causal variants of common diseases.
A large majority of genetic variants reported in GWAS do not change the amino acid sequence of
a protein. Only about 5% of variants reported in the NHGRI-EBI GWAS Catalog (v1.0.2, downloaded in
September 2021) are missense mutations. Mechanisms other than protein sequence changes are essential
to uncover the etiological path from single nucleotide polymorphisms (SNPs) to resultant phenotypic
traits. Hypothetically, genetic variants can affect mRNA transcription, mRNA splicing, mRNA decay,
and mRNA translation.
A variety of bioinformatics tools were developed to prioritize disease-causing SNP candidates.
Most of these tools utilize existing functional annotations and conservation scores to predict the
functional consequence of these candidates (4). Such prioritization reduces the number of SNPs worth
further investigation to a feasible amount for experimental follow-ups (3). Many tools focused on
missense mutations and their possible impact on protein sequences, activities, and structures (5-9). Others
studied GWAS SNPs’ effects on mRNA transcription by inferring their colocalization with functional
regulatory loci, e.g., open chromatin annotations, histone modification annotations, transcription factor
5
binding annotations, expression quantitative trait loci (eQTL), and 3D chromosome interaction
annotations (10-18). The potential impact of SNPs on mRNA splicing has also been explored (19-21).
However, the functional prediction of SNPs affecting translation outputs is still urgently lacking.
Translation elongation is vital in protein synthesis but rarely mentioned in the functional annotation of
SNPs. Genetic variants affecting translation elongation can alter protein folding and stability. For
example, a synonymous SNP in the Multidrug Resistance 1 (MDR1) gene introduces a rare codon to
affect the translation elongation rate and the subsequent co-translational protein folding, which ultimately
alters its drug and inhibitor interactions (25). Another synonymous SNP in the cystic fibrosis
transmembrane conductance regulator (CFTR) changes the translation velocity, damaging protein
stability and function (26). During translation elongation, ribosomes move at varying rates. They can be
decelerated or stalled by factors including mRNA secondary structures, tRNA availabilities, and the
presence of nascent peptide chains within the ribosome exit tunnel (27). Upstream ribosomes may collide
with the slow or paused ribosomes and form disomes, trisomes, or higher-order complexes (28),
indicating low translation output.
The precise positions of ribosome complexes on transcripts, including ribosome stalling and
collisions, can now be obtained from ribosome profiling which sequences the ribosome-protected mRNA
fragments (27, 29-32). The status of translation elongation can be evaluated by probing the observed
monosomes and disomes. Under normal conditions, some occurrences of disomes are accompanied by
regulation tasks, such as facilitating protein localization on membranes, serving as a mechanism for start
codon selection, adjusting the productive length of protein biosynthesis, and so on. In other cases,
disomes trigger ribosome-associated protein quality control to avoid synthesizing abnormal proteins and
prevent ribosome shortage caused by defective mRNAs (33). SNPs may induce or prevent ribosome
collisions by altering mRNA secondary structures or changing cognate tRNAs with different abundances.
Therefore, it is essential to annotate a SNP based on its translational impact.
6
To understand the functional consequences of SNPs on mRNA translation, we trained a machine
learning model to predict ribosome collision at each specific locus. We predicted the differences in
ribosome occupancy caused by different alleles (the reference allele or alternative alleles) of the same
SNP. For a total of 97,944 considered SNPs, we discovered 1,900 SNPs exhibiting allele-dependent
differences in ribosome occupancy and named them RibOc-SNPs (Ribosome-Occupancy-SNPs). Many
RibOc-SNPs are associated with disease, providing a valuable resource to improve our understanding of
genetic variants’ influences on disease through translational regulation.
Methods
Ribosome footprint data download and preprocess
Ribosome footprints, including both monosome profiling and disome profiling of HEK293 cells
(27), were downloaded from the NCBI GEO database (GSE133393). To distinguish mRNA from other
non-coding RNA contaminates, similar to the previous study (31), STAR (34) was implemented
repetitively with different mapping indexes. Three mapping indexes were built based on annotations of
lncRNA, tRNA, and mRNA transcriptome reference genome (GRCh38.p13) and GENCODE annotations
(release 43). The reads were sequentially mapped with these indexes in such a fashion that the unmapped
reads from the previous index were processed by the next one. As a result, non-coding RNA reads were
filtered out by the lncRNA and tRNA indexes. Then, the remaining unmapped reads were mapped to the
mRNA transcriptome index. Finally, reads that conform with the breadths of ribosome footprints were
selected. Specifically, those with lengths of 22 and 29 bp were deemed monosomes; those with lengths of
52, 54, and 61 bp were considered disomes, as suggested by the original paper (27). In addition, the A-site
offsets, which represent the distance between the 5’ end of ribosome complex and the A-site in
monosomes or the A-site of the leading ribosome in disomes, were reported to be 15 bp for monosomes
and 46, 48, and 46 bp for disomes of length of 52, 54, 61 bp, respectively (27). These A-site offsets were
used to calculate the exact location of the ribosome complexes on the transcript.
7
To demonstrate the robustness of our method, we applied our pipeline to two other ribosome
profiling datasets. The first dataset (GEO samples: GSM4203634, GSM4203635) was the generated from
HeLa cells (31), and the second dataset (GEO samples: GSM2563629, GSM2563630) was obtained from
HEK293T cells (35).
Machine learning model on ribosome collision
We applied the random forests regression model (36) to predict the competitive disome and
monosome ribosome occupancy on a locus through genomic features around that site. Such competitive
ribosome occupancy was quantified as the fraction of disomes (FOD) on each A-site position of a
transcript:
FOD =
normalized disome count
normalized monosome count+normalized disome count
(equation 1.1)
where the raw disome and monosome read counts were normalized by the total number of disome and
monosome counts, respectively. FOD represents the strength or likelihood of ribosome collision on that
specific genomic site. Furthermore, a threshold of at least 10 combined read hits (monosome + disome
reads) on each position was imposed. Thus, we need at least 10 ribosome-protected fragments (RPFs) per
position to make the calculation reliable. The predictive features in the machine learning model include:
the nucleotide, codon, amino acid of the A-site, such three types of information for the surrounding up or
downstream 50 base pair (bp) positions, and the regions that the surrounding positions falling into (coding
sequence (CDS) or untranslated region (UTR)). In addition, motifs of the nucleotides, codons, and amino
acids for the observed disome and monosome signals were identified by STREME (37) (motif width was
set to be 15-27 bp and other parameters were default settings). The target width for nucleotide motifs
ranges from 15-27 bp, which translates to the width of 5 to 9 codons or amino acids for these two types of
motifs, respectively. Then each A-site was scanned by FIMO (38) to find matches to these discovered
motifs. The motif score, MS, for each motif occurrence is defined as
MS = − ∑ log10(, × ,)
=1
(equation 1.2)
8
where , is the p-value for the motif enrichment obtained from STREME and , is the
matching probability obtained from FIMO. Since one A-site could be matched to multiple motifs found
by STREME, its final motif score is the sum of individual motif scores of all potential matches. A higher
MS score reflects a stronger resemblance to the motifs enriched by the disome or monosome signals,
while zero was assigned to A-sites with no motif occurrence. Six motif scores that were derived from
nucleotides, codons, and amino acids with respect to disomes and monosomes, were given to each A-site
as additional predictive features in the random forests model. We trained our machine learning model
using 90% of the data as a training set and the remaining 10% as a validation set.
Predict ribosome occupancy changes caused by different alleles of a SNP
With the above random forests model, FODs around SNPs can be predicted to examine the
changes in the occupancy of different ribosome complexes caused by nucleotide alterations. A total of
93,103 SNPs located in CDS regions were examined, including diseases-related SNPs (56,333) from the
DisGeNET database (39) and non-disease-related ones (36,770) from the dbSNP database (40). The
ancestral allele information for these SNPs was downloaded from the Ensembl database
(https://useast.ensembl.org/index.html) via biomaRt (41). Ensembl determined ancestral alleles based on
phylogenetic modelling of multi-species alignments. Since a SNP could affect the ribosome collision at
nearby A-sites, besides the scenario that the SNP position is a ribosome A-site, we also considered the
scenarios that a A-site may locate within the upstream or downstream ± 50 bp regions of the SNP. For
each alternative allele of a SNP on location , we predicted the FOD, ,, based on our trained random
forests model for the scenario that the A-site locates on the position where ( − 50, + 50).
Similarly, we obtained the predicted FOD for the reference allele as ref, . The maximum absolute
difference between , and ref, among all the scenarios was calculated as
:
= max
(−50,+50)
|, − ref,| (equation 1.3)
9
Note that the predictive features may be different for the same SNP counted for different transcript
isoforms of a single gene, and a SNP may have multiple alternative alleles. The largest of the
’s across
all transcripts and all alternative alleles was selected as the greatest potential change in translation, D, for
this particular SNP. Additionally, we restored the sign of the difference to D and denoted the signed
version as Ds. We ranked SNPs by their D values and selected the top 1,900 (~2%) SNPs as the ones with
remarkable ribosome occupancy changes and named them RibOc-SNPs (Ribosome-Occupancy-SNPs).
The remaining SNPs are herein called general SNPs.
Results
Disease-related SNPs tend to cause more considerable ribosome occupancy changes
Using public ribosome profiling data, we built a random forests model to predict the fraction of
disomes (FOD) on each CDS position of the human transcriptome. Both monosomes and disomes were
profiled in HEK293 cells using modified ribosome profiling protocols (27). The normalized read counts
were used to calculate the observed FOD (details in Methods). Genomic features around the A-sites of
ribosome binding were considered in the machine learning model to predict FOD. The workflow is shown
in Figure 2.1. Specifically, genomic features include nucleotide and amino acid information for the
considered position and its surrounding ± 50 base pair (bp) positions, and whether the surrounding
positions are in CDS regions or UTR regions. More importantly, we further included disome- and
monosome- motif features at the nucleotide level, the codon level, and the amino acid level (details in
Methods). In our analysis, the acquired predictive model had an R2 value of 98.73% and 91.82% for the
training and validation sets, suggesting sufficient power to predict ribosome collision from these genomic
features.
10
Figure 2.1 Workflow of building a random forests model base on profiled ribosome footprints. Initially, the sequence alignments
belonging to the same transcript isoform were grouped together. Then, the ribosome A-site position relative to the corresponding
transcript was inferred by the read width. Predictive features such as nucleotides, codons, amino acids, and motif scores of the
A-site and its surrounding regions were extracted. Simultaneously, the occurrences of monosomes and disomes were recorded
during sequencing data processing. Finally, the information obtained from individual transcript isoforms was compiled as the
training set for the random forests model.
We applied the learned machine learning model to SNPs in CDS regions and examined the
absolute changes (D) in FOD caused by each genetic variant. Interestingly, compared to the benign SNPs,
the disease-related ones are more likely to alter ribosome disome occupancy fraction (Fig. 2.2a). The
median of D is 0.084 for disease-related SNPs compared to 0.068 for benign SNPs (p-value<2.2×10−16,
t-test). SNPs with large D’s are more likely to be disease-related. Among the top 2% SNPs with large D’s,
11
71.5% of them are disease-related, while only 55.2% of all considered SNPs are disease-related (Fig.
2.2b). Notably, the original FOD distributions are similar for the reference alleles of disease SNPs and
non-disease SNPs (Fig. 2.2c). Therefore, the locations of disease-related SNPs do not necessarily exhibit
a higher fraction of disomes when reference alleles are present. Instead, the FOD change between the
reference and alternative alleles indicates the functional consequence of the SNP. We then considered the
sign of the difference D (i.e., Ds). Figure 2.2d shows that alternative alleles may increase or decrease
FOD, and disease SNPs tend to have a larger effect than benign ones in both directions. For benign SNPs,
their alternative alleles are more likely to decrease the FODs (Fig. 2.2d).
Figure 2.2 Disease-related SNPs lead to larger changes in ribosome occupancy. (a) Box plots illustrating the predicted D
distribution for disease-related SNPs and benign SNPs. (b) Relative frequency of disease-related SNPs in the top-ranking SNPs
with large D’s. SNPs were ordered based on their D values. (c) Distributions of FOD in both disease-related and benign SNPs.
(d) Distribution of Ds (signed version of D) for disease-related SNPs and benign SNPs.
To demonstrate the robustness of our method, we applied our pipeline to two other ribosome profiling
datasets (31, 35). These two additional datasets were generated from different cell lines and different
12
treatment conditions than that of the original dataset. They provided distinctive FOD distributions (Fig.
S1a). Despite the disparities in cell conditions and their distinctive FOD distributions, the random forest
models trained from the new datasets showed a visible resemblance in feature selection to the original
model (Fig. S1b), suggesting that these models predict the FOD values based on similar features.
Furthermore, among the top 1900 SNPs sorted by their D values from the two new datasets, 17.4% and
14.4% of them overlapped with the original top 1900 SNPs, respectively. In contrast, a randomly selected
SNP set only achieved a 3.8% overlap with the original SNPs (Fig. S1c). Among the top 5000 SNPs, the
overlapping percentage increased to 29.4% and 26.1%, respectively, compared to 9.5% for random SNPs.
Gene ontology and disease term analysis for top SNPs with large D’s
According to the plot of ordered D for the 97,944 considered SNPs (Fig. S2), we selected the top
1900 (2%) SNPs as the ones with remarkable ribosome occupancy changes and named them RibOc-SNPs
(Ribosome-Occupancy-SNPs). The remaining SNPs are herein called general SNPs. We conducted a gene
ontology (GO) enrichment analysis for the 1225 genes that are harboring the RibOc-SNPs. Using genes
of all investigated SNPs as the background and the default setting of DAVID (42), 26 GO terms are
enriched significantly (Fig. 2.3a, Bonferroni corrected p-values ≤ 0.05 and fold enrichments ≥ 2). To
eliminate the effect of genes that are susceptible to mutations other than RibOc-SNPs, we performed
another round of GO analysis for genes harboring the largest number of investigated SNPs (top 10%:
1090 genes). With the identical setup in DAVID (42), 231 GO terms are enriched (Table. S2, Bonferroni
corrected p-values ≤ 0.05 and fold enrichments ≥ 2) including 14 terms shown in the RibOc-SNP
analysis. Except these mutual terms, the remaining 12 GO terms are exclusively enriched by RibOc-SNPs
(red terms in Fig. 2.3a). Among them, 3 terms pertaining to ribosome are mainly enriched by 17
ribosomal protein genes. It is interesting that these ribosomal genes are harboring SNPs that cause great
discrepancies in ribosome occupancy during their own translations. The carbon metabolism pathway
along with a series of its biological processes, including steroid metabolism, sterol metabolism, and
cholesterol metabolism, are enriched by 56 genes. Our results show that SNPs of these essential genes
13
result in large differences in the ribosome occupancies on mRNA. These findings will help establish the
etiological path from SNPs to resultant phenotypic disease traits.
To determine which diseases are more likely to be associated with SNPs with large D’s, we
compared the percentages of RibOc-SNPs and general SNPs associated with each disease term. A total of
214 disease terms from DisGeNET (39) with at least 50 associated SNPs are under our consideration. As
illustrated in Figure 2.3, 7 diseases, such as Alport syndrome, Hemophilia, and Colorectal Carcinoma, are
enriched in SNPs resulting in large changes in ribosome occupancy (p-values < 1×10−5
, hypergeometric
tests). Alport syndrome is a rare genetic disorder characterized by progressive kidney disease, hearing
loss, and eye abnormalities. According to DisGeNet, genes related to Alport syndrome are associated with
183 of our RibOc-SNPs. These RibOc-SNPs may affect the translation efficiency of this gene into
components of type IV collagen, an important protein for kidneys, inner ears, and eyes.
14
Figure 2.3 Enriched gene ontology and disease terms for RibOc-SNPs. (a) Summary of gene ontology terms enriched by genes
containing RibOc-SNPs. The fold changes are plotted as bars along with the corresponding uncorrected P-values listed. Terms
with Bonferroni corrected p-values ≤ 0.05 and fold enrichments ≥ 2 were selected. Terms not included by GO analysis on genes
harboring the largest number of investigated SNPs were printed in red. (b) Disease terms appeared in the DisGeNET database
that RibOc-SNPs enriched. A disease term was considered to be significantly enriched by RibOc-SNPs if its 95% confidence
intervals of percentage frequencies from RibOc and general SNPs do not overlap and the p-value resulted from the
hypergeometric test <1×10−5.
15
Allele configuration of RibOc-SNPs
To gain more knowledge about SNPs with large D’s, we investigated their allele configuration.
Among the 12 possible nucleotide conversions (reference allele → alternative allele), 7 of them exhibit
significant differences in occurring frequencies between RibOc-SNPs and the remaining general SNPs (pvalues <0.05, hypergeometric tests). As shown in Figure 2.4a, allele configurations such as “G → T”, “T
→ G”, and “C → A” are enriched in RibOc-SNPs. Notably “G → T” increases its presence in the RibOcSNPs the most, compared with general SNPs (19.6% vs. 8.3%). On the other hand, “C → T”, “T → C”,
“C → G”, and “A → T” are depleted in RibOc-SNPs compared to general SNPs. These 7 conversions
were summarized by a nucleotide squad, where the colors of the arrows indicate whether a conversion is
enriched or depleted in RibOc-SNPs (Fig. 2.4a).
We found that the “A → G” conversion in DNA is not enriched in either SNP group. The adenine
to inosine (A → I) conversion is a widespread primary type of RNA editing in human transcriptomes
(43). Since inosine is interpreted as guanosine upon translation, A-to-I editing leads to post-transcriptional
A- to-G transitions in RNA. Our result indicates that the functional consequence of A-to-I editing on
ribosome collision may be marginal.
We then examined the direction of FOD changes by each nucleotide conversion. Conversions and their
counter conversions usually change FOD in opposite directions. For example, while “T → G” and “G →
C” largely increase the FODs, their counterparts “G → T” and “C → G” tend to decrease the FODs (Fig.
2.4b). Interestingly, conversion to C tends to increase FOD regardless of the reference alleles, which
suggests that C somehow promotes disome formation. As a result, a large majority of “T → C” increase
FODs (Fig. 2.4b, vertical red arrow marked between the two peaks).
We also examined the magnitude of FOD changes by each nucleotide conversion. The magnitude of FOD
changes varied among different conversions. “G → T” causes the largest difference in FODs (horizontal
blue arrow marked the magnitudes Fig. 2.4b). This matches our observation that “G → T” is enriched in
RibOc-SNPs. In the meantime, “C → T” and “T → C” show the most negligible magnitudes of FOD
16
changes in either direction, although they exhibit significant asymmetrical FOD changes (mainly decrease
or mainly increase FODs) (Fig. 2.4b).
Next, we examined the amino acid changes caused by alternative alleles of RibOc-SNPs vs.
general SNPs. A total of 71 amino acid conversions (out of 200 conversion types among our considered
SNPs) exhibited significant distribution differences between RibOc-SNPs and general SNPs. As shown in
Figure 2.5, 30 conversions are enriched in RibOc-SNPs, and 41 are depleted. “Glu → stop (codon)” is the
conversion type showing the most significant distribution difference between RibOc-SNPs and general
SNPs. It is also the most represented in RibOc-SNPs (9.3%). Among the 30 amino acid conversions
enriched in RibOc-SNPs, 18 conversions appear in pairs (e.g., “Ala → Thr”/“Thr → Ala”, “Ala →
Ser”/“Ser → Ala”, and so on), while 12 other conversions appear singly.
For those amino acid conversion types depleted in RibOc-SNPs, 61.0% (25 out of 41) are nonsynonymous, and the remaining 39.0% are synonymous. Thus, many non-synonymous SNPs have a
negligible impact on ribosome occupancy, although they change amino acid sequences. For the 32 out of
34 enriched or depleted conversion types that appeared in pairs, each conversion and its counter
conversion display FOD changes in opposite directions (Fig. S3). Other conversions with no significant
distribution differences between RibOc-SNPs and general SNPs were summarized in Table S1.
17
Figure 2.4 Nucleotide conversions enriched or depleted in RibOc-SNPs. (a) Percentage frequencies of nucleotide conversions
calculated from RibOc-SNPs or general SNPs. Nucleotide conversions significantly enriched or depleted in RibOc-SNPs
(hypergeometric tests) were marked. *****: p ≤ 10−5; ****: 10−5 < p ≤ 10−4; ***: 10−4 < p ≤ 10−3; and **: 10−3 < p ≤
10−2. (b) Distributions of Ds from all possible nucleotide conversions.
18
We further investigated amino acid conversions involving stop codons. A total of 11 amino acid
types were observed to convert from or to a stop codon. Changes from amino acids to stop codons usually
increase FODs, while changes from stop codons to amino acids typically result in lower FODs (Fig. 2.6a;
average FOD changes: 0.09 vs. −0.09; P<2.2×10−16, t test). This outcome is expected and validates our
machine learning model, because premature termination codons (PTC) introduced by SNPs cause
translation interruption, which can induce disome formation as incoming ribosomes collide with the
stagnant one on the stop codon. Many changes from amino acids to stop codons correspond to diseaserelated SNPs (91.2%). In the opposite scenario, when the stop codon changes to an ordinary amino acid,
ribosomes would keep traveling so that accumulation of ribosomes around the stop codon is relieved.
About 26.5% of such conversions are disease related. PTCs usually lead to a decrease in ribosomeprotected fragments (RPFs). It is important to note that the fraction of disomes (FOD) measures the ratio
of disomes to (monosomes+disomes), which can be indicative of ribosome collisions and queueing, and
not necessarily the number of RPFs. Therefore, an increase in FOD can still reflect the presence of PTCs
and the downstream consequences they have on translation. We specifically examined the FODs
associated with stop codons and found that the UGA stop codon exhibited the lowest FODs among the
three stop codons. We analyzed genes linked to these UGA codons and found no significant enrichment
of Gene Ontology (GO) terms.
19
Figure 2.5 Amino acid changes enriched or depleted in RibOc-SNPs. The top 30 amino acid conversions were enriched in
RibOc-SNPs. The bottom 41 conversions were depleted from RibOc-SNPs. P-values were based on hypergeometric tests. *****:
p ≤ 10−5; ****: 10−5 < p ≤ 10−4; ***: 10−4 < p ≤ 10−3; **: 10−3 < p ≤ 10−2; and *: 10−2 < p ≤ 5×10−2.
20
Figure 2.6 Distributions of Ds for amino acid conversions involving stop codons. (a) Boxplots of the Ds values for SNPs leading
to changes associated with stop codons. The red dots represent disease-related SNPs, and the grey dots represent benign SNPs.
(b) The Ds distribution for synonymous codon conversions based on whether such conversions started from ancestral alleles. The
left panel shows synonymous stop codon conversions; the right panel shows synonymous amino acid codon conversions.
Interestingly, the conversions from one stop codon to another stop codon often result in an
increased FOD (mean change: 0.04; P<2.2×10−16, t test) although they are usually not associated with
disease. Upon further checking, the reference allele is the ancestral allele for the majority (94.7%) of the
SNPs that caused these stop-to-stop codon conversions. This indicates evolutionary pressure on genes to
choose stop codons with lower FODs among the three possible stop codons (UAA, UAG, or UGA).
Figure 2.6b confirms that the stop-to-stop codon conversions started from ancestral alleles usually lead to
positive D
s
s. On the contrary, whether the reference allele is ancestral does not have a notable impact
on the Ds
s in synonymous amino acid conversions. This may suggest a unique selection pressure on stop
codons only.
21
Figure 2.7 shows the signed Ds
for individual conversions involving stop codons to dissect the
codon-specific effects. Conversions leading to more positive Ds
s are colored purple, and those leading to
more negative Ds
s are colored green. The conversions from an amino acid to stop codons always
resulted in increased FODs, and vice versa. These observations provide additional validation for the
accuracy and reliability of our machine learning model at the amino acid conversion level.
Figure 2.7 Density plots of Ds for each conversion involving stop codons. The curves corresponding to conversions with more
positive Ds are colored purple, and those with more negative Ds are colored green. The conversions enriched in RibOC-SNPs
are also highlighted in red boxes, while depleted ones are highlighted in blue. The frequency of each amino acid conversion is
listed in each sub panel.
22
RibOc-SNPs are enriched in the 5’ CDS regions
We examined the relationship between the genomic locations of SNPs and their potential to
change ribosome occupancy. Figure 2.8a shows the distribution of RibOc-SNPs along CDS regions vs.
that of general ones. Interestingly, RibOc-SNPs are highly enriched in the 5’ CDS regions. About 16.9%
(306 out of 1900) of RibOc-SNPs are located in the first 200 bp CDS regions, compared to the 13.3% of
general SNPs located in this particular CDS region (p-value < 2.2×10−16, proportion test). In addition,
RibOc-SNPs are closer to both ends of CDS regions compared to general SNPs. On the 5’ end, the
median of distance between the SNPs and the CDS boundary is 326 for RibOc-SNPs compared to 380 for
general SNPs; on the 3’ end, such distances are 327 and 378, respectively.
Since SNPs may affect FOD levels of their nearby positions (see D value calculation in Methods),
the affected FOD position does not necessarily colocalize with the SNP itself. Figure 2.8b shows the
distribution of affected FOD positions relative to either RibOc-SNPs or general ones, where negative or
positive positions on the x-axis represent up or downstream distances to the SNP, respectively. For both
types of SNPs, the affected FOD positions peak at ~47 bp downstream of the SNP loci. However, the
concentration pattern around +47 bp is more prominent for RibOc-SNPs than general SNPs. Since a
translating ribosome protects about 30 nucleotides of an mRNA from nuclease activity (30, 44), our
results imply that RibOc-SNPs tend to affect downstream 1.5 ribosome-occupancy-distance positions. In
both types of SNPs, we observed the presence of second peaks at the +1 bp position, indicating the
potential influence of these SNPs on their immediate surroundings.
Based on the locations of SNPs and their corresponding affected FOD positions, we inferred the
A-site position of the leading ribosome in the would-be disomes. As shown in Figure 2.8c, the A-site
location of the leading ribosome in potential disomes is piling up at both ends of the CDS. In the 5’ CDS
region, A-sites affected by RibOc-SNPs and general SNPs peak at 190 bp and 142 bp downstream of the
start of CDS, respectively. The 48 bp discrepancy suggests that positions further downstream of 5’ CDS
23
are more likely to exhibit large FOD changes caused by nearby SNPs. Additionally, A-sites related to
both types of SNPs accumulate toward the 3’ end of the CDS.
Some RibOc-SNPs shift FODs bi-directionally for different transcript isoforms
For SNPs located on multiple transcript isoforms of the same gene, their alternative alleles may
increase or decrease FODs depending on the reading frame of alternative transcript isoforms. For
example, the D value of rs143205514 is 0.340, determined by its largest positive FOD difference in
transcript ENST00000573283 of ACTG1. Meanwhile, this SNP also has a negative FOD difference with
a magnitude of 0.310 in an alternative ACTG1 transcript ENST00000575842. Among the 1900 RibOcSNPs, 420 (22.1%) have notable FOD changes in both directions depending on splicing isoforms.
To better characterize this phenomenon, for multi-isoform genes, the FOD change which
determined the D value (i.e., the maximum FOD change among all isoforms) was designated as the
dominant change (D(1), equal to D), while the largest FOD change in the opposite direction from a
different isoform was labeled as the subordinate opposite change (D(2)_opposite) if available. As shown in
Figure 2.9a, many RibOc-SNPs have positive dominant FOD changes and negative subordinate FOD
changes simultaneously (red dots) or vice versa (blue dots). For some of these RibOc-SNPs, the
Figure 2.8 Location comparison between RibOc-SNPs and general SNPs. (a) SNP locations relative to ends of CDS regions.
SNPs with distances farther than 1500 bp to either end of the CDS were excluded (11.97%). (b) Distribution of affected FOD
positions with respect to SNP loci, where negative and positive numbers on the x-axis represent up or downstream of the SNP. (c)
Location distribution of A-site of the leading ribosome on the affected FOD position.
24
magnitude of the subordinate FOD changes is comparable to (although slightly less than) that of the
dominant changes (e.g., dots close to the diagonal line).
To further investigate the magnitude of FOD changes, we calculate the difference between the
absolute dominant FOD change and the absolute subordinate change (Dif_opposite=|D(1)|− |D(2)_opposite|).
Similarly, we identified the second largest FOD change in the same direction (D(2)_same) corresponding to
another transcript isoform, and calculated its differences from the dominant FOD change
(Dif_same=|D(1)|− |D(2)_same|). In many cases, Dif_same is close to 0, which reflects similar magnitudes of
D(1) and D(2)_same obtained from almost identical transcript isoforms (Fig. S4). By excluding values less
than 0.05, both Dif_opposite and Dif_same peak around 0.1, as shown in Figure 2.9b. The above results
imply that the magnitude of the opposite FOD changes from different transcript isoforms is not trivial.
The ability to change FODs bi-directionally allows one SNP to further amplify the differences
between different transcript isoforms of the same gene at the translation level. We performed another
gene ontology enrichment analysis for genes harboring these bi-directional RibOc-SNPs with DAVID
(42). GO terms were tested with the background consisting of all genes of investigated SNPs and criteria
including Bonferroni corrected p-values ≤ 0.05 and fold enrichments ≥ 2. As summarized in Figure 2.9c,
21 GO terms were enriched. By comparing to terms enriched in GO analysis for genes harboring the
largest number of investigated SNPs, 8 terms, such as ion channel activity, gastric cancer, scaffold protein
binding, et., were exclusively enriched with genes containing bi-directional RibOc-SNPs (red terms in
Fig. 2.9c).
25
Figure 2.9 Analysis of RibOc-SNPs that can shift FOD bi-directionally. (a) Some RibOc-SNPs exhibit bi-directional FOD
changes among different transcript isoforms of the same gene. The magnitudes of dominant (D(1), equal to D) and subordinate
opposite (D(2)_opposite) changes were expressed by the horizontal and vertical axes, respectively. SNPs with D values
determined by positive FOD changes were plotted in red, while those determined by negative FOD changes were shown in blue.
(b) Distributions of Dif_opposite and Dif_same. Dif_opposite is the difference between the absolute dominant change and the
absolute subordinate change with the opposite sign (Dif_opposite=|D(1)|− |D(2)_opposite|). Dif_same is the difference between
the absolute dominant change and the absolute subordinate change with the same sign (Dif_same=|D(1)|− |D(2)_same|). Values
less than 0.05 were excluded. (c) Gene ontology analysis on genes where bi-directional RibOc-SNPs are located. The
uncorrected p-value of each term is listed beside the bar that represents the fold enrichment. Terms with Bonferroni corrected pvalues ≤ 0.05 and fold enrichments ≥ 2 were selected. Terms not included by GO analysis on genes harboring the largest number
of investigated SNPs were printed in red.
26
Discussion
In this study, our random forests model successfully predicted changes in FODs caused by nearby
SNPs. We distinguished RibOc-SNPs (about 2% of all SNPs) causing significant changes in ribosome
occupancy from the remaining general SNPs. Disease-related SNPs are more likely associated with more
substantial changes in disome occupancy, suggesting translational regulation can be an essential step in
pathological processes. However, a locus with a high ribosome disome fraction is not necessarily diseaserelated (Fig. 2.2c). Gene ontology analysis and disease term analysis on RibOc-SNPs revealed
associations linking SNPs to diseases.
The application of our pipeline to additional RiboSeq data sets demonstrated the robustness of our
model (Fig. S1). It is important to note that we cannot directly apply the random forest model trained on
one cell line to predict FODs in another cell line due to significant distribution differences. However, we
observed that the selected features in the models and the subsequent RibOc-SNP selection are similar
across cell lines (Fig. S1).
We found that specific nucleotide changes and their corresponding amino acid changes are more
likely to increase the disome occupancy fraction. At the same time, some allele configurations tend to
reduce the formation of disomes. The lack of distribution differences between RibOc-SNPs and general
ones for the allele configurations of “A → G” and “G → A” may be associated with selection
pressure on the A-to-I (A-to-G) RNA editing. Since ribosomes stall at stop codons, as expected, we
observed that conversions from amino acids to stop codons always increase FODs. Interestingly, we
found that stop codons containing ancestral alleles of SNPs tend to exhibit lower FODs compared to other
synonymous stop codons containing alternative alleles, suggesting the selection pressure among
synonymous stop codons. Such selection pressure does not show up for synonymous amino acids.
RibOc-SNPs are more enriched around both ends of the CDS region compared to general SNPs. As the
affected FOD positions concentrate at 47 bp downstream of the SNPs, it suggests that SNPs are more
27
likely to influence ribosome occupancy in their downstream positions. Based on the locations of the
potential A-sites of affected FOD positions, they peaked around 190 or 142 bp (impacted by RibOc-SNPs
or general SNPs) downstream of CDS start sites. Thus, they leave a disome-free zone with a width of at
least 12 bp, which resonates with the original ribosome footprint profiling paper (27). On the 3’ end of the
CDS region, potential A-sites affected by both types of SNPs aggregate as they approach the stop codon,
which can be explained by the stalling of unrecycled ribosomes on the stop codon (33).
We identified hundreds of RibOc-SNPs (22.1%) simultaneously increasing and decreasing
disome occupancy on different splicing isoforms of the same gene. This suggests the coupling of
alternative splicing and translational control at the isoform level. The mechanism is unclear but likely
involves the alternative exon in cis and/or the trans RNA binding proteins interacting with the alternative
exon. Such a mechanism may be harnessed to selectively produce certain protein isoforms but not others.
Our machine learning model was built upon the monosome and disome profiling of HEK293
cells. In the future, as more data become available, our model can be further trained and improved. When
ribosome profiling data from different tissues are available, we will incorporate tissue specificity to
predict tissue-specific translational control consequences of SNPs, which will significantly advance
understanding of disease progressions in specific tissue contexts.
28
Chapter 3 Dissect genetic architecture of circadian rhythms from multitissue gene expression in the natural population
Introduction
Circadian rhythms, intrinsic 24-hour tempos, underlie a wide range of daily cycles in
physiological and behavioral processes, including activity levels, sleep-wake cycles, and eating-fasting
cycles. These rhythmic patterns are orchestrated by gene networks controlled by clock genes such as
CLOCK, PER, CRY, among others (45). These core clock genes collectively influence nearly half of all
mammalian genes (46), exhibiting tissue-specific coordination through feedback loops (47). This
orchestration results in the oscillation of gene expression levels in sync with the circadian rhythms,
impacting various physiological and behavioral processes, such as regulation of sleep, metabolism, and
immune response (45).
Given the pivotal role of circadian rhythms in the regulation of daily bodily functions, several
genome-wide association studies (GWAS) have sought to identify genetic variants associated with
circadian rhythms (48-52). Since the endogenous circadian rhythms align with external cues like light
patterns and food intake (53), many of these GWAS studies rely on sleep-related traits reflecting daily
light-dark cycles, such as sleep duration, sleep quality, and chronotype measurement. However, the
scarcity of suitable traits capable of accurately characterizing circadian rhythms poses a challenge to these
studies. A more direct approach would involve analyzing molecular-level traits obtained from gene
expression data, particularly time-series gene expression levels exhibiting ~24-hour rhythms.
Nonetheless, obtaining rhythmic expression data from humans is challenging, and the variability in the
daily rhythms of core clock genes across tissues adds complexity (54).
To better understand the molecular machinery behind circadian rhythms and discover novel
master regulators, we proposed a circadian deviation score to quantify molecular-level circadian
29
disruptions in individuals. These scores, derived from circadian gene expression levels in the GenotypeTissue Expression (GTEx) Project (55), incorporate thousands of genes with rhythmic expression levels
into tissue-specific phenotypes. The identification of trans-acting genetic variants linked to gene
expression, such as trans-acting expression quantitative trait loci (eQTLs), is often constrained by limited
statistical power. To identify trans-acting master circadian regulators, our proposed expression deviation
scores are summarized from the collective behavior of thousands of circadian genes. This strategy not
only circumvents the limitations for identifying trans-acting eQTLs associated with individual circadian
genes, but also provides a more comprehensive and robust understanding of the genetic architecture
governing gene regulation in the context of circadian rhythms.
With our novel intermediate molecular phenotype, namely circadian deviation scores, we
successfully identified 654 significant loci in a tissue-specific manner. Among these SNPs, a substantial
portion is in close proximity to established variants associated with circadian traits, including “insomnia”,
“chronotype measurement”, and “circadian rhythm,” thereby enriching existing knowledge in this
domain. Moreover, our analysis revealed circadian master SNPs associated with mental health and
cardiovascular diseases - conditions frequently intertwined with circadian disruptions. This
comprehensive exploration of master circadian regulators contributes significantly to advancing our
understanding of the molecular components underlying circadian disruptions and pave the way for
more targeted and effective investigations into genetics of circadian regulation.
Methods
Assembly of human circadian genes
Circadian rhythms extend beyond the core clock components and their immediate targets,
encompassing a vast majority of genes with rhythmic transcription in mammals (45, 54). Nonetheless, a
comprehensive repertoire of human circadian genes is still lacking. In a study by Mure et al., numerous
circadian genes in the olive baboon was identified by analyzing the periodicity and rhythmicity in a set of
30
time-series gene expression data (54). Based on the one-to-one homology provided by Ensembl (Ensembl
Genes 106) (56), we acquired a set of human orthologues of these baboon circadian genes to be human
circadian genes. Mure et al. examined a total of 64 distinct tissues, each contributing to an independent
human circadian gene list.
Pre-processing of Genotype-Tissue Expression data
To explore the disruption of circadian gene expression in the natural human population, we
utilized the GTEx data (54, 55). Tissues containing circadian gene lists and comprising more than 70
samples in GTEx were considered in our analysis (23 tissues). We selected genes with expression greater
than 0.1 transcripts per million (TPM) in at least 20% of the samples in the specific tissue of interest.
Subsequently, we normalized gene expression values across samples using Trimmed Mean of M values
(TMM) as implemented in the R package edgeR (57). Finally, expression values were further
transformed into Z values across samples using an inverse normal transform (58).
Molecular quantification of circadian disruption
We propose a circadian deviation score, , to quantify the extent of circadian gene expression
disruption in each individual, as follows:
= ∑ |, − ̃
|
=1
∙
(equation 2.1)
where is the index of individuals and is the index of circadian genes, and is the total number of
circadian genes within the interested tissue. Thus, ,
is the th individual’s expression level in the th
circadian gene (transformed Z values); ̃
is the median of the expression of the th circadian gene, and
is the weight of the th gene. The weight
is calculated as:
= − log10
(equation 2.2)
in which
is the p-value of the th gene reported by Mure et al., indicating the extent to which the th
gene exhibits circadian characteristics based on its periodicity and rhythmicity in expression (54). The
31
circadian deviation score collectively captures an individual’s expression deviation in multiple circadian
genes. As such, it serves to characterizes the overall degree of circadian disruption at the molecular level
for that individual.
Mapping of genetic variants associated with circadian deviation scores
We scanned over 10 million SNPs, whose genotype calls were provided by GTEx (55), to test for
associations with circadian deviation scores. We removed rare variants with the minor allele frequency ≤
0.01. SNPs with more than 5% missing genotypes were also discarded. For the remaining SNPs, missing
genotypes were imputed based on reference and alternative allele frequencies. For SNPs on autosomes,
their genotypes were coded as the number of alternative alleles. Additionally, SNPs located on sex
chromosomes were handled differently for females and males. Females underwent the same coding
procedure as autosomal SNPs, while males, possessing only one X chromosome, were coded with either 0
(reference allele) or 2 (alternative allele). Finally, we retained SNPs with >= 5 individuals in at least two
genotype groups (a total of 10,826,972 SNPs over each of 23 considered tissues).
We searched for associations between SNPs and circadian deviations for different tissues
separately. A multiple linear regression model was fit to examine the relationship between circadian
deviation scores in a specific tissue and the genotypes of a tested SNP, along with additional covariates
obtained from GTEx, including the top 5 genotyping principal components, a set of Probabilistic
Estimation of Expression Residuals (PEER) factors, sequencing platform (Illumina HiSeq 2000 or HiSeq
X) and protocol (PCR-based or PCR-free), and sex. SNPs with a p-value ≤ 5×10−8 were considered
significant and designated as Circ-SNPs (Circadian-related SNPs).
Investigation of potential confounding from tissue sampling time
Since circadian genes exhibit rhythmicity in expression, if samples with a specific genotype were
collected from the donors around the same time during the day (i.e., more synchronized), such a genotype
group may exhibit higher or lower expression compared to other genotype groups. This could potentially
32
lead to false positives in our discoveries. To ensure that the different degree of synchronization across
different genotype groups is not a confounding factor for our Circ-SNP discovery, we calculated the
correlation of circadian gene expression levels between donors, denoted as circadian, as follows:
, =
E[(−)(−)]
(equation 2.3)
Results
Circadian disruptions in human population
We introduced a circadian deviation score for each individual, reflecting the extent of circadian
disruption at the molecular level (details in Method). Leveraging this deviation score, we conducted an
analysis of circadian gene expression patterns across various tissues in human natural population, utilizing
the GTEx data. Out of the 54 tissues in GTEx, 23 had a sufficient number of samples with derived
circadian genes, enabling the deviation score calculation. The distributions of these scores in each tissue,
illustrated in Figure S5, exhibit a right skewness, indicating the present of individuals with remarkably
disrupted circadian rhythm at the molecular level. This aligns with the observation that circadian
disruption is widespread within human populations. The exceptional circadian profiles in these
individuals provide opportunities for genome-wide mapping of genetic variants associated with circadian
disruption at the molecular level.
Genome-wide mapping of SNPs associated with circadian disruptions across multiple tissues
We examined over 10 million SNPs for associations with the circadian deviation score in 23
tissues, identifying a total of 654 significant SNPs (named as Circ-SNPs) across 16 tissues. As shown in
Figure 3.1a, small intestine – terminal ileum and adrenal gland harbor the majority of these SNPs, with
other tissues showing moderate quantities. Noteworthy, the abundance of associations in specific tissues
is driven by neither larger sample sizes (Fig. 3.1b) or larger number of circadian genes (Fig. 3.1c).
33
Analyzing the summary statistics of the discovered Circ-SNPs, we consistently observed a
positive coefficient for the genotype variable when the minor allele is the alternative allele or a negative
coefficient when the minor allele is the reference allele. Because we coded the genotype as the number of
alternative alleles, this suggests harboring more minor alleles for an individual, the risk of circadian
disruption is higher. This matches the notion that minor alleles are more likely to be risk alleles (59). This
Figure 3.1 Tissue-specific statistics for 16 tissues containing Circ-SNPs. (a) Number of Circ-SNPs identified
in each tissue. (b) Number of donors available for calculation deviation scores in the tissue. (c) Number of
circadian genes used to calculate deviation scores in the tissue.
34
pattern contrasts with randomly selected non-significant SNPs (Fig. 3.2), confirming that individuals with
minor alleles experience greater circadian disruptions.
We further conducted investigation to
inspect whether the major allele is the ancestral
allele in the identified Circ-SNPs. The ancestral
alleles were retrieved from the Ensembl
database (https://useast.ensembl.org/index.html)
using biomaRt (41), which utilizes phylogenetic
modeling of multispecies alignments to ascertain
ancestral alleles. Among the 654 Circ-SNPs,
624 had available ancestral allele information.
Out of these 624 Circ-SNPs, 548 had the
ancestral allele as their major alleles. To
contextualize this percentage, we sought ancestral allele information for 200,000 randomly selected nonsignificant SNPs. Of these, 142,302 out of 191,658 SNPs had the ancestral allele as the reference allele.
Consequently, Circ-SNPs with the ancestral allele as their major allele are enriched compared to random
SNPs (548/624=0.872 vs 142302/191658=0.742, P=6.8×10−3, proportion test). Additionally, we
conducted a formal heritability analysis, but the results were inconclusive due to the limited sample size
available in GTEx. Nevertheless, the enrichment of the ancestral allele in Circ-SNPs suggests their
potential for higher heritability compared to randomly selected SNPs.
Distribution of Circ-SNPs on the genome
The discovered Circ-SNPs are distributed throughout the genome, as depicted in Figure 2.3a, a
Manhattan plot illustrating their genomic locations. Notably, genes containing highly significant CircSNPs (-log10() ≥ 10) are highlighted. Some of them including LCT, PRKN, DPP6, and so on have
been reported by the GWAS Catalog (2) or the GeneHancer (60) to be associated with circadian-related
Figure 3.2 Impact of minor alleles on deviations scores.
Through examining the sign of our regression models, all
Circ-SNPs show a positive relationship between the number of
minor alleles and the deviation score. In contrast, randomly
selected SNPs may display either a positive or a negative
insignificant relationship between minor alleles and deviation
scores.
35
phenotypes such as chronotype measurement, insomnia, and circadian-associated mental health
conditions. Interestingly, Circ-SNPs tend to cluster, forming SNP clusters spanning 1M bp and containing
at least 2 Circ-SNPs. In addition, clusters only encompass Circ-SNPs found in the same tissue were
marked by blue bands, while clusters contain Circ-SNPs from different tissues were denoted by red bands
(Fig. 3.3a).
In total, 102 Circ-SNP clusters were observed, with 18 comprising Circ-SNPs from multiple
tissues. The upset plot in Figure S6 illustrates the distribution of these SNP clusters among these tissues.
SNP clusters shared by multiple tissues may function synchronously to regulate circadian rhythm through
collaboration across different tissues. For example, SNPs from clusters involving adrenal gland and small
intestine may coordinate motion activities and the digestive system, while clusters involving liver and
small intestine may collectively support digestion and overall metabolism in human bodies. Out of the
654 discovered Circ-SNPs, 127 reside on the X chromosome. Figure 3b displays the genes that contain
these Circ-SNPs, with several, including DMD, SMARCA1, SUPT20HL2, and IL1RAPL1, associated
with mental or behavioral disorders like depression, schizophrenia, and anxiety according to GWAS
Catalog (2). IL1RAPL1 was also linked to chronotype measurement and circadian rhythm. The
significant proportion of Circ-SNPs on the X chromosome suggests that sex can be a factor in circadian
rhythm disruptions. Circadian rhythms may vary between males and females due to hormonal differences
and other biological factors (61-63). Sex differences in the prevalence of certain circadian-related
disorders, such as insomnia, have been observed (64, 65). Our reported Circ-SNPs on the X chromosome
may provide valuable insights into understanding these sex differences.
36
We explored the types of the genomic locations for the discovered Circ-SNPs. As illustrated in
Figure S7a, the majority of the Circ-SNPs were located in non-coding regions such as intergenic regions
(IGRs) and introns, suggesting their regulatory roles. Upon further examination, many of these Circ-SNPs
Figure 3.3 Manhattan plots of Circ-SNPs from multiple tissues. (a) Circ-SNPs identified in the entire genome, color-coded by
tissues, with genes harboring highly significant Circ-SNPs (− 10() ≥ 10). SNPs clusters spanning 1M bp are highlighted in
colored bands, where blue represents cluster members were identified in the same tissue and red represents cluster members
were discovered in multiple tissues. (b) Circ-SNPs identified from the X chromosome, color-coded by tissues, with genes
containing Circ-SNPs. (c) Circ-SNPs colocalized with Sleep Duration SNPs, color-coded by tissues, with genes housing these
SNPs.
37
overlap with the cis-quantitative trait loci (cis-QTL) reported by GTEx. Specifically, out of 654 CircSNPs, 164 are expression QTL (eQTL) and 26 of them are tissue specific; 48 are alternative splicing QTL
(sQTL) and 1 of them is tissue specific, and 11 are editing QTL (edQTL). Moreover, certain Circ-SNPs
demonstrate multiple regulatory functions, with 36 falling into two types of QTL, and 3 encompassing all
three types of QTL. We conducted a comparison between the identified Circ-SNPs and the trans-QTLs
reported by GTEx. However, no overlap was identified, primarily attributed to the restricted number of
trans-QTLs available for analysis. These findings collectively indicate that Circ-SNPs likely fulfill their
regulatory functions through transcriptional, post-transcriptional mechanisms, or a combination of both.
Exploration of potential confounding arising from tissue sampling time
As circadian genes display rhythmicity in expression, if samples within a specific genotype were
collected from donors around the same time during the day (i.e., synchronized), this genotype group may
demonstrate higher or lower expression compared to other genotype groups synchronized at a different
time point or not synchronized at all. Such synchronization within a genotype group could lead to false
positives of our identification of Circ-SNPs. Nonetheless, tissue sampling time is lacking in GTEx. To
address this, for each identified Circ-SNP, we examined whether samples within a genotype group are
more synchronized than samples across different genotype groups. The evaluation involved calculating
the Spearman correlation coefficient of circadian gene expression for every sample pair, comparing
correlations within the same genotype group to those across different genotype groups. For the majority
(95.1%) of our Circ-SNPs, the within-genotype group correlations were either less than or equal to the
cross-genotype correlations (P-values≤ 0.05, Wilcoxon test). This trend persisted when examining CircSNPs belonging to multi-tissue SNP clusters, with the majority (98.5%), exhibiting lower or similar
within-genotype group correlations compared to those across genotype groups (P-values≤ 0.05,
Wilcoxon test). Thus, the sampling time does not emerge as a significant confounding factor in the search
of Cir-SNPs.
38
Circ-SNPs are closely linked to known circadian-related traits
We compared our discovered Circ-SNPs to genetic variants documented in the GWAS catalog
(2), and the majority of these Circ-SNPs are novel, with only 5 of them having been reported in previous
studies. Consequently, we assigned the existing SNPs in the GWAS catalog within 1Mb of our Circ-SNPs
as proxies for further comparisons. In total, our Circ-SNPs were associated with 34,262 proxy SNPs in
the GWAS catalog. Out of the 10,039 traits reported in the GWAS catalog, we selected 34 traits,
including “circadian rhythm”, “chronotype measurement”, “sleep time”, “insomnia,” identified through
the keyword “sleep” search (https://www.ebi.ac.uk/gwas/search?query=sleep). We observed that 24 of
the 34 “circadian” traits were encompassed by the 3,551 traits associated with the proxy SNPs, suggesting
an enrichment of circadian traits among the proxy SNPs (Table 2.1 24/34=0.71 vs. 3551/10039=0.35,
P=3.0×10−5
, hypergeometric test). This indicates that a higher likelihood of our Circ-SNPs being
associated with circadian traits as well.
Table 3.1 Sleep-related GWAS traits reported by GWAS Catalog. Out of 32 traits, 24 of them are associated with proxy SNPs
tagged by Circ-SNPs.
GWAS Catalog trait Contain proxy SNPs
sleep latency TRUE
sleep quality TRUE
sleep measurement TRUE
sleep time TRUE
sleep duration TRUE
REM sleep behavior disorder TRUE
sleep apnea measurement TRUE
nighttime rest measurement TRUE
daytime rest measurement TRUE
insomnia measurement TRUE
Somnambulism TRUE
snoring measurement TRUE
narcolepsy-cataplexy syndrome TRUE
excessive daytime sleepiness measurement TRUE
insomnia TRUE
circadian rhythm TRUE
chronotype measurement TRUE
irritability measurement TRUE
39
hypertension TRUE
low density lipoprotein cholesterol measurement TRUE
depressive symptom measurement TRUE
triglyceride measurement TRUE
high density lipoprotein cholesterol measurement TRUE
coronary artery disease TRUE
sleep depth FALSE
sleep apnea FALSE
sleep apnea measurement during REM sleep FALSE
sleep apnea measurement during non-REM sleep FALSE
short sleep FALSE
obstructive sleep apnea FALSE
bruxism FALSE
periodic limb movement disorder FALSE
hypersomnia FALSE
emotional symptom measurement FALSE
We further investigated the genes harboring our Circ-SNPs, referred to as master circadian genes,
to uncover their connections to reported “circadian” traits. Altogether, the GWAS catalog (2) annotated
4,236 genes as harboring SNPs associated with “circadian” traits. Meanwhile, our Circ-SNPs were
mapped to 199 master circadian genes, with 48 of them overlapping with the 4,236 GWAS “circadian”
genes. Therefore, GWAS “circadian” genes were significantly enriched in our master circadian genes
(48/199=0.241 vs. 4236/69222=0.061, P=1.6×10−16, hypergeometric test). Alternatively, we examined the
percentage distribution of our master circadian genes associated with the “circadian” traits, and compared
it to that of all GWAS genes. Figure 3.4 reveals a significant enrichment of our master circadian genes,
containing Circ-SNPs, in 11 out of the 31 “circadian” traits.
We also conducted colocalization analysis on the discovered Circ-SNPs using ezQTL (66), a
web-based tool for integrative QTL colocalization with GWAS data. For this analysis, we selected eQTLs
reported in GTEx (55) as eQTL data. We chose a study (67) utilizing sleep duration as the phenotype. We
obtained LD information from the European population in 1000genomes (68) and set the cis-QTL
distance to 100 Kb. With the above settings, eCAVIAR (69), a colocalization analysis method integrated
into ezQTL, identified 25 Circ-SNPs colocalizing with 27 significant SNPs in both GTEx and the sleep
40
duration GWAS study. eCAVIAR’s calculations suggested that these 27 SNPs are potentially causal,
regulating 22 genes (Fig. 3.3c). Among these genes, UVSSA, PLOD2, RPL4, etc., were annotated by the
GWAS Catalog (2) and GeneHancer (60) as associated with phenotypes like insomnia, sleep duration,
and circadian rhythm. Therefore, our discovered Circ-SNPs can aid pinpointing causal SNPs in the sleep
duration GWAS study. The results also further confirm the association between our Circ-SNPs and
circadian rhythms. As most of our Circ-SNPs are newly discovered, we did not observe any intersections
with known circadian-related SNPs. The challenge was further compounded by the limited accessibility
of full summary statistics from circadian GWAS, leading to the inability to identify established SNPs
directly colocalizing with our Circ-SNPs.
Figure 3.4 Master circadian genes containing Circ-SNPs are linked to “circadian” traits. The percentage of our master
circadian genes associated with the “circadian” traits is compared to that of all GWAS genes. A trait was considered to be
significantly enriched by master circadian genes if the 95% confidence intervals of percentage frequencies from our circadian
genes and GWAS genes do not overlap and the hypergeometric test’s p-value<0.05.
Genes containing Circ-SNPs encode druggable targets
To assess the druggability of proteins coded by genes harboring Circ-SNPs (referred to as master
circadian genes), we utilized the DrugBank database(70). Out of the 122 protein-coding master circadian
genes, 18 code for proteins targeted by 163 drugs. Notably, six of these drugs—Doxepin, Desipramine,
41
Amitriptyline, Doxylamine, Trimipramine, and Propiomazine—directly address sleep disorders and
disturbances, including insomnia. According to DrugBank, these drugs are associated with nearly 500
health conditions. We categorized these health conditions into 18 groups based on their symptoms and
corresponding anatomical systems. As shown in Figure 2.5a, mental health condition is predominating
among these categories, accounting for 19.55% of all involved health conditions. We further collected
drugs associated with the aforementioned conditions from DrugBank (70) as the background (Table 2, S.
Table 8). Compared to the background percentage for all drugs, mental health condition significantly
enriched for drugs associated with our Circ-SNPs (8.61%, Fig. 3.5b). This enrichment aligns with the
common occurrence of circadian disruptions in individuals with various psychiatric disorders, including
major depressive disorder, bipolar disorder, anxiety, and schizophrenia (71). Drugs designed for our
master circadian genes are effectively treating related symptoms.
Table 3.2 Number of drugs that medicate different condition groups. Frequencies of drugs appeared in drugs connected to CircSNPs (hit) and appeared in all drugs that medicate the aforementioned conditions (all) and their percentages were listed. The pvalues were obtained from proportional tests, in which the alternative hypothesis is the percentage of hit drugs is greater than
that of all drugs for each condition category.
Condition group hit frequency hit percentage total frequency total percentage p value
Mental Health Conditions 140 19.55% 185 8.61% 1.63E-25
Gastrointestinal Conditions 70 9.78% 131 6.10% 2.69E-05
Respiratory Conditions 67 9.36% 259 12.05% 9.85E-01
Neurological Conditions 65 9.08% 136 6.33% 1.61E-03
Figure 3.5 Compositions of condition categories for drugs targeting on our master circadian genes. (a) Percentage
frequencies of condition categories associated with proteins encoded by master circadian genes harboring Circ-SNPs. Less
prominent categories (less than 5%) were grouped Other Conditions. (b) Percentage frequencies of all drugs in Drugbank.
42
Cardiovascular Conditions 57 7.96% 167 7.77% 4.52E-01
Other Conditions 54 7.54% 209 9.73% 9.72E-01
Pain and Discomfort 44 6.15% 201 9.35% 9.98E-01
Genitourinary Conditions 37 5.17% 54 2.51% 4.95E-06
Allergic and Immune Conditions 35 4.89% 116 5.40% 6.99E-01
Skin and Dermatological Conditions 34 4.75% 180 8.38% 1.00E+00
Treatment-Based Conditions 25 3.49% 117 5.44% 9.87E-01
Oral and Throat Conditions 23 3.21% 67 3.12% 4.85E-01
Blood-Related Conditions 21 2.93% 101 4.70% 9.84E-01
Eye Conditions 15 2.09% 72 3.35% 9.61E-01
Nutritional Deficiencies 11 1.54% 46 2.14% 8.38E-01
Endocrine and Metabolic Conditions 8 1.12% 67 3.12% 9.99E-01
Poisoning and Toxicity 6 0.84% 4 0.19% 1.51E-04
Reproductive Health Conditions 4 0.56% 37 1.72% 9.88E-01
Additionally, neurological conditions are also among the enriched (65/716=0.091 vs.
136/2149=0.063, P=8.0×10−3
, proportion test) categories, in which conditions are closely related to
altered circadian rhythms. Specifically, circadian timing system and homoeostatic sleep–wake history
might affect the susceptibility to seizures (72). The enrichment of drugs targeting our master circadian
genes for the treatment of sleep disorder, mental health conditions, as well as neurological conditions,
underscores the potential therapeutic relevance of our Circ-SNPs.
Discussion
In this study, we proposed a deviation score based on circadian genes and their expression levels,
providing an effective way to captured circadian disruptions manifested at the molecular level in
individuals from natural human population in GTEx. Utilized this deviation score as an intermediate
phenotype, we discovered 654 significant SNPs. Despite the limitation of unavailable sampling times in
GTEx, we demonstrated that this missing variable is not a serious confounding factor.
A noteworthy observation is that the vast majority (523 out of 654) of our discovered Circ-SNPs
were found in small intestine – terminal ileum and adrenal gland, despite the limited number of samples
in these tissues. This prolific discovery rate suggests promising outcomes when applying our deviation
43
scores to datasets with a larger sample size. Furthermore, the robustness of our method is confirmed by
the consistent pattern where individuals harboring minor alleles always exhibit higher deviation scores for
each of our detected Circ-SNPs. In addition, the enrichment of SNPs where the major allele is the
ancestral allele in Circ-SNPs suggests that they may exhibit higher heritability compared to randomly
selected SNPs. Furthermore, the significant presence of Circ-SNPs on the sex chromosome raising
intriguing possibilities regarding the implications of sex on circadian rhythms.
Most of our discovered Circ-SNPs are novel, not reported by the GWAS Catalog (2). However,
they exhibit strong colocalization with established GWAS SNPs responsible for circadian-related traits.
Furthermore, our Circ-SNPs often reside in genes with functional consequences related to n circadian
traits such as “insomnia”, “chronotype measurement”, “circadian rhythm”, etc. Notably, many proteins
encoded by genes harboring Circ-SNPs sever as targets for drugs treating diseases and conditions closely
linked to circadian rhythms. All these connections also validate the reliability of our proposed deviation
score method in detecting meaningful SNPs associated with circadian disruptions.
In summary, our study not only unveils novel insights into circadian disruptions at the molecular
level but also establishes the robustness and applicability of our deviation score method. The
identification of Circ-SNPs, their colocalization with reported GWAS variants, and their association with
drug-targeted proteins underscore the relevance of our findings to circadian-related traits and potential
therapeutic avenues. Our study will help to better understand the genetic landscape underlying circadian
rhythms.
44
Chapter 4 Conclusions and Future work
Forecasting ribosome collision around SNPs
We predicted changes in Fraction of Occupied Data (FODs) caused by nearby Single Nucleotide
Polymorphisms (SNPs), distinguishing those that significantly alter ribosome occupancy from general
SNPs and highlighting the role of translational regulation in various biological processes, particularly in
disease mechanisms. While the model's robustness was demonstrated across datasets, applying it directly
to different cell lines requires consideration of distribution differences. The analysis reveals specific
nucleotide and amino acid changes influencing ribosome occupancy, with implications for RNA editing
and stop codon effects on FODs, as well as the enrichment of RibOc-SNPs near coding sequences,
particularly impacting downstream positions. Additionally, this study uncovers a connection between
alternative splicing and translational control, suggesting potential mechanisms yet to be fully elucidated.
Future directions include incorporating tissue specificity into the model to enhance understanding of
disease progression across diverse biological contexts, providing insights into targeted translational
control strategies for disease research.
Associate SNPs to Circadian rhythm with molecular phenotype
We introduced a deviation score based on circadian genes and their expression levels, offering an
effective means to capture circadian disruptions at the molecular level in individuals from the natural
human population in GTEx. Utilizing this deviation score as an intermediate phenotype, we identified 654
significant SNPs. The robustness of our approach is supported by consistent patterns where individuals
with minor alleles consistently exhibit higher deviation scores for each Circ-SNP. Moreover, the
enrichment of ancestral alleles in Circ-SNPs suggests higher heritability compared to randomly selected
SNPs, with significant presence on the sex chromosome raising intriguing possibilities regarding sex's
influence on circadian rhythms. Many of these Circ-SNPs are novel, yet they exhibit strong connections
with established GWAS SNPs related to circadian traits and often reside in genes with functional
45
consequences linked to circadian processes like "insomnia" and "chronotype measurement." Notably,
these genes encode proteins targeted by drugs treating conditions associated with circadian rhythms,
affirming the reliability of our deviation score method in identifying meaningful SNPs associated with
circadian disruptions. Overall, our study unveils new insights into circadian disruptions at the molecular
level, establishing the reliability and applicability of our deviation score method and contributing to a
better understanding of the genetic basis of circadian rhythms.
Extract knowledge from SNPs identified from pathways
We expanded our intermediate molecular phenotype approach and conducted association studies on over
1,000 pathways. Within the hierarchical structure of pathway networks, higher-level pathways typically
exhibit a greater number of associated SNPs, reflecting their complexity with multiple genes and intricate
regulatory mechanisms. This complexity increases the likelihood of genetic variations, including SNPs,
impacting various pathway components. Conversely, lower-level pathways often have higher SNP to gene
ratios, suggesting a more significant polygenic impact from individual SNPs within the more specific
pathways. Notably, we found that disease-related SNPs generally make a larger polygenic contribution
compared to those in non-disease-related pathways. Among the more than 200,000 significant SNPs
identified, many overlaps with known cis-eQTLs and cis-sQTLS within corresponding genes, validating
our approach's reliability. However, several identified SNPs fall outside the scanning range of cis-QTLs,
potentially representing trans-QTLs, providing a novel approach for mapping them. We also analyzed
identified SNPs overlapping with those in the GWAS Catalog. These SNPs' associated traits often closely
relate to the pathways where significant SNPs were found, indicating functional relevance. Some SNPs
were identified in multiple pathways, particularly in pathways strongly connected by functions, even
across different branches within the pathway hierarchy. We aim to further explore these SNPs to uncover
their functions and explore inter-pathway connections.
46
References
1. R. J. Klein et al., Complement Factor H Polymorphism in Age-Related Macular Degeneration.
Science (American Association for the Advancement of Science) 308, 385-389 (2005).
2. A. Buniello et al., The NHGRI-EBI GWAS Catalog of published genome-wide association studies,
targeted arrays and summary statistics 2019. Nucleic acids research 47, D1005-D1012 (2019).
3. K. Frousios, C. S. Iliopoulos, T. Schlitt, M. A. Simpson, Predicting the functional consequences of
non-synonymous DNA sequence variants — evaluation of bioinformatics tools and development
of a consensus strategy. Genomics (San Diego, Calif.) 102, 223-228 (2013).
4. S. S. Nishizaki, A. P. Boyle, Mining the Unknown: Assigning Function to Noncoding Single
Nucleotide Polymorphisms. Trends in genetics 33, 34-45 (2016).
5. L. Peshkin et al., A method and server for predicting damaging missense mutations. Nature
methods 7, 248-249 (2010).
6. C. Ferrer-Costa et al., PMUT: a web-based tool for the annotation of pathological mutations on
proteins. BIOINFORMATICS 21, 3176-3178 (2005).
7. P. C. Ng, S. Henikoff, SIFT: predicting amino acid changes that affect protein function. Nucleic
acids research 31, 3812-3814 (2003).
8. V. Ramensky, P. Bork, S. Sunyaev, Human non‐synonymous SNPs: server and survey. Nucleic
acids research 30, 3894-3900 (2002).
9. D. Wang et al., SNP2Structure: A Public and Versatile Resource for Mapping and ThreeDimensional Modeling of Missense SNPs on Human Protein Structures. Computational and
structural biotechnology journal 13, 514-519 (2015).
10. A. P. Boyle et al., Annotation of functional variation in personal genomes using RegulomeDB.
Genome research 22, 1790-1797 (2012).
11. S. G. Coetzee, S. K. Rhie, B. P. Berman, G. A. Coetzee, H. Noushmehr, FunciSNP: An
R/bioconductor tool integrating functional non-coding data sets with genetic association studies
to identify candidate regulatory SNPs. Nucleic acids research 40, e139-e139 (2012).
12. D. Lee et al., A method to predict the impact of regulatory variants from DNA sequence. Nature
genetics 47, 955-961 (2015).
13. M. J. Li, L. Y. Wang, Z. Xia, P. C. Sham, J. Wang, GWAS3D: Detecting human regulatory variants
by integrative analysis of genome-wide associations, chromosome interactions and histone
modifications. Nucleic acids research 41, W150-158 (2013).
14. W. McLaren et al., Deriving the consequences of genomic variants with the Ensembl API and
SNP Effect Predictor. BIOINFORMATICS 26, 2069-2070 (2010).
15. X.-H. Meng, H.-M. Xiao, H.-W. Deng, Combining artificial intelligence: deep learning with Hi-C
data to predict the functional effects of non-coding variants. Bioinformatics 37, 1339-1344
(2021).
47
16. K. Wang, M. Li, H. Hakonarson, ANNOVAR: functional annotation of genetic variants from highthroughput sequencing data. Nucleic acids research 38, e164-e164 (2010).
17. L. D. Ward, M. Kellis, HaploReg: a resource for exploring chromatin states, conservation, and
regulatory motif alterations within sets of genetically linked variants. Nucleic acids research 40,
D930-D934 (2012).
18. J. Zhou, O. G. Troyanskaya, Predicting effects of noncoding variants with deep learning-based
sequence model. Nature methods 12, 931-934 (2015).
19. H.-L. Chiang, J.-Y. Wu, Y.-T. Chen, Identification of functional single nucleotide polymorphisms in
the branchpoint site. Human genomics 11, 27-27 (2017).
20. K. Faber, K.-H. Glatting, P. J. Mueller, A. Risch, A. Hotz-Wagenblatt, Genome-wide prediction of
splice-modifying SNPs in human genes using a new analysis pipeline called AASsites. BMC
bioinformatics 12, S2-S2 (2011).
21. Y. Z. Kurmangaliyev, R. A. Sutormin, S. A. Naumenko, G. A. Bazykin, M. S. Gelfand, Functional
implications of splicing polymorphisms in the human genome. Human molecular genetics 22,
3449-3459 (2013).
22. L. H. Hartwell, J. J. Hopfield, S. Leibler, A. W. Murray, From molecular to modular cell biology.
Nature (London) 402, C47-C52 (1999).
23. E. Cirillo et al., From SNPs to pathways: Biological interpretation of type 2 diabetes (T2DM)
genome wide association study (GWAS) results. PloS one 13, e0193515-e0193515 (2018).
24. R. J. F. Loos, 15 years of genome-wide association studies and no signs of slowing down. Nature
communications 11, 5900-5900 (2020).
25. C. Kimchi-Sarfaty et al., A "Silent" Polymorphism in the MDR1 Gene Changes Substrate
Specificity. Science (American Association for the Advancement of Science) 315, 525-528 (2007).
26. S. Kirchner et al., Alteration of protein function by a silent polymorphism linked to tRNA
abundance. PLoS Biology 15, e2000779-e2000779 (2017).
27. P. Han et al., Genome-wide Survey of Ribosome Collision. Cell reports (Cambridge) 31, 107610-
107610 (2020).
28. T. Zhao et al. (Cold Spring Harbor Laboratory Press, Cold Spring Harbor, 2020).
29. A. B. Arpat et al., Transcriptome-wide sites of collided ribosomes reveal principles of
translational pausing. Genome research 30, 985-999 (2020).
30. N. T. Ingolia, S. Ghaemmaghami, J. R. S. Newman, J. S. Weissman, Genome-Wide Analysis in Vivo
of Translation with Nucleotide Resolution Using Ribosome Profiling. Science (American
Association for the Advancement of Science) 324, 218-223 (2009).
31. C. C.-C. Wu, A. Peterson, B. Zinshteyn, S. Regot, R. Green, Ribosome Collisions Trigger General
Stress Responses to Regulate Cell Fate. Cell 182, 404-416.e414 (2020).
32. T. Zhao et al., Disome-seq reveals widespread ribosome collisions that promote cotranslational
protein folding. Genome Biology 22, 16-16 (2021).
48
33. S. Meydan, N. R. Guydosh, Disome and Trisome Profiling Reveal Genome-wide Targets of
Ribosome Quality Control. Molecular cell 79, 588-602.e586 (2020).
34. A. Dobin et al., STAR: ultrafast universal RNA-seq aligner. Computer applications in the
biosciences 29, 15-21 (2013).
35. Y. Park, A. Reyna-Neyra, L. Philippe, C. C. Thoreen, mTORC1 Balances Cellular Amino Acid Supply
with Demand for Protein Synthesis through Post-transcriptional Control of ATF4. Cell reports
(Cambridge) 19, 1083-1090 (2017).
36. L. Breiman, Random forests. Machine learning 45, 5-32 (2001).
37. T. L. Bailey, STREME: Accurate and versatile sequence motif discovery. Bioinformatics 37, 2834-
2840 (2021).
38. C. E. Grant, T. L. Bailey, W. S. Noble, FIMO: Scanning for occurrences of a given motif.
BIOINFORMATICS 27, 1017-1018 (2011).
39. J. Piñero et al., The DisGeNET knowledge platform for disease genomics: 2019 update. Nucleic
acids research 48, D845-D855 (2020).
40. S. T. Sherry et al., dbSNP: the NCBI database of genetic variation. Nucleic acids research 29, 308-
311 (2001).
41. S. Durinck, P. T. Spellman, E. Birney, W. Huber, Mapping identifiers for the integration of
genomic datasets with the R/Bioconductor package biomaRt. Nature protocols 4, 1184-1191
(2009).
42. D. W. Huang, R. A. Lempicki, B. T. Sherman, Systematic and integrative analysis of large gene
lists using DAVID bioinformatics resources. Nature protocols 4, 44-57 (2008).
43. L. Chen, Characterization and comparison of human nuclear and cytosolic editomes. Proceedings
of the National Academy of Sciences - PNAS 110, E2741-E2747 (2013).
44. S. L. Wolin, P. Walter, Ribosome pausing and stacking during translation of a eukaryotic mRNA.
The EMBO journal 7, 3559-3569 (1988).
45. F. Rijo-Ferreira, J. S. Takahashi, Genomics of circadian rhythms in health and disease. Genome
medicine 11, 82-82 (2019).
46. R. Zhang, N. F. Lahens, H. I. Ballance, M. E. Hughes, J. B. Hogenesch, A circadian gene expression
atlas in mammals: Implications for biology and medicine. Proceedings of the National Academy
of Sciences - PNAS 111, 16219-16224 (2014).
47. M. L. Gumz. (Springer New York, United States, 2016), pp. 1-55.
48. D. J. Gottlieb, G. T. O'Connor, J. B. Wilk, Genome-wide association of sleep and circadian
phenotypes. BMC medical genetics 8, S9-S9 (2007).
49. Q. S. Li, C. Tian, G. R. Seabrook, W. C. Drevets, V. A. Narayan, Analysis of 23andMe
antidepressant efficacy survey data: implication of circadian rhythm and neuroplasticity in
bupropion response. Translational psychiatry 6, e889-e889 (2016).
50. Genetics; Automated Feature Extraction from Population Wearable Device Data Identified Novel
Loci Associated with Sleep and Circadian Rhythms. Genomics & Genetics Weekly, 413 (2020).
49
51. S. E. Jones et al., Genome-wide association analyses of chronotype in 697,828 individuals
provides insights into circadian rhythms. Nature communications 10, 343-343 (2019).
52. A. Ferguson et al., Genome-Wide Association Study of Circadian Rhythmicity in 71,500 UK
Biobank Participants and Polygenic Association with Mood Instability. EBioMedicine 35, 279-287
(2018).
53. F. Fagiani et al., Molecular regulations of circadian rhythm and implications for physiology and
diseases. Signal transduction and targeted therapy 7, 41-41 (2022).
54. L. S. Mure et al., Diurnal transcriptome atlas of a primate across major neural and peripheral
tissues. Science (American Association for the Advancement of Science) 359, 1232 (2018).
55. L. J. Carithers et al., A Novel Approach to High-Quality Postmortem Tissue Procurement: The
GTEx Project. Biopreservation and biobanking 13, 311-319 (2015).
56. F. Cunningham et al., Ensembl 2022. Nucleic acids research 50, D988-D995 (2022).
57. M. D. Robinson, A. Oshlack, A scaling normalization method for differential expression analysis
of RNA-seq data. Genome Biology 11, R25-R25 (2010).
58. Z. R. McCaw, J. M. Lane, R. Saxena, S. Redline, X. Lin, Operating characteristics of the rank‐
based inverse normal transformation for quantitative trait analysis in genome‐wide
association studies. Biometrics 76, 1262-1272 (2020).
59. T. Kido et al., Are minor alleles more likely to be risk alleles? BMC medical genomics 11, 3-3
(2018).
60. S. Fishilevich et al., GeneHancer: genome-wide integration of enhancers and target genes in
GeneCards. Database (Oxford) 2017, (2017).
61. J. F. Duffy et al., Sex difference in the near-24-hour intrinsic period of the human circadian
timing system. Proceedings of the National Academy of Sciences - PNAS 108, 15602-15608
(2011).
62. J. Ma et al., EEG power spectra response to a 4-h phase advance and gaboxadol treatment in
822 men and women. Journal of clinical sleep medicine 7, 493-501 (2011).
63. N. Santhi et al., Sex differences in the circadian regulation of sleep and waking cognition in
humans. Proceedings of the National Academy of Sciences - PNAS 113, E2730-E2739 (2016).
64. H. A. Beydoun et al., Sex Differences in Patterns of Sleep Disorders Among Hospitalizations With
Parkinson's Disease: 2004-2014 Nationwide Inpatient Sample. Psychosomatic medicine 83, 477-
484 (2021).
65. J. J. Madrid-Valero, J. M. Martínez-Selva, B. Ribeiro do Couto, J. F. Sánchez-Romera, J. R.
Ordoñana, Age and gender effects on the prevalence of poor sleep quality in the adult
population. Gaceta sanitaria 31, 18-22 (2017).
66. T. Zhang, A. Klein, J. Sang, J. Choi, K. M. Brown, ezQTL: A Web Platform for Interactive
Visualization and Colocalization of QTLs and GWAS Loci. Genomics Proteomics Bioinformatics 20,
541-548 (2022).
50
67. M. Marinelli et al., Heritability and Genome-Wide Association Analyses of Sleep Duration in
Children: The EAGLE Consortium. Sleep 39, 1859-1869 (2016).
68. A. G. Clark et al., A global reference for human genetic variation. Nature (London) 526, 68-74
(2015).
69. F. Hormozdiari et al., Colocalization of GWAS and eQTL Signals Detects Target Genes. American
journal of human genetics 99, 1245-1260 (2016).
70. D. S. Wishart et al., DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic
acids research 46, D1074-D1082 (2018).
71. W. H. Walker, J. C. Walton, A. C. DeVries, R. J. Nelson, Circadian rhythm disruption and mental
health. Translational psychiatry 10, 28-28 (2020).
72. S. Khan et al., Circadian rhythm and epilepsy. Lancet neurology 17, 1098-1108 (2018).
51
Appendices
Supplementary Figures
Figure S1 Robustness of our analysis pipeline. (a) Distribution of FODs derived from the different datasets. (b) Important
features are similar across the random forest models derived from the different datasets. The x-axis shows the importance ranks
of features. The y-axis consists of five variable categories (Codon usage, Nucleotide usage, Region, Amino acid usage &
Distance to CDS boundary). (c) Percentages of top SNPs selected by different datasets that overlap with the top ones selected by
the original dataset. The random set of SNPs was selected without considering their D values.
52
Figure S2 Changes in FODs (D) caused by SNPs. The SNPs are sorted by their D values. The top 1900 SNPs are selected as
RibOc-SNPs since they have relatively large changes in FODs.
53
Figure S3 Distributions of amino acid conversions that are enriched or depleted in RibOc-SNPs and appear in pairs. The
curves corresponding to conversions with more positive Ds
are colored purple, and those with more negative Ds
are colored
green. The ones enriched in the RibOc-SNPs are also highlighted in red boxes, while depleted ones are highlighted in blue. The
frequency of each amino acid conversion is listed in each sub panel.
54
Figure S4 Distributions of Dif_opposite and Dif_same. Dif_opposite is the difference between the absolute dominant change and
the absolute subordinate change with the opposite sign (Dif_opposite=|D(1)|− |D(2)_opposite|). Dif_same is the difference between
the absolute dominant change and the absolute subordinate change with the same sign (Dif_same=|D(1)|− |D(2)_same|).
55
Figure S5 Distributions of deviation scores derived from different tissues.
56
Figure S6 Composition of tissues in Circ-SNP clusters. Dots in the lower right panel represent tissues involved in a cluster,
where connected dots signify multiple tissues were implicated. The Clusters per tissue panel lists the quantities of clusters that
contain the interested tissue, and the Tissue intersections panel reports the numbers of clusters that are exclusive to the tissue
combination.
57
Figure S7 Genomic location of identified Circ-SNPs. (a-b) Compositions of genomic regions that harbor identified Circ-SNPs/all
investigated SNPs.
Abstract (if available)
Abstract
Genome-wide association studies (GWAS) aim to uncover genetic variants linked to diseases and traits by comparing individuals' phenotypes with their genotypes. Despite GWAS revealing numerous associations, understanding the functional implications of identified genetic variants remains a challenge. To enhance the interpretability of single nucleotide polymorphisms (SNPs) identified through GWAS, we developed an innovative pipeline that annotates SNPs at the translational level. Additionally, we introduced an intermediate molecular phenotype to improve GWAS workflows and extended this strategy to pathways, providing a richer biological context for deciphering SNP functions.
The functional impact of SNPs on translation has yet to be considered to prioritize disease-causing single nucleotide polymorphisms (SNPs) from genome-wide association studies (GWAS). Here we apply machine learning models to genome-wide ribosome profiling data to predict SNP function by forecasting ribosome collisions during mRNA translation. SNPs causing remarkable ribosome occupancy changes are named RibOc-SNPs (Ribosome-Occupancy-SNPs). We found that disease-related SNPs tend to cause notable changes in ribosome occupancy, suggesting translational regulation as an essential pathogenesis step. Nucleotide conversions, such as “G → T”, “T → G”, and “C → A”, are enriched in RibOc-SNPs with the most significant impact on ribosome occupancy, while “A → G” (or “A→ I” RNA editing) and “G → A” are less deterministic. Among amino acid conversions, “Glu → stop (codon)” shows the most significant enrichment in RibOc-SNPs. Interestingly, there is selection pressure on stop codons with lower collision likelihood. RibOc-SNPs are enriched at the 5’ coding sequence regions, implying hot spots of translation initiation regulation. Strikingly, about 22.1% of the RibOc-SNPs lead to opposite changes in ribosome occupancy on alternative transcript isoforms, suggesting that SNPs can amplify the differences between splicing isoforms by oppositely regulating their translation efficiency.
Circadian rhythms are 24-hour cycles that observed in many facets of physiological and behavioral processes, such as regulation of sleep, metabolism, and immune response. Numerous studies had suggested genetic influence on circadian rhythms. Several genome-wide association studies (GWAS) were conducted to identify genetic variants associated with circadian rhythms. Usually, these GWAS rely on sleep related traits, such as onset time, duration, and quality of sleep, to reflect the daily light-dark cycles of circadian rhythms. In this study, a circadian deviation score was proposed as an intermediate phenotype to represent the degree of circadian disruption on the molecular level. Such deviation scores were derived from gene expression levels from a general population, which spare continuous sampling within 24-hour cycles. Additionally, we broadened the potential factors that affect circadian rhythms by incorporating thousands of genes into the calculation of deviation scores in a tissue-specific manner. With these deviation scores, we discovered 654 SNPs associated with the expression-level circadian disruption. In addition to SNPs reported to be related with circadian traits such as “insomnia”, “chronotype measurement”, and “circadian rhythm”, our discovered list contains many novel ones. Our findings will significantly advance the understanding of genetic components underlying sleep disorders and our approach will benefit the design of future GWAS strategies.
We delved into the impact of pathway depth within the hierarchical structure on associated SNPs, leveraging our expanded intermediate molecular phenotype approach across more than 1,000 pathways. In this investigation, higher-level pathways consistently displayed a greater number of associated SNPs, mirroring their intricate regulatory mechanisms and involvement of multiple genes, which heightens the potential for genetic variations, including SNPs, to influence various pathway components. Conversely, lower-level pathways exhibited higher SNP to gene ratios, indicative of a more pronounced polygenic influence from individual SNPs within these more specific pathways. Notably, disease-related SNPs demonstrated a larger polygenic contribution compared to those in non-disease-related pathways. Among the significant SNPs identified—over 200,000 in total—many overlapped with known cis-eQTLs and cis-sQTLS within corresponding genes, affirming the reliability of our approach. Yet, some identified SNPs extended beyond the scanning range of cis-QTLs, potentially signifying trans-QTLs and introducing a novel mapping avenue. Furthermore, our analysis of SNPs overlapping with those in the GWAS Catalog revealed traits closely linked to the pathways where significant SNPs were discovered, indicating functional relevance. Noteworthy was the identification of some SNPs in multiple pathways, especially in pathways interconnected by functions, even spanning across different branches within the pathway hierarchy.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Leveraging functional datasets of stimulated cells to understand the relationship between environment and diseases
PDF
Prioritizing phenotype-associated functional modules and sub-networks from high throughout screening results
PDF
Detecting joint interactions between sets of variables in the context of studies with a dichotomous phenotype, with applications to asthma susceptibility involving epigenetics and epistasis
PDF
Enhancing phenotype prediction through integrative analysis of heterogeneous microbiome studies
PDF
Adaptive set-based tests for pathway analysis
PDF
Bayesian hierarchical models in genetic association studies
PDF
Phenotypic and multi-omic characterization of novel C. elegans models of Alzheimer's disease
PDF
Investigations of Mie resonance-mediated all dielectric functional metastructures as component-less on-chip classical and quantum optical circuits
Asset Metadata
Creator
Li, Zheyu
(author)
Core Title
Predicting functional consequences of SNPs: insights from translation elongation, molecular phenotypes, and pathways
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Chemical Engineering
Degree Conferral Date
2024-05
Publication Date
05/24/2024
Defense Date
04/19/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
circadian rhythm,GWAS,machine learning,OAI-PMH Harvest,pathway analysis,ribosome profiling,SNPs
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Wang, Pin (
committee chair
), Chen, Liang (
committee member
), Finley, Stacey (
committee member
)
Creator Email
lee-zheyu@hotmail.com,zheyuli@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113967516
Unique identifier
UC113967516
Identifier
etd-LiZheyu-13025.pdf (filename)
Legacy Identifier
etd-LiZheyu-13025
Document Type
Dissertation
Format
theses (aat)
Rights
Li, Zheyu
Internet Media Type
application/pdf
Type
texts
Source
20240528-usctheses-batch-1162
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
circadian rhythm
GWAS
machine learning
pathway analysis
ribosome profiling
SNPs