Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Leveraging functional datasets of stimulated cells to understand the relationship between environment and diseases
(USC Thesis Other)
Leveraging functional datasets of stimulated cells to understand the relationship between environment and diseases
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Leveraging functional datasets of stimulated cells
to understand the relationship between environment and diseases
By
Ruowen Wang
A Thesis Presented to the
FACULTY OF THE USC KECK SCHOOL OF MEDICINE
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
MASTER OF SCIENCE
(BIOSTATISTICS)
May 2022
Copyright 2022 Ruowen Wang
Acknowledgments
I would like to express my gratitude towards all who have offered warm support for my
thesis.
First and foremost, my deepest gratitude goes to the Gazal lab at USC and my mentor
Professor Steven Gazal, for the lab’s support and Prof. Gazal’s constant encouragement and
guidance. Prof. Gazal provided me with invaluable advice and instruction with patience for my
writing of the thesis. It has been the best memory of mine during my master course to enjoy the
time of every meeting and share the results of my project with the group.
Secondly, my sincere thank you goes to my parents Mrs. Ying Cui and Mr. Jiemin Wang,
as well as my whole extended family, for their continued support and love.
I would also like to express my thanks to my boyfriend Mr. Zhuangboyu Zhou, as well as
all my friends who are always by my side to support me, especially during this pandemic. Their
accompaniment was an important source of my strength during the period of my master study.
Thank you all for always making me happy.
Thank you all for being with me and witnessing the birth, the growth, and the perfection
of my thesis.
ⅱ
TABLE OF CONTENTS
Acknowledgments ⅱ
List of Tables ⅳ
List of Figures ⅴ
Abstract ⅵ
Chapter 1: Introduction 1
Chapter 2: Method 3
2-1 Datasets of genes differentially expressed by environments 3
2-2 Stratified LD score regression 8
2-3 Creating gene-set annotations 9
Chapter 3: Result 11
3-1 Analyses of drug-disease matched pairs 11
3-1-a/ Inflammatory drugs and inflammatory diseases 11
3-1-b/ Antihistamine and respiratory and allergic diseases 13
3-1-c/ Insulin drugs and diabetes 13
3-1-d/ Statin and cardiovascular traits 14
3-1-e/ Conclusion 15
3-2 Choice of the methods 16
3-3 Detecting new candidate environment-disease relationships 18
Chapter 4: Discussion 20
References 22
ⅲ
List of Tables
Table 1: Number of differentially expressed genes after stimulation by different environments
reported by Moyerbrailean et al. (using the FDR10 strategy). 5
Table 2: Number of differentially expressed genes after stimulation by different environments
reported by Findley et al. (using the FDR10 strategy). 6
Table 3: Number of differentially expressed genes after stimulation by different environments
reported by Balliu et al. (using the FDR05 strategy). 7
Table 4: S-LDSC results for inflammatory drugs in inflammatory diseases. 12
Table 5: S-LDSC results for Antihistamine and respiratory and allergic diseases. 13
Table 6: S-LDSC results for insulin drugs and diabetes. 14
Table 7: S-LDSC results for statin and cardiovascular traits. 14
Table 8: Top 10 environment-trait pairs highlighted by S-LDSC. 18
ⅳ
List of Figures
Figure 1: S-LDSC results for control environments. 16
Figure 2: S-LDSC results for a gold standard list of 35 environment-disease pairs. 17
ⅴ
Abstract
Identifying gene-environment relationships can inform the biology of complex diseases. Recent
studies have shown that stimulating cells by an environment (by a pathogen or drug exposure) in
vitro leads from hundreds to thousands of differentially expressed genes. While it is often not
possible to test if the exposure to an environment increases or decreases the risk of human
diseases in vivo, developing methods that will optimally link these differentially expressed genes
to genome-wide association studies (GWAS) results might help to investigate the impact of the
environment on human diseases. However, it is still unclear if this strategy would provide
meaningful results, and what would be the best approach to connect GWAS results to genes
answering to an environment, notably due to the challenge of linking GWAS causal variants to
their target genes. Our project is to develop and apply methods leveraging gene expression of
stimulated cells (providing environment-genome relationships) and summary statistics from
GWAS (providing genome-disease relationships), to detect new environment-disease
relationships. First, we curated differential expression datasets from three studies, to build a
catalog of genes differentially expressed after stimulation by more than 50 environments,
including several drugs. Second, we developed different strategies to detect differentially
expressed genes and to create SNP-annotations derived from those genes. Third, we applied
stratified LD score regression with SNP-annotations built using genes differentially expressed by
drugs to GWAS data from the drug targeted traits; we leveraged these analyses to validate our
approach and to define what is the best strategy to create gene-set annotations in order to apply
stratified LD score regression. Finally, we applied our optimal strategy of building
SNP-annotations for > 50 environments across > 100 human diseases and complex traits. Our
analyses highlight new environment-disease relationships, and open future research directions.
ⅵ
Chapter 1: Introduction
Human diseases and complex traits are impacted by both genetic and environmental factors.
Genome-wide association studies (GWAS) have allowed testing for associations between all
common genetic variants and hundreds of traits, yielding rich insights into the genetic
architectures of these. They notably highlighted that their genetic architectures are highly
polygenic, i..e dominated by thousands of common variants with weak effects. However,
investigating the impact of the environment on diseases has been more challenging, since there
are many environments interacting with each other, and as environments can not be all measured
or detected.
Recent studies have shown that stimulating cells by an environment (such as a pathogen
or a drug) in vitro leads to thousands of differentially expressed genes [1–8]. Intersecting the
genes involved in human complex traits with the genes answering to an environment could be a
simple and powerful way to connect environmental exposure to the risk of many diseases,
without doing any experiments in vivo. However, it is still unclear if this strategy provides
meaningful results, and what is the best approach to connect GWAS results to genes answering to
an environment, notably due to the challenge of linking GWAS causal variants to their target
genes.
Here we will leverage datasets from 3 papers: Moyerbrailean et al. [1], Findley et al. [2],
and Balliu et al. [7]. We will investigate what is the best strategy to create gene-set annotations in
order to apply stratified LD score regression (S-LDSC), a powerful method to investigate the
polygenic signal of human diseases and complex traits from GWAS summary statistics [9,10]. In
particular, we will investigate a default approach of selecting SNPs in +/-100kb of the gene [11],
1
and a new approach selecting only the functional SNPs of the gene [12]. Finally, we will apply
the selected approach for more than 50 environments across 100 common diseases and complex
traits.
2
Chapter 2: Method
2-1 Datasets of genes differentially expressed by environments
We leveraged transcriptional response to environmental treatments from three datasets [1,2,7].
We restricted analyses to 19,995 genes analyzed previously [12].
The Moyerbrailean dataset [1] records the transcriptional response to 50 environmental
treatments on 5 blood and immune cell types: human umbilical vein endothelial cells (HUVECs),
lymphoblastoid cell lines (LCLs), peripheral blood mononuclear cells (PBMCs), human smooth
muscle cells (SMCs), and melanocytes (Mel). The 50 environments were divided into 6
categories: metal ions, dietary components, peptide hormones and neurotransmitters, steroid
hormones, common drugs, and environmental contaminants and common chemicals (Table 1).
For each cell-type and each environment (i.e. 250 datasets), genes differentially expressed before
and after the stimulation were determined as genes with at least one transcript having a
controlled false discovery rate (FDR) of 10% and an absolute log2 (fold-change) > 0.25, as in
[1]. We used the most differentially expressed genes of these analyses to control for type I error
in further analyses.
The Findley dataset [2] records the transcriptional response to 28 environmental
treatments on three cell types: lymphoblastoid cell lines (LCLs), induced pluripotent stem cells
(IPSCs), and cardiomyocytes (CMs). The 28 environments were divided into the same 6
categories than [1] (Table 2; 28/28 were also used in the Moyerbrailean dataset). For each
cell-type and each environment (i.e. 84 datasets), genes differentially expressed before and after
the stimulation were determined as genes with at least one transcript having a controlled FDR of
10% and an absolute log2 (fold-change) > 0.25, as in [2].
3
The Balliu dataset [7] records the transcriptional response to 21 environmental treatments
on three cell types: fat, liver, and skeletal muscle cell lines. The 21 environments were divided
into 7 categories: glucose and insulin metabolism, kinase inhibitors, inflammation, adipokine,
drug, fatty-acid metabolism, and other (Table 3). For each cell-type and each environment (i.e.
63 datasets), genes differentially expressed before and after the stimulation were determined as
genes with at least one transcript having a controlled FDR of 5%, as in [7] (note that these
choices are not consistent with the ones often previous datasets, but that these inconsistencies
were not explored here).
We note here that two different criteria were used to define genes differentially expressed
by an environment: the two first studies used a controlled FDR of 10% and an absolute log2
(fold-change) > 0.25 (strategy that we will label FDR10), and the last strategy used a controlled
FDR of 5% (strategy that we will label FDR05). For consistent analyses, we applied these two
different criterias to each dataset. We also restricted analyses to environments with at least 200
differentially expressed genes (~1% of the genes) to guarantee enough SNPs in S-LDSC
polygenic analyses. For environments leading to more than 2,000 differentially expressed genes
(~10% of the genes), we kept the 10% of genes with the smallest P values. With the FDR10
strategy, we selected 71, 35, and 45 environment/cell type pairs for the Moyerbrailean, Findley,
and Balliu datasets, respectively. With the FDR05 strategy, we selected 64, 30, and 43
environment/cell type pairs for the Moyerbrailean, Findley, and Balliu datasets, respectively.
4
#DEG
Environment Description Category HUVEC LCL PBMC SMC Mel
Acetaminophen Paracetamol Drugs - - 1272 474 1551
Acetylcholine Neurotransmitters Neurotransmitters - - 1694 - -
Acrylamide Acrylic amide Common chemicals 624 1236 - 176 -
Aldosterone Steroid hormone Steroid hormones - - - 2144 -
Aspirin Acetylsalicylic acid Drugs 194 51 1112 46 10
BHA Butylated hydroxyanisole Common chemicals 312 - 838 - 1012
BP-3 Oxybenzone Common chemicals 685 - - 166 -
BPA Bisphenol A Common chemicals - 721 - 77 -
Cadmium Chemical element Metal ions - 1280 1607 - -
Caffeine Consumed psychoactive drug Dietary components 2287 5319 8910 3987 2288
Cetirizine Medication for allergies Drugs - 1390 - - -
Copper Chemical element Metal ions - 6359 154 - -
Dexamethasone Glucocorticoid medication Steroid hormones 1912 2422 6899 4863 816
Ibuprofen Medication Drugs - - - - 1168
Insulin Anabolic hormone Peptide hormones 218 - - 162 2659
Iron Chemical element Metal ions - - 11112 - -
Loratadine Medication for allergies Drugs - 1519 - - 5765
Molybdenum Chemical element Metal ions 688 - 4101 1064 -
Nicotine Chiral alkaloid Common chemicals - - 2638 - -
PFOA Perfluorooctanoic acid Common chemicals - 1261 - 230 -
Phthalate Phthalate esters Common chemicals 1099 687 1729 - 1491
Selenium Chemical element Metal ions 6198 10019 - 2729 2076
Triclosan Antibacterial and antifungal Common chemicals 1032 - - 110 705
Tunicamycin Antibiotics Drugs - - - - 8970
Vitamin A Vitamin A Dietary components 2031 5966 1984 1156 1943
Vitamin B3 B vitamins Dietary components - - 42 - -
Vitamin B5 Pantothenic acid Dietary components - - 20 555 -
Vitamin B6 B vitamins Dietary components - - 15 617 -
Vitamin D Secosteroids Dietary components 1149 - 15314 1370 2957
Vitamin E Vitamin E Dietary components - - 2718 - 3821
Vitamin H B vitamins Dietary components - - 18 - -
Zinc Chemical element Metal ions - - 1205 - -
CO2 Control environment Control 0 0 0 0 0
CO3 Control environment Control - - - - 0
Table 1: Number of differentially expressed genes after stimulation by different environments reported by
Moyerbrailean et al. (using the FDR10 strategy).
5
#DEG
Environment Description Category LCL IPSC CM
Acetaminophen Paracetamol Drugs 1729 219 356
Acetylcholine Neurotransmitters Peptide hormones and neurotransmitters 272 0 0
Acrylamide Acrylic amide Common chemicals 380 8 0
Aldosterone Steroid hormone Steroid hormones 444 2 276
Aspirin Acetylsalicylic acid Drugs 37 12 2
BHA Butylated hydroxyanisole Common chemicals 0 17 4
BP-3 Oxybenzone Common chemicals 149 10 10
BPA Bisphenol A Common chemicals 0 0 0
Cadmium Chemical element Metal ions 113 0 13
Caffeine Consumed psychoactive drug Dietary components 4496 4086 3730
Cetirizine Medication used to treat allergies Drugs 3 0 0
Copper Chemical element Metal ions 10679 2826 1005
Dexamethasone Glucocorticoid medication Steroid hormones 5039 28 1729
Ibuprofen Medication Drugs 14 22 15
Insulin Anabolic hormone Peptide hormones and neurotransmitters 662 0 2490
Loratadine Medication used to treat allergies Drugs 10 14 0
Nicotine Chiral alkaloid Common chemicals 4 59 745
PFOA Perfluorooctanoic acid Common chemicals 454 0 12
Phthalate Phthalate esters Common chemicals 4 0 0
Selenium Chemical element Metal ions 7505 5865 14
Triclosan Antibacterial and antifungal Common chemicals 491 32 333
Vasopressin Hormone Common chemicals 644 6 5
VitaminA Vitamin A Dietary components 3100 654 1656
VitaminB5 Pantothenic acid Dietary components 0 14 1
VitaminB6 B vitamins Dietary components 6 4 3
VitaminD Secosteroids Dietary components 536 18 25
VitaminE Vitamin E Dietary components 112 9 12
Zinc Chemical element Metal ions 4059 2463 35
Table 2: Number of differentially expressed genes after stimulation by different environments reported by
Findley et al. (using the FDR10 strategy).
6
#DEG
Environment Category Fat Liver Muscle
Adiponectin Adipokine 462 1014 174
Atorvastatin Drug (LDL-lowering drug) 336 453 67
Decanoyl-l-carnitine Fatty-acid metabolism 31 121 1907
Dexamethasone Inflammation 3421 1007 1114
Glucose Glucose and insulin metabolism 724 2578 1590
IBMX Other (cAMP phosphodiesterase inhibitor) 1149 1202 1660
IGF1 Glucose and insulin metabolism 1468 738 927
IL-6 Inflammation 2911 123 114
Insulin Glucose and insulin metabolism 1868 1450 2112
Isoprenaline Other (β adrenoreceptor agonist) 499 29 1771
Lauroyl-l-carnitine Fatty-acid metabolism 206 746 172
Leptin Adipokine 1177 24 214
Metformin Drug (Anti-diabetic drug) 376 357 183
Retinoic acid Other (Vitamin A metabolite) 968 494 1312
Rosiglitazone Drug (Anti-diabetic drug) 1924 405 250
SB203580 Kinase inhibitors (p38 inhibitor) 684 1591 1847
SP600125 Kinase inhibitors (JNK inhibitor) 744 928 1233
TGF-B1 Inflammation 864 735 293
TNF-a Inflammation 2009 10 381
U0126 Kinase inhibitors (MEK1/MEK2 inhibitor) 986 1060 944
Wortmannin Kinase inhibitors (PI3K inhibitor) 672 1549 898
Table 3: Number of differentially expressed genes after stimulation by different environments reported by
Balliu et al. (using the FDR05 strategy).
7
2-2 Stratified LD score regression
Stratified LD score regression (S-LDSC) is a method to partition the polygenic signal of human
heritable traits across functional annotations using GWAS summary statistics [9,10]. It uses
information from all the tested SNPs, and not only the significant ones.
S-LDSC considers a model where the phenotypes y = ( ) of N individuals have mean 𝑦 1
, ..., 𝑦 𝑁 0 and variance 1 and can be written as:
𝑦 = 𝑋 β + ε
where is a vector of effect sizes for M genetic variants (also called per-SNP β = (β
1
, ..., β
𝑀 )
heritability), X is a standardized genotypes’ matrix of N M, and is a vector of × ε = (ε
1
, ..., ε
𝑀 )
residuals with mean 0 and variance . S-LDSC also assumes that is a mean-0 vector, whose σ
𝑒 2
β
variance depends on C continuous-valued annotations:
𝑣𝑎𝑟 (β
𝑗 ) =
𝑐 ∑ 𝑎 𝑐 (𝑗 )τ
𝑐 where is the annotation of SNP j for annotation c, and is a coefficient denoting the 𝑎 𝑐 (𝑗 ) τ
𝑐 conditional contribution of annotation to expected per-SNP heritability. 𝑎 𝑐 Under this model,
𝐸 [χ
𝑗 2
] = 𝑁 𝑐 ∑ τ
𝑐 𝑙 (𝑗 , 𝑐 ) + 1
where is the LD score of SNP j with respect to annotation c and is the 𝑙 (𝑗 , 𝑐 ) =
𝑘 ∑ 𝑎 𝑐 (𝑘 )𝑟 𝑗 𝑘 2
𝑟 2
𝑗 𝑘 correlation between SNPs j and k [9,10]. Given a vector of statistics and LD scores computed χ
𝑗 2
from a reference sample (here Europeans from the 1000 Genomes project [13], this equation
allows us to obtain estimates of , and then the contribution of a functional annotation τ
𝑐 8
(conditioned to the other ones) to a trait. S-LDSC estimates ’s standard error using a block τ
𝑐 jackknife and its corresponding P-value from a z score.
Here, we ran S-LDSC using default settings [10,11]. We considered summary statistics
from 134 GWAS.
2-3 Creating gene-set annotations
We investigated the contribution of a “stimulated gene set” to the heritability of a trait, by
creating a SNP-annotation derived from genes differentially expressed, and analyzed this
SNP-annotation using S-LDSC. As the optimal way of constructing such a SNP-annotation is
unknown, we investigated a total of four different strategies to create SNP-annotations from a
given gene set.
First, we considered the genes differentially expressed detected using the FRD05 and
FDR10 strategies.
Second, to overcome the challenge of linking SNPs to genes, we investigated two
approaches to annotate SNP for a given gene. First, we assigned every SNP in the gene-body and
in a 100kb surrounding region to include regulatory elements, as usually performed in S-LDSC
analyses [11]. We labeled this approach 100kb. Second, we assigned every functional SNP using
a combined SNP-to-gene strategy [12]. Briefly, this strategy assigns to each gene the SNPs that
are in its exon, promoters, fine-mapped eQTLs and putative enhancers. We labeled this approach
cS2G.
We thus ended up with four strategies, labeled FDR05.100kb, FDR05.cS2G,
FDR10.100kb, and FDR10.cS2G.
9
Each gene set annotation was analyzed jointly with a “baseline” model of annotations
(i.e. a set of 52 functional annotations capturing known important regions of the genome), and an
annotation for all the genes (linked to SNPs using the 100kb or the cS2G approach). We tested
the significance of the association between the gene set annotation and a trait by testing if its
corresponding coefficient was significantly different from 0 (see above for more details). τ
10
Chapter 3: Result
3-1 Analyses of drug-disease matched pairs
We first validated that SNP-annotations derived from functional datasets of stimulated cells can
be leveraged by S-LDSC to detect relationships between environment and diseases. Specifically,
we tested if SNPs linked to genes differentially expressed after stimulation to a drug were
enriched in heritability using GWAS of the targeted traits. Here, we paired inflammatory drugs
with inflammatory diseases, antihistamine drugs with respiratory and allergic diseases, insulin
drugs with diabetes, and statin with cardiovascular diseases.
3-1-a/ Inflammatory drugs and inflammatory diseases
We considered 6 inflammatory drugs and environments (Acetaminophen, Aspirin,
Dexamethasone, Ibuprofen, IL-6, TNFa) and 15 inflammatory diseases (Asthma, Asthma child
onset, Celiac disease, Crohn's disease, Eczema, Hypothyroidism, Lupus, Multiple sclerosis,
Primary biliary cirrhosis, Psoriasis, Respiratory diseases, Rheumatoid arthritis, Thyroid diseases,
Type 1 Diabetes, Ulcerative colitis) (90 unique pairs across 12 different cell-type experiments).
We found 27 pairs significant at FDR 5% (Table 4). The most significant association was
between Aspirin and Hypothyroidism (P = 8.32 x 10
-5
). Interestingly, we observed that Aspirin
and Dexamethasone were significantly associated with 8 and 10 diseases, respectively. We found
significant relationships for 4 out of 6 environments (all but Ibuprofen and IL-6) and 12 out of 15
traits (all but Asthma child onset, Lupus, and Primary biliary cirrhosis).
11
Environment Cell-type Disease Method P-value (FDR) FDR05.cS2G P-value
Acetaminophen LCL Crohn's disease FDR05.cS2G 1.42E-03 (0.04) 1.42E-03
Acetaminophen PBMC Ulcerative colitis FDR05.cS2G 1.62E-03 (0.04) 1.62E-03
Acetaminophen PBMC Hypothyroidism FDR05.100kb 7.23E-04 (0.03) 0.01
Acetaminophen PBMC Thyroid diseases FDR05.100kb 1.04E-03 (0.03) 0.02
Aspirin PBMC Crohn's disease FDR10.100kb 1.88E-03 (0.04) 0.07
Aspirin PBMC Celiac FDR10.100kb 9.33E-05 (0.03) 4.28E-03
Aspirin PBMC Multiple sclerosis FDR05.cS2G 2.05E-03 (0.04) 2.05E-03
Aspirin PBMC Eczema FDR10.100kb 2.57E-04 (0.03) 2.65E-03
Aspirin PBMC Asthma FDR10.100kb 1.76E-03 (0.04) 0.03
Aspirin PBMC Hypothyroidism FDR05.100kb 8.32E-05 (0.03) 3.87E-04
Aspirin PBMC Respiratory diseases FDR10.100kb 1.50E-03 (0.04) 0.03
Aspirin PBMC Thyroid diseases FDR05.100kb 2.62E-04 (0.03) 1.78E-03
Dexamethasone Liver Ulcerative colitis FDR10.cS2G 1.81E-03 (0.04) 0.01
Dexamethasone LCL Crohn's disease FDR10.cS2G 1.56E-03 (0.04) 1.72E-03
Dexamethasone LCL Celiac FDR10.cS2G 1.36E-04 (0.03) 2.27E-04
Dexamethasone LCL Multiple sclerosis FDR10.cS2G 1.00E-04 (0.03) 1.13E-04
Dexamethasone LCL Type 1 Diabetes FDR10.cS2G 8.36E-04 (0.03) 9.18E-04
Dexamethasone LCL Eczema FDR05.100kb 1.77E-03 (0.04) 5.27E-03
Dexamethasone LCL Hypothyroidism FDR05.100kb 6.05E-04 (0.03) 7.75E-04
Dexamethasone LCL Thyroid diseases FDR05.100kb 3.73E-04 (0.03) 6.08E-04
Dexamethasone PBMC Crohn's disease FDR05.cS2G 1.72E-03 (0.04) 1.72E-03
Dexamethasone PBMC Multiple sclerosis FDR05.100kb 1.52E-03 (0.04) 2.84E-03
Dexamethasone PBMC Rheumatoid Arthritis FDR10.cS2G 7.13E-04 (0.03) 7.61E-04
Dexamethasone PBMC Hypothyroidism FDR05.cS2G 1.01E-03 (0.03) 1.01E-03
Dexamethasone PBMC Psoriasis FDR05.cS2G 4.07E-04 (0.03) 4.07E-04
Dexamethasone PBMC Thyroid diseases FDR05.cS2G 5.59E-04 (0.03) 5.59E-04
TNFa Fat Psoriasis FDR05.cS2G 4.04E-04 (0.03) 4.04E-04
Table 4: S-LDSC results for inflammatory drugs in inflammatory diseases. We report significant S-LDSC results (τ
P-value significant at the FDR 5% level) for SNP-annotations based on genes differentially expressed after stimulation to
inflammatory drugs in inflammatory diseases.
12
3-1-b/ Antihistamine and respiratory and allergic diseases
We next considered 2 antihistaminic drugs (Cetirizine and Loratadine) and 4 respiratory and
allergic diseases (Asthma, Asthma child onset, Eczema, and Respiratory diseases) (8 unique
pairs). Impressively, we found all the 8 pairs to be significant at FDR 5% (Table 5). The most
significant association was between Loratadine and Respiratory diseases (P = 1.88 x 10
-4
).
Environment Cell-type Disease Method P-value (FDR) FDR05.cS2G P-value
Cetirizine LCL Asthma FDR10.100kb 0.01 (0.04) 0.02
Cetirizine LCL Asthma child onset FDR05.100kb 3.39E-03 (0.03) 3.90E-03
Cetirizine LCL Eczema FDR10.100kb 0.02 (0.05) 0.12
Cetirizine LCL Respiratory diseases FDR05.cS2G 4.32E-03 (0.03) 4.32E-03
Loratadine LCL Asthma FDR10.cS2G 1.48E-03 (0.02) 0.04
Loratadine LCL Asthma child onset FDR10.100kb 0.01 (0.04) 0.18
Loratadine LCL Eczema FDR10.cS2G 3.05E-03 (0.03) 0.11
Loratadine LCL Respiratory diseases FDR10.100kb 1.88E-04 (6.04E-03) 0.01
Table 5: S-LDSC results for Antihistamine and respiratory and allergic diseases. We report significant S-LDSC
results ( P-value significant at the FDR 5% level) for SNP-annotations based on genes differentially expressed after stimulation τ
to antihistamine in respiratory and allergic diseases.
3-1-c/ Insulin drugs and diabetes
We next considered 5 insulin related drugs (Glucose, IGF1, Insulin, Metformin, Rosiglitazone)
and 3 diabetes related diseases from 4 GWAS datasets (Fasting Glucose, HbA1C, Type 2
diabetes, Type 2 diabetes (UKBB)) (20 unique pairs). We found 10 pairs to be significant at P <
0.05 (Table 6), but no pairs significant at FDR 5%. The most significant association was between
IGF1 and Type 2 diabetes (P = 0.006).
13
Environment Cell-type Disease Method P-value (FDR) FDR05.cS2G P-value
IGF1 Fat Type 2 diabetes FDR10.cS2G 0.04 (0.75) 0.06
IGF1 Liver Type 2 diabetes (UKBB) FDR10.100kb 0.006 (0.75) 0.04
Insulin Liver Fasting Glucose FDR05.cS2G 0.009 (0.75) 0.009
Insulin Liver Type 2 diabetes (UKBB) FDR05.cS2G 0.02 (0.75) 0.02
Insulin CM Type 2 diabetes (UKBB) FDR10.cS2G 0.05 (0.75) 0.06
Insulin LCL HbA1C FDR05.100kb 0.04 (0.75) 0.09
Insulin Mel Type 2 diabetes FDR05.cS2G 0.03 (0.75) 0.03
Metformin Fat HbA1C FDR10.cS2G 0.02 (0.75) NA
Rosiglitazone Liver HbA1C FDR10.cS2G 0.04 (0.75) NA
Rosiglitazone Liver Type 2 diabetes (UKBB) FDR10.cS2G 0.05 (0.75) NA
CM: cardiomyocytes; Mel: melanocytes.
Table 6: S-LDSC results for insulin drugs and diabetes. We report significant S-LDSC results ( P-value significant at τ
P < 0.05; none of them significant at the FDR 5% level) for annotations based on genes differentially expressed after stimulation
to insulin drugs in diabetes. NA indicates that annotations were only created when using the FDR 10% threshold.
3-1-d/ Statin and cardiovascular traits
Finally, we considered statin and 11 cardiovascular traits from 12 GWAS datasets
(Atherosclerosis, Cardioembolic stroke, Cardiovascular disease, Coronary artery disease, HDL,
HDL cholesterol, Ischemic stroke, Large artery stroke, LDL direct, Small vessel stroke, Stroke)
(24 unique pairs). We found one pair to be significant at P < 0.05 (Atorvastatin and Large artery
stroke, P = 0.02; Table 7), but it was not significant at FDR 5%.
Environment Cell-type Disease Method P-value (FDR) FDR05.cS2G P-value
Atorvastatin Liver Large artery stroke FDR10.cS2G 0.02 0.09
Table 7: S-LDSC results for statin and cardiovascular traits. We report significant S-LDSC results ( P-value τ
significant at P < 0.05; none of them significant at the FDR 5% level) for annotations based on genes differentially expressed
after stimulation to statin in cardiovascular traits.
14
3-1-e/ Conclusion
Overall, we found convincing results for inflammatory drugs with inflammatory diseases, and
very convincing results for antihistamine drugs with respiratory and allergic diseases, while
results for Insulin drugs with diabetes, and for Statin with cardiovascular traits were
inconclusive. We hypothesize that convincing results from the two first experiments are due to
the use of stimulation in disease relevant cell-types (i.e. immune cell-types), while the negative
results for the two last experiments are due to the use of stimulation in non disease relevant
cell-types. Performing insulin and statin stimulations on pancreas and heart cell-types would help
to confirm this hypothesis. To conclude, our approach seems relevant when stimulation is
performed on disease relevant cell-types.
15
3-2 Choice of the methods
We next investigated which of the 4 strategies of creating SNP-annotations had the smallest false
positive rate and the smallest P-values for a gold standard list of environment-disease pairs.
First, we checked the false positive rate of the 100kb and cS2G linking strategies of the
top 5% and top 10% of genes differentially expressed using control environments (CO2 and
CO3) on 134 GWAS datasets. We observed that the cS2G linking strategy tends to have smaller
P-values than the 100kb linking strategy (4% and 5% of the P-values were below 5% when
considering annotations built from the top 5% and top 10% of genes differentially expressed,
respectively) (Figure 1).
Figure 1: S-LDSC results for control environments. We report boxplots of S-LDSC results ( P-value on the -log10 τ
scale) for annotations based on genes differentially expressed after stimulation to control environments on 134 GWAS datasets
for 4 different strategies of creating annotations. Dashed red line represents P = 0.05.
16
Second, we constructed from our previous analyses a gold standard list of
environment-disease pairs, which include the 27 FDR 5% significant inflammatory drug and
inflammatory disease pairs, and the 8 FDR 5% significant antihistamine and respiratory and
allergic diseases pairs (35 pairs total). We observed that selecting differentially expressed genes
using an FDR 5% approach and linking SNPs to these genes using the cS2G map lead to the
smallest P-values on average (Figure 2).
Figure 2: S-LDSC results for a gold standard list of 35 environment-disease pairs. We report boxplots of S-LDSC
results ( P-value on the -log10 scale) for a gold standard list of 35 environment-disease pairs. τ
Based on results from Figures 1 and 2, we defined that creating differentially expressed
gene sets using the FDR 5% strategy, and that linking SNP to these genes using the cS2G linking
strategy provided optimal results (see individual results for FDR05.cS2G in all the highlighted
pairs in Tables 4-7).
17
3-3 Detecting new candidate environment-disease relationships
We now applied our selected strategy on 149 SNP-annotations and 134 GWAS datasets (19,966
total tests). We report the top 10 most significant results in Table 8, including only 2 pairs
significant at FDR 5%. The most significant pair reports an unknown relationship between
Decanoyl-l-carnitine (an acetylcarnitine compound that is used to transport fatty acids) and
height (P = 5.98 x 10
-7
). The second most significant result linked zinc to multiple sclerosis (P =
3.92 x 10
-6
). Interestingly, zinc has been widely studied for its role in multiple sclerosis [14].
Similarly, we found a strong association between zinc and Crohn's disease, which is also
supported by the literature [15].
Environment Cell-type Trait P-value
Decanoyl-l-carnitine Muscle Height 5.98E-07
Zinc LCL Multiple sclerosis 3.92E-06
Ibuprofen Mel Schizophrenia vs. Bipolar disorder 8.03E-06
Acrylamide LCL Blood lymphocyte 1.71E-05
Cadmium LCL Multiple sclerosis 2.93E-05
Aldosterone LCL Multiple sclerosis 3.16E-05
Acetaminophen Mel BMI 3.54E-05
Zinc LCL Crohn's disease 3.81E-05
BPA LCL Respiratory disease 4.08E-05
Vitamin A CM Atrial fibrillation 4.74E-05
CM: cardiomyocytes; Mel: melanocytes.
Table 8: Top 10 environment-trait pairs highlighted by S-LDSC.
Out of the other significant results, we highlight the link between respiratory diseases and
BPA (Bisphenol A), a chemical compound hypothesized to be an endocrine disruptor, and which
has been reported to impact lung inflammation [16]. We also highlight the link between
18
Ibuprofen and psychiatric traits, as anti-inflammatory drugs have a controversial impact on the
central nervous system [17,18].
19
Chapter 4: Discussion
Here, we developed an approach identifying environment-disease relationships by integrating
genes differentially expressed in vitro by a stimulant (providing environment-genome
relationships) with GWAS summary statistics (providing genome-disease relationships). Our
approach combined S-LDSC with SNP-annotations constructed from genes differentially
expressed at FDR 5% and linked using the cS2G strategy. This approach was validated using
convincing results for inflammatory drugs with inflammatory diseases and for antihistamine
drugs with respiratory and allergic diseases. When applied to 149 SNP-annotations and 134
GWAS datasets, we highlighted new and known relationships.
However, we note some limitations and routes for future analyses. First, we observed that
it is critical to perform stimulation in cell-types relevant to the environment and/or the disease.
Indeed, we were not able to observe significant associations between Insulin drugs and diabetes,
and between Statin and cardiovascular traits. However, the relevant cell-type is sometimes
unknown, sometimes hard to stimulate (for example brain cell-types). Thus, our approach might
be limited to cell-types easy to stimulate. Second, our approach provides unsigned associations.
For example, an association between a drug and a disease could either mean that the drug
protects you from the disease, or that the drug has a side effect and increases your risk of disease.
Developing new signed methods is necessary to answer such questions. A recent study proposed
to correlate signed transcriptome-wide association study (TWAS) results to signed differential
expression [19]; a natural next step would be to assess this approach using our framework.
Finally, genes answering to an environment in vitro, are not necessarily the ones that answer to
this environment in vivo. More fundamental research is needed to characterize those differences.
20
Despite these limitations, our results suggest the potential advantages of integrating genes
differentially expressed in vitro by a stimulant with GWAS summary statistics by using S-LDSC
to detect new relationships between an environment and a disease.
21
References
1. Moyerbrailean GA, Richards AL, Kurtz D, Kalita CA, Davis GO, Harvey CT, et al.
High-throughput allele-specific expression across 250 environmental conditions. Genome
Res. 2016;26: 1627–1638.
2. Findley AS, Monziani A, Richards AL, Rhodes K, Ward MC, Kalita CA, et al. Functional
dynamic genetic effects on gene regulation are specific to particular cell types and
environmental conditions. Elife. 2021;10. doi:10.7554/eLife.67077
3. Barreiro LB, Tailleux L, Pai AA, Gicquel B, Marioni JC, Gilad Y . Deciphering the genetic
architecture of variation in the immune response to Mycobacterium tuberculosis infection.
Proceedings of the National Academy of Sciences. 2012. pp. 1204–1209.
doi:10.1073/pnas.1115761109
4. Fairfax BP, Humburg P, Makino S, Naranbhai V , Wong D, Lau E, et al. Innate immune
activity conditions the effect of regulatory variants upon monocyte gene expression.
Science. 2014;343: 1246949.
5. Nédélec Y , Sanz J, Baharian G, Szpiech ZA, Pacis A, Dumaine A, et al. Genetic Ancestry
and Natural Selection Drive Population Differences in Immune Responses to Pathogens.
Cell. 2016;167: 657–669.e21.
6. Quach H, Rotival M, Pothlichet J, Loh Y-HE, Dannemann M, Zidane N, et al. Genetic
Adaptation and Neandertal Admixture Shaped the Immune System of Human Populations.
Cell. 2016;167: 643–656.e17.
7. Balliu B, -Orive IC, Gloudemans MJ, Nachun DC, Durrant MG, Gazal S, et al. An
integrated approach to identify environmental modulators of genetic risk factors for
complex traits. bioRxiv. 2021. p. 2021.02.23.432608. doi:10.1101/2021.02.23.432608
8. Lee MN, Ye C, Villani A-C, Raj T, Li W, Eisenhaure TM, et al. Common Genetic Variants
Modulate Pathogen-Sensing Responses in Human Dendritic Cells. Science. 2014. pp.
1246980–1246980. doi:10.1126/science.1246980
9. Finucane HK, Bulik-Sullivan B, Gusev A, Trynka G, Reshef Y , Loh P-R, et al. Partitioning
heritability by functional annotation using genome-wide association summary statistics. Nat
Genet. 2015;47: 1228–1235.
10. Gazal S, Finucane HK, Furlotte NA, Loh P-R, Palamara PF, Liu X, et al. Linkage
disequilibrium-dependent architecture of human complex traits shows action of negative
selection. Nat Genet. 2017;49: 1421–1427.
22
11. Finucane HK, Reshef YA, Anttila V , Slowikowski K, Gusev A, Byrnes A, et al. Heritability
enrichment of specifically expressed genes identifies disease-relevant tissues and cell types.
Nat Genet. 2018;50: 621–629.
12. Gazal S, Weissbrod O, Hormozdiari F, Dey K, Nasser J, Jagadeesh K, et al. Combining
SNP-to-gene linking strategies to pinpoint disease genes and assess disease omnigenicity.
medRxiv. 2021; 2021.08.02.21261488.
13. 1000 Genomes Project Consortium, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang
HM, et al. A global reference for human genetic variation. Nature. 2015;526: 68–74.
14. Mikkel Bredholt JLF. Zinc in Multiple Sclerosis: A Systematic Review and Meta-Analysis.
ASN Neuro. 2016;8. doi:10.1177/1759091416651511
15. Sturniolo GC, Di Leo V , Ferronato A, D’Odorico A, D’Incà R. Zinc supplementation
tightens “leaky gut” in Crohn’s disease. Inflamm Bowel Dis. 2001;7: 94–98.
16. Van Winkle LS, Murphy SR, Boetticher MV, VandeV oort CA. Fetal Exposure of Rhesus
Macaques to Bisphenol A Alters Cellular Development of the Conducting Airway by
Changing Epithelial Secretory Product Expression. Environ Health Perspect. 2013 [cited 17
Jan 2022]. Available: https://ehp.niehs.nih.gov/doi/abs/10.1289/ehp.1206064
17. Hoppmann RA, Peden JG, Ober SK. Central Nervous System Side Effects of Nonsteroidal
Anti-inflammatory Drugs: Aseptic Meningitis, Psychosis, and Cognitive Dysfunction. Arch
Intern Med. 1991;151: 1309–1313.
18. Sommer IE, van Westrhenen R, Begemann MJH, de Witte LD, Leucht S, Kahn RS. Efficacy
of Anti-inflammatory Agents to Improve Symptoms in Patients With Schizophrenia: An
Update. Schizophr Bull. 2013;40: 181–191.
19. Namba S, Konuma T, Wu K-H, Zhou W, Okada Y , Global Biobank Meta-analysis Initiative.
A practical guideline of genomics-driven drug discovery in the era of global biobank
meta-analysis. bioRxiv. 2021. doi:10.1101/2021.12.03.21267280
23
Abstract (if available)
Abstract
Identifying gene-environment relationships can inform the biology of complex diseases. Recent studies have shown that stimulating cells by an environment (by a pathogen or drug exposure) in vitro leads from hundreds to thousands of differentially expressed genes. While it is often not possible to test if the exposure to an environment increases or decreases the risk of human diseases in vivo, developing methods that will optimally link these differentially expressed genes to genome-wide association studies (GWAS) results might help to investigate the impact of the environment on human diseases. However, it is still unclear if this strategy would provide meaningful results, and what would be the best approach to connect GWAS results to genes answering to an environment, notably due to the challenge of linking GWAS causal variants to their target genes. Our project is to develop and apply methods leveraging gene expression of stimulated cells (providing environment-genome relationships) and summary statistics from GWAS (providing genome-disease relationships), to detect new environment-disease relationships. First, we curated differential expression datasets from three studies, to build a catalog of genes differentially expressed after stimulation by more than 50 environments, including several drugs. Second, we developed different strategies to detect differentially expressed genes and to create SNP-annotations derived from those genes. Third, we applied stratified LD score regression with SNP-annotations built using genes differentially expressed by drugs to GWAS data from the drug targeted traits; we leveraged these analyses to validate our approach and to define what is the best strategy to create gene-set annotations in order to apply stratified LD score regression. Finally, we applied our optimal strategy of building SNP-annotations for > 50 environments across > 100 human diseases and complex traits. Our analyses highlight new environment-disease relationships, and open future research directions.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Understanding ancestry-specific disease allelic effect sizes by leveraging multi-ancestry single-cell RNA-seq data
PDF
Two-step testing approaches for detecting quantitative trait gene-environment interactions in a genome-wide association study
PDF
Characterization and discovery of genetic associations: multiethnic fine-mapping and incorporation of functional information
PDF
Cell-specific case studies of enhancer function prediction using machine learning
PDF
Understand the distinct patterns of selection in auto-immune diseases with ancient DNA data by the S-LDSC model
PDF
A global view of disparity in imputation resources for conducting genetic studies in diverse populations
PDF
Polygenic analyses of complex traits in complex populations
PDF
High-dimensional regression for gene-environment interactions
PDF
Characterizing synonymous variants by leveraging gene expression and GWAS datasets
PDF
Comparisons of four commonly used methods in GWAS to detect gene-environment interactions
PDF
Detecting joint interactions between sets of variables in the context of studies with a dichotomous phenotype, with applications to asthma susceptibility involving epigenetics and epistasis
PDF
Minimum p-value approach in two-step tests of genome-wide gene-environment interactions
PDF
Combination of quantile integral linear model with two-step method to improve the power of genome-wide interaction scans
PDF
Hierarchical approaches for joint analysis of marginal summary statistics
PDF
Predicting functional consequences of SNPs: insights from translation elongation, molecular phenotypes, and pathways
PDF
Improving the power of GWAS Z-score imputation by leveraging functional data
PDF
Gene-set based analysis using external prior information
PDF
Genome-wide characterization of the regulatory relationships of cell type-specific enhancer-gene links
PDF
Nonlinear modeling of the relationship between smoking and DNA methylation in the multi-ethnic cohort
PDF
Gene expression and angiogenesis pathway across DNA methylation subtypes in colon adenocarcinoma
Asset Metadata
Creator
Wang, Ruowen
(author)
Core Title
Leveraging functional datasets of stimulated cells to understand the relationship between environment and diseases
School
Keck School of Medicine
Degree
Master of Science
Degree Program
Biostatistics
Degree Conferral Date
2022-05
Publication Date
04/18/2022
Defense Date
04/17/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
annotations,Diseases,Environment,GWAS,OAI-PMH Harvest,S-LDSC,SNP,stimulated cells,traits
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Gazal, Steven (
committee chair
), Gauderman, William (
committee member
), Lewinger, Juan Pablo (
committee member
)
Creator Email
596750892@qq.com,ruowenwa@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111004386
Unique identifier
UC111004386
Document Type
Thesis
Format
application/pdf (imt)
Rights
Wang, Ruowen
Type
texts
Source
20220418-usctheses-batch-928
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
annotations
GWAS
S-LDSC
SNP
stimulated cells
traits