Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Perinatal epigenetic and genetic analyses in childhood cancers
(USC Thesis Other)
Perinatal epigenetic and genetic analyses in childhood cancers
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
PERINATAL EPIGENETIC AND GENETIC ANALYSES IN CHILDHOOD CANCERS
by
Shaobo Li
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(CANCER BIOLOGY AND GENOMICS)
May 2022
Copyright 2022 Shaobo Li
ii
Table of Contents
List of Figures ……….……….……….……….……….……….……….……….……….……….……….……….……………
iii
List of Tables ……….……….……….……….……….……….……….……….……….……….……….……….……….......
v
Abstract ……….……….……….……….……….……….……….……….……….……….……….……….…………………….
vi
Chapter 1: Introduction ……….……….……….……….……….……….……….……….……….……….……………….
1
Chapter 2: Leveraging SNP array data to understand childhood cancer risk …………………………. 5
2.1 Mitochondrial 1555 G>A variant as a potential risk factor for childhood
glioblastoma ……….……….……….……….……….……….………………………………………………………..
6
2.2 Localized variation in ancestral admixture identifies pilocytic astrocytoma risk
loci among Latino children ……….……….……….……….……….……….…………………………………..
23
Chapter 3: Investigating childhood carcinogenesis using epigenetic analysis ……………………….. 42
3.1 Epigenome-Wide Association Study of Acute Lymphoblastic Leukemia in
Children with Down Syndrome ……….……….……….……….……….……….…………………………….
43
Chapter 4: Importance of accounting for interactions between genetics and epigenetics
in EWAS models ……….……….……….……….……….……….…………………….……….……….……….…………….
71
4.1 Incorporation of DNA methylation quantitative trait loci (mQTLs) in epigenome
wide association analysis: application to birthweight ……….……………………………………….
72
Chapter 5: Conclusion.……….……….……….……….…………………….……….……….……….……….…………….
93
References ……….……….……….……….……….……….…………………….……….……….……….……….…………….
96
iii
List of Figures
2.1 Association analysis results for mitochondrial SNPs in (A) European
subjects (B) Hispanic subjects (C) meta-analysis of both populations.
14
2.2 Flowchart for data processing and analysis 27
2.3 Estimated ancestry proportions for query Latino subjects and reference
samples
30
2.4 Distribution of ancestry proportions in Latino pilocytic and non-pilocytic
astrocytoma cases and controls
31
2.5 Association plots local European ancestry copies and risk of pilocytic
astrocytoma in Latino subjects
34
2.6 Case-control association analyses between SNPs and pilocytic
astrocytoma risk of CCRLP Latino, European subjects and meta-analysis
of both results in regions of admixture mapping peaks
36
3.1 Heatmap showing the top 2000 most variables CpGs in DS-ALL cases and
DS controls in the Discovery Study
52
3.2 Boxplots showing deconvoluted cell proportions in DS-ALL cases and DS
controls in the Discovery and Replication datasets
55
3.3 Bi-directional Manhattan plot (A) and quartile-quartile (QQ) plot (B) for
results of the EWAS of DS-ALL in the Discovery Study, including
autosomal CpG probes only
58
3.4 Location of the top EWAS hit (cg27347265) in the DS-ALL Discovery Study 61
4.1 Distributions of scanned mQTLs in different datasets 79
4.2 Number and overlap of scanned mQTLs in all four datasets 80
4.3 Distributions of CpGs with matched mQTLs 82
4.4 Comparison of birthweight epigenome-wide association analysis results
with or without controlling for mQTL (450K array)
84
4.5 Comparison of birthweight epigenome-wide association analysis results
with or without controlling for mQTL (EPIC array)
86
iv
4.6 Comparison of birthweight-associated DNA methylation sites with or
without controlling for validated birthweight GWAS SNPs
89
4.7 Comparison of results from original and sensitivity EWAS models adding
maternal weight gain as additional covariate
90
v
List of Tables
2.1 Clinical variables of glioma cases and controls 9
2.2 Clinical variables of glioma cases by subtype 10
2.3 Distributions of m1555 genotypes and odds ratios in glioma subtypes 15
2.4 Common macrogroup distribution in Europeans 16
2.5 Common macrogroup distribution in Hispanics 17
2.6 Haplogroups of subjects with m1555 A>G variant in controls and glioblastoma
cases
18
2.7 Significant SNPs in nucleus mitochondrial association analysis 19
2.8 Demographic data of pediatric astrocytoma subjects 26
2.9 Description of Latino non-pilocytic astrocytoma cases and controls 32
2.10 Top SNPs from fine mapping analyses of admixture mapping peaks 37
2.11 Conditional analysis in admixture mapping peaks 38
3.1 Demographic and birth characteristics of DS-ALL cases and DS controls 51
3.2 Deconvoluted blood cell proportions in DS-ALL cases versus DS controls 53
3.3 Deconvoluted blood cell proportions in DS-ALL cases versus DS controls
stratified by self-reported race/ethnicity in the DS-ALL Discovery Study
56
3.4 Deconvoluted B-cell proportions in DS-ALL cases versus DS controls adjusted by
GWAS SNPs in the DS-ALL Discovery Study, overall and stratified by self-reported
race/ethnicity
57
3.5 Significant differentially methylated probes associated with DS-ALL in the
Discovery Study and results in the Replication Study
59
3.6 Significant GO pathways enriched for DS-ALL-associated DMPs 61
3.7 Differentially methylated regions associated with DS-ALL in the Discovery Study 63
4.1 CpGs sharing identical SNP or SNPs in LD as matched mQTL across different
datasets
79
vi
Abstract
Cancer is one of the leading causes of death in children in the USA (1), however, the
etiology of most childhood cancers still remains unknown at the time of the writing of thesis.
Given that somatic changes are less frequently associated with cancers in children compared to
adult cancers, developmental abnormalities reflected by neonatal epigenetics and genetics are
likely to offer key insights into understanding the mechanisms of childhood cancer etiology.
Consequently, they may also help to predict cancer risk, which will be valuable for targeted
follow ups, early treatment or even prevention.
In this thesis, I illustrate the use high-throughput epigenetic and genetic data, as well as
inferred information from them to identify cancer related changes in children at birth, with the
help of a wide variety of bioinformatics tools. We focus on typical types of childhood cancers,
including childhood leukemia and childhood glioma, both in the general population and in
children with Down Syndrome.
1
Chapter 1: Introduction
Childhood cancers have greatly different etiology than that of adult cancers, and their
early diagnosis and treatment are important for disease prevention and improving disease
outcomes. As an example, the most prevalent pediatric cancer worldwide, acute lymphoblastic
leukemia (2), has a much higher survival rate when treated early (3). However, survivors are
often left with long-term health problems (4) including deficiency in attention (5), working
memory (6) and verbal fluency (7), making disease prevention a priority. Perinatal examination
of epigenetic features and germline genetic factors can help to elucidate mechanisms of
pediatric carcinogenesis; therefore, they are valuable tools to investigate cancer etiology as
well as to identify precancerous changes leading to potential preventative measures. These
features can also enable us to predict risks of developing cancer at birth, through signature
genetic mutations or epigenetic alterations, and provide high-risk children with better follow-
up. Through studying the interactions between environmental factors (maternal smoking (8),
maternal pesticides exposure (9), maternal folate intake (10), birth weight (11) etc.) and
(epi)genetic changes, we can also better understand pathways from demonstrated pregnancy-
related risk factors to disease.
DNA sequences are very similar (over 99.9%) between different human beings (12); and
most of human heterogeneity, including disease risks, can be captured by single nucleotide
polymorphism (SNPs). Genome-wide association analysis (GWAS) is the classical agnostic
method to understand risk alleles associated with disease (ALL (13) for example) and other
2
traits (birth weight (14) for example). Polymorphic sites in the genome, however, have a much
wider use in biomedical research. For example, through comparison with reference
populations, we can partition ancestry into different portions (for example, Latinos can be
partitioned into European, Amerindian and African ancestries (15)). Ancestry compositions can
also help to address many biological questions such as linkage of genomic features that
influence disease risk variance by ancestry. In this thesis, I present how leveraging GWAS style
analysis in a multiethnic population can help to identify SNPs, and subsequently genes, that
contribute to childhood glioma risks. Additionally, using inferred genetic ancestry proportions, I
demonstrate that Latino children diagnosed with pilocytic astrocytoma tend to have a higher
proportion of European ancestries, consistent with epidemiological evidence.
DNA methylation, a reliable and robust characteristic to measure epigenetic variations
(16), can be utilized to understand carcinogenic processes. A common way to use DNA
methylation is epigenome-wide association analysis (EWAS) models, identifying correlations
between CpGs and traits of interest. However, DNA methylation can also serve as a gateway to
understanding deeper and wider biological characteristics. For example, nucleated cell
proportions can be inferred using DNA methylation array data through cell deconvolution (17).
Then cell proportions at birth can be investigated for their correlations with disease risks and
other pathoclinical traits. Utilizing DNA methylation data, I present in this thesis that an
elevation of B cell proportions is associated with ALL risk in children with Down Syndrome.
Using EWAS models, I present here that DNA methylation change of certain groups of CpGs are
associated with ALL risk in Down syndrome subjects.
3
Lastly, it’s important to realize there exists significant interactions between DNA
methylations and SNPs. Methylation quantitative trait loci (mQTLs) refer to a group of SNPs,
whose genotypes are associated with CpG methylation levels (18) (mainly in cis). I present an
EWAS model for birthweight, with mQTLs as additional variables. As a result, key CpGs related
to birthweight from the model saw big changes, both in terms of effect sizes and P values. This
shows the important of recognizing the interactions between epigenetics and genetics when
doing EWAS models.
In summary, leveraging the following datasets available to me during my PhD research,
this thesis shows that genetic and epigenetic data together are valuable tools to understand
disease mechanism, either through classical high-throughput regression models, or inferred
characteristics from these data.
Descriptions of major datasets
The California Childhood Leukemia Study (CCLS) is a population-based case-control
study seeking to better understand genetic and environmental risk factors for childhood
leukemia. This study was started in 1995 at University of California Berkeley (UCB) and involved
recruitment at multiple facilities. Cases subjects are children diagnosed with ALL at 16
participating hospitals in California. This study also included control subjects identified from the
California birth registry serving as a comparison group. Several in utero chemical risk factors
have been identified and published as risk factors by CCLS including polychlorinated biphenyls,
organochlorine pesticides (19), and smoking (20), for which we explored epigenetic
4
mechanisms. Other pathoclinical variables were also available for a majority of CCLS subjects,
including gestational age, birthweight, dried blood spot (DBS) collection age which are also
associated with epigenetic features and must be statistically managed.
The California Cancer Records Linkage Project (CCRLP) is a data linkage and sample bank
resource. Californian Department of Public Health (CDPH), Genetic Diseases Screening Branch
acquired blood spots from all newborns within California for both genetic screening and
research purposes as described in previous publication from our lab (21). Briefly, Californian
birth records maintained by CDPH were then linked to cancer diagnosis data maintained by
Californian Cancer Registry (CCR). For this thesis, case eligibility criteria for glioma included: [i]
histologic diagnosis of glioma (ICDO-3 9380 to 9451) reported to the California Cancer Registry
between 1988 and 2011, [ii] under 20 years of age at diagnosis; and [iii] no previous diagnosis
of any other cancer.
The International Study of Down Syndrome Acute Leukemia (IS-DSAL) includes Down
Syndrome (DS) children both with or without acute leukemia (both acute lymphoblastic
leukemia (ALL) and myeloid leukemia (AML)). In this thesis, 145 DS-ALL cases and 198 DS
controls with DBS available were included, for which 325 subjects born in California and 18 in
Washington state. California-born DS-ALL cases were identified from both CCRLP (N=109) and
CCLS (N=18) (22) (23). Washington DS-ALL cases (N=18) were identified in the Washington State
Childhood Cancer study using population-based linked birth-hospital discharge-cancer registry
records (24). DS controls without ALL diagnosis by age 15, were obtained from the California
Biobank Program through linkage between the CDPH Genetic Disease Screening Program and
the CCR as described above.
5
Chapter 2: Leveraging SNP array data to understand childhood cancer risk
At the time of writing this thesis, the human race is learning to coexist with the
coronavirus disease 2019 (COVID-19). Multiple waves of cases were caused by different variants
of COVID-19. Humans have adapted to living in a pandemic and a public health rule suggested
by CDC is to stay 6 feet apart from each other to curb viral transmission. Interestingly, 6 feet is
also the length of DNA, packed into a 5 μm nucleus in each of our cells (25). This translates to
about 3.2 billion base pairs (26), however, most of these DNA base pairs are identical across
humanity, as they lay the genetic foundation of Homo Sapiens; however, the 1000 Genome
project (27) estimated that 20 million base pairs (0.1% of the whole genome) are
heterogeneous throughout the population (27). Some of these variants affect susceptibility to
diseases, others are related to ancestral difference and how we interact with environmental
perturbations differently. More than 99.9% of variants are single nucleotide polymorphism
(SNPs) and short indels (27). Using commercially available DNA arrays, with the help of up-to-
date imputation technology (TOPMed imputation server (28) for example), we can acquire DNA
genotype information that are the most distinctive in defining individual heterogeneity.
While the most classical way to leverage SNP array data is to conduct GWAS analysis, we
should remember that we can infer other extremely rich information from them too. In this
chapter, I will first present a classical GWAS-style analysis of childhood glioma in mitochondria
and report potential SNPs that are associated with childhood glioma risks in 2.1. Then I present
the manuscript “Localized variation in ancestral admixture identifies pilocytic astrocytoma risk
loci among Latino children” in 2.2, in which I talk about using SNP data to infer proportions of
6
different ancestry in Latino pilocytic astrocytoma (PA) cases and controls, and how European
ancestry is associated with an elevated PA risk.
2.1 Mitochondrial 1555 G>A variant as a potential risk factor for childhood glioblastoma
1
Shaobo Li
2
, Xiaowu Gai
3
, Swe Swe Myint
6
, Katti Arroyo
6
, Libby Morimoto
4
, Catherine Metayer
8
,
Adam de Smith
6
, Kyle M. Walsh
5
, Joseph Wiemels
6
Abstract
Childhood glioblastoma multiforme (GBM) is a highly aggressive disease with low
survival, and its etiology, especially concerning germline genetic risk, is poorly understood.
Mitochondria play a key role in putative tumorigenic processes relating to cellular oxidative
metabolism, and mitochondrial DNA variants were not previously assessed for association with
pediatric brain tumor risk. We conducted an analysis of 675 mitochondrial DNA variants in 90
childhood GBM cases and 2,789 controls and identified an enrichment of m1555 A>G in GBM
patients (adjusted OR 29.30, P value 9.5X10
-4
). Haplotype analysis further supported the
independent risk contributed by m1555 G>A, instead of a haplogroup joint effect. Further
analysis of nuclear-encoded mitochondrial gene variants identified significant associations in
1
This manuscript has been submitted to Neuro-Oncology Advances. Published version could be modified per editors’ and
reviewers’ comments.
2
Center for Genetic Epidemiology, Department of Population and Public Health Sciences, University of Southern California, Los
Angeles, California, USA
3
Center for Personalized Medicine, Children’s Hospital of Los Angeles, Los Angeles, California, USA
4
School of Public Health, University of California Berkeley, Berkeley, California, USA
5
Division of Neuro-epidemiology, Department of Neurosurgery, Duke University, Durham, North Carolina, USA
7
European (rs62036057 in WWOX, effect size = 1.10, P value = 3.42X10
-6
) and Hispanic
(rs111709726 in EFHD1, effect size = 1.82, P value = 1.41X10
-6
) populations in ethnicity-
stratified analyses. We reported for the first time a potential role played by a functional
mitochondrial ribosomal RNA variant in childhood GBM risk, and a potential role for both
mitochondrial and nuclear-mitochondrial DNA polymorphisms in GBM tumorigenesis. These
data implicate cellular oxidative metabolic capacity as a contributor to the etiology of pediatric
glioblastoma.
Introduction
Childhood glioblastoma multiforme (GBM) is a high-grade glioma, likely originating from
glial cells or neural precursor cells. Although it is a rare disease compared to adult GBM, only
accounting for 3%-15% of primary central nervous system (CNS) tumors in children (29), the
disease is very aggressive with poor prognosis. Surgical resection and chemo-radiotherapy
remain the standard treatment for childhood glioblastoma, and three-year survival is less than
20% (30); therefore, prevention is a worthwhile goal. Prevention hinges on an understanding of
the etiology of childhood GBM, which remains largely unknown. Etiologic research on pediatric
brain tumors has investigated environmental risk factors such as radiation, air pollution,
pesticides, and diet (31), but genetic risk factors that impact cellular metabolism and hence the
metabolism of xenobiotics have received little attention.
One of the primary organelles that could be involved in glioblastoma tumorigenesis is
mitochondria (mt), via the formation of damaging reactive oxygen species produced during
oxidative phosphorylation and ATP synthesis in biological processes such as inflammation,
8
immune response, and mitosis (32). The majority of proteins involved in mitochondrial
functions are encoded by nuclear genes (n-mt genes), of which there are approximately 1,300
in the mitochondrial pathway (33). However, mitochondria also maintain their own DNA that is
crucial to cellular respiratory functions. The 16 kb human mitochondrial genome (mtDNA) is
double-stranded, circular, and maternally inherited, and encodes 37 genes including respiratory
complexes (I, III, IV and V), tRNAs, and rRNAs for mtRNA translation. Variation in mtDNA has
been causally linked to multiple diseases including MERRF syndrome, an encephalomyopathy
with a mix of neurological and myopathic symptoms (34). Expectedly, mtDNA diseases typically
affect organs that have a high demand of respiratory functions, namely brain and muscles.
Recent sequencing studies demonstrated that somatic alterations of mtDNA may be important
drivers of multiple childhood cancers including CNS tumors (35), with the highest rates of
variants found in high grade gliomas (36). These findings inspired the current investigation.
We explored whether germline mitochondrial DNA variants play a role in pediatric
glioma risk, using single nucleotide polymorphism (SNP) array data from a California registry-
based case-control study. We concentrated this analysis on the glioblastoma subtype given its
demonstrated high frequency of somatic mitochondrial variants in tumor cells (36).
Methods
Population: Subjects were derived from the California Childhood Cancer Records Linkage
Project (CCRLP), a case-control study described previously (37). Briefly, glioma cases were
diagnosed before the age of 20 years old from 1988 to 2011, and were born in California
between 1982 and 2009. Histologic subtypes of glioma included pilocytic astrocytoma (POL, ICD
9
HISTO-T3 code 9421), non-pilocytic astrocytoma (AST, ICD HISTO-T3 code > 9400, and <9424
and is not 9421), glioblastoma (GBM, ICD HISTO-T3 code = 9440) and oligodendroglioma (OLG,
ICD HISTO-T3 code = 9450 or 9451). California-born Controls were cancer free and 1:1
individually matched to cases based on birthdate, gender, and maternal race/ethnicity through
the California Vital Statistics records (Tables 2.1 and 2.2). The final sample size included 2780
glioma cases and 2789 controls. This study was approved by the State of California Committee
for the Protection of Human Subjects, and the University of Southern California and University
of California, Berkeley review boards.
Table 2.1. Clinical variables of glioma cases and controls
Controls (N=2789) Cases (N=2780) P-value
Sex
Male 1502 1492 0.89
Female 1287 1288
Subtype
Pilocytic Astrocytoma (POL)
1
NA 787 NA
Non-pilocytic astrocytoma (AST)
2
NA 592
Glioblastoma (GBM)
3
NA 90
Oligodendroglioma (OLG)
4
NA 75
Missing NA 471
Self-reported Race/Ethnicity
European 1431 1358 0.05
Hispanic 1501 1279
Gestational age (days)
Mean (SD) 278.6 (30.79) 279.0 (27.76) 0.73
Median (range) 278.0 (158.0-796.0) 279.0 (172.0-828.0)
Missing 845 843
Birthweight (grams)
Mean (SD) 3416 (556.05) 3454 (563.66) 0.03
10
Median (range) 3459 (580-4960) 3487 (638-5840)
Missing 767 765
1. ICD HISTO-T3 code 9421
2. ICD HISTO-T3 code > 9400, and <9424 and is not 9421
3. ICD HISTO-T3 code = 9440
4. ICD HISTO-T3 code = 9450 or 9451
Table 2.2. Clinical variables of glioma cases by subtype
Pilocytic
Astrocytoma (POL)
Non-pilocytic
astrocytoma (AST)
Glioblastoma
(GBM)
Oligodendroglioma
(OLG)
Sex
Male 386 323 58 44
Female 401 269 32 31
Race/Ethnicity
European 455 329 46 47
Hispanic 332 263 44 28
Gestational age (days)
Mean (SD) 280.1 (29.46) 279.1 (29.39) 276.2 (19.64) 280.8 (20.62)
Median (range)
279.0 (175.0-828.0) 279.0 (172.0-815.0) 278.0 (172.0-
322.0)
282.0 (187.0-350.0)
Missing 27 24 6 1
Birthweight (g)
Mean (SD) 3465 (515.96) 3453 (583.90) 3498 (438.96) 3532 (648.50)
Median (range)
3465 (1247-5415) 3482 (737-5840) 3508 (2380-
4593)
3487 (1389-4649)
Tumor site
Retina 0 1 0 0
Cerebral meninges 0 1 0 0
Spinal meninges 0 1 0 0
Cerebrum 76 85 14 9
Frontal lobe 12 48 15 18
Temporal lobe 21 92 3 20
Parietal lobe 10 35 11 8
Occipital lobe 5 10 4 2
Ventricle NOS 38 26 5 2
11
Cerebellum, NOS 324 50 4 3
Brain stem 99 94 11 2
Overlapping lesion of
brain 35 45 15 10
Brain NOS 89 38 4 0
Spinal cord 31 42 4 1
Optic nerve 40 14 0 0
Cranial nerve, NOS 3 7 0 0
Overlapping lesion of
brain and CNS 1 0 0 0
Pineal gland 3 3 0 0
Dried blood spot (DBS) and DNA genotyping: Archived DBS were obtained from the
California biobank program. DNA extraction, processing and genotyping were performed as
previously described (38). Briefly, DNA was extracted from a 1/3 of DBS using Genfind v3.0
(Beckman) reagents on an Eppendorf robot, and a minimum of 300 ng DNA (using nanodrop for
purity and pico-green measurement for DNA quantity) was genotyped on the Affymetrix Axiom
Precision Medicine Diversity Array (PMDA). Genotype calls were then extracted with Affymetrix
Powertools. Polymorphisms were numbered according to the mitochondrial reference genome
Genbank: NC_012920.1.
mtDNA association study: Probes for a total of 675 mt variants were included on the
PMDA array and downstream quality-control steps retained all variants. More specifically, we
checked if any SNPs had more than 5% missing data, or significantly deviated from Hardy–
Weinberg equilibrium (HWE P value<10
-4
). Plink 2 (39) was used to conduct case-control
association analyses. In the event a logistic regression model failed to converge, a Firth
regression (40) was conducted. We controlled for 10 genetic principal components (calculated
12
with Plink 2) to account for possible population stratification, especially relevant for the
admixed Hispanic population, and performed association analyses separately in Hispanic and in
non-Hispanic white subjects. Association analyses were conducted in self-reported Hispanics
(43 cases and 1501 controls) and Europeans (44 cases and 1431 controls) separately and results
from both ethnic groups were meta-analyzed using METAL (41). Results were adjusted for
multiple testing using Bonferroni correction based on the number of actual comparisons after
linkage disequilibrium (LD) pruning (Europeans: 267 variants, P<1.88X10
-4
; Latinos: 286 variants,
P<1.75X10
-4
; Meta-analysis: 277 variants, P<1.8X10
-4
).
Nuclear-encoded mitochondrial gene (N-mt gene) set association analysis: Since most
mitochondrial proteins are encoded by genes located in the nucleus, we also conducted
candidate-variant association analyses in glioma cases and controls for SNPs located in regions
of N-mt genes. We retrieved the list of genes and their locations from NCBI gene database (42)
specified by Gene Ontology ID GO:0005739 (33) (43) (44). SNPs bounded by their genomic
coordinates within this GO category were included in N-mt SNP scanning. We identified 1,318
genes and a total of 19,119 SNPs in European subjects and 19,126 SNPs in Hispanic subjects.
We completed analysis (PLINK2), controlling for the same 10 ancestry-informative PCs as in
mitochondrial SNPs association analysis. Multiple correction was also based on Bonferroni test
like that of mitochondrial analysis (Europeans: 7960 variants, P<6.28X10
-06
; Latinos: 9799
variants, P<5.10X10
-6
; Meta-analysis: 8880 variants, P< 5.63X10
-06
).
Haplogroup-based analysis: Since mitochondrial DNA is a “haploid genome”, we also
conducted haplotype-based analysis in addition to SNP analysis. To classify the haplogroup of
each subject, we used Haplogrep (45), a bioinformatics tool for mtDNA haplogroup
13
classification. Haplogroups were reported based on leaf level, and then traversed to the most
common macrogroups (46) for further analysis. A 2-sided exact binomial test was used to
assess whether the distribution of cases was significantly different from chance (coin flip)
within each macrogroup.
Results
mtDNA variant m1555A>G was associated with elevated glioblastoma risks separately in
European (adjusted odds ratio (OR)= 98.33, 95% CI 8.06-1200.01, raw P value = 3.25X10
-4
) and
Hispanic children (adjusted OR= 9.93, raw P value = 0.056). In meta-analysis combining both
groups, m1555A>G was significantly associated with glioblastoma risk after multiple test
correction (P value = 1.0x10
-4
, Table 2.3, Figure 2.1). We next examined the distribution of
m1555 genotypes in glioblastoma cases and controls (Table 2.3). Patients had a significantly
higher odds of carrying a “G” variant by Fisher’s exact test in European subjects (raw OR=63.85,
P value = 3.0x10
-3
), and in the total population (raw OR=23.85, P value = 9.5X10
-4
). The OR was
also >1.0 in the Hispanic population, although it did not reach statistical significance (OR=10.43,
P value = 0.12). Overall, subjects with the m1555A>G variant had a significantly higher odds to
develop glioblastoma. We also examined m1555 genotypes in glioma cases with other, lower-
grade subtypes, and the presence of “G” allele was not detected (Table 2.3).
14
Figure 2.1 Association analysis results for mitochondrial SNPs in (A) European subjects (B)
Hispanic subjects (C) meta-analysis of both populations
15
16
The haplotype analysis confirmed the impact of the m1555 variant in glioblastoma
cases. We were able to obtain haplogroups at the leaf level, as well as the most common
macrogroups (Tables 2.4, 2.5) (by traversing up the mitochondrial haplogroup tree) for all
subjects, and as expected, subjects are scattered among many haplogroups. On the leaf level,
there are 314 haplogroups present in Europeans (34 of them contain glioblastoma cases) and
231 in Hispanics (21 of them contain glioblastoma cases). Furthermore, binomial test
demonstrated no significant difference in case/control distribution in any macrogroup (Tables
2.4, 2.5).
Table 2.4 Common macrogroup distribution in Europeans
haplotypes macrogroup control case case percentage propTestP
Z 0 1 1 0.031
A 6 1 0.143 0.199
H 591 24 0.039 0.245
J 172 7 0.039 0.514
K 119 4 0.033 0.795
T 157 4 0.025 0.822
U 204 5 0.024 0.692
B 6 0 0 1
C 13 0 0 1
D 6 0 0 1
F 4 0 0 1
I 41 0 0 0.64
L1 2 0 0 1
L2 6 0 0 1
L3 3 0 0 1
M 6 0 0 1
N 19 0 0 1
P 1 0 0 1
17
R 4 0 0 1
V 34 0 0 0.627
W 28 0 0 1
X 9 0 0 1
Table 2.5 Common macrogroup distribution in Hispanics
haplotypes macrogroup control case case percentage propTestP
I 8 1 0.111 0.249
M 9 1 0.1 0.273
D 81 6 0.069 0.056
L2 20 1 0.048 0.488
B 187 7 0.036 0.677
H 619 20 0.031 1
T 32 1 0.03 1
J 33 1 0.029 1
C 189 5 0.026 0.837
A 50 1 0.02 1
E 4 0 0 1
F 6 0 0 1
G 1 0 0 1
K 36 0 0 0.63
L0 5 0 0 1
L1 9 0 0 1
L3 17 0 0 1
N 4 0 0 1
R 1 0 0 1
U 37 0 0 0.632
V 6 0 0 1
W 3 0 0 1
Z 1 0 0 1
18
Additionally, there was no significant difference in haplogroup distribution for
m1555A>G subjects, further supporting that the m1555 variant is unlikely to impact GBM risk
through the effects of a larger haplogroup (Table 2.6). We also tested case control distribution
in each macrogroup for other subtypes of glioma, as well as all glioma cases combined, and
identified race/ethnicity-specific associations with several macrogroups. For example, in
European children, macrogroup B was enriched among pilocytic astrocytoma cases (P=0.046)
and all glioma cases combined (P=0.034). In Hispanic subjects, macrogroup A was highly
enriched for oligodendroglioma (P=0.0051), pilocytic astrocytoma (P=0.0082), as well as all
glioma cases combined (P=0.00064).
Table 2.6 Haplogroups of subjects with m1555 A>G variant in controls and glioblastoma
cases
Ethnicity
Case
status
Haplogroup
Common
Mitochondria
Macrogroup
rs62036057
16:78990391:T:C
(WWOX)
rs111709726
2:232616979:T:G
(EFHD1)
rs115257641
2:46156384:T:C
(PRKCE)
European Control H1a1 H C/C G/G C/C
European Case J1c3c J C/C G/G C/C
European case H1at H T/T G/G T/T
Hispanic Control H2a2a1 H C/C G/G C/C
Hispanic control D1d2 D C/T G/G C/T
Hispanic control D4b1a1 D C/C G/G C/C
Hispanic case D1f3 D C/C G/T C/C
19
We broadened the analysis of mitochondrial germline SNPs via an analysis of N-mt
genes in glioblastoma cases and controls. Candidate mitochondrial genes, defined by gene
ontology lists, were assessed in relation to glioblastoma risk. In European children, rs62036057
was significant after Bonferroni correction (effect size = 1.10, P value = 3.42X10
-6
) (Table 2.7).
This SNP is in the intronic region of WWOX, which encodes a member of the short-chain
dehydrogenases/reductases (SDR) family. However, the association with this SNP was not
statistically significant in the analysis limited to Hispanic children nor in the meta-analysis,
although effect sizes were in the same direction as Europeans. In Hispanic subjects,
rs111709726, which is in the intronic region of EFHD1, was significantly associated with
glioblastoma case-control status (effect size = 1.82, P value = 1.41X10
-6
). Report effects in
Europeans as here well, to be consistent with the above reporting of WWOX.
Meta-analyses of the N-mt SNPs did not reach statistical significance after Bonferroni
correction. The SNP with the smallest P value is rs115257641 (effect size = 1.65, P value
=8.83x10
-6
), in the intronic region of PRKCE which belongs to the Protein Kinase C (PKC) family
(Table 2.7).
Table 2.7: Significant SNPs in nucleus mitochondrial association analysis
rs ID
Gene
name
Odds ratio
(95% CI)
(Europeans)
P value
(Europeans)
Odds ratio
(95% CI)
(Hispanics)
P value
(Hispanics)
Odds ratio
(95% CI)
(meta-analysis)
P value
(meta-
analysis)
rs62036057 WWOX
2.99
(1.88-4.75)
3.42X10
-6*
1.3
(0.73-2.29)
0.37
2.14
(1.5-3.07)
3.07X10
-5
rs111709726 EFHD1 1.43 0.47 6.19 1.41X10
-6*
3.57 2.06X10
-5
20
(0.55-3.71) (2.95-12.97) (1.99-6.40)
rs115257641 PRKCE
2.72
(0.93-7.97)
0.07
8.93
(3.33-23.92)
1.33X10
-5
5.19
(2.51-10.72)
8.83X10
-6
* Significant after Bonferroni correction
Discussion
Leveraging a large California population-based case-control study, we report for the first
time an association between m1555A>G variant and pediatric glioblastoma risk. Further
haplotype analysis conducted to elucidate whether this association is a result of haplogroup
representation supported an independent effect of the m1555A>G variant. Notably, the
association plot showed a lack of other SNPs in LD with m1555, implying that the m1555 variant
is not tagging a larger haplogroup compatible with its known structure.
M1555 is in a region of mitochondrial 12S rRNA gene that is highly conserved (47), and
m1555 in E. coli has been demonstrated to be an essential ribosome decoding site (48). The
A>G variant has been previously reported to cause hearing loss (49)(50)(51)(52). Since hair cells
in the inner ear have a high demand for energy and are rich in mitochondria, hearing loss is a
common symptom across many mitochondrial conditions such as m1555A>G. m1555A>G has
also been reported to be associated with reduced height (53), likely associated with a decrease
in efficiency of mitochondrial energy synthesis. This suggests a similar vulnerability in the
developing brain to energy demands or the impact of mitochondrial insufficiency in the
pathophysiology of pediatric glioblastoma formation.
21
In our disease of interest, m1555 A>G alone is unlikely sufficient to result in
glioblastoma development, given the frequency of m1555 being 0.11% (in gnomAD, a large
sequenced reference database (54)) and the frequency of pediatric high-grade gliomas being
less than 1 per 100,000 persons per year, with GBM incidence even lower. Indeed, only 3
subjects of our case group carry this variant. However, this association led to a possible deeper
understanding of the etiology of glioblastoma. One possible mechanism is through affecting the
mitochondrial oxidative phosphorylation system. We hypothesize that m1555A>G could strain
energy supply to key biological processes in brain including DNA repair, expressing tumor
suppressing genes, recruiting immunological agents to eliminate tumor cells, and initiating
apoptosis, because of the high levels of brain activities and energy consumption. As a result,
m1555 A>G variant may predispose individuals at higher risk for glioblastomas, possibly in
concert with environmental risk factors. The presence of additional somatic MT variants in
pediatric GBM tumors at a higher rate than other glioma subtypes (35)(36) suggests a particular
sensitivity or dependence on aberrations in pediatric GBM etiology or pathophysiology.
Interestingly, somatic mutation of m1555 was also observed in other cancers such as
Ewing sarcoma (35). To-date, there has been no report of genetic polymorphisms significantly
associated with pediatric glioblastoma risk, likely due to lack of statistical power in small
datasets assembled to date. However, candidate n-mt association analysis on our dataset
discovered ancestry-specific nuclear SNPs with significant case -control differences. In European
children, the association between rs62036057 in WWOX and glioma risk was statistically
significant. WWOX protein localizes to the mitochondria (55) and is highly expressed in the
brain. Its alteration has been associated with multiple cancers (for example esophageal cancer
22
(56)), however, it was not clear if it was due to carcinogenesis or secondary effects of cancer
therapies (57). Our findings at the germline level lend support to the former hypothesis. In
Hispanic children, rs111709726 in EFHD1 was significantly associated with glioma risk. EFHD1 is
associated with mitochondrial inner membrane, and acts as a calcium sensor for mitochondrial
flash activation (58). It is also overexpressed in brain with a high confidence of spatial location
within mitochondria. Similarly, its effect sizes were in the same direction in European subjects,
but it was not statistically significant in Europeans nor in the entire study population. The
PRKCE gene that was associated with pediatric glioma risk in our study is an intriguing
candidate, as it has also been reported to be associated with both neuron growth and cancer
cell invasion, for example prostate cancer (59).
Our n-mt SNP association analyses suggest a potential interplay between mtDNA and n-
mt DNA polymorphism in contributing to glioblastoma risk. Indeed, some m1555 G allele
carriers were also observed to carry identified n-mt risk alleles, mostly glioblastoma cases. We
identify glioblastoma risk that is associated with both mitochondrial rRNA (encoded by mtDNA)
and genes that are dehydrogenases/reductases or calcium sensors in mitochondria (encoded by
n-mt DNA).
One major drawback of our study is the lack of a replication dataset. Unlike nuclear
mtDNA variants which are relatively easier to obtain replication datasets, a comprehensive mt
variant panel is not included on most SNP array panels and therefore an independent external
pediatric glioblastoma dataset to replicate m1555 is currently unavailable. Our findings warrant
future replication in studies of pediatric glioblastoma that include analysis of SNP arrays with
mt variant probes, or via targeted mt variant genotyping.
23
In summary, while genetic risk for pediatric glioblastoma is still largely undefined, the
role played by proteins encoded by mtDNA is worth investigating further along with other
metabolic variants.
2.2 Localized variation in ancestral admixture identifies pilocytic astrocytoma risk loci
among Latino children
6
Shaobo Li
7
, Charleston W.K. Chiang
2
, Swe Swe Myint
2
, Katti Arroyo
2
, Libby Morimoto
8
, Catherine
Metayer
3
, Adam J. de Smith
2
, Kyle M. Walsh
9
, Joseph L. Wiemels
2
Abstract
Pilocytic astrocytoma (PA) is the most common pediatric brain tumor. PA has at least a
50% higher incidence in populations of European ancestry compared to other ancestral groups,
which may be due in part to genetic differences. We estimated the proportions of European,
African, and Amerindian ancestry in 301 PA cases and 1185 controls of self-identified Latino
ethnicity from California Cancer Research Project (CCRLP) and found PA cases had a significantly
higher proportion of overall European ancestry than controls (case median = 0.55, control
6
This manuscript is in the process of submission. Published version could be modified per coauthors’, editors’ and reviewers’
comments.
7
Center for Genetic Epidemiology, Department of Population and Public Health Sciences, University of Southern California, Los
Angeles, California, USA
8
School of Public Health, University of California Berkeley, Berkeley, California, USA
9
Division of Neuro-epidemiology, Department of Neurosurgery, Duke University, Durham, North Carolina, USA
24
median = 0.51, P value = 3.5x10
-3
). Admixture mapping identified 13 SNPs in the 6q14.3 region
(SNX14) contributing to risk, as well as three other peaks approaching significance on
Chromosomes 7, 10 and 13. Downstream fine mapping in these regions revealed several SNPs
potentially contributing to childhood PA risk. In summary, we report for the first time a
difference in genomic ancestry associated with Latino PA risk and several genomic loci
potentially mediating this risk.
Introduction
Pilocytic astrocytoma (PA) is a slow-growing, benign primary central nervous system
tumor that most commonly arises in the cerebellum and chiasmatic/hypothalamic region. It has
a high survival rate, and most cases can be cured with resection. However, PAs are the most
common pediatric brain tumor and their sensitive intracranial location – including the optic
pathway – can lead to significant and lifelong morbidity. Additionally, some PAs show molecular
similarities to malignant gliomas and require aggressive treatment (60).
Little is known about the molecular etiology of childhood PA. While hallmark somatic
mutations have been reported to underlie PA tumorigenesis, including NF1 (61), KRAS(62),
PTEN(63), and BRAF (64), heritable genetic contributions impacting risk of PA remain largely
unidentified outside the context of Neurofibromatosis Type I.
PA incidence is significantly higher in populations of European ancestry than other
ancestries. According to a report from The Central Brain Tumor Registry of the United States
(CBTRUS) (65), the average annual age-adjusted incidence rate of pilocytic astrocytoma was
0.38 (95% CI: 0.37-0.39) per 100,000 per year in non-Hispanic whites, much higher than among
25
U.S. Latinos, 0.24 (0.23-0.26), African-Americans, 0.26 (0.24-0.29), American Indian/Alaskan
Natives, 0.14 (0.10-0.19), and Asian/Pacific Islander, 0.13 (0.11-0.16). These incidence
differences implicate differences in the distribution of underlying risk factors, including
ancestry-associated genetic risk alleles and race/ethnicity-related environmental factors. To-
date there has not been a rigorous exploration of these racial/ethnic differences in terms of
genetic predisposition, either on a genome-wide background level or at specific loci. However,
prior genomic analyses in admixed populations have observed increases in risk of both
childhood ependymoma risk and adult glioma risk in association with genome-wide differences
in ancestry. (38) (66) Furthermore, these studies have implicated both novel and well-validated
glioma-associated genes in contributing to racial/ethnic differences in tumor risk. Using a multi-
ethnic population of California children with PA and matched controls, we therefore sought to
investigate both global differences in genomic ancestry and locus-specific differences to identify
genetic factors associated with development of childhood PA.
Materials and Methods
Study participants: An overview of the subjects involved in this study is displayed in
Figure 2.2 and Table 2.8. Latino cases and controls were derived from the California Cancer
Research Project (CCRLP), a data linkage and sample bank resource described previously (21).
Case eligibility criteria included: [i] histologic diagnosis of glioma (ICDO-3 9380 to 9451)
reported to the California Cancer Registry between 1988 and 2011, [ii] under 20 years of age at
diagnosis; and [iii] no previous diagnosis of any other cancer. ICD-O3 code 9421 (Pilocytic
astrocytoma, WHO Grade I) constituted about 1/3 of all identified glioma cases and forms the
basis of the current report. Demographic data for all 2788 pediatric glioma cases meeting the
26
eligibility criteria, with an archived newborn bloodspot (ANBS) available, and successfully
genotyped are shown in Table 2.8. Control eligibility criteria were similar to those for cases,
based on the linkage between the California Cancer Registry (for absence of cancer) and
California birth records. Controls were matched to cases (individually, based on month and year
of birth, parental race and ethnicity, and sex) and randomly selected from the statewide birth
records.
Table 2.8: Demographic data of pediatric astrocytoma subjects
Variable Median (interquartile range) or N (%)
Age at diagnosis (years) 6.0 (7.0)
Sex, male 691 (51.34%)
Self-reported race
Non-Hispanic White 799 (59.36%)
Hispanic 547 (40.64%)
Histology
Pilocytic Astrocytoma 772 (57.36%)
Non-pilocytic astrocytoma 574 (42.64)
Primary site
Cerebellum, Not otherwise specified (NOS) 369 (27.41%)
Brain stem 187 (13.89%)
Cerebrum 156 (11.59%)
Brain NOS 120 (8.92%)
Other 514 (38.19%)
Birth weight (g) 3487.0 (659.0)
Gestational age (days) 279.0 (16.0)
Tumor Grade
I 167 (5.32%)
II 222 (7.07%)
III 14 (0.45%)
IV 179 (5.70%)
27
NOS 764 (24.33%)
Figure 2.2: Flowchart for data processing and analysis
Genotyping: For each subject, a single 1.4 cm diameter ANBS was excised by the
Biobank Program at the California Department of Public Health, labeled with study identifiers,
and individually bagged. Batches of ANBS were shipped on ice packs to Dr. Wiemels’ Childhood
Cancer Research Laboratory at University of Southern California. Then a 1/3 portion of the card
was cut and processed. DNA was isolated with Agincourt chemistry on an Eppendorf robot, and
quantified with pico-green. 500ng of genomic DNA was genotyped with the Precision Medicine
Diversity Array, a Thermofisher Affymetrix product that assays > 900,000 SNPs genome-wide.
Genotypes were called with Affymetrix Powertools, and resulting genotypes subjected to
quality control procedures, including: call-rate filtering (samples and SNPs with more than 5%
missing data were excluded), sex checks, cryptic relatedness filtering (IBD<0.25), and SNP
28
filtering based on Hardy-Weinberg equilibrium (SNPs with P<10
-4
among controls were
removed).
CCRLP Study-generated case-control datasets: Genetic ancestry was assigned generated
as follows. First ten principal components were calculated using SNP array data using PLINK2
(39). Each subject was assigned a genetic ancestry by comparing to 1000 Genome Phase 3
subjects (27). Assignment was made using “class” R package (67) with K-nearest neighbor (KNN)
method (k=10). As a result, Latino pilocytic astrocytoma case/control subjects (query panel)
constituted 301 cases and 1185 controls. The non-Latino white dataset included 471 pilocytic
astrocytoma cases and 1569 controls.
Reference subjects of European, Amerindian and African ancestries: a total of 3942
subjects with high quality SNP data passing gnomAD QC filters from the gnomAD (54)
v3.1 release were used as reference samples for each estimated ancestral component. Among
them, a total of 716 African reference subjects were selected based on self-reported ancestry,
excluding African Caribbean in Barbados and African Ancestry in Southwest US. Reference
subjects for European ancestry were also selected based on self-reported ancestry, excluding
Finnish in Finland, and a total of 671 subjects were included. To select reference subjects of
Amerindian ancestry, proportions of different ancestries were estimated with ADMIXTURE (68)
(mean result of 10 runs), using number of ancestries (K=5) determined by cross validation. A
total of 94 subjects with >85% estimated Amerindian ancestry were selected to be the
Amerindian reference population, of which 7 were Colombian, 12 were Karitianan, 14 were
Mayan, 4 were of Mexican ancestry in Los Angeles, 37 were Peruvian in Lima, Peru, 12 were
Pima and 8 were Suruí.
29
Estimation of Ancestry Proportions: To estimate the proportions of European, African
and Amerindian ancestries in Latino case and control subjects, we used ADMIXTURE (68) with
number of ancestries K=3. The program was run 10 times and the average from each run was
taken as the final estimate.
Inference of Local Ancestry and Genome-wide Association Analysis: RFMix (69) was used
to estimate local ancestry of Latino PA case and control subjects with default settings, using the
reference panel described above. Genetic data of reference panel and query panel were phased
and imputed with 1000 Genome Project as reference. Genome-wide association analysis was
then performed, regressing case-control status on number of European copies for each variant,
controlling for potential confounding variables (sex, global European ancestry proportion,
genetic principal components). Genome-wide significance threshold for admixture mapping
using test statistic simulation method (70) was calculated using “STEAM”(70) package in R.
Statistical Analysis: Association between number of European copies and risk of pilocytic
astrocytoma for each SNP in Latino subjects was tested using logistic regression models
adjusting for estimated global European ancestry proportion, sex and the first 10 genetic
principal components. Genotyped SNP array data were first imputed and phased using BEAGLE5
(71) (72). Association analyses for these SNPs around admixture mapping signals was conducted
using logistic regression models for Latinos and non-Latino whites separately. Meta-analysis of
these fine mapping results was performed using the METAL software package (41). Number of
independent SNPs were determined after pruning each region using PLINK2, in Europeans and
Latinos separately. Average was taken for meta-analysis results.
30
Results
European genomic ancestry is elevated in PA cases among Californian Latinos. Global
ancestries of both Latino query panel (Latino pilocytic case and control subjects) and reference
panel (reference subjects of European, African, and Amerindian ancestries) were partitioned
into three components (European, African, Amerindian) using ADMIXTURE. As seen in Figure
2.3, Latino subjects possessed a similar mixture of European and Amerindian ancestry
proportions and a small contribution from African ancestry. Subjects in the reference panels
were also confirmed to predominantly come from the single ancestral group to which they
were originally assigned.
Figure 2.3 Estimated ancestry proportions for query Latino subjects and reference samples
31
Latino PA cases had a significantly higher proportion of European genomic ancestry compared
to controls (Figure 2.4A) (case median = 55%, control median = 51%, Wilcoxon rank sum test P =
3.38x10
-3
). Correspondingly, cases had a lower proportion of Amerindian ancestry (Figure 2.4B)
(case median = 40%, control median = 43%, Wilcoxon rank sum test P = 1.36x10
-3
). No
significant difference was observed for African ancestry (case median = 3.76%, control median =
3.93%, Wilcoxon rank sum test P = 0.221).
Figure 2.4 Distribution of ancestry proportions in Latino pilocytic and non-pilocytic astrocytoma
cases and controls
CBTRUS data revealed that unlike pilocytic astrocytoma, other subtypes of pediatric
astrocytoma do not display a disproportionately higher incidence rate in populations of
European descent compared to other ancestral groups. To observe whether global ancestry
comparisons support these registry-based assessments, we compared the proportions of
32
European ancestry in Latino non-PA (n=1076, Table 2.9) cases and controls, observing no
significant differences in ancestry (Wilcoxon rank sum test P = 0.219) (Figure 2.4C).
Table 2.9: Description of Latino non-pilocytic astrocytoma cases and controls
Variable Median (interquartile range) or N (%)
Sex, male 539 (50.09%)
Birth weight (g) 3405.0 (678.0)
Gestational age (days) 277.0 (17.0)
Case status
Cases 246 (22.86%)
Controls 830 (77.14%)
Age at diagnosis among cases (years) 6.0 (8.0)
Histologies among cases (ICD-O codes)
Diffuse astrocytoma (9400) 125 (50.81%)
Anaplastic astrocytoma (9401) 65 (26.42%)
Dysembryoplastic neuroepithelial tumor (9413) 24 (9.76%)
Pleomorphic xanthoastrocytoma (9424) 17 (6.91%)
Fibrillary astrocytoma (9420) 11 (4.47%)
Gemistocytic astrocytoma (9411) 2 (0.81%)
Desmoplastic infantile astrocytoma (9412) 1 (0.41%)
Protoplasmic astrocytoma (9410) 1 (0.41%)
Primary site among cases
Brain stem 44 (17.89%)
Cerebrum 40 (16.26%)
Temoral lobe 39 (15.85%)
Overlapping lesion of brain 21 (8.54%)
Cerebellum, NOS 17 (6.91%)
Frontal lobe 16 (6.50%)
Brain, NOS 15 (6.10%)
Spinal cord 13 (5.28%)
Parietal lobe 13 (5.28%)
Others 28 (11.38%)
Tumor Grade among cases
33
I 13 (5.28%)
II 42 (17.07%)
III 5 (2.03%)
IV 78 (31.71%)
NOS 108 (43.90%)
Admixture Mapping in Latino PA cases and controls: Admixture mapping was performed
in Latino PA cases and controls using RFMix, followed by a genome-wide association analysis
between number of European copies for each SNP and PA status, controlling for estimated
global European ancestry, sex and first 10 genetic principal components (Figure 2.5A) (39). One
region of 13 linked SNPs in SNX14 on 6q14.3 surpassed the threshold for genome-wide
statistical significance (Figure 2.5B). Additional peaks approaching, but not reaching, genome-
wide significance were identified on chromosomes 7 (Figure 2.5C), 10 (Figure 2.5D) and 13
(Figure 2.5E).
34
Figure 2.5 Association plots local European ancestry copies and risk of pilocytic astrocytoma in
Latino subjects
Fine mapping of the regional admixture mapping peak: Based on the widths of
admixture peaks, we performed association analysis in the regions of admixture mapping
signals to identify individual SNPs potentially associated with PA risk. Association analyses in the
3MB region surrounding the chr6 peak (chr6: 84009612- 87009612) were performed in CCRLP
Latinos (Figure 2.6A1) and non-Latino White cohorts (Figure 2.6A2) separately using logistic
regression models adjusting for sex and PCs, then meta-analyzed (Figure 2.6A3). Bonferroni
correction was performed based on the number of independent SNPs (n=2,352 in Latinos,
n=2,394 in Europeans, n=2,373 for meta-analysis). No SNPs reached significance after multiple-
test correction, however several lead SNPs from the meta-analysis were identified, including
35
rs191186144 (P value = 1.64x10
-3
, intronic region of MRAP2), rs74559531 (P value = 2.22x10
-3
,
intronic region of HTR1E), and rs4707205 (P value = 4.05x10
-3
, upstream region of NT5E). All are
located in brain-expressed genes that play biological roles in brain development/function or
cancer development (Table 2.10).
Similarly, we also investigated 2Mb regions around the admixture mapping peaks on
chromosomes 7, 10 and 13 that approached genome-wide significance (Figure 2.6B for
chromosome 7, Figure 2.6C for chromosome 10, and Figure 2.6D for chromosome 13). No SNPs
reached significance after multiple-test correction, and we report the 3 lead SNPs from each
analysis in Table 2.10.
36
Figure 2.6 Case-control association analyses between SNPs and pilocytic astrocytoma risk of
CCRLP Latino, European subjects and meta-analysis of both results in regions of admixture
mapping peaks
37
Table 2.10 Top SNPs from fine mapping analyses of admixture mapping peaks
Locus
Nearest
Gene(s)
SNP
Position
(bp, hg38)
Risk
Allele
Dataset OR SE P-value
6q14.2 MRAP2 rs191186144 84,058,087 A
CCRLP Lat 1.533 0.346 0.217
CCRLP Eur 1.780 0.197 3.478x10
-3
Meta-analysis 1.716 0.171 1.637x10
-3
6q14.3 HTR1E rs74559531 86,961,711 A
CCRLP Lat 0.573 0.547 0.309
CCRLP Eur 2.204 0.214 2.285x10
-4
Meta-analysis 1.842 0.200 2.221x10
-3
6q14.3 NT5E rs4707205 854,39,850 C
CCRLP Lat 1.644 0.230 3.044x10
-2
CCRLP Eur 1.387 0.165 4.714x10
-2
Meta-analysis 1.469 0.134 4.049x10
-3
7q14.3 NEUROD6 rs113651799 31,316,499 G
CCRLP Lat 1.613 0.256 0.061
CCRLP Eur 1.641 0.204 0.016
Meta-analysis 1.630 0.160 2.215x10
-3
7q14.3 NEUROD6 rs17473169 31,324,183 A
CCRLP Lat 1.476 0.182 0.0321
CCRLP Eur 1.324 0.131 0.0322
Meta-analysis 1.374 0.106 2.788 x10
-3
7q14.3 MTURN rs34393279 30,138,553 G
CCRLP Lat 1.770 0.234 0.0146
CCRLP Eur 1.405 0.180 0.0588
Meta-analysis 1.531 0.143 2.824x10
-3
10p12.1 LYZL1 rs959431 29,099,450 C
CCRLP Lat 1.547 0.139 1.701x10
-3
CCRLP Eur 1.214 0.101 0.0547
Meta-analysis 1.320 0.0818 6.764x10
-4
10p12.1 LYZL1 rs555108 29,090,828 T
CCRLP Lat 1.552 0.158 5.403x10
-3
CCRLP Eur 1.272 0.110 0.0289
Meta-analysis 1.358 0.0904 7.148x10
-4
10p12.1 LYZL1 rs550240 29,094,615 A
CCRLP Lat 1.548 0.158 5.67x10
-3
CCRLP Eur 1.271 0.110 0.0294
Meta-analysis 1.356 0.0904 7.531x10
-4
13q31.3 GPC6 rs9584173 93,952,876 G CCRLP Lat 0.681 0.121 1.534x10
-3
38
Conditional analysis on regional admixture mapping signatures: We conducted
conditional analyses in regions of admixture mapping peaks to identify potential SNPs that
could account for the signals. For each region, we added the top 3 SNPs from meta-analysis into
the admixture mapping regression model, one by one, and observed if the association signal
was eliminated. There were slight decreases in signal significance for peaks on chromosomes 6
and 13, and the adjusted effect sizes were also closer to null (Table 2.11). However, the degree
of changes were all marginal, suggesting locus-specific admixture signals contribute to these
associations but are not well-explained by case-control differences in allele frequencies at the
models SNPs.
Table 2.11: Conditional analysis in admixture mapping peaks
Regional
admixture signal
Conditioned on
MAF
Lat
MAF
Eur
Estimate Pr(>|z|)
chr6:85504599 none 0.465 4.70x10
-6
chr6:85504599 chr6:86961711 0.010 0.025 0.446 2.09x10
-5
chr13:94441872 none -0.408 2.25x10
-5
chr13:94441872 chr13:93952876 0.26 0.16 -0.389 1.65x10
-4
CCRLP Eur 0.819 0.110 0.0693
Meta-analysis 0.754 0.0814 5.112x10
-4
13q32.1 ABCC4 rs146402029 95,250,755 T
CCRLP Lat 0.882 0.487 0.797
CCRLP Eur 2.620 0.244 8.113x10
-5
Meta-analysis 2.105 0.219 6.540x10
-4
13q31.3 GPC6 rs1264672115 93,930,275 G
CCRLP Lat 0.708 0.117 3.251x10
-3
CCRLP Eur 0.815 0.106 0.0537
Meta-analysis 0.765 0.0787 6.608x10
-4
39
Discussion
Pilocytic astrocytoma is the most common pediatric brain tumor. Although histologically
benign and typically curable, it can rarely progress to more malignant variants and often causes
other deficits due to its sensitive intracranial location. Epidemiological evidence has shown that
PA occurs significantly more frequently in populations of predominantly European ancestry.
Accordingly, we observe a strong association between elevated European genomic ancestry
and PA risk in our Latino study subjects, with each 5% increase in European ancestry proportion
associated with a 1.051-fold increase in odds of PA among Latinos (95% CI: 1.014-1.091).
Because of cases were identified from a registry-based data linkage study with careful
matching of population-based controls, these results indicate that genomic ancestry
contributes to PA risk, likely due to differing frequencies of underlying risk alleles across
racial/ethnic groups. Additional etiologic factors unable to be assessed in our study, such as
potential environmental risk factors, also merit assessment in future research.
Additional glioma subtypes have also been reported to occur more frequently in non-
Latino whites than other racial/ethnic groups, including childhood ependymoma(73), adult
glioblastoma and oligodendroglioma(74). Global ancestry analysis has previously revealed that
childhood ependymoma risk is associated with higher European ancestry in U.S. Latinos (38),
but we did not observe ancestral differences among any other subtypes of astrocytoma in this
study aside from PA. Therefore, cases of both pilocytic astrocytoma and non-pilocytic
astrocytoma showed a consistency between epidemiologic incidence and global ancestry
40
distribution, consistent with the hypothesis that genetic risk captures a proportion of the
incidence disparity for pediatric pilocytic astrocytoma.
The observation that the European ancestral proportion was associated with elevated
PA risk in our study implicates a higher frequency of PA risk alleles on European haplotypes and
led us to perform local admixture mapping analyses. Admixture mapping and subsequent fine-
mapping using traditional allelic association testing in a logistic regression framework identified
an admixture peak at 6q14.3 region (a 34,268 bp region, chr6:85,502,415-85,536,682) where 1
additional copy of the European ancestral haplotype was associated with 1.59-fold increased
odds of PA (smallest P-value from fine-mapping at chr6:85504599; P = 4.70x10
-6
). This region
contains the SNX14 gene, which codes a protein in the sorting nexin family involved in the
sorting of endosomes. SNX14 maintains microtubule organization and axonal transport in
neurons and glia (75), is thereby critical to maintenance of Purkinje cells (76), and has been
shown to regulate neuronal intrinsic excitability and synaptic transmission in mice (75) . Its loss
is associated Spinocerebellar Ataxia (SCAR20) and Vici Syndrome, rare childhood-onset
neurodevelopmental diseases (77)(78). One possible mechanism for the risk allele in SNX14 to
increase PA risk is through promoting tumorigenic microenvironment. It was reported that
synaptic activity was involved in shedding neuroligin 3 (NLGN3), which was required in the
process of PA gliomagenesis (79) .
We also carried out genotypic association analyses in this identified region in both
European and Latino PA subjects. Although no SNP reached significance after Bonferroni
correction, we identified potential alleles that could contribute to PA risk in these regions. For
example, NT5E is associated with HIF-1-α transcription factor network, and many genes induced
41
by HIF-1-α are highly expressed in cancer, including angiogenic growth factors (VEGF for
example) and glucose metabolism enzymes (80). It was also the most significant SNP in
chromosome 6 fine-mapping results.
One limitation of our study is lack of environmental covariates that could contribute to
differences in PA risks in different racial/ethnic groups. While this would not affect our global
ancestry comparisons due to the registry-based approach to case-identification and control
selection, lack of environment covariates precludes examination of potentially important gene-
environment interactions. Another limitation was that we had a comparatively smaller number
of subjects in the Amerindian reference panel. This could potentially affect regional admixture
accuracy and bias results toward the null.
In conclusion, we observed that a higher proportion of European ancestry was
associated with increased risk of childhood PA, with admixture mapping and subsequent
association analysis identifying a region of 6q14.3 potentially contributing to this risk.
42
Chapter 3: Investigating childhood carcinogenesis using epigenetic analysis
Although GWAS analyses have helped us understand etiology of some diseases such as
adult glioma (81), childhood ALL (13) and so on, they cannot fully account for disease risks.
Additionally, GWAS of some diseases where genetics should play a vital role failed to reveal any
significant findings at the time of this thesis, such as childhood glioma. Interestingly,
monozygotic twins should have the same DNA sequence, however, there are some disparities
in their cancer occurrences (82). All these suggest that other congenital factors are also
affecting childhood cancer risk, for example, epigenetic changes.
Commonly studied epigenetic changes include DNA methylation, histone modifications,
non-coding RNAs and so on. DNA methylation refers to the attachment of a methyl group onto
C5 position of the cytosine to form 5-methylcytocine (83). It is stable, heritable, and easy to
measure. Therefore, I will focus on DNA methylation in this chapter and investigate how DNA
methylation can help us understand disease risks on an epigenetic level. Using DNA methylation
array data, the most straightforward way is also to conduct association analysis between
methylation of each CpG and the trait of interest. This is referred to as epigenome-wide
association study (EWAS). However, similar to SNP array data, there are other information that
can be inferred from DNA methylation data. For example, tools are available to deconvolute
nucleated cell proportions using DNA methylation profile, taking advantage of the fact that DNA
methylation profiles are distinctive between different types of nucleated cell types.
Here I present my analysis, where I used DNA methylation data to understand
mechanisms of ALL in children with Down Syndrome. Firstly, through deconvoluting nucleated
cell proportions, it was reported that at birth, Down Syndrome children that would develop ALL
43
later in their lives already had higher proportions B cells, comparing to cancer free Down
Syndrome children. Then, using a EWAS model, genes that were differentially methylated were
reported to help us realize leukemogenesis was likely initiated neonatally in Down Syndrome
children, and that it was a complicated procedure involving wide methylation changes all over
the genome.
3.1 Epigenome-Wide Association Study of Acute Lymphoblastic Leukemia in Children with
Down Syndrome
1
Shaobo Li
2
, Pagna Sok
3
, Keren Xu
2
, Ivo S. Muskens
1
, Natalina Elliott
4
, Swe Swe Myint
2
, Priyatama
Pandey
2
, Helen M. Hansen
5
, Libby M. Morimoto
6
, Alice Y. Kang
6
, Catherine Metayer
6
, Xiaomei
Ma
7
, Beth A. Mueller
8
, Anindita Roy
4
, Irene Roberts
4
, Karen R. Rabin
3
, Austin L. Brown
3
, Philip J.
Lupo
3
, Joseph L. Wiemels
2
, Adam J. de Smith
2
Abstract
1
This manuscript has been submitted to Blood Advances. Published version could be modified per editors’ and reviewers’
comments.
2
Center for Genetic Epidemiology, Department of Preventive Medicine, Keck School of Medicine of the University of Southern
California, Los Angeles, CA, USA
3
Department of Pediatrics, Section of Hematology-Oncology, Baylor College of Medicine, Houston, TX, USA
4
Department of Paediatrics and MRC Molecular Haematology Unit, Weatherall Institute of Molecular Medicine, Oxford
University and BRC Blood Theme, NIHR Oxford Biomedical Centre, Oxford, UK
5
Department of Neurological Surgery, University of California San Francisco, San Francisco, CA, USA
6
School of Public Health, University of California, Berkeley, Berkeley, CA, USA
7
Department of Chronic Disease Epidemiology, Yale School of Public Health, New Haven, CT, USA
8
Public Health Sciences Division, Fred Hutchinson Cancer Research Center, and Department of Epidemiology, University of
Washington, Seattle, WA, USA
44
Down syndrome (DS) is associated with an increased risk of B-cell acute lymphoblastic
leukemia (ALL). We performed an epigenome-wide association study (EWAS) to identify
epigenetic differences in newborns with DS associated with subsequent DS-ALL risk.
Newborn dried bloodspots were available for 145 DS-ALL cases and 198 DS controls in
the discovery set, and 24 DS-ALL cases and 24 DS controls in the replication set. DNA was
assayed using Illumina Infinium MethylationEPIC Beadchips. Reference-based deconvolution of
DNA methylation data was performed to estimate proportions of blood cell-types and assess
potential differences between cases and controls. Multi-ancestry EWASs were performed to
identify CpGs and differentially methylated regions (DMRs) associated with DS-ALL.
We found significantly higher B-cell proportions at birth in DS-ALL cases than controls in
both discovery (P=6.4x10
-4
) and replication studies (P=0.03). The B-cell difference was
consistent between Latinos and non-Latino whites (P het=0.49) in the discovery study, and
remained significant with similar effect sizes after removal of GATA1 mutation-positive DS
controls and after adjustment for DS-ALL risk SNPs at ARID5B, IKZF1, GATA3, and CDKN2A. In
the discovery study EWAS, we identified 26 significant CpGs, including the top hit in a putative
regulatory region of EBF1, and 79 DMRs associated with DS-ALL, although these loci did not
replicate in the replication study.
Increased B-cell proportions in a subset of newborns with DS may be a risk factor for DS-
ALL development in childhood. This finding requires confirmation using conventional cell count
measures, and should be explored as a novel risk factor for non-DS ALL.
Introduction
45
Down syndrome (DS) is caused by constitutional trisomy of chromosome 21 and is the
most common chromosomal disorder, occurring in approximately 1 in 700 live births in the US
each year (84). Notably, DS is associated with an up to 30-fold increased risk of acute
lymphoblastic leukemia (ALL) (85) (86). Further, children with DS and ALL (DS-ALL) have inferior
overall survival than childhood ALL patients without DS, and face an increased risk of severe
chronic health conditions due to treatment-related toxicities (87) (88). Identifying risk factors
for DS-ALL, therefore, may reveal novel treatment targets and further our understanding of
leukemia etiology for children with and without DS.
Several lines of evidence suggest that disrupted DNA methylation and blood cell
development could partially explain the increased risk of ALL in children with DS. For example,
DS has been associated with changes in DNA methylation and other epigenetic modifications in
blood cells as well as in the expression levels of genes across the genome, although the
leukemogenic effects of trisomy 21 are not yet fully understood (89)(90)(91)(92). Additionally, a
recent genome-wide association study (GWAS) found common genetic variants associated with
DS-ALL risk, with significant single nucleotide polymorphisms (SNPs) identified in genes
previously associated with ALL in the non-DS population, including ARID5B, IKZF1, CDKN2A, and
GATA3 (22). These genes have been implicated in hematopoiesis, and SNPs at these loci have
also been associated with variation in blood cell traits including lymphocyte counts and ratios
(93)(94). These findings support the premise that genetic variation at genes involved in blood
cell development contributes to ALL risk in children with DS, in conjunction with the effects of
constitutional trisomy 21. In spite of this, there have been few attempts to characterize the
46
roles of the epigenome as a whole or CpG loci associated with blood cell type proportions
(which can be derived from DNA methylation data) in DS-ALL etiology.
The epigenome is sensitive to genetic and environmental influences during fetal
development, and measures of neonatal DNA methylation have been used to investigate the
impacts of prenatal exposures and to study associations with later health outcomes (95).
Building from this, we sought to evaluate the role of the neonatal methylome in children with
DS on subsequent ALL risk.
Methods
Study population
This study was approved by Institutional Review Boards at the California Health and
Human Services Agency, University of Southern California, University of California Berkeley,
Yale University, Washington State, and Baylor College of Medicine, and by the Committee for
the Protection of Human Subjects of the Health and Human Services Agency of the State of
Michigan, and did not involve the use of any personal identifiers.
The DS-ALL Discovery study included 145 DS-ALL cases and 198 DS controls from the
International Study of Down Syndrome Acute Leukemia (IS-DSAL) (22,23), limited to subjects
with available neonatal dried bloodspot samples (DBS), born in California (N=325) and
Washington state (N=18). California-born DS-ALL cases were identified in the California Cancer
Records Linkage Project ALL GWAS (N=109) or in the California Childhood Leukemia Study
(N=18) (22)(23)(21). Washington DS-ALL cases (N=18) were identified in the Washington State
Childhood Cancer study using population-based linked birth-hospital discharge-cancer registry
47
records (24). DS controls, without a leukemia diagnosis by 15 years of age, were identified from
the California Biobank Program through linkage between the California Department of Public
Health Genetic Disease Screening Program and the California Cancer Registry. DNA was
extracted from one-third portions of neonatal DBS using the QIAamp DNA Investigator Guthrie
Card protocol (Qiagen, Germantown, MD). Demographic and birth-related variables including
sex, self-reported race/ethnicity, blood collection age (in days), gestational age (in weeks), and
birthweight (in grams) were obtained from birth records where available. For 184 of 198 DS
controls, somatic GATA1 mutation data were available from targeted sequencing, as previously
described (92).
The DS-ALL Replication dataset included 24 DS-ALL cases and 24 DS non-leukemia
controls from the Michigan-based DS-ALL (MiDSALL) study (22). Neonatal dried bloodspots
(DBS) were obtained from the Michigan Neonatal Biobank, identified through linkage between
the Michigan Department of Health and Human Services birth records and the Michigan Cancer
Surveillance Program (22). DNA was extracted from neonatal DBS using the GenSolve DNA
extraction kit (GenTegra, Pleasanton, CA). Birth related variables for the Replication dataset
were obtained from the Michigan Department of Health and Human Services.
Genome-wide DNA methylation arrays and data processing:
In both the DS-ALL Discovery and Replication studies, isolated DNA samples from DS-ALL
cases and DS controls were bisulfite converted using Zymo DNA Methylation kits (Zymo
Research, Irvine, CA), and assayed using Illumina Infinium MethylationEPIC Beadchip genome-
48
wide DNA methylation arrays (Illumina, San Diego, CA) following block randomization to ensure
equivalent distributions of sex, race-ethnicity, and ALL case-control status, on all plates.
Raw IDAT files were imported into R and preprocessed with the “minfi” package for
mean detection P values calculation with “detectionP” function. Normalization was then carried
out using the “SeSAMe” package, which includes background correction using “noob”, out-of-
band (OOB) array hybridization for removal of false positive probes in deleted and
hyperpolymorphic regions (96). We used the default detection P-value cut-off of 0.05. A total of
174,077 CpGs were excluded following SeSAMe preprocessing in the Discovery study. In
subsequent analyses, normalized β values were used in the regression models. CpGs on sex
chromosomes (N=12,565) and at SNP sites with >5% minor allele frequency (N=57,764) were
removed from the association analyses. Following this, a total of 619,915 CpGs were included in
the EWAS of DS-ALL in the Discovery study.
“IlluminaHumanMethylationEPICanno.ilm10b4.hg19” package was used to annotate
CpGs on the EPIC array, including their locations, overlapping or closest genes, and their overlap
with regulatory regions.
Single nucleotide polymorphism (SNP) genotyping
Genome-wide SNP array genotyping data were available from Affymetrix Axiom World
LAT arrays for a subset of 261 DS subjects in the Discovery study, including 134 DS-ALL cases
and 130 DS controls, as previously described (22). Analyses in this study were limited to four
SNPs that were previously reported to be associated with DS-ALL at genome-wide significant
49
levels(22): rs7089424 in ARID5B, rs11978267 in IKZF1, rs3731249 in CDKN2A, and rs3824662 in
GATA3.
Deconvolution of blood cell proportions
To estimate the proportions of nucleated cells in neonatal blood samples from DS-ALL
cases and DS controls, we performed reference-based deconvolution using the Identifying
Optimal Libraries (IDOL) algorithm (17). Proportions of nucleated red blood cells (nRBC),
monocytes, granulocytes, B lymphocytes, CD4+ T lymphocytes, CD8+ T lymphocytes, and
natural killer (NK) cells were estimated using the “estimateCellCounts2” function in the
“FlowSorted.Blood.EPIC” R package, with the “FlowSorted.CordBloodCombined.450k” package
used to provide curated DNA methylation data from four different umbilical cord blood cell
reference samples. Deconvoluted cell proportions in DS-ALL cases and DS controls were
compared in the Discovery dataset using a linear regression model, controlling for sex, batch
variable, and the first 6 ancestry-related EPISTRUCTURE PCs (computed in GLINT (97)). In the
Replication dataset, 3 EPISTRUCTURE PCs were included due to the smaller sample size. This
comparison was performed in both Discovery and Replication datasets.
Epigenome-wide association analyses of differentially methylated probes (DMPs) and regions
(DMRs)
To identify DMPs associated with DS-ALL risk, linear regression was conducted for each
CpG β value as the dependent variable, with ALL status as the independent variable. In the
Discovery study, other independent variables (covariates) included in the regression model
were genetic ancestry-related EPISTRUCTURE PCs (10 PCs in the Discovery dataset, and 3 PCs in
50
the Replication dataset), 6 deconvoluted cell proportions (granulocytes were not included to
avoid collinearity), sex, and batch effect (controlling for plating during methylation assays). This
EWAS was conducted in both Discovery and Replication datasets, and results of the two
datasets were meta-analyzed using METAL. Gene pathway analyses were performed with
“methylGSA” package, using the top 100 most significant CpGs from the Discovery study EWAS
as inputs, and including both GO and KEGG databases. GO and KEGG pathways with FDR-
corrected P value<0.01 were considered significant. To determine CpGs previously associated
with gene expression, we used databases of expression quantitative trait methylation sites
reported by three different studies: association study of whole blood gene expression in
Framingham Heart Study (98), UK Household Longitudinal study (99), and FUSION Skeletal
Muscle Study (100).
To identify DMRs associated with DS-ALL, the DMRcate (101) and comb-P (102)
approaches were implemented. DMRcate was run adjusting for EPISTRUCTURE PCs,
deconvoluted blood cell proportions, sex, and batch as described above. Additionally, comb-P
was run using P values from the EWAS analysis. Only DMRs that spanned a minimum of 2 CpGs
and were called by both algorithms (FDR-adjusted P<0.01 for both algorithms) were included in
our results.
Results
The demographic and birth-related variable data for DS-ALL cases and DS controls in the
Discovery and Replication studies are summarized in Table 3.1. Unsupervised hierarchical
clustering did not differentiate DS-ALL cases from DS controls, but did demonstrate variation in
51
deconvoluted blood cell proportions and identified a subset of DS newborns with high nRBC
proportions, as previously shown (92) (Figure 3.1).
Table 3.1: Demographic and birth characteristics of DS-ALL cases and DS controls
DS-ALL Discovery Study DS-ALL Replication Study
DS controls DS ALL
P-value
DS controls DS ALL
P-value (N=198) (N=145) (N=24) (N=24)
N (%) N (%) N (%) N (%)
Sex
Male 91 (46.0) 94 (64.8) 14 (58.3) 13 (54.2)
Female 107 (54.0) 51 (35.2) 0.00066
a
10 (41.7) 11 (45.8) 0.771
a
Race/ethnicity
Asian 10 (5.1) 2 (1.4) 1 (4.2) 1 (4.2)
Latino 96 (48.5) 89 (61.4) 3 (12.5) 2 (8.3)
Non-Latino White 54 (27.3) 44 (30.3) 15 (62.5) 20 (83.3)
Non-Latino Black 10 (5.1) 2 (1.4) 5 (20.8) 1 (4.2)
Other 28 (14.1) 4 (2.8) 0.00015
a
0 0 0.287
a
Missing 0 4 (2.8) 0 0
Blood collection ages (days)
Mean (SD) 2.49 (2.04) 2.01 (2.13) 0.048
b
N/A N/A
Median (range)
1.75 (0.17 –
15.25)
1.46 (0 –
18.96)
N/A N/A
Missing 5 (2.5) 26 (17.9) 24 (100.0) 24 (100.0)
Gestational age (weeks)
Mean (SD) 38.15 (2.22) 37.89 (2.86) 0.37
b
N/A N/A
Median (range)
38.29 (26.42 –
44.71)
38.00 (25.00-
44.43)
N/A N/A
Preterm (<37) 39 (19.7) 29 (20.0) 0.78
a
N/A N/A
Missing 21 (10.6) 24 (16.6) 24 (100.0) 24 (100.0)
Birthweight (kg)
52
Mean (SD) 3.01 (0.73) 3.09 (0.60) 0.33
b
N/A N/A
Median (range)
3.01 (0.96-
8.65)
3.13 (0.94-
4.58)
N/A N/A
Missing 8 (4.0) 19 (13.1) 24 (100.0) 24 (100.0)
a
P-values calculated using a 2-sided Fisher exact test.
b
P-values calculated using a 2-sided t test.
53
Figure 3.1 Heatmap showing the top 2000 most variables CpGs in DS-ALL cases and DS controls
in the Discovery Study
B cell proportions are increased in newborns with DS who later develop ALL.
Deconvolution of blood cell proportions was performed to investigate potential
differences between DS-ALL cases and DS controls. In the Discovery study, we found a
significant increase in B cell proportions at birth in DS-ALL cases (mean=0.013, standard
deviation [sd]=0.015) compared with DS controls (mean=0.00826, sd=0.0115; beta=0.0052,
P=6.36 x 10
-4
) (Figure 3.2, Table 3.2). We replicated this finding in the smaller DS-ALL
Replication study (beta=0.0152, P=0.03), and meta-analysis across the two studies provided an
overall effect size of 0.0056 (P meta=1.25 x 10
-4
) (Table 3.2). Among all cell-types, B cells showed
the greatest proportional difference between cases and controls in both the Discovery study
(54.22% increase in DS-ALL) and Replication study (22.14% increase) (Table 3.1). In the meta-
analyses for other blood cell types, CD8+ T-cells were significantly increased in DS-ALL cases
compared with DS controls (beta=0.0076, P meta=5.29 x 10
-3
) (Table 3.2).
Table 3.2: Deconvoluted blood cell proportions in DS-ALL cases versus DS controls
Cell Type
Discovery Study Replication Study
Meta-analysis
(145 cases, 198 controls) (24 cases, 24 controls)
Effect
estimate
a
P-value
a
Effect
estimate
a
P-value
a
Effect
estimate
b
P meta
b
Direction
54
CD4 T cell 0.0038 0.475 -0.0147 0.29 0.0014 0.782 -/+
CD8 T cell 0.0069 0.0153 0.0168 0.11 0.0076 0.00529 +/+
B cell 0.0052 6.36 x 10
-4
0.0152 0.03 0.0056 1.25 x 10
-4
+/+
NK cells 0.0029 0.2315 0.00475 0.55 0.003 0.186 +/+
Gran 0.0106 0.5314 -0.0482 0.21 0.0007 0.965 -/+
Monocyte 0.0003 0.9456 -0.000246 0.98 0.0002 0.955 -/+
nRBC -0.0322 0.138 0.0163 0.65 -0.0191 0.302 +/-
a
P-values and coefficients calculated using linear regression, testing each blood cell type separately as the dependent
variable, with DS-ALL status as the independent variable, and including sex, batch, and ancestry-related principal
components from EPISTRUCTURE (n=6 for Discovery study, n=3 for Replication study) as covariates. P-values were not
adjusted for multiple comparisons. P-values <0.05 highlighted in bold.
b
Meta-analysis performed using METAL.
Abbreviations: NK, natural killer; nRBC, nucleated red blood cells; Gran, granulocytes.
55
Figure 3.2 Boxplots showing deconvoluted cell proportions in DS-ALL cases and DS controls in
the Discovery and Replication datasets
We also stratified results of the Discovery study by self-reported status as non-Latino
white (N controls = 54, N cases = 44) or Latino (N controls = 96, N cases = 89) (Table 3.3).
Increased neonatal B-cell proportions showed a stronger effect in Latinos (beta=0.0062, P=3.1 x
10
-3
) than in non-Latino whites (beta=0.004, P=0.10), although this difference was not
statistically significant in a test for heterogeneity (P het=0.49). Increased CD8+ T-cell proportions
were only observed in Latinos (P=0.035).
56
Table 3.3: Deconvoluted blood cell proportions in DS-ALL cases versus DS controls stratified by
self-reported race/ethnicity in the DS-ALL Discovery Study.
Cell Type
Latinos (96 cases, 89 controls)
Non-Latino whites (54 cases, 44
controls)
Meta-analysis
Effect estimate
a
P value
a
Effect estimate
a
P value
a
Effect
estimate
P meta P het
b
CD4 T cell 0.0009 0.902 -0.0065 0.355 -0.0032 0.539 0.478
CD8 T cell 0.0087 0.035 4.07x10
-7
0.999 0.0047 0.117 0.150
B cell 0.0062 0.0031 0.0040 0.104 0.0053 8.13 x 10
-4
0.490
NK cells 0.0029 0.380 0.0053 0.260 0.0037 0.170 0.673
Granulocyte 0.0166 0.522 -0.0075 0.681 0.0005 0.973 0.446
Monocyte 0.0038 0.488 -0.0055 0.379 -0.0003 0.946 0.262
nRBC -0.0400 0.227 0.0079 0.735 -0.0080 0.672 0.235
a
P-values and coefficients calculated using linear regression, testing each blood cell type separately as the dependent
variable, with DS-ALL status as the independent variable, and including sex, batch, and 6 ancestry-related principal
components from EPISTRUCTURE as covariates. P-values were not adjusted for multiple comparisons.
b
P-values for heterogeneity calculated using METAL.
We performed several sensitivity analyses in the Discovery study to assess potential
confounders of the increased B-cell proportions in DS-ALL cases, and found that the difference
in B-cell proportions remained statistically significant in each case. First, in a subset of subjects
with available birth-variable data, we adjusted the regression model for gestational age,
birthweight, and blood spot collection age. The difference in B-cell proportions between DS-ALL
cases (n=117) and DS controls (n=173) became even more significant (beta=0.0062, P=1.9 x 10
-
4
) when adjusted for these birth variables (data not shown).
Next, in a subset of subjects in the Discovery study with available genotype data (N case
= 134, N control = 130), we assessed whether SNPs associated with DS-ALL risk in ARID5B
57
(rs7089424), IKZF1 (rs11978267), CDKN2A (rs3731249), or GATA3 (rs3824662) (22) may be
associated with B-cell proportions and confound the association with DS-ALL, as these same
genetic loci have previously been associated with variation in white blood cell traits (93). We
added the genotypes of these four SNPs in the B-cell regression model one at a time and also
tested them jointly for combined effects, and the significantly increased B-cell proportions in
DS-ALL cases remained, with similar results found when stratifying subjects into Latinos and
non-Latino whites (Table 3.4).
Table 3.4: Deconvoluted B-cell proportions in DS-ALL cases versus DS controls adjusted by GWAS
SNPs in the DS-ALL Discovery Study, overall and stratified by self-reported race/ethnicity
GWAS SNP
adjustment
Gene
Overall
(130 cases, 134 controls)
Latinos
(86 cases, 88 controls)
Non-Latino whites
(44 cases, 43 controls)
B-cell effect
estimate
a
P value
a
B-cell effect
estimate
a
P value
a
B-cell effect
estimate
a
P value
a
rs7089424 ARID5B 0.00535 0.0018 0.00631 0.0053 0.00345 0.204
rs11978267 IKZF1 0.00580 0.00064 0.00668 0.0024 0.00387 0.152
rs3731249 CDKN2A 0.00568 0.00073 0.00652 0.0031 0.00376 0.160
rs3824662 GATA3 0.00556 0.0014 0.00636 0.0053 0.00417 0.133
All 4 SNPs All 0.00561 0.0019 0.00660 0.0057 0.00414 0.152
a
P-values and coefficients calculated using linear regression, testing each blood cell type separately as the dependent variable,
with DS-ALL status as the independent variable, and including sex, batch, and one of four SNPs (at ARID5B, IKZF1, CDKN2A, or
GATA3) or including all four SNPs together, and 6 ancestry-related principal components from EPISTRUCTURE as covariates. P-
values were not adjusted for multiple comparisons.
Finally, we removed GATA1 mutation-positive controls (N=30 out of 184 tested), and
found that the difference in B-cell proportions remained significant with little difference in the
effect estimate (β=0.0045, P=5.8 x 10
-3
) (data not shown).
58
Epigenome-wide significant CpG probes associated with DS-ALL.
We conducted an epigenome-wide association study (EWAS) of DS-ALL separately in the
Discovery study (145 cases, 198 controls) and Replication study (24 cases, 24 controls). In the
Discovery study, there were 247 significant DMPs after FDR correction and 26 epigenome-wide
significant DMPs after Bonferroni correction (P<8.066 x 10
-8
) (Figure 3.3, Table 3.5).
Figure 3.3 Bi-directional Manhattan plot (A) and quartile-quartile (QQ) plot (B) for results of the
EWAS of DS-ALL in the Discovery Study, including autosomal CpG probes only
59
60
61
Pathway enrichment analysis of the FDR significant DMPs revealed significant
enrichment of 24 GO pathways including ones involved in NF-kappaB signaling, positive
regulation of DNA-binding transcription factor activity, autophagy, viral infection, and ubiquitin
activity (Table 3.6). The top DS-ALL-associated CpG (cg27347265, P=5.062 x 10
-14
) was located in
the first intron of the B-cell transcription factor gene EBF1, in a putative regulatory region
(Figure 3.4, Table 3.5).
Figure 3.4: Location of the top EWAS hit (cg27347265) in the DS-ALL Discovery Study
Table 3.6: Significant GO pathways enriched for DS-ALL-associated DMPs
GO ID Description Size pvalue padj
GO:0007249 I-kappaB kinase/NF-kappaB signaling 391 1.33E-61 4.86E-60
GO:0010332 response to gamma radiation 61 1.33E-61 4.86E-60
GO:0010506 regulation of autophagy 433 1.33E-61 4.86E-60
GO:0010508 positive regulation of autophagy 166 1.33E-61 4.86E-60
GO:0016239 positive regulation of macroautophagy 80 1.33E-61 4.86E-60
GO:0016241 regulation of macroautophagy 207 1.33E-61 4.86E-60
62
GO:0019058 viral life cycle 427 1.33E-61 4.86E-60
GO:0030433 ubiquitin-dependent ERAD pathway 125 1.33E-61 4.86E-60
GO:0031331 positive regulation of cellular catabolic process 492 1.33E-61 4.86E-60
GO:0036503 ERAD pathway 169 1.33E-61 4.86E-60
GO:0043122 regulation of I-kappaB kinase/NF-kappaB signaling 323 1.33E-61 4.86E-60
GO:0043123 positive regulation of I-kappaB kinase/NF-kappaB signaling 240 1.33E-61 4.86E-60
GO:0043903 regulation of symbiotic process 287 1.33E-61 4.86E-60
GO:0046782 regulation of viral transcription 72 1.33E-61 4.86E-60
GO:0048525 negative regulation of viral process 132 1.33E-61 4.86E-60
GO:0050792 regulation of viral process 270 1.33E-61 4.86E-60
GO:0051091 positive regulation of DNA-binding transcription factor activity 347 1.33E-61 4.86E-60
GO:0051092 positive regulation of NF-kappaB transcription factor activity 212 1.33E-61 4.86E-60
GO:0051865 protein autoubiquitination 82 1.33E-61 4.86E-60
GO:0052126 movement in host environment 196 1.33E-61 4.86E-60
GO:0061630 ubiquitin protein ligase activity 384 1.33E-61 4.86E-60
GO:0061659 ubiquitin-like protein ligase activity 398 1.33E-61 4.86E-60
GO:1903900 regulation of viral life cycle 196 1.33E-61 4.86E-60
GO:1903901 negative regulation of viral life cycle 107 1.33E-61 4.86E-60
Additional significant CpGs were identified in the first exon of TRIM13, a gene frequently
deleted in B-cell chronic lymphocytic leukemia, and in the promoter region of BCL11A, in which
chromosomal abnormalities have been implicated in lymphoid malignancies (103). One CpG
was previously found to be an expression quantitative trait methylation site in whole blood
(98): cg14179381 in the promoter region of TRMT10B was associated with expression of the
DCAF10 gene. For all of the 26 Bonferroni-significant DMPs, the beta value difference between
DS-ALL cases and DS controls was lower than 0.05. Of these DMPs, 5 replicated in the
Replication study at P<0.05 although none had the same direction of effect and none were
significant following correction for multiple testing (P>0.05/26).
63
Differentially methylated regions (DMRs) associated with DS-ALL
In the Discovery study, we identified 79 significant DMRs associated with DS-ALL,
although the beta differences between cases and controls for these regions were also all < 0.05
(Table 3.7). The genes overlapped by the DMRs are not known to play a role in ALL
development. Analysis of DMRs in the Replication study did not identify any of the same
regions found in the Discovery study, although for 13/79 DMRs there were CpGs within the
regions that showed the same direction of methylation changes between DS-ALL cases and DS
controls and with P<0.05 in the Replication study (Table 3.7).
64
65
66
67
Discussion
In the first EWAS of DS-ALL, we found significantly increased B-cell proportions in
newborns with DS who later developed ALL compared with DS newborns without any leukemia
diagnosis during childhood, a finding that persisted after adjustment for several potential
confounding factors and, importantly, was consistent between two independent DS-ALL case-
control datasets from California/Washington and Michigan. Although we also identified several
significant differentially methylated probes and regions, results between the Discovery and
Replication DS-ALL studies did not corroborate each other, possibly due to the “large p, small n”
nature of the Replication dataset.
This finding was unexpected as trisomy 21 is known to be associated with reduced B-cell
production (104)(105) and reduced numbers of B-cells (104)(106) in fetal life, and children with
DS also have reduced numbers of B-cells (107) (108). Consistent with these findings, we
previously observed lower B-cell proportions in newborns with DS than in newborns without DS
using reference-based cell-type deconvolution analysis (92). Results from the current study
support that, in the context of DS, children with greater B-cell proportions at birth have an
increased risk of developing DS-ALL. Importantly, the significant shift towards higher B-cell
proportions was consistent between the Discovery and Replication DS-ALL studies and set these
neonates apart from the DS newborns who did not develop ALL, as overall the DS-ALL cases in
the Discovery study still had reduced median B-cell proportions (0.0077) compared with the
newborns without DS (0.038) included in our EWAS of DS (92). In addition, the significant
association between increased B-cell proportions and ALL risk in children with DS is consistent
68
with the recent observation in the non-DS population that a genetic predisposition to
overproducing lymphocytes, overall and in relation to other blood cell types, is associated with
an increased risk of childhood ALL (93). Further studies are required to understand the
mechanisms underlying the association between increased B-cells and ALL development in
children with and without DS, but these may involve effects on the proliferation of preleukemic
clones and generation of leukemia-forming mutations, as well as potential impacts on immune
function and response to infections (93)(109) given the marked increase in susceptibility to
infections and autoimmunity in children with DS (110) (111) (112). Given that an increased rate
of somatic mutations has been reported in trisomy 21 fetal hematopoietic stem and progenitor
cells compared to non-DS fetal cells (113), it is plausible that this may act in concert with
lymphocyte overproduction to drive B-ALL development in children with DS.
Changes in DNA methylation at birth have been proposed as a potential mediating
mechanism of the effects of early life exposures on the risk of developing ALL (114). Given the
greatly increased risk of ALL in children with DS, any prenatal exposures associated with ALL
etiology may demonstrate greater effects in the context of trisomy 21, as we have previously
reported for genetic risk factors (22). However, we did not find strong evidence for differences
in DNA methylation that might predict subsequent ALL risk. The epigenome-wide significant
probes and regions associated with DS-ALL in our Discovery study did not replicate in the
independent Michigan Biobank DS-ALL dataset, although sample size in the latter study was
small and likely underpowered to detect significance for the generally small beta value
differences (beta<0.05) found between DS-ALL cases and controls in the Discovery study. These
small beta differences at significant loci suggest that these epigenetic risk loci may have modest
69
effects or include spurious findings, however, it is also possible that larger differences in DNA
methylation between cases and controls may exist at these loci in a small subset of cells. The
significant CpG in a regulatory region of EBF1, which encodes an important B-cell transcription
factor, is intriguing given that this gene is frequently deleted in childhood ALL tumors (115).
Further investigation of DNA methylation differences in sorted cell populations will be required
to determine cell-specific epigenetic changes associated with DS-ALL risk.
An important strength of our study was the use of newborn dried bloodspots, which
were collected prior to the onset of disease and, therefore, any differences between DS-ALL
cases and DS controls should not be confounded by the presence of leukemia cells. Our study
does have some limitations, in addition to the aforementioned small sample size of the
Replication study. Blood cell proportions were estimated using DNA methylation array data,
and with cell-type deconvolution methods that were developed in euploid individuals (17),
although we recently used the same approach in newborns with and without trisomy 21 and
confirmed known differences in blood cell proportions associated with DS (92). Nonetheless,
the increased B-cell proportions in DS-ALL cases requires confirmation using blood cell count
measures in newborns. Another limitation is the lack of information on tumor subtypes for DS-
ALL cases. Future studies should investigate the association between increased neonatal B-cell
proportions and frequent somatic alterations in DS-ALL including CRLF2 rearrangements.
Finally, GATA1 sequencing data were only available for DS controls in the Discovery study.
Somatic GATA1 mutations in newborns with DS are associated with a transient
myeloproliferative disorder (TMD) (116); however, removal of GATA1 mutation-positive
controls had minimal effect on the B-cell association with DS-ALL, and these mutations are
70
unlikely to confound the association given that TMD is associated with increased risk of acute
myeloid leukemia in DS and not DS-ALL.
Our study provides evidence that increased levels of B-cells at birth is associated with
ALL risk in children with DS. Future studies are needed to understand the role of blood cell trait
variation in DS-ALL etiology, and to examine increased neonatal B-cells as a potential risk factor
for ALL in the non-DS population.
71
Chapter 4: Importance of accounting for interactions between genetics and epigenetics in
EWAS models
It’s been reported that genotypes can affect DNA methylation at certain CpGs, both in
cis and in trans. This group of SNPs are called methylation quantitative trait loci (mQTL). This
suggests that results of EWAS models could be confounded by genetic effects, since genetics
could account for the observed correlation between CpG methylation and the trait of interest.
Unfortunately, in the EWAS community, it’s still very rare to take genetic effects into
consideration. In this chapter, using CCLS and CCRLP datasets, I investigated the percentage of
CpGs with significant mQTLs. This could also help to estimate to what extend were previous
EWAS results confounded by genetics.
By looking at distributions of mQTL-matched CpGs in the genome, I showed that CpGs
matched to mQTLs were enriched in the CpG island shore regions. Next, using this mQTL-CpG
database, I ran EWAS models investigating the associations between CpG methylations and
birthweight. I was able to demonstrate that CpGs that were closely related to birthweight
(metabolism, biosynthesis and so on) were mostly like to be confounded by genetics in EWAS
models. Therefore, it is important to try to adjust for genetic effects when doing EWAS analysis,
especially for traits that could be genetically affected.
72
4.1 Incorporation of DNA methylation quantitative trait loci (mQTLs) in epigenome wide
association analysis: application to birthweight
1
Shaobo Li
2
, Nicholas Mancuso
2
, Catherine Metayer
3
, Xiaomei Ma
4
, Adam J. de Smith
2
, Joseph L.
Wiemels
2
Abstract
Epigenome wide association analyses (EWAS) have helped to understand the
correlations between DNA methylation and many clinicopathologic traits. Since DNA
methylation is affected by genotypes at certain loci, EWAS hits could be potentially confounded
by genetics. However, we still have a poor understanding of how deeply EWAS results were
affected by genetic variation. Using single nucleotide polymorphism (SNP) and methylation
array data of four multiethnic datasets (total n= 1,160), we generated methylation quantitative
trait loci (mQTL) datasets for both Illumina 450K and 850K methylation arrays. We then
performed EWAS models to investigate associations between neonatal DNA methylation and
birthweight, and reported how EWAS results were improved by controlling for mQTLs. For CpGs
on the 450K array, an average of 15.38% CpGs were matched to mQTLs; while on the EPIC
array, 22.78% CpGs were matched. The CpGs paired with SNPs were enriched in the CpG island
1
This manuscript is in the process of submission. Published version could be modified per coauthors’, editors’ and reviewers’
comments.
2
Center for Genetic Epidemiology, Department of Population and Public Health Sciences, University of Southern California, Los
Angeles, California, USA
3
School of Public Health, University of California Berkeley, Berkeley, California, USA
4
Yale School of Public Health, Department of Chronic Disease Epidemiology, New Haven, Connecticut, USA
73
shore regions. Correcting for CpGs in the EWAS model for birthweight helped to increase
significance levels for top hits. For CpGs overlapping genes that were associated with
birthweight related pathways (nutrition metabolism, biosynthesis for example), accounting for
mQTLs changed their regression coefficients more dramatically than other CpGs. Methylation of
around 20% CpGs in the genome were affected by genotypes. It is important to take these
genetic effects into consideration when doing EWAS analysis. Genetic effects were stronger on
CpGs overlapping key pathway genes associated with the outcome.
Introduction
Epigenome wide association studies typically regress DNA methylation at individual CpG
sites throughout the genome in relationship with demographic, environmental, or disease
characteristics. DNA methylation is quantified as the fraction of site specific CpG sites that are
methylated within a tissue, usually blood and using Illumina array methodology. While DNA
methylation status is a continuous phenotype, it’s status may be partially or completely
controlled by neighboring genetic factors or polymorphisms. When these polymorphisms exist
within the probes used for measurement, the DNA methylation assay may fail, leading to
aberrant data – lists of such CpG sites have been published and these are usually filtered out
before the analysis phase (117) (96). Genetic effects are not limited to such probes, as genetic
polymorphisms can have profound cis impacts on DNA methylation even at a distance – an
example would be several of the CpG sites that are known to be strongly impacted by tobacco
exposure. Interestingly these are also impacted by SNPs proximal to their sites as cis methyl
quantitative trait loci (118). SNPs may both confound the environmental association (eg., GFI1
and tobacco) or simply introduce noise without confounding the environmental association
74
(eg., AHRR and tobacco) (118). In either case, the lack of accounting for the genetic effect leads
to measurement error in associations between environmental and disease characteristics and
DNA methylation which should be accounted for in EWAS analysis. Ultimately, both genetic and
environmental effects need to be incorporated along with interaction terms to understand
epigenetic variation caused by environmental, dietary, and other factors.
Birthweight is related to several factors, including gestational age, parity, fetal sex,
maternal height, age, and ethnicity (119) (120). Studies have also shown recurrent DNA
methylation variations as either a cause or consequence of birthweight corrected for
gestational age – indeed EWAS studies have established a large set (914 CpG sites) of CpG sites
validated over a large meta-analysis of Illumina HM450K data (11). Birthweight is also known to
have a strong polygenic etiology, with over 430 SNPs as significant (121) (122). Here, we
consider DNA methylation variation in relation to birthweight variation in a California
multiethnic population, while accounting for DNA methylation quantitative trait loci (mQTL).
We also consider variation in DNA methylation in relationship to the confirmed birthweight
GWAS alleles.
Methods
Study subjects
Four separate datasets are used in this analysis. They will be called Set 1,2,3,4
respectively. Three datasets (set 1, 168 Europeans and 174 Latinos, set2, 105 Europeans and
220 Latinos, set4, 137 Europeans and 356 Latinos) are from California Childhood Leukemia
Study (CCLS) project which involved active recruitment of children with leukemia and healthy
75
controls throughout California between 1995 and 2015 (123). A separate sample set (set3, 160
Europeans and 318 Latinos) was derived independently from a separate leukemia case-control
sample set with subjects born within a five county Southern California region as described (37).
Genome-wide DNA methylation arrays
DNA was extracted from 1/3 of each newborn DBS (~1.4 cm diameter) using the Qiagen
DNA Investigator blood card protocol, and bisulfite conversion performed using Zymo EZ DNA
Methylation kits. Bisulfite-converted DNA samples were randomized.
Set 1 and 2 were assayed on Illumina Infinium Methylation450K Beadchip genome-wide
DNA methylation arrays (referred to as 450K array from here); Sets 3 and 4 were assayed on
Illumina Infinium Methylation850K Beadchip genome-wide EPIC methylation arrays (referred to
as EPIC array from here).
Genome-wide Genotype SNP array
Additional DNA was isolated from DBS for genotyping. Sets 1, 2, 4 were genotyped on
Illumina Omni Express array, while Set 3 was genotyped on the Affymetrix PMDA array.
Genotype data preprocessing
Pre-imputation QC was done in Sets 1,2,4 and Set 3 separately. Genotyped variants with
missing call rates exceeding 5% were excluded. We then imputed SNP data on the Topmed
imputation server (28) (124) (125), and after imputation, for Sets 1,2,4 and Set 3 separately, we
included imputed SNPs with an R
2
score higher than 0.6 and excluded SNPs with a minor allele
frequency (MAF)<0.01. Lastly, Set 3 and Set 4 genetic data were harmonized and combined by
76
taking the only the shared SNPs from each set. This resulted in 9,483,494 SNPs for sets 1 and 2
and 8,954,132 SNPs for sets 3 and 4.
DNA methylation array data preprocessing and annotation
DNA methylation data were normalized using R package “Minfi”, and “noob”
normalization was performed for each dataset. CpG probes and subjects with more than 5%
missingness were removed from normalized data. The remaining missing values were imputed
using “impute” package as described in our previous publication (92). Probes located on sex
chromosomes as well as SNP sites with MAF>5% were excluded.
Scanning of genome wide mQTLs for each CpG site
Scanning of mQTLs in cis was done using QTLtools (126) in each dataset by genetic
ancestry (Europeans and Latinos) using the parameters described below. This created eight
datasets for mQTL scanning: Set 1 Europeans, Set 1 Latinos, Set 2 Europeans, Set 2 Latinos, Set
3 Europeans, Set 3 Latinos, Set 4 Europeans, and Set 4 Latinos.
We did a permutation test (n=1000) for each CpG and SNPs located within a 2 million
base pair window flanking that CpG. This analysis output the top in cis SNP associated with each
CpG for each dataset by ancestry. Covariates controlled for in the screening process include sex,
batch effect at the time of DNA methylation measurement, and first 5 genetic PCs to account
for remaining heterogeneity within ancestry groups. CpG-SNP pairs from each dataset with an
adjusted association P value < 0.05 were included in downstream analysis.
77
To obtain a harmonized mQTL dataset for each methylation sequencing platform (450K
and EPIC), we meta-analyzed outputs from datasets by platform, and for each CpG, we selected
SNP with the most significant P value as the mQTL for that CpG. More specifically, 450K mQTLs
were based on results of meta-analysis of Set 1 Europeans, Set 1 Latinos, Set 2 Europeans and
Set 2 Latinos, while EPIC mQTLs were based on results of meta-analysis of Set 3 Europeans, Set
3 Latinos, Set 4 Europeans and Set 4 Latinos.
Assessment and adjustment of cell-type heterogeneity
Reference based deconvolution of blood cell proportions were obtained using the
Identifying Optimal Libraries (IDOL) algorithm from R package “FlowSorted.Blood.EPIC”.
Reference cord blood data were derived from R package
“FlowSorted.CordBloodCombined.450k” (127). We were able to deconvolute proportions of
monocytes, granulocytes, natural killer cells, B lymphocytes, CD4 T lymphocytes, CD8 T
lymphocytes and nucleated red blood cells, which were later used to correct for cell-type
heterogeneity in the regression analysis.
Epigenome-wide association analyses
For each CpG, a linear regression model was fit using methylation β value as the
dependent variable. Independent variables include birthweight (the major variable of interest
in this study), batch variable, sex, selection variable (case control status of childhood ALL),
gestational age, 10 genetic PCs for ancestry heterogeneity, and deconvoluted cell proportions
(excluding granulocytes to avoid collinearity) for cell type heterogeneity. CpGs were excluded if
they overlap with SNPs with more than 5% MAF, on Chromosomes X or Y, or with 50% missing
78
values. Such model was performed in all four datasets (Set 1, Set 2, Set 3 and Set 4) and the
results were meta-analyzed by platform (450K array and EPIC array). Subjects with gestational
age smaller than 30 weeks (n = 9) were excluded from this analysis.
Sensitivity analyses were run by including maternal weight gain during pregnancy as an
additional covariate.
Results
Overview of scanned mQTLs in different datasets: For datasets assessed on the 450K
array (Set 1 Europeans, Set 1 Latinos, Set 2 Europeans and Set 2 Latinos), an average of 15.38%
CpGs have a corresponding mQTL at adjusted significance level (P<0.05); while for datasets
assessed on the EPIC array (Set 3 Europeans, Set 3 Latinos, Set 4 Europeans and Set 4 Latinos),
an average of 22.78% CpGs are matched to SNPs with adjusted P values < 0.05 as its mQTL
(Figure 4.1A). Interestingly, proportions of mQTL-matched CpGs differ by ethnicity. In all four
datasets, Latinos consistently have a higher proportion of mQTL-matched CpGs comparing to
Europeans (Figure 4.1A), suggesting a potential difference in SNP-CpG interactions caused by
ethnicity.
79
Figure 4.1 Distributions of scanned mQTLs in different datasets
We also examined whether a CpG tends to be paired with the same SNP, or SNPs in
linkage disequilibrium (LD) (R
2
> 0.5), across different datasets. We made 8 pair-wise
comparisons for datasets assessed on the same DNA methylation platform (Table 4.1). For 450K
array, SNPs matched to the same CpG across dataset have a higher probability to be in LD with
each other than identical. However, for EPIC array, SNPs tend to be identical rather than in LD
(Figure 4.1B). All numbers of shared CpG-SNP pairs in either 450K array or EPIC array datasets
are shown in Figure 4.2.
Table 4.1: CpGs sharing identical SNP or SNPs in LD as matched mQTL across different
datasets
sets
compared
total Shared CpGs
With mQTL_n
CpGs with
identical
SNP_n
CpGs with
SNPs in
LD_n
CpGs with
identical or LD
SNP(s)_n
CpGs with
identical
SNP_%
CpGs with
SNPs in
LD_%
CpGs with
identical or
LD SNP(s)_%
eur1.eur2 38096 12423 16267 28690 32.61% 42.70% 75.31%
lat1.lat2 49129 15574 22751 38325 31.70% 46.31% 78.01%
80
eur1.lat1 42554 12853 19284 32137 30.20% 45.32% 75.52%
eur2.lat2 41854 12688 18343 31031 30.31% 43.83% 74.14%
eur3.eur4 103200 34658 7338 41996 33.58% 7.11% 40.69%
lat3.lat4 165873 54920 9481 64401 33.11% 5.72% 38.83%
eur3.lat3 137902 42650 8856 51506 30.93% 6.42% 37.35%
eur4.lat4 111518 35099 7958 43057 31.47% 7.14% 38.61%
Figure 4.2 Number and overlap of scanned mQTLs in all four datasets
Generation of CpG-mQTL pair databases by DNA methylation arrays: We then combined
results from multiple datasets to create an mQTL database for 450K and EPIC arrays separately,
which include CpG-SNP pairs with the strongest interactions. These mQTL databases are used
subsequently to account for genetic confounding effects in epigenome-wide association
analysis. More specifically, we combined mQTL scanning from Set 1 European and Latino, Set 2
European and Latino datasets to create a database for the 450K array; while Set 3 European
81
and Latino, Set 4 European and Latino datasets were combined to create a database for the
EPIC array.
The following combining scheme for each platform was adopted: for each CpG, all SNPs
identified as mQTLs in each dataset were recorded, sorted by P value, which was computed by
meta-analyzing nominal association P values (between both ethnic groups, and between both
datasets) weighted by inverse of standard errors. As one CpG could be matched to different
SNPs in different datasets, there could be several CpG-SNP pairs for a particular CpG. As a
results, on the 450K array, 243,719 such CpG-SNP pairs were identified for a total of 150,572
CpGs. On the EPIC array, 625,670 CpG-SNP pairs were in the database for a total of 354,939
CpGs.
Genome wide distribution of mQTL: We next investigated the location of CpGs with the
significant CpG-SNP correlation from each array, and whether they have a higher chance of
localizing within regions that play a key biological role.
On the 450K array, CpGs with matched mQTLs had a higher probability to be located in
N_Shore, S_Shore and OpenSea regions, comparing to genome wide CpG distributions. On the
EPIC array, mQTL-matched CpGs were less likely to be in the OpenSea region, but enriched in all
other regions (Figures 4.3A, 4.3B).
82
Figure 4.3 Distributions of CpGs with matched mQTLs
Interestingly, on both arrays, CpGs with matched mQTLs are significantly enriched in
CpG island shore regions (either the N_shore or S_shore) comparing to whole genome
distributions (Chi-squared test p-values < 2.2x10
-16
on both arrays).
EWAS reveals significant CpGs associated with birthweight: Birthweight has been
reported to have significant associations with neonatal DNA methylation (128) (129) (11).
However, previous reports did not take into consideration possible confounding effects from
genetics. To address this, we conducted an EWAS analysis to investigate the correlation
between neonatal DNA methylation and birthweight, while accounting for platform specific
SNPs correlated with the DNA methylation. This analysis was done in all four datasets
separately, and meta-analysis was conducted for each array (450K and EPIC separately). We
also conducted the same regression models without accounting for SNPs for comparison.
On the 450K array
83
We discovered a total of 31 CpGs significantly associated with birthweight after
Bonferroni correction, 13 (41.94%) of which are corrected for significant mQTLs. Three of these
CpGs are in the transcription start region. Some of these significant hits are consistent with
what has been reported before (11), for example in the ARID family. In our results, 1 hit is
located in ARID3A and 1 in ARID5B.
Eighteen of these hits would not have been identified if we did not account for mQTL
effects, suggesting the importance of taking genetic effects into consideration in epigenome-
wide association analysis. They are located in important genetic regions including growth
related genes (TGFB2, IGF2BP1), ARID5B, acetylglucosaminyltransferase (GCNT2), ATPase family
(ATP8B1, ATP6V0A1) etc.
We investigated how controlling for mQTL’s genetic effects changed results of this
EWAS model. While regression coefficients from EWAS models with or without controlling for
mQTLs were in general similar, inconsistencies were also seen for many CpGs. For these CpGs,
controlling for mQTLs affected EWAS effect sizes to a large extent (Figure 4.4A). However, the
regression coefficients for the 31 significant CpGs seemed to be similar with or without
controlling for mQTLs (Figure 4.4A).
In terms of P values, controlling for genetic effects also affected significance for many
CpGs, to a much greater extent than regression coefficients (Figure 4.4B). This included the 31
significant CpGs from the model, the majority of which became more significant after adjusting
for mQTL effects.
84
Figure 4.4 Comparison of birthweight epigenome-wide association analysis results with or
without controlling for mQTL (450K array)
The 450K results suggest that, in general, controlling for genetic effects can help to
identify additional loci that are potentially a cause or consequence of variance in birthweight.
We also conducted an enrichment test on the top 20 genes identified in the association
analysis. using Gene Set Enrichment Analysis (GSEA) (130) (131). 'Early-TGFB1 signature' gene
set was identified to be strongly enriched (FDR q value = 1.69x10
-2
), among other sets.
On the EPIC array
There has not been a report of large-scale multi-cohort birthweight EWAS on the EPIC
array as was performed previously for the 450K array [Küpers et al (11)]. We identified 3,281
significant CpGs after Bonferroni correction associated with birthweight, many of them not
available in the 450K platform (2,125 out of 3,281, 64.77%). For example, cg09797037 in
EXOSC10 (P=1.87x10
-17
, direction: ++) is significantly associated with birthweight in our EPIC
array data, however, Küpers et al did not identify this gene in their multi-cohort meta-analysis.
85
The EPIC array results also identified other genes, on top of 450K results, to be significantly
related to birthweight, some notable genes including IL21R (Interleukin 21 Receptor,
transduces the growth promoting signal of IL21(132)), IPO9 (associated with waist
circumference (133), fat-free mass or lean body mass (134) and body mass index (135) in
previous GWAS studies), ST6GALNAC4 (catalyzes the transfer of sialic acid from CMP-sialic acid
to galactose-containing substrates (136)) etc.
In our EWAS results, 2,001 (60.99%) of the significant hits were corrected for mQTL’s
genetic effects, at a CpG-SNP mQTL cutoff adjusted p value of 0.05. 405 (12.34%) of these CpGs
are annotated as transcription starting region (TSS) in the Illumina's EPIC methylation arrays
annotation (137).
We repeated the same EWAS model in both Set3 and 4 without controlling for mQTL’s
effects. Running the same pipeline (meta-analysis, and multiple correction using Bonferroni),
2,343 of significant hits form models including mQTLs as covariates would not have been
identified if mQTL effects were not adjusted, accounting for 71.41% of all the significant hits.
Similar to that of the 450K array, this illustrates that adjusting for mQTLs in this EWAS model
significantly affected the fundamental landscape of results.
Adding mQTLs as an additional variable also did not alter regression effect sizes
appreciably for the 3,281 significant CpGs, similarly to the result with the 450K (Figures 4.5A). P
values were also significantly different after controlling for mQTL, increasing the significance for
most significant hits, further suggesting the importance of controlling for genetic effects when
conducting EWAS analysis (Figures 4.5B).
86
Figure 4.5 Comparison of birthweight epigenome-wide association analysis results with or
without controlling for mQTL (EPIC array)
GSEA was also performed using top 50 genes from the EWAS results similar to that of
450K. Enriched gene sets include abnormal inflammatory response (FDR q-value = 1.82x10
-03
)
and abnormality of blood and blood-forming tissues (FDR q-value = 1.82x10
-03
).
Investigating CpGs whose regression coefficients were significantly affected by mQTLs
On the 450K array
We found that after correcting for mQTLs, regression coefficients of some CpGs changed
more significantly than others, suggesting that the association between these CpGs and
birthweight were more heavily confounded by mQTLs. Taking genetic effects into consideration
is of vital importance for this group of CpGs.
87
Among all the significant CpGs in the association analysis after FDR correction, there are
in total 91 CpGs whose effect sizes changed more than 20% after correcting for mQTL effects.
Pathway analyses suggest these CpGs are enriched in cell proliferation and energy related
pathways. For example, KEGG of these CpGs showed top pathways include glycerophospholipid
metabolism (KEGG ID 00564, adjusted P value = 8.93x10
-77
), ether lipid metabolism (KEGG ID
00565, adjusted P value = 8.93x10
-77
), GnRH signaling pathway (KEGG ID 04912, adjusted P
value = 8.93x10
-77
), hematopoietic cell lineage (KEGG ID 04640, adjusted P value = 1.00x10
-75
),
etc. GO identified phosphatidic acid biosynthetic process (GO ID 0006654, adjusted P value =
5.36x10
-76
), glycerolipid biosynthetic process (GO ID 0045017, adjusted P value = 5.36x10
-76
),
phosphatidic acid metabolic process (GO ID 0046473, adjusted P value = 5.36x10
-76
),
glycerophospholipid biosynthetic process (GO ID 0046474, adjusted P value = 5.36x10
-76
)
among top pathways.
On the EPIC array
The EPIC array was able to identify 770 CpGs with significantly changed effect sizes after
correcting for mQTL, using the same filtering as that of 450K. Pathway analyses also revealed
that these CpGs are enriched in energy and metabolism functions, as well as leukemia. More
specifically, KEGG identified acute myeloid leukemia (KEGG ID 05221, adjusted P value =
8.86x10
-07
) and proteasome (KEGG ID 03050, adjusted P value = 0.036) as significant pathways.
In GO pathway analysis, ATPase regulator activity (GO ID 0060590, adjusted P value = 3.79x10
-
06
), regulation of cellular amine metabolic process (GO ID 0033238, adjusted P value = 1.37x10
-
05
), and positive regulation of fat cell differentiation (GO ID 0045600, adjusted P value =
6.76x10
-05
) are among significantly enriched pathways.
88
These results suggestion that, for CpGs overlapping with biosynthesis, metabolism, and
leukemia which high birthweight is a risk factor, their association analysis with birthweight was
confounded heavily by genetic effects.
Validated Birthweight SNPs as mQTLs did not significantly confound EWAS results
Multiple SNPs have been identified to be associated with birthweight in previous GWAS
analyses. We obtained a total of 443 SNPs from GWAS catalogue (122), and investigated CpGs
with these SNPs as mQTLs, and how the EWAS results were affected by these SNPs. 139
(31.38%) of these SNPs were identified as mQTLs to CpGs in our dataset. Interestingly, CpGs
with these SNPs as mQTLs were mostly overlapping genes that were associated with protein
synthesis, growth factors, and metabolism. For example, growth factor receptor bound protein
10 (GRB10), Mesoderm Induction Early Response 1 Transcriptional Regulator (MIER1), and
SMAD Family Member 3 (SMAD3).
Gene set enrichment analysis of these genes revealed several key sets including “any
process that activates or increases the frequency, rate or extent of transcription from an RNA
polymerase II promoter” (GSEA adjusted P value = 3.25x10
-4
), “any process that stops, prevents,
or reduces the frequency, rate or extent of osteoblast differentiation” (GSEA adjusted P value =
3.89x10
-4
), and “the progression of muscle tissue over time, from its initial formation to its
mature state” (GSEA adjusted P value = 6.16x10
-4
), etc.
Comparing to the rest of mQTLs, they did not confound EWAS analysis more
dramatically. After controlling for these mQTLs, both effect sizes and P values were in general
consistent with models that did not correct for mQTLs. (450K array, Figure 4.6A, 4.6B; EPIC
89
array, Figure 4.6C, 4.6D), with only a small group of deviated CpGs. Since most of these GWAS
were conducted in European populations, we re-ran our models in white subjects only and
meta-analyzed results from four datasets, and results were similar.
Figure 4.6 Comparison of birthweight-associated DNA methylation sites with or without
controlling for validated birthweight GWAS SNPs
Correcting for maternal weight gain as an additional covariate did not significantly change
EWAS results
90
It’s been reported before that excessive maternal gain was associated with birthweight
(138) (139) (140). Therefore, maternal weight gain during pregnancy could also confound
birthweight EWAS models. As a sensitivity analysis, we controlled for maternal weight gain as
an additional covariate. As a result, neither P values nor regression coefficients were
significantly altered (Set 4 results were shown as an example in Figure 4.7).
Figure 4.7 Comparison of results from original and sensitivity EWAS models adding maternal
weight gain as additional covariate
Discussion
While epigenome-wide analyses (EWAS) have helped identify the correlation between
DNA methylation and key clinical or biological traits, the effects of genetic factors are generally
ignored in such studies. Using four multi-ethnic datasets, we generated SNP-CpG datasets that
could help us understand the role played by mQTLs in EWAS studies using a well-studied
91
covariate that is related to DNA methylation at birth (birthweight), for both 450K and EPIC
arrays separately.
On both arrays, CpGs matched to mQTLs were more likely to be in CpG island shore
regions (N_shore or S_shore). CpG island shores are 2kb regions flanking a CpG island (141),
and DNA methylation in the shore regions have been reported to be associated with both
disease traits such as Alzheimer’s disease (142) and chronic lymphocytic leukemia (143), as well
as negatively associated with gene expression levels (141) (144). We reported that methylation
of CpGs in the shore regions were also more likely to be affected by genotypes, suggesting that
previously reported shore methylation and gene expression associations were likely in fact
expression quantitative trait loci (eQTL) effects and key markers of population trait variability.
Incorporating this mQTL database, we performed EWAS analysis investigating the
correlation between birthweight and neonatal DNA methylation. While this relationship has
been reported by multi-cohort meta-analysis (11), genetic effects were not taken into
consideration. We identified some similar hits as before, for example, ARID3A and ARID5B. The
AT-rich interacting domain (ARID) family proteins bind to DNA (145) and play roles in
transcriptional regulation during cell proliferation, differentiation and development (146).
Other top genes overlapping with significant CpGs include PLD2 (cancer development and
progression (147) (148)), TGFB2 (Regulation of angiogenesis and heart development (149)
(150)) etc. We were also able to identify CpGs located in genes not previously reported in EWAS
studies. EXOSC10, for example, was involved in ATP/ITP metabolism pathway. Gusev et al
reported that its expression level was associated with birthweight in a TWAS study (151).
92
Interestingly, adjusting for mQTLs had a bigger effect on regression coefficient values for a
group of CpGs compared to others. To understand the characteristics for these CpGs, we
performed pathway analyses and found that they were highly enriched for energy metabolism,
and biological synthesis pathways, much to our expectation. One of the less expected identified
pathways is acute myeloid leukemia (AML) (KEGG ID 05221, adjusted P value = 8.86x10
-07
). It
has been reported that extreme birthweights (both high and low, a U-shaped association) were
associated with higher risk for AML (152). This suggested that for CpGs overlapping genes
strongly connected to the outcome (birthweight in our study) in EWAS analyses, accounting for
genetic effects was especially important.
There are some drawbacks of our study. We only had access to DNA methylation array
data, instead of whole-genome bisulfite sequencing data, which could limit our ability to detect
other key SNP-CpG loci. Moreover, although our datasets contain multi-ethnic subjects,
allowing us to understand race-specific methylation and SNP interaction, we only had access to
subjects of European and Latino descendants. Once data was available, it will be of interest to
investigate how our findings might change in other ethnic groups including African Americans
and Asians.
In summary, it’s of value to account for genetic effects when performing EWAS models.
Our matrix of SNP-CpG methylation effects presented in the current analysis can be used
directly for European and Latino populations on both types of Illumina DNA methylation array
data. This will allow accounting for genetic confounding, especially for CpGs overlapping key
genes, and potentially help to identify more hits that were not possible to detect without
adjusting for mQTL effects.
93
Chapter 5: Conclusion
This thesis investigated how to use genetic (SNP array data) and epigenetic (DNA
methylation array data) information, to understand the etiology of childhood cancers, both in
the context of general population, as well as children with Down Syndrome.
DNA data can be analyzed both in a global way and by each SNP individually. One
example of global analysis is admixture mapping. I presented my study using admixture
mapping to understand the association between European ancestry and risk of pilocytic
astrocytoma (PA) in mixed Latino population. As a result, every 5% increase in European
ancestry is associated with a 1.051-fold increase in odds of PA (95% CI: 1.014-1.091). Further
admixture mapping analysis pointed us to several genomic regions where this risk might arise,
especially at 6q14.3 region in the SNX14 gene. Each additional copy of European ancestral
haplotype was found to be associated with 1.59-fold increase in PA odds (smallest P-value from
fine-mapping at chr6:85504599; P = 4.70x10
-6
).
To analyze DNA data by each individual SNP is also a powerful tool to understand
disease risk and germline DNA variants. This type of analysis can be done both in a genome-
wide level, or a regional level (a single chromosome, or a smaller region). I presented my
analysis doing an association analysis on mitochondrial SNPs and risk of childhood glioblastoma.
As a result, m1555A>G was identified to be a significant hit. It is in the highly conserved 12S
rRNA gene and this suggested that energy synthesis could play a role in glioblastoma etiology.
Targeted association analyses of nucleus mitochondria SNPs identified hits in
dehydrogenases/reductases or calcium sensors genes. Together, these 2 sets of analyses jointly
suggested a possibility of mtDNA and n-mt DNA interplay.
94
Gene expressions are regulated by sophisticated and complicated systems. Since they
don’t directly alter DNA sequences, instead they play their roles “on top of” genetics, they are
referred to as epigenetics. DNA methylation is one of the most widely studied epigenetic
modifications, partially because of its stability and accessibility. Similar to SNP data, DNA
methylation data can be analyzed in a holistic way or by each individual CpG. I presented my
study where I first analyzed DNA methylation data as a whole to deconvolute nucleated cell
proportions in children with Down Syndrome, and whether these cell proportions were
associated with leukemia risk. Then I looked at CpGs one by one, conducting an epigenome-
wide association analysis (EWAS) to identify genes that were neonatally differentially
methylated in Down Syndrome children that were to develop leukemia. As a result, B-cell
proportions increased significantly in Down Syndrome children who later would develop acute
lymphoblastic leukemia (ALL). This result was robust after multiple sensitivity analyses,
including stratifying by race, controlling for white blood cell associated SNPs and so on. EWAS
analysis identified 247 significant CpGs in the discovery study, and they were associated with a
wide variety of tumor related functions, including B cell maturation, NF-kappaB signaling, etc.
Lastly, it’s important to understand DNA and DNA methylation do not function
independently. Methylations of some CpGs are affected by mQTLs. In my last study, I
conducted a scanning of mQTLs using 4 multi-ethnic data sets. Around 20% CpGs were matched
to SNPs that could affect their methylation levels. Interestingly, they were enriched in the DNA
island shore regions, and DNA methylation in these regions were reported to be associated
with both DNA expression and disease traits. Using this mQTL database to adjust for genetic
effects when doing EWAS analysis of birthweight, I was able to demonstrate that CpGs that
95
were the most relevant to birth weight associated functions (biosynthesis, metabolism, etc)
were mostly likely to be confounded by genotypes.
Understanding etiology of childhood cancers requires comprehensively analyzing
genetic or epigenetic data, both in a global way or to look at each molecule individually, while
taking their interactions into account. Based on the findings of this thesis, in the future, further
functional analyses can be conducted to understand the effects of perinatal genetic and
epigenetic changes on gene expressions, moving further to depicting a clear and complete
picture of childhood cancer etiology.
96
References
1. Steliarova-Foucher E, Colombet M, Ries LAG, Moreno F, Dolya A, Bray F, et al. International
incidence of childhood cancer, 2001-10: a population-based registry study. Lancet Oncol.
2017 Jun;18(6):719–31.
2. Kadan-Lottick NS. Survival Variability by Race and Ethnicity in Childhood Acute
Lymphoblastic Leukemia. JAMA. 2003 Oct 15;290(15):2008.
3. Hunger SP, Lu X, Devidas M, Camitta BM, Gaynon PS, Winick NJ, et al. Improved Survival for
Children and Adolescents With Acute Lymphoblastic Leukemia Between 1990 and 2005: A
Report From the Children’s Oncology Group. J Clin Oncol. 2012 May 10;30(14):1663–9.
4. Wen Y, Jin R, Chen H. Interactions Between Gut Microbiota and Acute Childhood Leukemia.
Front Microbiol. 2019 Jun 19;10:1300.
5. Rodgers J. Attentional ability among survivors of leukaemia treated without cranial
irradiation. Arch Dis Child. 2003 Feb 1;88(2):147–50.
6. Daams M, Schuitema I, van Dijk BW, van Dulmen-den Broeder E, Veerman AJ, van den Bos
C, et al. Long-term effects of cranial irradiation and intrathecal chemotherapy in treatment
of childhood leukemia: a MEG study of power spectrum and correlated cognitive
dysfunction. BMC Neurol. 2012 Dec;12(1):84.
7. Carey ME, Haut MW, Reminger SL, Hutter JJ, Theilmann R, Kaemingk KL. Reduced Frontal
White Matter Volume in Long-Term Childhood Leukemia Survivors: A Voxel-Based
Morphometry Study. Am J Neuroradiol. 2008 Apr;29(4):792–7.
97
8. Markunas CA, Xu Z, Harlid S, Wade PA, Lie RT, Taylor JA, et al. Identification of DNA
methylation changes in newborns related to maternal smoking during pregnancy. Environ
Health Perspect. 2014 Oct;122(10):1147–53.
9. Onyije FM, Olsson A, Baaken D, Erdmann F, Stanulla M, Wollschläger D, et al. Environmental
Risk Factors for Childhood Acute Lymphoblastic Leukemia: An Umbrella Review. Cancers.
2022 Jan 13;14(2):382.
10. Wan Ismail WR, Abdul Rahman R, Rahman NAA, Atil A, Nawi AM. The Protective Effect of
Maternal Folic Acid Supplementation on Childhood Cancer: A Systematic Review and Meta-
analysis of Case-control Studies. J Prev Med Pub Health. 2019 Jul 31;52(4):205–13.
11. Küpers LK, Monnereau C, Sharp GC, Yousefi P, Salas LA, Ghantous A, et al. Meta-analysis of
epigenome-wide association studies in neonates reveals widespread differential DNA
methylation associated with birthweight. Nat Commun. 2019 Dec;10(1):1893.
12. Singh DP, Bagam P, Sahoo MK, Batra S. Immune-related gene polymorphisms in pulmonary
diseases. Toxicology. 2017 May 15;383:24–39.
13. Jeon S, de Smith AJ, Li S, Chen M, Chan TF, Muskens IS, et al. Genome-wide trans-ethnic
meta-analysis identifies novel susceptibility loci for childhood acute lymphoblastic
leukemia. Leukemia [Internet]. 2021 Nov 8 [cited 2021 Nov 8]; Available from:
https://www.nature.com/articles/s41375-021-01465-1
98
14. Beck JJ, Pool R, van de Weijer M, Chen X, Krapohl E, Gordon SD, et al. Genetic meta-analysis
of twin birth weight shows high genetic correlation with singleton birth weight. Hum Mol
Genet. 2021 Sep 15;30(19):1894–905.
15. Sofer T, Baier LJ, Browning SR, Thornton TA, Talavera GA, Wassertheil-Smoller S, et al.
Admixture mapping in the Hispanic Community Health Study/Study of Latinos reveals
regions of genetic associations with blood pressure traits. PloS One. 2017;12(11):e0188400.
16. Ferreyra Vega S, Olsson Bontell T, Corell A, Smits A, Jakola AS, Carén H. DNA methylation
profiling for molecular classification of adult diffuse lower-grade gliomas. Clin Epigenetics.
2021 May 3;13(1):102.
17. Koestler DC, Jones MJ, Usset J, Christensen BC, Butler RA, Kobor MS, et al. Improving cell
mixture deconvolution by identifying optimal DNA methylation libraries (IDOL). BMC
Bioinformatics. 2016 Mar 8;17:120.
18. Villicaña S, Bell JT. Genetic impacts on DNA methylation: research findings and future
perspectives. Genome Biol. 2021 Apr 30;22(1):127.
19. Ward MH, Colt JS, Metayer C, Gunier RB, Lubin J, Crouse V, et al. Residential exposure to
polychlorinated biphenyls and organochlorine pesticides and risk of childhood leukemia.
Environ Health Perspect. 2009 Jun;117(6):1007–13.
20. de Smith AJ, Kaur M, Gonseth S, Endicott A, Selvin S, Zhang L, et al. Correlates of Prenatal
and Early-Life Tobacco Smoke Exposure and Frequency of Common Gene Deletions in
Childhood Acute Lymphoblastic Leukemia. Cancer Res. 2017 Apr 1;77(7):1674–83.
99
21. Wiemels JL, Walsh KM, de Smith AJ, Metayer C, Gonseth S, Hansen HM, et al. GWAS in
childhood acute lymphoblastic leukemia reveals novel genetic associations at chromosomes
17q12 and 8q24.21. Nat Commun. 2018 Dec;9(1):286.
22. Brown AL, de Smith AJ, Gant VU, Yang W, Scheurer ME, Walsh KM, et al. Inherited genetic
susceptibility to acute lymphoblastic leukemia in Down syndrome. Blood. 2019 Oct
10;134(15):1227–37.
23. de Smith AJ, Walsh KM, Morimoto LM, Francis SS, Hansen HM, Jeon S, et al. Heritable
variation at the chromosome 21 gene ERG is associated with acute lymphoblastic leukemia
risk in children with and without Down syndrome. Leukemia. 2019 Nov;33(11):2746–51.
24. Mueller BA, Doody DR, Weiss NS, Chow EJ. Hospitalization and mortality among pediatric
cancer survivors: a population-based study. Cancer Causes Control CCC. 2018
Nov;29(11):1047–57.
25. Schwartz BE, Ahmad K. Chromatin Assembly with H3 Histones: Full Throttle Down Multiple
Pathways. In: Current Topics in Developmental Biology [Internet]. Elsevier; 2006 [cited 2022
Jan 22]. p. 31–55. Available from:
https://linkinghub.elsevier.com/retrieve/pii/S0070215306740029
26. Morton NE. Parameters of the human genome. Proc Natl Acad Sci U S A. 1991 Sep
1;88(17):7474–6.
100
27. The 1000 Genomes Project Consortium, Corresponding authors, Auton A, Abecasis GR,
Steering committee, Altshuler DM, et al. A global reference for human genetic variation.
Nature. 2015 Oct 1;526(7571):68–74.
28. Taliun D, Harris DN, Kessler MD, Carlson J, Szpiech ZA, Torres R, et al. Sequencing of 53,831
diverse genomes from the NHLBI TOPMed Program [Internet]. Genomics; 2019 Mar [cited
2021 Dec 10]. Available from: http://biorxiv.org/lookup/doi/10.1101/563866
29. Department of Neurosurgery, Sanjay Gandhi Postgraduate Institute of Medical Sciences,
Lucknow, UP, India, Kanti Das K, Kumar R, Department of Neurosurgery, Sanjay Gandhi
Postgraduate Institute of Medical Sciences, Lucknow, UP, India. Pediatric Glioblastoma. In:
Department of Neurosurgery, University Hospitals Leuven, Leuven, Belgium, De
Vleeschouwer S, editors. Glioblastoma [Internet]. Codon Publications; 2017 [cited 2021 Sep
27]. p. 297–312. Available from:
https://exonpublications.com/index.php/exon/article/view/137
30. Faury D, Nantel A, Dunn SE, Guiot M-C, Haque T, Hauser P, et al. Molecular profiling
identifies prognostic subgroups of pediatric glioblastoma and shows increased YB-1
expression in tumors. J Clin Oncol Off J Am Soc Clin Oncol. 2007 Apr 1;25(10):1196–208.
31. Ostrom QT, Adel Fahmideh M, Cote DJ, Muskens IS, Schraw JM, Scheurer ME, et al. Risk
factors for childhood and adult primary brain tumors. Neuro-Oncol. 2019 Nov
4;21(11):1357–75.
101
32. Dhapola R, Sarma P, Medhi B, Prakash A, Reddy DH. Recent Advances in Molecular
Pathways and Therapeutic Implications Targeting Mitochondrial Dysfunction for
Alzheimer’s Disease. Mol Neurobiol [Internet]. 2021 Nov 2 [cited 2021 Nov 3]; Available
from: https://link.springer.com/10.1007/s12035-021-02612-6
33. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene Ontology: tool
for the unification of biology. Nat Genet. 2000 May;25(1):25–9.
34. Chinnery PF. Primary Mitochondrial Disorders Overview. In: Adam MP, Ardinger HH, Pagon
RA, Wallace SE, Bean LJ, Mirzaa G, et al., editors. GeneReviews® [Internet]. Seattle (WA):
University of Washington, Seattle; 1993 [cited 2021 Oct 6]. Available from:
http://www.ncbi.nlm.nih.gov/books/NBK1224/
35. Triska P, Kaneva K, Merkurjev D, Sohail N, Falk MJ, Triche TJ, et al. Landscape of Germline
and Somatic Mitochondrial DNA Mutations in Pediatric Malignancies. Cancer Res. 2019 Apr
1;79(7):1318–30.
36. Kaneva K, O’Halloran K, Triska P, Liu X, Merkurjev D, Bootwalla M, et al. The spectrum of
mitochondrial DNA (mtDNA) mutations in pediatric CNS tumors. Neuro-Oncol Adv. 2021
Dec;3(1):vdab074.
37. Zhang C, Ostrom QT, Semmes EC, Ramaswamy V, Hansen HM, Morimoto L, et al. Genetic
predisposition to longer telomere length and risk of childhood, adolescent and adult-onset
ependymoma. Acta Neuropathol Commun. 2020 Oct 28;8(1):173.
102
38. Zhang C, Ostrom QT, Hansen HM, Gonzalez-Maya J, Hu D, Ziv E, et al. European genetic
ancestry associated with risk of childhood ependymoma. Neuro-Oncol. 2020 Nov
26;22(11):1637–46.
39. Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK:
rising to the challenge of larger and richer datasets. GigaScience. 2015 Dec;4(1):7.
40. Wang X. Firth logistic regression for rare variant association tests. Front Genet [Internet].
2014 Jun 19 [cited 2021 Sep 27];5. Available from:
http://journal.frontiersin.org/article/10.3389/fgene.2014.00187/abstract
41. Willer CJ, Li Y, Abecasis GR. METAL: fast and efficient meta-analysis of genomewide
association scans. Bioinformatics. 2010 Sep 1;26(17):2190–1.
42. NCBI Resource Coordinators, Agarwala R, Barrett T, Beck J, Benson DA, Bollin C, et al.
Database resources of the National Center for Biotechnology Information. Nucleic Acids
Res. 2018 Jan 4;46(D1):D8–13.
43. The Gene Ontology Consortium, Carbon S, Douglass E, Good BM, Unni DR, Harris NL, et al.
The Gene Ontology resource: enriching a GOld mine. Nucleic Acids Res. 2021 Jan
8;49(D1):D325–34.
44. Eslamieh M, Williford A, Betrán E. Few Nuclear-Encoded Mitochondrial Gene Duplicates
Contribute to Male Germline-Specific Functions in Humans. Genome Biol Evol. 2017 Oct
1;9(10):2782–90.
103
45. Kloss-Brandstätter A, Pacher D, Schönherr S, Weissensteiner H, Binna R, Specht G, et al.
HaploGrep: a fast and reliable algorithm for automatic classification of mitochondrial DNA
haplogroups. Hum Mutat. 2011 Jan;32(1):25–32.
46. Behar DM, van Oven M, Rosset S, Metspalu M, Loogväli E-L, Silva NM, et al. A “Copernican”
Reassessment of the Human Mitochondrial DNA Tree from its Root. Am J Hum Genet. 2012
Apr;90(4):675–84.
47. Neefs J-M, Van de Peer Y, De Rijk P, Goris A, De Wachter R. Compilation of small ribosomal
subunit RNA sequences. Nucleic Acids Res. 1991 Apr 25;19(suppl):1987–2015.
48. Zwieb C, Jemiolo DK, Jacob WF, Wagner R, Dahlberg AE. Characterization of a collection of
deletion mutants at the 3’-end of 16S ribosomal RNA of Escherichia coli. Mol Gen Genet
MGG. 1986 May;203(2):256–64.
49. Prezant TR, Agapian JV, Bohlman MC, Bu X, Oztas S, Qiu WQ, et al. Mitochondrial ribosomal
RNA mutation associated with both antibiotic-induced and non-syndromic deafness. Nat
Genet. 1993 Jul;4(3):289–94.
50. Usami S, Abe S, Akita J, Namba A, Shinkawa H, Ishii M, et al. Prevalence of mitochondrial
gene mutations among hearing impaired patients. J Med Genet. 2000 Jan;37(1):38–40.
51. Maeda Y, Sasaki A, Kasai S, Goto S, Nishio S, Sawada K, et al. Prevalence of the
mitochondrial 1555 A>G and 1494 C>T mutations in a community-dwelling population in
Japan. Hum Genome Var. 2020 Dec;7(1):27.
104
52. Zou Y, Dai Q, Tao W, Wen X, Feng D, Deng H, et al. Suspension array-based deafness genetic
screening in 53,033 Chinese newborns identifies high prevalence of 109 G>A in GJB2. Int J
Pediatr Otorhinolaryngol. 2019 Nov;126:109630.
53. Yonova-Doing E, Calabrese C, Gomez-Duran A, Schon K, Wei W, Karthikeyan S, et al. An atlas
of mitochondrial DNA genotype–phenotype associations in the UK Biobank. Nat Genet.
2021 Jul;53(7):982–93.
54. Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alföldi J, Wang Q, et al. The mutational
constraint spectrum quantified from variation in 141,456 humans. Nature. 2020 May
28;581(7809):434–43.
55. Watanabe A, Hippo Y, Taniguchi H, Iwanari H, Yashiro M, Hirakawa K, et al. An opposing
view on WWOX protein function as a tumor suppressor. Cancer Res. 2003 Dec
15;63(24):8629–33.
56. Kuroki T, Trapasso F, Shiraishi T, Alder H, Mimori K, Mori M, et al. Genetic alterations of the
tumor suppressor gene WWOX in esophageal squamous cell carcinoma. Cancer Res. 2002
Apr 15;62(8):2258–60.
57. Paige AJ, Taylor KJ, Taylor C, Hillier SG, Farrington S, Scott D, et al. WWOX: a candidate
tumor suppressor gene involved in multiple tumor types. Proc Natl Acad Sci U S A. 2001 Sep
25;98(20):11417–22.
58. Hou T, Jian C, Xu J, Huang AY, Xi J, Hu K, et al. Identification of EFHD1 as a novel Ca2+ sensor
for mitoflash activation. Cell Calcium. 2016 May;59(5):262–70.
105
59. Aziz MH, Manoharan HT, Church DR, Dreckschmidt NE, Zhong W, Oberley TD, et al. Protein
Kinase Cε Interacts with Signal Transducers and Activators of Transcription 3 (Stat3),
Phosphorylates Stat3Ser727, and Regulates Its Constitutive Activation in Prostate Cancer.
Cancer Res. 2007 Sep 15;67(18):8828–38.
60. Collins VP, Jones DTW, Giannini C. Pilocytic astrocytoma: pathology, molecular mechanisms
and markers. Acta Neuropathol (Berl). 2015 Jun;129(6):775–88.
61. Gutmann DH, McLellan MD, Hussain I, Wallis JW, Fulton LL, Fulton RS, et al. Somatic
neurofibromatosis type 1 (NF1) inactivation characterizes NF1-associated pilocytic
astrocytoma. Genome Res. 2013 Mar;23(3):431–9.
62. Janzarik W, Kratz C, Loges N, Olbrich H, Klein C, Schäfer T, et al. Further Evidence for a
Somatic KRAS Mutation in a Pilocytic Astrocytoma. Neuropediatrics. 2007 Apr;38(2):61–3.
63. Duerr E-M, Rollbrocker B, Hayashi Y, Peters N, Meyer-Puttlitz B, Louis DN, et al. PTEN
mutations in gliomas and glioneuronal tumors. Oncogene. 1998 Apr;16(17):2259–64.
64. Andrews LJ, Thornton ZA, Saincher SS, Yao IY, Dawson S, McGuinness LA, et al. Prevalence
of BRAFV600 in glioma and use of BRAF Inhibitors in patients with BRAFV600 mutation-
positive glioma: systematic review. Neuro-Oncol. 2021 Oct 28;noab247.
65. Ostrom QT, Patil N, Cioffi G, Waite K, Kruchko C, Barnholtz-Sloan JS. CBTRUS Statistical
Report: Primary Brain and Other Central Nervous System Tumors Diagnosed in the United
States in 2013–2017. Neuro-Oncol. 2020 Oct 30;22(Supplement_1):iv1–96.
106
66. Ostrom QT, Egan KM, Nabors LB, Gerke T, Thompson RC, Olson JJ, et al. Glioma risk
associated with extent of estimated European genetic ancestry in African Americans and
Hispanics. Int J Cancer. 2020 Feb 1;146(3):739–48.
67. W. N. Venables BDR. Modern Applied Statistics with S [Internet]. Fourth. New York:
Springer; 2002. Available from: https://www.stats.ox.ac.uk/pub/MASS4/
68. Alexander DH, Novembre J, Lange K. Fast model-based estimation of ancestry in unrelated
individuals. Genome Res. 2009 Sep 1;19(9):1655–64.
69. Maples BK, Gravel S, Kenny EE, Bustamante CD. RFMix: a discriminative modeling approach
for rapid and robust local-ancestry inference. Am J Hum Genet. 2013 Aug 8;93(2):278–88.
70. Grinde KE, Brown LA, Reiner AP, Thornton TA, Browning SR. Genome-wide Significance
Thresholds for Admixture Mapping Studies. Am J Hum Genet. 2019 Mar;104(3):454–65.
71. Browning BL, Zhou Y, Browning SR. A One-Penny Imputed Genome from Next-Generation
Reference Panels. Am J Hum Genet. 2018 Sep;103(3):338–48.
72. Browning BL, Tian X, Zhou Y, Browning SR. Fast two-stage phasing of large-scale sequence
data. Am J Hum Genet. 2021 Oct;108(10):1880–90.
73. Leece R, Xu J, Ostrom QT, Chen Y, Kruchko C, Barnholtz-Sloan JS. Global incidence of
malignant brain and other central nervous system tumors by histology, 2003-2007. Neuro-
Oncol. 2017 Oct 19;19(11):1553–64.
107
74. Ostrom QT, Cote DJ, Ascha M, Kruchko C, Barnholtz-Sloan JS. Adult Glioma Incidence and
Survival by Race or Ethnicity in the United States From 2000 to 2014. JAMA Oncol. 2018 Sep
1;4(9):1254.
75. Zhang H, Hong Y, Yang W, Wang R, Yao T, Wang J, et al. SNX14 deficiency-induced defective
axonal mitochondrial transport in Purkinje cells underlies cerebellar ataxia and can be
reversed by valproate. Natl Sci Rev. 2021 Jul;8(7):nwab024.
76. Akizu N, Cantagrel V, Zaki MS, Al-Gazali L, Wang X, Rosti RO, et al. Biallelic mutations in
SNX14 cause a syndromic form of cerebellar atrophy and lysosome-autophagosome
dysfunction. Nat Genet. 2015 May;47(5):528–34.
77. Byrne S, Dionisi-Vici C, Smith L, Gautel M, Jungbluth H. Vici syndrome: a review. Orphanet J
Rare Dis. 2016 Dec;11(1):21.
78. Bird TD. Hereditary Ataxia Overview. In: Adam MP, Ardinger HH, Pagon RA, Wallace SE,
Bean LJ, Mirzaa G, et al., editors. GeneReviews® [Internet]. Seattle (WA): University of
Washington, Seattle; 1993 [cited 2021 Aug 20]. Available from:
http://www.ncbi.nlm.nih.gov/books/NBK1138/
79. Pan Y, Hysinger JD, Barron T, Schindler NF, Cobb O, Guo X, et al. NF1 mutation drives
neuronal activity-dependent initiation of optic glioma. Nature. 2021 Jun;594(7862):277–82.
80. Weidemann A, Johnson RS. Biology of HIF-1α. Cell Death Differ. 2008 Apr;15(4):621–7.
108
81. Eckel-Passow JE, Drucker KL, Kollmeyer TM, Kosel ML, Decker PA, Molinaro AM, et al. Adult
diffuse glioma GWAS by molecular subtype identifies variants in D2HGDH and FAM20C.
Neuro-Oncol. 2020 Nov 26;22(11):1602–13.
82. Hladíková A, Plevová P, Macháčková E. [Breast cancer in monozygotic twins]. Klin Onkol Cas
Ceske Slov Onkol Spolecnosti. 2013;26(3):213–7.
83. Moore LD, Le T, Fan G. DNA Methylation and Its Basic Function. Neuropsychopharmacology.
2013 Jan;38(1):23–38.
84. Mai CT, Isenburg JL, Canfield MA, Meyer RE, Correa A, Alverson CJ, et al. National
population-based estimates for major birth defects, 2010-2014. Birth Defects Res. 2019 Nov
1;111(18):1420–35.
85. Marlow EC, Ducore J, Kwan ML, Cheng SY, Bowles EJA, Greenlee RT, et al. Leukemia Risk in a
Cohort of 3.9 Million Children with and without Down Syndrome. J Pediatr. 2021
Jul;234:172-180.e3.
86. Lupo PJ, Schraw JM, Desrosiers TA, Nembhard WN, Langlois PH, Canfield MA, et al.
Association Between Birth Defects and Cancer Risk Among Children and Adolescents in a
Population-Based Assessment of 10 Million Live Births. JAMA Oncol. 2019 Aug 1;5(8):1150–
8.
87. Buitenkamp TD, Izraeli S, Zimmermann M, Forestier E, Heerema NA, van den Heuvel-Eibrink
MM, et al. Acute lymphoblastic leukemia in children with Down syndrome: a retrospective
analysis from the Ponte di Legno study group. Blood. 2014 Jan 2;123(1):70–7.
109
88. Goldsby RE, Stratton KL, Raber S, Ablin A, Strong LC, Oeffinger K, et al. Long-term sequelae
in survivors of childhood leukemia with Down syndrome: A childhood cancer survivor study
report. Cancer. 2018 Feb 1;124(3):617–25.
89. Liu B, Filippi S, Roy A, Roberts I. Stem and progenitor cell dysfunction in human trisomies.
EMBO Rep. 2015 Jan;16(1):44–62.
90. Letourneau A, Santoni FA, Bonilla X, Sailani MR, Gonzalez D, Kind J, et al. Domains of
genome-wide gene expression dysregulation in Down’s syndrome. Nature. 2014 Apr
17;508(7496):345–50.
91. Lane AA, Chapuy B, Lin CY, Tivey T, Li H, Townsend EC, et al. Triplication of a 21q22 region
contributes to B cell transformation through HMGN1 overexpression and loss of histone H3
Lys27 trimethylation. Nat Genet. 2014 Jun;46(6):618–23.
92. Muskens IS, Li S, Jackson T, Elliot N, Hansen HM, Myint SS, et al. The genome-wide impact
of trisomy 21 on DNA methylation and its implications for hematopoiesis. Nat Commun.
2021 Dec;12(1):821.
93. Kachuri L, Jeon S, DeWan AT, Metayer C, Ma X, Witte JS, et al. Genetic determinants of
blood-cell traits influence susceptibility to childhood acute lymphoblastic leukemia. Am J
Hum Genet. 2021 Oct 7;108(10):1823–35.
94. Vuckovic D, Bao EL, Akbari P, Lareau CA, Mousas A, Jiang T, et al. The Polygenic and
Monogenic Basis of Blood Traits and Diseases. Cell. 2020 Sep 3;182(5):1214-1231.e11.
110
95. Felix JF, Cecil C a. M. Population DNA methylation studies in the Developmental Origins of
Health and Disease (DOHaD) framework. J Dev Orig Health Dis. 2019 Jun;10(3):306–13.
96. Zhou W, Triche TJ, Laird PW, Shen H. SeSAMe: reducing artifactual detection of DNA
methylation by Infinium BeadChips in genomic deletions. Nucleic Acids Res [Internet]. 2018
Jul 31 [cited 2021 Oct 6]; Available from: https://academic.oup.com/nar/advance-
article/doi/10.1093/nar/gky691/5061974
97. Rahmani E, Yedidim R, Shenhav L, Schweiger R, Weissbrod O, Zaitlen N, et al. GLINT: a user-
friendly toolset for the analysis of high-throughput DNA-methylation array data. Bioinforma
Oxf Engl. 2017 Jun 15;33(12):1870–2.
98. Yao C, Joehanes R, Wilson R, Tanaka T, Ferrucci L, Kretschmer A, et al. Epigenome-wide
association study of whole blood gene expression in Framingham Heart Study participants
provides molecular insight into the potential role of CHRNA5 in cigarette smoking-related
lung diseases. Clin Epigenetics. 2021 Dec;13(1):60.
99. Hannon E, Gorrie-Stone TJ, Smart MC, Burrage J, Hughes A, Bao Y, et al. Leveraging DNA-
Methylation Quantitative-Trait Loci to Characterize the Relationship between Methylomic
Variation, Gene Expression, and Complex Traits. Am J Hum Genet. 2018 Nov 1;103(5):654–
65.
100. Taylor DL, Jackson AU, Narisu N, Hemani G, Erdos MR, Chines PS, et al. Integrative
analysis of gene expression, DNA methylation, physiological traits, and genetic variation in
human skeletal muscle. Proc Natl Acad Sci U S A. 2019 May 28;116(22):10883–8.
111
101. Peters TJ, Buckley MJ, Statham AL, Pidsley R, Samaras K, V Lord R, et al. De novo
identification of differentially methylated regions in the human genome. Epigenetics
Chromatin. 2015 Dec;8(1):6.
102. Pedersen BS, Schwartz DA, Yang IV, Kechris KJ. Comb-p: software for combining,
analyzing, grouping and correcting spatially correlated P-values. Bioinformatics. 2012 Nov
15;28(22):2986–8.
103. Satterwhite E, Sonoki T, Willis TG, Harder L, Nowak R, Arriola EL, et al. The BCL11 gene
family: involvement of BCL11A in lymphoid malignancies. Blood. 2001 Dec 1;98(12):3413–
20.
104. Jardine L, Webb S, Goh I, Quiroga Londoño M, Reynolds G, Mather M, et al. Blood and
immune development in human fetal bone marrow and Down syndrome. Nature. 2021
Oct;598(7880):327–31.
105. Roy A, Cowan G, Mead AJ, Filippi S, Bohn G, Chaidos A, et al. Perturbation of fetal liver
hematopoietic stem and progenitor cell development by trisomy 21. Proc Natl Acad Sci U S
A. 2012 Oct 23;109(43):17579–84.
106. Thilaganathan B, Tsakonas D, Nicolaides K. Abnormal fetal immunological development
in Down’s syndrome. Br J Obstet Gynaecol. 1993 Jan;100(1):60–2.
107. de Hingh YCM, van der Vossen PW, Gemen EFA, Mulder AB, Hop WCJ, Brus F, et al.
Intrinsic abnormalities of lymphocyte counts in children with down syndrome. J Pediatr.
2005 Dec;147(6):744–7.
112
108. Verstegen RHJ, Kusters MAA, Gemen EFA, DE Vries E. Down syndrome B-lymphocyte
subpopulations, intrinsic defect or decreased T-lymphocyte help. Pediatr Res. 2010
May;67(5):563–9.
109. Greaves M. A causal mechanism for childhood acute lymphoblastic leukaemia. Nat Rev
Cancer. 2018 Aug;18(8):471–84.
110. Garrison MM, Jeffries H, Christakis DA. Risk of death for children with down syndrome
and sepsis. J Pediatr. 2005 Dec;147(6):748–52.
111. Verstegen RHJ, Chang KJJ, Kusters MAA. Clinical implications of immune-mediated
diseases in children with Down syndrome. Pediatr Allergy Immunol Off Publ Eur Soc Pediatr
Allergy Immunol. 2020 Feb;31(2):117–23.
112. Santoro SL, Chicoine B, Jasien JM, Kim JL, Stephens M, Bulova P, et al. Pneumonia and
respiratory infections in Down syndrome: A scoping review of the literature. Am J Med
Genet A. 2021 Jan;185(1):286–99.
113. Hasaart KAL, Manders F, van der Hoorn M-L, Verheul M, Poplonski T, Kuijk E, et al.
Mutation accumulation and developmental lineages in normal and Down syndrome human
fetal haematopoiesis. Sci Rep. 2020 Jul 31;10(1):12991.
114. Timms JA, Relton CL, Sharp GC, Rankin J, Strathdee G, McKay JA. Exploring a potential
mechanistic role of DNA methylation in the relationship between in utero and post-natal
environmental exposures and risk of childhood acute lymphoblastic leukaemia. Int J Cancer.
2019 Dec 1;145(11):2933–43.
113
115. Mullighan CG, Goorha S, Radtke I, Miller CB, Coustan-Smith E, Dalton JD, et al. Genome-
wide analysis of genetic alterations in acute lymphoblastic leukaemia. Nature. 2007 Apr
12;446(7137):758–64.
116. Roberts I, Alford K, Hall G, Juban G, Richmond H, Norton A, et al. GATA1-mutant clones
are frequent and often unsuspected in babies with Down syndrome: identification of a
population at risk of leukemia. Blood. 2013 Dec 5;122(24):3908–17.
117. Chen Y, Lemire M, Choufani S, Butcher DT, Grafodatskaya D, Zanke BW, et al. Discovery
of cross-reactive probes and polymorphic CpGs in the Illumina Infinium
HumanMethylation450 microarray. Epigenetics. 2013 Feb;8(2):203–9.
118. Gonseth S, de Smith AJ, Roy R, Zhou M, Lee S-T, Shao X, et al. Genetic contribution to
variation in DNA methylation at maternal smoking-sensitive loci in exposed neonates.
Epigenetics. 2016 Sep;11(9):664–73.
119. Blair EM, Liu Y, de Klerk NH, Lawrence DM. Optimal fetal growth for the Caucasian
singleton and assessment of appropriateness of fetal growth: an analysis of a total
population perinatal database. BMC Pediatr. 2005 May 24;5(1):13.
120. Buck Louis GM, Grewal J, Albert PS, Sciscione A, Wing DA, Grobman WA, et al.
Racial/ethnic standards for fetal growth: the NICHD Fetal Growth Studies. Am J Obstet
Gynecol. 2015 Oct;213(4):449.e1-449.e41.
114
121. Horikoshi M, Beaumont RN, Day FR, Warrington NM, Kooijman MN, Fernandez-Tajes J,
et al. Genome-wide associations for birth weight and correlations with adult disease.
Nature. 2016 Oct 13;538(7624):248–52.
122. Buniello A, MacArthur JAL, Cerezo M, Harris LW, Hayhurst J, Malangone C, et al. The
NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays
and summary statistics 2019. Nucleic Acids Res. 2019 Jan 8;47(D1):D1005–12.
123. Petrick LM, Schiffman C, Edmands WMB, Yano Y, Perttula K, Whitehead T, et al.
Metabolomics of neonatal blood spots reveal distinct phenotypes of pediatric acute
lymphoblastic leukemia and potential effects of early-life nutrition. Cancer Lett. 2019
Jun;452:71–8.
124. Das S, Forer L, Schönherr S, Sidore C, Locke AE, Kwong A, et al. Next-generation
genotype imputation service and methods. Nat Genet. 2016 Oct;48(10):1284–7.
125. Fuchsberger C, Abecasis GR, Hinds DA. minimac2: faster genotype imputation.
Bioinformatics. 2015 Mar 1;31(5):782–4.
126. Delaneau O, Ongen H, Brown AA, Fort A, Panousis NI, Dermitzakis ET. A complete tool
set for molecular QTL discovery and analysis. Nat Commun. 2017 Aug;8(1):15452.
127. Lucas A., Kristina Gervin, Meaghan C., Kelly M., Devin C., John K., Karl T., Robert Lyle,
Brock C., Janine Felix. FlowSorted.CordBloodCombined.450k [Internet]. Bioconductor; [cited
2021 Dec 10]. Available from:
https://bioconductor.org/packages/FlowSorted.CordBloodCombined.450k
115
128. Moccia C, Popovic M, Isaevska E, Fiano V, Trevisan M, Rusconi F, et al. Birthweight DNA
methylation signatures in infant saliva. Clin Epigenetics. 2021 Mar 19;13(1):57.
129. Tekola-Ayele F, Zeng X, Ouidir M, Workalemahu T, Zhang C, Delahaye F, et al. DNA
methylation loci in placenta associated with birthweight and expression of genes relevant
for early development and adult diseases. Clin Epigenetics. 2020 Jun 3;12(1):78.
130. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene
set enrichment analysis: A knowledge-based approach for interpreting genome-wide
expression profiles. Proc Natl Acad Sci. 2005 Oct 25;102(43):15545–50.
131. Mootha VK, Lindgren CM, Eriksson K-F, Subramanian A, Sihag S, Lehar J, et al. PGC-1α-
responsive genes involved in oxidative phosphorylation are coordinately downregulated in
human diabetes. Nat Genet. 2003 Jul;34(3):267–73.
132. Ren HM, Lukacher AE, Rahman ZSM, Olsen NJ. New developments implicating IL-21 in
autoimmune disease. J Autoimmun. 2021 Aug;122:102689.
133. Lind L. Genetic Determinants of Clustering of Cardiometabolic Risk Factors in U.K.
Biobank. Metab Syndr Relat Disord. 2020 Apr;18(3):121–7.
134. Hübel C, Gaspar HA, Coleman JRI, Finucane H, Purves KL, Hanscombe KB, et al. Genomics
of body fat percentage may contribute to sex bias in anorexia nervosa. Am J Med Genet
Part B Neuropsychiatr Genet Off Publ Int Soc Psychiatr Genet. 2019 Sep;180(6):428–38.
116
135. Pulit SL, Stoneman C, Morris AP, Wood AR, Glastonbury CA, Tyrrell J, et al. Meta-analysis
of genome-wide association studies for body fat distribution in 694 649 individuals of
European ancestry. Hum Mol Genet. 2019 Jan 1;28(1):166–74.
136. Reticker-Flynn NE, Bhatia SN. Aberrant glycosylation promotes lung cancer metastasis
through adhesion to galectins in the metastatic niche. Cancer Discov. 2015 Feb;5(2):168–
81.
137. KD H. IlluminaHumanMethylationEPICanno.ilm10b2.hg19: Annotation for Illumina’s EPIC
methylation arrays. [Internet]. Available from:
https://bitbucket.com/kasperdanielhansen/Illumina_EPIC
138. Shapiro C, Sutija VG, Bush J. Effect of maternal weight gain on infant birth weight. J
Perinat Med. 2000;28(6):428–31.
139. Jan Mohamed HJ, Lim PY, Loy SL, Chang KH, Abdullah AFL. Temporal association of
maternal weight gain with early-term and preterm birth and low birth weight babies. J Chin
Med Assoc JCMA. 2021 Jul 1;84(7):722–7.
140. Wierzejska R, Wojda B. Pre-pregnancy nutritional status versus maternal weight gain
and neonatal size. Rocz Panstw Zakl Hig. 2019;70(4):377–84.
141. Irizarry RA, Ladd-Acosta C, Wen B, Wu Z, Montano C, Onyango P, et al. The human colon
cancer methylome shows similar hypo- and hypermethylation at conserved tissue-specific
CpG island shores. Nat Genet. 2009 Feb;41(2):178–86.
117
142. Mitsumori R, Sakaguchi K, Shigemizu D, Mori T, Akiyama S, Ozaki K, et al. Lower DNA
methylation levels in CpG island shores of CR1, CLU, and PICALM in the blood of Japanese
Alzheimer’s disease patients. PloS One. 2020;15(9):e0239196.
143. Wernig-Zorc S, Yadav MP, Kopparapu PK, Bemark M, Kristjansdottir HL, Andersson P-O,
et al. Global distribution of DNA hydroxymethylation and DNA methylation in chronic
lymphocytic leukemia. Epigenetics Chromatin. 2019 Jan 7;12(1):4.
144. Zhang Y, Wu X, Kai Y, Lee C-H, Cheng F, Li Y, et al. Secretome profiling identifies neuron-
derived neurotrophic factor as a tumor-suppressive factor in lung cancer. JCI Insight. 2019
Dec 19;4(24):129344.
145. Parlayan C, Sahin Y, Altan Z, Arman K, Ikeda M-A, Saadat KASM. ARID3A regulates
autophagy related gene BECN1 expression and inhibits proliferation of osteosarcoma cells.
Biochem Biophys Res Commun. 2021 Dec 31;585:89–95.
146. Saadat KASM, Lestari W, Pratama E, Ma T, Iseki S, Tatsumi M, et al. Distinct and
overlapping roles of ARID3A and ARID3B in regulating E2F-dependent transcription via
direct binding to E2F target genes. Int J Oncol. 2021 Apr;58(4):12.
147. Yao Y, Wang X, Li H, Fan J, Qian X, Li H, et al. Phospholipase D as a key modulator of
cancer progression. Biol Rev Camb Philos Soc. 2020 Aug;95(4):911–35.
148. Brown HA, Thomas PG, Lindsley CW. Targeting phospholipase D in cancer, infection and
neurodegenerative disorders. Nat Rev Drug Discov. 2017 May;16(5):351–67.
118
149. Boileau C, Guo D-C, Hanna N, Regalado ES, Detaint D, Gong L, et al. TGFB2 mutations
cause familial thoracic aortic aneurysms and dissections associated with mild systemic
features of Marfan syndrome. Nat Genet. 2012 Jul 8;44(8):916–21.
150. Lindsay ME, Schepers D, Bolar NA, Doyle JJ, Gallo E, Fert-Bober J, et al. Loss-of-function
mutations in TGFB2 cause a syndromic presentation of thoracic aortic aneurysm. Nat
Genet. 2012 Jul 8;44(8):922–7.
151. Gusev A, Ko A, Shi H, Bhatia G, Chung W, Penninx BWJH, et al. Integrative approaches
for large-scale transcriptome-wide association studies. Nat Genet. 2016 Mar;48(3):245–52.
152. Caughey RW, Michels KB. Birth weight and childhood leukemia: a meta-analysis and
review of the current evidence. Int J Cancer. 2009 Jun 1;124(11):2658–70.
Abstract (if available)
Abstract
Cancer is one of the leading causes of death in children in the USA (1), however, the etiology of most childhood cancers still remains unknown at the time of the writing of thesis. Given that somatic changes are less frequently associated with cancers in children compared to adult cancers, developmental abnormalities reflected by neonatal epigenetics and genetics are likely to offer key insights into understanding the mechanisms of childhood cancer etiology. Consequently, they may also help to predict cancer risk, which will be valuable for targeted follow ups, early treatment or even prevention.
In this thesis, I illustrate the use high-throughput epigenetic and genetic data, as well as inferred information from them to identify cancer related changes in children at birth, with the help of a wide variety of bioinformatics tools. We focus on typical types of childhood cancers, including childhood leukemia and childhood glioma, both in the general population and in children with Down Syndrome.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Understanding acute lymphoblastic leukemia in different ethnic groups in the United States
PDF
Ancestral/Ethnic variation in the epidemiology and genetic predisposition of early-onset hematologic cancers
PDF
Application of tracing enhancer networks using epigenetic traits (TENET) to identify epigenetic deregulation in cancer
PDF
The influence of DNA repair genes and prenatal tobacco exposure on childhood acute lymphoblastic leukemia risk: a gene-environment interaction study
PDF
Development of a colorectal cancer-on-chip to investigate the tumor microenvironment's role in cancer progression
PDF
Genetic epidemiological approaches in the study of risk factors for hematologic malignancies
PDF
The role of social support in the relationship between adverse childhood experiences and addictive behaviors across adolescence and young adulthood
PDF
Prostate cancer: genetic susceptibility and lifestyle risk factors
PDF
DNA methylation review and application
PDF
Understanding prostate cancer genetic susceptibility and chromatin regulation
PDF
Molecular role of EZH2 overexpression in Colorectal Cancer progression
PDF
Context-dependent role of androgen receptor (AR) in estrogen receptor-positive (ER+) breast cancer
PDF
Applying multi-omics in cancer liquid biopsy for improved patient monitoring and biomarker discovery
PDF
Developing a robust single cell whole genome bisulfite sequencing protocol to analyse circulating tumor cells
PDF
Genetic and environmental risk factors for childhood cancer
PDF
Integrative genomic and epigenomic analysis of human cancer
PDF
Investigating the complexity of the tumor microenvironment's role in drug response
PDF
Pathogenic variants in cancer predisposition genes and risk of non-breast multiple primary cancers in breast cancer patients
PDF
Transfusional iron, anthracyclines and cardiac outcomes in childhood cancer survivors
PDF
Cisplatin activates mitochondrial oxphos leading to acute treatment resistance in bladder cancer
Asset Metadata
Creator
Li, Shaobo
(author)
Core Title
Perinatal epigenetic and genetic analyses in childhood cancers
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Cancer Biology and Genomics
Degree Conferral Date
2022-05
Publication Date
03/04/2024
Defense Date
02/28/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
bioinformatic analyses,childhood cancers,epigenetic analyses,GWAS,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
de Smith, Adam (
committee chair
), Chiang, Charleston (
committee member
), Wiemels, Joseph (
committee member
)
Creator Email
lishaobo@usc.edu,sebastian.li@outlook.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC110768200
Unique identifier
UC110768200
Legacy Identifier
etd-LiShaobo-10420
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Li, Shaobo
Type
texts
Source
20220308-usctheses-batch-915
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
bioinformatic analyses
childhood cancers
epigenetic analyses
GWAS