Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Genetic epidemiological approaches in the study of risk factors for hematologic malignancies
(USC Thesis Other)
Genetic epidemiological approaches in the study of risk factors for hematologic malignancies
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Genetic epidemiological approaches in the study of risk factors for hematologic malignancies
By
Keren Xu
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(EPIDEMIOLOGY)
May 2022
Copyright 2022 Keren Xu
ii
Acknowledgements
This dissertation becomes a reality with the help of many people.
First, I owe my committee chair Dr. Adam de Smith my permanent gratitude for
teaching me how to be a scientist, for inspiring me to engage with meaningful problems in
rigorous, calm, and novel ways, and for spending his precious time every week discussing my
work with remarkable insights and painstakingly wordsmithing my manuscripts. You have
inspired the best in me. Your diligence has kept me motivated, which helps me to tackle
problems with positive approaches and a cool state of mind. I have no words to explain how
unbelievably lucky I feel to have you as my PhD mentor.
I also thank my committee members Dr. Joseph Wiemels, Dr. Kimberly Siegmund, Dr.
Myles Cockburn, and Dr. Oliver Bell for offering their valuable time to help me complete this
PhD and move on to the next phase of my life.
Dr. Joseph Wiemels has created a lovely lab environment and has offered his thoughtful
advice on multiple research projects. Dr. Kimberly Siegmund has provided valuable guidance on
administrative requirements. It was a great experience working alongside Dr. Myles Cockburn,
Dr. Melissa Wilson, and Dr. Trevor Pickering as their teaching assistant during the second year
of my PhD training. Dr. Wendy Mack’s classes on statistical methods were indispensable for my
PhD screening exam preparation. Dr. Wendy Cozen has offered tremendous help on the
sequencing project in chapter 4. Dr. Victoria Cortessis and Dr. Tracy Levin treated me with
stunning kindness and offered incredible encouragement. I owe them my sincere thanks.
iii
Derek Strong and Cesar Sul have always been highly accountable for troubleshooting
HPC-related problems, and I simply could not have completed many of my analyses without
their help.
I thank all of my lab mates for their help with my research projects. Shaobo (Sebastian)
Li and Qianxi Feng have been invaluable companions in both academia and in life. I thank
Sebastian especially for helping me from the very beginning get me started on HPC and GATK.
I also thank all of my friends and colleagues at the University of Southern California for
offering their precious time and support during my PhD training.
I also thank all the human beings who have made open-source contributions. I get to
stand on the shoulders of many giants. I thank Geraldine Van der Auwera and Brian O'Connor,
who published the fabulous Genomics in the Cloud book around the time I started working on
the chapter 4 sequencing project.
Finally, I thank my parents, who have fostered my curiosity that drives me to pursue this
PhD and who have never wavered in their support and love for me. Also, grandma, nothing
makes me happier than to make you proud.
iv
Table of Contents
Acknowledgements ........................................................................................................................ ii
List of Tables ................................................................................................................................. vii
List of Figures ................................................................................................................................. ix
Abstract ......................................................................................................................................... ix
Chapter 1: Disparities in Acute Lymphoblastic Leukemia Risk and Survival across
the Lifespan .................................................................................................................................... 1
Abstract ...................................................................................................................................... 1
Introduction ................................................................................................................................ 2
Disparities in all incidence across the lifespan ........................................................................... 4
Incidence of ALL is highest in Hispanics/Latinos ..................................................................... 4
Incidence of ALL is rising fastest in Hispanics/Latinos ............................................................ 6
Disparities in survival of ALL patients ......................................................................................... 7
Children .................................................................................................................................. 8
AYAs ........................................................................................................................................ 9
Older adults .......................................................................................................................... 10
Factors associated with disparities in all risk and outcomes .................................................... 11
Differences in ALL tumor biology ......................................................................................... 11
Genetic variation .................................................................................................................. 13
Environmental exposures and ALL risk ................................................................................. 17
Socioeconomic status and ALL risk and survival ................................................................... 18
Access to care ....................................................................................................................... 19
Recruitment to clinical trials ................................................................................................. 20
Conclusions and future directions ............................................................................................ 21
Chapter 2: Epigenetic Biomarkers of Prenatal Tobacco Smoke Exposure Are
Associated with Gene Deletions in Childhood Acute Lymphoblastic Leukemia ........................... 33
Abstract .................................................................................................................................... 33
Introduction .............................................................................................................................. 34
Materials and Methods ............................................................................................................ 35
Ethics statement ................................................................................................................... 35
Study population .................................................................................................................. 36
Somatic copy-number data .................................................................................................. 36
Genome-wide DNA methylation arrays ................................................................................ 36
AHRR DNA methylation quantitative trait locus (mQTL) genotype data .............................. 37
v
Self-reported smoking exposures ......................................................................................... 38
Statistical analyses ................................................................................................................ 39
Results ...................................................................................................................................... 40
Association between self-reported smoking variables and DNA methylation-
based biomarkers of maternal smoking in B-ALL cases ........................................................ 41
DNA methylation-based biomarkers of tobacco smoke exposure are
associated with gene deletion burden in childhood ALL ...................................................... 43
Discussion ................................................................................................................................. 44
Chapter 3: Investigating DNA Methylation as a Mediator of Genetic Risk in
Childhood Acute Lymphoblastic Leukemia ................................................................................... 76
Abstract .................................................................................................................................... 76
Introduction .............................................................................................................................. 77
Materials and Methods ............................................................................................................ 78
Study participants ................................................................................................................. 78
DNA methylation arrays ....................................................................................................... 79
Identification of differentially methylated positions (DMPs) ............................................... 80
Genotyping ........................................................................................................................... 81
Identification of methylation quantitative trait (mQTL) loci ................................................ 81
Mediation analysis ................................................................................................................ 82
Locus-specific analyses ......................................................................................................... 83
DNA methylation and gene expression analysis ................................................................... 84
Results ...................................................................................................................................... 85
Differentially methylated positions ...................................................................................... 85
Methylation quantitative trait loci ....................................................................................... 86
Mediation analysis ................................................................................................................ 87
Race/ethnicity stratified analyses ......................................................................................... 88
ARID5B CpG cg13344587 is a proxy for ALL risk SNP rs7090445 .......................................... 89
IKZF1 gene-specific analysis .................................................................................................. 90
IKZF1 SNP rs78396808 is independent of another secondary association
signal at SNP rs6421315 ....................................................................................................... 91
Increased DNA methylation at IKZF1 CpG cg01139861 correlated with
decreased IKZF1 expression ................................................................................................. 92
Discussion ................................................................................................................................. 92
Chapter 4: Whole-Exome Sequencing in Multiplex Families to Identify Novel AYA
Classical Hodgkin Lymphoma Predisposition Genes .................................................................. 127
Abstract .................................................................................................................................. 127
Introduction ............................................................................................................................ 128
Methods ................................................................................................................................. 129
vi
Study subjects ..................................................................................................................... 129
Exome sequencing and variant calling ................................................................................ 130
Identification of P/LP germline variants in cancer-predisposition genes ........................... 132
Analysis of putative P/LP germline variants by incorporating family structure ................. 133
Results .................................................................................................................................... 134
Putative P/LP variants in predisposition genes in cHL patients .......................................... 134
Incorporating family structure identifies putative causal germline variants in
cHL ...................................................................................................................................... 135
P/LP variants in predisposition genes in families lacking sibling controls .......................... 136
Loss-of-function variants in predisposition genes in multiplex families with
cHL patients ........................................................................................................................ 137
Discussion ............................................................................................................................... 139
Conclusions ................................................................................................................................. 157
References .................................................................................................................................. 159
vii
List of Tables
Table 1. 1. Genetic variants associated with ALL risk in genome-wide association
studies. ......................................................................................................................................... 23
Table 2. 1. Characteristics of childhood B-ALL cases (n = 482) in the California
Childhood Leukemia Study. .......................................................................................................... 50
Table 2. 2. The distribution of CpGs that make up the polyepigenetic smoking
scores*. ........................................................................................................................................ 51
Table 2. 3. Maternal smoking variables adjusted for paternal smoking in linear
regression models for association analysis between parental smoking and DNA
methylation, and vice versa. ........................................................................................................ 53
Table 2. 4. Joint exposures of self-reported tobacco smoking. .................................................... 54
Table 2. 5. The distribution of gene deletions of 482 B-ALL cases. .............................................. 55
Table 2. 6. Summary statistics of self-reported tobacco smoke exposures. ................................ 56
Table 2. 7. Multivariable Poisson regression testing the association between
epigenetic biomarkers of prenatal tobacco smoke exposure and gene deletion
frequency in B-ALL cases. ............................................................................................................. 58
Table 2. 8. Multivariable Poisson regression including the interaction term
between DNA methylation and B-ALL cytogenetic subtypes. ...................................................... 59
Table 3. 1. Characteristics of study participants included in the EWAS stratified by
study set and ALL case/control status (n = 1,727). ....................................................................... 98
Table 3. 2. Characteristics of study participants included in the mqtl analysis
stratified by study set and ALL case/control status (n = 1,487). ................................................... 99
Table 3. 3. Probe filtering steps to remove cross-reactive probes, Probe SNPs,
CpG SNPs, and SBE SNPs. ........................................................................................................... 100
Table 3. 4. 23 ALL risk SNPs included in the mQTL analysis. ....................................................... 101
Table 3. 5. EWAS results for DMPs found to be associated with ALL risk SNPs in
the subsequent mQTL analysis. .................................................................................................. 102
Table 3. 6. Significant SNP-DMP pairs identified from mQTL meta-analyses. ............................ 103
Table 3. 7. Estimated total effects, ACME and ADE from the causal mediation
analysis (quasi-Bayesian Monte Carlo simulation, n = 1000). .................................................... 104
Table 3. 8. Estimated total effects, ACME and ADE from the causal mediation
analysis (nonparametric bootstrap, n = 1000 simulations). ....................................................... 105
Table 3. 9. Significant SNP-DMP pairs identified from the mQTL overall meta-
analyses and the mQTL EPIC array meta-analyses in Latinos, non-Latino whites
and non-Latinos. ......................................................................................................................... 106
Table 3. 10. Estimated total effects, ACME and ADE from the causal mediation
analyses in Latinos, non-Latino whites and non-Latinos (quasi-Bayesian Monte
Carlo 1000 simulation). .............................................................................................................. 108
viii
Table 3. 11. Results from three logistic regression models investigating the
potential confounding effects. ................................................................................................... 111
Table 3. 12. Results from the IKZF1 gene-specific EWAS analysis. ............................................. 112
Table 3. 13. Significant SNP-DMP pairs identified from the mQTL overall meta-
analysis and the mQTL EPIC array meta-analysis in Latinos (IKZF1 gene-specific
analysis). ..................................................................................................................................... 113
Table 3. 14. Estimated total effects, ACME and ADE from the mediation analysis
in Latinos (IKZF1 gene-specific analysis) (quasi-Bayesian Monte Carlo 1000
simulation). ................................................................................................................................. 114
Table 3. 15. Meta-analyzed results for the association between rs78396808 and
ALL risk (683 cases and 804 controls). ........................................................................................ 116
ix
List of Figures
Figure 1. 1. Disparities in acute lymphoblastic leukemia (ALL) incidence across
the lifespan. .................................................................................................................................. 25
Figure 1. 2. Genetic variants associated with ALL risk and outcomes across the
genome. ........................................................................................................................................ 28
Figure 1. 3. Risk allele frequency and effect size of selected single nucleotide
polymorphisms (SNPs) associated with acute lymphoblastic leukemia (ALL) risk. ...................... 31
Figure 2. 1. Sample flowchart. ...................................................................................................... 60
Figure 2. 2. Chord diagrams showing the proportions of cases with at least one
deletion and cases with zero deletions, in the 450K array (N=198) and EPIC array
(N=284) datasets. ......................................................................................................................... 61
Figure 2. 3. Density plots of the polyepigenetic smoking score generated among
194 B-ALL cases. ........................................................................................................................... 62
Figure 2. 4. Independent two sample t-tests comparing DNA methylation beta
values for CpGs that are shared by 450K arrays and EPIC arrays and were used to
calculate the polyepigenetic smoking score. ................................................................................ 64
Figure 2. 5. Linear regression results for DNA methylation at the AHRR CpG
cg05575921, the polyepigenetic smoking score and child’s birth year. ....................................... 65
Figure 2. 6. Correlation matrix for parental reported tobacco smoke exposure
variables, DNA methylation at the AHRR CpG cg05575921 and the polyepigenetic
smoking score in the (A) 450K array (N=198) and (B) EPIC array (N=284) datasets. .................... 67
Figure 2. 7. Linear regression results for the associations of parental self-
reported tobacco smoke exposures with DNA methylation at the AHRR CpG
cg05575921 and with the polyepigenetic smoking score. ........................................................... 69
Figure 2. 8. Forest plots showing meta-analysis results of the association
between epigenetic biomarkers of prenatal tobacco smoke exposure and gene
deletion frequency in B-ALL cases. ............................................................................................... 70
Figure 2. 9. Forest plots showing meta-analysis results of the association
between epigenetic biomarkers of prenatal tobacco smoke exposure and gene
deletion frequency in B-ALL cases among self-reported (A) Latinos (n = 115 in
450K; n = 113 in EPIC) and (B) non-Latino Whites (n = 54 in 450K; n = 48 in EPIC). ..................... 72
Figure 2. 10. Forest plots showing meta-analysis results of the association
between epigenetic biomarkers of prenatal tobacco smoke exposure and gene
deletion frequency in B-ALL cases by diagnose age (A) diagnosed no earlier than
2 years of age (n = 178 in 450K; n = 268 in EPIC), (B) diagnosed 0 to 5 years of age
(n = 110 in 450K; n = 172 in EPIC), and (C) diagnosed >5 years of age (n = 88 in
450K; n = 112 in EPIC). .................................................................................................................. 75
Figure 3. 1. Study design of the EWAS, mQTL, and mediation analyses. ................................... 117
x
Figure 3. 2. Meta-analysis of the epigenome-wide association analysis. ................................... 121
Figure 3. 3. Path Diagrams showing the results of the causal mediation and
confounding analyses. ................................................................................................................ 122
Figure 3. 4. Characteristics of the significant DMP-SNP pair at IKZF1. ....................................... 124
Figure 3. 5. Scatter plot showing relationship between cg01139861 DNA
methylation and IKZF1 expression levels in ALL tumor samples. ............................................... 125
Figure 3. 6. Scatterplot showing the relationship between DNA methylation beta
values at CpG cg01139861 and CpG cg10551353 at gene IKZF1 in Latinos. .............................. 126
Figure 4. 1. Flowchart showing the workflow of rare germline variant discovery. .................... 151
Figure 4. 2. The depth of coverage matrices of the 48 samples from 14 multiplex
families with AYA classical Hodgkin lymphoma patients. ........................................................... 152
Figure 4. 3. The distribution of the 250 putative P/LP germline variants that were
analyzed using the PeCanPIE MedalCeremony pipeline across the predisposition
genes. ......................................................................................................................................... 153
Figure 4. 4. The distribution of the 172 putative P/LP germline variants classified
as gold, silver, or unknown by the PeCanPIE MedalCeremony pipeline across the
14 multiplex families with AYA classical Hodgkin lymphoma patients. ...................................... 154
Figure 4. 5. The distribution of the 22 putative causal germline variants and the 7
putative P/LP loss-of-function variants across the 10 multiplex families with AYA
classical Hodgkin lymphoma patients. ....................................................................................... 155
Figure 4. 6. The distribution of the 20 putative P/LP germline variants in
predisposition genes found in the classical Hodgkin lymphoma patients of the 3
families lacking sibling controls with sequencing data. .............................................................. 156
xi
Abstract
Acute lymphoblastic leukemia (ALL) is the most common childhood malignancy in the
United States, with approximately 2,700 incident cases diagnosed under age 15 each year.
Although current treatment protocols result in an overall survival rate that exceeds 90% in
childhood ALL patients in the US, long-term survivors experience significant adverse effects
from therapy. Therefore, prevention remains a top priority, and understanding the causes of
childhood ALL remains essential.
In chapter 1, we reviewed the racial/ethnic disparities in ALL incidence, discussed how
these vary across the age spectrum, and examined the potential causes of these disparities.
Genome-wide association studies have identified a growing number of SNPs associated with
childhood ALL. Several ALL-risk single nucleotide polymorphisms (SNPs) are associated with
genetic ancestry, and demonstrate different risk allele frequencies and/or effect sizes across
populations. Moreover, non-genetic factors including socioeconomic status, access to care, and
environmental exposures all likely influence the disparities in ALL risk and survival.
Parental smoking is implicated in the etiology of childhood ALL, but the causal
mechanisms remain largely unclear. In chapter 2, we assessed the association between early-
life tobacco smoke exposures and somatic gene deletions using two epigenetic biomarkers for
maternal smoking during pregnancy – DNA methylation at AHRR CpG cg05575921 and a
recently established polyepigenetic smoking score – in 482 B-ALL cases in the California
Childhood Leukemia Study with available Illumina 450K or MethylationEPIC array data. We
found an association between DNA methylation at AHRR CpG cg05575921 and deletion number
(meta-analysis summary RM [sRM]=1.32, 95% CI:1.10-1.57), and the polyepigenetic smoking
xii
score was positively associated with gene deletion frequency among all 482 B-ALL cases
(sRM=1.31 for each 4-unit increase in score; 95% CI:1.09-1.57). We provide further evidence
that prenatal tobacco-smoke exposure may influence the generation of somatic copy-number
deletions in childhood B-ALL. Analyses of deletion breakpoint sequences are required to further
understand the mutagenic effects of tobacco smoke in childhood ALL.
Genome-wide association studies have identified a growing number of SNPs associated
with childhood ALL, yet the functional roles of most SNPs are unclear. Evidence suggests
epigenetic mechanisms may mediate the impact of heritable genetic variation on phenotypes.
In chapter 3, we investigated whether DNA methylation mediates the effect of genetic risk loci
for childhood ALL. We performed an epigenome-wide association study (EWAS) including 808
childhood ALL cases and 919 controls from California-based studies using neonatal blood DNA.
For differentially methylated CpG positions (DMPs), we next conducted association analysis
with 23 known ALL risk SNPs followed by causal mediation analyses addressing the significant
SNP-DMP pairs. DNA methylation at CpG cg01139861, in the promoter region of IKZF1,
mediated the effects of the intronic IKZF1 risk SNP rs78396808, with the average causal
mediation effect (ACME) explaining ~30% of the total effect (ACME P=0.0031). In analyses
stratified by self-reported race/ethnicity, the mediation effect was only significant in Latinos,
explaining ~41% of the total effect of rs78396808 on ALL risk (ACME P=0.0037). We also
demonstrated that the most significant DMP in the EWAS, CpG cg13344587 at gene ARID5B
(P=8.61x10
-10
), was entirely confounded by the ARID5B ALL risk SNP rs7090445. Our findings
provide new insights into the functional pathways of ALL risk SNPs and the DNA methylation
differences associated with risk of childhood ALL.
xiii
Hodgkin Lymphoma (HL) is a B-cell malignancy that affects ~2-3 per 100,000 individuals
per year in the United States. Sequencing of familial and sporadic HL patients previously
identified rare pathogenic germline variants in genes with varying functions. The mechanisms
by which these low frequency variants with high penetrance contribute to HL risk have yet to
be determined. In chapter 4, we performed germline whole-exome sequencing (WES) for 48
individuals, including 20 classical HL (cHL) cases in 14 multiplex families with primarily AYA cHL
patients (age at diagnosis: 17-50 years) to identify novel cHL predisposition genes. Rare
germline short variant calling, annotation, and filtering were performed. In 11 families with
unaffected sibling controls, we found 22 putative pathogenic germline variants in 22 genes only
among the cHL patients. One variant in the cancer-related gene PGK1 was reported as
pathogenic in VarSome and pathogenic/likely pathogenic (P/LP) in ClinVar in patients with
phosphoglycerate kinase 1 deficiency associated with hemolytic anemia. Two variants in
cancer-related genes PTPRT and RANBP17 were classified by VarSome with evidence of
pathogenicity. In the 3 families with cHL patients but lacking sequencing data from sibling
controls, we found 3 P/LP variants in genes associated with blood disorders (SLC4A1 and
SEC23B) or immunodeficiency (PRKDC). In the analysis agnostic to family structure, we found 5
LP loss-of-function variants in genes associated with immunodeficiency (DCLRE1C and DNMT3B)
or cancers including lymphoma and leukemia (PTPRD, DDX10, and HMBS) shared by the
patients and their unaffected controls in 4 families. We have identified several putative novel
cHL predisposition genes, and assessment of these genes in sequencing studies of independent
HL families are required to validate their roles in HL predisposition.
1
Chapter 1: Disparities in Acute Lymphoblastic Leukemia Risk and Survival across the Lifespan
Abstract
Acute lymphoblastic leukemia (ALL) is the most common childhood cancer but is less
frequent in adolescents and young adults (AYAs) and is rare among older adults. The 5-year
survival of ALL is above 90% in children, but drops significantly in AYAs, and over half of ALL-
related deaths occur in older adults. In addition to diagnosis age, the race/ethnicity of patients
consistently shows association with ALL incidence and outcomes. Here, we review the
racial/ethnic disparities in ALL incidence and outcomes, discuss how these vary across the age
spectrum, and examine the potential causes of these disparities. In the United States, the
incidence of ALL is highest in Hispanics/Latinos and lowest in Black individuals across all age
groups. ALL incidence is rising fastest in Hispanics/Latinos, especially in AYAs. In addition,
survival is worse in Hispanic/Latino or Black ALL patients compared to those who are non-
Hispanic White. Different molecular subtypes of ALL show heterogeneities in incidence rates
and survival outcomes across age groups and race/ethnicity. Several ALL risk variants are
associated with genetic ancestry, and demonstrate different risk allele frequencies and/or
effect sizes across populations. Moreover, non-genetic factors including socioeconomic status,
access to care, and environmental exposures all likely influence the disparities in ALL risk and
survival. Further studies are needed to investigate the potential joint effects and interactions of
genetic and environmental risk factors. Improving survival in Hispanic/Latino and Black patients
with ALL requires advances in precision medicine approaches, improved access to care, and
inclusion of more diverse populations in future clinical trials.
2
Introduction
Acute lymphoblastic leukemia (ALL) is a hematologic malignancy characterized by
impaired differentiation, proliferation, and accumulation of B- and T-lineage lymphoid
precursor cells in the bone marrow, peripheral blood, and other organs
1,2
. In the United States
(US), the age-adjusted incidence rate (AAIR) for ALL was estimated to be 1.64 per 100,000
people
3
, with approximately 5,700 new cases and 1,600 deaths projected to occur in 2021
4
.
The incidence rate (IR) of ALL demonstrates a bimodal age pattern, in which the initial peak
occurs at age 1-4 years, followed by a decline at age 20-59 years and a modest rise at ages
above 60 years
5
. Indeed, ALL is the most common childhood malignancy, with approximately
2,700 incident ALL cases diagnosed under age 15 each year in the US
6
.
The causes of ALL are multifactorial, and likely vary based on the molecular subtype and
patient age of diagnosis
7
. Only a small proportion (<10%) of ALL cases are attributable to
known risk factors with large effects
8
, namely ionizing radiation and congenital syndromes
9–13
,
although both common and rare genetic variants are now known to contribute to childhood
ALL risk
14
. Genome-wide association studies (GWAS) of childhood ALL have identified multiple
genomic regions harboring common risk alleles for ALL, including at: 7p12.2 (IKZF1), 8q24.21,
9p21.3 (CDKN2A/B), 10p12.2 (PIP4K2A), 10p12.31 (BMI1), 10p14 (GATA3), 10q21.2 (ARID5B),
10q26.13 (LHPP), 12q23.1 (ELK3), 14q11.2 (CEBPE), 16p13.3 (USP7), 17q12, and 21q22.2 (ERG)
15–32
(Table 1.1). In addition, sequencing studies of familial and sporadic ALL have discovered
rare germline variants in PAX5
33,34
, ETV6
35–39
, IKZF1
40–42
, and TP53
43,44
that are associated
with disease risk. Non-genetic factors also contribute to ALL risk; for example, there is strong
epidemiological evidence supporting a role for early life infections and modulation of the
3
developing immune system in childhood ALL etiology, which has been reviewed in detail
elsewhere
45
. Studies have also reported modest associations for childhood ALL risk with several
environmental exposures
46
, including tobacco smoke
47–49
, pesticides
50,51
, paint
52,53
, and air
pollution
54–56
. The vast majority of epidemiologic studies for ALL have been conducted in
children, and very little is known regarding potential differences in ALL etiology across age
groups.
One factor that consistently shows association with ALL incidence is race/ethnicity. We
acknowledge that race and ethnicity are dynamic and multifactorial concepts
57
, and in this
review we use the term race/ethnicity to refer to heterogeneous groups of people defined by
the USA Office of Management and Budget as African Americans/Blacks (hereafter, Blacks);
Hispanics/Latinos; American Indians and Alaska Natives (AI/ANs); and Asians and Native
Hawaiians/other Pacific Islanders (APIs)
58
. Race/ethnicity reflects genetic ancestry, and
additionally conveys important epidemiologic information as to how social determinants such
as racism and discrimination, socioeconomic position, and environmental exposures can
influence disease incidence and mortality
57
. In the US, the incidence of ALL is highest in
Hispanics/Latinos and lowest in Blacks, and this is consistent across age groups
5,59–63
.
Race/ethnicity is also associated with ALL patient outcomes. Overall, survival of ALL
patients has improved dramatically in recent decades
3
, primarily in children
2,64
, which can
largely be attributed to improvements in combination chemotherapy protocols
2
, as well as
advances in the understanding of cytogenetics and genetics of the disease and, more recently,
the development of immunotherapy and targeted therapies
65–67
. Although in children the
overall 5-year survival rate of ALL has risen above 90%
68,69
, it remains inferior in later age
4
groups, with 60-85% in adolescents and young adults (AYAs, age range varies by studies,
between 15-39 years)
70–73
, and under 30% in older adults
74–79
. In addition, ALL patients who
are Hispanic/Latino or Black show worse outcomes compared to those who are non-Hispanic
White (NHW)
80–86
.
Here, we review the racial/ethnic disparities in ALL incidence and outcomes, and discuss
how these vary across the different age groups of patients: children, AYAs, and older adults. We
also examine the potential causes of these disparities, including genetic and non-genetic risk
factors, and how epidemiologic studies across populations are essential to our understanding
the causes of ALL.
Disparities in all incidence across the lifespan
ALL incidence initially peaks in the first decade of childhood, ranging from 1-4 years to <
9 years in different studies
5,61–63
, declines at age 20-59 years, and rises again modestly among
older adults aged 60 or above, with the highest second peak among Hispanic/Latino adults
87
.
The initial peak of ALL incidence occurs earlier for B-cell ALL (B-ALL) at 1-4 years compared to T-
cell ALL (T-ALL) at 5-14 years, and with a less prominent peak in the latter
5
. ALL develops more
often in males than females with an incidence rate ratio (IRR) of 1.29 overall
3
, and 2.20 and
1.20 for T-ALL and for B-ALL, respectively
5
.
Incidence of ALL is highest in Hispanics/Latinos
Hispanic/Latino children are more likely to be diagnosed with ALL compared to NHW,
Black or Asian children in both genders and across all age groups
5
. The reported
Hispanic/Latino-to-NHW IRR of childhood ALL ranges from 1.25 to 1.65 for all subtypes
5
combined
5,59,61–63
, and appears to be more prominent for B-ALL (IRR=1.64), but close to unity
for T-ALL (IRR=0.94)
5
. Moreover, the disparity in ALL IRs between Hispanic/Latino and NHW
children increases with increased age from <1 year to 19 years
61
. This disparity in ALL risk
corresponds to what has been observed geographically. For instance, Latin American countries,
including Mexico and Costa Rica, have some of the highest incidences of childhood ALL in the
world
88,89
. Meanwhile, the highest incidence of childhood ALL in the US is found in the West US
Census Region, where a high proportion of residents are Hispanics/Latinos
62
. On the contrary,
compared to NHWs, Black children have lower IRs of nearly all ALL subtypes in all age groups
5,59,61–63
, and API children also have lower IRs
90
. Among API regional groups, East Asians have a
significantly higher IR of childhood ALL compared to Southeast Asians (IRR=1.59), and
Oceanians have the highest IR
90
.
Among AYAs aged 15-39 years (age range defined by the National Cancer Institute), the
overall ALL AAIR was 0.98 (95% CI, 0.96, 1.01) per 100,000 during 2000-2016, with the highest
incidence being observed in Hispanics/Latinos (AAIR=1.63 [95% CI, 1.56, 1.70]), followed by
AI/ANs (AAIR=1.16 [95% CI, 0.86, 1.52]), NHWs (AAIR=0.79 [95% CI, 0.76, 0.83]), API (AAIR=0.78
[95% CI, 0.70, 0.86]), and Blacks (AAIR=0.53 [95% CI, 0.47, 0.59])
63
(Figure 1.1). A similar trend
has been found in B-ALL specifically, with the highest incidence seen in Hispanics/Latinos and
the lowest in Blacks
91
.
Among older adults, ALL incidence again predominates among Hispanics/Latinos
5
. For
those aged 40 or older, Hispanics/Latinos had the highest AAIR (AAIR=1.76 [95% CI, 1.67, 1.86]),
followed by AI/ANs (AAIR=1.17 [95% CI, 0.87, 1.54]), NHWs (AAIR=0.97 [95% CI, 0.94, 1.00]),
APIs (AAIR=0.85 [95% CI, 0.78, 0.93]), and Blacks (AAIR=0.77 [95% CI, 0.70, 0.84])
63
(Figure 1.1).
6
Furthermore, Philadelphia chromosome-like (Ph-like) ALL (patients with a similar gene
expression pattern as those with t(9;22), BCR-ABL1 translocations, i.e., Ph+), a subtype of B-ALL
associated with poor outcomes
92
, is more common in AYAs (19-27%) and older adults (20%)
than in children (10%)
93–96
. In addition, patients with Ph-like ALL or with its subtype carrying
CRLF2 rearrangement (also associated with poor outcomes)
96
are more likely to be
Hispanics/Latinos compared to other races/ethnicities (68% in Ph-like ALL and 85% in Ph-like
ALL with CRLF2 rearrangement)
96
.
Intriguingly, a higher percentage of residents born in a foreign country at the county
level contributes to a higher incidence of ALL among both NHWs and Blacks, but was
contradictorily associated with a lower incidence of ALL among Hispanics/Latinos
63
. For US-
based API children, ALL IRs were similar to rates seen in originating countries
90
. The inverse
association between percent foreign-born and the incidence of ALL in Hispanics/Latinos
represents an example of the “Hispanic paradox”
97,98
, which refers to the observation that
foreign-born Hispanics/Latinos have better health outcomes when compared to US-born
Hispanics/Latinos.
Incidence of ALL is rising fastest in Hispanics/Latinos
During 1992-2013, the incidence of ALL increased significantly by approximately 2% per
year for Hispanic/Latino children diagnosed from age 10-14 years (APC=2.09), and by 3% for
those 15-19 years of age (APC=2.67), while no significant increases were observed in NHW,
Black, or Asian children in the same age groups
61
. In the US Cancer Statistics database, the IR of
ALL in both overall children and Hispanic children aged below 20 years increased significantly
7
during 2001-2008, with the largest increase being observed in Hispanic/Latino children
(APC=2.5), and which remained stable during 2008-2014
62
.
Despite of the relatively low AAIR of ALL compared to other age groups, AYA had the
greatest increase of ALL AAIR during 2000-2016 (overall APC=1.56 [95% CI, 1.03, 2.09])
63
(Figure 1.1). Hispanics/Latinos had significant increase of AAIR across all age groups (APC=1.18
[95% CI, 0.76, 1.60]), with the greatest increase found in AYAs (APC=2.02 [95% CI, 1.17, 2.88])
63
. Across all age groups, AYA is the only group in which AI/ANs had a significant increase of
AAIR (APC=9.79 [95% CI, 5.65, 14.09])
63
. Given the small population size of AI/ANs, the
substantial interregional differences of incidence rates and misclassification of AI/ANs in central
registries that were observed in SEER data
99
, a note of caution should be offered in interpreting
rates and trends for the AI/AN population. The AAIR of ALL also increased significantly among
Asian and Hawaiian/Pacific Islander AYAs (APC=1.95 [95% CI, 0.15, 3.79])
63
. Among older
adults, the incidence of ALL increased significantly only among Hispanics/Latinos during 2000-
2016
63
. The trend of AAIR remained stable among NHWs and Blacks across all age groups over
time
63
.
Disparities in survival of ALL patients
In this section, we summarize disparities in the overall survival rates of ALL patients,
though we do not review potential disparities in long-term outcomes, such as treatment-
related morbidities, which have been described elsewhere
100–103
.
8
Children
Among children, the survival of ALL is lowest in infants (<1 year), highest in those aged
1-9 years, and thereafter, decreases with increased age
5,80
. Girls have better survival than boys
overall
80
. Hispanic/Latino children have inferior outcomes compared to NHWs
80–84
, with a 5-
15% difference in overall survival rate being persistently seen in SEER data
81,82,84
. Furthermore,
in Hispanics/Latinos, childhood ALL mortality has been shown to differ by genetic ancestry
104
.
For instance, Hispanic/Latino children in general have a 2.27 times higher mortality compared
to NHW children (mortality rate ratio [MRR]=2.27 [95% CI, 1.68-3.06]), with a MRR of 2.56 (95%
CI, 1.93-3.40) in continental Hispanic/Latino children (Mexicans, Central Americans, and South
Americans) but with a MRR of only 1.23 (95% CI, 0.74-2.03) in Caribbean Hispanic/Latino
children (Puerto Ricans, Cubans, and Dominicans)
104
, suggesting that higher Indigenous
American ancestry is associated with poorer overall survival.
In Black childhood ALL patients, improvement of 5-year survival lags behind compared
to in other races/ethnicities
105
. The largest improvements of survival in Blacks occurred at
much later diagnosis periods (1995-2001 and 2002-2008) compared to those in NHWs and
AI/ANs (1988-1994 and 1995-2001)
81
. Promisingly, SEER data have revealed a decreased
inequality in ALL survival between Black and NHW children
80,82,106
. From 1992-1995 to 2003-
2007, 5-year relative survival rate improved faster in Black children (APC=3.01) than in NHW
children (APC=1.37)
82
. In another study, from 1975-1983 to 2000-2010, the difference in 5-year
cumulative mortality of ALL between Black and NHW children reduced from 15% to 3%;
compared with NHWs, the adjusted Hazard Ratio (HR) for Blacks dropped from 1.46 (95% CI,
1.09-1.94) to an insignificant 1.21 (95% CI, 0.74-1.96)
80
.
9
API and AI/ANs also have significantly worse survival of childhood ALL compared to
NHWs
80,81
, with the 5-year cumulative ALL mortality being 10% in APIs, and 19% in AI/ANs
versus being 8% in NHWs at 2000-2010
80
. Compared with NHW counterparts, APIs diagnosed
at 1-9 years, and AI/ANs diagnosed at 10-19 years had about twice the ALL mortality HR
80
.
Further, in a stratified analysis for Asian subgroups, when comparing to NHWs, East Asians in
general (i.e., Chinese, Filipino, Korean, Japanese, Vietnamese and other Southeast Asians
combined) had significant inferior outcomes, with particularly worse survival for Vietnamese
(relative risk [RR]=2.44 [95 % CI, 1.50-3.97]) and Filipino (RR=1.64 [95 % CI, 1.13-2.38]) patients,
whereas the inferior outcomes for Koreans, Japanese and other Southeast Asians were non-
significant
81
.
AYAs
A “survival cliff” has been observed for ALL in AYA patients at age 17 to 20 years, where
the survival rate drops considerably during just this 3-year difference in age, and accounts for
nearly half of the total survival decrease from childhood to older adults
107
. This substantial
drop of survival rate partly results from the high frequency of the high-risk Ph-like B-ALL
subtype among AYAs
92–96
. Based on data obtained from the Texas Cancer Registry, among AYA
ALL patients, the overall 5-year survival rate was better in females than in males, and it has
improved over time across all races/ethnicities in both sex groups
85
. However, improvement in
the survival rate of Black AYA patients lags behind other racial/ethnic groups, similar to the
pattern seen in Black children. Among AYA patients, survival in Black males diagnosed in 2004-
2012 (66.9% [95% CI, 64.0%, 69.6%]) was significantly worse than in NHW (78.2% [95% CI,
77.2%, 79.1%]) and in Hispanic/Latino males (71.8% [95% CI, 70.3%, 73.3%]) diagnosed back in
10
1995-2003
85
; Black females diagnosed in 2004-2012 (76.9% [95% CI, 75.2%, 78.4%]) had a
worse survival rate compared to NHW females diagnosed in 1995-2003 (83.9% [95% CI, 83.2%,
84.2%])
85
.
Older adults
Older adult ALL patients have the worst survival across all age groups
74–79
. While
approximately 22.5% of patients are diagnosed after the age of 55 years, 54.6% of ALL-related
deaths occur in patients in this age stratum
108
. This is likely related to the elevated prevalence
of multiple high-risk subtypes of ALL in older adults. First, both Ph+ ALL and Ph-like ALL are very
common subtypes of B-ALL among older adults aged 60 years or above (Ph+ ALL: approximately
50%; Ph-like ALL: 24%-26%)
93–96,109,110
. In addition, older adults with Ph-negative B-ALL tend to
present with high-risk cytogenetics and complex karyotypes
111,112
associated with increased
risks of treatment failure and treatment complications
78
.
Promisingly, the 1973-2008 SEER data revealed a significant improvement for patients
aged over 45 years in survival among the overall population, NHWs, and in particular APIs
(19.8%), and a large but marginally significant improvement for Blacks (11.3%)
86
. However,
these improvements were not seen in Hispanic/Latino patients. For instance, in 2003 to 2008,
the 5-year survival rate of older adult Hispanic/Latino ALL patients was only 13.9% compared
with 23.6% in NHWs and 17.1% in Blacks
86
, perhaps due to the high frequency of Ph-like ALL in
Hispanic/Latino ALL patients
96
. Similarly, in the 1980-2011 SEER data, there was a modest
improvement of median overall survival rate of ALL among adults aged 60 years or above
113
,
partly attributable to advances in novel therapies for Ph+ ALL
114
.
11
Factors associated with disparities in all risk and outcomes
Differences in ALL tumor biology
Immunophenotype
The World Health Organization (WHO) classifies ALL based first on immunophenotype,
and categorizes patients into either B-ALL or T-ALL
115
, with both comprising multiple subtypes
defined by structural chromosomal alterations
116
. B-ALL prevalence is higher than T-ALL,
accounting for approximately 80% of ALL cases in children and 75% in adults in the US
117
. In
childhood ALL, the B-cell immunophenotype confers more favorable survival than T-ALL,
whereas in adults survival is substantially higher for T-ALL than B-ALL
5,111,118,119
, likely due to
differences in molecular subtypes across age groups. In both children and adults, B-ALL appears
to have a higher incidence in Hispanics/Latinos compared to other races/ethnicities
5
. On the
other hand, T-ALL occurs more frequently in Black children, in whom a T-ALL-related genetic
variant in USP7 is overrepresented
30
. Thus, the contribution of immunophenotype to
disparities in the survival of ALL patients may vary across population groups.
Cytogenetic subtypes
The most common chromosomal alterations in childhood B-ALL are high hyperdiploidy
(chromosomal number 51-67) and t(12;21)(p13;q22) translocation encoding the ETV6-RUNX1
fusion gene
112,120
. Each presents in 25-30% of children with ALL
116
, and is associated with a
favorable prognosis
112,121
. However, both subtypes are less common in adolescent ALL patients
and very rare in adult ALL patients
116
. Among ALL cases in the California Childhood Leukemia
Study, the prevalence of high hyperdiploidy was similar in Hispanics/Latinos and NHWs, at
28.3% and 27.6%, respectively
122
, whereas there was a significantly lower frequency of ETV6-
12
RUNX1 translocation in Hispanics/Latinos (13%) than in NHWs (24%)
123
. To our knowledge, the
frequencies of these two subtypes have not been compared across race/ethnicity in AYAs or
older adults, perhaps due to small numbers.
The Ph chromosome translocation (t(9;22), i.e., Ph+), which results in the BCR-ABL1
fusion gene
124
, is infrequent among childhood B-ALL patients (<5%) but presents in up to half
of adult B-ALL cases and becomes more prevalent with increased age (22% in patients <40 years
of age, 41% in patients ≥40 years and nearly 50% in patients aged 60 years or older)
94,109,110
. A
higher percent of Ph+ B-ALL has been reported in Black AYA/adult patients compared with in
NHW and Hispanic/Latino patients
96,125
. A similar pattern has been identified in children –
compared with Ph-negative ALL patients, Ph+ ALL patients were more likely to be Black
126
.
Although Ph chromosome has been historically recognized as an adverse prognostic factor for
ALL, Ph+ ALL now has noninferior or even superior outcomes compared to Ph-negative ALL in
older adult ALL patients
110,127,128
, due to recent advances in novel therapies such as CAR-T cell
therapy and tyrosine kinase inhibitor therapy
114
.
The WHO 2017 revision introduced Ph-like ALL as an additional subgroup for B-ALL
115
.
Ph-like B-ALL shares a similar gene-expression profile with Ph+ B-ALL, but does not harbor the
BCR-ABL1 fusion protein expressed from the t(9;22)
115
. Unlike Ph+ ALL that occurs more
frequently with increased age, Ph-like ALL has the highest incidence in AYAs (19-28%), a lower
frequency in childhood (10%), and is relatively common among adults aged 40 or above (20%)
93–96
. Ph-like ALL partly contributes to the AYA “survival cliff”
107
, and the continuing poor
outcomes in older age groups
74–79
. Patients with Ph-like ALL had a significantly inferior event-
free and disease-free survival, a lower complete remission rate, and an elevated level of
13
minimal residual disease at the end of the induction therapy compared to non-Ph-like patients
92
. Furthermore, Ph-like ALL likely plays a role in both the high incidence and the inferior
survival of ALL in Hispanics/Latinos. Ph-like ALL occurs more frequently in Hispanics/Latinos in
particular in AYA/adults
96
. Indeed, Hispanics/Latinos have been shown to account for up to
two-thirds of Ph-like ALL in AYA/adult patients
96
. Notably, nearly half of the patients with Ph-
like ALL had CRLF2 rearrangements
93
. Both Ph-like ALL and its subtype with CRLF2
rearrangements have significantly worse outcomes compared to other subtypes
93,95,96,129–131
,
and are more prevalent among Hispanics/Latinos compared to other racial/ethnic groups
129
. In
sum, the Ph-like subtype contributes significantly to the poor survival of Hispanic/Latino AYA
ALL patients.
Genetic variation
Genetic variants contribute to racial/ethnic disparities in ALL incidence
Several ALL risk loci identified by GWAS have been associated with genetic ancestry, and
have demonstrated differences in risk allele frequency and/or differences in effect size across
population groups
15–32,132,133
(Figure 1.2). For example, an increased number of risk alleles at 5
ALL risk single nucleotide polymorphisms (SNPs) rs3731217 (CDKN2A), rs7088318 (PIP4K2A),
rs2239633 (CEBPE), rs7089424 (ARID5B), and rs3824662 (GATA3) was correlated with increased
genome-wide Indigenous American ancestry in Hispanic/Latino children
134,135
. ARID5B SNP risk
allele frequency has also been associated with increased local Indigenous American ancestry in
Hispanics/Latinos
136
. At the GATA3 risk locus, SNP rs3824662 has a markedly higher risk allele
frequency in Hispanic/Latino than in European ancestry populations, with 39% compared with
only 17% frequency in the Genome Aggregation Database (gnomAD) v2.1.1. (Table 1.1 and
14
Figure 1.3)
137
. Further, the GATA3 SNP rs3824662 risk allele has been shown to confer a
remarkably high risk for Ph-like ALL in both children and AYAs, with an almost 4-fold risk of this
subtype
21,132
, supporting that this risk locus likely contributes significantly to the increased
prevalence of Ph-like ALL in Hispanic/Latino ALL patients.
In two recent GWAS of ALL conducted in Hispanic/Latino-only discovery studies, a novel
ALL risk locus was identified at the chromosome 21 gene ERG
29,31
. The effect of this locus on
ALL risk was larger in Hispanics/Latinos than in NHW and, in addition, this locus was associated
with an increased risk of ALL in Hispanic/Latino individuals both with higher genome-wide and
higher local Indigenous American ancestry
29,31
. Together, risk loci in ARID5B, GATA3, PIP4K2A,
CEBPE, and ERG likely account for some of the observed differences in ALL incidence between
Hispanics/Latinos and non-Hispanic/Latino races/ethnicities, which may be partly explained by
the Indigenous American ancestry in Hispanics/Latinos. Indeed, it has been suggested that the
CEBPE, ARID5B, and GATA3 risk SNPs may account for approximately 3%, 11%, and 11%
increased risk of B-ALL in Hispanics/Latinos versus non-Latino whites, respectively
134,135
.
Intriguingly, a recent study found that Indigenous American ancestry increased by ~20% on
average in Mexican Americans in the US during the 1940s-1990s, partly attributable to
assortative mating, shifts in migration pattern and changes in population size
138
. Given the
association between ALL risk alleles and Indigenous American ancestry, this perhaps suggests
that this shift in genetic ancestry may contribute to the rising ALL incidence among
Hispanics/Latinos, although this warrants further investigation. Further research is needed to
determine whether the ancestry-dependent effects from these SNPs are confounded by other
genetic or environmental factors, and to discover additional ancestry-associated risk loci via
15
admixture mapping and larger GWAS of ALL with a more diverse population across all age
groups.
Apart from the risk loci described above that are associated with Indigenous American
ancestry, a novel risk locus for T-ALL was recently identified at the USP7 gene and was found to
be overrepresented in children of African ancestry. This locus may, therefore, contribute to the
higher incidence of T-ALL in Black children compared to their counterparts of other
races/ethnicities
30
.
Finally, we summarized established GWAS-identified SNPs for ALL
15–32
in Table 1.1, and
we observed disparities in risk allele frequency and in effect size of these SNPs. In gnomAD
(v2.1.1.)
137
, the risk allele frequency of the Ph-like ALL-related SNP rs3824662 (GATA3) is 130%
higher in Latinos/Admixed Americans compared to in Europeans; further, SNPs in ARID5B have
a 20-50% higher risk allele frequency in Latinos/Admixed Americans compared to Europeans
(Figure 1.3A). Many of the established ALL GWAS SNPs have a higher absolute risk allele
frequency in Latinos/Admixed Americans than in Europeans (Figure 1.3B). In Blacks, risk allele
frequency of the T-ALL-related SNP at rs74010351 (USP7) is strikingly high, nearly 200% higher
than in Europeans, but the absolute difference is only ~10% because of the low frequency of
this risk allele across all populations (Table 1.1); other GWAS SNPs did not show consistent
differences in risk allele frequency between African and European populations (Figure 1.3). The
strongest risk effect is seen for the GATA3 SNP rs3824662 association with Ph-like ALL, with an
effect size of nearly 4.0 (Figure 1.3C).
16
Genetic variants are associated with racial/ethnic disparities in ALL outcomes
Genetic variation contributes to racial/ethnic disparities not only in ALL susceptibility
but also in treatment outcomes
136,139
(Figure 1.2). Indigenous American ancestry has been
associated with an increased risk of relapse in Hispanic/Latino ALL patients, which may result
from the effects of ancestry-related genetic variants on therapy response
140
. For example, in a
study conducted in children treated on Children’s Oncology Group (COG) clinical trials, ARID5B
genetic risk alleles that have a higher frequency in Hispanic/Latino populations and are
associated with increased Indigenous American ancestry were associated with both ALL
susceptibility and relapse risk
136
. In another example, the GATA3 risk SNP rs3824662,
associated with Indigenous American ancestry
135
and Ph-like ALL, has been found additionally
to contribute to the increased risk of relapse in both childhood
21,141
and adult ALL patients
142
.
Two variants in TPMT (rs1142345) and NUDT15 (rs116855232) have been discovered by
GWAS to be strongly associated with thiopurine intolerance during therapy resulting in
excessive toxicity in children with ALL
143
. The TPMT variant is most prevalent in Blacks and least
common in East Asians
143
. The NUDT15 variant is most prevalent in East Asians, followed by
Hispanics/Latinos, and extremely rare in NHWs and Blacks
143
. In a recent sequencing study, 4
additional germline loss-of-function variants were identified in NUDT15 that confer a major risk
for thiopurine intolerance, and appear to be highly prevalent in East Asians, South Asians and
Indigenous American populations
144
.
Moreover, a study of children with high-risk B-ALL enrolled in COG clinical trials revealed
19 genetic loci associated with increased relapse risk, of which 12 were specific to an ancestry
group, including 7 SNPs specific to Hispanics/Latinos and 3 SNPs specific to Black patients
139
.
17
These loci are associated with pharmacokinetic and pharmacodynamic phenotypes (e.g.
resistance or rapid clearance of chemotherapy)
139
. Further, including ancestry-specific SNPs in
multivariate models of relapse risk significantly attenuated the increased risk of relapse in
Hispanic/Latino and Black patients compared to white patients
139
.
Environmental exposures and ALL risk
Genetic variation undoubtedly contributes to the racial/ethnic disparities in ALL risk and
outcomes, but non-genetic factors also play an important role. In terms of the natural history of
the disease, it has been proposed that childhood ALL, in particular B-ALL, follows a “two-hit”
model of leukemogenesis
45,145
, with in utero development of a pre-leukemic clone
146,147
that
progresses to overt leukemia following postnatal acquisition of secondary genetic changes
148
. A
lack of microbial infectious exposure perinatally or in infancy impacts immune function
149–151
,
and this in combination with delayed exposure to infections may lead to abnormal immune
responses that result in secondary somatic events that drive leukemogenesis
45
. This is
supported by epidemiological evidence, including from studies that have assessed the impact of
early-life infectious exposure on ALL risk, using proxies such as day-care attendance
152–154
, birth
order
154–156
, and timing of birth
156
. Intriguingly, day-care attendance and higher birth order
have been found to have a protective effect on ALL risk among NHWs supporting the “delayed
infection” hypothesis
45
, but not in Hispanic/Latino children
152–154,156
. On the other hand,
Caesarean section and in utero CMV infection, found to be risk factors for childhood ALL,
conferred a more prominent effect in Hispanics/Latinos compared to NHWs
157–159
. As
described above, several genetic variants and high-risk cytogenetic features are more prevalent
in Hispanics/Latinos and are correlated with Indigenous American ancestry. More studies are
18
needed to investigate the joint effects of both genetic and environmental risk factors and their
potential interactions, particularly in Hispanics/Latinos.
Socioeconomic status and ALL risk and survival
Socioeconomic status (SES) also correlates with the racial/ethnic disparities in ALL risk.
For example, in a recent study, when adjusting for percent foreign-born in areas, neighborhood
SES was inversely associated with the AAIR of ALL among NHWs and Blacks, but was positively
associated with ALL AAIR in Hispanics/Latinos across all age groups
63
. This observed
racial/ethnic difference in the relationship between SES and the risk of ALL was reported to be
largely driven by data from California
63
, where there was an excessive ALL risk in Los Angeles
County and a highly diverse population in which Hispanics/Latinos are of an elevated
Indigenous American Ancestry. This contrasts with another study conducted in children without
adjusting for percent foreign-born, in which they found a higher incidence of ALL among lower
SES populations for Hispanics/Latinos, but among higher SES populations for other
races/ethnicities
160
. One potential reason that leads to this difference is that the former study
additionally controlled for percent foreign-born, which is a crucial indicator of the “Hispanic
paradox”
97,98
, and represents a variety of potential underlying risk factors that may differ by
individual and racial/ethnic group.
On the other hand, low SES is consistently associated with poor outcomes in ALL
patients. Living in high poverty areas has been associated with high rates of relapse in
childhood ALL patients
161
. Children with ALL in the US residing in neighborhoods with the
highest poverty rate have been found to have an almost two-fold increase in mortality
compared with those in neighborhoods with the lowest poverty rate (HR=1.8 [95% CI, 1.41-
19
2.30]), when adjusting for sex, age at diagnosis, race/ethnicity, and treatment era
162
.
Moreover, the difference in 5-year overall survival comparing NHW children with ALL residing in
the lowest poverty neighborhoods versus Black patients residing in the highest poverty
neighborhoods can be as high as 22%
162
. Furthermore, in SEER data, SES as measured at the
neighborhood level significantly mediated the association between race/ethnicity and
childhood ALL survival, leading to a 44% reduction from the total to the direct effect of the
Black-NHW survival disparity and 31% reduction of the Hispanic/Latino-White disparity in
survival
163
. The inferior outcomes in high poverty neighborhoods might be attributable to
multiple elements, including a poor adherence to therapy (e.g. long-term oral administration of
antimetabolites)
164,165
, lack of insurance, and the discontinuous coverage of insurance
166–168
.
Access to care
Previous studies have shown that older age was associated with less treatment
adherence
77
, and that compliance with therapy was more problematic for AYAs than for other
age groups
169–171
; however, the heterogeneity by race/ethnicity has been investigated mostly
in childhood ALL patients. Lower exposure to mercaptopurine increases the risk of relapse in
ALL, and thus the increased risk of relapse in Hispanic/Latino children with ALL compared with
NHW children with ALL may in part result from a lower compliance to oral mercaptopurine
therapy
172
. In a 6-month adherence monitoring program of 327 patients with ALL,
Hispanic/Latino children had a significantly lower level of adherence along with lower SES
compared to NHW children
165
. In another 5-month follow-up study among children with ALL
from COG, adherence rates for oral 6-mercaptopurine were significantly lower in Blacks (87%)
and Asian Americans (90%), as compared to NHWs (95%), after adjusting for SES
173
. These
20
suggest that compliance to therapy could be explained by factors other than SES. In addition,
the type of insurance payer is a significant predictor of adherence among ALL patients. It has
been found that ALL patients with commercial insurance payers had significantly higher levels
of adherence compared to those with Medicaid
174
. Compared to other age groups, the AYA
group is less likely to have insurance, with around 40% of individuals between 19 and 29 years
old being uninsured
175
. Hispanic/Latino and Black adult patients with cancer are more likely to
be uninsured or Medicaid-insured than NHW adult patients
168
. A pediatric cancer study has
also demonstrated that Hispanic/Latino patients were less likely to have insurance
176
. Notably,
despite that Black children with ALL were significantly more likely to have high-risk prognostic
profiles compared to NHW children, it has been found that with equal access to effective
antileukemic therapy, Blacks and NHWs had the same high rate of cure
177
.
Recruitment to clinical trials
In addition to the elevated incidence of Ph-like ALL in AYA ALL patients
93–96
, potential
factors that contribute to the AYA “survival cliff” also include the transition from pediatric to
adult treatment regimens
107
and the low recruitment rate of AYA patients into clinical trials
178
.
For instance, a drop off in clinical trial accruals for ALL has been identified during age 16-24,
where the estimated treatment trial accrual proportion decreased dramatically from 50% at
age 16 to below 10% at age 24 during 2000-2014
178
. This pattern strongly suggests that the
AYA survival cliff could be in fact largely due to an “accrual cliff”, as survival has been found to
strongly correlate with trial accrual
178
. Moreover, there was a lack of improvement in ALL
survival in patients aged 20-29 years since 1989 (APC = 0.33, p = 0.39), corresponding to the
negligible increase of trial accrual in AYAs during 2000-2015
178
. In addition to AYAs, elderly ALL
21
patients are rarely eligible for clinical trials and are underrepresented in trials of new cancer
therapy
179,180
, and the underrepresentation in clinical trials for cancer therapies has been
found to underlie the poor outcomes of elderly patients
180
. In addition to age disparities, Black
AYA cancer patients are less likely to be enrolled on a clinical trial compared to NHW AYAs
181
,
and NHWs continue to comprise the majority of participants in these trials
182
.
Conclusions and future directions
In this review, we described racial/ethnic disparities in ALL risk and survival; evaluated
how these vary across the age spectrum; and examined the potential causes of these
disparities, including genetic and non-genetic risk factors. Genetic risk factors certainly play a
significant role in contributing to these disparities, as several ALL risk loci are associated with
genetic ancestry, and have demonstrated different risk allele frequencies and/or effect sizes
across population groups. In particular, multiple studies have shown that Ph-like ALL is
associated with poor survival in both children and adults, and the risk of Ph-like ALL is
associated with specific GATA3 risk alleles that occur more frequently in Hispanics/Latinos with
elevated Indigenous American ancestry. A variety of genomic aberrations have been discovered
underlying Ph-like ALL and are likely to be drivers of leukemogenesis
183
, which offers a great
opportunity for precision medicine approaches to use molecule inhibitors targeted at these
lesions. Racial/ethnic categories in epidemiologic studies also capture, albeit imperfectly, the
influence from bias, racial discrimination, culture, socioeconomic status, access to care, and
environmental factors
57
. In this review, we recognize that these non-genetic factors are
associated with the disparities in ALL risk and survival. Improving survival in Hispanic/Latino and
22
Black patients with ALL will require both improved access to care and inclusion of more diverse
populations in future clinical trials and genetic studies.
23
Table 1. 1. Genetic variants associated with ALL risk in genome-wide association studies.
Gene SNP r
e
f
al
t
ri
sk
Trait(s) PubM
ed ID
Ye
ar
First
Author
AF_
afr
AF_
amr
AF_
nfe
P OR(CI)
ARID
5B
rs10821
936
C T C ALL 19684
603
20
09
Treviño LR 0.2
27
0.46
3
0.30
2
1.40
E-15
1.91(1.60
-2.20)
ARID
5B
rs10994
982
A G A ALL 19684
603
20
09
Treviño LR 0.5
65
0.56
0
0.46
7
5.70
E-09
1.61(1.30
-1.90)
ARID
5B
rs70894
24
T G G ALL 19684
604
20
09
Papaemm
anuil E
0.2
41
0.46
0
0.30
4
7.00
E-19
1.65(1.54
-1.76)
ARID
5B
rs70894
24
T G G B-ALL 19684
604
20
09
Papaemm
anuil E
0.2
41
0.46
0
0.30
4
1.41
E-19
1.70(1.58
-1.81)
BAK1 rs21014
3
T C C B-ALL (High-
hyperdiploidy)
31767
839
20
19
Vijayakris
hnan J
0.7
35
0.71
8
0.72
4
2.21
E-08
1.30(1.19
-1.43)
BMI
1
rs47487
93
A G A ALL 23512
250
20
13
Xu H 0.8
93
0.77
4
0.78
7
8.40
E-09
1.40(1.26
-1.57)
BMI
1
rs11591
377
G A G ALL 29923
177
20
18
de Smith
AJ
0.9
05
0.76
7
0.79
3
2.07
E-10
1.27(1.20
-1.35)
C5or
f56
rs88628
5
T C T B-ALL (High-
hyperdiploidy)
31767
839
20
19
Vijayakris
hnan J
0.6
47
0.30
4
0.34
3
1.56
E-08
1.29(1.18
-1.41)
CCD
C26
rs28665
337
C A,
T
A B-ALL 29632
299
20
18
Vijayakris
hnan J
0.0
86
0.09
7
0.11
6
4.00
E-09
1.34(1.21
-1.47)
CCD
C26
rs46171
18
A C,
G
G ALL 29348
612
20
18
Wiemels
JL
0.3
05
0.11
7
0.16
3
3.05
E-09
1.27(1.17
-1.38)
CDK
N2A
rs37312
17
A C,
T
A ALL 20453
839
20
10
Sherborne
AL
0.9
02
0.89
8
0.86
7
3.01
E-11
1.41(1.28
-1.56)
CDK
N2A
rs37312
49
C T T ALL 26527
286
20
15
Walsh K 0.0
04
0.01
6
0.03
3
1.69
E-13
2.97(2.22
-3.96)
CDK
N2B
rs77728
904
A C,
G
C B-ALL 26868
379
20
16
Hungate
EA
0.0
94
0.05
9
0.08
0
3.32
E-15
1.72(1.50
-1.97)
CEBP
E
rs22396
33
G A G B-ALL (ETV6-
RUNX1)
22076
464
20
12
Ellinghaus
E
0.7
87
0.59
9
0.51
9
4.00
E-10
1.35(1.22
-1.47)
CEBP
E
rs49827
31
C T C ALL 23512
250
20
13
Xu H 0.3
96
0.36
8
0.27
8
9.00
E-12
1.36(1.24
-1.48)
CPSF
2
rs18943
4316
A T T B-ALL (Normal
cytogenetic)
29296
818
20
17
Clay-
Gilmour
AI
0.0
11
0.02
7
0.06
6
6.00
E-09
3.70(2.50
-6.20)
ELK3 rs47622
84
A T T B-ALL 27694
927
20
17
Vijayakris
hnan J
0.4
49
0.48
0
0.29
9
8.00
E-09
1.19(1.12
-1.26)
ERG rs28363
65
A G G B-ALL 30510
082
20
19
Qian M 0.1
97
0.36
0
0.32
9
3.76
E-08
1.56(1.33
-1.83)
ERG rs81314
36
G C C ALL 31296
947
20
19
de Smith
AJ
0.1
96
0.36
3
0.33
1
8.76
E-09
1.23(1.16
-1.31)
GAT
A3
rs38246
62
C A,
T
A B-ALL 23996
088
20
13
Migliorini
G
0.0
94
0.39
5
0.17
2
8.62
E-12
1.31(1.21
-1.41)
GAT
A3
rs38246
62
C A,
T
A B-ALL (Ph-like) 24141
364
20
13
Perez-
Andreu V
0.0
94
0.39
5
0.17
2
2.17
E-14
3.85(2.71
-5.47)
IGF2
BP1
rs10853
104
C G,
T
T B-ALL (ETV6-
RUNX1)
31767
839
20
19
Vijayakris
hnan J
0.6
63
0.42
0
0.50
8
1.82
E-08
1.33(1.21
-1.47)
IKZF
1
rs11978
267
A G G ALL 19684
603
20
09
Treviño LR 0.1
93
0.24
1
0.27
7
8.80
E-11
1.69(1.40
-1.90)
IKZF
1
rs41326
01
T G G ALL 19684
604
20
09
Papaemm
anuil E
0.1
93
0.24
2
0.27
5
1.00
E-19
1.69(1.58
-1.81)
IKZF
1
rs41326
01
T G G B-ALL 19684
604
20
09
Papaemm
anuil E
0.1
93
0.24
2
0.27
5
9.31
E-20
1.73(1.61
-1.85)
IKZF
3
rs22904
00
T C T ALL 29348
612
20
18
Wiemels
JL
0.5
18
0.61
9
0.51
0
2.05
E-08
1.18(1.11
-1.25)
24
LHPP rs35837
782
A G G B-ALL 27694
927
20
17
Vijayakris
hnan J
0.6
54
0.50
4
0.63
1
1.00
E-11
1.21(1.15
-1.28)
OR8
U8
rs19452
13
C G,
T
C B-ALL (ETV6-
RUNX1)
22076
464
20
12
Ellinghaus
E
0.2
18
0.18
4
0.28
5
3.89
E-08
1.28(1.14
-1.45)
PIP4
K2A
rs10828
317
T C T B-ALL 23996
088
20
13
Migliorini
G
0.9
08
0.83
5
0.69
8
2.30
E-09
1.23(1.15
-1.32)
PIP4
K2A
rs70883
18
C A A ALL 23512
250
20
13
Xu H 0.4
00
0.75
1
0.61
6
1.13
E-11
1.40(1.28
-1.53)
PIP4
K2A
rs47488
12
G A A ALL 29923
177
20
18
de Smith
AJ
0.3
60
0.74
2
0.62
6
1.30
E-15
1.31(1.25
-1.38)
RPL6
P5
rs17481
869
C A A B-ALL (ETV6-
RUNX1)
29632
299
20
18
Vijayakris
hnan J
0.0
20
0.03
6
0.07
9
3.20
E-08
2.14(1.64
-2.80)
SP4 rs23905
36
G A A ALL 29348
612
20
18
Wiemels
JL
0.0
83
0.18
4
0.36
8
3.59
E-08
1.20(1.13
-1.29)
TLE1 rs76925
697
A T A B-ALL 31767
839
20
19
Vijayakris
hnan J
0.9
64
0.96
9
0.96
2
2.11
E-08
1.52(1.31
-1.76)
TP63 rs17505
102
G C C B-ALL (ETV6-
RUNX1)
22076
464
20
12
Ellinghaus
E
0.0
53
0.06
4
0.12
8
8.94
E-09
1.59(1.33
-1.92)
USP7 rs74010
351
A C,
G
G T-ALL 30938
820
20
19
Qian M 0.1
78
0.06
0
0.06
0
4.51
E-08
1.44(1.27
-1.65)
25
(A)
(B)
Figure 1. 1. Disparities in acute lymphoblastic leukemia (ALL) incidence across the lifespan.
Data extracted from Tables 1 and 2 from Feng et al.
63
Age-adjusted incidence rate per 100,000
population was derived from the Surveillance, Epidemiology, and End Results Registry, version
18. Centers of points and horizontal bars indicate point estimates and 95% confidence intervals.
(A) Age-adjusted incidence rates of ALL by age group and race/ethnicity, United States, 2000-
2016. (B) Annual Percent Change in incidence rates of ALL by age group and race/ethnicity,
26
United States, 2000-2016. AI/AN, American Indian and Alaska Native; API, Asian and Pacific
Islander; Black, African American/Black; NHW, non-Hispanic White; AYA, adolescents and young
adults.
27
(A)
28
(B)
Figure 1. 2. Genetic variants associated with ALL risk and outcomes across the genome.
PhenoGram plots
184
were constructed for genetic variants associated with (A) ALL
susceptibility, and/or (B) ALL patient outcomes (i.e., relapse and response to therapy). Genetic
variants included in the PhenoGrams were identified in the NHGRI-EBI catalog of human
genome-wide association studies (GWAS Catalog)
185
and included in published GWAS for acute
lymphoblastic leukemia (ALL)
15,16,18–20,24–27,32,132,133
or for outcomes of ALL
143,186–193
. We also
included some variants described in additional papers included in this review for ALL
susceptibility
17,21–23,28–31
and ALL patient outcomes
21,136
. For ALL susceptibility (A) we only
included variants that passed genome-wide significance levels of P<5x10
-8
. For patient
outcomes (B), we included variants that passed genome-wide significance levels of P<5x10
-8
plus variants in GATA3 and ARID5B from gene-specific analyses. Lines are plotted on each
chromosome corresponding to the base-pair position of each single nucleotide polymorphism
(SNP). Variants are colored by related phenotypes that have been detected in GWAS (from the
“Reported trait” column in the GWAS Catalog). Shapes of variants correspond to the genetic
ancestry (if any) that has been associated with the SNP risk allele. N/A represents no related
ancestry has been reported so far.
29
(A)
30
(B)
31
(C)
Figure 1. 3. Risk allele frequency and effect size of selected single nucleotide polymorphisms
(SNPs) associated with acute lymphoblastic leukemia (ALL) risk.
SNPs (n=33, Table 1.1) are grouped by nearest genes in each panel. (A) Percentage higher of
risk allele frequency in Africans/African-Americans and Latinos/Admixed Americans as
compared to in Europeans (non-Finnish). Percentage change equation:
!
!"#$ &''('( )*(+,(-./ 0) 1)*".&- &-2 1)*".&- 13(*".&- 0* 0) 4&5"-0# &-2 123"6(2 13(*".&-#
!"#$ &''('( )*(+,(-./ 0) 7,*08(&-# (-0-:;"--"#<)
−1$∗100
Horizontal bars are annotated by risk allele and colored by the direction of percentage
difference. (B) Difference of risk allele frequency in Africans/African-Americans and
Latinos/Admixed Americans as compared to in Europeans (non-Finnish). (C) Effect size of
selected GWAS-identified SNPs associated with ALL risk. Centers of points and horizontal bars
32
indicate point estimates and 95% confidence intervals. Points are shaped by study-reported
traits. Points and horizontal bars are colored by ancestry with the highest risk allele frequency.
X axis is on a log-10 scale in order to better present those relatively small effect sizes.
33
Chapter 2: Epigenetic Biomarkers of Prenatal Tobacco Smoke Exposure Are Associated with
Gene Deletions in Childhood Acute Lymphoblastic Leukemia
Abstract
Background: Parental smoking is implicated in the etiology of acute lymphoblastic
leukemia (ALL), the most common childhood cancer. We recently reported an association
between an epigenetic biomarker of early-life tobacco smoke exposure at the AHRR gene and
increased frequency of somatic gene deletions among ALL cases.
Methods: Here, we further assess this association using two epigenetic biomarkers for
maternal smoking during pregnancy – DNA methylation at AHRR CpG cg05575921 and a
recently established polyepigenetic smoking score – in an expanded set of 482 B-cell ALL (B-
ALL) cases in the California Childhood Leukemia Study with available Illumina 450K or
MethylationEPIC array data. Multivariable Poisson regression models were used to test the
associations between the epigenetic biomarkers and gene deletion numbers.
Results: We found an association between DNA methylation at AHRR CpG cg05575921
and deletion number among 284 childhood B-ALL cases with MethylationEPIC array data, with a
ratio of means (RM) of 1.31 (95% CI:1.02-1.69) for each 0.1 beta-value reduction in DNA
methylation, an effect size similar to our previous report in an independent set of 198 B-ALL
cases with 450K array data (meta-analysis summary RM [sRM]=1.32, 95% CI:1.10-1.57). The
polyepigenetic smoking score was positively associated with gene deletion frequency among all
482 B-ALL cases (sRM=1.31 for each 4-unit increase in score; 95% CI:1.09-1.57).
34
Conclusions: We provide further evidence that prenatal tobacco-smoke exposure may
influence the generation of somatic copy-number deletions in childhood B-ALL.
Impact: Analyses of deletion breakpoint sequences are required to further understand the
mutagenic effects of tobacco smoke in childhood ALL.
Introduction
Acute lymphoblastic leukemia (ALL) is the most common childhood malignancy in the
United States, with approximately 2,700 incident cases diagnosed under age 15 each year
6
.
Although survival rates for ALL have improved dramatically in recent decades, with overall 5-
year survival now upwards of 90%
68
, ALL remains a leading cause of disease-related mortality
in children and current treatments still carry long-term health consequences
101,194,195
.
Therefore, prevention remains a top priority
46
. In addition to known ALL risk factors with large
effects, namely ionizing radiation and genetic predisposition syndromes
9,10,196
, several
environmental exposures have been associated with ALL etiology, including tobacco smoke,
pesticides, paint, and air pollution
46,48
; however, the causal mechanisms remain largely unclear
197
.
Childhood ALL, in particular B-cell ALL (B-ALL), is thought to follow a “two-hit” model of
leukemogenesis
45,145
, with in utero development of a pre-leukemic clone
146,147
that progresses
to overt leukemia following postnatal acquisition of secondary genetic changes
148
. Deletions of
genes involved in cell cycle control, and B-lymphocyte development and hematopoiesis
198–200
including, most commonly, CDKN2A, ETV6, PAX5, and IKZF1
199
, comprise a large proportion of
the secondary alterations in childhood B-ALL.
35
We recently reported a positive association between early-life tobacco smoke exposure
and somatic gene deletions in childhood ALL cases, suggesting a potential etiologic role for
parental and/or household smoking. In 559 childhood ALL cases in the California Childhood
Leukemia Study (CCLS), self-reported maternal and paternal smoking were associated with an
increased number of gene deletions
49
. In a subset of 198 B-ALL cases for whom genome-wide
DNA methylation data were available from Illumina® HumanMethylation450 BeadChip (450K)
arrays, we validated this association using an epigenetic biomarker for maternal smoking during
pregnancy at the AHRR gene
49,201,202
.
In the current study, we examine the association between DNA methylation at the AHRR
CpG cg05575921 and gene deletions in an expanded set of 482 B-ALL cases in the CCLS,
including an additional 284 B-ALL cases with DNA methylation data now available from
Illumina® Infinium MethylationEPIC BeadChip (EPIC) arrays. Further, we sought to expand our
analysis of the impact of prenatal tobacco smoke exposure on gene deletion burden in
childhood B-ALL by using a recently established polyepigenetic smoking score of in utero
tobacco smoke exposure
203
.
Materials and Methods
Ethics statement
This study was approved and reviewed by the Institutional Review Boards at the
University of Southern California, the University of California, Berkeley, the California
Department of Public Health, and all participating hospitals. Informed consent was obtained
from all study participants.
36
Study population
The CCLS is a population-based case-control study conducted from 1995-2015 to
examine the relationships between various environmental exposures, genetic factors, and
childhood leukemia
48
. Cases were identified within 72 hours after diagnosis at hospitals across
California. Eligible criteria include (1) age under 15 years, (2) without prior cancer diagnosis, (3)
residence in California at the time of diagnosis, and (4) having an English or Spanish-speaking
biological parent available for interview. Controls were not included in the current case-only
analysis. Newborn dried bloodspots (DBS) were obtained from the California Biobank Program
Genetic Disease Screening Program. The current analysis included 482 B-ALL cases with
available genome-wide DNA methylation array data and gene deletion frequency data (Figure
2.1).
Somatic copy-number data
Copy-number at 8 commonly deleted gene regions (CDKN2A, ETV6, IKZF1, PAX5, BTG1,
EBF1, RB1, and genes within the pseudoautosomal region [PAR1] of the sex chromosomes
[CRLF2, CSF2RA, IL3RA]) was assayed in tumor DNA using multiplex ligation-dependent probe
amplification (MLPA), as previously described
49,204
. Of those with available race/ethnicity, 228
(58.9%) self-identified as Latino, 102 (26.4%) as non-Latino White, and 57 (14.7%) as other non-
Latino races/ethnicities (including African American, Native American, Asian, and mixed/other
groups) (Table 2.1).
Genome-wide DNA methylation arrays
For 198 B-ALL cases, genome-wide DNA methylation data were already available from
Illumina® 450K arrays
49,205
. For an additional 284 B-ALL cases, germline DNA was isolated from
37
newborn DBS and bisulfite-treated as previously described
205
, and subsequently assayed on
Illumina® EPIC arrays. EPIC arrays include >850,000 CpG probes, comprising >90% of CpGs on
450K arrays plus an additional 413,743 CpGs. CpG beta values were normalized to remove
batch effects according to the approach by Fortin et al.
206
. Functional normalization was
performed with noob background correction
207
by using the “preprocessFunnorm” function in
the minfi package
208
through the Bioconductor project
209,210
.
The AHRR CpG cg05575921 is included on both the 450K and EPIC arrays; we extracted
beta values for this CpG for all 284 B-ALL cases assayed on the EPIC array to test for association
with number of gene deletions, as previously performed for the 198 B-ALL cases in the 450K
dataset
49
.
We also calculated a new polyepigenetic smoking score for sustained maternal smoking
during pregnancy, in both the 450K and EPIC datasets
203
. In brief, for 450K data, we computed
a DNA methylation-based smoking score as the linear combination of 28 previously selected
maternal smoking-associated CpGs using their corresponding logistic LASSO regression
coefficients (Table 2.2)
203
. For EPIC data, we calculated a score using 26 of the 28 CpGs that are
included on the EPIC array, including the AHRR CpG cg05575921
203
. Cases with missing data for
any of these CpGs (due to detection P-values >0.01) were excluded from analyses involving
polyepigenetic smoking scores.
AHRR DNA methylation quantitative trait locus (mQTL) genotype data
To account for potential genetic effects on DNA methylation at the AHRR CpG
cg05575921, for 198 B-ALL cases (450K) we had available genotype data for SNP rs148405299,
which was identified as an mQTL for cg05575921, as previously described
205
. Additionally, for
38
the new set of 284 B-ALL cases (EPIC) we genotyped SNP rs77111113, which is in perfect linkage
disequilibrium with rs148405299 across all populations in LDlink (R
2
= 1.0)
211
, using a
predesigned TaqMan SNP genotyping assay (ThermoFisher Scientific, Assay ID:
C__25986435_10). Hereafter, we refer to either rs148405299 or rs77111113 as the AHRR-mQTL
SNP.
Self-reported smoking exposures
Among B-ALL cases with available parent interview data, we tested the association
between the DNA methylation-based biomarkers of maternal tobacco smoking in pregnancy
and self-reported tobacco exposures as assessed by parent interviews
48,49
. Dichotomous
smoking variables (“yes” or “no”) included maternal/paternal ever smoking, maternal/paternal
smoking 3 months before conception (preconception), maternal/paternal smoking at the time
of the interview, maternal smoking during pregnancy, maternal smoking during breastfeeding,
maternal prenatal smoking (during either preconception or pregnancy), maternal smoking
during the year after birth, and child postnatal passive smoking. Continuous measures (number
of cigarettes, pipes, or cigars per day) included maternal/paternal smoking preconception,
maternal smoking during pregnancy, maternal smoking during breastfeeding, and maternal
prenatal smoking (average of maternal smoking during preconception and during pregnancy).
We also used combinations of responses from parental interviews to infer: 1) which mothers
smoked throughout the entire duration of pregnancy, and 2) which mothers were never
exposed to tobacco smoke from any source during pregnancy, allowing us to compare the DNA
methylation-based biomarker levels of maternal smoking in pregnancy at these two extremes
of self-reported smoking.
39
Statistical analyses
All statistical analyses were performed in R v 4.0.0
212
. All 2-sided p-values below 0.05
indicate statistical significance. All analyses were performed separately in the 450K and EPIC
datasets, including 198 and 284 B-ALL cases, respectively. Means and standard deviations were
summarized to describe the distribution of continuous characteristics, and frequencies and
proportions were computed for categorical characteristics.
We calculated Spearman rank correlation among self-reported tobacco smoking
exposures, DNA methylation at the AHRR CpG cg05575921 and the polyepigenetic smoking
scores. Linear regression models were additionally used to test for association between DNA
methylation at the AHRR CpG cg05575921 or the polyepigenetic smoking scores and self-
reported tobacco smoke exposures. To obtain independent effects of paternal smoking or
maternal smoking on DNA methylation, maternal smoking was adjusted for paternal smoking in
linear regression models, and vice versa (Table 2.3). In addition, associations between the joint
exposures of prenatal and postnatal tobacco smoking and DNA methylation were measured by
fitting linear regression models for composite variables that were newly derived from paternal
smoking preconception, maternal prenatal smoking and child postnatal passive smoking (Table
2.4). Linear regression models were adjusted for cell type heterogeneity using principal
components (PCs) derived from ReFACTor
213
, and genetic ancestry using PCs derived from
EPISTRUCTURE
214
.
Linear regression models were used to assess whether the DNA methylation-based
biomarkers had significant associations with child’s birth year, with the AHRR-mQTL SNP being
additionally adjusted for DNA methylation at the AHRR CpG cg05575921
205
.
40
Poisson regression models were used to test association between DNA methylation at
the AHRR CpG cg05575921 and deletion numbers in 284 cases in the EPIC dataset, and between
the polyepigenetic smoking scores and deletion numbers in both the 450K and EPIC datasets.
Models were adjusted for ReFACTor and EPISTRUCTURE PCs and additionally adjusted for the
AHRR-mQTL SNP to control for potential confounding
205
. Models for ratios of means (RMs)
were calculated for every 0.1 beta-value decrease
49
in AHRR cg05575921 methylation, and for
every 4-unit increase in polyepigenetic scores. We also assessed the association between the
polyepigenetic scores minus the AHRR CpG cg05575921 and deletion numbers.
Fixed effect meta-analysis models were used to test for heterogeneity between 450K
and EPIC datasets, and to generate summary effect estimates accounting for the variance of
each dataset, using R packages tidymeta and metafor
215,216
. Study heterogeneity was
characterized with I
2
statistics and their corresponding p-values
217
.
Finally, we repeated the epigenetic biomarker and gene deletions analyses stratified by:
1) self-reported race/ethnicity, in the subset of B-ALL cases with available data (Table 2.1) and
limited to Latinos and non-Latino Whites due to sample size; and 2) age of diagnosis, limited to
≥2 years of age, 0 to 5 years of age, and >5 years of age (as the number of ALL cases diagnosed
<1 year of age in our study [n=8] was small).
Results
Demographic characteristics of the 482 B-ALL cases are summarized in Table 2.1, and
the study design is illustrated in Figure 2.1. The distribution of deletions among 198 B-ALL cases
in the 450K dataset and in the additional 284 B-ALL cases in the EPIC dataset were similar
41
(Figure 2.2 and Table 2.5). In the 450K dataset, 125/198 (63.1%) of cases harbored at least one
gene deletion compared with 162/284 (57.0%) of cases in the EPIC dataset (Figure 2.2).
Association between self-reported smoking variables and DNA methylation-based biomarkers
of maternal smoking in B-ALL cases
The median AHRR cg05575921 beta-value among B-ALL cases was 0.82 (interquartile
range [IQR]: 0.79-0.85) in the 450K dataset and 0.81 (IQR: 0.78-0.84) in the EPIC dataset. The
median polyepigenetic smoking score was -0.52 among 194 cases (4/198 cases excluded due to
missing data) in the 450K dataset (IQR: -1.83-0.95) and 0.83 (IQR: -0.34-2.11) among 284 cases
in the EPIC dataset (Figure 2.3). Mean methylation beta values of CpGs that were used to
generate the polyepigenetic scores are summarized in Table 2.2. Beta values of most of the
CpGs were significantly different between B-ALL cases in the 450K dataset and the EPIC dataset
(Figure 2.4), although a significant difference was not found for DNA methylation at the AHRR
CpG cg05575921.
Self-reported tobacco smoking exposure data were available for all 198 B-ALL cases in
the 450K dataset and 189 out of 284 cases in the EPIC dataset (Table 2.6). The distributions of
smoking variables were similar between cases in the 450K and EPIC datasets, although in
general more cases in the 450K dataset were reported to be exposed to tobacco smoke. We did
not find any evidence that DNA methylation at the AHRR CpG cg05575921 or the polyepigenetic
smoking scores were associated with child’s birth year (Figure 2.5).
Maternal smoking variables were strongly correlated with each other (rho range: 0.36-
1.00) and had relatively lower correlations with paternal smoking variables (rho range: 0.03-
0.48) (Figure 2.6). The two DNA methylation-based biomarkers were significantly correlated
42
(450K: rho = 0.54; EPIC: rho = 0.60). Decreased DNA methylation at the AHRR cg05575921 was
correlated with maternal prenatal smoking exposures in both the 450K and EPIC datasets; it
was additionally correlated with maternal smoking during breastfeeding and child passive
smoking (via parental smoking) in the EPIC dataset. Increased polyepigenetic scores were
significantly correlated with the majority of the self-reported parental smoking exposures in
both 450K and EPIC data.
In both the 450K and EPIC datasets, polyepigenetic smoking scores were associated with
nearly all of the self-reported smoking exposures in multivariable linear regression models
(Figure 2.7). Decreased AHRR cg05575921 beta-value was mainly associated with maternal
smoking exposures. Furthermore, joint exposures of maternal or paternal smoking and child
postnatal passive smoking were significantly associated with the two epigenetic biomarkers.
Independent maternal and paternal smoking effects on DNA methylation were obtained
from multivariable linear regression models (Figure 2.7). Maternal smoking exposures remained
associated with polyepigenetic smoking scores and AHRR cg05575921 methylation while
adjusting for paternal smoking preconception. In addition, paternal smoking during
preconception remained associated with the polyepigenetic smoking score when controlling for
maternal prenatal smoking. Notably, we found a -0.091 difference in the mean cg05575921
beta value and a ~4-unit difference for the polyepigenetic smoking score in the 450K dataset for
mothers who smoked throughout pregnancy compared with mothers who were never exposed
to tobacco smoke. The -0.091 difference is comparable to the previously reported -0.1
difference in AHRR cg05575921 beta-value of neonates of mothers with high cotinine levels
versus mothers with undetectable cotinine levels
218
. Therefore, we considered the
43
corresponding 4-unit coefficient estimate in the same model for the polyepigenetic score to be
biologically relevant, and subsequently computed RMs of deletion numbers for every 4-unit
increase of the polyepigenetic score in both 450K and EPIC datasets.
DNA methylation-based biomarkers of tobacco smoke exposure are associated with gene
deletion burden in childhood ALL
In the new EPIC dataset of 284 B-ALL cases, we found a 1.31-fold increase in the mean
number of deletions with every 0.1 beta-value decrease in cg05575921 (95% CI, 1.02-1.69;
Figure 2.8). After stratifying by sex, a stronger association presented in males (RM, 1.41; 95% CI,
0.99-2.02) compared to females (RM, 1.30; 95% CI, 0.87-1.95) (Table 2.7), however, these
differences were not significant in tests for heterogeneity (P het = 0.599). In a meta-analysis of
the 450K and EPIC datasets, the summary RM (sRM) was 1.32 (95% CI, 1.10-1.57; Figure 2.8).
We further extended our original analysis by constructing a DNA methylation-based
smoking score, including the AHRR CpG and over 20 additional CpGs. In the meta-analysis of the
450K and EPIC datasets, the polyepigenetic score was also significantly associated with an
increased number of deletions with a 1.31-fold increase in mean number of deletions for every
4-unit increase in the score (95% CI, 1.09-1.57; Figure 2.8). Similar effect sizes were seen for the
association between the polyepigenetic score and the number of deletions in the 450K dataset
(RM = 1.36; 95% CI, 1.05-1.76) and the EPIC dataset (RM = 1.26; 95% CI, 0.97-1.64), although
the latter did not reach statistical significance (Figure 2.8).
We next explored whether removal of the AHRR CpG cg05575921 would impact the
association between the polyepigenetic score and ALL patient gene deletion burden. A
significant association between the modified polyepigenetic score and the number of deletions
44
was still observed in the 450K dataset (RM = 1.44; 95% CI, 1.06-1.95), with a slightly attenuated
effect in the EPIC dataset (RM = 1.24; 95% CI, 0.91-1.69; Figure 2.8). In the meta-analysis, the
polyepigenetic smoking score excluding the AHRR CpG remained significantly positively
associated with number of gene deletions (sRM = 1.34; 95% CI, 1.08-1.66; Figure 2.8).
No significant interaction effects were detected between the DNA methylation-based
biomarkers and B-ALL cytogenetic subtypes (high-hyperdiploidy [HD-ALL] and ETV6-RUNX1
fusion) on deletion numbers (Table 2.8).
In analyses stratified by self-reported race/ethnicity, stronger associations between the
DNA methylation-based biomarkers and gene deletions presented in non-Latino White
compared to Latino B-ALL cases (Figure 2.9), although the differences were not significant in
tests for heterogeneity (P het >0.10). Finally, we assessed potential effects of patient age-at-
diagnosis on our results, and observed similar associations between the epigenetic biomarkers
and gene deletion burden after excluding cases diagnosed <2 years of age to those found in the
overall B-ALL cases (Figure 2.10); the association between the polyepigenetic smoking score
and gene deletions was slightly stronger among B-ALL cases diagnosed >5 years of age than
those diagnosed ≤5 years of age (P het >0.10).
Discussion
Somatic copy-number loss of lymphoid transcription factor and cell cycle control genes
is an important driver of leukemogenesis in childhood ALL. Aberrant recombination-activating
gene (RAG) activity, which normally drives antibody diversification as part of the adaptive
immune system, is thought to underlie the formation of gene deletions in some ALL patients
120,219
. However, few epidemiology studies have explored whether extrinsic factors influence
45
the generation of somatic copy-number alterations in developing lymphocytes. Here, we
provide further evidence that prenatal exposure to tobacco smoke may induce leukemia-
causing gene deletions in ALL patients
49,220
.
We recently reported that decreased DNA methylation at the AHRR CpG cg05575921, a
biomarker for maternal smoking during pregnancy
218,221
, was associated with an increased
frequency of somatic gene deletions among childhood B-ALL cases
49
. We have replicated this
association in a larger, independent set of childhood B-ALL cases, assayed on Illumina® EPIC
DNA methylation arrays, and found a remarkably similar effect size with a ratio of means of
1.31 in the current study compared with 1.32 in our previous report. Further, we found a
similar positive association between gene deletion frequency and increased in utero tobacco
smoke exposure in ALL cases as measured by a recently established polyepigenetic smoking
score
203
. This association remained after removal of the AHRR CpG from the smoking score
and, thus, we were able to confirm our findings using an independent epigenetic biomarker.
Previous case-control studies based on questionnaire data reported significant
association between paternal preconception smoking and childhood ALL risk, but no association
between maternal smoking and childhood ALL risk
48,222,223
. The discrepancy between our
findings and the epidemiological literature on maternal smoking and childhood ALL risk could
be due to several reasons. First, case-control studies limited to the use of self-reported data
may be affected by recall bias, and may include potentially underreported smoking exposures
224
due to a perceived social stigma
225
, in particular for maternal smoking. In addition, the two
epigenetic biomarkers examined in this study can reflect particularly sustained maternal
46
smoking throughout pregnancy
48,203,226
, which is difficult to assess using single survey
questions.
Second, we cannot rule out that these epigenetic biomarkers may also be proxies for
non-maternal and/or postnatal tobacco smoke exposures that may impact the generation of
gene deletions in childhood ALL. In our study, AHRR cg05575921 methylation was strongly
associated with self-reported maternal prenatal smoking, consistent with previous findings that
decreased methylation at cg05575921 was definitively associated with in utero exposure to
maternal smoking
201,205
, and not overtly connected to paternal smoking or secondhand smoke
exposure
202
. However, in contrast, the polyepigenetic smoking scores were associated with
both self-reported maternal and paternal smoking exposures, suggesting that some CpGs
included in this score may be associated with multiple sources of tobacco exposure. Moreover,
both biomarkers were associated with the cumulative self-reported smoking exposures and the
joint exposures of parental prenatal smoking with child postnatal passive smoking. These
composite variables are indicative of smoking exposures in the household or residual prenatal
smoking exposures that were not captured by single survey questions.
Third, the potential leukemogenic effects of tobacco smoke exposure on gene deletions
in ALL patients may not translate to overall ALL risk in case-control studies, perhaps due to
varying effects in different molecular subtypes. This is supported by the finding that the
combination of paternal prenatal smoking with child postnatal passive smoking was
significantly associated with ETV6-RUNX1 fusion ALL, but not with HD-ALL
48
. HD-ALL is
associated with a lower frequency of somatic gene deletions relative to other ALL subtypes
49
and, in our previous study, self-reported tobacco exposures were no longer associated with
47
gene deletion frequency in ALL cases when restricted to HD-ALL
49
, though we did not formally
test for interaction. In the current study, we did not find significant interaction between ALL
subtype and the DNA methylation-related biomarkers, but this may be due to a lack of power
and warrants further investigation.
To our knowledge, this is the first study to (1) compute a polyepigenetic smoking score
using neonatal DNA methylation data from EPIC arrays, and (2) test whether self-reported
smoking exposures were associated with polyepigenetic scores derived from both 450K and
EPIC data. The smoking score was developed by Reese et al. using 450K array data from
newborn cord blood samples
203
and, in our study, we found largely consistent mean
methylation beta values for the CpGs used to generate the score (Table 2.2). Excluding the two
CpGs present on 450K but not EPIC arrays caused little loss of performance in the 450K data;
however, the predictive performance of the score using EPIC array data requires further
investigation. We found that the majority of CpGs in the consensus smoking score showed
significantly different average beta values between 450K and EPIC array data (21/26 CpGs in
overall newborns; 19/26 CpGs in newborns not exposed to tobacco smoke during pregnancy)
(Figure 2.4). Further, the consensus score was significantly lower in the 450K dataset (0.26 [IQR:
-1.03-1.80]) than the EPIC dataset (0.83 [IQR: -0.34-2.11]; Wilcoxon test p = 0.003), despite
more newborns in the 450K dataset being exposed to parental tobacco smoke according to
interview data (Table 2.6). The inter-array differences in the smoking score CpGs did not
correspond consistently with their association with maternal smoking during pregnancy
218
, nor
were they likely explained by changes in individuals’ smoking behaviors over time (i.e., cohort
48
effect) (Figure 2.5). They may instead be due to probe cross-reactivity or a shifted distribution
of methylation values caused by increased Type II probe measurements on EPIC arrays
227,228
.
The polyepigenetic smoking score was developed in a homogenous population from
Norway
203
, a country with different smoking habits (e.g. more prevalent use of hand-rolled
cigarettes with higher nicotine and tar content) than the US
229
. This may hamper the
performance and generalizability of the score in our study, in which over 50% of cases were of
non-white race/ethnicity, with a particularly large number of Latinos. In analyses stratified by
self-reported race/ethnicity, both epigenetic biomarkers of tobacco smoke exposure showed a
stronger association with the frequency of ALL gene deletions in non-Latino Whites than in
Latinos. This might be attributable to a potentially superior performance of these biomarkers in
predicting prenatal tobacco smoke exposure in non-Latino Whites compared to Latinos, but this
warrants further evaluation as the number of non-Latino White cases in our study was
relatively small. Nonetheless, the transferability of epigenetic biomarkers developed in largely
European ancestry individuals across ancestrally diverse populations should be determined.
Our study does have several limitations that warrant consideration. Importantly, the
DNA methylation-based biomarkers were derived from newborn DBS, thus we were not able to
assess the potential effects of postnatal tobacco smoke exposure. Preleukemic clones may be
present at birth, but at very low clonal frequencies in whole blood
230,231
and, thus, are unlikely
to have influenced our DNA methylation results. This is supported by the minimal effects on our
results after excluding B-ALL cases diagnosed <2 years of age. An additional limitation was our
limited ability to study the effects of tobacco smoke exposure across different cytogenetic
subtypes of ALL, due to sample size and a lack of information on subtypes beyond HD-ALL and
49
ETV6-RUNX1 fusion. Further, our analyses were limited to the 8 commonly deleted genes
targeted by the MLPA assays. These assays do not provide information on deletion breakpoint
locations, hence we could not explore the molecular mechanisms underlying the formation of
deletions in our ALL cases. Given that aberrant RAG-mediated V(D)J recombination underlies a
large proportion of somatic gene deletions in ALL
120,219,232
, it is compelling that cord blood
lymphocytes in newborns of mothers exposed to tobacco smoke have been found to harbor a
significantly increased frequency of off-target RAG recombination-mediated deletions than in
newborns of mothers who were not exposed to tobacco
220,233
, however, this remains to be
examined in the setting of childhood ALL.
In summary, we provide further evidence that prenatal tobacco smoke exposure may
influence the generation of somatic copy-number deletions in childhood B-ALL cases. Future
epidemiological studies that incorporate both information on early-life exposure to tobacco
smoke as well as whole-genome sequencing of ALL tumors and, in turn, analysis of mutational
signatures and deletion breakpoint sequences are required to investigate the potential
mutagenic effects of tobacco smoke in childhood ALL.
50
Table 2. 1. Characteristics of childhood B-ALL cases (n = 482) in the California Childhood
Leukemia Study.
Variables Total (n = 482) 450K (n = 198) EPIC (n = 284) P
Gestational age (weeks), mean (SD) 39.31 (2.15) 39.51 (1.92) 39.06 (2.39) 0.047
Gestational age unknown (%) 124 (25.7)
124 (43.7)
Age at diagnosis (years), mean (SD) 5.47 (3.44) 5.36 (3.23) 5.55 (3.58) 0.550
Sex (%)
Females 223 (46.3) 91 (46.0) 132 (46.5) 0.984
Males 259 (53.7) 107 (54.0) 152 (53.5)
Race/ethnicity (%)
Latino 228 (58.9) 115 (58.1) 113 (59.8) 0.914
Non-Latino White 102 (26.4) 54 (27.3) 48 (25.4)
Non-Latino Other 57 (14.7) 29 (14.6) 28 (14.8)
Race/ethnicity unknown (%) 95 (19.7)
95 (33.5)
Deletion number (%)
0 195 (40.5) 73 (36.9) 122 (43.0) 0.459
1 151 (31.3) 64 (32.3) 87 (30.6)
2 88 (18.3) 40 (20.2) 48 (16.9)
3 35 (7.3) 13 (6.6) 22 (7.7)
4 9 (1.9) 6 (3.0) 3 (1.1)
5 4 (0.8) 2 (1.0) 2 (0.7)
P-values comparing the characteristics of B-ALL cases in the 450K and EPIC array datasets were
calculated using Student's t-tests for continuous variables (gestational age and age at diagnosis)
and Chi-squared tests for categorical variables.
51
Table 2. 2. The distribution of CpGs that make up the polyepigenetic smoking scores*.
Data from Reese et al. 450K EPIC
CpG Gene
Name
Coefficients used to construct
polyepigenetic scores
Mean Methylation beta
values of 450K data
Me
an
S
D
M
in
M
ax
Me
an
S
D
M
in
M
ax
cg00709966
**
-1.667 0.41 0.4
0
0.
06
0.
24
0.
66
cg0225
6631
ITGA
M
-0.191 0.10 0.1
0
0.
04
0.
04
0.
38
0.0
9
0.
04
0.
03
0.
33
cg0248
2603
RABG
AP1L
2.706 0.43 0.4
1
0.
06
0.
22
0.
57
0.3
7
0.
05
0.
25
0.
50
cg0410
3532
HIVEP
2
1.786 0.57 0.6
1
0.
07
0.
44
0.
81
0.6
6
0.
07
0.
40
0.
80
cg0418
0046
MYO1
G
14.027 0.45 0.5
1
0.
08
0.
29
0.
80
0.4
8
0.
07
0.
31
0.
82
cg0450
6190
PLXND
1
2.318 0.26 0.2
3
0.
05
0.
04
0.
41
0.2
1
0.
03
0.
13
0.
32
cg0554
9655
CYP1A
1
6.210 0.20 0.1
9
0.
06
0.
07
0.
36
0.1
9
0.
05
0.
02
0.
35
cg0557
5921
AHRR -10.909 0.86 0.8
2
0.
05
0.
62
0.
93
0.8
1
0.
05
0.
57
0.
91
cg0869
8721
MEG3 1.142 0.60 0.6
5
0.
06
0.
49
0.
81
0.5
6
0.
06
0.
43
0.
71
cg0974
3950
ITGA
M
-6.330 0.68 0.7
8
0.
06
0.
53
0.
94
0.7
1
0.
05
0.
56
0.
88
cg1079
9846
SYNJ2 -4.963 0.23 0.2
3
0.
03
0.
16
0.
36
0.0
6
0.
01
0.
04
0.
10
cg1186
4574
**
SPAG6 -0.370 0.46 0.4
6
0.
08
0.
18
0.
64
cg1218
6702
PLVAP 3.847 0.50 0.5
0
0.
04
0.
20
0.
61
0.5
0
0.
03
0.
40
0.
57
cg13834112 1.514 0.57 0.6
7
0.
06
0.
48
0.
91
0.6
7
0.
05
0.
49
0.
80
cg1389
3782
PDE7B -0.963 0.18 0.1
1
0.
04
0.
02
0.
27
0.0
9
0.
03
0.
04
0.
17
cg1417
9389
GFI1 -6.304 0.24 0.2
4
0.
09
0.
03
0.
50
0.2
7
0.
09
0.
07
0.
56
cg1435
1425
GRK5 6.361 0.22 0.1
5
0.
06
0.
03
0.
42
0.1
4
0.
05
0.
05
0.
44
cg1463
3298
TRIM2
7
5.050 0.86 0.9
1
0.
03
0.
74
0.
97
0.8
9
0.
03
0.
56
0.
95
cg1474
3346
DERL1 2.286 0.77 0.8
7
0.
03
0.
73
0.
96
0.8
7
0.
02
0.
66
0.
91
cg1739
7069
SGCD -2.912 0.67 0.7
6
0.
05
0.
61
0.
90
0.6
8
0.
04
0.
55
0.
82
cg19381766 5.245 0.56 0.5
3
0.
03
0.
44
0.
60
0.5
4
0.
02
0.
46
0.
60
cg2215
4659
HOXA
1
-0.773 0.42 0.4
0
0.
08
0.
14
0.
79
0.3
6
0.
07
0.
13
0.
59
cg2280
2102
NTF3 -0.254 0.73 0.8
1
0.
06
0.
60
0.
97
0.7
1
0.
07
0.
37
0.
94
cg2330
4605
CCDC8
8C
0.011 0.18 0.1
6
0.
05
0.
08
0.
44
0.1
2
0.
05
0.
05
0.
51
cg2518
9904
GNG1
2
-3.903 0.53 0.5
7
0.
06
0.
37
0.
76
0.4
8
0.
04
0.
33
0.
62
cg2594
9550
CNTN
AP2
-46.991 0.11 0.0
8
0.
02
0.
04
0.
17
0.0
9
0.
02
0.
06
0.
14
52
cg2676
4244
GNG1
2
-0.246 0.27 0.2
8
0.
05
0.
11
0.
46
0.2
6
0.
04
0.
11
0.
44
cg27291468 0.836 0.67 0.6
7
0.
11
0.
39
0.
86
0.7
8
0.
11
0.
36
0.
93
*Algorithm for calculating the polyepigenetic score in 450K data: polyepigenetic score = -
1.667*cg00709966 -0.191*cg02256631 + 2.706*cg02482603 + 1.786*cg04103532 +
14.027*cg04180046 + 2.318*cg04506190 + 6.210*cg05549655 -10.909*cg05575921 +
1.142*cg08698721 - 6.330*cg09743950 - 4.963*cg10799846 -0.370*cg11864574 +
3.847*cg12186702 + 1.514*cg13834112 - 0.963*cg13893782 - 6.304*cg14179389 +
6.361*cg14351425 + 5.050*cg14633298 + 2.286*cg14743346 - 2.912*cg17397069 +
5.245*cg19381766 -0.773*cg22154659 - 0.254*cg22802102 + 0.011*cg23304605 -
3.903*cg25189904 - 46.991*cg25949550 -0.246*cg26764244 + 0.836*cg27291468; Algorithm
for the polyepigenetic score in EPIC data was calculated as above, but excluding CpGs
cg00709966 and cg11864574 that were not present on the EPIC arrays.
**CpG probes present on Illumina 450K arrays but not on EPIC arrays.
53
Table 2. 3. Maternal smoking variables adjusted for paternal smoking in linear regression
models for association analysis between parental smoking and DNA methylation, and vice
versa.
Binary tobacco smoking exposures Adjusted tobacco smoking exposures
Paternal, ever Maternal, ever
Maternal, ever Paternal, ever
Paternal, preconception Maternal, prenatal
Maternal, preconception Paternal, preconception
Maternal, pregnancy Paternal, preconception
Maternal, breastfeeding Paternal, preconception
Continuous exposures
Paternal, preconception (CPD) Maternal, prenatal (CPD)
Maternal, preconception (CPD) Paternal, preconception (CPD)
Maternal, pregnancy (CPD) Paternal, preconception (CPD)
Maternal, breastfeeding (CPD) Paternal, preconception (CPD)
54
Table 2. 4. Joint exposures of self-reported tobacco smoking.
Paternal preconception Maternal prenatal Child's postnatal passive
Joint exposure 1: paternal preconception and maternal prenatal smoking
No No
Yes No
No Yes
Yes Yes
Joint exposure 2: paternal preconception and child's postnatal passive smoking
No
No
Yes
No
No
Yes
Yes
Yes
Joint exposure 3: maternal prenatal and child's postnatal passive smoking
No No
Yes No
No Yes
Yes Yes
55
Table 2. 5. The distribution of gene deletions of 482 B-ALL cases.
Gene deletion 450K (n = 198) EPIC (n = 284)
CDKN2A (%)
No 143 (72.2) 226 (79.6)
Yes 55 (27.8) 58 (20.4)
IKZF1 (%)
No 167 (84.3) 241 (84.9)
Yes 31 (15.7) 43 (15.1)
PAX5 (%)
No 158 (79.8) 239 (84.2)
Yes 40 (20.2) 45 (15.8)
ETV6 (%)
No 152 (76.8) 221 (77.8)
Yes 46 (23.2) 63 (22.2)
BTG1 (%)
No 182 (91.9) 265 (93.3)
Yes 16 ( 8.1) 19 ( 6.7)
RB1 (%)
No 184 (92.9) 261 (91.9)
Yes 14 ( 7.1) 23 ( 8.1)
PAR1 (%)
No 186 (93.9) 267 (94.0)
Yes 12 ( 6.1) 17 ( 6.0)
EBF1 (%)
No 195 (98.5) 281 (98.9)
Yes 3 ( 1.5) 3 ( 1.1)
56
Table 2. 6. Summary statistics of self-reported tobacco smoke exposures.
Variable 450K (n = 198) EPIC (n = 189/284)
Paternal, ever (%)
No 104 (52.5) 104 (65.4)
Yes 94 (47.5) 55 (34.6)
Paternal, now (%)
No 147 (75.0) 131 (83.4)
Yes 49 (25.0) 26 (16.6)
Paternal, preconception (%)
No 147 (74.2) 128 (81.0)
Yes 51 (25.8) 30 (19.0)
Paternal, preconception (CPD) (mean (SD)) 3.06 (7.44) 1.83 (5.68)
Maternal, ever (%)
No 147 (74.2) 147 (78.6)
Yes 51 (25.8) 40 (21.4)
Maternal, now (%)
No 171 (86.4) 174 (93.5)
Yes 27 (13.6) 12 (6.5)
Maternal, preconception (%)
No 165 (83.3) 168 (89.8)
Yes 33 (16.7) 19 (10.2)
Maternal, preconception (CPD) (mean (SD)) 1.56 (5.53) 1.15 (4.13)
Maternal, pregnancy (%)
No 176 (88.9) 175 (93.6)
Yes 22 (11.1) 12 (6.4)
Maternal, pregnancy (CPD) (mean (SD)) 0.51 (1.76) 0.53 (2.59)
Maternal, prenatal (%)
No 165 (83.3) 168 (89.8)
Yes 33 (16.7) 19 (10.2)
Maternal, prenatal (CPD) (mean (SD)) 1.03 (3.31) 0.84 (3.18)
Maternal, breastfeeding (%)
No 183 (96.3) 168 (96.6)
Yes 7 (3.7) 6 (3.4)
Maternal, breastfeeding (CPD) (mean (SD)) 0.20 (1.19) 0.18 (1.07)
Maternal, post (%)
No 165 (83.3) 167 (89.3)
Yes 33 (16.7) 20 (10.7)
Maternal, post (CPD) (mean (SD)) 0.97 (3.02) 0.52 (2.74)
Child, postnatal passive (other) (%)
No 170 (86.3) 171 (91.0)
Yes 27 (13.7) 17 (9.0)
Child, postnatal passive (parent) (%)
No 130 (66.3) 123 (75.9)
Yes 66 (33.7) 39 (24.1)
Child, postnatal passive (%)
57
No 122 (62.2) 122 (72.2)
Yes 74 (37.8) 47 (27.8)
Cumulative exposures (mean (SD)) 0.91 (1.28) 0.55 (1.07)
Cumulative exposures (categorical) (%)
0 115 (58.7) 112 (72.7)
1 23 (11.7) 17 (11.0)
2 32 (16.3) 14 (9.1)
3 12 (6.1) 4 (2.6)
4 14 (7.1) 7 (4.5)
Joint exposure (paternal, preconception & maternal, prenatal) (%)
No_No 133 (67.2) 122 (78.2)
No_Yes 14 (7.1) 5 (3.2)
Yes_No 32 (16.2) 20 (12.8)
Yes_Yes 19 (9.6) 9 (5.8)
Joint exposure (paternal, prenatal & child, postnatal passive) (%)
No_No 116 (59.2) 114 (73.5)
No_Yes 29 (14.8) 11 (7.1)
Yes_No 6 (3.1) 8 (5.2)
Yes_Yes 45 (23.0) 22 (14.2)
Joint exposure (maternal, prenatal & child, postnatal passive) (%)
No_No 120 (61.2) 120 (71.4)
No_Yes 44 (22.4) 29 (17.3)
Yes_No 2 (1.0) 2 (1.2)
Yes_Yes 30 (15.3) 17 (10.1)
Maternal,throughout pregnancy vs. never (%)
No 84 (90.3) 93 (93.0)
Yes 9 (9.7) 7 (7.0)
Study phase (%)
1 15 (7.6) 43 (22.8)
2 61 (30.8) 28 (14.8)
3 122 (61.6) 48 (25.4)
5
70 (37.0)
58
Table 2. 7. Multivariable Poisson regression testing the association between epigenetic
biomarkers of prenatal tobacco smoke exposure and gene deletion frequency in B-ALL cases.
Term RM 95% CI P n set
Polyepigenetic score 1.36 (1.05-1.76) 0.022 194 450K
Males 1.54 (1.10-2.14) 0.011 105 450K
Females 1.63 (0.98-2.70) 0.058 89 450K
Modified polyepigenetic score without AHRR 1.44 (1.06-1.95) 0.021 194 450K
Males 1.64 (1.11-2.43) 0.013 105 450K
Females 1.73 (0.96-3.10) 0.067 89 450K
DNA methylation at the AHRR cg05575921 1.31 (1.02-1.69) 0.036 280 EPIC
Males 1.41 (0.99-2.02) 0.058 149 EPIC
Females 1.30 (0.87-1.95) 0.195 131 EPIC
Polyepigenetic score 1.26 (0.97-1.64) 0.078 284 EPIC
Males 1.26 (0.85-1.87) 0.243 152 EPIC
Females 1.32 (0.91-1.92) 0.148 132 EPIC
Modified polyepigenetic score without AHRR 1.24 (0.91-1.69) 0.165 284 EPIC
Males 1.18 (0.73-1.90) 0.499 152 EPIC
Females 1.33 (0.87-2.05) 0.193 132 EPIC
Ratio of means (RM) were calculated for every 0.1 beta value decrease of AHRR CpG
cg05575921 and every 4-unit increase of polyepigenetic smoking score. All Poisson regression
models were adjusted for cell type heterogeneity and genetic ancestry. Models with exposure
variable DNA methylation at the AHRR CpG cg05575921 were additionally adjusted for methyl-
QTL SNP genotypes (rs148405299 in the 450K dataset and rs77111113 in the EPIC dataset).
59
Table 2. 8. Multivariable Poisson regression including the interaction term between DNA
methylation and B-ALL cytogenetic subtypes.
Interaction term between DNA methylation and B-
ALL cytogenetic subtypes
RM 95% CI P n set
AHRR:hyperdiploid 1.02 (0.47-2.19) 0.962 187 450K
Polyepigenetic score:hyperdiploid 1.32 (0.62-2.82) 0.468 183 450K
AHRR:ETV6-RUNX1 0.81 (0.48-1.35) 0.412 187 450K
Polyepigenetic score:ETV6-RUNX1 0.71 (0.41-1.24) 0.232 183 450K
AHRR:hyperdiploid 0.66 (0.20-2.18) 0.493 232 EPIC
Polyepigenetic score:hyperdiploid 1.25 (0.41-3.81) 0.700 236 EPIC
AHRR:ETV6-RUNX1 1.20 (0.71-2.05) 0.493 250 EPIC
Polyepigenetic score:ETV6-RUNX1 0.92 (0.52-1.65) 0.789 254 EPIC
All the Poisson regression models were adjusted for cell type heterogeneity and genetic
ancestry. Models with exposure variable DNA methylation at the AHRR CpG cg05575921 were
additionally adjusted for methyl-QTL SNP genotypes (rs148405299 in the 450K dataset and
rs77111113 in the EPIC dataset).
60
Figure 2. 1. Sample flowchart.
Left box: ALL cases included in the previous analysis for the association between early-life
tobacco smoke and gene deletion frequencies (n=559)
20
, in which 361 cases were analyzed
only with interview data and 198 B-ALL cases had Illumina 450K genome-wide DNA methylation
array data available and were thus included in the analyses of DNA methylation at the AHRR
CpG cg05575921. Right box: samples included in the current study, including 198 B-ALL cases
that were analyzed previously and 284 B-ALL cases now with available EPIC array DNA
methylation data and MLPA gene deletion frequency data, of which 178 overlapped with the
559 cases that were analyzed previously only with interview data. In total, 482 B-ALL cases are
included in our case-only analysis of prenatal tobacco smoke exposure and gene deletion
frequency. *11 out of 106 B-ALL cases have interview data now available.
61
Figure 2. 2. Chord diagrams showing the proportions of cases with at least one deletion and
cases with zero deletions, in the 450K array (N=198) and EPIC array (N=284) datasets.
Width of the end of each link is proportional to the number of deletions of each gene. Width of
the link between two genes is proportional to the number of cases having both genes deleted.
Diagrams were created using R package circlize.
62
Figure 2. 3. Density plots of the polyepigenetic smoking score generated among 194 B-ALL
cases.
Four out of 198 cases with missing data for at least one of the CpGs that make up the
polyepigenitic smoking score were excluded.
63
(A)
64
(B)
Figure 2. 4. Independent two sample t-tests comparing DNA methylation beta values for CpGs
that are shared by 450K arrays and EPIC arrays and were used to calculate the polyepigenetic
smoking score.
The plus and minus signs in the panel labels represent whether DNA methylation beta value at
this CpG is positively or negatively associated with maternal tobacco smoke. P-values were
obtained from independent two sample t-tests comparing CpG beta values between the 450K
dataset (N=198) and the EPIC dataset (N=284) in (A) overall cases (N = 482) and (B) limited to
cases for whom mothers reported no to smoking during pregnancy (N = 351).
65
(A) (B)
(C) (D)
Figure 2. 5. Linear regression results for DNA methylation at the AHRR CpG cg05575921, the
polyepigenetic smoking score and child’s birth year.
In linear regression models, the independent variable was child’s birth year and the dependent
variable was DNA methylation at the AHRR CpG cg05575921 or the polyepigenetic smoking
score. Models were additionally adjusted for the AHRR-mQTL SNP for DNA methylation at the
AHRR CpG cg05575921.
66
(A)
67
(B)
Figure 2. 6. Correlation matrix for parental reported tobacco smoke exposure variables, DNA
methylation at the AHRR CpG cg05575921 and the polyepigenetic smoking score in the (A) 450K
array (N=198) and (B) EPIC array (N=284) datasets.
P-values and correlation coefficients were calculated using the Spearman’s rank correlation
coefficient test. Statistically significant correlations were marked with asterisks. AHRR CpG
cg05575921 beta value was multiplied by -10.
68
(A)
(B)
(C)
69
Figure 2. 7. Linear regression results for the associations of parental self-reported tobacco
smoke exposures with DNA methylation at the AHRR CpG cg05575921 and with the
polyepigenetic smoking score.
Paternal and maternal ever smoking was defined as having smoked at least 100 cigarettes,
pipes, or cigars before the child's diagnosis. Additional dichotomous variables only account for
whether the mother or father smoked at all during the time period described. Child postnatal
passive smoking was measured by child secondhand smoking from either parent, or from other
persons aside from parents who smoked indoors, in order to show the presence of a regular
smoker (e.g. the mother, father, or other individual) in the household up to the child’s third
birthday or ALL diagnosis (whichever came first). Cumulative tobacco exposures were
calculated from four binary exposures: paternal smoking during preconception, maternal
smoking during preconception, maternal smoking during pregnancy, and child's postnatal
passive smoking. Parental continuous smoking exposures were measured by number of
cigarettes, pipes, or cigars per day (CPD) in 5-unit increments. All linear regression models were
adjusted for cell type heterogeneity and genetic ancestry. AHRR CpG cg05575921 beta value
was multiplied by -10. Linear regression models were fitted for each smoking exposure variable
in the 450K dataset and EPIC dataset. Panels show results from linear regression models for the
outcome variable DNA methylation at the AHRR CpG cg05575921 (left) or from models for the
outcome variable polyepigenetic smoking score (right). Centers of points and horizontal bars
indicate point estimates and 95% confidence intervals. (A) Results from linear regression
models adjusted for cell type heterogeneity and genetic ancestry only. (B) Independent effects
from linear regression models additionally mutually adjusted for paternal and maternal
smoking variables. (C) The joint exposures of prenatal and postnatal tobacco smoking from
linear regression models for newly derived variables of paternal smoke preconception plus
maternal prenatal smoking, maternal prenatal smoking plus child postnatal passive smoking,
and paternal smoke preconception plus child postnatal passive smoking. Reference group:
cases who were unexposed to both exposures that make up the joint effect.
70
Figure 2. 8. Forest plots showing meta-analysis results of the association between epigenetic
biomarkers of prenatal tobacco smoke exposure and gene deletion frequency in B-ALL cases.
The panels include Poisson regression results for the association between deletion numbers
and DNA methylation at the AHRR CpG cg05575921 (top), the polyepigenetic smoking score
(middle), and the polyepigenetic smoking score excluding AHRR CpG cg05575921 (bottom).
Ratio of means (RM) were calculated for every 0.1 beta value decrease of cg05575921 and
every 4-unit increase of polyepigenetic smoking score. All Poisson regression models were
adjusted for cell type heterogeneity and genetic ancestry. Models with exposure variable DNA
methylation at the AHRR CpG cg05575921 were additionally adjusted for methyl-QTL SNP
genotypes (rs148405299 in the 450K dataset and rs77111113 in the EPIC dataset). Centers of
squares and horizontal bars through each indicate point estimates and 95% confidence
intervals (CI) of individual set RM. Area of squares indicate relative weights of individual set.
Vertical apices of diamonds and horizontal bars through each indicate summary RM and 95% CI.
Relative weights (%) (proportional to the reciprocal of the sampling variance of the individual
set) of two sets, RM, sRM, and 95% CI were summarized in the right panel.
71
(A)
72
(B)
Figure 2. 9. Forest plots showing meta-analysis results of the association between epigenetic
biomarkers of prenatal tobacco smoke exposure and gene deletion frequency in B-ALL cases
among self-reported (A) Latinos (n = 115 in 450K; n = 113 in EPIC) and (B) non-Latino Whites (n
= 54 in 450K; n = 48 in EPIC).
The panels include Poisson regression results for the association between deletion numbers
and DNA methylation at the AHRR CpG cg05575921 (top), the polyepigenetic smoking score
(middle), and the polyepigenetic smoking score excluding AHRR CpG cg05575921 (bottom).
Ratio of means (RM) were calculated for every 0.1 beta value decrease of cg05575921 and
every 4-unit increase of polyepigenetic smoking score. All Poisson regression models were
adjusted for cell type heterogeneity and genetic ancestry. Models with exposure variable DNA
methylation at the AHRR CpG cg05575921 were additionally adjusted for methyl-QTL SNP
genotypes (rs148405299 in the 450K dataset and rs77111113 in the EPIC dataset). Centers of
squares and horizontal bars through each indicate point estimates and 95% confidence
intervals (CI) of individual set RM. Area of squares indicate relative weights of individual set.
Vertical apices of diamonds and horizontal bars through each indicate summary RM and 95% CI.
Relative weights (%) of two sets, RM, sRM, and 95% CI were summarized in the right panel.
73
(A)
74
(B)
75
(C)
Figure 2. 10. Forest plots showing meta-analysis results of the association between epigenetic
biomarkers of prenatal tobacco smoke exposure and gene deletion frequency in B-ALL cases by
diagnose age (A) diagnosed no earlier than 2 years of age (n = 178 in 450K; n = 268 in EPIC), (B)
diagnosed 0 to 5 years of age (n = 110 in 450K; n = 172 in EPIC), and (C) diagnosed >5 years of
age (n = 88 in 450K; n = 112 in EPIC).
The panels include Poisson regression results for the association between deletion numbers
and DNA methylation at the AHRR CpG cg05575921 (top), the polyepigenetic smoking score
(middle), and the polyepigenetic smoking score excluding AHRR CpG cg05575921 (bottom).
Ratio of means (RM) were calculated for every 0.1 beta value decrease of cg05575921 and
every 4-unit increase of polyepigenetic smoking score. All Poisson regression models were
adjusted for cell type heterogeneity and genetic ancestry. Models with exposure variable DNA
methylation at the AHRR CpG cg05575921 were additionally adjusted for methyl-QTL SNP
genotypes (rs148405299 in the 450K dataset and rs77111113 in the EPIC dataset). Centers of
squares and horizontal bars through each indicate point estimates and 95% confidence
intervals (CI) of individual set RM. Area of squares indicate relative weights of individual set.
Vertical apices of diamonds and horizontal bars through each indicate summary RM and 95% CI.
Relative weights (%) of two sets, RM, sRM, and 95% CI were summarized in the right panel.
76
Chapter 3: Investigating DNA Methylation as a Mediator of Genetic Risk in Childhood Acute
Lymphoblastic Leukemia
Abstract
Genome-wide association studies have identified a growing number of single nucleotide
polymorphisms (SNPs) associated with childhood acute lymphoblastic leukemia (ALL), yet the
functional roles of most SNPs are unclear. Multiple lines of evidence suggest epigenetic
mechanisms may mediate the impact of heritable genetic variation on phenotypes. Here, we
investigated whether DNA methylation mediates the effect of genetic risk loci for childhood
ALL. We performed an epigenome-wide association study (EWAS) including 808 childhood ALL
cases and 919 controls from California-based studies using neonatal blood DNA. For
differentially methylated CpG positions (DMPs), we next conducted association analysis with 23
known ALL risk SNPs followed by causal mediation analyses addressing the significant SNP-DMP
pairs. DNA methylation at CpG cg01139861, in the promoter region of IKZF1, mediated the
effects of the intronic IKZF1 risk SNP rs78396808, with the average causal mediation effect
(ACME) explaining ~30% of the total effect (ACME P=0.0031). In analyses stratified by self-
reported race/ethnicity, the mediation effect was only significant in Latinos, explaining ~41% of
the total effect of rs78396808 on ALL risk (ACME P=0.0037). Conditional analyses confirmed the
presence of at least three independent genetic risk loci for childhood ALL at IKZF1, with
rs78396808 unique to non-European populations. We also demonstrated that the most
significant DMP in the EWAS, CpG cg13344587 at gene ARID5B (P=8.61x10
-10
), was entirely
confounded by the ARID5B ALL risk SNP rs7090445. Our findings provide new insights into the
77
functional pathways of ALL risk SNPs and the DNA methylation differences associated with risk
of childhood ALL.
Introduction
Acute lymphoblastic leukemia (ALL) is characterized by the uncontrolled proliferation of
immature lymphocytes in the bone marrow and is the most common childhood cancer
4
.
Although current treatment protocols result in an overall survival rate that exceeds 90% in
childhood ALL patients in the US
68
, long-term survivors experience significant adverse effects
from therapy, including subsequent neoplasms, chronic health conditions, and premature
mortality
101
. Understanding the causes of childhood ALL, therefore, remains essential.
Genome-wide association studies (GWAS) have identified single nucleotide polymorphisms
(SNPs) associated with ALL risk in genes involved in hematopoiesis and B-cell development,
including ARID5B, IKZF1, CEBPE, GATA3, IKZF3, ERG, and BMI1
15,16,21,28,234,235
; however, most of
the top associated SNPs are located in non-coding regions of the genome, leaving the
mechanisms through which they contribute to ALL etiology unclear.
Epigenetic modifications are well-recognized drivers for oncogenesis
236
. As one of the
components of the epigenetic machinery, DNA methylation contributes to cancer etiology and
progression through various mechanisms
237
; for instance, DNA hypermethylation at gene
promoters can silence tumor suppressors and other cancer-related genes
238
, whereas broad
regions of DNA hypomethylation are associated with genomic instability
239
. Furthermore, most
cancers harbor genetic abnormalities that modify DNA methylation, resulting in widespread
changes in gene expression
238,240–242
. Several studies have reported that DNA methylation
78
mediates the heritable genetic impact on complex diseases, such as rheumatoid arthritis,
chronic obstructive pulmonary disease, and prostate cancer
243–245
. However, although it has
been reported that aberrant epigenetic modifications serve pivotal roles in leukemogenesis in
childhood ALL
246,247
, no study to our knowledge has been conducted to explore how DNA
methylation modifications may function downstream of genetic risk pathways of childhood ALL.
Epigenome-wide association studies (EWASs) testing DNA methylation differences
between childhood ALL cases and controls at birth may also pinpoint differentially methylated
CpG positions (DMPs) involved in the development of childhood ALL, yet studies conducted so
far have mainly focused on DNA methylation changes in diagnostic leukemia samples
248–251
.
Here, we performed an EWAS of ALL including 808 childhood ALL cases and 919 controls from
two ancestrally diverse independent California-based studies, to identify ALL-associated DMPs,
and then tested the association of significant DMPs with known ALL risk SNPs and assessed
whether DNA methylation at these CpG probes may mediate the effects of the genetic risk loci.
Materials and Methods
Study participants
Study participants were included from two independent California-based case-control
studies of childhood leukemia, the California Childhood Leukemia Study (CCLS) and the
Childhood Cancer Records Linkage Project (CCRLP), the details of which have been described
previously
48,235
. Briefly, the CCLS is a population-based case-control study conducted from 1995
to 2015 in multiple counties across California. Cases were identified within 72 hours after
diagnosis at hospitals and were eligible for participation if they met all the following criteria: (1)
79
age younger than 15 years, (2) without previous cancer diagnosis, (3) residence in California at
the time of diagnosis, and (4) having an English or Spanish-speaking biological parent or
guardian. Controls were randomly selected with similar eligibility criteria using birth certificates.
One or two controls were matched to each case on the date of birth, sex, and race/ethnicity.
The CCRLP linked statewide birth certificates from the California Office of Vital Records (for
1978-2009) to statewide cancer diagnosis data from the California Cancer Registry (1988-2011).
Cases were children born in California and diagnosed with their first primary ALL at 0-15 years.
Potential controls were children born in California during the same period without prior reports
of childhood cancer. Up to four controls were randomly selected and matched to each case on
the date of birth, sex, and race/ethnicity. DBS samples were obtained from the California
Biobank Program for all participants. Cases (N=808) and controls (N=919) with available
genome-wide DNA methylation data were included in the EWAS. Analyses for identifying
methylation quantitative trait loci (mQTL) and mediation effects were limited to 683 cases and
804 controls with both genome-wide DNA methylation and SNP array data available (therefore,
matching between cases and controls was broken for all analyses).
DNA methylation arrays
DNA samples were isolated from newborn dried bloodspot (DBS) for 850 ALL cases and
931 cancer-free controls, bisulfite converted, and then assayed on either the Illumina®
HumanMethylation450 BeadChip (450K) DNA methylation arrays or Illumina® Infinium
MethylationEPIC BeadChip (EPIC) arrays, as previously described
205,252,253
. EPIC arrays
include >850,000 CpG probes, comprising >90% of CpGs on the 450K array plus an additional
413,743 CpGs. CpG beta values were normalized to remove batch effects according to the
80
approach by Fortin et al.
206
Mean detection P values were calculated by using the “detectionP”
function in the minfi
208
package through the Bioconductor project
209,210
. Functional
normalization was performed with “noob” background correction
207
by using the
“preprocessFunnorm” function in the minfi package. The beta-mixture quantile normalization
method was additionally applied
254
. Samples with mean detection p-value >0.01 were
considered poor quality and were removed from the analysis. CpG sites and samples that had
over 15% missing values were removed. The R package “conumee”
255
was used to generate
copy-number variation plots to detect constitutive trisomy of chromosome 21 (T21), as
previously described, and a total of 27 ALL cases and 1 control with T21 were excluded from
subsequent analyses given the profound effects of Down syndrome on DNA methylation
253
.
Identification of differentially methylated positions (DMPs)
We first performed EWAS analyses separately in the CCLS 450K, CCLS EPIC, and CCRLP
EPIC datasets to identify ALL-associated DMPs on autosomal chromosomes. Probes with
common SNPs (minor allele frequency≥0.05) in the full capture sequence or with SNPs in the
targeted CpG site or its single base extension were removed
256,257
. To minimize false-positive
findings, we additionally removed cross-reactive probes identified previously
227,258,259
. We
fitted a logistic regression model predicting ALL case/control status as a function of DNA
methylation at each remaining CpG, adjusting for sex, batch effect, cell type heterogeneity
using the first ten principal components derived from ReFACTor
213
, and genetic ancestry using
the first ten principal components derived from EPISTRUCTURE
214
. Fixed-effect meta-analysis
models were used to generate summary effect estimates for the EWAS results of the CpGs
overlapping both 450K and EPIC arrays from three different study sets – CCLS 450K, CCLS EPIC,
81
and CCRLP EPIC datasets – using the R package “metafor"
216
. In addition, we performed a
second meta-analysis for CpGs on EPIC arrays limited to the CCLS and CCRLP EPIC datasets only,
which included nearly twice the number of CpGs tested in the three datasets. The associations
between CpGs and ALL were corrected for multiple testing using a stringent Bonferroni-
adjusted threshold of 0.05 divided by the number of CpGs. In addition, given that previous
studies have applied a more liberal threshold (<0.001) to identify DMPs for downstream
mediation and interaction analyses
260,261
, here we applied a lenient threshold of 1 x 10
-4
to
ensure sufficient numbers of candidate CpGs for subsequent mQTL and mediation analyses.
Manhattan plots and QQ plots were generated using the R package “CMplot”
262
.
Genotyping
Genome-wide SNP array data were available from constitutive DNA samples isolated
from newborn DBS for a subset of 683 cases and 804 controls. CCLS and CCRLP samples were
genotyped using the Illumina Human OmniExpress V1 platform and the Affymetrix Axiom World
(Latino) Array
235
, respectively. Genotype data for 23 SNPs previously associated with childhood
ALL were included from our recent multi-ancestry GWAS meta-analysis
263
.
Identification of methylation quantitative trait (mQTL) loci
We carried out mQTL analyses to identify genotype-dependent DMPs associated with
childhood ALL risk using the R package “Matrix eQTL”
264
. DMPs with P<1 x 10
-4
from the overall
meta-analysis or the EPIC array meta-analysis were tested for association with genotypes of the
23 ALL risk SNPs. We fitted an additive linear regression model predicting methylation at each
CpG site as a function of SNP genotype (coded 0, 1, and 2), adjusting for the same covariates as
for the EWAS. The associations between SNP genotypes and DMP DNA methylation were
82
corrected for multiple testing using a stringent Bonferroni-adjusted threshold of 0.05/(number
of DMPs × number of SNPs), and a Benjamini-Hochberg false discovery rate (FDR) of 0.05. The
mQTL analysis was first conducted separately in the CCLS 450K, CCLS EPIC, and CCRLP EPIC
datasets, and the results were subsequently meta-analyzed across all three datasets, and across
the two EPIC datasets, in fixed-effect meta-analysis models using “metafor”.
Mediation analysis
We next performed model-based causal mediation analyses for the significant mQTL-
DMP pairs, using the “mediation” R package
265
. First, we specified two statistical models, (1)
the mediator model for the distribution of the DMP methylation level, after conditioning on the
genotype of the mQTL and covariates including sex, ancestry, batch effect, and cell type
heterogeneity, and (2) the outcome model for the conditional distribution of ALL status, given
the mQTL genotype, DMP methylation level, and the same covariates. Models were fitted
separately, and then their fitted parameters were used as the main inputs to the mediate
function, which computes the estimated average causal mediation effect (ACME), the average
direct effect (ADE), and the total effect. Variances were estimated based on simulation. The
quasi-Bayesian Monte Carlo simulation based on normal approximation was conducted 1000
times. Alternatively, an approach based on nonparametric bootstrap was also applied to
estimate variance for validation. Models with the IKZF1 SNP rs78396808 were additionally
adjusted for the IKZF1 lead SNP rs10230978, as rs78396808 was previously reported as a
secondary ALL association signal in analysis conditioned on rs10230978
263
. Results of the
mediation analysis performed separately for the CCLS 450K, CCLS EPIC, and CCRLP EPIC datasets
83
were summarized across all three datasets and across the two EPIC datasets using the fixed-
effect meta-analysis model.
We repeated the mQTL analyses and the causal mediation analyses stratified by self-
reported race/ethnicity (Latinos vs. non-Latino whites, and Latinos vs. non-Latinos [i.e., non-
Latino whites plus non-Latino others]), in study participants with available genome-wide DNA
methylation and SNP array data. We included race/ethnicity as a moderator variable in fixed-
effect meta-analysis models to test for heterogeneity.
Locus-specific analyses
To further disentangle the mQTL-DMP-ALL associations analyzed in the causal mediation
analysis, we investigated whether there was a confounding effect from the mQTL genotype on
the association between DNA methylation and ALL risk by fitting three unconditional logistic
regression models in subjects with both DNA methylation and genotype data available: model 1
was a logistic regression predicting ALL risk as a function of DMP DNA methylation, adjusting for
sex, batch effect, cell type heterogeneity, and genetic ancestry; model 2 was additionally
adjusted for the mQTL genotype; and model 3 was model 1 fitted in individuals without any
copies of the mQTL risk allele. The odds ratios for each 0.1 beta-value increase in DNA
methylation in models 1 and 2 were considered as the “crude effect” and the “adjusted effect”,
respectively. A reduction >10% of the “adjusted effect” from the “crude effect” provides
evidence of confounding
266
. In addition, a nonsignificant coefficient in model 3 indicates that
the association between DNA methylation and ALL risk is entirely confounded by the SNP
effect.
84
We also performed gene-specific analysis to investigate additional CpGs in the
neighboring region of the mediator CpG that could also mediate the effect of mQTL on ALL
susceptibility. First, we conducted the overall and EPIC array fixed-effect meta-analyses for the
gene-specific DMP association testing, using a relaxed significance level of 0.05 to ensure
sufficient CpGs would be included in the following analysis. Next, significant SNP-CpG
associations were identified through the mQTL analysis, with a Bonferroni-adjusted significance
level of 0.05/number of CpGs. Finally, causal mediation analysis was performed to identify
mediators.
DNA methylation and gene expression analysis
DNA methylation and gene expression data were available from diagnostic leukemia
(tumor) samples of 71 ALL cases in the CCLS
251
. DNA methylation data from 450K arrays were
processed as described above. Genome-wide gene expression data were generated using the
GeneChip Human Gene 1.0 ST Array (Affymetrix, Santa Clara, CA), as previously described
251
.
Copy-number at 8 commonly deleted gene regions in childhood ALL was assayed in 60 out of 71
ALL tumors using multiplex ligation-dependent probe amplification (MLPA), as previously
described
49,204
. For DMPs found to be mediators between SNPs and ALL risk, we analyzed the
associations between DNA methylation and the expression levels of nearby genes using
Spearman correlation coefficient tests, and we further limited the correlation tests to cases
without copy number deletion at the corresponding gene regions (if available) to address
potential confounding.
85
Results
There were 850 ALL cases and 931 cancer-free controls from the CCLS and the CCRLP
that had DNA samples from DBS assayed on either the 450K DNA methylation arrays or EPIC
arrays. After excluding subjects with trisomy 21 (27 cases, 1 control), we included in the EWAS a
total of 808 childhood ALL cases and 919 cancer-free controls that passed DNA methylation
quality control. Demographic characteristics of these subjects are summarized in Table 3.1, and
the study design is illustrated in Figure 3.1. Over half of the study participants were males, and
overall 53.7% were self-reported Latino and 31.4% were non-Latino white, with approximately
equal distributions among cases and controls across the CCLS 450K, CCLS EPIC, and CCRLP EPIC
datasets. Demographic characteristics of the subset of 683 childhood ALL cases and 804
controls included in the mQTL and mediation analyses were similar to the overall dataset (Table
3.2).
Differentially methylated positions
A total of 363,973 and 703,253 CpGs were included in the overall EWAS fixed-effect
meta-analysis (CCLS 450K, CCLS EPIC, and CCRLP EPIC datasets) and the EPIC array meta-
analysis (CCLS EPIC and CCRLP EPIC datasets), respectively (Figure 3.1). The number of CpGs
meeting each probe filtering criterion is summarized in Table 3.3. CpG cg13344587, in an
intronic region of the ALL risk gene ARID5B, was significantly differentially methylated by ALL
case/control status (P=8.61x10
-10
) after adjusting for multiple testing using a stringent
Bonferroni correction in the overall meta-analysis (P threshold: 0.05/363,973=1.37x10
-7
)
(Figures 3.2A and 3.2B) and the EPIC array meta-analysis (0.05/703,253=7.11x10
-8
) (Figures 3.2C
and 3.2D). Using a less stringent threshold of P<1x10
-4
for the purposes of identifying candidate
86
CpGs for downstream mediation analyses
260,261
, we found 47 DMPs for ALL overlapping both
450K and EPIC arrays from the overall meta-analysis and 90 DMPs from the EPIC array meta-
analysis (Figures 3.2A and 3.2C), the latter of which included 51 DMPs on EPIC arrays only and
39 DMPs overlapping both 450K and EPIC arrays. Of the latter, 33/39 DMPs were not significant
in the overall meta-analysis and, thus, were likely to be false positives and were not included in
subsequent analyses, leaving 57 DMPs for the EPIC array meta-analysis (Figure 3.1).
Methylation quantitative trait loci
The 23 childhood ALL risk SNPs identified from our recent multi-ancestry GWAS meta-
analysis were included in the mQTL analysis
263
(Table 3.4). These SNPs overlap 19 genomic loci,
and include 4 secondary associations discovered in conditional analysis adjusting for the lead
SNP at IKZF1, CDKN2A, CEBPE, and IKZF3. They were analyzed for association with the 47 DMPs
separately in the CCLS 450K, CCLS EPIC, and CCRLP EPIC datasets, and with the 57 DMPs
separately in the two EPIC datasets (Figure 3.1). Two SNP-DMP pairs in cis passed the
Bonferroni corrected threshold in the meta-analysis of three datasets (P<4.63x10
-5
[0.05/(23×47)]): 1) SNP rs7090445 and DMP cg13344587 at gene ARID5B, and 2) SNP
rs78396808 and DMP cg01139861 at gene IKZF1 (Tables 3.5 and 3.6). The ARID5B SNP-DMP
pair was also identified in the EPIC array meta-analysis (P<3.81x10
-5
[0.05/(23×57)]. One
additional SNP-DMP association that was in trans between the intergenic SNP rs9376090 near
MYB/HBS1L on chromosome 6 and the CpG cg25722431 on chromosome 7 survived the FDR
correction (FDR<0.05) (Tables 3.5 and 3.6).
87
Mediation analysis
Causal mediation analyses were performed for the significant ARID5B and IKZF1 mQTL-
DMP pairs with ALL risk separately in the CCLS 450K, CCLS EPIC, and CCRLP EPIC datasets, and
for the same ARID5B pair and the significant trans mQTL-DMP pair with ALL separately in the
two EPIC datasets (Figure 3.1). The total effect, ADE and ACME estimated for each mQTL-DMP
pair were summarized across all three datasets and across the two EPIC datasets.
SNP rs7090445 at gene ARID5B had a significant total effect (estimate=0.108, P=5.17x10
-
15
) and direct effect (ADE estimate=0.093, P=1.06x10
-4
) but a nonsignificant causal mediation
effect (ACME P=0.223) on ALL risk through altering DNA methylation at the ARID5B CpG
cg13344587 in the overall meta-analysis (Figure 3.3A), demonstrating that the effect of
rs7090445 on ALL risk was independent of cg13344587.
In contrast, we found that IKZF1 SNP rs78396808 had a significant total effect on ALL
risk (estimate=0.056, P=0.010), a significant mediation effect through increasing DNA
methylation at the IKZF1 CpG cg01139861 (ACME estimate=0.017, P=0.003) and a
nonsignificant direct effect (estimate=0.039, P=0.077), with the ACME explaining ~30%
(0.017/0.056) of the total effect (Table 3.7 and Figure 3.3B). After conditioning on the lead
IKZF1 SNP rs10230978, we observed stronger total effect (estimate=0.089, P=1.24x10
-4
) and
direct effect (estimate=0.073, P=0.002) of rs78396808 on ALL risk, and a similar mediation
effect through cg01139861 (estimate=0.016, P=0.006) (Table 3.7 and Figure 3.3B). These
indicate that the IKZF1 SNP rs78396808 conferred risk for ALL both through cg01139861 and
independent of cg01139861. The associations between DNA methylation at cg01139861,
genotypes of rs78396808, and ALL risk in each study set are illustrated in Figure 3.4A. The IKZF1
88
SNP rs78396808 risk allele (A) increased DNA methylation at cg01139861 (Table 3.6 and Figure
3.4A), and the increased DNA methylation level at cg01139861 was associated with increased
ALL risk (Table 3.5 and Figure 3.4A).
In the EPIC array meta-analysis, although there was a significant protective mediation
effect of rs9376090 (MYB/HBS1L) on ALL risk through cg25722431 at gene SMURF1 (estimate=-
0.008, P=0.026), the total and direct effects were both nonsignificant (Table 3.7), suggesting
that the effect of SNP rs9376090 on ALL risk was relatively small and a larger sample size is
required to make a robust estimate in the mediation analysis for SNP rs9376090.
Results from causal mediation analyses with variance estimation based on the
nonparametric bootstrap method (Table 3.8) were similar to those from the quasi-Bayesian
Monte Carlo simulation (Table 3.7).
Race/ethnicity stratified analyses
Significant SNP-DMP pairs in overall participants had similar effect estimates in self-
reported Latinos, non-Latino whites, and non-Latinos (non-Latino whites plus non-Latino
others) in stratified mQTL analysis, with no significant heterogeneity by race/ethnicity in tests
of moderators in fixed-effect meta-analyses (Table 3.9). However, SNP rs78396808 at gene
IKZF1 was associated with cg01139861 in overall participants, in Latinos, and in non-Latinos
(non-Latino whites plus non-Latino others) but was not significantly associated with
cg01139861 in non-Latino whites. SNP rs78396808 is almost monomorphic for the non-risk
allele G in European populations in the Genome Aggregation Database, whereas the risk allele
A frequency is ~20% in East Asian and Admixed American populations
267
. We found the GA
genotype among 11 individuals who were self-reported non-Latino white (Figure 3.4A), likely
89
due to admixture, which may result in the slightly higher effect estimate and standard error for
the association between rs78396808 and cg01139861 in non-Latino whites than in Latinos
(Table 3.9).
Next, we conducted causal mediation analyses separately in Latinos, non-Latino whites,
and non-Latinos, for those significant SNP-DMP pairs identified in overall participants. We
found a significant total effect and a significant mediation effect for rs78396808(IKZF1)-
cg01139861(IKZF1)-ALL in the overall meta-analysis in Latinos, with the total effect explained by
the ACME being ~42% (0.022/0.053) and ~28% (0.022/0.078) before and after conditioning on
the top IKZF1 SNP rs10230978 (Table 3.10), which were both higher than that observed in
overall participants. We found no significant heterogeneity in the ACME for rs78396808(IKZF1)-
cg01139861(IKZF1)-ALL between Latinos and non-Latino whites (P het =0.441) or between Latinos
and non-Latinos (P het=0.236) (Table 3.10), likely due to lack of power given the low allele
frequency of rs78396808 in non-Latino whites.
ARID5B CpG cg13344587 is a proxy for ALL risk SNP rs7090445
Results for both rs7090445(ARID5B)-cg13344587(ARID5B)-ALL and rs78396808(IKZF1)-
cg01139861(IKZF1)-ALL are summarized in Table 3.11. We obtained the “crude effect” of every
0.1 beta-value increase in ARID5B cg13344587 methylation on ALL risk in model 1 (OR meta=0.44,
P=2.87x10
-10
) and the “adjusted effect” in model 2 (OR meta=0.80, P=0.249), with a 83%
reduction after adjusting for the SNP effect (Figure 3.3A), much higher than the 10% difference
for identifying the presence of confounding
266
. In addition, there was no longer a significant
association between DNA methylation at cg13344587 and ALL risk, with a greatly reduced
effect estimate (OR meta=0.99, P=0.970) in model 3 in individuals without any copies of the
90
rs7090445 risk allele. Therefore, the association between decreased DNA methylation at
cg13344587 and ALL risk is consistent with confounding by SNP rs7090445. In contrast, we
observed only an ~4% decrease in the effect on ALL risk from CpG cg01139861 at IKZF1 after
adjusting for rs78396808, and the effect remained significant in individuals without any copies
of the rs78396808 risk allele (OR meta=1.45, P=0.004). These demonstrated no evidence of a
confounding effect from rs78396808 on the association between IKZF1 cg01139861 and ALL
risk.
IKZF1 gene-specific analysis
We investigated additional IKZF1 CpGs that might mediate the effects of SNP
rs78396808 on ALL risk. There were 33 and 73 IKZF1 CpGs included in the overall meta-analysis
and the EPIC array meta-analysis, respectively. CpGs were tested for their association with ALL,
with cg01139861 the only one passing gene-wide significance (P<0.05/33), and cg16499656 was
an additional CpG with P<0.05 in the overall meta-analysis (Table 3.12). CpG cg12431065
passed the gene-wide significance threshold (0.05/73), and six additional CpGs had P<0.05 in
the EPIC array meta-analysis. Three CpGs overlapping both 450K and EPIC arrays were found to
be significantly associated with ALL risk in the EPIC array meta-analysis, one of which were not
significantly associated with ALL in the overall meta-analysis and thus were not included in
subsequent analyses. The CpG cg01139861 analyzed before was also not included in
subsequent analyses.
Since rs78396808 is monomorphic in European populations, we further conducted the
IKZF1 gene-specific mQTL and causal mediation analyses in Latinos only. IKZF1 CpGs
significantly associated with ALL were tested for their associations with SNP rs78396808 with a
91
p-value <0.05 or <0.01 (0.05/5) considered to show statistical significance in the overall meta-
analysis and the EPIC array meta-analysis, respectively. We identified one significant SNP-DMP
pair from the mQTL overall meta-analysis, and three significant SNP-DMP pairs from the mQTL
EPIC array meta-analysis (Table 3.13). We found a significant total effect from rs78396808 on
ALL (estimate=0.067, P=0.030) and a significant mediation effect from rs78396808
(estimate=0.019, P=0.031) through increasing DNA methylation at cg10551353 when
conditioning on rs10230978 in Latinos, with a ~29% total effect explained by the ACME (Table
3.14). We also found in Latinos that DNA methylation at cg10551353 was strongly correlated
with cg01139861, the original IKZF1 CpG found to mediate the effect of rs78396808 on ALL risk
(CCRLP EPIC: R
2
=0.41, P=3.19x10
-14
; CCLS EPIC: R
2
=0.45, P=8.23x10
-16
) (Figure 3.5). CpG
cg01139861 is located in a CpG island in the promoter region of IKZF1, and SNP rs78396808 is in
an intronic region ~116Kb downstream (Table 3.5 and Figure 3.4B). The additional IKZF1 CpG
cg10551353 identified here is in the 5ʹ UTR or the TSS1500 region of several alternative
transcripts, ~14Kb downstream of cg01139861 (Table 3.12 and Figure 3.4B).
IKZF1 SNP rs78396808 is independent of another secondary association signal at SNP rs6421315
In a previous GWAS of ALL in individuals of European ancestry, Vijayakrishnan et al.
32
identified a secondary signal at the IKZF1 locus at SNP rs6421315 after conditioning on their
lead SNP rs17133805. SNP rs17133805 is in nearly perfect linkage disequilibrium (LD) (R
2
=0.999)
with the lead IKZF1 SNP rs10230978 in our multi-ancestry GWAS meta-analysis of ALL
263
. Here,
we explored whether the genotypes of rs6421315 and rs78396808, the secondary association
signal at IKZF1 in our multi-ancestry ALL GWAS, were independently associated with ALL. The
two secondary hits rs78396808 and rs6421315 had weak LD in all populations (R
2
= 0.0188,
92
D’=0.4213) and in Admixed Americans (R
2
= 0.072, D’=0.5546) based on LDlink
211
, and reside on
different sides of a recombination peak (Figure 3.6). Additionally, SNP rs78396808 remained
significantly associated with ALL in this study when conditioning on both rs10230978 and
rs6421315 (P meta=0.007) (Table 3.15).
Increased DNA methylation at IKZF1 CpG cg01139861 correlated with decreased IKZF1
expression
Finally, we tested the correlation for DNA methylation at cg01139861 (IKZF1) with gene
expression of nearby genes IKZF1, FIGNL1, and DDC using Spearman correlation coefficient
tests in 51 ALL tumor samples from CCLS. We excluded 11/71 samples without MLPA copy-
number data and 9/60 samples with one deleted copy of IKZF1. Increased DNA methylation at
the IKZF1 CpG cg01139861 significantly correlated with decreased gene expression of IKZF1 (R=-
0.28, P= 0.044; Figure 3.5).
Discussion
A role for genetic variation in the etiology of childhood ALL is well established, but little
is known regarding the association of epigenetic differences at birth and future development of
ALL. We report results from the largest neonatal DNA methylation EWAS of childhood ALL
performed to date, along with the first comprehensive mediation analysis investigating whether
DNA methylation mediates the effects of ALL genetic risk loci. We found that the IKZF1 SNP
rs78396808 risk allele conferred risk for ALL through increasing DNA methylation at the IKZF1
promoter CpG cg01139861. In addition, the ARID5B CpG cg13344587, the only ALL-associated
93
DMP to survive Bonferroni correction, appeared to be entirely confounded by the ARID5B ALL
risk SNP rs7090445.
A limited number of epigenetic studies have been conducted previously for childhood
ALL
248–251
, with DNA methylation of cases profiled using bone marrow or peripheral blood
samples collected from ALL patients at diagnosis. Few studies have investigated differential
DNA methylation associated with subsequent development of ALL. We note a paucity of ALL-
associated CpGs in our EWAS, with the ARID5B CpG cg13344587 being the exception that
survived Bonferroni correction, which suggests that any prenatal environmental exposures with
moderate effects on ALL risk do not leave many signals in the DNA methylomes at birth. Using a
more lenient threshold of P<1 x 10
-4
, we identified 47 and 57 DMPs mapped to 38 and 37 genes
from the overall meta-analysis and the EPIC array meta-analysis, respectively, with four genes
overlapping both results: ARID5B, HECA, NHLRC1, and XKR9. Altered DNA methylation at two of
these genes (NHLRC1 and XKR9) has been observed previously in ALL tumor samples
248,249,268
,
and one of the remaining two genes, ARID5B, is an established ALL predisposition gene.
Intronic SNPs in ARID5B, which plays an important role in B-cell development, have
been associated with ALL risk in several GWAS
15,32,235
. Functional analysis has shown that the
ALL risk SNP rs70904455 disrupts binding of RUNX3 and leads to reduced ARID5B expression
269
.
We found that the ARID5B SNP rs7090445-C risk allele was significantly associated with
decreased DNA methylation at the ARID5B CpG cg1334458, as previously reported in whole-
blood samples from cancer-free individuals
270,271
. Although the rs7090445-C risk allele showed
the strongest total effect and direct effect on ALL in the causal mediation analysis, the lack of
significant mediation effect through altering DNA methylation indicates that the impact of this
94
SNP on ALL risk is independent of cg13344587. Further, results from our analysis of
confounding effects supported that the association between decreased DNA methylation at
cg13344587 and ALL risk appears to be entirely driven by the ARID5B SNP rs7090445, and
hypomethylated cg13344587 may function merely as a strong proxy of the rs7090445-C risk
allele.
In contrast, we found that DNA methylation at CpG cg01139861, located in a CpG island
in the promoter region of IKZF1, mediates the effects of the IKZF1 SNP rs78396808, which we
recently reported as an independent ALL-association signal in analysis conditioned on the lead
IKZF1 SNP rs10230978
263
. The causal mediation analysis in our overall dataset showed that the
rs78396808-A risk allele had a significant ACME on ALL risk through increasing DNA methylation
at cg01139861, explaining ~30% of the total effect on ALL. In analyses stratified by self-reported
race/ethnicity, the mediation effect for rs78396808 through increasing DNA methylation at
cg01139861 was only significant in Latinos, explaining a ~42% total effect in this population.
The SNP rs78396808 is monomorphic in European populations, although the risk allele (A) also
presents in African and South Asian populations and has a ~20% frequency in East Asians
267
;
however, we did not have sufficient samples to test for mediation effects for these population
groups. In addition, the IKZF1 SNP rs78396808 appears to be independent of another secondary
association signal at IKZF1, SNP rs6421315, previously identified in individuals of European
ancestry
13
, supporting the existence of at least three independent common genetic risk loci for
ALL across populations.
Further, we found that increased DNA methylation at cg01139861 correlated with
decreased gene expression of IKZF1 in ALL tumor samples. Gene IKZF1 encodes the lymphoid
95
transcription factor IKAROS, and is essential for lymphocyte development and differentiation
272
. Somatic deletion of IKZF1 is a common driver event in ALL, particularly in BCR-ABL1-positive
ALL (95%)
273
and in high-risk B-cell ALL (30%)
274
. IKZF1 deletions are also enriched in patients
with relapsed childhood B-cell ALL (39.3%), in whom increased promoter methylation was also
found
247
. Taken together, the leukemogenic effects of the rs78396808-A risk allele may act via
downregulation of IKZF1 gene expression partly through increased DNA methylation at the
IKZF1 CpG cg01139861. In our targeted analysis of CpGs across IKZF1, we identified one
additional CpG cg10551353, in the promoter region of several transcripts, which showed
evidence of some mediation effect for rs78396808 on ALL risk. Increased DNA methylation at
cg10551353 was significantly correlated with increased DNA methylation at cg01139861;
however, cg10551353 is on the EPIC array only, so we could not assess its association with
IKZF1 gene expression using the tumor samples assayed on 450K arrays.
The current study has several strengths. First, we assayed pre-diagnostic DNA from
neonatal DBS on both genome-wide DNA methylation arrays and SNP arrays, which rules out
the possibility of reverse causality (i.e., effects of leukemia itself on DNA methylation). Second,
instead of using the traditional mediation analysis approaches relying on the restrictive and
untested assumptions, we used a more general estimation framework that provides
distribution-free estimates for causal mediation effects and accommodates nonlinearities
265,275
. Moreover, we estimated the uncertainty of the causal mediation effects through the
quasi-Bayesian Monte Carlo simulation, and we validated our results by using an alternative
simulation approach based on nonparametric bootstrap. Last, over half of the participants
included in this study were self-reported Latinos, providing an opportunity for us to perform
96
analyses stratified by race/ethnicity, through which we detected a stronger mediation effect for
the IKZF1 SNP rs78396808 through cg01139861 in Latinos than in overall participants.
Our study does have some limitations. One potential limitation was sample size, with
only the ARID5B CpG cg13344587 reaching epigenome-wide significance in our EWAS for ALL.
This necessitated the use of a more lenient p-value threshold (P<1 x 10
-4
) to identify DMPs for
subsequent mQTL and causal mediation analyses, which may have introduced false positive
results, especially in the EPIC array meta-analysis that included almost double the number of
probes tested compared to the overall meta-analysis. Thus, we excluded those DMPs
overlapping both 450K and EPIC arrays that were only identified in the EPIC array meta-analysis.
Another limitation is that we were limited by the number of CpGs on the Illumina arrays.
Additional CpGs at IKZF1 that we could not assess with the array data may also mediate the
effect of ALL risk SNP rs78396808. Bisulfite sequencing targeting the IKZF1 region will be
required to fully capture the DNA methylation changes that mediate the effect of rs78396808,
especially at IKZF1 regulatory regions that correlate with gene expression patterns. Lastly, gene
expression data were measured from diagnostic leukemia samples and, although we accounted
for somatic copy-number loss of IKZF1, this may have affected the accuracy of the correlation
results between DNA methylation at cg01139861 and IKZF1 expression.
In conclusion, we provide evidence that increased DNA methylation at the IKZF1 CpG
cg01139861 mediates the effects on ALL risk from SNP rs78396808, which was recently
identified as a novel independent risk locus at IKZF1 that is specific to non-European
populations. Our findings enhance the understanding of the functional pathways of genetic risk
97
loci for childhood ALL and provide new insights into the DNA methylation differences
associated with childhood ALL.
98
Table 3. 1. Characteristics of study participants included in the EWAS stratified by study set and
ALL case/control status (n = 1,727).
CCLS 450K (n = 435) CCRLP EPIC (n = 566) CCLS EPIC (n = 726)
Variables
Controls (n
= 225)
Cases (n
= 210)
P
Controls (n
= 436)
Cases (n
= 130)
P
Controls (n
= 258)
Cases (n
= 468)
P
Gestational age
(weeks), mean (SD)
39.17
(2.49)
39.33
(2.33)
0.4
93
39.24
(2.01)
39.16
(2.04)
0.6
79
39.38
(1.84)
38.99
(2.38)
0.0
37
Gestational age
unknown (%) 6 (3.0%) 7 (3.0%) 23 (5.0%) 3 (2.3%) 4 (2.0%)
192
(41.0%)
Sex (%)
Males 130 (57.8)
121
(57.6)
1.0
00 257 (58.9) 74 (56.9)
0.7
57 148 (57.4)
263
(56.2)
0.8
21
Females 95 (42.2) 89 (42.4) 179 (41.1) 56 (43.1) 110 (42.6)
205
(43.8)
Ethnicity (%)
Latino 107 (47.6)
109
(51.9)
0.6
61 251 (57.6) 69 (53.1)
0.6
18 136 (52.7)
176
(54.8)
0.3
57
Non-Latino other 38 (16.9) 33 (15.7) 62 (14.2) 22 (16.9) 32 (12.4) 49 (15.3)
Non-Latino white 80 (35.6) 68 (32.4) 123 (28.2) 39 (30.0) 90 (34.9) 96 (29.9)
Race/ethnicity
unknown (%)
147
(31.4%)
P-values comparing the characteristics of ALL cases and controls in the CCLS 450K, CCLS EPIC,
and CCRLP EPIC datasets were calculated using t-tests for the continuous variable (gestational
age) and Chi-squared tests for categorical variables.
99
Table 3. 2. Characteristics of study participants included in the mqtl analysis stratified by study
set and ALL case/control status (n = 1,487).
CCLS 450K (n = 372) CCRLP EPIC (n = 551) CCLS EPIC (n = 564)
Variables
Controls (n
= 168)
Cases (n
= 204)
P
Controls (n
= 425)
Cases (n
= 126)
P
Controls (n
= 211)
Cases (n
= 353)
P
Gestational age
(weeks), mean (SD)
39.25
(2.31)
39.35
(2.33)
0.7
04
39.28
(1.94)
39.11
(2.04)
0.3
88
39.28
(1.90)
38.93
(2.42)
0.0
93
Gestational age
unknown (%) 5 (3.0%) 7 (3.0%) 22 (5.0%) 3 (2.4%) 3 (1.0%)
92
(26.0%)
Sex (%)
Males 104 (61.9)
117
(57.4)
0.4
33 249 (58.6) 72 (57.1)
0.8
52 127 (60.2)
200
(56.7)
0.4
63
Females 64 (38.1) 87 (42.6) 176 (41.4) 54 (42.9) 84 (39.8)
153
(43.3)
Ethnicity (%)
Latino 104 (61.9)
109
(53.4)
0.1
94 245 (57.6) 65 (51.6)
0.4
46 132 (62.6)
162
(54.2)
0.1
28
Non-Latino other 24 (14.3) 30 (14.7) 60 (14.1) 22 (17.5) 22 (10.4) 45 (15.1)
Non-Latino white 40 (23.8) 65 (31.9) 120 (28.2) 39 (31.0) 57 (27.0) 92 (30.8)
Race/ethnicity
unknown (%) 54 (15.3)
P-values comparing the characteristics of ALL cases and controls in the CCLS 450K, CCLS EPIC,
and CCRLP EPIC datasets were calculated using ANOVA tests for the continuous variable
(gestational age) and Chi-squared tests for categorical variables.
100
Table 3. 3. Probe filtering steps to remove cross-reactive probes, Probe SNPs, CpG SNPs, and
SBE SNPs.
The CCLS 450K, CCLS EPIC, and
CCRLP EPIC datasets
The CCLS EPIC, and CCRLP EPIC
datasets
Filtering criteria Number of CpGs in
each criterion
Number of
CpGs left
Number of CpGs in
each criterion
Number of
CpGs left
Overlapping CpGs 441025 844728
Probe SNPs as identified in Illumina
manifest files (MAF >= 0.05)
44683 400013 78668 766972
CpG SNPs as identified in Illumina
manifest files
16998 384657 29964 737943
SBE SNPs as identified in Illumina
manifest files
7876 384341 14723 737585
Cross-reactive probes as identified in
Chen et al.
29233 363973
Cross-reactive probes as identified in
Pidsley et al.
43254 704420
Cross-reactive probes as identified in
McCartney et al.
44210 703253
CpGs included in EWAS meta-analysis 363973 703253
CpG single nucleotide polymorphisms (SNPs) present in the actual CpG capture; SBE (single
base extension) SNPs present in the single base extension of the CpG site; Probe SNPs present
in the full capture sequence (the genomic sequence that encapsulated the CpG site, attaching
to the oligonucleotide bead on the array). Cross-reactive probes or cross-hybridizing probes
were identified to potentially hybridize to multiple genomic regions, thus generating off-target
signals.
101
Table 3. 4. 23 ALL risk SNPs included in the mQTL analysis.
Gene RS Id Chr Pos Ref Risk Condition
C5orf56 rs11741255 5 131811182 A G Known
intergenic_MYB/HBS1L rs9376090 6 135411228 C T Novel
BAK1 rs210142 6 33546837 T C Known
IKZF1 rs78396808 7 50459043 G A Conditional Known
IKZF1 rs10230978 7 50477144 G A Known
8q24 rs4617118 8 130156143 A G Known
CDKN2A rs3731249 9 21970916 C T Known(functional evidence of top SNP)
CDKN2A rs2811711 9 21993964 C T Conditional Known
TLE1 rs62579826 9 83728588 T C Known
LHPP rs35837782 10 126293309 A G Known
BMI1 rs11591377 10 22423302 A G Known(functional evidence of top SNP)
PIP4K2A rs7075634 10 22853102 T C Known
ARID5B rs7090445 10 63721176 T C Known
JMJD1C rs9415680 10 65020890 G A Novel
TET1 rs10998283 10 70329064 G A Novel
GATA3 rs3824662 10 8104208 C A Known
ELK3 rs78405390 12 96645605 A C Known
CEBPE rs2239630 14 23589349 G A Known
CEBPE rs60820638 14 23592617 C A Conditional Novel
IKZF3 rs17607816 17 37957235 T C Known
IKZF3 rs12944882 17 37983492 C T Conditional Novel
B4GALNT2 rs6504598 17 47217004 G C Known
ERG rs55681902 21 39784752 T C Known
Known SNPs are those variants with the lowest P-value at 1Mb around the reported SNP in
known loci in the transethnic meta-analysis (Jeon S. et al. 2021). Novel SNPs are those
putatively novel loci identified in the same study that were previously shown to be associated
with multiple blood cell traits and other hematopoietic cancers. Conditional known SNPs are
those secondary ALL association signals identified in known loci in analyses conditioning on the
top SNPs. Conditional novel SNPs are those secondary ALL association signals identified in novel
loci in analyses conditioning on the top SNPs. SNP rs3731249 at CDKN2A and SNP rs11591377
at BMI1 are two putative functional variants identified previously (Walsh K. et al. 2015 and de
Smith A. J. et al. 2018).
102
Table 3. 5. EWAS results for DMPs found to be associated with ALL risk SNPs in the subsequent
mQTL analysis.
Annotations CCLS
450K
CCRLP
EPIC
CCLS
EPIC
Meta-analysis
Prob
e
C
h
r
PO
S
Islands Relati
on to
Island
UCSC
RefGe
ne
Name
UCSC
RefGe
ne
Group
Regulat
ory
Feature
Group
O
R
S
E
O
R
S
E
O
R
S
E
O
R
S
E
P P.
h
e
t
i.s
qu
ar
ed
cg13
344
587*
c
h
r
1
0
63
72
39
19
Open
Sea
ARID5
B
Body Enhance
r
0.
5
4
6
2.
1
3
7
0.
3
9
1
2.
4
1
7
0.
4
9
3
1.
7
9
1
0.
4
8
1
1.
1
9
3
8.
61
E-
10
0.
5
7
6
0.0
00
cg01
139
861*
c
h
r
7
50
34
32
98
chr7:50
342895-
5034345
6
Island IKZF1 TSS150
0
Promote
r_Associ
ated
2.
3
1
5
2.
1
1
1
1.
5
5
7
1.
8
6
3
1.
1
9
9
1.
4
6
1
1.
5
0
5
1.
0
1
0
5.
20
E-
05
0.
0
3
6
69.
79
8
cg13
344
587*
*
c
h
r
1
0
63
72
39
19
Open
Sea
ARID5
B
Body Enhance
r
0.
3
9
1
2.
4
1
7
0.
4
9
3
1.
7
9
1
0.
4
5
4
1.
4
3
9
4.
08
E-
08
0.
4
4
2
0.0
00
cg25
722
431*
*
c
h
r
7
98
63
04
46
chr7:98
633091-
9863330
4
N_Sh
elf
SMUR
F1
Body 0.
2
7
8
4.
7
5
1
0.
2
8
8
4.
2
6
1
0.
2
8
3
3.
1
7
2
7.
02
E-
05
0.
9
5
4
0.0
00
*DMPs found to be associated with ALL risk SNPs from the mQTL overall meta-analysis (808
cases and 919 controls).
**DMPs found to be associated with ALL risk SNPs from the mQTL EPIC array meta-analysis
(598 cases and 694 controls).
Odds ratios (OR) were calculated for every 0.1 CpG beta value increase.
All logistic regression models were adjusted for sex, batch effect, cell type heterogeneity using
the first ten principal components derived from ReFACTor, and genetic ancestry using the first
ten principal components derived from EPISTRUCTURE.
103
Table 3. 6. Significant SNP-DMP pairs identified from mQTL meta-analyses.
CCLS
450K
CCRLP
EPIC
CCLS
EPIC
Meta-analysis
SNP CpG
Co
ef
SE
Co
ef
SE
Co
ef
SE
Co
ef
SE P
P.
he
t
i.sq
uare
d FDR
rs7090445
(chr10:63721176 at
ARID5B)*
cg13344587
(chr10:63723919 at
ARID5B)*
-
0.0
47
0.
00
2
-
0.0
52
0.
00
2
-
0.0
50
0.
00
2
-
0.0
50
0.
00
1
<2.2
3E-
308
0.
21
8
34.4
04
<2.2
3E-
308
rs78396808
(chr7:50459043 at
IKZF1)*
cg01139861
(chr7:50343298 at
IKZF1)*
0.0
18
0.
00
5
0.0
25
0.
00
5
0.0
26
0.
00
5
0.0
24
0.
00
3
3.25
E-17
0.
50
5
0.00
0
1.75
E-14
rs7090445
(chr10:63721176 at
ARID5B)**
cg13344587
(chr10:63723919 at
ARID5B)**
-
0.0
52
0.
00
2
-
0.0
50
0.
00
2
-
0.0
51
0.
00
1
<2.2
3E-
308
0.
45
6
0.00
0
<2.2
3E-
308
rs9376090
(chr6:135411228 at
MYB/HBS1L)**
cg25722431
(chr7:98630446 at
SMURF1)**
0.0
04
0.
00
2
0.0
05
0.
00
2
0.0
05
0.
00
1
6.37
E-05
0.
51
7
0.00
0
4.17
E-02
*Significant SNP-DMP pairs identified from the mQTL overall meta-analysis (683 cases and 804
controls).
**Significant SNP-DMP pairs identified from the mQTL EPIC array meta-analysis (479 cases and
636 controls).
All linear regression models were adjusted for sex, batch effect, cell type heterogeneity using
the first ten principal components derived from ReFACTor, and genetic ancestry using the first
ten principal components derived from EPISTRUCTURE.
104
Table 3. 7. Estimated total effects, ACME and ADE from the causal mediation analysis (quasi-
Bayesian Monte Carlo simulation, n = 1000).
Pair Estimate SE Statistic P
P.het i.squared
Effect
Meta-analysis across the CCLS 450K, CCLS EPIC, and CCRLP EPIC datasets (683 cases and 804 controls)
rs7090445_cg13344587 0.108 0.014 7.823 5.17E-15 0.060 64.431 total
rs7090445_cg13344587 0.023 0.019 1.219 2.23E-01 0.455 0.000 ACME
rs7090445_cg13344587 0.093 0.024 3.876 1.06E-04 0.054 65.784 ADE
rs78396808_cg01139861 0.056 0.022 2.560 1.05E-02 0.918 0.000 total
rs78396808_cg01139861 0.017 0.006 2.958 3.09E-03 0.435 0.000 ACME
rs78396808_cg01139861 0.039 0.022 1.771 7.66E-02 0.997 0.000 ADE
rs78396808_cg01139861* 0.089 0.023 3.838 1.24E-04 0.704 0.000 total
rs78396808_cg01139861* 0.016 0.006 2.733 6.28E-03 0.435 0.000 ACME
rs78396808_cg01139861* 0.073 0.024 3.084 2.04E-03 0.871 0.000 ADE
Meta-analysis across the CCLS EPIC, and CCRLP EPIC datasets (479 cases and 636 controls)
rs7090445_cg13344587 0.109 0.015 7.085 1.39E-12 0.018 82.168 total
rs7090445_cg13344587 0.015 0.022 0.682 4.95E-01 0.305 5.110 ACME
rs7090445_cg13344587 0.104 0.028 3.763 1.68E-04 0.023 80.771 ADE
rs9376090_cg25722431 -0.009 0.020 -0.472 6.37E-01 0.117 59.402 total
rs9376090_cg25722431 -0.008 0.004 -2.220 2.64E-02 0.655 0.000 ACME
rs9376090_cg25722431 0.000 0.020 0.022 9.83E-01 0.141 53.959 ADE
*Models additionally adjusted for rs10230978.
The effect estimates correspond to the increased probabilities of developing ALL per 1 copy
increase of the SNP risk allele.
105
Table 3. 8. Estimated total effects, ACME and ADE from the causal mediation analysis
(nonparametric bootstrap, n = 1000 simulations).
Pair
Estimate SE Statistic P
P.het i.squared
Effect
Meta-analysis of the CCLS 450K, CCLS EPIC, and CCRLP EPIC datasets (683 cases and 804 controls)
rs7090445_cg13344587 0.111 0.015 7.472 7.87E-14 0.063 63.784 total
rs7090445_cg13344587 0.024 0.019 1.252 2.11E-01 0.472 0.000 ACME
rs7090445_cg13344587 0.096 0.026 3.739 1.85E-04 0.056 65.260 ADE
rs78396808_cg01139861 0.058 0.024 2.451 1.42E-02 0.919 0.000 total
rs78396808_cg01139861 0.017 0.006 2.743 6.09E-03 0.442 0.000 ACME
rs78396808_cg01139861 0.042 0.024 1.717 8.59E-02 0.994 0.000 ADE
rs78396808_cg01139861* 0.091 0.024 3.768 1.65E-04 0.672 0.000 total
rs78396808_cg01139861* 0.017 0.006 2.666 7.68E-03 0.428 0.000 ACME
rs78396808_cg01139861* 0.075 0.025 2.974 2.94E-03 0.853 0.000 ADE
Meta-analysis of the CCLS EPIC, and CCRLP EPIC datasets (479 cases and 636 controls)
rs7090445_cg13344587 0.110 0.016 6.810 9.74E-12 0.019 81.887 total
rs7090445_cg13344587 0.017 0.022 0.795 4.26E-01 0.308 3.809 ACME
rs7090445_cg13344587 0.104 0.029 3.643 2.70E-04 0.021 81.307 ADE
rs9376090_cg25722431 -0.006 0.021 -0.299 7.65E-01 0.094 64.417 total
rs9376090_cg25722431 -0.008 0.004 -2.205 2.75E-02 0.627 0.000 ACME
rs9376090_cg25722431 0.004 0.021 0.206 8.37E-01 0.119 58.951 ADE
*Models additionally adjusted for rs10230978.
The effect estimates correspond to the increased probabilities of developing ALL per 1 copy
increase of the SNP risk allele.
106
Table 3. 9. Significant SNP-DMP pairs identified from the mQTL overall meta-analyses and the
mQTL EPIC array meta-analyses in Latinos, non-Latino whites and non-Latinos.
Two significant SNP-DMP pairs identified from the mQTL overall meta-analysis (336 cases and 481 controls), and two
significant SNP-DMP pairs identified from the mQTL EPIC-array meta-analysis (227 cases and 377 controls) in Latinos.
CCLS 450K CCRLP EPIC CCLS EPIC Meta-analysis
SNP CPG Coef SE Coef SE Coef SE Coef SE P
P.h
et
i.squa
red
rs7090
445
cg1334
4587
-
0.04
6
0.0
03
-
0.04
9
0.0
02
-
0.04
8
0.0
03
-
0.04
8
0.0
02
5.42E-
214
0.6
75 0.000
rs7839
6808
cg0113
9861
0.01
9
0.0
06
0.02
5
0.0
05
0.02
7
0.0
05
0.02
4
0.0
03
2.27E-
14
0.5
27 0.000
rs7090
445
cg1334
4587
-
0.04
9
0.0
02
-
0.04
8
0.0
03
-
0.04
9
0.0
02
3.16E-
171
0.8
59 0.000
rs9376
090
cg2572
2431
0.00
3
0.0
03
0.00
4
0.0
03
0.00
4
0.0
02
5.36E-
02
0.7
37 0.000
One significant SNP-DMP pairs identified from the mQTL overall meta-analysis (196 cases and 217 controls), and two
significant SNP-DMP pairs identified from the mQTL EPIC-array meta-analysis (131 cases and 177 controls) in non-Latino
whites.
CCLS 450K CCRLP EPIC CCLS EPIC Meta-analysis
SNP CPG Coef SE Coef SE Coef SE Coef SE P P.h
et
i.squa
red
P for moderator
(Latinos vs. non-
Latino Whites)*
rs7090
445
cg1334
4587
-
0.04
3
0.0
04
-
0.05
4
0.0
03
-
0.04
9
0.0
05
-
0.05
0
0.0
02
2.04E-
101
0.1
24
52.18
2 0.532
rs7839
6808
cg0113
9861
0.03
7
0.0
29
0.03
6
0.0
35
0.03
2
0.0
32
0.03
5
0.0
18
5.48E-
02
0.9
95 0.000 0.548
rs7090
445
cg1334
4587
-
0.05
4
0.0
03
-
0.04
9
0.0
05
-
0.05
3
0.0
03
1.36E-
79
0.4
34 0.000 0.233
rs9376
090
cg2572
2431
0.00
4
0.0
03
0.00
7
0.0
03
0.00
6
0.0
02
4.23E-
03
0.4
66 0.000 0.439
Two significant SNP-DMP pairs identified from the mQTL overall meta-analysis (293 cases and 323 controls), and two
significant SNP-DMP pairs identified from the mQTL EPIC-array meta-analysis (198 cases and 259 controls) in non-Latinos.
CCLS 450K CCRLP EPIC CCLS EPIC Meta-analysis
SNP CPG Coef SE Coef SE Coef SE Coef SE P P.h
et
i.squa
red
P for moderator
(Latinos vs. non-
Latinos)**
rs7090
445
cg1334
4587
-
0.04
8
0.0
03
-
0.05
4
0.0
02
-
0.05
1
0.0
04
-
0.05
2
0.0
02
2.42E-
206
0.2
40
29.92
9 0.097
rs7839
6808
cg0113
9861
0.00
5
0.0
14
0.01
7
0.0
12
0.03
5
0.0
19
0.01
6
0.0
08
4.59E-
02
0.4
47 0.000 0.373
rs7090
445
cg1334
4587
-
0.05
4
0.0
02
-
0.05
1
0.0
04
-
0.05
3
0.0
02
6.81E-
158
0.4
01 0.000 0.082
rs9376
090
cg2572
2431
0.00
4
0.0
03
0.00
6
0.0
02
0.00
5
0.0
02
2.86E-
03
0.6
41 0.000 0.561
*P values from test of moderators with race/ethnicity (Latinos vs. non-Latino whites) being
considered as a potential moderator for the mQTL-DMP association.
**P values from test of moderators with race/ethnicity (Latinos vs. non-Latinos) being
considered as a potential moderator for the mQTL-DMP association.
107
All linear regression models were adjusted for sex, batch effect, cell type heterogeneity using
the first ten principal components derived from ReFACTor, and genetic ancestry using the first
ten principal components derived from EPISTRUCTURE.
108
Table 3. 10. Estimated total effects, ACME and ADE from the causal mediation analyses in
Latinos, non-Latino whites and non-Latinos (quasi-Bayesian Monte Carlo 1000 simulation).
in Latinos
Pair
Estima
te
SE
Statis
tic
P
P.h
et
i.squar
ed
Effe
ct
Meta-analysis across the CCLS 450K, CCLS EPIC, and CCRLP EPIC datasets (336 cases
and 481 controls)
rs7090445_cg13344
587 0.115 0.016 6.999
2.57E-
12
0.0
69 62.668
tota
l
rs7090445_cg13344
587 0.042 0.022 1.889
5.89E-
02
0.4
74 0.000
AC
ME
rs7090445_cg13344
587 0.086 0.029 2.995
2.74E-
03
0.2
87 19.982 ADE
rs78396808_cg0113
9861 0.053 0.024 2.191
2.85E-
02
0.7
29 0.000
tota
l
rs78396808_cg0113
9861 0.022 0.008 2.906
3.66E-
03
0.8
31 0.000
AC
ME
rs78396808_cg0113
9861 0.032 0.025 1.287
1.98E-
01
0.7
58 0.000 ADE
rs78396808_cg0113
9861* 0.078 0.026 2.957
3.10E-
03
0.6
62 0.000
tota
l
rs78396808_cg0113
9861* 0.022 0.008 2.714
6.64E-
03
0.8
53 0.000
AC
ME
rs78396808_cg0113
9861* 0.056 0.027 2.082
3.73E-
02
0.7
11 0.000 ADE
Meta-analysis across the CCLS EPIC, and CCRLP EPIC datasets (227 cases and 377
controls)
rs7090445_cg13344
587 0.111 0.018 6.172
6.74E-
10
0.0
24 80.376
tota
l
rs7090445_cg13344
587 0.027 0.026 1.036
3.00E-
01
0.9
14 0.000
AC
ME
rs7090445_cg13344
587 0.098 0.033 3.008
2.63E-
03
0.1
71 46.747 ADE
rs9376090_cg25722
431 -0.064 0.036
-
1.788
7.38E-
02
0.8
73 0.000
tota
l
rs9376090_cg25722
431 -0.007 0.005
-
1.234
2.17E-
01
0.4
03 0.000
AC
ME
rs9376090_cg25722
431 -0.054 0.036
-
1.509
1.31E-
01
0.7
63 0.000 ADE
in non-Latino
whites
pair
estima
te
std.err
or
statis
tic
p.valu
e
P.h
et
i.squar
ed
effe
ct
P for moderator (Latinos vs. non-
Latino Whites)**
Meta-analysis across the CCLS 450K, CCLS EPIC, and CCRLP EPIC datasets (196 cases and 217 controls)
rs7090445_cg13344
587 0.131 0.027 4.808
1.53E-
06
0.9
88 0.000
tota
l 0.622
rs7090445_cg13344
587 0.029 0.034 0.857
3.91E-
01
0.2
21 33.727
AC
ME 0.751
rs7090445_cg13344
587 0.092 0.045 2.056
3.98E-
02
0.4
98 0.000 ADE 0.910
rs78396808_cg0113
9861 -0.065 0.107
-
0.610
5.42E-
01
0.1
19 52.967
tota
l 0.279
rs78396808_cg0113
9861 0.006 0.019 0.336
7.37E-
01
0.5
62 0.000
AC
ME 0.441
rs78396808_cg0113
9861 -0.113 0.103
-
1.104
2.69E-
01
0.0
45 67.641 ADE 0.169
109
rs78396808_cg0113
9861* 0.009 0.105 0.084
9.33E-
01
0.2
50 27.801
tota
l 0.526
rs78396808_cg0113
9861* 0.002 0.019 0.129
8.97E-
01
0.7
71 0.000
AC
ME 0.354
rs78396808_cg0113
9861* -0.002 0.104
-
0.021
9.83E-
01
0.1
84 40.938 ADE 0.591
Meta-analysis across the CCLS EPIC, and CCRLP EPIC datasets (131 cases and 177 controls)
rs7090445_cg13344
587 0.130 0.031 4.145
3.40E-
05
0.8
93 0.000
tota
l 0.612
rs7090445_cg13344
587 0.016 0.040 0.404
6.86E-
01
0.1
05 61.943
AC
ME 0.826
rs7090445_cg13344
587 0.100 0.053 1.899
5.76E-
02
0.2
53 23.369 ADE 0.974
rs9376090_cg25722
431 0.024 0.030 0.809
4.19E-
01
0.1
77 45.090
tota
l 0.059
rs9376090_cg25722
431 -0.011 0.008
-
1.386
1.66E-
01
0.6
56 0.000
AC
ME 0.675
rs9376090_cg25722
431 0.037 0.030 1.246
2.13E-
01
0.2
30 30.713 ADE 0.050
in non-Latinos
Pair
Estima
te
SE
Statis
tic
P
P.h
et
i.squar
ed
Effe
ct
P for moderator (Latinos vs. non-
Latinos)***
Meta-analysis across the CCLS 450K, CCLS EPIC, and CCRLP EPIC datasets (293 cases and 323 controls)
rs7090445_cg13344
587 0.104 0.023 4.631
3.64E-
06
0.4
60 0.000
tota
l 0.698
rs7090445_cg13344
587 0.015 0.031 0.482
6.30E-
01
0.4
20 0.000
AC
ME 0.484
rs7090445_cg13344
587 0.090 0.039 2.282
2.25E-
02
0.2
00 37.876 ADE 0.930
rs78396808_cg0113
9861 0.085 0.060 1.414
1.57E-
01
0.6
85 0.000
tota
l 0.625
rs78396808_cg0113
9861 0.007 0.010 0.752
4.52E-
01
0.9
34 0.000
AC
ME 0.236
rs78396808_cg0113
9861 0.075 0.059 1.258
2.09E-
01
0.6
39 0.000 ADE 0.506
rs78396808_cg0113
9861* 0.094 0.058 1.607
1.08E-
01
0.8
10 0.000
tota
l 0.803
rs78396808_cg0113
9861* 0.007 0.009 0.760
4.47E-
01
0.9
82 0.000
AC
ME 0.237
rs78396808_cg0113
9861* 0.085 0.057 1.476
1.40E-
01
0.7
85 0.000 ADE 0.646
Meta-analysis across the CCLS EPIC, and CCRLP EPIC datasets (198 cases and 259 controls)
rs7090445_cg13344
587 0.112 0.026 4.260
2.05E-
05
0.2
63 20.289
tota
l 0.993
rs7090445_cg13344
587 0.011 0.037 0.298
7.66E-
01
0.1
94 40.704
AC
ME 0.726
rs7090445_cg13344
587 0.103 0.046 2.214
2.68E-
02
0.0
86 66.077 ADE 0.934
rs9376090_cg25722
431 0.041 0.023 1.761
7.82E-
02
0.1
33 55.676
tota
l 0.014
rs9376090_cg25722
431 -0.007 0.006
-
1.140
2.54E-
01
0.5
39 0.000
AC
ME 0.959
rs9376090_cg25722
431 0.051 0.023 2.248
2.46E-
02
0.1
01 62.928 ADE 0.013
*Models additionally adjusted for rs10230978.
110
**P values from test of moderators with race/ethnicity (Latinos vs. non-Latino whites) being
considered as a potential moderator for the mediation analysis.
***P values from test of moderators with race/ethnicity (Latinos vs. non-Latinos) being
considered as a potential moderator for the mediation analysis.
The effect estimates correspond to the increased probabilities of developing ALL per 1 copy
increase of the SNP risk allele.
111
Table 3. 11. Results from three logistic regression models investigating the potential
confounding effects.
CCLS 450K CCRLP EPIC CCLS EPIC Meta-analysis
Variable OR SE OR SE OR SE OR SE P P.het i.squared
rs7090445(ARID5B)-cg13344587(ARID5B)-ALL
Model 1: logistic regression model predicting ALL status as a function of methylation at cg13344587, adjusting for sex, batch
effect, cell type heterogeneity, and genetic ancestry (808 cases and 919 controls)
cg1334458
7
0.465 2.388 0.377 2.444 0.464 2.046 0.437 1.311 2.87E-10 0.769 0.000
Model 2: model 1 additionally adjusted for SNP rs7090445 (808 cases and 919 controls)
cg1334458
7
0.638 3.545 0.592 4.189 1.054 2.818 0.798 1.952 2.49E-01 0.391 0.000
Model 3: model 1 in subjects without any copies of the risk allele of SNP rs7090445 (154 cases and 292 controls)
cg1334458
7
1.262 8.479 0.457 11.312 1.045 4.791 0.985 3.914 9.70E-01 0.755 0.000
(ORadj - ORcrude)/ORcrude = (0.437-0.798)/0.437 = 83%
rs78396808(IKZF1)-cg01139861(IKZF1)-ALL
Model 1: logistic regression model predicting ALL status as a function of methylation at cg01139861, adjusting for sex, batch
effect, cell type heterogeneity, and genetic ancestry (808 cases and 919 controls)
cg0113986
1
2.173 2.264 1.509 1.888 1.236 1.680 1.510 1.098 1.75E-04 0.135 50.135
Model 2: model 1 additionally adjusted for SNP rs78396808 (808 cases and 919 controls)
cg0113986
1
2.091 2.297 1.443 1.940 1.184 1.728 1.450 1.125 9.49E-04 0.141 48.884
Model 3: model 1 in subjects without any copies of the risk allele of SNP rs78396808 (490 cases and 592 controls)
cg0113986
1
2.257 2.715 1.579 2.244 1.078 1.983 1.454 1.303 4.09E-03 0.081 60.286
(ORadj - ORcrude)/ORcrude = (1.510-1.450)/1.510= 4%
Odds ratios (OR) were calculated for every 0.1 CpG beta value increase.
ORcrude is the OR from the model 1 meta-analysis and ORadj is the OR from the model 2 meta-
analysis.
112
Table 3. 12. Results from the IKZF1 gene-specific EWAS analysis.
Annotations
CCLS
450K
CCRLP
EPIC
CCLS
EPIC Meta-analysis
Prob
e
C
h
r
PO
S
Islands
Relati
on to
Islan
d
UCSC
RefGe
ne
Name
UCSC
RefGe
ne
Group
Regulat
ory
Feature
Group
O
R
S
E
O
R
S
E
O
R
S
E
O
R
S
E
P
P
.
h
e
t
i.s
qu
ar
ed
cg01
139
861
*
c
h
r
7
50
34
32
98
chr7:50
342895-
503434
56
Islan
d IKZF1
TSS150
0
Promot
er_Asso
ciated
2.
3
1
5
2.
1
1
1
1.
5
5
7
1.
8
6
3
1.
1
9
9
1.
4
6
1
1.
5
0
5
1.
0
1
0
5.
20
E-
05
0.
0
3
6
69
.7
98
cg16
499
656
*
c
h
r
7
50
34
44
71
chr7:50
343757-
503445
19
Islan
d IKZF1
1stExo
n;5'UT
R
Promot
er_Asso
ciated
0.
2
2
0
1
2.
1
6
9
0.
0
9
0
2
2.
2
1
9
0.
0
1
1
2
1.
8
3
2
0.
1
0
5
9.
5
8
9
1.
86
E-
02
0.
4
8
9
0.
00
0
cg12
431
065
**
c
h
r
7
50
46
67
92
chr7:50
467566-
504684
00
N_Sh
ore IKZF1 Body
0.
1
2
4
6.
3
5
0
0.
3
2
2
5.
0
4
5
0.
2
2
3
3.
9
5
0
1.
43
E-
04
0.
2
4
0
27
.4
39
cg01
139
861
**
c
h
r
7
50
34
32
98
chr7:50
342895-
503434
56
Islan
d IKZF1
TSS150
0
Promot
er_Asso
ciated
1.
5
5
7
1.
8
6
3
1.
1
9
9
1.
4
6
1
1.
3
2
4
1.
1
4
9
1.
46
E-
02
0.
2
6
9
18
.0
86
cg10
551
353
**
c
h
r
7
50
35
78
25
Open
Sea IKZF1
5'UTR;
TSS150
0;5'UT
R
2.
8
1
6
3.
5
8
1
1.
2
1
9
2.
4
9
8
1.
6
0
3
2.
0
4
9
2.
12
E-
02
0.
0
5
5
72
.8
14
cg16
499
656
**
c
h
r
7
50
34
44
71
chr7:50
343757-
503445
19
Islan
d IKZF1
1stExo
n;5'UT
R
Promot
er_Asso
ciated
0.
0
9
0
2
2.
2
1
9
0.
0
1
1
2
1.
8
3
2
0.
0
3
1
1
5.
5
7
2
2.
58
E-
02
0.
5
0
2
0.
00
0
cg23
720
063
**
c
h
r
7
50
46
90
42
chr7:50
467566-
504684
00
S_Sh
ore IKZF1 3'UTR
0.
1
8
1
1
6.
8
2
2
0.
0
6
5
1
4.
0
0
9
0.
0
9
9
1
0.
7
6
5
3.
18
E-
02
0.
6
4
2
0.
00
0
cg16
232
940
**
c
h
r
7
50
41
82
80
Open
Sea IKZF1 Body
0.
5
1
3
2.
8
9
3
0.
8
3
0
1.
8
7
2
0.
7
2
0
1.
5
7
2
3.
66
E-
02
0.
1
6
3
48
.5
66
cg24
067
003
**
c
h
r
7
50
35
81
94
Open
Sea IKZF1
5'UTR;
TSS150
0;5'UT
R
1.
2
1
2
3.
3
0
8
1.
5
7
9
2.
2
5
5
1.
4
5
2
1.
8
6
3
4.
55
E-
02
0.
5
0
9
0.
00
0
*CpGs found to be associated with ALL risk from the overall meta-analysis (808 cases and 919
controls).
**CpGs found to be associated with ALL risk from the EPIC array meta-analysis (598 cases and
694 controls).
Odds ratios (OR) were calculated for every 0.1 CpG beta value increase.
All logistic regression models were adjusted for sex, batch effect, cell type heterogeneity using
the first ten principal components derived from ReFACTor, and genetic ancestry using the first
ten principal components derived from EPISTRUCTURE.
113
Table 3. 13. Significant SNP-DMP pairs identified from the mQTL overall meta-analysis and the
mQTL EPIC array meta-analysis in Latinos (IKZF1 gene-specific analysis).
CCLS 450K CCRLP EPIC CCLS EPIC Meta-analysis
SNP CPG Coef SE Coef SE Coef SE Coef SE P
P.he
t
i.square
d
rs78396808
*
cg16499656
*
-
0.001
0.00
1
-
0.001
0.00
0
0.00
0
0.00
0
-
0.001
0.00
0
1.20E-
02
0.17
3 43.079
rs78396808
**
cg10551353
** 0.016
0.00
3
0.01
3
0.00
3 0.015
0.00
2
4.06E-
12
0.47
9 0.000
rs78396808
**
cg12431065
** 0.006
0.00
1
0.00
4
0.00
2 0.005
0.00
1
1.36E-
06
0.48
9 0.000
rs78396808
**
cg16499656
**
-
0.001
0.00
0
0.00
0
0.00
0
-
0.001
0.00
0
2.33E-
02
0.06
5 70.694
*Significant SNP-DMP pair identified from the mQTL overall meta-analysis (336 cases and 481
controls).
**Significant SNP-DMP pairs identified from the mQTL EPIC array meta-analysis (227 cases and
377 controls).
All linear regression models were adjusted for sex, batch effect, cell type heterogeneity using
the first ten principal components derived from ReFACTor, and genetic ancestry using the first
ten principal components derived from EPISTRUCTURE.
114
Table 3. 14. Estimated total effects, ACME and ADE from the mediation analysis in Latinos
(IKZF1 gene-specific analysis) (quasi-Bayesian Monte Carlo 1000 simulation).
Pair
Estimate SE
Statisti
c
P
P.he
t
i.square
d
Effec
t
Meta-analysis across the CCLS 450K, CCLS EPIC, and CCRLP EPIC datasets (336 cases and 481 controls)
rs78396808_cg16499656 0.053 0.024 2.183
0.02
9
0.74
0 0.000 total
rs78396808_cg16499656 0.003 0.003 1.029
0.30
4
0.59
2 0.000
ACM
E
rs78396808_cg16499656 0.046 0.024 1.905
0.05
7
0.69
1 0.000 ADE
Meta-analysis across the CCLS EPIC, and CCRLP EPIC datasets (227 cases and 377 controls)
rs78396808_cg10551353 0.044 0.028 1.571
0.11
6
0.91
7 0.000 total
rs78396808_cg10551353 0.020 0.009 2.246
0.02
5
0.40
5 0.000
ACM
E
rs78396808_cg10551353 0.021 0.028 0.750
0.45
3
0.87
8 0.000 ADE
rs78396808_cg16499656 0.043 0.027 1.552
0.12
1
0.92
5 0.000 total
rs78396808_cg16499656 0.004 0.004 0.946
0.34
4
0.31
1 2.772
ACM
E
rs78396808_cg16499656 0.035 0.028 1.266
0.20
6
0.93
6 0.000 ADE
rs78396808_cg12431065 0.041 0.028 1.478
0.13
9
0.97
2 0.000 total
rs78396808_cg12431065 -0.012 0.007 -1.797
0.07
2
0.92
8 0.000
ACM
E
rs78396808_cg12431065 0.053 0.028 1.877
0.06
0
0.95
6 0.000 ADE
Meta-analysis across the CCLS 450K, CCLS EPIC, and CCRLP EPIC datasets additionally adjusted for rs10230978 (336 cases and
481 controls)
rs78396808_cg16499656 0.078 0.027 2.915
0.00
4
0.68
5 0.000 total
rs78396808_cg16499656 0.004 0.003 1.194
0.23
3
0.60
2 0.000
ACM
E
rs78396808_cg16499656 0.071 0.027 2.645
0.00
8
0.63
5 0.000 ADE
Meta-analysis across the CCLS EPIC, and CCRLP EPIC datasets additionally adjusted for rs10230978 (227 cases and 377
controls)
rs78396808_cg10551353 0.067 0.031 2.168
0.03
0
0.86
7 0.000 total
rs78396808_cg10551353 0.019 0.009 2.162
0.03
1
0.41
1 0.000
ACM
E
rs78396808_cg10551353 0.045 0.031 1.429
0.15
3
0.94
7 0.000 ADE
rs78396808_cg16499656 0.065 0.031 2.140
0.03
2
0.88
0 0.000 total
rs78396808_cg16499656 0.005 0.004 1.225
0.22
1
0.36
3 0.000
ACM
E
rs78396808_cg16499656 0.057 0.031 1.855
0.06
4
0.99
2 0.000 ADE
rs78396808_cg12431065 0.065 0.031 2.117
0.03
4
0.90
4 0.000 total
rs78396808_cg12431065 0.000 0.002 0.022
0.98
2
0.84
9 0.000
ACM
E
115
rs78396808_cg12431065 0.065 0.031 2.122
0.03
4
0.89
4 0.000 ADE
The effect estimates correspond to the increased probabilities of developing ALL per 1 copy
increase of the SNP risk allele.
116
Table 3. 15. Meta-analyzed results for the association between rs78396808 and ALL risk (683
cases and 804 controls).
CCLS 450K CCRLP EPIC CCLS EPIC Meta-analysis
SNP OR SE OR SE OR SE OR SE P P.het i.squared
rs78396808 1.504 0.232 1.554 0.220 1.210 0.196 1.394 0.124 0.007 0.647 0.000
Logistic regression was adjusted for sex, ancestry, batch effect, rs6421315 and rs10230978.
Odds ratios (OR) were calculated for every 1 copy increase of the SNP risk allele.
117
Figure 3. 1. Study design of the EWAS, mQTL, and mediation analyses.
Flowcharts show the inclusion and exclusion criteria for CpGs included in the EWAS and SNP-
DMP pairs for the mQTL and mediation analyses, separately for the overall meta-analysis and
the EPIC array only meta-analysis.
118
(A)
119
(B)
120
(C)
121
(D)
Figure 3. 2. Meta-analysis of the epigenome-wide association analysis.
(A) Bidirectional Manhattan plot for the overall meta-analysis, (B) QQ plot for the overall meta-
analysis, (C) Bidirectional Manhattan plot for the EPIC array meta-analysis, and (D) QQ plot for
the EPIC array meta-analysis. Two horizontal lines in the Manhattan plot are a Bonferroni-
adjusted threshold of 0.05 divided by the number of CpGs (red) and a lenient threshold of 1 x
10
-4
(blue). Y-axis for the Bidirectional Manhattan plot represents -log 10Phyper for
hypermethylated CpGs (higher DNA methylation beta values in cases vs. in controls) and
log 10Phypo for hypomethylated CpGs (lower DNA methylation beta values in cases vs. in
controls), respectively. CpGs later included in the causal mediation analysis are labeled with
gene names.
122
Figure 3. 3. Path Diagrams showing the results of the causal mediation and confounding
analyses.
(A) The left panel shows the causal mediation model of the ARID5B SNP rs7090445
(independent variable), the ARID5B CpG cg13344587 (mediator candidate), and ALL risk
(dependent variable). The right panel shows the confounding model of the ARID5B SNP
rs7090445 (confounder), the ARID5B CpG cg13344587 (independent variable), and ALL risk
(dependent variable). (B) The left and right panels show the causal mediation models of the
IKZF1 SNP rs78396808 (independent variable), the IKZF1 CpG cg01139861 (mediator), and ALL
risk (dependent variable), with or without conditioning on the lead IKZF1 SNP rs10230978
identified from our recent multi-ancestry GWAS meta-analysis, respectively. The plus and minus
signs indicate positive correlations and negative correlations, respectively. The solid and dash
lines indicate pathways found to be statistically significant and nonsignificant, respectively. The
diagrams for the mediation models show the direct effect, the average causal mediation effect,
and the total effect estimated from the overall meta-analysis of the causal mediation analysis
results from quasi-Bayesian Monte Carlo simulation. The effect estimates correspond to the
123
increased probabilities of developing ALL per 1 copy increase of the SNP risk allele. The diagram
for the confounding model shows the crude effect from the logistic regression predicting ALL
risk as a function of DNA methylation at the ARID5B CpG cg13344587, and the adjusted effect
from the logistic regression additionally controlled for the ARID5B SNP rs7090445. The effect
estimates are the odds ratios for each 0.1 beta-value increase in DNA methylation from the
logistic regression models. All Models were adjusting for sex, batch effect, cell type
heterogeneity, and genetic ancestry.
124
Figure 3. 4. Characteristics of the significant DMP-SNP pair at IKZF1.
(A) Left panels: relationship between DNA methylation level at cg01139861 and rs78396808
genotype. Red points represent median DNA methylation levels. A is the risk allele for
rs78396808. Middle panel: relationship between DNA methylation level at cg01139861 and ALL
risk. Red points represent median DNA methylation levels. Right panel: rs78396808 genotype
frequency in cases and controls overall, in Latinos, and in non-Latino whites. (B) Visualization of
the genomic location for DMP cg01139861, CpG cg10551353, and SNP rs78396808 at gene
IKZF1 incorporating annotation queries to UCSC genome browser via Gviz
276
. The top track
shows the ideogram of chromosome 7 with the red bar indicating where gene IKZF1 is. The
second track shows the genome axis, starting from position 50342500 to position 50472798
(reference build Hg19). The third track shows the IKZF1 gene transcript. The fourth track shows
three CpG islands located at gene IKZF1. They are at chr7:50342895-50343456 (46 CpGs),
chr7:50343757-50344519 (80 CpGs), and chr7: 50467566-50468400 (79 CpGs). The last track
shows where DMP cg01139861, CpG cg10551353, and SNP rs78396808 are located. CpG
cg01139861 is located in the CpG island at chr7:50342895-50343456 in the promoter region of
IKZF1, and SNP rs78396808 is in an intronic region ~116Kb downstream. CpG cg10551353 is in
the 5ʹ UTR or the TSS1500 region of several transcripts, ~14Kb downstream of cg01139861.
125
Figure 3. 5. Scatter plot showing relationship between cg01139861 DNA methylation and IKZF1
expression levels in ALL tumor samples.
Scatter plot with linear regression line and its 95% confidence interval band showing a
significantly negative correlation between DNA methylation beta values at cg01139861 at gene
IKZF1 and gene expression log 2 fold changes of IKZF1 in 51 tumor samples from CCLS. The
Spearman correlation coefficient R and its P value are shown in the plot.
126
Figure 3. 6. Scatterplot showing the relationship between DNA methylation beta values at CpG
cg01139861 and CpG cg10551353 at gene IKZF1 in Latinos.
Left panel displays data from the CCRLP EPIC dataset and right panel displays data from the
CCLS EPIC dataset. Correlation coefficients and p-values calculated using the Spearman
correlation test.
127
Chapter 4: Whole-Exome Sequencing in Multiplex Families to Identify Novel AYA Classical
Hodgkin Lymphoma Predisposition Genes
Abstract
Hodgkin Lymphoma (HL) is a B-cell malignancy that mainly affects adolescents and
young adults (AYA). Sequencing of familial and sporadic HL patients previously identified rare
pathogenic germline variants in genes with varying functions. A greater understanding of the
mechanisms by which these genes are involved in HL predisposition is needed. We performed
germline whole-exome sequencing for 48 individuals, including 20 classical HL (cHL) cases in 14
multiplex families with primarily AYA cHL patients (age at diagnosis: 17-50 years) to identify
novel cHL predisposition genes. Rare germline short variant calling, annotation, and filtering
were performed. In 11 families with unaffected sibling controls, we found 22 putative
pathogenic germline variants in 22 genes only among the cHL patients. One variant in the
cancer-related gene PGK1 was reported as pathogenic in VarSome and pathogenic/likely
pathogenic (P/LP) in ClinVar in patients with phosphoglycerate kinase 1 deficiency associated
with hemolytic anemia. Two variants in cancer-related genes PTPRT and RANBP17 were
classified by VarSome with evidence of pathogenicity. In the 3 families with cHL patients but
lacking sequencing data from sibling controls, we found 3 P/LP variants in genes associated with
blood disorders (SLC4A1 and SEC23B) or immunodeficiency (PRKDC). In the analysis agnostic to
family structure, we found 5 LP loss-of-function variants in genes associated with
immunodeficiency (DCLRE1C and DNMT3B) or cancers (PTPRD, DDX10, and HMBS). We
identified several novel putative predisposition genes for cHL, and assessment of these genes in
128
sequencing studies of independent HL families is required to understand their roles in HL
predisposition.
Introduction
Hodgkin Lymphoma (HL) is a B-cell malignancy that affects ~2-3 per 100,000 individuals
per year in the United States
277
. HL is typically classified based on histopathological
appearances into classical HL (cHL) comprising >95% of the cases, and nodular lymphocyte
predominant HL (NLPHL)
278
. cHL can be further divided into four subtypes, nodular sclerosis HL
(NSHL, the most frequent subtype), mixed cellularity HL (MCHL), lymphocyte-rich HL, and
lymphocyte-depleted HL, with the remaining 5% of patients with cHL that cannot be classified
279
. HL shows a unique trimodal age-specific incidence pattern that varies by time period,
geography, race/ethnicity, gender, and socioeconomic status (SES)
280–283
, suggesting strong
environmental determination. Two small incidence peaks of HL occur in children (<15 years of
age) and elderly adults (>50 years of age), with most of the cases in these age groups diagnosed
as Epstein-Barr virus (EBV)-positive MCHL
281,282,284
. The major incidence peak of HL occurs
among adolescents and young adults (AYAs) (age range 15-35 years), mainly comprising cases
with EBV-negative NSHL
282
. The risk of HL in AYAs is higher for populations with higher SES and
smaller family sizes
284,285
, suggesting that a lack of early childhood exposure to a variety of
infectious agents may increase susceptibility to HL by perturbing immune development.
In addition to environmental risk factors, there is evidence for a strong heritable
component of cHL as supported by the elevated risk in first-degree relatives
286
and
monozygotic twins
287
. The estimated heritability of HL is ~30% in European populations
288
.
129
Genome-wide association studies (GWAS) have identified several common risk loci for cHL,
primarily in immune-related genes, with the strongest associations in the HLA-class II region for
EBV-negative cHL and in the HLA-class I region for EBV-positive cHL
289–295
. The exact causal
mechanisms for these reported susceptibility alleles remain largely unclear
296
. In addition,
sequencing studies have identified rare germline variants that contribute to cHL predisposition,
in genes involved in varying biological processes including cytokinesis (KLDHC8B), cell cycle
regulation (NPAT), extracellular matrix (ACAN), vascular endothelial growth factor pathways
(KDR), microRNA biogenesis (DICER1), and telomere maintenance (POT1)
297–303
. The
mechanisms by which these low frequency variants with high penetrance contribute to HL risk
have yet to be determined, but multiple biological pathways appear to be involved. Here, we
performed germline whole-exome sequencing (WES) for 48 samples (20 cHL cases) among 14
multiplex families with AYA cHL cases to identify novel predisposition genes for HL.
Methods
Study subjects
This study was approved by the institutional review boards of the Keck School of
Medicine of the University of Southern California (USC) in accordance with the Declaration of
Helsinki. Signed informed consent was obtained from all participants in this study. Index cases
from 14 multiplex families along with unaffected first-degree relatives were ascertained from
the USC twin registries (7 index cases, 2 additional cases, and 7 unaffected relatives) or from
the USC Cancer Surveillance Program (7 index cases, 4 additional cases, and 21 unaffected
relatives). The USC twin registries include twins with HL who were registered in either the
130
International Twin Registry [ITS] or California Twin Program [CTP]
304
. The ITS is a volunteer
registry of twins with cancer and chronic disease, ascertained by advertising in periodicals from
1980-1991; and the CTP is a population-based California Twin Cohort, consisting of twin pairs
born in California between 1908 and 1982, ascertained by linking the State birth records to the
State Department of Motor Vehicles. Eligibility criteria were cases of HL diagnosed <50 years in
either of the twins. Investigators attempted to collect samples for the entire family of both
twins if more than one case were reported in the family. Cases in the USC Cancer Surveillance
Program were ascertained within 3 years of diagnosis from 2000 and 2008 in a case-parent trio
design
293
. In this study, 95% of the cases were less than 45 years old at diagnosis. Siblings were
enrolled if both parents were not available or did not comply. Among the twin pairs, 5 were
concordant MZ pairs, one was a concordant dizygotic pair and one was an MZ discordant pair
with a second case occurring in the daughter of the unaffected MZ co-twin. The other 7 families
included 4 with two affected siblings, one with affected child/parent pair, and two with
affected child/uncle(aunt) pairs; the latter 3 families had specimens available for the index child
case only. All index cases were diagnosed under 50 years old, with the mean age 33.
Exome sequencing and variant calling
Germline WES was performed using the Nextera® Rapid Capture Exome kit
305
. The kit
includes >340,000 95mer probes, targeting a genomic footprint of 62 Mb region, spanning
201,121 exons plus expanded content (e.g., UTRs and miRNA binding sites). Samples were
sequenced as 2 × 150 bp paired-end reads stored in two FASTQ format files
306
, each containing
the forward and the reverse reads, respectively.
131
Our workflow of rare pathogenic/likely pathogenic (P/LP) germline variant discovery is
summarized in Figure 4.1. Fastq-pair v0.3
307
was used to identify and remove the unmatched
reads from the paired-end FASTQ files. FastQC v0.11
308
and TrimGalore v0.6
309
were used for
quality control (QC) and adapter trimming. MultiQC v1.8
310
was used to summarize the QC
statistics across all the samples. We performed additional data pre-processing and germline
short variant discovery based on the Genome Analysis Tool Kit (GATK) best practices guidelines
311,312
. Briefly, BWA v0.7
313
was first used to generate BAM files
314
that contain sequencing
reads mapped to the human reference genome GRCh38/hg38
315
. Alfred software
316
was used
to calculate the on-target rate and the fraction of targets above x20 coverage for each mapped
BAM file. GATK v4.1
317
mark duplicates and base quality score recalibration (BQSR) were then
used to produce analysis-ready BAM files. UCSC liftOver
318
was used to convert the interval list
of the Nextera® Rapid Capture Exome kit from the GRCh37 reference assembly to GRCh38 to
make it consistent with the reference genome used for mapping. The base quality scores before
and after BQSR were generated using the GATK AnalyzeCovariates command and were
visualized using the R package ggplot2
319
. The GATK DepthOfCoverage command was used to
summarize coverage statistics (i.e., mean, median, and % base above 15 coverage) for each
sample. SNP and INDEL discovery was performed using the HaplotypeCaller command in
Genomic Variant Call Format (GVCF) mode. The resulting GVCFs from each sample were
combined into a single multi-sample GVCF using the GenomicsDBImport command, and joint
genotyping across all the samples was subsequently performed using the GenotypeGVCFs
command
320
. Raw SNPs and INDELS were stored in a variant call format (VCF) file
321
.
132
Data filtering methods were implemented between the GATK variant calling and Variant
Quality Score Recalibration (VQSR) steps to improve variant quality, as suggested by Carson et
al.
322
Variants with read depth (DP) ≤8 at the genotype level, minimum genotype quality (GQ)
≤20, average GQ ≤35, or genotype call rate ≤90% were removed by using BCFtools v1.10
314
.
VQSR was then applied with a truth sensitivity level of 99.0% for both SNPs and indels, and
variants that did not pass VQSR were removed. Additional filtering steps were conducted by
using BCFtools and vcfR R package
323
. At the variant level, variants were included if they had a
variant confidence score (QUAL) normalized by unfiltered allele depth (QD) >2. Given the high
depth of sequencing (mean depth = 45.8), relatively stringent thresholds were subsequently
applied at the genotype level to further exclude likely spurious variant calls, only including
variants genotype calls with an alternative allele reading depth >5, and an alternate allele
reading depth fraction >0.2. Finally, variants only found in those samples with extremely low
coverage (N=3) were likely to be spurious and were removed.
Variants were annotated using ANNOVAR
324
, incorporating information from
RefSeqGene
325
, Exome Aggregation Consortium (ExAC) v0.3
326
, dbSNP build 150
327
, dbNSFP
v3.5a
328
, Genome Aggregation Database (gnomAD)
137
exome and genome collection (v2.1.1),
and ClinVar (03-16-2020)
329
. BCFtools v1.10 was used to annotate for Combined Annotation
Dependent Depletion (CADD) Phred-scaled score v1.6
330
and Trans-Omics for Precision
Medicine (TOPMed) program Freeze5
331
.
Identification of P/LP germline variants in cancer-predisposition genes
To identify putative P/LP variants, we filtered out variants with an allele frequency
(AF) >0.001 in population databases (gnomAD
137
or TOPMed
331
) and variants documented as
133
benign/likely benign (B/LB) in ClinVar
329
. We retained splicing/ncRNA variants and exonic
variants with loss-of-function/missense/unknown functional consequences, and with a CADD
Phred score >10. We further limited variants to those in 1,073 genes related to cancer,
immunodeficiency, and/or hematological traits as categorized by the PeCanPIE MedalCeremony
pipeline
332
and previous literature
333–335
or previously identified in sequencing studies or
GWAS for cHL
289,297–303
(Table S1). Variants predicted to be tolerated by the PeCanPIE
MedalCeremony pipeline
332
were removed, and the remaining variants were visually inspected
in Integrative Genomics Viewer (IGV) and were removed if found to be a false positive or an
artifact due to strand bias or orientation bias
336
. Genomic locations of the remaining putative
P/LP variants were visualized in ProteinPaint
337
.
Analysis of putative P/LP germline variants by incorporating family structure
Self-reported family pedigree information was confirmed using WES data by the
pairwise kinship coefficient of each pair of individuals estimated by SEEKIN v1.0
338
. The
distribution of the putative P/LP variants within each family in our cohort were visualized by
using the R package ComplexHeatmap
339
. Variants were retained if found only in HL index
cases and not in their unaffected sibling controls, except for variants not found in the
unaffected MZ twin. These variants were considered to be potentially causal, and were further
reviewed for pathogenicity in VarSome
340
(02-08-2022) and for gene functions and related
pathways/phenotypes on OMIM
341
and Genecards
342
. VarSome is a search engine for human
genomic variation that classifies the submitted variant into different pathogenicity categories
according to the ACMG guidelines by incorporating information from 30 external databases and
134
risk prediction scores from 20 in silico algorithms. Variants predicted to be B/LB in VarSome
were excluded from our further analysis.
In addition to incorporating family structure into the identification of putative causal
germline variants in cHL, we assessed loss-of-function variants (stopgain/frameshift/splicing)
among all subjects agnostic to family structure.
Results
A total of 48 individuals, including 20 patients with cHL from 14 multiplex families, were
included in the analysis. The mean age of diagnosis for cases was 33 years (standard deviation
[SD] = 9.39, range:17-50), the mean age at blood draw was 50.36 years (SD = 12.99) overall and
46.05 years (SD = 10.78) in cases, and there were 30 (62.5%) females in overall samples and 6
(35.3%) females in cases (Table 4.1).
Putative P/LP variants in predisposition genes in cHL patients
Following the removal of 430 likely spurious variants that were only identified in 3
unaffected samples with extremely low sequencing coverage (Figure 4.2), we identified 4,998
variants that were rare (AF <0.001) and putatively functional, of which 250 variants overlapped
the 1,073 candidate genes. Of these 250 variants, 6, 142, and 78 were assigned with gold, silver,
or bronze medals, respectively, by the PeCanPIE MedalCeremony pipeline (Figure 4.3). There
were 24 variants assigned in the “unknown” category, most of which were located on
chromosome X. The 250 variants included 7 stopgain variants, 1 frameshift deletion, 1
frameshift insertion, 3 splice-site variants, 235 nonsynonymous single nucleotide variants
(SNVs), and 3 variants of unknown mutation type. The 148 gold/silver medal variants, which
135
included most (10/12) of the loss-of-function/splice-site variants (including 5 stopgain, 1
frameshift deletion, 1 frameshift insertion, and 3 splice-site variants), along with the 24
unknown category variants (including 2 stopgain variants on chromosome X) were prioritized
for downstream analysis incorporating family structure.
Incorporating family structure identifies putative causal germline variants in cHL
Of the 172 gold/silver/unknown variants, the 20 patients harbored 151 variants (127
unique), and 28 controls carried 166 variants (104 unique) in total (Figure 4.4). Following the
removal of variants that appeared spurious in IGV (n = 5) and variants predicted to be B/LB by
VarSome (n = 5), among the 11 families with sibling controls there were 22 unique variants in
22 genes that were found only in cHL patients (Table 4.2 and Figure 4.5). None of the variants
were found in more than one family. In family #2, a nonsynonymous SNV in PGK1 identified in
the MZ twins discordant for cHL was classified as pathogenic by VarSome and was reported as
P/LP in ClinVar in patients with phosphoglycerate kinase 1 deficiency associated with hemolytic
anemia (ClinVar ID #9950) (Figure S5A). The same variant was found in the affected offspring of
the unaffected MZ twin (Figure 4.5), supporting the potential pathogenicity of this variant.
There were two additional nonsynonymous SNVs that were classified with some evidence of
pathogenicity by VarSome (Table 4.2), both in genes associated with cancer (PTPRT and
RANBP17). The remaining 19 variants with uncertain significance (VUS) included in genes
associated with cancer (APOB, ETV4, HDAC7, KIAA1549, MET, POLE, TERT, and TSC1), immune
disorders (C1QB , IL2RA, MALT1, NOD2, and PRKDC), and blood disorders (ALDOA, EPB42,
SPECC1, and SPTB), and in genes NPAT and ZGPAT that have previously been implicated in HL
predisposition
298,300
.
136
Of the 22 variants, 10 were previously reported in ClinVar including the described
pathogenic variant in PGK1, and the variant in PTPRT that was reported to be LP in ClinVar in a
patient with congenital brain malformations. Eight variants were reported as VUS, of which 3
were previously reported in patients with immunodeficiency disorders (in genes IL2RA, MALT1,
and PRKDC) and 2 were reported in patients with dyskeratosis congenita (TERT) and other
anemia-associated disorders (EPB42).
P/LP variants in predisposition genes in families lacking sibling controls
In the 3 multiplex families with cHL patients but lacking sibling controls with sequencing
data (families #1, #4, and #12), there were 31 variants out of the list of 172 variants with
gold/silver/unknown medal categories. We excluded 6 variants that appeared to be spurious in
IGV and 5 variants found to be B/LB in VarSome, resulting in 20 remaining variants across 19
different genes (Table 4.3 and Figure 4.6). Of the 20 variants, 3 were classified as P/LP by
VarSome, in SLC4A1, PRKDC, and SEC23B. The SEC23B variant, identified in the family #12 index
case, was previously reported as pathogenic in ClinVar in patients with congenital
dyserythropoietic anemia. This gene has also been associated with Cowden syndrome.
However, the SEC23B variant was not found in the index case’s affected sibling, with whom the
index case shared 3 VUS in genes AMER1, FBXW7, and NSD1. Of the other two variants, both
found in the family #1 patient, the PRKDC variant was previously reported as VUS in ClinVar in
patients with immunodeficiency; the SLC4A1 variant was not reported in ClinVar before, but
this gene has been associated with spherocytosis, an inherited blood disorder.
The family #1 patient carried an additional 2 variants in LYST and FAT4 that were
reported with some evidence of pathogenicity by VarSome. The LYST variant was previously
137
reported as VUS in ClinVar, and the gene has been associated with immune disorders, including
Chediak-Higashi syndrome and familial hemophagocytic lymphohistiocytosis. The FAT4 variant
was not reported in ClinVar before, but this gene has been associated with Hennekam
lymphangiectasia-lymphedema syndrome and splenic marginal zone lymphoma. The remaining
15 VUS overlapped genes involved in cancer (AMER1, APOB, COL7A1, FBXW7, HOXB13, NSD1,
NTRK1, PTCH1, and PTPRT), immune disorders (DCLRE1C, ITK, LIG4, and LPIN2), and blood
disorders (SLC4A1 and SPTA1).
Thirteen variants were reported in ClinVar as VUS, including the described LP variant in
PRKDC and the variant with some evidence of pathogenicity in LYST. Of the remaining 11
variants, 3 were previously reported in patients with lymphoproliferative syndrome (in gene
ITK) and other immunodeficiency disorders (LIG4 and DCLRE1C), 2 were reported in patients
with anemia-associated disorders such as spherocytosis and elliptocytosis (SLC4A1 and SPTA1),
and 6 were reported in patients with other types of diseases, including familial
hypercholesterolemia (APOB), dystrophic epidermolysis bullosa (COL7A1), hereditary cancer-
predisposing syndrome (HOXB13), Majeed syndrome (LPIN2), familial medullary thyroid
carcinoma (NTRK1), and Gorlin syndrome (PTCH1).
Loss-of-function variants in predisposition genes in multiplex families with cHL patients
There were 12 prioritized loss-of-function variants among all subjects in the analysis
agnostic to family structure, of which 5 variants (1 frameshift deletion and 4 stopgain variants)
were found to be spurious in IGV and were removed. Of the remaining 7 variants (3 stopgain, 1
frameshift insertion, and 3 splice-site variants), 5 were classified as P/LP by VarSome (Table 4.4
and Figure 4.5). One variant was only found in the patient, 5 were found in the index case and
138
the unaffected siblings, and 1 was only found in the unaffected parent. In family #7, both the
index case and their unaffected sibling harbored a splice-site variant in DNMT3B gene that was
classified as pathogenic by VarSome but reported as a VUS in ClinVar in patients with
immunodeficiency. In family #9, the index case and one of their unaffected siblings carried
another predicted pathogenic splice-site variant in gene HMBS, which was reported as VUS in
ClinVar in patients with acute intermittent porphyria. The same index case also carried a
frameshift insertion in gene DCLRE1C that was classified as LP by VarSome and was reported as
VUS in ClinVar in patients with immunodeficiency, although the same variant was found in 3
out of the 4 unaffected siblings. In family #10, a stopgain variant in C8B was classified as
pathogenic by VarSome and was also reported as pathogenic in ClinVar in patients with
complement component 8 deficiency associated with immunodeficiency. However, this variant
was only found in the unaffected parent and not the index case. In family #8, the index case
carried a splice-site variant in PTPRD that was classified as LP by VarSome but was not reported
in ClinVar before, although the same variant was found in 2 out of the 3 unaffected siblings.
Gene PTPRD has been associated with nodal marginal zone lymphoma.
In family #3, both the index case and their unaffected sibling harbored a stopgain
variant in DDX10 that was classified with some evidence of pathogenicity by VarSome. DDX10
has been associated with myeloid malignancies. In family #11, one of the two patients carried a
stopgain variant in MSN that was not reported in VarSome, but MSN gene has been associated
with immunodeficiency.
139
Discussion
In this study, we performed germline WES in 20 primarily AYA cHL patients and 28
unaffected relatives among 14 multiplex families, to identify novel cHL predisposition genes.
There appeared to be a broad range of genes involved in HL etiology, especially those involved
in the immune system or anemia. We identified 5/20 (25%) index cHL patients harboring
potentially pathogenic germline variants, in genes PGK1, PTPRT, RANBP17, SLC4A1, PRKDC, and
SEC23B, which have not previously been implicated in cHL predisposition. In an additional 4
index patients, we found likely pathogenic variants in DDX10, DNMT3B, PTPRD, HMBS, and
DCLRE1C, but these were also found in their unaffected family controls.
Based on our analysis of variants that were only found in cases and not in their
unaffected siblings, we identified one P/LP variant in the PGK1 gene. The p.I253T missense
variant in PGK1 was discovered along with two VUS in family #2, whereby all 3 variants were
present in the index case, the unaffected MZ twin, and the unaffected twin's affected offspring
but not in the unaffected sibling. The p.I253T variant was predicted as pathogenic by all the in
silico algorithms available in VarSome, and it was reported as P/LP in ClinVar in a patient with
PGK1 deficiency
343,344
, a metabolic disorder with the most common form characterized by
hemolytic anemia
345
. PGK1 encodes a glycolytic enzyme that catalyzes one of the two ATP-
producing reactions in the glycolysis pathway
345
. Although the biological relationship of PGK1
to HL is unknown, PGK1 functions as an oncogene in most cancer types and plays important
roles in various oncogenic signal pathways
346
, such as HIF-1𝛼 induced glycolysis enhancement,
MYC-induced metabolic reprogramming, AKT/mTOR pathway activation, and CXCR4/ERK
pathway activation
347
, all of which can lead to tumorigenesis.
140
Two variants in PTPRT and RANBP17 were classified with some evidence of
pathogenicity by VarSome. The p.V69A variant in PTPRT was reported as LP in ClinVar in
patients with brain malformations
348
. PTPRT is a receptor-type protein tyrosine phosphatase
that functions as a tumor suppressor
349
and plays significant roles in cell adhesion and
intracellular signaling
350
. The p.V69A variant is located in its extracellular MAM domain, which
is critical for the cell adhesion function
351
. Intriguingly, all the tumor-derived extracellular
mutations of PTPRT have been found to impair cell-cell adhesion
351,352
. In addition, PTPRT was
also found to interact with the adherens junction proteins to regulate adherin-mediated cell
adhesion
353
. Although PTPRT has not been implicated in HL tumorigenesis before, two other
PTP superfamily members, PTPRK and PTPN1, were reported as tumor suppressors in HL
354,355
.
The p.G719R variant in RANBP17 has the highest CADD-Phred score (32) among all the
missense putative pathogenic variants identified in this study. RANBP17 is one of the 6 human
nuclear export receptors, involved with key signal transduction pathways, oncogenes, and
tumor-suppressor genes
356
. The exportins are widely dysregulated in hematological
malignancies
356
. For instance, exportin-1 was previously reported to play a crucial role in cHL
pathogenesis
357
. However, although RANBP17 was found to function as a tumor suppressor in
glioblastoma
358
, its role in other cancers including HL is unknown.
In the 3 families with cHL patients but lacking unaffected siblings with sequencing data,
we found 2 patients harboring 3 P/LP missense variants in genes SEC23B, SLC4A1, and PRKDC.
One of the two patients in family #12 carried the SEC23B p.E109K variant, which was reported
pathogenic in ClinVar
359,360
and the most frequent (>30%) SEC23B missense mutation inherited
in individuals with congenital dyserythropoietic anemia
359
. This variant reduces SEC23B gene
141
expression, resulting in erythrocyte-specific defects in COPII-mediated endoplasmic reticulum
export
359
. SEC23B is associated with the cancer predisposing condition Cowden syndrome (type
7), and abnormalities in this gene can disturb endoplasmic reticulum–to–Golgi trafficking and
interfere with sugar transporters and glycosyltransferases
360
. However, the p.E109K variant
was not found in the index case’s affected sibling, suggesting its potential low penetrance. The
SLC4A1 and PRKDC variants were both found in the family #1 patient. SLC4A1 plays a key role in
the electroneutral anion exchange across the cell plasma membrane, and mutations in this
gene have been associated with plasma membrane instability and transport property changes,
leading to red cell pathologies
361
. Intriguingly, the family #4 patient carried another variant
p.L153M in SLC4A1, which was reported as VUS in ClinVar in patients with hereditary
spherocytosis type 4, a disorder that can lead to hemolytic anemia. Moreover, variants were
found in additional genes that can cause membrane protein deficiency and lead to hemolytic
anemia
362
– EPB42, SPTA1, and SPTB in families #3, #4, and #5, respectively – suggesting that
cHL may share disease pathways with hereditary spherocytosis and hemolytic anemia. The
PRKDC variant was reported as VUS in ClinVar in patients with immunodeficiency. PRKDC
encodes DNA-dependent protein kinase catalytic subunit (DNA-PKcs), which is crucial for DNA
double-strand break repair and V(D)J recombination
363
. DNA-PKcs interacts with the
transcription factor autoimmune regulator (AIRE), playing a role in regulating autoimmune
responses and maintaining AIRE-dependent tolerance
363
. PRKDC mutations can disrupt AIRE
function, leading to autoimmunity
363
. The family #1 patient carried an additional 2 variants
reported with some evidence of pathogenicity by VarSome in genes LYST and FAT4. LYST has
been associated with Chediak-Higashi syndrome, an immunodeficiency disorder
364
, and
142
mutations in this gene can lead to impaired lymphocyte cytotoxicity that potentially links to the
development of cHL
365
. FAT4 gene has been associated with Hennekam syndrome, an inherited
disorder resulting from the defective lymphatic system
366
, and mutations in this gene have
been found in cHL tumor samples
367
.
Of the 4 loss-of-function variants that were classified as P/LP by VarSome but found in
both affected and unaffected subjects, family #7 carried a splice-site variant in DNMT3B, which
was reported as VUS in ClinVar in patients with immunodeficiency-centromeric instability-facial
anomalies syndrome
368
. DNMT3B mutations can cause aberrant DNA methylation leading to
chromosomal instability and cytogenetic abnormalities
368
. Family #8 carried a splice-site
variant in PTPRD, a tumor suppressor frequently disrupted in nodal marginal zone lymphoma
369
and cHL
367
. PTPRD defects deregulate the phosphatase activity and cell growth and activate the
STAT3 oncoprotein
369
. Family #9 carried a frameshift insertion in DCLRE1C and a splice-site
variant in HMBS. Similar to the previously described PRKDC gene, DCLRE1C is also involved in
V(D)J recombination and associated with severe combined immunodeficiency
363
. The HMBS
variant was reported in ClinVar in patients with a rare metabolic disorder
370
. HMBS gene may
act as a tumor suppressor in liver cancer
371
, while its functional relevance with HL is unknown.
In addition, family #3 carried a stopgain variant in DDX10, which was classified with some
evidence of pathogenicity by VarSome and has a CADD-Phred score of 38. DDX10 forms a
common leukemogenic fusion with NUP98, which is associated with acute myeloid leukemia
and myelodysplastic syndrome. This fusion can increase proliferation and self-renewal of CD34+
cells, and disrupt their differentiation
372
. Finally, one of the family #11 patients carried a
stopgain variant in MSN, which was not found in VarSome. MSN is a gene that has been
143
associated with immunodeficiency, playing an important role in regulating the proliferation,
migration, and adhesion of lymphoid cells
373
. Although discovered in both affected and
unaffected subjects or only in one of the two affected siblings in our study, it is possible that
these variants and genes may contribute to cHL etiology and have variable penetrance, for
example requiring interaction with environmental or other genetic factors.
To our knowledge, none of these genes identified in our study other than NPAT and
ZGPAT have previously been implicated in predisposition to HL
298,300
. The NPAT variant
p.P1266L identified in our study was not given strong pathogenicity evidence in VarSome.
Family #14 carried the ZGPAT variant p.K527N, which is the same variant as previously reported
in a multiplex family of HL, found in all 3 HL patients and not in any of the 7 unaffected relatives
300
. This variant was predicted as pathogenic by 8 in silico algorithms in VarSome but is
relatively common in the Ashkenazi Jewish population (AF=0.0174)
137
.
An important strength of our study is the ability to identify putative causal germline
variants by incorporating information on family structures. Nevertheless, our study does have
several limitations. First, we applied stringent filtering criteria, which could exclude some
susceptibility variants. Second, more than one gene or the interaction between multiple genes
in a same pathway could be involved in the HL etiology, but we did not have a sufficient
number of families including both cHL cases and unaffected relatives to identify the enriched
biological pathways in HL. Finally, it is difficult to confirm the pathogenicity of these putative
causal germline variants for HL without replication studies.
In conclusion, this study provides further evidence that genetic predisposition to HL
appears to involve a wide range of genes that play a role in cancer development, immune
144
deficiency, and hematological traits. We have identified several putative novel cHL
predisposition genes, and assessment of these genes in sequencing studies of independent HL
families are required to validate their roles in HL predisposition. Future studies using whole-
genome sequencing from a large cohort of families are needed to provide a more
comprehensive picture of the genome and more detailed data to identify the susceptibility
variants for HL.
145
Table 4. 1. Characteristics of 48 individuals from 14 multiplex families with AYA classical
Hodgkin lymphoma patients.
Characteristics Overall (n = 48) Case (n = 20) Control (n = 28)
Race (%)
European 27 (56.2) 14 (70.0) 13 (46.4)
Hispanic or Latino 10 (20.8) 2 (10.0) 8 (28.6)
Middle Eastern (includes Iranian) 7 (14.6) 2 (10.0) 5 (17.9)
Mix(European+Hispanic) 4 ( 8.3) 2 (10.0) 2 ( 7.1)
Source (%)
HL Polulation - LLS 11 (22.9) 4 (20.0) 7 (25.0)
HL Population - DOD 21 (43.8) 7 (35.0) 14 (50.0)
HL Twin 16 (33.3) 9 (45.0) 7 (25.0)
Twin (%)
Yes 9 (18.8) 8 (40.0) 1 ( 3.6)
No 39 (81.2) 12 (60.0) 27 (96.4)
Age at diagnosis (case) (mean (SD)) 33.00 (9.39) 33.00 (9.39) NaN (NA)
Age at blood draw (mean (SD)) 50.36 (12.99) 46.05 (10.78) 53.56 (13.73)
Missing/unknown 1 (2.1) 1 (3.6)
Sex (%)
Male 18 (37.5) 11 (55.0) 7 (25.0)
Female 30 (62.5) 9 (45.0) 21 (75.0)
Histology (%)
Nodular sclerosis 9 (52.9) 9 (52.9)
Mixed cellularity 4 (23.5) 4 (23.5)
Missing/unknown 4 (23.5) 4 (23.5)
EBV in tumor (%)
Yes 5 (29.4) 5 (29.4)
No 1 ( 5.9) 1 ( 5.9)
Missing/unknown 11 (64.7) 11 (64.7)
146
Table 4. 2. Putative causal germline variants (n = 22) found in the classical Hodgkin lymphoma
cases within 8 multiplex families.
Fa
mil
y id
Variant(ch
r.pos.ref.a
lt)
Gen
e
AAc
han
ge
Mutation
type VarSome
VarSome
score
Pecan
Pie
class ClinVar
PH
RE
D
TOP
Med
2
8.4784949
9.A.G
PRK
DC
L23
37P
nonsynon
ymous_S
NV
Uncertain
Significance PM2, BP4 Silver
Uncertain_significan
ce
13.
8
8.76E
-05
2
23.781231
96.T.C
PGK
1
I253
T
nonsynon
ymous_S
NV Pathogenic
PVS1, PM2,
PP2, PP3,
PP5
Unkno
wn
Pathogenic/Likely_p
athogenic
25.
9
1.59E
-05
2
17.202050
04.G.C
SPE
CC1
V31
9L
nonsynon
ymous_S
NV
Uncertain
Significance PM2 Silver .
22.
9
3
10.601806
0.G.A
IL2R
A
R26
3W
Nonsynon
ymous
SNV
Uncertain
Significance PM2, BP4 Silver
Uncertain_significan
ce
14.
04
0.000
3265
16
3
12.132643
860.T.C
POL
E
I142
3V
Nonsynon
ymous
SNV
Uncertain
Significance PM2, BP4 Silver
Uncertain_significan
ce
17.
6
4.78E
-05
3
15.432082
98.C.T
EPB
42
R36
6Q
Nonsynon
ymous
SNV
Uncertain
Significance PM2 Silver
Uncertain_significan
ce
25.
2
5.57E
-05
3
18.587475
62.T.C
MA
LT1
M73
2T
Nonsynon
ymous
SNV
Uncertain
Significance PM2, BP4 Silver
Uncertain_significan
ce
19.
82
0.000
2468
78
5
12.477957
12.C.T
HDA
C7
G32
1E
Nonsynon
ymous
SNV
Uncertain
Significance PM2 Silver .
22.
2
5
14.647935
13.C.G
SPT
B
R71
7P
Nonsynon
ymous
SNV
Uncertain
Significance PM2, PP3 Silver .
24.
9
2.39E
-05
6
16.300686
58.G.A
ALD
OA
V11
3M
Nonsynon
ymous
SNV
Uncertain
Significance PM2, PP3 Silver .
24.
5
3.98E
-05
6
17.435296
43.C.T
ETV
4
R33
0Q
Nonsynon
ymous
SNV
Uncertain
Significance NA Silver .
28.
4
1.59E
-05
6
20.428858
15.A.G
PTP
RT
V69
A
Nonsynon
ymous
SNV
Uncertain
Significance
/P PM2, PP5 Gold Likely_pathogenic
24.
4
1.59E
-05
6
5.1279464
.G.A
TER
T
R65
3C
Nonsynon
ymous
SNV
Uncertain
Significance PM2, BP4 Silver
Uncertain_significan
ce
14.
8
3.98E
-05
7
2.2100930
3.C.T
APO
B
R25
22Q
Nonsynon
ymous
SNV
Uncertain
Significance PM2 Silver
Uncertain_significan
ce
23.
7
3.98E
-05
10
9.1329026
40.G.C
TSC
1
R78
6G
Nonsynon
ymous
SNV
Uncertain
Significance PM2, PP3 Silver .
28.
5
10
5.1712055
36.G.A
RAN
BP1
7
G71
9R
nonsynon
ymous_S
NV
Uncertain
Significance
/LP PM2, PP3 Silver . 32
4.78E
-05
11
11.108161
289.G.A
NPA
T
P12
66L
Nonsynon
ymous
SNV
Uncertain
Significance
/B BP1 Silver .
23.
7
7.96E
-06
147
11
7.1388944
43.C.A
KIA
A15
49
V13
11L
Nonsynon
ymous
SNV
Uncertain
Significance PM2 Silver .
21.
8
7.96E
-06
11
1.2266114
4.G.T
C1Q
B
A17
4S
nonsynon
ymous_S
NV
Uncertain
Significance PM2 Silver .
23.
9
11
16.507113
91.A.C
NO
D2
K49
4Q
nonsynon
ymous_S
NV
Uncertain
Significance PM2, BP4 Silver .
21.
3
14
7.1166991
94.T.C MET
V37
A
nonsynon
ymous_S
NV
Uncertain
Significance PM2 Silver
Conflicting_interpret
ations_of_pathogeni
city
22.
3
0.000
1114
93
14
20.637359
04.G.T
ZGP
AT
K52
7N
nonsynon
ymous_S
NV
Uncertain
Significance PP3, BS1 Silver .
25.
8
0.000
5495
03
148
Table 4. 3. Putative P/LP germline variants (n = 20) in predisposition genes found in the classical
Hodgkin lymphoma cases of the 3 families lacking sibling controls with sequencing data.
Fa
mily
id
Variant(ch
r.pos.ref.a
lt)
Gen
e
AAc
han
ge
Mutation
type VarSome
VarSome
score
Pecan
Pie
class ClinVar
PH
RE
D
TOPM
ed
1
1.2356775
50.C.T
LYS
T
V36
24I
Nonsynon
ymous
SNV
Uncertain
Significance/L
P PM2 Silver
Uncertain_significan
ce
25.
7
4.78E
-05
1
17.442581
30.C.T
SLC
4A1
G38
0R
Nonsynon
ymous
SNV
Likely
Pathogenic
PM2,
PP2, PP3 Silver .
25.
5
1
4.1254772
98.C.T
FAT
4
P41
46L
Nonsynon
ymous
SNV
Uncertain
Significance/L
P PM2 Silver .
23.
7
7.96E
-06
1
8.4785961
2.C.T
PRK
DC
R20
69Q
Nonsynon
ymous
SNV
Likely
Pathogenic
PVS1,
PM2 Silver
Uncertain_significan
ce
17.
7
0.000
3105
89
4
1.1568743
92.C.T
NTR
K1
S39
6L
Nonsynon
ymous
SNV
Uncertain
Significance PM2
Unkno
wn
Conflicting_interpret
ations_of_pathogeni
city
21.
6
0.000
1194
57
4
1.1586576
74.C.T
SPT
A1
V87
0M
Nonsynon
ymous
SNV
Uncertain
Significance PM2 Silver
Uncertain_significan
ce
25.
3
0.000
2548
42
4
13.108209
047.C.T
LIG
4
R74
1H
Nonsynon
ymous
SNV
Uncertain
Significance
PM2,
BP4 Silver
Conflicting_interpret
ations_of_pathogeni
city 22
0.000
4061
54
4
17.487282
66.G.C
HO
XB1
3
P11
0A
Nonsynon
ymous
SNV
Uncertain
Significance/B
enign BP4 Silver
Uncertain_significan
ce
21.
2
7.96E
-06
4
17.442604
32.G.T
SLC
4A1
L15
3M
Nonsynon
ymous
SNV
Uncertain
Significance
PM2,
PP2 Silver
Uncertain_significan
ce
23.
1
0.000
3981
91
4
18.295119
9.G.A
LPI
N2
P14
9L
Nonsynon
ymous
SNV
Uncertain
Significance PM2 Silver
Conflicting_interpret
ations_of_pathogeni
city
22.
3
0.000
4619
01
4
3.4859510
1.C.G
COL
7A1
R20
P
Nonsynon
ymous
SNV
Uncertain
Significance
PM2,
BP4 Silver
Conflicting_interpret
ations_of_pathogeni
city
16.
69
0.000
1592
76
4
5.1572489
57.C.T ITK
R58
1W
Nonsynon
ymous
SNV
Uncertain
Significance
PM2,
PP3, BP6 Silver
Conflicting_interpret
ations_of_pathogeni
city 24
0.000
7884
17
12
10.149230
57.A.T
DCL
RE1
C
L32
9M
Nonsynon
ymous
SNV
Uncertain
Significance
PM2,
BP6 Silver
Uncertain_significan
ce
24.
3
0.000
7008
15
12
2.2100795
6.T.G
AP
OB
N29
71T
Nonsynon
ymous
SNV
Uncertain
Significance
PM2,
BP4 Silver
Conflicting_interpret
ations_of_pathogeni
city
14.
53
0.000
2787
33
12
20.421046
08.G.T
PTP
RT
D11
86E
Nonsynon
ymous
SNV
Uncertain
Significance NA Silver .
23.
6
3.98E
-05
12
20.185156
95.G.A
SEC
23B
E10
9K
Nonsynon
ymous
SNV Pathogenic
PM2,
PP5,
PS3, PP3 Gold Pathogenic
28.
4
0.000
2866
97
12
4.1523822
52.T.G
FBX
W7
Q28
H
Nonsynon
ymous
SNV
Uncertain
Significance
PM2,
PP2
Unkno
wn .
21.
2
149
12
5.1772100
71.C.T
NSD
1
L28
9F
Nonsynon
ymous
SNV
Uncertain
Significance
PM2,
PP2 Silver .
20.
5
12
9.9544924
2.G.A
PTC
H1
P12
11S
Nonsynon
ymous
SNV
Uncertain
Significance
PM2,
BP4 Silver
Uncertain_significan
ce
23.
9
3.98E
-05
12
23.641904
03.C.T
AM
ER1
A96
2T
Nonsynon
ymous
SNV
Uncertain
Significance/B
enign BP4 .
19.
78
4.78E
-05
150
Table 4. 4. Putative P/LP loss-of-function variants (n = 7) in predisposition genes found in 6
multiplex families with AYA classical Hodgkin lymphoma cases.
Fa
mi
ly
id Variant(chr.pos.ref.alt)
Ge
ne
AA
ch
an
ge
Mutat
ion
type
VarSom
e
VarSo
me
score
Pec
anPi
e
clas
s ClinVar
P
H
R
E
D
TO
PM
ed
3 11.108675634.C.T
D
DX
10
Q9
6X
stopg
ain
Uncert
ain
Signific
ance/P
PVS1,
PP3
Silve
r .
3
8
7.9
6E-
05
7 20.32781353.G.A
D
N
M
T3
B
G4
8D
splice-
site
Pathog
enic
PVS1,
PM2,
PP3
Gol
d
Uncertain_sig
nificance
2
3.
8
4.7
8E-
05
8 9.8504405.G.C
PT
PR
D
Q5
60
E
splice-
site
Likely
Pathog
enic
PVS1,
PM2
Gol
d .
1
7.
0
7
3.1
9E-
05
9 10.14908583.C.CT
DC
LR
E1
C
S6
35
fs
frame
shift_i
nserti
on
Likely
Pathog
enic
PVS1,
PM2
Silve
r
Conflicting_in
terpretations_
of_pathogenic
ity
2
2.
7
0.0
001
831
7
9 11.119088707.A.C
H
M
BS
I54
L
splice-
site
Pathog
enic
PVS1,
PM2,
PP2
Silve
r
Uncertain_sig
nificance
1
9.
4
5
3.9
8E-
05
10 1.56949599.G.A
C8
B
R2
74
X
stopg
ain
Pathog
enic
PVS1,
PM2,
PP3,
PP5
Silve
r Pathogenic
3
5
2.3
9E-
05
11
23.65731147.G.GGAAGGGAAGCCCATTT
AGCTTTCTAAGGGGCTGATGCAAAGAGG
ATGTAGAGTCAGCCAATTGAGGTCACAC
TGGA
M
SN
E1
70
fs
stopg
ain NA NA .
151
Figure 4. 1. Flowchart showing the workflow of rare germline variant discovery.
152
Figure 4. 2. The depth of coverage matrices of the 48 samples from 14 multiplex families with
AYA classical Hodgkin lymphoma patients.
Panel (A) shows %bases above 15 reads, and panel (B) shows the median number of reads per
base with interquartile ranges (IQR) for 48 samples.
153
Figure 4. 3. The distribution of the 250 putative P/LP germline variants that were analyzed using
the PeCanPIE MedalCeremony pipeline across the predisposition genes.
Variants that were classified as gold, silver, bronze, or unknown by PeCanPIE are summarized in
different panels. Different colors represent different mutation types.
154
Figure 4. 4. The distribution of the 172 putative P/LP germline variants classified as gold, silver,
or unknown by the PeCanPIE MedalCeremony pipeline across the 14 multiplex families with
AYA classical Hodgkin lymphoma patients.
155
Figure 4. 5. The distribution of the 22 putative causal germline variants and the 7 putative P/LP
loss-of-function variants across the 10 multiplex families with AYA classical Hodgkin lymphoma
patients.
Each heatmap represents a family, with individuals shown in the columns and variants in the
rows. Each block with color means that the variant shown in the row is present in the individual
shown in the column. Family structures are shown at the bottom of the heatmaps. The HL
case/control status is shown at the top of the heatmap. Different colors indicate whether the
variant is a missense putative predisposition variant or a loss-of-function variant.
156
Figure 4. 6. The distribution of the 20 putative P/LP germline variants in predisposition genes
found in the classical Hodgkin lymphoma patients of the 3 families lacking sibling controls with
sequencing data.
Each heatmap represents a family, with individuals shown in the columns and variants in the
rows. Each block with color means that the variant shown in the row is present in the individual
shown in the column. Family structures are shown at the bottom of the heatmaps. The HL
case/control status is shown at the top of the heatmap.
157
Conclusions
We applied a variety of epidemiological approaches in Chapters 1-4 to study genetic risk
factors for hematologic malignancies, including ALL and HL.
First, we reviewed the literature on the topic of racial/ethnic disparities in ALL incidence
and outcomes across the age spectrum. We summarized the ALL-risk SNPs that were identified
in the NHGRI-EBI GWAS Catalog or described in additional papers. Genetic and non-genetic risk
factors both contribute to the disparities in ALL risk and survival. Improving ALL survival
requires advances in precision medicine approaches, improved access to care, and inclusion of
more diverse populations in future clinical trials. Further studies are needed to investigate the
potential joint effects and interactions of genetic and environmental risk factors.
Second, we examined the association of DNA methylation at the AHRR CpG cg05575921
and a polyepigenetic smoking score of in utero tobacco smoke exposure with gene deletion
burden in childhood B-ALL cases from the CCLS by using the Poisson regression models and
fixed-effect meta-analyses. We provided further evidence that prenatal tobacco smoke
exposure may influence the generation of leukemia-causing somatic copy-number deletions in
childhood B-ALL cases. Analysis of mutational signatures and deletion breakpoint sequences are
required to investigate the potential mutagenic effects of tobacco smoke in childhood ALL.
Third, we investigated whether DNA methylation mediates the effect of genetic risk loci
for childhood ALL, by performing the EWAS, mQTL, and causal mediation analyses including
childhood ALL cases and controls from the CCLS and the CCRLP. We found that DNA
methylation at CpG cg01139861, located in a CpG island in the promoter region of IKZF1,
mediates the effects of the IKZF1 SNP rs78396808, and the most significant DMP in the EWAS,
158
CpG cg13344587 at gene ARID5B may function merely as a strong proxy of the rs7090445-C risk
allele. Our findings enhance the understanding of the functional pathways of genetic risk loci
for childhood ALL and provide new insights into the DNA methylation differences associated
with childhood ALL.
Finally, we performed rare variant calling, annotation, and filtering using germline
whole-exome sequencing data of 48 individuals, including 20 cHL cases in 14 multiplex families
with primarily AYA cHL patients from the USC twin registries and the USC Cancer Surveillance
Program. We have identified several putative novel cHL predisposition genes, and assessment
of these genes in sequencing studies of independent HL families are required to validate their
roles in HL predisposition. Future studies using whole-genome sequencing from a large cohort
of families are needed to provide a more comprehensive picture of the genome and more
detailed data to identify the susceptibility variants for HL.
159
References
1. Paul, S., Kantarjian, H. & Jabbour, E. J. Adult Acute Lymphoblastic Leukemia. Mayo Clinic
Proceedings 91, 1645–1666 (2016).
2. Hunger, S. P. & Mullighan, C. G. Acute Lymphoblastic Leukemia in Children. N Engl J Med
373, 1541–1552 (2015).
3. Howlader N, Noone AM, Krapcho M, Miller D, Brest A, Yu M, Ruhl J, Tatalovich Z, Mariotto
A, Lewis DR, Chen HS, Feuer EJ, Cronin KA (eds). SEER Cancer Statistics Review, 1975-2017,
National Cancer Institute. Bethesda, MD, https://seer.cancer.gov/csr/1975_2017/, based
on November 2019 SEER data submission, posted to the SEER web site, April 2020.
4. Siegel, R. L., Miller, K. D., Fuchs, H. E. & Jemal, A. Cancer Statistics, 2021. CA: A Cancer
Journal for Clinicians 71, 7–33 (2021).
5. Dores, G. M., Devesa, S. S., Curtis, R. E., Linet, M. S. & Morton, L. M. Acute leukemia
incidence and patient survival among children and adults in the United States, 2001-2007.
Blood 119, 34–43 (2012).
6. Ward, E., DeSantis, C., Robbins, A., Kohler, B. & Jemal, A. Childhood and adolescent cancer
statistics, 2014. CA: A Cancer Journal for Clinicians 64, 83–103 (2014).
7. Williams, L. A., Yang, J. J., Hirsch, B. A., Marcotte, E. L. & Spector, L. G. Is There Etiologic
Heterogeneity between Subtypes of Childhood Acute Lymphoblastic Leukemia? A Review
of Variation in Risk by Subtype. Cancer Epidemiol Biomarkers Prev 28, 846–856 (2019).
8. Wiemels, J. Perspectives on the causes of childhood leukemia. Chemico-Biological
Interactions 196, 59–67 (2012).
160
9. Preston, D. L. et al. Cancer incidence in atomic bomb survivors. Part III. Leukemia,
lymphoma and multiple myeloma, 1950-1987. Radiat. Res. 137, S68-97 (1994).
10. Doll, R. & Wakeford, R. Risk of childhood cancer from fetal irradiation. BJR 70, 130–139
(1997).
11. Bartley, K., Metayer, C., Selvin, S., Ducore, J. & Buffler, P. Diagnostic X-rays and risk of
childhood leukaemia. Int J Epidemiol 39, 1628–1637 (2010).
12. Linet, M. S., Kim, K. pyo & Rajaraman, P. Children’s Exposure to Diagnostic Medical
Radiation and Cancer Risk: Epidemiologic and Dosimetric Considerations. Pediatr Radiol
39, S4 (2009).
13. Curtin, K. et al. Familial risk of childhood cancer and tumors in the Li-Fraumeni spectrum in
the Utah Population Database: Implications for genetic evaluation in pediatric practice. Int
J Cancer 133, 2444–2453 (2013).
14. Klco, J. M. & Mullighan, C. G. Advances in germline predisposition to acute leukaemias and
myeloid neoplasms. Nat Rev Cancer (2020) doi:10.1038/s41568-020-00315-z.
15. Treviño, L. R. et al. Germline genomic variants associated with childhood acute
lymphoblastic leukemia. Nat Genet 41, 1001–1005 (2009).
16. Papaemmanuil, E. et al. Loci on 7p12.2, 10q21.2 and 14q11.2 are associated with risk of
childhood acute lymphoblastic leukemia. Nat Genet 41, 1006–1010 (2009).
17. Sherborne, A. L. et al. Variation in CDKN2A at 9p21.3 influences childhood acute
lymphoblastic leukemia risk. Nat Genet 42, 492–494 (2010).
18. Ellinghaus, E. et al. Identification of germline susceptibility loci in ETV6-RUNX1-rearranged
childhood acute lymphoblastic leukemia. Leukemia 26, 902–909 (2012).
161
19. Xu, H. et al. Novel Susceptibility Variants at 10p12.31-12.2 for Childhood Acute
Lymphoblastic Leukemia in Ethnically Diverse Populations. J Natl Cancer Inst 105, 733–742
(2013).
20. Migliorini, G. et al. Variation at 10p12.2 and 10p14 influences risk of childhood B-cell acute
lymphoblastic leukemia and phenotype. Blood 122, 3298–3307 (2013).
21. Perez-Andreu, V. et al. Inherited GATA3 variants are associated with Ph-like childhood acute
lymphoblastic leukemia and risk of relapse. Nat Genet 45, 1494–1498 (2013).
22. Walsh, K. M. et al. A Heritable Missense Polymorphism in CDKN2A Confers Strong Risk of
Childhood Acute Lymphoblastic Leukemia and Is Preferentially Selected during Clonal
Evolution. Cancer Res 75, 4884–4894 (2015).
23. Hungate, E. A. et al. A variant at 9p21.3 functionally implicates CDKN2B in paediatric B-cell
precursor acute lymphoblastic leukaemia aetiology. Nat Commun 7, (2016).
24. Vijayakrishnan, J. et al. A genome-wide association study identifies risk loci for childhood
acute lymphoblastic leukemia at 10q26.13 and 12q23.1. Leukemia 31, 573–579 (2017).
25. Clay-Gilmour, A. I. et al. Genetic association with B-cell acute lymphoblastic leukemia in
allogeneic transplant patients differs by age and sex. Blood Adv 1, 1717–1728 (2017).
26. Wiemels, J. L. et al. GWAS in childhood acute lymphoblastic leukemia reveals novel genetic
associations at chromosomes 17q12 and 8q24.21. Nat Commun 9, (2018).
27. Vijayakrishnan, J. et al. Genome-wide association study identifies susceptibility loci for B-
cell childhood acute lymphoblastic leukemia. Nature Communications 9, 1340 (2018).
162
28. de Smith, A. J. et al. BMI1 enhancer polymorphism underlies chromosome 10p12.31
association with childhood acute lymphoblastic leukemia. International Journal of Cancer
143, 2647–2658 (2018).
29. Qian, M. et al. Novel susceptibility variants at the ERG locus for childhood acute
lymphoblastic leukemia in Hispanics. Blood 133, 724–729 (2019).
30. Qian, M. et al. Genome-Wide Association Study of Susceptibility Loci for T-Cell Acute
Lymphoblastic Leukemia in Children. J Natl Cancer Inst 111, 1350–1357 (2019).
31. de Smith, A. J. Heritable variation at the chromosome 21 gene ERG is associated with acute
lymphoblastic leukemia risk in children with and without Down syndrome. Leukemia 33,
2732–2766 (2019).
32. Vijayakrishnan, J. et al. Identification of four novel associations for B-cell acute
lymphoblastic leukaemia risk. Nat Commun 10, 5348 (2019).
33. Shah, S. et al. A recurrent germline PAX5 mutation confers susceptibility to pre-B cell acute
lymphoblastic leukemia. Nat Genet 45, 1226–1231 (2013).
34. Gu, Z. et al. PAX5-driven Subtypes of B-progenitor Acute Lymphoblastic Leukemia. Nat
Genet 51, 296–307 (2019).
35. Noetzli, L. et al. Germline mutations in ETV6 are associated with thrombocytopenia, red cell
macrocytosis and predisposition to lymphoblastic leukemia. Nat Genet 47, 535–538
(2015).
36. Zhang, M. Y. et al. Germline ETV6 mutations in familial thrombocytopenia and hematologic
malignancy. Nat Genet 47, 180–185 (2015).
163
37. Moriyama, T. et al. Germline genetic variation in ETV6 and risk of childhood acute
lymphoblastic leukaemia: a systematic genetic study. The Lancet Oncology 16, 1659–1666
(2015).
38. Melazzini, F. et al. Clinical and pathogenic features of ETV6-related thrombocytopenia with
predisposition to acute lymphoblastic leukemia. Haematologica 101, 1333–1342 (2016).
39. Nishii, R. et al. Molecular Basis of ETV6-Mediated Predisposition to Childhood Acute
Lymphoblastic Leukemia. Blood blood.2020006164 (2020) doi:10.1182/blood.2020006164.
40. Kuehn, H. S. et al. Loss of B Cells in Patients with Heterozygous Mutations in IKAROS. N Engl
J Med 374, 1032–1043 (2016).
41. Yoshida, N. et al. Germline IKAROS mutation associated with primary immunodeficiency
that progressed to T-cell acute lymphoblastic leukemia. Leukemia 31, 1221–1223 (2017).
42. Churchman, M. L. et al. Germline Genetic IKZF1 Variation and Predisposition to Childhood
Acute Lymphoblastic Leukemia. Cancer Cell 33, 937-948.e8 (2018).
43. Holmfeldt, L. et al. The genomic landscape of hypodiploid acute lymphoblastic leukemia.
Nat Genet 45, 242–252 (2013).
44. Qian, M. et al. TP53 Germline Variations Influence the Predisposition and Prognosis of B-
Cell Acute Lymphoblastic Leukemia in Children. J Clin Oncol 36, 591–599 (2018).
45. Greaves, M. A causal mechanism for childhood acute lymphoblastic leukaemia. Nat Rev
Cancer 18, 471–484 (2018).
46. Whitehead, T. P., Metayer, C., Wiemels, J. L., Singer, A. W. & Miller, M. D. Childhood
Leukemia and Primary Prevention. Current Problems in Pediatric and Adolescent Health
Care 46, 317–352 (2016).
164
47. Liu, R., Zhang, L., McHale, C. M. & Hammond, S. K. Paternal Smoking and Risk of Childhood
Acute Lymphoblastic Leukemia: Systematic Review and Meta-Analysis. J Oncol 2011,
(2011).
48. Metayer, C. et al. Tobacco Smoke Exposure and the Risk of Childhood Acute Lymphoblastic
and Myeloid Leukemias by Cytogenetic Subtype. Cancer Epidemiology Biomarkers &
Prevention 22, 1600–1611 (2013).
49. de Smith, A. J. et al. Correlates of Prenatal and Early-Life Tobacco Smoke Exposure and
Frequency of Common Gene Deletions in Childhood Acute Lymphoblastic Leukemia.
Cancer Res 77, 1674–1683 (2017).
50. Bailey, H. D. et al. Home pesticide exposures and risk of childhood leukemia: Findings from
the Childhood Leukemia International Consortium. Int J Cancer 137, 2644–2663 (2015).
51. Hyland, C. et al. Maternal residential pesticide use and risk of childhood leukemia in Costa
Rica. Int J Cancer 143, 1295–1304 (2018).
52. Colt, J. S. & Blair, A. Parental occupational exposures and risk of childhood cancer. Environ
Health Perspect 106 Suppl 3, 909–925 (1998).
53. Bailey, H. D. et al. Home paint exposures and risk of childhood acute lymphoblastic
leukemia: Findings from the Childhood Leukemia International Consortium. Cancer Causes
Control 26, 1257–1270 (2015).
54. Ward, M. H. et al. Residential Levels of Polybrominated Diphenyl Ethers and Risk of
Childhood Acute Lymphoblastic Leukemia in California. Environ Health Perspect 122,
1110–1116 (2014).
165
55. Boothe, V. L., Boehmer, T. K., Wendel, A. M. & Yip, F. Y. Residential Traffic Exposure and
Childhood Leukemia: A Systematic Review and Meta-analysis. American Journal of
Preventive Medicine 46, 413–422 (2014).
56. Whitehead, T. P. et al. Dust metal loadings and the risk of childhood acute lymphoblastic
leukemia. Journal of Exposure Science & Environmental Epidemiology 25, 593–598 (2015).
57. Borrell, L. N. et al. Race and Genetic Ancestry in Medicine — A Time for Reckoning with
Racism. N Engl J Med NEJMms2029562 (2021) doi:10.1056/NEJMms2029562.
58. Alvidrez, J., Castille, D., Laude-Sharp, M., Rosario, A. & Tabor, D. The National Institute on
Minority Health and Health Disparities Research Framework. Am J Public Health 109, S16–
S20 (2019).
59. Barrington-Trimis, J. L. et al. Rising rates of acute lymphoblastic leukemia in Hispanic
children: trends in incidence from 1992 to 2011. Blood 125, 3033–3034 (2015).
60. Giddings, B. M., Whitehead, T. P., Metayer, C. & Miller, M. D. Childhood leukemia incidence
in California: High and rising in the Hispanic population. Cancer 122, 2867–2875 (2016).
61. Barrington-Trimis, J. L. et al. Trends in Childhood Leukemia Incidence Over Two Decades
from 1992–2013. Int J Cancer 140, 1000–1008 (2017).
62. Siegel, D. A. Rates and Trends of Pediatric Acute Lymphoblastic Leukemia — United States,
2001–2014. MMWR Morb Mortal Wkly Rep 66, (2017).
63. Feng, Q. et al. Trends in Acute Lymphoblastic Leukemia Incidence in the US from 2000-
2016: an Increased Risk in Latinos Across All Age Groups. Am J Epidemiol (2020)
doi:10.1093/aje/kwaa215.
166
64. Pui, C.-H. & Evans, W. E. A 50-year journey to cure childhood acute lymphoblastic leukemia.
Semin Hematol 50, 185–196 (2013).
65. Alvarnas, J. C. et al. Acute Lymphoblastic Leukemia, Version 2.2015. Journal of the National
Comprehensive Cancer Network 13, 1240–1279 (2015).
66. Mohseni, M., Uludag, H. & Brandwein, J. M. Advances in biology of acute lymphoblastic
leukemia (ALL) and therapeutic implications. Am J Blood Res 8, 29–56 (2018).
67. Carobolante, F., Chiaretti, S., Skert, C. & Bassan, R. Practical guidance for the management
of acute lymphoblastic leukemia in the adolescent and young adult population. Ther Adv
Hematol 11, (2020).
68. Cancer Facts & Figures. American Cancer Society (2020).
69. Inaba, H. & Mullighan, C. G. Pediatric acute lymphoblastic leukemia. Haematologica 105,
(2020).
70. Rytting, M. E. et al. Augmented Berlin-Frankfurt-Münster therapy in adolescents and young
adults (AYAs) with acute lymphoblastic leukemia (ALL). Cancer 120, 3660–3668 (2014).
71. Stock, W. et al. Favorable Outcomes for Older Adolescents and Young Adults (AYA) with
Acute Lymphoblastic Leukemia (ALL): Early Results of U.S. Intergroup Trial C10403. Blood
124, 796–796 (2014).
72. DeAngelo, D. J. et al. Long-term outcome of a pediatric-inspired regimen used for adults
aged 18–50 years with newly diagnosed acute lymphoblastic leukemia. Leukemia 29, 526–
534 (2015).
73. Rytting, M. E. et al. Final results of a single institution experience with a pediatric-based
regimen, the augmented Berlin-Frankfurt-Münster, in adolescents and young adults with
167
acute lymphoblastic leukemia, and comparison to the hyper-CVAD regimen. Am J Hematol
91, 819–823 (2016).
74. Sive, J. I. et al. Outcomes in older adults with acute lymphoblastic leukaemia (ALL): results
from the international MRC UKALL XII/ECOG2993 trial. British Journal of Haematology 157,
463–471 (2012).
75. Pulte, D. et al. Survival of Adults with Acute Lymphoblastic Leukemia in Germany and the
United States. PLoS One 9, (2014).
76. Bailey, C. et al. Adult Leukemia Survival Trends in the United States by Subtype: A
Population-Based Registry Study of 370,994 Patients Diagnosed During 1995-2009. Cancer
124, 3856–3867 (2018).
77. Huguet, F. et al. Intensified Therapy of Acute Lymphoblastic Leukemia in Adults: Report of
the Randomized GRAALL-2005 Clinical Trial. J Clin Oncol 36, 2514–2523 (2018).
78. O’Dwyer, K. M. & Liesveld, J. L. Philadelphia chromosome negative B-cell acute
lymphoblastic leukemia in older adults: Current treatment and novel therapies. Best
Practice & Research Clinical Haematology 30, 184–192 (2017).
79. Luskin, M. R. & DeAngelo, D. J. Mini-Hyper-CVD Combinations for Older Adults: Results of
Recent Trials and a Glimpse into the Future. Clinical Lymphoma Myeloma and Leukemia
20, S44–S47 (2020).
80. Wang, L., Bhatia, S., Gomez, S. L. & Yasui, Y. Differential Inequality Trends Over Time in
Survival Among U.S. Children with Acute Lymphoblastic Leukemia by Race/Ethnicity, Age
at Diagnosis, and Sex. Cancer Epidemiol Biomarkers Prev 24, 1781–1788 (2015).
168
81. Goggins, W. B. & Lo, F. F. K. Racial and ethnic disparities in survival of US children with acute
lymphoblastic leukemia: evidence from the SEER database 1988-2008. Cancer Causes
Control 23, 737–743 (2012).
82. Jm, K. et al. Racial disparities in the survival of American children, adolescents, and young
adults with acute lymphoblastic leukemia, acute myelogenous leukemia, and Hodgkin
lymphoma. Cancer 122, 2723–2730 (2016).
83. Kahn, J. M. et al. An investigation of toxicities and survival in Hispanic children and
adolescents with ALL: Results from the Dana-Farber Cancer Institute ALL Consortium
protocol 05-001. Pediatr Blood Cancer 65, (2018).
84. Abrahão, R. et al. Racial/ethnic and socioeconomic disparities in survival among children
with acute lymphoblastic leukemia in California, 1988-2011: A population-based
observational study. Pediatr Blood Cancer 62, 1819–1825 (2015).
85. Murphy, C. C., Lupo, P. J., Roth, M. E., Winick, N. J. & Pruitt, S. L. Disparities in cancer
survival among adolescents and young adults: a population-based study of 88,000
patients. J Natl Cancer Inst (2021) doi:10.1093/jnci/djab006.
86. Pulte, D., Redaniel, M. T., Jansen, L., Brenner, H. & Jeffreys, M. Recent trends in survival of
adult patients with acute leukemia: overall improvements, but persistent and partly
increasing disparity in survival of patients from minority groups. Haematologica 98, 222–
229 (2013).
87. Quiroz, E. et al. The emerging story of acute lymphoblastic leukemia among the Latin
American population – biological and clinical implications. Blood Reviews 33, 98–105
(2019).
169
88. Pérez-Saldivar, M. L. et al. Childhood acute leukemias are frequent in Mexico City:
descriptive epidemiology. BMC Cancer 11, 355 (2011).
89. Santamaría-Quesada, C. et al. Molecular and Epidemiologic Findings of Childhood Acute
Leukemia in Costa Rica. Journal of Pediatric Hematology/Oncology 31, 131–135 (2009).
90. Moore, K. J., Hubbard, A. K., Williams, L. A. & Spector, L. G. Childhood cancer incidence
among specific Asian and Pacific Islander populations in the United States. Int J Cancer
147, 3339–3348 (2020).
91. Shah, N. et al. Ethnic disparities in survival of adult B-cell acute lymphoblastic leukemia in
modern era - a SEER analysis. Leuk Lymphoma 61, 3503–3506 (2020).
92. Chiaretti, S. et al. Philadelphia-like acute lymphoblastic leukemia is associated with minimal
residual disease persistence and poor outcome. First report of the minimal residual
disease-oriented GIMEMA LAL1913. 1 (2020) doi:10.3324/haematol.2020.247973.
93. Roberts, K. G. et al. Targetable Kinase-Activating Lesions in Ph-like Acute Lymphoblastic
Leukemia. N Engl J Med 371, 1005–1015 (2014).
94. Herold, T., Baldus, C. D. & Gökbuget, N. Ph-like acute lymphoblastic leukemia in older
adults. N Engl J Med 371, 2235 (2014).
95. Roberts, K. G. et al. High Frequency and Poor Outcome of Philadelphia Chromosome–Like
Acute Lymphoblastic Leukemia in Adults. J Clin Oncol 35, 394–401 (2017).
96. Jain, N. et al. Ph-like acute lymphoblastic leukemia: a high-risk subtype in adults. Blood 129,
572–581 (2017).
97. Young, R. P. & Hopkins, R. J. A review of the Hispanic paradox: time to spill the beans? Eur
Respir Rev 23, 439–449 (2014).
170
98. Franzini, L., Ribble, J. C. & Keddie, A. M. Understanding the Hispanic paradox. Ethn Dis 11,
496–518 (2001).
99. Swan, J. & Edwards, B. K. Cancer rates among American Indians and Alaska Natives. Cancer
98, 1262–1272 (2003).
100. Nathan, P. C., Wasilewski-Masker, K. & Janzen, L. A. Long-term outcomes in survivors of
childhood acute lymphoblastic leukemia. Hematol Oncol Clin North Am 23, 1065–1082, vi–
vii (2009).
101. Essig, S. et al. Estimating the risk for late effects of therapy in children newly diagnosed
with standard risk acute lymphoblastic leukemia using an historical cohort: A report from
the Childhood Cancer Survivor Study. Lancet Oncol 15, 841–851 (2014).
102. Gofman, I. & Ducore, J. Risk Factors for the Development of Obesity in Children Surviving
ALL and NHL. Journal of Pediatric Hematology/Oncology 31, 101–107 (2009).
103. Sadighi, Z. S. et al. Headache types, related morbidity, and quality of life in survivors of
childhood acute lymphoblastic leukemia: a prospective cross sectional study. Eur J
Paediatr Neurol 18, 722–729 (2014).
104. Shoag, J. M., Barredo, J. C., Lossos, I. S. & Pinheiro, P. S. Acute lymphoblastic leukemia
mortality in Hispanic Americans. Leuk Lymphoma 61, 2674–2681 (2020).
105. Eche, I. J. & Aronowitz, T. A Literature Review of Racial Disparities in Overall Survival of
Black Children With Acute Lymphoblastic Leukemia Compared With White Children With
Acute Lymphoblastic Leukemia. J Pediatr Oncol Nurs 37, 180–194 (2020).
171
106. Bryant, C., Mayhew, M., Fleites, J., Lozano, J. & Saunders, J. M. Comparison of Five-Year
Survival Rate Between Black and White Children With Acute Lymphoblastic Leukemia.
Cureus 12, e11797 (2020).
107. Siegel, S. E. et al. Pediatric-Inspired Treatment Regimens for Adolescents and Young Adults
With Philadelphia Chromosome–Negative Acute Lymphoblastic Leukemia. JAMA Oncol 4,
725–734 (2018).
108. National Cancer Institute: Cancer Stat Facts: Leukemia: Acute lymphocytic leukemia (ALL).
https://seer.cancer.gov/statfacts/html/alyl.html.
109. Burmeister, T. et al. Patients’ age and BCR-ABL frequency in adult B-precursor ALL: a
retrospective analysis from the GMALL study group. Blood 112, 918–919 (2008).
110. Byun, J. M. et al. BCR-ABL translocation as a favorable prognostic factor in elderly patients
with acute lymphoblastic leukemia in the era of potent tyrosine kinase inhibitors.
Haematologica 102, e187–e190 (2017).
111. Moorman, A. V. et al. A population-based cytogenetic study of adults with acute
lymphoblastic leukemia. Blood 115, 206–214 (2010).
112. Paulsson, K. et al. The genomic landscape of high hyperdiploid childhood acute
lymphoblastic leukemia. Nature Genetics 47, 672–676 (2015).
113. Geyer, M. B. et al. Overall survival among older US adults with ALL remains low despite
modest improvement since 1980: SEER analysis. Blood 129, 1878–1881 (2017).
114. Bassan, R., Bourquin, J.-P., DeAngelo, D. J. & Chiaretti, S. New Approaches to the
Management of Adult Acute Lymphoblastic Leukemia. JCO 36, 3504–3519 (2018).
172
115. Wenzinger, C., Williams, E. & Gru, A. A. Updates in the Pathology of Precursor Lymphoid
Neoplasms in the Revised Fourth Edition of the WHO Classification of Tumors of
Hematopoietic and Lymphoid Tissues. Curr Hematol Malig Rep 13, 275–288 (2018).
116. Iacobucci, I. & Mullighan, C. G. Genetic Basis of Acute Lymphoblastic Leukemia. J Clin
Oncol 35, 975–983 (2017).
117. You, M. J., Medeiros, L. J. & Hsi, E. D. T-lymphoblastic leukemia/lymphoma. Am J Clin
Pathol 144, 411–422 (2015).
118. Rowe, J. M. et al. Induction therapy for adults with acute lymphoblastic leukemia: results
of more than 1500 patients from the international ALL trial: MRC UKALL XII/ECOG E2993.
Blood 106, 3760–3767 (2005).
119. Marks, D. I. et al. T-cell acute lymphoblastic leukemia in adults: clinical features,
immunophenotype, cytogenetics, and outcome from the large randomized prospective
trial (UKALL XII/ECOG 2993). Blood 114, 5136–5145 (2009).
120. Papaemmanuil, E. et al. RAG-mediated recombination is the predominant driver of
oncogenic rearrangement in ETV6-RUNX1 acute lymphoblastic leukemia. Nat Genet 46,
116–125 (2014).
121. Bhojwani, D. et al. ETV6-RUNX1-positive childhood acute lymphoblastic leukemia:
improved outcome with contemporary therapy. Leukemia 26, 265–270 (2012).
122. Chokkalingam, A. P. et al. Genetic variants in ARID5B and CEBPE are childhood ALL
susceptibility loci in Hispanics. Cancer Causes Control 24, 1789–1795 (2013).
123. Aldrich, M. C. et al. Cytogenetics of Hispanic and White Children with Acute Lymphoblastic
Leukemia in California. Cancer Epidemiol Biomarkers Prev 15, 578–581 (2006).
173
124. Kurzrock, R., Kantarjian, H. M., Druker, B. J. & Talpaz, M. Philadelphia chromosome-
positive leukemias: from basic mechanisms to molecular therapeutics. Ann Intern Med
138, 819–830 (2003).
125. Igwe, I. J. et al. The presence of Philadelphia chromosome does not confer poor prognosis
in adult pre-B acute lymphoblastic leukaemia in the tyrosine kinase inhibitor era – a
surveillance, epidemiology, and end results database analysis. Br J Haematol 179, 618–626
(2017).
126. Uckun, F. M. et al. Clinical significance of Philadelphia chromosome positive pediatric
acute lymphoblastic leukemia in the context of contemporary intensive therapies: a report
from the Children’s Cancer Group. Cancer 83, 2030–2039 (1998).
127. Sawalha, Y. & Advani, A. S. Management of older adults with acute lymphoblastic
leukemia: challenges & current approaches. Int J Hematol Oncol 7, (2018).
128. Kozlowski, P. et al. Age but not Philadelphia positivity impairs outcome in older/elderly
patients with acute lymphoblastic leukemia in Sweden. Eur J Haematol 99, 141–149
(2017).
129. Harvey, R. C. et al. Rearrangement of CRLF2 is associated with mutation of JAK kinases,
alteration of IKZF1, Hispanic/Latino ethnicity, and a poor outcome in pediatric B-
progenitor acute lymphoblastic leukemia. Blood 115, 5312–5321 (2010).
130. Yang, H. et al. Non-coding germline GATA3 variants alter chromatin topology and
contribute to pathogenesis of acute lymphoblastic leukemia.
http://biorxiv.org/lookup/doi/10.1101/2020.02.23.961672 (2020)
doi:10.1101/2020.02.23.961672.
174
131. Herold, T. et al. Adults with Philadelphia chromosome–like acute lymphoblastic leukemia
frequently have IGH-CRLF2 and JAK2 mutations, persistence of minimal residual disease
and poor prognosis. Haematologica 102, 130–138 (2017).
132. Perez-Andreu, V. et al. A genome-wide association study of susceptibility to acute
lymphoblastic leukemia in adolescents and young adults. Blood 125, 680–686 (2015).
133. Evans, T.-J. et al. Confirmation of Childhood Acute Lymphoblastic Leukemia Variants,
ARID5B and IKZF1, and Interaction with Parental Environmental Exposures. PLoS One 9,
(2014).
134. Walsh, K. M. et al. Associations between genome-wide Native American ancestry, known
risk alleles and B-cell ALL risk in Hispanic children. Leukemia 27, 2416–2419 (2013).
135. Walsh, K. M. et al. GATA3 risk alleles are associated with ancestral components in Hispanic
children with ALL. Blood 122, 3385–3387 (2013).
136. Xu, H. et al. ARID5B Genetic Polymorphisms Contribute to Racial Disparities in the
Incidence and Treatment Outcome of Childhood Acute Lymphoblastic Leukemia. J Clin
Oncol 30, 751–757 (2012).
137. Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in
141,456 humans. Nature 581, 434–443 (2020).
138. Spear, M. L. et al. Recent shifts in the genomic ancestry of Mexican Americans may alter
the genetic architecture of biomedical traits. eLife 9, e56029 (2020).
139. Karol, S. E. et al. Genetics of ancestry-specific risk for relapse in acute lymphoblastic
leukemia. Leukemia 31, 1325–1332 (2017).
175
140. Yang, J. J. et al. Ancestry and Pharmacogenomics of Relapse in Acute Lymphoblastic
Leukemia. Nat Genet 43, 237–241 (2011).
141. Zhang, H. et al. Association of GATA3 polymorphisms with minimal residual disease and
relapse risk in childhood acute lymphoblastic leukemia. J Natl Cancer Inst (2020)
doi:10.1093/jnci/djaa138.
142. Jain, N. et al. GATA3 rs3824662A Allele Is Overrepresented in Adult Patients with Ph-like
ALL, Especially in Patients with CRLF2 Abnormalities. Blood 130, 1430–1430 (2017).
143. Yang, J. J. et al. Inherited NUDT15 Variant Is a Genetic Determinant of Mercaptopurine
Intolerance in Children With Acute Lymphoblastic Leukemia. J Clin Oncol 33, 1235–1242
(2015).
144. Moriyama, T. et al. NUDT15 Polymorphisms Alter Thiopurine Metabolism and
Hematopoietic Toxicity. Nat Genet 48, 367–373 (2016).
145. Greaves, M. Childhood leukaemia. BMJ 324, 283–287 (2002).
146. Wiemels, J. et al. Prenatal origin of acute lymphoblastic leukaemia in children. The Lancet
354, 1499–1503 (1999).
147. Greaves, M. F., Maia, A. T., Wiemels, J. L. & Ford, A. M. Leukemia in twins: lessons in
natural history. Blood 102, 2321–2333 (2003).
148. Bateman, C. M. et al. Acquisition of genome-wide copy number alterations in monozygotic
twins with acute lymphoblastic leukemia. Blood 115, 3553–3558 (2010).
149. Torow, N. & Hornef, M. W. The Neonatal Window of Opportunity: Setting the Stage for
Life-Long Host-Microbial Interaction and Immune Homeostasis. J Immunol 198, 557–563
(2017).
176
150. Olszak, T. et al. Microbial exposure during early life has persistent effects on natural killer
T cell function. Science 336, 489–493 (2012).
151. Biesbroek, G. et al. Early respiratory microbiota composition determines bacterial
succession patterns and respiratory health in children. Am J Respir Crit Care Med 190,
1283–1292 (2014).
152. Urayama, K. Y., Buffler, P. A., Gallagher, E. R., Ayoob, J. M. & Ma, X. A meta-analysis of the
association between day-care attendance and childhood acute lymphoblastic leukaemia.
International Journal of Epidemiology 39, 718–732 (2010).
153. Ma, X. et al. Ethnic difference in daycare attendance, early infections, and risk of childhood
acute lymphoblastic leukemia. Cancer Epidemiol Biomarkers Prev 14, 1928–1934 (2005).
154. Urayama, K. Y. et al. Early life exposure to infections and risk of childhood acute
lymphoblastic leukemia. Int J Cancer 128, 1632–1643 (2011).
155. Von Behren, J. et al. Birth order and Risk of Childhood Cancer: A Pooled Analysis from Five
U.S. States. Int J Cancer 128, 2709–2716 (2011).
156. Marcotte, E. L., Ritz, B., Cockburn, M., Yu, F. & Heck, J. E. Exposure to infections and Risk of
Leukemia in Young Children. Cancer Epidemiol Biomarkers Prev 23, 1195–1203 (2014).
157. Marcotte, E. L. et al. Caesarean delivery and risk of childhood leukaemia: a pooled analysis
from the Childhood Leukemia International Consortium (CLIC). Lancet Haematol 3, e176–
e185 (2016).
158. Francis, S. S. et al. In utero cytomegalovirus infection and development of childhood acute
lymphoblastic leukemia. Blood 129, 1680–1684 (2017).
177
159. Francis, S. S. et al. Mode of Delivery and Risk of Childhood Leukemia. Cancer Epidemiol
Biomarkers Prev 23, 876–881 (2014).
160. Wang, L., Gomez, S. L. & Yasui, Y. Racial and Ethnic Differences in Socioeconomic Position
and Risk of Childhood Acute Lymphoblastic Leukemia. American Journal of Epidemiology
185, 1263–1271 (2017).
161. Bona, K., Blonquist, T. M., Neuberg, D. S., Silverman, L. B. & Wolfe, J. Impact of
Socioeconomic Status on Timing of Relapse and Overall Survival for Children Treated on
Dana-Farber Cancer Institute ALL Consortium Protocols (2000–2010). Pediatric Blood &
Cancer 63, 1012–1018 (2016).
162. Acharya, S. et al. Effects of Race/Ethnicity and Socioeconomic Status on Outcome in
Childhood Acute Lymphoblastic Leukemia. Journal of Pediatric Hematology/Oncology 38,
350–354 (2016).
163. Kehm, R. D. et al. Does socioeconomic status account for racial and ethnic disparities in
childhood cancer survival? Cancer 124, 4090–4097 (2018).
164. Bhatia, S. et al. Racial and ethnic differences in survival of children with acute
lymphoblastic leukemia. Blood 100, 1957–1964 (2002).
165. Bhatia, S. et al. Nonadherence to Oral Mercaptopurine and Risk of Relapse in Hispanic and
Non-Hispanic White Children With Acute Lymphoblastic Leukemia: A Report From the
Children’s Oncology Group. J Clin Oncol 30, 2094–2101 (2012).
166. Rosenberg, A. R., Kroon, L., Chen, L., Li, C. I. & Jones, B. Insurance status and risk of cancer
mortality among adolescents and young adults. Cancer 121, 1279–1286 (2015).
178
167. Krakora, R. et al. Impact of Insurance Status on Survival Outcomes in Adults With Acute
Lymphoblastic Leukemia (ALL): A Single-center Experience. Clinical Lymphoma Myeloma
and Leukemia 20, e890–e896 (2020).
168. Halpern, M. T. et al. Association of insurance status and ethnicity with cancer stage at
diagnosis for 12 cancer sites: a retrospective analysis. Lancet Oncol 9, 222–231 (2008).
169. Butow, P. et al. Review of Adherence-Related Issues in Adolescents and Young Adults With
Cancer. JCO 28, 4800–4809 (2010).
170. Schmiegelow, K. et al. The degree of myelosuppression during maintenance therapy of
adolescents with B-lineage intermediate risk acute lymphoblastic leukemia predicts risk of
relapse. Leukemia 24, 715–720 (2010).
171. Landier, W. et al. “Doing Our Part” (Taking Responsibility): A Grounded Theory of the
Process of Adherence to Oral Chemotherapy in Children and Adolescents with Acute
Lymphoblastic Leukemia. J Pediatr Oncol Nurs 28, 203–223 (2011).
172. Koren, G. et al. Systemic exposure to mercaptopurine as a prognostic factor in acute
lymphocytic leukemia in children. N Engl J Med 323, 17–21 (1990).
173. Bhatia, S. et al. 6MP adherence in a multiracial cohort of children with acute lymphoblastic
leukemia: a Children’s Oncology Group study. Blood 124, 2345–2353 (2014).
174. Wu, Y. P. et al. Adherence to Oral Medications During Maintenance Therapy Among
Children and Adolescents With Acute Lymphoblastic Leukemia: A Medication Refill
Analysis. J Pediatr Oncol Nurs 35, 86–93 (2017).
175. Rytting, M. E., Jabbour, E. J., O’Brien, S. M. & Kantarjian, H. M. Acute lymphoblastic
leukemia in adolescents and young adults. Cancer 123, 2398–2403 (2017).
179
176. Sharib, J. et al. Comparison of Latino and Non-Latino Patients with Ewing Sarcoma. Pediatr
Blood Cancer 61, 233–237 (2014).
177. Pui, C.-H. et al. Results of therapy for acute lymphoblastic leukemia in black and white
children. JAMA 290, 2001–2007 (2003).
178. Bleyer, A., Tai, E. & Siegel, S. Role of clinical trials in survival progress of American
adolescents and young adults with cancer—and lack thereof. Pediatr Blood Cancer 65,
e27074 (2018).
179. Pulte, D., Gondos, A. & Brenner, H. Improvement in survival in younger patients with acute
lymphoblastic leukemia from the 1980s to the early 21st century. Blood 113, 1408–1411
(2009).
180. Talarico, L., Chen, G. & Pazdur, R. Enrollment of Elderly Patients in Clinical Trials for Cancer
Drug Registration: A 7-Year Experience by the US Food and Drug Administration. JCO 22,
4626–4631 (2004).
181. Parsons, H. M. et al. Increased Clinical Trial Enrollment among Adolescent and Young Adult
Cancer Patients between 2006 and 2012-2013 in the United States. Pediatr Blood Cancer
66, e27426 (2019).
182. Hamel, L. M. et al. Barriers to Clinical Trial Enrollment in Racial and Ethnic Minority
Patients With Cancer. Cancer Control 23, 327–337 (2016).
183. Tasian, S. K., Loh, M. L. & Hunger, S. P. Philadelphia chromosome–like acute lymphoblastic
leukemia. Blood 130, 2064–2072 (2017).
184. Wolfe, D., Dudek, S., Ritchie, M. D. & Pendergrass, S. A. Visualizing genomic information
across chromosomes with PhenoGram. BioData Mining 6, 18 (2013).
180
185. Buniello, A. et al. The NHGRI-EBI GWAS Catalog of published genome-wide association
studies, targeted arrays and summary statistics 2019. Nucleic Acids Res 47, D1005–D1012
(2019).
186. Diouf, B. et al. Association of an inherited genetic variant with vincristine-related
peripheral neuropathy in children with acute lymphoblastic leukemia. JAMA 313, 815–823
(2015).
187. Fernandez, C. A. et al. Genome-wide analysis links NFATC2 with asparaginase
hypersensitivity. Blood 126, 69–75 (2015).
188. Liu, C. et al. Clinical and Genetic Risk Factors for Acute Pancreatitis in Patients With Acute
Lymphoblastic Leukemia. J Clin Oncol 34, 2133–2140 (2016).
189. Liu, Y. et al. Genome-Wide Study Links PNPLA3 Variant With Elevated Hepatic
Transaminase After Acute Lymphoblastic Leukemia Therapy. Clin Pharmacol Ther 102,
131–140 (2017).
190. Højfeldt, S. G. et al. Genetic predisposition to PEG-asparaginase hypersensitivity in
children treated according to NOPHO ALL2008. Br J Haematol 184, 405–417 (2019).
191. Liu, C. et al. Genomewide Approach Validates Thiopurine Methyltransferase Activity Is a
Monogenic Pharmacogenomic Trait. Clin Pharmacol Ther 101, 373–381 (2017).
192. Tulstrup, M. et al. NT5C2 germline variants alter thiopurine metabolism and are associated
with acquired NT5C2 relapse mutations in childhood acute lymphoblastic leukaemia.
Leukemia 32, 2527–2535 (2018).
181
193. Yang, J. J. et al. Genome-wide association study identifies germline polymorphisms
associated with relapse of childhood acute lymphoblastic leukemia. Blood 120, 4197–4204
(2012).
194. Winther, J. F. & Schmiegelow, K. How safe is a standard-risk child with ALL? The Lancet
Oncology 15, 782–783 (2014).
195. Mody, R. et al. Twenty-five–year follow-up among survivors of childhood acute
lymphoblastic leukemia: a report from the Childhood Cancer Survivor Study. Blood 111,
5515–5523 (2008).
196. Pui, C.-H., Nichols, K. E. & Yang, J. J. Somatic and germline genomics in paediatric acute
lymphoblastic leukaemia. Nat Rev Clin Oncol 16, 227–240 (2019).
197. Greaves, M. Infection, immune responses and the aetiology of childhood leukaemia. Nat
Rev Cancer 6, 193–203 (2006).
198. Mullighan, C. G. et al. GENOMIC ANALYSIS OF THE CLONAL ORIGINS OF RELAPSED ACUTE
LYMPHOBLASTIC LEUKEMIA. Science 322, 1377–1380 (2008).
199. Schwab, C. J. et al. Genes commonly deleted in childhood B-cell precursor acute
lymphoblastic leukemia: association with cytogenetics and clinical features.
Haematologica 98, 1081–1088 (2013).
200. Mullighan, C. G. et al. Genome-wide analysis of genetic alterations in acute lymphoblastic
leukaemia. Nature 446, 758–764 (2007).
201. Joubert, B. R. et al. DNA Methylation in Newborns and Maternal Smoking in Pregnancy:
Genome-wide Consortium Meta-analysis. Am J Hum Genet 98, 680–696 (2016).
182
202. Joubert, B. R. et al. Maternal Smoking and DNA Methylation in Newborns: In Utero Effect
or Epigenetic Inheritance? Cancer Epidemiol Biomarkers Prev 23, 1007–1017 (2014).
203. Reese, S. E. et al. DNA Methylation Score as a Biomarker in Newborns for Sustained
Maternal Smoking during Pregnancy. Environ Health Perspect 125, 760–766 (2017).
204. Walsh, K. M. et al. Genomic ancestry and somatic alterations correlate with age at
diagnosis in Hispanic children with B-cell ALL. Am J Hematol 89, 721–725 (2014).
205. Gonseth, S. et al. Genetic contribution to variation in DNA methylation at maternal
smoking-sensitive loci in exposed neonates. Epigenetics 11, 664–673 (2016).
206. Fortin, J.-P. et al. Functional normalization of 450k methylation array data improves
replication in large cancer studies. Genome Biol 15, (2014).
207. Triche, T. J., Weisenberger, D. J., Van Den Berg, D., Laird, P. W. & Siegmund, K. D. Low-level
processing of Illumina Infinium DNA Methylation BeadArrays. Nucleic Acids Res 41, e90
(2013).
208. Aryee, M. J. et al. Minfi: a flexible and comprehensive Bioconductor package for the
analysis of Infinium DNA methylation microarrays. Bioinformatics 30, 1363–1369 (2014).
209. Huber, W. et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nat
Methods 12, 115–121 (2015).
210. Gentleman, R. C. et al. Bioconductor: open software development for computational
biology and bioinformatics. Genome Biology 16 (2004).
211. Machiela, M. J. & Chanock, S. J. LDlink: a web-based application for exploring population-
specific haplotype structure and linking correlated alleles of possible functional variants.
Bioinformatics 31, 3555–3557 (2015).
183
212. R: A language and environment for statistical computing. R Foundation for Statistical
Computing, Vienna, Austria. https://www.R-project.org/ (2020).
213. Rahmani, E. et al. Sparse PCA corrects for cell type heterogeneity in epigenome-wide
association studies. Nat. Methods 13, 443–445 (2016).
214. Rahmani, E. et al. Genome-wide methylation data mirror ancestry information. Epigenetics
& Chromatin 10, 1 (2017).
215. Barrett, M. tidymeta: Tidy and Plot Meta Analyses. R package version 0.1.0.9000. (2020).
216. Viechtbauer, W. Conducting meta-analyses in R with the metafor package. Journal of
Statistical Software 36, 1–48 (2010).
217. Higgins, J. P. T. & Thompson, S. G. Quantifying heterogeneity in a meta-analysis. Statistics
in Medicine 21, 1539–1558 (2002).
218. Joubert Bonnie R. et al. 450K Epigenome-Wide Scan Identifies Differential DNA
Methylation in Newborns Related to Maternal Smoking during Pregnancy. Environmental
Health Perspectives 120, 1425–1431 (2012).
219. Mendes, R. D. et al. PTEN microdeletions in T-cell acute lymphoblastic leukemia are caused
by illegitimate RAG-mediated recombination events. Blood 124, 567–578 (2014).
220. Finette, B. A., O’Neill, J. P., Vacek, P. M. & Albertini, R. J. Gene mutations with
characteristic deletions in cord blood T lymphocytes associated with passive maternal
exposure to tobacco smoke. Nature Medicine 4, 1144–1151 (1998).
221. Markunas, C. A. et al. Identification of DNA Methylation Changes in Newborns Related to
Maternal Smoking during Pregnancy. Environ Health Perspect 122, 1147–1153 (2014).
184
222. Orsi, L. et al. Parental smoking, maternal alcohol, coffee and tea consumption during
pregnancy, and childhood acute leukemia: the ESTELLE study. Cancer Causes Control 26,
1003–1017 (2015).
223. Milne, E. et al. Parental Prenatal Smoking and Risk of Childhood Acute Lymphoblastic
Leukemia. Am J Epidemiol 175, 43–53 (2012).
224. Rhomberg, L. R., Chandalia, J. K., Long, C. M. & Goodman, J. E. Measurement error in
environmental epidemiology and the shape of exposure-response curves. Critical Reviews
in Toxicology 41, 651–671 (2011).
225. Rebagliato, M. Validation of self reported smoking. Journal of Epidemiology & Community
Health 56, 163–164 (2002).
226. Klimentopoulou, A. et al. Maternal smoking during pregnancy and risk for childhood
leukemia: a nationwide case-control study in Greece and meta-analysis. Pediatr Blood
Cancer 58, 344–351 (2012).
227. Pidsley, R. et al. Critical evaluation of the Illumina MethylationEPIC BeadChip microarray
for whole-genome DNA methylation profiling. Genome Biology 17, 208 (2016).
228. Logue, M. W. et al. The correlation of methylation levels measured using Illumina 450K
and EPIC BeadChips in blood samples. Epigenomics 9, 1363–1371 (2017).
229. Rolke, H. B., Bakke, P. S. & Gallefoss, F. Relationships between hand-rolled cigarettes and
primary lung cancer: A Norwegian experience. The Clinical Respiratory Journal 3, 152–160
(2009).
230. Schäfer, D. et al. Five percent of healthy newborns have an ETV6-RUNX1 fusion as
revealed by DNA-based GIPFEL screening. Blood 131, 821–826 (2018).
185
231. Mori, H. et al. Chromosome translocations and covert leukemic clones are generated
during normal fetal development. Proc Natl Acad Sci U S A 99, 8242–8247 (2002).
232. Mullighan, C. G. et al. BCR–ABL1 lymphoblastic leukaemia is characterized by the deletion
of Ikaros. Nature 453, 110–114 (2008).
233. Grant, S. G. Qualitatively and quantitatively similar effects of active and passive maternal
tobacco smoke exposure on in utero mutagenesis at the HPRT locus. BMC Pediatr 5, 20
(2005).
234. de Smith, A. J. et al. Heritable variation at the chromosome 21 gene ERG is associated with
acute lymphoblastic leukemia risk in children with and without Down syndrome. Leukemia
33, 2746–2751 (2019).
235. Wiemels, J. L. et al. GWAS in childhood acute lymphoblastic leukemia reveals novel genetic
associations at chromosomes 17q12 and 8q24.21. Nat Commun 9, 1–8 (2018).
236. Feinberg, A. P. & Tycko, B. The history of cancer epigenetics. Nat Rev Cancer 4, 143–153
(2004).
237. Baylin, S. B. & Jones, P. A. Epigenetic Determinants of Cancer. Cold Spring Harb Perspect
Biol 8, a019505 (2016).
238. Jones, P. A. & Baylin, S. B. The Epigenomics of Cancer. Cell 128, 683–692 (2007).
239. Ehrlich, M. & Lacey, M. DNA Hypomethylation and Hemimethylation in Cancer. in
Epigenetic Alterations in Oncogenesis (ed. Karpf, A. R.) 31–56 (Springer, 2013).
doi:10.1007/978-1-4419-9967-2_2.
240. You, J. S. & Jones, P. A. Cancer Genetics and Epigenetics: Two Sides of the Same Coin?
Cancer Cell 22, 9–20 (2012).
186
241. Garraway, L. A. & Lander, E. S. Lessons from the Cancer Genome. Cell 153, 17–37 (2013).
242. Shen, H. & Laird, P. W. Interplay Between the Cancer Genome and Epigenome. Cell 153,
38–55 (2013).
243. Liu, Y. et al. Epigenome-wide association data implicate DNA methylation as an
intermediary of genetic risk in rheumatoid arthritis. Nat Biotechnol 31, 142–147 (2013).
244. Dai, J. Y. et al. DNA methylation and cis-regulation of gene expression by prostate cancer
risk SNPs. PLoS Genet 16, e1008667 (2020).
245. Nedeljkovic, I. et al. COPD GWAS variant at 19q13.2 in relation with DNA methylation and
gene expression. Human Molecular Genetics 27, 396–405 (2018).
246. Hale, V., Hale, G. A., Brown, P. A. & Amankwah, E. K. A Review of DNA Methylation and
microRNA Expression in Recurrent Pediatric Acute Leukemia. Oncology 92, 61–67 (2017).
247. Hogan, L. E. et al. Integrated genomic analysis of relapsed childhood acute lymphoblastic
leukemia reveals therapeutic strategies. Blood 118, 5218–5226 (2011).
248. Nordlund, J. et al. Genome-wide signatures of differential DNA methylation in pediatric
acute lymphoblastic leukemia. Genome Biol 14, r105 (2013).
249. Chatterton, Z. et al. Epigenetic deregulation in pediatric acute lymphoblastic leukemia.
Epigenetics 9, 459–467 (2014).
250. Nordlund, J. & Syvänen, A.-C. Epigenetics in pediatric acute lymphoblastic leukemia.
Seminars in Cancer Biology 51, 129–138 (2018).
251. Lee, S.-T. et al. Epigenetic remodeling in B-cell acute lymphoblastic leukemia occurs in two
tracks and employs embryonic stem cell-like signatures. Nucleic Acids Research 43, 2590–
2602 (2015).
187
252. Xu, K. et al. Epigenetic Biomarkers of Prenatal Tobacco Smoke Exposure Are Associated
with Gene Deletions in Childhood Acute Lymphoblastic Leukemia. Cancer Epidemiol
Biomarkers Prev (2021) doi:10.1158/1055-9965.EPI-21-0009.
253. Muskens, I. S. et al. The genome-wide impact of trisomy 21 on DNA methylation and its
implications for hematopoiesis. Nature Communications 12, 821 (2021).
254. Teschendorff, A. E. et al. A beta-mixture quantile normalization method for correcting
probe design bias in Illumina Infinium 450 k DNA methylation data. Bioinformatics 29,
189–196 (2013).
255. Mah, C. K., Mesirov, J. P. & Chavez, L. An accessible GenePattern notebook for the copy
number variation analysis of Illumina Infinium DNA methylation arrays. F1000Res 7, 1897
(2018).
256. Hansen, K. IlluminaHumanMethylation450kanno. ilmn12. hg19: annotation for Illumina’s
450k methylation arrays. R package version 0.6. 0 10, B9 (2016).
257. Hansen, K. IlluminaHumanMethylationEPICanno. ilm10b2. hg19: Annotation for Illumina’s
EPIC methylation arrays; R package version 0.6. 0. (2016).
258. Chen, Y. et al. Discovery of cross-reactive probes and polymorphic CpGs in the Illumina
Infinium HumanMethylation450 microarray. Epigenetics 8, 203–209 (2013).
259. McCartney, D. L. et al. Identification of polymorphic and off-target probe binding sites on
the Illumina Infinium MethylationEPIC BeadChip. Genomics Data 9, 22–24 (2016).
260. Shu, C. et al. DNA methylation mediates the effect of cocaine use on HIV severity. Clinical
Epigenetics 12, 140 (2020).
188
261. Declerck, K. et al. Interaction between prenatal pesticide exposure and a common
polymorphism in the PON1 gene on DNA methylation in genes associated with cardio-
metabolic disease risk—an exploratory study. Clinical Epigenetics 9, 35 (2017).
262. Yin, L. et al. rMVP: A Memory-efficient, Visualization-enhanced, and Parallel-accelerated
tool for Genome-Wide Association Study. Genomics, Proteomics & Bioinformatics (2021)
doi:10.1016/j.gpb.2020.10.007.
263. Jeon, S. et al. Genome-wide trans-ethnic meta-analysis identifies novel susceptibility loci
for childhood acute lymphoblastic leukemia. Leukemia 1–4 (2021) doi:10.1038/s41375-
021-01465-1.
264. Shabalin, A. A. Matrix eQTL: ultra fast eQTL analysis via large matrix operations.
Bioinformatics 28, 1353–1358 (2012).
265. Tingley, D., Yamamoto, T., Hirose, K., Keele, L. & Imai, K. mediation : R Package for Causal
Mediation Analysis. J. Stat. Soft. 59, (2014).
266. VanderWeele, T. J. Principles of confounder selection. Eur J Epidemiol 34, 211–219 (2019).
267. Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in
141,456 humans. http://biorxiv.org/lookup/doi/10.1101/531210 (2019)
doi:10.1101/531210.
268. Timms, J. A. et al. Exploring a potential mechanistic role of DNA methylation in the
relationship between in utero and post-natal environmental exposures and risk of
childhood acute lymphoblastic leukaemia. International Journal of Cancer 145, 2933–2943
(2019).
189
269. Studd, J. B. et al. Genetic and regulatory mechanism of susceptibility to high-hyperdiploid
acute lymphoblastic leukaemia at 10q21.2. Nat Commun 8, (2017).
270. Hannon, E., Weedon, M., Bray, N., O’Donovan, M. & Mill, J. Pleiotropic Effects of Trait-
Associated Genetic Variation on DNA Methylation: Utility for Refining GWAS Loci. Am J
Hum Genet 100, 954–959 (2017).
271. Hannon, E. et al. An integrated genetic-epigenetic analysis of schizophrenia: evidence for
co-localization of genetic associations and differential DNA methylation. Genome Biology
17, 176 (2016).
272. Harker, N. et al. The CD8alpha gene locus is regulated by the Ikaros family of proteins. Mol
Cell 10, 1403–1415 (2002).
273. Mullighan, C. G. et al. BCR-ABL1 lymphoblastic leukaemia is characterized by the deletion
of Ikaros. Nature 453, 110–114 (2008).
274. Mullighan, C. G. et al. Deletion of IKZF1 and Prognosis in Acute Lymphoblastic Leukemia. N
Engl J Med 360, 470–480 (2009).
275. Imai, K., Keele, L. & Tingley, D. A general approach to causal mediation analysis.
Psychological Methods 15, 309–334 (2010).
276. Hahne, F. & Ivanek, R. Visualizing Genomic Data Using Gviz and Bioconductor. in Statistical
Genomics (eds. Mathé, E. & Davis, S.) vol. 1418 335–351 (Springer New York, 2016).
277. SEER. National Cancer Institute Surveillance Epidemiology and End Results Program.
Cancer stat facts: Hodgkin lymphoma. https://seer.cancer.gov/statfacts/ html/hodg.html
(2021).
190
278. Swerdlow, S. H. et al. WHO classification of tumours of haematopoietic and lymphoid
tissues. vol. 2 (International agency for research on cancer Lyon, France, 2008).
279. Connors, J. M. et al. Hodgkin lymphoma. Nature Reviews Disease Primers 6, 1–25 (2020).
280. Smithers, D. W. HODGKIN’S DISEASE: ONE ENTITY OR TWO ? The Lancet 296, 1285–1288
(1970).
281. Cozen, W., Katz, J. & Mack, T. Hodgkin’s disease varies by cell type in Los Angeles. Cancer
Epidemiol Biomarkers Preven 1, 261–268 (1992).
282. Glaser, S. L. et al. Racial/ethnic variation in EBV-positive classical Hodgkin lymphoma in
California populations. International journal of cancer 123, 1499–1507 (2008).
283. Correa, P. & O’conor, G. Geographic pathology of lymphoreticular tumors: summary of
survey from the geographic pathology committee of the international union against
cancer. Journal of the National Cancer Institute 50, 1609–1617 (1973).
284. Correa, P. & O’Conor, G. T. Epidemiologic patterns of hodgkin’s disease. International
journal of cancer 8, 192–201 (1971).
285. Mack, T. M., Norman, J. E., Rappaport, E. & Cozen, W. Childhood determination of Hodgkin
lymphoma among US servicemen. Cancer Epidemiology and Prevention Biomarkers 24,
1707–1715 (2015).
286. Chang, E. T. et al. Childhood social environment and Hodgkin’s lymphoma: new findings
from a population-based case-control study. Cancer Epidemiol Biomarkers Prev 13, 1361–
1370 (2004).
287. Mack, T. M. et al. Concordance for Hodgkin’s disease in identical twins suggesting genetic
susceptibility to the young-adult form of the disease. N Engl J Med 332, 413–418 (1995).
191
288. Thomsen, H. et al. Heritability estimates on Hodgkin’s lymphoma: a genomic- versus
population-based approach. Eur J Hum Genet 23, 824–830 (2015).
289. Cozen, W. et al. A meta-analysis of Hodgkin lymphoma reveals 19p13.3 TCF3 as a novel
susceptibility locus. Nature Communications 5, 3856 (2014).
290. Sud, A. et al. Genome-wide association study implicates immune dysfunction in the
development of Hodgkin lymphoma. Blood 132, 2040–2052 (2018).
291. Enciso-Mora, V. et al. A genome-wide association study of Hodgkin Lymphoma identifies
new susceptibility loci at 2p16.1 (REL), 8q24.21, and 10p14 (GATA3). Nat Genet 42, 1126–
1130 (2010).
292. Urayama, K. Y. et al. Genome-Wide Association Study of Classical Hodgkin Lymphoma and
Epstein–Barr Virus Status–Defined Subgroups. J Natl Cancer Inst 104, 240–253 (2012).
293. Cozen, W. et al. A genome-wide meta-analysis of nodular sclerosing Hodgkin lymphoma
identifies risk loci at 6p21.32. Blood 119, 469–475 (2012).
294. Best, T. et al. Variants at 6q21 implicate PRDM1 in the etiology of therapy-induced second
malignancies after Hodgkin lymphoma. Nat Med 17, 941–943 (2011).
295. Delahaye-Sourdeix, M. et al. A Novel Risk Locus at 6p21.3 for Epstein–Barr Virus-Positive
Hodgkin Lymphoma. Cancer Epidemiol Biomarkers Prev 24, 1838–1843 (2015).
296. Kushekhar, K. et al. Genetic associations in classical hodgkin lymphoma: a systematic
review and insights into susceptibility mechanisms. Cancer Epidemiol Biomarkers Prev 23,
2737–2747 (2014).
192
297. Salipante, S. J. et al. Mutations in a gene encoding a midbody kelch protein in familial and
sporadic classical Hodgkin lymphoma lead to binucleated cells. Proceedings of the National
Academy of Sciences 106, 14920–14925 (2009).
298. Saarinen, S. et al. Exome sequencing reveals germline NPAT mutation as a candidate risk
factor for Hodgkin lymphoma. Blood 118, 493–498 (2011).
299. Ristolainen, H. et al. Identification of homozygous deletion in ACAN and other candidate
variants in familial classical Hodgkin lymphoma by exome sequencing. British Journal of
Haematology 170, 428–431 (2015).
300. Rotunno, M. et al. Whole exome sequencing in families at high risk for Hodgkin
lymphoma: identification of a predisposing mutation in the KDR gene. Haematologica 101,
853–860 (2016).
301. Bandapalli, O. R. et al. Whole genome sequencing reveals DICER1 as a candidate
predisposing gene in familial Hodgkin lymphoma. International Journal of Cancer 143,
2076–2078 (2018).
302. McMaster, M. L. et al. Germline Mutations in Protection of Telomeres 1 in Two Families
with Hodgkin Lymphoma. Br J Haematol 181, 372–377 (2018).
303. Srivastava, A. et al. Identification of Familial Hodgkin Lymphoma Predisposing Genes Using
Whole Genome Sequencing. Front Bioeng Biotechnol 8, (2020).
304. Cozen, W. et al. The USC Adult Twin Cohorts: International Twin Study and California Twin
Program. Twin Research and Human Genetics 16, 366–370 (2013).
305. Nextera Rapid Capture Exomes - A rapid workflow and comprehensive exome content,
with unparalleled flexibility. (2015).
193
306. Cock, P. J. A., Fields, C. J., Goto, N., Heuer, M. L. & Rice, P. M. The Sanger FASTQ file format
for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids
Res 38, 1767–1771 (2010).
307. Edwards, J. A. & Edwards, R. A. Fastq-pair: efficient synchronization of paired-end fastq
files. http://biorxiv.org/lookup/doi/10.1101/552885 (2019) doi:10.1101/552885.
308. Andrews, S. et al. FastQC: a quality control tool for high throughput sequence data.
http://www.bioinformatics.babraham.ac.uk/projects/fastqc (2012).
309. Krueger, F. Trim Galore: a wrapper tool around Cutadapt and FastQC to consistently apply
quality and adapter trimming to FastQ files, with some extra functionality for MspI-
digested RRBS-type (Reduced Representation Bisufite-Seq) libraries. URL http://www.
bioinformatics. babraham. ac. uk/projects/trim_galore/.(Date of access: 28/04/2016)
(2012).
310. Ewels, P., Magnusson, M., Lundin, S. & Käller, M. MultiQC: summarize analysis results for
multiple tools and samples in a single report. Bioinformatics 32, 3047–3048 (2016).
311. Van der Auwera, G. & O’Connor, B. Genomics in the Cloud: Using Docker, GATK, and WDL
in Terra (1st Edition). (O’Reilly Media, 2020).
312. DePristo, M. A. et al. A framework for variation discovery and genotyping using next-
generation DNA sequencing data. Nat Genet 43, 491–498 (2011).
313. Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM.
arXiv:1303.3997 [q-bio] (2013).
314. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–
2079 (2009).
194
315. Church, D. M. et al. Extending reference assembly models. Genome Biol 16, 13 (2015).
316. Rausch, T., Hsi-Yang Fritz, M., Korbel, J. O. & Benes, V. Alfred: interactive multi-sample
BAM alignment statistics, feature counting and feature annotation for long- and short-
read sequencing. Bioinformatics 35, 2489–2491 (2019).
317. McKenna, A. et al. The Genome Analysis Toolkit: A MapReduce framework for analyzing
next-generation DNA sequencing data. Genome Res 20, 1297–1303 (2010).
318. UCSC Genome Browser website. Lift Genome Annotations. https://genome.ucsc.edu/cgi-
bin/hgLiftOver.
319. Wickham, H. ggplot2: Elegant graphics for data analysis. (Springer-Verlag New York,
2016).
320. Poplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples.
http://biorxiv.org/lookup/doi/10.1101/201178 (2017) doi:10.1101/201178.
321. The Variant Call Format (VCF) Version 4.2 Specification. https://github.com/samtools/hts-
specs (2021).
322. Carson, A. R. et al. Effective filtering strategies to improve data quality from population-
based whole exome sequencing studies. BMC Bioinformatics 15, 125 (2014).
323. Knaus, B. J. & Grünwald, N. J. vcfr: a package to manipulate and visualize variant call
format data in R. Molecular Ecology Resources 17, 44–53 (2017).
324. Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants
from high-throughput sequencing data. Nucleic Acids Research 38, e164–e164 (2010).
195
325. O’Leary, N. A. et al. Reference sequence (RefSeq) database at NCBI: current status,
taxonomic expansion, and functional annotation. Nucleic Acids Research 44, D733–D745
(2016).
326. Karczewski, K. J. et al. The ExAC browser: displaying reference data information from over
60 000 exomes. Nucleic Acids Res 45, D840–D845 (2017).
327. Sherry, S. T. dbSNP: the NCBI database of genetic variation. Nucleic Acids Research 29,
308–311 (2001).
328. Liu, X., Wu, C., Li, C. & Boerwinkle, E. dbNSFP v3.0: A One-Stop Database of Functional
Predictions and Annotations for Human Non-synonymous and Splice Site SNVs. Hum
Mutat 37, 235–241 (2016).
329. Landrum, M. J. et al. ClinVar: public archive of relationships among sequence variation and
human phenotype. Nucleic Acids Res 42, D980–D985 (2014).
330. Rentzsch, P., Witten, D., Cooper, G. M., Shendure, J. & Kircher, M. CADD: predicting the
deleteriousness of variants throughout the human genome. Nucleic Acids Res 47, D886–
D894 (2019).
331. The NHLBI Trans-Omics for Precision Medicine (TOPMed) Whole Genome Sequencing
Program. BRAVO variant browser: University of Michigan and NHLB.
https://bravo.sph.umich.edu/freeze5/hg38/ (2018).
332. Edmonson, M. N. et al. Pediatric Cancer Variant Pathogenicity Information Exchange
(PeCanPIE): a cloud-based platform for curating and classifying germline variants. Genome
Res. 29, 1555–1565 (2019).
196
333. Gröbner, S. N. et al. The landscape of genomic alterations across childhood cancers.
Nature 555, 321–327 (2018).
334. Worst, B. C. et al. Next-generation personalised medicine for high-risk paediatric cancer
patients - The INFORM pilot study. Eur J Cancer 65, 91–101 (2016).
335. Zhang, J. et al. Germline Mutations in Predisposition Genes in Pediatric Cancer. N Engl J
Med 373, 2336–2346 (2015).
336. Thorvaldsdóttir, H., Robinson, J. T. & Mesirov, J. P. Integrative Genomics Viewer (IGV):
high-performance genomics data visualization and exploration. Brief Bioinform 14, 178–
192 (2013).
337. Zhou, X. et al. Correspondence to Nature Genetics. Nat Genet 48, 4–6 (2016).
338. Dou, J. et al. Estimation of kinship coefficient in structured and admixed populations using
sparse sequencing data. PLoS Genet 13, (2017).
339. Gu, Z., Eils, R. & Schlesner, M. Complex heatmaps reveal patterns and correlations in
multidimensional genomic data. Bioinformatics 32, 2847–2849 (2016).
340. Kopanos, C. et al. VarSome: the human genomic variant search engine. Bioinformatics 35,
1978–1980 (2019).
341. Amberger, J. S., Bocchini, C. A., Schiettecatte, F., Scott, A. F. & Hamosh, A. OMIM.org:
Online Mendelian Inheritance in Man (OMIM®), an online catalog of human genes and
genetic disorders. Nucleic Acids Research 43, D789–D798 (2015).
342. Stelzer, G. et al. The GeneCards Suite: From Gene Data Mining to Disease Genome
Sequence Analyses. Current Protocols in Bioinformatics 54, 1.30.1-1.30.33 (2016).
197
343. Sugie, H., Sugie, Y., Ito, M. & Fukuda, T. A Novel Missense Mutation (837T→C) in the
Phosphoglycerate Kinase Gene of a Patient With a Myopathic Form of Phosphoglycerate
Kinase Deficiency. J Child Neurol 13, 95–97 (1998).
344. Bischof, J. M. et al. Genome-wide identification of pseudogenes capable of disease-causing
gene conversion. Human Mutation 27, 545–552 (2006).
345. Beutler, E. PGK deficiency. British Journal of Haematology 136, 3–11 (2007).
346. He, Y. et al. PGK1-mediated cancer progression and drug resistance. Am J Cancer Res 9,
2280–2302 (2019).
347. He, Y. et al. PGK1 contributes to tumorigenesis and sorafenib resistance of renal clear cell
carcinoma via activating CXCR4/ERK signaling pathway and accelerating glycolysis. Cell
Death Dis 13, 1–15 (2022).
348. Karaca, E. et al. Genes that affect brain structure and function identified by rare variant
analyses of Mendelian neurologic disease. Neuron 88, 499–513 (2015).
349. Julien, S. G., Dubé, N., Hardy, S. & Tremblay, M. L. Inside the human cancer tyrosine
phosphatome. Nat Rev Cancer 11, 35–49 (2011).
350. SCOTT, A. & WANG, Z. Tumour suppressor function of protein tyrosine phosphatase
receptor-T. Biosci Rep 31, 303–307 (2011).
351. Yu, J. et al. Tumor-derived extracellular mutations of PTPRT /PTPrho are defective in cell
adhesion. Mol Cancer Res 6, 1106–1113 (2008).
352. Zhang, P. et al. Cancer-derived mutations in the fibronectin III repeats of PTPRT/PTPrho
inhibit cell-cell aggregation. Cell Commun Adhes 16, 146–153 (2009).
198
353. Besco, J. A., Hooft van Huijsduijnen, R., Frostholm, A. & Rotter, A. Intracellular substrates
of brain-enriched receptor protein tyrosine phosphatase rho (RPTPrho/PTPRT). Brain Res
1116, 50–57 (2006).
354. Flavell, J. R. et al. Down-regulation of the TGF-beta target gene, PTPRK, by the Epstein-Barr
virus encoded EBNA1 contributes to the growth and survival of Hodgkin lymphoma cells.
Blood 111, 292–301 (2008).
355. Gunawardana, J. et al. Recurrent somatic mutations of PTPN1 in primary mediastinal B cell
lymphoma and Hodgkin lymphoma. Nat Genet 46, 329–335 (2014).
356. Nachmias, B. & Schimmer, A. D. Targeting nuclear import and export in hematological
malignancies. Leukemia 34, 2875–2886 (2020).
357. Camus, V. et al. Detection and prognostic value of recurrent exportin 1 mutations in tumor
and cell-free circulating DNA of patients with classical Hodgkin lymphoma. Haematologica
101, 1094–1101 (2016).
358. Wang, Y. et al. Abstract 2482: RanBP17 retards the epithelial-mesenchymal transition
(EMT) and malignant progression of glioblastoma cells by regulating the exportation of
beta-catenin from cell nucleus. Cancer Research 78, 2482 (2018).
359. Schwarz, K. et al. Mutations affecting the secretory COPII coat component SEC23B cause
congenital dyserythropoietic anemia type II. Nat Genet 41, 936–940 (2009).
360. Bianchi, P. et al. Congenital dyserythropoietic anemia type II (CDAII) is caused by
mutations in the SEC23B gene. Human Mutation 30, 1292–1298 (2009).
361. Bruce, L. J. & Tanner, M. J. Erythroid band 3 variants and disease. Baillieres Best Pract Res
Clin Haematol 12, 637–654 (1999).
199
362. Bogusławska, D. M., Heger, E. & Sikorski, A. F. Molecular mechanism of hereditary
spherocytosis. Pol Merkur Lekarski 20, 112–116 (2006).
363. Mathieu, A.-L. et al. PRKDC mutations associated with immunodeficiency, granuloma, and
autoimmune regulator–dependent autoimmunity. J Allergy Clin Immunol 135, 1578-
1588.e5 (2015).
364. Barbosa, M. D. F. S. et al. Identification of Mutations in Two Major mRNA Isoforms of the
Chediak-Higashi Syndrome Gene in Human and Mouse. Human Molecular Genetics 6,
1091–1098 (1997).
365. Machaczka, M. et al. Development of classical Hodgkin’s lymphoma in an adult with
biallelic STXBP2 mutations. Haematologica 98, 760–764 (2013).
366. Alders, M. et al. Hennekam syndrome can be caused by FAT4 mutations and be allelic to
Van Maldergem syndrome. Hum Genet 133, 1161–1167 (2014).
367. Reichel, J. et al. Flow sorting and exome sequencing reveal the oncogenome of primary
Hodgkin and Reed-Sternberg cells. Blood 125, 1061–1072 (2015).
368. Vukic, M. & Daxinger, L. DNA methylation in disease: Immunodeficiency, Centromeric
instability, Facial anomalies syndrome. Essays Biochem 63, 773–783 (2019).
369. Spina, V. et al. The genetics of nodal marginal zone lymphoma. Blood 128, 1362–1373
(2016).
370. Chen, B. et al. Acute Intermittent Porphyria: Predicted Pathogenicity of HMBS Variants
Indicates Extremely Low Penetrance of the Autosomal Dominant Disease. Hum Mutat 37,
1215–1222 (2016).
200
371. Schneider-Yin, X. et al. Biallelic inactivation of protoporphyrinogen oxidase and
hydroxymethylbilane synthase is associated with liver cancer in acute porphyrias. Journal
of Hepatology 62, 734–738 (2015).
372. Yassin, E. R., Abdul-Nabi, A. M., Takeda, A. & Yaseen, N. R. Effects of the NUP98-DDX10
oncogene on primary human CD34+ cells: Role of a conserved helicase motif. Leukemia 24,
1001–1011 (2010).
373. Lagresle-Peyrou, C. et al. X-linked primary immunodeficiency associated with hemizygous
mutations in the moesin (MSN) gene. Journal of Allergy and Clinical Immunology 138,
1681-1689.e8 (2016).
Abstract (if available)
Abstract
Acute lymphoblastic leukemia (ALL) is the most common childhood malignancy in the United States, with approximately 2,700 incident cases diagnosed under age 15 each year. Although current treatment protocols result in an overall survival rate that exceeds 90% in childhood ALL patients in the US, long-term survivors experience significant adverse effects from therapy. Therefore, prevention remains a top priority, and understanding the causes of childhood ALL remains essential.
In chapter 1, we reviewed the racial/ethnic disparities in ALL incidence, discussed how these vary across the age spectrum, and examined the potential causes of these disparities. Genome-wide association studies have identified a growing number of SNPs associated with childhood ALL. Several ALL-risk single nucleotide polymorphisms (SNPs) are associated with genetic ancestry, and demonstrate different risk allele frequencies and/or effect sizes across populations. Moreover, non-genetic factors including socioeconomic status, access to care, and environmental exposures all likely influence the disparities in ALL risk and survival.
Parental smoking is implicated in the etiology of childhood ALL, but the causal mechanisms remain largely unclear. In chapter 2, we assessed the association between early-life tobacco smoke exposures and somatic gene deletions using two epigenetic biomarkers for maternal smoking during pregnancy – DNA methylation at AHRR CpG cg05575921 and a recently established polyepigenetic smoking score – in 482 B-ALL cases in the California Childhood Leukemia Study with available Illumina 450K or MethylationEPIC array data. We found an association between DNA methylation at AHRR CpG cg05575921 and deletion number (meta-analysis summary RM [sRM]=1.32, 95% CI:1.10-1.57), and the polyepigenetic smoking score was positively associated with gene deletion frequency among all 482 B-ALL cases (sRM=1.31 for each 4-unit increase in score; 95% CI:1.09-1.57). We provide further evidence that prenatal tobacco-smoke exposure may influence the generation of somatic copy-number deletions in childhood B-ALL. Analyses of deletion breakpoint sequences are required to further understand the mutagenic effects of tobacco smoke in childhood ALL.
Genome-wide association studies have identified a growing number of SNPs associated with childhood ALL, yet the functional roles of most SNPs are unclear. Evidence suggests epigenetic mechanisms may mediate the impact of heritable genetic variation on phenotypes. In chapter 3, we investigated whether DNA methylation mediates the effect of genetic risk loci for childhood ALL. We performed an epigenome-wide association study (EWAS) including 808 childhood ALL cases and 919 controls from California-based studies using neonatal blood DNA. For differentially methylated CpG positions (DMPs), we next conducted association analysis with 23 known ALL risk SNPs followed by causal mediation analyses addressing the significant SNP-DMP pairs. DNA methylation at CpG cg01139861, in the promoter region of IKZF1, mediated the effects of the intronic IKZF1 risk SNP rs78396808, with the average causal mediation effect (ACME) explaining ~30% of the total effect (ACME P=0.0031). In analyses stratified by self-reported race/ethnicity, the mediation effect was only significant in Latinos, explaining ~41% of the total effect of rs78396808 on ALL risk (ACME P=0.0037). We also demonstrated that the most significant DMP in the EWAS, CpG cg13344587 at gene ARID5B (P=8.61x10-10), was entirely confounded by the ARID5B ALL risk SNP rs7090445. Our findings provide new insights into the functional pathways of ALL risk SNPs and the DNA methylation differences associated with risk of childhood ALL.
Hodgkin Lymphoma (HL) is a B-cell malignancy that affects ~2-3 per 100,000 individuals per year in the United States. Sequencing of familial and sporadic HL patients previously identified rare pathogenic germline variants in genes with varying functions. The mechanisms by which these low frequency variants with high penetrance contribute to HL risk have yet to be determined. In chapter 4, we performed germline whole-exome sequencing (WES) for 48 individuals, including 20 classical HL (cHL) cases in 14 multiplex families with primarily AYA cHL patients (age at diagnosis: 17-50 years) to identify novel cHL predisposition genes. Rare germline short variant calling, annotation, and filtering were performed. In 11 families with unaffected sibling controls, we found 22 putative pathogenic germline variants in 22 genes only among the cHL patients. One variant in the cancer-related gene PGK1 was reported as pathogenic in VarSome and pathogenic/likely pathogenic (P/LP) in ClinVar in patients with phosphoglycerate kinase 1 deficiency associated with hemolytic anemia. Two variants in cancer-related genes PTPRT and RANBP17 were classified by VarSome with evidence of pathogenicity. In the 3 families with cHL patients but lacking sequencing data from sibling controls, we found 3 P/LP variants in genes associated with blood disorders (SLC4A1 and SEC23B) or immunodeficiency (PRKDC). In the analysis agnostic to family structure, we found 5 LP loss-of-function variants in genes associated with immunodeficiency (DCLRE1C and DNMT3B) or cancers including lymphoma and leukemia (PTPRD, DDX10, and HMBS) shared by the patients and their unaffected controls in 4 families. We have identified several putative novel cHL predisposition genes, and assessment of these genes in sequencing studies of independent HL families are required to validate their roles in HL predisposition.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Prenatal air pollution exposure, newborn DNA methylation, and childhood respiratory health
PDF
Genetic variation in the base excision repair pathway, environmental risk factors and colorectal adenoma risk
PDF
Air pollution, smoking, and multigenerational DNA methylation Signatures: a study of two southern California cohorts
PDF
Preprocessing and analysis of DNA methylation microarrays
PDF
Ancestral/Ethnic variation in the epidemiology and genetic predisposition of early-onset hematologic cancers
PDF
Air pollution, mitochondrial function, and growth in children
PDF
Genetic and environmental risk factors for childhood cancer
PDF
Genomic risk factors associated with Ewing Sarcoma susceptibility
PDF
The influence of DNA repair genes and prenatal tobacco exposure on childhood acute lymphoblastic leukemia risk: a gene-environment interaction study
PDF
Genes and environment in prostate cancer risk and prognosis
PDF
Common immune-related factors and risk of non-Hodgkin lymphomy
PDF
DNA methylation changes in the development of lung adenocarcinoma
PDF
Meat intake, polymorphisms in the NER and MMR pathways and colorectal cancer risk
PDF
The interplay between tobacco exposure and polygenic risk score for growth on birthweight and childhood acute lymphoblastic leukemia
PDF
Risk factors of pelvic floor disorders in the multiethnic cohort study
PDF
Evaluating the use of friend or family controls in epidemiologic case-control studies
PDF
Identifying genetic, environmental, and lifestyle determinants of ethnic variation in risk of pancreatic cancer
PDF
Understanding protein–DNA recognition in the context of DNA methylation
PDF
Screening and association testing of coding variation in steroid hormone coactivator and corepressor genes in relationship with breast cancer risk in multiple populations
PDF
Genome-wide studies reveal the function and evolution of DNA shape
Asset Metadata
Creator
Xu, Keren
(author)
Core Title
Genetic epidemiological approaches in the study of risk factors for hematologic malignancies
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Epidemiology
Degree Conferral Date
2022-05
Publication Date
10/05/2023
Defense Date
03/07/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
acute lymphoblastic leukemia,bioinformatics,DNA methylation,genetic epidemiology,hodgkin's lymphoma,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
de Smith, Adam (
committee chair
), Bell, Oliver (
committee member
), Cockburn, Myles (
committee member
), Siegmund, Kimberly (
committee member
), Wiemels, Joseph (
committee member
)
Creator Email
xukeren418@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC110879211
Unique identifier
UC110879211
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Xu, Keren
Type
texts
Source
20220406-usctheses-batch-919
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
acute lymphoblastic leukemia
bioinformatics
DNA methylation
genetic epidemiology
hodgkin's lymphoma