Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Understanding acute lymphoblastic leukemia in different ethnic groups in the United States
(USC Thesis Other)
Understanding acute lymphoblastic leukemia in different ethnic groups in the United States
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Understanding Acute Lymphoblastic Leukemia in different ethnic groups in the United States
by
Soyoung Jeon
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(CANCER BIOLOGY AND GENOMICS)
May 2023
Copyright 2023 Soyoung Jeon
ii
Dedication
This thesis is dedicated to my parents, Young-tea Jeon and Eun-sook Kim.
엄마 아빠께서 제 부모님 이시기에 가능한 지난 5 년, 아니 30 년이었습니다.
감사합니다. 존경합니다. 사랑합니다.
iii
Acknowledgements
My most profound gratitude goes to my advisors, Dr. Charleston Chiang and Dr. Joseph
Wiemels, not only for the research opportunities they provided, but also for their incredible
guidance, mentorship, and support. Their vast wisdom has never ceased to amaze and inspire me
throughout these five years. I want to thank them for having faith in me and providing me the space
and freedom to grow both as a scientist and as a person, and for always being so kind and generous
to me. I could not wish for a better advisor than them and it has been such a memorable and
enjoyable life experience working with them.
I would like to express my gratitude to Dr. Adam de Smith for his invaluable advice,
continuous support, and kind feedback throughout my studies. Working with him was very
enjoyable and I have learned so much from him. Many thanks to the rest of my defense committee,
Dr. Hooman Allayee and Dr. Amie Hwang for their time and insightful comments.
I would like to thank the members of Chiang lab, past and present, for their friendship and
insights. Most of all, I want to thank Dr. Minhui Chen for his help, generosity, and encouragement
during the first years of my PhD. His talent and perseverance have always inspired me and
motivated me.
In addition, I would like to thank my friends for their continued support during my studies.
I would specifically like to recognize Libere for the chocolate cold brew, mint mojito, morning
hikes and long chats that made my time here much more enjoyable and kept my sanity in check
during the hard times. I thank Dr. Byung San Moon for his guidance and support during our daily
lunches and chats - our conversation always helped to keep me on track. I also want to thank Dr.
Jiyun Kang for her kind help and support especially during the challenging times.
I would like to express the deepest gratitude to Dr. Guo Yu whose unwavering support was
crucial to complete my study. Thank you for those little things you've done - long drives, the best
chicken, shrimp, seaweed soups, late night walks and chats. Thank you for celebrating with me
whenever little things go right, and for your understanding and encouragement whenever I had
doubts.
And last, and certainly most of all, I want to thank my parents for their unconditional love
and support. I would not be who am I today without you. I would like to thank my one and only
brother, Chanho Jeon for always listening to me, giving me the opportunities to laugh, being the
shoulder to lean and cry on, and providing the reason to stay strong.
iv
TABLE OF CONTENTS
Dedication ....................................................................................................................................... ii
Acknowledgements ........................................................................................................................ iii
List of Tables .................................................................................................................................. v
List of Figures ................................................................................................................................ vi
Abstract ......................................................................................................................................... vii
Chapter 1 : Introduction .................................................................................................................. 1
Chapter 2 : Genome-wide trans-ethnic meta-analysis identifies novel susceptibility loci
for childhood acute lymphoblastic leukemia ................................................................................ 10
Introduction ............................................................................................................................................. 11
Materials and Methods ............................................................................................................................ 13
Results ..................................................................................................................................................... 15
Discussion ............................................................................................................................................... 21
Chapter 3: Evaluating genomic Polygenic Risk Scores for childhood Acute Lymphoblastic
Leukemia in Latino Americans ..................................................................................................... 34
Introduction ............................................................................................................................................. 35
Materials and Methods ............................................................................................................................ 36
Results ..................................................................................................................................................... 45
Discussion ............................................................................................................................................... 56
Chapter 4: Risk allele associated with childhood acute lymphoblastic leukemia at the IKZF1
locus is associated with Indigenous American ancestry and absent in European-ancestry
populations .................................................................................................................................... 62
Introduction ............................................................................................................................................. 63
Methods ................................................................................................................................................... 65
Results ..................................................................................................................................................... 70
Discussion ............................................................................................................................................... 81
Chapter 5: Conclusion................................................................................................................... 84
References ..................................................................................................................................... 87
v
List of Tables
Table 2.1: Summary statistics for the reported variants, the top variant in the loci from our
meta-analysis, and the linkage disequilibrium between the two variants in NLW and LAT.…...29
Table 2.2: Summary of conditional analysis to identify secondary associations at known loci....30
Table 3.1: Summary of sample size for each strategy of constructing PRS……………………..42
Table 3.2: Performance of the best model for NLW_NLW strategy across different testing
datasets…………………………………….…………………………….……………………….49
Table 3.3: Summary of best performing model for each strategy tested on CCLS LAT………..51
Table 3.4: Predictive performance of best performing LDPred2 model vs. LDPred2-Inf ……...54
Table 3.5: Predictive performance of best performing genomic PRS model vs. conventional
model constructed with 23 known ALL risk SNPs……………………………………………...56
Table 3.6: Comparison of effect sizes and P-values of SNPs in known risk loci for ALL in
CCRLP and CCLS NLW…………………………….…………………………….…………….60
Table 4.1 - Marginal and conditional analysis results for three independent risk SNPs in
IKZF1 in Latinos ……………….…….…………………………….……………………………74
Table 4.2 – Correlation between IKZF1 SNP risk alleles and global Indigenous American
ancestry proportion…………………………….…………………………….…………………..77
Table 4.3 – IKZF1 SNP associations in CCRLP Latinos stratified by local ancestry…………...80
vi
List of Figures
Figure 2.1: Summary result of the trans-ethnic meta-analysis on ALL………………………….31
Figure 2.2: Novel loci associated with childhood ALL in trans-ethnic meta-analysis…………..32
Figure 2.3: Secondary association signal (p < 5x10-8) with ALL found in previously known
loci through conditional analysis. …….……………………………………………..33
Figure 2.4: Polygenic Risk Score (PRS) distribution based on GWAS loci for ALL…………...34
Figure 3.1: Summary of study design and analysis……………………………………………...41
Figure 4.1 - Conditional association results for CCRLP Latinos………………………………..73
Figure 4.2 - IKZF1 SNP rs76880433 is associated with global and local Indigenous
American ancestry……………………………...………………………………………………..78
Figure 4.3 - Origins and selection of IKZF1 risk allele for childhood ALL…………………….82
vii
ABSTRACT
Despite advances in treatment, Acute lymphoblastic leukemia (ALL) remains a leading
cause of childhood mortality in the U.S.
82,83
. The disease risk for ALL shows substantial
differences across race/ethnic populations in the United States. For example, Latino children
have the highest risk of ALL, with an incidence rate ~30-40% higher than that in non-Latino
whites, which in turn has ~50% higher incidence rate than that found in non-Latino blacks
84,85
.
Latino children also have increased chance of relapse and poorer overall survival, compared to
children from other ethnic populations
86,87
. There are environmental and birth characteristics that
could have contributed to these disparities, such as high birth weight, diet, maternal folate intake
and alcohol consumption, and in-utero pesticide exposure
1
. Genetic variation and genetic
ancestry also likely play an important role in this ethnic disparity
88-92
. In addition, ALL is a
cancer of the immune-forming cells, and epidemiological studies have shown that both prenatal
immune development and postnatal infectious disease histories can contribute to risk of ALL
93-
95
. Infection and immunity is also a prime driver of natural selection in human history, through
which phenotypic differences can arise between populations
96
.
The overarching goal of this dissertation is to improve our understanding of the genetic
architecture of ALL and to investigate the genetic mechanisms through which differences in
disease risk may arise between populations, notably between Latinos and non-Latino whites.
Specifically, we first performed the largest genome-wide association study for ALL across ethnic
populations representing multiple ancestries. Our effort identified three novel putative loci
associated with ALL, in addition to secondary independent variants in two previously known
loci. Second, leveraging the genome-wide summary statistics from our investigation, we
viii
constructed genomic polygenic risk score (PRS) models and evaluate their efficacy in predicting
and stratifying individuals based on estimated genetic risks in Latino and non-Latino white
populations. We found that genomic PRS models based on multi-ethnic meta-analysis summary
statistics perform comparably in both Latino and non-Latino white populations. However,
possibly due to the unique oligogenic architecture of ALL, currently the genomic PRS models
are not necessarily more efficacious over a naïve PRS model based on only genome-wide
significant loci. Third, we investigated in detail the association signal at one ALL locus, IKZF1.
We identified a novel associated variant at this locus in which the effect on ALL is specific to
Latino population. The association signal for this variant was masked by other associations that
are shared across populations, and overlaps a putative functional regulatory element based on
histone modification peaks. More importantly, the risk allele is prevalent in Latinos due to its
origin from Indigenous American haplotype, and the allele exhibits signature of positive
selection through analysis of local genealogical trees.
Hippocrates is attributed in his book Epidemics for his recommendations to physicians to
“declare the past, diagnose the present, [and] foretell the future.” For childhood ALL, we have
investigated its past by studying the evolutionary mechanism contributing to elevated risk of
ALL among Latinos at the IKZF1 locus. We have delineated its present through a genome-wide
association study of populations with multiple ancestries to identify additional loci contributing
to ALL risk today. We have improved the prospect to foretell the future by evaluating risk
prediction models through polygenic risk scores for ALL. Together, we have undertaken a multi-
pronged approach through functional genomics, statistical genetics, and population genetics to
better understand the genetic etiology of ALL.
1
CHAPTER 1: INTRODUCTION
Since its genesis with studying simple Mendelian inheritance of physiologic traits a
century ago, the field of human genetics experienced substantial advances. The advances in
highly accurate, cost-effective SNP array and sequencing technology have allowed the genome-
wide association studies (GWAS) to now become a routine for studying various complex traits
and diseases. Allowing the unbiased approach to systematically interrogate millions of variants
across the genome, GWAS has successfully identified thousands of disease-associated variants
and provided deeper insights into the contribution of genetic factors to various complex
diseases
2
. Moreover, these studies have revealed the polygenic nature of many human traits and
diseases. Polygenic risk scores (PRS) constructed using the identified loci in GWAS have also
been shown to provide clinically actionable insights and more personalized medical care. By
aggregating the genetic effects across the genome, these scores can measure the overall genetic
liability to a trait, and has shown promise in individualized disease risk and trajectory
predictions
3–7
.
However, despite their immense success, the vast majority of genome-wide association
studies were carried out in European-ancestry populations
8–10
. The availability of large sample
sizes, funding, and biased genotyping technologies could have been the practical reasons that
GWAS started out being Euro-centric. But it is now more widely acknowledged that more
diverse population samples are needed to expand the effectiveness of genomic medicine. For
example, it has been noted the predictive performance of PRS is greatly reduced across different
ancestry groups. The prediction accuracy of PRS for 17 anthropometric and blood panel traits
constructed using GWAS with European ancestry individuals drops to almost 25% and 65%
2
when tested on individuals of African or Latino ancestry respectively, compared to Europeans
10
.
The differences in linkage disequilibrium (LD) between causal variants and SNPs assayed, allele
frequencies, and causal variant effect sizes could explain most of the loss in prediction accuracy
across populations
11
. Worse, this relative drop in PRS performance may further be exacerbated
when confounded by environmental, demographic, or social risk factors
12–15
. In addition to
improving predictive accuracy of individual-level scores, inclusion of diverse ethnic groups can
also improve the resolution of fine-mapping, accuracy of effect size estimate, not to mention the
public health implication of better understanding the genetic contribution to disparities in disease
risk.
The most common childhood cancer, acute lymphoblastic leukemia (ALL) is not an
exception for such Euro-centric bias in genetic studies. With the incidence consistently
increasing, ALL represents 20% of childhood cancer and remains to be the leading cause of
childhood mortality in the United States
16–19
. Among all ethnic groups, Latino Americans (LAT)
consistently show the highest incidence and worst treatment outcome with 1.3 times risk of
developing the disease compared to non-Latino white (NLW) and double the risk in African-
Americans(AA)
20
. However, reasons for the increased risk of ALL in Latino children remain
elusive. The first GWAS on ALL with a large, ethnically diverse population, and population-
specific studies focusing on LAT have identified novel ALL-associated loci that were not
previously found in European-only studies
21–23
, highlighting the need for studies on diverse
populations in order to understand the complete spectrum of genetic susceptibility of childhood
ALL and how it contributes to disparities in incidence. This dissertation aims to better
understand the genetic etiology of ALL in diverse populations, and the genetic contribution for
3
the apparent ethnic disparity. The long-term goal is to emphasize the need of incorporating
ethnically diverse subjects in studies of ALL and ultimately translate the genomic insights to
improve healthcare in these populations. To aid in understanding and to motivate the work
described in later chapters, I will first briefly review the epidemiology and etiology of ALL. I
will then discuss the impact of evolutionary history on disease architecture, as it may be an
important factor for observed population difference in ALL risk, and finally I will present an
outline of three chapters.
Acute Lymphoblastic Leukemia
Leukemia is a heterogeneous group of malignancy in hematopoietic tissue that can be
classified by the affected cell line and stage at which hematopoiesis is disrupted. Acute
lymphoblastic leukemia (ALL) is characterized by uncontrolled multiplication of immature
blood cells among lymphoid lineage in either the B-cell or T-cell pathway that compete with
normal cells in marrow and interfere with normal hematopoiesis.
The causes of ALL are multifactorial and vary based on molecular subtype and age of
diagnosis
24
. The recent advances in genetic studies and molecular analyses have led to a better
understanding of disease etiology, suggesting the important role of both the genetic and
environmental risk factors. Both the rare germline mutations with high-penetrance and common
variants with low-penetrance have been reported, including the important genes for
hematopoietic transcription factors (IKZF1, CEBPE, ARID5B, GATA3, ELK3), cell-cycle
regulators (CDKN2A/CDKN2B, SP4), and chromatin remodeling enzymes (BMI1)
25
. The
studies on correlation of ALL and pre/postnatal environmental exposures have also suggested the
possible influence of non-genetic factors such high birth weight, diet, maternal folate intake and
4
alcohol consumption, and in-utero pesticide exposure
1
. Yet, not only the precise mechanism by
which these identified risk factors influence leukemogenesis are not completely understood, but
the majority of ALL etiology remains unexplained.
Epidemiologically, ALL is the most common type of cancer in children less than 15 years
of age, with approximately 3000 newly diagnosed cases each year
26,27
. Although combination
chemotherapy has dramatically improved the cure rates of childhood ALL, it still remains a
leading cause of childhood mortality in the US and the survivors still struggle from significant
treatment-related morbidities and medical problems
28–31
. Moreover, despite the overall
decreasing cancer incidence in the general populations, the incidence of childhood ALL has been
increasing significantly across all races and ethnicities over the past two decades
20
. Furthermore,
ALL also exhibits substantial difference in incidence by race and ethnicity. Incidence rates are
highest in European, American (North, Central and South), and Oceania (Australia and New
Zealand) countries, followed by intermediate rates in Asian countries, and have the lowest rates
in African countries
32
. In the United States, Latino children have been consistently shown to
have the highest and fastest-increasing risk in ALL. Not only was it reported that they have
higher incidence (43 per 1 million vs. 34 per 1 million), but also poorer survival (86.3% vs.
92.1%) than their NLW counterparts
20,33,34
. The underlying cause of the apparent disparity is
unknown but is most likely to be an interplay of environmental and genetic risk factors
33,35–37
.
The genetic variants can play an important role if the allele frequency or magnitude of
association differs by ethnicity or when they are associated in ancestry-specific manner. For
example, ARID5B SNP genotype that was shown to be associated with ALL in all AA, NLW,
and LAT but show difference in allele frequency with LAT having the highest
38,39
. Also, studies
have noted the association of ALL with Indigenous American (IA) ancestry in LAT. The risk
5
allele frequency in ARID5B, GATA3, and ERG was associated with both local and global IA
ancestry in LAT, and also the effect size of ERG locus on ALL risk is correlated with increasing
IA ancestry
22,23
. However, more investigations are needed to fully unravel the genetic basis of
the ethnic disparity.
The Impact of Evolutionary History on Disease Architecture.
An often underappreciated fact is that the genetic risk of a disease for an individual or the
disparity of genetic risk between populations could be intimately tied to the evolutionary
histories of human populations. Broadly speaking, two major evolutionary forces shape the
pattern of genetic variation: the demographic history of a population, and the adaptive history
(i.e. natural selection) of the population. Mediated by its impact on the pattern of genetic
variation, these evolutionary events in turns influence the phenotypic consequences. Because
different human populations may have recently experienced different evolutionary histories,
these events could contribute to the portion of the disparity in disease risk that is attributable to
genetics.
The demographic history of a population includes historical events where the population
sizes may have expanded or contracted, or where migrants and admixture occurred with other
populations. These events can alter the way alleles vary in frequencies and numbers in a
population. A unique demographic event such as a population bottleneck can leave lasting
consequences on genetic and phenotypic differences between populations. For instance,
functionally deleterious variants surviving through a bottleneck will more easily overcome the
impact of negative selection to reach higher frequencies
40
. Indeed, in simulations, a greater
proportion of the heritability of a trait is explained by relatively common alleles in a
6
bottlenecked population than a non-bottlenecked, recently expanded, population
41
. In practice,
the predicted impacts due to demographic events have already aided the design of human
genetics studies. For example, the Finnish population, a well-known founder population, is
enriched for functionally deleterious alleles compared to Non-Finnish Europeans, thereby
enabling the successful mapping of novel associations with quantitative cardiometabolic traits
42–
44
. Similarly, as a consequence of these types of unique demography, unusually common alleles
with large effects for T2D, obesity, height, lipids, and multiple sclerosis were mapped in isolated
populations like the Inuits, Sardinians, and Peruvians
45–51
. In all of these cases, the discovered
alleles are rare in large outbred Non-Finnish European populations and thus would be difficult to
discover if we do not expand GWAS to diverse, non-European populations.
In the Americas, Indigenous Americans exhibit similar features in their demographic
history. The Indigenous Americans are known to have experienced multiple bottlenecks.
Beginning about 25,000 years ago, the Last Glacial Maximum drastically reduced sea levels,and
formed the Bering land bridge that joined the Northeast Asia and North America. This allowed
migrants from various parts of Asia to cross the present day Bering Strait, and eventually
migrated southward to inhabit the Americas
52,53
. As the last major continent to be inhabited, the
ancestors of Indigenous Americans experienced serial founder events as they reached
Americas
54
, and as they expanded within Americas
55
. Furthermore, when these isolated
American populations encountered the European explorers during the colonial era, the event
decimated their population by up to 90%, partly due to a lack of immunity against pathogens
brought by Europeans.
56
These unique historical events are expected to facilitate enrichment for
unusually differentiated alleles in individuals with Indigenous American ancestry, some of which
would be functionally deleterious and exhibit phenotypic effects. Present day Latinos,
7
encompassing non-Spanish speaking populations in Latin America, are the largest population
that contain a significant component of the Indigenous American ancestry
57,58
. It is therefore
through large-scale incorporation of Latinos into GWAS that we will have the opportunity to
assay and assess the association of alleles enriched on Indigenous American haplotypes. Indeed,
studies have already found unusually common alleles with large effects for T2D in Mexican
populations due to the Indigenous American ancestry
59–61
.
A second major factor shaping the genetic variation of a population is natural selection.
As humans migrated and occupied the globe throughout history, they were subjected to a variety
of novel selective pressures due to climate, diet, UV exposures, and pathogens
62
. These
adaptations left a lasting genetic signature that we can now detect in data, for example through
allele frequency-based comparisons guided by GWAS loci
63–69
. More importantly, in modern
society, these adaptations could also lead to unwanted consequences that manifest as diseases
and contribute towards differences in risk between populations. For instance, because of local
selection, the a missense variant in the CREBRF gene segregates at ~26% frequency in the
Samoans but is not found outside of Polynesia, contributing to the excess risk of obesity
observed among these populations
70–76
. As another example, variants in the SH2B3 gene are
positively selected in Europe because of its protection for bacterial infections, but also increased
celiac disease risk
77
. Indigenous Americans similarly have experienced their unique adaptive
history. Having to inhabit previously uninhabited land in Beringia and Americas, there were
likely novel interactions with the environment that could drive adaptation
78,79
. Furthermore, the
aforementioned encounter with European explorers also likely posed a significant pressure for
adaptation. The severe population decline was driven by exposure to new pathogens; the
responsible infections remain unknown, although smallpox, measles, mumps, typhus, and
8
cholera are prime candidates
80
. Considering that immune response is a prime driver of natural
selection in human history, and that ALL is a cancer of the immune system with several genes
involved in lymphocyte development and regulation are implicated in conferring ALL disease
risks, it may be possible the adaptive history of the Indigenous Americans contributed to the
elevated risk of ALL among Latinos today.
Taken together, a better understanding of our evolutionary past will enable better designs
and interpretation of human genetics studies, help address the disparity among diverse
populations today and may even allow incorporation of evolutionary insights into our clinical
practices
. With respect to ALL, an evolutionary perspective would argue for better incorporation
of diverse, non-European populations into GWAS to better leverage the different population
histories that may have enriched for different alleles across populations. Moreover, an emphasis
on populations with strong Indigenous American ancestry may also be able to leverage the
unique adaptive history of this population to identify alleles with unique and specific impact on a
population such as the Latinos today.
Dissertation Overview.
In the remaining Chapters of this dissertation, I will describe our efforts aimed to
improve on the current shortcomings of previous genetic studies of ALL. In particular, previous
studies were predominantly conducted in populations of European ancestries and the identified
risk loci do not fully explain the heritable risk in Latinos.
In Chapter 2 of this dissertation, we performed the largest-available GWAS of ALL in
participants across four ethnic groups and multiple genetic ancestries. In addition to replicating
previously identified risk loci, we further identified five novel associations at genome-wide
9
significance. Furthermore, we compared the efficacy and distribution of PRS constructed from
our multi-ethnic meta-analysis in Latinos and non-Latino whites, and assessed the genetic
correlation between the two populations. In Chapter 3, I will describe our effort to construct,
train, and assess genomic polygenic risk score based on genome-wide summary statistics from
Chapter 2 using two different methods. I constructed PRS models based on either European-
ancestry specific GWAS or a multi-ethnic GWAS. I then explored different strategies for
constructing these PRS models to identify the model that most accurately estimates the risk in
LAT. Finally, In Chapter 4, I will describe our effort to characterize a population-specific
secondary signal at the IKZF1 gene identified in the our multi-ancestry GWAS in Chapter 2. In
addition to two independent associations identified in European-centric or multi-ancestry
GWAS, we identify a third independently associated allele conferring risk to ALL specific to the
Latinos. We investigated the association of risk alleles with both global and local Indigenous
American ancestry and the signature of selection in 1000 Genomes populations. Finally, in the
last chapter I will summarize what has been accomplished in this dissertation and propose
possible future studies that will build upon this dissertation to further enhance our understanding
of the genetic architecture of ALL.
10
CHAPTER 2
Genome-wide trans-ethnic meta-analysis identifies novel susceptibility loci
for childhood acute lymphoblastic leukemia
Soyoung Jeon
1,2
, Adam J. de Smith
1
, Shaobo Li
1,2
, Minhui Chen
1
, Tsz Fung Chan
1
, Ivo S.
Muskens
1
, Libby M. Morimoto
3
, Andrew T. DeWan
4,5
, Nicholas Mancuso
1,6,7
, Catherine
Metayer
3
, Xiaomei Ma
5
, Joseph L. Wiemels
1
, Charleston W.K. Chiang
1,6
Affiliations:
1. Center for Genetic Epidemiology, Department of Preventive Medicine, Keck School of
Medicine, University of Southern California, Los Angeles, CA
2. Cancer Biology and Genomics Graduate Program, Program in Biological and Biomedical
Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA
3. Division of Epidemiology & Biostatistics, School of Public Health, University of
California, Berkeley, CA
4. Center for Perinatal, Pediatric and Environmental Epidemiology, Yale School of Public
Health, New Haven, CT
5. Department of Chronic Disease Epidemiology, Yale School of Public Health, New Haven,
C
6. Department of Quantitative and Computational Biology, University of Southern California,
Los Angeles, CA
7. Norris Comprehensive Cancer Center, Keck School of Medicine, University of Southern
California, Los Angeles, CA
Originally published as:
Jeon, S., de Smith, A.J., Li, S., Chen, M., Chan, T.F., Muskens, I.S., Morimoto, L.M., DeWan,
A.T., Mancuso, N., Metayer, C., et al. (2021). Genome-wide trans-ethnic meta-analysis identifies
novel susceptibility loci for childhood acute lymphoblastic leukemia. Leukemia. 10.1038/s41375-
021-01465-1.
11
Introduction
Acute lymphoblastic leukemia (ALL) is the most common type of childhood cancer
worldwide, with substantial racial and ethnic differences in incidence and treatment
outcome
82,83
. Previous genome-wide association studies (GWAS) have confirmed the genetic
basis of ALL susceptibility by identifying a number of risk loci for childhood ALL
84–89
and
estimating the heritability to be 21% (ref.
90
). However, the known risk loci together account for a
relatively small portion of the total variance in genetic risk of ALL
90
, suggesting that additional
susceptibility alleles may be discovered in larger studies. Furthermore, these studies were
generally performed in cohorts with a predominantly European ancestry. Latino children have
the highest risk of ALL in the United States, with an incidence rate ~15-40% higher than in non-
Latino whites
17,91,92
and an increased chance of relapse and poorer overall survival
36,93
. Yet, we
have a limited understanding of the genetic architecture of ALL in non-European populations
and the generalizability of findings from existing GWAS to non-European populations (but see
recent efforts for studying the genetic etiology of childhood ALL in Latinos
8,22,23
). While
environmental or social factors likely underlie some if not the majority of the differences in risk
between ethnic groups, there may also be a difference in the genetic risk architecture that
modulates risk across ethnic groups and would argue for the greater inclusion of other ethnic
groups in genetic studies of ALL.
Given this context, we performed a trans-ethnic GWAS of childhood ALL in a discovery
panel consisting of 76 317 individuals from an assembled multi-ethnic cohort. We note the
complexity of discussing race, ethnicity and ancestry in a genetic study. As a convention, we
used the following terms and abbreviations to refer to each ethnic group in our study: African
American (AFR), East Asian (EAS), Latino American (LAT), and non-Latino white (NLW).
These population labels are largely based on self-reported ethnic identity and we confirmed that
12
they largely correlate with genetic ancestry as defined by the reference populations in 1000
Genomes
94
(Methods). Our cohort consisted of 3 482 cases and 72 835 controls for an effective
sample size of 13 292, which is, to our knowledge, the largest trans-ethnic GWAS for ALL to
date. We identified three novel ALL risk loci and tested the novel findings from our discovery
panel in two additional independent cohorts. We further compared the efficacy of polygenic risk
scores (PRS) to stratify individuals based on their risk of ALL in the two largest subgroups of
our data, LAT and NLW. PRS models are known to be poorly transferred to non-European
populations
10
, but multi-ethnic designs may be more effective in identifying alleles with shared
effects across population without explicit fine-mapping and produce more comparable PRS
models between populations
95,96
. Finally, we leveraged our genome-wide summary statistics to
contrast the genetic architecture of ALL between LAT and NLW populations.
13
Materials and Methods
Study Cohorts
The California Childhood Cancer Record Linkage Project (CCRLP) includes all children
born in California during 1982-2009 and diagnosed with ALL at the age of 0-14 years per
California Cancer Registry records. Children who were born in California during the same period
and not reported to California Cancer Registry as having any childhood cancer were considered
potential controls. Detailed information on sample matching, preparation and genotyping has
been previously described.
85
Because ALL is a rare childhood cancer, for the purpose of a
genetic study we followed previous practice
85
and incorporated additional controls using adult
individuals from the Kaiser Resource for Genetic Epidemiology Research on Aging Cohort
(GERA; dbGaP accession: phs000788.v1.p2). The GERA cohort was chosen because a very
similar genotyping platform had been used: Affymetrix Axiom World arrays. For replications we
included two independent ALL cohorts: (1) individuals of predominantly European ancestry
from the Children’s Oncology Group (COG; dbGAP accession: phs000638.v1.p1) as cases and
from Wellcome Trust Case–Control Consortium
97
(WTCCC) as controls; and (2) individuals of
European and Latino ancestry from the California Childhood Leukemia Study (CCLS), a non-
overlapping California case-control study (1995-2008).
98
The quality control and imputation for
both the discovery and replication cohorts were conducted in ethnic strata and generally followed
previous pipelines of ALL GWAS, but with additional attention paid to incorporate the entire
GERA cohort and ensuring data quality post-imputation. This study was approved by
Institutional Review Boards at the California Health and Human Services Agency, University of
Southern California, Yale University, and the University of California San Francisco.
14
Association Testing
We used SNPTEST
99
(v2.5.2) to test the association between imputed genotype dosage
and case-control status in logistic regression, after adjusting for the top 20 principal components
(PCs). Sex was not included as a covariate, and we found sex was not correlated with genotype
dosage of any of the putatively associated SNPs (data not shown). Results from the four ethnic-
stratified analyses were combined via the fixed-effect meta-analysis with variance weighting
using METAL
100
. Only variants passing QC in at least three of the four ethnic groups were meta-
analyzed. A genome-wide threshold of 5 x 10
-8
was used for significance in the discovery stage.
A Bonferroni-corrected significance of 0.00312 (=0.05/16) was used for replication of previously
reported susceptibility variants
84–89,101–103
. Cochran's Q-test for heterogeneity was performed
using METAL
100
. To perform conditional analysis in identifying secondary associations within a
locus, the lead SNP was additionally included in the regression model, again using 5 x 10
-8
as
threshold for significance.
Polygenic Risk Score Analysis
Polygenic risk scores (PRS) for ALL were constructed using PLINK (v2.0) by summing
the genotype dosages of risk alleles, each weighted by its effect size from our discovery GWAS
meta-analysis. PRS were constructed based on: (1) lead SNPs in the 16 known loci (N = 18
SNPs, including variants from the two secondary signals in IKZF1 and CDKN2A/B that were
previously reported; for which we used the corresponding effect sizes from conditional analysis),
and (2) by additionally including the novel hits (N = 23 SNPs, including the additional 3 novel
loci and 2 novel conditional associations). Associations between PRS and case-control status for
ALL were tested in each group adjusting for 20 PCs using R. To evaluate the predictive power of
15
PRS, Area Under the receiver operating characteristic Curve (AUC) were calculated using pROC
package
104
in R.
Genetic architecture of ALL within and between populations
To investigate the genetic architecture of ALL and contrasting this architecture between
NLW and LAT populations, we estimated the percentage of familial relative risk (FRR)
explained by associated variants individually or in aggregate, the heritability ascribable to all
post-QC imputed SNPs with MAF ≥ 0.05, the genetic correlation between NLW and LAT, and
the genome-wide proportion of causal variants that are population-specific or population-shared.
Results
Trans-ethnic Genetic Associations with ALL
We performed a trans-ethnic meta-analysis GWAS for childhood ALL. After quality
control filtering, our dataset consisted of 3 482 cases and 72 835 controls (Methods) in total. In
contrast to the previous trans-ethnic analysis
85
, we included additional controls for NLW and
added the EAS cohort. Furthermore, we tested the association at 7 628 894 imputed SNPs,
including low frequency (MAF between 1-5%) variants that were not previously systematically
tested. We aggregated the summary statistics across the four ethnic groups in a fixed-effect meta-
analysis. The genomic control inflation factor was 1.022 after excluding 16 previously known
ALL-associated loci (Table 2.1), suggesting our meta-analysis was reasonably robust to any
confounding due to population stratification (Figure 2.1). In total, twelve loci reached genome-
wide significance (i.e, P < 5.0 X 10
-8
) in our analysis.
16
We found that for the 16 previously published risk loci for ALL
23,84–90,102,103
, all were
associated with ALL at the nominal level (P < 0.05) or have a SNP nearby with strong
association (Table 2.1). Nearly all of the published risk SNPs show consistent direction of effects
across ethnic groups (13/16 SNPs with heterogeneity P-value > 0.05; P = 0.0384, 0.006, 0,
0.000259 for AFR, EAS, LAT, NLW respectively for consistent direction of effect by the sign
test,). In some cases, the published SNP is not the SNP with the most significant association in
our dataset, though usually our top SNP in the locus is in strong LD with the reported SNP
(Table 2.1). Given the larger sample size and trans-ethnic analysis, the best associated variants in
our analysis may reflect the more likely causal / shared association across populations. Two loci
at C5orf56 and TLE1 are noted. At the C5orf56 locus on 5q31, the variant previously reported in
an independent European-ancestry cohort (rs886285) to be associated with a particular subtype
of ALL (HD-ALL)
90
was not nominally associated with ALL overall (P = 0.63) in our dataset. A
weakly linked SNP (rs11741255; r2 = 0.35 in NLW, 0.19 in LAT) in the same locus
approximately 20kb away was significantly associated with ALL in our data (P = 1.69x10
-4
) but
may reflect a chance association. At the TLE1 locus on 5q21, neither the published variant nor
our top variant in the locus would be considered significantly associated after Bonferroni
correction (minimum P = 1.06x10
-2
for rs62579826), possibly due to heterogeneity driven by
EAS in which both the published variant and our top variant are monomorphic
105
.
More importantly, we discovered three putatively novel susceptibility loci: one at 6q23
and two at 10q21 (Figure 2.1). The strongest association signal in 6q23 is at rs9376090 (P= 8.23
X 10
-9
, OR=1.27) in the intergenic region between MYB and HBS1L (Figure 2.2A). This
association is mainly driven by NLW presumably due to its large sample size. In 10q21, there
were two independent signals that showed genome-wide significance. One locus was identified
17
with the lead SNP rs9415680 (P=7.27 X 10
-8
, OR=1.20), within a broad association peak, with
apparently long-range LD with SNPs covering NRBF2, JMJD1C, and parts of REEP3 (Figure
2.2B). The second locus in 10q21 was identified 5Mb away, with lead SNP rs10998283
(P=3.92x10
-8
, OR=1.15) in an intronic region in TET1 (Figure 2.2C). The association signals for
both loci in 10q21 were largely driven by LAT. We used the convention of the nearest genes to
refer to these loci for the remainder of the manuscript, acknowledging that they may not be the
causal genes.
To replicate our findings in independent datasets, we tested the associations of the three
novel variants and their LD proxies (with P < 5 x 10
-7
;n=141) in independent samples from the
COG/WTCCC and CCLS cohorts (Methods). For the MYB/HBS1L locus, which was driven by
NLW in the discovery cohort, we replicated the signal in COG/WTCCC cohort (rs9376090, PCOG
= 4.87x10
-3
, PCOG+discovery analysis = 1.23x10
-10
), but did not replicate in CCLS likely owing to the
small sample size of NLW. For the TET1 locus, in which the original association was driven by
LAT in the discovery, three of the four SNPs with P < 5x10
-7
in the discovery cohort nominally
replicated in CCLS. The lead SNP after meta-analyzing the discovery cohort and the replication
cohort of CCLS was rs79226025 (PCCLS = 3.04 x10
-2
, PCCLS+discovery = 6.81 x10
-9
). For the NRBF2
/ JMJD1C locus, we did not observe an association in the replication cohorts.
We also performed conditional analyses adjusting for the lead SNP at each locus and identified a
secondary signal in four out of the 16 previously known loci (Table 2.2, Figure 2.3). In all cases,
the LD between the secondary hit and the top hit in the locus are low (Table 2.2). The additional
second associations in CDKN2A and IZKF1 loci were previously noted
90
. In CEBPE
(rs60820638, P=5.38 x10
-8
) and 17q12 (rs12944882, P=7.71 x10
-10
), these secondary signals
represent novel associations. In particular, at the CEBPE locus, previous reports suggest multiple
18
correlated variants with functional evidence
106,107
. Our analysis is consistent with the two
previous variants (rs2239635 and rs2239630) being or tagging the same underlying signal, while
the new association we identified (rs60820638) is an independent association.
Polygenic Risk Score
To assess the combined effect of all identified risk alleles for ALL, we constructed a PRS
model in our discovery cohort, using either the 18 SNPs from 16 previously known loci or the 23
known plus novel SNPs and their associated effect sizes from the trans-ethnic meta-analysis. We
then computed and tested the PRS for NLW and LAT individuals in the independent CCLS and
COG/WTCCC cohorts. The scores generated with the known risk loci were significantly
associated with case-control status in all groups (PCCLS NLW=2.22x10
-17
, PCCLS LAT=4.78x10
-23
,
PCOG/WTCCC =2.99x10
-62
). Adding the three novel loci identified in this study and the two novel
secondary signals further strengthened the evidence of the association in COG/WTCCC (P
=6.93x10
-63
) and CCLS LAT (P = 5.75 x10
-24
), while the evidence of association stayed about
the same in CCLS NLW (P =2.03 x10
-17
). The predictive accuracy as measured by AUC are
similar between NLW and LAT, at around 67-68%, consistent with the hypothesis that trans-
ethnic meta-analysis will enable PRS to be more transferrable between populations.
We also examined the distribution of PRS in CCRLP individuals (Figure 2.4). We found
that while the shape of the PRS distribution is consistent with a normal distribution
(Kolmogorov-Smirnov P = 0.918 and 0.303 for LAT and NLW, respectively) and appears
similar between LAT and NLW (standard deviation of 0.728 and 0.735 respectively; F-Test P =
0.633), the scores in LAT are shifted to the right compared to the scores in NLW (mean of 5.101
and 4.641 respectively, Welch t-test P = 1.3x10
-122
). The observed pattern was consistent when
19
the scores were stratified by case-control status (mean of 5.324 and 4.881 in LAT and NLW
cases, respectively, P=3.956 x10
-58
; mean of 4.895 and 4.414 in LAT and NLW controls,
respectively, with P= 1.493 x10
-78
). This observation was also replicated in CCLS with mean of
5.119 in LAT and 4.607 in NLW (P=4.596 x10
-51
). Therefore, results from our PRS analyses are
consistent with the notion that differences in allele frequency of ALL risk loci between
populations may complement other non-genetic factors for ALL risk, and partly explain the
increased ALL risk in LAT relative to NLW children and LAT.
Genetic architecture of ALL in Latinos and non-Latino whites
We estimated the relative contributions of each variant to ALL risk by computing the
familial relative risk. In CCLS, where effect size estimates are expected to be less biased by
winner’s curse, the known risk variants accounted for 22.7% and 23.2% of familial relative risk
in LAT and NLW, and the addition of novel variants increased these estimates to 24.3% and
24.8%, respectively.
The heritability of ALL attributable to all common SNPs (MAF ≥ 0.05) was estimated to
be 20.3 3.2% in NLW and 4.1 2.0% in LAT using the GCTA-LDMS framework
108
, and 20.2
4.7%% in NLW and 11.1 3.6% in LAT using the phenotype-correlation-genotype-correlation
(PCGC) regression framework. The heritability estimates in NLW are consistent in both
approaches and with that previous reported
90
. Because the imputation quality using HRC
reference panel is expected to be high for variants with MAF between 1-5% in NLW, our dataset
also provides the opportunity to estimate the frequency-stratified contribution to the heritability
of ALL in NLW. The inclusion of low frequency variants increased the estimated heritability in
NLW to 29.8 4.3% using REML (divided ~16.2% due to common variants, 13.5% due to low
20
frequency variants). Taking advantage of the admixed nature of LAT, whereby ancestry
segments could capture effects beyond that directly attributable to assayed SNPs (such as the
estimate from GCTA-LDMS), we also adopted an approach described in Zaitlen et al
109
to
estimate the total narrow-sense heritability for ALL in LAT to be 37.3 6.9%. Taken together,
multiple lines of evidence suggest that increasing sample sizes will identify additional low
frequency associations to ALL in the future.
Furthermore, we estimated the genetic correlation of ALL between NLW and LAT to be
high (rG = 0.714 ± standard error 0.130) but significantly different from 1 (P = 0.014). This
indicates the genetic architectures of NLW and LAT may be similar as expected from correlated
effect sizes but not perfectly concordant. We complemented this analysis further by estimating
the number of population-specific and shared causal alleles using the program PESCA
110
. The
PESCA framework defines the set of causal variants as all variants tested to have a non-zero
effect, even if the effect is indirect and only statistical rather than biological in nature. Using this
framework, we estimated that approximately 32.5% of SNPs inferred to be causal are shared
between NLW and LAT (1.71% of all common SNPs were inferred to have nonzero effects in
both NLW and LAT; 1.69% and 1.87% were inferred to have population-specific nonzero effects
in NLW and LAT, respectively). Together, these results suggests that there may be ethnic-
specific genetic risk profiles or differential interactions with the environment that contributes to
differences in disease risk between NLW and LAT. However, it should be noted that these
analyses adopted the REML framework or used the GCTA-LDMS estimates as hyperparameters,
which could be biased in the context of LAT population here (see Discussion).
21
Discussion
By incorporating data across four ethnic groups, we have performed the largest trans-
ethnic meta-analysis GWAS of childhood ALL to date. We identified three putatively novel
susceptibility loci and two additional independent risk associations at previously reported loci.
Our analysis suggests that the known and novel ALL risk alleles together explained about 25%
of the familial relative risk in both NLW and LAT populations, and that the trans-ethnic PRS we
constructed, although relatively simple and utilizing only the genome-wide associated variants,
performed similarly in both NLW and LAT in predicting ALL (AUC ~ 67-68%).
In support of their potential role in ALL etiology, each of the three novel loci harbors genes
and/or variants with a role in hematopoiesis and leukemogenesis as annotated by HaploReg
(version 4.1)
111
and GTEx portal
112
. The associated variants in 6q23 are located between HBS1L
and MYB, a myeloblastosis oncogene that encodes a critical regulator protein of lymphocyte
differentiation and hematopoiesis
113
. This locus is already well known for associations with
multiple blood cell measurements, severity of major hemoglobin disorders, and β-
thalassemia
114,115
. The associated SNPs in our study fall within HBS1L-MYB intergenic region
known to harbor multiple variants that reduce transcription factor binding, affect long-range
interaction with MYB, and impact MYB expression
114,116
. The lead SNP rs9376090 is in a
predicted enhancer region in K562 leukemia cells and GM12878 lymphoblastoid cells, and is a
known GWAS hit for platelet count
113
and hemoglobin concentration
117,118
. Also, it is an eQTL
in lymphocytes and whole blood
112
for ALDH8A1, which encodes aldehyde dehydrogenases, a
cancer stem cell marker and a regulator self-renewal, expansion, and differentiation.
22
One of the associated loci in 10q21 has a distinct haplotype structure, with 130 highly
correlated SNPs (r
2
> 0.8) associated with ALL (Figure 2.2B). This haplotype structure is
observed in LAT and EAS, and the associations are driven by alleles with higher frequency in
LAT and EAS than NLW or AFR. This 400kb region is rich with genetic variants associated
with blood cell traits such as platelet count, myeloid white cell count, and neutrophil percentage
of white cells
119,120
. It is also associated with IL-10 levels
121
which was shown to be in deficit in
ALL cases
122
. The signal region is contained within the intron of JMJD1C, a histone
demethylase that a recent study has found to regulate abnormal metabolic processes in AML
123
.
Previous studies have found that it acts as a coactivator for key transcription factors to ensure su.
rvival of AML cells
124
and self-renewal of mouse embryonic stem cells
125
.
The second locus in 10q21 contains intronic variants in the TET1 gene, which is well
known for its oncogenicity in several malignancies including AML
126
. A recent study showed the
epigenetic regulator TET1 is highly expressed in T-cell ALL and is crucial for human T-ALL
cell growth in vivo
127
. We found the associations at this locus to be slightly stronger for T-ALL
than for B-ALL in a small subset of individuals with ALL subtype information, though the
difference is not statistically significant. Of the four significant variants in this locus, SNP
rs58627364 lies in the promoter region of TET1 while the remaining three variants did not
appear to overlap functional elements. However, none of these SNPs were observed as eQTL for
TET1 in whole blood or lymphoblastoid cells
128
; future studies may want to investigate whether
these SNPs affect TET1 expression in hematopoietic stem or progenitor cells.
In addition to identifying putative novel ALL risk loci, we capitalized on the large
numbers of Latinos and non-Latino whites included in our study to explore the genetic
architecture of ALL in these two populations. In the NLW population, we estimated that ~ 29%
23
of the heritability of ALL was attributed to a combination of common and low-frequency (MAF
between 1-5%) imputed variants using both GCTA-LDMS and PCGC regression. This estimate
is higher than the previous estimate of ~20% (ref.
90
), suggesting that there are additional low
frequency variants associated with ALL that may be discovered in larger scaled studies. The
picture is less clear among LAT, where the estimated heritability was perhaps unrealistically low
using GCTA-LDMS (4.1% in univariate analysis). This estimate contrasted strongly against
other lines of evidence that showed similar estimated effect sizes (r2 = 0.819) and familial
relative risks explained by GWAS loci between NLW and LAT. Previous studies have noted the
downward bias in REML heritability estimate in case-control studies, which is exacerbated when
the covariates in the model (i.e. PCs and ancestry) are correlated with the disease status
129
. We
thus also followed previous suggestions and used the PCGC regression to obtain variance
component estimates
129,130
, resulting in a higher heritability estimate (11.1%). Because of the
underrepresentation of Native American or other non-European haplotypes in HRC panel, a
priori we did not estimate the heritability including low-frequency imputed variants in LAT.
When we did attempt to estimate heritability in this setting, we obtained a strongly negative
REML-based heritability estimates, suggesting potential model instability or misspecification
attributed to the admixed nature of LAT
131
. Consequently, we also recommend caution when
interpreting the estimated genetic correlation between LAT and NLW.
Nevertheless, we used bivariate version of the REML analysis to compute genetic
correlation of ALL between populations, as had been done previously for prostate cancer with
individual level data
132
. Our estimated genetic correlation (rG = 0.71) is significantly less than 1,
apparently suggesting a significant population-specific components of the disease architecture
between LAT and NLW. This would be consistent with the findings of the ERG locus
22,23
, a
24
Latino-specific association with ALL, and suggest that future ethnic-specific GWAS across
different ethnic groups for ALL will be insightful. This is also consistent with our observation in
the PESCA analysis, where we found that only 32.5% of the estimated causal alleles are shared
between LAT and NLW. These insights should still be treated with caution because the sample
size for ALL, a rare disease, is still relatively small compared to complex traits examined using
PESCA
110
, and because the REML-based heritability estimates for LAT used as hyperparameter
by PESCA may be biased. Therefore, more focused efforts to investigate the genetic architecture
for ALL, particularly in admixed populations like the Latinos, is needed.
Future studies aimed to uncover the genetic risk factors for ALL could focus on multiple
avenues. First, there will be a need to further increase the sample size of the study cohort, which
would provide additional venues to replicate the putative novel findings here and identify more
associated alleles at lower frequency. Second, there should be a focus on ethnic-specific GWAS
for ALL, as ethnic-specific associations could be missed in a trans-ethnic GWAS. An example is
the ERG locus, which is not genome-wide significant in our meta-analysis. Finally, while not
explored extensively in this particular study, there should be a focus on disentangling the
different subtypes of ALL, and to study other aspects of the disease pathogenesis such as disease
progression or risk of relapse, though these data are less available and may require more focused
ascertainment and cohort creation.
Acknowledgement
This work was supported by research grants from the National Institutes of Health
(R01CA155461, R01CA175737, R01ES009137, P42ES004705, P01ES018172, P42ES0470518
and R24ES028524) and the Environmental Protection Agency (RD83451101), United States.
25
The content is solely the responsibility of the authors and does not necessarily represent the
official views of the National Institutes of Health and the EPA. The collection of cancer
incidence data used in this study was supported by the California Department of Public Health as
part of the statewide cancer reporting program mandated by California Health and Safety Code
Section 103885; the National Cancer Institute’s Surveillance, Epidemiology and End Results
Program under contract HHSN261201000140C awarded to the Cancer Prevention Institute of
California, contract HHSN261201000035C awarded to the University of Southern California,
and contract HHSN261201000034C awarded to the Public Health Institute; and the Centers for
Disease Control and Prevention’s National Program of Cancer Registries, under agreement
U58DP003862-01 awarded to the California Department of Public Health. The biospecimens
and/or data used in this study were obtained from the California Biobank Program, (SIS request
#26), Section 6555(b), 17 CCR. The California Department of Public Health is not responsible
for the results or conclusions drawn by the authors of this publication. We thank Hong Quach
and Diana Quach for DNA isolation support. We thank Martin Kharrazi, Robin Cooley, and
Steve Graham of the California Department of Public Health for advice and logistical support.
We thank Eunice Wan, Simon Wong, and Pui Yan Kwok at the UCSF Institute of Human
Genetics Core for genotyping support. This study makes use of data generated by the Wellcome
Trust Case–Control Consortium. A full list of the investigators who contributed to the generation
of the data is available from www.wtccc.org.uk. Funding for the project was provided by the
Wellcome Trust under award 076113 and 085475. Genotype data for COG ALL cases are
available for download from dbGaP (Study Accession: phs000638.v1.p1). Data came from a
grant, the Resource for Genetic Epidemiology Research in Adult Health and Aging (RC2
AG033067; Schaefer and Risch, PIs) awarded to the Kaiser Permanente Research Program on
26
Genes, Environment, and Health (RPGEH) and the UCSF Institute for Human Genetics. The
RPGEH was supported by grants from the Robert Wood Johnson Foundation, the Wayne and
Gladys Valley Foundation, the Ellison Medical Foundation, Kaiser Permanente Northern
California, and the Kaiser Permanente National and Northern California Community Benefit
Programs. The RPGEH and the Resource for Genetic Epidemiology Research in Adult Health
and Aging are described here:
https://divisionofresearch.kaiserpermanente.org/genetics/rpgeh/rpgehhome. For recruitment of
subjects enrolled in the CCLS replication set, the authors gratefully acknowledge the clinical
investigators at the following collaborating hospitals: University of California Davis Medical
Center (Dr. Jonathan Ducore), University of California San Francisco (Drs. Mignon Loh and
Katherine Matthay), Children’s Hospital of Central California (Dr. Vonda Crouse), Lucile
Packard Children’s Hospital (Dr. Gary Dahl), Children’s Hospital Oakland (Dr. James Feusner),
Kaiser Permanente Roseville (formerly Sacramento) (Drs. Kent Jolly and Vincent Kiley), Kaiser
Permanente Santa Clara (Drs. Carolyn Russo, Alan Wong, and Denah Taggart), Kaiser
Permanente San Francisco (Dr. Kenneth Leung), and Kaiser Permanente Oakland (Drs. Daniel
Kronish and Stacy Month). The authors additionally thank the families for their participation in
the California Childhood Leukemia Study (formerly known as the Northern California
Childhood Leukemia Study). Finally, the authors acknowledge the Center for Advanced
Research Computing (CARC; https://carc.usc.edu) at the University of Southern California for
providing computing resources that have contributed to the research results reported within this
publication.
Competing Interests: The authors declare no competing interests.
27
Correspondence: Joseph Leo Wiemels, Center for Genetic Epidemiology, 1450 Biggy St, Los
Angeles, California, email: wiemels@usc.edu; Charleston Chiang, 1450 Biggy St, Los Angeles,
California, email: charleston.chiang@med.usc.edu
28
Table 2.1: Summary statistics for the reported variants, the top variant in the loci from our meta-
analysis, and the linkage disequilibrium between the two variants in NLW and LAT.
Reported SNP Top SNP in this study r
2
Gene Chr Pos rsID (reference) P-value Chr Pos rsID P-value NLW LAT
C5orf56 5 131765206 rs886285 (ref
90
) 0.63 5 131811182 rs11741255 1.69x10
-4
0.35 0.19
BAK1 6 33546930 rs210143 (ref
90
) 4.49x10
-8
6 33546837 rs210142 4.27 x10
-8
1 1
IKZF1 7 50470604 rs4132601 (ref
86
) 1.13x10
-33
7 50477144 rs10230978 3.92 x10
-34
0.98 0.97
8q24 8 130156143 rs4617118 (ref
85
) 1.04 x10
-12
Same
CDKN2A 9 21970916 rs3731249 (ref
102
) 1.29x10
-18
9 21975319 rs36228834 1.90 x10
-18
0.99 1
TLE1 9 83747371 rs76925697 (ref
90
) 5.37x10
-2
9 83728588 rs62579826 1.06 x10
-2
0.81 0.98
GATA3 10 8104208 rs3824662 (ref
87
) 4.24x10
-9
Same
PIP4K2A 10 22852948 rs7088318 (ref
89
) 6.50x10
-19
10 22853102 rs7075634 2.42 x10
-19
0.96 0.97
BMI1 10 22423302 rs11591377 (ref
103
) 8.21x10
-10
10 22374489 rs1926697 5.24 x10
-10
0.84 0.88
ARID5B 10 63723577 rs10821936 (ref
88
) 4.78x10
-67
10 63721176 rs7090445 7.36 x10
-70
0.98 0.99
LHPP 10 126293309 rs35837782 (ref
84
) 6.90x10
-4
Same
ELK3 12 96612762 rs4762284 (ref
84
) 2.42x10
-3
12 96645605 rs78405390 4.68 x10
-5
0.13 0.22
CEBPE 14 23589057 rs2239633 (ref
86
) 3.0 x10
-14
14 23589349 rs2239630 2.12 x10
-21
0.74 0.78
IKZF3 17 38066240 rs2290400 (ref
85
) 2.09 10
-6
17 37957235 rs17607816 1.42 x10
-7
0.02 0.22
IGF2BP1 17 47092076 rs10853104 (ref
90
) 2.93x10
-2
17 47217004 rs6504598 4.87 x10
-4
0.02 0.02
ERG 21 39789606 rs8131436 (ref
23
) 6.97x10
-5
21 39784752 rs55681902 9.36 x10
-6
0.62 0.65
We focused on the variants within 1Mb of the previously reported susceptibility variants
84–89,101–
103
and reported the association results of the published lead SNP as well as the top SNP at each
locus from our meta-analysis. Note that out of the 16 loci, three (8q24.21, IKZF3, and BMI1)
were initially identified and five (IKZF1, PIP4K2A, ARID5B, CDKN2A, CEBPE) were
previously shown to be replicated using a smaller but largely overlapping subset of this
dataset
85,103
. For these loci, our findings here would not necessarily constitute an independent
replication. Gene names (gene) are given based on the nearest gene unless the variant is in gene
desert. Chromosome (Chr) and position (Pos) are given in hg19 coordinates. r2 denotes the
squared correlation of the reported SNP and our top SNP in NLW from discovery cohort; NLW
and LAT denote the non-Latino white and Latino cohorts, respectively.
29
Table 2.2: Summary of conditional analysis to identify secondary associations at known loci.
Gene Chr Pos rsID Risk allele OR Pconditional Pdiscovery r2
IKZF1 7 50459043 rs78396808 A 1.632 3.46x10
-26
2.7x10
-16
*0.06
CDKN2A/B 9 21993964 rs2811711 T 1.355 7.2x10-
10
1.85x10
-11
0.01
CEBPE 14 23592617
rs60820638 A 1.193 5.38x10
-8
0.102 0.16
IZKF3 17 37983492
rs12944882 T 1.204 7.71x10
-10
2.81x10
-7
0.02
For each of the four significant association after conditional analysis, we show the genomic
coordinates in hg19, effect size (OR), the P-values with or without conditioning on the lead SNP
from the discovery meta-analysis in the locus, and the r2 between the lead SNP and secondary
association.
Chr., chromosome; Pos;. Position in hg19; OR: Effect size; Pconditional: p-value from the
conditional analysis; Pdiscovery: p –value from meta-analysis without conditioning on any SNP; r2:
squared correlation of the conditioned SNP and the most significantly associated SNP from
conditional analysis.
*calculated in Latino population as the variant was filtered out for low MAF in NLW cohort.
30
Figure 2.1: Summary result of the trans-ethnic meta-analysis on ALL.
Results of the meta-analysis is represented by the Manhattan plot. The novel loci from this study
are marked with dotted lines and labeled with the nearest genes. Significance threshold at
genome-wide significance level (5x10-8) is marked with a horizontal dashed grey line in the
Manhattan plot. The y- axis is truncated at –log10(1x10-50) to improve readability. The insert
shows the Quantile-Quantile plot. Deviation from the expected p-value distribution is evident
only in the tail. There is little evidence of inflation of the test statistics in general as the genomic
inflation factor is 1.024.
31
Figure 2.2: Novel loci associated with childhood ALL in trans-ethnic meta-analysis.
LocusZoom plots showing 1 Mb region around the identified loci near (A) MYB/HBS1L on
chr6, (B) NRBF2/JMJD1C on chr10, and (C) TET1 on chr10 are shown. Diamond symbol
indicates the lead SNP in each locus. Color of remaining SNPs is based on linkage
disequilibrium (LD) as measured by r2 with the lead SNP in non-Latino white. All coordinates in
x-axis are in hg19.
32
Figure 2.3: Secondary association signal (p < 5x10-8) with ALL found in previously known loci
through conditional analysis.
LocusZoom plot displaying the 1 Mb region found to harbor a second novel variant associated with ALL
through conditional analysis: (A) IKZF1 (B) CDKN2A (C ) CEBPE (D) IKZF3. For each locus, we
display the pattern of association before(left) and after(right) conditioning on the top associated variant in
the locus. In both cases, diamond indicates the lead SNP in the conditional analysis. Color of the
remaining SNPs is based on linkage disequilibrium (LD) with the lead variant in the conditional analysis
in non-Latino white. Genomic coordinates on x-axis are in hg19.
33
Figure 2.4: Polygenic Risk Score (PRS) distribution based on GWAS loci for ALL.
We compared the PRS distribution between LAT and NLW cohorts in (A) CCRLP and (B) CCLS
cohorts. PRS were constructed by summing up imputed dosage weighted by effect size for each
Latino(red) and non-Latino white individual(green). In (C) We further stratified the PRS in CCRLP
cohort by case/control status. The population mean is indicated with vertical dash lines with the mean
score shown. P-values on the right upper corner of each graph is from one-sided t-test comparing the
difference in PRS between LAT and NLW overall or within cases and controls.
34
CHAPTER 3
Evaluating genomic Polygenic Risk Scores for
childhood Acute Lymphoblastic Leukemia in Latino Americans
Soyoung Jeon
1
, Ying Chu Lo
1
, Libby M. Morimoto
2
, Catherine Metayer
2
, Xiaomei Ma
3
,
Joseph L. WIemels
1
, Adam J. de Smith
1
, Charleston W.K. Chiang
1
Affiliations:
1. Center for Genetic Epidemiology, University of Southern California, Los Angeles, CA
2. School of Public Health, University of California Berkeley, Berkeley, CA
3. Yale School of Public Health, New Haven, CT
35
Introduction
Acute lymphoblastic Leukemia (ALL) is the most common type of childhood cancer
worldwide, representing 20% of all cancers in children in the United States.
27
In addition to the
environmental factors, the recent advances in genomic analyses have explored and confirmed
strong contribution of genetic variation to ALL risk. To date, 16 loci has been discovered and
replicated in previous Genome-Wide Association Studies (GWAS) primarily using European
ancestry individuals, suggesting the polygenic nature of susceptibility to ALL
23,84–89,102,103,133,134
.
Yet, how these variants collectively attribute to the disease risk has not been fully characterized.
Polygenic Risk Scores (PRS) can identify individuals at significantly elevated risk for a
disease such as cancer, or enrich for individuals in strata of higher risk relative to the general
population, by providing a quantitative measure of an individual's inherited risk based on the
cumulative impact of common polymorphisms. Weights are assigned to each genetic variants
according to the strength of their association with the disease risk, and individuals are scored
based on the number of risk alleles they have for each variants used to construct the polygenic
score. There has been growing evidence that the predictive power of PRS can be further
increased by aggregation of genotypic effects across all variants even if they don't reach the
commonly acknowledged genome-wide significance threshold for association (P = 5e-8).
135,136
Similarly, a genomic PRS approach may also enhance the PRS models for ALL given the small
number of known susceptibility loci.
However, one of the biggest limitation of PRS is the poor performance of European
ancestry-derived PRS when applied to populations of other ancestries.
10
Part of this lost in
transferability may be due to the over-representation of GWAS participants of European
ancestry,
8,10
resulting in much more informative GWAS based in European ancestry individuals
compared to that for other ancestries. In addition, differences between patterns of linkage
36
disequilibrium (LD), allele frequencies, causal variants, and effect sizes further contribute to this
poor transferability.
9,10,137
Such observation is particularly important for ALL of which Latino
children have the highest and fastest-increasing risk and poorer survival than non-Latino
whites.
1,17,36,92,93
A previous study have constructed a PRS using 11 SNPs with effect sizes
estimated from European ancestry cohort, and reported the area under the curve (AUC) of 73%
in receiver-operator characteristic individual risk discrimination analysis.
138
This predictive
performance is likely to be overestimated as it was only tested in European ancestry individuals,
and there may have been concerns of overfitting in the models.
In this study, we set out to evaluate the PRS models derived using European American
samples when they are ported to Latinos individuals. We also aimed to answer whether effect
sizes estimated from ethnic-specific GWAS or multi-ancestry meta-analysis, and whether
training with matched ancestry LD reference panel, could improve the efficacy of the PRS. We
used two approaches – Pruning and Thresholding (P+T) and LDPred2 – in parallel to construct
PRS based on European-ancestry only GWAS or multi-ethnic GWAS and found that the best
model trained in Non-Latino whites (NLW) performed similarly between NLW or Latino
American (LAT) testing dataset, and the best PRS model in LAT could be further improved by
using multi-ethnic GWAS or LAT training dataset.
Materials and Methods
Study Cohort
The California Childhood Cancer Record Linkage Project (CCRLP) includes all children
born in California during 1982-2009 and diagnosed with ALL at the age of 0-14 years per
California Cancer Registry records. Children who were born in California during the same period
37
and not reported to California Cancer Registry as having any childhood cancer were considered
potential controls. Detailed information on sample matching, preparation and genotyping has
been previously described
85
. Because ALL is a rare childhood cancer, for the purpose of a
genetic study we followed previous practice
85
and incorporated additional controls using adult
individuals from the Kaiser Resource for Genetic Epidemiology Research on Aging Cohort
(GERA; dbGaP accession: phs000788.v1.p2). The GERA cohort was chosen because a very
similar genotyping platform had been used: Affymetrix Axiom World arrays. Genome-wide
single nucleotide polymorphism (SNP) genotyping was performed for all individuals in CCRLP
and GERA using the Affymetrix Axiom World LAT array
85
. In general, we included individuals
in each population group based on their self-reported race/ethnicity. The imputation and quality
control (QC) of SNP array data were carried out in each population group, as previously
described in our multiple ethnicities meta-analysis GWAS of ALL
133
. In brief, for pre-imputation
QC, we excluded the sex chromosomes and filtered out SNPs based on call rates <98%, minor
allele frequency (MAF) <0.01, and Hardy-Weinberg equilibrium in controls (P<10-5), and
removed samples based on genome-wide relatedness (PI_HAT>0.20), genome heterozygosity
rate (mean heterozygosity± 6Std), and call rates < 95%. Principal components analysis (PCA)
was performed for each population group along with reference data from 1000 Genomes
Project
94
to identify extreme outlier individuals that clustered separately from other individuals
in their self-reported race-ethnicity groups. Among self-reported Asians, we identified a small
subset (29 cases and 51 controls from CCRLP, 31 individuals from GERA) that clustered with
South Asian reference individuals and were subsequently removed from our analyses, with ,
remaining individuals classified as East Asians. To protect against between-cohort batch effects,
we performed a GWAS between population-matched CCRLP controls and GERA individuals
38
before and after imputation. Whole-genome imputation was performed for each population
group, separately for CCRLP and GERA datasets, using Haplotype Reference Consortium (HRC
v r1.1 2016) as a reference in the Michigan Imputation Server.
139
After imputation we removed
variants in each population group based on imputation quality (R2 < 0.3), MAF (< 0.01), and
allele frequency difference between non-Finnish Europeans in the Genome Aggregation
Database (gnomAD) and CCRLP NLW controls ( > 0.1). After quality control filtering, the
Latino GWAS included 1,878 cases and 8,441 controls, the non-Latino White GWAS included
1,162 cases and 57,341 controls, and African GWAS included 124 cases and 2067 East Asian
GWAS included 318 cases and 5,017 controls.
Another GWAS was performed with individuals from Children’s Oncology Group
(COG; dbGAP accession: phs000638.v1.p1) and Wellcome Trust Case–Control Consortium
(WTCCC).
97
We generally followed the same quality control pipeline, but because self-identified
ethnicity was not available to us, we performed global ancestry estimations using ADMIXTURE
and the 1000 Genomes populations as reference and removed individuals with < 90% estimated
European ancestry from the analysis, resulting in total of 1504 and 2931 NLW cases and controls
respectively. This dataset was previously used as a replication cohort of European ancestry in our
study
133
, but here we combined it with CCRLP NLW to increase the sample size of the discovery
GWAS (below).
California Childhood Leukemia Study (CCLS)
98
, a non-overlapping California case-
control study with controls from California Birth Cohort (1995-2008), was used as our testing
dataset. The quality control procedures and imputation were performed in accordance with the
discovery/training dataset.
Overall Study Design
39
A PRS of an individual j is defined as a weighted sum of SNP allele counts:
𝑃𝑅𝑆 = ∑ 𝛽 𝑖 ̂
𝑔 𝑖𝑗
𝑚 𝑖 =1
where m is the number of SNPs to be included in the predictor, 𝛽 𝑖 ̂
in the per allele weight for
each SNP, 𝑔 𝑖𝑗
is the allele count (0,1,2) or dosage of the allele of SNP in individual j.
For each step of score derivation, optimization, and evaluation, we used three non-overlapping
datasets to (1) perform discovery GWAS to estimate variant effect sizes (2) train to optimize
parameters for the best predictive score (3) evaluate the predictive performance of the resulting
scores (Figure 3.1)
We randomly selected and held out 360 cases and 1500 controls from CCRLP+GERA
NLW (~3.2% of the sample size) and LAT (~18.1% of the sample size) as a training dataset and
used the remaining sample in the three different discovery GWAS (1) NLW meta-analysis (2)
Latino-only GWAS (3) Multi-ethnic meta-analysis. For each GWAS, we constructed PRS in the
training dataset using two approaches - Pruning and Thresholding (P+T) and LDPred2 in either
LAT or NLW training sample.
Throughout this chapter we will explore multiple strategies to develop PRS models. We
labelled the different strategies following the convention of “POPdiscovery_POPtraining”, where
POPdiscovery is the population in which the discovery GWAS was conducted, and POPtraining is the
population in which the optimization for the best model was performed. Once all scores for each
strategy was constructed, the best performing PRS model was evaluated in the testing cohort -
CCLS NLW or LAT.
40
Figure 3.1: Summary of study design and analysis
The flowchart details different cohorts used for each step of PRS derivation with different
discovery GWAS, optimization, and evaluation in either non-Latino white or Latino populations.
NLW: non-Latino White, LAT: Latino American
41
Table 3.1: Summary of sample size for each strategy of constructing PRS
Approach Discovery GWAS Training dataset
NLW_NLW 2306 case; 58772 control 360 case; 1500 control NLW
NLW_LAT 2306 case; 58772 control 360 case; 1500 control LAT
META_NLW 4266 case; 72766 control 360 case; 1500 control NLW
META_LAT 4266 case; 72766 control 360 case; 1500 control LAT
LAT_LAT 1518 case; 6910 control 360 case; 1500 control LAT
We labelled the different strategies following the convention of “POPdiscovery_POPtraining”, where
POPdiscovery is the population in which the discovery GWAS was conducted, and POPtraining is the
population in which the optimization for the best model was performed; NLW: Non-Latino
Whites, LAT: Latino Americans
42
Discovery GWAS
We used PLINK v2.3 alpha to test the association between imputed genotype dosage at
each SNP and case-control status in logistic regression, after adjusting for the top 20 PCs. For
NLW and multi-ethnic GWAS meta-analysis, the results from each group were combined via the
fixed-effect meta-analysis with variance weighting using METAL
100
. After excluding 360 cases
and 1500 controls for NLW and LAT, 802 cases and 55841 controls in NLW, 1518 cases and
6910 controls in LAT, 318 cases and 5017 controls EAS, and 124 cases and 2067 controls
African Americans in CCRLP/GERA remained for discovery GWAS. For NLW meta-analysis,
CCRLP/GERA GWAS was meta-analyzed with a separate GWAS conducted with 1504 cases
and 2931 controls from COG/WTCCC cohort. Multi-ethnic meta-analysis was conducted with
CCRLP/GERA NLW, LAT, East Asian, African American and COG/WTCCC individuals,
totaling 4266 cases and 72766 controls. The total sample size for each discovery GWAS design
can be found in Table 3.1.
Polygenic Risk Score Derivation / Optimization
For each ancestry-specific or multi-ancestry GWAS, we constructed PRS using two
different methods: LDPred2 and Pruning and Thresholding (P + T). Both methods use the
GWAS summary statistics as the starting point, but each make different choice for which SNPs
to include in the predictor and the weight values to assign to each SNPs.
Genomic PRS Method: Pruning and Thresholding (P + T).
The most commonly used approach, Pruning and Thresholding (P+T), uses a P-value
threshold and LD-driven clumping procedure to construct scores. The scores using P+T approach
was constructed using PLINK v1.9.
43
In brief, given a user-defined threshold for associated P-value and clumping parameters, the
algorithm forms clumps around the index SNPs with all SNPs within the specified distance (kb)
that have P-value and pair-wise LD (measured by r2) at levels greater than the specified
threshold. The algorithm greedily and iteratively cycles through all index SNPs, beginning with
the index SNP with the smallest P-value, only allowing each SNP to appear in one clump. The
most significant SNP for each LD-based clump across the genome are used to build the PRS with
associated estimate beta as weights.
We constructed PRS using a range of P-values (1.0, 0.5, 0.05, 5 × 10
−4
, 5 × 10
−6
, and 5 × 10
−8
), r2
(0.2, 0.4, 0.6, and 0.8), and kb (250, 500) thresholds for a total of 48 PRS models for each
strategy.
Genomic PRS Method: LDPred2
LDPred2 uses a Bayesian approach to calculate posterior mean effect size for each
variant based on a prior and subsequent shrinkage based on the extent to which the variant is
correlated with similarly associated variant.
140,141
The underlying Gaussian distribution
additionally considers the proportion of causal variants (ρ). LDPred2 uses a grid of values for
hyper-parameters/tuning parameter - ρ, ℎ
2
(the SNP heritability), sparsity (whether to fit some
variant effects to exactly zero) to construct PRS. We used ρ from a sequence of 17 values from
10−4 to 1 on a log-scale, a range of ℎ
2
within {0.7, 1, 1.4} * estimated heritability, with the
sparsity option on and off. In addition, we also tested a model assuming infinitesimal causal
effects (each variant assumed to contribute to disease risk). In total, we evaluated 103 scores
using LDPred2 in total.
44
Once the variants and weights for each score model were estimated, the scores were generated in
the training sample (360 cases and 1500 controls in NLW or LAT) by multiplying the genotype
dosage of each risk allele for each variant by its respective weights then summing across all
variants in the score using Plink v2.3 alpha. For each strategy, the score with the best predictive
performance was determined based on the highest Negelkerke’s pseudo-R2. The R2 was
calculated for the full model inclusive of the PRS and the covariates minus R2 for the null model
with covariates alone, thus yielding an estimate of the explained variance. Covariates in the
model included the first 20 principal components and sex.
Ancestry Inference
Global ancestry inference was performed on CCRLP Latino cases and controls using
RFMix, using a reference panel consisting of 671 non-Finnish European individuals for
European ancestry, 716 African individuals for African ancestry, and 94 Admixed American
individuals (7 Columbian, 12 Karitianan, 14 Mayan, 4 Mexican in Los Angeles, 37 Peruvian in
Lima, Peru, 12 Pima, and 8 Surui) for Indigenous American (IA) ancestry from gnomAD v3.1
release
105
. The 94 IA individuals were selected based on having more than 85% global ancestry
of Admixed American (AMR) individuals from 1000 Genomes Project
94
in an unsupervised
analysis using ADMIXTURE
142
. We stratified Latino individuals into three tertiles of global
EUR ancestry, and in each group evaluated the predictive performance of the best score model
for NLW_NLW strategy.
PRS evaluation
After optimizing the PRS model in held-out training samples of 360 cases and 1500
controls, we computed the PRS score in the CCLS NLW and LAT cohort as our testing dataset
45
using Plink v2.0 alpha. The CCLS testing dataset has 306 cases and 258 controls in the NLW
subcohort, and 509 cases and 509 controls in the LAT subcohort. The predictive performance of
PRS was quantified by Negelkerke’s pseudo-R2 (the proportion of variance explained) and Area
under receiver operator characteristic curve (AUC; probability that a case ranks higher than a
control). AUC was computed for the full model with covariates to account for population
stratification. AUC for the null model (ALL ~ 10 PCs + sex) is 0.593 and 0.577 in CCLS LAT
and NLW respectively. AUC were calculated using pROC package in R.
104
Results
Transferability of genomic PRS for ALL
We first evaluated the transferability of genomic PRS trained from a solely European-
ancestry cohort to the Latino cohort. If it is observed that there is an overt transferability issue of
ALL PRS models from European-ancestry populations to Latinos, it would imply that
researchers may want to shift resources towards investing in diverse ancestries when studying
the genetic architecture of ALL.
Given our previous multi-ethnic GWAS for ALL, we established the following designs to
evaluate first the transferability of genomic PRS for ALL. After removing individuals reserved
for training, we performed a GWAS in 2306 cases and 58772 controls from NLW. This is the
largest GWAS in NLW for ALL available at our disposal. Based on the discovery GWAS
summary statistics, we then constructed genomic PRS models using two approaches: Pruning-
and-Thresholding (P+T) or by LDPred2. Both approaches require optimization across a grid of
parameters (see Methods). We thus evaluated a total of 151 models in the held-out training
sample of 1860 individuals, and identified the most predictive model in NLW as measured by
46
Negelkerke’s Pseudo R2. Following our naming convention for PRS strategies (see Methods),
our first design is termed NLW_NLW, for the discovery GWAS was performed in NLW, and the
model was optimized also in held-out NLW samples (Figure 3.1). We then evaluated the
performance of this model in independent samples of NLW (306 cases and 258 controls) and
LAT (592 cases and 509 controls) individuals from CCLS.
The best model in the NLW_NLW approach was based on the P+T method, with p-value
threshold of 5e-6, pruned by in-sample LD with r2 threshold of 0.2 within 250kb of each index
variant. This model consisted of 90 SNPs across the genome and is significantly associated with
case/control status in both CCLS NLW and LAT cohorts (P = 4.2e-4 and 2.2e-6 for NLW and
LAT, respectively). The resulting PRS explained 3.0% of the variance in CCLS NLW cohort, as
measured by pseudo R2 (Table 3.2). The same PRS model explained 2.8% of the variance in the
CCLS LAT cohort (Table 3.2), suggesting minimal loss of transferability in efficacy. We also
note that the predictive power is fairly low in either cohort, suggesting that the discovery GWAS
in NLW alone may be underpowered and not sufficiently informative to guide genomic PRS
construction, despite being significantly associated with case/control status (see Discussion). The
AUC in both NLW and LAT are 0.598 and 0.604, respectively, in the full prediction model,
including PRS as well as sex and 10 principal components.
An alternative approach to evaluate the transferability of the genomic PRS model is to
assess if the prediction efficacy differs by proportion of European ancestry in the Latino
individuals. Because the CCRLP LAT individuals have not been used in the NLW_NLW model,
we can evaluate the prediction accuracy in CCRLP LAT individuals (N = 3901; 1878 cases). In
tertiles of LAT individuals, each with approximately 1300 individuals (Table 3.2), we found
47
little evidence of differences in performance across strata of ancestry proportions (Pseudo R2 =
0.009,
48
Table 3.2: Performance of the best model for NLW_NLW strategy across different testing
datasets
Testing dataset Sample size P-value AUC AUC_CI PseudoR2 SE
CCLS NLW 564 4.22E-04 0.6251 0.5788-0.6713 0.0297 0.023
CCLS LAT 1101 2.24E-06 0.6285 0.5956-0.6613 0.0276 0.017
CCRLP LAT
highEUR
1300 2.91E-03 0.5866 0.5557-0.6175 0.0090 0.0133
CCRLP LAT
medEUR
1301 2.60E-04 0.5798 0.5489-0.6108 0.0136 0.0114
CCRLP LAT
lowEUR
1300 1.08E-02 0.5773 0.5464-0.6083 0.0065 0.0110
49
0.014, 0.006 across the highest, middle, and lowest tertiles by European ancestries in LAT; AUC
= 0.549, 0.559, and 0.533, respectively. Table 3.2). Again, the predictive power is fairly low
across strata, suggesting that the discovery GWAS may be underpowered. Nevertheless, we
identified little evidence that there is a difference in transferability between NLW and LAT
populations or ancestries.
Improving the prediction accuracy of genomic PRS for Latinos
We then evaluated strategies that could improve the prediction accuracy of genomic PRS
models for Latinos. We first evaluated a scenario where the LAT were used as the training
cohort even though the discovery GWAS was still from NLW (NLW_LAT strategy; 2a in
Figure 3.1). We found that in this case, the best model was a LDPred2 model with parameters
ρ= 0.0032 and ℎ
2
= 0.1663. This model appears to be an improvement in performance over the
best NLW_NLW model (PseudoR2 = 0.056 +/- 0.019, compared to 0.028 +/- 0.017 under
NLW_NLW approach; Table 3.3), though we may still be suffering from the inadequate power
in NLW discovery GWAS.
We also evaluated a scenario where the LAT were used in discovery GWAS. In this case,
1518 cases and 6910 controls of LAT individuals from CCRLP+GERA were used in the
discovery GWAS, and 199 models based on either P+T or LDPred2 were optimized in 360 cases
and 1500 controls from held-out LAT individuals (LAT_LAT strategy; 2b in Figure 3.1). The
best PRS model from this approach was a LDPred2 model with parameters ρ= 0.00018 and ℎ
2
=
0.4536. In 592 cases and 509 controls from CCLS used to test this model, the performance was
significantly better (Pseudo R2 = 0.100 +/- 0.022, compared to 0.028 +/- 0.017 under the
NLW_NLW strategy; Table 3.3).
50
Table 3.3: Summary of best performing model for each strategy tested on CCLS LAT
Strategy Approach Parameters SNP_CT P-value AUC SE_AUC PseudoR2 SE
NLW_NLW P + T
p=5e-6
r2=0.2
81 2.24E-06 0.629 0.033 0.028 0.017
NLW_LAT LDPred2
p=0.0032
h2=0.1663
nosparse
1083465 1.47E-11 0.647 0.032 0.056 0.019
META_NLW P + T
p=5e-6
r2=0.2
100 2.29E-19 0.694 0.031 0.104 0.024
META_LAT LDPred2
p=0.0018
h2=0.119
nosparse
1083465 3.16E-24 0.706 0.030 0.134 0.025
LAT_LAT LDPred2
p=0.00018
h2=0.4536
sparse
1083465 6.57E-19 0.687 0.031 0.100 0.023
51
Finally, as discovery GWAS based in NLW or LAT are both potentially underpower, we
also evaluated the meta-analysis design that combines all four multi-ethnic cohorts from
CCRLP+Kaiser as well as the COG samples. In total, the GWAS contained 4226 cases (2306,
1518, 318, and 124 in NLW, LAT, EAS, and AFR, respectively) and 72766 controls (58772,
6910, 5017, and 2067 in NLW, LAT, EAS, and AFR, respectively). We then trained the best
genomic PRS model in either NLW (META_NLW strategy; 2c in Figure 3.1) or LAT
(META_LAT strategy; 2d in Figure 3.1), both using held out 360 cases and 1500 controls. The
best model under META_NLW is a P+T approach with p-value threshold = 5e-6, and LD
threshold r2 = 0.2 within 500kb. Under this model, the prediction accuracy in LAT is also better
than the naïve NLW_NLW strategy (Pseudo R2 = 0.104 +/- 0.024 vs. 0.028 +/- 0.017; Table
3.3), but only marginally higher than the LAT_LAT strategy (Pseudo R2 = 0.100 +/- 0.022). The
strategy that performed the best turns out to be META_LAT, where the prediction performance
was the highest (Pseudo R2 = 0.134 +/- 0.025; AUC for the full model including PRS, sex, and
PCs = 0.706; Table 3.3). Our results thus suggest that given the currently available data,
combining the largest multiethnic sample for discovery GWAS, follow by training in an ethnic-
specific cohort, will lead to the best genomic PRS model in terms of prediction accuracy.
As the multi-ethnic meta-analysis GWAS is the most powerful discovery GWAS
currently available, we also evaluated if the transferability of the PRS model from the
META_NLW strategy by comparing the PRS performance in CCLS NLW vs. LAT samples.
The Pseudo R2 remains comparable between the two cohorts (0.104 +/- 0.032 for NLW vs.
0.104 +/- 0.024 for LAT). This result is consistent with the attempt described above
(NLW_NLW strategy), suggesting that unlike other diseases, the transferability issue may be less
52
prominent for ALL, which could be driven by the multi-ethnic representation in the discovery
GWAS, the admixed nature of the LAT genomes, or the genetic architecture of ALL.
Genetic Architecture of ALL
LDPred2 as well as other statistical modelling approaches to build genomic PRS have
been shown to be generally outperforming a single P+T model
135,136
. LDPred2 has two different
modes of inference, LDPred2-grid and LDPred2-inf, where the former assumes some
proportions of the variants are causal and parameters need to be optimized in a grid, while latter
assumes an infinitesimal model where every variant have a mean effect of 0 with some small
variance. In our META_NLW and META_LAT approaches, where we have the most powerful
discovery GWAS to guide PRS model constructions, we noticed that LDPred2 models
consistently outperforms the LDPred2-inf models (e.g. Pseudo R2 = 0.134 in LDPred2 vs. 0.015
in LDPred2-inf in CCLS LAT under the META_LAT strategy; Table 3.4). Our results are thus
consistent with a more oligogenic architecture of ALL, such that LDPred2-inf is a less
appropriate model for the pattern of summary statistics in the ALL GWAS.
Genomic PRS vs. PRS based on known loci
Generally speaking, genomic PRS whether through P+T, LDPred2, or other similar approaches,
are expected to be more accurate in risk prediction or stratification over a simple PRS model
based solely on the set of known GWAS loci (i.e. those that have been shown to reach a P-value
less than 5e-8 in one or more GWAS for a particular trait). Indeed, in each of the strategies that
we have examined, the best genomic PRS models tend to be better than P+T model with p-value
threshold of 5e-8, a special case which is equivalent to building a PRS model with the genome-
53
Table 3.4: Predictive performance of best performing LDPred2 model vs. LDPred2-Inf
Strategy approach
CCLS NLW CCLS LAT
PseudoR2 SE AUC PseudoR2 SE AUC
META_NLW LDPred2 0.120 0.033 0.691 0.071 0.021 0.663
LDPred2-Inf 0.016 0.021 0.588 0.016 0.016 0.610
META_LAT LDPred2 0.135 0.025 0.706 0.134 0.025 0.706
LDPred2-Inf 0.016 0.021 0.587 0.015 0.015 0.609
54
wide significant loci. For instance, under the META_LAT strategy, the best PRS model achieved
a pseudo R2 of 0.134 +/- 0.0251, while the best P+T model with P-value threshold of 5e-8 only
attained a pseudo R2 of 0.086 +/- 0.021.
However, the genomic PRS requires a training dataset to optimize the parameters for
building the PRS. This necessitates holding out training data, which in turn reduces the sample
sizes available for GWAS. While this may not be a huge problem for common diseases but could
be a concern for a rare disease such as ALL. In order to evaluate the genomic PRS, we had to
reduce our case proportions by 8.7% (from 2666 cases to 2306 cases and from 1878 cases to
1518 cases after removing 360 cases each for NLW and LAT respectively from CCRLP+Kaiser
as training sample). Thus, an alternative approach could have been just taking all known loci in
literature to build a simple PRS model, and test this PRS model in independent cohorts.
We built a PRS model using 23 known associated SNPs with ALL, identified across 11
studies
23,84–89,102,103,133,134
. These 23 SNPs were derived from 19 loci, including 4 conditionally
independent secondary associations at 4 loci. These associated SNPs were identified in one or
more independent cohorts in literature, including the full CCRLP+GERA datasets that were used
for constructing and evaluating genomic PRS models above. Because there is no need to train
and optimize the PRS model in held-out samples, we directly tested this PRS in the independent
CCLS cohort that were not used in the discovery of any of these 23 loci (although they had been
used as part of the replication cohort in previous studies). This strategy produced better
prediction accuracy than the best performing PRS models in both LAT and NLW cohorts
(Pseudo R2 = 0.166 +/- 0.0254; AUC = 0.726 in CCLS LAT; Table 3.5)
55
Table 3.5: Predictive performance of best performing genomic PRS model vs. conventional
model constructed with 23 known ALL risk SNPs
CCLS NLW CCLS LAT
PseudoR2 SE AUC PseudoR2 SE AUC
Conventional PRS 0.151 0.0342 0.706 0.166 0.0254 0.726
Genomic PRS 0.104 0.032 0.675 0.134 0.025 0.706
The conventional PRS model was built using 23 known associated SNPs with ALL, identified
across 11 studies
23,84–89,102,103,133,134
. These 23 SNPs were derived from 19 loci, including 4
conditionally independent secondary associations at 4 loci, that were identified in one or more
independent cohorts in literature. The best model for genomic PRS for CCLS NLW and CCLS
LAT was used for comparison (Table 3.2)
56
Discussion
In the current study, we leveraged the largest available meta-analysis GWAS from multi-
ancestry groups for ALL to investigate strategies to build and evaluate polygenic risk score
models for ALL. We evaluated the extent of loss in efficacy for PRS models trained solely in
NLW populations but applied in LAT populations, explored approaches to improve PRS models
for LAT through different training strategies, and compared the genomic PRS models against a
simple model that used all known ALL-associated variants in literature with no training or
optimization. We found little evidence of a loss in efficacy when transferring the PRS model
between populations, though our conclusion may be driven by the lack of power in an NLW-only
GWAS within the CCRLP+GERA cohort. We also found that while leveraging multi-ethnic
information to increase GWAS sample sizes and representation could lead to much more
effective genomic PRS models (pseudo R2 = 0.134+/-0.025, AUC = 0.706), this model currently
still has lower prediction accuracy compared to a simple model of using only 23 known ALL-
associated SNPs (pseudo R2 = 0.166, AUC = 0.7256) that were derived from multiple
independent cohorts in literature (including ones we do not have access to and not utilized in this
study for building genomic PRS).
We took two analytical approaches to evaluate the transferability of a PRS model for
ALL. We constructed and trained the best model in NLW (NLW_NLW strategy; Figure 3.1) in
CCRLP+GERA samples. In the first analytical approach we tested the model in independent
NLW and LAT cohorts from CCLS. We observed little difference in performance in CCLS
(62.5% in NLW vs. 62.8% in LAT by AUC; Table 3.2). However, we note that the pseudo R2 is
low, likely reflecting the fact that our NLW GWAS was not sufficiently powered or informative
for a genomic PRS approach. This may also be the reason why a P+T model appears to be the
57
most predictive design, even though generally speaking LDPred2 should be more efficacious
140
.
In the second analytical approach, we tested the best model in stratified CCRLP LAT individuals
that had not been used for PRS construction and optimization. In this analysis, we similarly did
not observe differences in pseudo R2 across strata of LAT individuals by estimated European
ancestry (Table 3.2). If there had been overt transferability issues, we would expect that strata
with the highest European ancestry would have higher prediction accuracy compared to strata
with lower European ancestry. While the lack of difference in prediction accuracy in LAT vs.
NLW could be explained by low efficacy of the PRS model trained in this manner, we also note
that we observed little difference in the best PRS model from the META_NLW strategy; the
pseudo R2 is 0.104 in both CCLS LAT and NLW, despite the model being trained in NLW.
Using the meta-analysis GWAS, the constructed PRS model would have sufficient power to
stratify individuals by case or control status, as the PRS is strongly associated with case/control
status in either cohort (P = 1.5e-10 and 2.3e-19 in CCLS NLW and LAT with similar OR of 1.88
and 1.93, respectively). In this case, even though NLW samples are the largest contributor of this
multi-ethnic meta-analysis GWAS by both case and control counts, because LAT samples (in
addition to EAS and AFR samples) had been included, it is not clear whether the similar
performance across populations is driven by representation in GWAS, European-ancestry
admixture in LAT samples, or sufficiently shared genetic architecture between populations for
ALL that is minimally impacted by LD differences. However, even if there is little transferability
issue with PRS models for ALL, there are many other reasons to be more inclusive in GWAS
representation, as diverse ancestries can help with better fine-mapping of causal loci, and can
identify population-specific risk alleles.
10,137
58
As one possible explanation for comparable PRS efficacy between LAT and NLW is
shared genetic architecture, we also found suggestions that ALL may follow a more oligogenic
architecture where a number of large effect causal loci exist on top of a polygenic background of
smaller effect causal loci. Our previous study
133
had demonstrated that the genetic correlation
between NLW and LAT are high (rG = 0.714 +/- 0.13), but significantly different from 1 (P =
0.014). Here, we have shown that a LDPred2 model for PRS assuming infinitesimal causal loci
drastically underperform compared to one without the assumption (Table 3.4). Combining these
two observations, we speculate that the disease architecture for ALL may be driven by a few
large effect loci, such as CDKN2A or IKZF1 that are shared across ancestries. The lower genetic
correlation between population may then be driven by significant differences in the polygenic
background, or by other population-specific alleles. But as these causal loci have smaller effects,
PRS efficacy, and hence transferability, is driven largely by the main effect loci, at least within
the resolution of the sample size of the current validation cohort (i.e. CCLS).
Another possible explanation for the lack of observing poor transferability between NLW
and LAT may lie in our validation dataset, CCLS. We chose to use CCLS as the validation
dataset to test optimized PRS models because it had not been used in the discovery GWAS, is a
stand-alone cohort that were matched well between cases and controls, and ascertained from a
similar geographical region (California) as our main discovery cohort, CCRLP+GERA.
However, the NLW subcohort for CCLS is smaller than the LAT subcohort (N = 564 NLW vs.
1101 LAT, 306 NLW cases vs. 592 LAT cases). The small sample size in NLW may cause
pseudo R2 estimates to be noisier, and thus disguise possible transferability issues. Consistent
with this hypothesis, we found that a number of known ALL loci did not replicate independently,
even at nominal level of P < 0.05, in the CCLS NLW subchort (Table 3.6).
59
Table 3.6: Comparison of effect sizes and P-values of SNPs in known risk loci for ALL in
CCRLP and CCLS NLW
Gene SNP
CCRLP CCLS
OR P-value OR P-value
C5orf56 5:131711100:G:A 1.24 1.74E-03 1.15 0.543
BAK1 6:33571483:G:C 1.20 2.75E-04 0.77 0.093
IKZF1 7:50466304:A:G 1.61 2.39E-32 1.81 9.86E-06
8q24 8:130155860:C:T 1.38 1.49E-10 1.28 0.148
CDKN2A 9:21975319:T:A 2.09 2.22E-14 1.69 0.118
TLE1 9:21975319:T:A 1.50 6.20E-04 1.65 0.21
GATA3 10:8114922:A:G 1.23 5.63E-05 1.05 0.76
BMI1 10:22382802:C:T 1.32 1.73E-08 1.30 0.095
PIP4K2A 10:22963683:C:A 1.36 7.52E-12 1.06 0.654
ARID5B 10:63721176:C:T 1.71 4.37E-42 1.94 2.31E-07
LHPP 10:126292497:G:C 1.16 7.97E-04 1.03 0.841
ELK3 12:96599854:G:A 1.99 1.60E-03 1.83 0.497
CEBPE 14:23589349:A:G 1.31 1.20E-10 1.48 2.32E-03
IKZF3 17:37957235:T:C 1.66 8.86E-05 1.67 0.227
IGF2BP1 17:47339852:C:T 1.35 9.79E-04 1.19 0.55
ERG 21:39930483:T:C 1.21 6.35E-05 1.24 0.156
60
Taken everything together, our study provided guidance for future designs to propose and
evaluate PRS models for ALL. Firstly, continue to increase the sample size and representation in
GWAS is imperative. We have shown that genomic PRS using LDPred2 outperforms that based
on just genome-wide significant loci (P+T with P-value threshold < 5e-8; Table 3.3) within the
same GWAS. But this model currently does not outperform one simply based on aggregating all
known loci from literature, effectively combining information across multiple independent
GWAS datasets. Therefore, an aggregation of available GWAS through a consortium effort
should provide the ideal dataset to train better genomic PRS models. Efforts like the Childhood
Leukemia International Consortium (CLIC) should provide the best resources in the foreseeable
future. CLIC around 10,000 cases and 100,000 multi-ethnic cases and controls. In this setting,
we could hold-out CCRLP and split it into training dataset and validation dataset without
severely impacting the power of the discovery GWAS, which will always be a concern for
studies of rare disease. In this design, we can ensure more homogeneity and greater sample size
for both the training and validation dataset and would have more cohorts with diverse ancestry to
iteratively assess the transferability of the PRS models. This is particularly important for Latino
populations, given the known heterogeneity in ancestry compositions and fine-scale structure of
Latinos across the United States. This design will provide a more robust evaluation of the
transferability of PRS models, as the NLW_NLW strategy would also have larger sample sizes,
and the resulting genomic PRS model would presumably be more strongly associated with the
outcome. Unfortunately, CLIC still does not have all available ALL data, but this will be a clear
step towards a more robust design.
61
Secondly, given the suspected oligo-genic architecture of ALL, alternative PRS strategies
that incorporate information from the distribution of effect sizes may also further improve the
performance from a methodological standpoint. While LDPred2 controls somewhat the
proportion of the genome underlying a trait through optimization of the ρ parameter, its prior is
ultimately a spike-and-slab prior. A more direct modelling of the distribution of effect sizes, on
top of a polygenic background may prove to be a better model for ALL. Methods following these
types of models are emerging (e.g. see ref
143
), and will likely become more mature in the near
future. But even without a unified framework to model effect size distributions, a simpler
approach
144
that combines weighted PRS could also be more effective. In this case, one score
would be derived from sections of the genome known to be associated with ALL that may also
include multiple secondary but independent causal variants, and the other score could be derived
from LDPred2 or similar approaches from the rest of the genome. The weights between these
two scores can then be optimized in the training dataset as an additional parameter to derive a
score that may outperform any of the existing models evaluated here.
62
CHAPTER 4
Risk allele associated with childhood acute lymphoblastic leukemia at the
IKZF1 locus is associated with Indigenous American ancestry and absent in
European-ancestry populations.
Soyoung Jeon
1
, TszFung Chan
1
, Liam Cato
2
, Nathan Nakatsuka
2
, Vijay G. Sankaran
2
, Steven Gazal
1
,
Nicholas Mancuso
1
, Catherine Metayer
3
, Xiaomei Ma
4
, Joseph L. Wiemels
1
, Charleston W.K. Chiang
1
,
Adam J. de Smith
1
Affiliations:
1. Center for Genetic Epidemiology, Department of Preventive Medicine, Keck School of
Medicine, University of Southern California, Los Angeles, CA
2. Dana-Farber/Boston Children’s Cancer and Blood Disorders Center, Harvard Medical
School, Boston, MA
3. Division of Epidemiology & Biostatistics, School of Public Health, University of
California, Berkeley, CA
4. Department of Chronic Disease Epidemiology, Yale School of Public Health, New Haven,
CT
63
Introduction
Certain types of cancer disproportionately affect specific racial or ethnic groups.
Understanding the causes of this will be essential to alleviate health disparities and reveal novel
etiologic insights
145
. One notable example is acute lymphoblastic leukemia (ALL), the most
common cancer in children, for which self-reported Latino individuals have the greatest risk in the
United States. Latino children have an approximately 1.3-fold increased incidence of ALL than in
non-Latino Whites, and this difference rises to >2-fold in adolescents and young adults
91
.
Moreover, Latino ALL patients have increased risk of therapy-related neurotoxicity
146
and inferior
overall survival compared to their non-Latino white counterparts
92
. Intriguingly, the disparity in
ALL incidence appears to be driven entirely by the B-cell ALL (B-ALL) immunophenotype
147,148
,
suggesting the possibility of specific underlying biological mechanisms.
Genome-wide association studies (GWAS) have identified several well-replicated single
nucleotide polymorphisms (SNPs) associated with childhood ALL risk, including in genes
involved in B-cell development and hematopoiesis such as IKZF1, ARID5B, CEBPE, and GATA3
21,86,88,149
. Several ALL risk alleles have a higher frequency in Latino populations than in
individuals of predominantly European ancestry, in particular at ARID5B and GATA3
39,87
; the risk
allele frequencies of these and other ALL risk SNPs have also been positively correlated with
Indigenous American ancestry proportions
23,39,150
. Moreover, recent GWAS conducted specifically
in Latino individuals revealed a novel ALL risk locus in another key regulator of hematopoiesis,
ERG, which showed only modest association with ALL in European ancestry individuals and an
increasing risk of disease in Latino individuals with increasing Indigenous American ancestry
22,23
.
Outside of the three aforementioned SNPs, the effect sizes of known ALL GWAS loci do not differ
substantially between Latino individuals and non-Latino whites. Moreover, a polygenic risk score
64
(PRS) based on all known GWAS SNPs for ALL demonstrated that Latino individuals had only
an ~10% increase in their mean PRS relative to non-Latino whites
58
. Therefore, the majority of the
increased ALL risk in Latino individuals is not explained by the currently known association loci.
In a recent trans-ancestry GWAS of childhood ALL, we identified multiple independent
SNP associations at the IKZF1 locus, including one SNP that appeared unique to non-European
ancestry populations
58,151
. However, the causal variants underlying the IKZF1 association, the total
number of risk variants present at this locus, the relevant cell types in which they may operate, and
their underlying mechanisms have not been comprehensively examined. Here, we carried out fine-
mapping of the IKZF1 region in Latino, non-Latino white, and East Asian childhood ALL cases
and controls. Remarkably, we find that Latino populations are unique in harboring three
independent risk loci at IKZF1, one of which (rs76880433) derives from indigenous American
haplotypes, demonstrates evidence of positive selection in Latino populations and explains a
considerable portion of the increased ALL risk in Latino children. We also demonstrate that
putative causal variants at two of these loci – rs1451367 and rs17133807 – reside within an
enhancer for IKZF1 that is most active in developing B-cells progenitors, and the loss of which
leads to a reduction in IKZF1 expression accompanied by an accumulation of pro-B cells.
Intriguingly, we find that this regulatory element is occupied by IKZF1 itself and the Latino
population-enriched variant rs1451367 appears to alter a motif for this transcription factor,
suggesting that this variant may disrupt a positive feedback mechanism required for effective B
cell development. Our study demonstrates the molecular and evolutionary mechanisms through
which a single allele, hitherto not been identified until Latino populations were studied, explains
a considerable portion of the disparity in ALL risk in Latino children.
65
Methods
Study Subjects
The study protocol was approved by the Institutional Review Boards at the California
Health and Human Services Agency, University of Southern California, University of California
San Francisco, and Yale University. This study includes childhood ALL cases and controls
identified in the California Cancer Records Linkage Project (CCRLP)
58
. The acquisition of
newborn dried bloodspot (DBS) samples has been described in detail previously
21
. Briefly, DBS
for cases and controls were obtained from the California Biobank Program, California Department
of Public Health (CDPH), Genetic Disease Screening Program. Childhood ALL cases (<14 years
of age) were identified through linkage between the CDPH statewide birth records (years 1982–
2009) and the California Cancer Registry (CCR, diagnosis years 1988–2011), and controls were
randomly selected and matched on year and month of birth, sex and race/ethnicity (non-Latino
White, non-Latino Black, Latino (any race), Asian/Pacific Islander, other) as reported in the birth
records. Additional controls were included from the Genetic Epidemiology Research on Aging
(GERA) study (dbGaP accession: phs000788.v1.p2 ).
We note the complexity of discussing race, ethnicity, and ancestry in a genetic study. In
this study, analyses were limited to Latinos, non-Latino Whites, and East Asians (identified as
described below), given the disparity of ALL incidence in self-reported Latinos in the U.S. and
given that the novel IKZF1 risk allele was common only in Latino and East Asian populations.
Data Processing, Quality Control, and Imputation
Genome-wide single nucleotide polymorphism (SNP) genotyping was performed for all
individuals in CCRLP and GERA using the Affymetrix Axiom World LAT array
21
. Quality control
66
(QC) of SNP array data and samples were carried out in each population group, as previously
described in our recent trans-ancestry GWAS meta-analysis of ALL
58
. In brief, for pre-imputation
QC, we excluded the sex chromosomes and filtered out SNPs based on call rates <98%, minor
allele frequency (MAF) <0.01, and Hardy-Weinberg equilibrium in controls (P<10
-5
), and
removed samples based on genome-wide relatedness (PI_HAT>0.20), genome heterozygosity rate
(mean heterozygosity± 6Std), and call rates < 95%. In general, we included individuals in each of
the three population groups based on their self-reported race/ethnicity and did not reassign
individuals based on estimated genetic ancestry
58
. Principal components analysis (PCA) was
performed for each population group along with reference data from 1000 Genomes Project
152
to
identify extreme outlier individuals that clustered separately from other individuals in their self-
reported race-ethnicity groups. Among self-reported Asians, we identified a small subset (29 cases
and 51 controls from CCRLP, 31 individuals from GERA) that clustered with South Asian
reference individuals and were subsequently removed from our analyses, with remaining
individuals classified as East Asians. No such outlier individuals were identified among Latinos
or non-Latino Whites. Pre-imputation GWAS was performed between population-matched
CCRLP controls and GERA individuals to protect against between-cohort batch effects
58
.
Whole-genome imputation was performed for each population group, separately for
CCRLP and GERA datasets, using the TOPMed Imputation Server
153
, with additional QC filtering
performed post-imputation. We removed variants in each population group based on imputation
quality (R
2
< 0.3), MAF (< 0.01), and allele frequency difference between non-Finnish Europeans
in the Genome Aggregation Database (gnomAD) and CCRLP NLW controls (> 0.1). We next
performed another GWAS between CCRLP controls and GERA individuals, and removed variants
with P < 1x10
-5
. CCRLP and GERA datasets for Latinos, non-Latino Whites, and East Asians were
67
then merged to perform GWAS of childhood ALL. After quality control filtering, the Latino
GWAS included 1,878 cases and 8,441 controls, the non-Latino White GWAS included 1,162
cases and 57,341 controls, and East Asian GWAS included 318 cases and 5,017 controls.
Chromosome 7p12 Association Testing and Conditional Analysis
SNP association testing was limited to a ~600 kb region centered +/-250 kb around the
IKZF1 gene at chromosome 7p12.2-p12.1 (chr7: 50,054,716 – 50,655,101, hg38). In each
population group, we used PLINK v2
154
to test the association between imputed genotype dosage
and case-control status in logistic regression, after adjusting for the top 20 principal components
(PCs). We did not include sex as a covariate, and we found sex was not correlated with genotype
dosage of any of the putatively associated SNPs (data not shown). Next, we repeated the
association testing in each population group conditioning on the top SNP in that group, where we
included the SNP genotype in the logistic regression model (conditional test 1). Then, we repeated
the association testing conditioning on the top SNP in the main GWAS plus the top SNP in
conditional test 1 (conditional test 2). In Latinos, where 3 independent loci were identified, we
repeated analysis conditioning on top SNPs from the main GWAS and conditional tests 1 and 2.
A genome-wide threshold of P< 5 x 10
-8
was used for significance in each test.
Statistical Fine-Mapping
We performed statistical fine-mapping using SuSiE
155
with GWAS summary statistics over
the region covering 1 Mb upstream to rs4917017 and 1 Mb downstream to rs10272724. We ran
the diagnostic test in SuSiE to ensure no outliers in the GWAS test statistics. We estimated the in-
68
sample LD matrix for each population after regressing out GWAS covariates from genotype
dosages. We set the maximum number of causal variants to be 10 and reported 90% credible sets.
Genetic ancestry analysis
Global ancestry inference was performed on CCRLP Latino cases and controls using
RFMix, using a reference panel consisting of 671 non-Finnish European individuals for European
ancestry, 716 African individuals for African ancestry, and 94 Admixed American individuals (7
Columbian, 12 Karitianan, 14 Mayan, 4 Mexican in Los Angeles, 37 Peruvian in Lima, Peru, 12
Pima, and 8 Surui) for Indigenous American (IA) ancestry from gnomAD v3.1 release
105
. The 94
IA individuals were selected based on having more than 85% global ancestry of Admixed
American (AMR) individuals from 1000 Genomes Project
94
in an unsupervised analysis using
ADMIXTURE
142
. We stratified Latino individuals into three tertiles of global EUR ancestry, and
in each group evaluated the predictive performance of the best score model for NLW_NLW
strategy. After estimating local ancestry for each individual, we stratified Latino individuals into
whether they carried zero copies or at least 1 copy of the haplotype derived from NA ancestry, and
in each group estimated the OR for association of the three independent IKZF1 SNPs with ALL
risk in a logistic regression model for a series of conditional analyses on the top SNPs, adjusting
for age, sex, and 20 principal components using PLINK v2.0
154
.
Selection analysis
We retrieved pre-compiled estimates of the allelic ages and p-values for positive selection
for each of the IKZF1 risk alleles for IBS, CHS, PEL, and MXL populations from the 1KG. The
estimates of allelic ages and evidence of selection were based on an inferred genealogical tree
69
spanning the IKZF1 locus. The genealogical trees were inferred using “Relate”
142
and pre-
calculated trees for each 1KG populations are available for download
at:
20
https://zenodo.org/record/3234689#.Y2VdouzML0p. The allelic age is estimated as the mid-
branch time in the number of generations for the branch of the genealogy where the derived allele
of each risk-associated SNPs arose. The age in generations was converted to years assuming 28
years per generation. The evidence of positive selection is based on the over-representation of
lineages carrying the derived allele at the tip of the genealogy, given the distribution of carrier and
non-carrier haplotypes when the branch carrying the allele of interest first branched into two
sublineages.
eQTL analysis
To examine the effects of the ALL risk SNPs on the expression of IKZF1 and nearby genes,
we analyzed whole genome and RNA sequencing data from whole blood of 2,733 Latino (n=1976)
and African American (n=757) children from the Genes-environments and Admixture in Latino
Americans (GALA II) study and the Study of African Americans, Asthma, Genes, and
Environments (SAGE)
157
. Study subjects were children between 8-21 years of age with or without
physician-diagnosed asthma from GALA/SAGE and included 757 self-identified African
Americans (AA), 893 Puerto Ricans (PR), 784 Mexican Americans (MX), and 299 other Latinos
(LA) who did not self-identify as Mexican American or Puerto Rican. Subjects were grouped into
high (> 50% IAMhigh) or low (< 10% IAMlow) global Indigenous American ancestry, calculated as
described
157
.
70
Results
Fine-mapping of ALL risk variants at IKZF1 reveals three independent association loci in
Latino children
While robust signals for childhood ALL risk had been previously identified in the IKZF1
locus, the precise number and identity of potential causal variants in this region across different
racial or ethnic groups had not been previously defined. In Latino individuals, we identified 109
genome-wide significant SNPs at the IKZF1 gene locus at chromosome 7p12.2, with top SNP
rs4917017 (P=2.05x10
-17
, OR=1.41) located ~9 kb upstream of the transcription start site (Figure
4.1A). There appeared to be two distinct association peaks separated by a recombination peak, one
at the promoter region and the other spanning the 3’ end of IKZF1. In conditional analyses
adjusting for rs4917017, only the 3’ end peak remained genome-wide significant with top SNP
rs10272724 (P=4.39x10
-9
, OR=1.29), which is not in LD with rs4917017 (r2=0.002) in Admixed
Americans in the 1000 Genomes Project (LDLink) (Figure 4.1B). These two risk loci were
previously reported in ALL GWAS across multiple populations.
Interestingly, analysis adjusting for both rs4917017 and rs10272724 revealed a third,
independent genome-wide significant peak spanning the 3’ end of IKZF1 (Figure 4.1C). The top
SNP in this third locus, rs76880433 (P=4.67x10
-11
, OR=1.44), was not in LD with either
rs4917017 (r2=0.09) or rs10272724 (r2=0.07,) in Admixed Americans. In fact, the rs76880433
risk allele T appears to lie on the opposing haplotype from the rs10272724 risk allele C, which we
confirmed in genotype data in our study. Intriguingly, the evidence of association and the effect
size of rs76880433 were reduced in the first conditional analysis adjusting only for rs4917017, but
increased in the conditional analysis adjusting for both rs4917017 and rs10272724 (Table 4.1).
71
Figure 4.1 - Conditional association results for CCRLP Latinos
LocusZoom plots of associations near IKZF1 on (a) marginal analysis (b) conditional analysis adjusting on
rs4917017 (c) conditional analysis adjusting on rs4917017 and rs10272724. Diamond symbol indicates the
lead SNP in each cohort. Color of remaining SNPs is based on linkage disequilibrium (LD) as measured by
r2 with the lead SNP in the respective cohort. All coordinates in x-axis are in hg38.
Figure 1 - Conditional association results for CCRLP Latinos
(a)
(b)
(c)
72
Table 4.1 - Marginal and conditional analysis results for three independent risk SNPs in IKZF1 in Latinos
IKZF1 SNP
rsID
Chromosome 7
location (bp,
hg38)
Risk
Allele
RAF
Marginal conditional 1 conditional 2
OR P-value OR P-value OR P-value
rs4917017 50295636 A 0.437 1.41 2.05E-17
rs10272724 50409515 C 0.268 1.3 7.49E-10 1.29 4.39E-09
rs76880433 50384350 T 0.153 1.37 1.62E-10 1.22 8.43E-05 1.44 4.67E-11
Abbreviations: Risk allele frequency in RAF: Risk Allele Frequency; OR: Odds Ratio
73
In non-Latino whites, there was a single genome-wide significant peak downstream of the
3’ end of IKZF1 with top SNP rs17133805, which is in near-perfect LD with rs10272724 (r2=0.995)
in European ancestry populations. Adjusting for rs17133805 revealed a second peak at the IKZF1
promoter region with top SNP rs9886239 (P=4.01x10
-5
, OR=1.20), which is in LD with rs4917017
(the top Latino SNP at this locus ) in Europeans (r2=0.88). Adjusting for both rs17133805 and
rs9886239 revealed no further significant loci. In East Asians, there were no genome-wide
significant SNPs due to small sample size, but there did appear to be two independent IKZF1 risk
loci at the 3’ end of the gene in conditional analyses. We further confirmed the observation through
conditional analysis with statistical fine-mapping of the IKZF1 region using SuSiE, which
identified three credible sets of causal variants in Latino individuals, two credible sets in non-
Latino whites, and only one credible set in East Asians.
Thus, Latino children are unique in having three independent IKZF1 risk loci for ALL, as
non-Latino whites lacked the association peak identified by rs76880433 in Latino individuals that
itself confers an ~1.4-fold increased odds for ALL - nearly the extent of increased risk observed
for ALL observed in Latino children. SNP rs76880433 has the highest RAF in Latino/Admixed
Americans (17.6%) and East Asians (18.8%) in gnomAD, but is almost monomorphic in non-
Finnish Europeans (RAF=0.2%). The corresponding RAFs in CCRLP controls were 24.4% in
Latino individuals, 0.12% in non-Latino whites, and 19.8% in East Asians.
We calculated a genetic risk score (GRS) including the three lead independent IKZF1 SNPs
for both CCRLP Latinos and non-Latino whites. We found the resulting GRS to be significantly
associated with case/control status in CCRLP (OR = 2.33, P = 3.66 x 10-
29
in LAT; OR = 2.65, P
= 4.61 x 10-
33
in NLW). We found that the proportion of the variance in ALL risk explained on
the liability scale by an IKZF1 GRS was >3-fold higher in Latinos (pseudo-R2 = 0.194) than in
74
non-Latino whites (pseudo-R2 = 0.062). Therefore, through both the presence of population-
enriched allele and the elevated risk allele frequency, the IKZF1 locus appears to contribute at least
partly the differences in ALL risk between LAT and NLW.
rs76880433 risk allele frequency and effect size correlate with Indigenous American ancestry
The Latino populations are known to harbor varying extents of admixture from European,
African, and Indigenous American ancestries
158
. In the CCRLP cohort, the predominant
components of ancestries are European and Indigenous American (estimated proportion = 49%
and 39%), with relatively minor contributions from African ancestry component (estimated
proportion = 6%). Because rs76880433 is nearly monomorphic in European-ancestry populations,
we speculate that it is prevalent in Latino individuals due to the Indigenous American ancestry.
We thus analyzed the association between the three independent IKZF1 risk alleles and global and
local genetic ancestry in Latino individuals. Both rs4917017 and rs76880433 risk alleles were
significantly positively correlated with global Indigenous American ancestry proportions, whereas
rs10272724 showed no correlation (Table 4.2). SNP rs76880433 showed the greatest correlation,
with similar estimates in Latino cases (estimate=0.296, P=3.28x10
-39
) and controls
(estimate=0.283, P=1.76x10
-38
). Among cases and controls, the highest risk allele frequencies for
rs4917017 and rs76880433 were in individuals with at least 50% Indigenous American ancestry
(Figure 4.2).
Analysis of local ancestry across IKZF1 revealed a significantly higher frequency of the
Indigenous American haplotype at rs76880433 in cases than in controls (P=8.78x10
-7
) (Figure
4.2). In fact, the frequency of Indigenous American haplotypes at this locus in cases (0.89) and
75
Table 4.2 – Correlation between IKZF1 SNP risk alleles and global Indigenous American
ancestry proportion
rs4917017 rs10272724 rs76880433
estimate p.value estimate p.value estimate p.value
all 0.218 3.48E-43 -0.017 0.299 0.291 6.77E-77
cases 0.213 1.18E-20 -0.040 0.080 0.296 3.28E-39
controls 0.217 4.48E-23 -0.001 0.980 0.283 1.76E-38
76
Figure 4.2 - IKZF1 SNP rs76880433 is associated with global and local Indigenous American
ancestry
(a) Bar graphs shows the proportion of global ancestry for each genotype of rs76880433 in CCRLP Latinos
(b) histogram of copies of IA haplotype in CCRLP Latino cases and controls (c) Risk allele frequency of
rs76880433 in CCRLP Latino cases and controls stratified by Indigenous ancestry proportions
(a) (b) (c)
77
controls (0.76) are both significantly higher than the expectation across the genome (0.39).
Furthermore, we found an increasing effect size of rs76880433 with increasing numbers of the
Indigenous American haplotype at this locus, a pattern that was not seen for rs4917017 or
rs10272724 (Table 4.3). In analysis conditioned on SNPs rs4917017 and rs10272724, SNP
rs76880433 had an OR of 1.55 (P=1.98x10
-8
) in Latino individuals with ≥1 copy of the Indigenous
American haplotype (n=2500), but an OR of only 1.18 (P=0.59) in Latino individuals with zero
copies (n=1402) (Table 4.3). The effect size for rs10272724 was reduced in subjects with ≥1
Indigenous American haplotype.
Functional analyses to pinpoint causal variants
Having identified three distinct risk variants in Latino children, including two variants that
occurred downstream (past the 3’ end) of IKZF1, we next sought to identify and study the history
and function of these variants. We initially wanted to define the likely causal variants underlying
each of the three signals. To identify putative causal variants underlying the IKZF1 association
loci, we assessed the overlap of SNPs in each of the SuSiE credible sets with coding and regulatory
regions. In credible set 1 in Latino individuals - the set of variants in LD with rs76880433 - only
one out of 10 SNPs, rs1451367, overlapped a putative functional regulatory element. This variant
is in high LD with rs76880433 (R
2
>0.96 in Admixed Americans) and is located downstream of
IKZF1 in an apparent regulatory region based on histone modification peaks. In credible set 2,
4/23 SNPs overlapped the same functional element, including rs17133807 that is located only 26
bp away from rs1451367. As with tagging SNPs rs76880433 and rs10272724, the rs1451367 and
rs17133807 ALL risk alleles reside on opposing haplotypes. In credible set 3 in Latino individuals,
8/11 SNPs mapped to intron 1 and 3 SNPs mapped to within 10 kb of the promoter region of
78
Table 4.3 – IKZF1 SNP associations in CCRLP Latinos stratified by local ancestry
Zero NA haplotypes
(n = 1,402)
≥1 NA haplotype
(n = 2,500)
OR p.value OR p.value
rs4917017 1.37 1.40 x 10
-4
1.35 3.37 x 10
-6
rs10272724 1.50 1.57 x 10
-6
1.24 8.41 x 10
-4
rs10272724 * 1.46 7.72 x 10
-6
1.24 7.55 x 10
-4
rs76880433 0.98 0.959 1.31 4.50 x 10
-5
rs76880433 * 1.05 0.880 1.23 1.62 x 10
-3
rs76880433 ** 1.18 0.594 1.55 1.98 x 10
-8
*adjusted for rs4917017
**adjusted for rs4917017 and rs10272724
79
IKZF1. The credible set for this region in non-Latino whites included all 11 aforementioned SNPs
along with 12 additional SNPs, including two - rs11765436 and rs11761922 - that reside within
the IKZF1 promoter region and are in strong LD (D`=1, R
2
>0.92) with the tagging SNP rs4917017.
The rs11765436 risk allele introduces a binding motif for CTCF (the non-risk allele disrupts CTCF
binding motif) and disrupts a TCF3 binding motif, and lies within a strong CTCF ChIP-Seq peak.
Latino-enriched locus demonstrates signals of positive selection in Latino but not in East Asian
populations
Having defined potential causal variants, we next sought to understand more about when
the variant rs1451367 that appeared to be largely restricted to Latino children may have arisen in
human history. We examined data from shotgun sequencing of ancient DNA samples found in the
Americas to explore when the rs1451367 SNP arose. We found that the oldest previously
sequenced Indigenous American individual (Anzick; 12,700 years old) was heterozygous for the
rs1451367 risk allele, supporting the idea that the SNP was present in the first migrants who
entered the Americas ~13,000 years ago. Importantly, many ancient individuals in a variety of
locations in the Americas, including Patagonia, Brazil, California, and Nevada, appeared to carry
the risk allele (Figure 4.3B). In addition, we also examined the ancient genome browser
(https://bioinf.eva.mpg.de/jbrowse/) and found that the rs1451367 SNP was not present in
Neandertal or Denisova genomes, but was heterozygous in the Ust’-Ishim genome suggesting that
the mutation arose at least 45,000 years ago
159
.
We examined signature of positive selection at rs76880433 based on reconstructed
genealogies in 1000Genomes populations at this locus, and found evidence for selection only in
Latino populations (MXL, PEL; P = 2.9e-3 for MXL, 5.3e-3 for PEL), but not in European (IBS)
80
Figure 4.3 - Origins and selection of IKZF1 risk allele for childhood ALL
A
B
(A) Evolutionary model for the rs1451367/rs76880433 allele that led to its high prevalence in
Latino populations (B) Presence of risk allele for rs1451367 in shotgun sequencing of ancient
DNA samples found in the Americas
81
or East Asians (CHS). The estimated ages for this variant ranges from 28ky to 52ky based on the
genealogies. The putative functional SNP (rs1451367) in high LD with rs76880433 had much
older age estimates (130-260ky) and more modest evidence of selection (P = 0.03 in PEL, 0.12 in
MXL).
Discussion
In the current study we characterized in detail the genetic association of the IKZF1 locus
with ALL in LAT and NLW populations. We found three independent risk alleles in the locus,
with one of the three alleles, rs1451367 or its close proxy rs76880433, being strongly enriched in
Latino population. We demonstrated that the risk conferred by this allele contributes to increased
risk of ALL in Latino population due to their Indigenous American ancestry. We provided results
from in silico functional analyses for the three independently associated SNPs in the IKZF1 locus,
and explored the evolutionary mechanism through which the Latino-specific allele may have come
to be so frequent in the Indigenous American genomes. Our results thus provided evidence of
genetic contribution to the elevated risk for ALL in LAT, compared to the NLW population, in
addition to environmental or sociodemographic differences between the two populations.
The increased incidence of ALL in Latinos is driven by the B-cell immunophenotype
147
.
Our results here support that genetic variation at IKZF1, a regulator of B-cell development,
contribute significantly to the disparity in B-ALL incidence. This is also observed from the genetic
risk score (GRS) analysis. Consistent with our previous observation that the distribution of GRS
based on known 16 ALL-associated SNPs are shifted towards higher risk in LAT population,
compared to the NLW population, irrespective of case and control status. This reflects the fact that
frequencies at these shared ALL-associated SNPs are systematically different in LAT compared
82
to NLW. In the present analysis, we further demonstrated that the increased incidence may also be
contributed by presence of population-specific alleles, such as rs1451367/rs76880433 at the IKZF1
locus. These risk alleles are common in LAT populations due to the higher proportion of
Indigenous American ancestry (Table 4.3), but very rare or nearly monomorphic in the European
or NLW populations. Interestingly, rs76880433 is also prevalent in East Asian populations. The
East Asian cohort of our multiethnic dataset is too small to properly assess the impact of this allele
on ALL risk, but could be a prime target to validate in other larger East Asian cohorts.
Currently, our functional fine-mapping is limited to in silico analysis. A direction of
immediate interest is to functionally validate the three putative causal variants in IKZF1 in
experimental systems, so to define the precise mechanism through which risk to ALL is conferred
by this locus. For example, it would be of interest to examine chromatin accessibility of the
regulatory regions in the IKZF1 locus across cells covering the entirety of human hematopoiesis
using scATAC-seq data
160
. This may help delineate accessibility in B-cells and their precursors,
versus T-cells or myeloid cells, to better understand the developmental stages and cell type that
IKZF1 exerts its function. It would also be illuminating if regulatory element harboring the
rs1451367/rs17133807 risk variant could be deleted within the primary human hematopoietic stem
and progenitor cells (HSPCs) from health donors (in similar ways as had been done previously for
other regulatory elements
161,162
, we could then assess the resulting changes in IKZF1 expression.
The direct experimentation will greatly complement our statistical and in silico results that already
strongly implicate multiple causal variants at the IKZF1 locus.
Finally, we propose an evolutionary model for the rs1451367/rs76880433 allele that led to
its high prevalence in LAT populations (Figure 4.3 A). In this model, rs76880433 was introduced
after the split of Asian and European populations, hence the variant is essentially absent in
83
Europeans. The allele may have drifted neutrally to a frequency of ~20% in East Asians, explaining
the lack of selection signal in genealogies from CHS. In the lineage leading to Indigenous
Americans who crossed the Bering Strait and populated the Americas, the risk allele frequency
may or may not have been reduced due to the population bottlenecks but appear to increase
dramatically in frequency over at least the last 13,000 years, as evident by the prevalent carrier
rate among ancient American samples (Figure 4.3 B). Today, the risk allele frequency is highest
among Latino individuals with > 50% estimated Indigenous American ancestry (RAF = 0.33), as
well as in Indigenous American samples from HGDP (RAF = 0.44 from gnomAD), as result of
positive selection specific to the Americas. The observed frequency in Latino populations (RAF
~17-24%) is then a result of recent admixture from European and African ancestries. The putative
functional variant in high LD, rs1451367, showed more modest signature of selection. It could in
part be due to its older age (as one recombinant haplotype between rs1451367 and rs76880433 in
1000G PEL population), which lowers the sensitivity of selection tests. Rs76880433 may also be
associated with an yet unknown functional element in the region. The genealogical-based analysis
for selection would also be consistent with the fact that this locus has drastically elevated
Indigenous American ancestry in both cases (0.89) or controls (0.76), compared to the genomic
average (0.39). Combined with prevalent carrier rate among ancient American samples, our results
suggest the selection may have persisted over the last 13,000 years. The precise events in human
history that drove this selection remain unknown, although it is interesting to note how many
infectious diseases in human history have shaped selection in alleles that altered human immune
responses
163
, considering the role IKZF1 plays in B-cell development and differentiation.
84
CHAPTER 5: CONCLUSION
This dissertation aimed to fill in the gap in genetic research in childhood ALL by
focusing on studying diverse ethnic groups. Using the largest-available multi-ethnic cohort, we
discovered novel risk loci at chr6q23.3 and chr10q21.3 and demonstrated the PRS based on the
known risk loci show similar predictive performance in Latinos and non-Latino whites, and
partly explain the increased risk in Latinos. We then constructed, trained, and evaluated the
genomic polygenic risk scores in Latinos to identify the best performing model. Our results
suggest that multi-ancestry GWAS significantly improve performance of PRS not only for non-
European individuals, but possibly for European ancestry individuals as well, consistent with a
previous study
164
. Also, our results suggest ALL has a more oligo-genic architecture where a
number of large effect loci may be shared between populations. Lastly, we characterized a
Latino-specific secondary signal at IKZF1 locus that is absent in European-ancestry populations
and showed its association with both global and local Indigenous American ancestry. These
results suggest a significant role of genetic risk factors in contributing to the disparities in ALL
risk across ethnic groups with multiple ancestries, most likely due to both the presence of
ancestry-associated risk loci and the increased allele frequency or effect sizes of the shared risk
loci. Together, our study highlights the importance of an ancestrally diverse population in
genetic studies of ALL to fully unravel the disease etiology and ensure that therapeutic
applications can benefit greater number of people, regardless of their race.
Our studies have provided some insights to explain the increased ALL risk in Latino
children but also suggest some important directions for future work to understand the ethnic
disparity more comprehensively. First of all, the discoveries of novel loci from a large multi-
ethnic or population-specific GWAS, and the improved performance of PRS from multi-ethnic
GWAS suggest that it is highly likely even larger GWAS incorporating Latinos or specific for
Latinos will uncover additional risk loci and allow better PRS predictive accuracy. An effort to
aggregate individual GWAS for multi-ancestry or Latino-specific meta-analysis is warranted.
The benefit of including diverse populations notwithstanding, the minority populations are rarely
of sufficient size to have high statistical power in GWAS. One of the approaches to address such
challenge would be leveraging the unique evolutionary history of Latino Americans and perform
admixture mapping. To identify a disease-associated locus, admixture mapping tests for an
association between the trait of interest and variation in genetic ancestry at each locus, rather
than specific genotypes as in a GWAS
165
. This approach would not suffer from as high of a
multiple testing burden as SNP association testing and can be more powerful when the risk allele
shows drastic frequency differences, are unobserved but captured by ancestry blocks, or when
the effects of multiple rare variants are aggregated within an ancestry interval
166
. It has been
85
successfully used to identify novel associations with asthma, blood cell traits, breast and prostate
cancer, using African-American and Latinos
167
. Admixture mapping can be especially relevant to
elucidate the disease etiology that may differ for different ancestral populations as in childhood
ALL, but no effort has yet been made to infer local ancestry and test the association with ALL
risk in Latino Americans. Deconvolution of local ancestry can also help interpret the results from
GWAS and provide an insight into different ALL incidence rates between ethnic groups.
In addition, a better characterization of the discovered loci is as important as the
discovery of novel loci. In general, “Latino” refers to people from Latin America, which is
comprised of more than twenty nations on the American continent and in the Caribbean Sea
where Spanish and Portuguese are the predominant language
168
. As our studies were limited
with Latino populations in California, self-reported as “Hispanics”, future investigations are
needed to characterize the discovered loci across the heterogeneous groups within Latino
community. Similarly, childhood ALL is a heterogenous disease comprising several
immunophenotype and cytogenic subtypes. Stratified association testing by subtypes can identify
potential subtype-specific associations. Assessment of PRS model performance across different
subtypes, when available, will also provide greater insight for understanding genetic architecture
of different subtypes of ALL.
Finally, another approach to better understand ALL risk in Latino is exploring the impact
of genetic ancestry and signatures of selection. The number of loci, the distribution of allele
frequency and effect size at the causal loci in a contemporary population could be shaped by
historical natural selection. A recent study has proposed a massive decline in Indigenous
American populations after European contact posed a strong selective pressure on immune-
related genes
169
. As Indigenous American’s lack of herd immunity was maladaptive to the new
pathogens brought by Europeans, those who survived may have had a survival advantage they
may be more appropriately reactive to the pathogens. ALL is characterized by an overgrowth of
immature hematopoietic cells and the findings from GWAS also suggest the genes involved in
B-lymphocyte development and regulation such as IKZF1, ARID5B, and CEBPE play an
important role in ALL predispositions. They support the hypothesis that childhood ALL results
from accumulation of critical genetic lesions that disrupt normal lymphoid progenitor
86
differentiation
26
. Therefore, it is possible that the adaptation in Indigenous Americans to new
pathogens Europeans brought, although unknown, could have led to an unintended consequence
of higher leukemia risk or other differences in immune system regulation or hematopoiesis in
current Latino Americans through their inherited ancestry component from Indigenous
Americans. This hypothesis could be tested by examining natural selection among ALL risk loci
and genes that may be relevant to the biological basis of leukemia. Indeed, the analysis of
frequency distribution using ancient DNA and reconstructed genealogical history in chapter 2
suggested the Latino-specific risk allele is under selection in Americas. Although previous
studies have noted the association of risk allele frequency in a number of loci with IA
ancestry
22,23,39,150
, a more systematic characterization of all GWAS loci with respect to genetic
ancestry and signature of selection will help us disentangle the genetic risk of ALL in Latino
population.
87
REFERENCES
1. Barrington-Trimis, J.L., Cockburn, M., Metayer, C., Gauderman, W.J., Wiemels, J., and
McKean-Cowdin, R. (2015). Rising rates of acute lymphoblastic leukemia in Hispanic
children: trends in incidence from 1992 to 2011. Blood 125, 3033–3034. 10.1182/blood-
2015-03-634006.
2. McCarthy, M.I., Abecasis, G.R., Cardon, L.R., Goldstein, D.B., Little, J., Ioannidis, J.P.A.,
and Hirschhorn, J.N. (2008). Genome-wide association studies for complex traits:
consensus, uncertainty and challenges. Nat Rev Genet 9, 356–369. 10.1038/nrg2344.
3. Khera, A.V., Chaffin, M., Aragam, K.G., Haas, M.E., Roselli, C., Choi, S.H., Natarajan, P.,
Lander, E.S., Lubitz, S.A., Ellinor, P.T., et al. (2018). Genome-wide polygenic scores for
common diseases identify individuals with risk equivalent to monogenic mutations. Nat
Genet 50, 1219–1224. 10.1038/s41588-018-0183-z.
4. Khera, A.V., Chaffin, M., Wade, K.H., Zahid, S., Brancale, J., Xia, R., Distefano, M., Senol-
Cosar, O., Haas, M.E., Bick, A., et al. (2019). Polygenic Prediction of Weight and Obesity
Trajectories from Birth to Adulthood. Cell 177, 587-596.e9. 10.1016/j.cell.2019.03.028.
5. Michailidou, K., Lindström, S., Dennis, J., Beesley, J., Hui, S., Kar, S., Lemaçon, A., Soucy,
P., Glubb, D., Rostamianfar, A., et al. (2017). Association analysis identifies 65 new breast
cancer risk loci. Nature 551, 92–94. 10.1038/nature24284.
6. Mahajan, A., Wessel, J., Willems, S.M., Zhao, W., Robertson, N.R., Chu, A.Y., Gan, W.,
Kitajima, H., Taliun, D., Rayner, N.W., et al. (2018). Refining the accuracy of validated
target identification through coding variant fine-mapping in type 2 diabetes. Nat Genet 50,
559–571. 10.1038/s41588-018-0084-1.
7. Torkamani, A., Wineinger, N.E., and Topol, E.J. (2018). The personal and clinical utility of
polygenic risk scores. Nat Rev Genet 19, 581–590. 10.1038/s41576-018-0018-x.
8. Popejoy, A.B., and Fullerton, S.M. (2016). Genomics is failing on diversity. Nature 538, 161–
164. 10.1038/538161a.
9. Martin, A.R., Gignoux, C.R., Walters, R.K., Wojcik, G.L., Neale, B.M., Gravel, S., Daly, M.J.,
Bustamante, C.D., and Kenny, E.E. (2017). Human Demographic History Impacts Genetic
Risk Prediction across Diverse Populations. The American Journal of Human Genetics
100, 635–649. 10.1016/j.ajhg.2017.03.004.
10. Martin, A.R., Kanai, M., Kamatani, Y., Okada, Y., Neale, B.M., and Daly, M.J. (2019).
Clinical use of current polygenic risk scores may exacerbate health disparities. Nat Genet
51, 584–591. 10.1038/s41588-019-0379-x.
11. Wang, Y., Guo, J., Ni, G., Yang, J., Visscher, P.M., and Yengo, L. (2020). Theoretical and
empirical quantification of the accuracy of polygenic scores in ancestry divergent
populations. Nat Commun 11, 3865. 10.1038/s41467-020-17719-y.
12. Rask-Andersen, M., Karlsson, T., Ek, W.E., and Johansson, Å. (2017). Gene-environment
interaction study for BMI reveals interactions between genetic factors and physical activity,
88
alcohol consumption and socioeconomic status. PLoS Genet 13, e1006977.
10.1371/journal.pgen.1006977.
13. Robinson, M.R., English, G., Moser, G., Lloyd-Jones, L.R., Triplett, M.A., Zhu, Z., Nolte,
I.M., van Vliet-Ostaptchouk, J.V., Snieder, H., LifeLines Cohort Study, et al. (2017).
Genotype-covariate interaction effects and the heritability of adult body mass index. Nat
Genet 49, 1174–1181. 10.1038/ng.3912.
14. Sulc, J., Mounier, N., Günther, F., Winkler, T., Wood, A.R., Frayling, T.M., Heid, I.M.,
Robinson, M.R., and Kutalik, Z. (2020). Quantification of the overall contribution of gene-
environment interaction for obesity-related traits. Nat Commun 11, 1385. 10.1038/s41467-
020-15107-0.
15. Justice, A.E., Winkler, T.W., Feitosa, M.F., Graff, M., Fisher, V.A., Young, K., Barata, L.,
Deng, X., Czajkowski, J., Hadley, D., et al. (2017). Genome-wide meta-analysis of 241,258
adults accounting for smoking behaviour identifies novel loci for obesity traits. Nat
Commun 8, 14977. 10.1038/ncomms14977.
16. Edwards, B.K., Noone, A.-M., Mariotto, A.B., Simard, E.P., Boscoe, F.P., Henley, S.J.,
Jemal, A., Cho, H., Anderson, R.N., Kohler, B.A., et al. (2014). Annual Report to the Nation
on the status of cancer, 1975-2010, featuring prevalence of comorbidity and impact on
survival among persons with lung, colorectal, breast, or prostate cancer. Cancer 120,
1290–1314. 10.1002/cncr.28509.
17. Barrington-Trimis, J.L., Cockburn, M., Metayer, C., Gauderman, W.J., Wiemels, J., and
McKean-Cowdin, R. (2017). Trends in childhood leukemia incidence over two decades
from 1992 to 2013. Int J Cancer 140, 1000–1008. 10.1002/ijc.30487.
18. Linet, M.S., Ries, L.A., Smith, M.A., Tarone, R.E., and Devesa, S.S. (1999). Cancer
surveillance series: recent trends in childhood cancer incidence and mortality in the United
States. J Natl Cancer Inst 91, 1051–1058. 10.1093/jnci/91.12.1051.
19. Siegel, R., Ma, J., Zou, Z., and Jemal, A. (2014). Cancer statistics, 2014. CA: A Cancer
Journal for Clinicians 64, 9–29. 10.3322/caac.21208.
20. Acute Lymphocytic Leukemia - Cancer Stat Facts SEER.
https://seer.cancer.gov/statfacts/html/alyl.html.
21. Wiemels, J.L., Walsh, K.M., de Smith, A.J., Metayer, C., Gonseth, S., Hansen, H.M.,
Francis, S.S., Ojha, J., Smirnov, I., Barcellos, L., et al. (2018). GWAS in childhood acute
lymphoblastic leukemia reveals novel genetic associations at chromosomes 17q12 and
8q24.21. Nature Communications 9. 10.1038/s41467-017-02596-9.
22. Qian, M., Xu, H., Perez-Andreu, V., Roberts, K.G., Zhang, H., Yang, W., Zhang, S., Zhao,
X., Smith, C., Devidas, M., et al. (2019). Novel susceptibility variants at the ERG locus for
childhood acute lymphoblastic leukemia in Hispanics. Blood 133, 724–729. 10.1182/blood-
2018-07-862946.
23. de Smith, A.J., Walsh, K.M., Morimoto, L.M., Francis, S.S., Hansen, H.M., Jeon, S.,
Gonseth, S., Chen, M., Sun, H., Luna-Fineman, S., et al. (2019). Heritable variation at the
89
chromosome 21 gene ERG is associated with acute lymphoblastic leukemia risk in children
with and without Down syndrome. Leukemia 33, 2746–2751. 10.1038/s41375-019-0514-9.
24. Williams, L.A., Yang, J.J., Hirsch, B.A., Marcotte, E.L., and Spector, L.G. (2019). Is There
Etiologic Heterogeneity between Subtypes of Childhood Acute Lymphoblastic Leukemia?
A Review of Variation in Risk by Subtype. Cancer Epidemiol Biomarkers Prev 28, 846–
856. 10.1158/1055-9965.EPI-18-0801.
25. Semmes, E.C., Vijayakrishnan, J., Zhang, C., Hurst, J.H., Houlston, R.S., and Walsh, K.M.
(2020). Leveraging Genome and Phenome-Wide Association Studies to Investigate
Genetic Risk of Acute Lymphoblastic Leukemia. Cancer Epidemiology, Biomarkers &
Prevention 29, 1606–1614. 10.1158/1055-9965.EPI-20-0113.
26. Pui, C.-H., Robison, L.L., and Look, A.T. (2008). Acute lymphoblastic leukaemia. Lancet
371, 1030–1043. 10.1016/S0140-6736(08)60457-2.
27. Siegel, D.A., Henley, S.J., Li, J., Pollack, L.A., Van Dyne, E.A., and White, A. (2017). Rates
and Trends of Pediatric Acute Lymphoblastic Leukemia — United States, 2001–2014.
MMWR. Morbidity and Mortality Weekly Report 66, 950–954. 10.15585/mmwr.mm6636a3.
28. Turcotte, L.M., Liu, Q., Yasui, Y., Arnold, M.A., Hammond, S., Howell, R.M., Smith, S.A.,
Weathers, R.E., Henderson, T.O., Gibson, T.M., et al. (2017). Temporal Trends in
Treatment and Subsequent Neoplasm Risk Among 5-Year Survivors of Childhood Cancer,
1970-2015. JAMA 317, 814–824. 10.1001/jama.2017.0693.
29. Mody, R., Li, S., Dover, D.C., Sallan, S., Leisenring, W., Oeffinger, K.C., Yasui, Y., Robison,
L.L., and Neglia, J.P. (2008). Twenty-five-year follow-up among survivors of childhood
acute lymphoblastic leukemia: a report from the Childhood Cancer Survivor Study. Blood
111, 5515–5523. 10.1182/blood-2007-10-117150.
30. Cheung, Y.T., Sabin, N.D., Reddick, W.E., Bhojwani, D., Liu, W., Brinkman, T.M., Glass,
J.O., Hwang, S.N., Srivastava, D., Pui, C.-H., et al. (2016). Leukoencephalopathy and
long-term neurobehavioural, neurocognitive, and brain imaging outcomes in survivors of
childhood acute lymphoblastic leukaemia treated with chemotherapy: a longitudinal
analysis. Lancet Haematol 3, e456–e466. 10.1016/S2352-3026(16)30110-7.
31. Mulrooney, D.A., Hyun, G., Ness, K.K., Bhakta, N., Pui, C.-H., Ehrhardt, M.J., Krull, K.R.,
Crom, D.B., Chemaitilly, W., Srivastava, D., et al. (2019). The Changing Burden of Late
Health Outcomes in Adult Survivors of Childhood Acute Lymphoblastic Leukemia: A
Report from the St. Jude Lifetime Cohort Study. Lancet Haematol 6, e306–e316.
10.1016/S2352-3026(19)30050-X.
32. Parkin, D.M., Stiller, C.A., Draper, G.J., and Bieber, C.A. (1988). The international incidence
of childhood cancer. Int J Cancer 42, 511–520. 10.1002/ijc.2910420408.
33. Bhatia, S. (2011). Disparities in cancer outcomes: Lessons learned from children with
cancer. Pediatric Blood & Cancer 56, 994–1002. 10.1002/pbc.23078.
34. Wang, L., Bhatia, S., Gomez, S.L., and Yasui, Y. (2015). Differential inequality trends over
time in survival among U.S. children with acute lymphoblastic leukemia by race/ethnicity,
90
age at diagnosis, and sex. Cancer Epidemiol Biomarkers Prev 24, 1781–1788.
10.1158/1055-9965.EPI-15-0639.
35. Goggins, W.B., and Lo, F.F.K. (2012). Racial and ethnic disparities in survival of US children
with acute lymphoblastic leukemia: evidence from the SEER database 1988-2008. Cancer
Causes Control 23, 737–743. 10.1007/s10552-012-9943-8.
36. Kadan-Lottick, N.S. (2003). Survival Variability by Race and Ethnicity in Childhood Acute
Lymphoblastic Leukemia. JAMA 290, 2008. 10.1001/jama.290.15.2008.
37. Kehm, R.D., Spector, L.G., Poynter, J.N., Vock, D.M., Altekruse, S.F., and Osypuk, T.L.
(2018). Does socioeconomic status account for racial and ethnic disparities in childhood
cancer survival? Cancer 124, 4090–4097. 10.1002/cncr.31560.
38. Yang, W., Treviño, L.R., Yang, J.J., Scheet, P., Pui, C.-H., Evans, W.E., and Relling, M.V.
(2010). ARID5B SNP rs10821936 is associated with risk of childhood acute lymphoblastic
leukemia in blacks and contributes to racial differences in leukemia incidence. Leukemia
24, 894–896. 10.1038/leu.2009.277.
39. Xu, H., Cheng, C., Devidas, M., Pei, D., Fan, Y., Yang, W., Neale, G., Scheet, P., Burchard,
E.G., Torgerson, D.G., et al. (2012). ARID5B genetic polymorphisms contribute to racial
disparities in the incidence and treatment outcome of childhood acute lymphoblastic
leukemia. J Clin Oncol 30, 751–757. 10.1200/JCO.2011.38.0345.
40. Ohta, T. (1973). Slightly deleterious mutant substitutions in evolution. Nature 246, 96–98.
41. Lohmueller, K.E. (2014). The impact of population demography and selection on the genetic
architecture of complex traits. PLoS Genet 10, e1004379. 10.1371/journal.pgen.1004379.
42. Lim, E.T., Wurtz, P., Havulinna, A.S., Palta, P., Tukiainen, T., Rehnstrom, K., Esko, T.,
Magi, R., Inouye, M., Lappalainen, T., et al. (2014). Distribution and medical impact of loss-
of-function variants in the Finnish founder population. PLoS Genet 10, e1004494.
10.1371/journal.pgen.1004494.
43. Wang, S.R., Agarwala, V., Flannick, J., Chiang, C.W., Altshuler, D., Go, T.D.C., and
Hirschhorn, J.N. (2014). Simulation of Finnish population history, guided by empirical
genetic data, to assess power of rare-variant tests in Finland. Am J Hum Genet 94, 710–
720. 10.1016/j.ajhg.2014.03.019.
44. Locke, A.E., Steinberg, K.M., Chiang, C.W.K., Service, S.K., Havulinna, A.S., Stell, L.,
Pirinen, M., Abel, H.J., Chiang, C.C., Fulton, R.S., et al. (2019). Exome sequencing of
Finnish isolates enhances rare-variant association power. Nature 572, 323–328.
10.1038/s41586-019-1457-z.
45. Moltke, I., Grarup, N., Jorgensen, M.E., Bjerregaard, P., Treebak, J.T., Fumagalli, M.,
Korneliussen, T.S., Andersen, M.A., Nielsen, T.S., Krarup, N.T., et al. (2014). A common
Greenlandic TBC1D4 variant confers muscle insulin resistance and type 2 diabetes.
Nature 512, 190–193. 10.1038/nature13425.
46. Grarup, N., Moltke, I., Andersen, M.K., Dalby, M., Vitting-Seerup, K., Kern, T., Mahendran,
Y., Jorsboe, E., Larsen, C.V.L., Dahl-Petersen, I.K., et al. (2018). Loss-of-function variants
91
in ADCY3 increase risk of obesity and type 2 diabetes. Nat Genet 50, 172–174.
10.1038/s41588-017-0022-7.
47. Grarup, N., Moltke, I., Andersen, M.K., Bjerregaard, P., Larsen, C.V.L., Dahl-Petersen, I.K.,
Jorsboe, E., Tiwari, H.K., Hopkins, S.E., Wiener, H.W., et al. (2018). Identification of novel
high-impact recessively inherited type 2 diabetes risk variants in the Greenlandic
population. Diabetologia. 10.1007/s00125-018-4659-2.
48. Sidore, C., Busonero, F., Maschio, A., Porcu, E., Naitza, S., Zoledziewska, M., Mulas, A.,
Pistis, G., Steri, M., Danjou, F., et al. (2015). Genome sequencing elucidates Sardinian
genetic architecture and augments association analyses for lipid and blood inflammatory
markers. Nat Genet 47, 1272–1281. 10.1038/ng.3368.
49. Steri, M., Orru, V., Idda, M.L., Pitzalis, M., Pala, M., Zara, I., Sidore, C., Faa, V., Floris, M.,
Deiana, M., et al. (2017). Overexpression of the Cytokine BAFF and Autoimmunity Risk. N
Engl J Med 376, 1615–1626. 10.1056/NEJMoa1610528.
50. Asgari, S., Luo, Y., Akbari, A., Belbin, G.M., Li, X., Harris, D.N., Selig, M., Bartell, E.,
Calderon, R., Slowikowski, K., et al. (2020). A positively selected FBN1 missense variant
reduces height in Peruvian individuals. Nature. 10.1038/s41586-020-2302-0.
51. Zoledziewska, M., Sidore, C., Chiang, C.W.K., Sanna, S., Mulas, A., Steri, M., Busonero, F.,
Marcus, J.H., Marongiu, M., Maschio, A., et al. (2015). Height-reducing variants and
selection for short stature in Sardinia. Nat Genet 47, 1352–1356. 10.1038/ng.3403.
52. Poznik, G.D., Xue, Y., Mendez, F.L., Willems, T.F., Massaia, A., Sayres, M.A.W., Ayub, Q.,
McCarthy, S.A., Narechania, A., Kashin, S., et al. (2016). Punctuated bursts in human
male demography inferred from 1,244 worldwide Y-chromosome sequences. Nature
genetics 48, 593. 10.1038/ng.3559.
53. Bonatto, S.L., and Salzano, F.M. (1997). A single and early migration for the peopling of the
Americas supported by mitochondrial DNA sequence data. Proc Natl Acad Sci U S A 94,
1866–1871. 10.1073/pnas.94.5.1866.
54. Ramachandran, S., Deshpande, O., Roseman, C.C., Rosenberg, N.A., Feldman, M.W., and
Cavalli-Sforza, L.L. (2005). Support from the relationship of genetic and geographic
distance in human populations for a serial founder effect originating in Africa. Proceedings
of the National Academy of Sciences 102, 15942–15947. 10.1073/pnas.0507611102.
55. Scheib, C.L., Li, H., Desai, T., Link, V., Kendall, C., Dewar, G., Griffith, P.W., Mörseburg, A.,
Johnson, J.R., Potter, A., et al. (2018). Ancient human parallel lineages within North
America contributed to a coastal expansion. Science 360, 1024–1027.
10.1126/science.aar6851.
56. Acuna-Soto, R., Stahle, D.W., Cleaveland, M.K., and Therrell, M.D. (2002). Megadrought
and Megadeath in 16th Century Mexico. Emerg Infect Dis 8, 360–362.
10.3201/eid0804.010175.
57. Bryc, K., Durand, E.Y., Macpherson, J.M., Reich, D., and Mountain, J.L. (2015). The genetic
ancestry of African Americans, Latinos, and European Americans across the United
States. Am J Hum Genet 96, 37–53. 10.1016/j.ajhg.2014.11.010.
92
58. Jeon, S., de Smith, A.J., Li, S., Chen, M., Chan, T.F., Muskens, I.S., Morimoto, L.M.,
DeWan, A.T., Mancuso, N., Metayer, C., et al. (2021). Genome-wide trans-ethnic meta-
analysis identifies novel susceptibility loci for childhood acute lymphoblastic leukemia.
Leukemia. 10.1038/s41375-021-01465-1.
59. SIGMA Type 2 Diabetes Consortium, Williams, A.L., Jacobs, S.B.R., Moreno-Macías, H.,
Huerta-Chagoya, A., Churchhouse, C., Márquez-Luna, C., García-Ortíz, H., Gómez-
Vázquez, M.J., Burtt, N.P., et al. (2014). Sequence variants in SLC16A11 are a common
risk factor for type 2 diabetes in Mexico. Nature 506, 97–101. 10.1038/nature12828.
60. SIGMA Type 2 Diabetes Consortium, Estrada, K., Aukrust, I., Bjørkhaug, L., Burtt, N.P.,
Mercader, J.M., García-Ortiz, H., Huerta-Chagoya, A., Moreno-Macías, H., Walford, G., et
al. (2014). Association of a low-frequency variant in HNF1A with type 2 diabetes in a Latino
population. JAMA 311, 2305–2314. 10.1001/jama.2014.6511.
61. Mercader, J.M., Liao, R.G., Bell, A.D., Dymek, Z., Estrada, K., Tukiainen, T., Huerta-
Chagoya, A., Moreno-Macías, H., Jablonski, K.A., Hanson, R.L., et al. (2017). A Loss-of-
Function Splice Acceptor Variant in IGF2 Is Protective for Type 2 Diabetes. Diabetes 66,
2903–2914. 10.2337/db17-0187.
62. Scheinfeldt, L.B., and Tishkoff, S.A. (2013). Recent human adaptation: genomic
approaches, interpretation and insights. Nat Rev Genet 14, 692–702. 10.1038/nrg3604.
63. Berg, J.J., and Coop, G. (2014). A population genetic signal of polygenic adaptation. PLoS
Genet 10, e1004412. 10.1371/journal.pgen.1004412.
64. Guo, J., Wu, Y., Zhu, Z., Zheng, Z., Trzaskowski, M., Zeng, J., Robinson, M.R., Visscher,
P.M., and Yang, J. (2018). Global genetic differentiation of complex traits shaped by
natural selection in humans. Nat Commun 9, 1865. 10.1038/s41467-018-04191-y.
65. Robinson, M.R., Hemani, G., Medina-Gomez, C., Mezzavilla, M., Esko, T., Shakhbazov, K.,
Powell, J.E., Vinkhuyzen, A., Berndt, S.I., Gustafsson, S., et al. (2015). Population genetic
differentiation of height and body mass index across Europe. Nat Genet 47, 1357–1362.
10.1038/ng.3401.
66. Turchin, M.C., Chiang, C.W., Palmer, C.D., Sankararaman, S., Reich, D., Genetic
Investigation of, An.T.C., and Hirschhorn, J.N. (2012). Evidence of widespread selection
on standing variation in Europe at height-associated SNPs. Nat Genet 44, 1015–1019.
10.1038/ng.2368.
67. Racimo, F., Berg, J.J., and Pickrell, J.K. (2018). Detecting Polygenic Adaptation in
Admixture Graphs. Genetics 208, 1565–1584. 10.1534/genetics.117.300489.
68. Refoyo-Martínez, A., da Fonseca, R.R., Halldórsdóttir, K., Árnason, E., Mailund, T., and
Racimo, F. (2019). Identifying loci under positive selection in complex population histories.
Genome Res. 29, 1506–1520. 10.1101/gr.246777.118.
69. Chen, M., Sidore, C., Akiyama, M., Ishigaki, K., Kamatani, Y., Schlessinger, D., Cucca, F.,
Okada, Y., and Chiang, C.W.K. (2019). Evidence of polygenic adaptation at height-
associated loci in mainland Europeans and Sardinians. bioRxiv, 776377. 10.1101/776377.
93
70. Lin, M., Caberto, C., Wan, P., Li, Y., Lum-Jones, A., Tiirikainen, M., Pooler, L., Nakamura,
B., Sheng, X., Porcel, J., et al. (2019). Population specific reference panels are crucial for
the genetic analyses of Native Hawai’ians: an example of the CREBRF locus. bioRxiv,
789073. 10.1101/789073.
71. Minster, R.L., Hawley, N.L., Su, C.T., Sun, G., Kershaw, E.E., Cheng, H., Buhule, O.D., Lin,
J., Reupena, M.S., Viali, S., et al. (2016). A thrifty variant in CREBRF strongly influences
body mass index in Samoans. Nat Genet 48, 1049–1054. 10.1038/ng.3620.
72. Hanson, R.L., Safabakhsh, S., Curtis, J.M., Hsueh, W.C., Jones, L.I., Aflague, T.F., Duenas
Sarmiento, J., Kumar, S., Blackburn, N.B., Curran, J.E., et al. (2019). Association of
CREBRF variants with obesity and diabetes in Pacific Islanders from Guam and Saipan.
Diabetologia 62, 1647–1652. 10.1007/s00125-019-4932-z.
73. Ohashi, J., Naka, I., Furusawa, T., Kimura, R., Natsuhara, K., Yamauchi, T., Nakazawa, M.,
Ishida, T., Inaoka, T., Matsumura, Y., et al. (2018). Association study of CREBRF
missense variant (rs373863828:G > A; p.Arg457Gln) with levels of serum lipid profile in the
Pacific populations. Ann Hum Biol 45, 215–219. 10.1080/03014460.2018.1461928.
74. Krishnan, M., Major, T.J., Topless, R.K., Dewes, O., Yu, L., Thompson, J.M.D., McCowan,
L., de Zoysa, J., Stamp, L.K., Dalbeth, N., et al. (2018). Discordant association of the
CREBRF rs373863828 A allele with increased BMI and protection from type 2 diabetes in
Maori and Pacific (Polynesian) people living in Aotearoa/New Zealand. Diabetologia 61,
1603–1613. 10.1007/s00125-018-4623-1.
75. Berry, S.D., Walker, C.G., Ly, K., Snell, R.G., Atatoa Carr, P.E., Bandara, D., Mohal, J.,
Castro, T.G., Marks, E.J., Morton, S.M.B., et al. (2018). Widespread prevalence of a
CREBRF variant amongst Maori and Pacific children is associated with weight and height
in early childhood. Int J Obes (Lond) 42, 603–607. 10.1038/ijo.2017.230.
76. Naka, I., Furusawa, T., Kimura, R., Natsuhara, K., Yamauchi, T., Nakazawa, M., Ataka, Y.,
Ishida, T., Inaoka, T., Matsumura, Y., et al. (2017). A missense variant, rs373863828-A
(p.Arg457Gln), of CREBRF and body mass index in Oceanic populations. J Hum Genet
62, 847–849. 10.1038/jhg.2017.44.
77. Zhernakova, A., Elbers, C.C., Ferwerda, B., Romanos, J., Trynka, G., Dubois, P.C., de
Kovel, C.G., Franke, L., Oosting, M., Barisani, D., et al. (2010). Evolutionary and functional
analysis of celiac risk loci reveals SH2B3 as a protective factor against bacterial infection.
Am J Hum Genet 86, 970–977. 10.1016/j.ajhg.2010.05.004.
78. G. Amorim, C.E., Nunes, K., Meyer, D., Comas, D., Bortolini, M.C., Salzano, F.M., and
Hünemeier, T. (2017). Genetic signature of natural selection in first Americans.
Proceedings of the National Academy of Sciences 114, 2195–2199.
10.1073/pnas.1620541114.
79. Lindo, J., Huerta-Sánchez, E., Nakagome, S., Rasmussen, M., Petzelt, B., Mitchell, J.,
Cybulski, J.S., Willerslev, E., DeGiorgio, M., and Malhi, R.S. (2016). A time transect of
exomes from a Native American population before and after European contact. Nature
Communications 7. 10.1038/ncomms13175.
94
80. Tilley, L., and (Editors, A. (2016). New Developments in the Bioarchaeology of Care: Further
Case Studies and Extended Theory 10.1007/978-3-319-39901-0.
81. Stearns, S.C., Nesse, R.M., Govindaraju, D.R., and Ellison, P.T. (2010). Evolution in health
and medicine Sackler colloquium: Evolutionary perspectives on health and medicine. Proc
Natl Acad Sci U S A 107 Suppl 1, 1691–1695. 10.1073/pnas.0914475107.
82. Giddings, B.M., Whitehead, T.P., Metayer, C., and Miller, M.D. (2016). Childhood leukemia
incidence in California: High and rising in the Hispanic population: Hispanic Childhood
Leukemia Incidence. Cancer 122, 2867–2875. 10.1002/cncr.30129.
83. Lim, J.Y.-S., Bhatia, S., Robison, L.L., and Yang, J.J. (2014). Genomics of racial and ethnic
disparities in childhood acute lymphoblastic leukemia. Cancer 120, 955–962.
10.1002/cncr.28531.
84. Vijayakrishnan, J., Kumar, R., Henrion, M.Y.R., Moorman, A.V., Rachakonda, P.S., Hosen,
I., da Silva Filho, M.I., Holroyd, A., Dobbins, S.E., Koehler, R., et al. (2017). A genome-
wide association study identifies risk loci for childhood acute lymphoblastic leukemia at
10q26.13 and 12q23.1. Leukemia 31, 573–579. 10.1038/leu.2016.271.
85. Wiemels, J.L., Walsh, K.M., de Smith, A.J., Metayer, C., Gonseth, S., Hansen, H.M.,
Francis, S.S., Ojha, J., Smirnov, I., Barcellos, L., et al. (2018). GWAS in childhood acute
lymphoblastic leukemia reveals novel genetic associations at chromosomes 17q12 and
8q24.21. Nat Commun 9, 286. 10.1038/s41467-017-02596-9.
86. Papaemmanuil, E., Hosking, F.J., Vijayakrishnan, J., Price, A., Olver, B., Sheridan, E.,
Kinsey, S.E., Lightfoot, T., Roman, E., Irving, J.A.E., et al. (2009). Loci on 7p12.2, 10q21.2
and 14q11.2 are associated with risk of childhood acute lymphoblastic leukemia. Nat
Genet 41, 1006–1010. 10.1038/ng.430.
87. Perez-Andreu, V., Roberts, K.G., Harvey, R.C., Yang, W., Cheng, C., Pei, D., Xu, H.,
Gastier-Foster, J., E, S., Lim, J.Y.-S., et al. (2013). Inherited GATA3 variants are
associated with Ph-like childhood acute lymphoblastic leukemia and risk of relapse. Nat
Genet 45, 1494–1498. 10.1038/ng.2803.
88. Treviño, L.R., Yang, W., French, D., Hunger, S.P., Carroll, W.L., Devidas, M., Willman, C.,
Neale, G., Downing, J., Raimondi, S.C., et al. (2009). Germline genomic variants
associated with childhood acute lymphoblastic leukemia. Nat Genet 41, 1001–1005.
10.1038/ng.432.
89. Xu, H., Yang, W., Perez-Andreu, V., Devidas, M., Fan, Y., Cheng, C., Pei, D., Scheet, P.,
Burchard, E.G., Eng, C., et al. (2013). Novel susceptibility variants at 10p12.31-12.2 for
childhood acute lymphoblastic leukemia in ethnically diverse populations. J Natl Cancer
Inst 105, 733–742. 10.1093/jnci/djt042.
90. Vijayakrishnan, J., Qian, M., Studd, J.B., Yang, W., Kinnersley, B., Law, P.J., Broderick, P.,
Raetz, E.A., Allan, J., Pui, C.-H., et al. (2019). Identification of four novel associations for
B-cell acute lymphoblastic leukaemia risk. Nat Commun 10, 5348. 10.1038/s41467-019-
13069-6.
95
91. Feng, Q., de Smith, A.J., Vergara-Lluri, M., Muskens, I.S., McKean-Cowdin, R., Kogan, S.,
Brynes, R., and Wiemels, J.L. (2020). Trends in Acute Lymphoblastic Leukemia Incidence
in the United States by Race/Ethnicity From 2000 to 2016. American Journal of
Epidemiology, kwaa215. 10.1093/aje/kwaa215.
92. Linabery, A.M., and Ross, J.A. (2008). Trends in childhood cancer incidence in the U.S.
(1992-2004). Cancer 112, 416–432. 10.1002/cncr.23169.
93. Bhatia, S., Sather, H.N., Heerema, N.A., Trigg, M.E., Gaynon, P.S., and Robison, L.L.
(2002). Racial and ethnic differences in survival of children with acute lymphoblastic
leukemia. Blood 100, 1957–1964. 10.1182/blood-2002-02-0395.
94. The 1000 Genomes Project Consortium (2015). A global reference for human genetic
variation. Nature 526, 68–74. 10.1038/nature15393.
95. Wojcik, G.L., Graff, M., Nishimura, K.K., Tao, R., Haessler, J., Gignoux, C.R., Highland,
H.M., Patel, Y.M., Sorokin, E.P., Avery, C.L., et al. (2019). Genetic analyses of diverse
populations improves discovery for complex traits. Nature 570, 514–518. 10.1038/s41586-
019-1310-4.
96. Mahajan, A., Spracklen, C.N., Zhang, W., Ng, M.C., Petty, L.E., Kitajima, H., Yu, G.Z.,
Rueger, S., Speidel, L., Kim, Y.J., et al. (2020). Trans-ancestry genetic study of type 2
diabetes highlights the power of diverse populations for discovery and translation (Genetic
and Genomic Medicine) 10.1101/2020.09.22.20198937.
97. Wellcome Trust Case Control Consortium (2007). Genome-wide association study of 14,000
cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678.
10.1038/nature05911.
98. Metayer, C., Zhang, L., Wiemels, J.L., Bartley, K., Schiffman, J., Ma, X., Aldrich, M.C.,
Chang, J.S., Selvin, S., Fu, C.H., et al. (2013). Tobacco smoke exposure and the risk of
childhood acute lymphoblastic and myeloid leukemias by cytogenetic subtype. Cancer
Epidemiol Biomarkers Prev 22, 1600–1611. 10.1158/1055-9965.EPI-13-0350.
99. Marchini, J., Howie, B., Myers, S., McVean, G., and Donnelly, P. (2007). A new multipoint
method for genome-wide association studies by imputation of genotypes. Nat Genet 39,
906–913. 10.1038/ng2088.
100. Willer, C.J., Li, Y., and Abecasis, G.R. (2010). METAL: fast and efficient meta-analysis of
genomewide association scans. Bioinformatics 26, 2190–2191.
10.1093/bioinformatics/btq340.
101. Walsh, K.M., de Smith, A.J., Hansen, H.M., Smirnov, I.V., Gonseth, S., Endicott, A.A.,
Xiao, J., Rice, T., Fu, C.H., McCoy, L.S., et al. (2015). A Heritable Missense Polymorphism
in CDKN2A Confers Strong Risk of Childhood Acute Lymphoblastic Leukemia and Is
Preferentially Selected during Clonal Evolution. Cancer Res 75, 4884–4894.
10.1158/0008-5472.CAN-15-1105.
102. Vijayakrishnan, J., Henrion, M., Moorman, A.V., Fiege, B., Kumar, R., Inacio da Silva Filho,
M., Holroyd, A., Koehler, R., Thomsen, H., Irving, J.A., et al. (2015). The 9p21.3 risk of
96
childhood acute lymphoblastic leukaemia is explained by a rare high-impact variant in
CDKN2A. Sci Rep 5, 15065. 10.1038/srep15065.
103. de Smith, A.J., Walsh, K.M., Francis, S.S., Zhang, C., Hansen, H.M., Smirnov, I.,
Morimoto, L., Whitehead, T.P., Kang, A., Shao, X., et al. (2018). BMI1enhancer
polymorphism underlies chromosome 10p12.31 association with childhood acute
lymphoblastic leukemia: BMI 1 enhancer polymorphism in ALL. Int. J. Cancer 143, 2647–
2658. 10.1002/ijc.31622.
104. Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J.-C., and Müller, M.
(2011). pROC: an open-source package for R and S+ to analyze and compare ROC
curves. BMC Bioinformatics 12, 77. 10.1186/1471-2105-12-77.
105. Genome Aggregation Database Consortium, Karczewski, K.J., Francioli, L.C., Tiao, G.,
Cummings, B.B., Alföldi, J., Wang, Q., Collins, R.L., Laricchia, K.M., Ganna, A., et al.
(2020). The mutational constraint spectrum quantified from variation in 141,456 humans.
Nature 581, 434–443. 10.1038/s41586-020-2308-7.
106. Wiemels, J.L., de Smith, A.J., Xiao, J., Lee, S.-T., Muench, M.O., Fomin, M.E., Zhou, M.,
Hansen, H.M., Termuhlen, A., Metayer, C., et al. (2016). A functional polymorphism in the
CEBPE gene promoter influences acute lymphoblastic leukemia risk through interaction
with the hematopoietic transcription factor Ikaros. Leukemia 30, 1194–1197.
10.1038/leu.2015.251.
107. Studd, J.B., Yang, M., Li, Z., Vijayakrishnan, J., Lu, Y., Yeoh, A.E.-J., Paulsson, K., and
Houlston, R.S. (2019). Genetic predisposition to B-cell acute lymphoblastic leukemia at
14q11.2 is mediated by a CEBPE promoter polymorphism. Leukemia 33, 1–14.
10.1038/s41375-018-0184-z.
108. The LifeLines Cohort Study, Yang, J., Bakshi, A., Zhu, Z., Hemani, G., Vinkhuyzen, A.A.E.,
Lee, S.H., Robinson, M.R., Perry, J.R.B., Nolte, I.M., et al. (2015). Genetic variance
estimation with imputed variants finds negligible missing heritability for human height and
body mass index. Nat Genet 47, 1114–1120. 10.1038/ng.3390.
109. Zaitlen, N., Pasaniuc, B., Sankararaman, S., Bhatia, G., Zhang, J., Gusev, A., Young, T.,
Tandon, A., Pollack, S., Vilhjálmsson, B.J., et al. (2014). Leveraging population admixture
to characterize the heritability of complex traits. Nat Genet 46, 1356–1362.
10.1038/ng.3139.
110. Shi, H., Burch, K.S., Johnson, R., Freund, M.K., Kichaev, G., Mancuso, N., Manuel, A.M.,
Dong, N., and Pasaniuc, B. (2020). Localizing Components of Shared Transethnic Genetic
Architecture of Complex Traits from GWAS Summary Data. The American Journal of
Human Genetics 106, 805–817. 10.1016/j.ajhg.2020.04.012.
111. Ward, L.D., and Kellis, M. (2012). HaploReg: a resource for exploring chromatin states,
conservation, and regulatory motif alterations within sets of genetically linked variants.
Nucleic Acids Res 40, D930-934. 10.1093/nar/gkr917.
112. Lonsdale, J., Thomas, J., Salvatore, M., Phillips, R., Lo, E., Shad, S., Hasz, R., Walters,
G., Garcia, F., Young, N., et al. (2013). The Genotype-Tissue Expression (GTEx) project.
Nat Genet 45, 580–585. 10.1038/ng.2653.
97
113. Lin, B.D., Carnero-Montoro, E., Bell, J.T., Boomsma, D.I., de Geus, E.J., Jansen, R., Kluft,
C., Mangino, M., Penninx, B., Spector, T.D., et al. (2017). 2SNP heritability and effects of
genetic variants for neutrophil-to-lymphocyte and platelet-to-lymphocyte ratio. J Hum
Genet 62, 979–988. 10.1038/jhg.2017.76.
114. Stadhouders, R., Aktuna, S., Thongjuea, S., Aghajanirefah, A., Pourfarzad, F., van IJcken,
W., Lenhard, B., Rooks, H., Best, S., Menzel, S., et al. (2014). HBS1L-MYB intergenic
variants modulate fetal hemoglobin via long-range MYB enhancers. J. Clin. Invest. 124,
1699–1710. 10.1172/JCI71520.
115. Guo, M.H., Nandakumar, S.K., Ulirsch, J.C., Zekavat, S.M., Buenrostro, J.D., Natarajan,
P., Salem, R.M., Chiarle, R., Mitt, M., Kals, M., et al. (2017). Comprehensive population-
based genome sequencing provides insight into hematopoietic regulatory mechanisms.
Proc Natl Acad Sci USA 114, E327–E336. 10.1073/pnas.1619052114.
116. Li, M., Jiang, P., Cheng, K., Zhang, Z., Lan, S., Li, X., Zhao, L., Wang, Y., Wang, X., Chen,
J., et al. (2021). Regulation of MYB by distal enhancer elements in human myeloid
leukemia. Cell Death Dis 12, 223. 10.1038/s41419-021-03515-z.
117. Astle, W.J., Elding, H., Jiang, T., Allen, D., Ruklisa, D., Mann, A.L., Mead, D., Bouman, H.,
Riveros-Mckay, F., Kostadima, M.A., et al. (2016). The Allelic Landscape of Human Blood
Cell Trait Variation and Links to Common Complex Disease. Cell 167, 1415-1429.e19.
10.1016/j.cell.2016.10.042.
118. van Rooij, F.J.A., Qayyum, R., Smith, A.V., Zhou, Y., Trompet, S., Tanaka, T., Keller, M.F.,
Chang, L.-C., Schmidt, H., Yang, M.-L., et al. (2017). Genome-wide Trans-ethnic Meta-
analysis Identifies Seven Genetic Loci Influencing Erythrocyte Traits and a Role for
RBPMS in Erythropoiesis. The American Journal of Human Genetics 100, 51–63.
10.1016/j.ajhg.2016.11.016.
119. Tajuddin, S.M., Schick, U.M., Eicher, J.D., Chami, N., Giri, A., Brody, J.A., Hill, W.D.,
Kacprowski, T., Li, J., Lyytikäinen, L.-P., et al. (2016). Large-Scale Exome-wide
Association Analysis Identifies Loci for White Blood Cell Traits and Pleiotropy with
Immune-Mediated Diseases. The American Journal of Human Genetics 99, 22–39.
10.1016/j.ajhg.2016.05.003.
120. Buniello, A., MacArthur, J.A.L., Cerezo, M., Harris, L.W., Hayhurst, J., Malangone, C.,
McMahon, A., Morales, J., Mountjoy, E., Sollis, E., et al. (2019). The NHGRI-EBI GWAS
Catalog of published genome-wide association studies, targeted arrays and summary
statistics 2019. Nucleic Acids Research 47, D1005–D1012. 10.1093/nar/gky1120.
121. Ahola-Olli, A.V., Würtz, P., Havulinna, A.S., Aalto, K., Pitkänen, N., Lehtimäki, T.,
Kähönen, M., Lyytikäinen, L.-P., Raitoharju, E., Seppälä, I., et al. (2017). Genome-wide
Association Study Identifies 27 Loci Influencing Concentrations of Circulating Cytokines
and Growth Factors. The American Journal of Human Genetics 100, 40–50.
10.1016/j.ajhg.2016.11.007.
122. Chang, J.S., Zhou, M., Buffler, P.A., Chokkalingam, A.P., Metayer, C., and Wiemels, J.L.
(2011). Profound deficit of IL10 at birth in children who develop childhood acute
lymphoblastic leukemia. Cancer Epidemiol Biomarkers Prev 20, 1736–1740.
10.1158/1055-9965.EPI-11-0162.
98
123. Lynch, J.R., Salik, B., Connerty, P., Vick, B., Leung, H., Pijning, A., Jeremias, I.,
Spiekermann, K., Trahair, T., Liu, T., et al. (2019). JMJD1C-mediated metabolic
dysregulation contributes to HOXA9-dependent leukemogenesis. Leukemia 33, 1400–
1410. 10.1038/s41375-018-0354-z.
124. Chen, M., Zhu, N., Liu, X., Laurent, B., Tang, Z., Eng, R., Shi, Y., Armstrong, S.A., and
Roeder, R.G. (2015). JMJD1C is required for the survival of acute myeloid leukemia by
functioning as a coactivator for key transcription factors. Genes Dev 29, 2123–2139.
10.1101/gad.267278.115.
125. Xiao, F., Liao, B., Hu, J., Li, S., Zhao, H., Sun, M., Gu, J., and Jin, Y. (2017). JMJD1C
Ensures Mouse Embryonic Stem Cell Self-Renewal and Somatic Cell Reprogramming
through Controlling MicroRNA Expression. Stem Cell Reports 9, 927–942.
10.1016/j.stemcr.2017.07.013.
126. Cimmino, L., Dawlaty, M.M., Ndiaye-Lobry, D., Yap, Y.S., Bakogianni, S., Yu, Y.,
Bhattacharyya, S., Shaknovich, R., Geng, H., Lobry, C., et al. (2015). TET1 is a tumor
suppressor of hematopoietic malignancy. Nat Immunol 16, 653–662. 10.1038/ni.3148.
127. Bamezai, S., Demir, D., Pulikkottil, A.J., Ciccarone, F., Fischbein, E., Sinha, A., Borga, C.,
te Kronnie, G., Meyer, L.-H., Mohr, F., et al. (2020). TET1 promotes growth of T-cell acute
lymphoblastic leukemia and can be antagonized via PARP inhibition. Leukemia.
10.1038/s41375-020-0864-3.
128. Lappalainen, T., Sammeth, M., Friedländer, M.R., ‘t Hoen, P.A.C., Monlong, J., Rivas,
M.A., Gonzàlez-Porta, M., Kurbatova, N., Griebel, T., Ferreira, P.G., et al. (2013).
Transcriptome and genome sequencing uncovers functional variation in humans. Nature
501, 506–511. 10.1038/nature12531.
129. Weissbrod, O., Flint, J., and Rosset, S. (2018). Estimating SNP-Based Heritability and
Genetic Correlation in Case-Control Studies Directly and with Summary Statistics. The
American Journal of Human Genetics 103, 89–99. 10.1016/j.ajhg.2018.06.002.
130. Golan, D., Lander, E.S., and Rosset, S. (2014). Measuring missing heritability: Inferring the
contribution of common variants. Proc Natl Acad Sci USA 111, E5272–E5281.
10.1073/pnas.1419064111.
131. Steinsaltz, D., Dahl, A., and Wachter, K.W. (2020). On Negative Heritability and Negative
Estimates of Heritability. Genetics 215, 343–357. 10.1534/genetics.120.303161.
132. the PRACTICAL consortium, Mancuso, N., Rohland, N., Rand, K.A., Tandon, A., Allen, A.,
Quinque, D., Mallick, S., Li, H., Stram, A., et al. (2016). The contribution of rare variation to
prostate cancer heritability. Nat Genet 48, 30–35. 10.1038/ng.3446.
133. Jeon, S., de Smith, A.J., Li, S., Chen, M., Chan, T.F., Muskens, I.S., Morimoto, L.M.,
DeWan, A.T., Mancuso, N., Metayer, C., et al. (2022). Genome-wide trans-ethnic meta-
analysis identifies novel susceptibility loci for childhood acute lymphoblastic leukemia.
Leukemia 36, 865–868. 10.1038/s41375-021-01465-1.
134. Vijayakrishnan, J., Qian, M., Studd, J.B., Yang, W., Kinnersley, B., Law, P.J., Broderick, P.,
Raetz, E.A., Allan, J., Pui, C.-H., et al. (2019). Identification of four novel associations for
99
B-cell acute lymphoblastic leukaemia risk. Nat Commun 10, 5348. 10.1038/s41467-019-
13069-6.
135. Purcell, S.M., Wray, N.R., Stone, J.L., Visscher, P.M., O’Donovan, M.C., Sullivan, P.F.,
Sklar, P., Purcell (Leader), S.M., Stone, J.L., Sullivan, P.F., et al. (2009). Common
polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460,
748–752. 10.1038/nature08185.
136. Santoro, M.L., Ota, V., de Jong, S., Noto, C., Spindola, L.M., Talarico, F., Gouvea, E., Lee,
S.H., Moretti, P., Curtis, C., et al. (2018). Polygenic risk score analyses of symptoms and
treatment response in an antipsychotic-naive first episode of psychosis cohort. Transl
Psychiatry 8, 1–8. 10.1038/s41398-018-0230-7.
137. Fatumo, S., Chikowore, T., Choudhury, A., Ayub, M., Martin, A.R., and Kuchenbaecker, K.
(2022). A roadmap to increase diversity in genomic studies. Nat Med 28, 243–250.
10.1038/s41591-021-01672-4.
138. Vijayakrishnan, J., Studd, J., Broderick, P., Kinnersley, B., Holroyd, A., Law, P.J., Kumar,
R., Allan, J.M., Harrison, C.J., Moorman, A.V., et al. (2018). Genome-wide association
study identifies susceptibility loci for B-cell childhood acute lymphoblastic leukemia. Nat
Commun 9, 1340. 10.1038/s41467-018-03178-z.
139. the Haplotype Reference Consortium (2016). A reference panel of 64,976 haplotypes for
genotype imputation. Nat Genet 48, 1279–1283. 10.1038/ng.3643.
140. Privé, F., Arbel, J., and Vilhjálmsson, B.J. (2020). LDpred2: better, faster, stronger.
Bioinformatics 36, 5424–5431. 10.1093/bioinformatics/btaa1029.
141. Vilhjálmsson, B.J., Yang, J., Finucane, H.K., Gusev, A., Lindström, S., Ripke, S.,
Genovese, G., Loh, P.-R., Bhatia, G., Do, R., et al. (2015). Modeling Linkage
Disequilibrium Increases Accuracy of Polygenic Risk Scores. Am J Hum Genet 97, 576–
592. 10.1016/j.ajhg.2015.09.001.
142. Alexander, D.H., Novembre, J., and Lange, K. (2009). Fast model-based estimation of
ancestry in unrelated individuals. Genome Research 19, 1655–1664.
10.1101/gr.094052.109.
143. Spence, J.P., Sinnott-Armstrong, N., Assimes, T.L., and Pritchard, J.K. (2022). A flexible
modeling and inference framework for estimating variant effect sizes from GWAS summary
statistics (Genomics) 10.1101/2022.04.18.488696.
144. Márquez-Luna, C., Loh, P.-R., South Asian Type 2 Diabetes (SAT2D) Consortium, The
SIGMA Type 2 Diabetes Consortium, and Price, A.L. (2017). Multiethnic polygenic risk
scores improve risk prediction in diverse populations. Genet. Epidemiol. 41, 811–823.
10.1002/gepi.22083.
145. Zavala, V.A., Bracci, P.M., Carethers, J.M., Carvajal-Carmona, L., Coggins, N.B., Cruz-
Correa, M.R., Davis, M., de Smith, A.J., Dutil, J., Figueiredo, J.C., et al. (2021). Cancer
health disparities in racial/ethnic minorities in the United States. Br J Cancer 124, 315–332.
10.1038/s41416-020-01038-6.
100
146. Taylor, O.A., Brown, A.L., Brackett, J., Dreyer, Z.E., Moore, I.K., Mitby, P., Hooke, M.C.,
Hockenberry, M.J., Lupo, P.J., and Scheurer, M.E. (2018). Disparities in Neurotoxicity Risk
and Outcomes among Pediatric Acute Lymphoblastic Leukemia Patients. Clin Cancer Res
24, 5012–5017. 10.1158/1078-0432.CCR-18-0939.
147. Quiroz, E., Venkateswaran, A.R., Nelson, R., Aldoss, I., Pullarkat, V., Rego, E., Marcucci,
G., and Douer, D. (2022). Immunophenotype of acute lymphoblastic leukemia in minorities-
analysis from the SEER database. Hematol Oncol 40, 105–110. 10.1002/hon.2945.
148. Dores, G.M., Devesa, S.S., Curtis, R.E., Linet, M.S., and Morton, L.M. (2012). Acute
leukemia incidence and patient survival among children and adults in the United States,
2001-2007. Blood 119, 34–43. 10.1182/blood-2011-04-347872.
149. Migliorini, G., Fiege, B., Hosking, F.J., Ma, Y., Kumar, R., Sherborne, A.L., da Silva Filho,
M.I., Vijayakrishnan, J., Koehler, R., Thomsen, H., et al. (2013). Variation at 10p12.2 and
10p14 influences risk of childhood B-cell acute lymphoblastic leukemia and phenotype.
Blood 122, 3298–3307. 10.1182/blood-2013-03-491316.
150. Walsh, K.M., de Smith, A.J., Chokkalingam, A.P., Metayer, C., Roberts, W., Barcellos,
L.F., Wiemels, J.L., and Buffler, P.A. (2013). GATA3 risk alleles are associated with
ancestral components in Hispanic children with ALL. Blood 122, 3385–3387.
10.1182/blood-2013-08-524124.
151. Xu, K., Li, S., Pandey, P., Kang, A.Y., Morimoto, L.M., Mancuso, N., Ma, X., Metayer, C.,
Wiemels, J.L., and de Smith, A.J. (2022). Investigating DNA methylation as a mediator of
genetic risk in childhood acute lymphoblastic leukemia. Hum Mol Genet 31, 3741–3756.
10.1093/hmg/ddac137.
152. 1000 Genomes Project Consortium, Auton, A., Brooks, L.D., Durbin, R.M., Garrison, E.P.,
Kang, H.M., Korbel, J.O., Marchini, J.L., McCarthy, S., McVean, G.A., et al. (2015). A
global reference for human genetic variation. Nature 526, 68–74. 10.1038/nature15393.
153. Taliun, D., Harris, D.N., Kessler, M.D., Carlson, J., Szpiech, Z.A., Torres, R., Taliun,
S.A.G., Corvelo, A., Gogarten, S.M., Kang, H.M., et al. (2021). Sequencing of 53,831
diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299.
10.1038/s41586-021-03205-y.
154. Chang, C.C., Chow, C.C., Tellier, L.C., Vattikuti, S., Purcell, S.M., and Lee, J.J. (2015).
Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaSci 4,
7. 10.1186/s13742-015-0047-8.
155. Zou, Y., Carbonetto, P., Wang, G., and Stephens, M. (2022). Fine-mapping from summary
data with the “Sum of Single Effects” model. PLoS Genet 18, e1010299.
10.1371/journal.pgen.1010299.
156. Maples, B.K., Gravel, S., Kenny, E.E., and Bustamante, C.D. (2013). RFMix: a
discriminative modeling approach for rapid and robust local-ancestry inference. Am J Hum
Genet 93, 278–288. 10.1016/j.ajhg.2013.06.020.
157. Kachuri, L., Mak, A.C.Y., Hu, D., Eng, C., Huntsman, S., Elhawary, J.R., Gupta, N.,
Gabriel, S., Xiao, S., Keys, K.L., et al. (2021). Gene expression in African Americans and
101
Latinos reveals ancestry-specific patterns of genetic architecture (Genetics)
10.1101/2021.08.19.456901.
158. Price, A.L., Patterson, N., Yu, F., Cox, D.R., Waliszewska, A., McDonald, G.J., Tandon, A.,
Schirmer, C., Neubauer, J., Bedoya, G., et al. (2007). A genomewide admixture map for
Latino populations. Am J Hum Genet 80, 1024–1036. 10.1086/518313.
159. Fu, Q., Li, H., Moorjani, P., Jay, F., Slepchenko, S.M., Bondarev, A.A., Johnson, P.L.F.,
Aximu-Petri, A., Prüfer, K., de Filippo, C., et al. (2014). Genome sequence of a 45,000-
year-old modern human from western Siberia. Nature 514, 445–449.
10.1038/nature13810.
160. Yu, F., Cato, L.D., Weng, C., Liggett, L.A., Jeon, S., Xu, K., Chiang, C.W.K., Wiemels, J.L.,
Weissman, J.S., de Smith, A.J., et al. (2022). Variant to function mapping at single-cell
resolution through network propagation. Nat Biotechnol 40, 1644–1653. 10.1038/s41587-
022-01341-y.
161. Bao, E.L., Nandakumar, S.K., Liao, X., Bick, A.G., Karjalainen, J., Tabaka, M., Gan, O.I.,
Havulinna, A.S., Kiiskinen, T.T.J., Lareau, C.A., et al. (2020). Inherited myeloproliferative
neoplasm risk affects haematopoietic stem cells. Nature 586, 769–775. 10.1038/s41586-
020-2786-7.
162. Voit, R.A., Tao, L., Yu, F., Cato, L.D., Cohen, B., Liao, X., Fiorini, C., Nandakumar, S.K.,
Wahlster, L., Teichert, K., et al. (2022). A genetic disorder reveals a hematopoietic stem
cell regulatory network co-opted in leukemia. 2021.12.09.471942.
10.1101/2021.12.09.471942.
163. Quintana-Murci, L. (2019). Human Immunology through the Lens of Evolutionary Genetics.
Cell 177, 184–199. 10.1016/j.cell.2019.02.033.
164. Hui, D., Xiao, B., Dikilitas, O., Freimuth, R.R., Irvin, M.R., Jarvik, G.P., Kottyan, L., Kullo, I.,
Limdi, N.A., Liu, C., et al. (2022). Quantifying factors that affect polygenic risk score
performance across diverse ancestries and age groups for body mass index.
2022.05.27.22275647. 10.1101/2022.05.27.22275647.
165. Patterson, N., Hattangadi, N., Lane, B., Lohmueller, K.E., Hafler, D.A., Oksenberg, J.R.,
Hauser, S.L., Smith, M.W., O’Brien, S.J., Altshuler, D., et al. (2004). Methods for High-
Density Admixture Mapping of Disease Genes. The American Journal of Human Genetics
74, 979–1000. 10.1086/420871.
166. Hoggart, C.J., Shriver, M.D., Kittles, R.A., Clayton, D.G., and McKeigue, P.M. (2004).
Design and Analysis of Admixture Mapping Studies. The American Journal of Human
Genetics 74, 965–978. 10.1086/420855.
167. Winkler, C.A., Nelson, G.W., and Smith, M.W. (2010). Admixture mapping comes of age.
Annu Rev Genomics Hum Genet 11, 65–89. 10.1146/annurev-genom-082509-141523.
168. Quiroz, E., Aldoss, I., Pullarkat, V., Rego, E., Marcucci, G., and Douer, D. (2019). The
emerging story of acute lymphoblastic leukemia among the Latin American population -
biological and clinical implications. Blood Rev 33, 98–105. 10.1016/j.blre.2018.08.002.
102
169. Guo, J., Yang, J., and Visscher, P.M. (2018). Leveraging GWAS for complex traits to
detect signatures of natural selection in humans. Current Opinion in Genetics &
Development 53, 9–14. 10.1016/j.gde.2018.05.012.
103
Abstract (if available)
Abstract
Despite advances in treatment, Acute lymphoblastic leukemia (ALL) remains a leading cause of childhood mortality in the U.S.82,83. The disease risk for ALL shows substantial differences across race/ethnic populations in the United States. For example, Latino children have the highest risk of ALL, with an incidence rate ~30-40% higher than that in non-Latino whites, which in turn has ~50% higher incidence rate than that found in non-Latino blacks84,85. Latino children also have increased chance of relapse and poorer overall survival, compared to children from other ethnic populations86,87. There are environmental and birth characteristics that could have contributed to these disparities, such as high birth weight, diet, maternal folate intake and alcohol consumption, and in-utero pesticide exposure1. Genetic variation and genetic ancestry also likely play an important role in this ethnic disparity88-92. In addition, ALL is a cancer of the immune-forming cells, and epidemiological studies have shown that both prenatal immune development and postnatal infectious disease histories can contribute to risk of ALL93- 95. Infection and immunity is also a prime driver of natural selection in human history, through which phenotypic differences can arise between populations96.
The overarching goal of this dissertation is to improve our understanding of the genetic architecture of ALL and to investigate the genetic mechanisms through which differences in disease risk may arise between populations, notably between Latinos and non-Latino whites. Specifically, we first performed the largest genome-wide association study for ALL across ethnic populations representing multiple ancestries. Our effort identified three novel putative loci associated with ALL, in addition to secondary independent variants in two previously known loci. Second, leveraging the genome-wide summary statistics from our investigation, we constructed genomic polygenic risk score (PRS) models and evaluate their efficacy in predicting and stratifying individuals based on estimated genetic risks in Latino and non-Latino white populations. We found that genomic PRS models based on multi-ethnic meta-analysis summary statistics perform comparably in both Latino and non-Latino white populations. However, possibly due to the unique oligogenic architecture of ALL, currently the genomic PRS models are not necessarily more efficacious over a naïve PRS model based on only genome-wide significant loci. Third, we investigated in detail the association signal at one ALL locus, IKZF1. We identified a novel associated variant at this locus in which the effect on ALL is specific to Latino population. The association signal for this variant was masked by other associations that are shared across populations, and overlaps a putative functional regulatory element based on histone modification peaks. More importantly, the risk allele is prevalent in Latinos due to its origin from Indigenous American haplotype, and the allele exhibits signature of positive selection through analysis of local genealogical trees.
Hippocrates is attributed in his book Epidemics for his recommendations to physicians to “declare the past, diagnose the present, [and] foretell the future.” For childhood ALL, we have investigated its past by studying the evolutionary mechanism contributing to elevated risk of ALL among Latinos at the IKZF1 locus. We have delineated its present through a genome-wide association study of populations with multiple ancestries to identify additional loci contributing to ALL risk today. We have improved the prospect to foretell the future by evaluating risk prediction models through polygenic risk scores for ALL. Together, we have undertaken a multi- pronged approach through functional genomics, statistical genetics, and population genetics to better understand the genetic etiology of ALL.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
The influence of DNA repair genes and prenatal tobacco exposure on childhood acute lymphoblastic leukemia risk: a gene-environment interaction study
PDF
Genetic epidemiological approaches in the study of risk factors for hematologic malignancies
PDF
Ancestral/Ethnic variation in the epidemiology and genetic predisposition of early-onset hematologic cancers
PDF
Perinatal epigenetic and genetic analyses in childhood cancers
PDF
The effect of cytomegalovirus on gene expression of pediatric acute lymphoblastic leukemia
PDF
The impact of global and local Polynesian genetic ancestry on complex traits in Native Hawaiians
PDF
Genetic studies of cancer in populations of African ancestry and Latinos
PDF
The interplay between tobacco exposure and polygenic risk score for growth on birthweight and childhood acute lymphoblastic leukemia
PDF
Genetic and dietary determinants of nonalcoholic fatty liver disease in Hispanic children
PDF
The role of survivin in drug resistant pediatric acute lymphoblastic leukemia
PDF
Application of genetic association methods in mice to understand phenotypes with a complex etiology
PDF
A global view of disparity in imputation resources for conducting genetic studies in diverse populations
PDF
Understanding prostate cancer genetic susceptibility and chromatin regulation
PDF
Understand the distinct patterns of selection in auto-immune diseases with ancient DNA data by the S-LDSC model
PDF
The multiethnic nature of chronic disease: studies in the multiethnic cohort
PDF
The role of adipocyte-derived free-fatty acids in acute lymphoblastic leukemia
PDF
Identifying genetic, environmental, and lifestyle determinants of ethnic variation in risk of pancreatic cancer
PDF
Prostate cancer: genetic susceptibility and lifestyle risk factors
PDF
Utility of polygenic risk score with biomarkers and lifestyle factors in the multiethnic cohort study
PDF
Induction of hypersignaling as a therapeutic approach for treatment of BCR-ABL1 positive Acute Lymphoblastic Leukemia (ALL) cells
Asset Metadata
Creator
Jeon, Soyoung
(author)
Core Title
Understanding acute lymphoblastic leukemia in different ethnic groups in the United States
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Cancer Biology and Genomics
Degree Conferral Date
2023-05
Publication Date
03/13/2023
Defense Date
01/13/2023
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
acute lymphoblastic leukemia,cancer genetics,childhood leukemia,OAI-PMH Harvest,population genetics,transethnic
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Allayee, Hooman (
committee chair
), Chiang, Charleston (
committee member
), deSmith, Adam (
committee member
), Hwang, Amie (
committee member
), Wiemels, Joseph (
committee member
)
Creator Email
jeonsoyo@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC112838579
Unique identifier
UC112838579
Identifier
etd-JeonSoyoun-11500.pdf (filename)
Legacy Identifier
etd-JeonSoyoun-11500
Document Type
Dissertation
Format
theses (aat)
Rights
Jeon, Soyoung
Internet Media Type
application/pdf
Type
texts
Source
20230320-usctheses-batch-1010
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
acute lymphoblastic leukemia
cancer genetics
childhood leukemia
population genetics
transethnic