Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
A global view of disparity in imputation resources for conducting genetic studies in diverse populations
(USC Thesis Other)
A global view of disparity in imputation resources for conducting genetic studies in diverse populations
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
A Global View of Disparity in Imputation Resources for Conducting Genetic Studies in Diverse
Populations
by
Camellia X.Y. Rui
A Thesis Presented to the
FACULTY OF THE USC KECK SCHOOL OF MEDICINE
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment to the Degree
MASTER OF SCIENCE
(BIOSTATISTICS)
May 2022
Copyright 2022 Xinyue Rui
i
Acknowledgments
Throughout the writing of this thesis, I have received a great deal of support and
assistance.
I would first like to thank my supervisor, Professor Charleston Chiang, whose expertise
was invaluable in formulating the ideas and methodology. Your insightful feedback sharpened
my thinking and brought my work to a higher level. Working with you motivated me in pursuing
graduate studies in the human genetics field.
I would like to acknowledge my colleagues Minhui Chen and Ying-Chu Lo for their
wonderful help in the project. Minhui and Ying-Chu, I want to thank you for your patient
support and for all of the ideas to further my research. I would also like to thank Echo Tang,
Jordan Cahoon, and Christopher Simons for your enthusiastic involvement in the project. I
would also like to acknowledge Tsz Fung Chan, Bryan Dinh, and Zeyun Lu, from whom I
learned a lot to be an aspiring researcher.
In addition, I would like to thank USC for constantly funding my project with Provost’s
Undergraduate Research Fellowship.
ii
TABLES OF CONTENTS
Acknowledgments ........................................................................................................................................................... i
List Of Tables ................................................................................................................................................................ iii
List Of Figures ............................................................................................................................................................... iv
Introduction .................................................................................................................................................................... 1
Chapter 1: Method .......................................................................................................................................................... 5
Chapter 2: Results ......................................................................................................................................................... 15
Chapter 3: Discussion ................................................................................................................................................... 40
References .................................................................................................................................................................... 42
iii
List of Tables
Table 1: SNPs Statistics during the Process of Quality Control for Lipson et al. 2018(Science) ................. 9
Table 2: SNPs Statistics during the Process of Quality Control for Lipson et al. 2018(Current Biology) ... 9
Table 3: SNPs Statistics during the Process of Quality Control for Skoglund et al. 2016 .......................... 10
Table 4: SNPs Statistics during the Process of Quality Control for Xing et al. 2010 ................................. 10
Table 5: SNPs Statistics during the Process of Quality Control for Wang et al. 2021 ................................ 10
Table 6: SNPs Statistics in the Imputation Server for Lipson et al. 2018(Science) .................................... 11
Table 7: SNPs Statistics in the Imputation Server for Lipson et al. 2018(Current Biology) ...................... 11
Table 8: SNPs Statistics in the Imputation Server for Skoglund et al. 2016 ............................................... 12
Table 9: SNPs Statistics in the Imputation Server for Xing et al. 2010 ...................................................... 12
Table 10: SNPs Statistics in the Imputation Server for Wang et al. 2021 ................................................... 13
iv
List of Figures
Figure 1. Map of Imputed Populations ........................................................................................................ 16
Figure 2.Paired Comparison of the Imputation Accuracy of Thailand Individuals vs UK10K (N=20,
321,136 SNPs on Affymetrix Human Origins Array) ................................................................................. 17
Figure 3.Paired Comparison of the Imputation Accuracy of Vanuatu Individuals vs UK10K (N=203,
423,316 SNPs on Affymetrix Human Origins Array) ................................................................................. 18
Figure 4. Paired Comparison of the Imputation Accuracy of Brunei Individuals vs UK10K (N=20,
367,644 SNPs on Affymetrix Human Origins Array) ................................................................................. 19
Figure 5.Paired Comparison of the Imputation Accuracy of Papuan Individuals vs UK10K (N=269,
405,901 SNPs on Affymetrix Human Origins Array) ................................................................................. 20
Figure 6. Paired Comparison of the Imputation Accuracy of Philippines Individuals vs UK10K (N=21,
374,161 SNPs on Affymetrix Human Origins Array) ................................................................................. 21
Figure 7. Paired Comparison of the Imputation Accuracy of Northern China Individuals(N=41, 412,605
SNPs) vs UK10K (N=203, 423,316 SNPs on Affymetrix Human Origins Array) ..................................... 22
Figure 8.Paired Comparison of the Imputation Accuracy of Han Chinese Individuals vs UK10K (N=64,
401,208 SNPs on Affymetrix Human Origins Array) ................................................................................. 24
Figure 9.Paired Comparison of the Imputation Accuracy of Zhuang Chinese Individuals vs
UK10K(N=40, 380,058 SNPs on Affymetrix Human Origins Array) ........................................................ 24
Figure 10.Paired Comparison of the Imputation Accuracy of Ethnic Minorities in Nepal Individuals vs
UK10K (N=32, 420,075 SNPs on Affymetrix Human Origins Array) ....................................................... 25
Figure 11.Paired Comparison of the Imputation Accuracy of Central Asians Individuals vs UK10K
(N=21, 412,605 SNPs on Affymetrix Human Origins Array) .................................................................... 26
v
Figure 12 paired Comparison of the Imputation Accuracy of Tibetan and Nepal Individuals vs UK10K
(N=115, 425,086 SNPs on Affymetrix Human Origins Array) .................................................................. 26
Figure 13. Paired Comparison of the Imputation Accuracy of Other Ethnic in Southern China Individuals
vs UK10K (N= 64, 398,877 SNPs on Affymetrix Human Origins Array) ................................................. 27
Figure 14. Paired Comparison of the Imputation Accuracy of Bolivian/Totonacl Individuals vs UK10K
(N=47, 237, 292 SNPs on Affymetrix Nspl 250K Array) ........................................................................... 28
Figure 15. Paired Comparison of the Imputation Accuracy of Central Asian Siberian Individuals vs
UK10K (N=147, 243,705 SNPs on Affymetrix Nspl 250K Array) ............................................................ 29
Figure 16. Paired Comparison of the Imputation Accuracy of Central Asian Mongolian Individuals vs
UK10K (N=50, 216,289 SNPs on Affymetrix Nspl 250K Array) .............................................................. 30
Figure 17. Paired Comparison of the Imputation Accuracy of Ethnic Minorities from Caucasus vs UK10K
(N=47, 233,850 SNPs on Affymetrix Nspl 250K Array) ............................................................................ 31
Figure 18. Paired Comparison of the Imputation Accuracy of West African Individuals vs UK10K
(N=109, 239,322 SNPs on Affymetrix Nspl 250K Array) .......................................................................... 32
Figure 19. Paired Comparison of the Imputation Accuracy of Central African Individuals vs UK10K
(N=25, 192,452 SNPs on Affymetrix Nspl 250K Array) ............................................................................ 32
Figure 20. Paired Comparison of the Imputation Accuracy of Southeast Asian Individuals vs UK10K
(N=61, 240,836 SNPs on Affymetrix Nspl 250K Array) ............................................................................ 33
Figure 21. paired Comparison of the Imputation Accuracy of East Asian in Nepal Individuals vs UK10K
(N=113, 20,211 SNPs on Affymetrix Nspl 250K Array) ............................................................................ 33
Figure 22. Paired Comparison of the Imputation Accuracy of Tongan/Samoan Individuals vs UK10K
(N=26, 233,538 SNPs on Affymetrix Nspl 250K Array) ............................................................................ 34
Figure 23. Imputation Accuracy of East Asians vs Europeans ................................................................... 36
Figure 24. Imputation Accuracy of Southeast Asians vs Europeans ........................................................... 37
vi
Figure 25. Imputation Accuracy of Africans vs Europeans ......................................................................... 38
Figure 26. Imputation Accuracy of Oceanians vs Europeans ..................................................................... 39
vii
Abstract
Genotype imputation is an essential tool for genome-wide association studies (GWAS),
especially when the whole-genome sequencing of the sample is not readily available. By
comparing an imputation reference panel consisting of individuals whose genomes were
previously sequenced, one can greatly increase the coverage of the genomic dataset to increase
the statistical power of the association study. However, previous imputation reference panels
consist largely of individuals of European ancestry and are severely deficient in sampling
populations from other continents. In May 2020, the Trans-Omics for Precision Medicine
(TOPMed) imputation panel became available online with a larger sample size and claimed to
obtain better imputation quality for African and Hispanic/Latino American populations since
they improved the representation in sequencing data from those populations. However, it remains
unclear how effective the TOPMed data serves as an imputation reference for other global
populations such as East Asians, Southeast Asians, and Polynesians. We found that the
imputation quality of the TOPMed imputation panel for global populations outside of the United
States still lags behind that for the European populations. Our assessment thus call for more
sequencing studies to be conducted more equitably for populations around the globe so to enrich
the representation of our current imputation reference panel and to enable powerful genetic
association studies in these global populations.
i
Introduction
Using genetic approaches to understand the similarities and differences in the genetic
architecture of complex traits within and between populations is an essential component in
medical genetics studies. Genome-wide association studies (GWAS) help elucidate the genetic
architecture of complex traits and reveal the relationship between genes and traits or
diseases(Tam et al. 2019). Genotyping imputation has become an essential tool for GWAS.
Genotyping imputation is a statistical method of predicting genotypes that are not directly
assayed in a sample of individuals(Marchini and Howie, 2010). Imputation is particularly
promising for researchers focusing on understudied ethnic minority populations because there
are generally fewer resources available to sequencing the whole genome of individuals from
these populations. Imputation can be an efficient means to fill in an untyped genetic variation
catalog in a study population by comparing it to a panel of reference individuals, and subject
these variants to association testing However, even this approach is potentially heavily biased
towards Europeans because current imputation reference panels are European-centric. Prior to
2020, the most extensive imputation reference panel routinely used in GWAS is the
HRC(Haplotype Reference Consortium) panel (McCarthy, 2016), which consisted of around 30k
individuals with whole-genome sequences, basically all from Europeans.
A major factor of imputation quality of a population is the diversity and the sample sizes of the
reference panel. Factors associated with higher imputation quality are large panel size, higher
sequencing quality, and genetic similarity between the sample and reference panel. As our
2
reference panels have increased in their sizes, from 180 in Hapmap2, to 1092 in 1000 Genome
Phase 1, 2504 in 1000 Genomes Phase 3, and then to 32470 in
HRC(https://imputationserver.readthedocs.io/en/latest/reference-panels/). Besides aggregating a
larger number of samples in its reference panel, the field has also been working towards
improving the genetic similarity between sample and reference panel. That is, on improving the
diversity and representation of different populations in the imputation reference. These efforts
include:
(i) Building population-specific imputation reference panels. Along with the rise of an
increasing number of genomic studies from East Asian populations, there is also the need for
high-quality, East Asian ancestry-specific imputation panels. In 2015, 1KJPN was developed
using WGS data from 1,070 Japanese individuals. Kawaii et al claimed that the 1KJPN reference
panel outperformed the 1000G panel for the genotype imputation of Japanese samples, despite a
roughly equal or smaller sample size (Kawai et al., 2015). Similarly, a recent study seeking an
improved imputation panel for Chinese populations was published in 2021. The ChinaMap
reference panel was constructed from samples of 10,155 unrelated Chinese individuals with
40.8X depth WGS data. Li et al demonstrated that the imputation quality of ChinaMap
outperformed TOPMed, HRC, and 1000G on all allele frequency ranges for 794 Chinese
individuals(Li et al., 2021).
3
(ii) Modifying the current reference panel by incorporating diverse populations. Studies
have shown that expanding the current reference panel with a cohort of the target population can
greatly improve the imputation quality of such a population. Kaufmann et al. found that the
addition of 134 native American genomics on the 1000 Genome Project improved the imputation
accuracy for the rare variants with diploid Native ancestry of the Mexican population(Jiménez-
Kaufmann et al., 2022). This study highlighted the critical need for increasing the representation
of non-European populations in the reference panel.
Recognizing the need to diversify the imputation reference panel in order to improve
imputation accuracy, the latest imputation reference panel has been built. In May 2020, the
TOPMed(Trans-Omics for Precision Medicine) reference panel(Taliun,2019) came online, with
much higher representation from ethnic minorities such as Latinos and African Americans.
Kowalski et al present that by including ~ 26% African Americans and ~ 10% Hispanic/Latino
participants among the total 97,256 individuals contained in the imputation reference panel,
TOPMed substantially enhances imputation quality in under-represented admixed populations
(Kowalski et al., 2019). Besides the better performance on underrepresented admixed
populations, a study also finds that TOPMed can even outperform some population-specific
reference panels. O’Connell et al. examined the imputation performances of 103 samples with
African ancestry and claimed that TOPMed outperformed the African-specific reference panel
across the spectrum of allele frequencies(O'Connell et al. 2021).
4
Despite the significant improvement in the performance of the TOPMed imputation server, the
cohorts sequenced as part of TOPMed remain largely the major ethnic groups living in the
United States. It is still unclear how effective and equitable the TOPMed imputation reference is
for other ethnic minorities within and outside of the United States. For instance, it has been
noted that certain major continental populations such as East Asians are still underrepresented
and poorly imputed using TOPMed imputation reference, an observation that we also observed
in this study. In addition, the imputation efficacy for other underserved populations (Polynesians,
Southeast Asians, Central Asians, etc.) around the world have not been systematically evaluated.
Therefore, the question to be asked here is - Does TOPMed perform well for those underserved
populations? That is, we aim to identify and characterize the limitations in current imputation
reference panels that would inhibit diverse populations worldwide from being included in
GWAS. To address this question, we surveyed the literature to build a collection of five publicly
available genetic datasets representing many populations worldwide. Together, these datasets
contained over 1500 individuals from 23 populations. We curated these datasets and performed
uniform quality control measures, followed by the imputation of each population using the
TOPMed imputation server. After imputation, we evaluated the imputation quality per
population based on the distribution of imputation quality scores, 𝑹
𝟐
, as a function of allele
frequency, and the number of genetic variants successfully imputed. Our assessment suggested
that while the TOPMed reference panel is a significant advance over past imputation reference
panels, it remains insufficient for many global populations around the world.
5
Chapter 1: Method
(i) Data collection and curation
We collected publicly available datasets of 1585 individuals in total from 5 recent
publications. These publications tend to focus on understanding the population structure and
history of diverse populations around the globe. As a result, the sample sizes are typically
small compared to genetic epidemiological studies, and there are usually no phenotypes
associated with the public release. However, unlike epidemiological studies, these datasets
typically contain more demographic information, such as the geographical location of
sampling, as well as ancestry component decompositions such as through the software
ADMIXTURE. We required a sample size of at least 20 individuals for a population to be
retained in the study (after merging similar ancestry populations; see below). Despite the small
sample sizes (N ranging from 20 to 269), we are able to evaluate the imputation efficacy of
common variants and can still observe the disparity in imputation efficacy due to the lack of
diversity in the imputation reference panel.
Here we summarize briefly the five publications used in this study. From Ancient
genomes document multiple waves of migration in Southeast Asian prehistory (Lipson et
al. 2018) , we collected 20 present-day individuals from Thailand, with 593051 SNPs
genotyped on Affymetrix Human Origins representing a source population of southeast Asian.
From Population Turnover in Remote Oceania Shortly after Initial Settlement(Lipson et
6
al. 2018), we obtained 203 individuals from Vanuatu, with 593051 SNPs genotyped on the
Affymetrix Human Origins array. From Genomic insights into the peopling of the Southwest
Pacific(Skoglund et al. 2016), we have 310 individuals from Brunei, Papuan, and Philippine,
with 593051 SNPs genotyped on the Affymetrix Human Origins array. For East and Central
Asian populations, we obtained 383 individuals from 40 populations from China(N=337) and
Nepal(N=46), genotyped on the Affymetrix Human Origins Array from Genomic insights into
the formation of human populations in East Asia(Wang et al. 2021). From Toward a more
uniform sampling of human genetic diversity: a survey of worldwide populations by high-
density genotyping we extracted 663 individuals across the globe (Xing et al. 2010),
genotyped on Affymetrix Nspl 250K Array.
(ii) Sample quality control
For each publication, we converted the publicly available genetic datasets into PLINK
format using either available software (e.g. EIGENSTRAT) or with custom scripts. We then
applied uniform quality control, removing SNPs with greater than 2% missingness and
individuals with greater than 5% missingness using PLINK v.1.9. We also upgraded each genetic
dataset to genome build Hg38 by liftover.
7
(iii) Defining population units of analysis
Depending on the publication, study authors could group the study participants in varying
levels of detail. At the finest level, study authors could report sampling locations that are
geographically proximate, but the sampled individuals could be genetically similar and simply
demarcated by arbitrary geopolitical boundaries. Given that larger sample size will be desirable
in order to assess the efficacy of imputing rare variants, we sought to combine populations that
have similar ancestry decomposition profiles in the original publication. Below we detail this
process for each publication:
Ancient genomes document multiple waves of migration in Southeast Asian prehistory
This study provided 20 present-day individuals from Thailand as a single group. This
grouping was unchanged.
Population Turnover in Remote Oceania Shortly after Initial Settlement
All 203 individuals in the study are from Vanuatu, this grouping was left unchanged.
Genomic insights into the peopling of the Southwest Pacific
This study provided 310 individuals from Brunei(N=20), Papuan(N=269), and
Philipine(N=21). Since there is no admixture analysis in the publication regarding the ancestry
8
components of the samples, and the three populations are geographically distinct from each
other, this grouping was left unchanged.
Genomic insights into the formation of human populations in East Asia
This study publicly released 383 individuals from 44 populations across East Asia and
Central Asia. Based on the published ancestry analysis using ADMIXTURE (K=15; Figure SI2-
2 from Supplementary Information in Wang et al. 2020) as well as the geographical sampling
location, we grouped 383 individuals into eight populations: Han from China (N=64), Central
Asia (N=21), Northern China (N=41), Tibet and Nepal (N=115), Other Ethnic Groups in
Southern China (N=64), Qiang from China (N=20), Zhuang from China (N=40), and Ethnic
Minorities in Nepal (N=32).
Toward a more uniform sampling of human genetic diversity: a survey of
worldwide populations by high-density genotyping
This study publicly released 722 individuals from 39 populations across the globe. We
grouped 618 individuals into 11 populations based on the admixture plot (K=12, Figure 3 in
Xing et al. 2010). The group for Luhya, Alur, and Hema were dropped because the three are not
close to one another geographically or based on ancestry profile to be aggregated as a group, and
were too small individually to be included in the analysis. The remaining 10 groups are named as
Central African, South African, European, Ethnic Minority from the Caucasus, Central
9
Asian/Serbian, Central Asian/Mongolian, Southeast Asia, East Asia, Tongan/Samoan,
Bolivian/Totonac.
(iv)Hardy Weinberg Test and Filtering
After forming the merged population groups within each publication, we tested for
Hardy-Weinberg equilibrium and filtered SNPs. SNPs with a p-value less than 10
-6
in each
population.
The summary of sample size and number of SNPs at each filtering step are reported in the
following Tables (Table 1-5)
Table 1: SNPs Statistics during the Process of Quality Control for Lipson et al. 2018(Science)
Table 2: SNPs Statistics during the Process of Quality Control for Lipson et al. 2018(Current Biology)
10
Table 3: SNPs Statistics during the Process of Quality Control for Skoglund et al. 2016
Table 4: SNPs Statistics during the Process of Quality Control for Xing et al. 2010
Table 5: SNPs Statistics during the Process of Quality Control for Wang et al. 2021
(iii) Imputation
11
Post-QC, imputation-ready dataset for each population of interest was submitted to the
TOPMed Imputation Server(https://imputation.biodatacatalyst.nhlbi.nih.gov/#!). We selected the
reference panel as ‘TOPMed r2’, array build as ‘GRC38/hg38’, rsq filter as ‘off’, phasing as
‘Eagle v2.4(phased output)’, the population as ‘mixed’, and mode as ‘Quality Control &
Imputation’. Imputation results and quality control statistics were downloaded and stored on
USC’s CARC Discovery cluster for downstream analysis, and are summarized in Tables 6-10.
Table 6: SNPs Statistics in the Imputation Server for Lipson et al. 2018(Science)
Table 7: SNPs Statistics in the Imputation Server for Lipson et al. 2018(Current Biology)
12
Table 8: SNPs Statistics in the Imputation Server for Skoglund et al. 2016
Table 9: SNPs Statistics in the Imputation Server for Xing et al. 2010
13
Table 10: SNPs Statistics in the Imputation Server for Wang et al. 2021
(iv) Data Analysis
We used python3.1.2 to generate the line plot of imputation quality vs MAF for East
Asian populations, Southeast Asian populations, and African populations. We compared the
imputation quality to a sample from the UK10K dataset as reference(K. Walter et al. 2015).
UK10K consists of the whole-genome sequencing data of approximately 3,781 individuals. We
extracted 593,051 SNPs found in Affymetrix Human Origins Array and 246,554 SNPs found in
Affymetrix Nspl 250K array from 3,781 UK10K individuals to mimic a genotyping dataset of
European ancestry individuals for imputation. Note that each population presented in this line
14
plot would have different sample sizes. Given the sample size of a population, SNPs with minor
allele frequencies less than 1/(2*sample size) are dropped because these variants are not
expected to be found in a sample of that size. In other words, these are variants that are not
expected to be seen even once in the population sample, and thus the imputation quality is
necessarily low. To avoid overstating the deficiency in imputation accuracy simply due to
sample sizes, we elected to not include SNPs below this threshold, and focused our attention on
the imputation accuracy of the more common SNPs expected to be found in the population
sample. We thus binned the SNPs into multiple minor allele frequencies bins (0.001-0.005,
0.005-0.01, 0.01-0.05, 0.05-0.1, 0.1-0.5) and generated the line plots below. Y-axis showed the
R
2
as a measure of imputation quality and X-axis showed the MAF bins.
The paired comparisons of imputation quality of a global population with the European
population in UK10K were also conducted. For each population, we extracted the same set of
SNPs and the same number of individuals from UK10K for comparison. Results are shown in the
paired boxplot, between the study population and the UK10K reference for each minor allele
frequency bin. Boxplots were generated by python3.1.2.
15
Chapter 2: Results
For the 1585 individuals grouped into 23 populations, we have largely covered the
populations from East Asia, Southeast Asia, Central Asia, Oceania/Polynesia, and Africa
(Figure 1). Those populations are generally underserved in genomic research. The samples are
genotyped on either Affymetrix Human Origins Array (593,051 SNPs) or Affymetrix Nspl
250K Array(246,554 SNPs). After quality control and exclusion criteria from the imputation
server, approximately 400,000 SNPs each population(on Affymetrix Human Origins Array)
were imputed, and approximately 200,000 SNPs each population (on Affymetrix Nspl 250K
Array) were imputed through the TOPMed Imputation server. The sample sizes per population
and the array type used to genotype each population is shown in Figure 1.
To visualize the relative imputation quality of non-European populations with the European
reference, we presented paired box plots for each non-European population with European
reference matched on both sample size and SNPs. We further used line plots to compare the
disparity in imputation efficacy by each of the three continental/subcontinental groups. .
16
Figure 1. Map of Imputed Populations
(i) Paired Comparison of Imputation Accuracy Between a European and a Matched
Global Population
We performed a series of paired comparisons of imputation accuracy between the 23
populations we collected and the matched number of individuals and SNPs from UK10K. The
patterns of these boxplots are qualitatively consistent: European populations from UK10K were
imputed far better than the non-European populations we have examined here. Note that the
imputation quality for UK10K samples vary somewhat, particularly for SNPs in the rarer
frequency bins. This is due to different sample sizes matched to the study, non-European,
17
population. Nevertheless, the disparity of imputation quality between Europeans and non-
Europeans is evident.
Ancient genomes document multiple waves of migration in Southeast Asian prehistory
Figure 2.Paired Comparison of the Imputation Accuracy of Thailand Individuals vs UK10K (N=20,
321,136 SNPs on Affymetrix Human Origins Array)
Population Turnover in Remote Oceania Shortly after Initial Settlement
18
Figure 3.Paired Comparison of the Imputation Accuracy of Vanuatu Individuals vs UK10K (N=203,
423,316 SNPs on Affymetrix Human Origins Array)
19
Genomic insights into the peopling of the Southwest Pacific
Figure 4. Paired Comparison of the Imputation Accuracy of Brunei Individuals vs UK10K (N=20,
367,644 SNPs on Affymetrix Human Origins Array)
20
Figure 5.Paired Comparison of the Imputation Accuracy of Papuan Individuals vs UK10K (N=269,
405,901 SNPs on Affymetrix Human Origins Array)
21
Figure 6. Paired Comparison of the Imputation Accuracy of Philippines Individuals vs UK10K
(N=21, 374,161 SNPs on Affymetrix Human Origins Array)
22
Genomic insights into the formation of human populations in East Asia(N=8)
Figure 7. Paired Comparison of the Imputation Accuracy of Northern China Individuals(N=41,
412,605 SNPs) vs UK10K (N=203, 423,316 SNPs on Affymetrix Human Origins Array)
23
24
Figure 8.Paired Comparison of the Imputation Accuracy of Han Chinese Individuals vs UK10K
(N=64, 401,208 SNPs on Affymetrix Human Origins Array)
Figure 9.Paired Comparison of the Imputation Accuracy of Zhuang Chinese Individuals vs
UK10K(N=40, 380,058 SNPs on Affymetrix Human Origins Array)
25
Figure 10.Paired Comparison of the Imputation Accuracy of Ethnic Minorities in Nepal Individuals
vs UK10K (N=32, 420,075 SNPs on Affymetrix Human Origins Array)
26
Figure 11.Paired Comparison of the Imputation Accuracy of Central Asians Individuals vs UK10K
(N=21, 412,605 SNPs on Affymetrix Human Origins Array)
Figure 12 paired Comparison of the Imputation Accuracy of Tibetan and Nepal Individuals vs UK10K
(N=115, 425,086 SNPs on Affymetrix Human Origins Array)
27
Figure 13. Paired Comparison of the Imputation Accuracy of Other Ethnic in Southern China
Individuals vs UK10K (N= 64, 398,877 SNPs on Affymetrix Human Origins Array)
28
Toward a more uniform sampling of human genetic diversity: a survey
of worldwide populations by high-density genotyping
Figure 14. Paired Comparison of the Imputation Accuracy of Bolivian/Totonacl Individuals vs UK10K
(N=47, 237, 292 SNPs on Affymetrix Nspl 250K Array)
29
Figure 15. Paired Comparison of the Imputation Accuracy of Central Asian Siberian Individuals vs
UK10K (N=147, 243,705 SNPs on Affymetrix Nspl 250K Array)
30
Figure 16. Paired Comparison of the Imputation Accuracy of Central Asian Mongolian Individuals vs
UK10K (N=50, 216,289 SNPs on Affymetrix Nspl 250K Array)
31
Figure 17. Paired Comparison of the Imputation Accuracy of Ethnic Minorities from Caucasus vs
UK10K (N=47, 233,850 SNPs on Affymetrix Nspl 250K Array)
32
Figure 18. Paired Comparison of the Imputation Accuracy of West African Individuals vs UK10K
(N=109, 239,322 SNPs on Affymetrix Nspl 250K Array)
Figure 19. Paired Comparison of the Imputation Accuracy of Central African Individuals vs UK10K
(N=25, 192,452 SNPs on Affymetrix Nspl 250K Array)
33
Figure 20. Paired Comparison of the Imputation Accuracy of Southeast Asian Individuals vs UK10K
(N=61, 240,836 SNPs on Affymetrix Nspl 250K Array)
Figure 21. paired Comparison of the Imputation Accuracy of East Asian in Nepal Individuals vs
UK10K (N=113, 20,211 SNPs on Affymetrix Nspl 250K Array)
34
Figure 22. Paired Comparison of the Imputation Accuracy of Tongan/Samoan Individuals vs UK10K
(N=26, 233,538 SNPs on Affymetrix Nspl 250K Array)
35
(ii) Imputation Accuracy at the Continental and Sub-Continental
Scale
We visualized the results across all populations stratified by major continental and
subcontinental regions. Specifically, we focused on populations coming from East Asia,
Southeast Asia, and Africa.
Across the array platform, and thus the marker number and density / coverage of the
genome that was provided for imputation, also impacts imputation accuracy. For the East
Asian populations, East Asian(N=113) from Xing et al.2010, consisting of Japanese and
Chinese, were the largest group. However, despite its largest sample size, it was imputed
poorly compared to the other East Asian populations (including Han Chinese from Wang et al.
2020), presumably due to the lower SNP density on the genotyping array platform used to
generate the genetic data for this cohort (Affymetrix Nspl 250K array, with only approximately
245K SNPs available for imputation, compared to approximately 590K SNPs on the
Affymetrix Human Origins Array). The other East Asian populations are Han from
China(N=64), Other Ethnic groups in Southern China (N=64), Northern China (N=41), Zhuang
from China(N=40) and Qiang from China(N=20). In general, across all frequency bins that we
have the sample sizes to evaluate, the East Asian populations were more poorly imputed than
the UK10K reference individuals as measured by Rsq.
36
Figure 23. Imputation Accuracy of East Asians vs Europeans
For the comparison of Southeast Asians with European reference, we included Thai(N=20)
from Lipson et al. 2018, Philippines (N=21) from Skoglund et al. 2016, Ethnic Minorities in
Nepal (N=32) from Wang et al. 2021 and Southeast Asians (N=61) from Xing et al. 2010.
Southeast Asians from Xing et al. performed worse compared to the other southeast Asian
populations presumably due to the similar issue with array platform and marker density.
Consistent with results from East Asian populations, Southeast Asian populations are also
poorly imputed compared to the UK10K reference individuals given the Rsq on all of the
spectrum of minor allele frequencies.
37
Figure 24. Imputation Accuracy of Southeast Asians vs Europeans
It is worth noting that populations from the African continent were far more poorly imputed
compared to the UK10K reference individuals and other continental populations. This may be
surprising, given that a large number of African American individuals were sequenced as part
of the TOPMed initiative. The disparity comes from two parts - the genotyping array and the
(lack of) genetic similarity between African samples tested here and the populations in the
TOPMed reference panel. Inclusions of African Americans in TOPMed are more or less
restricted to individuals of African ancestry living in the United States. Because African
Americans in the United States derive large proportion of their ancestry from a single
population of origin in West Africa, the genetic variation catalogued in the TOPMed reference
38
panel may still be insufficient to better impute populations from other parts of the African
continent, or even to other populations from West Africa.
Figure 25. Imputation Accuracy of Africans vs Europeans
39
Figure 26. Imputation Accuracy of Oceanians vs Europeans
We also compared the imputation quality of three Oceanian populations collected. The leading
two are Papuan and Tongan/Samoan, which presumably due to the larger sample size. We can clearly
observe that the rare variants are poorly imputed for all three Oceanian populations. The imputation
reference panel is likely lacking individuals with a large Oceanian ancestry component. Therefore,
future studies regarding the imputation accuracy of Oceanian populations will be needed.
40
Chapter 3: Discussion
A consistent theme in our systematic comparison of imputation accuracy between
European reference and non-European populations is a clear gap between the imputation efficacy
of the two populations using the current state-of-the-art TOPMed imputation server. This brings
us to the major question faced currently by the genetics community - how to improve the
imputation accuracy of the underrepresented populations? An obvious solution is to increase the
whole genome sequencing efforts from multiple populations across the globe and combine them
into a single unified imputation reference panel. This approach, however, may be expensive both
in terms of experimental costs (ascertaining individuals donating biospecimens and generating
WGS data) and computational costs (transferring, storing, and computing on hundreds of
thousands of WGS data together). Alternatively, Xu et al. provided another possibility in their
recent paper - the development of population-specific genotyping arrays. In this paper, they
proposed to use a combination of add-on tag SNPs on a pre-existing genotyping array and an
internal population-specific reference panel to improve imputation accuracy. As the result shows,
the H3Africa array with Add-On tag SNPs outperformed H3Africa on all spectrums of minor
allele frequencies. The mean 𝑹
𝟐
of H3Africa with Add-on tag SNPs ranges between 0.65 and 0.9
whereas H3Africa alone without the add-on tag SNPs ranges between 0.3 and 0.65(Xu et al.
2022). Another alternative can be a ‘hybrid’ method. Herzig et al. suggested that one can
perform imputation once online with HRC and then once locally with a population-specific
reference panel. For each SNP, keep the posterior genotyped probabilities that have higher
41
maximal probability. The ‘hybrid’ method remarkably outperformed the other method, such as
using only population-specific imputation servers or HRC imputation servers (Herzig et al.
2022). Even though in this demonstration only the older generation of HRC imputation server
was tested, the principle should be transferable to TOPMed imputation reference.
Overall, our results have given us insights into the performance of a variety of populations
across the globe under the TOPMed imputation server. This is but the first step, as our current
focus is restricted to only a handful of publications and limited to mainly the Asian continent. A
future focus is to broaden the scope of our analysis to systematically evaluate other parts of the
world. Nevertheless, given the clear disparity of the imputation accuracy for non-European
populations and European populations through our most cutting-edge imputation server, future
efforts are needed to improve the performance of those non-European populations through the
server. Only by that we can help with the inclusiveness and diversity for such populations in
genomic research, and ultimately benefit research of complex diseases for such populations.
42
References
Kawai, Yosuke, Takahiro Mimori, Kaname Kojima, Naoki Nariai, Inaho Danjoh, Rumiko Saito,
Jun Yasuda, Masayuki Yamamoto, and Masao Nagasaki. 2015. “Japonica Array: Improved
Genotype Imputation by Designing a Population-Specific SNP Array with 1070 Japanese
Individuals.” Journal of Human Genetics 60 (10): 581–87.
Li, Lin, Peide Huang, Xiaohui Sun, Siyu Wang, Min Xu, Sha Liu, Zhimin Feng, et al. 2021.
“The ChinaMAP Reference Panel for the Accurate Genotype Imputation in Chinese
Populations.” Cell Research 31 (12): 1308–10.
Peter, Benjamin M., Desislava Petkova, and John Novembre. 2020. “Genetic Landscapes Reveal
How Human Genetic Diversity Aligns with Geography.” Molecular Biology and Evolution 37
(4): 943–51.
Xing, Jinchuan, W. Scott Watkins, Adam Shlien, Erin Walker, Chad D. Huff, David J.
Witherspoon, Yuhua Zhang, et al. 2010. “Toward a More Uniform Sampling of Human Genetic
Diversity: A Survey of Worldwide Populations by High-Density Genotyping.” Genomics 96 (4):
199–210.
Kowalski, Madeline H., Huijun Qian, Ziyi Hou, Jonathan D. Rosen, Amanda L. Tapia, Yue
Shan, Deepti Jain, et al. 2019. “Use of >100,000 NHLBI Trans-Omics for Precision Medicine
(TOPMed) Consortium Whole Genome Sequences Improves Imputation Quality and Detection
of Rare Variant Associations in Admixed African and Hispanic/Latino Populations.” PLoS
Genetics 15 (12): e1008500.
Jiménez-Kaufmann, A., Chong, A. Y., Cortés, A., Quinto-Cortés, C. D., Fernandez-Valverde, S. L.,
Ferreyra-Reyes, L., Cruz-Hervert, L. P., Medina-Muñoz, S. G., Sohail, M., Palma-Martinez, M. J.,
Delgado-Sánchez, G., Mongua-Rodríguez, N., Mentzer, A. J., Hill, A. V. S., Moreno-Macías, H., Huerta-
Chagoya, A., Aguilar-Salinas, C. A., Torres, M., Kim, H. L., … Moreno-Estrada, A. (2022). Imputation
Performance in Latin American Populations: Improving Rare Variants Representation With the Inclusion
of Native American Genomes. Frontiers in Genetics, 0. https://doi.org/10.3389/fgene.2021.719791
Kowalski, M. H., Qian, H., Hou, Z., Rosen, J. D., Tapia, A. L., Shan, Y., Jain, D., Argos, M.,
43
Arnett, D. K., Avery, C., Barnes, K. C., Becker, L. C., Bien, S. A., Bis, J. C., Blangero, J.,
Boerwinkle, E., Bowden, D. W., Buyske, S., Cai, J., … Li, Y. (2019). Use of >100,000 NHLBI
Trans-Omics for Precision Medicine (TOPMed) Consortium whole genome sequences improves
imputation quality and detection of rare variant associations in admixed African and
Hispanic/Latino populations. PLoS Genetics, 15(12), e1008500.
Li, L., Huang, P., Sun, X., Wang, S., Xu, M., Liu, S., Feng, Z., Zhang, Q., Wang, X., Zheng, X.,
Dai, M., Bi, Y., Ning, G., Cao, Y., & Wang, W. (2021). The ChinaMAP reference panel for the
accurate genotype imputation in Chinese populations. Cell Research, 31(12), 1308–1310.
Xu, Zhi Ming, Sina Rüeger, Michaela Zwyer, Daniela Brites, Hellen Hiza, Miriam Reinhard,
Liliana Rutaihwa, et al. 2022. “Using Population-Specific Add-on Polymorphisms to Improve
Genotype Imputation in Underrepresented Populations.” PLoS Computational Biology 18 (1):
e1009628.
Anthony F Herzig, Velo-Suarez Lourdes, Redon Richard, Deleuze Jean-Francois, and Genin
Emmanuelle. 2021. “Can Imputation in a European Country Be Improved by Local Reference
Panels? The Example of France.” In HUMAN HEREDITY, 85:79–79. KARGER
ALLSCHWILERSTRASSE 10, CH-4009 BASEL, SWITZERLAND.
Tam, Vivian, Nikunj Patel, Michelle Turcotte, Yohan Bossé, Guillaume Paré, and David Meyre.
2019. “Benefits and Limitations of Genome-Wide Association Studies.” Nature Reviews.
Genetics 20 (8): 467–84.
UK10K Consortium, Klaudia Walter, Josine L. Min, Jie Huang, Lucy Crooks, Yasin Memari,
Shane McCarthy, et al. 2015. “The UK10K Project Identifies Rare Variants in Health and
Disease.” Nature 526 (7571): 82–90.
Lipson, Mark, Olivia Cheronet, Swapan Mallick, Nadin Rohland, Marc Oxenham, Michael
Pietrusewsky, Thomas Oliver Pryce, et al. 2018. “Ancient Genomes Document Multiple Waves
of Migration in Southeast Asian Prehistory.” Science 361 (6397): 92–95.
Lipson, Mark, Pontus Skoglund, Matthew Spriggs, Frederique Valentin, Stuart Bedford, Richard
Shing, Hallie Buckley, et al. 2018. “Population Turnover in Remote Oceania Shortly after Initial
Settlement.” Current Biology: CB 28 (7): 1157–65.e7.
Lipson, Mark, Pontus Skoglund, Matthew Spriggs, Frederique Valentin, Stuart Bedford, Richard
Shing, Hallie Buckley, et al. 2018. “Population Turnover in Remote Oceania Shortly after Initial
Settlement.” Current Biology: CB 28 (7): 1157–65.e7.
Marchini, J., & Howie, B. (2010). Genotype imputation for genome-wide association studies.
44
Nature Reviews Genetics, 11(7), 499–511. https://doi.org/10.1038/nrg2796
Abstract (if available)
Abstract
Genotype imputation is an essential tool for genome-wide association studies (GWAS), especially when the whole-genome sequencing of the sample is not readily available. By comparing an imputation reference panel consisting of individuals whose genomes were previously sequenced, one can greatly increase the coverage of the genomic dataset to increase the statistical power of the association study. However, previous imputation reference panels consist largely of individuals of European ancestry and are severely deficient in sampling populations from other continents. In May 2020, the Trans-Omics for Precision Medicine (TOPMed) imputation panel became available online with a larger sample size and claimed to obtain better imputation quality for African and Hispanic/Latino American populations since they improved the representation in sequencing data from those populations. However, it remains unclear how effective the TOPMed data serves as an imputation reference for other global populations such as East Asians, Southeast Asians, and Polynesians. We found that the imputation quality of the TOPMed imputation panel for global populations outside of the United States still lags behind that for the European populations. Our assessment thus call for more sequencing studies to be conducted more equitably for populations around the globe so to enrich the representation of our current imputation reference panel and to enable powerful genetic association studies in these global populations.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Understand the distinct patterns of selection in auto-immune diseases with ancient DNA data by the S-LDSC model
PDF
Leveraging functional datasets of stimulated cells to understand the relationship between environment and diseases
PDF
The impact of global and local Polynesian genetic ancestry on complex traits in Native Hawaiians
PDF
Modeling the minor allele frequency and linkage disequilibrium joint architectures of human diseases and complex traits
PDF
Cell-specific case studies of enhancer function prediction using machine learning
PDF
Polygenic analyses of complex traits in complex populations
PDF
Improving the power of GWAS Z-score imputation by leveraging functional data
PDF
Bayesian hierarchical models in genetic association studies
PDF
The role of genetic ancestry in estimation of the risk of age-related degeneration (AMD) in the Los Angeles Latino population
PDF
Genome-wide characterization of the regulatory relationships of cell type-specific enhancer-gene links
PDF
Hierarchical regularized regression for incorporation of external data in high-dimensional models
PDF
Extending genome-wide association study methods in African American data
PDF
Understanding acute lymphoblastic leukemia in different ethnic groups in the United States
PDF
Using genetic ancestry to improve between-population transferability of a prostate cancer polygenic risk score
PDF
Methodology and application of modern genetic association tests in admixed populations
PDF
Covariance-based distance-weighted regression for incomplete and misaligned spatial data
PDF
Enhancing model performance of regularization methods by incorporating prior information
PDF
Prediction and feature selection with regularized regression in integrative genomics
PDF
Uncertainty quantification in extreme gradient boosting with application to environmental epidemiology
PDF
Best practice development for RNA-Seq analysis of complex disorders, with applications in schizophrenia
Asset Metadata
Creator
Rui, Xinyue "Camellia"
(author)
Core Title
A global view of disparity in imputation resources for conducting genetic studies in diverse populations
School
Keck School of Medicine
Degree
Master of Science
Degree Program
Biostatistics
Degree Conferral Date
2022-05
Publication Date
04/19/2022
Defense Date
04/19/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
diversity,global,human genetics,imputation,OAI-PMH Harvest,population genetics
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Chiang, Charleston Wen-Kai (
committee chair
), Gazal, Steven (
committee member
), Lewinger, Juan Pablo (
committee member
)
Creator Email
crui@usc.edu,xrui0419@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111023120
Unique identifier
UC111023120
Document Type
Thesis
Format
application/pdf (imt)
Rights
Rui, Xinyue "Camellia"
Type
texts
Source
20220420-usctheses-batch-930
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
global
human genetics
imputation
population genetics