Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Improving the power of GWAS Z-score imputation by leveraging functional data
(USC Thesis Other)
Improving the power of GWAS Z-score imputation by leveraging functional data
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Improving the Power of GWAS Z-Score Imputation by
Leveraging Functional data
By
Jingyang Chen
A Thesis Presented to the
FACULTY OF THE USC KECK SCHOOL OF MEDICINE
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
MASTER OF SCIENCE
BIOSTATISTICS
May 2020
Copyright 2020 Jingyang Chen
ii
ACKNOWLEDGMENTS
I wish to express my sincere appreciation to my thesis mentor, Dr. Nicholas Mancuso
for being a great mentor: allowing me to join his lab, and being incredibly supportive,
generous and patient. He convincingly guided me throughout the project and
encouraged me to take on various challenges. In addition, I would like to thank my
committee members Dr. David Conti and Dr. Joshua Millstein for their expertise and
guidance on this project. Without their persistent help, the goal of this project would not
have been achieved.
Furthermore, I would like to thank Ph.D. candidate Tsz Fung Chan for his help and
advice when I was working at the lab. I would also like to thank Dr. Omer Weissbrod
from Harvard University for his instruction and clarification on using the software
PolyFun. I am thankful to Dr. Meredith Franklin and program coordinator Sherri Fagan
at the USC Biostatistics program for their professional advice and helps that they
provided to me throughout the program.
Lastly, I am deeply grateful to my family for their enduring help and support throughout
my academic career. I am pleased that my parents are always being there for me and
guided me to get over the language burden when I first immigrated to the United States.
My parents and brother gave quiet encouragement and positive beliefs on my academic
path, and I will be forever thankful for their love.
iii
TABLE OF CONTENTS
ACKNOWLEDGMENTS ................................................................................................................ ii
LIST OF TABLES ......................................................................................................................... iv
LIST OF FIGURES ....................................................................................................................... v
ABSTRACT ................................................................................................................................... vi
INTRODUCTION .......................................................................................................................... 1
METHODS .................................................................................................................................... 4
RESULTS ..................................................................................................................................... 8
DISCUSSION ............................................................................................................................. 20
REFERENCES ........................................................................................................................... 24
APPENDIX ................................................................................................................................. 26
iv
List of Tables
Table 1. List of complex traits and disease analyzed. ...................................................... 7
Table 2. MAF bins average R2 values and reported P-value in comparison with 3
methods. ........................................................................................................................ 15
Table 3. Average R2 values and reported P-value in comparison with 3 methods ........ 19
Table 4. 86 Annotations taus comparison with P_value difference. ............................... 26
Table 5. Taus comparison by LDSC and PolyFun per trait. ........................................... 28
v
List of Figures
Figure 1. Coefficients comparison in Annotations for LDSC vs PolyFun. ........................ 9
Figure 2. Estimates of τ parameters using LDSC and PolyFun for
Blood_Eosinophil_Count. ............................................................................................... 11
Figure 3. Estimates of τ parameters using LDSC and PolyFun for body_WHRadjBMIz.
....................................................................................................................................... 12
Figure 4. Average R
2
value across 27 traits among 3 methods. .................................... 14
Figure 5. MAF [0 – 5%) Average R
2
value across 27 traits among 3 methods .............. 16
Figure 6. MAF (5% - 25%] average R
2
value across 27 traits among 3 methods .......... 17
Figure 7. MAF (25% - 50%] R
2
value across 27 traits among 3 methods ...................... 18
vi
Abstract
Multiple recent works have demonstrated the benefit of imputing unmeasured genome-
wide association study (GWAS) summary statistics directly using only publicly available
genotype reference panels. Summary-statistic imputation is both computationally
efficient and highly accurate. Imputation in the summary statistic scenario is conducted
under the assumption of the null model which can lead to a loss of statistical power at
regions with significant associations. Independently, multiple lines of evidence have
shown that incorporating known functional information about genetic variants boosts
statistical power. In this project, we investigate the performance of a random-effects
model to impute summary statistics with functional information in addition to LD
estimates. Our model, Functionally Informed Z-score Imputation or FIZI, imputes GWAS
summary statistics (Z-score) on the unmeasured markers by leveraging LD and
functional annotations describing the prior variance of random single-nucleotide
polymorphisms (SNP) effects. Imputation in the random-effects regime requires
knowing the variance parameters for each functional category and it is unclear how best
perform estimation prior to imputation. Here, we investigate the performance of FIZI on
real data using two methods for variance parameter inference: LDscore regression
(LDSC) and a penalized regression model, PolyFun. We compare our approach with the
fixed-effect model ImpG. We found no significance difference in prior variance estimates
for 24 out of the 27 traits using LDSC and PolyFun with pairwise t-test. We found that
LDSC outperforms PolyFun with an average 𝑅
!
= 0.60 compared with an average 𝑅
!
=
0.43. For better inner view, we group the summary statistics by underlying true 𝑍
!
statistics divided into 3 bins with bin 1 to be [0, 10], bin 2 to be [10, 30] and bin 3 to be
vii
[30, inf), We found for bin [30, inf), the distributions for the 3 methods are more focused
on the 1.0 boundary compared to the other two bins. This informs us on the higher scale
of the bin, the imputed summary statistics from all 3 methods are performing well
compares to the original summary statistics. In general, we find ImpG outperforms
functionally-informed approaches for summary statistics imputation in our study.
1
Introduction
Overview
The genome-wide association study (GWAS)
1,2,3
is an approach to identify genetic risk
factors for common diseases (e.g., prostate cancer) and quantitative traits (e.g., height).
It involves rapidly scanning markers across full sets of DNA or genomes to find genetic
variations association. GWAS is important because it can use genetic risk factors to
make predictions about who is at risk and to identify the biological underpinnings of
disease, which is used to develop new prevention and treatment strategies
1
.
Procedurally, GWAS operates by first measuring the genotypes in a large number of
individuals using a genotyping chip, or SNP-chip
4
. This assay technology is limited by
the size of the chip and typically quantifies only known common genetic variants. When
there are unmeasured alleles in an association study, we use the information on the LD
pattern relating to the measured markers to impute the unmeasured markers
4
. GWAS
relies on genotype imputation where the unmeasured markers are predicted and
estimated by referencing a large scale of panels of sequenced individuals
5
. As most of
the heritability of complex traits is associated to common and low frequency genetic
variants, the limitation of imputing for the summary statistics is cost-effective. Such
computation will be extremely complicated and it will require a substantial amount of
times to re-impute for the genetic data and proceeding meta-analysis on the results
when updating to the reference panels. Alternatively, a much more effective approach is
called summary statistics imputation, which is directly imputing the summary statistics
6
.
Summary statistics imputation by combining summary statistics for a set of variants and
the fine-scale LD structure from the same genome region, and is used to estimate
2
summary statistics of new, unmeasured variants at the same locus
2
. The primary
computation source that was used to calculate unmeasured GWAS summary statistics
is linkage-disequilibrium, which could be obtained from different publicly available
reference genome panels. Summary-based statistics are often used to summarize
considerate information such as central tendency and measure of spread and proved to
be highly accurate, stable, and within each genotype data.
Genome annotation is the process of identifying the locations of genes and determining
how those genes function. A genome annotation also contains biologically significant
information that is derived from the raw DNA sequence. A functional annotation is the
process of relating biological functions to the genetic elements as depicted in the
structural annotation step. This component is crucial when taking into GWAS
imputation. After the genome annotation process, the obtained information will be
stored, and published in the database as functional data. Scientists have developed
software and tools to aid the imputation step in GWAS. Summary statistics imputation
has been proposed as a path to directly impute meta-analysis summary statistics, which
only requires summary statistics in the functional data and the linkage disequilibrium
(LD) information estimated from the referencing sequence panel
7
. The advantage of
using summary statistics imputation allows us to shorten the time that needed to spend
on the imputation. We introduced Functional-informed Z-score Imputation (FIZI)
22
. This
approach extends the fixed-effect linear model based on LD-weighted statistics by
including prior 𝝉 estimate defined by functional annotations. We apply FIZI on the
summary data from UKBiobank
8,9,10
, and perform multiple ways of estimating and
imputing the summary statistics on the unmeasured markers in SNPs.
3
In our study, we investigate the performance of FIZI using two primary approaches to
estimate the prior variances: LDSC and PolyFun. We apply it to 27 real world traits from
the UKBB data, and find that the prior variance computation is not statistically
significantly different across traits (P < 0.05/27). Linkage disequilibrium score regression
(LDSC regression) is a technique that aims to quantify each contribution of polygenic
effects and various confounding factors from GWASs
11,12,13
. This approach involves
using regression analysis to examine the relationship between the LD score and the test
statistics of each single SNP from GWAS. Another method is called Polygenic
Functionally-informed fine-mapping (PolyFun). PolyFun estimates prior causal
probabilities for SNPS, in which we apply this method in estimating the penalized prior
variance
14
. We apply both LDSC regression and PolyFun to produce SNP-based
heritability estimates, partition SNP’s heritability into separate categories, and to
calculate genetic correlations between separate phenotype. Previously, it is proven by
many scientists that summary statistics imputation is more effective than association
summary statistics when applying to reveal genetic markers. In our study, we are aiming
to test the performance of three different summary statistics imputation approaches, of
which will return give us the best estimate on the summary statistics of the unmeasured
markers.
Methods
Overview of methods for summary statistic imputation
Our approach is to apply FIZI to form a linear model based on LD-weighted statistics
with annotations. Gaussian imputation of GWAS (ImpG)
15
is used to directly impute
4
GWAS summary statistics, which takes the linkage disequilibrium (LD) information
estimated from the referencing sequence panel
16
. The ImpG imputed model is as
following:
𝒁
!|!
= 𝚺
!,!
𝚺
𝒕,𝒕
!𝟏
𝒁
𝒕
Where 𝚺
!,!
is the correlation matrix among all pairs of SNPS induced by LD, and 𝚺
!,!
𝚺
𝒕,𝒕
!𝟏
is the precomputed weights from the reference panel. 𝒁
𝒕
is the observed z-scores at the
typed SNPs without any information regarding the untyped SNPs.
To test for accuracy and precision, before running FIZI, we have a different approach:
retrieving the 𝝉 score with an L2-regularized S-LD score by running Polygenic
Functionally-informed fine-mapping (PolyFun). PolyFun estimates prior causal
probability for SNPs that can be used by fine-mapping methods. PolyFun can also
aggregate polygenic data from across the entire genome and hundreds of functional
annotations. Our analysis based on the assumption that the models summary statistics
are under a linear model. The FIZI model performed the imputation on the given GWAS
summary statistics Z
o
(measured markers) with linkage disequilibrium and variance
estimates. The FIZI model describes observed GWAS summary statistics a p SNPs (𝒁
!
)
as a function of linkage disequilibrium (𝚺) and their functional annotations 𝑨 and prior
variance estimates 𝝉. Assuming the model is linear, and SNP effect sizes are drawn
from a normal distribution with variance defined by functional categories, we model the
unobserved summary data Z
u
under a conditional math model as:
𝒁
!
| 𝒁
!
,𝑫,𝚺 ~ 𝑵(𝑽
𝒖,𝒐
𝑽
𝒐,𝒐
!𝟏
𝒁
𝒐
,𝑽
𝒖,𝒖
−𝑽
𝒖,𝒐
𝑽
𝒐,𝒐
!𝟏
𝑽
𝒐,𝒖
)
5
where 𝑽
!,!
=𝚺
!,!
+𝚺
!,!
𝑫
!,!
𝚺
!,!
and 𝑽
!,!
=𝚺
!,!
+𝚺
!,!
𝑫
!,!
𝚺
!,!
+𝚺
!,!
𝑫
!,!
𝚺
!,!
capture the
uncertainty due to sample size 𝜮 and tagged effect-size uncertainty explained by
function annotation 𝜮𝑫𝜮 where 𝜮 denotes the linkage disequilibrium and 𝑫= 𝐝𝐢𝐚𝐠(𝑨𝝉)
denotes the variance estimates. The FIZI model recovers the ImpG model as
degenerate case when the prior variance parameters are zero. In order to impute
summary data Z
u
under the model, we require the relevant functional categories and
their corresponding variance parameters 𝑫. In our study, we used three different
approaches: ImpG (no annotations or prior variance terms), FIZI+LDSC (FIZI model
using LDSC estimated 𝝉), and FIZI+PolyFun (FIZI model using LDSC estimated 𝝉).
When applying FIZI with LDSC, the LD score in the model is unpenalized, which could
potentially bring noise in estimating the 𝝉 coefficients. To test for the differences, here,
we will use an additional procedure to evaluate the GWAS summary statistics of the
variations within the SNPs. PolyFun is appointed to impute 𝝉 using the penalized LD
score regression. The data frame is still the same as FIZI linear model, and with these
two approaches we are hoping to increase the precision compared to ImpG.
Inference of prior variance terms using LDSC and PolyFun
Under the same condition normal frame, we retrieve the prior variance by running LDSC
and PolyFun on the data. The goal of running PolyFun is to estimate the prior causal
probability for SNPs across the entire genome and hundreds of functional annotations.
The result is convincing that the two approaches illustrate differences statistically.
LDSC regression is an approach that quantifies the contribution of each by examining
the relationship between test statistics and linkage disequilibrium (LD)
12,17,18
. The LD
6
score regression intercept can be used to summarize the characteristics and estimate a
powerful and accurate factor in each SNP. Under a polygenic model, test statistics in
association analysis with a causal variant is proportional to LD, which measured by 𝑅
!
.
The expected Chi-square statistics of variant j is given by
𝐸 𝜒
!
ℓ
!
= 𝛮 𝜏 ℓ
!
+𝑁𝑎+1,
where 𝛮 is the sample size; 𝑀 is the number of SNPs, and 𝜏=
!
!
!
is the average
heritability explained by each SNP; 𝑎 is the contribution of population structure and bias.
Lastly, ℓ
!
= ∑
!
𝑅
!
!"
is the LD score of variant 𝑗. In our study, we use LDSC regression to
estimate the prior variances on the UKBB Biobank dataset, and input the unpenalized
prior variances into the FIZI model for summary statistics imputation
24
.
PolyFun is a computationally scalable framework to improve fine-mapping accuracy by
using genome-wide functional data for a broad set of coding, conserved, regulatory and
LD-related annotations
23
. For our work, however, PolyFun estimates the same 𝜏
parameters as LDSC, but uses a penalized estimator given by,
𝐸 𝜒
!
ℓ
!
= 𝛮 𝜏 ℓ
!
+𝑁𝑎+1+𝜆 𝜏 ,
where 𝜆 is a penalty term inferred by cross-validation. In our study, we use partly of
PolyFun to estimate unpenalized prior variances, and fit them into the FIZI model for
summary statistics imputation
23
.
Data Analyzed
The dataset we used to demonstrate our idea is called UKBiobank. The dataset
recruited 500,000 people ages 40 to 69 years in 2006-2010 from across the country. In
7
our work, we attempt to establish a statistical model to impute GWAS summary
statistics by leveraging functional data. We make an imputation on the linkage-
disequilibrium among the unmeasured markers to obtain sufficient summary statistics.
For real data analysis, we applied Functionally-informed Z-score Imputation (FIZI), a
software tool to impute summary statistics, to GWAS summary statistics gathered from
approximately 337k individuals of European ancestry in the UK Biobank with 27
complex traits.
Table 1. List of complex traits and disease analyzed.
Trait Name
blood_EOSINOPHIL_COUNT blood_MEAN_CORPUSCULAR_HEMOGLOBIN
blood_RBC_DISTRIB_WIDTH blood_RED_COUNT
blood_WHITE_COUNT bmd_HEEL_TSCOREz
body_BALDING1 body_BMIz
body_HEIGHTz body_WHRadjBMIz
bp_SYSTOLICadjMEDz cov_EDU_YEARS
cov_SMOKING_STATUS disease_AID_ALL
disease_ALLERGY_ECZEMA_DIAGNOSED disease_CARDIOVASCULAR
disease_HYPOTHYROIDISM_SELF_REP disease_RESPIRATORY_ENT
disease_T2D lung_FEV1FVCzSMOKE
lung_FVCzSMOKE mental_NEUROTICISM
other_MORNINGPERSON pigment_HAIR
pigment_SUNBURN repro_MENARCHE_AGE
repro_MENOPAUSE_AGE
8
Results
Prior variance estimates are different between LDSC and PolyFun
First, we characterize the prior variance estimates from LDSC and PolyFun using
GWAS results from 27 traits measured from the UK Biobank study (see Methods).
Among the 86 shared functional annotations and averaging across all 27 traits, we
found 3/86 to be significantly different (P < 0.05 / 86; see Figure 1). On average we
found estimates from PolyFun to be bigger compared with LDSC. The estimates for
PolyFun are smaller on average, which suggests that the penalized model is shrinking
estimates closer to 0. Next, looking across the 27 traits, we find no traits to have
significantly different estimates between LDSC and PolyFun (P < 0.05 / 27; see Table 1).
9
Figure 1. Coefficients comparison in Annotations for LDSC vs PolyFun. There are two boxplots per annotation
indicating the performance of 𝝉 estimates for the two methods, LDSC and PolyFun. Results are across 27 traits
analyzed.
10
Here to illustrate the inferential differences and similarities, we focus on two traits:
Eosinophil count (blood_Eosinphil_count) in whole blood and waist-hip adjusted BMI
(Body_WhradjBMIz). Trait 1 is not statistically significant with the trend follow the
reference line, and the majority of the points are falling within the 95% gray area
boundary. We can see that for this specific trait, there is one potential visual outlier.
Besides, the reference line in one way or the other indicates a perfect positive
relationship between LDSC and PolyFun meaning that as LDSC increases, the PolyFun
also increases.
Futhermore for Body_WhradjBMIz, we find a fair amount of the points is scattered away
from the 95% boundary area, with a couple of possible outliers. Among the 27 traits,
there are only 16 that seem to be stabilized regarding the reference line. With 95%
confidence, the 𝝉 in the other 11 traits are statistically significantly different for LDSC
and PolyFun. Therefore, the results of this trait that confirm the existence of statistical
difference imply that we can reject the null hypothesis and conclude that there is a
statistically significant difference in using the 𝝉 and other different methods when it
comes to bringing improvement to the GWAS summary statistics imputation.
11
Figure 2. Estimates of 𝝉 parameters using LDSC and PolyFun for Blood_Eosinophil_Count. There are 86
points in the graph each representing the corresponding annotation. The x-axis measures the PolyFun 𝝉 estimate,
and the y-axis measures the LDSC 𝝉 estimates. The blue line corresponds to the best-fit linear regression, and the
gray shading corresponds to the 95% prediction confidence interval.
12
Figure 3. Estimates of 𝝉 parameters using LDSC and PolyFun for body_WHRadjBMIz. There are 86 points in
the graph each representing the corresponding annotation. The x-axis measures the PolyFun 𝝉 estimate, and the y-
axis measures the LDSC 𝝉 estimates. The blue line corresponds to the best-fit linear regression, and the gray
shading corresponds to the 95% prediction confidence interval.
To assess the association more precisely, we used a boxplot to show the difference in
coefficients for the two methods. Figure 3 gathers the 86 shared annotation across 27
traits and compares the 𝝉 value for LDSC and PolyFun. Residuals are excluded from
having a more explicit plot to show the actual difference. Among the 86 annotations, the
majority is very constant between the two methods, but there are also some visible
noises indicating that the 𝝉 are statistically significantly different. For instance, for
annotation Human_Promoter_Villar.flanking.500, the 𝝉 LDSC computed are broader in
range compared to the 𝝉 LDSC computed.
13
In addition, there are one extreme negative value and two positive absolute values
PolyFun computed, but only one negative extreme value LDSC calculated. Since the
estimated coefficients are significant differences between LDSC and PolyFun, it is
convinced that once we fit the 𝝉 values into the FIZI model, there is a high comparable
value on the performance of the imputed unmeasured markers. Because trait 1 and trait
10 have two extreme characteristics of the data we possibly have, we would then mainly
focus on trait 1 and trait 10 in the following analysis.
Imputed GWAS summary statistics by using FIZI with ImpG, LDSC, and PolyFun
To ascertain the difference in using the functional annotations when it comes to
improving GWAS summary statistics imputation, we assess the performance of FIZI
using LDSC and PolyFun estimates of 𝝉 compared with the standard summary-statistic
imputation scheme (ImpG). Across 27 traits, we found that FIZI+LDSC was significantly
better than FIZI+PolyFun, with an average imputation accuracy of R2=0.60 compared
with R2=0.43, (P=0.024), which suggests that penalized estimates do not improve
imputation quality. We also compared FIZI with ImpG, and found a mean imputation
accuracy of R
2
of 0.83. Together, these results strongly suggest that functional
information does not improve the overall quality in summary statistics imputation.
14
Figure 4. Average R
2
value across 27 traits among 3 methods.
Imputation accuracy as a function of minor allele frequency
Having characterized the performance of FIZI+LDSC, FIZI+PolyFun, and ImpG we next
investigated imputation accuracy as a function of minor allele frequency (MAF)
19
. To do
this we partitioned imputation results at SNPs across 3 MAF bins ((0 – 5%], (5 – 25%],
(25% – 50%]). Firstly, we compute the average R
2
per bin for each method, as well as
the standard error for the mean, and apply the average R
2
with the standard error for
mean with Z-test. We ought to find whether any of the methods is statistically
significantly different with comparing to P-value =0.05/3.
15
Table 2. MAF bins average R2 values and reported P-value in comparison with 3 methods.
bin
Mean R2
PolyFun
Mean R2
LDSC
Mean R2
ImpG
P-value PolyFun
vs LDSC
P-value PolyFun
vs ImpG
P-value LDSC
ImpG
[0-5%) 0.27 0.36 0.55 0.514 0.0137 0.066
(5%-
25%]
0.5 0.73 0.92 0.177 0.00039 0.108
(25%-
50%]
0.55 0.81 0.95 0.12 0.0045 0.23
In general, average R2 values tend to be larger in higher bin. After categorizing by MAF
bins, our conclusion stays the same: ImpG is the best method to perform the imputation
with the highest R2 values across 3 bins. LDSC is a better method when inputting prior
variance with a higher R2 values across 3 bins. In summary, we conclude that there is a
statistically significant difference between PolyFun and ImpG with P values less than
0.05/3 from all 3 MAF bins (P-value = 0.0137, 0.00039, 0.0045). ImpG and LDSC are
among the top methods that are giving the most improved GWAS summary statistics
imputation.
16
Figure 5. MAF [0 – 5%) Average R
2
value across 27 traits among 3 methods
17
Figure 6. MAF (5% - 25%] average R
2
value across 27 traits among 3 methods
18
Figure 7. MAF (25% - 50%] R
2
value across 27 traits among 3 methods
Imputation accuracy as a function of association magnitude
To evaluate imputation accuracy across the spectrum of true association signal, we
partitioned the results into groups by the true squared association statistics. We divide it
into three equal bins before we compute the R
2
. We set the benchmark for the first to be
(0,15), second to be (15, 30), and third to be (30, inf). This aims to help us test the
association of the imputed summary statistics with the observed summary statistics. We
compute the average R
2
per bin for each method, as well as the standard error for the
mean. Applying the average R
2
with the standard error for mean with Z-test, we ought
19
to find whether any of the methods is statistically significantly different with comparing to
P-value =0.05/3.
Table 3. Average R2 values and reported P-value in comparison with 3 methods
bin
Mean R2
PolyFun
Mean R2
LDSC
Mean R2
ImpG
P-value PolyFun
vs LDSC
P value PolyFun
vs ImpG
P value LDSC
vs ImpG
(0,15) 0.40 0.56 0.79 0.32 0.018 0.052
(15,30) 0.78 0.95 0.98 0.22 0.15 0.77
(30,inf) 0.85 0.95 0.96 0.49 0.41 0.91
We compare the mean R
2
values within each pin among the 3 methods and reported 9
P-values. In summary, the mean R
2
obtained from LDSC and ImpG does not suggest a
significant difference across three bins (P-values = 0. 052, 0.77, 0.91). In addition, the
mean R
2
obtained from PolyFun and LDSC does not suggest a significant difference
across the three bins (P-values = 0.32, 0.22, 0.49). In bin 1, we found that the mean R
2
is 0.40 for PolyFun and 0.79 for ImpG, which is statistically marginally different (P-
values = 0.018). Although LDSC seeks to perform better than PolyFun with a higher
mean R
2
across three bins, but we are unable to state that they are statistically different
(P-values = 0.32, 0.22, 0.49).
ImpG has the largest reported mean R
2
for the 3 bins: 0.79, 0.98, and 0.96
correspondingly. Notice that in bin 3, LDSC and ImpG have a similar performance for
the mean R
2
, 0.946 and 0.96. (P-value = 0.91). This suggests that if we evaluate the
mean R
2
by bins, LDSC is comparable with ImpG in terms of function of association
20
magnitude. Overall, our conclusion stays the same: ImpG is performing better than
LDSC, which is indeed better than PolyFun. It is possible that when PolyFun computed
the penalized taus estimates, this procedure may potentially skip some of the
information of the chromosome, which leads the reported R
2
to have large deviation
from the original summary statistics. In summary, we conclude that there is a
statistically significant difference when using the taus into FIZI and other different
methods while improving the GWAS summary statistics imputation and that ImpG and
LDSC are among the top methods that are giving the most improved GWAS summary
statistics imputation.
Discussion
This study aims to find the most accurate approach to impute the unmeasured markers
in Genome-Wide summary statistics. Our hypothesis is to test whether using the taus
from different methods will bring improvement to the GWAS summary statistics
imputation. To note, the standard method to impute the unmeasured markers is through
ImpG, where no prior variance is used. In our study, we are pursuing an alternative
approach that can perform the imputation on summary statistics that are closed to or
better than ImpG. The two methods we introduce are LDSC and PolyFun, one with
unpenalized coefficients and the other with penalized coefficients.
The dataset we used is the UK Biobank. In the analysis, we plot the association in
several different ways, and the results are convincing that LDSC is performing similarly
with ImpG. There is a need to consider using the LDSC when it comes to the
improvement of the GWAS summary statistics imputation. However, PolyFun is not
performing well as most of the R squares have relatively small values indicating that the
21
association between the imputed summary statistics is weak with the original summary
statistics. We derived the imputed results of the three methods by applying the FIZI
model.
From our analysis, one of the assumptions of the LDSC and ImpG is that the relational
statistics and the genotypes might be having similar correlational structures.
Unfortunately, this assumption may be true only for the cohorts with some homogenous
properties, but this might not be met when some form of relevant covariates exists
which in one way or the other might be considered as a confounder for the genotypes. A
good example is the occurrence of the ancestry principal components in studies that are
deemed unethical like those employing a mixed methodology design.
The results are quite surprising as PolyFun with the penalized 𝝉 are far off on the left
across 22 chromosomes, indicating that the imputed summary statistics have a fragile
association with the original summary statistics. When estimating the R squares, we are
hoping to see values that are relatively close to positive 1. Recall that when we were
contrasting the imputed taus between LDSC and PolyFun, trait1
Blood_Eosinophil_Count Taus is very stabilized, and the distribution follows the
regression reference line. The fact is, across the 22 chromosomes, PolyFun is
essentially performing better than LDSC, specifically for trait 1.
ImpG has the best performance among the three methods as it is stable, and each trait
has an R square value greater or equal to 0.75 across all 22 chromosomes (Figure 4).
This could be due to ImpG takes on no prior variance in the imputation. Some notable
observations are ImpG performance could reach over 0.90 for R Square, which is
22
relatively high in statistical analysis. Also, among the three approaches, ImpG does not
contain the smallest R square value within each trait across all chromosomes. That is,
the R square of the summary statistics that ImpG imputer either has the highest or
second-highest R squares value among our three methods. Therefore, from the results
of the R-square outputs among the three methods, it is prudent to conclude that there is
a statistically significant difference in using the taus and other different methods while
improving the GWAS summary statistics imputation and that ImpG methods give the
most precise GWAS summary statistics imputation.
Generally, while looking at the summary statistics-based imputation methods like LDSC
and ImpG, the results show that more flexibility finding that can be considered by
several researchers are offered by the genotype imputation methods. To be specific,
doing a sample cohort imputation for once haplotype phasing and genotype, it is easy
for the researchers to carry out different GWAS while implementing various phenotypes
and covariates without different approaches. Now that a majority of the intensive
computer analysis are the estimates of the correlation matrix, it is worth to note that a
more developed version can be established soon to have the ability of simultaneously
imputing the summary statistics for several traits while putting in place very minimal
effort.
There has been an increase of large reference populations like the Haplotype
Reference Consortium (HRC)
20,21
. In such cases of large sample sizes, there is a need
for the imputation procedures for imputing both new studies and the GWAS that were
previously published. It is believed that a sudden increase in processing capabilities
may be required which is associated with increasing with panel sizes.
23
With these reference panel sizes, the LDSC is expected to have much longer running
times than usual. Having observed that a majority of the intensive computer part of the
LDSC imputation act as the reference panel-based computation of the matrix correlation,
there will need to develop a way with the ability to pre-computing as well as storing in a
database all the matrix correlation of the generated genetic from different regions by
each ethnic group.
Hence, it is expected that the LDSC that prioritizes the pre-computed LD correlation
matrices may gradually reduce the time taken to run the associated future large
reference panels to be produced. And to mention that this kind of methodology
approach is more advantageous while producing the computational burden of the
imputation summary statistics that is more practical when compared to the sample size
of the reference panels. Thus, the observed invariability might be a very relevant feature
especially when the study sample sizes are expected to increase. In comparison to the
imputation methods we performed in this study, the LDSC to some extent may be seen
to produce results that are a bit conservative more so to its lower imputation information
and small existing panels.
24
References
1. Howie B, Fuchsberger C, Stephens M, Marchini J, Abecasis GR. Fast and
accurate genotype imputation in genome-wide association studies through pre-
phasing. Nat Genet. 2012. doi:10.1038/ng.2354.
2. Dahl A, Iotchkova V, Baud A, et al. A multiple-phenotype imputation method for
genetic studies. Nat Genet. 2016. doi:10.1038/ng.3513.
3. Cantor RM, Lange K, Sinsheimer JS. Prioritizing GWAS Results: A Review of
Statistical Methods and Recommendations for Their Application. Am J Hum
Genet. 2010. doi:10.1016/j.ajhg.2009.11.017.
4. Hormozdiari F, Kang EY, Bilow M, et al. Imputing Phenotypes for Genome-wide
Association Studies. Am J Hum Genet. 2016. doi:10.1016/j.ajhg.2016.04.013.
5. Nicolae DL. Testing untyped alleles (TUNA) - Applications to genome-wide
association studies. Genet Epidemiol. 2006. doi:10.1002/gepi.20182.
6. Gusev A, Ko A, Shi H, et al. Integrative approaches for large-scale
transcriptome-wide association studies. Nat Genet. 2016. doi:10.1038/ng.3506.
7. Tylee DS, Sun J, Hess JL, et al. Genetic correlations among psychiatric and
immune-related phenotypes based on genome-wide association data. Am J Med
Genet Part B Neuropsychiatr Genet. 2018. doi:10.1002/ajmg.b.32652.
8. UK Biobank. About UK Biobank. http://www.ukbiobank.ac.uk.
9. UK Biobank. Dent Abstr. 2013. doi:10.1016/j.denabs.2012.04.049.
10. Bycroft C, Freeman C, Petkova D, et al. The UK Biobank resource with deep
phenotyping and genomic data. Nature. 2018. doi:10.1038/s41586-018-0579-z.
11. Calabrese B. Linkage disequilibrium. In: Encyclopedia of Bioinformatics and
Computational Biology: ABC of Bioinformatics. ; 2018. doi:10.1016/B978-0-12-
809633-8.20234-3.
12. Gazal S, Marquez-Luna C, Finucane HK, Price AL. Reconciling S-LDSC and
LDAK functional enrichment estimates. Nat Genet. 2019. doi:10.1038/s41588-
019-0464-1.
13. Reich DE, Cargili M, Boik S, et al. Linkage disequilibrium in the human genome.
Nature. 2001. doi:10.1038/35075590.
14. Weissbrod O, Hormozdiari F, Benner C, et al. Functionally-informed fine-
25
mapping and polygenic localization of complex trait heritability. bioRxiv. 2019.
doi:10.1101/807792.
15. Pasaniuc B, Zaitlen N, Shi H, et al. Fast and accurate imputation of summary
statistics enhances evidence of functional enrichment. Bioinformatics. 2014.
doi:10.1093/bioinformatics/btu416.
16. Lee D, Bigdeli TB, Riley BP, Fanous AH, Bacanu SA. DIST: Direct imputation of
summary statistics for unmeasured SNPs. Bioinformatics. 2013.
doi:10.1093/bioinformatics/btt500.
17. Ni G, Moser G, Ripke S, et al. Estimation of Genetic Correlation via Linkage
Disequilibrium Score Regression and Genomic Restricted Maximum Likelihood.
Am J Hum Genet. 2018. doi:10.1016/j.ajhg.2018.03.021.
18. Vlaming R de, Johannesson M, Magnusson PKE, Ikram MA, Visscher PM.
Equivalence of LD-Score Regression and Individual-Level-Data Methods. bioRxiv.
2017. doi:10.1101/211821.
19. Wen X, Stephens M. Using linear predictors to impute allele frequencies from
summary or pooled genotype data. Ann Appl Stat. 2010. doi:10.1214/10-
AOAS338.
20. Huang L, Jakobsson M, Pemberton TJ, et al. Haplotype variation and genotype
imputation in African populations. Genet Epidemiol. 2011.
doi:10.1002/gepi.20626.
21. Loh PR, Danecek P, Palamara PF, et al. Reference-based phasing using the
Haplotype Reference Consortium panel. Nat Genet. 2016. doi:10.1038/ng.3679.
22. Bogdanlab. “Bogdanlab/Fizi.” GitHub, github.com/bogdanlab/fizi.
23. Omerwe. “Omerwe/Polyfun.” GitHub, 10 Mar. 2020, github.com/omerwe/polyfun.
24. Bulik. “Bulik/Ldsc.” GitHub, 16 Feb. 2020, github.com/bulik/ldsc.
26
Appendix
Table 4. 86 Annotations taus comparison with P_value difference.
Annotation
Avg est
imate
(LDSC)
Avg est
imate
(Polyfun)
p.value (no
difference)
Ancient_Sequence_Age_Human_Enhancer 2.09E-07 1.95E-07 6.00E-01
Ancient_Sequence_Age_Human_Enhancer.flanking.500 -7.11E-08 -6.64E-08 7.24E-01
Ancient_Sequence_Age_Human_Promoter 3.62E-07 3.47E-07 5.91E-01
Ancient_Sequence_Age_Human_Promoter.flanking.500 1.80E-07 2.17E-07 2.96E-01
Backgrd_Selection_Stat 1.56E-08 1.75E-08 2.98E-01
BivFlnk 3.35E-08 4.88E-08 1.90E-01
BivFlnk.flanking.500 2.55E-08 -6.11E-09 1.98E-02
BLUEPRINT_DNA_methylation_MaxCPP 2.51E-08 4.43E-08 1.06E-01
BLUEPRINT_H3K27acQTL_MaxCPP 1.13E-07 1.30E-07 1.21E-01
BLUEPRINT_H3K4me1QTL_MaxCPP 5.85E-08 9.93E-08 4.14E-03
Coding_UCSC 2.90E-08 3.75E-08 6.65E-01
Coding_UCSC.flanking.500 -3.36E-08 -3.89E-08 4.25E-01
Conserved_LindbladToh 8.10E-08 4.33E-08 2.89E-01
Conserved_LindbladToh.flanking.500 -1.51E-08 -1.85E-08 2.31E-01
Conserved_Mammal_phastCons46way 1.01E-07 1.57E-07 2.98E-01
Conserved_Mammal_phastCons46way.flanking.500 4.25E-08 4.64E-08 5.62E-01
Conserved_Primate_phastCons46way 1.76E-07 1.10E-07 1.91E-02
Conserved_Primate_phastCons46way.flanking.500 3.21E-08 3.48E-08 6.04E-01
Conserved_Vertebrate_phastCons46way -3.35E-09 1.70E-08 5.80E-02
Conserved_Vertebrate_phastCons46way.flanking.500 -4.47E-08 -3.96E-08 4.01E-01
CpG_Content_50kb 1.12E-06 1.19E-06 6.20E-01
CTCF_Hoffman -3.45E-08 -5.36E-08 4.22E-02
CTCF_Hoffman.flanking.500 -2.66E-08 -2.48E-08 7.53E-01
DGF_ENCODE 5.46E-08 5.81E-08 5.36E-01
DGF_ENCODE.flanking.500 6.28E-09 1.01E-08 3.38E-01
DHS_peaks_Trynka 1.51E-08 -8.21E-09 1.40E-01
DHS_Trynka -4.00E-08 -3.31E-08 3.36E-01
DHS_Trynka.flanking.500 -3.66E-11 -5.37E-09 2.30E-01
Enhancer_Andersson -1.47E-08 -6.57E-08 9.73E-02
Enhancer_Andersson.flanking.500 4.09E-08 6.24E-08 8.89E-02
Enhancer_Hoffman 2.06E-08 1.91E-08 8.37E-01
Enhancer_Hoffman.flanking.500 1.86E-08 2.48E-08 3.37E-01
FetalDHS_Trynka 1.48E-08 5.56E-09 4.01E-01
FetalDHS_Trynka.flanking.500 -4.13E-09 -3.71E-09 9.11E-01
GERP.NS 5.12E-09 3.23E-09 5.23E-03
GERP.RSsup4 -2.72E-08 5.08E-08 8.14E-03
GTEx_eQTL_MaxCPP 1.24E-07 1.86E-07 8.44E-03
H3K27ac_Hnisz 5.78E-09 4.77E-09 5.39E-01
H3K27ac_Hnisz.flanking.500 1.07E-08 8.99E-09 8.39E-01
27
H3K27ac_PGC2 1.57E-08 1.44E-08 6.17E-01
H3K27ac_PGC2.flanking.500 -1.50E-08 -3.73E-09 3.83E-02
H3K4me1_peaks_Trynka -1.81E-08 2.40E-09 7.84E-03
H3K4me1_Trynka 1.66E-08 2.71E-09 4.24E-04
H3K4me1_Trynka.flanking.500 -1.32E-08 -1.52E-08 5.96E-01
H3K4me3_peaks_Trynka -8.25E-08 -5.54E-08 1.10E-01
H3K4me3_Trynka 3.97E-08 3.92E-08 9.11E-01
H3K4me3_Trynka.flanking.500 1.40E-08 6.02E-09 6.21E-02
H3K9ac_peaks_Trynka -2.88E-08 -3.21E-08 8.42E-01
H3K9ac_Trynka 2.90E-09 9.01E-09 1.79E-01
H3K9ac_Trynka.flanking.500 -2.00E-08 -1.69E-08 6.76E-01
Human_Enhancer_Villar -5.58E-08 -4.98E-08 6.01E-01
Human_Enhancer_Villar_Species_Enhancer_Count 1.71E-08 1.66E-08 8.14E-01
Human_Enhancer_Villar.flanking.500 1.01E-07 4.79E-08 1.15E-01
Human_Promoter_Villar -6.67E-08 -5.65E-08 5.73E-01
Human_Promoter_Villar_ExAC 2.20E-07 1.79E-07 1.82E-01
Human_Promoter_Villar_ExAC.flanking.500 1.77E-08 2.02E-07 1.23E-01
Human_Promoter_Villar.flanking.500 -1.55E-08 -9.36E-08 1.05E-01
Intron_UCSC -3.49E-09 -2.81E-09 1.29E-01
Intron_UCSC.flanking.500 6.68E-09 5.18E-08 7.88E-02
MAF_Adj_ASMC -5.30E-09 -7.87E-09 1.57E-03
MAF_Adj_LLD_AFR -3.57E-09 -5.76E-09 1.47E-01
MAF_Adj_Predicted_Allele_Age -5.18E-09 -4.24E-09 1.98E-01
non_synonymous 8.94E-08 1.30E-07 4.24E-01
Nucleotide_Diversity_10kb -1.75E-09 -1.21E-09 1.54E-01
Promoter_UCSC -2.53E-08 -4.37E-08 5.31E-02
Promoter_UCSC.flanking.500 -1.78E-09 2.12E-08 1.58E-01
PromoterFlanking_Hoffman -7.86E-08 -1.23E-07 5.03E-03
PromoterFlanking_Hoffman.flanking.500 -7.49E-08 -6.75E-08 4.36E-01
Recomb_Rate_10kb -8.32E-10 -2.73E-09 4.76E-07
Repressed_Hoffman 8.99E-09 6.35E-09 3.09E-02
Repressed_Hoffman.flanking.500 4.08E-09 3.91E-09 9.49E-01
SuperEnhancer_Hnisz -2.37E-10 4.39E-09 1.27E-04
SuperEnhancer_Hnisz.flanking.500 2.18E-08 -2.47E-08 2.57E-01
synonymous -2.96E-07 -3.73E-07 1.88E-01
TFBS_ENCODE 3.37E-08 3.23E-08 8.23E-01
TFBS_ENCODE.flanking.500 -1.29E-08 -8.59E-09 3.99E-01
Transcr_Hoffman 3.34E-09 4.86E-09 3.54E-01
Transcr_Hoffman.flanking.500 -9.15E-09 -7.45E-09 9.33E-02
TSS_Hoffman 1.30E-07 8.12E-08 2.41E-03
TSS_Hoffman.flanking.500 -1.74E-08 1.34E-08 3.69E-02
UTR_3_UCSC 2.96E-08 5.87E-09 1.76E-01
UTR_3_UCSC.flanking.500 1.26E-08 3.04E-09 4.17E-01
UTR_5_UCSC -3.04E-08 -7.19E-08 1.72E-02
UTR_5_UCSC.flanking.500 -9.01E-09 1.35E-08 1.23E-01
WeakEnhancer_Hoffman -1.62E-08 -2.03E-08 6.13E-01
28
WeakEnhancer_Hoffman.flanking.500 -2.56E-08 -3.42E-08 1.23E-01
Table 5. Taus comparison by LDSC and PolyFun per trait.
trait LDSC Polyfun Pair t-test
p_value
blood_EOSINOPHIL_COUNT 2.39E-08 3.18E-08 5.34E-01
blood_MEAN_CORPUSCULAR_HEMOGLOBIN 6.27E-08 7.41E-08 7.44E-01
blood_RBC_DISTRIB_WIDTH 5.72E-08 5.59E-08 9.30E-01
blood_RED_COUNT 5.57E-08 6.70E-08 5.76E-01
blood_WHITE_COUNT 3.16E-08 3.18E-08 9.87E-01
bmd_HEEL_TSCOREz 3.05E-08 3.02E-08 9.81E-01
body_BALDING1 4.53E-08 3.93E-08 6.26E-01
body_BMIz 5.27E-08 4.50E-08 4.80E-01
body_HEIGHTz 1.03E-07 8.88E-08 7.54E-01
body_WHRadjBMIz 3.02E-08 2.25E-08 4.00E-01
bp_SYSTOLICadjMEDz 2.00E-08 1.77E-08 7.08E-01
cov_EDU_YEARS 2.58E-08 2.31E-08 4.86E-01
cov_SMOKING_STATUS 1.43E-08 1.36E-08 7.64E-01
disease_AID_ALL 1.09E-08 1.10E-08 9.78E-01
disease_ALLERGY_ECZEMA_DIAGNOSED 1.00E-08 1.01E-08 9.81E-01
disease_CARDIOVASCULAR 2.47E-08 1.50E-08 1.55E-01
disease_HYPOTHYROIDISM_SELF_REP 1.21E-08 1.34E-08 7.55E-01
disease_RESPIRATORY_ENT 6.75E-09 7.71E-09 6.64E-01
disease_T2D 7.66E-09 6.19E-09 6.69E-01
lung_FEV1FVCzSMOKE 1.57E-08 1.46E-08 9.01E-01
lung_FVCzSMOKE 2.43E-08 2.37E-08 9.38E-01
mental_NEUROTICISM 1.87E-08 1.75E-08 7.46E-01
other_MORNINGPERSON 1.37E-08 8.32E-09 2.08E-01
pigment_HAIR 1.20E-08 4.81E-08 1.39E-01
pigment_SUNBURN 5.55E-09 1.90E-08 1.80E-01
repro_MENARCHE_AGE 3.75E-08 4.37E-08 3.52E-01
repro_MENOPAUSE_AGE 2.02E-08 2.71E-08 4.41E-01
Abstract (if available)
Abstract
Multiple recent works have demonstrated the benefit of imputing unmeasured genome-wide association study (GWAS) summary statistics directly using only publicly available genotype reference panels. Summary-statistic imputation is both computationally efficient and highly accurate. Imputation in the summary statistic scenario is conducted under the assumption of the null model which can lead to a loss of statistical power at regions with significant associations. Independently, multiple lines of evidence have shown that incorporating known functional information about genetic variants boosts statistical power. In this project, we investigate the performance of a random-effects model to impute summary statistics with functional information in addition to LD estimates. Our model, Functionally Informed Z-score Imputation or FIZI, imputes GWAS summary statistics (Z-score) on the unmeasured markers by leveraging LD and functional annotations describing the prior variance of random single-nucleotide polymorphisms (SNP) effects. Imputation in the random-effects regime requires knowing the variance parameters for each functional category and it is unclear how best perform estimation prior to imputation. Here, we investigate the performance of FIZI on real data using two methods for variance parameter inference: LDscore regression (LDSC) and a penalized regression model, PolyFun. We compare our approach with the fixed-effect model ImpG. We found no significance difference in prior variance estimates for 24 out of the 27 traits using LDSC and PolyFun with pairwise t-test. We found that LDSC outperforms PolyFun with an average R² = 0.60 compared with an average R² = 0.43. For better inner view, we group the summary statistics by underlying true Z² statistics divided into 3 bins with bin 1 to be [0, 10], bin 2 to be [10, 30] and bin 3 to be [30, inf), We found for bin [30, inf), the distributions for the 3 methods are more focused on the 1.0 boundary compared to the other two bins. This informs us on the higher scale of the bin, the imputed summary statistics from all 3 methods are performing well compares to the original summary statistics. In general, we find ImpG outperforms functionally-informed approaches for summary statistics imputation in our study.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Hierarchical approaches for joint analysis of marginal summary statistics
PDF
Hierarchical regularized regression for incorporation of external data in high-dimensional models
PDF
Characterization and discovery of genetic associations: multiethnic fine-mapping and incorporation of functional information
PDF
Characterizing synonymous variants by leveraging gene expression and GWAS datasets
PDF
Statistical methods and analyses in the Multiethnic Cohort (MEC) human gut microbiome data
PDF
A global view of disparity in imputation resources for conducting genetic studies in diverse populations
PDF
Using genetic ancestry to improve between-population transferability of a prostate cancer polygenic risk score
PDF
Statistical analysis of high-throughput genomic data
PDF
Adaptive set-based tests for pathway analysis
PDF
Bayesian hierarchical models in genetic association studies
PDF
X-linked repeat polymorphisms and disease risk: statistical power and study designs
PDF
Shortcomings of the genetic risk score in the analysis of disease-related quantitative traits
PDF
Essays on bioinformatics and social network analysis: statistical and computational methods for complex systems
PDF
Covariance-based distance-weighted regression for incomplete and misaligned spatial data
PDF
The risk estimates of pneumoconiosis and its relevant complications: a systematic review and meta-analysis
PDF
Combination of quantile integral linear model with two-step method to improve the power of genome-wide interaction scans
PDF
HIF-1α gene polymorphisms and risk of severe-spectrum hypertensive disorders of pregnancy: a pilot triad-based case-control study
PDF
Evaluating the effects of testing framework and annotation updates on gene ontology enrichment analysis
PDF
Measuring functional connectivity of the brain
PDF
The environmental and genetic determinants of cleft lip and palate in the global setting
Asset Metadata
Creator
Chen, Jingyang (author)
Core Title
Improving the power of GWAS Z-score imputation by leveraging functional data
School
Keck School of Medicine
Degree
Master of Science
Degree Program
Biostatistics
Publication Date
05/07/2020
Defense Date
05/05/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
annotations,FIZI,functional data,genetic,GWAS,ImpG,imputation,LDSC,OAI-PMH Harvest,PolyFun,single-nucleotide polymorphisms,summary statistics,UK Biobank,Z-score
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Mancuso, Nicholas (
committee chair
), Conti, David (
committee member
), Millstein, Joshua (
committee member
)
Creator Email
jchen429@usc.edu,peterchen3366@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-299226
Unique identifier
UC11665856
Identifier
etd-ChenJingya-8441.pdf (filename),usctheses-c89-299226 (legacy record id)
Legacy Identifier
etd-ChenJingya-8441.pdf
Dmrecord
299226
Document Type
Thesis
Rights
Chen, Jingyang
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
annotations
FIZI
functional data
genetic
GWAS
ImpG
imputation
LDSC
PolyFun
single-nucleotide polymorphisms
summary statistics
UK Biobank
Z-score