Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Hierarchical approaches for joint analysis of marginal summary statistics
(USC Thesis Other)
Hierarchical approaches for joint analysis of marginal summary statistics
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Lai Jiang
A Proposal for PhD Thesis Presented to the
By
Lai Jiang
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree of
DOCTOR OF PHILOSOPHY
(Biostatistics)
May 2021
Copyright 2021 Lai Jiang
Hierarchical Approaches for Joint Analysis of Marginal
Summary Statistics
Hierarchical Approaches for Joint Analysis of Marginal
Summary Statistics
Hierarchical Approaches for Joint Analysis of Marginal
ii
Acknowledgements
The last three and half years at USC has been a pleasant and fulfilling experience and it
would not have been possible without the support and guidance that I received from many
people, who so generously contributed to my research presented in this dissertation.
First and foremost, I would like to express my deepest gratitude to my enthusiastic
advisor, Dr. David V. Conti, for his invaluable guidance, persistent support and tremendous
mentorship. His immense knowledge and constant feedback have always inspired me and guided
me through my Ph.D research journey. On a personal level, David has been very supportive, and
I am deeply grateful for that: he encouraged me to explore the world and to pursue wonderful
opportunities. David is the most energetic advisor and one of the smartest people that I know. He
is the best advisor that one could ever ask for. Meanwhile, I am very thankful to my dissertation
committee members, Dr. Nicholas Mancuso, Dr. William Gauderman, Dr. Juan Pablo Lewinger
and Dr. Mark Chaisson for their precious time, valuable suggestions and contribution in the
development of this dissertation. Special thanks to Dr. Duncan Thomas, who has been supportive
and gave me many insightful discussions and advice in the early stage of my Ph.D.
The contents in Chapter 2 and 3 have been accepted by the American Journal of
Epidemiology, the contents in Chapter 4 are in preparation for a submission, and the R package
in Chapter 5 has been published on CRAN. Thanks to my co-authors, Dr. David V. Conti, Dr.
Nicholas Mancuso, Dr. Shujing Xu, and Dr. Paul J. Newcombe for their valuable feedback and
suggestions for the manuscripts. I would also like to thank Dr. Abigail Horn and Dr. David V.
Conti, for their support and our collaboration on the USC COVID 19 project.
iii
Many thanks to my friends in the biostatistics program for their help and support and all
the wonderful moments that we have spent together at Soto. I am also very grateful to my friends
outside of the program who have supported, encouraged and accompanied me. Their friendship
made me laugh and happy.
Last but not least, I would like to sincerely thank my parents, Zheng Li and Wen Jiang,
for their unconditional love and unbelievable support. I have learned a lot from my parents: my
mom taught me how to be a smart learner and be a strong and confident person, and my dad
taught me how to be a thoughtful friend and be always optimistic and enthusiastic about the life.
Thanks for them for always being there to video chat with me whenever I need a talk or just
some good words to push myself forward. To my beloved husband, Aozhou, thank you for
appearing in my life and accompanying me all the time with your unwavering support and
encouragement. Thank you for always being there to be my cheerleader whenever I felt anxious
or depressed. I would never finish my Ph.D. without your tremendous support and love.
iv
Table of Contents
Acknowledgements ............................................................................................................. ii
List of Tables .................................................................................................................... vii
List of Figures .................................................................................................................... ix
Abstract ............................................................................................................................. xii
Chapter 1. Introduction ........................................................................................................1
1.1. Instrumental Variable Analysis....................................................................1
1.2. Mendelian Randomization ...........................................................................4
1.2.1. Existing Mendelian Randomization Approaches...............................5
1.2.2. Limitations of Mendelian Randomization .........................................8
1.3. Transcriptome-Wide Association Study .........................................................10
1.4. A Glimpse of Our Contribution ......................................................................15
Chapter 2. hJAM: A Hierarchical Joint Analysis of Marginal Summary Statistics ..........17
2.1. Motivation ..................................................................................................17
2.2. Methods......................................................................................................19
2.2.1. Unify the Framework of Mendelian Randomization and TWAS ....19
2.2.2. Hierarchical JAM (hJAM) Model ....................................................21
2.2.3. Required Data Sources for hJAM Implementation ..........................23
2.2.4. Methods as Comparisons .................................................................24
2.3. Simulation Studies .....................................................................................26
2.3.1. Simulation Settings ..........................................................................26
2.3.2. Simulation Results ...........................................................................28
2.4. Data Applications.......................................................................................31
2.4.1. Causal Effect of BMI and T2D on Myocardial Infarction ...............31
2.4.2. Causal Effect of PM20D1 and NUCKS1 on Prostate Cancer Risk .33
2.5. Discussion ..................................................................................................36
Appendix 2A. Supplementary Methods .................................................................38
2A.1. Joint Analysis of Marginal Summary Statistics (JAM) ...................38
2A.2. Theoretical justifications of hJAM and the competing approaches .41
v
Appendix 2B. Supplementary Tables and Figures ................................................49
2B.1. Supplementary Tables ......................................................................49
2B.2. Supplementary Figures .....................................................................53
Chapter 3. hJAM Egger: A Hierarchical Joint Analysis of Marginal Summary Statistics with
Egger Regression ...............................................................................................................60
3.1. Motivation ..................................................................................................60
3.2. Methods......................................................................................................62
3.2.1. Egger Test and the Application in Mendelian Randomization ........62
3.2.2. Hierarchical JAM with Egger Regression (hJAM Egger) ...............64
3.3. Simulation Studies .....................................................................................65
3.3.1. Simulation Settings ..........................................................................65
3.3.2. Simulation Results ...........................................................................66
3.4. Data Application ........................................................................................70
3.5. Discussion ..................................................................................................71
Chapter 4. SHA-JAM: A Scalable Hierarchical Approach to Joint Analysis for Marginal
Summary Omics Data ........................................................................................................72
4.1. Motivation .......................................................................................................72
4.2. Methods...........................................................................................................75
4.2.1. Recap of hJAM ................................................................................75
4.2.2. Sum of Single Effect model (SuSiE) ...............................................76
4.2.3. SHA-JAM: Scalable Hierarchical Approach for Joint Analysis of Marginal
summary data .............................................................................................78
4.2.2. Regularized hJAM ...........................................................................80
4.2.3. Composing A matrix ........................................................................80
4.3. Simulation Studies ..........................................................................................83
4.3.1. Constructing A matrix for simulations .............................................83
4.3.2. Simulation Settings ..........................................................................84
4.3.2. Simulation Results ...........................................................................86
4.4. Data Applications............................................................................................90
4.4.1. Selecting Causal Metabolites for Prostate Cancer with Metabolomics
Summary Data ...........................................................................................90
vi
4.4.2. Selecting Causal Genes for Prostate Cancer with Transcriptomics Summary
Data ............................................................................................................95
4.5. Discussion .....................................................................................................100
Appendix 4A. Supplementary Methods ...............................................................103
4A.1. Hierarchical Regularized Regression .............................................103
4A.2. Priority Pruner ................................................................................106
Appendix 4B. Supplementary Tables and Figures ..............................................107
4B.1. Supplementary Tables ....................................................................107
4B.2. Supplementary Figures ...................................................................112
Chapter 5. R Package for Implementation: hJAM ...........................................................120
5.1. Overview .......................................................................................................120
5.1.1. Motivation ......................................................................................120
5.1.2. Data Required and Structure Examples .........................................120
5.1.3. Data Available ...............................................................................122
5.1.4. Functions Available .......................................................................124
5.2. Package Usage ..............................................................................................126
5.2.1. Get Started .....................................................................................126
5.2.2. Implementation of hJAM ...............................................................127
5.2.3. Implementation of SHA-JAM........................................................128
5.3. Conclusion ....................................................................................................130
Chapter 6. Conclusions ....................................................................................................131
6.1. Summary of Our Contributions ....................................................................132
6.2. Future Directions ..........................................................................................133
References ........................................................................................................................135
vii
List of Tables
Table 1.1 Other limitations of MR approaches and solutions for the problems ........................... 10
Table 2.1 The annotations of the three data sources for summary statistics employed in hJAM. 23
Table 2.2 Theoretical comparisons between hJAM, competing MR and TWAS approaches. .... 25
Table 2.3 Causal odds ratios (95% confidence interval) for myocardial infarction per unit in body
mass index and having type 2 diabetes. ........................................................................................ 33
Supplementary Table 2B.1 The estimate and its standard error of simulation scenario A:
independent X’s. ........................................................................................................................... 49
Supplementary Table 2B.2 The estimate and its standard error of simulation scenario B:
correlated X’s. ............................................................................................................................... 50
Supplementary Table 2B.3 The estimate and its standard error of simulation scenario C: X1
causes X2. ...................................................................................................................................... 51
Supplementary Table 2B.4 The estimate and its standard error of simulation scenario D: X1
causes X2 and X1 and X2 are correlated. ....................................................................................... 52
Supplementary Table 2B.5 Four correlated pairs of SNPs in the instrument sets of BMI and type
2 diabetes. ..................................................................................................................................... 53
Table 3.1 Simulation results for hJAM Egger, multivariable MR Egger, and MR Egger for
different scenarios with independent intermediates and independent SNPs. ............................... 68
Table 3.2 Simulation results for hJAM Egger, multivariable MR Egger, and MR Egger for
different scenarios with correlated intermediates and independent SNPs. ................................... 69
Table 3.3 Estimates (95% confidence interval) of the intercepts on estimating the effects of body
mass index and having type 2 diabetes on the risk of myocardial infarction. .............................. 70
Algorithm 1: Pseudo algorithm for fitting SHA-JAM with iterative Bayesian stepwise selection
(IBSS) ........................................................................................................................................... 79
Table 4.1 Average number of false positives identified by SHA-JAM, elastic net hJAM and MR-
BMA across 600 replicates in simulation scenario with no causal intermediates exist. ............... 88
Table 4.2 Mean-squared errors (MSE) of the estimates from three selection algorithms with the
best performed 𝑨 matrix across 600 simulation replicates for different linkage disequilibrium
viii
(LD) structures and scenarios with different maximum correlation coefficient between the
intermediates. ................................................................................................................................ 89
Table 4.3 Average runtime (seconds) of different algorithms across 400 simulation replicates for
SHA-JAM, elastic net hJAM and MR-BMA with 500 iterations in shotgun stochastic search. .. 90
Table 4.4 Selecting causal metabolites on the risk of prostate cancer with and without potential
influential genetic variants (𝑲 =𝟏𝟒𝟒 and 𝑲 =𝟏𝟒𝟎 , respectively). ........................................... 93
Table 4.5 Selected candidate genes with strong effect on the prostate cancer risk by SHA-JAM
(credible sets that contain only one gene). .................................................................................... 96
Supplementary Table 4B.1 Absolute bias of the estimates from three selection algorithms with
the best performed 𝑨 matrix across 600 simulation replicates for different linkage disequilibrium
(LD) structures and scenarios with different maximum correlation coefficient between the
intermediates. .............................................................................................................................. 107
Supplementary Table 4B.2 Mean-squared errors of the estimates in different 𝑨 matrix
construction, averaged across 500 replicates. ............................................................................. 108
Supplementary Table 4B.3 Description of the candidate metabolites in estimating the causal
association between metabolites and the risk of prostate cancer. ............................................... 108
Supplementary Table 4B.4 Top best models from MR-BMA in the influential-variants-excluded
analysis (𝑲 =𝟏𝟒𝟎 ) in estimating the causal effects of metabolites on the risk of prostate cancer.
..................................................................................................................................................... 109
Supplementary Table 4B.5 The candidate 𝑨 matrix for each chromosome and the credible sets
identified by SuSiE hJAM in selecting the causal gene for the risk of prostate cancer. ............ 109
Supplementary Table 4B.6 Credible sets that contain two genes, identified by SuSiE hJAM for
selecting the causal genes of the risk of prostate cancer. ............................................................ 110
Table 5.1 Data structure example for 𝑨 ∈ ℝ𝑲 ×𝑴 matrix. ...................................................... 121
Table 5.2 Descriptions of the MI.Rdata in hJAM R package. ................................................ 122
Table 5.3 Descriptions of the PrCa.lipids.Rdata in hJAM R package. .......................... 123
Table 5.4 Descriptions of the GTEx.PrCa.Rdata in hJAM R package. ............................... 124
ix
List of Figures
Figure 1.1 Schematic of instrumental variable analysis. ................................................................ 2
Figure 1.2 Schematic of multivariable Mendelian randomization.................................................. 7
Figure 1.3 Schematic of the FUSION approach. .......................................................................... 12
Figure 1.4 Schematic of the PrediXcan approach. ........................................................................ 13
Figure 1.5 Schematic of summary-PrediXcan. ............................................................................. 15
Figure 2.1 Simulation scenarios of different relationships between X’s. ..................................... 27
Figure 2.2 Empirical Power of the correlated SNPs scenarios across 1000 replications. ............ 29
Figure 2.3 Average estimates and 95% confidence intervals of the correlated SNPs scenarios
across 1000 replications. ............................................................................................................... 30
Figure 2.4 Direct acyclic graph (DAG) of the relationship between body mass index (BMI), type
2 diabetes (T2D) and myocardial infarction. ................................................................................ 31
Figure 2.5 Causal odds ratios (95% confidence interval) for prostate cancer risk per unit
increasing in gene expression reads. ............................................................................................. 35
Supplementary Figure 2B.1 Correlation coefficient matrix and data structure of 𝑮𝑿 . ................ 53
Supplementary Figure 2B.2 Simulation results for evaluating the robustness of hJAM and
MVIVW MR on different linkage disequilibrium (LD) structures of 𝑮𝑳 . ................................... 54
Supplementary Figure 2B.3 Empirical Power of the correlated SNPs scenarios across 1000
replications with 𝑹𝑮 ,𝑿 𝟐 =𝟎 .𝟎𝟓 in the simulation setting. ........................................................ 55
Supplementary Figure 2B.4 Average estimates and 95% confidence intervals of the correlated
SNPs scenarios across 1000 replications with 𝑹𝑮 ,𝑿 𝟐 =𝟎 .𝟎𝟓 in the simulation setting. .......... 56
Supplementary Figure 2B.5 Empirical Power of the correlated SNPs scenarios across 1000
replications with adding an unknown confounder between 𝑿 and 𝒀 in the simulation setting. ... 57
Supplementary Figure 2B.6 Average estimates and 95% confidence intervals of the correlated
SNPs scenarios across 1000 replications with adding an unknown confounder between 𝑿 and 𝒀
in the simulation setting. ............................................................................................................... 58
x
Supplementary Figure 2B.7 Scatter plots of the 𝜶 and 𝜷 of the significant SNPs that are selected
from the genome-wide association studies for (A) body mass index and (B) type 2 diabetes,
respectively. .................................................................................................................................. 59
Figure 3.1 Direct Acyclic Graph (DAG) of the pleiotropy effect. ................................................ 62
Figure 3.2 Illustration of the MR Egger regression. ..................................................................... 64
Figure 4.1 Direct acyclic graph (DAG) for the high-throughput intermediates data in an
instrumental variable analysis framework. ................................................................................... 73
Figure 4.2 Comparing three selection algorithms with the best performed 𝑨 matrix by area under
curve (AUC) across 600 simulation replicates for three linkage disequilibrium (LD) structures
with correlated or independent intermediates. .............................................................................. 87
Supplementary Figure 4A.1 Illustration of the estimates from the LASSO (left) and ridge
regression (right). ........................................................................................................................ 104
Supplementary Figure 4A.2 Geometry of the elastic net, LASSO, and Ridge regression. ........ 105
Supplementary Figure 4B.1 Procedures in composing the 𝑨 matrix. ......................................... 112
Supplementary Figure 4B.2 Linkage disequilibrium (LD) structure of the individual genotype
data set with 𝒓 𝐰𝐢𝐭𝐡𝐢𝐧 𝐛𝐥𝐨𝐜𝐤 =𝟎 .𝟖 . ......................................................................................... 113
Supplementary Figure 4B.3 Correlation structure of the individual intermediates data with
max(𝒓𝑿 )=𝟎 .𝟔 . ......................................................................................................................... 114
Supplementary Figure 4B.4 An illustrative example of the log(𝝀 ) vs. mean-squared error in an
elastic net glmnet fit. ............................................................................................................... 115
Supplementary Figure 4B.5 Comparing three selection algorithms with different 𝑨 matrix by
area under curve (AUC) of receiver operating characteristic (ROC) curve across 600 simulation
replicates for three linkage disequilibrium (LD) structures with the maximum correlation
coefficient between intermediates equals to 0.6. ........................................................................ 116
Supplementary Figure 4B.6 Comparing the performance of SHA-JAM with 𝑨 matrix from data
of different sample sizes by area under curve (AUC) of receiver operating characteristic (ROC)
curve across 600 simulation replicates for three linkage disequilibrium (LD) structures with
independent intermediates. ......................................................................................................... 117
Supplementary Figure 4B.7 Diagnosis plots of Cook’s distance and q-statistics....................... 118
Supplementary Figure 4B.8 Comparisons of the coefficients and inclusion probabilities with and
without the influential genetic variants for SHA-hJAM and MR-BMA. ................................... 119
xi
Figure 5.1 Sample output from the SNPs_heatmap function in R package hJAM. ............... 126
Figure 6.1 Flowchart of the hJAM family approaches and the application examples. ............... 131
xii
Abstract
Over the last decade, genome-wide association studies (GWAS) have not only mapped
thousands of genetic variants across the genome with complex traits but also stimulated many
new methodologies leveraging GWAS summary statistics to explore the biological mechanisms
underlying the genetic trait associations. For example, previous research has demonstrated the
usefulness of hierarchical modeling for incorporating a flexible array of prior information in
genetic association studies. When this prior information consists of effect estimates from
association analyses of genetic variants with a modifiable risk factor or gene expression, the
hierarchical model is equivalent to a Mendelian randomization and Transcriptome-Wide
Association Study (TWAS) analysis, respectively.
In this dissertation, we propose novel methods to incorporate the prior information from
summary data and test the causal effects of intermediates on the outcome trait in observational
studies. The intermediates can be modifiable risk factors, gene expression data, metabolite data,
and other omic data. Additionally, we discuss different ways to compose the prior information
with various data types. The composition of the prior information has been shown to be critical
for valid inference.
This dissertation is structured as follows. Chapter 1 gives a brief introduction of the
instrumental variable analysis, Mendelian randomization, and TWAS. Chapter 2 first shows that
both Mendelian randomization and TWAS are a form of instrumental variable analysis with the
genetic variants as the instruments. I then propose a hierarchical joint analysis of marginal
summary data, hJAM, which is designed to incorporate prior information via a hierarchical
xiii
model and test the causal effect of the risk factors or genes on the outcome trait. The use of
appropriate effect estimates as prior information yields an analysis similar to Mendelian
randomization and TWAS. Chapter 3 describes a natural extension on hJAM by integrating an
Egger regression, hJAM Egger, to account for the bias that are introduced by the invalid
instruments. Chapter 4 designs a scalable hierarchical approach for joint analysis of marginal
data, SHA-JAM, to variable selection in a hJAM setting with high-throughput experiments, such
as omic data. Chapter 5 presents the R package, hJAM, that we developed for the
implementations of the methods proposed in this dissertation. Finally, Chapter 6 summarizes this
dissertation.
1
Chapter 1.
Introduction
Over the last decade, the development of high-throughput genotyping and next-
generation sequencing platforms have advanced the boosting of genome-wide association studies
(GWAS). GWAS have successfully identified an impressive amount of susceptibility loci or
genomic regions for many complex diseases (Buniello et al., 2019). An additional benefit from
the rapid development of GWAS, many methodologies have evolved over the last decade,
including Mendelian randomization (Burgess et al., 2013) and Transcriptome-Wide Association
Studies (TWAS) (A. Gusev et al., 2016). Mendelian randomization employs genetic variants to
estimate the causal associations between risk factors and the trait. TWAS imputes the gene
expressions levels by using expression quantitative trail loci (eQTLs) and investigates the effects
of imputed gene expression on the trait. Despite the different research questions being asked by
Mendelian randomization and TWAS, both are essentially instrumental variable analysis from a
statistical view.
1.1. Instrumental Variable Analysis
The instrumental variable (IV) technique has been widely used in causal inference and
outcome research to address the potential confounding and reverse causality in observational
2
studies. The IV approach can be seen as a pseudo-randomization factor in the study (Newhouse
& McClellan, 1998). The randomness property of IV makes it hard to be identified in
observational studies. A valid IV has to satisfy three additional assumptions (Figure 1.1)
(Newhouse & McClellan, 1998): (1) IV (𝐺 ) has no independent effect on the outcome of
interests (𝑌 ); (2) IV (𝐺 ) has to have a substantial effect on the risk factors (𝑋 ); and (3) IV (𝐺 ) has
to be independent of other variables (𝑈 ) which are potential confounders of the association
between risk factors (𝑋 ) with the phenotype (𝑌 ) (Martens et al., 2006). If the first assumption is
violated (i.e., pleiotropy effect presents), biased estimates will be produced. If the second
assumption does not hold (i.e. weak instrument presents), the random error of the weak
instrument may mask the effect of the risk factors (𝑋 ) on the outcome (𝑌 ) (Newhouse &
McClellan, 1998) and lead to a large bias in estimators even though the first assumption is just
slightly violated (Martens et al., 2006). If the third assumption is violated, the estimates may be
biased.
Figure 1.1 Schematic of instrumental variable analysis.
This diagram displays the schematic of the instrumental variable analysis. Here, G denotes the
instrumental variables, X denote the risk factors/gene expressions, Y denotes the outcome of
interests and U denote measured or unmeasured confounders. Solid arrows refer to association
and the dot arrow refers to the association of interests.
Practically, the classic method of IV analysis is two-stage least squares (TSLS), which
requires individual level data. The TSLS consists of two models
𝑌 𝑖 =𝜋 0
+𝜋 1
𝑋 𝑖 +𝜃 𝑼 𝒊 +𝜖 1,𝑖
3
𝑋 𝑖 =𝛼 0
+𝛼 1
𝐺 𝑖 +𝛾 𝑾 𝒊 +𝜖 2,𝑖
where 𝑼 and 𝑾 reflect other predictors of the outcome (Y) and risk factors (X), respectively.
Note that 𝑼 and 𝑾 are often the same or at least have a lot of overlap. Allowing additional
predictors or confounders (i.e., 𝑼 and 𝑾 ) in both models will increase the precision of the
estimation. For simplicity in illustration, I restrict the two models to no other predictors, i.e.
𝒀 =𝜋 0
+𝜋 1
𝑿 +𝝐 1
(1.1)
𝑿 =𝛼 0
+𝛼 1
𝑮 +𝝐 2.
(1.2)
If only one instrument variable presents and the assumptions are satisfied, we have
𝜋̂
1
=
1
𝑛 −1
∑ (𝑔 𝑖 −𝑔 ̅ )(𝑦 𝑖 −𝑦̅)
𝑛 𝑖 =1
1
𝑛 −1
∑ (𝑔 𝑖 −𝑔 ̅ )(𝑥 𝑖 −𝑥 ̅ )
𝑛 𝑖 =1
=
𝜎̂
𝐺 ,𝑌 𝜎̂
𝐺 ,𝑋 =
𝜎̂
𝐺 ,𝑌 /𝜎̂
𝐺 2
𝜎̂
𝐺 ,𝑋 /𝜎̂
𝐺 2
=
𝛽̂
𝑂𝐿𝑆 (𝐺 →𝑌 )
𝛼̂
𝑂𝐿𝑆 (𝐺 →𝑋 )
,
which is an asymptomatically unbiased estimator of 𝜋 1
(Martens et al., 2006). Here, OLS
denotes the solution from ordinal least square (OLS) regression. If more than one instrument is
present, which is rarely seen in epidemiology studies, we have
𝜋̂
1
=
∑ 𝛽̂
𝑘 𝛼̂
𝑘 𝐾 𝑘 =1
∑ 𝛽̂
𝑘 2 𝐾 𝑘 =1
,
where 𝐾 is the number of instruments (Sawa, 1969). A method which applies genetic variants as
instruments, referred to as Mendelian randomization, can allow for more than one instrument
(next section).
Limitations of instrumental variable analysis have been discussed. Firstly, it is hard to
obtain a valid instrument. To achieve the assumption of being randomly assigned to the group,
the researcher have to control the IV assignment (Martens et al., 2006) or it has to be a natural
randomization process, such as genetic variants in Mendelian randomization. Otherwise, the
4
researcher have to select an IV on “theoretically grounds” and justify the reasons of choosing it
(Martens et al., 2006). Secondly, violating the two assumptions, (1) IV is independent of
𝑌 conditioning on 𝑋 and (2) 𝑋 is moderately/strongly associated with IV, results in biased
estimates. For the first assumption, one has to justify that there is no direct effect of the
instrument on the outcome of interest except through the risk factors, which it is unlikely to
evaluate statistically (Martens et al., 2006). For the second assumption, one can use the
correlation between the instrument and risk factors to evaluate the validity of the instrument
variable. However, there is no critical value either in terms of variability described (i.e. R
2
) or in
terms of statistical significance to define whether an instrument is weak or strong. When weak
instrument are present, a small sample size will lead to considerable bias; thus, having a larger
sample size may help with the impact of weak instrumental bias. In order to satisfy the first
assumption, researchers may have to exclude some instruments when multiple instruments exist
and result in the violation of the second assumption. It is important to consider this tradeoff in
conducting the instrumental variable analysis.
1.2. Mendelian Randomization
Mendelian randomization is a form of instrumental variable analysis. It applies genetic
variants as instrumental variables to test the effects of risk factors on the outcome of interests in
observational studies. The Mendelian randomization method exploits Mendel’s second law, i.e.,
the law of independent assortment (Lawlor et al., 2008). Due to the fact that alleles of germline
genotypes are randomly segregated from parents to offspring, the genetic variants are unlikely to
be associated with the confounders. The segregation occurs before conception so the germline
5
genotypes are not subject to reverse causation (Lawlor et al., 2008). Thus, genetics variants are
good potential instruments. To be a valid IV analysis, the genetic variants used in Mendelian
randomization analysis have to satisfy the assumptions of IV as illustrated in Figure 1.1:
1. The genetic variants (𝐺 ) are independent of the outcome of interests (𝑌 ) given the
risk factors (𝑋 ) and the confounding factors (𝑈 );
2. The genetic variants (𝐺 ) have to be associated with the risk factors (𝑋 );
3. The genetic variants (𝐺 ) are not associated with the confounding factors (𝑈 ).
1.2.1. Existing Mendelian Randomization Approaches
In the simplest case of Mendelian randomization, i.e., a continuous outcome variable 𝑦 ,
an intermediate 𝑋 , and one single-nucleotide polymorphism (SNP) 𝐺 , the estimate from
Mendelian randomization is the ratio of two regression coefficients, i.e.
𝜋̂ =
𝛽̂
𝛼̂
(1.3)
where 𝛽̂
denotes the regression coefficients of the association between 𝐺 and 𝑦 and 𝛼̂ denotes
the regression coefficients of the association between 𝐺 and 𝑋 (Lawlor et al., 2008).
In reality, we often have more than one SNP as a potential instrument. In such case, when
individual level data is available, TSLS could be directly applied to the data. As the number of
GWAS consortiums increase, summary-based Mendelian randomization approaches have been
developed (Zheng et al., 2017). Each Mendelian randomization approach has its own purpose.
The basic summary-based Mendelian randomization approach is two-sample inverse-variance
weighted Mendelian randomization (IVW MR) (Burgess et al., 2013) which estimates the causal
effect of one risk factor (X) on the outcome (𝑌 ) per model. It employs weighted least squares
6
estimators in modeling the two sets of regression coefficients described in Eq. 1.3 and applies the
inverse of the squared standard error of 𝛽̂
as the weights (Eq. 1.4). The two-sample IVW MR can
be viewed as a fixed-effect meta-analysis.
𝛽̂
𝑘 =𝜋 𝛼̂
𝑘 +𝜖 𝑘 ; 𝜖 𝑘 ∼𝑁 (0,se(𝛽̂
𝑘 )
−2
)
(1.4)
Pleiotropy is defined as the violation of the first assumption described in section above,
i.e., the instruments are not independent of the outcome conditioning on the risk factors and
confounders. The presence of pleiotropy results in a biased estimate. Several approaches have
been developed to address the influence of pleiotropy, including Mendelian randomization with
Egger regression (MR Egger regression) (Bowden et al., 2015), Mendelian randomization
pleiotropy residual sum and outlier (MR-PRESSO) (Verbanck et al., 2018), Joint Analysis of
Marginal (JAM) summary statistics for MR (JAM-MR) (Gkatzionis et al., 2019), and HEIDI
(heterogeneity in dependent instruments) in Summary data-based MR (SMR) (Zhu et al., 2016).
The core idea of these approaches is to identify invalid instruments that violate the assumption
and address the biased caused by the invalid instruments through different strategies. The Egger
test is widely used in meta-analysis for systematic review to identify the publication bias (Egger
et al., 1997). MR Egger regression adopts the idea from the Egger test by considering the effect
of pleiotropy as a biased term and extends the IVW MR by adding an intercept (𝛼 0
) to the
weighted linear regression (Bowden et al., 2015). MR-PRESSO states that valid variants are
expected to be close to the regression line while invalid instruments, the ones that are subject to
pleiotropy, are expected to deviate from the true slope of the regression line. It includes a global
test which evaluates the presence of pleiotropy, an outlier test which corrects the IV analysis by
removing the invalid IVs that are identified by the global test, and a distortion test which
compares the causal estimates before and after the outlier test (Verbanck et al., 2018). Both MR
7
Egger regression and MR-PRESSO relax the first assumption. Instead, they have to satisfy the
InSIDE (Instrument Strength Independent of Direct Effect) assumption that the distributions of
the parameters are independent (Bowden et al., 2015; Verbanck et al., 2018). JAM-MR
(Gkatzionis et al., 2019) applies a Bayesian framework and conducts variable selection based on
a pleiotropic loss function via reversible-jump MCMC algorithm. It penalizes the invalid
instruments that exhibit pleiotropic effects. The HEIDI statistics, developed in SMR, exploit the
fact that if pleiotropy is present, the 𝜋̂ that is calculated with any genetic variant in linkage
disequilibrium (LD) with the causal variant, is identical (Zhu et al., 2016). Thus, whether an
instrument is valid or not can be tested by the difference between 𝜋̂ estimated using the top
genetic variants and that any other significant genetic variants in LD. SMR can only deal with
one single instrument; the generalized SMR (GSMR) extends SMR to more than one instrument
(Zhu et al., 2018).
Figure 1.2 Schematic of multivariable Mendelian randomization.
This diagram displays the schematic of the multivariable Mendelian randomization. Here, G
denotes the instrumental variables, X denote the risk factors/gene expressions, Y denotes the
outcome of interests and U denote measured or unmeasured confounders.
Another extension of summary-based Mendelian randomization is to incorporate more
than one risk factors (𝑋 ) per model (Figure 1.2), such as inverse-variance weighted multivariable
Mendelian randomization (MVIVW MR) (Burgess et al., 2015; Burgess & Thompson, 2015;
Yavorska & Burgess, 2017) and multi-trait-based conditional and joint analysis (mtCOJO) (Zhu
8
et al., 2018) . The MVIVW MR extends the IVW MR to a multivariable framework by modeling
multiple risk factors in one model as
𝛽̂
𝑘 =𝜋 1
𝛼̂
𝑘 2
+𝜋 2
𝛼̂
𝑘 2
+⋯+𝜋 𝑀 𝛼̂
𝑘𝑀
+𝜖 𝑘 ; 𝜖 𝑘 ∼𝑁 (0,se(𝛽̂
𝑘 )
−2
)
where 𝜋 𝑖 denotes the effect of the 𝑖 𝑡 ℎ
risk factor on 𝒚 , and 𝛼̂
𝑘𝑖
denotes the effect of 𝑘 𝑡 ℎ
instrument on the 𝑖 𝑡 ℎ
risk factor. Zuber et al. (Zuber et al., 2020) proposed a Bayesian model
average approach, MR-BMA, which further extends the MVIVW MR to variable selection in
high-throughput experiment data. The mtCOJO is an extension from the GSMR (Zhu et al.,
2018) which constructs a joint analysis by
𝑦 =𝑥 0
𝜋 0
+𝒙 𝝅 𝒙𝒚
+𝜖 ,
where 𝜋 0
is the effect of 𝑥 0
on y and 𝝅 𝒙𝒚
is the effect of other covariant 𝒙 on y. It constructs the
test statistics to test the null hypothesis 𝜋 0
=0 by
𝑇 mtCOJO
=
(𝛽̂
|𝝅 𝒙𝒚
)
2
var(𝛽̂
|𝝅 𝒙𝒚
)
∼𝜒 1
2
.
1.2.2. Limitations of Mendelian Randomization
Although Mendelian randomization approaches have been well-established and have
successfully evaluated complex associations in observational studies (Zheng et al., 2017), it is
subject to several limitations.
Firstly, since Mendelian randomization is a form of IV analysis, it shares the caveats of
IV analysis. Mendelian randomization approaches often include the top genome-wide significant
(𝑃 =5×10
−8
) SNPs effects in GWAS, thus only a small proportion of the variance of
modifiable risk factor (i.e., 𝑋 ) can be explained by the group of instrumental variables. In such
case, the instrument may not be strong enough to detect the true effect or introduce bias in
9
estimates. This is referred as “weak instrument bias”. One solution is to relax the threshold of
SNP inclusion criteria and add more SNPs in the model. However, such solution may violate the
first assumption and introduce pleiotropy effect bias. Moreover, the summary statistics that
Mendelian randomization approaches use are from the single-variant test in GWAS, which can
be possibly attributed to the correlations among SNPs, i.e. linkage disequilibrium (LD). A recent
review suggested that even the presence of moderate LD may lead a null SNP to a significant
signal (Zheng et al., 2017).
Secondly, the multi-functional genes could introduce pleiotropic effects (Smith &
Ebrahim, 2004; Thomas & Conti, 2004). For example, using the methylenetetrahydrofoloate
reductase (MTHFR) gene as an instrument to explore the role of folate and homocysteine in the
etiology of neural tube defects (NTD) via a single-risk-factor Mendelian randomization, such as
IVW MR, may not be appropriate (Thomas & Conti, 2004). The direct acyclic diagram (DAG)
of the relationship is more likely what Figure 1.2. applies here, where 𝑋 1
represents folate and 𝑋 2
represents the homocysteine.
Population stratification is another potential pitfall in the interpretation of Mendelian
randomization approaches. We consider that the alleles of genotype to be randomly assigned but
in reality, it may not be due to population stratification. The violation of the randomness
assumption may result in false positives. Some argue that using summary statistics from GWAS
within homogenous populations only or from GWAS which have been sufficiently adjusted for
population stratification (Zheng et al., 2017) may solve the problem.
Table 1.1 described other remaining limitations of Mendelian randomization that have
been discussed in the literatures.
10
Table 1.1 Remaining limitations of MR approaches and solutions for the problems
Limitation Description Solution
Failure to establish reliable
relationships (Smith &
Ebrahim, 2004) / Complexity
of biology (Zheng et al.,
2017)
The relationship between
genotype, risk factors and
phenotypes are not reliable,
as the example in (Thomas &
Conti, 2004).
Build the diagram of the
relationships based on
literature review and previous
evidence.
Lack of reliable instruments
for the risk factors (Smith &
Ebrahim, 2004; Zheng et al.,
2017)
Causal genetic instruments
are not available in the
summary statistics from the
GWAS
Use proxy for the unavailable
risk factors.
Gene-environmental
interactions (Thomas &
Conti, 2004)
The genetic instrument and
an environmental factor have
interactive effect on the risk
factors.
Biology knowledge is needed
to solve the problem.
1.3. Transcriptome-Wide Association Study
The concept of transcriptome-wide association studies (TWAS) was brought up by (A.
Gusev et al., 2016). It refers to the studies that identifies significant gene expression-trait
associations through gene expression levels which are imputed from genetic data. TWAS follows
the same DAG as in Figure 1.3 with 𝑋 as the gene expression levels. We will refer the
approaches which are designed for such problem as transcriptome-wide association studies
(TWAS) and the specific transcriptome-wide association studies approach in (A. Gusev et al.,
2016) as FUSION in the following text to avoid confusion.
11
FUSION (A. Gusev et al., 2016) and PrediXcan (Gamazon et al., 2015) are two popular
TWAS approaches. Both methods have two steps: they first impute the gene expression levels
from genetic data and then infer the association between the imputed gene expression and traits.
Figure 1.3 (A. Gusev et al., 2016) and Figure 1.4 (Barbeira et al., 2018) describe the schematics
of FUSION and PrediXcan, respectively.
In step one, when individual level data is available, FUSION evaluates three imputation
schemes: (1) apply the single most significantly associated eQTL in the training set as the
predictor (cis-eQTL); (2) apply the Best Linear Unbiased Prediction (BLUP) (Robinson, 1991)
which fits a mixed model with all the SNPs in the locus where random effects are allowed and
estimates the joint effects with a single variance component; and (3) a Bayesian estimation
scheme (Zhou et al., 2013), called BSLMM by FUSION, which is developed to construct
polygenic risk score. When only summary statistics exists, FUSION applies a summary-based
imputation through the ImpG-Summary algorithm (Pasaniuc et al., 2014) to train on the cis
genetic component of expression. In practice, FUSION selected all SNPs within 1 Mb of the
gene. One critical part of FUSION is that they assume that the gene expression levels are
standardized; thus, the variance of the genetic markers are the proportions of variation of each
gene expression explained by the eQTLs which can be represented by the heritability of the
expressions. Different from FUSION, PrediXcan (Gamazon et al., 2015) constructs the imputed
gene expression with individual-level data from a public available data source, GTEx (Lonsdale
et al., 2013), in step one. Moreover, a common implementation evaluates a polygenic risk score
via a least absolute shrinkage and selection operator (LASSO) or elastic net. These results are
included in the public resource, PredictDB.org (http://predictdb.org), to store the weights that
composed by elastic net and the covariance matrix of eQTLs.
12
Figure 1.3 Schematic of the FUSION approach.
This figure is from Figure 1 in (A. Gusev et al., 2016), describing the schematic of the FUSION
approach. Top panel displays estimating the effect sizes between the cis-eQTLs and gene
expression in the reference panel. Path A is the FUSION approach when individual level data is
available. It will predict the gene expression directly for genotyped samples by using the effect
sizes that are estimated in the reference panel (top) and use the predicted expression levels to
test the association between the gene expression and the trait. Path B is the summary-based
FUSION approach when summary statistics is provided. It indirectly estimates the effect sizes
between the predicted gene expression and the trait by using weighted linear combination of the
eQTLs-trait standardized effect sizes and incorporating the LD information.
13
Figure 1.4 Schematic of the PrediXcan approach.
This figure is from Figure 2 in (Gamazon et al., 2015), describing the steps of PrediXcan.
14
In step two, both PrediXcan and FUSION use the imputed gene expression as a surrogate
to infer the effects of gene expression on traits. PrediXcan applies regression, such as linear
regression, logistic regression or a Cox model, or non-parametric approaches, such as Pearson, to
test the gene expression-trait association when individual level data is available. PrediXcan is
similar to TSLS but it uses two different data sets to perform the two stages. S-PrediXcan
(Barbeira et al., 2018) is a summary-based approach of PrediXcan which requires summary
statistics from GWAS and does not require any individual level data. Figure 1.5 illustrates the
process of S-PrediXcan and its relationship to PrediXcan (Barbeira et al., 2018). One limitation
of summary-PrediXcan is that it considers one gene expression per regression, which may
introduce pleiotropy effect bias in the estimate if some cis-eQTLs on the investigated gene are
also associated with another gene that is associated with the trait. FUSION claims that their
approach can be “conceptually viewed as a test for the correlation between the genetic
component of expression and the genetic component of a trait” (A. Gusev et al., 2016). Instead of
providing point estimate of the gene expression-trait association, both FUSION and PrediXcan
provides the test statistics, z score, for testing the statistical significance of the associations.
In addition to PrediXcan and FUSION, there exist several approaches that target the same
research question. The summary-based MR (SMR) (Zhu et al., 2016) and generalized SMR
(GSMR) (Zhu et al., 2018), as mentioned in the Mendelian randomization section, are two
approaches designed to integrate the summary data from GWAS and eQTL studies to predict the
complex traits. In SMR and GSMR, they use the top eQTLs as instrument variable of the
phenotype to detect the gene expression-trait associations.
15
Figure 1.5 Schematic of summary-PrediXcan.
This figure is from Figure 1 in (Barbeira et al., 2018), illustrating the schematic of summary-
PrediXcan.
1.4. A Glimpse of Our Contribution
Mendelian randomization and TWAS have been viewed as two different approaches. In
this dissertation, we unify the framework of Mendelian randomization and TWAS statistically
and show its flexibility. Currently, most Mendelian randomization approaches focus on a single
risk factor per model while most TWAS approaches test the genes univariately, which may lead
to pleiotropy bias or weak instrument bias and are not suitable for high-throughput omics data. In
the following chapters, we propose four methods, including hierarchical joint analysis of
marginal summary statistics (hJAM) (Chapter 2), hJAM with Egger regression (hJAM Egger)
(Chapter 3), and scalable hierarchical approach of joint analysis for marginal summary omics
16
data (SHA-JAM) (Chapter 4) to address the challenges and limitations. In Chapter 5, we provide
a R package, hJAM, for implementing the methods proposed in this dissertation. We conclude
with an overall summary for this dissertation in Chapter 6.
Chapter 2.
hJAM: A Hierarchical Joint Analysis of Marginal Summary
Statistics
This chapter is a modified version of the published work:
Lai Jiang, et al., A Hierarchical Approach Using Marginal Summary Statistics for Multiple
Intermediates in a Mendelian Randomization or Transcriptome Analysis. American Journal of
Epidemiology (in press).
2.1. Motivation
As discussed in the Chapter 1, Mendelian randomization and transcriptome-wide
association studies (TWAS) are the two major approaches within the instrumental variable
analysis framework using genetic variants. Mendelian randomization approaches use a set of
genetic variants as an instrument variable set to estimate the association between modifiable risk
factors and traits while TWAS, including FUSION (A. Gusev et al., 2016) and PrediXcan
(Gamazon et al., 2015), use expression quantitative trait loci (eQTLs) to predict the gene
expression and estimate the association between the predicted gene expressions and traits. For
intermediates, Mendelian randomization focus on modifiable risk factors and TWAS targets gene
expression via eQTLs, loci that explain genetic variation in gene expression levels (Nica &
18
Dermitzakis, 2013). As an alternative, TWAS approaches can be viewed as a weighted SNP
approach in which the eQTL information is used to construct the weights. One advantage of
using these tools is the ubiquity of publicly-available GWAS, such as UK Biobank (Sudlow et
al., 2015), which facilitates researchers to initiate investigation of complex traits and diseases
nearly immediately (Burgess & Davey Smith, 2019). The existing approaches differ in their
strategies to combine the summary statistics of SNPs from GWAS or that of eQTLs from RNA
sequencing data. The most widely used Mendelian randomization approach is inverse-variance
weighted Mendelian randomization (IVW MR) (Burgess et al., 2013) which is similar to the
fixed-effect meta-analysis, often constraining the intercept to be zero.
As a form of instrumental variable analysis, these approaches must fulfill all the caveats
to yield valid estimates and extensions exist to overcome some of these limitations (discussed in
Chapter 1). For example, the top genome-wide significant (𝑃 =5×10
−8
) SNPs effects from a
GWAS which are included by Mendelian randomization may only explain a small proportion of
the variance of the modifiable risk factor. In such cases, the instrument may not be strong enough
to detect the true effect and result in bias. Of course, one solution is to increase the number of
SNPs in the instrument set (Zheng et al., 2017). However, such a solution may introduce
pleiotropy which is defined as the effect of genetic variants on the outcome that is not through
the intermediates. Likewise, in TWAS analyses, pleiotropy may be present because of the
potential existence of multi-functional genes, or if some cis-SNPs for the investigated gene are
also associated with another gene that is associated with the trait (Smith & Ebrahim, 2004;
Thomas & Conti, 2004). Potential solutions to pleiotropy include MR-Egger (Bowden et al.,
2015) which allows the intercept to be estimated as non-zero in the presence of pleiotropy.
Alternatively, if multiple correlated intermediates are analyzed jointly there is the potential for
19
each intermediate to explain all or a portion of the pleiotropic effect, which thus relaxes the first
assumption and allows the genetic variants to have effects on the outcome through multiple
pathways/intermediates (Burgess & Thompson, 2015).
In this chapter, we propose an approach that leverages the joint analysis of marginal
summary statistics (JAM) (Newcombe et al., 2016), a scalable algorithm designed to analyze
published marginal summary statistics from GWAS under a joint multi-SNPs model to identify
causal genetic variants for fine mapping. The marginal summary statistics refer to the univariate
estimates of the SNPs from GWAS. Here, we extend JAM with a hierarchical model (hJAM) to
incorporate SNP-intermediate association estimates and unify the framework of MR and TWAS
approaches when multiple intermediates and/or correlated SNPs exist. The hJAM framework
provides a natural way to incorporate invalid instruments and adjust for the pleiotropy bias, as
we will discuss later in Chapter 3.
2.2. Methods
2.2.1. Unify the Framework of Mendelian Randomization and TWAS
Instrumental variable analysis with individual-level genotype data can be viewed as a
two-stage hierarchical model. Using linear regression, the first stage models the outcome as a
function of the genetic variants:
𝒀 =𝑮𝜷 +𝜹 . (2.1)
Here, 𝒀 denotes a 𝑛 -length vector of a continuous outcome, 𝑮 denotes an 𝑛 ×𝐾 genotype
matrix with 𝐾 SNPs and n individuals and 𝜹 denotes the residuals. The second stage models the
20
conditional effect estimates 𝜷 as a function of prior information (Conti & Witte, 2003;
Greenland, 2000; Lewinger et al., 2007; Thomas et al., 2009), 𝑨̂
∈ℝ
𝐾 ×𝑀 :
𝜷 =𝑨̂
𝝅 +𝝐 . (2.2)
where 𝝅 ∈ℝ
𝑀 ×1
denotes the parameter of interest, the vector of effects for the
intermediates 𝑿 on outcome 𝒀 and 𝑀 is the number of intermediates 𝑿 . We can join these two-
stage models into a single linear mixed model by substituting Eq. 2.2 into Eq. 2.1 (Witte et al.,
2000):
𝒀 =𝑮 𝑨̂
𝝅 +𝑮 𝜖 +𝜹 =𝑮 𝑨̂
𝝅 +𝜹 , (2.3)
assuming there is no direct effect from the genetic variants to the outcome (i.e. 𝝐 =𝟎 ). The
estimate of 𝝅̂ from Eq. 2.3 is equivalent to the result from the two-stage least square (2SLS)
regression, which is employed by PrediXcan (Gamazon et al., 2015) and others (A. Gusev et al.,
2016). The prior information 𝑨̂
is the association estimates between the genetic variants and the
intermediate and can be applied to impute the intermediate with the genetic variants:
𝑿̂
=𝑮 𝑨̂
. (2.4)
Note that Eq. 2.4 is the stage-2 in the 2SLS regression and that MR approaches with
summary data are developed based on Eq. 2.2. One key aspect of the instrumental variable
analysis with genetic variants is that the 𝑨̂
matrix is computed from a separate data, 𝑨̂
=
(𝜶̂
𝟏 ,𝜶̂
𝟐 ,...,𝜶̂
𝒎 )
𝑇 , where 𝜶̂
𝑖 denotes the vector of association estimates between genetic variants
and 𝑖 𝑡 ℎ
intermediate from external data. Two different 𝜶̂ vectors have been used by previous
methods. Marginal estimates 𝒂̂ are widely employed by Mendelian randomization where
marginal summary statistics from GWAS are being used (Burgess & Bowden, 2015; Burgess et
al., 2013; Burgess & Thompson, 2015). Here marginal estimates refer to the univariant estimates.
Conditional estimates of 𝜶̂ , which adjust for the correlations between genetic variants, can also
21
be incorporated into the framework and are easily obtained from joint regression models with
individual-level data. If only marginal summary statistics are available, one way to convert these
estimate𝑠 into conditional ones is incorporating the linkage disequilibrium (LD) block among the
SNPs using the JAM approach (Appendix 2A.1) (Newcombe et al., 2016). Alternatively with
individual-level data, a regularized estimate 𝜶̂ is obtained by applying regularized regression,
such as those values reported in the PredictDB developed for PrediXcan (Gamazon et al., 2015).
To model multiple intermediates, we construct an 𝑨̂
matrix by combining the vectors of effect
estimates of the SNPs on each intermediate, 𝜶̂
𝑖 , into a matrix with the number of columns equal
the number of intermediates (i.e., 𝑀 ):
𝑨̂
𝐾 ×𝑀 =[
𝛼̂
11
… 𝛼̂
1𝑚 ⋮ ⋱ ⋮
𝛼̂
𝑘 1
… 𝛼̂
𝑘𝑚
].
2.2.2. Hierarchical JAM (hJAM) Model
Following the JAM framework (Appendix 2A.1) (Newcombe et al., 2016), we use the
marginal summary statistics,𝒃̂
, which are obtained from a GWAS and the minor allele frequency
(MAF) of the genetic variants, 𝒑̂ , to construct a vector 𝒛 with the 𝑖 𝑡 ℎ
element, 𝑧 𝑖 , for each genetic
variant:
𝑧 𝑖 =2𝑁 𝑌 𝑝 ̂
𝑖 (1−𝑝 ̂
𝑖 )𝑏̂
𝑖 ,
assuming Hardy-Weinberg Equilibrium. Each element represents the total trait burden for all risk
alleles of SNP 𝑖 , present in the population. The MAF can be extracted from the same GWAS or
using external populations such as 1000 Genomes Project (Consortium, 2015) as reference data.
Using standard linear algebra, we can express the distribution of 𝑧 as
𝒛 ~𝑀𝑉 𝑁 𝐾 (𝑮 𝟎 ′
𝑮 𝟎 𝜷 ,𝜎 𝑌 2
𝑮 𝟎 ′
𝑮 𝟎 ),
22
where 𝑮 𝟎 ′
𝑮 𝟎 denotes the 𝐾 ×𝐾 genotype variance-covariance from a centered reference
genotype data set (e.g. 1000 Genome Project (Consortium, 2015)) to obtain the conditional
effects of SNPs on the outcome, 𝜷̂
. The reference genotype data is centered by the mean to avoid
an intercept term. Details are described in Newcombe et al. (Newcombe et al., 2016). To
simplify the likelihood, we perform a Cholesky decomposition transformation 𝑳 ′
𝑳 =𝑮 𝟎 ′
𝑮 𝟎 .
Then, we transform 𝒛 into 𝒛 𝑳 with the inverse of 𝑳 ′ as 𝒛 𝑳 =𝑳 ′−1
𝒛 . When 𝑳 is positive semi-
definite, we add a ridge term, i.e., a small positive element, on the diagonal to enforce it to be a
positive definite matrix. The regularization term has a very small effect on the estimates while
guaranteeing the invertibility of the 𝑳 matrix. Then, the 𝒛 𝑳 is a vector of independent statistics
that can be expressed as
𝒛 𝑳 ~𝑀𝑉 𝑁 𝐾 (𝑳𝜷 ,𝜎 𝑌 2
𝑰 𝑲 ). (2.5)
Similar to above, we then fit a hierarchical model by incorporating the second-stage
model (Eq. 2.2) into Eq. 2.5 and construct the hJAM model as
𝒛 𝑳 ~𝑀𝑉 𝑁 𝐾 (𝑳 𝑨̂
𝝅 ,𝜎 𝑌 2
𝑰 𝑲 ),
(2.6)
assuming no direct effect from genetic variants to the outcome. Here, 𝝅̂ denotes the association
parameter of interest between the intermediate and outcome and is estimated using maximum
likelihood and the statistical significance is given by a Wald test. The estimate of 𝝅̂ and
corresponding variance are
𝝅̂ =((𝑳 𝑨̂
)
′
(𝑳 𝑨̂
))
−1
(𝑳 𝑨̂
)
′
𝒛 𝑳
and
Var(𝝅̂)=((𝑳 𝑨̂
)
′
(𝑳 𝑨̂
))
−𝟏 𝜎 𝑌 2
.
23
Note the number of genetic variants must be equal or greater than the number of the
intermediates.
2.2.3. Required Data Sources for hJAM Implementation
To implement hJAM, we need three sets of summary statistics, including 𝐺 𝑋 , 𝐺 𝑌 , and 𝐺 𝐿
(Table 2.1). 𝐺 𝐿 denotes the reference data to extract the underlying LD structure of the SNPs.
Individual level reference data is required. One choice could be the 1000 Genomes (Consortium,
2015). 𝐺 𝑋 denotes the data for the association estimates between the SNPs and the intermediates
𝑿 . When 𝑿 are modifiable risk factors, 𝐺 𝑋 could be several independent GWAS data sets with
the outcomes of the analysis being the intermediates. When 𝑿 are gene expression levels, 𝐺 𝑋 is
could be a genomic data set, such as the Genotype-Tissue Expression (GTEx) project (Lonsdale
et al., 2013) or Genetic European Variation in Health and Disease (GEUVADIS; 460
lymphoblastoid cell lines, LCLs) (Lappalainen et al., 2013). 𝐺 𝑌 denotes the data for the
association estimates between the SNPs and the outcome of interests (𝒀 ). The most popular
choice of 𝐺 𝑌 is the GWAS with an outcome as 𝒀 .
Table 2.1 The annotations of the three data sources for summary statistics employed in
hJAM.
Data Source Summary statistics obtained Dimension† Examples
𝐺 𝑋 𝑨̂
= (
𝛼̂
11
⋯ 𝛼̂
𝑝 1
⋮ ⋱ ⋮
𝛼̂
1𝑀 ⋯ 𝛼̂
𝑝𝑀
) 𝐾 ×𝑀 GTEx, GWAS
𝐺 𝑌 𝜷̂
𝐾 ×1 GWAS
𝐺 𝐿
𝜞 𝒈 =
(𝑮 𝟎 ′
𝑮 𝟎 )
𝑁 =
(𝑮 𝑳 −𝑮 𝑳 ̅̅̅̅
)
′
(𝑮 𝑳 −𝑮 𝑳 ̅̅̅̅
)
𝑁
𝑁 ×𝐾 1000 Genomes
† Dimension of the summary statistics for 𝐺 𝑋 and 𝐺 𝑌 , and dimension of the individual level data
of the reference panel 𝐺 𝐿 . Abbreviations: GTEx, Genotype-Tissue Expression; GWAS, Genome-
Wide Association Studies.
24
2.2.4. Methods as Comparisons
We compared the performance of hJAM to inverse-variance weighted Mendelian
randomization (IVW MR) (Burgess et al., 2013), multivariate inverse-variance weighted
Mendelian randomization (MVIVW MR) (Burgess & Thompson, 2015), and S-PrediXcan
(Barbeira et al., 2018). The default versions of IVW MR (Burgess et al., 2013) and MVIVW MR
(Burgess et al., 2015; Burgess & Thompson, 2015) do not specify a correlation structure between
SNPs. However, the R package for implementation, MendelianRandomization (Yavorska &
Burgess, 2017), provides functionality to incorporate the LD structure in the analysis. Thus, for
IVW MR and MVIVW MR, we implemented them in two ways: with or without specifying the
LD structure of the simulated SNPs. IVW MR (Burgess et al., 2013) is a classic MR method
which applies a fixed-effect meta-analysis approach to combine the ratio estimates using each
variate from two GWAS. The point estimate from IVW MR is similar in spirit to the second-
stage model of the hierarchical model and is equivalent to a weighted linear regression without
an intercept of SNP-outcome (𝒀 and 𝑮 ) associations on SNP- intermediate (𝑋 𝑖 and 𝑮 )
associations. MVIVW MR is an extension on IVW MR which allows multiple intermediates and
uses multivariate weighted linear model to test the joint effects of multiple intermediates
(Yavorska & Burgess, 2017). S-PrediXcan is an extension of PrediXcan using summary statistics
as an input (Barbeira et al., 2018) to estimate the univariate gene expression-outcome
associations. Table 2.2 summarizes the theoretical expressions for our approach and the
competing approaches and Appendix 2A.2 shows the theoretical justifications in detail. When
there is one 𝑋 and all SNPs are independent, the estimate from hJAM, IVW MR, and S-
PrediXcan are theoretically equivalent. S-PrediXcan and hJAM produce the same test statistic
under such scenario. The estimates from hJAM and MVIVW MR are equivalent when there are
25
multiple independent 𝑿 but not when there are multiple correlated 𝑿 . FUSION shows a slightly
different test statistic from S-PrediXcan and hJAM.
Table 2.2 Theoretical comparisons between hJAM, competing MR and TWAS approaches.
Approach 𝝅̂ 𝐕𝐚𝐫 (𝝅̂) Test statistic
hJAM
((𝑳 𝑨̂
)
′
(𝑳 𝑨̂
))
−𝟏 (𝑳 𝑨̂
)′𝒛 𝑳 =(𝑨̂
′𝑾 hJAM
𝑨̂
)
−1
𝑨̂
′
𝑾 hJAM
𝜷̂
diag((𝑳 𝑨̂
)
′
(𝑳 𝑨̂
))
−𝟏 𝜎 𝑌 2
=diag((𝑨̂
′
𝑾 hJAM
𝑨̂
)
−𝟏 )
where 𝑾 hJAM
=
𝑁 𝑌 𝜎̂
𝑌 2
⋅(
𝜎̂
𝑔 ,1
2
⋯ 𝜎̂
𝑔 ,1
⋅𝜎̂
𝑔 ,𝑝 ⋅𝜌 1𝑝 ⋮ ⋱ ⋮
𝜎̂
𝑔 ,𝑝 ⋅𝜎̂
𝑔 ,1
⋅𝜌 𝑝 1
⋯ 𝜎̂
𝑔 ,𝑝 2
)
hJAM (M=1)
∑𝛼̂
𝑘 𝛽̂
𝑘 𝜎̂
𝑔 ,𝑘 2
𝑘
∑𝛼̂
𝑘 2
𝜎̂
𝑔 ,𝑘 2
𝑘
𝜎̂
𝑌 2
𝑁 𝑌 𝜎̂
𝑔 𝟐
∑𝛼̂
𝑘 𝛽̂
𝑘 𝜎̂
𝑔 ,𝑘 2
𝑘 𝜎̂
𝑔 ⋅√
𝑁 𝑌 𝜎̂
𝑌 2
IVW MR
∑𝛼̂
𝑘 𝛽̂
𝑘 ⋅se
-2
(𝛽̂
𝑘 )
𝑘 ∑𝛼̂
𝑘 2
⋅se
-2
(𝛽̂
𝑘 )
𝑘 =
∑𝛼̂
𝑘 𝛽̂
𝑘 𝜎̂
𝑔 ,𝑘 2
𝑘
∑𝛼̂
𝑘 2
𝜎̂
𝑔 ,𝑘 2
𝑘
1
∑𝛼̂
𝑘 𝑘 ⋅
1
se
2
(𝛽̂
𝑘 )
=
𝜎̂
𝑌 2
𝑁 𝑌 𝜎̂
𝑔 𝟐
∑𝛼̂
𝑘 𝛽̂
𝑘 𝜎̂
𝑔 ,𝑘 2
𝑘 𝜎̂
𝑔 ⋅√
𝑁 𝑌 𝜎̂
𝑌 2
MVIVW MR
(𝑨̂
′
𝑾 MVIVW
𝑨̂
)
−1
𝑨̂
′
𝑾 MVIVW
𝜷̂
diag((𝑨̂
′
𝑾 MVIVW
𝑨̂
)
−𝟏 )
where 𝑾 MVIVW
=(
se(𝛽̂
1
)
2
⋯ se(𝛽̂
1
)⋅se(𝛽̂
𝑝 )⋅𝜌 1𝑝 ⋮ ⋱ ⋮
se(𝛽̂
𝑝 )⋅se(𝛽̂
1
)⋅𝜌 𝑝 1
⋯ se(𝛽̂
𝑝 )
2
)
−1
=
𝑁 𝑌 𝜎̂
𝑌 2
⋅
(
1
𝜎̂
𝑔 ,1
2
⋯
𝜌 1𝑝 𝜎̂
𝑔 ,1
⋅𝜎̂
𝑔 ,𝑝 ⋮ ⋱ ⋮
𝜌 𝑝 1
𝜎̂
𝑔 ,1
⋅𝜎̂
𝑔 ,𝑝 ⋯
1
𝜎̂
𝑔 ,𝑝 2
)
−1
S-PrediXcan
Cov
̂
(𝑋 ,𝑌 )
𝜎̂
𝑔 2
=
∑𝛼̂
𝑘 𝛽̂
𝑘 𝜎̂
𝑔 ,𝑘 2
𝑘
∑𝛼̂
𝑘 2
𝜎̂
𝑔 ,𝑘 2
𝑘
∑𝛼̂
𝑘 𝛽̂
𝑘
𝑘 𝜎̂
𝑔 ,𝑘 2
𝜎̂
𝑔 √
𝑁 𝑌 𝜎̂
𝑌 2
FUSION
∑𝛼̂
𝑘 𝛽̂
𝑘
𝑘 𝜎̂
𝑔 ,𝑘 𝜎̂
𝑔 √
𝑁 𝑌 𝜎̂
𝑌 2
Note: Since FUSION does not produce the point estimate and the standard error and S-
PrediXcan does not produce the standard error, we did not show it here in the table.
26
2.3. Simulation Studies
2.3.1. Simulation Settings
To assess the performance of hJAM, we performed an extensive set of simulation studies.
For each simulation, we simulated an intermediate matrix 𝑿 , an outcome vector 𝒀 , and three
standardized individual genotype matrices 𝐺 𝑋 , 𝐺 𝑌 , and 𝐺 𝐿 . 𝐺 𝑋 is the SNP-intermediate data used
to obtain the 𝑨̂
matrix, 𝐺 𝑌 is the SNP-outcome data used to obtain the univariate 𝒃̂
vectors, and
𝐺 𝐿 is the reference data with the LD structure between the SNPs. Each of the three standardized
individual genotype matrices (𝐺 𝑋 , 𝐺 𝑌 , and 𝐺 𝐿 ) is composed by two/three SNP blocks (i.e., 𝐺 1
, 𝐺 2
and 𝐺 3
in Figure 2.1). Supplementary Figure 2B.1 shows the relationship between the 𝐺 𝑋 and the
SNP blocks, which is the same as the relationship between 𝐺 𝑌 /𝐺 𝐿 and the SNP blocks. Each SNP
block contains 10 SNPs, in which we set 3 SNPs to be causal to the intermediate with 𝑅 𝐺 ,𝑋 2
=0.1
per SNP block. For each genotype matrix, we had two inter-block relationships: no LD and
moderate LD (𝑟 =0.6). The MAF was sampled from a uniform distribution (0.05, 0.3). Sample
size for each genotype data set was set to be 𝑁 𝐺 𝑋 = 1000, 𝑁 𝐺 𝑌 = 5000, and 𝑁 𝐺 𝐿 = 500,
respectively.
Without the loss of generality, we simulated two 𝑋 ’s and four scenarios representing
different causal models for the two intermediates which are likely to be encountered in
epidemiologic studies (Figure 2.1). For scenario A, 𝑋 1
and 𝑋 2
were independent. For scenarios B
and D, 𝑋 1
and 𝑋 2
were correlated through a shared SNPs set 𝐺 3
. The coefficient 𝜆 in the causal
scenarios (Figure 2.1 C and 2.1D) was simulated by 𝑅 𝑋 1
,𝑋 2
2
=0.2. These simulation scenarios are
similar to those described in Sanderson et al. (Sanderson et al., 2019).
27
Figure 2.1 Simulation scenarios of different relationships between X’s.
(A) 𝑋 1
and 𝑋 2
are independent. (B) 𝑋 1
and 𝑋 2
are correlated. (C) 𝑋 1
causes 𝑋 2
. (D) 𝑋 1
causes
𝑋 2
and correlated. In the DAG, 𝐺 1
, 𝐺 2
, and 𝐺 3
represent three SNP blocks that contribute to
different intermediates. For example, in (B), 𝐺 1
and 𝐺 3
both contribute to 𝑋 1
and 𝐺 2
and 𝐺 3
contribute to 𝑋 2
. Taken together, 𝐺 1
, 𝐺 2
and 𝐺 3
compose each 𝐺 𝑋 , 𝐺 𝑌 and 𝐺 𝐿 .
To evaluate the robustness of different methods in terms of the reference set 𝐺 𝐿 , we
added a simulation where the LD of the 𝐺 𝐿 differs from 𝐺 𝑋 or 𝐺 𝑌 in scenario B (Figure 2.1 B).
We investigate the sensitivity of performance to scenarios with (1) lower heritability for the
intermediates with 𝑅 𝐺 ,𝑋 2
=0.05 per SNP block; and (2) scenarios with an unknown confounder
simulated between the intermediates 𝑋 1
and the outcome 𝒀 .
The primary objective was to estimate 𝜋̂ with each true 𝜋 𝑖 being set to null (𝜋 𝑖 =0) or a
positive effect (𝜋 𝑖 =0.1). To mimic applied applications and to ensure selection of at least two
or more SNPs, a forward selection on 𝑨̂
was performed to exclude the noninformative variants
with a threshold 𝑃 <0.2 in the analysis step. All simulation analyses were performed in R
version 3.4.0. Results were calculated from 1000 replications for each scenario. All tests were
two-sided with a type-I error of 0.05.
28
2.3.2. Simulation Results
Simulation results from the base scenario A, where 𝑋 1
and 𝑋 2
were independent,
demonstrate that the estimates from most methods were unbiased. However, when IVW MR and
MVIVW MR do not incorporate the LD structure, there is a slightly inflated type-I error under
simulation scenarios with correlated SNPs (Figure 2.2). IVW MR with and without correlation
had a less precise estimate and lower power compared to the other methods in scenario A (Figure
2.3, Supplementary Table 2B.1). When a pleiotropic effect was simulated for each intermediate
(scenario B to D), the estimates from hJAM and MVIVW MR with LD were unbiased and had a
correct type-I error for the corresponding intermediate (Figure 2.2). The estimates from MVIVW
MR without LD were unbiased but showed an inflated type-I error due to a smaller estimated
standard error in scenarios in which SNPs were correlated (Figure 2.2). IVW MR and S-
PrediXcan had a biased estimate and an inflated type-I error regardless of the correlation
structure of the SNPs in the presence of pleiotropy. The results for MVIVW MR and IVW MR
reflect specification of the LD structure for the instruments when using the
MedelianRandomization (Yavorska & Burgess, 2017) package. Results without the LD structure
showed a poor performance as indicated by increased type-I errors.
In addition, we found that the results from hJAM and MVIVW remain consistent even as
the LD in the reference GL data differs from 𝐺 𝑋 or 𝐺 𝑌 (supplementary Figure 2B.2). For
simulations with 𝑅 𝐺 ,𝑋 2
=0.05 per SNP block and additional unknown confounders, the results were
consistent with the main simulation studies (supplementary Figures 2B.3-4 and 2B.5-6).
29
Figure 2.2 Empirical Power of the correlated SNPs scenarios across 1000 replications.
(A) 𝑋 1
and 𝑋 2
are independent. (B) 𝑋 1
and 𝑋 2
are correlated. (C) 𝑋 1
causes 𝑋 2
. (D) 𝑋 1
causes
𝑋 2
and correlated. The black solid line refers to the default Type-I error, 𝛼 =0.05.
30
Figure 2.3 Average estimates and 95% confidence intervals of the correlated SNPs
scenarios across 1000 replications.
(A) 𝑋 1
and 𝑋 2
are independent. (B) 𝑋 1
and 𝑋 2
are correlated. (C) 𝑋 1
causes 𝑋 2
. (D) 𝑋 1
causes
𝑋 2
and correlated. The black solid line refers to the default Type-I error, 𝛼 =0.05.
31
2.4. Data Applications
To demonstrate hJAM on real data, we applied the methods to two examples: 1)
estimating the effects of body mass index (BMI) and type 2 diabetes (T2D) on myocardial
infarction (MI); and 2) estimating the effects of two genes on prostate cancer risk. As the study
populations for both examples include individuals of European ancestry, we used the 503
European-ancestry subjects from the 1000 Genomes Project (Consortium, 2015) as our reference
data for the LD structure.
2.4.1. Causal Effect of BMI and T2D on Myocardial Infarction
Previous studies have shown that obesity (Lauer et al., 1991; Yusuf et al., 2005) and T2D
(Barrett-Connor et al., 1991; Manson et al., 1991) are two important risk factors for MI. In
addition, the association between obesity and T2D is well-established (Group, 2010; Kahn et al.,
2006). A directed acyclic graph (DAG) shows the relationships between the two risk factors and
MI (Figure 2.4).
Figure 2.4 Direct acyclic graph (DAG) of the relationship between body mass index (BMI),
type 2 diabetes (T2D) and myocardial infarction.
On the leftmost, we used three SNPs set to denote the real sets of SNPs that we used in the
analysis: SNPs that are only associated with obesity, SNPs that are only associated with
32
diabetes, and SNPs that are associated with both obesity and diabetes. The solid arrows denote
the established effect while the dashed arrows denote the associations of interests.
To examine the conditional effects of the two risk factors, we extracted the summary
statistics for MI, BMI, and T2D from the UK Biobank (n = 459,324) (Sudlow et al., 2015),
GIANT consortium (n = 339,224) (Locke et al., 2015), and DIAGRAM+GERA+UKB (n =
659,316) (Xue et al., 2018), respectively. In total, 75 SNPs and 136 SNPs were identified as
genome-wide significant for BMI and T2D, respectively (supplementary Figure 2B.7). In this set
of SNPs, there was one overlapping SNP in both the instrument sets for BMI and T2D
(rs7903146, 𝛼 BMI
= -0.016, 𝑃 BMI
= 1.4×10
−14
, 𝛼 T2D
= 0.319, and 𝑃 T2D
= 1.7×10
−204
). This
SNP is a well-known T2D associated SNP and has being identified as a BMI-associated hit in
GIANT. Additionally, four correlated pairs of SNPs exist between the two sets (supplementary
Table 2B.5).
Results are shown in Table 2.3. All methods suggested a significantly increasing risk of
MI with an increased BMI and the presence of T2D. This agrees with previous studies (Manson
et al., 1991; Yusuf et al., 2005). The magnitude of hJAM and MVIVW MR were similar while
IVW MR and S-PrediXcan showed larger estimated values. The odds ratio (OR) from hJAM for
the risk of MI was 1.38 (95% CI=1.22, 1.56) and 1.16 (95% CI=1.12, 1.20) for per one unit
increase in BMI and having T2D, respectively. MVIVW MR with LD has similar estimates with
1.37 (95% CI=1.22, 1.54) and 1.15 (95% CI=1.11, 1.19) for BMI and having T2D, respectively.
The difference in estimates between the multivariate approaches and the univariate MR/TWAS
approaches may be attributed to potential pleiotropy not accounted for in the analyses that do not
model the intermediates jointly.
33
Table 2.3 Causal odds ratios (95% confidence interval) for myocardial infarction per unit
in body mass index and having type 2 diabetes.
Methods Odds ratios (95% CI) P
Body Mass Index
hJAM 1.38 (1.22, 1.56) 3.19E-07
MVIVW MR 1.37 (1.22, 1.54) 1.94E-07
MVIVW MR (w/o LD) 1.34 (1.20, 1.49) 1.65E-07
IVW MR 1.54 (1.32, 1.79) 2.07E-08
IVW MR (w/o LD) 1.53 (1.32, 1.77) 1.45E-08
S-PrediXcan 1.66 (1.58, 1.74) 9.88E-96
Type 2 Diabetes
hJAM 1.16 (1.12, 1.20) 4.12E-11
MVIVW MR 1.15 (1.11, 1.19) 8.34E-12
MVIVW MR (w/o LD) 1.16 (1.11, 1.20) 1.29E-11
IVW MR 1.15 (1.11, 1.20) 1.77E-14
IVW MR (w/o LD) 1.15 (1.11, 1.20) 1.98E-14
S-PrediXcan 1.14 (1.11, 1.16) 9.43E-109
Abbreviation: w/o LD, without linkage disequilibrium adjustment; s.e., standard error; 95% CI,
95% confidence interval.
Note: * For MR-Egger (intercept), we showed log odds ratio and its 95% CI.
2.4.2. Causal Effect of PM20D1 and NUCKS1 on Prostate Cancer Risk
To further illustrate the benefit of hJAM, we next considered the gene-prostate cancer
risk association of two genes on chromosome 1q32.1, gene PM20D1 (Peptidase M20 Domain
Containing 1) and gene NUCKS1 (Nuclear Casein Kinase and Cyclin Dependent Kinase
Substrate 1). Both PM20D1 and NUCKS1 are protein coding genes and previous transcriptome
studies have found a significant effect of both PM20D1 and NUCKS1 on the risk of prostate
cancer among a European-ancestry population (Mancuso et al., 2018; Wu et al., 2019b). Due to
the close proximity of the two genes along the genome, there is a potential for a univariate
approach to result in biased estimates. To examine the effects jointly, we applied hJAM to this
research question.
34
We constructed the 𝑨̂
matrix with 114 eQTLs estimates with false discovery rate (FDR) <
0.05 for the two genes from GTEx v7 (Lonsdale et al., 2013). Among the 114 eQTLs, one locus
has significant associations with both PM20D1 and NUCKS1. To limit the correlation between
the eQTLs, we used priority pruner (Edlund et al.) to prune the eQTLs by limiting the squared
pairwise correlation coefficient 𝑟 2
and using the magnitude of the eQTLs association effect
estimates on each gene as the priority criteria. The genome-wide summary statistics for the risk
of prostate cancer was taken from a published GWAS with more than 140,000 European-
ancestry men (Schumacher et al., 2018).
Figure 6 displays forest plots of the OR (95% CI) and p-values of the gene effects on the
prostate cancer risk by each 𝑟 2
cutoff (𝑟 2
=0.3,0.4,0.5,and 0.6 for LD pruning of SNPs for
selection for the analysis. In general, all approaches show consistent estimates of effect across
the various sets of SNPs included for analysis (from 𝑟 2
=0.3 to 𝑟 2
=0.6), but the significance
level is sensitive to the choice and number of SNPs selected. We observe that hJAM and
MVIVW MR with LD yield significant results for NUCKS1 with the pruning cutoff 𝑟 2
=0.4
and 𝑟 2
=0.6. Both of these approaches show no significant effect of PM20D1 on the risk of
prostate cancer regardless of the pruning cutoff used to select the SNPs. The univariate models,
including IVW MR and S-PrediXcan, results in a significant positive effect on prostate cancer
risk for PM20D1 and NUCKS1. We consider the significance in the univariate models was due to
the correlation between the two genes and the LD between the eQTLs, which could be adjusted
for by the hJAM and MVIVW MR with LD models.
35
Figure 2.5 Causal odds ratios (95% confidence interval) for prostate cancer risk per unit
increasing in gene expression reads.
The y-axis displays the cutoff of 𝑟 2
that has been used prune the eQTLs for the analysis. Next to
each OR and 95% CI, we show the corresponding p-value and number of eQTL (i.e. k) used in
the analysis. For example, for NUCKS1, hJAM with 𝑟 2
=0.3, we used 4 eQTLs in the analysis
(i.e. k=4
2.5. Discussion
In this chapter, we have proposed a two-stage hierarchical model which unifies the
framework of Mendelian randomization and transcriptome-wide association tools and can be
applied to correlated instruments and multiple intermediates. We have implemented the method
in an R package (hJAM) which is now available on CRAN.
When only one intermediate or multiple independent intermediates is present, hJAM
yields an equivalent estimate and standard error to alternative approaches (see Appendix 2A.2).
However, when intermediates are correlated, only MVIVW MR showed a comparable
performance with hJAM under the independent SNPs scenarios. For correlated SNPs scenarios,
when the LD structure is specified, the estimates of hJAM are empirically equivalent to MVIVW
MR although the two approaches use slightly different weighted matrices: hJAM uses the
adjusted variance-covariance matrix of SNPs from a reference panel while MVIVW MR uses an
inverse-variance matrix. Nevertheless, we believe that the hJAM formulation offers several
advantages in flexibility to specify the 𝐴̂
matrix. As in TWAS, this matrix can specify eQTL
estimates or as in more classical MR approaches this can specify SNP-intermediate associations.
Moreover, it can incorporate other types of prior information such as functional or genomic
annotation or information from metabolomic studies (Alexander Gusev et al., 2016). Inclusion of
this type of annotation information can offer potential advantages for characterization of SNP
effects as demonstrated in the hierarchical modeling context (Chen & Witte, 2007; Conti &
Witte, 2003; Lewinger et al., 2007). In addition, optimal construction of the 𝐴̂
matrix for high
dimensional data is an area that needs further investigation.
37
Although hJAM provides an overall improvement over most existing MR methods, it is
also susceptible to the caveats of these types of approaches. Firstly, even though hJAM could
handle the measured pleiotropy effect, it may subject to the bias in estimation due to unknown
pleiotropy. One potential solution is to account for the unknown pleiotropy effect by employing
the Egger regression, which considers the effect of pleiotropy as a biased term (Chapter 3).
Secondly, the effects of the SNPs on the intermediates, and/or the causal effect of intermediates
on the outcome may include interactions or be non-linear. One way to address the presence of
interactions is to limit the use of summary data from stratified GWAS; however, it may attenuate
the power due to a smaller sample size of the subset GWAS. For the presence of non-linear
relationships, potential approaches include modifying the two-stage analysis by incorporating a
nonlinear function in the second stage or to more formally incorporate methods for investigating
the shape of the exposure-outcome relationship, such as fractional polynomials or piece-wise
linear approaches (Staley & Burgess, 2017).
In applied applications, population structure may introduce potential difficulties for
hJAM, as is similar for all MR and TWAS approaches using summary statistics. First, there is
the reliance that the association statistics are unbiased due to potential confounding by
population structure. This includes summary data for the SNPs to intermediate associations in 𝑨̂
matrix, as well as the marginal SNP-outcome associations using within the hJAM model.
However, given that modern techniques to account for population structure are often sufficient
(Price et al., 2006; Runcie & Crawford, 2019), this is a fair assumption. Additionally, to account
for the correlation structure between SNPs, hJAM assumes that the LD structure estimated from
the reference data is the same as the study data used to generate the summary statistics. Since
hJAM and MVIVW MR incorporate the correlation structure of SNPs in a slightly different
38
weight matrices, there is the potential for this to impact these methods differently. Although, in a
limited set of simulations we found that both methods are fairly robust to scenarios in which the
reference data and the association data have modest differences in LD structures.
In contrast to most current methods that rely on independent SNPs or analyze
intermediates in isolation, we propose a two-stage hierarchical model to jointly model summary
statistics (hJAM) for correlated SNPs and multiple intermediates within Mendelian
Randomization and TWAS. As technology expands the potential use of these types of studies to
proteomic, methylation and metabolomic data, such flexible approaches are needed to account
for the potential increase in complexity in underlying relationships between factors (Chapter 4).
Appendix 2A. Supplementary Methods
2A.1. Joint Analysis of Marginal Summary Statistics (JAM)
Joint Analysis of Marginal summary statistics (JAM) (Newcombe et al., 2016) is a
scalable approach which is designed to re-analyze the published marginal summary statistics
from GWAS under a joint multi-SNPs model, rule out the noisy SNPs whose significance was
attributed to the LD structure, and ultimately target the true causal genetic variants for fine-
mapping. The original JAM is composed two steps: (1) converting the summary data into a JAM
framework by using a multivariate normal distribution, and (2) performing variable selection via
a Bayesian framework and reversible-jump MCMC algorithm. JAM has been widely employed
as a fine-mapping algorithm (Conti et al., 2017).
39
Assuming an additive effect of the effect allele on the phenotype, we can express the
phenotype as a linear function of the individual genotype data:
𝒀 ∼𝑁 (𝑮𝜷 ,𝜎 2
𝑰 𝑵 ),
and convert it into a multivariate normal distribution of a summary statistics-based vector, 𝒛 ≔
𝑮 ′𝒚 ,
𝒛 ∼𝑀𝑉 𝑁 𝑝 (𝑮 ′𝑮𝜷 ,𝜎 2
𝑮 ′𝑮 ),
where 𝜷 is the conditional effects of the SNPs on the phenotype 𝒀 and 𝑮 is the SNP. JAM then
performs a Cholesky decomposition to set the multivariate likelihood into a vector of
independent statistics, 𝒛 𝑳 =𝑳 ′
−1
𝒛 , as
𝒛 𝑳 ∼𝑀𝑉 𝑁 𝑝 (𝑳𝜷 ,𝜎̂
2
𝑰 𝒑 ),
where 𝑳 ′−1
is the Cholesky transpose inverse from 𝑳 ′
𝑳 =𝑿 ′
𝑿 .
In the second step, JAM selects the best model by using Bayesian variable selection.
They use the set of independent Gaussian distributions derived in the first step as the conditional
likelihood:
𝑝 (𝑧 𝐿 |𝛾 ,𝜷 𝜸 ,𝜎 2
)=𝑀𝑉 𝑁 𝑝 (𝑳 𝜸 𝜷 𝜸 ,𝜎 2
𝑰 𝒑 ),
where 𝛾 denotes the current model and 𝑳 𝜸 denotes the sub-matrix of 𝑳 which contains the SNPs
in model 𝛾 only. Observed that the conditional likelihood is a simple linear form, the conjugate
of (𝛽 𝛾 ,𝜎 2
) follows an inverse-gamma normal distribution, i.e.
𝑝 (𝜷 𝜸 |𝜎 2
,𝛾 )=𝑀𝑉 𝑁 𝑝 (𝒎 𝜸 ,𝜎 2
𝚺 𝜸 )
𝑝 (𝜎 2
| 𝛾 )=𝑃 (𝜎 2
)=𝐼𝑛𝑣𝐺𝑎 (𝑎 𝜎 ,𝑏 𝜎 ).
JAM then expresses the posterior probability as
40
𝑝 (𝛾 |𝑧 𝐿 )∝∫𝑝 (𝜷 𝜸 ,𝜎 2
,𝛾 |𝑧 𝐿 )𝑑 𝜷 𝜸 𝑑 𝜎 2
∝∫𝑝 (𝑧 𝐿 |𝜷 𝜸 ,𝜎 2
,𝛾 )𝑝 (𝜷 𝜸 |𝜎 2
,𝛾 )𝑝 (𝜎 2
|𝛾 )𝑑 𝜷 𝜸 𝑑 𝜎 2
,
(2A.1)
which will be used to measure the importance of model 𝛾 ∈Γ. It has been proved that the
integration over 𝜷 and 𝜎 2
may lead to a closed form expression (Brown et al., 1998). The prior is
assigned to be a binomial prior with a beta-binomial conjugate (Scott & Berger, 2010) over
models as
𝑝 (𝛾 )=∫𝑃 (𝛾 |𝜔 )𝑃 (𝜔 )=
𝐵 (𝛾 ′
𝐼 𝑃 +𝑎 𝜔 ,𝑃 −𝛾 ′
𝐼 𝑝 +𝑏 𝜔 )
𝐵 (𝑎 𝜔 ,𝑏 𝜔 )
, (2A.2)
where 𝐵 (∙, ∙) is the beta function, 𝑃 is the number of SNPs in the model, and 𝜔 is assigned to be
a beta hyper-prior:
𝜔 ∼𝐵𝑒𝑡𝑎 (𝑎 𝜔 ,𝑏 𝜔 ).
Here, the prior of beta-binomial conjugate is set to be 𝑎 𝜔 =1 and 𝑏 𝜔 =9 which represents a
weak informative prior. Combining Eq. 2A.1 and 2A.2, the marginal posterior of model, 𝛾 , could
be expressed as
𝑝 (𝛾 |𝑧 𝐿 )∝𝑝 (𝛾 )𝑝 (𝑧 𝐿 |𝛾 ).
The Block independent decomposition was designed and implemented in this step in
order to accelerate the variable selection algorithm (Newcombe et al., 2016). Having selected the
model 𝛾 , we can infer the posterior distribution because of the conjugate Normal-inverse-
Gamma structure. The posterior inference of conditional SNPs effects can be expressed as
𝑝 (𝜎 2
|𝑧 𝐿 ,𝛾 )=𝐼𝑛𝑣𝐺𝑎 (𝑎 𝜎 +
𝑃 2
,𝑏 𝜎 +
𝑠 2
2
+
𝜷̂
𝜸 𝑳 𝜸 ′
𝑳 𝜸 𝜷̂
𝜸 2(𝜏 +1)
),
𝑝 (𝛽 𝛾 |𝑧 𝐿 ,𝜎 2
,𝛾 )=𝑀𝑉𝑁 (
𝜏 𝜷̂
𝜸 1+𝜏 ,
𝜏 𝜎 2
(𝑳 𝜸 ′
𝑳 𝜸 )
−1
1+𝜏 ).
41
where 𝜷̂
𝜸 =(𝑳 𝜸 ′
𝑳 𝜸 )
−1
𝑳 𝜸 ′
𝑧 𝐿 and 𝑠 2
=(𝑧 𝐿 −𝑳 𝜸 𝜷̂
𝜸 )′(𝑧 𝐿 −𝑳 𝜸 𝜷̂
𝜸 ) . Here, 𝜏 is an unknown
parameter that controls the shrinkage of Σ
𝛾 =𝜏 (𝑳 𝜸 ′
𝑳 𝜸 )
−1
in the conjugate 𝑝 (𝜷 𝜸 |𝜎 2
,𝛾 )
(Fernandez et al., 2001; Liang et al., 2008).
2A.2. Theoretical justifications of hJAM and the competing approaches
2A.2.1. Hierarchical Joint Analysis of Summary Statistics (hJAM)
To theoretically compare with other approaches, we write hJAM estimates in another
form under several assumptions. Use the distribution as shown in Eq. 2.6 and assume that the
SNPs are independent, we can simplify the estimate of hJAM as
𝜋̂
hJAM
=((𝑳 𝑨̂
)
′
(𝑳 𝑨̂
))
−𝟏 (𝑳 𝑨̂
)′𝒛 𝑳 =(𝑨̂
′
𝑳 ′
𝑳 𝑨̂
)
−𝟏 𝑨̂
′
𝑳 ′
𝑳 ′
−𝟏 𝒛 =(𝑨̂
′
𝑮 𝟎 ′
𝑮 𝟎 𝑨̂
)
−𝟏 𝑨̂
′
𝑮 𝟎 ′
𝒚 =
1
𝑁 𝑌 (𝑨̂
′
𝚪 𝒈 𝑨̂
)
−𝟏 𝑨̂
′
𝑮 𝟎 ′
𝑮 𝟎 𝜷̂
=(𝑨̂
𝚪 𝒈 𝑨̂
)
−𝟏 𝑨̂
′
𝚪 𝒈 𝜷̂
,
with variance being estimated as
Var(𝝅̂
hJAM
)=((𝑳 𝑨̂
)
′
(𝑳 𝑨̂
))
−𝟏 𝜎 𝑌 2
=
1
𝑁 𝑌 (𝑨̂
′
𝚪 𝒈 𝑨̂
)
−𝟏 𝜎 𝑌 2
.
Then, denote 𝜎 𝑔 𝟐 =𝑨̂
′
𝚪 𝒈 𝑨̂
, when 𝑀 =1 with all instruments being independent, we have
𝜋̂
hJAM
=
1
𝑁 𝑌 𝜎̂
𝑔 𝟐 𝑨̂
′
𝑮 𝟎 ′
𝑮 𝟎 𝜷̂
=
1
𝑁 𝑌 𝜎̂
𝑔 𝟐 ∑𝛼̂
𝑘 ⋅ 𝛽̂
𝑘 ⋅𝑔 𝑘𝑘
2
𝑘 =
∑𝛼̂
𝑘 𝛽̂
𝑘 𝜎̂
𝑔 ,𝑘 2
𝑘
∑𝛼̂
𝑘 2
𝜎̂
𝑔 ,𝑘 2
𝑘 ,
42
where 𝑔 𝑘𝑘
2
denotes the (𝑘 ,𝑘 )
𝑡 ℎ
element of 𝑮 𝟎 ′
𝑮 𝟎 , 𝜎̂
𝑔 ,𝑘 2
=
𝑔 𝑘𝑘
2
𝑁 𝑌 denotes the variance of the SNP k
and 𝜎̂
𝑔 2
= 𝑨̂
′
𝚪 𝒈 𝑨̂
= ∑𝛼 𝑘 ⋅𝜎̂
𝑔 ,𝑘 2
⋅𝛼 𝑘 𝑘 = ∑𝛼 𝑘 2
𝜎̂
𝑔 ,𝑘 2
𝑘 .
For the variance of the estimate, we have
Var( 𝜋̂
hJAM
)=
𝜎̂
𝑌 2
𝑁 𝑌 (𝑨̂
′
𝚪 𝒈 𝑨̂
)
−𝟏 =
𝜎̂
𝑌 2
𝑁 𝑌 𝜎̂
𝑔 𝟐 .
The Z test statistic is
𝑍 hJAM
=
𝜋̂
ℎ𝐽𝐴𝑀 √Var( 𝜋̂
ℎ𝐽𝐴𝑀 )
=
∑𝛼̂
𝑘 𝛽̂
𝑘 𝜎̂
𝑔 ,𝑘 2
𝑘
𝜎̂
𝑔 2
⋅√
𝑁 𝑌 𝜎̂
𝑔 𝟐 𝜎̂
𝑌 2
=
∑𝛼̂
𝑘 𝛽̂
𝑘 𝜎̂
𝑔 ,𝑘 2
𝑘 𝜎̂
𝑔 ⋅√
𝑁 𝑌 𝜎̂
𝑌 2
.
2A.2.2. Multivariate Inverse-variance Weighted Mendelian Randomization
We followed the function of MVIVW MR in R package MendelianRandomization
(Yavorska & Burgess, 2017) for theoretical comparison. When no LD specified, the solutions of
the estimates and corresponding standard errors are the weighted least square maximum
likelihood estimate solutions of the regression
𝜷̂
~𝑨̂
,
and the solutions are
𝝅̂
MVIVW
=(𝑨̂
′
𝑾 MVIVW
𝑨̂
)
−1
𝑨̂
′
𝑾 MVIVW
𝜷̂
and
Var(𝝅̂
MVIVW
)= diag((𝑨̂
′
𝑾 MVIVW
𝑨̂
)
−𝟏 ),
where 𝑾 MVIVW,independent
is a diagonal matrix with the inverse standard errors of 𝛽̂
, i.e.
43
𝑾 MVIVW,independent
=
(
1
se(𝛽̂
1
)
2
⋯ 0
⋮ ⋱ ⋮
0 ⋯
1
se(𝛽̂
𝑝 )
2
)
=
𝑁 𝑌 𝜎̂
𝑌 2
⋅(
𝜎̂
𝑔 ,1
2
⋯ 0
⋮ ⋱ ⋮
0 ⋯ 𝜎̂
𝑔 ,𝑝 2
).
Here, 𝑁 𝑌 denotes the sample size of 𝐺 𝑌 . In hJAM, we have a similar form of the estimate as
MVIVW MR. If we rewrite the variance of the estimates from hJAM as Var(𝜋̂
hJAM
)=
diag((𝑨̂
′
𝑾 hJAM
𝑨̂
)
−𝟏 ) , we can express the weight matrix of hJAM as
𝑾 hJAM
=
𝑁 𝑌 𝜎̂
𝑌 2
⋅𝚪 𝒈 =
𝑁 𝑌 𝜎̂
𝑌 2
⋅(
𝜎̂
𝑔 ,1
2
⋯ 𝜎̂
𝑔 ,1
⋅𝜎̂
𝑔 ,𝑝 ⋅𝜌 1𝑝 ⋮ ⋱ ⋮
𝜎̂
𝑔 ,𝑝 ⋅𝜎̂
𝑔 ,1
⋅𝜌 𝑝 1
⋯ 𝜎̂
𝑔 ,𝑝 2
),
where 𝜌 𝑖𝑗
denotes the correlation coefficient between 𝑖 𝑡 ℎ
SNP and 𝑗 𝑡 ℎ
SNP, obtaining from the
reference panel (i.e. 𝐺 𝐿 ). Our estimate is equivalent to the estimate from MVIVW MR when SNP
are independent, i.e. 𝜌 𝑖𝑗
=0 for ∀ 𝑖 ≠𝑗 . Note that, if the sample sizes of the reference panel, 𝑮 𝑳 ,
and that of 𝑮 𝒀 are different, we will modify the 𝑮 𝟎 ′
𝑮 𝟎 matrix with the summary statistics from
𝑮 𝒀 and scaling the variance and co-variances accordingly (details in the supplementary materials
of (Newcombe et al., 2016)). When correlation specified, the weight matrix 𝑾 of MVIVW MR
changes into
𝑾 MVIVW, correlated
=((𝐁 ⊗𝐁 )⋅𝚺 )
−1
=(
se(𝛽̂
1
)
2
⋯ se(𝛽̂
1
)⋅se(𝛽̂
𝑝 )⋅𝜌 1𝑝 ⋮ ⋱ ⋮
se(𝛽̂
𝑝 )⋅se(𝛽̂
1
)⋅𝜌 𝑝 1
⋯ se(𝛽̂
𝑝 )
2
)
−1
=
𝑁 𝑌 𝜎̂
𝑌 2
⋅
(
1
𝜎̂
𝑔 ,1
2
⋯
𝜌 1𝑝 𝜎̂
𝑔 ,1
⋅𝜎̂
𝑔 ,𝑝 ⋮ ⋱ ⋮
𝜌 𝑝 1
𝜎̂
𝑔 ,1
⋅𝜎̂
𝑔 ,𝑝 ⋯
1
𝜎̂
𝑔 ,𝑝 2
)
−1
,
44
where B denotes the vector of standard error of 𝜷̂
’s and 𝚺 denotes the square correlation
coefficient structure of the SNPs. When all SNPs are independent, we can observe that
𝑾 MVIVW, correlated
=𝑾 MVIVW, independent
since
(
se(𝛽̂
1
)
2
⋯ 0
⋮ ⋱ ⋮
0 ⋯ se(𝛽̂
𝑝 )
2
)
−1
=
(
1
se(𝛽̂
1
)
2
⋯ 0
⋮ ⋱ ⋮
0 ⋯
1
se(𝛽̂
𝑝 )
2
)
.
From the comparison, we could observe that the weight matrixes are different between hJAM
and MVIVW MR.
2A.2.3. Inverse-variance Weighted Mendelian Randomization
As described in Burgess et al. (Burgess et al., 2013), the point estimate from inverse variance
weighted Mendelian Randomization (IVW MR) can be expressed as
𝜋̂
IVW
=
∑𝛼̂
𝑘 𝛽̂
𝑘 ⋅se
-2
(𝛽̂
𝑘 )
𝑘 ∑𝛼̂
𝑘 2
⋅se
-2
(𝛽̂
𝑘 )
𝑘 .
For SNP k, we have
𝑠 𝑒 2
(𝛽̂
𝑘 )=(𝑮 𝒀𝒌
′
𝑮 𝒀𝒌
)
−1
𝜎̂
𝑌 2
=
𝜎̂
𝑌 2
∑ (𝐺 𝑖,𝑌𝑘
−𝐺 ̅
𝑖,𝑌𝑘
)
2
=
𝜎̂
𝑌 2
𝑔 𝑘𝑘
2
=
𝜎̂
𝑌 2
𝑁 𝑌 𝜎̂
𝑔 ,𝑘 2
.
Then,
𝜋̂
IVW
=
∑𝛼̂
𝑘 𝛽̂
𝑘 ⋅se
-2
(𝛽̂
𝑘 )
𝑘 ∑𝛼̂
𝑘 2
⋅se
-2
(𝛽̂
𝑘 )
𝑘 =
∑𝛼̂
𝑘 𝛽̂
𝑘 ⋅𝜎̂
𝑌 −2
𝑁 𝑌 𝜎̂
𝑔 ,𝑘 2
𝑘 ∑𝛼̂
𝑘 2
⋅𝜎̂
𝑌 −2
𝑁 𝑌 𝜎̂
𝑔 ,𝑘 2
𝑘 =
∑𝛼̂
𝑘 𝛽̂
𝑘 𝜎̂
𝑔 ,𝑘 2
𝑘
∑𝛼̂
𝑘 2
𝜎̂
𝑔 ,𝑘 2
𝑘 = 𝜋̂
hJAM
.
and
45
Var(𝜋̂
IVW
)=
1
∑𝛼̂
𝑘 𝑘 ⋅
1
se
2
(𝛽̂
𝑘 )
=
1
∑𝛼̂
𝑘 𝑘 ⋅
𝑁 𝑌 𝜎̂
𝑔 ,𝑘 2
𝜎̂
𝑌 2
=
𝜎̂
𝑌 2
𝑁 𝑌 ⋅(∑𝛼̂
𝑘 𝑘 ⋅𝜎̂
𝑔 ,𝑘 2
)
=
𝜎̂
𝑌 2
𝑁 𝑌 𝜎̂
𝑔 2
.
Thus, the point estimate and its standard error that are estimated from IVW MR and hJAM are
equivalent when 𝑀 =1 and all instruments are independent.
2A.2.4. Summary-PrediXcan
As described in (Barbeira et al., 2018), the point estimate from summary-PrediXcan (S-
PrediXcan) can be expressed as
𝜋̂
Spred
=
Cov
̂
(𝑋 ,𝑌 )
𝜎̂
𝑔 2
=
Cov
̂
(∑𝛼 𝑘 𝐺 𝑘 𝑘 ,𝑌 )
𝜎 𝑔 2
= ∑
Cov
̂
(𝛼 𝑘 𝐺 𝑘 ,𝑌 )
𝜎̂
𝑔 2
𝑘 = ∑
𝛼̂
𝑘 ⋅𝛽̂
𝑘 ⋅𝜎̂
𝑔 ,𝑘 2
𝜎̂
𝑔 2
𝑘
=
∑𝛼̂
𝑘 𝛽̂
𝑘 𝜎̂
𝑔 ,𝑘 2
𝑘
∑𝛼̂
𝑘 2
𝜎̂
𝑔 ,𝑘 2
𝑘
=𝜋̂
hJAM
.
Note that summary-PrediXcan deals one gene per model. Thus, our point estimate is the same as
summary-PrediXcan when 𝑀 =1. Since PrediXcan uses elastic net to construct their 𝜶 vector,
the association estimates in 𝜶 should be the joint SNP effects on the gene expression. From the
properties of linear regression, we know that
Var(𝜋̂
Spred
)=
𝜎̂
𝛿 2
𝑁 𝑌 𝜎̂
𝑔 2
=
𝜎̂
𝑌 2
𝑁 𝑌 𝜎̂
𝑔 2
(1−𝑅 𝑔 2
)=(1−𝑅 𝑔 2
) Var(𝜋̂
hJAM
),
where 𝑅 𝑔 2
denotes the heritability of gene expression 𝑿 . The Z test statistic of the significance of
the association between the gene expression 𝑿 and the trait 𝒀 can be expressed as:
46
𝑍 Spred
=
𝜋̂
Spred
se(𝜋̂
Spred
)
=∑
𝛼̂
𝑘 𝛽̂
𝑘 𝜎̂
𝑔 ,𝑘 2
𝜎̂
𝑔 2
𝑘 ⋅√
𝑁 𝑌 𝜎̂
𝑔 2
𝜎̂
𝑌 2
(1−𝑅 𝑔 2
)
=∑
𝛼̂
𝑘 𝛽̂
𝑘 𝜎̂
𝑔 ,𝑘 2
𝜎̂
𝑔 𝑘 ⋅
√
(1−𝑅 𝑔 ,𝑘 2
)
se(𝛽̂
𝑘 )
2
𝜎̂
𝑔 ,𝑘 2
(1−𝑅 𝑔 2
)
=∑
𝛼̂
𝑘 𝛽̂
𝑘 𝜎̂
𝑔 ,𝑘
𝜎̂
𝑔 ⋅ se(𝛽̂
𝑘 )
𝑘 ⋅√
(1−𝑅 𝑔 ,𝑘 2
)
(1−𝑅 𝑔 2
)
≈
∑𝛼̂
𝑘 𝛽̂
𝑘 𝜎̂
𝑔 ,𝑘 se
−1
(𝛽̂
𝑘 )
𝑘 𝜎̂
𝑔 =
∑𝛼̂
𝑘 𝛽̂
𝑘
𝑘 𝜎̂
𝑔 ,𝑘 2
𝜎̂
𝑔 √
𝑁 𝑌 𝜎̂
𝑌 2
S-PrediXcan ignores the
√
(1−𝑅 𝑔 ,𝑘 2
)
(1−𝑅 𝑔 2
)
term in the fourth line since they claim that this term
does not affect their ability to detect the association based on their real data application and
simulations. The 𝑍 Spred
statistic is same as the 𝑍 hJAM
statistic in our approach in the univariate
case.
2A.2.5. Summary-TWAS (FUSION)
The standard FUSION does not estimate 𝜋̂ as other approaches do. It uses 𝑍 FUSION
score to test
the significance of the association between gene expression and the phenotypes. The weight
matrix 𝑊 FUSION
in FUSION was compiled from the reference panel using summary statistics via
ImpG-Summary algorithm (Pasaniuc et al., 2014). The 𝑍 FUSION
statistic is computed as a linear
47
combination of the standardized effect size, 𝒁 *, of expression quantitative trait loci (eQTLs), i.e.
for eQTL 𝑘 ,
𝑍 𝑘 ∗
=
𝛽̂
𝑘 se(𝛽̂
𝑘 )
,
and the weight matrix 𝑊 FUSION
=𝚺 𝑮 ,𝑿 𝚺 −𝟏 , where 𝚺 𝑮 ,𝑿 denotes the covariance matrix between
the genotype and the gene expression. The 𝑍 FUSION
can be expressed as
𝑍 FUSION
=
𝑾 TWAS
𝒁 ∗
√𝑾 TWAS
⋅𝚺 ⋅𝑾 TWAS
′
=
𝚺 𝑮 ,𝑿 𝚪 𝐠 −𝟏 ⋅𝒁 ∗
√𝚺 𝑮 ,𝑿 ⋅𝚪 𝐠 −𝟏 ⋅𝚺 𝑮 ,𝑿 ′
.
Here, for the eQTL 𝑘 , assuming all the SNPs are independent, we have
Σ
𝐺 𝑘 ,𝑋 =Cov(𝐺 𝑘 ,𝑋 )=Cov(𝐺 𝑘 ,∑ 𝛼 𝑘 𝐺 𝑘 𝑘 )=𝛼̂
𝑘 𝜎̂
𝑔 ,𝑘 2
.
Note that we have the (𝑘 ,𝑘 )
𝑡 ℎ
diagonal element of 𝚪 𝒈 −𝟏 to be
1
𝜎̂
𝑔 ,𝑘 2
under the independent
assumption. Thus, we have
48
𝑍 FUSION,independent
=
𝚺 𝑮 ,𝑿 𝚪 𝐠 −𝟏 ⋅𝒁 ∗
√𝚺 𝑮 ,𝑿 ⋅𝚪 𝐠 −𝟏 ⋅𝚺 𝑮 ,𝑿 ′
=
∑𝛼̂
𝑘 𝜎̂
𝑔 ,𝑘 2
⋅
1
𝜎̂
𝑔 ,𝑘 2
⋅
𝛽̂
𝑘 se(𝛽̂
𝑘 )
𝑘 √
∑𝛼̂
𝑘 2
𝜎̂
𝑔 ,𝑘 2
𝑘 =
∑𝛼̂
𝑘 𝛽̂
𝑘
𝑘 se
−1
(𝛽̂
𝑘 )
𝜎̂
𝑔 =
∑𝛼̂
𝑘 𝛽̂
𝑘
𝑘 𝜎̂
𝑔 ,𝑘 𝜎̂
𝑔 √
𝑁 𝑌 𝜎̂
𝑌 2
,
which is slightly different from our estimate and S-PrediXcan: both hJAM and S-PrediXcan have
an extra 𝜎̂
𝑔 ,𝑘 in the numerator of the Z test score.
49
Appendix 2B. Supplementary Tables and Figures
2B.1. Supplementary Tables
Supplementary Table 2B.1 The estimate and its standard error of simulation scenario A: independent 𝑿 ’s.
Scenarios
S-PrediXcan
IVW MR
(w/o LD)
IVW MR
MVIVW MR
(w/o LD)
MVIVW MR hJAM
Estimate s.e. Estimate s.e. Estimate s.e. Estimate s.e. Estimate s.e. Estimate s.e.
Independent SNPs
1 𝜋 1
=0 0.000 0.018 0.000 0.021 0.000 0.021 0.000 0.021 0.000 0.021 0.000 0.019
𝜋 2
=0 0.001 0.019 0.001 0.021 0.001 0.021 0.001 0.021 0.001 0.021 0.001 0.019
2 𝜋 1
=0 0.001 0.019 0.002 0.045 0.002 0.045 0.002 0.023 0.002 0.022 0.002 0.021
𝜋 2
=0.1 0.090 0.020 0.095 0.023 0.095 0.023 0.096 0.023 0.096 0.023 0.096 0.021
3 𝜋 1
=0.1 0.090 0.018 0.096 0.021 0.096 0.021 0.096 0.021 0.096 0.021 0.096 0.020
𝜋 2
=0 0.001 0.019 0.000 0.049 0.001 0.049 0.000 0.023 0.000 0.023 0.000 0.022
4 𝜋 1
=0.1 0.091 0.019 0.096 0.045 0.096 0.045 0.097 0.023 0.097 0.023 0.097 0.021
𝜋 2
=0.1 0.091 0.019 0.096 0.047 0.097 0.047 0.097 0.023 0.097 0.023 0.097 0.022
Correlated SNPs
1 𝜋 1
=0 0.000 0.019 0.001 0.020 0.000 0.022 0.001 0.020 0.000 0.022 0.000 0.020
𝜋 2
=0 0.001 0.019 0.002 0.020 0.001 0.022 0.002 0.020 0.001 0.022 0.001 0.020
2 𝜋 1
=0 -0.001 0.019 -0.001 0.043 -0.001 0.042 0.001 0.021 0.001 0.023 0.001 0.022
𝜋 2
=0.1 0.089 0.020 0.095 0.021 0.095 0.023 0.096 0.021 0.096 0.024 0.096 0.022
3 𝜋 1
=0.1 0.089 0.018 0.096 0.019 0.096 0.021 0.097 0.019 0.096 0.021 0.096 0.020
50
𝜋 2
=0 0.000 0.020 0.000 0.046 0.000 0.045 -0.001 0.021 -0.001 0.024 -0.001 0.022
4 𝜋 1
=0.1 0.090 0.020 0.097 0.043 0.096 0.042 0.096 0.021 0.096 0.024 0.096 0.022
𝜋 2
=0.1 0.089 0.020 0.095 0.044 0.095 0.043 0.096 0.021 0.095 0.024 0.095 0.023
Abbreviations: s.e., standard error; IVW MR, inverse-variance weighted Mendelian randomization; MVIVW MR, multivariable
inverse-variance weighted Mendelian randomization; LD, linkage disequilibrium.
Supplementary Table 2B.2 The estimate and its standard error of simulation scenario B: correlated 𝑿 ’s.
Scenarios
S-PrediXcan
IVW MR
(w/o LD)
IVW MR
MVIVW MR
(w/o LD)
MVIVW MR hJAM
Estimate s.e. Estimate s.e. Estimate s.e. Estimate s.e. Estimate s.e. Estimate s.e.
Independent SNPs
1 𝜋 1
=0 0.000 0.009 0.000 0.011 0.000 0.011 0.000 0.012 0.000 0.012 0.000 0.011
𝜋 2
=0 0.000 0.009 0.000 0.011 0.000 0.011 0.000 0.012 0.000 0.012 0.000 0.011
2 𝜋 1
=0 0.017 0.009 0.019 0.028 0.019 0.028 0.000 0.013 0.000 0.013 0.000 0.012
𝜋 2
=0.1 0.088 0.010 0.096 0.011 0.096 0.011 0.096 0.013 0.096 0.013 0.096 0.012
3 𝜋 1
=0.1 0.088 0.009 0.097 0.011 0.097 0.011 0.097 0.012 0.097 0.012 0.097 0.012
𝜋 2
=0 0.019 0.010 0.021 0.029 0.021 0.029 0.000 0.013 0.000 0.013 0.000 0.012
4 𝜋 1
=0.1 0.107 0.010 0.118 0.029 0.118 0.029 0.098 0.014 0.098 0.014 0.097 0.014
𝜋 2
=0.1 0.107 0.010 0.118 0.029 0.118 0.029 0.097 0.014 0.097 0.014 0.097 0.014
Correlated SNPs
1 𝜋 1
=0 0.000 0.009 -0.001 0.010 0.000 0.011 0.000 0.011 0.000 0.012 0.000 0.012
𝜋 2
=0 0.000 0.011 0.000 0.010 0.000 0.011 0.000 0.011 0.000 0.012 0.000 0.012
2 𝜋 1
=0 0.016 0.011 0.024 0.025 0.018 0.026 0.001 0.012 0.000 0.013 0.000 0.013
𝜋 2
=0.1 0.086 0.010 0.097 0.010 0.096 0.012 0.097 0.012 0.096 0.013 0.096 0.013
3 𝜋 1
=0.1 0.085 0.010 0.097 0.010 0.096 0.012 0.097 0.011 0.096 0.013 0.096 0.013
𝜋 2
=0 0.017 0.010 0.024 0.025 0.019 0.026 0.001 0.012 0.001 0.013 0.001 0.013
4 𝜋 1
=0.1 0.104 0.010 0.123 0.025 0.117 0.026 0.098 0.013 0.097 0.014 0.097 0.014
𝜋 2
=0.1 0.104 0.010 0.123 0.026 0.117 0.026 0.097 0.013 0.097 0.014 0.096 0.014
51
Abbreviations: s.e., standard error; IVW MR, inverse-variance weighted Mendelian randomization; MVIVW MR, multivariable
inverse-variance weighted Mendelian randomization; LD, linkage disequilibrium.
Supplementary Table 2B.3 The estimate and its standard error of simulation scenario C: 𝑿 𝟏 causes 𝑿 𝟐 .
Scenarios
S-PrediXcan
IVW MR
(w/o LD)
IVW MR
MVIVW MR
(w/o LD)
MVIVW MR hJAM
Estimate s.e. Estimate s.e. Estimate s.e. Estimate s.e. Estimate s.e. Estimate s.e.
Independent SNPs
1 𝜋 1
=0 0.000 0.018 0.000 0.021 0.001 0.021 -0.001 0.032 -0.001 0.031 -0.001 0.028
𝜋 2
=0 0.000 0.013 0.000 0.015 0.000 0.015 0.000 0.022 0.000 0.022 0.000 0.020
2 𝜋 1
=0 0.027 0.019 0.028 0.047 0.028 0.047 -0.003 0.035 -0.003 0.035 -0.003 0.034
𝜋 2
=0.1 0.091 0.014 0.096 0.016 0.096 0.016 0.097 0.024 0.096 0.024 0.097 0.023
3 𝜋 1
=0.1 0.090 0.019 0.096 0.021 0.095 0.021 0.095 0.032 0.095 0.032 0.095 0.029
𝜋 2
=0 0.009 0.017 0.010 0.026 0.010 0.026 0.000 0.023 0.000 0.023 0.000 0.021
4 𝜋 1
=0.1 0.121 0.019 0.127 0.046 0.127 0.046 0.093 0.042 0.093 0.042 0.093 0.041
𝜋 2
=0.1 0.099 0.013 0.104 0.026 0.104 0.026 0.096 0.028 0.096 0.028 0.096 0.027
Correlated SNPs
1 𝜋 1
=0 0.000 0.020 0.001 0.020 0.000 0.022 0.000 0.030 -0.001 0.032 -0.001 0.030
𝜋 2
=0 0.000 0.015 0.000 0.014 0.000 0.015 0.000 0.020 0.000 0.023 0.000 0.021
2 𝜋 1
=0 0.028 0.020 0.030 0.045 0.030 0.044 0.000 0.032 -0.001 0.036 -0.001 0.034
𝜋 2
=0.1 0.090 0.014 0.097 0.014 0.096 0.016 0.096 0.021 0.096 0.024 0.096 0.022
3 𝜋 1
=0.1 0.089 0.020 0.095 0.020 0.095 0.022 0.098 0.030 0.097 0.033 0.097 0.031
𝜋 2
=0 0.009 0.014 0.010 0.023 0.010 0.024 -0.001 0.021 -0.001 0.024 -0.001 0.022
4 𝜋 1
=0.1 0.119 0.019 0.128 0.043 0.126 0.042 0.096 0.035 0.095 0.039 0.095 0.037
𝜋 2
=0.1 0.097 0.013 0.104 0.023 0.104 0.024 0.095 0.024 0.094 0.026 0.094 0.025
Abbreviations: s.e., standard error; IVW MR, inverse-variance weighted Mendelian randomization; MVIVW MR, multivariable
inverse-variance weighted Mendelian randomization; LD, linkage disequilibrium.
52
Supplementary Table 2B.4 The estimate and its standard error of simulation scenario D: 𝑿 𝟏 causes 𝑿 𝟐 and 𝑿 𝟏 and 𝑿 𝟐 are
correlated.
Scenarios
S-PrediXcan
IVW MR
(w/o LD)
IVW MR
MVIVW MR
(w/o LD)
MVIVW MR hJAM
Estimate s.e. Estimate s.e. Estimate s.e. Estimate s.e. Estimate s.e. Estimate s.e.
Independent SNPs
1 𝜋 1
=0 0.000 0.010 0.000 0.011 0.000 0.011 0.000 0.017 0.000 0.017 0.000 0.015
𝜋 2
=0 0.000 0.008 0.000 0.008 0.000 0.008 0.000 0.012 0.000 0.012 0.000 0.011
2 𝜋 1
=0 0.040 0.009 0.043 0.029 0.043 0.029 -0.001 0.021 -0.001 0.021 -0.001 0.020
𝜋 2
=0.1 0.089 0.007 0.097 0.009 0.097 0.009 0.097 0.014 0.097 0.014 0.097 0.014
3 𝜋 1
=0.1 0.089 0.009 0.097 0.011 0.097 0.011 0.097 0.018 0.097 0.018 0.097 0.017
𝜋 2
=0 0.014 0.007 0.015 0.018 0.015 0.018 0.000 0.013 0.000 0.013 0.000 0.012
4 𝜋 1
=0.1 0.125 0.010 0.138 0.030 0.138 0.030 0.095 0.023 0.096 0.023 0.095 0.023
𝜋 2
=0.1 0.098 0.007 0.108 0.020 0.108 0.019 0.098 0.016 0.098 0.016 0.097 0.016
Correlated SNPs
1 𝜋 1
=0 0.000 0.010 -0.001 0.010 -0.001 0.011 0.000 0.016 0.000 0.017 0.000 0.016
𝜋 2
=0
0.000 0.007 -0.001 0.007 0.000 0.008 0.000 0.011 0.000 0.012 0.000 0.011
2 𝜋 1
=0 0.039 0.010 0.048 0.025 0.043 0.026 -0.001 0.019 -0.001 0.021 -0.001 0.020
𝜋 2
=0.1 0.087 0.008 0.097 0.008 0.096 0.009 0.097 0.013 0.096 0.014 0.096 0.014
3 𝜋 1
=0.1 0.086 0.010 0.097 0.010 0.096 0.011 0.097 0.016 0.096 0.018 0.096 0.018
𝜋 2
=0 0.013 0.007 0.017 0.016 0.015 0.016 0.000 0.012 0.000 0.013 0.000 0.013
4 𝜋 1
=0.1 0.122 0.009 0.142 0.026 0.137 0.027 0.096 0.021 0.095 0.023 0.095 0.023
𝜋 2
=0.1 0.096 0.008 0.109 0.017 0.107 0.017 0.097 0.014 0.097 0.015 0.097 0.015
Abbreviations: s.e., standard error; IVW MR, inverse-variance weighted Mendelian randomization; MVIVW MR, multivariable
inverse-variance weighted Mendelian randomization; LD, linkage disequilibrium.
53
Supplementary Table 2B.5 Four correlated pairs of SNPs in the instrument sets of BMI
and type 2 diabetes.
CHR BMI T2D 𝑹 *
2 rs13021737 rs2867125 0.91
18 rs6567160 rs12970134 0.74
5 rs2112347 rs2307111 -0.72
16 rs1558902 rs7185735 0.65
Notes: * The correlation coefficient between the two SNPs was calculated from the 1000
Genome data. Abbreviations: CHR, chromosome; BMI, body mass index; T2D, type 2 diabetes.
2B.2. Supplementary Figures
Supplementary Figure 2B.1 Correlation coefficient matrix and data structure of 𝑮 𝑿 .
(A) A sample of the correlation coefficient matrix of 𝐺 𝑋 . 𝐺 𝑋 is composed by the three SNPs block
- 𝐺 1
, 𝐺 2
, and 𝐺 3
. For correlated simulation scenarios, SNPs are correlated within each SNPs
block but not correlated across SNPs blocks. (B) Data structure of 𝐺 𝑋 and its relationship with
𝐺 1
, 𝐺 2
, and 𝐺 3
. The columns are SNPs and rows are individuals.
54
Supplementary Figure 2B.2 Simulation results for evaluating the robustness of hJAM and
MVIVW MR on different linkage disequilibrium (LD) structures of 𝑮 𝑳 .
The LD structures are the same for 𝐺 𝑋 and 𝐺 𝑌 , where the maximum correlation coefficient is
r(𝐺 𝑋 ) = r(𝐺 𝑌 )=0.3. For the sub-scenario with less correlated and more correlated LD, the
maximum correlation coefficient for 𝐺 𝐿 is r(𝐺 𝐿 ) =0.2 and r(𝐺 𝐿 )=0.6, respectively.
55
Supplementary Figure 2B.3 Empirical Power of the correlated SNPs scenarios across 1000
replications with 𝑹 𝑮 ,𝑿 𝟐 =𝟎 .𝟎𝟓 in the simulation setting.
(A) 𝑿 𝟏 and 𝑿 𝟐 are independent. (B) 𝑿 𝟏 and 𝑿 𝟐 are correlated. (C) 𝑿 𝟏 causes 𝑿 𝟐 . (D) 𝑿 𝟏 causes
𝑿 𝟐 and correlated. The black solid line refers to the default Type-I error, 𝜶 =𝟎 .𝟎𝟓 .
56
Supplementary Figure 2B.4 Average estimates and 95% confidence intervals of the
correlated SNPs scenarios across 1000 replications with 𝑹 𝑮 ,𝑿 𝟐 =𝟎 .𝟎𝟓 in the simulation
setting.
(A) 𝑋 1
and 𝑋 2
are independent. (B) 𝑋 1
and 𝑋 2
are correlated. (C) 𝑋 1
causes 𝑋 2
. (D) 𝑋 1
causes
𝑋 2
and correlated. The black solid line refers to the default Type-I error, 𝛼 =0.05.
57
Supplementary Figure 2B.5 Empirical Power of the correlated SNPs scenarios across 1000
replications with adding an unknown confounder between 𝑿 and 𝒀 in the simulation
setting.
(A) 𝑿 𝟏 and 𝑿 𝟐 are independent. (B) 𝑿 𝟏 and 𝑿 𝟐 are correlated. (C) 𝑿 𝟏 causes 𝑿 𝟐 . (D) 𝑿 𝟏 causes
𝑿 𝟐 and correlated. The black solid line refers to the default Type-I error, 𝜶 =𝟎 .𝟎𝟓 .
58
Supplementary Figure 2B.6 Average estimates and 95% confidence intervals of the
correlated SNPs scenarios across 1000 replications with adding an unknown confounder
between 𝑿 and 𝒀 in the simulation setting.
(A) 𝑿 𝟏 and 𝑿 𝟐 are independent. (B) 𝑿 𝟏 and 𝑿 𝟐 are correlated. (C) 𝑿 𝟏 causes 𝑿 𝟐 . (D) 𝑿 𝟏 causes
𝑿 𝟐 and correlated. The black solid line refers to the default Type-I error, 𝜶 =𝟎 .𝟎𝟓 .
59
Supplementary Figure 2B.7 Scatter plots of the 𝜶̂ and 𝜷̂
of the significant SNPs that are
selected from the genome-wide association studies for (A) body mass index and (B) type 2
diabetes, respectively.
Chapter 3.
hJAM Egger: A Hierarchical Joint Analysis of Marginal
Summary Statistics with Egger Regression
This chapter is a modified version of the work:
Lai Jiang, et al., A Hierarchical Approach Using Marginal Summary Statistics for Multiple
Intermediates in a Mendelian Randomization or Transcriptome Analysis. American Journal of
Epidemiology (in press).
3.1. Motivation
Mendelian randomization employs a set of genetic variants as the instruments to estimate
the causal effects of the modifiable risk factors on an outcome in observational studies. As
discussed in Chapter 2, when the associations between the genetic variants and the risk factors
are weak, weak instrument bias are likely to be introduced into the model and result in an
inappropriate inference. Adding additional genetic variants into the analysis could a potential
way to increase the power of detecting the causal effects. However, such solution increases the
possibility of introducing pleiotropy bias, which is defined as the effect of the genetic variants on
the outcome that are not through the risk factors. hJAM, the approach that we proposed in
Chapter 2, allows more genetic variants and could handle the measured pleiotropy which are
61
attributed to the risk factors included in the analysis. Nevertheless, it cannot handle the
unmeasured pleiotropy when it presents.
Mendelian randomization with a form of weighted linear regression of marginal summary
statistics, such as inverse-variance weighted Mendelian randomization (IVW MR), can be
viewed as a form of meta-analysis. The Egger test is an approach that has been widely used in
meta-analysis to address the small study bias (Egger et al., 1997), which is similar to the
directional pleiotropy in Mendelian randomization (Bowden et al., 2015). Mendelian
randomization Egger regression (MR Egger), proposed by Bowen et al. (Bowden et al., 2015),
adapts Egger regression to IVW MR model to detect the violations of the “no pleiotropy”
assumptions and provide an estimate that is not subject to the violation of the assumption. Rees
et al. (Rees et al., 2017) extended MR Egger into a multivariable setting as multivariable
Mendelian randomization Egger (MV MR Egger).
In this chapter, we extend hJAM to hJAM Egger by integrating the Egger regression into
the hJAM framework and evaluate the performance through simulations studies in comparison to
MV MR Egger and MR Egger. We also applied the methods to the first data example in Chapter
2 to test if pleiotropy exists in estimating the effects of body mass index and type 2 diabetes on
the myocardial infarction risk.
62
3.2. Methods
3.2.1. Egger Test and the Application in Mendelian Randomization
As described in Chapter 2, for a single modifiable risk factor, the estimate of 𝜋̂ from
inverse-variance weighted Mendelian randomization (IVW MR) (Burgess et al., 2013) can be
expressed as
𝜋̂
IVW
=
∑𝛼̂
𝑘 𝛽̂
𝑘 ⋅se
-2
(𝛽̂
𝑘 )
𝑘 ∑𝛼̂
𝑘 2
⋅se
-2
(𝛽̂
𝑘 )
𝑘 ,
which is the same formula that is employed in a fixed-effect meta-analysis, where the association
estimate, 𝛽̂
𝑘 , can be viewed as the estimate from the studies and the inverse of the standard error
for the estimate, se
-2
(𝛽̂
𝑘 ) , can be viewed as the weight (Sutton et al., 2000). Pleiotropy effect
suggests that at least one genetic variant, 𝑆𝑁 𝑃 𝐾 , is not a valid instrument variable and has an
effect on the outcome which is not through the modifiable risk factor (Figure 3.1).
Figure 3.1 Direct Acyclic Graph (DAG) of the pleiotropy effect.
In the figure, SNPK has a pleiotropy effect on 𝑌 , suggesting that it has an effect on 𝑌 which is not
through the intermediate 𝑋 .
63
Then the association estimate, 𝛽̂
𝐾 , between this invalid instrument (𝑆𝑁 𝑃 𝐾 ) and the
phenotype (𝑌 ) is not a product of 𝛼̂
𝐾 (the association estimate between 𝑆𝑁 𝑃 𝐾 and 𝑋 ) and 𝜋 (the
association between 𝑋 and 𝑌 ) but with an additional intercept 𝛽̂
1𝐾 :
𝛽̂
𝐾 =𝛽̂
1𝐾 +𝛼̂
𝐾 𝜋 ,
where 𝛽̂
1𝐾 denotes the direct effect from 𝑆𝑁 𝑃 𝐾 to 𝑌 . Thus, we have
𝜋 =
𝛽̂
𝐾 𝛼̂
𝐾 −
𝛽̂
1𝐾 𝛼̂
𝐾 =
𝛽̂
𝐾 𝛼̂
𝐾 +Bias(𝛽̂
𝐾 ,𝛼̂
𝐾 ),
where Bias(𝛽̂
𝐾 ,𝛼̂
𝐾 )≔ −
𝛽̂
1𝐾 𝛼̂
𝐾 . Asymptotically, the ratio estimate based on the instrument can be
expressed as
𝜋̂
Correct
=
∑𝛼̂
𝑘 𝛽̂
𝑘 ⋅se
-2
(𝛽̂
𝑘 )
𝑘 ∑𝛼̂
𝑘 2
⋅se
-2
(𝛽̂
𝑘 )
𝑘 +Bias(𝜷̂
,𝜶̂),
which implies that the bias could be captured by the intercept term, 𝛽̂
1𝑘 , and therefore produce
an unbiased slope, 𝜋̂ , as graphically illustrated in Figure 3.2. Mendelian randomization with
Egger regression (MR Egger) can be expressed as
𝛽̂
𝑘 =𝛽̂
1𝑘 +𝛼̂
𝑘 𝜋 𝐸 , weights=1/se
2
(𝛽̂
𝑘 ).
An natural extension on MR Egger is multivariable Mendelian randomization with Egger
regression (MV MR Egger) (Rees et al., 2017):
𝛽̂
𝑘 =𝛽̂
1𝑘 +𝛼̂
1𝑘 𝜋 1
+𝛼̂
2𝑘 𝜋 2
+⋯+𝛼̂
𝑀𝑘
𝜋 𝑀 , weights=1/se
2
(𝛽̂
𝑘 ).
Both MR Egger and MV MR Egger assume that the distribution of 𝛼 is independent from
the distribution of 𝛽 . This is referred as the InSIDE assumption (Instrument Strength
Independent of Direct Effect) (Bowden et al., 2015; Rees et al., 2017).
64
Figure 3.2 Illustration of the MR Egger regression.
This plot graphically illustrates the framework of MR Egger regression. We changed the
annotation based on our method: 𝛼̂ denotes the association between genetic variants and the
modifiable risk factor, 𝛽̂
denotes the association between genetic variants and the phenotype,
and 𝜋̂ denotes the association of interests: the association between the modifiable risk factor
and the phenotype. This figure is from Figure 1 (Bowden et al., 2015).
3.2.2. Hierarchical JAM with Egger Regression (hJAM Egger)
When pleiotropy effect is present, we allow an intercept term in Eq. 2.1. to denote the
bias:
𝒚 =𝑮𝜷 +𝑮 𝜷 𝟏 +𝜹 ,
where 𝜷 𝟏 is the vector of the bias for genetic variants in the analysis. Following the hJAM
model, we have
𝒛 ≔𝑮 ′
𝒚 ∼𝑀𝑉 𝑁 𝐾 (𝑮 ′
𝑮𝜷 +𝑮 ′
𝑮 𝜷 𝟏 ,𝜎 𝑌 2
𝑰 𝑲 ) (3.1)
Substituting the stage 2 (Eq. 2.4) into Eq. 3.1, we have
𝒛 ∼𝑀𝑉 𝑁 𝐾 (𝑮 ′
𝑮 𝑨̂
𝝅 +𝑮 ′
𝑮 𝜷 𝟏 ,𝜎 𝑌 2
𝑰 𝑲 )
65
We then perform the Cholesky decomposition on 𝑳 ′
𝑳 =𝑮 ′𝑮 and multiply 𝑳 ′
−𝟏 on both
sides to simplify the likelihood:
𝒛 𝑳 ≔𝑳 ′
−𝟏 𝒛 ∼𝑀𝑉 𝑁 𝐾 (𝑳 𝑨̂
𝝅 +𝑳 𝜷 𝟏 ,𝜎 2
𝑰 𝑲 )
(3.2)
Thus, the Egger regression with hJAM framework can be considered as allowing an
intercept in Eq. 2.6 which can be achieved by adding a column of ones to 𝑨̂
matrix:
𝑨̂
𝐾 ×(𝑀 +1)
= (
1
1
1
𝛼̂
11
⋯ 𝛼̂
𝐾 1
⋮ ⋱ ⋮
𝛼̂
1𝑀 ⋯ 𝛼̂
𝐾𝑀
).
It is analogue to MR Egger (Bowden et al., 2015) when risk factors are independent and
MV MR Egger (Rees et al., 2017). Same as MR Egger and MV MR Egger, hJAM Egger needs
to satisfy the InSIDE assumption.
3.3. Simulation Studies
3.3.1. Simulation Settings
To evaluate the performance of hJAM Egger regression, we performed simulation studies
and compared the results to Mendelian randomization Egger regression (MR Egger) and
multivariable Mendelian randomization Egger regression (MV MR Egger) in two independent
and correlated intermediates (i.e., risk factors) situations (Figure 2.1A and Figure 2.1B,
respectively). We set the number of independent SNPs as 20 and 30 for scenarios of independent
and correlated intermediates, respectively. Within each scenario, we tested two sub-scenarios:
balanced and unbalanced pleiotropy.
66
To generate the summary statistics, we created the individual level data with the
following models:
𝑥 𝑖,𝑚 =∑ 𝛼 𝑘𝑚
𝐺 𝑖𝑘
+𝜖 𝑖𝑘𝑚 𝐾 𝑘 =1
;
𝑦 𝑖 =∑ (∑ 𝛽 1𝑘 𝐺 𝑖𝑘
+𝜋 𝑚 𝑥 𝑖𝑚
𝐾 𝑘 =1
+𝛾 𝑖𝑘𝑚 )
2
𝑚 =1
,
where 𝑘 =1,…,𝐾 denotes the 𝑘 𝑡 ℎ
SNP, 𝑚 =1,2 denotes the 𝑚 𝑡 ℎ
intermediate, and 𝑖 =1,…,𝑛
denotes the 𝑖 𝑡 ℎ
individual in the data. For balanced pleiotropy scenario, we randomly extract the
vector of direct effects, 𝜷 𝟏 , of 𝑮 on 𝒀 from a normal distribution: 𝛽 1𝑘 ∼𝑁 (0,0.005) . For
unbalanced pleiotropy scenario, we tested the direct effects of 𝛽 1𝑘 ∼𝑁 (0.01,0.005) , 𝛽 1𝑘 ∼
𝑁 (0.05,0.005) , and 𝛽 1𝑘 ∼𝑁 (0.1,0.005) . The simulation settings for unbalanced pleiotropy
scenarios are inspired by Rees et al. (Rees et al., 2017). The remaining settings of the simulations
are the same as described in Chapter 2, section 2.3.1.
For performance assessment, we used the power of detecting the pleiotropy effects (i.e.,
significant intercept) and the causal effect when 𝜋 𝑖 =0.1 and the type I error when 𝜋 𝑖 =0. We
also checked the mean estimates of 𝝅̂ and the corresponding standard errors. All tests were two-
sided with a type-I error of 0.05.
3.3.2. Simulation Results
Results from simulation studies with 1000 replicates for each scenario are shown in Table
3.1 and Table 3.2 for independent and correlated intermediates, respectively. Since the
performance of 𝜋̂
1
and 𝜋̂
2
are symmetric, we only show the performance for each method with
respect to 𝜋̂
1
in different combinations of the 𝜋 2
.
3.3.2.1. Scenarios with independent intermediates 𝑋 1
and 𝑋 2
67
When intermediate 𝑋 1
has no effect on the outcome (i.e., 𝜋 1
=0), we observed a similar
performance of three methods regardless of the pleiotropy effect size (Table 3.1). A slightly
inflated type-I error was observed in testing the intercept and the causal effect of 𝜋 1
in scenarios
with balanced pleiotropy. For scenarios with unbalanced pleiotropy, the power of detecting the
bias increases as the magnitude of bias increases from 0.01 to 0.1. The hJAM Egger showed a
higher power of detecting the bias at the cost of a slightly larger type-I error in testing the null
effect of 𝜋 1
.
When intermediate 𝑋 1
has a positive effect on the outcome (i.e., 𝜋 1
=0.1), hJAM Egger
shows greater power to detect the causal effect while a lower type-I error comparing to
multivariable MR Egger and MR Egger when balanced pleiotropy is present. When unbalanced
pleiotropy is present, the power of detecting the causal effects decreases as the magnitude of the
bias increases from 0.01 to 0.1. The estimates from hJAM Egger and multivariable MR Egger
are consistently more precise than the estimates from MR Egger.
3.3.2.2. Scenarios with correlated intermediates 𝑋 1
and 𝑋 2
For correlated intermediates (Table 3.2), we observed a similar performance as in the
scenarios of independent intermediates. MR Egger shows better performance in detecting
pleiotropy when there is a small unbalanced pleiotropy effect (𝛽 1𝑘 ∼𝑁 (0.01,0.005) ) present.
However, the better performance was due to the fact that the correlation between intermediates
introduced an additional part of pleiotropy into the MR Egger model, which is proved by the
inflated type-I error of testing the bias in balanced pleiotropy scenarios.
68
Table 3.1 Simulation results for hJAM Egger, multivariable MR Egger, and MR Egger for different scenarios with
independent intermediates and independent SNPs.
𝝅 𝟐
hJAM Egger Multivariable MR Egger MR Egger
Mean 𝝅̂
𝟏 (SE)
Power
Mean 𝝅̂
𝟏 (SE)
Power
Mean 𝝅̂
𝟏 (SE)
Power
Intercept Causal Intercept Causal Intercept Causal
Null causal effect: 𝜋 1
=0
Balanced Pleiotropy
0 0.003 (0.049) 0.104 0.083 0.003 (0.049) 0.063 0.045 0.002 (0.047) 0.054 0.043
0.1 -0.003 (0.05) 0.077 0.096 -0.003 (0.05) 0.051 0.044 -0.015 (0.058) 0.065 0.039
Unbalanced Pleiotropy with 𝛽 1
~𝑁 (0.01,0.005)
0 0.003 (0.049) 0.220 0.076 0.003 (0.049) 0.137 0.042 0.002 (0.047) 0.171 0.051
0.1 -0.003 (0.05) 0.221 0.082 -0.003 (0.05) 0.149 0.046 -0.015 (0.058) 0.277 0.066
Unbalanced Pleiotropy with 𝛽 1
~𝑁 (0.05,0.005)
0 0.004 (0.051) 0.976 0.083 0.003 (0.049) 0.947 0.044 0.002 (0.047) 0.985 0.043
0.1 -0.002 (0.051) 0.970 0.089 -0.004 (0.05) 0.938 0.047 -0.015 (0.058) 0.987 0.040
Unbalanced Pleiotropy with 𝛽 1
~𝑁 (0.1,0.005)
0 0.005 (0.055) 0.998 0.086 0.003 (0.05) 0.996 0.046 0.003 (0.048) 0.999 0.054
0.1 -0.001 (0.056) 0.999 0.087 -0.004 (0.051) 0.998 0.044 -0.015 (0.059) 0.999 0.039
Positive causal effect: 𝜋 1
=0.1
Balanced Pleiotropy
0 0.086 (0.051) 0.042 0.512 0.085 (0.051) 0.041 0.404 0.085 (0.049) 0.066 0.419
0.1 0.083 (0.052) 0.039 0.475 0.083 (0.052) 0.070 0.362 0.072 (0.06) 0.079 0.240
Unbalanced Pleiotropy with 𝛽 1
~𝑁 (0.01,0.005)
0 0.086 (0.051) 0.197 0.754 0.085 (0.051) 0.131 0.694 0.084 (0.049) 0.172 0.735
0.1 0.084 (0.052) 0.215 0.744 0.083 (0.052) 0.138 0.673 0.072 (0.06) 0.275 0.597
Unbalanced Pleiotropy with 𝛽 1
~𝑁 (0.05,0.005)
0 0.087 (0.053) 0.969 0.498 0.085 (0.051) 0.942 0.397 0.084 (0.049) 0.976 0.409
0.1 0.085 (0.053) 0.965 0.465 0.084 (0.052) 0.935 0.362 0.072 (0.061) 0.981 0.243
Unbalanced Pleiotropy with 𝛽 1
~𝑁 (0.1,0.005)
0 0.088 (0.057) 0.999 0.444 0.084 (0.052) 0.998 0.383 0.084 (0.05) 0.999 0.398
0.1 0.087 (0.058) 0.999 0.432 0.084 (0.053) 0.993 0.348 0.072 (0.061) 0.999 0.233
69
Table 3.2 Simulation results for hJAM Egger, multivariable MR Egger, and MR Egger for different scenarios with correlated
intermediates and independent SNPs.
𝝅 𝟐
hJAM Egger Multivariable MR Egger MR Egger
Mean 𝝅̂
𝟏 (SE)
Power
Mean 𝝅̂
𝟏 (SE)
Power
Mean 𝝅̂
𝟏 (SE)
Power
Intercept Causal Intercept Causal Intercept Causal
Null causal effect: 𝜋 1
=0
Balanced Pleiotropy
0 0 (0.031) 0.065 0.067 0 (0.031) 0.046 0.043 0 (0.03) 0.049 0.05
0.1 0 (0.034) 0.071 0.087 0.001 (0.034) 0.053 0.045 0.004 (0.041) 0.117 0.066
Unbalanced Pleiotropy with 𝛽 1
~𝑁 (0.01,0.005)
0 0.001 (0.031) 0.258 0.076 0 (0.031) 0.215 0.042 0 (0.03) 0.255 0.051
0.1 0.001 (0.034) 0.300 0.082 0.001 (0.034) 0.241 0.046 0.004 (0.041) 0.455 0.066
Unbalanced Pleiotropy with 𝛽 1
~𝑁 (0.05,0.005)
0 0.002 (0.033) 0.998 0.083 0 (0.032) 0.998 0.042 0 (0.03) 0.999 0.051
0.1 0.003 (0.035) 0.997 0.076 0.001 (0.034) 0.993 0.047 0.004 (0.041) 0.998 0.065
Unbalanced Pleiotropy with 𝛽 1
~𝑁 (0.1,0.005)
0 0.005 (0.038) 0.999 0.083 0 (0.033) 0.999 0.044 0 (0.031) 0.999 0.052
0.1 0.005 (0.04) 0.999 0.084 0.001 (0.035) 0.999 0.053 0.004 (0.041) 0.999 0.067
Positive causal effect: 𝜋 1
=0.1
Balanced Pleiotropy
0 0.088 (0.033) 0.078 0.752 0.088 (0.033) 0.05 0.699 0.088 (0.031) 0.052 0.735
0.1 0.09 (0.034) 0.092 0.74 0.09 (0.034) 0.064 0.677 0.095 (0.041) 0.132 0.598
Unbalanced Pleiotropy with 𝛽 1
~𝑁 (0.01,0.005)
0 0.088 (0.033) 0.281 0.754 0.088 (0.033) 0.225 0.694 0.088 (0.031) 0.265 0.735
0.1 0.09 (0.034) 0.308 0.744 0.09 (0.034) 0.255 0.673 0.095 (0.041) 0.479 0.597
Unbalanced Pleiotropy with 𝛽 1
~𝑁 (0.05,0.005)
0 0.09 (0.035) 0.996 0.717 0.088 (0.033) 0.998 0.69 0.088 (0.032) 0.999 0.725
0.1 0.092 (0.036) 0.992 0.723 0.09 (0.035) 0.994 0.669 0.095 (0.041) 0.998 0.585
Unbalanced Pleiotropy with 𝛽 1
~𝑁 (0.1,0.005)
0 0.092 (0.04) 0.999 0.652 0.088 (0.034) 0.999 0.666 0.088 (0.033) 0.999 0.697
0.1 0.094 (0.041) 0.999 0.647 0.089 (0.036) 0.999 0.654 0.095 (0.042) 0.999 0.576
70
3.4. Data Application
To demonstrate the implementation of hJAM Egger regression in application, we applied
hJAM Egger to the first data example in Chapter 2: estimating the effects of body mass index
(BMI) and type 2 diabetes (T2D) on myocardial infarction (MI). The data pre-processing has
been described in section 2.4.2. We re-orientated the effects of all SNPs but one (except the
effect of the overlapping SNP rs7903146 on BMI) to have a positive effect on the outcome and
we used MR-Egger (Bowden et al., 2015), MV MR Egger (Rees et al., 2017) and hJAM-Egger
to detect a potential directional pleiotropy effect.
When modeled jointly, results from hJAM Egger and MV MR Egger both suggest that
there was no residual pleiotropy detected when we incorporated both BMI- and T2D-associated
instruments in the analysis (𝑃 hJAM Egger
=0.57 and 𝑃 MV MR Egger
=0.51, respectively) (Table
3.3). In contrast, the MR Egger approach applied univariately to T2D resulted in a significant test
for the intercept, suggesting the presence of pleiotropy, potentially due to association of some of
the SNPs to the outcome via BMI.
Table 3.3 Estimates (95% confidence interval) of the intercepts on estimating the effects of
body mass index and having type 2 diabetes on the risk of myocardial infarction.
Methods Intercepts Log odds ratios (95% CI) 𝑷 value
hJAM Egger Intercept 0.453 (-1.090, 1.996) 0.57
MV MR Egger Intercept 0.001 (-0.002, 0.004) 0.51
MR Egger^ BMI-intercept 0.005 (-0.003, 0.013) 0.20
T2D-intercept 0.005 (0.001, 0.009) 0.02
Abbreviations: 95% CI, 95% confidence interval; BMI, body mass index; T2D, type 2 diabetes;
MR Egger, Mendelian randomization with Egger regression; MV MR Egger, multivariable
Mendelian randomization with Egger regression. Note, ^ MR Egger can only incorporate one
71
risk factor per model; thus, we have two separate models for BMI and T2D which give us two
intercepts.
3.5. Discussion
In this chapter, we proposed a natural extension on hJAM, hJAM Egger, which integrates
the Egger regression into the framework of hJAM to account for the pleiotropy effect. The hJAM
Egger, which is analogous to MR Egger (Bowden et al., 2015) and MV MR Egger (Rees et al.,
2017), showed a similar performance to the MR Egger and MV MR Egger with unbiased
estimates under simulations in which the horizontal pleiotropy is balanced and a conservative
estimator towards null when unbalanced pleiotropy presents (Bowden et al., 2015; Rees et al.,
2017). hJAM-Egger can be applied as a sensitivity analysis of a multivariable framework
Mendelian randomization analysis (Rees et al., 2017). Similar to MR-Egger, hJAM-Egger has to
satisfy the instrument strength independent of direct effect (InSIDE) assumption. If InSIDE is
violated, both the estimated intercept and corresponding variance will be influenced (Rees et al.,
2017). A potential extension of the current hJAM approach could include variable selection to
assess the pleiotropy assumption before incorporating the 𝑨̂
matrix into the model. Several
approaches have been proposed, such as JAM MR (Gkatzionis et al., 2019) and MR-PRESSO
(Verbanck et al., 2018).
Chapter 4.
SHA-JAM: A Scalable Hierarchical Approach to Joint
Analysis for Marginal Summary Omics Data
This chapter is a modified version of the work in preparation for submission to Nature
Communication:
Lai Jiang, et al., SHA-JAM: A Scalable Hierarchical Approach to Joint Analysis for Marginal
Summary Statistics with Omics Data.
4.1. Motivation
In Chapter 2, we presented a hierarchical joint analysis of marginal summary statistics
(hJAM) and showed that Mendelian randomization and transcriptome-wide association studies
(TWAS) can both be statistically viewed as a form of the instrumental variable analysis using
genetic variants. The intermediates of interests are risk factors and gene expressions for MR and
TWAS approaches, respectively (Figure 1.1). To be a valid instrument variable, the genetic
variant has to satisfy three assumptions: (1) it can only be associated with the outcome through
the intermediate; (2) it has to be at least moderately associated with the intermediate; and (3) the
potential confounders between the intermediate and the outcome are independent of the genetic
variable. Pleiotropy bias is defined as the violation of the first assumption where the genetic
variants are associated with the outcome through an alternative way. For example, if a genetic
73
variant is associated with outcome through two intermediates, having only one intermediate in
the model will result in a pleiotropy bias. hJAM relaxes the first assumption by modeling the
intermediates jointly.
Figure 4.1 Direct acyclic graph (DAG) for the high-throughput intermediates data in an
instrumental variable analysis framework.
This DAG describes the causal diagram of the high-throughput intermediates data. Here, SNP
denotes the genetic variant included as the instrumental variables, X denotes the intermediates,
which can be genomics, transcriptomics, metabolomics, etc., and Y denotes the outcome of
interests. The solid and dotted lines define a strong or moderate causal effect and an uncertain
association, respectively.
In recent years, omics data have developed rapidly. Extending the framework of
instrument variable analysis into a more general form, the intermediates can also be omics data,
including genomics, metabolomics (Dettmer et al., 2007), transcriptomics, proteomics, or
epigenomics data (Sekula et al., 2016) from high-throughput experiments. The framework with
transcriptomics as the intermediates is essentially the TWAS, such as FUSION (A. Gusev et al.,
2016) and PrediXcan (Gamazon et al., 2015). Figure 4.1. illustrates the causal diagram of the
high-throughput data as the intermediate. To fit such data in the instrumental variable
framework, one needs a scalable approach to select the causal variants. For TWAS, a previous
approach ignores the correlation between the gene expressions and applies an univariant
intermediate model which avoids the selection in intermediates and adopts a Bonferroni-
corrected P value to account for the inflated family-wise error rate due to multiple comparison
74
(Gamazon et al., 2015). Under the MR setting, a recent Bayesian model average approach, MR-
BMA (Zuber et al., 2020) integrates the Bayesian framework for variable selection with the
multivariable MR model which uses an inverse-variance weighted linear regression to formulate
the relationship between the 𝛽̂
𝑋 , the association estimates between the genetic variants and the
intermediates, and 𝛽̂
𝑌 , the association estimates between the genetic variants and the outcome.
Our previous work, hJAM, is a hierarchical model which uses summary statistics for a
joint analysis of multiple intermediates within an instrumental variable analysis framework. The
existing hJAM can only deal with a small amount of moderately correlated intermediates since it
is essentially a linear regression of different sets of transformed summary statistics. The property
of linear regression makes hJAM not scalable to the high-throughput intermediate data which is
likely to be high-dimensional and highly correlated. To extend the hJAM to high-throughput
experiments data, we employ the hJAM framework and incorporate it with two selection
algorithms: elastic net (Zou & Hastie, 2005) and a recent proposed Bayesian variable selection in
regression approach, the Sum of Single Effects (SuSiE) model (Wang et al., 2020). SuSiE is a
scalable approach to variable selection in linear regression for highly correlated data which
contains sparse causal effects to the outcome. It employs an iterative Bayesian stepwise selection
(IBSS) procedure which is analogues to stepwise selection and makes the algorithm fast and
computational efficient. A common application of SuSiE is in genetic fine mapping (Zhang et
al., 2020).
75
4.2. Methods
4.2.1. Recap of hJAM
The hJAM model is a form of instrumental variable analysis using marginal summary
statistics for multiple intermediates. It is analogous to a Mendelian randomization or TWAS
analysis which estimates the effects of each intermediates (e.g. risk factor, gene expression) on
the outcome, conditioning on other intermediates. Details are discussed in Chapter 2.
The instrumental variable analysis can be expressed as a two-stage model when
individual level data is available. Stage 1 models the outcome as a linear function of the
genotypes:
𝒚 =𝑮𝜷 +𝒆 , (4.1)
and stage 2 models the effect estimates 𝜷 ∈ℝ
𝐾 ×1
from stage 1 as a function of a prior
information matrix 𝑨̂
∈ℝ
𝐾 ×𝑀 :
𝜷 =𝑨̂
𝝅 +𝝐 , (4.2)
respectively. The vector 𝝅 ∈ℝ
𝑀 ×1
is the parameter of interest, which denotes the effects of
intermediates on the outcome and 𝑀 is the number of intermediates. Here, 𝒚 ∈ℝ
𝑛 ×1
and
𝑮 ∈ℝ
𝑛 ×𝐾 denote the mean-centered 𝑛 -individual outcome and genotype data, respectively,
where 𝐾 is the number genetic variants. The prior information 𝑨̂
is an association estimates
matrix between genetic variants and intermediates 𝑋 ′
𝑠 and is computed using a different data
from Eq. 4.1. In MR and TWAS analysis, we require 𝝐 =𝟎 to satisfy the no-pleiotropy effect
assumption.
76
Following Newcombe et al. (Newcombe et al., 2016), we convert Eq. 4.1 into a linear
regression using a 𝒛 ≔ 𝑮 𝟎 ′
𝒚 vector which represents the total trait burden for the risk alleles of
single-nucleotide polymorphism (SNPs) and is composed by the summary statistics 𝒃̂
, the effect
estimates of 𝜷 , 𝒑̂ , the minor allele frequency (MAF) of the genetic variants, and 𝑁 𝑦 , the sample
size of the genome-wide association study (GWAS) with 𝒚 as the outcome. For each genetic
variant 𝑖 , we have 𝑧 𝑖 =2𝑁 𝑦
𝑝 ̂
𝑖(1−𝑝 ̂
𝑖) 𝑏̂
𝑖 , assuming Hardy-Weinberg Equilibrium. With
standard linear algebra, we have 𝒛 ∼𝑀𝑉 𝑁 𝐾 (𝑮 𝟎 ′
𝑮 𝟎 𝜷 ,𝜎 𝑦 2
𝑮 𝟎 ′
𝑮 𝟎 ) , where 𝑮 𝟎 ′
𝑮 𝟎 is a 𝐾 ×𝐾
genotyped variance-covariance matrix from an external centered-by-mean genotype data. Plug in
the stage 2 model (Eq. 4.2) in the instrumental variable analysis, we have
𝒛 ∼𝑀𝑉 𝑁 𝐾 (𝑮 𝟎 ′
𝑮 𝟎 𝑨̂
𝝅 ,𝜎 𝑦 2
𝑮 𝟎 ′
𝑮 𝟎 ).
To simplify the likelihood, Cholesky decomposition transformation 𝑳 ′
𝑳 =𝑮 𝟎 ′
𝑮 𝟎 is
performed and 𝒛 vector is transformed into 𝒛 𝑳 with the inverse of 𝑳 ′ :
𝒛 𝑳 ~𝑀𝑉 𝑁 𝐾 (𝑳 𝑨̂
𝝅 ,𝜎 𝑦 2
𝑰 𝑲 ).
(4.3)
4.2.2. Sum of Single Effect model (SuSiE)
The SER model assumes that out of the 𝐾 dependent variables in Eq. 4.1, there exists
exactly one non-zero regression coefficient:
𝜷 =𝛽 𝜸 ,𝜸 ∼Mult(1,𝜏 ) and 𝛽 ∼𝑁 1
(0,𝜎 0𝑙 2
),
where 𝜸 ∈{0,1}
𝐾 is a 𝐾 -length vector of indicator variables and 𝛽 is a scalar of the “single
effect”. Here, 𝜎 𝑦 2
and 𝜎 0
2
denote the residual variance of 𝑦 and the prior variance of the non-zero
effect, respectively. Taken together, SuSiE, the sum of SERs model, can be expressed as
77
𝜷 =∑𝜷 𝒍 =∑𝛽 𝑙𝜸 𝒍 𝐿 𝑙 =1
𝐿 𝑙 =1
(4.4)
𝜸 𝒍 ∼Mult(1,𝜏 ) and 𝛽 𝑙 ∼𝑁 1
(0,𝜎 0𝑙 2
), (4.5)
where 𝑙 =1,…,𝐿 denotes the largest number of credible sets allowed in fitting. When 𝐿 =1,
SuSiE returns a single SER model. For each credible set 𝑙 , the posterior distribution of 𝜸 𝒍 and 𝜷 𝒍
is
𝛾 𝑙 |𝑮 ,𝒚 ,𝜎 2
,𝜎 0𝑙 2
∼Mult(1,𝝂 𝒍 ),
and
𝛽 𝑙|𝑮 ,𝒚 ,𝜎 2
,𝜎 0𝑙 2
,𝛾 𝑗 =1∼𝑁 1
(𝜇 1𝑗 ,𝜎 1𝑙𝑗
2
),
respectively. Here 𝝂 𝒍 =(𝜈 𝑙1
,…,𝜈 𝑙𝐾
) denotes the posterior inclusion probabilities (PIPs) with
𝜈 𝑙𝑗
≔𝑃 (𝛾 𝑙𝑗
=1|𝑮 ,𝒚 ,𝜎 2
,𝜎 0𝑙 2
) and 𝜇 1𝑗 and 𝜎 1𝑙𝑗
2
denotes the posterior mean and variance of 𝛽 𝑙
given 𝛾 𝑙𝑗
=1, which can be analogously computed as a Bayesian simple linear regression.
The IBSS is proposed by Wang et al. (Wang et al., 2020) to fit the SuSiE procedure. At
each iteration, given current estimates of 𝜷 𝒍 ′
, we return an estimate of 𝜷 𝒍, for 𝑙 ≠𝑙 ′
with a SER
model. Details are described in Algorithm 1 and 4 of (Wang et al., 2020). Note that for each
dependent variable 𝑗 , 𝛽 𝑙𝑗
is independent across 𝑙 because of the property of the variation
approximation algorithm that SuSiE employs in computing the posterior distribution in IBSS
(Wang et al., 2020). Therefore, under SuSiE, the posterior mean and PIP of the 𝑗 𝑡 ℎ
variable are
𝛽 (𝑗 )
≔ ∑ 𝛽 𝑙𝑗
𝐿 𝑙 =1
=∑ 𝜇 𝑙𝑗
𝜈 𝑙𝑗
,
𝐿 𝑙=1
and
PIP
𝑗 ≔Pr(𝛽 (𝑗 )
≠0|𝑮 ,𝒚 )≈1−∏ (1−𝜈 𝑙𝑗
)
𝑙 ∈𝐿 ,
respectively.
78
4.2.3. SHA-JAM: Scalable Hierarchical Approach for Joint Analysis of Marginal summary
data
To extend the hJAM in fitting high-throughput experiments data, we came up with the
SHA-JAM, a Scalable Hierarchical Approach to Joint Analysis for Marginal summary statistics,
by integrating the hJAM framework with a variable selection algorithm, the Sum of Single Effect
model (SuSiE) (Wang et al., 2020). The hJAM framework can be viewed as a transformed two-
stage model with summary data (see Methods), of which the second stage models 𝜷 ∈ℝ
𝐾 ×1
, the
vector that contains the effect estimates of the genetic variants on the outcome, as a function of a
prior information matrix 𝑨̂
∈ℝ
𝐾 ×𝑀 : 𝜷 =𝑨̂
𝝅 +𝝐 , where 𝝅 is the vector of parameter of
interests and 𝝐 =𝟎 when the “no unmeasured pleiotropy” assumption of instrumental variable
analysis is satisfied. The 𝑨̂
matrix needs to be obtained in external data. SuSiE was recently
proposed for variable selection in regression for highly correlated data with sparse detectable
effects such as fine mapping in GWAS (Wang et al., 2020). In the fitting procedure, SuSiE
employs an iterative Bayesian Stepwise selection (IBSS) method, which is analogous to the
forward selection and shares its computational simplicity and efficiency. In each iteration, SuSiE
computes a distribution with a single effect model (SER) in each step and models the effects
vector as a sum of SERs from steps. In the SHA-JAM implementation, we fit the stage 2 model
of hJAM into the linear regression framework of SuSiE and designed a selection algorithm
(Algorithm 1) which is specifically to genetic summary statistics based on SuSiE.
𝜷 =𝑨̂
𝝅 =𝑨̂
∑𝝅 𝒍 =𝑨̂
∑𝜋 𝑙 𝜸 𝒍 𝐿 𝑙 =1
𝐿 𝑙 =1
𝜸 𝒍 ∼Mult(1,𝜏 ) and 𝜋 𝑙 ∼𝑁 1
(0,𝜎 0𝑙 2
).
(4.6)
79
Given a fixed 𝑨̂
matrix that were computed from an external data, Eq. 4.6 is equivalent to
solving the problem of 𝝅 =∑ 𝜋 𝑙 𝜸 𝒍 𝐿 𝑙 =1
, 𝜸 𝒍 ∼Mult(1,𝜏 ) and 𝜋 𝑙 ∼𝑁 1
(0,𝜎 0𝑙 2
). When only
summary data is available, we employ (𝑮 𝑨̂
)′(𝑮 𝑨̂
) , (𝑮 𝑨̂
)
′
𝒚 =𝑨̂
𝑮 ′
𝒚 =𝑨̂
𝒛 , and 𝒚 ′𝒚 to obtain the
posterior mean and PIP of each intermediates using the SuSiE with summary data. See algorithm
1 for fitting SuSiE hJAM model.
We identify the causal intermediates by computing the credible sets, which are calculated
by the posterior inclusion probability (PIP) of each intermediate and the correlation structure
between the intermediates. The SHA-JAM reports the posterior mean of intermediates as the
effect estimates. The novel structure of SuSiE and the flexible framework of hJAM enable SHA-
JAM to efficiently select causal intermediates in a high-dimensional and/or high correlated data
setting with an overall good performance, shown by our simulation studies.
Algorithm 1: Pseudo algorithm for fitting SHA-JAM with iterative Bayesian stepwise
selection (IBSS)
Input data: 𝒃̂
, 𝑠𝑒 (𝒃̂
), 𝑁 𝐺 𝑦 , 𝐌𝐀 𝐅 𝑮 𝒚 , 𝑮 𝑹 , 𝑨̂
Input arguments: the largest number of credible sets allowed, L.SuSiE and hyperparameters 𝜎 0
2
and 𝜎 2
Function required:
(1) SER_ss(𝑿 ′
𝑿 , 𝑿 ′
𝒚 ; 𝜎 0
2
,𝜎 2
) →(𝝂 ,𝝁 𝟏 ,𝝈 𝟏 ) that computes the posterior function for 𝝅 𝒍
under the SER model with summary data;
(2) hJAM(𝒃̂
,𝑠𝑒 (𝒃̂
),𝑁 𝐺 𝑦 ,𝐌𝐀 𝐅 𝑮 𝒚 ,𝑮 𝑹 ,𝑨̂
)→((𝑮 𝑨̂
)
′
(𝑮 𝑨̂
),(𝑮 𝑨̂
)
′
𝒚 ,𝒚 ′
𝒚 ) that computes the
hJAM variables for function SER_ss.
1. Get hJAM variables:
((𝑮 𝑨̂
)
′
(𝑮 𝑨̂
),(𝑮 𝑨̂
)
′
𝒚 ,𝒚 ′
𝒚 )←hJAM(𝒃̂
,𝑠𝑒 (𝒃̂
),𝑁 𝐺 𝑦 ,𝐌𝐀 𝐅 𝑮 𝒚 ,𝑮 𝑹 ,𝑨̂
)
For simplicity, we denote 𝑿 ≔𝑮 𝑨̂
in the following text.
2. Initialize posterior means 𝝅̅
𝑙 =0, for 𝑙 =1,…,𝐿
3. Repeat
4. For 𝑙 in 1,…,𝐿 do
5. 𝑿 ′𝒓̅
𝒍 ←𝑿 ′
𝒚 −𝑿 ′𝑿 ∑ 𝝅̅
𝒍 ′
𝑙 ′
≠𝑙
80
6. (𝝂 ,𝝁 𝟏 ,𝝈 𝟏 )←SER_ss(𝑿 ′
𝑿 ,𝑿 ′
𝒓̅
𝒍; 𝜎 0
2
,𝜎 2
)
7. 𝝅̅
𝒍 ←𝝂 𝒍 ∘𝝁 𝟏𝒍
(∘ denotes the elementwise multiplication.)
8. Until convergence criterion satisfied
9. For 𝑚 in 1,…,𝑀 do
10. (𝜋 (𝑚 )
,PIP
𝑚 )←(𝜈 1𝑚 ,𝜇 11𝑚 ,𝜎 11𝑚 ,…,𝜈 𝐿𝑚
,𝜇 1𝐿𝑚
,𝜎 1𝐿𝑚
)
Return 𝝅 and 𝐏𝐈𝐏
Note: this algorithm is modified from Algorithm 1 in Wang et al. (Wang et al., 2020).
4.2.2. Regularized hJAM
Benefit from the independent correlation structure of dependent intermediates in Eq. 4.3,
we can perform regularization to variable selection in regression. The generalized objective
function is
𝝅̂ =argmin
𝝅 (‖𝒛 𝑳 −𝑳 𝑨̂
𝝅 ‖
2
+𝜆 1
‖𝝅 ‖
2
2
+𝜆 2
‖𝝅 ‖
1
),
where ‖𝝅 ‖
1
=∑ |𝜋 𝑚 |
𝑀 𝑚 =1
and ‖𝝅 ‖
2
=∑ 𝜋 𝑚 2 𝑀 𝑚 =1
. Denote 𝑐 ≔
𝜆 2
𝜆 1
+𝜆 2
and 𝜆 =𝜆 1
+𝜆 2
, we can
express the regularized hJAM problem as
𝝅̂ =argmin
𝝅 (‖𝒛 𝑳 −𝑳 𝑨̂
𝝅 ‖
2
+𝜆 (1−𝑐 ) ‖𝝅 ‖
2
2
+𝜆𝑐 ‖𝝅 ‖
1
),
When 𝑐 =0, we have a ridge hJAM; and when 𝑐 =1, we have a LASSO hJAM. We use
the same optimization algorithms to solve the regularzed hJAM problem as in the original elastic
net paper (Appendix 4A.1) (Zou & Hastie, 2005).
4.2.3. Composing 𝑨̂
matrix
The 𝑨̂
∈ℝ
𝐾 ×𝑀 matrix is computed from an external data and can expressed as
𝑨̂
𝐾 ×𝑀 =[
𝛼̂
11
… 𝛼̂
1𝑚 ⋮ ⋱ ⋮
𝛼̂
𝑘 1
… 𝛼̂
𝑘𝑚
],
81
where 𝛼̂
𝑘𝑚
denotes the association estimate between the genetic variant 𝑘 and the intermediate
𝑚 . We discussed and identified the importance of composing the correct 𝑨̂
matrix in Chapter 2,
including the selection of the valid genetic variants and different types of the coefficients
computed in 𝑨̂
matrix. To be a valid instrument variable, the genetic variants has to satisfy three
assumptions as discussed in section 2.1.
Computing the coefficients in the 𝑨̂
matrix can be considered as a fine-mapping problem
for each intermediate (Supplementary Figure 1). In Chapter 2, we introduced three types of 𝑨̂
matrix: marginal summary statistics, elastic net coefficients with individual genotype data, and
conditional summary statistics by converting the marginal ones using the JAM (joint analysis of
marginal summary statistics) (Newcombe et al., 2016). Elastic net 𝑨̂
is employed by PrediXcan
(Gamazon et al., 2015) to predict the gene expressions in individual-level genotype data. The
genetic variants which are included in the marginal summary statistics 𝑨̂
are firstly identified by
a univariant linear or logistic regression between the genetic variant and the intermediates using
the Bonferroni correction to account for the multiple comparison. Then additional pruning steps,
either statistically or biologically, are applied to the set of the univariately significant genetic
variants to produce the final set of the pruned genetic variants, of which the estimates in 𝐴̂
are
the univariant regression coefficients. MR approaches use the inverse variance weighted (IVW)
marginal summary statistics, i.e. 𝛼̂
𝑘𝑚
/𝑠𝑒 (𝛽̂
𝑘 ) , and requires the genetic variants to be
independent (Zuber et al., 2020). hJAM uses the JAM conditional summary statistics 𝑨̂
which
uses the same set of genetic variants in the marginal summary statistics 𝑨̂
and applies the JAM
framework (Newcombe et al., 2016) to obtain the conditional estimates of the each genetic
variant, conditioning on other genetic variants. However, the JAM framework doesn’t work well
in converting when only pre-pruned highly correlated genetic variants are available.
82
To address this issue, we adopt the JAM framework and incorporate it with the SuSiE
algorithm, called SuSiE JAM, to compose the SuSiE JAM 𝑨̂
when only highly correlated
summary data is available. The susieR package, which was provided by Wang et al. (Wang et
al., 2020), provided the implementation of SuSiE with input as 𝑮 ′𝑮 , 𝑮 ′
𝒙 , and 𝒙 ′𝒙 . To avoid an
intercept term as in hJAM, all genotype data will be centered as 𝑔 𝑖𝑘
=−2𝑝 𝑘 ,1−2𝑝 𝑘 or 2−
2𝑝 𝑘 where 𝑔 𝑖𝑘
denotes the dosage of 𝑘 𝑡 ℎ
SNP for individual 𝑖 .
We define 𝒛 ≔𝑮 ′𝒙 as in JAM (Newcombe et al., 2016) and construct 𝒛 with the
genotype group means 𝑔 ̅
𝑘𝑠
and counts for each group 𝑛 𝑘𝑠
, where 𝑠 =0,…,2. Assuming Hardy-
Weinberg equilibrium (HWE), we have 𝑛̂
𝑘 0
=(1−𝑝 ̂
𝑘 )
2
𝑛 , 𝑛̂
𝑘 1
=2 𝑝 ̂
𝑘 (1−𝑝 ̂
𝑘 )𝑛 , and 𝑛̂
𝑘 2
=
𝑝 ̂
𝑘 2
𝑛 , where 𝑝 ̂
𝑘 denotes the allele frequency of the reference allele and 𝑛 is the sample size of the
summary data. Then the overall group means of SNP 𝑔 𝑘 can be expressed as
𝑔 ̅
𝑘 .
=
𝑛̂
𝑘 0
𝑔 ̅
𝑘 0
+𝑛̂
𝑘 1
𝑔 ̅
𝑘 1
+𝑛̂
𝑘 2
𝑔 ̅
𝑘 2
𝑛̂
𝑘 0
+𝑛̂
𝑘 1
+𝑛̂
𝑘 2
=0 (4.7)
since 𝑮 is mean centered. If the assumption of additive effects of alleles holds, we have 𝑔 ̅
𝑘 1
=
𝑔 ̅
𝑘 0
+𝛼̂
𝑘𝑚
and 𝑔 ̅
𝑘 2
=𝑔 ̅
𝑘 0
+2𝛼̂
𝑘𝑚
. Substituting the two equations into Eq. 4.7, we can
approximate the genotype group means of 𝑔 ̅
𝑘 1
and 𝑔 ̅
𝑘 2
and therefore compose 𝑧 𝑘 =𝑛̂
𝑘 1
𝑔 ̅
𝑘 1
+
2𝑛̂
𝑘 2
𝑔 ̅
𝑘 2
.
To construct the plug-in estimates of 𝑮 ′𝑮 and 𝒙 ′𝒙 , we followed Yang et al. (Yang et al.,
2012). We know the marginal and joint effect estimates of SNPs on intermediate 𝑚 can be
expressed as 𝜶̂
𝒎 =𝑫 −𝟏 𝑮 ′
𝒙 𝒎 and 𝒂̂
𝑚 =(𝑮 ′
𝑮 )
−1
𝑮 ′
𝒙 𝑚 with the same variance var(𝒂̂
𝒎 )=
𝜎 𝑒 2
(𝑮 ′
𝑮 )
−1
, respectively, where 𝑫 is the diagonal matrix of 𝑮 ′𝑮 with 𝐷 𝑘 =∑ 𝑔 𝑖𝑘
2 𝑛 𝑖 =1
and 𝜎 𝑒 2
denotes the residual variance in joint analysis. Since 𝑔 𝑖𝑘
2
is not available, we take 𝐷 𝑘 =
2𝑝 ̂
𝑘 (1−𝑝 ̂
𝑘 )𝑛 to approximate by assuming HWE. For a single SNP 𝑘 , we have
83
𝑅 𝑘 ,𝑒 2
=
𝑎̂
𝑘𝑚
′
𝒈 𝑘 ′𝒙 𝑥 ′𝑥 =
𝑎̂
𝑘𝑚
′
𝐷 𝑘 𝛼̂
𝑘𝑚
𝑥 ′𝑥 .
Thus, we have
𝜎̂
𝑘 ,𝑒 2
=
(1−𝑅 𝑘 ,𝑒 2
)𝑥 ′
𝑥 𝑛 −1
=
𝑥 ′
𝑥 −𝐷 𝑘 𝛼̂
2
𝑘𝑚
𝑛 −1
,
and squared standard error of the estimate is 𝑆 𝑘 2
=𝜎̂
𝑘 ,𝑒 2
/𝐷 𝑘 . Thus, we have 𝑥 ′
𝑥 =
𝐷 𝑘 𝑆 𝑘 2
(𝑛 −1)+𝐷 𝑘 𝛼̂
2
𝑘𝑚
. Following Yang et al. (Yang et al., 2012), we have the plug-in of 𝑥 ′𝑥
as the median of 𝐷 𝑘 𝑆 𝑘 2
(𝑛 −1)+𝐷 𝑘 𝛼̂
2
𝑘 𝑚 across all SNPs. Denote 𝑾 as the reference genotype
data with a sample size 𝑛 𝑊 , we have 𝑮 ′
𝑮 =𝑾 ′𝑾 when the 𝑛 𝑊 =𝑛 . When sample sizes differ,
we need to adjust the variance-covariance matrix (i.e. 𝑾 ′𝑾 ) accordingly. Define 𝐷 𝑤 =∑ 𝑤 𝑖𝑘
2
𝑛 𝑊 𝑖 =1
where 𝑤 𝑖𝑘
denotes the dosage of 𝑘 𝑡 ℎ
SNP for individual 𝑖 in 𝑾 , we can approximate 𝑮 ′
𝑮 by
𝐺 ′
𝐺 𝑘 1
,𝑘 2
=√
𝐷 𝑘 1
𝐷 𝑘 2
𝐷 𝑊 ,𝑘 1
𝐷 𝑤 ,𝑘 2
∑ 𝑤 𝑖 ,𝑘 1
𝑤 𝑖 ,𝑘 2
𝑛 𝑊 𝑖 =1
.
4.3. Simulation Studies
4.3.1. Constructing 𝑨̂
matrix for simulations
In simulation, we constructed four types of 𝑨̂
matrix: elastic net and SuSiE with
individual level data and IVW marginal and SuSiE JAM with summary data. A pre-pruning step
was performed to include the SNPs which were univariately significant only. The significance
was defined as 𝑃 <0.05/𝐾 , which accounted for the multiple comparisons by using the
Bonferroni correction. To mimic the realistic applications, we simulated a large sample size data
to generate the summary data (𝑁 =5000), which was likely to be the summary statistics from
84
the meta-analysis of GWASs, and randomly subset the large data into a relatively smaller sample
for 𝑨̂
matrix with individual level data (𝑁 subset
=500), which was likely to be the gene
expression data such as GTEx (Lonsdale et al., 2013). We pre-pruned the IVW marginal 𝑨̂
at a
level of 𝑟 2
≤0.16 using Priority Pruner (Appendix 4A.2) (Edlund et al.). As a sensitivity
analysis, we generated a SuSiE 𝑨̂
with the larger sample size individual data (𝑁 =5000). The
performance of each type of 𝐴̂
matrix was assessed by the MSE of the predicted intermediates 𝑿 .
4.3.2. Simulation Settings
To evaluate the performance of the SHA-JAM, we conducted an extensive set of
simulation studies and compared the performance to elastic net hJAM and MR-BMA (Zuber et
al., 2020). The elastic net hJAM is to apply the elastic net (Zou & Hastie, 2005) on the linear
regression framework of the summary data that was constructed by hJAM.
For each simulation, we simulated three standardized individual genotype data: 𝑮 𝑹 , 𝑮 𝑿 ,
and 𝑮 𝒀
(𝐾 =300) , an intermediate matrix 𝑿 (𝑀 =50) , and an outcome vector 𝒚 . The linkage
disequilibrium (LD) structures and number of independent blocks (𝑛 SNP Blocks
=10) were the
same across the three 𝑮 ’s. The MAF of the SNPs were randomly drawn from a uniform
distribution of (0.05, 0.3). Sample size of the three 𝑮 ’s was 𝑁 𝐺 𝑅 =500, 𝑁 𝐺 𝑋 =5000, and 𝑁 𝐺 𝑦 =
5000, respectively. We set three LD scenarios for the genotype data: independent SNPs with
𝑟 within block
=0, moderately correlated SNPs with 𝑟 within block
=0.6, and highly correlated
SNPs with 𝑟 within block
=0.8 (Supplementary Figure 4B.1). For intermediate 𝑚 , 5 out of 300
SNPs were randomly picked as the causal SNPs and the effect size (i.e. column 𝑚 of matrix 𝑨 )
of the 5 causal SNPs was defined by correlation between the SNPs and the intermediate with
total 𝑟 𝑮 ,𝑿 𝒎 2
=0.1. We simulated two correlation structures for the intermediate data: correlated
85
intermediates with max(𝑟 𝑋 )=0.6 and independent intermediates with max(𝑟 𝑋 )=0
(Supplementary Figure 4B.2). Among the 50 intermediates, we randomly set 0, 3, 7, and 10
intermediates to be the causal ones with an effect size of 𝜋 =0.3.
The set of summary statistics required by the methods was then generated from the
individual level data, including
(1) an effect estimates vector 𝒃̂
and the corresponding 𝑠𝑒 (𝒃̂
) from 𝑮 𝒚 and 𝒚 ;
(2) a vector of the MAF of the SNPs from 𝑮 𝒚 and 𝒚 ;
(3) four types of 𝑨̂
using the approaches described above from 𝑮 𝑿 and 𝑿 ;
(4) a LD structure from 𝑮 𝑹 .
We compare the performance of our approaches, including SHA-JAM and elastic net
hJAM, with MR-BMA (Zuber et al., 2020). The performance was assessed by mean squared
error (MSE) and running time averaged from 600 replicates. Additionally, for scenarios with
causal intermediates, we assessed the area under the receiver operating characteristics (ROC)
curve (AUC) while for scenarios with no causal intermediate, we assessed the number of false
positives. The MSE is composed by the variance and bias of the estimate. The true positives are
defined as the true non-zero intermediate being selected by the model: included in the credible
sets for SHA-JAM, non-zero effect for elastic net hJAM, and a marginal probability larger than
0.2 for MR-BMA.
For SHA-JAM, we set the minimum absolute correlation between the variants in each
credible set as 0.5 and the largest number of credible sets to be 10. For elastic net hJAM, we used
the glmnet package and 10-folds cross-validation to tune the two shrinkage parameters, i.e. 𝜆 1
and 𝜆 2
, with lambda.1se which gives the most regularized model such that the mean cross-
validation error is within one standard error of the minimum (Supplementary Figure 4B.3). For
86
the sake of running time, we set 500 iterations for the shotgun stochastic search with a prior
probability of 0.1 for MR-BMA. In MR-BMA, the minimum and maximum number of risk
factors per model was set to be 1 and true number of causal intermediates plus 5.
We constructed different 𝑨̂
matrix for SHA-JAM and elastic net hJAM and presented the
results from the best performed 𝑨̂
matrix, which was the SuSiE JAM 𝑨̂
in the main text. For MR-
BMA, we used the inverse-variance weighted marginal 𝑨̂
, as suggested by the original paper
(Zuber et al., 2020). All simulation studies were performed in R version 3.6.0.
4.3.2. Simulation Results
The overall average AUC across all scenarios was 0.781, 0.645 and 0.840 for MR-BMA,
elastic net hJAM, and SHA-JAM, respectively. SHA-JAM consistently outperformed MR-BMA
and elastic net hJAM, regardless of the correlation structure of the intermediates in correlated
SNPs scenarios (𝑟 within block
=0.6 and 𝑟 within block
=0.8) (Figure 4.2 and Supplementary
Figure 4B.1). The better performance of SHA-JAM was benefit from the flexibility of hJAM that
it could handle correlated SNPs while MR-BMA requires independent SNPs. Both elastic net
hJAM and MR-BMA showed a relatively consistent average AUC across different number of
causal intermediates while SHA-JAM performed better in the scenarios with fewer causal
intermediates due to the fact that SuSiE was designed for sparse effects data (Wang et al., 2020).
The performance of each algorithm did not differ between the correlated and independent
intermediates scenarios. When no causal intermediate existed, elastic net hJAM and SHA-JAM
identified almost zero intermediates across all scenarios while MR-BMA identified 0.747
(1.49%) and 2.866 (5.73%) intermediates in scenarios with independent and correlated SNPs,
respectively (Table 4.1). The higher false positive rate for MR-BMA may be due to the fact that
87
it requires at least one risk factor to be included in the model in each step. When no causal
intermediates exist, such assumption may lead to false positives.
Figure 4.2 Comparing three selection algorithms with the best performed 𝑨̂
matrix by area
under curve (AUC) across 600 simulation replicates for three linkage disequilibrium (LD)
structures with correlated or independent intermediates.
Simulation results for three LD structures with different number of causal intermediates: 3, 7,
and 10. The bar plots showed the average AUC values. The algorithms are displayed as
“Selection algorithm (𝑨̂
matrix algorithm)”. We used pruned inverse-variance weighted
marginal 𝑨̂
for MR-BMA and SuSiE JAM 𝑨̂
for SHA-JAM and elastic net hJAM. For correlated
SNPs scenarios, we used a threshold of 𝑟 2
≤0.16 to prune the genetic variants for the inverse-
variance weighted marginal 𝑨̂
. The MR-BMA was set to have 500 iterations in shotgun
stochastic search.
88
Table 4.1 Average number of false positives identified by SHA-JAM, elastic net hJAM and
MR-BMA across 600 replicates in simulation scenario with no causal intermediates exist.
𝒓 𝐰𝐢𝐭𝐡𝐢𝐧 𝐛𝐥𝐨𝐜𝐤 *
𝐦𝐚𝐱 𝒓 𝑿 ^
Selection algorithm
MR-BMA Elastic net hJAM SHA-JAM
0 0 0.649 0 0.007
0.6 0.845 0 0
0.6 0 3.520 0 0.006
0.6 3.345 0 0.010
0.8 0 2.335 0 0.008
0.6 2.430 0 0.001
Note: *Correlation within block of 𝑮 ’s; ^Maximum correlation coefficient between intermediates
𝑿 . This table shows the average number of false positives out of 50 non-causal intermediates.
The positive was identified as marginal inclusion probability > 0.2 for MR-BMA, non-zero
coefficient for elastic net hJAM, and being included in the credible sets for SHA-JAM,
respectively.
With respect to the MSE, when causal intermediates existed, SHA-JAM showed the
lowest MSE and followed by the MR-BMA across almost all scenarios except that MR-BMA
had the lowest MSE in the independent SNPs and intermediates scenarios with ten causal
intermediates (Table 4.2). The larger MSEs of MR-BMA in the correlated SNPs scenarios were
likely to be contributed by the bias of the estimates (Supplementary Table 4B.1). When no causal
intermediate existed, elastic net hJAM showed the lowest MSE across all scenarios.
In terms of the 𝐴̂
matrix construction, we found comparable performances between SuSiE
JAM and SuSiE with large individual genotype data (𝑛 =5000) (Supplementary Table 4B.2 and
Supplementary Figure 4B.5). A slightly better performance of SuSiE JAM was observed when
compares to SuSiE and elastic net with a smaller sample of individuals (𝑛 subset
=500)
(Supplementary Table 4B.2 and Supplementary Figure 4B.6).
89
Table 4.2 Mean-squared errors (MSE) of the estimates from three selection algorithms
with the best performed 𝑨̂
matrix across 600 simulation replicates for different linkage
disequilibrium (LD) structures and scenarios with different maximum correlation
coefficient between the intermediates.
𝒓 𝐰𝐢𝐭𝐡𝐢𝐧 𝐛𝐥𝐨𝐜𝐤 *
𝐦𝐚𝐱 𝒓 𝑿 ^
Selection
algorithm
Number of causal intermediates
0 3 7 10
0 0 MR-BMA 0.001 0.031 0.099 0.184
Elastic net hJAM 0 0.123 0.237 0.336
SHA-JAM 0.001 0.025 0.097 0.198
0 0.6 MR-BMA 0.003 0.031 0.128 0.232
Elastic net hJAM 0 0.112 0.259 0.328
SHA-JAM 0.001 0.023 0.115 0.226
0.6 0 MR-BMA 0.026 0.163 0.479 0.758
Elastic net hJAM 0 0.220 0.476 0.670
SHA-JAM 0.001 0.046 0.209 0.388
0.6 0.6 MR-BMA 0.027 0.163 0.518 0.832
Elastic net hJAM 0 0.230 0.496 0.708
SHA-JAM 0.002 0.050 0.218 0.435
0.8 0 MR-BMA 0.022 0.262 0.767 1.175
Elastic net hJAM 0 0.238 0.529 0.788
SHA-JAM 0.002 0.070 0.292 0.618
0.8 0.6 MR-BMA 0.018 0.251 0.765 1.145
Elastic net hJAM 0 0.237 0.541 0.777
SHA-JAM 0.001 0.067 0.432 0.601
Note: *Correlation within block of 𝑮 ’s; ^Maximum correlation coefficient between intermediates
𝑿 . We showed the mean-squared error (MSE) of each selection algorithm with best performed 𝑨̂
matrix. The lowest MSE was bolded for each scenario. We used pruned inverse-variance
weighted marginal 𝑨̂
for MR-BMA, SuSiE JAM 𝑨̂
for SHA-JAM and elastic net hJAM.
SHA-JAM and elastic net hJAM were about 300 and 150 times faster than MR-BMA,
respectively (Table 4.3). The runtime increases as the causal intermediates or the complexity
between the SNPs increases. The fast computation property of SHA-JAM was benefit from the
linear regression format of hJAM and the fast speed of SuSiE (Wang et al., 2020), which was
achieved by the computational simplicity of IBSS. Note that we set the maximum number of
iterations for the shotgun stochastic search as 500 for MR-BMA in assessing the runtime. The
computation time increases as the maximum number of iterations in the shotgun stochastic
90
search increases. When the maximum iterations was set as 100,000, same as the data example in
the original paper (Zuber et al., 2020), MR-BMA took around 2 hours to complete with 50
candidate intermediates.
Table 4.3 Average runtime (seconds) of different algorithms across 400 simulation
replicates for SHA-JAM, elastic net hJAM and MR-BMA with 500 iterations in shotgun
stochastic search.
𝒓 𝐰𝐢𝐭𝐡𝐢𝐧 𝐛𝐥𝐨𝐜𝐤 *
Selection algorithm
Number of causal intermediates
3 7 10
0 MR-BMA 34.545 55.429 71.585
Elastic net hJAM 0.263 0.246 0.252
SHA-JAM 0.128 0.143 0.150
0.6 MR-BMA 86.178 112.371 124.852
Elastic net hJAM 0.481 0.426 0.412
SHA-JAM 0.226 0.289 0.321
0.8 MR-BMA 77.331 91.033 94.475
Elastic net hJAM 0.323 0.317 0.316
SHA-JAM 0.163 0.208 0.250
Note:
*
Correlation within block of 𝑮 ’s. We showed an average runtime across different 𝑨̂
matrices and different correlation structures between the intermediates since the runtime did not
differ within these groups.
4.4. Data Applications
4.4.1. Selecting Causal Metabolites for Prostate Cancer with Metabolomics Summary Data
Even though large-scale epidemiologic studies showed a reduction in prostate cancer risk
among men with low cholesterol (Heir et al., 2016; Platz et al., 2009), low triglycerides (TG)
(Arthur et al., 2016; Van Hemelrijck et al., 2011), and other metabolites (Newcomer et al.,
2001), existing MR analyses which were used to draw causal conclusion showed inconsistent
results (Bull et al., 2016; Khankari et al., 2016; Orho-Melander et al., 2018; Pierce et al., 2018).
91
Additionally, current MR analyses only included a few of metabolites and modeled them
individually which ignored the high correlation between the intermediates.
To gain a better understanding of the effects of the metabolites on the prostate cancer
risk, we extended the range of the candidate metabolites to 𝑀 =30 and applied the methods to
select the metabolites that are likely to be causally associated with prostate cancer risk
(Supplementary Table 4B.3). We constructed the 𝑨̂
matrix with JAM based on the summary data
from Zuber et al. (Zuber et al., 2020) (https://github.com/verena-zuber/demo_AMD), which
identified 150 significant metabolites-associated SNPs from a large-scale meta-analysis of the
Global Lipids Genetics Consortium (GLGC) (Willer et al., 2013) and extracted the marginal
summary statistics from the nuclear magnetic resonance (NMR) metabolite GWAS (Kettunen et
al., 2016). The original summary data had 118 metabolites. We excluded 69 metabolites that
were highly correlated with other metabolites (|𝑟 Metabolites
|≥ 0.985; this step has been
completed by Zuber et al. (Zuber et al., 2020)) and 19 metabolites that had no univariate-
significant associated genetic variants in the summary data. After the exclusion, there were 30
candidate metabolites (4 metabolites with no significant associated SNPs were further excluded
in the analysis that removed the influential genetic variants) (Supplementary Table 4B.3). The 𝜷̂
and 𝑠𝑒 (𝜷̂
) of the genetic variants that are identified in 𝑨̂
matrix were taken from a large GWAS
with 79,194 prostate cancer cases and 61,112 controls (Schumacher et al., 2018). The study
population in the three GWAS are all European-ancestry population. The European-ancestry
genotype data from 1000 Genome Project (Consortium, 2015) was used as the reference panel
for extracting the LD structure. We excluded four SNPs which were missing in the prostate
cancer GWAS (rs2652834, rs9930333, rs1998013, and rs894210) and one SNP which had no
variation in the 1000 Genome data (rs2290547). We identified two highly correlated SNPs
92
(rs11246602 and rs12226802, |𝑟 SNPs
|=0.97) with same effect on the metabolites and excluded
one randomly. We further excluded two SNPs (rs2710642 and rs261342) and two SNPs
(rs205262 and rs894210) which were identified as the influential points by using Cook’s distance
and q-statistics (q-statistics < 20) in MR-BMA, respectively (Zuber et al., 2020). After the
exclusion, we have 𝐾 =140 genetic variants included in the analysis.
For SHA-JAM, we set the minimum correlation of the variants within one credible set
(level = 0.9) as 0.6 and the maximum number of credible sets as 5. As a comparison, we applied
the inverse variance weighted marginal 𝑨̂
and 𝜷̂
to MR-BMA to select the causal metabolites of
prostate cancer risk. We set the prior probability of 0.1, a minimum and maximum size of 1 and
12 metabolites, respectively, per model, and with 100,000 iterations in the shotgun stochastic
search, as suggested by the original paper (Zuber et al., 2020).
SHA-JAM results from the analysis with 144 SNPs identified one credible set (coverage
= 0.96) associated with an increase in the risk of prostate cancer (Table 4.4). It contains four
metabolites, including the total cholesterol in small LDL (S.LDL.C, posterior inclusion
probability, PIP = 0.388), total cholesterol in IDL (IDL.C, PIP = 0.263), total cholesterol in LDL
(LDL.C, PIP = 0.243), and serum total cholesterol (Serum.C, PIP = 0.066). The minimum and
mean absolute correlation between the four metabolites are 0.902 and 0.948, respectively. No
metabolites showed marginal inclusion probability (MIP) larger than 0.25. The top ranked risk
factor was TG in medium VLDL (M.VLDL.TG, MIP = 0.093). We then checked the model fit of
MR-BMA and identified the potential outliers among the best models with posterior probability
higher than 0.02. Cook’s distance and q-statistics were used. We first excluded one genetic
variant, rs2710642 in the gene EHBP1 region, which showed a large value in both Cook’s
distance and q-statistics in most best models (Supplementary Figure 4B.7-A1 and 4B.7-A2). This
93
gene was reported to be strongly associated with prostate cancer risk in European-ancestry
subjects (Ao et al., 2015). We re-fit the models after excluding rs2710642 and identified three
more influential points: rs261342 with large Cook’s distance and rs205262 and rs267733 with
large q-statistics (Supplementary Figure 4B.7-B1 and 4B.7-B2).
Table 4.4 Selecting causal metabolites on the risk of prostate cancer with and without
potential influential genetic variants (𝑲 =𝟏𝟒𝟒 and 𝑲 =𝟏𝟒𝟎 , respectively).
Selection
algorithm
Selected
Metabolites*
PIP / MIP^ Coefficient† Runtime
With potential influential genetic variants (𝑲 =𝟏𝟒𝟒 ,𝑴 =𝟑𝟎 )
SHA-JAM S.LDL.C 0.388 0.05 0.146
seconds IDL.C 0.263 0.03
LDL.C 0.243 0.029
Serum.C 0.066 0.008
MR-BMA M.VLDL.TG 0.093 -0.032 7.470
minutes S.VLDL.TG 0.091 0.028
IDL.C 0.089 0.009
Serum.C 0.078 0.009
Without potential influential genetic variants (𝑲 =𝟏𝟒𝟎 ,𝑴 =𝟐𝟔 )
SHA-JAM S.LDL.C 0.393 0.038 0.026
seconds Serum.C 0.221 0.021
LDL.C 0.196 0.017
IDL.C 0.142 0.012
MR-BMA Serum.C 0.119 0.017 6.022
minutes S.LDL.C 0.073 0.005
IDL.C 0.07 0.005
M.VLDL.TG 0.068 -0.011
Notes: *For the selected metabolites column, SHA-JAM showed the metabolites that were
selected in the credible sets and MR-BMA showed the top four ranked metabolites.
^For the inclusion probability column, SHA-JAM and MR-BMA showed posterior inclusion
probability and marginal inclusion probability, respectively.
† For the coefficient column, SHA-JAM and MR-BMA displayed the posterior means and
model-averaged causal effects (MACE), respectively.
Abbreviations: S.LDL.C, total cholesterol in small LDL; Serum.C, serum total cholesterol;
LDL.C, total cholesterol in LDL; IDL.C, total cholesterol in IDL; M.VLDL.TG, triglycerides in
medium VLDL, S.VLDL.TG, triglycerides in small VLDL; PIP, posterior inclusion probability;
MIP, marginal inclusion probability; CS, credible set.
94
After the exclusion, a total of 140 genetic variants remained in the analysis. We further
excluded four metabolites (𝑀 =26) which had no genome-wide significant (𝑃 <5×10
−8
)
SNPs left in the analysis. Results showed that SHA-JAM identified a credible set with the same
four metabolites as above, showing a positive association with the prostate cancer risk (Table
4.4). The minimum and mean absolute correlation between the four metabolites are 0.931 and
0.963, respectively. Even though results from MR-BMA showed no intermediate with MIP >
0.25, the top three metabolites in MR-BMA matched with the SHA-JAM results. IDL.C, which
showed the lowest PIP in the credible set of SHA-JAM, ranked as the top 7 metabolites by MR-
BMA (MIP = 0.056). We also noted that SHA-JAM selected metabolites matched exactly with
the selected ones in the top four best models of MR-BMA (Supplementary Table 4B.4).
Supplementary Figure 4B.8 showed the PIP/MIP and effect estimates comparisons between with
and without the influential genetic variants. SHA-JAM showed a robust performance to the
potential invalid instrument variables identified by the best models in MR-BMA.
The runtime of SHA-JAM and MR-BMA were 0.026 seconds and 6.022 minutes when
140 genetic variants and 26 candidate metabolites included, suggesting that the SHA-JAM was
~13,000 times faster than MR-BMA.
The findings by SHA-JAM were consistently with previous large epidemiological studies
(Kok et al., 2011; Orho-Melander et al., 2018). A review by Suburu and Chen (Suburu & Chen,
2012) illustrated the strong impact of de novo lipogenesis on prostate cancer. For example,
cholesterol that is derived from the mevalonic acid pathway could impact the development and
progression of prostate cancer. Statins, which lower the cholesterol levels by inhibiting HMG-
CoA reductase, the rate-limiting enzyme in cholesterol synthesis, showed a protective effect on
prostate cancer risk in a large cohort study (Bansal et al., 2012; Boudreau et al., 2008). This was
95
consistent with the effect of an HMGCR locus, rs7703051, shown in our analysis which was
significantly associated with the increasing in the cholesterol in small LDL, serum, LDL, and
IDL.
4.4.2. Selecting Causal Genes for Prostate Cancer with Transcriptomics Summary Data
Previously, we examined the association between two genes on chromosome 1q32.1
(PM20D1, Peptidase M20 Domain Containing 1, and NUCKS1, Nuclear Casein Kinase and
Cyclin Dependent Kinase Substrate 1) and prostate cancer risk with hJAM (section 2.4.2). To
extend this analysis and select the candidate causal genes for the risk of prostate cancer, we
applied the SuSiE hJAM on a whole genome-wide scan among the European-ancestry
population.
In the second data example, we identified the eQTLs and gene expressions by using the
prostate tissue specific significant eQTL-gene pairs, which were identified based on
permutations by the GTEx Project
(https://storage.googleapis.com/gtex_analysis_v7/single_tissue_eqtl_data/GTEx_Analysis_v7_e
QTL.tar.gz) (Lonsdale et al., 2013), and constructed one 𝑨̂
matrix for each chromosome. We
used the prostate tissue only as it has been shown that the gene expression in prostate tissue are
the most relevant to the risk of prostate cancer (Mancuso et al., 2018). The coefficients in the
marginal 𝑨̂
matrix were obtained from GTEx analysis v7
(https://storage.googleapis.com/gtex_analysis_v7/single_tissue_eqtl_data/GTEx_Analysis_v7_e
QTL_all_associations.tar.gz). Priority pruner (Edlund et al.) was applied to prune the eQTLs by
limiting the pairwise correlation coefficients |𝑟 |≤0.8. The sum of the absolute univariate
association estimates for each eQTL across all genes on the same chromosome was employed as
96
the pruning criteria. We then applied SuSiE JAM on the marginal 𝑨̂
to compute the posterior
mean and obtained the conditional 𝑨̂
. Genes with all elements equal to zero in the conditional 𝑨̂
were removed. Supplementary Table 4B.5 showed the dimensions of the final 𝑨̂
matrix for each
chromosome. The association coefficients between the identified genetic variants and prostate
cancer, 𝜷̂
and 𝑠𝑒 (𝜷̂
) , were obtained from the same prostate cancer GWAS as in the first data
example (Schumacher et al., 2018). We used the European-ancestry population in the 1000
Genome Project (Consortium, 2015) as the reference genotype data. For SHA-JAM, we set the
minimum correlation of the variants within one credible set (level = 0.95) as 0.6 and the
maximum number of credible sets as 1/5 of the total genes being included per chromosome.
In total, SHA-JAM yielded a total of 45 credible sets with a coverage of 0.95 and 55
genes (Supplementary Table 4B.5). Among the 45 identified credible sets, 38 contained exactly
one gene, suggesting strong causal effects on the risk of prostate cancer (Table 4.5), 6 contained
two highly correlated genes (average purity 𝑟 =0.932) (Supplementary Table 4B.6), and 1
contained five perfectly correlated genes (purity 𝑟 =1) (Supplementary Table 4B.6). The
median size of all credible sets was 1 and median purity of the credible sets was 0.995.
Table 4.5 Selected candidate genes with strong effect on the prostate cancer risk by SHA-
JAM (credible sets that contain only one gene).
Chromosome Gene Name Gene Type Posterior mean PIP
1 NUCKS1 protein coding 0.166 0.992
2 ALS2CR12 protein coding -0.173 1
2 GGCX protein coding -0.196 1
2 MLPH protein coding -0.312 1
2 AC007879.2 lincRNA 0.103 1
2 SPET2 protein coding -0.139 0.998
3 CHMP2B protein coding 0.443 1
3 RP11-469J4.3 lincRNA -0.140 1
3 WDR52 protein coding 0.218 1
97
3 RP11-446H18.1 pseudogene 0.129 1
3 ACPP protein coding -0.221 0.999
5 CTD-3080P12.3 antisense -0.045 0.997
5 RP11-184E9.2 lincRNA 0.038 0.973
6 RGS17 protein coding 0.212 1
6 SESN1 protein coding 0.200 1
6 WTAP protein coding -0.143 1
6 L3MBTL3 protein coding -0.068 1
6 C6orf164 protein coding 0.097 0.966
7 HOTAIRM1 antisense 0.065 0.999
7 RP4-607J23.2 antisense -0.105 0.990
7 TMEM184A protein coding 0.153 0.968
9 CBWD6 protein coding -0.010 1
10 MSMB protein coding -0.762 1
10 AGAP4 protein coding -0.077 1
10 MGMT protein coding 0.046 0.995
10 RP11-18I14.10 processed transcript 0.581 0.993
11 RP11-554A11.9 antisense 0.101 1
11 MMP7 protein coding 0.188 1
11 RAD9A protein coding 0.197 1
12 RP1-228P16.1 pseudogene -0.091 1
12 TUBA1A protein coding -3.598 1
17 RP11-115K3.1 antisense 0.523 1
17 LRRC46 protein coding 0.105 0.995
17 TRPV3 protein coding -0.086 0.982
18 RAB27B protein coding -4.223 0.995
19 ZNF571-AS1 antisense 0.105 0.975
20 ABHD16B protein coding -6.385 0.995
22 TBX1 protein coding -0.072 1
Note: this table includes the genes from the credible sets that contained only one gene. There are
10 such credible sets in total.
Abbreviations: PIP, posterior inclusion probability.
We compared our results with two previous large-scale TWAS of prostate cancer
(Mancuso et al., 2018; Wu et al., 2019a) and found 16 genes have been reported. Since we only
included the genes which have significant associated eQTLs in the GTEx prostate tissue
summary data, some of previous reported genes in prostate cancer risk were not included in our
98
analysis, such as HNF1B (Eeles et al., 2008). SHA-JAM replicated the previously reported
genes, including NUCKS1 (Mancuso et al., 2018; Wu et al., 2019a) (𝛽 =0.166, PIP = 0.992),
ALS2CR12 (Wu et al., 2019a) (𝛽 =−0.173, PIP = 1), GGCX (Mancuso et al., 2018; Wu et al.,
2019a) (𝛽 =−0.196, PIP = 1), MLPH (Mancuso et al., 2018; Wu et al., 2019a) (𝛽 =−0.312,
PIP = 1), CHMP2B (Mancuso et al., 2018; Wu et al., 2019a) (𝛽 =0.443, PIP = 1), WDR52
(Mancuso et al., 2018; Wu et al., 2019a) (𝛽 =0.218, PIP = 1), RGS17 (Mancuso et al., 2018;
Wu et al., 2019a) (𝛽 =0.217, PIP = 1), WTAP (Wu et al., 2019a) (𝛽 =−0.143, PIP = 1), RP4-
607J23.2 (Wu et al., 2019a) (𝛽 =−0.105, PIP = 0.990), MSMB (Wu et al., 2019a) (𝛽 =
−0.762, PIP = 1), RP11-554A11.9 (Wu et al., 2019a) (𝛽 =0.101 , PIP = 1), SESN1 (Mancuso et
al., 2018) (𝛽 =0.200, PIP = 1), MMP7 (Mancuso et al., 2018; Wu et al., 2019a) (𝛽 =0.188,
PIP = 1), RAD9A (Wu et al., 2019a) (𝛽 =0.197, PIP = 1), RP11-115K3.1 (Wu et al., 2019a)
(𝛽 =0.523, PIP = 1), and TBX1 (Mancuso et al., 2018; Wu et al., 2019a) (𝛽 =−0.072, PIP =
1) (Table 4.5).
Furthermore, several genes were reported in GWAS and gene expression studies. The
Leu84Phe CT+TT genotypes (Ritchey et al., 2005) at MGMT (𝛽 =0.046, PIP = 0.995) was
reported to associated with prostate cancer (Ritchey et al., 2005). ACPP (Wayner et al., 2012)
(𝛽 =−0.221 , PIP = 0.999) was overexpressed in prostate cancer cell lines. HOTAIRM1 (𝛽 =
0.065, PIP = 0.999) is antisense of gene HOXA1 that was found to be highly expressed in
prostate cancer cell and enhance the cell proliferation, invasion and metastasis (Wang et al.,
2015). TMEM184A (𝛽 =0.153, PIP = 0.968) was seen to be differentially expressed in
comparison of castration-induced regression nadir prostate cancer to castration-resistant growth
prostate cancer, which is a more advanced stage of prostate cancer (Sha et al., 2017). TUBA1A
showed an increased expression in Gleason 4 stroma as compared with Gleason 3 stroma
99
(Staunton et al., 2017). RAB27B was reported to be down-regulated in advanced prostate cancer
(Worst et al., 2017). SEPT2 (𝛽 =−0.139, PIP = 0.998) expression was modulated by
LINC00473, which inhibited the cell proliferation by regulating SEPT2 in prostate cancer via
JAK-STAT3 signaling pathway (Xing et al., 2020).
Several identified genes have not been previously reported to prostate cancer but have
been observed in association with other cancers or the risk factors of prostate cancer. We noted
that L3MBTL3 (𝛽 =−0.068, PIP = 1) has been reported to associated with height (Gudbjartsson
et al., 2008), type 2 diabetes (Vujkovic et al., 2020), and body mass index (Pulit et al., 2019),
which are well-established risk factors of prostate cancer. The variant rs6569648, which was the
top cis-eQTL for L3MBTL3, showed a strong association with risk of estrogen-receptor-negative
breast cancer (𝑃 =4.3×10
−6
) (Milne et al., 2017). TRPV3 (𝛽 =−0.086, PIP = 0.982) and
ZNF571-AS1 (𝛽 =0.105, PIP = 0.995), which is antisense to gene ZNF571, were reported to
promote proliferation of lung cancer cells (Li et al., 2016) and was overexpressed in breast
cancer (Smeets et al., 2011), respectively. LRRC46 (𝛽 =0.105, PIP = 0.995) at 19q21.32 was
reported to be significantly associated with high-grade serous ovarian cancer (Gusev et al.,
2019). ABHD16B (𝛽 =−6.853, PIP = 0.995) was involved in lipid biosynthesis in testis (Shan
et al., 2020). CTD-3080P12.3 (𝛽 =−0.045, PIP = 0.997) and C6orf164 (𝛽 =0.097, PIP =
0.996) were differentially expressed in hepatocellular carcinomas compared to non-tumor tissues
(Degli Esposti et al., 2016; Wang et al., 2019). In survival analysis, the high expression of
AGAP4 (𝛽 =−0.077, PIP = 1) was associated a poor breast cancer clinical outcome (Yang et
al., 2020) and the up-regulation of the IncRNA AC007879.2 (𝛽 =0.103, PIP = 1) led to poorer
prognoses of oral squamous cell carcinoma (Hu et al., 2018).
100
4.5. Discussion
In this chapter, we showed the flexibility of the hJAM framework and proposed SHA-
JAM, an extension on hJAM to variable selection for high-dimensional intermediates data, such
as omics data. SHA-JAM showed an overall good performance across all simulation scenarios
with an average AUC of 0.840, low MSE with less bias, and fast computation speed.
SHA-JAM demonstrates several advantages over the existing approach. Firstly, the
flexibility of the hJAM framework enables SHA-JAM to incorporate different types of
intermediates and extend the framework of Mendelian Randomization analysis and TWAS into a
high-dimensional data with a hierarchical joint model using summary data only. Such property
generalizes the SHA-JAM into a broader area where it could be applied to address the omics data
with summary statistics. Another critical benefit of SHA-JAM is that it shares the computational
efficiency of SuSiE (Wang et al., 2020) and runs 300 times faster than MR-BMA when 50
candidate intermediates in the model. The difference increases as the number of candidate
intermediates increases. SHA-JAM also shares an advantage of SuSiE that it provides credible
sets in variable selection which not only summarizes the uncertainty but also is better for
selecting the highly correlated intermediates with small effects. In the data example of selecting
genes for the risk of prostate cancer, we identified several credible sets which contained perfect
correlated genes with small effects and low posterior inclusion probability. These genes may not
be selected in approaches based on the univariate estimates or marginal inclusion probability.
Additionally, comparing to MR-BMA, SHA-JAM is more robust to the potential invalid
instrument variables which were identified by using Cook’s distance and q-statistics from a
multivariable MR model (Zuber et al., 2020).
101
For SHA-JAM, as well as MR and TWAS models, 𝑨̂
matrix construction is always a
critical step. Firstly, constructing a better performed 𝑨̂
matrix could always improve the
performance of these approaches. We have provided different ways to construct the 𝑨̂
matrix
under different scenarios. When the summary data is well-pruned and the genetic variants within
the set are not highly correlated, JAM could be applied to get the conditional 𝑨̂
matrix. If the
genetic variants are highly correlated, we proposed SuSiE JAM by adapting SuSiE on the JAM
framework and showed a good performance of the method in simulations and an application in
constructing the 𝑨̂
matrix with the highly correlated eQTLs and gene expressions. With the
individual level data, SuSiE would be a better solution for a better performed 𝑨̂
matrix.
Secondly, the summary data for extracting the genetic variants and intermediates for 𝑨̂
matrix
determines the candidate intermediates for selection. One may miss some previous reported
intermediates due the lack of SNP-intermediate pairs when identifying the 𝑨̂
matrix. For
example, we did not include HNF1B (Eeles et al., 2008), a strongly implicated expression in
prostate cancer risk, in our gene-prostate cancer example due to no significant eQTL- HNF1B
pairs exist in GTEx prostate tissue. Such problem may be avoided by conducting a more
comprehensive search, such as identifying the eQTL-gene pairs in all tissues instead of a single
tissue. We are currently working on an approach to estimate the average intermediate effects
across multi-ethnic populations. Under the multi-ethnic hJAM framework, identifying the
variants-intermediate pairs across different populations could enlarge the candidate intermediates
pool and provide an average effect estimate.
Despite the benefits from SuSiE and hJAM, SHA-JAM shares the caveats of SuSiE and
hJAM. SuSiE may not perform well and may not converge in less sparse data with many causal
effects (Wang et al., 2020). This leads to the decrease of the performance of SHA-JAM as the
102
number of causal intermediates increases in the simulation studies. Wang et al. (Wang et al.,
2020) has proposed several possible solutions, such as using a better initialization for the
algorithm. Another limitation is that even though hJAM is quite robust to the LD structure
differences between the summary data, SuSiE with summary data is very sensitive to the
accuracy of the reference genotype LD structure. Using a wrong genotype data may cause large
bias in computing the 𝑿 ′𝑿 and eventually leads to inaccurate results. Thus, in application, we
need to ensure that we use the reference genotype data of the population who share the same
ancestry of those in the summary data where we obtained the effect estimates of the genetic
variants on the exposures or outcome of interests. We are currently working on an extension of
SHA-JAM to gather the information from different ethnic groups and compute a multi-ethnic
average effect when multi-ethnic groups present. Additionally, same to the existing MR
approaches, the unknown pleiotropy effects could hurt the performance of SHA-JAM.
In addition to SHA-JAM, we proposed elastic net hJAM where we apply the regularized
regression on the hJAM framework. One limitation of elastic net hJAM is that it does not
provide the evidence of strength which makes it unlikely for one to draw inference. Post
selection inference has been an active research area (Lee et al., 2016; Taylor & Tibshirani, 2015;
Tibshirani et al., 2016). Tibshirani et al. (Tibshirani et al., 2016) has proposed a sequential of
tests to draw the post selection inference and provide exact p-values. An R package,
selectiveInference, from their group is available which is now considered as one of the most
popular post selection inference tools. An extension on elastic net hJAM using the post selection
inference may provide a further insight into the model.
103
Appendix 4A. Supplementary Methods
4A.1. Hierarchical Regularized Regression
With the rapid development of high-dimensional data, many approaches have been
proposed to perform the dimensionality reduction and feature selections. Stepwise and best
subset selection are widely used in practice but they both subject to the problem that they are
sensitive to the minor changes in data (Tibshirani, 1996) and that they are quite computationally
expensive when the number of features is too large. Hierarchical model selection, which is more
robust than stepwise and best subset, has become popular in recent high-dimensional data. Ridge
regression, which was proposed to solve the nonorthogonal problems, imposes a L2 norm
penalty to shrink the regression coefficients but not to zero (Hoerl & Kennard, 1970). The
LASSO (least absolute shrinkage and selection operator) regression, which was proposed by
(Tibshirani, 1996), shrinks some regression coefficients and set others to zero by imposing a L1
norm penalty. Thus, it retains the good features of both best subset and the ridge regression and
makes the coefficients easier to interpret. Supplementary Figure 4A.1 illustrates the different
shrinkages of regression coefficients in ridge and LASSO. One limitation of the LASSO is that it
does not have a grouping effect, suggesting that if there is a group of variables with high
pairwise correlation, the LASSO tends to select only one variable from the group but does not
care which one the variable is (Zou & Hastie, 2005).
104
Supplementary Figure 4A.1 Illustration of the estimates from the LASSO (left) and ridge
regression (right).
This plot shows the objective function and the least square error when there are two coefficients.
The red ellipticals show the contours of the least square error function. The blue solid areas
show the constraint regions: |𝛽 1
|+|𝛽 2
|≤𝑡 (L1 norm) and 𝛽 1
2
+𝛽 2
2
≤𝑡 (L2 norm),
respectively. This figure is from Figure 3.11 of (Friedman et al., 2001).
Elastic net penalty, which is introduced by (Zou & Hastie, 2005), is a compromise
between ridge and lasso: it conducts variable selection like LASSO and shrinks the coefficients
of correlated variables at the same time as Ridge. Thus, for SNP data, where most SNPs are
always correlated with some variants within the data set, elastic net may be a better approach. In
practice, PrediXcan used elastic net to construct the PredictDB data set (Gamazon et al., 2015).
Here, we will briefly describe a naïve elastic net problem (Zou & Hastie, 2005). We use
the same notations as we have in Chapter 2, section 2.2. Given a genotype data set with 𝑛
observations and 𝐾 SNPs, we have:
𝒚 =𝑮𝜷 +𝜹 .
Then the objective function of naïve elastic net penalized regression can be expressed as
min
𝛽 ∈ℝ
𝐾 ‖𝒚 −𝑮𝜷 ‖
2
+𝜆 1
‖𝜷 ‖
2
2
+𝜆 2
‖𝜷 ‖
1
,
105
where ‖𝜷 ‖
2
2
=∑ 𝛽 𝑗 2 𝐾 𝑖 =1
and ‖𝜷 ‖
1
=∑ |𝛽 𝑖|
𝐾 𝑖 =1
. The function 𝜆 1
‖𝜷 ‖
2
2
+𝜆 2
‖𝜷 ‖
1
is called the
elastic net penalty. Let 𝑐 =
𝜆 2
𝜆 1
+𝜆 2
(note that the original paper used 𝛼 (Zou & Hastie, 2005), we
use 𝑐 here to distinguish from our notations in the main text), we have the elastic net problem as
min
𝛽 ∈ℝ
𝐾 ‖𝒚 −𝑮𝜷 ‖
2
+𝜆 𝑐 ‖𝜷 ‖
2
2
+𝜆 (1−𝑐 )‖𝜷 ‖
1
.
When 𝑐 =1, the elastic net becomes ridge regression and when 𝑐 =0, the elastic net becomes
LASSO. Supplementary Figure 4A.2 shows the contour plots of the elastic net with 𝑐 =0.5
comparing to LASSO and Ridge.
Supplementary Figure 4A.2 Geometry of the elastic net, LASSO, and Ridge regression.
This plot shows the two-dimensional contour plots for elastic net, LASSO, and Ridge regression.
For elastic net, this figure shows the strength of convexity with 𝑐 =0.5. Note the original figure
used 𝛼 , we use 𝑐 here to distinguish from our notations in the main text. This figure is from
Figure 1 of (Zou & Hastie, 2005).
To solve the elastic net problem, we define an artificial data set (𝒚 ∗
,𝑮 ∗
) as
𝑮 (𝑛 +𝐾 )×𝐾 ∗
=(1+𝜆 2
)
−1/2
(
𝑮 √𝜆 2
𝑰 ), 𝒚 (𝑛 +𝐾 )
∗
=(
𝒚 𝟎 ).
106
Let 𝛾 =
𝜆 1
1+𝜆 2
and 𝜷 ∗
=√1+𝜆 2
𝜷 , then naïve elastic net can be expressed a LASSO-type
problem as
𝐿 (𝛾 ,𝜷 )=𝐿 (𝛾 ,𝜷 ∗
)=|𝒚 ∗
−𝑮 ∗
𝜷 ∗
|
2
+𝛾 |𝜷 ∗
|
1
,
with the solution 𝜷̂
∗
=argmin
𝛽 ∗
𝐿 (𝛾 ,𝜷 ∗
) ; then we have 𝜷̂
naive elastic net
=
1
√1+𝜆 2
𝜷̂
∗
. The elastic
net estimates 𝜷̂
is defined as 𝜷̂
elastic net
=√1+𝜆 2
𝜷̂
∗
, then we have the elastic net coefficient as
a rescaled naïve elastic net coefficient:
𝜷̂
elastic net
=(1+𝜆 2
)𝜷̂
naive elastic net
.
Different choices of tunning parameters exist, including (𝜆 1
,𝜆 2
) and (𝜆 ,𝑐 ) . We use the
glmnet R package to tune 𝜆 and set 𝑐 =0.5 for elastic net. The 𝑘 -fold cross validation is used
for the selection of the tunning parameter 𝜆 .
4A.2. Priority Pruner
Priority pruner is a Java software which is designed to prune a list of SNPs which are in
high LD. It is available online (http://prioritypruner.sourceforge.net/index.html). The central idea
of the priority pruner algorithm is pruning the SNPs based on the LD structure with additionally
requests from the users. Priority pruner first picks the SNP, 𝑗 , with the lowest P value and
calculates the squared correlation coefficients, 𝑅 2
, between SNP 𝑗 and other SNPs in the list
within a user-specified distance. Priority pruner allows additional surrogate SNPs which are in
the set where the SNPs show correlation coeffect 𝑅 2
larger than the user-specified threshold.
Selecting the SNPs as surrogate or not based on user-specified metrics.
107
Appendix 4B. Supplementary Tables and Figures
4B.1. Supplementary Tables
Supplementary Table 4B.1 Absolute bias of the estimates from three selection algorithms
with the best performed 𝑨̂
matrix across 600 simulation replicates for different linkage
disequilibrium (LD) structures and scenarios with different maximum correlation
coefficient between the intermediates.
𝒓 𝐰𝐢𝐭𝐡𝐢𝐧 𝐛𝐥𝐨𝐜𝐤 *
𝐦𝐚𝐱 𝒓 𝑿 ^
Selection
algorithm
Number of causal intermediates
0 3 7 10
0 0 MR-BMA 0.001 0.049 0.104 0.141
Elastic net hJAM 0 0.011 0.024 0.034
SuSiE hJAM 0.002 0.003 0.011 0.019
0 0.6 MR-BMA 0.001 0.051 0.104 0.138
Elastic net hJAM 0 0.011 0.025 0.033
SuSiE hJAM 0.001 0.003 0.012 0.021
0.6 0 MR-BMA 0.007 0.112 0.160 0.186
Elastic net hJAM 0 0.016 0.036 0.050
SuSiE hJAM 0.001 0.005 0.018 0.031
0.6 0.6 MR-BMA 0.007 0.112 0.161 0.185
Elastic net hJAM 0 0.016 0.037 0.052
SuSiE hJAM 0.002 0.005 0.018 0.033
0.8 0 MR-BMA 0.007 0.095 0.134 0.152
Elastic net hJAM 0 0.017 0.038 0.055
SuSiE hJAM 0.002 0.006 0.023 0.040
0.8 0.6 MR-BMA 0.007 0.093 0.133 0.149
Elastic net hJAM 0 0.017 0.038 0.055
SuSiE hJAM 0.001 0.006 0.024 0.042
Note:
*
Correlation within block of 𝑮 ’s;
^
Maximum correlation coefficient between intermediates
𝑿 . We showed the absolute bias of each selection algorithm with best performed 𝑨̂
matrix with
the lowest being bolded for each scenario. We used pruned inverse-variance weighted marginal
𝑨̂
for MR-BMA, SuSiE JAM 𝑨̂
for SHA-JAM and elastic net hJAM.
108
Supplementary Table 4B.2 Mean-squared errors of the estimates in different 𝑨̂
matrix
construction, averaged across 500 replicates.
𝒓 𝐰𝐢𝐭𝐡𝐢𝐧 𝐛𝐥𝐨𝐜𝐤 *
Algorithms
SuSiE JAM SuSiE (n=5000^) SuSiE (n=500^) Elastic net
0 1.0178 0.9999 1.0503 1.0914
0.6 1.0132 0.9998 1.0468 1.0900
0.8 1.0111 1.0011 1.0430 1.0871
Note: *Correlation within block of 𝑮 ’s. ^Here displays the sample size of the data to obtain
the 𝐴̂
matrix.
Supplementary Table 4B.3 Description of the candidate metabolites in estimating the
causal association between metabolites and the risk of prostate cancer.
Abbreviation Name Sample size Heritability
ApoA1 ApoA1 20687 5.01%
ApoB ApoB 20690 8.61%
Est.C Esterified cholesterol 13497 7.58%
HDL.C Total cholesterol in HDL 21555 6.12%
HDL.D HDL diameter 19273 10.32%
IDL.C Total cholesterol in IDL 19273 10.89%
IDL.TG Triglycerides in IDL 19273 9.94%
L.HDL.C Total cholesterol in large HDL 21558 8.23%
L.VLDL.C Total cholesterol in large VLDL 21235 3.24%
L.VLDL.TG Triglycerides in large VLDL 21239 2.59%
LDL.C Total cholesterol in LDL 21559 10.72%
LDL.D LDL diameter 19273 3.12%
M.HDL.C Total cholesterol in medium HDL 21558 2.38%
M.VLDL.C Total cholesterol in medium VLDL 21551 5.12%
M.VLDL.TG Triglycerides in medium VLDL 21241 3.19%
S.HDL.TG Triglycerides in small HDL 21558 4.24%
S.LDL.C Total cholesterol in small LDL 21556 8.96%
S.VLDL.C Total cholesterol in small VLDL 21557 6.72%
S.VLDL.TG Triglycerides in small VLDL 21558 4.91%
Serum.C Serum total cholesterol 21491 8.67%
Serum.TG Serum total triglycerides 21545 4.33%
SM Sphingomyelins 13476 3.92%
Tot.FA Total fatty acids 13505 3.35%
TotPG Total phosphoglycerides 13519 4.24%
109
VLDL.D VLDL diameter 19273 4.65%
XL.HDL.C Total cholesterol in very large HDL 21540 5.05%
XL.HDL.TG Triglycerides in very large HDL 21536 10.76%
XL.VLDL.TG Triglycerides in very large VLDL 21548 2.41%
XS.VLDL.TG Triglycerides in very small VLDL 19273 7.13%
XXL.VLDL.TG
Triglycerides in chylomicrons and
extremely large VLDL
21540 1.71%
Note: This table was adapted from the “NMRA_dat” data set in the MRChallenge R package
(https://github.com/WSpiller/MRChallenge2019). Four metabolites highlighted in dark grey
were excluded in the analysis without the influential variants (K=140).
Supplementary Table 4B.4 Top best models from MR-BMA in the influential-variants-
excluded analysis (𝑲 =𝟏𝟒𝟎 ) in estimating the causal effects of metabolites on the risk of
prostate cancer.
Metabolites combination Posterior probability Model specific Causal estimates
1 Serum.C 0.059 0.063
2 S.LDL.C 0.048 0.056
3 IDL.C 0.046 0.054
4 LDL.C 0.041 0.05
5 S.VLDL.C 0.035 0.044
6 XXL.VLDL.TG 0.031 0
7 ApoA1 0.03 0.035
8 XL.VLDL.TG 0.029 0.006
9 ApoB 0.029 0.036
10 L.VLDL.TG 0.028 0
Abbreviations: Serum.C, serum total cholesterol; S.LDL.C, total cholesterol in small LDL;
IDL.C, total cholesterol in IDL; LDL.C, total cholesterol in LDL; S.VLDL.C, total cholesterol in
small VLDL; XXL.VLDL.TG, triglycerides in chylomicrons and extremely large VLDL;
XL.VLDL.TG, triglycerides in very large LDL; L.VLDL.TG, triglycerides in large VLDL.
Supplementary Table 4B.5 The candidate 𝑨̂
matrix for each chromosome and the credible
sets identified by SuSiE hJAM in selecting the causal gene for the risk of prostate cancer.
Chrom
osome
Candidate
eQTLs (𝑲 )
Candidate
genes (𝑴 )
Number
of CSs
Absolute correlation Selected
genes Minimum Mean Median
1 792 108 1 1 1 1 1
110
2 756 78 5 1 1 1 5
3 469 66 6 0.987 0.993 0.993 7
4 366 40 0 - - - 0
5 453 57 4 0.998 0.999 0.999 6
6 1162 53 5 1 1 1 5
7 695 65 3 1 1 1 3
8 351 35 1 0.777 0.888 0.888 2
9 372 47 2 1 1 1 6
10 447 52 4 1 1 1 4
11 470 44 3 1 1 1 3
12 354 47 2 1 1 1 2
13 99 9 1 0.976 0.988 0.988 2
14 246 22 1 0.924 0.962 0.962 2
15 349 29 0 - - - 0
16 585 48 0 - - - 0
17 512 66 3 1 1 1 3
18 144 17 1 1 1 1 1
19 659 65 1 1 1 1 1
20 189 26 1 1 1 1 1
21 121 12 0 - - - 0
22 302 42 1 1 1 1 1
Total 9893 1028 45 0.991 0.995 0.995 55
Supplementary Table 4B.6 Credible sets that contain two genes, identified by SuSiE hJAM
for selecting the causal genes of the risk of prostate cancer.
Chromosome Gene Name Gene Type Posterior mean PIP 𝒓 ^
Credible sets that contain two genes
3 AMT protein coding -0.109 0.798 0.920
NICN1 protein coding -0.026 0.171
5 CTD-2194D22.3 antisense -0.226 0.885 0.994
IRX4 protein coding -0.024 0.116
5 CTD-2280E9.1 antisense -0.030 0.574 0.997
CTC-210G5.1 antisense -0.037 0.409
8 FAM66A processed transcript 0.041 0.949 0.777
RP11-351I21.6 pseudogene 0.000 0.012
13 RP11-173B14.4 sense intronic 0.056 0.784 0.976
RP11-173B14.5 antisense 0.006 0.189
111
14 HAUS4 protein coding 0.026 0.501 0.924
RBM23 protein coding -0.020 0.484
Credible sets that contain more than two genes
9 RP11-203I2.1 pseudogene -0.261 0.201 1
RP11-282E4.1 pseudogene -0.094 0.201
MYO5BP2 pseudogene -0.020 0.201
RP11-262H14.3 lincRNA -0.021 0.201
RP11-262H14.7 pseudogene -0.021 0.201
Note: ^𝑟 : the correlation coefficient between the genes in the credible set.
Abbreviations: PIP, posterior inclusion probability.
112
4B.2. Supplementary Figures
Supplementary Figure 4B.1 Procedures in composing the 𝑨̂
matrix.
113
Supplementary Figure 4B.2 Linkage disequilibrium (LD) structure of the individual
genotype data set with 𝒓 𝐰𝐢𝐭𝐡𝐢𝐧 𝐛𝐥𝐨𝐜𝐤 =𝟎 .𝟖 .
This Figure displays the LD structure of the individual genotype data with 300 single-nucleotide
polymorphism (SNPs) and 10 independent LD blocks. Within each block, the pairwise
correlation coefficient between SNPs is 𝑟 𝑤𝑖𝑡 ℎ𝑖𝑛 𝑏𝑙𝑜𝑐𝑘 =0.8. The plot on left shows a pseudo
overall LD structure while the plot on right shows the true LD structure for one block for the
simulation.
114
Supplementary Figure 4B.3 Correlation structure of the individual intermediates data with
max(𝒓 𝑿 )=𝟎 .𝟔 .
This Figure displays the correlation structure of the individual intermediates data with 50
intermediates (i.e. 𝑿 ). The block size of intermediates is 10. For each intermediate 𝑥 , we set 4-8
correlated intermediates with max(𝑟 𝑋 )=0.6 and 𝑚𝑖𝑛 (𝑟 𝑋 )=0.04. The plot on left shows a
pseudo overall correlation structure while the plot on right shows the true correlation structure
for one 𝑋 block for the simulation.
115
Supplementary Figure 4B.4 An illustrative example of the log(𝝀 ) vs. mean-squared error in
an elastic net glmnet fit.
This figure shows an illustrative example of the log(𝜆 ) vs. mean-squared error of an elastic net
fit from glmnet. For the 𝜆 which gives the minimum cross-validated error (i.e. lambda.min),
we have a model with 16 variables being selected (left dashed line). For the 𝜆 which gives the
most regularized model such that error is within one standard error of the minimum (i.e.
lambda.1se), we have a model with 2 variables being selected (right dashed line).
116
Supplementary Figure 4B.5 Comparing three selection algorithms with different 𝑨̂
matrix
by area under curve (AUC) of receiver operating characteristic (ROC) curve across 600
simulation replicates for three linkage disequilibrium (LD) structures with the maximum
correlation coefficient between intermediates equals to 0.6.
Simulation results for three LD structures with different number of causal intermediates. The bar
plots showed the average AUC values, and the error bars showed the corresponding standard
error. The algorithms are displayed as “Selection algorithm (𝑨̂
matrix algorithm)”. All 𝐴̂
matrices were composed from data with a sample size of 5,000. SuSiE JAM and IVW marginal
used the summary data and SuSiE used the individual data.
117
Supplementary Figure 4B.6 Comparing the performance of SHA-JAM with 𝑨̂
matrix from
data of different sample sizes by area under curve (AUC) of receiver operating
characteristic (ROC) curve across 600 simulation replicates for three linkage
disequilibrium (LD) structures with independent intermediates.
Simulation results for three LD structures with different number of causal intermediates. The bar
plots showed the averaged AUC values, and the error bars showed the corresponding standard
error. The algorithms are displayed as “Selection algorithm (𝑨̂
matrix algorithm, n=sample size
of the data)”.
118
A1 A2
B1 B2
Supplementary Figure 4B.7 Diagnosis plots of Cook’s distance and q-statistics
Figure A1 and A2 showed the Cook’s distance and q-statistics when no influential outliers have
been identified (K=144). The two figures showed an obvious influential point, rs2710642, with a
large q-statistics of 106. Figure B1 and B2 showed the Cook’s distance and q-statistics after
excluding rs2710642 (K=143). The two figures showed an influential point, rs261342, with a
Cook’s distance larger than the medium Cook’s distance and two influential points, rs205262
and rs267733, with large q-statistics (q=21.50 and q=21.39, respectively).
rs2710642 −10
−5
0
5
−1. 0 −0. 5 0.0 0.5 1.0
Predicted beta PrCa
Observed beta PrCa
0.1
0.2
0.3
0.4
Cooks D
IDL.C
rs267733
rs12145743
rs2710642
rs2240327
rs205262
rs4722551
rs799160 rs9693857
rs2241210
−10
−5
0
5
−1. 0 −0. 5 0.0 0.5 1.0
Predicted beta PrCa
Observed beta PrCa
25
50
75
100
Q
IDL.C
rs261342
−5. 0
−2. 5
0.0
2.5
5.0
−0. 50 −0. 25 0.00 0.25
Predicted beta PrCa
Observed beta PrCa
0.2
0.4
0.6
Cooks D
LDL.D
rs267733
rs205262
rs4722551
rs2241210
−5. 0
−2. 5
0.0
2.5
5.0
−1. 0 −0. 5 0.0 0.5 1.0
Predicted beta PrCa
Observed beta PrCa
5
10
15
20
Q
Est.C + Serum.C
119
Supplementary Figure 4B.8 Comparisons of the coefficients and inclusion probabilities
with and without the influential genetic variants for SHA-JAM and MR-BMA.
For SHA-JAM, the coefficients and inclusion probabilities were displayed as posterior means
and posterior inclusion probabilities (PIP), respectively. For MR-BMA, the coefficients and
inclusion probabilities were displayed as model-average causal effects (MACE) and marginal
inclusion probabilities (MIP), respectively. The labeled metabolites were identified by the
credible sets of SHA-JAM and the top four ranked metabolites of MR-BMA in the analysis
without the influential SNPs, respectively. The Pearson correlation coefficient was applied to
show the correlation of parameters of interests with and without the influential variants.
Chapter 5.
R Package for Implementation: hJAM
This chapter is a modified version of the vignette of R package hJAM under https://cran.r-
project.org/web/packages/hJAM/index.html.
5.1. Overview
5.1.1. Motivation
To implement the approaches discussed in this dissertation, we provide a publicly
accessible R package hJAM, referring to Hierarchical Joint Analysis of Marginal summary
statistics (https://github.com/lailylajiang/hJAM). In addition to the hJAM approaches, we include
the implementations to compose the 𝑨̂
matrix and functions for data exploration – this can be
viewed as an alternative implementation to JAM. The package will be sufficient to conduct the
hJAM analysis with different sources of data and intermediates.
5.1.2. Data Required and Structure Examples
As we discussed in Chapter 2 (Table 2.1), three sets of summary data from different data
sources are needed for the implementation of hJAM. The three sets of summary data have to be
the same population to ensure similar LD patterns.
121
After obtaining the summary data, the user needs to check two critical issues. First, the
order of the SNPs/eQTLs are the same across all summary data. Second, the effect allele of each
SNP/eQTL should be the same across all summary data. Implementations in hJAM package
checks the dimensions of the summary data automatically.
5.1.2.1. 𝐺 𝑋
From 𝐺 𝑋 , we need to identify the genetic variants that will be included as the instruments
and extract the association estimates needs for composing the 𝑨̂
matrix. For high-dimensional 𝑿 ,
𝐺 𝑋 will also be used to identify the candidate intermediates for selection. Table 5.1. displays the
data structure example of the 𝑨̂
matrix.
Table 5.1 Data structure example for 𝑨̂
∈ ℝ
𝑲 ×𝑴 matrix.
𝑋 1
𝑋 2
… 𝑋 𝑚 −1
𝑋 𝑚 … 𝑋 𝑀 −1
𝑋 𝑀
𝑆𝑁 𝑃 1
0.1 0 … 0 -0.8 … 0.2 0.1
𝑆𝑁 𝑃 2
0 0.1 … 0 0 … 0.3 0
… … … … … … … … …
𝑆𝑁 𝑃 𝐾 −1
-0.1 0.1 … 0.3 0 … 0 -0.2
𝑆𝑁 𝑃 𝐾 0.9 0.8 … 0 0 … 0 0
5.1.2.2. 𝐺 𝑌
From 𝐺 𝑌 , we extract three components: (1) 𝒃̂
, the 𝐾 -length vector of marginal association
estimates between the SNPs and the outcome, (2) EAF, the effect allele frequency and the effect
allele for each SNP included, and (3) the sample size of 𝐺 𝑌 . Additionally, for SHA-JAM, we
need, se(𝒃̂
) , the standard error pf 𝒃̂
from 𝐺 𝑌 . 𝐺 𝑌 is most likely to be a GWAS summary data.
5.1.2.3. 𝐺 𝐿
𝐺 𝐿 is the reference genotype data for extracting the LD structure which will be applied in
computing the plug-in of 𝐺 𝑌 ′
𝐺 𝑌 in hJAM and 𝐺 𝑋 ′
𝐺 𝑋 in getting the conditional 𝑨̂
matrix with JAM
122
or SuSiE JAM. In our applications, we used the European-ancestry subjects in the 1000 Genome
Project (Consortium, 2015).
5.1.3. Data Available
5.1.3.1. Causal effect of body mass index and type 2 diabetes on myocardial infarction
MI.Rdata. This data set was applied in the data example one in hJAM (Chapter 2,
section 2.4.1) and the data example in hJAM Egger (Chapter 3, section 3.4). Five components
are included in the MI.Rdata.
Table 5.2 Descriptions of the MI.Rdata in hJAM R package.
Data set components Descriptions
MI.marginal.Amatrix
The marginal 𝑨̂
matrix. Column one and two are the marginal
estimates of the SNPs on body mass index from GIANT
consortium (n = 339,224) (Locke et al., 2015) and type 2
diabetes from DIAGRAM+GERA+UKB (n = 659,316) (Xue et
al., 2018), respectively.
MI.Amatrix
The conditional 𝑨̂
matrix composed by JAM and the marginal
𝑨̂
matrix. Column one and two are the conditional effect
estimates of the SNPs on body mass index and type 2 diabetes,
respectively.
MI.Geno
The reference genotype data from the European-ancestry
population in 1000 Genome Project (Consortium, 2015).
MI.betas.gwas
The 𝒃̂
vector. The association estimates between selected SNPs
and the risk of myocardial infarction from UK Biobank
(Sudlow et al., 2015).
MI.SNPs_info
The SNP information. Five columns included: the RSID,
reference allele, reference allele frequency, if BMI significant
and if T2D significant. The last two columns are indicator
variables for the SNPs which are genome-wide significant
associated with BMI/T2D.
123
5.1.3.2. Selecting causal metabolites for prostate cancer risk
PrCa.lipids.Rdata. This data set was applied in the data example one in SHA-
JAM (Chapter 4, section 4.4.1). Eight components are included in the PrCa.lipids.Rdata.
Table 5.3 Descriptions of the PrCa.lipids.Rdata in hJAM R package.
Data set components Descriptions
PrCa.lipids.
marginal.Amatrix
The marginal 𝑨̂
matrix with 118 metabolites and 144 SNPs.
This data is directly adapted from https://github.com/verena-
zuber/demo_AMD (Zuber et al., 2020).
PrCa.lipids.
Amatrix
The conditional 𝑨̂
matrix with 118 metabolites and 144 SNPs,
which was composed by SuSiE JAM and the marginal 𝑨̂
matrix.
PrCa.lipids.Geno
The reference genotype data for the 144 SNPs from the
European-ancestry population in 1000 Genome Project
(Consortium, 2015).
PrCa.lipids.betas.
gwas
The 𝒃̂
vector. The association estimates between selected
SNPs and the risk of prostate cancer from (Schumacher et al.,
2018)
PrCa.lipids.betas.
se.gwas
The se(𝒃̂
) vector from (Schumacher et al., 2018)
PrCa.lipids.pvalue.
gwas
The pvalues vector of the association estimates between
selected SNPs and the risk of prostate cancer from
(Schumacher et al., 2018)
PrCa.lipids.maf.gwas
The vector of the effect allele frequency of the SNPs from
(Schumacher et al., 2018)
PrCa.lipids.rsid
The RSID of the SNPs.
5.1.3.1. Selecting causal genes on chromosome 10 for prostate cancer risk
GTEx.PrCa.Rdata. This data set was applied in the data example two in SHA-JAM
(Chapter 4, section 4.4.2). In this data example, we applied one set of inputs per chromosome.
For illustration purpose in the R package, we only provide the data on chromosome 10. Eight
components are included in the GTEx.PrCa.Rdata.
124
Table 5.4 Descriptions of the GTEx.PrCa.Rdata in hJAM R package.
Data set components Descriptions
GTEx.PrCa.
marginal.Amatrix
The marginal 𝑨̂
matrix with 158 genes and 182 eQTLs. The raw
data was downloaded from GTEx analysis v7
(https://gtexportal.org/home/datasets). Priority Pruner was used
to select the independent eQTLs. We used this matrix for MR-
BMA implementation.
GTEx.PrCa.
Amatrix
The conditional 𝑨̂
matrix with 167 genes and 447 eQTLs,
which was composed by SuSiE JAM and the raw data of 𝑨̂
matrix.
GTEx.PrCa.Geno
The reference genotype data for the 447 eQTLs from the
European-ancestry population in 1000 Genome Project
(Consortium, 2015).
GTEx.PrCa.betas.
gwas
The 𝒃̂
vector. The association estimates between eQTLs and the
risk of prostate cancer from (Schumacher et al., 2018)
GTEx.PrCa.betas.
se.gwas
The se(𝒃̂
) vector from (Schumacher et al., 2018)
GTEx.PrCa.pvalue.
gwas
The 𝑃 values vector of the association estimates between
eQTLs and prostate cancer risk from (Schumacher et al., 2018)
GTEx.PrCa.maf.gwas
The vector of the effect allele frequency of the eQTLs from
(Schumacher et al., 2018)
GTEx.PrCa.marginal.
selected
The RSID of the eQTLs in
GTEx.PrCa.marginal.Amatrix.
5.1.4. Functions Available
5.1.4.1. hJAM and hJAM Egger regression
The arguments needed for hJAM and hJAM Egger regression are the same. Function
hJAM_lnreg is for hJAM (Chapter 2) implementation and hJAM_egger is for hJAM Egger
(Chapter 3) implementation. Sample implementations are:
# hJAM
hJAM_lnreg(betas.Gy, N.Gy, Gl, A, ridgeTerm = FALSE)
# hJAM Egger
125
hJAM_egger(betas.Gy, N.Gy, Gl, A, ridgeTerm = FALSE)
5.1.4.2. SHA-JAM and Regularized hJAM
To implement SHA-JAM and regularized hJAM (Chapter 4), one needs to use
hJAM_selectX. The required arguments for SHA-JAM and regularized hJAM are different.
For both, betas.Gy, N.Gy, Gl, and A are required. Additionally, SHA-JAM
(selection_alg = 'susie') requires betas_se.Gy, L.susie, and
min_abs_corr. Sample implementations are:
# SHA-JAM
hJAM_selectX(betas.Gy, betas_se.Gy, N.Gy,
MAF_summary, Gl, A,
selection_alg = 'susie',
L.susie = 10, min_abs_corr = 0.6, coverage=0.95,
estimate_residual_variance = TRUE)
# Elastic net hJAM
hJAM_selectX(betas.Gy, N.Gy,
MAF_summary, Gl, A,
selection_alg = 'en', ridgeTerm = FALSE)
5.1.4.3. JAM and SuSiE JAM
To construct the conditional 𝐴̂
matrix, the user could use the JAM and susieJAM
functions for JAM (Chapter 2) and SuSiE JAM (Chapter 4) implementation, respectively. The
MAF_summary vector should be the effect allele frequency of the SNPs from 𝐺 𝑋 . Sample
implementations are:
# JAM
JAM(marginal_A, Gl, N.Gx, ridgeTerm = FALSE)
# SuSiE JAM
susieJAM(marginal_A, marginal_A_se, Gl, N.Gx, MAF_summary = NULL,
selection_alg = 'susie', L.susie= 10,
min_abs_corr = 0.6, coverage = 0.95)
126
5.1.4.4. Correlation heatmap of SNPs
To plot the correlation structure of the SNPs, the user could use SNPs_heatmap. The
sample implementation and output (Figure 5.1) are shown below.
# SNPs heatmap
SNPs_heatmap(Gl)
Figure 5.1 Sample output from the SNPs_heatmap function in R package hJAM.
5.2. Package Usage
5.2.1. Get Started
We have published the R package hJAM version 1 on CRAN (https://CRAN.R-
project.org/package=hJAM). The current CRAN version only includes the implementations in
Chapter 2 and 3. The methods in Chapter 4 are accessible through Github
127
(https://github.com/lailylajiang/hJAM or https://github.com/USCbiostats/hJAM). We
recommend to use the development version (Github). To get started, the user could install it from
CRAN or Github.
# For CRAN version
install.packages("hJAM")
# For Github version
install.packages("devtools")
devtools::install_github("lailylajiang/hJAM")
library(hJAM)
5.2.2. Implementation of hJAM
Firstly, a quick look at the data in the example (MI.Rdata).
data(MI.Rdata)
MI.Amatrix[1:5, ]
## bmi t2d
## [1,] 0.019531085 0.072587211
## [2,] 0.025262061 0.013586392
## [3,] -0.005147363 0.089673178
## [4,] 0.046302578 0.041313103
## [5,] 0.016849395 -0.004564683MI.betas.gwas[1:5]##. [1] 0.0197298
348 0.0133151413 0.0008717583 0.0213550014 -0.0031514278
MI.SNPs_info[1:5, ]
## SNP Major_A ref_frq BMI.sig T2D.sig
## 1 rs2296173 G 0.12079449 0 1
## 2 rs657452 A 0.54742602 1 0
## 3 rs12088739 A 0.87859749 0 1
## 4 rs3101336 C 0.67815160 1 0
## 5 rs12566985 G 0.67936765 1 0
If conditional 𝐴̂
matrix is not available, the user could use JAM to convert the marginal
one into the conditional one.
128
MI.cond_A = JAM(marginal_A = MI.marginal.Amatrix, Gl = MI.Geno, N.Gx
= 339224, ridgeTerm = T)
MI.cond_A[1:5, ]
## bmi t2d
## [1,] 0.019531085 0.072587211
## [2,] 0.025262061 0.013586392
## [3,] -0.005147363 0.089673178
## [4,] 0.046302578 0.041313103
## [5,] 0.016849395 -0.004564683
MI.Amatrix[1:5, ]
## bmi t2d
## [1,] 0.019531085 0.072587211
## [2,] 0.025262061 0.013586392
## [3,] -0.005147363 0.089673178
## [4,] 0.046302578 0.041313103
## [5,] 0.016849395 -0.004564683
When inputs are ready, we can run hJAM_lnreg for hJAM implementation.
hJAM_lnreg(betas.Gy = MI.betas.gwas, N.Gy = 459324, A = MI.Amatrix,
Gl = MI.Geno, ridgeTerm = TRUE) # 459324 is the sample size of the UK
Biobank GWAS of MI
## ------------------------------------------------------
## hJAM output
## ------------------------------------------------------
## Number of SNPs used in model: 210
##
## Estimate StdErr 95% CI Pvalue
## bmi 0.322 0.061 (0.202, 0.441) 1.268210e-07
## t2d 0.119 0.017 (0.086, 0.153) 3.176604e-12
## ------------------------------------------------------
5.2.3. Implementation of SHA-JAM
First, a quick at the data in the example (PrCa.lipids.Rdata) and process the data
as described in Chapter 4, section 4.4.2. Note this data example includes the potential influential
SNPs as identified by MR-BMA models (section 4.4.2).
129
data(PrCa.lipids.Rdata)
library(MRChallenge)
names(PrCa.lipids.betas.gwas) = PrCa.lipids.rsid
names(PrCa.lipids.betas.se.gwas) = PrCa.lipids.rsid
names(PrCa.lipids.maf.gwas) = PrCa.lipids.rsid
names(PrCa.lipids.pvalue.gwas) = PrCa.lipids.rsid
dim(PrCa.lipids.Amatrix)
## [1] 144 118
# Filter the X based on colinearity
index = c(1,2
,3,4,5,6,11,12,15,21,22,23,24,26,27,28,29,34,35,36,47,53,55,56,57,58,
59,70,76,79,80,83,84,87,92,93,94,95,96,97,98,99,100,101,102,108,111,1
15,118)
PrCa.lipids.Amatrix.f = PrCa.lipids.Amatrix[, index]
# Filter the X based on the significance
pmat_in = Challenge_dat %>%
filter(rsid %in% rownames(PrCa.lipids.Amatrix.f))
pmat_in = pmat_in[, 150:267]
pmat_in = pmat_in[, index]
minp = apply(pmat_in, MARGIN=2, FUN=min)
PrCa.lipids.Amatrix.f = PrCa.lipids.Amatrix.f[, which(minp<5e-8)]
dim(PrCa.lipids.Amatrix.f)
## [1] 144 30
Now we have the data ready, we can run hJAM_selectX for SHA-JAM implementation.
shJAM.out.f =
hJAM_selecX(betas.Gy = PrCa.lipids.betas.gwas,
betas_se.Gy = PrCa.lipids.betas.se.gwas,
N.Gy = 140000,
Gl = PrCa.lipids.Geno,
MAF_summary = PrCa.lipids.maf.gwas,
A = PrCa.lipids.Amatrix.f,
selection_alg = 'susie',
L.susie = 5, min_abs_corr = 0.6, coverage=0.9,
max_iter = 500)
shJAM.out.f
## ------------------------------------------------------
## hJAM output
## ------------------------------------------------------
130
## Number of SNPs used in model: 144
## Number of intermediates used in model: 30
## Number of the credible sets of intermediates selected by susie: 1
##
## Credible Sets Variable Coefficients PIP
## 1 L1 IDL.C 0.030 0.263
## 2 L1 LDL.C 0.029 0.243
## 3 L1 S.LDL.C 0.050 0.388
## 4 L1 Serum.C 0.008 0.066
## ------------------------------------------------------
We can check the purity of the selected credible set(s):
print(shJAM.out.f$cs_purity)
## min.abs.corr mean.abs.corr median.abs.corr
## L1 0.9020388 0.947638 0.9436103
5.3. Conclusion
In this chapter, we describe the overview of the R package hJAM and provided two usage
examples with the data applications we discussed in Chapter 2 and Chapter 4. Benefit from the
flexibility of hJAM, several extensions could be done, and the implementations could be
embedded in this package.
Chapter 6.
Conclusions
Over the last decade, the rapid emergence of GWAS have revolutionized the field of
complex disease and quantitative genetics research. Mendelian randomization and TWAS, as
two methodologies that were benefit from the boosting of GWAS, have been increasingly
employed by scientists to investigate the causal effect of modifiable risk factors or genes on the
outcome. In this chapter, we discuss our contributions and the future directions towards to field
of Mendelian randomization and TWAS as well as the post-GWAS research (Figure 6.1)
Figure 6.1 Flowchart of the hJAM family approaches and the application examples.
This figure describes the relationships of the methods that we discussed in this dissertation (in
blue), one method proposed by our research team (in green), and two methods we are currently
working on (in yellow).
132
6.1. Summary of Our Contributions
Firstly, we unified the framework of Mendelian randomization and TWAS statistically.
Mendelian randomization and TWAS have been viewed as two different approaches because of
the different target questions. In Chapter 2, we unified the framework of the two methods by a
hierarchical model and demonstrated the theoretical connection between the hierarchical model
and the existing Mendelian randomization and TWAS approaches.
Secondly, we proposed a hierarchical joint analysis of marginal summary statistics
(hJAM) (Chapter 2). hJAM has several advantages over existing Mendelian randomization and
TWAS approaches. First of all, it can estimate the joint effects of multiple intermediates, which
can either be modifiable risk factors or gene expressions, and account for the correlation between
the intermediates. Additionally, it can incorporate the correlation of the SNPs used in the
analysis. We showed that hJAM yields an unbiased estimate, maintains correct type-I error and
has increased power across extensive simulations. Moreover, the flexibility of hJAM model
provides a natural way to incorporate the Egger regression in the model and to account for the
pleiotropy effects (Chapter 3). We showed a comparable performance of hJAM Egger and an
existing multivariable Mendelian randomization approach with Egger regression.
Thirdly, we proposed a scalable hierarchical approach to joint analysis of marginal
summary data (SHA-JAM) to variable selection in high-dimensional intermediates data. SHA-
JAM has shown several advantages over the existing methods. SHA-JAM overcomes the
limitation of the hJAM that it can incorporate highly correlated SNPs and intermediates. Such
property enables SHA-JAM to incorporate different types of intermediates and expands the
SHA-JAM into a broader field of research such as omics data. Additionally, demonstrated by the
133
simulation studies, SHA-JAM showed a higher AUC and lower MSE with less bias compared to
existing approaches across almost all simulation scenarios. We also highlighted the
computational efficiency and simplicity of SHA-JAM by the simulation studies and real data
examples.
Fourthly, we discussed and highlighted the composition of the 𝑨̂
matrix, including
identifying the genetic variants and composing the estimates in the matrix. We showed in our
data example (Chapter 2) that the results from Mendelian randomization and TWAS are sensitive
to the genetic variants included in the analysis. We provided different ways to get conditional 𝑨̂
matrix in different types of summary data. For example, for moderately or weakly correlated
SNPs, one can use JAM to get the conditional estimates; for highly correlated SNPs, one can use
SuSiE JAM to select SNPs and get the posterior means as the conditional estimates.
Last but not least, we provided an R package, hJAM, for the implementations of the
approaches proposed in this dissertation.
6.2. Future Directions
Leveraging the information across multiple ethnic groups can increase the power to
facilitate the identification of causal SNPs. Several GWAS in multi-ethnic populations have
detected novel loci for complex traits (Cook & Morris, 2016; Roselli et al., 2018). To the best of
our knowledge, there is no approach for multi-ethnic Mendelian randomization or TWAS.
We have developed a preliminary implementation for multi-ethnic hJAM (mhJAM), as
described in Figure 6.1 (in light yellow), to combine the information from multiple ethnic groups
and improve the power of the detecting the effects of the intermediates on the outcome. The core
134
model of mhJAM is to adapt the idea of fixed-effect meta-analysis and stack the effects across
populations. We then follow the idea of meta-regression and estimate an average effect across
multiple ethnic groups. Simulation studies show an overall good performance of the mhJAM
model with an unbiased estimator, a good type-I error and high power. To add variable selection
in a high-dimensional setting across different populations, we have preliminary work on a
method by integrating the SHA-JAM and the mhJAM framework (Figure 6.1, in dark yellow).
References
Ao, X., Liu, Y., Bai, X.-Y., Qu, X., Xu, Z., Hu, G., Chen, M., & Wu, H. (2015). Association
between EHBP1 rs721048 (A> G) polymorphism and prostate cancer susceptibility: a
meta-analysis of 17 studies involving 150,678 subjects. OncoTargets and therapy, 8,
1671.
Arthur, R., Møller, H., Garmo, H., Holmberg, L., Stattin, P., Malmstrom, H., Lambe, M.,
Hammar, N., Walldius, G., & Robinson, D. (2016). Association between baseline serum
glucose, triglycerides and total cholesterol, and prostate cancer risk categories. Cancer
medicine, 5(6), 1307-1318.
Bansal, D., Undela, K., D'Cruz, S., & Schifano, F. (2012). Statin use and risk of prostate cancer:
a meta-analysis of observational studies. PloS one, 7(10), e46691.
Barbeira, A. N., Dickinson, S. P., Bonazzola, R., Zheng, J., Wheeler, H. E., Torres, J. M.,
Torstenson, E. S., Shah, K. P., Garcia, T., Edwards, T. L., Stahl, E. A., Huckins, L. M.,
Nicolae, D. L., Cox, N. J., Im, H. K., & Consortium, G. (2018). Exploring the phenotypic
consequences of tissue specific gene expression variation inferred from GWAS summary
statistics. Nat Commun, 9(1), 1825. https://doi.org/10.1038/s41467-018-03621-1
Barrett-Connor, E. L., Cohn, B. A., Wingard, D. L., & Edelstein, S. L. (1991). Why is diabetes
mellitus a stronger risk factor for fatal ischemic heart disease in women than in men?: the
Rancho Bernardo Study. Jama, 265(5), 627-631.
Boudreau, D. M., Yu, O., Buist, D. S., & Miglioretti, D. L. (2008). Statin use and prostate cancer
risk in a large population-based setting. Cancer Causes & Control, 19(7), 767-774.
Bowden, J., Davey Smith, G., & Burgess, S. (2015). Mendelian randomization with invalid
instruments: effect estimation and bias detection through Egger regression. Int J
Epidemiol, 44(2), 512-525. https://doi.org/10.1093/ije/dyv080
Brown, P. J., Vannucci, M., & Fearn, T. (1998). Multivariate Bayesian variable selection and
prediction. Journal of the Royal Statistical Society: Series B (Statistical Methodology),
60(3), 627-641.
Bull, C. J., Bonilla, C., Holly, J. M., Perks, C. M., Davies, N., Haycock, P., Yu, O. H. Y.,
Richards, J. B., Eeles, R., & Easton, D. (2016). Blood lipids and prostate cancer: a
Mendelian randomization analysis. Cancer medicine, 5(6), 1125-1136.
Buniello, A., MacArthur, J. A. L., Cerezo, M., Harris, L. W., Hayhurst, J., Malangone, C.,
McMahon, A., Morales, J., Mountjoy, E., & Sollis, E. (2019). The NHGRI-EBI GWAS
Catalog of published genome-wide association studies, targeted arrays and summary
statistics 2019. Nucleic acids research, 47(D1), D1005-D1012.
Burgess, S., & Bowden, J. (2015). Integrating summarized data from multiple genetic variants in
Mendelian randomization: bias and coverage properties of inverse-variance weighted
methods. arXiv preprint arXiv:1512.04486.
136
Burgess, S., Butterworth, A., & Thompson, S. G. (2013). Mendelian randomization analysis with
multiple genetic variants using summarized data. Genet Epidemiol, 37(7), 658-665.
https://doi.org/10.1002/gepi.21758
[Record #111 is using a reference type undefined in this output style.]
Burgess, S., Dudbridge, F., & Thompson, S. G. (2015). Re: "Multivariable Mendelian
randomization: the use of pleiotropic genetic variants to estimate causal effects". Am J
Epidemiol, 181(4), 290-291. https://doi.org/10.1093/aje/kwv017
Burgess, S., & Thompson, S. G. (2015). Multivariable Mendelian Randomization: The Use of
Pleiotropic Genetic Variants to Estimate Causal Effects. American Journal of
Epidemiology, 181(4), 251-260. https://doi.org/10.1093/aje/kwu283
Chen, G. K., & Witte, J. S. (2007). Enriching the analysis of genomewide association studies
with hierarchical modeling. The American Journal of Human Genetics, 81(2), 397-404.
Consortium, G. P. (2015). A global reference for human genetic variation. Nature, 526(7571),
68.
Conti, D. V., Wang, K., Sheng, X., Bensen, J. T., Hazelett, D. J., Cook, M. B., Ingles, S. A.,
Kittles, R. A., Strom, S. S., & Rybicki, B. A. (2017). Two novel susceptibility loci for
prostate cancer in men of African ancestry. JNCI: Journal of the National Cancer
Institute, 109(8), djx084.
Conti, D. V., & Witte, J. S. (2003). Hierarchical modeling of linkage disequilibrum: genetic
structure and spatial relations. The American Journal of Human Genetics, 72(2), 351-363.
Cook, J. P., & Morris, A. P. (2016). Multi-ethnic genome-wide association study identifies novel
locus for type 2 diabetes susceptibility. European Journal of Human Genetics, 24(8),
1175-1180.
Degli Esposti, D., Hernandez-Vargas, H., Voegele, C., Fernandez-Jimenez, N., Forey, N.,
Bancel, B., Le Calvez-Kelm, F., McKay, J., Merle, P., & Herceg, Z. (2016).
Identification of novel long non-coding RNAs deregulated in hepatocellular carcinoma
using RNA-sequencing. Oncotarget, 7(22), 31862.
Dettmer, K., Aronov, P. A., & Hammock, B. D. (2007). Mass spectrometry ‐based
metabolomics. Mass spectrometry reviews, 26(1), 51-78.
[Record #101 is using a reference type undefined in this output style.]
Edlund, C. K., Anker, M., Schumacher, F. R., Gauderman, W. J., & Conti, D. V. PriorityPruner.
http://prioritypruner.sourceforge.net/. http://prioritypruner.sourceforge.net/
Eeles, R. A., Kote-Jarai, Z., Giles, G. G., Al Olama, A. A., Guy, M., Jugurnauth, S. K.,
Mulholland, S., Leongamornlert, D. A., Edwards, S. M., & Morrison, J. (2008). Multiple
newly identified loci associated with prostate cancer susceptibility. Nature genetics,
40(3), 316.
Egger, M., Davey Smith, G., Schneider, M., & Minder, C. (1997). Bias in meta-analysis detected
by a simple, graphical test. BMJ, 315(7109), 629-634.
https://doi.org/10.1136/bmj.315.7109.629
Fernandez, C., Ley, E., & Steel, M. F. (2001). Benchmark priors for Bayesian model averaging.
Journal of Econometrics, 100(2), 381-427.
Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning (Vol. 1).
Springer series in statistics New York.
Gamazon, E. R., Wheeler, H. E., Shah, K. P., Mozaffari, S. V., Aquino-Michaels, K., Carroll, R.
J., Eyler, A. E., Denny, J. C., Nicolae, D. L., Cox, N. J., Im, H. K., & Consortium, G.
137
(2015). A gene-based association method for mapping traits using reference
transcriptome data. Nat Genet, 47(9), 1091-1098. https://doi.org/10.1038/ng.3367
Gkatzionis, A., Burgess, S., Conti, D., & Newcombe, P. J. (2019). Bayesian variable selection
with a pleiotropic loss function in Mendelian randomization. BioRxiv, 593863.
Greenland, S. (2000). Principles of multilevel modelling. International journal of epidemiology,
29(1), 158-167.
Group, L. A. R. (2010). Long term effects of a lifestyle intervention on weight and
cardiovascular risk factors in individuals with type 2 diabetes: four year results of the
Look AHEAD trial. Archives of internal medicine, 170(17), 1566.
Gudbjartsson, D. F., Walters, G. B., Thorleifsson, G., Stefansson, H., Halldorsson, B. V.,
Zusmanovich, P., Sulem, P., Thorlacius, S., Gylfason, A., & Steinberg, S. (2008). Many
sequence variants affecting diversity of adult human height. Nature genetics, 40(5), 609-
615.
Gusev, A., Ko, A., Shi, H., Bhatia, G., Chung, W., Penninx, B. W., Jansen, R., de Geus, E. J.,
Boomsma, D. I., Wright, F. A., Sullivan, P. F., Nikkola, E., Alvarez, M., Civelek, M.,
Lusis, A. J., Lehtimäki, T., Raitoharju, E., Kähönen, M., Seppälä, I., Raitakari, O. T.,
Kuusisto, J., Laakso, M., Price, A. L., Pajukanta, P., & Pasaniuc, B. (2016). Integrative
approaches for large-scale transcriptome-wide association studies. Nat Genet, 48(3), 245-
252. https://doi.org/10.1038/ng.3506
Gusev, A., Lawrenson, K., Lin, X., Lyra, P. C., Kar, S., Vavra, K. C., Segato, F., Fonseca, M. A.,
Lee, J. M., & Pejovic, T. (2019). A transcriptome-wide association study of high-grade
serous epithelial ovarian cancer identifies new susceptibility genes and splice variants.
Nature genetics, 51(5), 815-823.
Gusev, A., Shi, H., Kichaev, G., Pomerantz, M., Li, F., Long, H. W., Ingles, S. A., Kittles, R. A.,
Strom, S. S., & Rybicki, B. A. (2016). Atlas of prostate cancer heritability in European
and African-American men pinpoints tissue-specific regulation. Nature communications,
7, 10979.
Heir, T., Falk, R. S., Robsahm, T. E., Sandvik, L., Erikssen, J., & Tretli, S. (2016). Cholesterol
and prostate cancer risk: a long-term prospective cohort study. BMC cancer, 16(1), 643.
Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal
problems. Technometrics, 12(1), 55-67.
Hu, X., Qiu, Z., Zeng, J., Xiao, T., Ke, Z., & Lyu, H. (2018). A novel long non-coding RNA,
AC012456. 4, as a valuable and independent prognostic biomarker of survival in oral
squamous cell carcinoma. PeerJ, 6, e5307.
Kahn, S. E., Hull, R. L., & Utzschneider, K. M. (2006). Mechanisms linking obesity to insulin
resistance and type 2 diabetes. Nature, 444(7121), 840.
Kettunen, J., Demirkan, A., Würtz, P., Draisma, H. H., Haller, T., Rawal, R., Vaarhorst, A.,
Kangas, A. J., Lyytikäinen, L.-P., & Pirinen, M. (2016). Genome-wide study for
circulating metabolites identifies 62 loci and reveals novel systemic effects of LPA.
Nature communications, 7(1), 1-9.
Khankari, N. K., Murff, H. J., Zeng, C., Wen, W., Eeles, R. A., Easton, D. F., Kote-Jarai, Z., Al
Olama, A. A., Benlloch, S., & Muir, K. (2016). Polyunsaturated fatty acids and prostate
cancer risk: a Mendelian randomisation analysis from the PRACTICAL consortium.
British journal of cancer, 115(5), 624-631.
138
Kok, D. E., van Roermund, J. G., Aben, K. K., den Heijer, M., Swinkels, D. W., Kampman, E.,
& Kiemeney, L. A. (2011). Blood lipid levels and prostate cancer risk; a cohort study.
Prostate cancer and prostatic diseases, 14(4), 340-345.
Lappalainen, T., Sammeth, M., Friedländer, M. R., AC‘t Hoen, P., Monlong, J., Rivas, M. A.,
Gonzalez-Porta, M., Kurbatova, N., Griebel, T., & Ferreira, P. G. (2013). Transcriptome
and genome sequencing uncovers functional variation in humans. Nature, 501(7468),
506.
Lauer, M. S., Anderson, K. M., Kannel, W. B., & Levy, D. (1991). The impact of obesity on left
ventricular mass and geometry: the Framingham Heart Study. Jama, 266(2), 231-236.
Lawlor, D. A., Harbord, R. M., Sterne, J. A., Timpson, N., & Davey Smith, G. (2008).
Mendelian randomization: using genes as instruments for making causal inferences in
epidemiology. Stat Med, 27(8), 1133-1163. https://doi.org/10.1002/sim.3034
Lee, J. D., Sun, D. L., Sun, Y., & Taylor, J. E. (2016). Exact post-selection inference, with
application to the lasso. The Annals of Statistics, 44(3), 907-927.
Lewinger, J. P., Conti, D. V., Baurley, J. W., Triche, T. J., & Thomas, D. C. (2007). Hierarchical
Bayes prioritization of marker associations from a genome ‐wide association scan for
further investigation. Genetic Epidemiology: The Official Publication of the International
Genetic Epidemiology Society, 31(8), 871-882.
Li, X., Zhang, Q., Fan, K., Li, B., Li, H., Qi, H., Guo, J., Cao, Y., & Sun, H. (2016).
Overexpression of TRPV3 correlates with tumor progression in non-small cell lung
cancer. International journal of molecular sciences, 17(4), 437.
Liang, F., Paulo, R., Molina, G., Clyde, M. A., & Berger, J. O. (2008). Mixtures of g priors for
Bayesian variable selection. Journal of the American Statistical Association, 103(481),
410-423.
Locke, A. E., Kahali, B., Berndt, S. I., Justice, A. E., Pers, T. H., Day, F. R., Powell, C.,
Vedantam, S., Buchkovich, M. L., Yang, J., Croteau-Chonka, D. C., Esko, T., Fall, T.,
Ferreira, T., Gustafsson, S., Kutalik, Z., Luan, J., Mägi, R., Randall, J. C., Winkler, T.
W., Wood, A. R., Workalemahu, T., Faul, J. D., Smith, J. A., Zhao, J. H., Zhao, W.,
Chen, J., Fehrmann, R., Hedman, Å., Karjalainen, J., Schmidt, E. M., Absher, D., Amin,
N., Anderson, D., Beekman, M., Bolton, J. L., Bragg-Gresham, J. L., Buyske, S.,
Demirkan, A., Deng, G., Ehret, G. B., Feenstra, B., Feitosa, M. F., Fischer, K., Goel, A.,
Gong, J., Jackson, A. U., Kanoni, S., Kleber, M. E., Kristiansson, K., Lim, U., Lotay, V.,
Mangino, M., Leach, I. M., Medina-Gomez, C., Medland, S. E., Nalls, M. A., Palmer, C.
D., Pasko, D., Pechlivanis, S., Peters, M. J., Prokopenko, I., Shungin, D., Stančáková, A.,
Strawbridge, R. J., Sung, Y. J., Tanaka, T., Teumer, A., Trompet, S., van der Laan, S. W.,
van Setten, J., Van Vliet-Ostaptchouk, J. V., Wang, Z., Yengo, L., Zhang, W., Isaacs, A.,
Albrecht, E., Ärnlöv, J., Arscott, G. M., Attwood, A. P., Bandinelli, S., Barrett, A., Bas, I.
N., Bellis, C., Bennett, A. J., Berne, C., Blagieva, R., Blüher, M., Böhringer, S.,
Bonnycastle, L. L., Böttcher, Y., Boyd, H. A., Bruinenberg, M., Caspersen, I. H., Chen,
Y. I., Clarke, R., Daw, E. W., de Craen, A. J. M., Delgado, G., Dimitriou, M., Doney, A.
S. F., Eklund, N., Estrada, K., Eury, E., Folkersen, L., Fraser, R. M., Garcia, M. E.,
Geller, F., Giedraitis, V., Gigante, B., Go, A. S., Golay, A., Goodall, A. H., Gordon, S.
D., Gorski, M., Grabe, H. J., Grallert, H., Grammer, T. B., Gräßler, J., Grönberg, H.,
Groves, C. J., Gusto, G., Haessler, J., Hall, P., Haller, T., Hallmans, G., Hartman, C. A.,
Hassinen, M., Hayward, C., Heard-Costa, N. L., Helmer, Q., Hengstenberg, C., Holmen,
O., Hottenga, J. J., James, A. L., Jeff, J. M., Johansson, Å., Jolley, J., Juliusdottir, T.,
139
Kinnunen, L., Koenig, W., Koskenvuo, M., Kratzer, W., Laitinen, J., Lamina, C.,
Leander, K., Lee, N. R., Lichtner, P., Lind, L., Lindström, J., Lo, K. S., Lobbens, S.,
Lorbeer, R., Lu, Y., Mach, F., Magnusson, P. K. E., Mahajan, A., McArdle, W. L.,
McLachlan, S., Menni, C., Merger, S., Mihailov, E., Milani, L., Moayyeri, A., Monda, K.
L., Morken, M. A., Mulas, A., Müller, G., Müller-Nurasyid, M., Musk, A. W., Nagaraja,
R., Nöthen, M. M., Nolte, I. M., Pilz, S., Rayner, N. W., Renstrom, F., Rettig, R., Ried, J.
S., Ripke, S., Robertson, N. R., Rose, L. M., Sanna, S., Scharnagl, H., Scholtens, S.,
Schumacher, F. R., Scott, W. R., Seufferlein, T., Shi, J., Smith, A. V., Smolonska, J.,
Stanton, A. V., Steinthorsdottir, V., Stirrups, K., Stringham, H. M., Sundström, J.,
Swertz, M. A., Swift, A. J., Syvänen, A. C., Tan, S. T., Tayo, B. O., Thorand, B.,
Thorleifsson, G., Tyrer, J. P., Uh, H. W., Vandenput, L., Verhulst, F. C., Vermeulen, S.
H., Verweij, N., Vonk, J. M., Waite, L. L., Warren, H. R., Waterworth, D., Weedon, M.
N., Wilkens, L. R., Willenborg, C., Wilsgaard, T., Wojczynski, M. K., Wong, A., Wright,
A. F., Zhang, Q., Brennan, E. P., Choi, M., Dastani, Z., Drong, A. W., Eriksson, P.,
Franco-Cereceda, A., Gådin, J. R., Gharavi, A. G., Goddard, M. E., Handsaker, R. E.,
Huang, J., Karpe, F., Kathiresan, S., Keildson, S., Kiryluk, K., Kubo, M., Lee, J. Y.,
Liang, L., Lifton, R. P., Ma, B., McCarroll, S. A., McKnight, A. J., Min, J. L., Moffatt,
M. F., Montgomery, G. W., Murabito, J. M., Nicholson, G., Nyholt, D. R., Okada, Y.,
Perry, J. R. B., Dorajoo, R., Reinmaa, E., Salem, R. M., Sandholm, N., Scott, R. A.,
Stolk, L., Takahashi, A., van 't Hooft, F. M., Vinkhuyzen, A. A. E., Westra, H. J., Zheng,
W., Zondervan, K. T., Heath, A. C., Arveiler, D., Bakker, S. J. L., Beilby, J., Bergman,
R. N., Blangero, J., Bovet, P., Campbell, H., Caulfield, M. J., Cesana, G., Chakravarti,
A., Chasman, D. I., Chines, P. S., Collins, F. S., Crawford, D. C., Cupples, L. A., Cusi,
D., Danesh, J., de Faire, U., den Ruijter, H. M., Dominiczak, A. F., Erbel, R., Erdmann,
J., Eriksson, J. G., Farrall, M., Felix, S. B., Ferrannini, E., Ferrières, J., Ford, I., Forouhi,
N. G., Forrester, T., Franco, O. H., Gansevoort, R. T., Gejman, P. V., Gieger, C.,
Gottesman, O., Gudnason, V., Gyllensten, U., Hall, A. S., Harris, T. B., Hattersley, A. T.,
Hicks, A. A., Hindorff, L. A., Hingorani, A. D., Hofman, A., Homuth, G., Hovingh, G.
K., Humphries, S. E., Hunt, S. C., Hyppönen, E., Illig, T., Jacobs, K. B., Jarvelin, M. R.,
Jöckel, K. H., Johansen, B., Jousilahti, P., Jukema, J. W., Jula, A. M., Kaprio, J.,
Kastelein, J. J. P., Keinanen-Kiukaanniemi, S. M., Kiemeney, L. A., Knekt, P., Kooner, J.
S., Kooperberg, C., Kovacs, P., Kraja, A. T., Kumari, M., Kuusisto, J., Lakka, T. A.,
Langenberg, C., Marchand, L. L., Lehtimäki, T., Lyssenko, V., Männistö, S., Marette, A.,
Matise, T. C., McKenzie, C. A., McKnight, B., Moll, F. L., Morris, A. D., Morris, A. P.,
Murray, J. C., Nelis, M., Ohlsson, C., Oldehinkel, A. J., Ong, K. K., Madden, P. A. F.,
Pasterkamp, G., Peden, J. F., Peters, A., Postma, D. S., Pramstaller, P. P., Price, J. F., Qi,
L., Raitakari, O. T., Rankinen, T., Rao, D. C., Rice, T. K., Ridker, P. M., Rioux, J. D.,
Ritchie, M. D., Rudan, I., Salomaa, V., Samani, N. J., Saramies, J., Sarzynski, M. A.,
Schunkert, H., Schwarz, P. E. H., Sever, P., Shuldiner, A. R., Sinisalo, J., Stolk, R. P.,
Strauch, K., Tönjes, A., Trégouët, D. A., Tremblay, A., Tremoli, E., Virtamo, J., Vohl,
M. C., Völker, U., Waeber, G., Willemsen, G., Witteman, J. C., Zillikens, M. C., Adair,
L. S., Amouyel, P., Asselbergs, F. W., Assimes, T. L., Bochud, M., Boehm, B. O.,
Boerwinkle, E., Bornstein, S. R., Bottinger, E. P., Bouchard, C., Cauchi, S., Chambers, J.
C., Chanock, S. J., Cooper, R. S., de Bakker, P. I. W., Dedoussis, G., Ferrucci, L., Franks,
P. W., Froguel, P., Groop, L. C., Haiman, C. A., Hamsten, A., Hui, J., Hunter, D. J.,
Hveem, K., Kaplan, R. C., Kivimaki, M., Kuh, D., Laakso, M., Liu, Y., Martin, N. G.,
140
März, W., Melbye, M., Metspalu, A., Moebus, S., Munroe, P. B., Njølstad, I., Oostra, B.
A., Palmer, C. N. A., Pedersen, N. L., Perola, M., Pérusse, L., Peters, U., Power, C.,
Quertermous, T., Rauramaa, R., Rivadeneira, F., Saaristo, T. E., Saleheen, D., Sattar, N.,
Schadt, E. E., Schlessinger, D., Slagboom, P. E., Snieder, H., Spector, T. D.,
Thorsteinsdottir, U., Stumvoll, M., Tuomilehto, J., Uitterlinden, A. G., Uusitupa, M., van
der Harst, P., Walker, M., Wallaschofski, H., Wareham, N. J., Watkins, H., Weir, D. R.,
Wichmann, H. E., Wilson, J. F., Zanen, P., Borecki, I. B., Deloukas, P., Fox, C. S., Heid,
I. M., O'Connell, J. R., Strachan, D. P., Stefansson, K., van Duijn, C. M., Abecasis, G. R.,
Franke, L., Frayling, T. M., McCarthy, M. I., Visscher, P. M., Scherag, A., Willer, C. J.,
Boehnke, M., Mohlke, K. L., Lindgren, C. M., Beckmann, J. S., Barroso, I., North, K. E.,
Ingelsson, E., Hirschhorn, J. N., Loos, R. J. F., Speliotes, E. K., Study, L. C.,
Consortium, A., Group, A.-B. W., Consortium, C. D., Consortium, C., GLGC, ICBP,
Investigators, M., Consortium, M., Consortium, M., Consortium, P., Consortium, R.,
Consortium, G., & Consortium, I. E. (2015). Genetic studies of body mass index yield
new insights for obesity biology. Nature, 518(7538), 197-206.
https://doi.org/10.1038/nature14177
Lonsdale, J., Thomas, J., Salvatore, M., Phillips, R., Lo, E., Shad, S., Hasz, R., Walters, G.,
Garcia, F., & Young, N. (2013). The genotype-tissue expression (GTEx) project. Nature
genetics, 45(6), 580.
Mancuso, N., Gayther, S., Gusev, A., Zheng, W., Penney, K. L., Kote-Jarai, Z., Eeles, R.,
Freedman, M., Haiman, C., & Pasaniuc, B. (2018). Large-scale transcriptome-wide
association study identifies new prostate cancer risk regions. Nature communications,
9(1), 4079.
Manson, J. E., Colditz, G. A., Stampfer, M. J., Willett, W. C., Krolewski, A. S., Rosner, B.,
Arky, R. A., Speizer, F. E., & Hennekens, C. H. (1991). A prospective study of maturity-
onset diabetes mellitus and risk of coronary heart disease and stroke in women. Archives
of internal medicine, 151(6), 1141-1147.
Martens, E. P., Pestman, W. R., de Boer, A., Belitser, S. V., & Klungel, O. H. (2006).
Instrumental variables application and limitations [Article]. Epidemiology, 17(3), 260-
267. https://doi.org/10.1097/01.ede.0000215160.88317.cb
Milne, R. L., Kuchenbaecker, K. B., Michailidou, K., Beesley, J., Kar, S., Lindström, S., Hui, S.,
Lemaçon, A., Soucy, P., & Dennis, J. (2017). Identification of ten variants associated
with risk of estrogen-receptor-negative breast cancer. Nature genetics, 49(12), 1767-
1778.
Newcombe, P. J., Conti, D. V., & Richardson, S. (2016). JAM: A Scalable Bayesian Framework
for Joint Analysis of Marginal SNP Effects. Genet Epidemiol, 40(3), 188-201.
https://doi.org/10.1002/gepi.21953
Newcomer, L. M., King, I. B., Wicklund, K. G., & Stanford, J. L. (2001). The association of
fatty acids with prostate cancer risk. The Prostate, 47(4), 262-268.
Newhouse, J. P., & McClellan, M. (1998). Econometrics in outcomes research: The use of
instrumental variables [Review]. Annual Review of Public Health, 19, 17-34.
https://doi.org/10.1146/annurev.publhealth.19.1.17
Nica, A. C., & Dermitzakis, E. T. (2013). Expression quantitative trait loci: present and future.
Philosophical Transactions of the Royal Society B: Biological Sciences, 368(1620),
20120362.
141
Orho-Melander, M., Hindy, G., Borgquist, S., Schulz, C.-A., Manjer, J., Melander, O., & Stocks,
T. (2018). Blood lipid genetic scores, the HMGCR gene and cancer risk: a Mendelian
randomization study. International journal of epidemiology, 47(2), 495-505.
Pasaniuc, B., Zaitlen, N., Shi, H., Bhatia, G., Gusev, A., Pickrell, J., Hirschhorn, J., Strachan, D.
P., Patterson, N., & Price, A. L. (2014). Fast and accurate imputation of summary
statistics enhances evidence of functional enrichment. Bioinformatics, 30(20), 2906-2914.
Pierce, B. L., Kraft, P., & Zhang, C. (2018). Mendelian randomization studies of cancer risk: a
literature review. Current epidemiology reports, 5(2), 184-196.
Platz, E. A., Till, C., Goodman, P. J., Parnes, H. L., Figg, W. D., Albanes, D., Neuhouser, M. L.,
Klein, E. A., Thompson, I. M., & Kristal, A. R. (2009). Men with low serum cholesterol
have a lower risk of high-grade prostate cancer in the placebo arm of the prostate cancer
prevention trial. Cancer Epidemiology and Prevention Biomarkers, 18(11), 2807-2813.
Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A., & Reich, D.
(2006). Principal components analysis corrects for stratification in genome-wide
association studies. Nature genetics, 38(8), 904.
Pulit, S. L., Stoneman, C., Morris, A. P., Wood, A. R., Glastonbury, C. A., Tyrrell, J., Yengo, L.,
Ferreira, T., Marouli, E., & Ji, Y. (2019). Meta-analysis of genome-wide association
studies for body fat distribution in 694 649 individuals of European ancestry. Human
molecular genetics, 28(1), 166-174.
Rees, J. M., Wood, A. M., & Burgess, S. (2017). Extending the MR ‐Egger method for
multivariable Mendelian randomization to correct for both measured and unmeasured
pleiotropy. Statistics in medicine, 36(29), 4705-4718.
Ritchey, J. D., Huang, W.-Y., Chokkalingam, A. P., Gao, Y.-T., Deng, J., Levine, P., Stanczyk,
F. Z., & Hsing, A. W. (2005). Genetic variants of DNA repair genes and prostate cancer:
a population-based study. Cancer Epidemiology and Prevention Biomarkers, 14(7),
1703-1709.
Robinson, G. K. (1991). That BLUP is a good thing: the estimation of random effects. Statistical
science, 6(1), 15-32.
Roselli, C., Chaffin, M. D., Weng, L.-C., Aeschbacher, S., Ahlberg, G., Albert, C. M., Almgren,
P., Alonso, A., Anderson, C. D., & Aragam, K. G. (2018). Multi-ethnic genome-wide
association study for atrial fibrillation. Nature genetics, 50(9), 1225-1233.
Runcie, D. E., & Crawford, L. (2019). Fast and flexible linear mixed models for genome-wide
genetics. PLoS genetics, 15(2), e1007978.
Sanderson, E., Davey Smith, G., Windmeijer, F., & Bowden, J. (2019). An examination of
multivariable Mendelian randomization in the single-sample and two-sample summary
data settings. International journal of epidemiology, 48(3), 713-727.
Sawa, T. (1969). The exact sampling distribution of ordinary least squares and two-stage least
squares estimators. Journal of the American Statistical association, 64(327), 923-937.
Schumacher, F. R., Al Olama, A. A., Berndt, S. I., Benlloch, S., Ahmed, M., Saunders, E. J.,
Dadaev, T., Leongamornlert, D., Anokian, E., Cieza-Borrella, C., Goh, C., Brook, M. N.,
Sheng, X., Fachal, L., Dennis, J., Tyrer, J., Muir, K., Lophatananon, A., Stevens, V. L.,
Gapstur, S. M., Carter, B. D., Tangen, C. M., Goodman, P. J., Thompson, I. M., Batra, J.,
Chambers, S., Moya, L., Clements, J., Horvath, L., Tilley, W., Risbridger, G. P.,
Gronberg, H., Aly, M., Nordström, T., Pharoah, P., Pashayan, N., Schleutker, J.,
Tammela, T. L. J., Sipeky, C., Auvinen, A., Albanes, D., Weinstein, S., Wolk, A.,
Håkansson, N., West, C. M. L., Dunning, A. M., Burnet, N., Mucci, L. A., Giovannucci,
142
E., Andriole, G. L., Cussenot, O., Cancel-Tassin, G., Koutros, S., Beane Freeman, L. E.,
Sorensen, K. D., Orntoft, T. F., Borre, M., Maehle, L., Grindedal, E. M., Neal, D. E.,
Donovan, J. L., Hamdy, F. C., Martin, R. M., Travis, R. C., Key, T. J., Hamilton, R. J.,
Fleshner, N. E., Finelli, A., Ingles, S. A., Stern, M. C., Rosenstein, B. S., Kerns, S. L.,
Ostrer, H., Lu, Y. J., Zhang, H. W., Feng, N., Mao, X., Guo, X., Wang, G., Sun, Z.,
Giles, G. G., Southey, M. C., MacInnis, R. J., FitzGerald, L. M., Kibel, A. S., Drake, B.
F., Vega, A., Gómez-Caamaño, A., Szulkin, R., Eklund, M., Kogevinas, M., Llorca, J.,
Castaño-Vinyals, G., Penney, K. L., Stampfer, M., Park, J. Y., Sellers, T. A., Lin, H. Y.,
Stanford, J. L., Cybulski, C., Wokolorczyk, D., Lubinski, J., Ostrander, E. A., Geybels,
M. S., Nordestgaard, B. G., Nielsen, S. F., Weischer, M., Bisbjerg, R., Røder, M. A.,
Iversen, P., Brenner, H., Cuk, K., Holleczek, B., Maier, C., Luedeke, M., Schnoeller, T.,
Kim, J., Logothetis, C. J., John, E. M., Teixeira, M. R., Paulo, P., Cardoso, M.,
Neuhausen, S. L., Steele, L., Ding, Y. C., De Ruyck, K., De Meerleer, G., Ost, P.,
Razack, A., Lim, J., Teo, S. H., Lin, D. W., Newcomb, L. F., Lessel, D., Gamulin, M.,
Kulis, T., Kaneva, R., Usmani, N., Singhal, S., Slavov, C., Mitev, V., Parliament, M.,
Claessens, F., Joniau, S., Van den Broeck, T., Larkin, S., Townsend, P. A., Aukim-
Hastie, C., Gago-Dominguez, M., Castelao, J. E., Martinez, M. E., Roobol, M. J., Jenster,
G., van Schaik, R. H. N., Menegaux, F., Truong, T., Koudou, Y. A., Xu, J., Khaw, K. T.,
Cannon-Albright, L., Pandha, H., Michael, A., Thibodeau, S. N., McDonnell, S. K.,
Schaid, D. J., Lindstrom, S., Turman, C., Ma, J., Hunter, D. J., Riboli, E., Siddiq, A.,
Canzian, F., Kolonel, L. N., Le Marchand, L., Hoover, R. N., Machiela, M. J., Cui, Z.,
Kraft, P., Amos, C. I., Conti, D. V., Easton, D. F., Wiklund, F., Chanock, S. J.,
Henderson, B. E., Kote-Jarai, Z., Haiman, C. A., Eeles, R. A., Study, P., (APCB), A. P.
C. B., Study, I., Investigators, C. P., (BPC3), B. a. P. C. C. C., Consortium, P. P. C. A. G.
t. I. C.-A. A. i. t. G., (CAPS), C. o. t. P. i. S., (PEGASUS), P. C. G.-w. A. S. o. U. S. L.,
& Consortium, G. A. a. M. i. O. G.-O. E. L. I. i. P. C. S. E. (2018). Association analyses
of more than 140,000 men identify 63 new prostate cancer susceptibility loci. Nat Genet,
50(7), 928-936. https://doi.org/10.1038/s41588-018-0142-8
Scott, J. G., & Berger, J. O. (2010). Bayes and empirical-Bayes multiplicity adjustment in the
variable-selection problem. The Annals of Statistics, 38(5), 2587-2619.
Sekula, P., Fabiola Del Greco, M., Pattaro, C., & Köttgen, A. (2016). Mendelian randomization
as an approach to assess causality using observational data. Journal of the American
Society of Nephrology, 27(11), 3253-3265.
Sha, J., Xue, W., Dong, B., Pan, J., Wu, X., Li, D., Liu, D., & Huang, Y. (2017). PRKAR2B
plays an oncogenic role in the castration-resistant prostate cancer. Oncotarget, 8(4),
6114.
Shan, S., Xu, F., Bleyer, M., Becker, S., Melbaum, T., Wemheuer, W., Hirschfeld, M., Wacker,
C., Zhao, S., & Schütz, E. (2020). Association of α/β-Hydrolase D16B with Bovine
Conception Rate and Sperm Plasma Membrane Lipid Composition. International Journal
of Molecular Sciences, 21(2), 627.
Smeets, A., Daemen, A., Bempt, I. V., Gevaert, O., Claes, B., Wildiers, H., Drijkoningen, R.,
Van Hummelen, P., Lambrechts, D., & De Moor, B. (2011). Prediction of lymph node
involvement in breast cancer from primary tumor tissue using gene expression profiling
and miRNAs. Breast cancer research and treatment, 129(3), 767-776.
Smith, G. D., & Ebrahim, S. (2004). Mendelian randomization: prospects, potentials, and
limitations. International journal of epidemiology, 33(1), 30-42.
143
Staley, J. R., & Burgess, S. (2017). Semiparametric methods for estimation of a nonlinear
exposure ‐outcome relationship using instrumental variables with application to
Mendelian randomization. Genetic epidemiology, 41(4), 341-352.
Staunton, L., Tonry, C., Lis, R., Espina, V., Liotta, L., Inzitari, R., Bowden, M., Fabre, A.,
O'Leary, J., & Finn, S. P. (2017). Pathology-driven comprehensive proteomic profiling of
the prostate cancer tumor microenvironment. Molecular Cancer Research, 15(3), 281-
293.
Suburu, J., & Chen, Y. Q. (2012). Lipids and prostate cancer. Prostaglandins & other lipid
mediators, 98(1-2), 1-10.
Sudlow, C., Gallacher, J., Allen, N., Beral, V., Burton, P., Danesh, J., Downey, P., Elliott, P.,
Green, J., Landray, M., Liu, B., Matthews, P., Ong, G., Pell, J., Silman, A., Young, A.,
Sprosen, T., Peakman, T., & Collins, R. (2015). UK biobank: an open access resource for
identifying the causes of a wide range of complex diseases of middle and old age. PLoS
Med, 12(3), e1001779. https://doi.org/10.1371/journal.pmed.1001779
Sutton, A. J., Abrams, K. R., Jones, D. R., Sheldon, T. A., & Song, F. (2000). Methods for meta-
analysis in medical research (Vol. 348). Wiley Chichester.
Taylor, J., & Tibshirani, R. J. (2015). Statistical learning and selective inference. Proceedings of
the National Academy of Sciences, 112(25), 7629-7634.
Thomas, D. C., & Conti, D. V. (2004). Commentary: the concept of ‘Mendelian Randomization’.
International journal of epidemiology, 33(1), 21-25.
Thomas, D. C., Conti, D. V., Baurley, J., Nijhout, F., Reed, M., & Ulrich, C. M. (2009). Use of
pathway information in molecular epidemiology. Human genomics, 4(1), 21.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society: Series B (Methodological), 58(1), 267-288.
Tibshirani, R. J., Taylor, J., Lockhart, R., & Tibshirani, R. (2016). Exact post-selection inference
for sequential regression procedures. Journal of the American Statistical Association,
111(514), 600-620.
Van Hemelrijck, M., Garmo, H., Holmberg, L., Walldius, G., Jungner, I., Hammar, N., &
Lambe, M. (2011). Prostate cancer risk in the Swedish AMORIS study: the interplay
among triglycerides, total cholesterol, and glucose. Cancer, 117(10), 2086-2095.
Verbanck, M., Chen, C. Y., Neale, B., & Do, R. (2018). Detection of widespread horizontal
pleiotropy in causal relationships inferred from Mendelian randomization between
complex traits and diseases. Nature Genetics, 50(5), 693-+.
https://doi.org/10.1038/s41588-018-0099-7
Vujkovic, M., Keaton, J. M., Lynch, J. A., Miller, D. R., Zhou, J., Tcheandjieu, C., Huffman, J.
E., Assimes, T. L., Lorenz, K., & Zhu, X. (2020). Discovery of 318 new risk loci for type
2 diabetes and related vascular outcomes among 1.4 million participants in a multi-
ancestry meta-analysis. Nature genetics, 52(7), 680-691.
Wang, G., Sarkar, A., Carbonetto, P., & Stephens, M. (2020). A simple new approach to variable
selection in regression, with application to genetic fine mapping. Journal of the Royal
Statistical Society: Series B (Statistical Methodology).
https://doi.org/https://doi.org/10.1111/rssb.12388
Wang, H., Liu, G., Shen, D., Ye, H., Huang, J., Jiao, L., & Sun, Y. (2015). HOXA1 enhances the
cell proliferation, invasion and metastasis of prostate cancer cells. Oncology reports,
34(3), 1203-1210.
144
Wang, Y., Yang, L., Chen, T., Liu, X., Guo, Y., Zhu, Q., Tong, X., Yang, W., Xu, Q., & Huang,
D. (2019). A novel lncRNA MCM3AP-AS1 promotes the growth of hepatocellular
carcinoma by targeting miR-194-5p/FOXA1 axis. Molecular cancer, 18(1), 28.
Wayner, E. A., Quek, S. I., Ahmad, R., Ho, M. E., Loprieno, M. A., Zhou, Y., Ellis, W. J., True,
L. D., & Liu, A. Y. (2012). Development of an ELISA to detect the secreted prostate
cancer biomarker AGR2 in voided urine. The Prostate, 72(9), 1023-1034.
Willer, C. J., Schmidt, E. M., Sengupta, S., Peloso, G. M., Gustafsson, S., Kanoni, S., Ganna, A.,
Chen, J., Buchkovich, M. L., & Mora, S. (2013). Discovery and refinement of loci
associated with lipid levels. Nature genetics, 45(11), 1274.
Witte, J. S., Greenland, S., Kim, L.-L., & Arab, L. (2000). Multilevel modeling in epidemiology
with GLIMMIX. Epidemiology, 11(6), 684-688.
Worst, T. S., Meyer, Y., Gottschalt, M., Weis, C.-A., Von Hardenberg, J., Frank, C., Steidler, A.,
Michel, M. S., & Erben, P. (2017). RAB27A, RAB27B and VPS36 are downregulated in
advanced prostate cancer and show functional relevance in prostate cancer cells.
International journal of oncology, 50(3), 920-932.
Wu, L., Wang, J., Cai, Q., Cavazos, T. B., Emami, N. C., Long, J., Shu, X.-O., Lu, Y., Guo, X.,
& Bauer, J. A. (2019a). Identification of novel susceptibility loci and genes for prostate
cancer risk: a transcriptome-wide association study in over 140,000 European
descendants. Cancer research, 79(13), 3192-3204.
Wu, L., Wang, J., Cai, Q., Cavazos, T. B., Emami, N. C., Long, J., Shu, X.-O., Lu, Y., Guo, X.,
& Bauer, J. A. (2019b). Identification of novel susceptibility loci and genes for prostate
cancer risk: A transcriptome-wide association study in over 140,000 European
descendants. Cancer research, canres. 3536.2018.
Xing, Z., Li, S., Liu, Z., Zhang, C., Meng, M., & Bai, Z. (2020). The long non-coding RNA
LINC00473 contributes to cell proliferation via JAK-STAT3 signaling pathway by
regulating miR-195-5p/SEPT2 axis in prostate cancer. Bioscience Reports.
Xue, A., Wu, Y., Zhu, Z., Zhang, F., Kemper, K. E., Zheng, Z., Yengo, L., Lloyd-Jones, L. R.,
Sidorenko, J., McRae, A. F., Visscher, P. M., Zeng, J., Yang, J., & Consortium, e. (2018).
Genome-wide association analyses identify 143 risk variants and putative regulatory
mechanisms for type 2 diabetes. Nat Commun, 9(1), 2941.
https://doi.org/10.1038/s41467-018-04951-w
Yang, J., Ferreira, T., Morris, A. P., Medland, S. E., Madden, P. A., Heath, A. C., Martin, N. G.,
Montgomery, G. W., Weedon, M. N., Loos, R. J., Frayling, T. M., McCarthy, M. I.,
Hirschhorn, J. N., Goddard, M. E., Visscher, P. M., Consortium, G. I. o. A. T. G., &
Consortium, D. G. R. A. M.-a. D. (2012). Conditional and joint multiple-SNP analysis of
GWAS summary statistics identifies additional variants influencing complex traits. Nat
Genet, 44(4), 369-375, S361-363. https://doi.org/10.1038/ng.2213
Yang, X., Amgad, M., Cooper, L. A., Du, Y., Fu, H., & Ivanov, A. A. (2020). High expression of
MKK3 is associated with worse clinical outcomes in African American breast cancer
patients. Journal of translational medicine, 18(1), 1-19.
Yavorska, O. O., & Burgess, S. (2017). MendelianRandomization: an R package for performing
Mendelian randomization analyses using summarized data. International Journal of
Epidemiology, 46(6), 1734-1739. https://doi.org/10.1093/ije/dyx034
Yusuf, S., Hawken, S., Ounpuu, S., Bautista, L., Franzosi, M. G., Commerford, P., Lang, C. C.,
Rumboldt, Z., Onen, C. L., & Lisheng, L. (2005). Obesity and the risk of myocardial
145
infarction in 27 000 participants from 52 countries: a case-control study. The Lancet,
366(9497), 1640-1649.
Zhang, Z., Luo, K., Zou, Z., Qiu, M., Tian, J., Sieh, L., Shi, H., Zou, Y., Wang, G., & Morrison,
J. (2020). Genetic analyses support the contribution of mRNA N 6-methyladenosine (m 6
A) modification to human disease heritability. Nature Genetics, 52(9), 939-949.
Zheng, J., Baird, D., Borges, M. C., Bowden, J., Hemani, G., Haycock, P., Evans, D. M., &
Smith, G. D. (2017). Recent Developments in Mendelian Randomization Studies. Curr
Epidemiol Rep, 4(4), 330-345. https://doi.org/10.1007/s40471-017-0128-6
Zhou, X., Carbonetto, P., & Stephens, M. (2013). Polygenic modeling with Bayesian sparse
linear mixed models. PLoS genetics, 9(2), e1003264.
Zhu, Z., Zhang, F., Hu, H., Bakshi, A., Robinson, M. R., Powell, J. E., Montgomery, G. W.,
Goddard, M. E., Wray, N. R., & Visscher, P. M. (2016). Integration of summary data
from GWAS and eQTL studies predicts complex trait gene targets. Nature genetics,
48(5), 481.
Zhu, Z., Zheng, Z., Zhang, F., Wu, Y., Trzaskowski, M., Maier, R., Robinson, M. R., McGrath,
J. J., Visscher, P. M., & Wray, N. R. (2018). Causal associations between risk factors and
common diseases inferred from GWAS summary data. Nature communications, 9(1),
224.
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of
the royal statistical society: series B (statistical methodology), 67(2), 301-320.
Zuber, V., Colijn, J. M., Klaver, C., & Burgess, S. (2020). Selecting likely causal risk factors
from high-throughput experiments using multivariable Mendelian randomization. Nature
Communications, 11(1), 1-11.
Abstract (if available)
Abstract
Over the last decade, genome-wide association studies (GWAS) have not only mapped thousands of genetic variants across the genome with complex traits but also stimulated many new methodologies leveraging GWAS summary statistics to explore the biological mechanisms underlying the genetic trait associations. For example, previous research has demonstrated the usefulness of hierarchical modeling for incorporating a flexible array of prior information in genetic association studies. When this prior information consists of effect estimates from association analyses of genetic variants with a modifiable risk factor or gene expression, the hierarchical model is equivalent to a Mendelian randomization and Transcriptome-Wide Association Study (TWAS) analysis, respectively. ❧ In this dissertation, we propose novel methods to incorporate the prior information from summary data and test the causal effects of intermediates on the outcome trait in observational studies. The intermediates can be modifiable risk factors, gene expression data, metabolite data, and other omic data. Additionally, we discuss different ways to compose the prior information with various data types. The composition of the prior information has been shown to be critical for valid inference. ❧ This dissertation is structured as follows. Chapter 1 gives a brief introduction of the instrumental variable analysis, Mendelian randomization, and TWAS. Chapter 2 first shows that both Mendelian randomization and TWAS are a form of instrumental variable analysis with the genetic variants as the instruments. I then propose a hierarchical joint analysis of marginal summary data, hJAM, which is designed to incorporate prior information via a hierarchical model and test the causal effect of the risk factors or genes on the outcome trait. The use of appropriate effect estimates as prior information yields an analysis similar to Mendelian randomization and TWAS. Chapter 3 describes a natural extension on hJAM by integrating an Egger regression, hJAM Egger, to account for the bias that are introduced by the invalid instruments. Chapter 4 designs a scalable hierarchical approach for joint analysis of marginal data, SHA-JAM, to variable selection in a hJAM setting with high-throughput experiments, such as omic data. Chapter 5 presents the R package, hJAM, that we developed for the implementations of the methods proposed in this dissertation. Finally, Chapter 6 summarizes this dissertation.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Characterization and discovery of genetic associations: multiethnic fine-mapping and incorporation of functional information
PDF
Hierarchical regularized regression for incorporation of external data in high-dimensional models
PDF
twas_sim, a Python-based tool for simulation and power analysis of transcriptome-wide association analysis
PDF
Two-step testing approaches for detecting quantitative trait gene-environment interactions in a genome-wide association study
PDF
Bayesian hierarchical models in genetic association studies
PDF
Adaptive set-based tests for pathway analysis
PDF
Improving the power of GWAS Z-score imputation by leveraging functional data
PDF
Efficient two-step testing approaches for detecting gene-environment interactions in genome-wide association studies, with an application to the Children’s Health Study
PDF
Leveraging functional datasets of stimulated cells to understand the relationship between environment and diseases
PDF
Identification and fine-mapping of genetic susceptibility loci for prostate cancer and statistical methodology for multiethnic fine-mapping
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
Enhancing model performance of regularization methods by incorporating prior information
PDF
Computational approaches to identify genetic regulators of aging and late-life mortality
PDF
Observed and underlying associations in nicotine dependence
PDF
Pharmacogenetic association studies and the impact of population substructure in the women's interagency HIV study
PDF
Detecting joint interactions between sets of variables in the context of studies with a dichotomous phenotype, with applications to asthma susceptibility involving epigenetics and epistasis
Asset Metadata
Creator
Jiang, Lai
(author)
Core Title
Hierarchical approaches for joint analysis of marginal summary statistics
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Biostatistics
Publication Date
02/26/2021
Defense Date
12/16/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
causal inference,genome-wide association study (GWAS),hierarchical modeling,joint analysis of marginal summary statistics (JAM),Mendelian randomization,OAI-PMH Harvest,omics data,transcriptome-wide association study (TWAS)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Conti, David V. (
committee chair
), Chaisson, Mark (
committee member
), Gauderman, William (
committee member
), Lewinger, Juan Pablo (
committee member
), Mancuso, Nicholas (
committee member
)
Creator Email
jian848@usc.edu,lai.lyla.jiang@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-423072
Unique identifier
UC11668253
Identifier
etd-JiangLai-9294.pdf (filename),usctheses-c89-423072 (legacy record id)
Legacy Identifier
etd-JiangLai-9294.pdf
Dmrecord
423072
Document Type
Dissertation
Rights
Jiang, Lai
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
causal inference
genome-wide association study (GWAS)
hierarchical modeling
joint analysis of marginal summary statistics (JAM)
Mendelian randomization
omics data
transcriptome-wide association study (TWAS)