Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Enhancing model performance of regularization methods by incorporating prior information
(USC Thesis Other)
Enhancing model performance of regularization methods by incorporating prior information
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ENHANCING MODEL PERFORMANCE OF REGULARIZATION METHODS BY
INCORPORATING PRIOR INFORMATION
by
JINGXUAN HE
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(BIOSTATISTICS)
August 2023
Copyright 2023 JINGXUAN HE
Dedication
This thesis is dedicated to my beloved families. My grandparents, Ansheng Wang and Manli You. My
parents, Hongwei He and Tong Wang. My uncle, Hao Wang. Thank you for all the love and support along
the way.
ii
Acknowledgements
I would like to express my deepest gratitude and appreciation to all the individuals who have contributed
to the completion of this Ph.D. thesis. Their guidance, support, and encouragement have been invaluable
throughout this journey.
First and foremost, I am immensely grateful to my supervisor, Dr. David V. Conti, for his unwavering
guidance, expertise, and continuous support. His profound knowledge, critical insights, and unwavering
commitment to excellence have shaped this research work significantly. David is also considerate of stu-
dents’ personal lives. I am indebted to him for the patience, encouragement, and valuable feedback that
have enriched my academic and personal growth. I will always admire David as my role model for his
energetic attitude, intelligence, and diligence.
I extend my heartfelt appreciation to the members of my thesis committee, Dr. Juan Pablo Lewinger
and Dr. Jose-Luis Ambite, for their valuable time, insightful comments, and constructive suggestions. I am
also grateful for the time and advices from my qualifying committee members, Dr. Vaia Lida Chatzi and
Dr. William Gauderman. Their expertise and engagement in reviewing and evaluating this thesis have
significantly enhanced its quality.
I would like to express my gratitude to the faculty and staff of USC, who provided a nurturing aca-
demic environment and resources essential for the successful completion of this research. I appreciate the
collaborations with Dr. Vaia Lida Chatzi’s team, and I express my appreciation to my collaborator, Dr.
Erika Garcia, for her efforts in contributing to the content of Chapter 2. Their dedication to education and
iii
their commitment to fostering intellectual curiosity have been instrumental in shaping my research and
academic journey.
My sincere thanks go to my colleagues and friends who have provided support and companionship
throughout this doctoral journey. Their camaraderie, intellectual discussions, and encouragement have
been invaluable in overcoming the challenges and celebrating the achievements along the way.
I would like to express my gratitude to my parents, grandparents, and other family members for their
unwavering love, support, and belief in my abilities. Their constant encouragement, sacrifices, and under-
standing have been the foundation of my academic pursuits.
Four years have passed by in the blink of an eye. When I reflect on this journey, I am surprised by the
many moments of both tears and laughter. Moreover, the pandemic has been woven into my life as a Ph.D.
student, profoundly reshaping my thoughts on both personal and career persuasions. Standing at the final
stage of the Ph.D. student adventure, I am humbled and grateful to all those who have contributed to this
Ph.D. thesis. This accomplishment would not have been possible without your support, guidance, and
belief in me. Thank you for being a part of this journey and for helping me fulfill this academic milestone.
iv
TableofContents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 An overall applied problem – analysis of highly correlated variables . . . . . . . . . . . . . 1
1.1.1 High-dimensional individual level data . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Polygenic risk scores with genetic summary statistics . . . . . . . . . . . . . . . . 4
1.2 Mixture approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Weighted Quantile Sum Regression (WQS) . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.2 Bayesian Kernel Machine Regression (BKMR) . . . . . . . . . . . . . . . . . . . . . 6
1.2.3 g-computation and quantileg-computation . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Regression models with priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.1 Generalized linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.2 Bayesian Hierarchical modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.3 g prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.4 Regularized regressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.5 xtune: External information tuned regularized regression . . . . . . . . . . . . . . 16
1.4 Genetic summary statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.4.1 Constructing the linear model with summary statistics. . . . . . . . . . . . . . . . 20
1.4.2 Genetic interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.5 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Chapter 2: Bayesian hierarchical regression modeling withg-prior andg-computation . . . . . . . 23
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.1 Model specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.2 Parameters estimation using Markov chain Monte Carlo (MCMC) . . . . . . . . . . 26
2.2.3 g computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.1 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
v
2.3.2 Real-world data example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Chapter 3: Multiclass regularized regression integrating prior information . . . . . . . . . . . . . . 39
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.1 Model Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.2 Bayesian interpretation of the xtune multinomial Elastic Net model . . . . . . . . . 43
3.2.3 Empirical Bayes approach for hyperparameter estimation . . . . . . . . . . . . . . 44
3.2.4 Optimizing the approximated marginal likelihood function using l
2
iterative
re-weighted method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3.1 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3.1.1 Simulation settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3.1.2 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3.2 Real data examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3.2.1 Breast cancer data example . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.3.2.2 Prostate cancer data example . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.3.2.3 Environmental exposures example . . . . . . . . . . . . . . . . . . . . . . 58
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Chapter 4: Data-informed tuning of regularized regression with summary statistics from multi-
population studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2.1 Model specification and Bayesian interpretation . . . . . . . . . . . . . . . . . . . . 65
4.2.2 Obtaining approximated marginal likelihood using empirical Bayes approach . . . 67
4.2.3 Optimizing the approximated marginal likelihood . . . . . . . . . . . . . . . . . . . 69
4.3 Data example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Chapter 5: Software implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.1 BHRM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.1.2 R function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.1.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.2 mxtune . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2.2 R function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.2.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.3 xtune_mJAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3.2 R function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.3.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Chapter 6: Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.1 Summary of our contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.2 Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
vi
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
A Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
B Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
C Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
vii
ListofTables
1.1 Example of main dataset (upper) and prior dataset (lower) . . . . . . . . . . . . . . . . . . 17
2.1 Simulation scenarios for exposures in low dimension, which were repeatedly used in
settings reflecting high, medium, and no correlation among exposures ( ρ = 0.9, 0.5, and 0,
respectively), which we refer to as scenarios 1a-6a, 1b-6b, and 1c-6c. . . . . . . . . . . . . . 29
2.2 Simulation results for scenarios with highly correlated exposures (ρ = 0.9) implemented
using univariate, WQS, quantile g-computation (LRMLE), and BHRM methods. Coefficient
estimate (Monte Carlo standard error) presented. . . . . . . . . . . . . . . . . . . . . . . . 37
2.3 Simulation results for scenarios with high-dimensional data and highly correlated
exposures (ρ = 0.9) implemented using univariate, WQS, quantile g-computation (LRMLE),
and BHRM methods. Coefficient estimate (Monte Carlo standard error) presented. . . . . . 38
2.4 BMI z-score and PFAS chemicals effect estimation profiles from WQS, quantile g-
computation (LRMLE), and BHRM using HELIX challenge data. . . . . . . . . . . . . . . . 38
3.1 Model prediction accuracy (AUC), sparsity, and computational time (in minutes)
comparing the standard EN, mxtune, and mxtune without using the prior information. . . 57
3.2 Selection stability across 100 replicates comparing the standard EN, mxtune, and mxtune
without using the prior information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.1 Primary dataX andY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2 Prior informationZ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
A.1 Simulation scenarios for median correlated (ρ = 0.5) exposures and non-correlated
(ρ =0) exposures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
A.2 Simulation results for median correlated (ρ = 0.5) exposures. Coefficient estimates
and MCSE are compared among univariate, WQS, Quantile g-computation, and BHRM
methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
A.3 Simulation results for uncorrelated (ρ = 0) exposures. Coefficient estimates and MCSE are
compared among univariate, WQS, Quantile g-computation, and BHRM methods. . . . . . 106
viii
B.4 Coefficient estimation bias of true predictors comparing the standard EN with mxtune and
mxtune without using prior information in different scenarios. . . . . . . . . . . . . . . . . 107
ix
ListofFigures
1.1 Human tumor DNA expression microarray data with 6830 genes (rows) and 64 samples
(columns). Only showed 100 random rows. Bright green (negative, under expressed) to
bright red (positive, over expressed) displayed the genes expression level in the heatmap.
This figure is based on Figure 1.3 of ([43]). . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Spearman correlation heatmap of POPs (p = 18) across participants in NHANES 2001–2002.
This figure is based on Figure 1 of ([39]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Lasso (left) and ridge regression (right) estimation plot (p = 2). The blue areas are the
constraint regions for|β 1
|+|β 2
|≤ t andβ 2
1
+β 2
2
≤ t, respectively. The red ellipses show
the contours of the square loss function. This figure is based on Figure 3.11 of [43]. . . . . 15
2.1 Correlation pattern among the PFAS exposures . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1 Prediction accuracy comparing the standard EN with mxtune and mxtune without using
prior information in five scenarios. (a) Varying p withρ =0.5,θ =0.9,κ =100%,K =3.
(b) Varyingρ withp = 800,θ = 0.9,κ = 100%,K = 3. (c) Varyingθ withp = 800,ρ =
0.5,κ = 100%,K = 3. (d) Varying κ with p = 800,ρ = 0.5,θ = 0.9,K = 3. (e) (f)
K =2 andK =6 withp=800,ρ =0.5,θ =0.9,κ =100%. . . . . . . . . . . . . . . . . 53
3.2 Probability of selecting wrong predictors comparing the standard EN with mxtune and
mxtune without using prior information in five scenarios. (a) Varying p withρ =0.5,θ =
0.9,κ = 100%,K = 3. (b) Varying ρ with p = 800,θ = 0.9,κ = 100%,K = 3. (c)
Different θ withp = 800,ρ = 0.5,κ = 100%,K = 3. The scales of misclassification rate
are varying across three conditions. (d) Varyingκ withp=800,ρ =0.5,θ =0.9,K =3.
(e) (f)K =2 andK =6 withp=800,ρ =0.5,θ =0.9,κ =100%. . . . . . . . . . . . . . 54
3.3 Model sparsity comparing the standard EN with mxtune and mxtune without using prior
information in five scenarios. (a) Varying p withρ = 0.5,θ = 0.9,κ = 100%,K = 3.
(b) Varying ρ with p = 800,θ = 0.9,κ = 100%,K = 3. (c) Varying θ with
p=800,ρ =0.5,κ =100%,K =3. (d) Varingκ withp=800,ρ =0.5,θ =0.9,K =3.
(e) (f)K =2 andK =6 withp=800,ρ =0.5,θ =0.9,κ =100%. . . . . . . . . . . . . . 55
3.4 Performance of predictive models for prostate cancer patients’ status. (a) AUC of standard
EN, fwelnet, mxtune, and non-informative mxtune. (b) Number of predictors selected by
standard EN, fwelnet, mxtune, and non-informative mxtune. . . . . . . . . . . . . . . . . . . 59
x
3.5 Performance of predictive models for categorical liver injury. (a) AUC of mxtune,
non-informative mxtune, and standard EN. (b) Number of predictors selected by mxtune,
non-informative mxtune, and standard EN. . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.1 Model performance with single causal SNP. (a) Coefficient of predicted PRS vs. true
PRS comparing the xtune-mJAM and non-informative xtune-mJAM using lasso, ridge,
EN penalty with mJAM-Foward. (b) Mean square error (MSE) of PRS model comparing
the xtune-mJAM and non-informative xtune-mJAM using lasso, ridge, EN penalty with
mJAM-Foward. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.1 Summary of thesis contribution in the field of penalty parameter selection. . . . . . . . . . 90
B.2 Model comparisons among the standard EN, mxtune and non-informative mxtune when
κ = 50% withp = 800,ρ = 0.5,θ = 0.9, andK = 3 in terms of (a) prediction accuracy
(b) probability of identifying false positive predictors and (c) model sparsity. . . . . . . . . 108
B.3 mxtune computational time (seconds) with different dimensions of N,p,K, andq. . . . . . 109
C.4 Model performance with single causal SNP. (a) Coefficient of predicted PRS vs. true
PRS comparing the xtune-mJAM and non-informative xtune-mJAM using lasso, ridge,
EN penalty with mJAM-Foward. (b) Mean square error (MSE) of PRS model comparing
the xtune-mJAM and non-informative xtune-mJAM using lasso, ridge, EN penalty with
mJAM-Foward. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
xi
Abstract
Researchers frequently encounter statistical challenges when analyzing complex datasets, such as high-
dimensional data or mixtures data from environmental epidemiology studies. To mitigate the risk of over-
fitting or correlation issues caused by such complex data, shrinkage is commonly employed in model
fitting. Traditionally, a single non-data informed penalty parameter is uniformly applied to all regression
coefficients to regulate the amount of regularization in the model. However, there are instances where
external information, such as gene functional annotations, can provide valuable insights into understand-
ing the correlated data. Consequently, incorporating prior data-informed penalties has the potential to
leverage the shrinkage effect and improve model performance. This thesis explores novel methods for
both non-data-informed and data-informed tuning techniques, along with their applications across vari-
ous types of data.
The first project establishes a flexible framework for a Bayesian Hierarchical Regression Model that
effectively handles highly correlated exposures using the g-prior and g-computation. This framework
allows for the estimation of: 1) independent exposure effects; 2) interactions between exposures; and 3)
combined effects for mixture exposures.
The second project develops a data-informed penalized regression approach for predicting multicat-
egorical phenotypes in high-dimensional data contexts. Specifically, the regression coefficients are regu-
larized by feature-specific Elastic-Net penalty parameters, which are modeled as a log-linear function of
xii
prior covariates. Instead of relying on cross-validation, penalty parameters are estimated using an empir-
ical Bayes method.
In the third project, we extend the framework from the second project, which primarily focused on
individual-level data modeling, to the analysis of summary statistics in Genome-Wide Association Studies
(GWAS) results from multi-population studies.
Lastly, we provide an overview of software implementation of all these methods.
xiii
Chapter1
Introduction
1.1 Anoverallappliedproblem–analysisofhighlycorrelatedvariables
1.1.1 High-dimensionalindividualleveldata
The accelerated advancement of data collection techniques and devices allows dramatically larger volume
dataset available. With the more complicated dataset, correlated variables are much more likely to occur
in this case. Traditionally, regression models are commonly used to assess and quantify the association
between the outcome and variables. However, the assumption of independent variables is violated with
highly correlated predictors, which may lead to improper model inference and misleading result interpre-
tation. For instance, thousands of genes information can be easily collected with the development of DNA
microarray experiments. In the example from Figure 1.1 [43], the number of predictors (6830 genes) is
greatly larger than the number of observations (64 samples) [43], which consist of a typical structure of
high-dimensional data. Under the high-dimensional data settings, we need statistical tools that can ad-
dress the potential problems of correlated or colinear variables and find the best subset of features from
the mass predictors pool to predict the outcome accurately.
1
Figure 1.1: Human tumor DNA expression microarray data with 6830 genes (rows) and 64 samples
(columns). Only showed 100 random rows. Bright green (negative, under expressed) to bright red (posi-
tive, over expressed) displayed the genes expression level in the heatmap. This figure is based on Figure
1.3 of ([43]).
Another example of correlated variables often occurs in the mixtures analysis. Mixtures analysis is
an active research topics in environmental epidemiology, which focusing on studying the effects of chem-
ical compounds on individual and population health [39]. Traditionally, epidemiologists have primarily
concentrated on understanding the impact of a single environmental chemical on humans. However, it is
essential to recognize that environmental exposures are not experienced in isolation; they are often corre-
lated due to common sources or exposure pathways. Consequently, there has been an increasing interest in
studying the effects of compound mixtures [39]. We can see the example from [39], there are 18 exposures
of persistent organic pollutants (POPs) and most of them have correlation with each other. In this case,
we need appropriate statistical methods to address the challenges such as multi-collinearity and mixtures
effect in the mixtures analysis.
2
Figure 1.2: Spearman correlation heatmap of POPs (p = 18) across participants in NHANES 2001–2002.
This figure is based on Figure 1 of ([39]).
In the past decades, numerous studies have focused on how to better comprehend the complicated
dataset. There are two types of common approaches to analyze such kind of data: unsupervised meth-
ods and supervised methods [39]. Unsupervised methods including Clustering and Principal Component
Analysis (PCA) discover the pattern in the data without considering the outcome. Contrarily, supervised
methods investigate association between variables and outcome. For the remainder of this chapter, we
will provide brief introductions to several regression models that are typical examples of supervised meth-
ods capable of performing data analysis for correlated variables. These regression models are designed to
3
handle the complexities arising from correlated exposures and can aid in understanding the relationships
between these variables and their impact on the outcome of interest.
1.1.2 Polygenicriskscoreswithgeneticsummarystatistics
In addition to the complex individual-level data, there are increasing studies to focus on genetic summary
statistics data. In the past decade, genome-wide association studies (GWAS) have been instrumental in
identifying thousands of genetic variants associated with a wide range of traits and diseases [95]. These
variants are typically single nucleotide polymorphisms (SNPs) spread across the human genome [95, 113].
GWAS involve analyzing the genetic data of thousands to millions of individuals, comparing the genotype
frequencies between individuals with and without the phenotype of interest. The statistical associations
between genetic variants and phenotypic traits are then summarized as genetic summary statistics (e.g.
effect sizes, standard errors, and p-values for each genetic variant, etc.). Thus, genetic summary statistics
play a crucial role in understanding the genetic basis of complex traits and diseases.
Polygenic risk scores (PRS) have emerged as a promising tool in the field of genetics and precision
medicine [18, 81]. By summarizing an individual’s genetic risk based on multiple genetic variants, PRS of-
fers insights into the likelihood of developing complex traits or diseases. One key advancement in this field
is the utilization of genetic summary statistics, which has further enhanced the accuracy and applicability
of PRS.
The integration of genetic summary statistics into PRS development has revolutionized the accuracy
and predictive power of these scores. Previously, constructing PRS required individual-level genotype data,
limiting their widespread application [18]. However, by leveraging summary statistics derived from large-
scale GWAS, researchers can now estimate polygenic risk without the need for individual-level data. This
breakthrough allows for the efficient assessment of disease risk across diverse populations, facilitating the
translation of PRS into clinical practice [18, 81].
4
However, there are also statistical challenges accompanied when analyzing PRS using summary statis-
tics. Firstly, population stratification can introduce bias when analyzing diverse populations. Several stud-
ies have addressed this challenge by employing methods such as principal component analysis (PCA) to
account for population structure. For instance, Gazal et al. [33] developed a method called "LDpred-funct"
that incorporates functional annotations and adjusts for population stratification, improving the predic-
tive performance of PRS [33]. Furthermore, accurate estimation of effect sizes is vital for calculating the
weighted contribution of genetic variants in PRS. Several statistical approaches have been developed to
address potential bias in effect size estimation, such as the use of shrinkage estimators. For instance, the
PRS-CS method, proposed by Ge et al. [34], incorporates genetic summary statistics and applies a shrink-
age estimator to improve effect size estimation, resulting in improved prediction accuracy [34]. Newcombe
et al. [79] proposed a scalable Bayesian framework for the joint analysis of marginal summary statistics
(JAM) to address the correlation issues in the reference dataset [79]. In later of this chapter, we will in-
troduce more details of methods to construct the linear model with summary statistics and the genetic
interpretation of the methods.
1.2 Mixtureapproaches
As we discussed earlier in this chapter, some complicated data such as mixtures data have inherent cor-
relations among exposures. Over the past decades, numerous studies have focused on developing novel
methods, including both supervised and unsupervised models, to address the statistical challenges in ana-
lyzing mixtures data [39]. In this section, we will introduce several common tools that target the analysis
of mixtures data.
5
1.2.1 WeightedQuantileSumRegression(WQS)
WQS was proposed [16] to effectively handle the complex correlation pattern present in mixtures data
and measure the combined impact of a mixture on the outcome. Suppose we havec mixture components,
and the values of each component are determined as quantile categories denoted by q
i
. For example,
q
i
= 0,1,2, or3 may represent values in the 1st, 2nd, 3rd, or 4th quartile, respectively. The fundamental
WQS model can be expressed as follows:
g(µ )=β 0
+β 1
WQS+z
′
ϕ WQS=
P
c
i=1
w
i
q
i
,
(1.1)
whereβ 0
is the intercept,β 1
is the regression coefficient for the weighted quantile sum, w
i
is the weight
for thei
th
mixtures component with the constrain
P
c
i=1
w
i
=1(0≤ w
i
≤ 1),z is a covariates vector, and
ϕ is the regression coefficients for the covariates vectors.
WQS estimates variable weights using a nonlinear numerical optimization algorithm with a boot-
strapped samples training set, and tests the weighted index in a test set [16]. The final index is calculated
by the weighted average weights across the bootstrap samples. Also, WQS constrains direction on the
mixture components to enhance the interpretability of the weighted index [16].
To interpret the results, we first need to test the significance of β 1
in the validation set. Subsequently,
a pre-specified a priori cut-point can be utilized to determine the importance of the weight index.
1.2.2 BayesianKernelMachineRegression(BKMR)
BKMR is another statistical technique used to study the complexity of mixtures. It is a semi-parametric
method capable of modeling the combined effects of mixtures while allowing for nonlinear effects and
6
interactions among components [39]. With BKMR, researchers can assess (1) if the exposure has associa-
tion with the outcome; (2) the exposure-response relationships between exposures and outcome; and (3)
if there is interaction effect among mixtures components [39].
Suppose there aren subjects, the basic framework of BKMR can be specified as:
g(µ i
)=h(z
i1
,...,z
iM
)+β x
i
, i=1,...,n, (1.2)
whereh(·) is an exposure-response function of the predictorsz
i1
,...,z
iM
, andx is a covariates vectors
assumably having a linear relationship with the outcome [12]. In such setting, h(·) is a flexible function
that allows non-linear and/or interaction effects among the components. Using the dual form, h(·) can be
represented as:
h(z)=
n
X
i=1
K(z
i
,z)α i
, (1.3)
for some set of coefficients α i
n
i=1
, whereK(·) is a kernel function. There are a variety of kernel choices
and one of the most popular kernels is Gaussian kernel.
In this framework, two variable selection algorithms are commonly employed: component-wise vari-
able selection and hierarchical variable selection. Component-wise variable selection bears similarity to
the Bayesian variable selection method used in multiple regression problems [12, 36]. However, in the pres-
ence of correlated exposures, the component-wise variable selection method faces limitations, as it cannot
distinguish between correlated components, treating them as exchangeable [12]. Hierarchical variable se-
lection overcomes this challenge by incorporating the structure information of mixtures into the model.
Mixture components are partitioned into groups with specific probabilities, enhancing within-group corre-
lation while reducing across-group correlation. Each component is assigned a prior probability of belong-
ing to a particular group, making it challenging for highly correlated components to be simultaneously
7
included in the model. The Markov chain Monte Carlo (MCMC) algorithm is utilized to update this process
and obtain parameter estimations [12, 36].
1.2.3 g-computationandquantile g-computation
In the previous sections, we explored various regression methods used in analyzing complex data with
correlated variables, such as high-dimensional data or mixtures data. However, many of these methods
rely on strict underlying assumptions, such as normally distributed residuals, which might not always
be fully met in real-world scenarios. Deviations from these assumptions can lead to biased estimations
and reduced inference power of the model. To address this issue, Robins introduced the g-computation
method in 1986 [92]. This method belongs to theg method family, which also encompasses other powerful
techniques like theg formula (g-computation), marginal structural models, and structural nested models
[77]. Specifically, g-computation can estimate the contrast (e.g. risk differences, ratios) of outcomes with
more relax assumptions than standard regression methods [77]. There are three important assumptions for
g-method: (1) counterfactual consistency assumes a person’s hypothetical outcome with certain condition
is the same as the materialized outcome; (2) exchangeability assumes the potential outcome is independent
of exposures; (3) positivity assumes that people have positive probability of experiencing all exposures
[77, 122]. With these assumptions,g-computation can estimate the causal effects based on exposures with
more reliable inference and improved validity of causal conclusions in various research settings.
Quantile g-computation (qgcomp) is an algorithm that implements the g-computation algorithm to
estimates the effects of individual components and mixtures as a whole [3]. Suppose there are p exposures
The linear model ofqgcomp can be expressed as:
Y
i
=β 0
+
P
d
j=1
β j
X
q
ji
+ϵ i
, i=1,...n
ψ =
P
d
j=1
β j
,
(1.4)
8
whereX
q
j
is the quantized version of the original variableX
j
. The sum of the effect, denoted as ψ , can be
interpreted as the expected change in the outcome resulting from a one quantile increase in all exposures.
Compared to WQS, quantile g-computation can estimate both the sum of mixture effects and individual
exposure effects without relying on the directional homogeneity assumption [3]. Moreover, quantile g-
computation allows for non-additive effects, such as interaction effects among exposures, as well as non-
linear exposure mixture effects [3]. The quantile g-computation method is implemented in qgcomp R
package [52]. And we will give more details on incorporating g-computation into stochastic Bayesian
variable selection in Chapter 2.
1.3 Regressionmodelswithpriors
Regression models always play a central role in statistical analysis, enabling researchers to explore the
relationships between dependent and independent variables through concise mathematical forms [96].
Regression models have two main functions in the analysis: 1) estimate the effects of exposures on the
outcome and describe the relationship between them; 2) identify risk factors and establish the predictive
model for the outcome of interest. Within the family of regression models, numerous methods exist. In
the subsequent sections, we will elaborate on the specific methods that are relevant to our dissertation.
1.3.1 Generalizedlinearmodels
In statistics, a Generalized Linear Model (GLM) serves as a versatile extension of ordinary linear regression.
In this context, a "linear model" typically denotes the conventional linear regression applied to continuous
outcomes. GLM extends linear regression to encompass a broader class of models that cater to outcomes
with specific distributions within the exponential family [66]. This family comprises various probability
distributions such as normal, binomial, multinomial, Poisson, gamma distributions, among others. There
are three components in GLM:
9
• Random component - specifies the probability distribution for the outcome variable Y .
• Systematic component - specifies the dependent variables (X
1
,X
2
,...,X
p
) and their linear combi-
nation in the model. The variables could be continuous or categorical. Also, they can be transformed
into the logarithm or polynomial format.
• Link function η or g(µ ) - specifies the link between the random component and the systematic
component. It describes the relationship between expected outcome mean and the combination of
dependent variables.
For example, the conventional linear regression takes the following form:
µ i
=β 0
+
p
X
j=1
β j
X. (1.5)
The random componentY follows a normal distribution with meanµ and constant varianceσ 2
. The
systematic component is the linear combination of variablesβ 0
+
P
p
j=1
β j
X. The link function employed
here is an identity link, denoted asη =g(E(Y))=E(Y).
Another typical example of GLM is logistic regression. It can be formulated as:
logit(µ i
)= log
π i
1− π i
=β 0
+
p
X
j=1
β j
X (1.6)
Here, the random componentY follows a binomial distribution with the successful probability equals
to π . The systematic component is still the linear combination of variables β 0
+
P
p
j=1
β j
X. The link
function is the logit or log-odds link, denoted asη =g(E(Y))= log
E(Y)
1− (Y)
.
GLM assumes the outcome Y
1
,Y
2
,...,Y
n
are independent and follow an identical distribution. It
doesn’t assume the linear relationship between the outcome variable and dependent variables. Instead,
it makes the linear assumption between the dependent variables and the transformed expected outcome
10
using link function. Furthermore, GLM uses the maximum likelihood estimation (MLE) to estimate the
model coefficients rather than the ordinal least square (OLS) method. The development of GLM greatly
expand the capacity of regression analysis and the flexibility in modeling different types of outcome data.
1.3.2 BayesianHierarchicalmodeling
Bayesian Hierarchical modeling is a statistical approach to estimate the parameters of the posterior distri-
bution in a hierarchical form using Bayesian methods [4]. It is useful when different levels of observational
units are available, and it has been proven to be robust because of the less sensitivity of posterior distribu-
tion to the flexible hierarchical priors [2].
There is a general framework for the Bayesian Hierarchical models. Suppose we have an outcome
observationy
i
. θ i
is the parameter that describey
i
’s data generating process. ϕ is a hyperparameter that
governing the distribution ofθ i
[2]. Under this scenario, a Bayesian Hierarchical model has the following
three stages:
• Stage 1: y
i
|θ i
,ϕ ∼ P(y
i
|θ i
,ϕ )
• Stage 2: θ i
|ϕ ∼ P(θ i
|ϕ )
• Stage 3: ϕ ∼ P(ϕ )
The likelihood of P(y
i
|θ i
,ϕ ) in Stage 1 has the prior distribution P(θ i
,ϕ ), which can be formulated
using the conditional probability:
P(θ i
,ϕ )=P(θ j
|ϕ )P(ϕ ). (1.7)
Here, the hyperparameterϕ has the hyperprior distributionP(ϕ ). Thus, by using the Bayes theorem, we
can derive the posterior distribution:
P(θ i
,ϕ |y
i
)∝P(y
i
|θ i
)P(θ i
|ϕ )P(ϕ ). (1.8)
11
The Bayesian Hierarchical modeling gives the direct probabilistic between the theory and observations
without computational constrains. Moreover, the model can quantify the uncertainty in the parameters.
It allows researchers to investigate more complicated problems with a flexible framework.
1.3.3 g prior
Another tool that can aid in the variable selection process is Zellner’sg prior [125]. It plays an important
role in the field of Bayes and Bayesian variable selection [61]. To accommodate the g prior into a Bayesian
framework, we can consider a multiple regression model as an example:
y
i
=x
T
i
β +ϵ i
, i=1,...,n, (1.9)
whereϵ i
is a random error. The prior distribution ofβ withg prior becomes:
β |ψ ∼ N(µ β ,gψ − 1
(X
T
X)
− 1
), (1.10)
whereµ β is the hyperparameter prior mean, g is a positive scalar parameter, andψ − 1
is the variance of
the random error. Based on the above specifications, the posterior distribution of β can be derived:
β |ψ,x,y ∼ N(q
ˆ
β +(1− q)µ β ,
q
ψ (X
T
X)
− 1
), (1.11)
whereq =g/(1+g),
ˆ
β is the MLE estimator ofβ . We can see that, asg goes to infinity, the posterior mean
ofβ converges to the
ˆ
β . There are many choices ofg including unit information prior [49], risk inflation
criterion [27], and empirical Bayes methods, etc. In Chapter 3 and Chapter 4, we will delve into more
comprehensive explanations on how to incorporate Zellner’sg prior into the stochastic Bayesian variable
12
selection process. Furthermore, we will explore the considerations and methods involved in selecting an
appropriate value forg.
1.3.4 Regularizedregressions
Regularized regression is a class of regression models that set additional constrains on the magnitude of
coefficients to solve the potential correlated variables or multi-collinearity problems. Thus, regularized
regression has gained popularity, especially in the field of high-dimensional data analysis, in past decades.
In the regularized regression, coefficients are estimated by maximizing the log-likelihood function subject
to a penalty term which shrink the coefficient estimation. The regularized models take the general form
as:
min
f
n
X
i=1
V(f(x
i
),y
i
)+λR (f). (1.12)
The V represents an underlying loss function which describes the cost of predicting f(x) when the
outcome is y. There are a variety of loss functions, including square loss, hinge loss, perceptron loss,
logistic loss, and others. The choice of loss function depends on the distribution of response variables and
research questions.λ is a parameter that controls the weight of the penalty term, which is usually selected
by the cross-validation method [108]. R(f) represents the penalty term, which is the combination of the
model coefficients. The complexity of the penalty term is determined by restrictions for smoothness and
bounds on the vector space norm [11]. Furthermore, the form of penalty term decides the characteristics
of regularized models. Based on the type of penalty term, lasso [108], ridge [45], and elastic-net (EN) [131]
are three most popular regularized regression models.
13
Under the condition that the outcome is continuous, the loss function could be the residual sum of
squares, which is equivalent to the square loss. Lasso shrinks the coefficients by including a L
1
penalty on
the sum of the absolute value of coefficients:
min
β ∈R
||Y − Xβ ||
2
2
, subject to
p
X
j=1
|β j
|≤ t (1.13)
where thet controls the weight of penalty term of lasso: with a tighter bound oft, coefficients will shrink
more to zero, and vice versa. We can also equivalently express the optimization problem of lasso in a
Lagrange form :
min
β ∈R
{||Y − Xβ ||
2
2
+λ p
X
j=1
|β j
|} (1.14)
where the λ in (1.7) performs similarly to the t in (1.6). With an increasing value of λ , more regression
coefficients will finally shrink toward zero.
Ridge regression is similar to lasso, except it has the L
2
penalty rather than the L
1
penalty. We can
express the objective function of ridge as:
min
β ∈R
||Y − Xβ ||
2
2
, subject to
p
X
j=1
β 2
j
≤ t (1.15)
or equivalently in the Lagrange form:
min
β ∈R
{||Y − Xβ ||
2
2
+λ p
X
j=1
β 2
j
} (1.16)
The difference to distinguish lasso and ridge is the shape of the constraint region of objective functions.
We can see from the Figure 1.3, lasso has a diamond shape constraint region, and it is not differentiable at
zero. Coefficients of lasso could be shrunk to zero when it lay on the corner of diamond constrain region
[43]. Thus, lasso could result in a sparse solution [75], while the ridge cannot.
14
Figure 1.3: Lasso (left) and ridge regression (right) estimation plot (p = 2). The blue areas are the constraint
regions for|β 1
|+|β 2
|≤ t andβ 2
1
+β 2
2
≤ t, respectively. The red ellipses show the contours of the square
loss function. This figure is based on Figure 3.11 of [43].
However, whenp > N, lasso selects at mostN variables before it saturates [75]. To this end, the EN
penalty trade-offs between the lasso and ridge penalties. The objective function of EN takes the form:
min
β ∈R
{||Y − Xβ ||
2
2
+λ p
X
j=1
(
1− c
2
β 2
j
+c|β j
|)} (1.17)
wherec is a pre-specified constant parameter to control the weight of lasso (c = 1) and ridge (c = 0) penalties.
EN combines the advantages of lasso and ridge, which can conduct variable selection and coefficients
shrinkage at the same time. Moreover, it is more robust in the situation when correlated or grouping
predictors exist [131].
Most of regularized models can be interpreted by hierarchical Bayesian modeling. By maximizing the
penalized log-likelihood, the estimated coefficients often correspond to the maximum a posteriori (MAP),
15
or the posterior mode obtained by specifying a prior distribution for regression coefficients using hierar-
chical Bayesian approaches [8]. Continuing with the linear model example, the outcome is assumed to
follow a normal distribution:
Y|X,β ∼ N(Xβ,σ 2
I
n
) (1.18)
The prior distributions ofβ for lasso, ridge, and EN estimates can be respectively characterized using:
f(β |σ 2
)∼ Double Exponential(0,
λ 2σ 2
)
f(β |σ 2
)∼ N(0,
σ 2
λ )
f(β |σ 2
)∝ exp
n
− λ 2σ 2
h
(1− c)∥|β ||
2
2
/2+c||β ||
1
io
.
(1.19)
The resulting marginal posterior mode of β |Y s is equivalent to the lasso, ridge and EN estimates.
Taking the lasso as an example, the joint log-likelihood ofY andβ can be derived using the 1.7 with given
σ 2
:
l(Y,β )=p(X,Y|β )f(β |λ,σ 2
)
=− 1
2σ 2
(Y − Xβ )
T
(Y − Xβ )− λ 2σ 2
p
X
j=1
|β j
|.
(1.20)
1.3.5 xtune: Externalinformationtunedregularizedregression
As mentioned in the previous section, the regularized regression model is a helpful tool for constructing
predictive models in the context of high-dimensional data. Besides model fitting, it is always worthwhile
to explore the model using additional information, such as gene functional annotations or features iden-
tified by related studies, among other relevant sources. Such meta-feature knowledge is called external
information or prior information. Table 1.1 exemplifies the data structure. The main outcome-predictors
dataset contains N observations and p predictors. The prior information Z is a p× q matrix that has q
prior covariates to describe the information for each of the predictorX
j
. The prior covariate could be any
16
type of data. For instance, it can be a binary variable that indicates whether this feature was identified
by previous studies. Or it could also be a continuous variable that contains a feature’s biological function.
This kind of prior information can be recognized as "features of features", which might be helpful for fine-
tuning the model. Traditionally, people assess the model using prior information in a post-hoc manner.
However, the manually post-hoc analysis may lead to bias, inconsistency, or less power in feature selection
[126]. Motivated by that, xtune [126] was proposed to incorporate the prior information into the model
building process to improve prediction performances and enhance model interpretation.
Table 1.1: Example of main dataset (upper) and prior dataset (lower)
Observations Y X
1
X
2
... X
p
1 y
1
x
11
x
21
... x
p1
2 y
2
x
12
x
22
... x
p2
... ... ... ... ... ...
N y
N
x
1N
x
2N
... x
pN
Features Z
1
... Z
q
X
1
0 ... Z
q1
X
1
1 ... Z
q2
... ... ... ...
X
p
0 ... Z
qp
Basically, xtune is an extension of the regularized regression. It incorporates the prior informationZ
into the model by modeling the penalty parameters as a function of prior information Z. The objective
function of xtune lasso can be expressed as:
min
β ∈R
{||Y − Xβ ||
2
2
+
P
p
j=1
λ j
|β j
|}
λ =e
Zα (1.21)
whereλ is a1× p vector of tuning parameters,Z is thep× q prior information matrix,α is aq× 1 second
level of coefficients that links external information Z to individual penaltiesλ j
[126]. When the value of
α is positive, the coefficients of features with positive (negative) values of Z will undergo greater (lesser)
17
shrinkage compared to the original level of shrinkage. If theα equals to zero, it indicates prior information
Z does not influence the shrinkage of the regression coefficient. We can notice that, by modeling the
penalty parameter as the log-linear function of prior informationZ, we can obtain feature-specific penalty
parameters instead of a single common penalty.
There are two steps of fitting xtune lasso: (1) choose the penalty vector, and (2) estimate the regression
coefficients β givenλ . The second step can be easily implemented using standard optimization approaches
and software, such as the glmnet function in R [31, 126]. However, for xtune, the main challenge lies in
selecting an appropriateλ . Cross-validation is a typical method for decidingλ , but due to the potentially
large value ofp, it becomes computationally impractical. Instead, xtune employs a novel empirical Bayes
approach to estimate the best penalty vectorsλ .
As previously discussed, lasso coefficients can be represented by the MAP or the posterior mode ob-
tained by specifying a prior distribution for regression coefficients in a hierarchical Bayesian model [8]
with a double exponential prior distribution forβ . Thus, the Bayesian interpretation ofxtune lasso can be
expressed as:
Y|X,β ∼ N(Xβ,σ 2
I
n
)
f(β j
|σ 2
)∼ Double Exponential(0,
λ j
2σ 2
), ∀j =1,2,...p
λ =e
Zα (1.22)
where the population varianceσ 2
is set in advance [89].
Then, the hyperparameterα can be estimated using the empirical Bayes approach by maximizing the
marginal likelihood (type-II maximum likelihood) [109] which is obtained by integrating out theβ from
the joint distribution ofY andβ :
L(α )=
Z
R
p
L(Y|X,β )f(β |λ (α ))dβ
=
Z
R
p
N(Y|Xβ,σ 2
I
n
)DE(β |0,α )dβ.
(1.23)
18
The marginal likelihood doesn’t have a closed-form solution whenX is not orthogonal. Thus, Zeng
et al. [126] proposed to use the normal distribution with the same prior variance to approximate the double
exponential distribution:
β ∼ N(0,V
− 1
), (1.24)
where = diag(η 1
,...,η p
) = diag((
λ 2σ 2
)
2
/2). The resulting approximated negative log marginal likelihood
with analytical solution is:
− l(α )= log|C
α |+Y
T
C
− 1
α Y, (1.25)
whereC
α =σ 2
I +XV
− 1
X
T
. This objected function can be maximized by a novel iterative re-weighted
L
2
optimization algorithm [126], which will be discussed in more detail in Chapter 2. By maximizing the
approximated marginal likelihood function, we can obtain the estimatedα . Once theα (hence theλ ) is
known, we can implement a standard method such asglmnet function in R to get the estimated regression
coefficients.
Currently, xtune can be modeled with lasso, ridge, and EN penalties for continuous and binary out-
comes. The simulation studies and real-world data examples showed it can improve the prediction accuracy
while maintain a more parsimonious model by integrating prior information into the modeling process.
Motivated by this, we will extend thextune algorithm into the EN framework for multi-classification prob-
lems, which will be discussed in Chapter 3.
1.4 Geneticsummarystatistics
As we mentioned in the earlier section, there are expanding numbers of studies to analyze genetic sum-
mary statistics to reveal the association between phenotype and genotype. In this section, we will briefly
introduce the modeling using summary statistics and the interpretation from the model.
19
1.4.1 Constructingthelinearmodelwithsummarystatistics.
Linear models incorporating genetic summary statistics have gained prominence in the field of genomics,
enabling the study of complex traits and diseases. It aims to estimate the relationship between genetic
variants and phenotypic traits [14].
The simplest method is to perform ordinary least squares (OLS) regression, typically takes the following
form:
Y =β 0
+
X
j
β j
X
j
+ϵ (1.26)
whereY represents the phenotype being studied,β 0
is the intercept term,β j
denotes the effect size of the
jth genetic variant (X
j
) on the phenotype, andϵ represents the residual error term capturing unexplained
variation. The genetic variant can be represented as a binary variable (0, 1, or 2) indicating the number
of risk alleles an individual carries, or as a weighted genetic score computed using PRS or other methods
[24]. The effect sizes ( β j
) quantify the magnitude and direction of the genetic influence on the pheno-
type. Multiple testing correction is often necessary to account for the testing of multiple genetic variants.
Methods such as Bonferroni correction, false discovery rate (FDR), or permutation-based approaches can
be employed to control the family-wise error rate or the false positive rate.
However, there are statistical challenges when constructing OLS linear models with genetic summary
statistics. For example, genetic associations may be confounded by population stratification and the es-
timation of coefficients might be ill-posed with GWAS data [13]. Alternative to OLS, other methods can
be used to analyze genetic summary statistics. For instance, Bayesian methods provide a flexible frame-
work for incorporating prior information and estimating parameters [129]. Bayesian linear regression with
summary statistics allows for the estimation of posterior distributions of the effect sizes and uncertainties
20
[129]. Moreover, it is also useful when dealing with high-dimensional datasets. By specifying different pri-
ors of theβ j
, the model can be connected to the penalized regression. When the prior distribution follows
the Gaussian distribution (β iid
∼ N(0,σ 2
)), the model is equivalent to the ridge regression; when the prior
follows the double exponential distribution (|β j
| ∼ exp(1/σ )), the model is close to the lasso regression
[128].
1.4.2 Geneticinterpretation
The interpretation of linear models with genetic summary statistics involves understanding the effect
sizes, significance, and predictive power of the PRS [114]. The effect size ( β ) indicates the change in
the phenotype associated with a one-unit increase in the PRS. It provides insights into the strength and
direction of the genetic contribution to the phenotype.
To assess the statistical significance of the PRS, p-values or confidence intervals can be derived from
hypothesis testing. The p-value indicates the probability of observing the association between the PRS and
the phenotype by chance alone. Typically, a threshold, such asp < 0.05, is used to determine statistical
significance.
The predictive power of the linear model can be evaluated using metrics such as R-squared (R
2
) or
mean squared error (MSE).R
2
represents the proportion of variance in the phenotype explained by the PRS
and other covariates. MSE measures the average squared difference between the observed and predicted
values, providing an assessment of the model’s accuracy.
1.5 Overview
In this dissertation, we will introduce both non-data-informed and data-informed penalization methods
and how they address the real-world applied problems. In Chapter 2, we present a hierarchical Bayesian
framework that utilizeg methods to estimate mixtures effects in the environmental epidemiology studies.
21
Chapter 3 proposes the data-informed feature-specific elastic-net model integrating prior information for
multi-classification problems. In Chapter 4, we extend the framework of the method in Chapter 3 with
the genetic summary statistics. And in Chapter 5, we will demonstrate the software utilization of above
methods.
22
Chapter2
Bayesianhierarchicalregressionmodelingwith g-priorand
g-computation
2.1 Introduction
Over the past decade, there has been a significant growth in mixtures analysis within environmental epi-
demiology, with advancements both in new methodologies and their application [9, 100, 41, 38, 47]. This
shift towards mixtures analysis is crucial for our field because environmental exposures are often cor-
related, stemming from common sources or exposure pathways, and not experienced in isolation. As a
result, the conventional regression models typically used in epidemiology face limitations when dealing
with multiple highly correlated variables, providing only exposure-specific effect estimates without a com-
bined mixture effect estimate.
To address these limitations, a wide range of statistical methods for mixtures analyses has been de-
veloped. These methods aim to overcome the challenges posed by conventional models and offer more
comprehensive insights. However, while they represent significant progress, many of the commonly used
mixture analysis methods still have some drawbacks. For example, some methods provide only a com-
bined mixture effect estimate, such as WQS [16], while others, like BKMR [12], present results that are
difficult to interpret in the context of both single-exposure effects and combined effects. Moreover, several
23
implementations of these methods have limitations in their flexibility of application. Some do not allow
for interaction effects between component exposures, while others are restricted to certain types of out-
comes. Additionally, certain methods require assumptions about directionality (e.g., WQS) or functional
forms (e.g., common effect across quantiles of exposure) [12].
Motivated by above challenges, we proposed Bayesian Hierarchical Regression Modeling (BHRM) for
mixtures analysis. BHRM offers a flexible framework, enabling robust estimation for exposures with high
correlation through prior specification. It allows for obtaining conditional exposure-specific estimates and
can be readily expanded to include interaction effects. Additionally, variable selection techniques can be
incorporated, enhancing its versatility. A significant advantage of BHRM is that, being based on general-
ized linear regression, the effect estimates remain interpretable. Moreover, combinations of exposures at
defined levels can be effectively summarized, providing posterior estimates of combined mixture effects.
For instance, it allows for estimating effects per increase of one standard deviation across all component
exposures or contrasting a specified exposure profile with the natural course of observed exposures. This
feature is crucial as it enables estimating effects based on the counterfactual framework, using methodolo-
gies like g-computation (also known as g-formula) [62, 91, 77, 99]. Furthermore, BHRM offers established
frameworks for application to various outcomes, encompassing continuous, binary, categorical, longitudi-
nal, and time-to-event data. This versatility allows researchers to apply the method effectively to a wide
range of epidemiological scenarios and outcome types.
2.2 Methods
This approach yields conditional individual exposure-specific estimates along with an estimate of the over-
all effect of the mixture. BHRM combines 1) a g-prior for the corresponding exposure effects to provide
robust estimation of highly correlated exposures [73, 69]; 2) a Bayesian selection procedure to estimate the
24
posterior inclusion probability of individual exposures [54]; and 3) Bayesian g-computation in a potential
outcome framework for estimating the overall mixture effect based on two exposure profiles [54].
2.2.1 Modelspecification
Firstly, we include the Bayesian model selection method into our algorithm. Suppose we have a linear
model withp exposures andq covariates:
Y
i
=α +
p
X
j=1
γ j
β j
X
j
+
q
X
k=1
δ k
U
k
+ϵ i
, i=1,...n, (2.1)
whereX
j
is a exposure with corresponding estimateβ j
,γ j
is a binary variable indicating the inclusion of a
specific exposure j in the mixture, andU
k
is a covariate with corresponding effect estimates δ k
. Specifically,
γ = (γ 1
,...,γ p
) is a vector with elementγ j
∈{0,1} which indicates if variableX
j
should be included in
the modelM
γ (γ j
=0 equals toβ j
=0).
In order to incorporate theg-prior into the model, we interpret the linear regression into a Bayesian
setting under modelM
γ :
Y|α,β,ϕ,γ ∼ N(1α +X
γ β γ ,ϕ − 1
I)
β γ ∼ N(0,gϕ − 1
(X
γ ′
X
γ )
− 1
)
γ ∼ Beta(1,p)
p(ϕ )∝1/ϕ
p(α |ϕ )∝1
(2.2)
where prior covariance ofβ is specified as the scaled version of the covariance matrix of the MLE estimator
and a function of the variance of the outcome ϕ − 1
. We chose the Beta-Binomial distribution with the
function of p as the prior for γ [117]. Here, null-based approach is used to calculate Bayes factors and
25
model probabilities by comparing each modelM
γ with the null model (H
0
: β γ = 0 andH
0
: β γ ̸= 0)
[61]. Based on that, we can derive the posterior distributions:
β γ |Y,α,ϕ,γ ∼ N
g
1+g
ˆ
β γ ,ϕ − 1 g
1+g
(X
γ ′
X
γ )
− 1
α |Y,γ ∼ N(ˆ α γ ,ϕ − 1
(X
γ ′
X
γ )
− 1
).
(2.3)
where
ˆ
β γ is the MLE estimators for M
γ . We can see that scalar g egulates the degree of conditional
posterior mean shrinkage from the MLE to the prior mean, which is zero. [61]. Also, the dispersion of the
posterior covariance is shrunk by the factor ofg/(1+g). Within this framework, the posterior inclusion
probability (PIP) on the individual γ j
is the posterior probability when the coefficient is non-zero. In
this case, theg-prior shrinkage factor is combined with the Bayesian selection algorithm to yield optimal
prediction performance.
There are a lot of discussions on the choice ofg. Liang et al. [61] and Li and Clyde [60] showed several
choices ofg prior, including fixed g priors, empirical Bayes approaches, and a mixture ofg prior. For our
BHMR, we adapted the semi-Bayes fixed g priors that is pre-specified with a constant according to the
level of desired shrinkage.
2.2.2 ParametersestimationusingMarkovchainMonteCarlo(MCMC)
To obtain coefficients estimations, we used MCMC [40] simulation method for Bayesian hierarchical mod-
els through Just another Gibbs sampler (JAGS) coding scheme. JAGS uses the same modeling language as
BUGS and has a simple interface in R via the package rjags.[1, 5, 83, 84]. It provides a flexible framework for
Bayesian hierarchical modeling with any formula, while users only need to specify the model likelihood
and statistical distributions.
26
2.2.3 g computation
After obtaining coefficients estimations, we utilized g-computation to yield a single effect estimate (e.g.,
a mixtures effect) that captures the impact of one standard deviation increase in levels of all exposures
simultaneously [3, 54]. Specifically, we use posterior predictive distributions to estimate a single mixture
risk difference ( ψ RD
) based on two exposure profiles, such that:
ψ RD
=ψ x
∗ =high
− ψ x
∗ =low
, (2.4)
where ψ x
∗ =
P
p
γ j
β j
x
∗ j
and x
∗ is the counterfactual profile for the standardized exposures with all
exposures values from low to high (e.g., one standard deviation change ofX from -0.5 to 0.5).
2.3 Results
2.3.1 Simulationstudies
We conducted a series of simulation studies to examine the performance of BHMR. ForN=200 individu-
als, we simulated a set of five continuous exposures from a multivariate normal distribution with a pre-
specified correlation ρ between exposures. This correlation was varied for all six simulation scenarios as a
group to reflect high, medium, and no correlation, using ρ = 0.9, 0.5, and 0, respectively. Each simulated ex-
posure was then categorized into quartiles to ensure comparison to alternative approaches. For each level
of correlation, six simulation scenarios were explored varying the causal mechanism and level of exposure-
specific effects for a simulated normally distributed outcome with a standard deviation equal to one. The
scenarios are named 1 through 6 with an additional letter indicating the correlation between exposures,
with a, b, and c, representingρ = 0.9, 0.5, and 0, respectively (e.g., scenario 2a is scenario 2 conducted with
correlation of ρ = 0.9). Scenario 1 sets all effects to zero to establish empirical type I error rates. For the
other scenarios, the first two exposures are causal with non-zero coefficients ( β 1
,β 2
) while the other three
27
exposures are simulated as non-causal with zero coefficients ( β 3
= β 4
= β 5
= 0). Scenarios 2 and 5 have
causal effects in the same direction. Scenarios 3 and 6 have causal effects in opposite directions. Scenarios
4, 5, and 6 add an additional interaction effect between the two causal exposures. These various scenarios
result in a variety of mixture effects ψ , as defined by the linear sum of the corresponding causal effects.
The six scenarios are summarized in Table 2.1. For each simulated replicate from a total of 100, we perform
four analyses for comparison: 1) a univariate linear regression treating each exposure as independent; 2)
a weighted quantile sum (WQS) regression as implemented in gWQS [16] (version 3.0.4); 3) a joint linear
regression using maximum likelihood (LRMLE) and g-computation for mixture effect estimation as imple-
mented in qgcomp [3] (version 2.8.0); and our BHRM. For our BHRM, we used a semi-Bayes approach and
prespecified g =99. For all approaches, we report average coefficient estimates and Monte Carlo standard
error (MCSE) across all replicates.
Furthermore, we conducted a simulation study with higher dimensional data to demonstrate robust
estimation in high-dimensional data. We simulated a set of 100 continuous exposures from a multivariate
normal distribution with ρ = 0.9 between exposures. Similar to simulated data in the lower dimension,
simulated exposures were categorized into quartiles and the outcome follows a normal distribution with
a standard deviation equal to one. The first two exposures are simulated as causal with opposite coeffi-
cients (β 1
= 1,β 2
= -1) while the remaining exposures are non-causal with zero coefficients ( β 3
=...=β 100
=0)
without interaction effects between exposures. For each model, in addition to mean estimated coefficients
and MCSE, we also evaluated the mean numbers of causal exposures that were identified as statistically
significant by a post-hoc Wald test statistics (e.g.,
˜
β se(
˜
β )
) for each of 100 simulation iterations.
Table 2.2 shows the simulation results for the highly correlated data from the lower dimensional simula-
tions, and Table A.2 and Table Table A.3 show the simulation results for median correlated and uncorrelated
data, respectively. The univariate model does not estimate interaction effects nor a single mixture/com-
bined effect. In scenario 1a, while all approaches do not have bias for each corresponding estimate, LRMLE
28
Table 2.1: Simulation scenarios for exposures in low dimension, which were repeatedly used in settings
reflecting high, medium, and no correlation among exposures ( ρ = 0.9, 0.5, and 0, respectively), which we
refer to as scenarios 1a-6a, 1b-6b, and 1c-6c.
Scenarios Profiles β 1
β 2
β 3
β 4
β 5
β 12
ψ 1 [0, 1] 0 0 0 0 0 0 0
2 [0, 1] 2 1 0 0 0 0 3
3 [0, 1] 1 -1 0 0 0 0 0
4 [0, 1] 0 0 0 0 0 1 3
5 [0, 1] 2 1 0 0 0 1 6
6 [0, 1] 1 -1 0 0 0 1 3
has increased MCSE compared to the other approaches. In addition, we also observe a higher mixture effect
MCSE for WQS and LRMLE in comparison to BHRM. In scenario 2a, the univariate model has extremely bi-
ased estimates for non-causal exposures reflecting their correlation to the causal exposures. Interestingly,
while WQS has slightly biased estimates to the null for the causal exposures, the overall mixture effect is
unbiased. In a direct comparison to LRMLE, BHRM has slightly biased estimates toward the null for the
causal exposures manifesting the influence of the prior specification centered at zero. In addition, BHRM
estimates have smaller MCSE—a result of the bias-variance tradeoff in Bayesian analysis—with this differ-
ence much more substantial for the null exposures. In scenario 3a, the highly correlated exposures with
different directions of effects lead to very poor estimation for the univariate model and WQS while LRMLE
and BHRM still yielded good estimates for both exposure-specific and mixture effects. In scenario 4a, the
underlying causal model of only an interaction effect results in non-zero estimates for all correlated expo-
sures for the univariate model as it does not model the interaction effect directly. Likewise, because the
WQS does not model the interaction specifically the exposure-specific effects are poorly estimated. How-
ever, the mixture estimate is unbiased. As seen in previous scenarios, the LRMLE and BHRM yield good
estimates with the BHRM having smaller MCSE. Previously described trends are observed for scenario 5a
and 6a, with BHRM exposure-specific estimates with slightly biased estimates but with substantially lower
29
MCSE than LRMLE. Similar patterns can also be observed in Table A.2 and Table A.3 for median-correlated
and uncorrelated exposures.
For simulations with higher dimensional data (Table 2.3), BHRM and LRMLE yield estimates that are
close to the simulated values for both the exposure-specific and the mixture effect. However, BHRM has
lower MCSE for exposure-specific effect estimates with substantially lower MCSE for non-causal exposures
as a result of the model selection process. Exploring the selection for BHRM via a post-hoc Wald test, the
mean number of causal exposures selected by BHRM is 2.13, corresponding closely with the true number
of causal exposures simulated. However, LRMLE selected 6.76 exposures on average which implies LRMLE
may lead to higher false positive in a high-dimensional setting.
2.3.2 Real-worlddataexample
For the real-world data application, we used data from the ISGlobal/ATHLETE “Exposome Data Challenge
Event” held April 2021 [115, 70], which we will refer to as the “challenge data.” The challenge data was de-
rived from the Human Early Life Exposome (HELIX) subcohort database and were partially simulated from
estimated correlation structure from the observed data [115, 70]. The HELIX study [115, 70] represents a
collaborative project across six established and ongoing longitudinal population-based birth cohort studies
in six European countries (France, Greece, Lithuania, Norway, Spain, and the United Kingdom). Based on
1,301 observations included this dataset, we evaluated if exposure to per- and polyfluoroalkyl substances
(PFAS) was associated with body mass index (BMI) z-score (standardized on sex and age) at age 6-11 years.
We included five PFAS exposures: perfluorohexane sulfonate (PFHXS), perfluorononanoate (PFNA), perflu-
orooctanoate (PFOA), perfluorooctane sulfonate (PFOS), and perfluoroundecanoate (PFUNDA). Covariates
included child sex (female; male), maternal age (years), maternal pre-pregnancy BMI, maternal education
(primary school; secondary school; university degree or higher), and cohort of inclusion. The challenge
30
data was not collected for the purpose of our study and none of authors had access to any subject iden-
tifiers. According to the Office for Human Research Protection of U.S. Department of Health and Human
Services, these analyses do not constitute human subjects research requiring Institutional Review Board
approval or consent. The type I error used in analysis was equal to 0.05.
Figure 2.1 shows the correlation pattern among the PFAS exposures with a highest pairwise correlation
of 0.56 and several pairwise correlations around 0.45. The BHRM model with interaction terms provided
little evidence for interactions between exposures. Therefore, we present results for WQS, LRMSE, and
BHRM without interaction terms. Here, we used g=99 and a burn-in period of 10,000 iterations followed
by 50,000 iterations in the MCMC estimation. Convergence was confirmed via visually inspecting trace
plots. Across the various alternative approaches, evidence for each individual exposure was evaluated by
examining coefficient estimates. In addition, posterior inclusion probabilities (PIP) for BHRM and weights
for WQS were evaluated relative to the baseline for both of 0.20. Although there is a modest pairwise
correlation between each exposure, the overall aggregate correlation among exposures leads to several
noteworthy exposures from the univariate analysis (Table 2.4). WQS indicates the sum of effect of PFAS
has statistically significant association with BMI z-score, while it does not make any statistical inference
on the individual effects. LRMLE shows the PFOA, PFOS, and the sum of PFAS effect are statistically
significantly associated with the BMI z-score. BHRM indicates only the sum of effect of PFAS has statisti-
cally significant relationship with the BMI z-score. Furthermore, WQS results in larger effect estimates for
PFHXS and PFNA and corresponding larger weights for PFHXS (weight = 0.358) and PFNA (weight = 0.27)
and smaller weights relative to the baseline for PFOS (weight = 0.099) and PFUNDA (weight = 0.073). In
contrast, LRMLE and BHRM tend to have larger effect estimates for PFOA and PFOS after conditioning on
all exposures. Reflective of these trends, BHRM has positive evidence for inclusion for PFOA (PIP = 0.863)
and PFOS (PIP = 0.466) and smaller PIPs for all other exposures.
31
Figure 2.1: Correlation pattern among the PFAS exposures
2.4 Discussion
The BHRM described in this study integrates several prominent ideas and approaches in the analysis of
correlated data, feature selection, and causal modeling. The main objective of this project is to explain the
motivation behind combining these ideas into a unified approach, emphasizing its potential advantages in
application and the reduction of implementation barriers.
Recent discussions on statistical methods for mixtures analysis have considered different research ques-
tions within this framework [41, 38, 47, 104]. These questions include estimating the overall effect of the
mixture, identifying independent effects of mixture components (e.g., to identify toxic agents), and inves-
tigating interactions (e.g., synergistic effects) between mixture components. It is acknowledged that the
choice of method should align with the specific research question(s) at hand, as no single method can ad-
dress all aspects relevant to mixtures analysis [41, 38, 47]. However, we advocate for the use of BHRM in
32
mixtures analysis in environmental epidemiology due to its flexibility. BHRM allows for the estimation
of multiple mixtures-related effects, including the overall effect, independent effects of individual compo-
nents, and straightforward extensions to account for potential interaction effects, non-linear effects, and
various types of outcomes. Other researchers have also recognized the appeal of Bayesian approaches for
mixtures analysis [41, 38].
In our example application of BHRM, we observed more robust results compared to current mixtures
methods, particularly in settings with high correlations between exposures. BHRM provided unbiased
effect estimations with low standard errors, making it an advantageous method for mixtures analysis. Its
robustness, stability, and flexibility to estimate both group and individual exposure effects further highlight
its utility. The estimation of exposure-specific effects in addition to overall effects is crucial for identifying
causal agents in mixtures of environmental contaminants.
In many cases, generalized linear regression approaches with maximum likelihood estimation may
suffice for analysis when the sample size is large enough to provide robust estimates, and the exposures
have low inter-correlation, and there are only a few exposures to consider. However, with advancements
in modern technologies enabling more comprehensive exposure assessment, it is becoming more common
to encounter highly correlated exposures due to measuring similar chemical exposures or classes. More-
over, the number of exposures is likely to increase with technological progress. Additionally, the costs
of exposure assessment can be prohibitive, leading to relatively modest sample sizes. In such situations,
conventional regression techniques for estimation may exhibit poor performance, as evidenced in our
simulation experiments. To overcome these challenges, we introduce the BHRM approach, incorporating
several components tailored to address specific limitations. By specifying a prior distribution, we achieve
more stable effect estimates as Bayesian approaches strike a balance between the maximum likelihood es-
timate obtained from the data and a pre-specified prior mean, often set as zero under the assumption of no
effect. This balance is influenced by the strength of evidence from the data, represented by the estimate of
33
the variance for the effect, and the estimated or pre-specified strength of the prior. We adopt a g-prior for
its advantages in incorporating the correlation structure within the prior and its status as a conjugate prior,
offering additional benefits in terms of interpretation and computation. In a linear model, the posterior
estimate of effect can be simply expressed as
g
g+1
ˆ
β , where
ˆ
β is the maximum likelihood or ordinary least
squares estimate. Consequently,g acts as a shrinkage factor, pulling the Bayesian estimate towards zero,
which aids in stabilizing effect estimation. While this shrinkage introduces a bias towards zero, the final
posterior effect estimate often exhibits a smaller standard error, representing the bias-variance tradeoff
[120]. While various approaches exist for pre-specifying or estimatingg, it is helpful to viewg as a param-
eter influencing the weight between the prior and the data. As demonstrated in our simulation study and
applied example, even assigning a relatively large weight to the data (e.g.,g=99) can significantly assist in
estimation, resulting in substantially smaller standard errors with minimal bias for true causal effects.
In scenarios where the number of exposures is high relative to the sample size, feature selection be-
comes essential. We implemented a specific version of Bayesian stochastic search using a "spike and slab"
prior. This approach effectively identifies specific exposures from a potentially large number of exposures
and results in a representative set of models [117, 19, 46], each weighted by the evidence from the data and
prior. This method formally incorporates uncertainty in model selection and provides valuable insights
for mixture effects when coupled with ’ g-computation’.
An important aspect of our approach is the utilization of g-computation [91, 77, 99] to estimate the
effects of contrasting two exposure profiles. Considering exposure contrasts within the counterfactual
framework [62] can be highly advantageous for mixtures analysis. The interpretation of overall effect
estimates in this field can be challenging. While we currently implement simple exposure contrasts, this
framework can be extended to compare more complex exposure profiles, such as data-driven exposure pro-
files or those associated with specific interventions. For instance, these exposure contrasts could involve
comparisons to the natural course of exposure, answering questions about what would have happened
34
if exposures were different from what they actually were. Alternatively, it could entail comparing "low"
and "high" exposure profiles commonly observed in the data. Additionally, we can target the reduction of
individual components with the highest independent or synergistic effects, provided such interventions
are realistic and supported by the available data. G-computation has been successfully employed in other
mixture analysis approaches, including the work of Keil and colleagues on quantileg-computation [3] and
Bayesiang-computation [53]. These studies provide important discussions on the inferential framework
for mixture methods and further support the effectiveness of using g-computation to estimate exposure
contrasts in mixtures analysis.
BHRM also offers many potential extensions. Prior information can be incorporated to improve es-
timates of health outcomes. Investigators can influence priors in two main ways: (i) by specifying prior
covariates or characteristics for the exposures within the mean component of the g-prior [120, 59, 22, 107],
or (ii) by modifying the prior on the probability of inclusion using methods like probit [87, 86]. These
prior covariates may include information from toxicology studies, such as potency factors, or annotation
of multiple groups of exposures. For instance, a single model could encompass multiple classes of chemical
exposures, like polybrominated diphenyl ethers, polychlorinated biphenyls, and heavy metals, including
individual chemical components, enabling the assessment of their independent effects on the outcome.
By building upon a more conventional regression model, BHRM can potentially analyze more complex
data using existing and more interpretable models. Furthermore, extensions to BHRM could incorporate
non-linear exposure-response functions, time-varying exposures, or handle longitudinal data. These ex-
amples illustrate the versatility of the BHRM approach in addressing the challenges of mixtures analysis
in environmental epidemiology.
As part of this project, we provide access to R scripts and an accompanying vignette that demonstrate
the application of the BHRM approach for simple scenarios using JAGS and an MCMC implementation for
35
estimation. Although alternative options for BHRM estimation exist, such as STAN [15] or specific R pack-
ages like BMA [46], all implementations require some familiarity with extensions. As is the case with any
Bayesian analysis, careful attention is needed for valid inference and model fitting, as some approaches can
be computationally expensive or rely on experience, such as determining model convergence with MCMC
methods. In certain cases, Bayesian analysis might not be necessary, and conventional regression models
with maximum likelihood estimation could suffice. For such situations, we recommend using the R pack-
age qgcomp for mixtures analysis withg-computation. The choice of software often involves a trade-off
between using readily available and easy-to-implement tools, which may assume simplified model forms
and have common defaults, versus developing and implementing more complex models that better capture
the nuances of the investigators’ understanding of the underlying data-generating process. To improve
the estimation of mixture group and component-specific effects, models must not be grossly misspeci-
fied. When there is evidence of more complex exposure-response relationships, such as non-linearity or
complex interactions, these factors should be carefully considered when selecting the final main model or
conducting subsequent sensitivity analyses. Collaborating with methodologists and biostatisticians may
be necessary to appropriately incorporate these modeling parameters until easy-to-implement complex
modeling approaches become more accessible to the average environmental epidemiologist.
Mixture analysis methods vary in their data requirements, ease of implementation, and statistical ap-
proach, depending on the research question or effect estimation goal. In conclusion, we highlight BHRM
as a flexible methodology that allows for the estimation of both overall mixtures effects and independent
effects of individual components, including interactions. Its ability to incorporate prior information, han-
dle non-linear effects, and accommodate various outcome types makes it a valuable tool in environmental
epidemiology.
36
Table 2.2: Simulation results for scenarios with highly correlated exposures (ρ = 0.9) implemented using
univariate, WQS, quantile g-computation (LRMLE), and BHRM methods. Coefficient estimate (Monte
Carlo standard error) presented.
Scenarios Parameters Truth Univariate WQS LRMLR BHRM
1a β 1
0 0.004 (0.058) -0.002 (0.055) -0.011 (0.142) -0.001 (0.005)
β 2
0 -0.001 (0.061) -0.001 (0.018) -0.007 (0.142) 0.001 (0.018)
β 3,4,5
a
0 0.003 (0.058) 0.000 (0.005) 0.008 (0.144) 0.000 (0.010)
ψ 0 - -0.004 (0.079) 0.007 (0.074) 0.001 (0.027)
2a β 1
2 2.835 (0.070) 1.877 (0.190) 1.989 (0.137) 1.943 (0.114)
β 2
1 2.665 (0.098) 0.870 (0.212) 1.010 (0.143) 0.974 (0.119)
β 3,4,5
a
0 2.502 (0.112) 0.088 (0.068) -0.002 (0.139) -0.003 (0.028)
ψ 3 - 3.012 (0.089) 2.992 (0.069) 2.908 (0.066)
3a β 1
1 0.167 (0.069) 0.132 (0.086) 0.998 (0.146) 0.987 (0.105)
β 2
-1 -0.164 (0.069) 0.008 (0.016) -0.995 (0.126) -0.989 (0.102)
β 3,4,5
a
0 -0.002 (0.062) 0.001 (0.002) 0.002 (0.139) -0.002 (0.043)
ψ 0 - 0.143 (0.092) -0.002 (0.063) -0.006 (0.057)
4a β 1
0 2.740 (0.103) 1.311 (0.318) -0.060 (0.316) 0.016 (0.150)
β 2
0 2.741 (0.103) 1.332 (0.294) 0.001 (0.354) 0.004 (0.134)
β 3,4,5
a
0 2.509 (0.123) 0.121 (0.137) 0.018 (0.316) 0.000 (0.019)
β 12
1 - - 1.011 (0.233) 0.994 (0.055)
ψ 3 - 3.008 (0.116) 3.006 (0.065) 2.999 (0.073)
5a β 1
2 5.589 (0.089) 3.306 (0.277) 1.973 (0.289) 1.655 (0.151)
β 2
1 5.427 (0.145) 2.349 (0.327) 0.976 (0.329) 0.643 (0.160)
β 3,4,5
a
0 4.997 (0.184) 0.129 (0.145) 0.017 (0.315) -0.014 (0.047)
β 12
1 - - 1.010 (0.251) 1.207 (0.065)
ψ 6 - 6.041 (0.110) 5.995 (0.076) 5.928 (0.059)
6a β 1
1 2.917 (0.095) 2.284 (0.267) 1.020 (0.291) 0.968 (0.126)
β 2
-1 2.589 (0.128) 0.376 (0.195) -0.974 (0.304) -0.965 (0.158)
β 3,4,5
a
0 2.514 (0.121) 0.122 (0.132) 0.020 (0.282) 0.004 (0.053)
β 12
1 - - 0.977 (0.230) 0.998 (0.064)
ψ 3 - 3.028 (0.127) 3.006 (0.074) 2.995 (0.062)
Note: —, no data; BHRM, Bayesian hierarchical regression model; LRMLE, linear regression using maximum
likelihood; WQS, weighted quantile sum.
a
Estimated coefficient and MCSE for β 3,4,5
are the average ofβ 3
,β 4
, andβ 5
, i.e., the non-causal parameters.
37
Table 2.3: Simulation results for scenarios with high-dimensional data and highly
correlated exposures (ρ = 0.9) implemented using univariate, WQS, quantile g-
computation (LRMLE), and BHRM methods. Coefficient estimate (Monte Carlo stan-
dard error) presented.
Parameters Truth Univariate WQS LRMLE BHRM
β 1
1 0.171 (0.070) 0.001 (0.005) 1.013 (0.217) 0.976 (0.202)
β 2
-1 -0.157 (0.073) 0.000 (0.002) -0.966 (0.231) -0.988 (0.206)
β non-causal
a
0 0.001 (0.074) 0.000 (0.008) -0.001 (0.99) 0.001 (0.075)
ψ 0 - 0.008 (0.721) -0.007 (0.062) 0.001 (0.075)
Note: —, no data; BHRM, Bayesian hierarchical regression model; LRMLE, linear regression
using maximum likelihood; WQS, weighted quantile sum.
a
Estimated coefficient and MCSE for β non-causal
are the average of β 3
to β 100
, i.e., the non-
causal parameters.
Table 2.4: BMI z-score and PFAS chemicals effect estimation profiles from WQS, quantile g-
computation (LRMLE), and BHRM using HELIX challenge data.
PFAS chemical Effect estimates Weights/PIP
Univariate WQS LRMLE BHRM WQS BHRM
PFHXS -0.088
*
(0.042) -0.0108 0.002 (0.047) -0.001 (0.011) 0.325 0.045
PFNA -0.129
*
(0.036) -0.105 -0.026 (0.047) -0.006 (0.026) 0.316 0.095
PFOA -0.144
*
(0.032) -0.072 -0.101
*
(0.041) -0.110 (0.057) 0.216 0.862
PFOS -0.147
*
(0.036) -0.045 -0.093
*
(0.045) -0.047 (0.067) 0.135 0.440
PFUNDA -0.049 (0.033) -0.003 0.010 (0.036) 0.000 (0.007) 0.008 0.035
ψ - -0.332
*
(0.048) -0.207
*
(0.054) -0.165
*
(0.047) - -
Note: —, no data; BHRM, Bayesian hierarchical regression model; LRMLE, linear regression using maximum
likelihood; PFAS, per- and polyfluoroalkyl substances; PFHXS, perfluorohexane sulfonate; PFNA, perfluo-
rononanoate; PFOA, perfluorooctanoate; PFOS, perfluorooctane sulfonate; PFUNDA, perfluoroundecanoate;
PIP, posterior inclusion probability; SD, standard deviation; WQS, weighted quantile sum.
* Indicates statistical significance based on a Wald test.
38
Chapter3
Multiclassregularizedregressionintegratingpriorinformation
3.1 Introduction
As data processing techniques advance, increasing numbers of researchers are focusing on using high-
dimensional data to predict categorical healthcare outcomes. For example, Ressom [90] described mul-
tiple algorithms and examples for the multiclass prediction of cancer phenotypes using genomics and
proteomics. Typically logistic or multinomial regression methods are used to model these categorical out-
comes. For feature selection, regularization is often used to select predictors since they can address the
potential correlation and collinearity issues that commonly occur in high-dimensional omic data. As an ex-
ample, LASSO, a type of regularized regression, introduces the sparsity of the model to achieve the variable
selection by shrinking features coefficients toward zero [108]. While this is advantageous for identifying
sparse models, for highly correlated data LASSO often selects a single predictor among a correlated set
of variables and removes all others. In contrast, ridge regression often performs slightly better with cor-
related data by distributing the effect across the set of highly correlated variables, but lacks the ability to
perform feature selection. Elastic-net (EN) [131], a combination of both approaches, is designed to balance
these extremes allowing for feature selection while also maintaining performance with highly correlated
variables.
39
With any feature selection approach, after model fitting, it is often helpful to use external information
to either fine-tune the model and/or improve the model’s interpretability. Examples of such prior knowl-
edge include gene functional annotations, pathway information, or, generally speaking, any information,
continuous or categorical that can characterize the variables [22, 56, 86, 106]. By analyzing selected fea-
tures using prior information, it is possible to find valuable biological insights to enhance predictive power,
variable selection, and interpretation [126, 26, 32].
Rather than enhancing the model performance or interpretation by using prior information in a post-
hoc manner, Zeng et al. [126] proposed a novel method, "external information informed tuning" (xtune),
that incorporates prior information (meta-features or "features of features") of any kind during the model-
ing process through prior-informed penalty parameters within a regularized regression framework. Com-
pared to standard LASSO with a single common penalty parameterλ for allp predictors, xtune specifies
a predictor-specific penalty parameter that is modeled as a log-linear function of prior covariates, which
directly connects the meta-features and the extent of shrinkage effect. Furthermore, instead of using cross-
validation to select the single common penalty parameter that controls the global shrinkage across all re-
gression coefficients[44], xtune uses an empirical Bayes approach to estimate the feature-specific penalty
parameters as a function of effect estimates corresponding to each prior covariate [126].
Previous works established thextune framework to accommodate different types of penalties for con-
tinuous and binary outcomes. While xtune demonstrated improved prediction accuracy and variable se-
lection as compared to standard regularized regression, the current implementation has some limitations.
Firstly, xtune cannot conduct classification when the outcome has multiple categories. Secondly, the esti-
mation procedure may suffer from instability in certain cases.
Several related studies have investigated the concept of differential shrinkage in regularized regression.
Group LASSO [124] and adaptive LASSO [130] are examples of methods that allow for multiple tuning
parameters. However, group LASSO only permits different shrinkage for distinct groups of variables,
40
and neither group LASSO nor adaptive LASSO integrate meta-features for tuning parameter selection.
Other studies have incorporated prior knowledge to adaptively estimate tuning parameters. For instance,
Kawaguchi et al. [50] integrated meta-features into the hierarchical two-level ridge model to enable feature-
specific weighting for differential shrinkage. Van De Wiel et al. [111] utilized co-data to achieve group-
specific penalties. However, both methods were based on the ridge penalty, which lacks variable selection
capabilities. Additionally, Van De Wiel et al. [111] can only allow the grouping information as the co-data,
unlikextune, which can accommodate various types of prior covariates. Velten and Huber [112] developed
the differentially penalized model with prior covariates using variational Bayes for convex penalties, but
it only applied to continuous and binary outcomes. In a work similar to xtune, Tay et al. [103] introduced
the method called "feature-weighted elastic net" (fwelnet) that incorporates prior covariates to establish
different penalties by scaling the tuning parameter according to different weights for the EN model. In
addition to different estimation and inference procedure, however, fwelnet may still involve a step of cross-
validation for the tuning parameter selection, whereas xtune directly estimate feature-specific penalties.
Moreover, while fwelnet extended the model framework to the generalized linear model, it has not been
implemented for the multicategorical outcome data.
To this end, we propose the multicategorical xtune (mxtune) to address the above limitations of xtune
and related works. While building upon the xtune framework, mxtune integrates an EN penalty within a
multinomial logistic regression to: (i) model data with multiple outcome categories (K > 2) to conduct
multiclass classification; (ii) work with high-dimensional data with a more stable estimation algorithm.
Similar toxtune, rather than using cross-validation for selection of the penalty parameters, we implement
an empirical Bayes approach for stable and efficient estimation of the penalty parameter.
In this paper, we will specify the model and illustrate the estimation procedure. Through simulation
studies and real data examples, we demonstrate the improvement ofmxtune in prediction accuracy, effect
estimation, and variable selection.
41
3.2 Methods
3.2.1 ModelSpecification
Consider the baseline-category logit model for multi-category outcome with levelK≥ 2:
log
Pr(Y =k)
Pr(Y =1)
=Xβ k
, k =2,...,K (3.1)
whereY is a response vector with length ofN,X is aN × p matrix of features, andβ k
is a vector ofp
that represents corresponding features coefficients for the kth category. In our approach, instead of using
the traditional generalized multi-logit model withK− 1 logits, we used the more symmetric approach -
softmax function - to model the probability of being in thekth class out of totalK categories:
σ {y
k
(X;β k
)}=P(Y =k|X)=
e
Xβ k
P
K
e
Xβ k
(3.2)
where the outcome y
k
(X;β k
) has multiple outputs using ‘one-of K’ encoding and each output level is
associated with its own coefficients β k
.
The EN model imposesl
1
andl
2
penalties on the sum of absolute value of coefficients and square value
of coefficients, respectively [131]. Thus, the objective function of EN for multi-category classification
problem can be expressed as:
min
{β 0k
,β
k
}
K
1
(
− 1
N
N
X
i=1
"
K
X
k=1
t
ik
(β 0k
+x
i
β k
)− log
K
X
k=1
e
β 0k
+x
i
β k
#
+λ
K
X
k=1
p
X
j=1
(1− c)β 2
kj
/2+c|β kj
|
)
.
(3.3)
Here, the negative log-likelihood is treated as the loss function. The t is denoted as a N × K indicator
matrix for the outcome responses. For regularization,λ is the penalty parameter that controls the intensity
42
of shrinkage on the coefficient β and cross-validation is often used to select the value of λ [44]. c is a
constant that controls the weight ofl
1
andl
2
penalty and is often pre-specified. When c=1 the model is
equivalent to LASSO, and whenc=0 the model is equivalent to Ridge regression.
Our proposed mxtune model extends the multi-classification regularized regression by first adding a
feature specific tuning parameter ( λ j
) in the objective function in Equation (3.3) and modeling these with
a corresponding log-linear function with prior covariates. It can be represented as:
min
{β 0k
,β
k
}
K
1
(
− 1
N
N
X
i=1
"
K
X
k=1
t
ik
(β 0k
+x
i
β k
)− log
K
X
k=1
e
β 0k
+x
i
β k
#
+λ j
K
X
k=1
p
X
j=1
(1− c)β 2
kj
/2+c|β kj
|
)
λ j
=e
Z
j
·α (3.4)
whereλ = (λ 1
,...,λ p
) andλ j
represents the penalty parameter for coefficient β kj
. α is a vector with a
length of q that describe the effect of the corresponding prior covariate. In this case, instead of a single
common tuning parameter for all features, we have the prior-informed feature-specific penalty parameter
by integrating prior information into the modeling process. Furthermore, to control the number of tuning
parameters, we constrained theα to be the same in all outcome levels, which also makes each of the tuning
parameters remain the same across all levels.
We can estimate β coefficients given λ by using the standard regularized regression method. For
example, this procedure can be easily implemented byglmnet package (v4.1.2) in R environment [126, 31].
3.2.2 BayesianinterpretationofthextunemultinomialElasticNetmodel
Since the estimation ofβ can follow a standard procedure and is easily implemented in application, we
focus on the estimation of the tuning parametersλ . Since themxtune model has a feature-specific penalty
43
parameter vector with lengthp, tunning parameter election methods such as cross-validation are not ap-
propriate in consideration of computational efficiency. We extended the empirical Bayesian method pro-
posed in [126] for a EN type of penalty to estimate the hyperparametersα .
[58] indicated that the EN method can be expressed in a Bayesian framework because the naive EN
estimators
ˆ
β EN
are equivalently characterized by a marginal posterior mode of the regression coefficients.
Specifically, the response variable Y follows a multinomial distribution and the probability of being in a
certain outcome category is measured by the softmax function described earlier (Equation (3.2)). The EN
estimator
ˆ
β EN
is equivalent to marginal conditional posterior mode ofβ |Y using the prior ofβ given by
the combination of Gaussian and Laplacian priors [58]. The Bayesian formulation of the Elastic-Net model
can be expressed as:
Y|X,β ∼ Multinomial(N,σ (Xβ ))
f(β k
)∝ exp
n
− λ h
(1− c)β 2
kj
/2+c|β kj
|
io
.
(3.5)
Since the mxtune model has the tuning parameter for each individual coefficient modeled as the log-
linear function of external covariatesZ
j
of theX
j
, the mxtune Bayesian model can be represented as:
Y|X,β ∼ Multinomial(N,σ (Xβ ))
f(β k
|α )∝ exp
n
− λ kj
h
(1− c)β 2
kj
/2+c|β kj
|
io
logλ j
=Z
j
α (3.6)
3.2.3 EmpiricalBayesapproachforhyperparameterestimation
The sparse Bayesian learning framework described in [109] is utilized to estimate the hyperparameter
α . Specifically, the sparsity of coefficients is introduced by using the Automatic Relevance Determination
(ARD) prior [109] in the Bayesian expression in Equation (3.6), which concentrates the posterior probability
of associated coefficients at zero. Specifically:
44
1. For the current values ofα , the ‘most likely’β are obtained by finding the mode over β . Since
p({β k
}
K
1
|Y,α )∝P(Y|{β k
}
K
1
)p({β k
}
K
1
|α )
P(Y|{β k
}
K
1
)=
Q
N
n=1
Q
K
k=1
σ {y
k
(x
n
;β k
)}
t
nk
(3.7)
we can maximize the log likelihood of the posterior distribution:
l({β k
}
K
1
|Y,α )∝ log[P(Y|{β k
}
K
1
)P({β k
}
K
1
|α )]
=
N
X
n=1
K
X
k=1
t
nk
log[σ {y
k
(x
n
;β k
)]− 1
2
β T
Vβ (3.8)
whereV = diag(η ) = diag(
2λ (1− c)+(cλ )
2
2
). This is a standard regularized multinomial logistic regression
and we can adapt a coordinate descent approach and perform the partial Newton steps to find the β . This
procedure can be implemented by the glmnet package in R.
2. The empirical Bayes method estimates theα by maximizing the marginal likelihood, which can be
obtained by integrating out theβ from the joint distribution ofY andβ . The marginal likelihood function
can be expressed as:
L(Y|α )=
Z
P(Y|{β k
}
K
1
)P({β k
}
K
1
|α )dβ
=
Z
exp{log[P(Y|{β k
}
K
1
)P({β k
}
K
1
|α )]}dβ.
(3.9)
Since the coefficients cannot be integrated out analytically there is no direct closed-form expression
for the marginal likelihood. However, to obtain closed-form expression that approximates this marginal
45
likelihood, we apply two procedures. First, we proposed to use theL
2
norm to approximate theL
1
norm
of the prior distribution of theβ in Equation (3.6) with the same prior variance[28, 126]. Thus,
f(β k
)=ζ λ kj
,c
exp
n
− λ kj
h
(1− c)β 2
kj
/2+c|β kj
|
io
≈ ζ λ kj
,c
exp
n
− h
λ kj
(1− c)β 2
kj
/2+
β 2
kj
2(
2
(λ kj
,c)
2
)
io
=N(0,
2
2λ kj
(1− c)+(cλ
kj
)
2
)
(3.10)
Whenc=1, the approximation is equivalent to the Bayesian LASSO, and the Bayesian ridge whenc=0.
Secondly, we implemented the Laplace approximation to the log-posterior around the mode[109, 67].
Then, by using the fact that the gradient vector of the log-posterior (Equation (3.8)) equals to zero, we can
get the mean and the covariance of the Laplace approximation:
β ∗ k
=Σ k
X
T
B
k
ˆ y
k
Σ k
=− H
− 1
k
=(X
T
B
k
X +V
k
)
− 1
ˆ y
k
=Xβ ∗ k
+B
− 1
k
(t
k
− σ (y
k
))
(3.11)
where theB
k
is aN× N diagonal matrix for thekth category with the elementsb
i
= σ (y
i
)[1− σ (y
i
)].
Here, the mode of the posterior distribution is identical to the mean of the Laplace approximation, and
the negative inverted Hessian corresponds to the covariance matrix for the Laplace approximation. How-
ever, the Hessian matrix is a(pK)× (pK) block matrix, which yields additionalK
3
computational cost
compared to the binary classification [10]. To simplify, we treat each output level as independent [76]. By
doing so, the resulting approximated marginal likelihood function can be expressed as:
L(Y|α )=
Z
P(Y|{β k
}
K
1
)P({β k
}
K
1
|α )dβ
≈ Y
K
P(Y|β ∗ k
)P(β ∗ k
|α )(2π )
p
2
|Σ k
|
1
2
(3.12)
46
And the approximated marginal log likelihood is in the form:
l(Y|α )=
X
K
− 1
2
Nlog(2π )+ log|C
k
|+ ˆ y
k
T
C
− 1
k
ˆ y
k
(3.13)
whereC
k
=B
− 1
k
+XV
− 1
X
T
andV
− 1
= diag(
2
2λ (1− c)+(cλ )
2
).
3.2.4 Optimizingtheapproximatedmarginallikelihoodfunctionusing l
2
iterativere-
weightedmethod
The optimization process becomes complicated due to the non-convex nature of the objective function de-
picted in Equation 3.13. When the number of features is large, traditional updating algorithms such as the
EM algorithm, L-BFGS algorithm, or a gradient descent algorithm will have low computational efficiency
since they need to calculate the inverse of ap× p matrix for each output level in every iteration. To ad-
dress this, [126] suggested an iterative reweighted-L
2
algorithm for updating the hyperparameters. This
algorithm was motivated by the similarly specified ARD model [68, 78, 109]. Then, a Majorization Min-
imization (MM) procedure optimizes the reformulated ARD by converting this non-convex optimization
into a series of easier convex optimizations by solving reweighted-L
1
problems [118, 119, 126].
To this end, we modified the iterative reweighted- L
2
algorithm described in [126] and extended it for
the multi-classification EN model. Specifically, the approximated log likelihood function (Equation 3.13)
consists of both convex and concave functions. For the concave part log|C
k
|, the marjorization function
is:
θ =
X
K
∇
α log|C
k
|=
X
K
diag[X
T
C
k,α
X] (3.14)
which is the first derivative of the concave function at the current α . For the convex part ˆ y
k
T
C
− 1
k
ˆ y
k
,
[126] proposed a MM procedure instead of using a marjorization function only to optimizeα , again due
47
to computational considerations. By letting X
B
= B
1/2
X,Y
B,k
= B
1/2
Y
k
, the convex term ˆ y
k
T
C
− 1
k
ˆ y
k
can be re-formulated as:
min
δ X
K
h
||Y
B,k
− X
B
δ k
||
2
+
X
p
η j
δ 2
k,j
i
. (3.15)
Here, an auxiliary termδ is introduced with an upper-bound auxiliary function forl
θ (α ):
l
θ (α,δ k
)≜θ T
1
η +
X
K
h
||Y
B,k
− X
B
δ k
||
2
+
X
p
η j
δ 2
k,j
i
≥ l
θ (α ). (3.16)
By iteratively updatingδ andα in Equation 3.16, we can obtain the estimates ofα [126]. Thus, for
anyδ ,α is estimated by optimizing:
α X
K
h
θ T
1
η +
X
p
η j
δ 2
k,j
i
, (3.17)
and at givenα ,δ is updated by optimizing:
δ X
K
h
||Y
B,k
− X
B
δ k
||
2
+
X
p
η j
δ 2
k,j
i
. (3.18)
Equation 3.17 and 3.18 can be directly optimized using theoptim andglmnet functions in R, respectively.
Furthermore, we constrain theα by the bounding of the λ using a pathwise coordinate descent, which
gives a warm start to the algorithm and stabilizes the estimation [31]. The maximum value ofα is bounded
by:
min
k
max
j
|
P
N
i=1
[Y
i,k
− ¯ Y
k
(1− ¯ Y
k
)]X
ij
|
cN
. (3.19)
48
This l
2
iterative re-weighted optimization algorithm is summarized in Algorithm 1. After obtaining
α , we can calculate λ and estimate β coefficients using standard multinomial likelihood optimization
methods.
Algorithm1: Iterative reweightedL
2
algorithm for mxtune
Step1: Initialize theα withα 0
(e.g,α i
= 0,∀
i
= 1,2,...,q);
while not converge do
Step2: Givenα , updateθ :
θ new
← X
K
diag[X
T
C
k,α
X]
Step3: Givenθ , updateα by the following inner loop:
• Step 3.1: Initializeα with estimates from the last iteration.
• Step 3.2: Updateδ givenα :
δ ← δ X
K
h
||Y
B,k
− X
B
δ k
||
2
+
X
p
η j
δ 2
k,j
i
• Step 3.3: Updateα givenδ :
α ← α X
K
h
θ T
1
η +
X
p
η j
δ 2
k,j
i
• Step 3.4: Repeat Step 3.2 and Step 3.3 until converge.
end
3.3 Results
To demonstrate the performance, we compared mxtune with the standard multinomial EN through simu-
lation studies and real data examples. We also examined the performance of mxtune without using prior
information (non-informative mxtune), which is a re-parameterization of the standard EN except it es-
timates the penalty parameters using mxtune’s empirical Bayes approach, to see how it performs when
the prior information doesn’t exist. Non-informative mxtune is implemented by using a prior covariate
matrix with a single column of ones, equivalent to a intercept only model for the tuning parameters. All
results are based on the outputs from R (v4.0.3; [88]). The glmnet [31] package was used to perform the
49
standard EN model with the penalty parameter selected using cross-validation method. For the analysis,
the constantc that regulates the weight ofl
1
andl
2
penalties in the Elastic Net model is set to be 0.5.
3.3.1 Simulationstudies
3.3.1.1 Simulationsettings
We generated N=600 observations for each simulation run. TheX matrix follows ap-dimensional multi-
variate normal distribution with mean vector zero and covariance matrixΣ i,j
=ρ |i− j|
. The hyperparame-
tersα were arbitrarily set to be (-2, -1, 0, 1, 2). Thus, the prior informationZ is ap× q matrix(q =5), and
the entries ofZ equal to zero or one with probability 0.9 or 0.1 indicating the grouping of each predictor.
The true regression coefficients β were generated from a double exponential distribution with location
parameter zero and scale parametere
Zα . θ percent of theβ were set to be zero to achieve the sparsity of
the regression coefficients. The outcome Y was generated according to a K-level multinomial distribution
using model-based probabilities.
The simulated dataset was split into two independent sets of data: a training set (n = 400) and a testing
set (n = 200). The analysis models were trained using the training data set, and each model’s performance
was evaluated using the testing data set. Simulation studies were conducted under different scenarios
with performance evaluated with three metrics: prediction accuracy, coefficients estimation, and model
sparsity. The prediction performances of the mxtune, non-informative mxtune, and standard EN models
were assessed by the generalized Area under the ROC Curve (AUC) [42] which can assess the prediction
accuracy for the multicategorical classification. The coefficient estimation performance was evaluated by
the bias of true predictors and the probability of selecting non-predictors. The sparsity was compared
using the number of features selected by each method.
We assess model performance under five scenarios by varying five parameters of interest: (i) the num-
ber of featuresp, (ii) correlations among predictorsρ , (iii) proportion of true non-effective predictors (zero
50
β s)θ , (iv) proportion of prior information identifying the true causal featuresκ , and (v) the number of out-
come categoriesK, respectively. One thousand replicates were performed for each scenario. Five scenarios
are summarized below:
1. The number of features p = 400, 800, 1,200. Other parameters are fixed at ρ = 0.5,θ = 0.9,κ =
100%, andK =3 for each scenario.
2. Correlations among predictorsρ = 0.5, 0.7, and 0.9. p=800,θ =0.9,κ =100%, andK =3.
3. Proportion of true non-effective predictors θ = 0.1, 0.5, 0.9, with other parameters p = 800,ρ =
0.5,κ =100%, andK =3 fixed.
4. Proportion of prior information identifying the true causal features κ = 70%, 80%,90%, which
represents the informativeness of the prior information. Other parameters are fixed at p = 800,ρ =
0.5,θ = 0.9, andK = 3. For each scenario, Z matrix has(1− κ ) percent of rows have "flipped" values,
which means entries with 0 were set to be 1 arbitrarily, and vice versa.
5. The number of outcome category K = 2, 3, 6. For each scenario, other parameters are fixed at
p=800,ρ =0.5,θ =0.9, andκ =100%.
3.3.1.2 Simulationresults
Prediction, coefficients estimation, and sparsity of all methods under the five scenarios are summarized in
Figure 3.1, Figure 3.2, and Figure 3.3, respectively. For the prediction performance, the AUC of mxtune is
higher than that of standard EN in all scenarios. In addition, the AUC of mxtune without using external
information is similar to the AUC of standard EN. Specifically, with a larger number of predictors, all
methods have a decreasing predictive performance while the mxtune still outperforms the standard EN
(Figure 3.1 (a)). The AUC of all methods increases with a higher correlation among predictors, while the
improvement of mxtune prediction becomes smaller as the correlation increases (Figure 3.1 (b)). From
Figure 3.1 (d), there is decreased improvement ofmxtune for prediction as the informativeness of the prior
51
covariates lessens. When only 50% of prior information is available to identify true predictors, meaning
that the prior knowledge consists of random noises, the prediction accuracy of both mxtune and the non-
informative mxtune is equivalent to that of the standard EN (Figure B.2(a)).
Regarding the coefficient estimation, while the estimation bias of true predictors is indistinguishable
among all three models in every scenario (Table B.4), themxtune methods exhibit a lower probability of se-
lecting incorrect predictors compared to the standard EN (Figure 3.2, Figure B.2 (b)). The false positive rate
of all methods rises with a sparser or more correlated model, and it decreases as the number of predictors
increases.
In terms of sparsity,mxtune and non-informativemxtune select fewer features than standard EN does
in all scenarios (Figure 3.3, Figure B.2 (c)), reflecting the selection of a more parsimonious model using the
mxtune method as compared to using standard EN.
In summary, the simulation studies demonstrate that mxtune exceeds the standard EN model when
the prior information exists by improving the prediction accuracy with a more parsimonious model and
enhancing the overall estimation as well. Even when no prior information is present,mxtune surpasses the
standard EN in most scenarios. Factors such as the correlation among predictors and the informativeness of
the prior covariates influence the performance of mxtune. For example, with the less correlated predictors
and highly informative prior information, mxtune has a greater improvement in prediction and feature
selection.
3.3.2 Realdataexamples
We further illustrate the performance ofmxtune using three applied examples: prediction of breast cancer
survival based on gene expression data, prediction of prostate cancer status using gene expression data,
and prediction of liver injury status using correlated environmental exposome data.
52
(a) (b)
(c) (d)
(e) (f)
Figure 3.1: Prediction accuracy comparing the standard EN with mxtune and mxtune without using prior
information in five scenarios. (a) Varying p withρ = 0.5,θ = 0.9,κ = 100%,K = 3. (b) Varyingρ with
p=800,θ =0.9,κ =100%,K =3. (c) Varyingθ withp=800,ρ =0.5,κ =100%,K =3. (d) Varying
κ withp = 800,ρ = 0.5,θ = 0.9,K = 3. (e) (f)K = 2 andK = 6 withp = 800,ρ = 0.5,θ = 0.9,κ =
100%.
53
(a) (b)
(c) (d)
(e) (f)
Figure 3.2: Probability of selecting wrong predictors comparing the standard EN withmxtune andmxtune
without using prior information in five scenarios. (a) Varying p withρ =0.5,θ =0.9,κ =100%,K =3.
(b) Varying ρ with p = 800,θ = 0.9,κ = 100%,K = 3. (c) Different θ with p = 800,ρ = 0.5,κ =
100%,K = 3. The scales of misclassification rate are varying across three conditions. (d) Varying κ with
p=800,ρ =0.5,θ =0.9,K =3. (e) (f)K =2 andK =6 withp=800,ρ =0.5,θ =0.9,κ =100%.
54
(a) (b)
(c) (d)
(e) (f)
Figure 3.3: Model sparsity comparing the standard EN with mxtune and mxtune without using prior in-
formation in five scenarios. (a) Varying p withρ = 0.5,θ = 0.9,κ = 100%,K = 3. (b) Varyingρ with
p = 800,θ = 0.9,κ = 100%,K = 3. (c) Varyingθ withp = 800,ρ = 0.5,κ = 100%,K = 3. (d) Varing
κ withp = 800,ρ = 0.5,θ = 0.9,K = 3. (e) (f)K = 2 andK = 6 withp = 800,ρ = 0.5,θ = 0.9,κ =
100%.
55
3.3.2.1 Breastcancerdataexample
In this example, we aim to predict breast cancer survival at five years (survived or dead) based on patients’
gene expression profile and clinical features. The data comes from the Molecular Taxonomy of Breast
Cancer International Consortium (METABRIC) cohort [25]. Details about the study design and data ac-
quisition can be found on their webpage https://ega-archive.org/dacs/EGAC00001000484. The METABRIC
study contains 29,475 gene expression features and three clinical variables (age at diagnosis, lymph node
status, and whether a patient is a progesterone receptor (PR) negative type of breast cancer). The data
consist of a training set with 594 subjects and a testing set with 564 subjects. The prior covaraites con-
tain information about whether a gene was identified to be prognostic for cancer from a previous study
[17]. Specifical, Z is a29,478× 4 matrix where the entries are binary values indicating whether a gene
is associated with mitotic chromosomal instability, mesenchymal transition, lymphocyte-based immune
recruitment, and if a gene belongs to the set of clinical variables. This last covariate allows for the clinical
features to have a penalty parameter separate from the gene expression variables.
Table 3.1 summarizes the model performance of mxtune, mxtune without prior information (i.e. aZ
matrix with a column of 1s), feature-weighted elastic net implemented in the fwelnet (v0.1) R package,
and the standard EN implemented in glmnet package. Compared to the standard EN and fwelnet, mxtune
achieves better prediction performance (AUC
EN
=0.709 vs. AUC
fwelnet
=0.710 vs. AUC
mxtune
=0.753) with
a more parsimonious model (P
EN
=156 vs. P
fwelnet
=156 vs. P
mxtune
=94). The second-level coefficient α is
estimated bymxtune asα =-2.33, -0.34, -0.95, -0.41. These estimates translate to a feature-specific penalty,
λ = exp(Zα ). Thus , the corresponding feature coefficient is shrunk less if a feature is a clinical variable
or if a gene has been previously demonstrated to be prognostic for breast cancer (associated with mitotic
chromosomal instability, mesenchymal transition, or lymphocyte-based immune recruitment). In addi-
tion, even without the prior information, themxtune approach to estimating an overall shrinkage outper-
forms the standard EN for both prediction accuracy (AUC
NI mxtune
=0.723) and in selecting fewer predictors
56
(P
NI mxtune
=109). In regard to computational efficiency, both mxtune and non-informative mxtune have a
longer computational time than standard EN does using the glmnet package.
Table 3.1: Model prediction accuracy (AUC), sparsity, and computational time (in minutes) comparing the
standard EN, mxtune, and mxtune without using the prior information.
AUC # of selected features Computation time
mxtune 0.753 94 15.77
non-informative mxtune 0.723 109 2.23
fwelnet 0.710 156 2.82
standard EN 0.709 156 0.67
3.3.2.2 Prostatecancerdataexample
In the second example, we compared methods performance using a prostate cancer example with gene ex-
pression data. Specifically, the outcome is a prostate cancer patients’ disease status (no evidence of disease
(NED) and clinical recurrence (CR)) after radical prostatectomy. The data comes from [97] and it consists of
187 observations and 29,336 gene expression features. 154 participants have NED after prostatectomy and
the other 33 participants have CR prostatectomy. Prior covariate dataZ consists of three indicator vari-
ables that were constructed according to 24 previous studies that identified multiple genes associated with
prostate cancer [102, 64, 30, 93, 51, 65, 101, 6, 63, 116, 74, 105, 48, 123, 80, 37, 121, 85, 127, 20, 110, 82, 57, 94].
If a gene was identified in any one of these studies, the prior covariate value is coded as 1 in Z
1
; if a gene
was identified by two of these studies, it is coded as 1 in Z
2
; and if a gene was identified by three of these
studies, it is coded as 1 inZ
3
; otherwise, it is coded as 0. Thus, the prior data setZ is a29,336× 3 matrix.
Data were randomly divided into the training set with 70% of observations and the testing set with 30% of
observations. We repeated this process for 100 times and all results are based on 100 replicates resampling
the training and testing subsets of observations.
Results are displayed in Figure 3.4 and Table 3.2. With the prior information, mxtune demonstrates
a substantial improvement in AUC (AUC
mxtune
= 0.766 vs. AUC
fwelnet
= 0.705 vs. AUC
EN
= 0.703). The
57
Table 3.2: Selection stability across 100 replicates comparing the standard EN,mxtune, andmxtune without
using the prior information.
Percentage mxtune non-informative mxtune fwelnet Standard EN
Unselected predictors 99.983 95.473 96.772 93.605
Predictors selected
over half of replicates 40.000 0.527 0.739 0.267
average estimatedα bymxtune across 100 replicates areα = 0.403, 0.156, -9.271, which meansZ
3
has more
influence on the regularization since it has more evidence from prior studies. Without prior information,
mxtune has only a slight increase in prediction accuracy (AUC
NI mxtune
= 0.717 vs. AUC
fwelnet
= 0.705 vs.
AUC
EN
= 0.703). For variable selection and sparsity, mxtune outperforms standard EN and fwelnet by
yielding a more parsimonious model and providing a more stable predictor selection. Using the prior
information, mxtune achieves a better prediction performance compared to existing methods, with an
average of only 2.25 predictors, whereas standard EN selected an average of 86.76 predictors and fwelnet
selected an average of 44.01 predictors (3.4). When considering selection stability, mxtune exhibits the
highest percentage (99.983%) of unselected predictors, surpassing the non-informative mxtune (95.473%),
fwelnet (96.772%), and standard EN (93.605%), which indicates that mxtune consistently narrows down
the pool of predictors of interest across 100 replicates. Moreover, when considering predictors that were
selected by each method at least once, 40% of the predictors identified by mxtune were consistently selected
across over 50 replicates. In contrast, less than 1% of the predictors identified by the non-informative
mxtune (0.527%),fwelnet(0.739%), and standard EN (0.267%) appeared in over 50 replicates. This highlights
the superior consistency and stability of mxtune in selecting reliable predictors compared to the standard
methods.
3.3.2.3 Environmentalexposuresexample
To further illustrate the advantages of mxtune when the number of outcome category is greater than
two, we applied our algorithm to an environmental epidemiology problem. In this example, we want to
58
Figure 3.4: Performance of predictive models for prostate cancer patients’ status. (a) AUC of standard EN,
fwelnet, mxtune, and non-informative mxtune. (b) Number of predictors selected by standard EN, fwelnet,
mxtune, and non-informative mxtune.
build a predictive model for the categorical liver injury status (low, median, and high) using participants’
demographic features and correlated exposome exposures. The data is derived from a data challenge event
held by the Barcelona Institute for Global Health [71]. In our example, we used the subset of complete
data which consists of 1,006 observations and 41 predictors including demographic features, persistent
organic pollutants (POPs), metals, phthalates, per- and polyfluoroalkyl substances (PFAS), and phenols.
The liver biomarkers ALT, AST, and GGT are used to create categorical indicators for the liver injury. If
a person has greater than 90% percentile value in any of ALT, AST, and GGT, this person has high liver
injury. If a person has greater than 70% but lower than 90% percentile value in any of ALT, AST, and GGT,
then this person has median liver injury. Otherwise, the person has low liver injury. The prior covariates
Z contain indicator variables grouping the exposures into chemical classes of POPs, metals, phthalates,
PFAS, or phenol. Therefore, the prior datasetZ is a41× 5 data matrix. Data were randomly divided into
a training set with 70% of observations and a testing set with 30% of observations. All results are based on
100 replicates.
59
Figure 3.5 summarizes the prediction performance and variable selection of all methods. While AUC
remains similar for all three models, mxtune and non-informative mxtune select fewer predictors than
the standard EN model does in all three outcome categories. The more parsimonious model selected by
mxtune may be helpful for identifying important prognostic factors and for simplifying the interpretation
of results. Furthermore, the average of estimatedα are -1.222, -1.230, 1.139, -1.316, 2.629, which give us
the insight that POPs, metals, and PFAS might be more important chemicals groups since they have less
shrinkage compared to phthalates and phenol. The overall limited improvement in prediction might reflect
the relative low dimension of the data. As seen in previous examples, potentially with a largerp, we may
observe more separation in the improvement in prediction due to more variation in the selection of the
number and informativeness of the predictors.
Figure 3.5: Performance of predictive models for categorical liver injury. (a) AUC of mxtune, non-
informative mxtune, and standard EN. (b) Number of predictors selected by mxtune, non-informative mx-
tune, and standard EN.
60
3.4 Discussion
In this chapter, we present a novel method mxtune that enables feature-specific shrinkage by incorpo-
rating external information into the penalty parameter of the elastic-net model for multi-classification
problems. Instead of selecting a single common penalty via cross-validation, mxtune uses an empirical
Bayes approach to estimate feature-specific penalty parameters by maximizing the approximated marginal
likelihood. This estimation method is also referred to as Type II maximum likelihood estimation, which
is utilized in relevance vector machines (RVM), a probabilistic Bayesian learning framework for support
vector machines (SVM) [109]. Noticeably, when using the marginal likelihood to derive the objective func-
tion, we approximated the multinomial Hessian matrix by treating each outcome category as independent.
While this approximation may not capture the true Hessian perfectly [76], it resolves the computational
difficulty efficiently.
Through simulation studies and real data examples, we demonstrate that mxtune exhibits superior
performance in terms of prediction accuracy and feature selection compared to existing methods for clas-
sification problems. Notably, in the real data examples, mxtune demonstrates a more stable and reliable
selection ability compared to the standard EN method. The performance improvement of mxtune is fur-
ther enhanced by incorporating prior knowledge. But even in the absence of prior information,mxtune is
capable of producing a more parsimonious model without sacrificing prediction accuracy. This aspect is
particularly beneficial for researchers as it allows them to streamline the selection of important predictors,
thereby facilitating the interpretation of biological conclusions in practical applications.
Prior information is critical for the use of mxtune. Prior covariates could be either continuous or
indicator variables from related qualitative or quantitative results. Since we can see from simulation studies
that the strength of the improvement in prediction accuracy relies on the informativeness of prior data,
it is important to optimize the prior data’s quality. Thorough literature review of the study domain, deep
understanding of the research question, and expert insights in the field may help to determine what should
61
be included in the prior data. Furthermore, rather than the log-linear function we used in this paper, future
research could also consider modeling the relationship between penalty parameters and prior data by other
functions such as link functions in generalized linear model.
Regarding the computational efficiency, mxtune takes a longer time to run compared to a standard
EN. The computation time is nearly linear with the number of observations N, number of predictors p,
number of outcome categories K, and the number of columns in Z, with the latter having a substantial
impact on the running time (Figure B.3). Although, theoretically mxtune can deal with any numbers of
covariates inZ, practically we want to include a reasonable amount of prior covariates to make the prior
knowledge informative and algorithm efficient. Currently, we recommend the prior data include less than
10 prior covariates. With the computational enhancement such as parallel computing, we can enlarge the
capability of our algorithm to a larger scale of dataset.
The software implementation ofmxtune can be found in thextune R package, along withxtune linear
LASSO function, which are available for download from CRAN. Currently, our method focus on analyzing
the individual level data. In the future, we also want to extend thextune framework to predict the health-
care outcomes using genome-wide association studies (GWAS) summary statistics so that our algorithm
can be used to address more state-to-art questions in the field of genomics.
62
Chapter4
Data-informedtuningofregularizedregressionwithsummarystatistics
frommulti-populationstudies
4.1 Introduction
Recent Genome-Wide Association Studies (GWAS) studies have made significant contributions to our un-
derstanding of the genetic basis of various complex traits and diseases. For example, Khera et al. [55]
conducted a GWAS Polygenic Risk Score (PRS) study to develop polygenic scores for body mass index
(BMI) and assess their association with obesity-related traits. The study utilized large-scale GWAS data
and constructed PRS for BMI. The researchers demonstrated that individuals with high PRS for BMI had
an increased risk of obesity and related cardiometabolic traits, highlighting the potential of PRS in identi-
fying individuals at risk for common diseases. Similarly, Franke et al. [29] performed a GWAS PRS study
to examine the genetic influences on schizophrenia and their relationship with subcortical brain volumes.
The study involved a large-scale analysis of PRS derived from a schizophrenia GWAS meta-analysis and
its association with brain imaging and cognitive data. The researchers demonstrated the utility of PRS in
predicting schizophrenia risk and its potential links to brain structural variations and cognitive impair-
ments.
63
We can see that analysis of genetic summary statistics and PRS plays the important role in under-
standing the GWAS results. As we discussed in Chapter 1, penalized methods, such as lasso and ridge
regression, are commonly employed when modeling with genetic summary statistics. By specifying dif-
ferent prior distributions in a Bayesian framework, corresponding penalties can be applied to shrink the
large values of Single Nucleotide Polymorphism (SNP) coefficients. Sometimes, external knowledge about
SNPs can assist in the penalized effect. In Chapter 3, we introduced xtune [126], a method proposed to in-
troduce feature-specific penalties by integrating prior information. However, the current version of xtune
can only handle individual-level data and cannot be applied to summary statistics level data. To address this
limitation, Newcombe et al. [79] proposed a scalable Bayesian framework called JAM, which can accom-
modate summary statistics by transferring individual data likelihood to the multivariate normal likelihood
of the summary data. Building uponJAM, Shen et al. [98] developed a fine-mapping method called mJAM
that utilizes marginal summary statistics from multipopulation studies to address population stratification
problems. To further advance this area of research, we proposextune-mJAM, a novel method that incorpo-
rates the JAM and mJAM methods into the xtune framework. This integration allows for feature-specific
penalties in regularized regression using summary statistics from multipopulation studies. By leveraging
the strengths ofxtune,JAM, andmJAM, our proposed method aims to enhance the accuracy and efficiency
of GWAS analyses with summary statistics from diverse populations.
Indeed, there are other related studies that focus on penalized methods for summary statistics in the
context of GWAS. For instance, Mak et al. [72] extended penalized regression techniques such as lasso
and elastic-net (EN) to handle summary statistics level data. This extension allows for the utilization of
summary statistics in regression models, which can be more computationally efficient and enable analyses
of large-scale GWAS datasets. On the Bayesian side, Ge et al. [35] introduced a Bayesian regression frame-
work with continuous shrinkageg priors specifically designed for summary statistics analysis. Bayesian
64
approaches offer advantages in incorporating prior information and handling uncertainty in parameter
estimation.
However, the existing methods mentioned above do not fully utilize external information to guide the
selection of tuning parameters. This is where our proposed method, xtune-mJAM, stands out. It distin-
guishes itself from the aforementioned approaches in two main ways: (1) our method allows for feature-
specific penalties, which can be guided by external information. This feature-specific penalization can
lead to more accurate and interpretable results by highlighting the importance of different features in the
analysis; and (2) the integration of mJAM into the framework addresses challenges related to population
stratification, which is crucial when dealing with data from diverse populations. This helps to ensure the
robustness and generalizability of the results across different populations.
In summary, the xtune-mJAM method represents an innovative approach that not only leverages ex-
ternal information for tuning parameter selection but also addresses critical population stratification con-
cerns. This comprehensive and data-driven approach has the potential to advance GWAS analysis with
summary statistics and provide valuable insights into the genetic basis of complex traits and diseases across
diverse populations.
4.2 Methods
4.2.1 ModelspecificationandBayesianinterpretation
As discussed in Chapter 1, the EN regression shrinks the regression coefficients by imposing a combination
ofl
1
andl
2
penalty on the sum of absolute value of regression coefficients. The objective function of EN
is:
min
β ∈R
p
||Y − Xβ ||
2
2
+λ p
X
j=1
(1− c)β 2
j
/2+c|β j
|
(4.1)
65
whenc = 1, the model becomes lasso, and whenc = 0, it is ridge regression.
By adapting the xtune framework, we model theλ as a function of prior information matrixZ. Thus,
the EN model can be expressed as:
min
β ∈R
p
||Y − Xβ ||
2
2
+
P
p
j=1
λ j
(1− c)β 2
j
/2+c|β j
|
λ j
=e
Z
j
α j
(4.2)
The EN linear regression models can be equivalently formulated as a Bayesian or random effects model,
where the coefficients have a double exponential (also known as Laplace) prior distribution conditioned on
σ 2
[58]. In this context, we make the specific assumption that the response variable Y , conditioned on the
regression coefficients β andσ 2
, follows a normal distribution. The coefficients estimated by the EN model,
which solve Equation 4.1, can be alternatively described as the posterior mode or Maximum a Posteriori
(MAP) estimates within a Bayesian model. This Bayesian model incorporates a double exponential prior
distribution forβ conditional onσ 2
, characterized by::
Y|X,β ∼ N(Xβ,σ 2
I
n
)
f(β k
)∝ exp
n
− λ h
(1− c)β 2
kj
/2+c|β kj
|
io
.
(4.3)
The priorf(β |σ 2
) is a compromise between the Gaussian prior for the ridge regression and the double
exponential prior for the lasso regression. Given a fixed value of σ 2
, the EN estimatorβ EN
corresponds
to the marginal posterior mode ofβ |Y,σ 2
. From the previous discussion, we can formulate the following
hierarchical model to incorporate external information into EN regression:
Y|X,β ∼ N(Xβ,σ 2
I
n
)
f(β j
)∝ exp
n
− λ j
h
(1− c)β 2
j
/2+c|β j
|
io
logλ j
=Z
j
α j
(4.4)
66
4.2.2 ObtainingapproximatedmarginallikelihoodusingempiricalBayesapproach
As the xtune-JAM involves a substantial number of hyperparameters (α 1× q
), employing cross-validation
for tuning parameter selection, as traditionally done in regularized regression, becomes impractical. There-
fore, we propose to estimate the hyperparameters using an empirical Bayes approach [109]. Givenσ 2
, the
likelihood of α can be computed by integrating out the β from the joint distribution of Y and β . The
resulting marginal likelihood function can be expressed as:
L(Y|α )=
Z
P(Y|β )P(β |α )dβ =
Z
exp{log[P(Y|β )P(β |α )]}dβ .
(4.5)
As discussed in Chapter 3, the EN prior, similar to the lasso prior, lacks differentiability at 0. In situ-
ations where the matrixX is not orthogonal, thep-dimensional integral in Equation 4.5 does not have a
practical closed-form solution [28]. To overcome this limitation and obtain an approximate closed-form
log likelihood, we introduced the use of the l
2
norm as an approximation for the l
1
norm of the prior
distribution ofβ in Equation 4.4. This approximation retains the same prior variance [126]:
f(β j
)=ζ λ j
,c
exp
n
− λ j
h
(1− c)β 2
j
/2+c|β j
|
io
≈ ζ λ j
,c
exp
n
− h
λ j
(1− c)β 2
j
/2+
β 2
j
2(
2
(λ j
,c)
2
)
io
=N(0,
2
2λ j
(1− c)+(cλ
j
)
2
)
(4.6)
However, the marginal likelihood of the hyperparameter is still based on the distribution of individual-
level data. To address this limitation, Newcombe et al. [79] proposed a Cholesky decomposition to trans-
form the data distribution from individual level to the summary statistics level. Firstly, ap-length vector
z is defined as z = X
′
y, whereX is the design matrix andy is the response vector. All p entries ofz
can be estimated from the summary association data. By assuming the Hardy Weinberg Equilibrium and
67
an additive genetic model, estimates of the genotype countsn
jg
forg =0,1,2, and the trait mean within
each genotype group ¯y
jg
for each markerj are constructed. Thejth entry ofz, corresponding to SNPj,
is then given by:
z
m
=
X
i
y
i
× x
′
j,i
= ¯y
j1
¯y
j1
+2¯y
j2
¯y
j2
(4.7)
where x
′
j,i
is the design matrix entry corresponding to individual i’s genotype for SNP j. This transfor-
mation allows us to express the multivariate normal likelihood of the summary data,z, in terms of the
multivariate adjusted SNP effects β :
z∼ MVN
p
(X
′
Xβ,σ 2
X
′
X) (4.8)
As the individual-level genotype matrixX is unobserved, we propose constructing a plug-in estimate
forX
′
X using a reference panel, such as the 1,000 Genomes Project [21]. With this plug-in estimate for
X
′
X, we can then perform a Cholesky decomposition:
X
′
X =L
′
L (4.9)
whereL is an upper triangular P × P matrix with positive diagonal elements. Using the Cholesky de-
composition, we can obtain a distribution of independent observations:
L
′− 1
z∼ MVN
p
(L
′− 1
X
′
Xβ,σ 2
L
′− 1
X
′
XL
− 1
)
∼ MVN
p
(L
′− 1
X
′
Xβ,σ 2
L
′− 1
L
′
LL
− 1
)
∼ MVN
p
(Lβ ,σ 2
I
p
).
(4.10)
Thus, we can approximate the distribution ofY|β in Equation 4.5 by the distribution ofz
L
=L
′− 1
z.
68
4.2.3 Optimizingtheapproximatedmarginallikelihood
The approximated marginal likelihood can be expressed in log-form as follows:
− l(α )= log|C
α |+z
T
L
C
− 1
α z
L
, (4.11)
where
C
α =σ 2
I +LV
− 1
L
V = diag(η )= diag(
2λ (1− c)+(cλ )
2
2
).
(4.12)
The objective function can be optimized using the reweighted-L
2
procedure [126] as described in Chap-
ter 3. The algorithm can be summarized as follows:
Algorithm2: Iterative reweightedL
2
algorithm for xtune-mJAM
Step1: Initialize theα withα 0
(e.g,α i
= 0,∀
i
= 1,2,...,q);
while not converge do
Step2: Givenα , updateθ :
θ new
← diag[L
T
C
α L]
Step3: Givenθ , updateα by the following inner loop:
• Step 3.1: Initializeα with estimates from the last iteration.
• Step 3.2: Updateδ givenα :
δ ← δ ||z
L
− Lδ ||
2
+
X
p
η j
δ 2
j
• Step 3.3: Updateα givenδ :
α ← α θ T
1
η +
X
p
η j
δ 2
j
• Step 3.4: Repeat Step 3.2 and Step 3.3 until converge.
end
By optimizing the approximated marginal likelihood using Algorithm 2, we can obtain the estimates
for the coefficient of prior covariate α and the feature-specific tuning parameters λ . Once we haveλ , we
can apply standard regularized regression to obtain the estimates for the model coefficients β .
69
4.3 Dataexample
To illustrate the practical application ofxtune-mJAM on real data, we conducted a targeted analysis within
specific genomic regions. Our investigation utilized summary statistics obtained from a recent cross-
ancestry study focused on prostate cancer associations, which included four distinct ancestral groups [23].
These groups encompassed a total of 122,188 prostate cancer cases and 604,640 controls of European an-
cestry, 19,391 cases and 61,608 controls of African ancestry, 10,809 cases and 95,790 controls of East Asian
ancestry, and 3,931 cases and 26,405 controls from Hispanic populations.
We selected a specific region containing 120 SNPs located on chromosome 22 for analysis. Within this
region, we employed xtune-JAM to select one or more index SNP(s) using population-specific summary
statistics and reference dosage data for each ancestral group. The reference dosage data were gathered from
diverse sources including the Prostate Cancer Association Group to Investigate Cancer-Associated Alter-
ations in the Genome and Collaborative Oncological Gene-Environment Study Consortium [PRACTICAL
iCOGS], the Elucidating Loci Involved in Prostate Cancer Susceptibility OncoArray Consortium [ELLIPSE
OncoArray], the African Ancestry Prostate Cancer Consortium [AAPC GWAS], GWAS of prostate cancer
in Latinos [LAPC GWAS], and Japanese [JAPC GWAS] [98, 23].
Based on the dosage data, we simulated two scenarios: (1) a single causal SNP, where the joint effect of
the first SNP represents the true causal variant; and (2) multiple causal SNPs, where the first five SNPs are
causal. Subsequently, we conducted regressions of the simulated response against the dosage SNPs data
to generate summary statistics for each ancestral group. The prior information matrix was constructed
using the following formulation:
Z
jk
=[p
jk
(1− p
jk
)]
0.75
(4.13)
where the p
kj
represents the allele frequency for SNP j from the kth population[7]. Thus, the Z is a
120× 4 dimension matrix. After obtaining the results fromxtune-mJAM using different types of penalties
70
(lasso, ridge, and EN), we compare them with the results fromxtune-mJAM without prior information and
mJAM-Forward[98].
We then regressed the predicted PRS against the true PRS to assess the model’s performance in terms
of summarizing the effects from all SNPs and predicting PRS. A coefficient estimate closer to one indicates
a better PRS prediction ability. For the scenario with a single causal SNP, as showed in Figure 4.1 (a),
both xtune-mJAM lasso and EN demonstrated superior PRS prediction abilities when compared to other
methods, regardless of the presence or absence of prior information. The performance of xtune-mJAM
ridge was comparatively weaker due to its non-selection nature. However, with the integration of prior
information,xtune-mJAM ridge could leverage the predictive strength ofxtune-mJAM ridge without prior
information. In Figure 4.1 (b), all methods exhibit low average mean square error (MSE) in the PRS model,
indicating a good model fit. Notably, the xtune-mJAM methods demonstrate a slightly superior perfor-
mance compared tomJAM-Forward. In cases involving multiple causal SNPs, the mean coefficients of true
PRS versus predicted PRS across all models were close to one (greater than 0.95 and less than 1), indicating
strong PRS prediction capabilities. Concurrently, thextune-mJAM methods exhibited lower average model
MSE compared to mJAM-Forward, implying a better model fit for the xtune-mJAM methods.
4.4 Discussion
Thextune-mJAM introduces the capability of feature-specific penalty parameters by modeling the tuning
parameters as the log-linear function of the external information for summary statistics from multipopula-
tion studies. By incorporating theJAM methods into thextune method, we can utilize the empirical Bayes
framework to model the distribution of the summary statistics. The optimization of hyperparameters is
achieved through the iterative reweighted-l
2
algorithm.
The data example demonstrated that the lasso and EN variants of the method outperform existing ap-
proaches such asmJAM-Forward in terms of PRS prediction. However,xtune-mJAM shares the limitations
71
(a) (b)
Figure 4.1: Model performance with single causal SNP. (a) Coefficient of predicted PRS vs. true PRS compar-
ing thextune-mJAM and non-informativextune-mJAM using lasso, ridge, EN penalty withmJAM-Foward.
(b) Mean square error (MSE) of PRS model comparing thextune-mJAM and non-informativextune-mJAM
using lasso, ridge, EN penalty with mJAM-Foward.
of the generalxtune method. Firstly, careful consideration is required when determining prior information.
In the data example, allele frequencies from each population were utilized as prior covariates. Additionally,
indicator variables, like SNP grouping information, could serve as suitable prior covariates. To ensure the
quality of prior data, rigorous literature review, a deep understanding of the research question, and consul-
tation with experts in the field are vital. Furthermore, the current implementation is limited to applying the
method to one or multiple regions of the GWAS data. Scaling up to handle data across the entire chromo-
some could be computationally intensive. Future advancements, including leveraging parallel computing,
can significantly enhance the algorithm’s capacity to handle larger datasets.
For those interested, the xtune-mJAM R package can be found on GitHub at https://github.com/
JingxuanH/xtune.mJAM, offering a software implementation of the method available for download. In the
future, efforts will focus on further improving the computational efficiency of xtune-mJAM and applying
the method to extensive datasets like genetic data from the Pan-UK Biobank, showcasing its potential for
real-world applications at a larger scale.
72
Chapter5
Softwareimplementation
In this chapter, we will present the software implementation and examples of the methods we discussed
in Chapter 2 (BHRM), Chapter 3 (xtune), and Chapter 4 (xtune_mJAM).
5.1 BHRM
This subchapter is a modified version of README file of the BHRM GitHub repository on https://github.
com/USCbiostats/BHRM.
5.1.1 Overview
BHMR approach yields conditional individual exposure-specific estimates along with an estimate of the
overall effect of the mixture. While BHRM is a broad framework of statistical models, our implementation
of BHRM combines : 1) a g-prior for the corresponding exposure effects to provide robust estimation of
highly correlated exposures; 2) a Bayesian selection procedure to estimate the posterior inclusion proba-
bility of individual exposures; 3) Bayesian g-computation in a potential outcome framework for estimating
the overall mixture effect based on two specified exposure profiles.
73
Suppose we have a model with p exposures and q covariates:
g(µ i
)=α +
p
X
j=1
γ j
β j
X
j
+
q
X
k=1
δ k
U
k
+ϵ i
, i=1,...n, (5.1)
whereX
j
is an exposure with the corresponding estimateβ j
,γ j
is a binary variable indicating the inclusion
of a specific exposure j in the mixture, and U
k
is a covariate with corresponding effect estimates δ k
.
Specifically, γ = (γ 1
,...,γ p
) is a vector with element γ j
∈ (0,1) which indicates if the variable X
j
should be included in the modelM
γ (γ j
= 0 is equivalent toβ j
= 0). And we chose the Beta-Binomial
distribution with the function of p as the prior forγ .
To incorporate theg-prior into the model, we interpret the regression model into a Bayesian setting
under modelM
γ . Assume we have a continuous outcome:
Y |α,β,ϕ,γ ∼ N(1α +X
γ β γ , ϕ − 1
I )
β γ ∼ N
0,gϕ − 1
X
′
γ X
γ
− 1
(5.2)
where prior covariance ofβ is specified as the scaled version of the covariance matrix of the MLE estimator
and a function of the variance of the outcomeϕ − 1
. We can derive the posterior distributions:
β γ |Y, α, ϕ, γ ∼ N (
g
1+g
c
β γ , ϕ − 1 g
1+g
X
′
γ X
γ
− 1
)
α |Y, γ ∼ N (c α γ , ϕ − 1
X
′
γ X
γ
− 1
)
(5.3)
where
c
β γ is the MLE estimators forM
γ . We can see that scalarg controls the conditional posterior mean
shrink from the MLE estimator to prior mean zero. Also, the dispersion of the posterior covariance is
shrunk by the factor ofg/(1 + g). Within this framework, the posterior inclusion probability (PIP) on the
individualγ j
is the posterior probability that the coefficient is non-zero. In this case, the Bayesian selection
algorithm is combined with the shrinkage factor forg-priors to yield optimal prediction performance. For
74
our BHMR, we adapted the semi-Bayes fixed g prior that is pre-specified with a constant according to the
level of the desired shrinkage.
To obtain coefficient estimations, we used the MCMC simulation method for Bayesian hierarchical
models through Just another Gibbs sampler (JAGS) coding scheme. After obtaining coefficients estima-
tions, we utilizedg computation to yield a single effect estimate (e.g., a mixtures effect) that captures the
impact of one standard deviation increase in levels of all exposures simultaneously. Specifically, we use
posterior predictive distributions to estimate a single mixture risk difference (ψ RD
) based on two exposure
profiles, such that:
ψ RD
=ψ x∗ =high
− ψ x∗ =low
. (5.4)
5.1.2 Rfunction
We developed functions to implement the BHRM algorithm using JAGS programming framework. Users
can choose the function with or without interaction effect.
To use the BHRM function
BHRM(X=NULL, Y=NULL, U=as.matrix(rep(0, dim(X)[1])), profiles=NULL,
family = "gaussian", w=0.9, selection=FALSE, n.adapt=5000, n.burnin=5000, n.sample=5000),
input variables are defined as:
• X: A NxP matrix of exposures for mixture analysis (on the original scale and we assume that there
are no missing values)
• Y: A N-length vector for a continuous outcome
• U : A NxQ matrix of covariates (variables included in the regression model but not included in the
g-estimation or model selection)
75
• profiles : A 2xP matrix of two counterfactual profiles of exposures for which a potential outcomes
risk difference is calculated (as the exposures are assumed to be standardized, these profiles should
be on the standard normal scale)
• family: a character string representing the type of outcome. Choose between "gaussian" or "bino-
mial". Default is "gaussian"
• w: weight for g-prior: w -> 0 shrink to common mean; as w -> 1 toward the maximum likelihood
estimate. Default is 0.9
• n.adapt: number of iterations for the adaptation. Default is 5000
• n.burnin: length of burn-in. Default is 5000
• n.sample: number of random samples draws. Default is 5000
• selection: TRUE or FALSE indicating whether to conduct Bayesian stochastic variable selection for
main effects. Default is FALSE.
Output variables are defined as:
• beta: estimated coefficients for exposures effect
• beta.int: estimated coefficients for interactions effect
• gamma: PIPs for exposures
• gamma.int: PIPs for interactions
• eta.low/eta.high: contrasted profiles
• psi: overall mixture effect
76
For more information about how to install JAGS in R, please refer to : https://onlinelibrary.wiley.com/
doi/pdf/10.1002/9781119287995.app1#:~:text=The%20following%20are%20the%20steps%20to%20install%20R%20package%
20rjags,packages(%22rjags%22).
5.1.3 Examples
We demonstrate the utilization of BHRM through two simulated data examples. In the first example, we
build a model without the interaction term. The simulated dataset comprises 1,000 observations with
a continuous outcome variable, 10 exposures, and two covariates. To initiate the analysis, we must first
download the BHRM.g R script from the BHRM GitHub repository and execute the script. Next, we specify
the model’s family based on the outcome distribution, define the contrasting profiles, and set other analysis
parameters (e.g., length of burn-in, etc.). The code for simulating the example data and configuring the
model is provided below:
1 library(MASS)
2 ### run the BHRM.g R script first
3 source(PATH/BHRM.g.R)
4
5 ### generate example data:
6 numInd <- 1000 # number of observations
7 numE <- 5 # number of exposures
8 numCov <- 2 # number of covariates
9 X <- mvrnorm(n=numInd, mu=rep(0,numE), Sigma=diag(numE)) # generate exposures data
10 # generate outcome data
11 Y <- ifelse(rep(family,numInd)=="gaussian", rnorm(numInd, 0, 1), rbinom(numInd, 1, .5))
12 U <- mvrnorm(n=numInd, mu=rep(0,numCov), Sigma=diag(numCov)) # generate covariates data
13
14 ### specify the analysis parameters in the model
15 family="gaussian" # Choose the model/outcome type from {"gaussian", "binomial"}
16 w <- 0.9 # Specify the g-prior weight parameter. The overall shrinkage parameter G = w/(1-w)
17 # specify the contrasting profile (e.g., from -0.5 vs. 0.5)
18 profiles <- rbind(rep(-0.5,numE), rep(0.5, numE))
19 # specify the number of iterations for the adaptation, length of burn-in,
20 # and the number of random samples
21 n.adapt=5000; n.burnin=5000; n.sample=5000;
77
After setting up all model arguments, we can run the model:
1 ### run BHRM analysis
2 results <- BHRM(X=X,Y=Y,U=U,profiles=profiles,nfamily = family, w=w, selection=T,
3 n.adapt=5000, n.burnin=5000, n.sample=5000)
4 > results
5 Mean SD X2.5. X97.5. p.val
6 beta[1] -0.001 0.007 -0.013 0.000 0.886
7 beta[2] -0.001 0.008 -0.021 0.000 0.901
8 beta[3] 0.000 0.006 0.000 0.000 1.000
9 beta[4] -0.002 0.011 -0.043 0.000 0.856
10 beta[5] -0.001 0.009 -0.024 0.000 0.912
11 eta.high -0.003 0.010 -0.033 0.006 0.764
12 eta.low 0.003 0.010 -0.006 0.033 0.764
13 gamma[1] 0.049 0.215 0.000 1.000 0.820
14 gamma[2] 0.050 0.218 0.000 1.000 0.819
15 gamma[3] 0.038 0.192 0.000 1.000 0.843
16 gamma[4] 0.074 0.262 0.000 1.000 0.778
17 gamma[5] 0.053 0.224 0.000 1.000 0.813
18 psi -0.006 0.019 -0.066 0.013 0.752
We can see that the output gives us the effect estimates, coefficient standard deviations, 95% confidence
intervals, and thep values of the individual exposure, profiles, PIPs, and overall mixture effect.
In the second example, we demonstrate how to employ the BHRM function for modeling with an
interaction effect. The analysis steps are akin to those in the first example. In this scenario, we generate
an outcome variable that follows a binomial distribution. Utilizing the BHEM.interaction() function,
we prompt the model to estimate the pairwise interaction effects of exposures. The code for the second
example is presented below:
1 library(MASS)
2 ### run the BHRM.g.interaction R script first
3 source(PATH/BHRM.g.R)
4
5 ### generate example data:
6 numInd <- 1000 # number of observations
7 numE <- 5 # number of dxposures
8 numCov <- 2 # number of covariates
9 X <- mvrnorm(n=numInd, mu=rep(0,numE), Sigma=diag(numE)) # generate exposures data
78
10 # generate outcome data
11 Y <- ifelse(rep(family,numInd)=="gaussian", rnorm(numInd, 0, 1), rbinom(numInd, 1, .5))
12 U <- mvrnorm(n=numInd, mu=rep(0,numCov), Sigma=diag(numCov)) # generate covariates data
13
14 ### specify the analysis parameters in the model
15 family="binomial" # Choose the model/outcome type from {"gaussian", "binomial"}
16 w <- 0.9 # Specify the g-prior weight parameter. The overall shrinkage parameter G = w/(1-w)
17 # specify the contrasting profile (e.g., 0 vs. 1)
18 profiles <- as.data.frame(c(0,1)*matrix(1, nrow=2, ncol=numE))
19 # specify the number of iterations for the adaptation, length of burn-in, and number of random samples
20 n.adapt=5000; n.burnin=5000; n.sample=5000;
21
22 # run BHRM analysis with pairwise interaction terms
23 results <- BHRM.interaction(X=X,Y=Y,U=U,profiles=profiles,
24 family = family, w=w, selection=F,
25 n.adapt=5000, n.burnin=5000, n.sample=5000)
26 > results
27 Mean SD X2.5. X97.5. p.val
28 beta[1] -0.057 0.053 -0.166 0.049 0.282
29 beta[2] -0.031 0.053 -0.135 0.078 0.559
30 beta[3] 0.034 0.054 -0.066 0.145 0.529
31 beta[4] -0.025 0.054 -0.132 0.078 0.643
32 beta[5] -0.082 0.052 -0.183 0.026 0.115
33 beta.int[1] 0.000 0.002 0.000 0.000 1.000
34 beta.int[2] 0.000 0.002 0.000 0.000 1.000
35 beta.int[3] 0.000 0.001 0.000 0.000 1.000
36 beta.int[4] 0.000 0.001 0.000 0.000 1.000
37 beta.int[5] 0.000 0.000 0.000 0.000 NaN
38 beta.int[6] 0.000 0.004 0.000 0.000 1.000
39 beta.int[7] 0.000 0.002 0.000 0.000 1.000
40 beta.int[8] 0.000 0.001 0.000 0.000 1.000
41 beta.int[9] 0.000 0.001 0.000 0.000 1.000
42 beta.int[10] 0.000 0.000 0.000 0.000 NaN
43 eta.high -0.161 0.115 -0.389 0.058 0.162
44 eta.low 0.000 0.000 0.000 0.000 NaN
45 gamma[1] 1.000 0.000 1.000 1.000 0.000
46 gamma[2] 1.000 0.000 1.000 1.000 0.000
47 gamma[3] 1.000 0.000 1.000 1.000 0.000
48 gamma[4] 1.000 0.000 1.000 1.000 0.000
49 gamma[5] 1.000 0.000 1.000 1.000 0.000
50 gamma.int[1] 0.000 0.020 0.000 0.000 1.000
51 gamma.int[2] 0.000 0.014 0.000 0.000 1.000
52 gamma.int[3] 0.000 0.014 0.000 0.000 1.000
53 gamma.int[4] 0.001 0.028 0.000 0.000 0.972
54 gamma.int[5] 0.000 0.000 0.000 0.000 NaN
79
55 gamma.int[6] 0.002 0.042 0.000 0.000 0.962
56 gamma.int[7] 0.000 0.020 0.000 0.000 1.000
57 gamma.int[8] 0.001 0.024 0.000 0.000 0.967
58 gamma.int[9] 0.000 0.014 0.000 0.000 1.000
59 gamma.int[10] 0.000 0.000 0.000 0.000 NaN
60 psi -0.161 0.115 -0.389 0.058 0.162
As shown in the provided output, in addition to the results observed in the first example, the model
also furnishes estimates for pairwise interaction effects and the Probability of Inclusion (PIP) values for
the interaction terms. These additional insights provide valuable information on the interaction effects
between different exposures in the model.
5.2 mxtune
This subchapter is a modified version of manual of the xtune R package on CRANhttps://cran.r-project.
org/web/packages/xtune/index.html.
5.2.1 Overview
In standard regularized regression (lasso, ridge, and elastic-net), a single penalty parameter λ applied
equally to all regression coefficients to control the amount of regularization in the model. Better prediction
accuracy may be achieved by allowing a different amount of shrinkage. Ideally, we want to give a small
penalty to important features and a large penalty to unimportant features. Thus,xtune apply an individual
shrinkage parameterλ j
to each regression coefficient β j
. And the vector of shrinkage parametersλs =
(λ 1
,...,λ p
) is guided by external informationZ. In specific, λ s is modeled as a log-linear function ofZ.
Table 5.1 and Table 5.2 show the data needed for fitting the mxtune. Table5.1 is the standard data
matrix withp predictors and categorical outcomeY . Table 5.2 refers to thep× q prior data matrix. For
each predictor X
j
, there are q prior covariates available to describe its characteristics, often referred to
as "features of features." The prior covariate could be an indicator variable, such as grouping information
80
about whether theX
j
is identified previously by a related study. Or the prior covariate could also be some
extra quantitative measurements of theX
j
.
Table 5.1: Primary dataX andY
Observations Y X
1
X
2
... X
p
1 y
1
x
11
x
21
... x
p1
2 y
2
x
12
x
22
... x
p2
... ... ... ... ... ...
N y
N
x
1N
x
2N
... x
pN
Table 5.2: Prior informationZ
Features Z
1
... Z
q
X
1
0 ... Z
q1
X
2
1 ... Z
q2
... ... ... ...
X
p
0 ... Z
qp
Based on the data and the model, the objective function of feature-specific shrinkage integrating ex-
ternal information is:
min
f
P
n
i=1
V(f(x
i
),y
i
)+λR (f)
λ =e
Z·α (5.5)
whereV represents loss function,λ is the penalty/tuning parameter, andR(f) is the regularization/penalty
term. Specifically, mxtune extend the current version of xtune to multi-classification problems using
Elastic-net model. Thus, the loss function and the penalty term should be:
V(f(x
i
),y
i
)=
P
K
k=1
t
ik
(β 0k
+x
i
β k
)− log
P
K
k=1
e
β 0k
+x
i
β k
R(f)=
P
K
k=1
(1− c)||β k
||
2
2
/2+c||β k
||
1
(5.6)
whenc=1,0 or any value between 0 and 1, the model is equivalent to lasso, ridge, and EN, respectively.
81
Themxtune model fitting consists of two steps: (1) selecting the tuning parameter(s) (2) estimating the
regression coefficients giving the tuning parameter(s). Usually, cross-validation is widely used to select a
single common penalty parameter in Step (1), but it is computationally infeasible to select more than three
penalty parameters. Thus, mxtune implement an empirical Bayes framework to estimate the multiple
tuning parameters. Once the tuning parametersλ s are estimated, Step (2) can be done using the standard
multinomial likelihood optimization methods (e.g.: glmnet function in R).
5.2.2 Rfunction
The mxtune method is embedded in the xtune(X, Y, Z,...) function in the xtune package. The input
arguments are defined as:
• x: Numeric design matrix of explanatory variables (n observations in rows,p predictors in columns).
xtune includes an intercept by default.
• Y: Outcome vector of dimensionn.
• Z: Numeric information matrix about the predictors (p rows, each corresponding to a predictor in
X;q columns of external information about the predictors, such as prior biological importance). If
Z is the grouping of predictors, it is best if the user codes it as a dummy variable (i.e. each column
indicating whether predictors belong to a specific group).
• U : Covariates to be adjusted in the model (matrix with n observations in rows, u predictors in
columns). Covariates are non-penalized in the model.
• family: The family of the model according to different types of outcomes including "linear", "binary",
and "multiclass". For the mxtune, the family should be set to "multiclass".
82
• c: The elastic-net mixing parameter, ranging from 0 to 1. When c = 1, the model corresponds to
Lasso. When c is set to 0, it corresponds to Ridge. For values between 0 and 1 (with a default of 0.5),
the model corresponds to the elastic net.
• epsilon: The parameter controls the boundary of the α . The maximum value that α could achieve
equals to epsilon times of alpha max calculated by the pathwise coordinate descent. A larger value
of epsilon indicates a stronger shrinkage effect (with a default of 5).
• sigma.square: A user-supplied noise variance estimate. The user don’t need to specify this argument
when using mxtune.
• message: Generates diagnostic message in model fitting. Default is TRUE.
• control: Specifies model control object.
The major output elements are defined as below:
• beta.est: The fitted vector of model coefficients. When model the data with K outcome levels, there
will beK coefficients vector with length of p.
• penalty.vector: The estimated penalty vector applied to each regression coefficient. Similar to the
penalty.factor: argument in glmnet.
• lambda: The estimated λ value, which is calculated to reflect that penalty factors are internally
rescaled to sum tonvars inglmnet.
• alpha.est: The estimated second-level coefficient for prior covariate Z. The first value is the intercept
of the prior coefficient.
83
5.2.3 Examples
We use a simulated dataset included in thextune package to illustrate the usage ofmxtune method and the
xtune function. To get started, we first install and load the package:
1 # install the package
2 install.packages("xtune")
3 library (xtune)
We then load the simulated example data in the package and fit the model. The dataset contains 600
observations and 800 predictors with 10 covariates. The outcome variable is categorical with three levels.
The prior dataZ is a800× 5 matrix with 0/1 entries that contains the grouping information (e.g. if this
predictor is pre-identified by study 1 to study 5).
1 # load the data
2 data("example.multiclass")
3
4 # fit the model
5 fit.multiclass = xtune(X = example.multiclass$X,Y=example.multiclass$Y,Z = example.multiclass$Z,
6 U = example.multiclass$U, family = "multiclass", c = 0.5)
Users can trace the process of the model fitting by specifying the verbosity = T in thextune.control
object. After the model fitting, we can check the estimated tuning parameters. We will see a total of 810
tuning parameters were estimated. The first 800 tuning parameters correspond to the shrinkage param-
eters of the predictors, and the last 10 correspond to the penalty parameters of the adjusted covariates.
Since the covariates will be forced in the model, we can see that the shrinkage effects for them are zero.
1 # check the tuning parameter
2 fit.multiclass$penalty.vector
3
4 [1] 0.019033238 0.020019923 0.019033238 0.019033238
5 ...
6 [809] 0.000000000 0.000000000
84
After model fitting, we can make the prediction and model evaluation based on the training model:
1 # predict the probability that the individual lies in each of the outcome category
2 pred.prob = predict_xtune(fit.multiclass,newX = cbind(example.multiclass$X, example.multiclass$U))
3 pred.prob
4 cat 1 cat 2 cat 3
5 1 0.152714695 0.04651872 0.80076659
6 2 0.001784845 0.96370971 0.03450544
7 3 0.594903813 0.14663364 0.25846255
8 4 0.040101522 0.84879438 0.11110409
9 5 0.084372394 0.17279569 0.74283192
10 ...
11 600 2.211814e-01 7.109743e-01 6.784435e-02
12
13 # predict the class of the outcome by specifying type = "class"
14 pred.class <- predict_xtune(fit.multiclass,newX = cbind(example.multiclass$X, example.multiclass$U),
15 type = "class")
16 pred.class
17 [1] "3" "2" "1" "2" "3" "2" "1" "3" "3" "2" "1" "3"
18 ...
19 [589] "3" "1" "1" "1" "2" "2" "3" "3" "1" "1" "1" "2"
20
21 # misclassification rate
22 misclassification(pred.class,true = example.multiclass$Y)
23 [1] 0.006666667
24
25 # multiclass AUC
26 pROC::multiclass.roc(as.numeric(example.multiclass$Y), as.numeric(pred.class))
27 Multi-class area under the curve: 0.9945
5.3 xtune_mJAM
5.3.1 Overview
Regularized regression is a widely used technique for analyzing high-dimensional data, such as genetic
data. In standard regularized regression, a single penalty parameterλ is applied uniformly to all regression
coefficients to control the level of regularization in the model. However, a novel approach called xtune was
85
proposed in a previous study, which models the penalty parameter as a log-linear function of the prior data
Z, introducing feature-specific shrinkage parameters.
To build on this concept, we extend thextune method from modeling individual-level data to summary
statistics. We achieve this by incorporating the mJAM (multi-population Joint Analysis of Marginals)
method, which utilizes Cholesky decomposition to perform joint analysis of marginal summary statis-
tics for multi-population GWAS (Genome-Wide Association Studies) studies. This extension allows for
the analysis of high-dimensional data represented by summary statistics, making it suitable for multi-
population genetic studies.
5.3.2 Rfunction
For the functionxtune.mJAM(betas.Gx, N.Gx, Geno, Z,...), the input variables are defined as:
• betas.Gx: A list of numeric vectors that contains the summary statistics from each ethnic population.
• N.Gx: A list of numeric vectors that contains sample sizes in all original GWAS studies.
• Geno: A list of matrices with individual-level SNP dosage data in each study/population as reference
data. Each column corresponds to a SNP. Note that the order of the study/population should be
matched in all lists.
• eaf.Gy: The effect allele frequency of the SNPs in betas.Gx.
• Z: Prior information matrix about the predictors/SNPs (p rows, each corresponding to a SNP in X;
q columns of external information about the predictors, such as prior biological importance). IfZ is
the grouping of predictors, it is best if user codes it as a dummy variable (i.e. each column indicating
whether predictors belong to a specific group).
86
Other analysis parameters of input (e.g. c, sigma.square) can be referred to the definition of inputs
in the package xtune. The output objects are defined as the same as the output objects in the xtune()
function.
5.3.3 Examples
We illustrate the usage of the xtune_mJAM using a simulated data example. First, we install the package
from the GitHub:
1 # install.packages("devtools")
2 library(devtools)
3 devtools::install_github("JingxuanH/xtune.mJAM")
4 library("xtune.mJAM")
Then we load the example data that embedded in the xtune.mJAM package and check what the data
looks like:
1 # load the example data
2 data(example)
3
4 # check the data
5 example$beta.gwas
6 [[1]]
7 [1] -9.500064e-02 -2.512558e-02 -1.731112e-02 -6.726648e-03 -5.519060e-03 -3.215950e-03
8 ...
9 [97] 2.709292e-03 2.161107e-03 1.234885e-04 -2.674988e-03
10
11 [[2]]
12 [1] -9.757099e-02 -4.549155e-02 -2.547170e-02 -1.605575e-02 -1.122632e-02 -6.135954e-03
13 ...
14 [97] 1.826946e-04 -9.174620e-05 -6.429620e-04 -3.386482e-04
15
16 [[3]]
17 [1] -9.639237e-02 -1.541633e-02 -7.187294e-03 -6.176262e-03 -9.030340e-04 -1.458887e-03
18 ...
19 [97] -1.690840e-04 4.211061e-04 7.596348e-04 2.447554e-03
20
21 example$N.Gx
87
22 [[1]]
23 [1] 5000
24
25 [[2]]
26 [1] 4000
27
28 [[3]]
29 [1] 6000
30
31 example$Geno[[1]][1:5,1:5]
32 [,1] [,2] [,3] [,4] [,5]
33 [1,] 1 1 1 1 0
34 [2,] 0 0 0 0 1
35 [3,] 0 1 1 2 1
36 [4,] 0 1 1 1 0
37 [5,] 0 0 0 0 0
38
39 example$Z
40 z.1 z.2 z.3
41 [1,] 0.1746261 0.14996633 0.1236590
42 [2,] 0.2686071 0.17083853 0.2525722
43 [3,] 0.2194457 0.19965372 0.2508865
The simulated dataset comprises genetic information from 100 SNPs across three populations. Each
population has individual-level genotype data containing 100 SNPs, with population-specific sample sizes
of 5,000, 4,000, and 6,000 observations, respectively. The prior information matrixZ is constructed based
on the allele frequency from each population, specifically considering the minor allele frequency (MAF)
powerh
2
.
To proceed with the analysis, we can fit the model and examine the SNP-specific penalties and coef-
ficients for each SNP. In this process, we have the flexibility to modify the argument c, which allows us
to specify the type of penalty used in the model, such as lasso, ridge, or EN, as referred to in the xtune
method. This adaptability empowers us to explore various regularization techniques and choose the most
suitable one for our specific dataset and research question.
1 fit = xtune_mJAM(betas.Gx = example$beta.gwas, N.Gx =example$N.Gx, Geno = example$Geno,
2 Z = example$Z, c = 0.5)
88
3 fit$penalty.vector[1:5]
4 [1,] 4.473506e-05
5 [2,] 4.473506e-05
6 [3,] 4.473506e-05
7 [4,] 4.473506e-05
8 [5,] 4.473506e-05
9 ...
10 [100,] 4.473506e-05
11
12 fit$beta.est
13 (Intercept) -7.232828e-02
14 V1 -9.555928e-02
15 V2 -2.026669e-04
16 V3 2.451330e-04
17 V4 -8.098853e-05
18 V5 -2.311183e-04
19 ...
20 V100 -7.800279e-04
89
Chapter6
Conclusions
With the advancement of data acquisition techniques, penalized methods have seen increasing use in re-
cent decades to better address statistical challenges arising from complicated datasets. Alongside penalty
parameters, various types of priors are employed to aid in fine-tuning the model for improved perfor-
mance. In this chapter, we will discuss our contributions to the field of penalty parameter selection for
the penalized methods and also explore potential future directions, addressing remaining interests and
challenges in this area.
6.1 Summaryofourcontributions
Figure 6.1: Summary of thesis contribution in the field of penalty parameter selection.
90
Figure 6.1 summarizes the content of this thesis. In the process of selecting penalty parameters for reg-
ularized methods, the choice of priors plays a crucial role. Commonly used priors are non-data-informed,
such as parameters selected by cross-validation methods in standard lasso [108], ridge [45], or elastic-net
[131], etc. Additionally, there are existing methods that utilize data-informed techniques to select priors
for penalty parameters, like fwelnet [103]. Our dissertation introduces new methods for both non-data-
informed and data-informed priors for regularization methods, aiming to enhance model performances.
In Chapter 2, we introduced the BHRM method, which utilizes a non-data-informed type of prior to
shrink the estimated coefficients in environmental exposure studies involving mixtures data. The key fea-
tures of BHRM include: (1) Bayesian stochastic methods for variable selection; (2) application of Zellner’s
g prior to shrink the independent exposure effects; and (3) implementation of g computation methods to
estimate the sum of effects for the mixture exposures. Furthermore, BHRM has the capability to model
potential interaction effects among the independent exposures, and its implementation using JAGS coding
provides a flexible Bayesian hierarchical modeling framework.Through extensive simulation studies and
a practical example using the HELIX dataset, we demonstrated that BHRM produces robust and stable
estimates, even in the presence of highly-correlated data. The method proves to be valuable in accurately
modeling mixtures of environmental exposures and providing reliable estimates for complex datasets.
In Chapter 3, we introduced the mxtune method, which incorporates prior information for feature-
specific regularized regression to address multi-classification problems. mxtune utilizes a data-informed
type of prior for the choice of penalty parameter and extends thextune method to handle multi-classification
tasks with a more stable estimation algorithm. Unlike traditional approaches that use cross-validation to
select a common penalty parameter, mxtune adopts an empirical Bayes approach. It estimates feature-
specific penalty parameters by modeling them as log-linear functions of prior data, leading to improved
model performance and interpretability. These prior data can include gene functional annotations, path-
way information, or any relevant meta-features characterizing the variables. Through comprehensive
91
simulation studies and the analysis of three real-world datasets, we demonstrate that mxtune presents a
promising approach for multi-classification problems. It leverages model performance while enhancing
model parsimoniousness, making it an effective tool for handling complex multi-class classification tasks.
In Chapter 4, we extend the xtune framework to model GWAS summary statistics level data. While
preserving the method for estimating data-informed penalty parameters in thextune framework, our pro-
posed method, calledxtune-mJAM, incorporates themJAM method into thextune framework. Specifically,
xtune-mJAM is to enable the transfer of individual level data to GWAS level data in multi-population stud-
ies. By conducting simulation studies using the Chromosome 22 data, we demonstrate that xtune-mJAM
significantly improves model performance, particularly in terms of PRS prediction. This extension of the
xtune framework to GWAS summary statistics level data provides an effective tool for leveraging multi-
population information and enhancing the accuracy of PRS prediction. The integration of mJAM ensures
that valuable insights from GWAS level data can be effectively utilized, leading to more powerful and
informative analyses.
We have implemented all three methods into software programs, making the BHRM R script available
on GitHub, thextune R package accessible on CRAN, and thextune-mJAM GitHub R package downloadable
for future applications and easy access by the research community.
6.2 Futuredirections
In future applications, xtune-mJAM will be applied to genome-wide data, enabling the exploration of ge-
netics studies on a larger scale and revealing deeper insights with the use of extensive datasets. Moreover,
the enhancement of computational efficacy in processing prior information for regularization methods will
significantly facilitate future research in this field.
92
Bibliography
[1] Just another gibbs sampler, Sep 2021. URL https://en.wikipedia.org/wiki/Just_another_Gibbs_
sampler.
[2] Bayesian hierarchical modeling, Feb 2022. URL https://en.wikipedia.org/wiki/Bayesian_
hierarchical_modeling#Hierarchical_models.
[3] Keil A, Buckley J, O’Brien K, Ferguson K, Zhao S, and White A. A quantile-based g-computation
approach to addressing the effects of exposure mixtures. Environmental Epidemiology, 3:44, 2019.
doi: 10.1097/01.ee9.0000606120.58494.9d.
[4] Greg M. Allenby, Peter E. Rossi, and Robert E. McCulloch. Hierarchical bayes models: A practitioners
guide. SSRN Electronic Journal, 2005. doi: 10.2139/ssrn.655541.
[5] Jeromy Anglim. Getting started with jags, rjags, and bayesian modelling: R-bloggers, Apr 2013. URL
https://www.r-bloggers.com/2012/04/getting-started-with-jags-rjags-and-bayesian-modelling/.
[6] Alberto A. Antunes, Sabrina T. Reis, Katia R. Leite, Danilo M. Real, Juliana M. Sousa-Canavez, Luiz H.
Camara-Lopes, Marcos F. Dall’oglio, and Miguel Srougi. Pgc and psma in prostate cancer diagnosis:
Tissue analysis from biopsy samples. International braz j urol, 39(5):649–656, 2013. doi: 10.1590/
s1677-5538.ibju.2013.05.06.
[7] Takiy-Eddine Berrandou, David Balding, and Doug Speed. Ldak-gbat: fast and powerful gene-based
association testing using summary statistics. TheAmericanJournalofHumanGenetics, 110(1):23–29,
2023.
[8] Anirban Bhattacharya, Debdeep Pati, Natesh S. Pillai, and David B. Dunson. Dirichlet–laplace priors
for optimal shrinkage. JournaloftheAmericanStatisticalAssociation, 110(512):1479–1490, 2015. doi:
10.1080/01621459.2014.960967.
[9] Cécile Billionnet, Duane Sherrill, and Isabella Annesi-Maesano. Estimating the health effects of
exposure to multi-pollutant mixture. Annals of Epidemiology, 22(2):126–141, 2012. doi: 10.1016/j.
annepidem.2011.11.004.
[10] Christopher M. Bishop. Sparse kernel machines. In Pattern Recognition and Machine Learning,
chapter 7, pages 355–356. Springer New York, NY, 2006.
[11] C.M Bishop. Pattern recognition and machine learning. Springer, 2007.
93
[12] Jennifer Bobb, Gregory A. Wellenius, Murray A. Mittleman, and Brent A. Coull. Bayesian kernel
machine regression for estimating the health effects of multi-pollutant mixtures. ISEE Conference
Abstracts, 2013(1):4800, 2013. doi: 10.1289/isee.2013.o-2-20-06.
[13] Brendan K Bulik-Sullivan, Po-Ru Loh, Hilary K Finucane, Stephan Ripke, Jian Yang, Schizophrenia
Working Group of the Psychiatric Genomics Consortium, Nick Patterson, Mark J Daly, Alkes L
Price, and Benjamin M Neale. Ld score regression distinguishes confounding from polygenicity in
genome-wide association studies. Nature genetics, 47(3):291–295, 2015.
[14] Petra Bžková. Linear regression in genetic association studies. PLoS One, 8(2):e56976, 2013.
[15] Bob Carpenter, Andrew Gelman, Matthew D Hoffman, Daniel Lee, Ben Goodrich, Michael Betan-
court, Marcus A Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan: A probabilistic program-
ming language. Journal of statistical software, 76, 2017.
[16] Caroline Carrico, Chris Gennings, David C. Wheeler, and Pam Factor-Litvak. Characterization
of weighted quantile sum regression for highly correlated data in a risk analysis setting. Jour-
nal of Agricultural, Biological, and Environmental Statistics, 20(1):100–120, 2014. doi: 10.1007/
s13253-014-0180-3.
[17] Wei-Yi Cheng, Tai-Hsien Ou Yang, and Dimitris Anastassiou. Development of a prognostic model
for breast cancer survival in an open challenge environment. ScienceTranslationalMedicine, 5(181),
2013. doi: 10.1126/scitranslmed.3005974.
[18] Shing Wan Choi, Timothy Shin-Heng Mak, and Paul F O’Reilly. Tutorial: a guide to performing
polygenic risk score analyses. Nature protocols, 15(9):2759–2772, 2020.
[19] Merlise Clyde and Edward I George. Model uncertainty. 2004.
[20] Nuria Comes, Antonio Serrano-Albarras, Jesusa Capera, Clara Serrano-Novillo, Enric Condom, San-
tiago Ramón y Cajal, Joan Carles Ferreres, and Antonio Felipe. Involvement of potassium channels
in the progression of cancer to a more malignant phenotype. Biochimica et Biophysica Acta (BBA)-
Biomembranes, 1848(10):2477–2492, 2015.
[21] Consortium, 1000 Genomes Project. A global reference for human genetic variation. http://www.
internationalgenome.org, 2015. Accession Number: PRJEB6930.
[22] David V Conti and John S Witte. Hierarchical modeling of linkage disequilibrum: genetic structure
and spatial relations. The American Journal of Human Genetics, 72(2):351–363, 2003.
[23] David V Conti, Burcu F Darst, Lilit C Moss, Edward J Saunders, Xin Sheng, Alisha Chou, Fredrick R
Schumacher, Ali Amin Al Olama, Sara Benlloch, Tokhir Dadaev, et al. Trans-ancestry genome-wide
association meta-analysis of prostate cancer identifies new susceptibility loci and informs genetic
risk prediction. Nature genetics, 53(1):65–75, 2021.
[24] James P Cook, Anubha Mahajan, and Andrew P Morris. Guidance for the utility of linear models
in meta-analysis of genetic association studies of binary phenotypes. European Journal of Human
Genetics, 25(2):240–245, 2017.
94
[25] Christina Curtis, Sohrab P. Shah, Suet-Feung Chin, Gulisa Turashvili, Oscar M. Rueda, Mark J. Dun-
ning, Doug Speed, Andy G. Lynch, Shamith Samarajiwa, Yinyin Yuan, and et al. The genomic and
transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature, 486(7403):
346–352, 2012. doi: 10.1038/nature10983.
[26] Hilary K Finucane, Brendan Bulik-Sullivan, Alexander Gusev, Gosia Trynka, Yakir Reshef, Po-Ru
Loh, Verneri Anttila, Han Xu, Chongzhi Zang, Kyle Farh, et al. Partitioning heritability by functional
annotation using genome-wide association summary statistics. Nature genetics, 47(11):1228–1235,
2015.
[27] Dean P. Foster and Edward I. George. The risk inflation criterion for multiple regression. TheAnnals
of Statistics, 22(4), 1994. doi: 10.1214/aos/1176325766.
[28] Scott D. Foster, Ar¯ unas P. Verbyla, and Wayne S. Pitchford. A random model approach for the lasso.
Computational Statistics, 23(2):217–233, 2007. doi: 10.1007/s00180-007-0033-4.
[29] Barbara Franke, Jason L Stein, Stephan Ripke, Verneri Anttila, Derrek P Hibar, Kimm JE Van Hulzen,
Alejandro Arias-Vasquez, Jordan W Smoller, Thomas E Nichols, Michael C Neale, et al. Genetic
influences on schizophrenia and subcortical brain volumes: large-scale proof of concept. Nature
neuroscience, 19(3):420–431, 2016.
[30] S. P. Fraser, J. A. Grimes, J. K. Diss, D. Stewart, J. O. Dolly, and M. B. Djamgoz. Predominant expres-
sion of kv1.3 voltage-gated k+ channel subunit in rat prostate cancer cell lines: Electrophysiological,
pharmacological and molecular characterisation. Pflügers Archiv - European Journal of Physiology ,
446(5):559–571, 2003. doi: 10.1007/s00424-003-1077-0.
[31] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Regularization paths for generalized linear
models via coordinate descent. JournalofStatisticalSoftware, 33(1), 2010. doi: 10.18637/jss.v033.i01.
[32] Steven Gazal, Hilary K Finucane, Nicholas A Furlotte, Po-Ru Loh, Pier Francesco Palamara, Xuanyao
Liu, Armin Schoech, Brendan Bulik-Sullivan, Benjamin M Neale, Alexander Gusev, et al. Linkage
disequilibrium–dependent architecture of human complex traits shows action of negative selection.
Nature genetics, 49(10):1421–1427, 2017.
[33] Steven Gazal, Po-Ru Loh, Hilary K Finucane, Andrea Ganna, Armin Schoech, Shamil Sunyaev, and
Alkes L Price. Functional architecture of low-frequency variants highlights strength of negative
selection across coding and non-coding annotations. Nature genetics, 50(11):1600–1607, 2018.
[34] Tian Ge, Chia-Yen Chen, Yang Ni, Yen-Chen Anne Feng, and Jordan W Smoller. Polygenic prediction
via bayesian regression and continuous shrinkage priors. Nature communications, 10(1):1776, 2019.
[35] Tian Ge, Chia-Yen Chen, Yang Ni, Yen-Chen Anne Feng, and Jordan W Smoller. Polygenic prediction
via bayesian regression and continuous shrinkage priors. Nature communications, 10(1):1776, 2019.
[36] Edward I. George and Robert E. McCulloch. Variable selection via gibbs sampling. Journal of the
American Statistical Association, 88(423):881–889, 1993. doi: 10.1080/01621459.1993.10476353.
95
[37] Constantin Georgescu, Joshua M Corbin, Sandra Thibivilliers, Zachary D Webb, Yan D Zhao, Jan
Koster, Kar-Ming Fung, Adam S Asch, Jonathan D Wren, and Maria J Ruiz-Echevarría. A tmeff2-
regulated cell cycle derived gene signature is prognostic of recurrence risk in prostate cancer. BMC
cancer, 19:1–13, 2019.
[38] Elizabeth A Gibson, Jeff Goldsmith, and Marianthi-Anna Kioumourtzoglou. Complex mixtures, com-
plex analyses: an emphasis on interpretable results. Current environmental health reports, 6:53–61,
2019.
[39] Elizabeth A. Gibson, Yanelli Nunez, Ahlam Abuawad, Ami R. Zota, Stefano Renzetti, Katrina L.
Devick, Chris Gennings, Jeff Goldsmith, Brent A. Coull, Marianthi-Anna Kioumourtzoglou, and et al.
An overview of methods to address distinct research questions on environmental mixtures: An
application to persistent organic pollutants and leukocyte telomere length. Environmental Health,
18(1), 2019. doi: 10.1186/s12940-019-0515-1.
[40] Walter R Gilks, Sylvia Richardson, and David Spiegelhalter. Markov chain Monte Carlo in practice.
CRC press, 1995.
[41] Ghassan B. Hamra and Jessie P. Buckley. Environmental exposure mixtures: Questions and methods
to address them. CurrentEpidemiologyReports, 5(2):160–165, 2018. doi: 10.1007/s40471-018-0145-0.
[42] David J. Hand and Robert J. Till. A simple generalisation of the area under the roc curve for multiple
class classification problems. Machine Learning, 45(2):171–186, 2001. ISSN 0885-6125. URL http:
//dx.doi.org/10.1023/A:1010920819831. 10.1023/A:1010920819831.
[43] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning: data
mining,inferenceandprediction. Springer, 2 edition, 2009. URLhttp://www-stat.stanford.edu/~tibs/
ElemStatLearn/.
[44] Trevor J. Hastie, Robert John Tibshirani, and Jerome H. Friedman. Chapter 7. Springer, 2001.
[45] Arthur E. Hoerl and Robert W. Kennard. Ridge regression: Biased estimation for nonorthogonal
problems. Technometrics, 12(1):55–67, 1970. doi: 10.1080/00401706.1970.10488634.
[46] Jennifer A Hoeting, David Madigan, Adrian E Raftery, and Chris T Volinsky. Bayesian model av-
eraging: a tutorial (with comments by m. clyde, david draper and ei george, and a rejoinder by the
authors. Statistical science, 14(4):382–417, 1999.
[47] Bonnie R Joubert, Marianthi-Anna Kioumourtzoglou, Toccara Chamberlain, Hua Yun Chen, Chris
Gennings, Mary E Turyk, Marie Lynn Miranda, Thomas F Webster, Katherine B Ensor, David B
Dunson, et al. Powering research through innovative methods for mixtures in epidemiology (prime)
program: novel and expanded statistical methods. International Journal of Environmental Research
and Public Health, 19(3):1378, 2022.
[48] Lisa Kaminski, Stéphanie Torrino, Maeva Dufies, Zied Djabari, Romain Haider, François-René Rous-
tan, Emilie Jaune, Kathiane Laurent, Nicolas Nottet, Jean-François Michiels, et al. Pgc1α inhibits
polyamine synthesis to suppress prostate cancer aggressivenesspgc1α inhibits polyamine synthesis
and aggressiveness. Cancer research, 79(13):3268–3280, 2019.
96
[49] Robert E. Kass and Larry Wasserman. A reference bayesian test for nested hypotheses and its re-
lationship to the schwarz criterion. Journal ofthe American StatisticalAssociation, 90(431):928–934,
1995. doi: 10.1080/01621459.1995.10476592.
[50] Eric S Kawaguchi, Sisi Li, Garrett M Weaver, and Juan Pablo Lewinger. Hierarchical ridge regression
for incorporating prior information in genomic studies. Journal of Data Science, 20(1), 2022.
[51] Rémi Kazma, Joel A. Mefford, Iona Cheng, Sarah J. Plummer, Albert M. Levin, Benjamin A. Rybicki,
Graham Casey, and John S. Witte. Association of the innate immunity and inflammation pathway
with advanced prostate cancer risk. PLoS ONE, 7(12), 2012. doi: 10.1371/journal.pone.0051680.
[52] Alexander Keil. qgcomp: Quantile G-Computation, 2021. URL https://CRAN.R-project.org/package=
qgcomp. R package version 2.8.0.
[53] Alexander P Keil, Jessie P Buckley, and Amy E Kalkbrenner. Bayesian g-computation for estimating
impacts of interventions on exposure mixtures: demonstration with metals from coal-fired power
plants and birth weight. American journal of epidemiology, 190(12):2647–2657, 2021.
[54] Alexander P Keil, Jessie P Buckley, and Amy E Kalkbrenner. Bayesian g-computation for estimating
impacts of interventions on exposure mixtures: Demonstration with metals from coal-fired power
plants and birth weight. American Journal of Epidemiology, 190(12):2647–2657, 2021. doi: 10.1093/
aje/kwab053.
[55] Amit V Khera, Mark Chaffin, Krishna G Aragam, Mary E Haas, Carolina Roselli, Seung Hoan Choi,
Pradeep Natarajan, Eric S Lander, Steven A Lubitz, Patrick T Ellinor, et al. Genome-wide polygenic
scores for common diseases identify individuals with risk equivalent to monogenic mutations. Na-
ture genetics, 50(9):1219–1224, 2018.
[56] Juan Pablo Lewinger, David V Conti, James W Baurley, Timothy J Triche, and Duncan C Thomas.
Hierarchical bayes prioritization of marker associations from a genome-wide association scan for
further investigation. GeneticEpidemiology: TheOfficialPublicationoftheInternationalGeneticEpi-
demiology Society, 31(8):871–882, 2007.
[57] Mulin Jun Li, Panwen Wang, Xiaorong Liu, Ee Lyn Lim, Zhangyong Wang, Meredith Yeager, Maria P
Wong, Pak Chung Sham, Stephen J Chanock, and Junwen Wang. Gwasdb: a database for human ge-
netic variants identified by genome-wide association studies. Nucleicacidsresearch, 40(D1):D1047–
D1054, 2012.
[58] Qing Li and Nan Lin. The bayesian elastic net. Bayesian Analysis, 5(1):151–170, March 2010. doi:
10.1214/10-BA506. URL https://doi.org/10.1214/10-BA506.
[59] Rui Li, David V Conti, David Diaz-Sanchez, Frank Gilliland, and Duncan C Thomas. Joint analysis for
integrating two related studies of different data types and different study designs using hierarchical
modeling approaches. Human heredity, 74(2):83–96, 2013.
[60] Yingbo Li and Merlise A. Clyde. Mixtures of g-priors in generalized linear models. Journal of the
American Statistical Association, 113(524):1828–1845, 2018. doi: 10.1080/01621459.2018.1469992.
97
[61] Feng Liang, Rui Paulo, German Molina, Merlise A Clyde, and Jim O Berger. Mixtures of g priors for
bayesian variable selection. Journal of the American Statistical Association, 103(481):410–423, 2008.
doi: 10.1198/016214507000001337.
[62] Roderick J Little and Donald B Rubin. Causal effects in clinical and epidemiological studies via
potential outcomes: concepts and analytical approaches. Annual review of public health, 21(1):121–
145, 2000.
[63] Stacy Loeb and Alan W Partin. Review of the literature: Pca3 for prostate cancer risk assessment
and prognostication. Reviews in urology, 13(4):e191, 2011.
[64] Qing-Zhi Long, Yue-Feng Du, Xiao-Ying Ding, Xiang Li, Wen-Bin Song, Yong Yang, Peng Zhang,
Jian-Ping Zhou, and Xiao-Gang Liu. Replication and fine mapping for association of the c2orf43,
foxp4, gprc6a and rfx6 genes with prostate cancer in the chinese population. PLoS ONE, 7(5), 2012.
doi: 10.1371/journal.pone.0037866.
[65] Yong Luo, Xin Gou, Peng Huang, and Chan Mou. The pca3 test for guiding repeat biopsy of prostate
cancer and its cut-off score: a systematic review and meta-analysis. Asian journal of andrology, 16
(3):487, 2014.
[66] Peter MacCullagh and J. A. Nelder. Generalized linear models. Chapman amp; Hall, 1989.
[67] David J. MacKay. The evidence framework applied to classification networks. Neural Computation,
4(5):720–736, 1992. doi: 10.1162/neco.1992.4.5.720.
[68] David J. C. MacKay. Bayesian Interpolation. NeuralComputation, 4(3):415–447, 05 1992. ISSN 0899-
7667. doi: 10.1162/neco.1992.4.3.415. URL https://doi.org/10.1162/neco.1992.4.3.415.
[69] Richard F MacLehose, David B Dunson, Amy H Herring, and Jane A Hoppin. Bayesian methods for
highly correlated exposure data. Epidemiology, pages 199–207, 2007.
[70] Léa Maitre, Jeroen de Bont, Maribel Casas, Oliver Robinson, Gunn Marit Aasvang, Lydiane Agier,
Sandra Andrušaityt˙ e, Ferran Ballester, Xavier Basagaña, Eva Borràs, and et al. BMJOpen, 8(9), 2018.
doi: 10.1136/bmjopen-2017-021311.
[71] Léa Maitre, Jean-Baptiste Guimbaud, Charline Warembourg, Nuria Güil-Oumrait, Paula Marcela
Petrone, Marc Chadeau-Hyam, Martine Vrijheid, Xavier Basagaña, and Juan R. Gonzalez. State-
of-the-art methods for exposure-health studies: Results from the exposome data challenge event.
Environment International, 168:107422, 2022. doi: 10.1016/j.envint.2022.107422.
[72] Timothy Shin Heng Mak, Robert Milan Porsch, Shing Wan Choi, Xueya Zhou, and Pak Chung Sham.
Polygenic scores via penalized regression on summary statistics. Genetic epidemiology, 41(6):469–
480, 2017.
[73] Yuzo Maruyama and Edward I. George. Fully bayes factors with a generalized g-prior. The Annals
of Statistics, 39(5), 2011. doi: 10.1214/11-aos917.
98
[74] Roberta Merola, Luigi Tomao, Anna Antenucci, Isabella Sperduti, Steno Sentinelli, Serena Masi,
Chiara Mandoj, Giulia Orlandi, Rocco Papalia, Salvatore Guaglianone, et al. Pca3 in prostate cancer
and tumor aggressiveness detection on 407 high-risk patients: a national cancer institute experience.
Journal of Experimental & Clinical Cancer Research, 34:1–6, 2015.
[75] Kevin P. Murphy. Machine learning: A probabilistic perspective. MIT Press, 2013.
[76] I.T. Nabney. Efficient training of rbf networks for classification. 9th International Conference on
Artificial Neural Networks: ICANN ’99 , 1999. doi: 10.1049/cp:19991110.
[77] Ashley I Naimi, Stephen R Cole, and Edward H Kennedy. An introduction to g methods.International
Journal of Epidemiology, 2016. doi: 10.1093/ije/dyw323.
[78] Radford M. Neal. Bayesian learning for neural networks. 1995.
[79] Paul J Newcombe, David V Conti, and Sylvia Richardson. Jam: a scalable bayesian framework for
joint analysis of marginal snp effects. Genetic epidemiology, 40(3):188–201, 2016.
[80] Christopher A Pettigrew, John S Clerkin, and Thomas G Cotter. Duox enzyme activity promotes akt
signalling in prostate cancer cells. Anticancer research, 32(12):5175–5181, 2012.
[81] Jobanjit S Phulka, Mishal Ashraf, Beenu K Bajwa, Guillaume Pare, and Zachary Laksman. Current
state and future of polygenic risk scores in cardiometabolic disease: A scoping review. Circulation:
Genomic and Precision Medicine, page e003834, 2023.
[82] Sune Pletscher-Frankild, Albert Pallejà, Kalliopi Tsafou, Janos X Binder, and Lars Juhl Jensen. Dis-
eases: Text mining and data integration of disease–gene associations. Methods, 74:83–89, 2015.
[83] Martyn Plummer, Alexey Stukalov, and Matt Denwood. Bayesian Graphical Models using MCMC.
Vienna, Austria, 2021. URL https://mcmc-jags.sourceforge.io.
[84] Martyn Plummer et al. Jags: A program for analysis of bayesian graphical models using gibbs
sampling. InProceedingsofthe3rdinternationalworkshopondistributedstatisticalcomputing, volume
124, pages 1–10. Vienna, Austria, 2003.
[85] Elena A Pudova, Elena N Lukyanova, Kirill M Nyushko, Dmitry S Mikhaylenko, Andrew R Zaretsky,
Anastasiya V Snezhkina, Maria V Savvateeva, Anastasiya A Kobelyatskaya, Nataliya V Melnikova,
Nadezhda N Volchenko, et al. Differentially expressed genes associated with prognosis in locally
advanced lymph node-negative prostate cancer. Frontiers in genetics, 10:730, 2019.
[86] MA Quintana and DV Conti. Integrative variable selection via bayesian model uncertainty. Statistics
in medicine, 32(28):4938–4953, 2013.
[87] Melanie A Quintana, Fredrick R Schumacher, Graham Casey, Jonine L Bernstein, Li Li, and David V
Conti. Incorporating prior biologic information for high-dimensional rare variant association stud-
ies. Human heredity, 74(3-4):184–195, 2013.
99
[88] R Core Team. R:ALanguageandEnvironmentforStatisticalComputing. R Foundation for Statistical
Computing, Vienna, Austria, 2020. URL https://www.R-project.org/.
[89] Stephen Reid, Robert Tibshirani, and Jerome Friedman. A study of error variance estimation in lasso
regression. Statistica Sinica, 2016. doi: 10.5705/ss.2014.042.
[90] W. Ressom, Habtom. Classification algorithms for phenotype prediction in genomics and pro-
teomics. Frontiers in Bioscience, 13(13):691, 2008. doi: 10.2741/2712.
[91] James Robins. A new approach to causal inference in mortality studies with a sustained exposure
period—application to control of the healthy worker survivor effect. Mathematical modelling, 7(9-
12):1393–1512, 1986.
[92] James Robins. A new approach to causal inference in mortality studies with a sustained exposure
period—application to control of the healthy worker survivor effect. Mathematical Modelling, 7(9-
12):1393–1512, 1986. doi: 10.1016/0270-0255(86)90088-6.
[93] Giuseppe Roscilli, Manuela Cappelletti, Claudia De Vitis, Gennaro Ciliberto, Arianna Di Napoli,
Luigi Ruco, Rita Mancini, and Luigi Aurisicchio. Circulating mmp11 and specific antibody immune
response in breast and prostate cancer patients. Journal of Translational Medicine, 12(1), 2014. doi:
10.1186/1479-5876-12-54.
[94] Andrew D Rouillard, Gregory W Gundersen, Nicolas F Fernandez, Zichen Wang, Caroline D Mon-
teiro, Michael G McDermott, and Avi Ma’ayan. The harmonizome: a collection of processed datasets
gathered to serve and mine knowledge about genes and proteins. Database, 2016, 2016.
[95] Daniel J Schaid, Wenan Chen, and Nicholas B Larson. From genome-wide associations to candidate
causal variants by statistical fine-mapping. Nature Reviews Genetics, 19(8):491–504, 2018.
[96] Astrid Schneider, Gerhard Hommel, and Maria Blettner. Linear regression analysis. Deutsches
Ärzteblatt international, 2010. doi: 10.3238/arztebl.2010.0776.
[97] Ahva Shahabi, Juan Pablo Lewinger, Jie Ren, Craig April, Andy E. Sherrod, Joseph G. Hacia, Siamak
Daneshmand, Inderbir Gill, Jacek K. Pinski, Jian-Bing Fan, and et al. Novel gene expression signature
predictive of clinical recurrence after radical prostatectomy in early stage prostate cancer patients.
The Prostate, 76(14):1239–1256, 2016. doi: 10.1002/pros.23211.
[98] Jiayi Shen, Lai Jiang, Kan Wang, Anqi Wang, Fei Chen, Paul J Newcombe, Christopher A Haiman,
and David V Conti. Fine-mapping and credible set construction using a multipopulation joint anal-
ysis of marginal summary statistics from genome-wide association studies. bioRxiv, pages 2022–12,
2022.
[99] Jonathan M Snowden, Sherri Rose, and Kathleen M Mortimer. Implementation of g-computation on a
simulated data set: demonstration of a causal inference technique. Americanjournalofepidemiology,
173(7):731–738, 2011.
100
[100] Massimo Stafoggia, Susanne Breitner, Regina Hampel, and Xavier Basagaña. Statistical approaches
to address multi-pollutant mixtures and multiple exposures: The state of the science. Current Envi-
ronmental Health Reports, 4(4):481–490, 2017. doi: 10.1007/s40572-017-0162-z.
[101] Bing Su, Lingqiu Gao, Catherine Baranowski, Bryan Gillard, Jianmin Wang, Ryan Ransom, Hyun-
Kyung Ko, and Irwin H. Gelman. A genome-wide rnai screen identifies foxo4 as a metastasis-
suppressor through counteracting pi3k/akt signal pathway in prostate cancer. PLoSONE, 9(7), 2014.
doi: 10.1371/journal.pone.0101411.
[102] Ryo Takata, Shusuke Akamatsu, Michiaki Kubo, Atsushi Takahashi, Naoya Hosono, Takahisa
Kawaguchi, Tatsuhiko Tsunoda, Johji Inazawa, Naoyuki Kamatani, Osamu Ogawa, and et al.
Genome-wide association study identifies five new susceptibility loci for prostate cancer in the
japanese population. Nature Genetics, 42(9):751–754, 2010. doi: 10.1038/ng.635.
[103] J Kenneth Tay, Nima Aghaeepour, Trevor Hastie, and Robert Tibshirani. Feature-weighted elastic
net: using" features of features" for better prediction. arXiv preprint arXiv:2006.01395, 2020.
[104] Kyla W Taylor, Bonnie R Joubert, Joe M Braun, Caroline Dilworth, Chris Gennings, Russ Hauser,
Jerry J Heindel, Cynthia V Rider, Thomas F Webster, and Danielle J Carlin. Statistical approaches
for assessing health effects of environmental chemical mixtures in epidemiology: lessons from an
innovative workshop. Environmental health perspectives, 124(12):A227–A229, 2016.
[105] Jayantha B Tennakoon, Yan Shi, Jenny J Han, Efrosini Tsouko, Mark A White, Alan R Burns, Aijun
Zhang, Xuefeng Xia, Olga R Ilkayeva, Li Xin, et al. Androgens regulate prostate cancer cell growth
via an ampk-pgc-1α -mediated metabolic switch. Oncogene, 33(45):5251–5261, 2014.
[106] Duncan C Thomas, David V Conti, James Baurley, Frederik Nijhout, Michael Reed, and Cornelia M
Ulrich. Use of pathway information in molecular epidemiology. Human genomics, 4(1):1–22, 2009.
[107] Duncan C Thomas, David V Conti, James Baurley, Frederik Nijhout, Michael Reed, and Cornelia M
Ulrich. Use of pathway information in molecular epidemiology. Human genomics, 4(1):1–22, 2009.
[108] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical
Society: Series B (Methodological), 58(1):267–288, 1996. doi: 10.1111/j.2517-6161.1996.tb02080.x.
[109] Michael E. Tipping. Sparse Bayesian Learning and the Relevance Vector Machine.JournalofMachine
Learning Research, 1:211–244, 2001. URL http://jmlr.csail.mit.edu/papers/v1/tipping01a.html.
[110] Markku H Vaarala, Katja Porvari, Atte Kyllönen, and Pirkko Vihko. Differentially expressed genes
in two lncap prostate cancer cell lines reflecting changes during prostate cancer progression. Labo-
ratory investigation, 80(8):1259–1268, 2000.
[111] Mark A Van De Wiel, Tonje G Lien, Wina Verlaat, Wessel N van Wieringen, and Saskia M Wilt-
ing. Better prediction by use of co-data: adaptive group-regularized ridge regression. Statistics in
medicine, 35(3):368–381, 2016.
[112] Britta Velten and Wolfgang Huber. Adaptive penalization in high-dimensional regression and clas-
sification with external covariates using variational bayes. Biostatistics, 22(2):348–364, 2021.
101
[113] Peter M Visscher, Matthew A Brown, Mark I McCarthy, and Jian Yang. Five years of gwas discovery.
The American Journal of Human Genetics, 90(1):7–24, 2012.
[114] Peter M Visscher, Naomi R Wray, Qian Zhang, Pamela Sklar, Mark I McCarthy, Matthew A Brown,
and Jian Yang. 10 years of gwas discovery: biology, function, and translation. TheAmericanJournal
of Human Genetics, 101(1):5–22, 2017.
[115] Martine Vrijheid, Rémy Slama, Oliver Robinson, Leda Chatzi, Muireann Coen, Peter van den Hazel,
Cathrine Thomsen, John Wright, Toby J. Athersuch, Narcis Avellana, and et al. The human early-life
exposome (helix): Project rationale and design. Environmental Health Perspectives, 122(6):535–544,
2014. doi: 10.1289/ehp.1307204.
[116] Wei Wei, Jiangyong Leng, Hongxiang Shao, and Weidong Wang. High pca3 scores in urine cor-
relate with poor-prognosis factors in prostate cancer patients. International journal of clinical and
experimental medicine, 8(9):16606, 2015.
[117] Melanie A Wilson, Edwin S Iversen, Merlise A Clyde, Scott C Schmidler, and Joellen M Schildkraut.
Bayesian model search and multilevel inference for snp association studies. The annals of applied
statistics, 4(3):1342, 2010.
[118] David Wipf and Srikantan Nagarajan. A new view of automatic relevance determination. In
Advances in Neural Information Processing Systems 20, MIT Press, 2008, July 2008. URL https:
//www.microsoft.com/en-us/research/publication/new-view-automatic-relevance-determination/.
[119] David Wipf and Srikantan Nagarajan. Iterative reweighted ℓ
1
and ℓ
2
methods for finding sparse
solutions. IEEEJournalofSelectedTopicsinSignalProcessing, 4(2):317–329, 2010. doi: 10.1109/JSTSP.
2010.2042413.
[120] John S Witte, Sander Greenland, Robert W Haile, and Cristy L Bird. Hierarchical regression analysis
applied to a study of multiple dietary exposures and breast cancer. Epidemiology, 5(6):612–621, 1994.
[121] Lingjian Yang, Darren Roberts, Mandeep Takhar, Nicholas Erho, Becky AS Bibby, Niluja Thirutha-
neeswaran, Vinayak Bhandari, Wei-Chen Cheng, Syed Haider, Amy MB McCorry, et al. Develop-
ment and validation of a 28-gene hypoxia-related prognostic signature for localized prostate cancer.
EBioMedicine, 31:182–189, 2018.
[122] Yao Yang. G-computation in causal inference, Sep 2019. URL https://towardsdatascience.com/
g-computation-in-causal-inference-774099da3631.
[123] Jindan Yu, Qi Cao, Rohit Mehra, Bharathi Laxman, Jianjun Yu, Scott A Tomlins, Chad J Creighton,
Saravana M Dhanasekaran, Ronglai Shen, Guoan Chen, et al. Integrative genomics analysis reveals
silencing ofβ -adrenergic signaling by polycomb in prostate cancer. Cancercell, 12(5):419–431, 2007.
[124] Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal
of the Royal Statistical Society Series B: Statistical Methodology, 68(1):49–67, 2005. doi: 10.1111/j.
1467-9868.2005.00532.x.
102
[125] Arnold Zellner, Bruno De Finetti, and Prem K. Goel. On Assessing Prior Distributions and Bayesian
Regression Analysis with g Prior Distributions, volume 6, page 233–243. North-Holland, 1986.
[126] Chubing Zeng, Duncan Campbell Thomas, and Juan Pablo Lewinger. Incorporating prior knowl-
edge into regularized regression. Bioinformatics, 37(4):514–521, 2020. doi: 10.1093/bioinformatics/
btaa776.
[127] Qiang Zhang, Xiujuan Yin, Zhiwei Pan, Yingying Cao, Shaojie Han, Guojun Gao, Zhiqin Gao, Zhi-
fang Pan, and Weiguo Feng. Identification of potential diagnostic and prognostic biomarkers for
prostate cancer. Oncology letters, 18(4):4237–4245, 2019.
[128] Huanhuan Zhu and Xiang Zhou. Statistical methods for snp heritability estimation and partition: a
review. Computational and Structural Biotechnology Journal, 18:1557–1568, 2020.
[129] Xiang Zhu and Matthew Stephens. Bayesian large-scale multiple regression with summary statistics
from genome-wide association studies. The annals of applied statistics, 11(3):1561, 2017.
[130] Hui Zou. The adaptive lasso and its oracle properties. JournaloftheAmericanStatisticalAssociation,
101(476):1418–1429, 2006. doi: 10.1198/016214506000000735.
[131] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of
the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005. doi: 10.1111/j.
1467-9868.2005.00503.x.
103
Appendices
A Chapter2
Table A.1: Simulation scenarios for median correlated (ρ = 0.5) exposures and non-correlated (ρ = 0)
exposures
Scenarios Correlation Profiles β 1
β 2
β 3
β 4
β 5
β 12
ψ 1 0.5 [0, 1] 0 0 0 0 0 0 0
2 0.5 [0, 1] 2 1 0 0 0 0 3
3 0.5 [0, 1] 1 -1 0 0 0 0 0
4 0.5 [0, 1] 0 0 0 0 0 1 3
5 0.5 [0, 1] 2 1 0 0 0 1 6
6 0.5 [0, 1] 1 -1 0 0 0 1 3
7 0 [0, 1] 0 0 0 0 0 0 0
8 0 [0, 1] 2 1 0 0 0 0 3
9 0 [0, 1] 1 -1 0 0 0 0 0
10 0 [0, 1] 0 0 0 0 0 1 3
11 0 [0, 1] 2 1 0 0 0 1 6
12 0 [0, 1] 1 -1 0 0 0 1 3
104
Table A.2: Simulation results for median correlated (ρ = 0.5) exposures. Coefficient estimates and MCSE
are compared among univariate, WQS, Quantile g-computation, and BHRM methods.
Scenarios Parameters Truth Univariate
(MCSE)
WQS (MCSE) Quantile g-
computation
(MCSE)
BHRM
(MCSE)
1 β 1
0 0.012 (0.057) 0.002 (0.046) 0.002 (0.081) 0.003 (0.021)
β 2
0 0.012 (0.056) 0.001 (0.025) 0.005 (0.079) 0.000 (0.016)
β 3,4,5
0 0.008 (0.061) 0.000 (0.008) 0.004 (0.080) 0.041 (0.010)
ψ 0 - 0.004 (0.090) -0.006 (0.084) 0.004 (0.031)
2 β 1
2 2.444 (0.082) 1.942 (0.111) 1.985 (0.085) 1.950 (0.065)
β 2
1 1.900 (0.134) 0.936 (0.117) 1.021 (0.089) 0.947 (0.059)
β 3,4,5
0 1.330 (0.154) 0.054 (0.042) -0.003 (0.083) 0.073 (0.012)
ψ 3 - 3.039 (0.106) 2.999 (0.087) 2.888 (0.078)
3 β 1
1 0.554 (0.096) 0.520 (0.130) 1.003 (0.069) 0.994 (0.071)
β 2
-1 -0.557 (0.091) 0.020 (0.027) -0.999 (0.083) -0.991 (0.068)
β 3,4,5
0 0.008 (0.097) 0.002 (0.003) -0.002 (0.075) 0.001 (0.023)
ψ 0 - 0.545 (0.129) -0.004 (0.075) 0.005 (0.067)
4 β 1
0 2.171 (0.134) 1.405 (0.182) 0.007 (0.190) 0.011 (0.102)
β 2
0 2.170 (0.146) 1.431 (0.163) 0.037 (0.192) 0.015 (0.099)
β 3,4,5
0 1.310 (0.181) 0.072 (0.076) -0.012 (0.185) -0.002 (0.020)
β 12
1 - - 1.007 (0.082) 0.992 (0.054)
ψ 3 - 3.053 (0.163) 3.002 (0.102) 3.004 (0.070)
5 β 1
2 4.591 (0.175) 3.380 (0.203) 1.970 (0.184) 1.718 (0.107)
β 2
1 3.996 (0.214) 2.448 (0.215) 0.996 (0.182) 0.724 (0.108)
β 3,4,5
0 2.628 (0.305) 0.082 (0.081) 0.017 (0.315) -0.011 (0.034)
β 12
1 - - 1.017 (0.080) 1.151 (0.057)
ψ 6 - 6.072 (0.129) 6.005 (0.088) 5.907 (0.069)
6 β 1
1 2.733 (0.114) 2.427 (0.182) 1.008 (0.193) 1.003 (0.109)
β 2
-1 1.581 (0.169) 0.405 (0.202) -1.018 (0.195) -0.987 (0.107)
β 3,4,5
0 1.323 (0.196) 0.025 (0.083) 0.009 (0.180) 0.001 (0.021)
β 12
1 - - 1.016 (0.088) 0.994 (0.055)
ψ 3 - 3.067 (0.161) 3.008 (0.084) 3.000 (0.083)
105
Table A.3: Simulation results for uncorrelated (ρ = 0) exposures. Coefficient estimates and MCSE are
compared among univariate, WQS, Quantile g-computation, and BHRM methods.
Scenarios Parameters Truth Univariate
(MCSE)
WQS (MCSE) Quantile g-
computation
(MCSE)
BHRM
(MCSE)
7 β 1
0 -0.003 (0.062) -0.008 (0.069) 0.005 (0.066) 0.000 (0.007)
β 2
0 0.007 (0.062) -0.003 (0.038) 0.005 (0.072) 0.000 (0.005)
β 3,4,5
0 -0.003 (0.063) -0.001 (0.015) -0.005 (0.066) 0.000 (0.010)
ψ 0 - -0.015 (0.143) -0.005 (0.136) 0.002 (0.022)
8 β 1
2 1.993 (0.093) 1.991 (0.085) 1.998 (0.068) 1.929 (0.074)
β 2
1 0.988 (0.153) 0.997 (0.098) 1.011 (0.062) 0.932 (0.064)
β 3,4,5
0 -0.012 (0.171) 0.057 (0.042) 0.001 (0.062) -0.011 (0.029)
ψ 3 - 3.157 (0.156) 3.011 (0.140) 2.827 (0.107)
9 β 1
1 0.998 (0.091) 0.947 (0.120) 0.998 (0.072) 0.988 (0.067)
β 2
-1 -0.996 (0.086) 0.149 (0.082) -0.998 (0.064) -0.991 (0.066)
β 3,4,5
0 -0.006 (0.122) 0.031 (0.025) -0.001 (0.067) 0.000 (0.013)
ψ 0 - 1.190 (0.154) -0.002 (0.152) -0.005 (0.089)
10 β 1
0 1.508 (0.145) 1.483 (0.138) 0.018 (0.200) 0.005 (0.085)
β 2
0 1.489 (0.146) 1.448 (0.158) -0.005 (0.183) -0.002 (0.088)
β 3,4,5
0 0.014 (0.173) 0.089 (0.085) -0.002 (0.186) 0.000 (0.011)
β 12
1 - - 1.002 (0.062) 1.000 (0.044)
ψ 3 - 3.199 (0.234) 3.010 (0.154) 2.999 (0.083)
11 β 1
2 3.527 (0.211) 3.509 (0.141) 1.999 (0.196) 1.698 (0.098)
β 2
1 2.540 (0.270) 2.473 (0.147) 1.001 (0.166) 0.719 (0.086)
β 3,4,5
0 -0.015 (0.296) 0.014 (0.081) -0.018 (0.191) -0.031 (0.064)
β 12
1 - - 1.003 (0.060) 1.136 (0.054)
ψ 6 - 6.235 (0.233) 6.011 (0.141) 5.865 (0.095)
12 β 1
1 2.491 (0.089) 2.479 (0.135) 0.988 (0.198) 0.992 (0.091)
β 2
-1 0.497 (0.192) 0.521 (0.177) -0.986 (0.184) -0.985 (0.090)
β 3,4,5
0 0.012 (0.214) 0.083 (0.077) 0.007 (0.176) -0.001 (0.016)
β 12
1 - - 0.997 (0.059) 0.998 (0.047)
ψ 3 - 3.067 (0.161) 2.995 (0.158) 2.996 (0.092)
106
B Chapter3
Table B.4: Coefficient estimation bias
1
of true predictors comparing the
standard EN withmxtune andmxtune without using prior information
in different scenarios
2
.
Scenario mxtune non-informative mxtune Standard EN
(a) p=400 0.765 0.760 0.761
p=800 -0.154 -0.154 -0.154
p=1200 1.313 1.313 1.313
(b) ρ =0.5 -1.018 -1.018 -1.018
ρ =0.7 -1.018 -1.018 -1.018
ρ =0.9 -1.016 -1.017 -1.017
(c) θ =10% -0.132 -0.132 -0.132
θ =50% -0.239 -0.239 -0.239
θ =90% -1.018 -1.018 -1.018
(d) κ =90% -1.008 -1.008 -1.008
κ =80% -1.008 -1.008 -1.008
κ =70% -1.008 -1.008 -1.008
κ =50% 0.487 0.487 0.487
(e) K =2 -0.455 -0.456 -0.455
K =6 -0.057 -0.057 -0.057
1
Bias are averaged across all outcome categories.
2
Scenarios include: (a) Varyingp withρ = 0.5,θ = 0.9,κ = 100%,K = 3.
(b) Varyingρ withp = 800,θ = 0.9,κ = 100%,K = 3. (c) Different θ with
p = 800,ρ = 0.5,κ = 100%,K = 3. The scales of misclassification rate are
varying across three conditions. (d) Varying κ with p = 800,ρ = 0.5,θ =
0.9,K = 3. (e) Varying numbers of outcome categoryK withp = 800,ρ =
0.5,θ =0.9,κ =100%.
107
(a) (b)
(c)
Figure B.2: Model comparisons among the standard EN, mxtune and non-informative mxtune whenκ =
50% with p = 800,ρ = 0.5,θ = 0.9, andK = 3 in terms of (a) prediction accuracy (b) probability of
identifying false positive predictors and (c) model sparsity.
108
Figure B.3: mxtune computational time (seconds) with different dimensions of N,p,K, andq.
109
C Chapter4
(a) (b)
Figure C.4: Model performance with single causal SNP. (a) Coefficient of predicted PRS vs. true PRS com-
paring the xtune-mJAM and non-informative xtune-mJAM using lasso, ridge, EN penalty with mJAM-
Foward. (b) Mean square error (MSE) of PRS model comparing the xtune-mJAM and non-informative
xtune-mJAM using lasso, ridge, EN penalty with mJAM-Foward.
110
Abstract (if available)
Abstract
Researchers frequently encounter statistical challenges when analyzing complex datasets, such as high-dimensional data or mixtures data from environmental epidemiology studies. To mitigate the risk of overfitting or correlation issues caused by such complex data, shrinkage is commonly employed in model fitting. Traditionally, a single non-data informed penalty parameter is uniformly applied to all regression coefficients to regulate the amount of regularization in the model. However, there are instances where external information, such as gene functional annotations, can provide valuable insights into understanding the correlated data. Consequently, incorporating prior data-informed penalties has the potential to leverage the shrinkage effect and improve model performance. This thesis explores novel methods for both non-data-informed and data-informed tuning techniques, along with their applications across various types of data.
The first project establishes a flexible framework for a Bayesian Hierarchical Regression Model that effectively handles highly correlated exposures using the g-prior and g-computation. This framework allows for the estimation of: 1) independent exposure effects; 2) interactions between exposures; and 3) combined effects for mixture exposures.
The second project develops a data-informed penalized regression approach for predicting multicategorical phenotypes in high-dimensional data contexts. Specifically, the regression coefficients are regularized by feature-specific Elastic-Net penalty parameters, which are modeled as a log-linear function of prior covariates. Instead of relying on cross-validation, penalty parameters are estimated using an empirical Bayes method.
In the third project, we extend the framework from the second project, which primarily focused on individual-level data modeling, to the analysis of summary statistics in Genome-Wide Association Studies (GWAS) results from multi-population studies.
Lastly, we provide an overview of software implementation of all these methods.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Hierarchical regularized regression for incorporation of external data in high-dimensional models
PDF
Incorporating prior knowledge into regularized regression
PDF
Robust feature selection with penalized regression in imbalanced high dimensional data
PDF
Generalized linear discriminant analysis for high-dimensional genomic data with external information
PDF
Pathway-based analysis of multivariate traits using classical dimensionality reduction methods
PDF
Characterization and discovery of genetic associations: multiethnic fine-mapping and incorporation of functional information
PDF
Missing heritability may be explained by the common household environment and its interaction with genetic variation
PDF
Prediction and feature selection with regularized regression in integrative genomics
PDF
High-dimensional regression for gene-environment interactions
PDF
Adaptive set-based tests for pathway analysis
PDF
Multilevel Bayesian latent class growth mixture model for longitudinal zero-inflated Poisson data
PDF
Uncertainty quantification in extreme gradient boosting with application to environmental epidemiology
PDF
Statistical analysis of high-throughput genomic data
PDF
Assessment of the mortality burden associated with ambient air pollution in rural and urban areas of India
PDF
twas_sim, a Python-based tool for simulation and power analysis of transcriptome-wide association analysis
PDF
Statistical downscaling with artificial neural network
PDF
Examining exposure to extreme heat and air pollution and its effects on all-cause, cardiovascular, and respiratory mortality in California: effect modification by the social deprivation index
PDF
Genome-wide characterization of the regulatory relationships of cell type-specific enhancer-gene links
PDF
Association of comorbidity with prostate cancer tumor characteristics in African American men
PDF
Covariance-based distance-weighted regression for incomplete and misaligned spatial data
Asset Metadata
Creator
He, Jingxuan
(author)
Core Title
Enhancing model performance of regularization methods by incorporating prior information
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Biostatistics
Degree Conferral Date
2023-08
Publication Date
08/09/2023
Defense Date
06/23/2023
Publisher
University of Southern California. Libraries
(digital)
Tag
Bayesian,Biostatistics,environmental epidemiology,machine learning,OAI-PMH Harvest,regularized regression,Statistical Genetics
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Conti, David V. (
committee chair
), Ambite, Jose-Luis (
committee member
), Lewinger, Juan Pablo (
committee member
)
Creator Email
hejingxu@usc.edu,jxhe0418@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113297744
Unique identifier
UC113297744
Identifier
etd-HeJingxuan-12231.pdf (filename)
Legacy Identifier
etd-HeJingxuan-12231
Document Type
Dissertation
Rights
He, Jingxuan
Internet Media Type
application/pdf
Type
texts
Source
20230809-usctheses-batch-1082
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Repository Email
cisadmin@lib.usc.edu
Tags
Bayesian
environmental epidemiology
machine learning
regularized regression