Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Incorporating prior knowledge into regularized regression
(USC Thesis Other)
Incorporating prior knowledge into regularized regression
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Acknowledgements
The last four years in the biostatistics program has been a truly fulfilling experience. I am very
fortunate to have Dr. Juan Pablo Lewinger to be my advisor. He has been a tremendous mentor
for me and has given me great support in my research. I appreciate all his valuable advice, helpful
feedback and kind encouragement that make my Ph.D. life productive, stimulating and enjoyable.
He introduces me to the areas of penalized regression that I am very interested in. It is a great
pleasure to do my Ph.D. under his guidance and to learn from his research expertise. Under his
supervision, I learned how to formulate research problems and find solutions to them. On a personal
level, Dr. Lewinger is very friendly, patient and kind. He has always made himself available to
clarify my doubts and gives me the freedom to explore my interests and take courses I like. I am
very grateful for his guidance and support.
I am also very thankful to Dr. Duncan Thomas, who came up with the idea and motivation
for this project in the first place and gave me tremendous help to start on this project. I benefited
a lot from our meetings to compare code results and equation derivations. The endless joy and
enthusiasm he has for academic research also motivated me to be hardworking and passionate
about my research. I would also like to thank the rest of my dissertation committee members Dr.
David Conti, Dr. Kimberly Siegmund and Dr. Morteza Dehghani for their contribution of time and
valuable advice.
Moreover, I would like to thank my friends in the biostatistics program for all the great times
we have shared. Their friendship, advice, and collaboration have contributed immensely to my
professional and personal time at USC.
Last but not least, I want to sincerely thank my parents for their unconditional love and endless
support. Every time I went back to China, they always say “You lost weight (and hair) due to
ii
research. You should treat yourself better!” With so much love and power gained from them, I can,
therefore, make processes in my research and write this dissertation.
iii
Contents
Acknowledgements ii
List of Tables vii
List of Figures viii
Abstract xi
1 Penalized regression for high dimensional data 1
1.1 Model specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Bayesian interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Choosing the penalty parameter(s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Computation algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 A glimpse of our contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 An empirical Bayes approach to select penalty parameter 13
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Approximation of marginal likelihood . . . . . . . . . . . . . . . . . . . . . . 15
2.2.2 Estimation of hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.1 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.2 Data example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
iv
3 Improved prediction with LASSO regression incorporating external covariates 37
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.1 LASSO formulation from a Bayesian perspective . . . . . . . . . . . . . . . . 41
3.2.2 Empirical Bayes parameter tuning . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.3 Optimizing the objective function . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2.4 Extension to linear discriminant analysis for classification . . . . . . . . . . . 46
3.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.1 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.2 Data applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Appendix 3.A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.A.1 Additional simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.A.2 Supplementary tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4 Marginal likelihood maximization algorithms 67
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2.1 Estimation of ‡ 2
.................................. 68
4.2.2 EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2.3 Direct update algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2.4 Iterative reweighted-L
2
algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5 Incorporation of external information into Elastic-Net regression and classifica-
tion 88
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.2.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.2.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
v
5.2.3 Marginal likelihood maximization . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.3.1 Simulation studies for xtune EN logistic regression . . . . . . . . . . . . . . . 96
5.3.2 Simulation studies for xtune EN linear regression . . . . . . . . . . . . . . . . 99
5.3.3 The prostate cancer example . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.3.4 The breast cancer example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Appendix 5.A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.A.1 Supplementary tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6 xtune: An R package for incorporating external data into penalized regression115
6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.1.2 Data structure examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.1.3 Tuning multiple penalty parameters . . . . . . . . . . . . . . . . . . . . . . . 117
6.1.4 Get started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.2 Usage examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.2.1 General example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.2.2 The dietary data example (binary outcome) . . . . . . . . . . . . . . . . . . . 122
6.2.3 The gene expression data example (linear outcome) . . . . . . . . . . . . . . 124
6.2.4 Two special cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Bibliography 127
vi
List of Tables
2.1 Median MSE,R
2
, number of features selected, and estimated ⁄ across 100 replications
comparing cross-validation vs. EB tuning. . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1 Comparison of standard LASSO, adaptive LASSO and xtune LASSO in terms of
AUC, number of selected covariates and computation time (in minutes) in the breast
cancer data example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2 The LYM, MES and CIN attractor genes based on (Cheng et al., 2013a)....... 63
3.3 Genes selected by standard LASSO, adaptive LASSO and xtune LASSO in the breast
cancer data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.1 Compare hyperparameter – estimates across all optimization algorithms . . . . . . . 85
5.1 Comparison of standard logistic LASSO, standard logistic LASSO EN, standard
logistic LASSO ridge vs. xtune logistic LASSO, xtune logistic EN, and xtune logistic
ridge for AUC and number of selected covariates in application to the breast cancer
data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.2 External information used for the prostate cancer example . . . . . . . . . . . . . . . 109
5.3 Genes selected by standard LASSO, adaptive LASSO and xtune LASSO in the breast
cancer data example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.1 Data structure example 1: structure of primary data X and Y............. 117
6.2 Data structure example 1: structure of external information data Z.......... 117
6.3 Data structure example 2: structure of primary data X and Y............. 118
6.4 Data structure example 2: structure of external information data Z.......... 118
vii
List of Figures
1.1 Heatmap of gene expression data matrix with 6830 genes and 64 samples. . . . . . . 2
1.2 LASSO coecient path with the change of penalty term.. . . . . . . . . . . . . . . . 4
1.3 Contour plots of LASSO objective function and ridge objective function when p=2 6
1.4 The 10-Fold cross-validation procedure applied to the breast cancer dataset described
in Section 3.5.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Double exponential distribution and normal distribution with mean zero and variance
one. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Comparison of approximated marginal likelihood contour plot vs. true marginal
likelihood contour plot in the orthogonal case. . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Comparison of estimated parameters using the approximated marginal likelihood vs.
true marginal likelihood. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 EB tuning simulation study results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.5 Comparison of prediction performance of cross-validation vs. EB tuning applied to
the bone density dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1 xtune LASSO simulation study results. . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2 Prediction test R
2
and number of selected gene expression features for the adaptive
LASSO, standard LASSO and the proposed method applied to the bone density data. 53
3.3 ROC curves for adaptive LASSO, standard LASSO and xtune LASSO applied to the
breast cancer dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.4 Illustration of correlation structure used in simulation studies. . . . . . . . . . . . . . 58
3.5 Additional xtune LASSO simulation studies Results . . . . . . . . . . . . . . . . . . 62
viii
4.1 Idea of MM procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.2 Graphic representation of a concave function upperbounded by a linear function. . . 79
4.3 Non-convex marginal likelihood (5.9) as a sum of concave function and a convex
function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.4 Convergence plots and estimated coecients values for EM, gradient descent and
iterative reweighted-L
2
algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.5 Comparison of computation time for cross-validation LASSO, xtune LASSO using
iterative reweighted-L
2
, and xtune LASSO using LBFGS. . . . . . . . . . . . . . . . 85
5.1 Results for the five scenarios considered in the xtune EN logistic model simulation
studies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.2 Results for the five scenarios considered in the xtune EN linear model simulation
studies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.3 Application to prostate cancer example: results . . . . . . . . . . . . . . . . . . . . . 104
5.4 ROC curves for standard LASSO, xtune LASSO, standard EN, xtune EN, standard
ridge and xtune ridge applied on the breast cancer dataset. . . . . . . . . . . . . . . 105
ix
Abstract
The rapid advancement of high-throughput sequencing technologies has produced unprecedented
amounts and types of omic data. Predicting clinical outcomes based on genomic features like gene
expression, methylation, and genotypes is becoming increasingly important for individualized risk
assessment and treatment. Associated with genomic features, there is also a rich set of “meta-
features” such as functional annotation, pathway information, and knowledge from previous studies,
that comprise valuable additional information. Traditionally, such “meta-feature” information is
used in a post-hoc manner to enhance model explainability. For example, after model fit, analysis
can be conducted to formally assess whether the selected gene features are enriched in particular
metabolic pathways or gene ontology annotations. This kind of post-hoc analysis can provide
biological insights and validation for a prediction model. In this dissertation, we propose novel
methods that exploit genomic meta-features a-priori rather than post-hoc, to improve better identify
important markers and improve prediction performance. We aim at addressing one central question:
how can we predict an outcome of interest and identify relevant features while taking additional
information on the features into account?
Since genomic data sets are typically high-dimensional, penalized regression methods are com-
monly used to select relevant features and build predictive models. Standard penalized regression
applies one penalty parameter to all features, ignoring the structural dierence or heterogeneity
of features. Based on this, we integrate meta-features into penalized regression by adapting the
penalty parameters to be “meta-feature-driven”. The penalty parameters are modeled as a log-linear
function of the meta-features and are estimated from the data using an approximate empirical Bayes
approach.
x
This dissertation is structured as follows. Chapter 1 introduces how penalized regression
techniques can be used to solve high dimensional data problems. Chapter 2 describes an empirical
Bayes approach to select the penalty parameter(s) in penalized regression. Chapter 3 discusses our
method for incorporating meta-features into LASSO linear regression. Chapter 4 is devoted to the
optimization algorithms for marginal likelihood maximization. Chapter 5 extends the model to
Ridge and Elastic-Net linear and logistic regression. Finally, Chapter 6 presents the R package we
developed to implement our method.
xi
Chapter 1
Penalized regression for high
dimensional data
Data with a large number of predictors relative to the sample size—“high-dimensional data”—are
becoming increasingly prevalent with the advance of high throughput measurement devices. For
example, wearenowabletocollecttheexpressionvaluesofseveralthousandsofgenessimultaneously
with the rapid development of DNA microarray experiments. In the particular example of Figure
1.1 from (Hastie et al., 2009), the gene expression data consist of 6830 genes and only 64 samples.
To build predictive models for such datasets, we need statistical methods and algorithms that can
sort through a large number of features and find a subset that can accurately predict the response.
Penalized regression methods, with the ability to accurately predict outcomes of interest and identify
related signal features, have gained popularity in research areas involving high dimensional data,
especially in genetic studies. In a penalized regression framework, the regression coecients are
shrunk by maximizing the log-likelihood function subject to a penalty term. The form of the penalty
term determines the properties of penalized models. In practice, a variety of penalty terms have
been proposed, among which the most popular ones are the LASSO (Tibshirani, 1996), ridge (Hoerl
and Kennard, 1970) and Elastic-Net (EN) (Zou and Hastie, 2005). We focus on those three penalty
forms in this dissertation; generalization to other regularization forms may be considered for future
research.
1
Figure 1.1 Heatmap of gene expression data matrix with 6830 genes (columns) and 64 samples
(rows) for the human tumor data. Only a random sample of 100 rows are shown. Bright green
represent negative (under expressed) genes and bright red represents positive (over expressed) genes.
This figure is based on Figure 1.3 of (Hastie et al., 2009).
LASSO was first proposed by Tibshirani in 1996. It was not originally designed to handle
high dimensional problems. Instead, the original motivation of it was to improve the prediction
accuracy, and interpretation of the ordinary least square (OLS) estimates (Tibshirani, 1996). The
OLS estimates are often unbiased but have large variance and are not easy to interpret when there is
a large number of predictors. LASSO overcome those two drawbacks by adding a L
1
constraint into
the objective function of OLS. Because of the nature of the L
1
constraint, LASSO produces some
coecients to be exactly 0, and hence gives a more interpretable model and improves prediction at
the sacrifice of a little bias.
LASSO received much more attention when years later, high dimensional regression became
a hot topic with the emergence of large data (p>n), especially in genomics. It enables variable
selection and prediction simultaneously in one stage, which is crucial for high dimensional problems
(Tibshirani, 2011). Variants of the LASSO related to our method are discussed in Chapter 2 and
Chapter 3.
2
Before the wide use of LASSO, subset selection methods, such as best-subset selection and
stepwise selection, were often used for feature selection. However, subset selection can be very
unstable. The inclusion and drop-o of variables is a discrete process, which may result in a very
dierent selected model with small changes in the data ( Tibshirani, 1996). Also, subset selection
methods are generally not very computationally ecient and scalable for very large p or when
p>>n (Hastie et al., 2009).
Ridge regression (Hoerl and Kennard, 1970) is also a popular regularization method for high
dimensional data. It was proposed 26 years before LASSO was first introduced. Like LASSO, it
was not originally intended to handle high dimensional data but to improve the prediction accuracy
of OLS. Today, Ridge regression is also a popular shrinkage method. Instead of using a L
1
penalty
term, ridge uses a L
2
penalty term. Because of the nature of the L
2
penalty, ridge shrinks the size of
the coecients but cannot shrink any coecient to be precisely 0. It always keeps all the predictors
in the model. LASSO does both parameter shrinkage and variable selection automatically and thus
is more suitable than ridge regression when feature selection is desired. However, ridge regression is
much easier to be computed than the LASSO, and its estimation is more stable in the presence of
multicollinearity (Mallick and Yi, 2013).
In data with strong correlations, like genomic data sets where genes tend to operate in given
molecular pathways, there are often multicollinearity and grouping among predictors. LASSO often
selects one out of the correlated variables essentially at random while ridge, on the other hand,
tends to shrink the coecient values of grouped variables toward each other ( Hastie et al., 2009).
Zou and Hastie (2005) proposed the EN estimator, which is based on a mixture penalty of LASSO
and ridge regression penalties. It is more robust to correlations among predictors and is particularly
useful when the number of highly correlated predictors is larger than the sample size.
1.1 Model specification
Consider a linear model Y =X— +‘, where the vector Y is the n-dimensional vector of observed
outcome, X =(X
1
,...,X
p
) is the n◊ p matrix of predictors, and ‘ is a vector of unobserved
independentandidenticallydistributed errorswithexpectation 0andvariance ‡ 2
. LASSOregression
shrinks the regression coecients by imposing a L
1
penalty on the sum of the absolute value of
3
Figure 1.2 Coecient path with the change of penalty term ⁄ in (1.2) on a breast cancer data.
Detailed description of the data is in section 3.5.2. Each curve represents the coecient of a feature
(gene) in the model. The x axis is the log of penalty term ⁄ .The y axis gives the value of the
coecient. Only the top 100 features (genes) with the largest empirical variance were used. glmnet
in R was used to generate this figure.
regression coecients. The regression coecients are obtained by minimize a constrained residual
sum of squares. The objective of LASSO is to solve:
min
— œ R
p{||Y ≠ X— ||
2
2
},subjectto
p
ÿ
j=1
|— j
|Æ t (1.1)
where t is a size constraint parameter, t controls the amount of penalty LASSO put on regression
coecients: the tighter bound of t, the higher the amount of shrinkage and more coecients would
be shrunk towards zero, vice versa.
Equivalently, we can write the constrained optimization problem in (1.1) in its Lagrange form:
min
— œ R
p{||Y ≠ X— ||
2
2
+⁄ p
ÿ
j=1
|— j
|} (1.2)
where the penalty parameter ⁄ in (1.2) is the Lagrange multiplier corresponding to t in (1.1). There
is a one-to-one relationship between ⁄ and t. Figure 1.2 shows the coecient path with the change
of ⁄ . When the penalty parameter ⁄ is very small (left), the LASSO coecients estimates are close
to OLS. On the right, the penalty term is very large; every coecient is shrunk to 0. In between, we
4
have a constrained solution, which is sparse because the eect of some features (genes) are shrunk
to 0.
The ridge regression is like LASSO with a dierent type of constraints. Instead of a L
1
penalty,
ridge regression has a L
2
penalty. The objective of ridge regression is:
min
— œ R
p{||Y ≠ X— ||
2
2
},subjectto
p
ÿ
j=1
— 2
j
Æ t (1.3)
We can also write ridge problem in its equivalent Lagrange form:
min
— œ R
p{||Y ≠ X— ||
2
2
+⁄ p
ÿ
j=1
— 2
j
} (1.4)
Like LASSO, the penalty parameter in ridge regression controls the amount of shrinkage on the
coecients. The dierence between those two methods lies in the shape of the constraint region in
the objective functions. Figure 1.3 shows the shape of objective function for LASSO and ridge when
p=2. The coecients estimates are found where the red elliptical contours hit the blue constrain
region. Since LASSO constrain is not dierentiable at 0 and has a diamond shape constraint region,
a coecient would be set to 0 when it hit the corner of the diamond constrain region (Hastie et al.,
2009). A theoretical proof of why does L
1
regularization yield sparse solutions can be found in
section 13.3.2 of (Murphy, 2012).
The EN penalty is a compromise between LASSO penalty (equation 1.2) and ridge penalty
(equation 1.4) with the following objective function:
min
— œ R
p{||Y ≠ X— ||
2
2
+⁄ p
ÿ
j=1
(
1≠ – 2
— 2
j
+– |— j
|)} (1.5)
An extra weight parameter – controls the weights of ridge (– =0) and LASSO (– =1) regression.
Like LASSO, the EN regression does both shrinkage and variable selection at the same time, but it
is more robust in situations when there are correlation and grouping eect among predictors ( Zou
and Hastie, 2005).
In general, the eect of regularization is to control the trade-o between the bias, i.e., the
flexibility of the model to capture the true underlying relationship between Y and X, and its
variance, i.e., the variability of the model fit across dierent training data sets. This control of the
5
Figure 1.3 Plot of LASSO (left) and ridge regression (right) objective function when there are
only two coecients. Red elliptical contours represent the residual sum of square and blue regions
(diamond for LASSO and circle for ridge) represent the constrain on coecients. This figure is from
Figure 3.11 of (Hastie et al., 2009).
bias-variance trade-o is achieved by the shrinkage toward zero of the estimates of the regression
coecientsinducedbyregularization. Theamountofshrinkageiscontrolledbythetuningparameter
⁄ . A smaller value of ⁄ gives more relative weight to the sum of squares term in the objective
functions, forcing a closer fit to the data and inducing less shrinkage of the regression coecients.
The resulting model has a higher variance and a lower bias. By contrast, a model with a larger value
of ⁄ gives more relative weight to the regularization penalty term, inducing more shrinkage but
fitting the training data less closely. The resulting model has a lower variance and higher bias. Since
the generalizability of the model, i.e., its prediction performance on new data, requires minimizing
the mean square error, which can be decomposed into a bias plus a variance term, proper tuning of
the penalty parameter is crucial for obtaining models with good performance.
6
1.2 Bayesian interpretation
Most penalized regression approach have a Bayesian interpretation. Specifically, the coecient
estimates obtained by maximizing the penalized log-likelihood often correspond to the posterior
mode, or maximum a posteriori (MAP) obtained by specifying a prior distribution for the regression
coecients ( Bhattacharya et al., 2014). The choice of prior distribution is implied by the form of
penalty.
In linear models, we assume that the distribution of the response variable Y conditional on the
regression coecients — and ‡ 2
follows a normal distribution:
Y|X,— ≥N(X— ,‡ 2
I
n
) (1.6)
The LASSO, ridge, and EN estimates can be equivalently characterized as a MAP in a Bayesian
model with a double exponential prior, Gaussian prior, and a compromise of those two, respectively.
Specifically, if the prior distribution of — condition on ‡ 2
is given by
f(— |‡ 2
)≥ Double Exponential(0,
⁄ 2‡ 2
) or
f(— |‡ 2
)≥ N(0,
‡ 2
⁄ ) or
f(— |‡ 2
)Ã exp{≠ ⁄ 2‡ 2
[(1≠ c)||— ||
2
2
/2+c||— ||
1
]}
(1.7)
the marginal posterior mode of— |Y is equivalent to the LASSO, ridge and EN estimates, respectively.
The double exponential distribution with location parameter 0 and scale parameter
⁄ 2‡ 2
has the
density function
Double Exponential(— |0,
⁄ 2‡ 2
)=
⁄ 4‡ 2
exp
≠ ⁄ 2‡ 2
|— |
A brief proof of the LASSO case is shown here. The Bayesian analogue of LASSO is:
Y|X,— ≥N(X— ,‡ 2
I
n
)
— j
|‡ 2
i.i.d
≥ Double Exponential(
⁄ 2‡ 2
),j=1,2,..,p
(1.8)
7
where Double Exponential(⁄/ 2‡ 2
) has density function
⁄ 4‡ 2
exp
≠ ⁄ 2‡ 2
|— |
. The joint likelihood of Y
and — is:
L(Y,— )=p(X,Y|— )f(— |⁄,‡ 2
)
=
n
Ÿ
i=1
1
Ô 2fi‡
2
e
≠ (Y
i
≠ X
i
— )
2
2‡ 2
p
Ÿ
j=1
⁄ 4‡ 2
exp
≠ ⁄ 2‡ 2
|— |
(1.9)
Taking log and discard constant terms, the log joint likelihood becomes:
l(Y,— )=≠ 1
2‡ 2
(Y ≠ X— )
T
(Y ≠ X— )≠ ⁄ 2‡ 2
p
ÿ
j=1
|— j
| (1.10)
which is equivalent to (1.2) with given ‡ 2
. The connection between penalty forms and shrinkage
priors allows penalty parameter(s) to be marginalized over the posterior distribution instead of
relying on standard cross-validation. Motivated by this idea, we proposed the empirical Bayes
tuning approach to select the penalty parameter ⁄ (s), which will be discussed in detail later in
Chapter 2 and Chapter 3.
1.3 Choosing the penalty parameter(s)
The fitting of penalized regression consists of two steps: 1) choosing the penalty parameter(s) and
2) estimating the regression coecients given the penalty parameter(s). The first step is critical to
achieving good performance when using regularized regression models. The penalty parameter ⁄ in
LASSO controls the amount of shrinkage applied to regression coecients. Dierent values of ⁄ will
have a significant influence on the selected variable and the prediction accuracy of LASSO. The
model and its interpretation can change quite dramatically depending on the choice of the constraint.
Therefore, it is essential to choose a good penalty parameter. The most popular approach used in
practice is cross-validation (CV), including Leave One Out Cross Validation (LOOCV) and K-fold
cross-validation.
LOOCV is done by splitting the data into a training set and a testing set. Each time a single
observation is used for testing, and the remaining observations is used for training. A grid of penalty
parameter ⁄ was fit in the training set, and the prediction error is computed using the testing
8
set. This process is repeated n times. The selected ⁄ is the one that gives the smallest aggregated
prediction error across all n repetitions. The advantage of LOOCV is that it generates the same
results every time despite how the data is split. However, LOOCV tends to select models that
overfit, and for that reason, K-fold cross-validation is more commonly used (Krstajic et al., 2014).
The procedure ofK-fold cross-validation is similar to LOOCV. Instead of leaving one observation
out one at a time, K-fold cross-validation split the data into K folds. The rest procedure is the
same as the LOOCV. Usually, K is 5 or 10, and the prediction error is generally measured by mean
square error (MSE) for regression problem and area under the ROC curve (AUC) for classification
problems. Figure 1.4 shows the result of 10-fold cross-validation applied to a breast cancer dataset.
From the figure, ⁄ = exp(≠ 5) gives the smallest prediction error (misclassification rate) and is
selected as the optimal penalty term.
Cross-validationiseasytoimplementwhenthenumberofpenaltyparametersislow. Moreover, it
directly optimizes the penalty parameter by minimizing the predictive error; thereby, the estimation
directly matches the primary gold of most prediction problems. However, despite its extensive use
and easiness in implementation, cross-validation may lead to unstable models when used in high
Figure 1.4 The 10-Fold cross-validation procedure applied to the breast cancer dataset described in
Section 3.5.2. Red dots are the aggregated prediction error (misclassification error) across 10 folds.
The dashed vertical line is the selected ⁄ giving smallest average prediction error across 10 folds.
glmnet in R was used to generate this figure.
9
dimensional data (Lim and Yu, 2016). The choice of ⁄ , hence the results of model fit, can be variable
due to dierent splits of the dataset. In addition, the computation time required by cross-validation
grows exponentially with the number of penalty parameters (Boulesteix et al., 2017). To tune M
penalty parameters, a M-dimensional grid search is needed for each split of the data. With this
computational burden, cross-validation is manageable for up to two penalty parameters in practice.
In Chapter 2, we describe in detail a novel empirical Bayes approach to select the penalty parameter
as an alternative to cross-validation.
1.4 Computation algorithms
Once tuning parameter(s) have been selected, the next step is to find coecients that minimize
(1.2) for LASSO, (1.4) for ridge or (1.5) for EN. Since penalized regression is often used in the
field with very large datasets such as genomics or neuroscience research, its computation eciency
is crucial. For a given penalty parameter ⁄ , the LASSO regression coecients are the ones that
optimize the LASSO objective function (1.2). Unlike ridge regression, whose regression coecients
have a closed-form expression (1.11), LASSO does not have a closed-form expression for coecient
estimates.
— ridge
=(X
T
X +⁄I
n
)
≠ 1
X
T
Y (1.11)
In the original paper (Tibshirani, 1996), Tibshirani proposed a quadratic programming solver
since LASSO has a quadratic loss. However, it did not scale well and not statistically motivated
(Tibshirani, 2011). Later, Efron et al. (2004) pointed out that LASSO is a basic procedure called
Least Angle Regression (LARS). The LARS algorithm by (Efron et al., 2004) connects the LASSO to
forward stage-wise regression and computes the entire regularization path for the LASSO. It requires
the same order of computation as that of a single least-squares fit using the p predictors (Hastie
et al., 2009). The homotopy algorithm by (Osborne et al., 2000) also exploits the piecewise-linear
path for computing the LASSO. Both the LARS algorithm and the homotopy algorithm yield
ecient ways of fitting the LASSO.
Friedman et al. (2007) made aware of the potential of a “one-at-a-time” coordinate-wise descent
algorithminconvexoptimizationproblemsandshoweditsapplicationtolarge-scaleLASSOproblems.
10
Friedman et al. (2007) and Friedman et al. (2010a) used coordinate descent for LASSO problem
fitting and demonstrated that their procedures outperform all competitors in large LASSO problem.
Since the optimization function in LASSO is hard to optimize all variables simultaneously, the
idea of coordinate descent is to estimate each parameter by coordinate-wise descent procedure
successively, one single parameter at a time. This approach is particularly appealing because
every single parameter (one-dimensional) optimization problem can be easily and quickly solved
by applying a soft-thresholding operator (Friedman et al., 2007, 2010a). The glmnet package in R
implements the coordinate descent algorithm for LASSO (and many other penalization methods),
and it is probably by far the fastest among all competing techniques.
Other optimization methods include proximal and gradient project methods (Vandenberghe,
2011) and EM algorithm (Figueiredo, 2003). Proximal and gradient project methods are suitable
for very large scale problems and can be easily extended to regularization beyond L
1
constraint
(Murphy, 2012). The EM algorithm for LASSO was first proposed by (Figueiredo, 2003). It uses the
idea that Laplace distribution can be expressed as a Gaussian scale mixture. The EM algorithm for
LASSO is slow to implement, but it provides a thought to try other priors on the variance besides
gamma distribution (Murphy, 2012). A similar idea was used later in Bayesian Lasso by (Park and
Casella, 2008).
Because the coordinate descent algorithm in glmnet is very fast to implement in practice, we
used the glmnet package (Friedman et al., 2010a) in the R programming language to implement the
LASSO, ridge, and EN once the penalty term is estimated.
1.5 A glimpse of our contribution
The standard penalized regression methods discussed in this chapter apply a single global penalty
parameter ⁄ to all regression coecients, eectively treating all features or predictors equally
in the model building process. This can result in over-shrinking of important coecients and
under-shrinking of unimportant ones, with a corresponding loss in prediction ability. In the following
chapters, we will introduce the idea of using “data-driven” dierential penalization of features for
accounting for the heterogeneity, grouping, or structural dierence of features in penalized regression.
In specific, we use external information that correlates with the importance of predictors to guide
11
the choice of penalty parameters. In genomics and other applications, there is often a great deal of
such external information. Examples include gene function annotations, omic types, or previous
study results. We developed methods to incorporate such information into LASSO, ridge, and
Elastic-Net regression, for both linear regression problems and binary classification problems. The
proposed methods are demonstrated through simulation studies and applications in real data.
12
Chapter 2
An empirical Bayes approach to select
penalty parameter
2.1 Introduction
As discussed in the previous chapter, cross-validation (CV) is easy to implement and directly
optimize the penalty parameter by minimizing the predictive error. However, despite its extensive
use and easiness in implementation, CV may lead to unstable models when used in high dimensional
data (Lim and Yu, 2016). The choice of penalty parameter, hence the results of model fit, can
be variable due to dierent splits of the dataset. In addition, the computation time required by
CV grows exponentially with the number of penalty parameters (Boulesteix et al., 2017). When
multiple tuning parameters are required, a multiple dimensional grid search is needed for each split
of the data, and CV can be computationally disadvantageous in such a case.
In this chapter, we propose a novel direct empirical Bayes (EB) approach to select the penalty
parameter for LASSO regression. LASSO regression has a Bayesian interpretation with given
double exponential prior distribution for the parameters. Utilizing this connection, we estimate
the hyperparameters in its Bayesian formulation by maximizing their marginal likelihood function
obtained by integrating over the regression coecients, which leads back to the choice of penalty
parameter in LASSO. This EB procedure is also known as type II maximum likelihood (Berger,
2008), or evidence approximation in machine learning literature (MacKay, 1992).
13
Here we adopt a direct EB approach, i.e., we maximize an analytical expression for the marginal
likelihood. When an analytic expression for the marginal likelihood can not be obtained, other
forms of EB can also be used (Van de Wiel et al., 2017). Examples include EB using Laplace
approximation, Markov Chain Monte Carlo sample EB, or Variational Bayes EB (Barber et al.,
2016; Park and Casella, 2008; Li and Lin, 2010; Polson et al., 2012; Blei et al., 2016).
Several methods have been proposed as alternatives to choose the penalty parameter. Lim and
Yu (2016) proposed an “estimation stability with cross-validation (ESCV)” method based on an
estimation stability metric and CV. ESCV finds a model that is smaller than the standard CV
model but has the property of estimation stability. Lei (2017)describeda “cross-validation with
confidence” approach that takes into account the uncertainty in the testing sample and has the
property of variable selection consistency in a classical linear regression setting. Those methods
are still based on the procedure of cross-validation. Another dierent type of approach that does
not rely on CV adopts the fully Bayesian formulation. The fully Bayesian approach also utilize the
connection between penalized regression and their Bayesian formulation, but assign the parameters
hyper-priors and sample them from corresponding posterior distribution rather than maximizing
their marginal likelihood. Park and Casella (2008) considered a hierarchical representation of a fully
Bayesian model that represents the double exponential distribution by a scale mixture of normals,
and the penalty parameter is assigned a diuse hyperprior. The full condition distribution of ⁄ 2
is
specified, and Gibbs sampler was used to estimate ⁄ and block update — . They also proposed an
EB estimation of ⁄ by using a Monte Carlo EM algorithm in combination with a Gibbs sampler. In
each iteration of the Monte Carlo EM algorithm, ⁄ is treated as “missing” data and is updated by
its expected value. The expected value of ⁄ does not have an analytic solution and is estimated by
using Gibbs sampler with ⁄ estimate from the last iteration. Though Gibbs sampler is relatively
ecient, the algorithm’s implementation could be computationally intensive and time-consuming in
practice.
In this proposal, we use a direct EB approach to choose the penalty parameter for two main
reasons: (1) the prior can be generalized to allow a dierent ⁄ j
for each feature. This is the principal
practical advantage of EB over CV, as it would not be possible to use CV to tune p dierent
hyper-parameters. (2) It is much more computation ecient compared to a fully Bayesian approach.
14
(3) The EB framework provides a natural way for incorporating prior knowledge into penalized
regression, as we will discuss later in Chapter 3 and Chapter 5.
2.2 Methods
2.2.1 Approximation of marginal likelihood
As stated in Chapter 1, the LASSO objective function
min
— œ R
p{||Y ≠ X— ||
2
2
+⁄ p
ÿ
j=1
|— j
|} (2.1)
can be interpreted in the Bayesian form:
Y|X,— ≥N(X— ,‡ 2
I
n
)
— i
i.i.d
≥ Double Exponential(· ),i=1,2,..,p
(2.2)
We have proved that the MAP under (2.2) is equivalent to the LASSO coecient estimate in
(2.1)when ⁄ =2‡ 2
· . Hence, we can first estimate ‡ 2
and · and use this connection to get an
estimate for penalty parameter ⁄ . The maximal likelihood estimate (MLE) of ‡ 2
and · is returned
by maximizing their marginal likelihood calculated by integrating the joint distribution of Y and — for the random eect.
L(‡ 2
,· )=p(X,Y|— )f(— |· )
=
⁄
R
p
N(Y|X— ,‡ 2
I
n
)DE(— |0,· )d— =
⁄
R
p
n
Ÿ
i=1
1
Ô 2fi‡
2
e
≠ (Y
i
≠ X
i
— )
2
2‡ 2
p
Ÿ
j=1
· 2
e
≠ · |— j
|
d— =
1
(2fi‡
2
)
n/2
(
· 2
)
p
⁄
R
p
e
¸(— )
d— (2.3)
where
¸(— )=≠ (
1
2‡ 2
(Y ≠ X— )
T
(Y ≠ X— )+· ||— ||
1
) (2.4)
15
Derivation of hyperparameter marginal likelihood in orthogonal case
In the special case where X is orthogonal (i.e. X
T
X = I
p
), the marginal likelihood (2.3) can be
computed analytically.
Using:
¸(— )=≠ (
1
2‡ 2
(Y ≠ X— )
T
(Y ≠ X— )+· ||— ||
1
)
=≠ 1
2‡ 2
Y
T
Y +
1
‡ 2
p
ÿ
j=1
(— j
ˆ
— j
ls
≠ 1
2
— 2
j
≠ ·‡
2
|— j
|)
(2.5)
we have
L(‡ 2
,· )=
1
(2fi‡
2
)
n/2
(
· 2
)
p
⁄
R
p
e
¸(— )
d— =
1
(2fi‡
2
)
n/2
(
· 2
)
p
exp(≠ 1
2‡ 2
Y
T
Y)
⁄
R
p
exp(
1
‡ 2
p
ÿ
j=1
(— j
ˆ
— j
ls
≠ 1
2
— 2
j
≠ ·‡
2
|— j
|))d— =
1
(2fi‡
2
)
n/2
(
· 2
)
p
exp(≠ 1
2‡ 2
Y
T
Y)
p
Ÿ
j=1
⁄
exp(
1
‡ 2
(— j
ˆ
— j
ls
≠ 1
2
— 2
j
≠ ·‡
2
|— j
|))d— j
(2.6)
where
ˆ
— ls
=(X
T
X)
≠ 1
X
T
Y =X
T
Y (2.7)
The one dimensional integration can be computed analytically:
⁄
exp(
1
‡ 2
(— j
ˆ
— j
ls
≠ 1
2
— 2
j
≠ ·‡
2
|— j
|))d— j
(2.8)
The likelihood for hyperparameters under orthogonal design is:
L(‡ 2
,· )=
1
(2fi‡
2
)
n/2
(
· 2
)
p
exp(≠ 1
2‡ 2
Y
T
Y)(
fi‡
2
2
)
p/2
◊ p
Ÿ
j=1
e
(
ˆ
— j
ls
≠ ·‡
2
)
2
2‡ 2
(1+Erf(
(
ˆ
— j
ls
≠ ·‡
2
)
2
Ô 2‡ 2
)+e
2· ˆ
— i
ls
Erfc(
(
ˆ
— i
ls
+·‡
2
)
2
Ô 2‡ 2
)) (2.9)
16
where Erf() is the error function that is twice the integral of the Gaussian distribution with mean 0
and variance 1/2:
Erf(x)=
2
Ô fi ⁄
x
0
e
≠ t
2
dt
and Erfc() is the complementary error function:
Erfc(x)=1≠ Erf(x).
Hyperparameters‡ 2
and· canbeestimatedbymaximizingthelog-likelihoodin(2.9). Theestimated
penalty term is
ˆ
⁄ =2ˆ ‡ 2
ˆ · .Given
ˆ
⁄ we have an explicit solution for— , known as the soft-thresholding:
¯
— i
=
Y _______] _______[ ˆ
— i
ls
≠ ⁄‡
2
, if
ˆ
— i
ls
Ø ⁄‡
2
0, if ≠ ⁄‡
2
<
ˆ
— i
ls
<⁄‡ 2
ˆ
— i
ls
+⁄‡
2
, if
ˆ
— i
ls
Æ ⁄‡
2
where ⁄ =2‡ 2
· Approximation of marginal likelihood under non-orthogonal case
When X is not orthogonal, the p dimension integration is intractable. Hence, we cannot get a
analytic expression for L(‡ 2
,· ). Foster et al. (2008) proposed a Laplace approximation approach
to approximate the marginal likelihood in this case. Using this approximation, they turned a p
dimensional “multivariate double exponential” integration into a multivariate normal integration,
which is much easier to do. We briefly summarize the idea behind (Foster et al., 2008)here.
Suppose we have a function h(x). We can use Taylor expansion to approximate h(x) by its
Tylor series (up to the second order):
h(x)¥ h(ˆ x)+
1
2
h
2
(ˆ x)(x≠ ˆ x)
2
where ˆ x is the value of x such that h
1
(ˆ x)=0.
17
It follows that
⁄
e
≠ h(x)
dx¥ ⁄
e
≠ h(ˆ x)+
1
2
h
2
(ˆ x)(x≠ ˆ x)
2
=e
≠ h(ˆ x)
⁄
e
1
2
h
2
(ˆ x)(x≠ ˆ x)
2
.
(2.10)
In our model
¸(— )=≠ (
1
2‡ 2
(Y ≠ X— )
T
(Y ≠ X— )+· ||— ||
1
).
Sub-dierentiate ¸(— ) with respect to — , we get:
ˆ¸ (— )=
1
‡ 2
X
T
(Y ≠ X— )≠ ·v (2.11)
where v is a vector of [-1,1] such that: v
T
— = ||— ||
1
. Sub-dierential ( 2.11)’s derivative is :
ˆ (ˆ — l(— ))
ˆ—
=≠ 1
‡ 2
X
T
X (2.12)
From (2.3)
L(‡ 2
,· )=
1
(2fi‡
2
)
n/2
(
· 2
)
p
⁄
R
p
e
¸(— )
d— ¥ 1
(2fi‡
2
)
n/2
(
· 2
)
p
exp(¸(
ˆ
— ))
⁄
R
p
exp(≠ 1
2‡ 2
(— ≠ ˆ
— )
T
X
T
X(— ≠ ˆ
— ))d— (2.13)
where
ˆ
— is the value of — such that the first order derivative of ¸(— ) equals 0.
Note that a solution to (2.11)is:
X
T
Y =X
T
X
ˆ
— +·‡
2
v
Then:
¸(
ˆ
— )=≠ 1
2‡ 2
Y
T
Y +
1
‡ 2
ˆ
— T
X
T
Y ≠ 1
2‡ 2
ˆ
— T
X
T
X
ˆ
— ≠ · ||— ||
1
=≠ 1
2‡ 2
Y
T
Y +
1
‡ 2
ˆ
— T
(X
T
X
ˆ
— +·‡
2
ˆ v)≠ 1
2‡ 2
ˆ
— T
X
T
X
ˆ
— ≠ · ˆ v
ˆ
— =≠ 1
2‡ 2
Y
T
Y +
1
2‡ 2
ˆ
— T
X
T
X
ˆ
— (2.14)
18
Therefore, we got a multivariate normal integration which is much easier to do. According to (Foster
et al., 2008), the approximated log-likelihood is:
¸(‡ 2
,· )=≠ n≠ k
2
log‡ 2
+klog· ≠ 1
2‡ 2
(Y
T
Y ≠ ˆ
— T
X
T
X
ˆ
— )
where k =min(n,p). Optimization of ¸(‡ 2
,· ) yields the estimates for · and ‡ 2
:
ˆ ‡ 2
=
Y
T
Y ≠ ˆ
— T
X
T
X
ˆ
— n≠ k+2· ||
ˆ
— ||
1
(2.15)
ˆ · =
k
||
ˆ
— ||
1
(2.16)
In summary, · and ‡ 2
are estimated by the following iterative algorithm:
Algorithm 1: Estimation of hyperparameters · and ‡ 2
by (Foster et al., 2008).
Step 1: Set initial values for
ˆ
— while not converge do
Step 2: Compute ‡ 2
and · by (2.15) and (2.16)
Step 3: Given ‡ 2
and · , update
ˆ
— by minimizing
n
ÿ
i=1
(Y
i
≠ X
i
— )
2
+
2‡ 2
· n
||— ||
1
end
Normal approximation
Motivated by the Laplace approximation approach described in (Foster et al., 2008), we propose a
simpler new approximation which also tries to approximate the multivariate double exponential
distribution using a multivariate normal distribution. Instead of using a Laplace approximation, we
use a normal distribution with the same variance to approximate the double exponential distribution.
Essentially for a given range of — , the area under the normal distribution is used to approximate
the area under the double exponential distribution. The advantage of this approximation is that we
can get the integration of the joint likelihood in closed form.
19
Figure 2.1 shows the approximation in one dimension. The idea here is to approximate the
purple area with the green area. We show that in the orthogonal case where we can compute the
true marginal likelihood, the estimated ⁄ using the normal approximation is very close to the ⁄ estimated from the true marginal likelihood (Figure 2.2, Figure 2.3).
Figure 2.1 Double exponential distribution and normal distribution with mean zero and variance one.
The purple line represents the density of double exponential distribution with variance 1, and the
green line represents the density of the normal distribution with variance 1. Both distributions in
the figure have mean 0. The purple shade represents the area under the double exponential density
curve for — ranging from≠ 2 to 2, and the green shade represents that of the normal distribution.
The variance for double exponential distribution with density function
· 2
e
≠ · |— |
is
2
· 2
, therefore
we propose to use the following model to approximate the LASSO Bayesian framework:
20
Y|X,— ≥N(X— ,‡ 2
I
n
)
— ≥N(0,V
≠ 1
)
(2.17)
where
V =
S W W W W W U ÷ .
.
.
÷ T X X X X X V =
S W W W W W U · 2
/2
.
.
.
· 2
/2
T X X X X X V Since the approximation model has a Gaussian prior, the marginal likelihood is also Gaussian and
has an analytic expression.
To derive the approximated marginal likelihood, we use the results from Section 2.3 of (Bishop,
2006). Specifically, if a marginal Gaussian distribution for — and a conditional Gaussian distribution
forY given — has the following form:
p(— )= N(— |µ,A
≠ 1
)
p(Y|— )= N(Y|X— ,L
≠ 1
)
(2.18)
then the marginal distribution of Y and the conditional distribution of — given Y are given by
p(Y)= N(Y|X— ,L
≠ 1
+XA
≠ 1
X
T
)
p(— |Y)= N(— | X
T
LY +Aµ,)
(2.19)
where
=( A+X
T
LX)
≠ 1
).
Using this result, we can compute the approximated likelihood:
˜
L(‡ 2
,– )=
⁄
R
p
p(Y|— )p(— )d— =
⁄
R
p
N(Y|X— ,I
n
‡ 2
)N(— |0,V
≠ 1
)d— = N(Y|0,‡ 2
I +XV
≠ 1
X
T
)
=(2fi )
≠ n/2
|C
· |
≠ 1
2
exp(≠ 1
2
Y
T
C
≠ 1
· Y)
(2.20)
21
where C
· = ‡ 2
I +XV
≠ 1
X
T
= ‡ 2
I +
2
· 2
XX
T
. In the log form, the approximated log-likelihood is
≠ ˜
¸(‡ 2
,· ) = log|C
· |+Y
T
C
≠ 1
· Y (2.21)
Figure 2.2 Comparison of approximated marginal likelihood contour plot vs. true marginal likelihood
contour plot in the orthogonal case. Top plot represents the marginal likelihood for the “true”
model and the bottom plot represents the approximated log likelihood. The plot is generated by · ranging from 0.001 to 5 and ‡ 2
ranging from 0.01 to 2. The true log likelihood is maximized at point
(· =1.84,‡ 2
=0.27)andtheapproximatedloglikelihoodismaximizedatpoint(· =1.89,‡ 2
=0.28).
Simulated data has 200 observations and 100 features. Predictor matrix has orthogonal design with
mean 0. — is simulated from a double exponential distribution with · =2. Noise term ‘ is simulated
by normal distribution with mean 0 and variance ‡ 2
=0.25. Outcome Y =X— +‘.
22
Note that essentially we use a ridge regression to approximate the LASSO regression, but only
to estimate hyperparameters. To compare the approximated likelihood with the true likelihood, we
simulated data with X being orthogonal (in such a scenario, we can compute the true likelihood).
Figure 2.2 shows the contour plot of the “true” marginal log-likelihood (2.9) and the normal
approximated log-likelihood (2.20) in the orthogonal case. The two contour plots have similar shapes
with some dierences. Moreover, the maximum points for the contour plots are very close. Figure
2.3 shows the log-likelihood as a function of · for fixed ‡ 2
. In this case, the maximal point of the
true log-likelihood and the approximated log-likelihood are very close.
Figure 2.3 Comparison of estimated parameters using the approximated marginal likelihood vs.
true marginal likelihood. Log likelihood as a function of · for fixed ‡ 2
.‡ 2
is fixed at 0.25.The
green line represents the true log likelihood and the red line represents the normal approximated
log likelihood. The vertical line is where the likelihood function is maximized. · max
equals 1.84 for
normal approximated log likelihood and 1.81 for true log likelihood. The data is simulated in the
same way as described in Figure 2.2.
23
To summarize, we propose an empirical Bayes tuning (EB tuning) approach to select the penalty
parameter based on a “normal approximation” of the marginal likelihood. In the next section, we
describe in detail dierent algorithms to optimize the likelihood ( 2.21).
2.2.2 Estimation of hyperparameters
We propose two algorithms – EM algorithm and fixed point update algorithm to optimize (2.21)
and hence implement EB tuning. We referred to Section 3.5 of (Bishop, 2006) and Section 7.6 of
(Murphy, 2012) in this section.
EM algorithm
Since our model has this form · æ — j
æ Y, the most straight forward way is to use the EM
algorithm.
Estep
In our model, let ◊ = {· ,‡ 2
} be the parameter to optimize. Because the parameter vector — is
integrated out, we can regard it as a latent variable, and hence we can optimize this marginal
likelihood function using EM.
In the E step, we first compute the posterior distribution of — given the current estimated ◊ and
then use this to find the expected complete-data log likelihood Q(◊ ). In the M step, we maximize
Q(◊ ) with respect to ◊ to get estimated individual weight precision. We first need to compute the
posterior distribution of p(— |X,Y,· ,‡ 2
). Due to the conjugate property of Gaussian prior , the
posterior will also be Gaussian. The posterior of — given estimated ◊ is
p(— |X,Y,· ,‡ 2
)= N(µ,)
where
=( ‡ ≠ 2
X
T
X +V)
≠ 1
µ = ‡ ≠ 2
X
T
Y
(2.22)
24
Mstep
The expected complete data log-likelihood is given by
Q(◊ )=E[logN(Y,— |◊ )]
=E[logN(Y|X—,‡ 2
I
n
)+logN(— |0,V
≠ 1
)]
=
1
2
E
#
N log‡ ≠ 2
≠ ‡ ≠ 2
||Y ≠ X— ||
2
+plog
· 2
2
≠ Trace(V——
T
)
$
+const
=
1
2
N log‡ ≠ 2
≠ ‡ ≠ 2
2
!
||Y ≠ Xµ||
2
+Trace(X
T
X)
"
+
p
2
ÿ
j
log
· 2
2
≠ 1
2
Trace
#
V(µµ
T
+)
$
+const
(2.23)
where µ and are computed using Equation (4.8). To maximize the Q function, we set
dQ
d· =0 and
dQ
d‡ 2
=0. The M step is as follows:
· new
Ω Û
2
p
µ
T
µ+Trace()
‡ 2
new
Ω ||Y ≠ Xµ||
2
+‡ 2
(p≠ · 2
2
Trace())
N
(2.24)
Fixed point update algorithm
Another natural approach to get the estimates that optimize (2.21) is to compute its derivatives
and set it to be zero. Noting that:
|V||‡ 2
I +XV
≠ 1
X
T
| = |‡ 2
I||V +‡ ≠ 2
X
T
X|
The first component of (2.21)is
log|‡ 2
I +XV
≠ 1
X
T
| =≠ log| |+nlog‡ 2
≠ log|V| (2.25)
using the Woodbury inversion identity:
(A+UCV)
≠ 1
=A
≠ 1
≠ A
≠ 1
U(C
≠ 1
+VA
≠ 1
U)
≠ 1
VA
≠ 1
25
we have
(‡ 2
I +XV
≠ 1
X
T
)
≠ 1
= ‡ ≠ 2
I≠ ‡ ≠ 2
X(V +‡ ≠ 2
X
T
X)
≠ 1
X
T
‡ ≠ 2
.
Therefore, the second component of (2.21)is
Y
T
(‡ 2
I +XV
≠ 1
X
T
)
≠ 1
Y = ‡ ≠ 2
Y
T
Y ≠ ‡ ≠ 2
Y
T
X X
T
‡ ≠ 2
Y
= ‡ ≠ 2
Y
T
(Y ≠ Xµ)
= ‡ ≠ 2
#
||Y ≠ Xµ||
2
+Y
T
Xµ≠ µX
T
Xµ
$
= ‡ ≠ 2
||Y ≠ Xµ||
2
+‡ ≠ 2
Y
T
Xµ≠ ‡ ≠ 2
µ
T
X
T
Xµ
= ‡ ≠ 2
||Y ≠ Xµ||
2
+µ
T
≠ 1
µ≠ ‡ ≠ 2
µ
T
X
T
Xµ
= ‡ ≠ 2
||Y ≠ Xµ||
2
+µ
T
Vµ
(2.26)
Adding up (2.25) and (2.26), we have
≠ ¸(‡ 2
,÷ )=≠ log| |+nlog‡ 2
≠ log|V|+
1
‡ 2
||Y ≠ Xµ||
2
+µ
T
Vµ
Setting
d¸
d· =0 and
d¸
d‡ 2
=0 to be 0, we get the following iterative algorithm.
· new
Ω Ú
2
“ µ
T
µ
‡ 2
new
Ω ||Y ≠ Xµ||
2
N≠ “ (2.27)
where
“ ,p≠ · 2
2
Trace()
2.3 Results
2.3.1 Simulation studies
In this section, we present several simulation studies to compare the performance of cross-validation
and EB tuning under dierent scenarios. In this studies we controlled four aspects of simulated
data:
26
1. the dimension of the predictors p
2. the signal to noise ratio (SNR) of the data defined as
Var(X— )
‡ 2
3. the sparsity of — controlled by ” 4. the correlation magnitude fl and structure between features.
The evaluation criteria for CV and EB tuning included (1) prediction accuracy measured by
prediction MSE in a testing dataset with 1000 samples. (2) variable selection ability measured by
sensitivity and specificity. Sensitivity was calculated by the number of truly nonzero features being
selected divided by the total number of truly nonzero features. Specificity was calculated by the
number of features estimated to be zero that are truly zero divided by the total number of truly zero
features. The sample size used for each simulation was 100. The feature column X was simulated
from a normal distribution with mean zero and a covariance matrix described below. We used an
AR(1) correlation of fl between the columns of X.
We simulated data in the following steps. (1) The number of nonzero — s was set to be [n
” ],where
” is the parameter that controls the sparsity of the features. A higher value of ” means that the
model is less sparse, and a smaller value of ” means that the model is more sparse. To control how
closely nonzero — s are clustered together in the covariance matrix, Ÿ [n
” ] nonzero — s were assigned
indices next to each other. The indices of the rest nonzero — s were selected randomly. The values
of nonzero — s were simulated randomly from a Double Exponential distribution with mean 0 and
scale parameter 3. (2) The value of X was simulated and keep fixed. (3) In each replication, we
simulated ‘ from N(0,1) and set Y =x— +‘. The number of replication was 100. Each simulated
data was split into a training set (size 100) and a testing set (1000). The prediction MSE for the
test set was computed based on the model fitted in the training set and was evaluated by the mean
of the residual square of fitted Y and test Y. Standard LASSO was performed by 10-fold CV.
We considered the following values for the parameters described above and simulation results of
the ten examples we considered are summarized in Figure 2.3.1.
1. p = 200,600,800,1000
2. snr=1,3,5,7
27
3. ” =0.2,0.4,0.6,0.8
4. fl =0,0.2,0.4,0.8
5. Ÿ =0.3,0.5,0.7,0.9
Example (a) - (c) represent cases varying number of predictors, feature informativeness level,
and the sparsity of the actual model. Example (d) - (h) represent cases with dierent correlation
structure. Specifically, in example (d), the signal (nonzero) features are not clustered together. For
example (e), half of the signal features are indexed right next to each other, while the remaining
signal features are random indexed. Example (f) to (h) shows dierent scenarios of how closely
signal features are cluster together in the case of low, moderate, and high correlation. Example (i)
and (j) are special cases where the magnitudes of nonzero — s are large.
From Figure 2.3.1, it can be seen that for example (a) - (h), the model returned by EB tuning
had comparable prediction performance as cross-validation. The performance pattern of these two
methods across dierent settings was very similar. In the two particular examples (i) and (j) where
the nonzero — s have large eect size, EB tuning had a better performance than cross-validation. We
see that the performance of EB tuning can be aected by SNR and the eect size of — .EBtuning
performs better with higher SNR. The bar plots of variable selection (sensitivity and specificity of
zero and nonzero features estimated) are also included in Figure 2.3.1. The pattern of the results of
these two methods is similar.
2.3.2 Data example
To illustrate the application of EB tuning in high-dimensional data, we applied EB tuning to predict
bone mineral density (BMD) using the dataset publicly available from the European Bioinformatics
Institute (EMBL-EBI) Array Expression repository ID E-MEXP-1618. In this dataset, transiliac
bone biopsies from 84 women were submitted to Aymetrix microarray expression analysis to study
the relationship between BMD and gene expression. The microarray used to analyze the RNA from
these women contains over 54,676 probe sets. The data were normalized using the RMA method as
described in (Tharmaratnam et al., 2016). Gene expression levels were analyzed on a logarithmic-2
28
(a) Simulations varying the number of predictors p=200,600,800,1000. n=100,snr =
3,fl =0,” =0.5.
(b) Simulations varying snr=1,3,5,7. n=100,p=1000,fl =0,” =0.5.
29
(c) Simulations varying ” =0.2,0.4,0.6,0.8. n=100,p=1000,snr=3,fl =0.
(d) Simulations varying fl =0,0.2,0.4,0.8 with random indexing of non-zero — s. n =
100,p=1000,snr=3,” =0.5.
30
(e) Simulations varying fl =0,0.2,0.4,0.8 with 50% of non-zero — s indexed next to each
other. n=100,p=1000,snr=3,” =0.5.
(f) Simulations varying Ÿ =0.3,0.5,0.7,0.9. n=100,p=1000,snr=3,” =0.5,fl =0.2.
31
(g) Simulations varying Ÿ =0.3,0.5,0.7,0.9. n=100,p=1000,snr=3,” =0.5,fl =0.4.
(h) Simulations varying Ÿ =0.3,0.5,0.7,0.9. n=100,p=1000,snr=3,” =0.5,fl =0.8.
32
(i) Simulations varying the number of predictors p=200,500,1000.
(j) Simulations varying the snr=0.5,1,2,5,10
Figure 2.4 EB tuning simulation study results
33
scale. Transcripts having RMA signal levels < 32 (log2 values < 5) across all samples are omitted,
resulting in a reduction of probe sets from 54,676 to 20,721. We are interested in predicting the
total hip T-score using the gene expression data. Bone density test results are reported using
T-scores. The lower a person’s T-score, the lower the bone density. We preselected 2000 probes
with the largest variances. We compare EB tuning with cross-validation in terms of prediction and
estimation.
To adjust for over-fitting, we evaluated the performance of the EB Lasso and CV Lasso using
cross-validation. The data was randomly split into a training dataset consisting of 80% of the
observations and a testing data set consisting of 20% of the observations. We fitted EB Lasso and
CV Lasso in the training dataset and evaluated their prediction MSE in the testing dataset. We
repeated 100 random splitting of the dataset and considered dierent numbers of pre-selected probe
sets and reported the number of covariates selected with non-zero coecients. MSE and R
2
were
calculated by the correlation between predicted Y and observed Y in the testing dataset.
Figure 2.5 summarizes the prediction MSE, R
2
, number of covariates with non-zero coecients,
and tuning parameter value ⁄ by LASSO using cross-validation and empirical Bayes tuning. Table
2.1 shows the median values of the 100 replications corresponding to Figure 2.5.
Table 2.1 Median MSE, R
2
, number of features selected, and estimated ⁄ across 100 replications
comparing cross-validation vs. EB tuning. This table corresponding to Figure 2.5.
Cross Validation EB tuning
pMSE R
2
nonzeros ⁄ MSE R
2
nonzeros ⁄ 200 1.811 0.292 11.0 0.201 1.770 0.309 12 0.195
500 1.655 0.350 17.0 0.206 1.627 0.351 17 0.212
1000 1.853 0.283 22 0.221 1.742 0.310 18 0.272
2000 1.801 0.309 31.5 0.155 1.750 0.312 22 0.243
Overall, we see that choosing the penalty parameter for LASSO using CV or EB tuning gave
similar results. The LASSO model selected by EB tuning is less variable than the LASSO model
selected by cross-validation in terms of estimated ⁄ and number of selected features. In this specific
dataset, EB tuning tended to select a larger penalty parameter ⁄ than cross-validation, especially
when the p/n ratio was high. The model estimated by EB tuning was generally more parsimonious,
34
Figure 2.5 Comparison of prediction performance of cross-validation vs. EB tuning applied to
the bone density dataset. The data is randomly split 100 times. In each replication, the data is
randomly divided into a training dataset consisting of 80% of the observations and a testing data
set consisting of 20% of the observations. We fit EB Lasso and CV Lasso in the training dataset
and evaluated their prediction accuracy in the testing dataset. The figure shows the results for all
100 replications. X axis is the number of pre-selected genes. MSE and R
2
are the fitted MSE and
R
2
in the testing set.
35
that is, fewer variables were selected by EB tuning. The prediction accuracy of those two methods
was similar. EB tuning had a slightly lower MSE than cross-validation.
2.4 Discussion
Choosing an appropriate tuning parameter is critical to all penalized regression methods. For
LASSO regression, model interpretation and prediction performance can change quite dramatically
depending on the choice of the penalty parameter. The most commonly used method to estimate
the penalty parameter ⁄ is by cross-validation, where a grid of ⁄ s is specified first and the ⁄ that
provides the smallest prediction error is selected.
In this chapter, we proposed an compelling alternative to cross-validation in LASSO to estimate
the penalty parameter using an empirical Bayes approach based on maximizing marginal likelihood
obtained by integrating over the regression coecients under Bayesian framework. Through
simulation studies and real data application we demonstrated that our proposed approach are
competitive to cross validation in prediction and variable selection. Our approach becomes especially
valuable when cross-validation is not appropriate.
So far we limit this idea only to the estimation of a single penalty parameter in LASSO regression;
however, it can be further applied to various penalized models with dierent penalty forms or
with multiple penalty parameters. One can extend the same idea described in this chapter to
more complicated problems where cross-validation is not available. More generally, the principle of
choosing penalty parameter by maximizing its marginal likelihood under Bayesian formulation could,
in principle, be applicable to most penalized regression problem that has a Bayesian interpretation,
such as adaptive LASSO, group LASSO, ridge, and Elastic-Net. In the following chapters, we
will present generalizing this empirical Bayes procedure to incorporate external information about
the predictors into penalized regression methods where the penalty parameters are “external-data-
driven”.
36
Chapter 3
Improved prediction with LASSO
regression incorporating external
covariates
This chapter is a modified version of a pre-print available under https://www.biorxiv.org/conte
nt/10.11012020.03.04.971408v1.full.pdf.
Associated with genomic features like gene expression, methylation, and genotypes, used in
statisticalmodelingofhealthoutcomes,thereisarichsetofmeta-featureslikefunctionalannotations,
pathway information, and knowledge from previous studies, that can be used post-hoc to facilitate
the interpretation of a model. However, using this meta-feature information a-priori rather than
post-hoc can yield improved prediction performance as well as enhanced model interpretation.
We propose a new penalized regression approach named xtune LASSO that allows a-priori
integration of external meta-features. The method extends LASSO regression by incorporating
individualized penalty parameters for each regression coecient. The penalty parameters are, in
turn, modeled as a log-linear function of the meta-features and are estimated from the data using
an approximate empirical Bayes approach. Optimization of the marginal likelihood on which the
empirical Bayes estimation is based is performed using a fast and stable majorization-minimization
procedure. Through simulations, we show that the proposed regression with individualized penalties
can greatly outperform the standard LASSO in terms of both parameters estimation and prediction
37
performance when the external data is informative. We further demonstrate our approach with
applications to gene expression studies of bone density and breast cancer.
The methods have been implemented in the R package xtune freely available for download from
CRAN.
38
3.1 Introduction
Predicting outcomes based on genomic biomarkers such as gene expression, methylation, and
genotypes is becoming increasingly important for individualized risk assessment and treatment
(Kamel and Al-Amodi, 2017). As an example, consider predicting mortality from breast cancer after
surgical treatment based on gene expression profiles (Nuyten et al., 2006). Since genomic studies
typically have more available features than subjects, a common approach to develop prediction
models based on genomic features is to use regularized regression methods, which can handle
high-dimensional data. In addition to regularization, sparsity inducing regression approaches such
as the LASSO can also perform feature selection. In the context of genomic studies, feature selection
is critical for yielding interpretable models that provide insight into potential biological mechanisms
and which can, in turn, facilitate adoption by practitioners. To enhance model interpretability, it is
common to examine features selected in a model in relation to available information about gene
function and previous studies. For example, analyses can be conducted to formally assess whether
the selected features are enriched in particular metabolic pathways or gene ontology annotations.
This kind of post-hoc analysis relating genomic features to existing knowledge about them, hereafter
referred to as genomic meta-features, can provide valuable biological insight and validation for a
prediction model. In this paper, we propose a new approach that exploits genomic meta-features
a-priori rather than post-hoc, to improve prediction performance and feature selection, and to
enhance the interpretability of models developed using penalized regression.
Most penalized regression methods apply a single global penalty parameter to all regression
coecients, eectively treating all features or predictors equally in the model building process. This
can result in over-shrinking of relevant coecients and under-shrinking of unimportant ones, with
a corresponding loss in prediction ability. In genomics and other applications, there is often a
great deal of prior knowledge about the features. Examples of prior knowledge include (1) gene
function annotation from databases like the Gene Ontology Project (Ashburner et al., 2000); (2)
gene-disease co-occurrence scores from text-mining biomedical abstracts (Pletscher-Frankild et al.,
2015; Rouillard et al., 2016b); (3) deleterious somatic mutations in the Catalogue of Somatic
Mutations in Cancer (Forbes et al., 2010). With the primary goal of improving prediction but also
to improve feature selection performance, parameter estimation, and model interpretability, we
39
extend the standard LASSO regression to allow for penalty terms that depend on such external
meta-features. Specifically, rather than using a single penalty parameter that controls the global
amount of shrinkage across all regression coecients, our model allows each coecient to have its
own individual penalty parameter, which is, in turn, modeled as a function of the meta-features.
We focus on the LASSO penalty because of its widespread use but address potential extensions to
other penalties in the discussion.
Previously, other variants of LASSO regression have been introduced to allow either coecient-
specific penalties or multiple tuning parameters. Yuan and Lin (2006) proposed group LASSO
that extends LASSO to grouped predictors. Zou (2006b) adjusted the penalty dierently for each
coecient by using a vector of adaptive weights. The group Lasso applies only to grouping variables,
and the adaptive Lasso uses pre-specified weights obtained from the initial estimate of the coecients
using the same data as the data used for regression. Neither of these approaches incorporates
a general set of meta-features. Boulesteix et al. (2017) proposed the integrative LASSO with
penalty factors method that assigns dierent penalty factors to dierent data modalities such as
gene expression, methylation, and copy number. They use cross-validation to choose the penalty
parameters based on prediction performance. In practice, the number of dierent modalities they
can allow is up to 4, due to computational bottle-neck.
In addition to those above, several other methods have been proposed previously to make use of
prior knowledge of the features. Tharmaratnam et al. (2016) suggested using biologic knowledge to
derive a set of features that could replace the genes set by chosen by LASSO with minimal loss in
predictive power. However, ‘experts’ are required to assess the importance of each gene. Tai and Pan
(2007) partitioned the features into groups and shrink the features of dierent groups by dierent
magnitudes. Shrinkage for groups is considered fixed and arbitrary but is not data-dependent, and
the use of an external dataset only provides information on the grouping of predictors. van de
Wiel et al. (2016) proposed an adaptive group-regularized ridge regression that accounts for group
structure as the group LASSO and allows group-specific penalties. However, the method requires
partitioning of the features into dierent groups, and the external information is only used for
guiding the partition. The approach proposed by (Bergersen et al., 2011) is also in the form of
the adaptive Lasso. Instead of constructing weights from the same data X and Y,theyusethe
40
Spearman correlation coecients or the ridge regression coecients between the features X and
external information as the penalty weights.
Our approach is distinguished from these methods in that (i) we adopt an empirical Bayes
approach to estimate the hyper-parameters instead of cross-validation, which allows us to estimate
feature-specific tuning parameters; (ii) the magnitude of the penalty terms are modeled as a
log-linear function of the external information and are estimated from a ‘second-level’ regression;
(iii) our approach is not restricted to meta-features that define feature groupings but can handle
meta-features of any type including quantitative ones.
3.2 Methods
3.2.1 LASSO formulation from a Bayesian perspective
As discussed in Chapter 1, the LASSO regression shrinks the regression coecients by imposing a
L
1
penalty on the sum of absolute value of regression coecients. The objective function of LASSO
is:
min
— œ R
p{||Y ≠ X— ||
2
2
+⁄ p
ÿ
j=1
|— j
|} (3.1)
The Lasso linear regression models can be equivalently formulated as a Bayesian or random
eects model where the coecients have a double exponential (a.k.a. Laplace) prior distribution
condition on ‡ 2
(Tibshirani, 1996). In specific, we assume that the distribution of the response
variable Y conditional on the regression coecients — and ‡ 2
follows a normal distribution. The
LASSO coecient estimates that solves ( 3.1) can be equivalently characterized as the posterior
mode, or maximum a posteriori (MAP) in a Bayesian model with the following double exponential
prior distribution for — conditional on ‡ 2
:
Y|X,— ≥N(X— ,‡ 2
I
n
)
f(— |‡ 2
)≥ Double Exponential(0,
⁄ 2‡ 2
)
(3.2)
We extend the random eects formulation of the LASSO to incorporate meta-features deemed
potentially informative about the relative importance of dierent types of genomic features in
41
predicting the outcome of interest. The meta-features can be collected into a matrix Z of dimension
p◊ q,wherethe j
th
row of Z contains information about the j
th
feature.
min
— œ R
p{||Y ≠ X— ||
2
2
+
p
ÿ
j=1
⁄ j
|— j
|}
log⁄ =Z– (3.3)
ByusingtheBayesianformulationonpenalizedregression,externalcovariatescanbeincorporated
via the prior on the model coecients. Here we allow dierential penalization by assigning priors
with dierent scale parameters to regression coecients. In specific, the double exponential random
eect formulation of the standard LASSO in ( 3.2) assumes a single common ‘prior’ variance for all
— j
’s. In our extension, we let each — j
have a double exponential random eect distribution but with
its own individual hyperparameter ⁄ j
, where the hyperparameter vector ⁄ is in turn modeled as a
log-linear function of the external information available for feature j, Z.
Y|X,— ≥N(X— ,‡ 2
I
n
)
f(— j
|‡ 2
)≥ Double Exponential(0,
⁄ j
2‡ 2
),’ j=1,2,..,p
log⁄ =Z– (3.4)
The maximum a posteriori (MAP) estimator under the random eects model in ( 3.4), which is
conditional on ‡ 2
, is equivalent to the estimator minimizing the LASSO objective function with
external information driven penalization in (3.3).
The population variance parameter ‡ 2
can be estimated from the data or given a point-mass
prior (Li and Lin, 2010). In this paper, we estimate ‡ 2
using the method proposed by Reid
et al. (2013) and assume it is “set in advance”. We discuss the estimation of ‡ 2
in more detail
in Chapter 3. Park and Casella (2008), and Li and Lin (2010) used a specification that assigns a
non-informative prior for ‡ 2
and a Gibbs sampler based on full conditional distributions to obtain
regression coecients estimates. Our method is dierent in that, rather than extending the model
to include Bayesian inference over the hyperparameters, we use an empirical Bayes approach that
maximizes the log-likelihood of hyperparameters. That is, the hyperparameters – are estimated
42
from the data by first marginalizing over the coecients — and then performing what is commonly
referred to as empirical Bayes, evidence maximization or type-II maximum likelihood (Tipping,
2001).
3.2.2 Empirical Bayes parameter tuning
Unlike the standard LASSO, the proposed model has a potentially large number of hyperparameters
(– 1◊ q
), so tuning them by cross-validation is not feasible. Instead, we propose an empirical Bayes
approach for directly estimating the hyperparameters based on the training data. Specifically, the
empirical Bayes of – (hence ⁄ ) is obtained by maximizing the marginal likelihood calculated by
integrating out the random eects — from the joint distribution ofY and— . The marginal likelihood
of – is given by:
L(– )=
⁄
R
p
p(Y|X,— )f(— |– )d— =
⁄
R
p
N(Y|X— ,‡ 2
I
n
)DE(— |0,– )d— =
⁄
R
p
n
Ÿ
i=1
1
Ô 2fi‡
2
e
≠ (Y
i
≠ X
i
— )
2
2‡ 2
p
Ÿ
j=1
exp(Z
j
– )
4‡ 2
e
≠ exp(Z
j
– )
2‡ 2
|— j
|
(3.5)
As shown in Chapter 2, when X is not orthogonal the marginal likelihood resulting from the
p-dimensional integral in (3.5) does not have a usable close-form solution. Foster et al. (2008)
proposed using a Laplace approximation to the integral which has simple close-form solution.
Motivated by their approach, we propose a simpler new approximation which uses a normal
distribution with the same prior variance to approximate the double exponential distribution. That
is, we use a normal distribution — j
≥N(0,
2
· 2
j
) to approximate the double exponential distribution
— j
≥ Double Exponential(· i
), yields closed form solution for the approximated marginal likelihood:
Y|X,— ≥N(X— ,I
n
‡ 2
)
— ≥N(0,V
≠ 1
)
(3.6)
43
where
V =
S W W W W W U ÷ 1
.
.
.
÷ p
T X X X X X V =
S W W W W W U 1
2‡ 2
exp(2Z
1·
– )
.
.
.
1
2‡ 2
exp(2Z
p·
– )
T X X X X X V Details on the proposed normal approximation method was discussed in Chapter 2. The
approximate log marginal likelihood of – integrating out — is:
≠ ¸(– ) = log|C
– |+Y
T
C
≠ 1
– Y (3.7)
where C
– = ‡ 2
I +XV
≠ 1
X
T
. The approximated log likelihood (5.9) is then maximized to obtain
– estimates. Once – known, hence the penalty parameters vector ⁄ known, the glmnet package
(Friedman et al., 2010a) in the R language is used to implement the LASSO with a penalty vector.
3.2.3 Optimizing the objective function
Marginal likelihood maximization of (5.9) is a non-convex optimization problem that is intrinsically
challenging. Here we propose a Majorization Minimization (MM) algorithm as a particular strategy
for the maximization of the marginal likelihood (5.9). Some key steps are summarized here, and
greater detail on the procedure as well as other alternative optimization algorithm s will be discussed
later in Chapter 4.
We note that the approximated model (3.6) is closely related to the model specification of the
Automatic Relevance Determination (ARD) (MacKay, 1992; Neal, 1995; Tipping, 2001)widely
used in the field of signal processing. Wipf and Nagarajan (2008, 2014) described a Majorization
Minimization (MM) procedure that uses a reformulation of ARD to optimize the non-convex
optimization function by solving a series of easier re-weighted L
1
problem. Motivated by their idea,
we propose an iterative re-weighted L
2
optimization algorithm described in detail below. Note that
this non-convex optimization problem is a particular case of the dierence of convex functions (DC)
problem (Le Thi and Pham Dinh, 2018).
44
Since the log-determinate term log|C
– | is a concave function in – (Boyd and Vandenberghe,
2004), a majorization function of log|C
– | is its slope at the current value – of log|C
– |:
◊ =Ò – log|C
– | = diag[X
T
C
≠ 1
– X] (3.8)
The Y
T
C
≠ 1
– Y term in (5.9) is convex. Therefore, at current value of – , the majorization function
for≠ ¸(– ) is given by
¸
◊ (– ),◊ T
1
÷ +Y
T
C
≠ 1
– Y (3.9)
Hence, given ◊ , – is updated by:
– Ω argmin
– ¸
◊ (– ),◊ T
1
÷ +Y
T
C
≠ 1
– Y (3.10)
Although the objective function (3.9) is convex, it is slow to optimize in practice. We use one
more MM procedure for optimizing (3.9). In specific, using the fact that the data dependent term
Y
T
C
≠ 1
– Y can be re-expressed as:
Y
T
C
≠ 1
– Y =min
” 1
‡ 2
||Y ≠ X” ||
2
+
ÿ
j
” 2
j
÷ j
(3.11)
an upper-bounding auxiliary function for ¸
◊ (– ) is given by the following with one more auxiliary
term ” :
¸
◊ (– ,” ),◊ T
1
÷ +
ÿ
j
÷ j
” 2
j
+
1
‡ 2
||Y ≠ X” ||
2
Ø ¸
◊ (– ). (3.12)
Therefore, the – value that minimizes equation (3.9) can be estimated using another MM procedure
using (3.12) as the majorization function. For any ” , – is estimated by minimizing
◊ T
1
÷ +
ÿ
j
÷ j
” 2
j
. (3.13)
Given – , ” is updated by:
” Ω argmin
” ||Y ≠ X” ||
2
+
ÿ
j
÷ j
” 2
j
. (3.14)
45
Equation (3.14) has a weighted convex L
2
regularized cost function, and can be optimized eciently
using glmnet. The iterative reweighed L
2
algorithm is summarized in Algorithm 2 below. Once the
hyperparameters – are estimated, and therefore the penalties are known, the LASSO regression
coecients can be obtained using standard LASSO software (e.g., glmnet).
Algorithm 2: Optimization algorithm
Step 1: Initialize – with – 0
(e.g, – i
=0,’ i=1,2,...,q)
while not converge do
Step 2: Given – , update ◊ by:
◊ new
Ω diag[X
T
C
≠ 1
– X]
Step 3: Given ◊ , update – by the following inner loop:
• Step 3.1: initialize – with estimates from last iteration
• Step 3.2: given – , update ” by:
” Ω argmin
” ||Y ≠ X” ||
2
+
ÿ
j
÷ j
” 2
j
• Step 3.3: given ” , update – by:
– Ω argmin
– ◊ T
1
÷ +
ÿ
j
÷ j
” 2
j
• Step 3.4: iterate Step 3.2 and Step 3.3 until converge
end
3.2.4 Extension to linear discriminant analysis for classification
So far, we have been focused on LASSO linear model, where the response variable is continuous.
Here, following the scheme proposed by (Mai et al., 2012) that builds high dimensional linear
discriminate analysis (LDA) upon sparse penalized linear regression, we extend the xtune LASSO
model to the framework of LDA with a binary response variable Yœ{ 1,2}.
46
The LDA model assumes that X is normally distributed within each class, i.e,
X|(Y =k)≥N(µ
k
,)
Pr(Y =k)= fi k
,k=1,2
(3.15)
where µ
k
is the mean of X within class k and is the common within class covariance matrix. We
adapt the following procedure proposed by (Mai et al., 2012) to predict the class of Y based on
xtune LASSO linear penalized regression:
Step 1. Let y
i
=≠ n
1
n
if Y
i
=1, and y
i
=
n
n
2
if Y
2
=2.
Step 2. Compute the solution to a penalized least squares problem:
(
ˆ
— LDA
,
ˆ
— 0
LDA
) = argmin
—,— 0
n
≠ 1
n
ÿ
i=1
(y
i
≠ — 0
≠ x
T
i
— )
2
+
p
ÿ
j=1
P
⁄ (— j
) (3.16)
Step 3. Estimate the LDA model on the reduced data {Y
i
,X
T
i
ˆ
— LDA
}
n
i=1
. Assign observation x
to class 2 if
{x≠ (ˆ µ
1
+ˆ µ
2
)/2}
T
ˆ
— LDA
+(
ˆ
— LDA
)
T
ˆ
ˆ
— LDA
{(ˆ µ
1
≠ ˆ µ
2
)
T
ˆ
— LDA
}
≠ 1
log(n2/n1)> 0 (3.17)
where ˆ µ
1
,n
1
,ˆ µ
2
,n
2
arethesamplemeanvectorandsamplesizewithinclass1andclass2, respectively.
ˆ
is the sample covariance matrix. P
⁄ (.) is a generic sparsity-inducing regularization term, such as
the single-penalty L
1
norm, Elastic-Net or adaptive L
1
norm as in our case. We first solve (3.16)
using xtune LASSO, then predict response variable class by (3.17).
In Chapter 5, we extend the penalized model with data-driven penalties to the framework
of logistic regression for classification problems by assuming that the response variables follow a
Bernoulli likelihood with a logit link function.
47
3.3 Results
3.3.1 Simulation studies
Simulation setting
We performed a simulation study to evaluate the performance of the xtune LASSO under a range of
scenarios obtained by varying the following key simulation parameters:
1. The true ability of the features to predict the outcome as measured by the signal to noise
ratio (SNR), defined as Var(X— )/‡ 2
.
2. The informativeness of the external metadata, which is controlled by the number of non-zero
hyperparameters – .
3. The number of predictor features p.
4. The proportion of predictive features captured by the number of non-zero true regression
coecients — .
5. The overall magnitude of the non-zero entries of the true regression coecient vector — ,which
is controlled by – 0
.
6. The overall degree of correlation between features.
The simulation data was generated according to the following steps: 1) set the parameter – controlling the informativeness of the external metadata; 2) generate the external metadata matrix
Z; 3) generate the vector of regression coecients — based on – and Z; 4) generate the feature
data X; 5) generate the outcome Y based on — , X and an independent drawn random error.
The external metadata matrix Z of dimension p◊ q was generated to have 0-1 entries indicating
a grouping of features. Specifically, entry Z
jk
indicates whether the j
th
feature belongs to group
k (1) or not (0). The entries were independently generated to be zero or one with probability 0.2
and 0.8, respectively. A consequence of how Z was generated is that features can belong to more
than one group, i.e., the groups can overlap. A column of 1 was appended to Z corresponding to
the intercept – 0
, which controls the overall amount of shrinkage; the higher – 0
; the smaller the
regression coecients.
48
Thetrueregressioncoecientsweregeneratedbysamplingfromadoubleexponentialdistribution
with local parameter µ=0 and scale parameter b = exp(Z
j
– ). Threshold as described in (Reid
et al., 2013) was then applied to achieve sparsity. Specifically, the [n
” ] largest — s in magnitude were
retained, while all smaller entries were set to 0. The parameter ” controls the sparsity of the final
— vector, with a lower value of ” inducing a sparser vector. The feature matrix X was simulated
by sampling from a multivariate normal distribution with mean vector zero and covariance matrix
i,j
= fl |i≠ j|
.
Finally, for each simulation replicate, an independent error term ‘ was sampled from a normal
distribution with mean 0 and a variance corresponding to the pre-set SNR. The outcome was
generated as Y =X— +‘. Each simulated data is split into a training set (n = 200) and a test set
(n = 1000). The performance of the xtune LASSO was assessed by the prediction R
2
computed
on the test using the model fitted in the training set. The large test set guarantees that the
generalization/test error was accurately estimated. Penalty parameter tuning for the standard
LASSO, which was fitted to compare with the proposed xtune LASSO, was performed by 10-fold
cross-validation. One hundred replicates were generated for each scenario.
We consider nine scenarios varying each of the main simulation parameters described above:
1. SNR=1,3,5
2. p = 500,1000,3000
3. External data informativeness: low, medium, and high. We fixed the sub-set of the – =
(– 0
,≠ 1,≠ 0.78,≠ 0.56,≠ 0.33,≠ 0.11,0.11,0.33,0.56,0.78,1). The low external information is
simulation from – low
=(– ,0,...,0
¸ ˚˙ ˝
40
). The medium external information is simulated from
– medium
=(– ,0,...,0
¸ ˚˙ ˝
20
), and the high external information is simulated from – high
=– .The
idea is that non-informative external information has many noise variables.
4. – 0
=1,3,5
5. ” =0.3,0.5,0.7
6. Dierent correlation magnitude fl =0,0.3,0.6,0.9
The simulation results are summarized in Figure 3.1.
49
Figure 3.1 xtune LASSO simulation study results: subplot (a): SNR = 1,3,5,with n = 200,
p = 1000, q = 10, – 0
=3, ” =0.5, fl =0.2; subplot (b): p = 500,1000,3000,with n = 200, SNR =
3, q = 10, – 0
=3, ” =0.5, fl =0.2; subplot (c): q = 10,30,50,with n = 200, p = 1000, SNR = 3,
– 0
=3, ” =0.5, fl =0.2; subplot (d): regression coecients sparsity ” = 0.3,0.5,0.7,with n = 200,
p = 1000, SNR = 3, q = 10, – 0
=3, fl =0.2; subplot (e): overall penalty magnitude – 0
= 1,3,5,
n = 200, p = 1000, SNR = 3, q = 10, ” =0.5, fl =0.2; subplot (f): fl =0,0.3,0.6,0.9,with n = 200,
p = 1000, SNR = 3, q = 10, – 0
=3, ” =0.5.
50
Simulation results
Figure 3.1 (a) shows box plots (across simulation replicates) of the prediction R
2
for the standard
and xtune LASSO as a function of the SNR of the metadata. As SNR increases, the prediction
performance increases for both methods, but the xtune LASSO has a better prediction across all
levelsofSNRconsidered. Whenthep/n ratio getshigher, both methods have decreased performance.
However, xtune LASSO has a slower decreasing rate. When the p/n ratio is very high, we see that
the performance of standard LASSO decreased dramatically, and xtune LASSO has significantly
better prediction performance than standard LASSO. The R
2
for standard LASSO remain the same
across dierent level of external data informativeness. The performance of xtune LASSO decreases
when the external information has low informativeness. In the case when the external information
is entirely useless, xtune LASSO performs slightly worse than standard LASSO (not shown).
Figure 3.1 (d) shows the eect of decreased — sparsity. The higher the value of ” , the less sparse
the model and the more non-zero regression coecients. The value of xtune LASSO becomes more
apparent when the model is less sparse. With ” =0.5, 14 out of 1000 features are non-zero. Both
methods have decreased performance when the model has more signal features, but xtune LASSO
has a slower decreasing rate. Parameter – 0
controls the overall amount of shrinkage. From Figure
3.1 (e), the overall amount of shrinkage have little eect on the performance of both methods. Notice
that both methods have a higher prediction ability when the signal features (non-zero) features are
correlated with each other. This is especially true for standard LASSO. The improved prediction of
xtune LASSO over standard LASSO becomes smaller as the correlation between features increases.
LASSO is known to have a decreased ability in variable selection when the features are highly
correlated. It tends to select one variable from each highly correlated group and ignoring the
remaining ones. Hebiri and Lederer (2013) studied the influence of correlation on LASSO prediction,
and they suggested that correlation in features is not problematic for LASSO prediction. They also
find that the prediction errors are mostly smaller for the correlated settings in experimental studies.
In summary, our simulation results show that the LASSO with individual penalties informed by
meta-features can outperform the standard LASSO in terms of prediction when 1) the meta-features
are informative for the regression eect sizes, 2) the true model is less sparse, and 3) the SNR is
relatively high.
51
3.3.2 Data applications
We exemplify the method’s performance on real data by considering two applications to clinical
outcomes prediction using gene expression data.
Bone density data
In the first example, bone biopsies from 84 women were profiled using an expression microarray to
study the relationship between bone mass density (BMD) and gene expression. The goal is to predict
the total bone density based on the gene expression profiles. The bone density is measured by the
hip T-score derived from the women’s biopsies, with a higher score indicating higher bone density.
The microarray data contains over 22,815 probe sets. The data were normalized using the RMA
method as described in (Tharmaratnam et al., 2016). Gene expression levels were analyzed on a
logarithmic-2 scale. The bone density dataset is publicly available from the European Bioinformatics
Institute (EMBL-EBI) Array Expression repository ID E-MEXP-1618 (see Section 2.3.2 for more
detailed description).
Insights from previous study results were used as external covariates. (Reppe et al., 2010)
identified eight genes that are highly associated with bone density variation. We created a two-
column external information Z. The first column is a column of all 1, representing the overall
amount of shrinkage. The second column is a binary variable indicating whether each gene is one of
the eight genes identified.
To illustrate the advantage of incorporating external information, we compared our proposed
method to the standard LASSO as well as the adaptive LASSO (Zou, 2006a). The adaptive LASSO
also adjusts the penalty individually for each coecient, but the adaptive weight is obtained from
the initial estimate of the coecients using the same data rather than external information. Our
implementation of adaptive LASSO utilizes the adalasso() function in the parcor (Kraemer et al.,
2009) R package.
The data were randomly split into a training data consisting of 80% of the observations and a
test data set consisting of 20% of the observations. We fitted the adaptive LASSO, standard LASSO,
and our proposed method in the training data and evaluated their prediction performance in the
52
testing data. We repeated 100 random splits of the full data into training and test sets. Figure 3.2
shows the MSE, R
2
and the number of selected (non-zero) expression features across the 100 splits.
Figure 3.2 Prediction test R
2
and number of selected gene expression features for the adaptive
LASSO, standard LASSO and the proposed method applied to the bone density data. The median
R
2
is 0.27 for adaptive LASSO; 0.38 for the standard LASSO, and 0.43 for xtune LASSO. The
median number of selected covariates is 10 for adaptive LASSO, 43 for standard LASSO and 12 for
the xtune LASSO.
Overall, we see that the externally tuned LASSO has better prediction performance than the
standard LASSO while selecting a more parsimonious model. The adaptive LASSO does not
perform well in this data example. To gain further insight into the prediction performance results,
we examined the penalties applied to the regression coecients by each of the methods when
fitted on the full data. The tuning parameter chosen by standard LASSO using cross-validation is
0.16, while for the xtune LASSO, the estimated tuning parameter is 0.26 for the gene expression
featureswithoutexternalinformationand0.016fortheexpressionfeatureswithexternalinformation,
resulting in larger eects estimates for the latter group.
Breast cancer data
In the second example, we apply xtune LASSO to a breast cancer dataset. The data is from the
Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) cohort (Curtis et al.,
2012)(https://ega-archive.org/dacs/EGAC00001000484). 29,476 gene expression profiles and three
clinical variables (age at diagnosis, progesterone receptor status, and lymph node status) were used
for the prediction of five-year survival (survived or died). Patients followed-up for less than five years,
with no record of mortality, were excluded from the analysis. We used a subset of the METABRIC
data with patients that are Estrogen receptor (ER) positive and human epidermal growth factor
53
receptor 2 (HER2) status negative. The data contains a discovery set of 594 observations and an
additional validation set of 564 observations. The models were trained in the discovery set and
tested on the validation set.
The external information used for the xtune LASSO model is based on the results of (Cheng
et al., 2013a), where groups of genes referred to as “metagenes” that are prognostic in all cancers,
including breast cancer were identified. In specific, (Cheng et al., 2013a) analyzed six gene expression
datasets from dierent cancer types and present three multi-cancer attractors with strong phenotypic
associations: a lymphocyte-specific attractor (LYM), a mesenchymal transition attractor strongly
associated with tumor stage (MES), and a mitotic chromosomal instability attractor strongly
associated with tumor grade (CIN). Details on the metagene attractor can be found in Appendix
2.B.2. The LYM, MES, and CIN metagenes consist of 169, 134, and 108 genes, respectively.
Therefore, the meta-feature matrixZ consists of four external covariates: the first covariates is a
binary variable indicating whether a predictor in X is a clinical feature (1) or a gene expression
feature (0) and the next three columns map genes that belong to with LYM, MES, and CIN
respectively.
We compared the xtune LASSO incorporating the meta-features described above, with the
standard and the adaptive LASSO. As in the first example, standard LASSO was tuned by repeated
10 fold cross-validation and the adaptive LASSO is implemented using the adalasso() function in
the parcor R package.
Table3.1comparestheAUC,numberofselectedfeatures, andcomputationtimeforthestandard,
adaptive, and xtune LASSO. Figure 3.3 shows the receiver operating characteristic (ROC) curves for
the three methods. The xtune LASSO has the best prediction performance among all three methods.
The three clinical variables are given a minimal penalty in xtune LASSO and are both estimated
to be non-zero. The penalty chosen by repeated cross-validation for the standard LASSO is 0.020.
For xtune LASSO, the penalty applied to clinical variables, “non-attractor” genes, LYM metagenes,
MES metagenes and CIN metagenes is 0.001, 0.114, 0.030, 0.038 and 0.028, respectively. This
illustrates how the xtune LASSO can induce dierential shrinkage among the features according to
their practical importance. In this example, xtune LASSO shrinks the coecients corresponding to
expression features that do not belong to any metagene (the vast majority) toward zero much more
aggressively than the standard LASSO, while it shrinks the clinical variables and the expression
54
Figure 3.3 ROC curves for adaptive LASSO, standard LASSO and xtune LASSO applied to the
breast cancer dataset.
55
features in metagenes LYM and MES much less than the standard LASSO. The standard LASSO
shrink the coecients for the two clinical variables to 0, eectively selecting them out of the model.
Adaptive LASSO selected out one of the clinical variables (lymph node status). In agreement with
our simulation results, the xtune LASSO also yielded a much more parsimonious model with only
ten selected features, while the standard LASSO selected 207 features. However, the xtune LASSO
is more computationally demanding than the standard LASSO tuned by cross-validation using
glmnet.
Table 3.1 Comparison of standard LASSO, adaptive LASSO and xtune LASSO in terms of AUC,
number of selected covariates and computation time (in minutes) in the breast cancer data example.
Standard LASSO Adaptive LASSO xtune LASSO
AUC 0.675 0.718 0.746
Number of selected covariates 207 5 10
Computation time 1.58 30.81 12.15
3.4 Discussion
WeintroducedxtuneLASSO,anewapproachimplementedinthenamesakeRpackageforintegrating
external meta-features to inform the tuning of penalty parameters in LASSO regression. We showed
through simulation and real data examples that xtune could yield better prediction performance
than standard LASSO. These findings are consistent with the work of (van de Wiel et al., 2016),
who proposed a method for estimating group-specific penalties for ridge regression and showed that
the use of prior knowledge could improve prediction performance.
xtune LASSO diers from related methods ( Pan et al., 2010; Boulesteix et al., 2017; Liu et al.,
2018; Yuan and Lin, 2006; Zou, 2006b) in our use of an direct empirical Bayes approach to estimate
the penalty parameters, rather than relying on cross-validation. In the particular case of no external
meta-features (i.e., Z is a vector of one), xtune performs empirical Bayes tuning of the single LASSO
penalty parameter, providing an alternative to standard tuning by cross-validation. Cross-validation
becomes impractical with more than a handful of penalty parameters, while empirical Bayes tuning
allows xtune to handle a much larger number of individualized penalties.
56
In both our real data application examples, the meta-features are categorical variables that group
featuresintosubsets. Acategoricalmeta-featurealsoariseswhenthefeaturesoriginatefromdierent
data types (e.g., gene expression, methylation, somatic mutations). Liu et al. (2018),Boulesteix
et al. (2017), and van de Wiel et al. (2016) showed that having separate penalty parameters for each
data type can yield better prediction performance. However, xtune is not limited to categorical
meta-features and can equally handle quantitative ones, such as eect estimates or p-values for the
association of the features and the outcome of interest or related outcomes, which can be exploited
to integrate information from previous studies.
Although prediction performance has been our main focus of interest, our results also show that
for the range of simulation scenarios we considered and in the two real data applications, xtune
tends to yields sparser and, therefore, more interpretable models than standard LASSO regression.
Moreover, the estimates of the meta-feature coecients – can yield additional biological insights as
they capture the importance of the meta-features for predicting the outcome.
However, we did note that when the number of meta-features q relative to the sample size n is
large, the – estimates may not be stable. A related limitation is that in its current implementation,
xtune does not scale to ultra high dimensional datasets. Typical datasets that xtune LASSO
can currently handle have sample size of up to n¥ 5000,with p¥ 50,000 features and q¥ 100
meta-features. However, we believe that future algorithmic improvements, along with parallelization,
can extend the applicability of xtune LASSO to larger datasets and larger numbers of meta-features.
In Chapter 5, we will describe extensions of xtune to binary logistic regression, as well as the
incorporation of external information in ridge and Elastic-Net penalties in addition to the LASSO.
57
Appendix 3.A
In Section 2.A.1, we provide additional simulations results for Section 2.3.1. Supplementary
information for the breast cancer application is shown in Section 2.A.2.
3.A.1 Additional simulation results
Section 2.3.1 shows the simulation results comparing xtune LASSO vs. standard LASSO in terms
of R
2
in test sets. Here we provide more details in the models’ performance in terms of prediction
MSE and relative MSE reduction (Figure 3.5). In addition, to investigate the eect of correlation of
features on model performance, four more scenarios varying correlation magnitude and structure
between features are investigated. The covariance matrix for feature column X has correlation
structure shown in Figure 3.4. We consider four dierent settings of correlation magnitude and
structure.
Figure 3.4 Illustration of correlation structure used in simulation studies. The red block represents
the correlation between signal features (features with non-zero regression coecients) and the blue
block represents the correlation between noise features (features with zero regression coecients).
Both signal features and noise features have AR(1) correlation structure. The correlation for signal
features is fl signal
and the correlation for noise features is fl noise
.
1. The indexes of non-zero (signal) regression coecients are selected randomly. Signal features
and noise features are correlated with each other with correlation fl .
58
2. Both non-zero(signal) features and zero (noise) features have AR(1) correlation structure with
correlation fl all
(fl noise
= fl signal
= fl all
), and there is no correlation between signal features
and noise features.
3. Signal features are weakly correlated with AR(1) correlation structure fl signal
=0.3. Noise
features are correlated with AR(1) correlation structure fl noise
, and there is no correlation
between signal features and noise features.
4. Noise features are weakly correlated with AR(1) correlation structure fl noise
=0.3. Signal
features are correlated with AR(1) correlation structure fl signal
, and there is no correlation
between signal features and noise features.
59
(a) Compare standard LASSO vs. xtune LASSO varying first level data snr = 1,3,5. Sample size n=200,
number of features p=1000, q=10, – 0=3, ” =0.5, fl =0. Evaluation measurements are MSE, R
2
and
relative MSE moving left to right.
(b) Compare standard LASSO vs. xtune LASSO varying the number of predictors p = 500,1000,3000.
Sample size n=200, snr=3, q=10, – 0=3, ” =0.5, fl =0. Evaluation measurements are MSE, R
2
and
relative MSE moving left to right.
(c) Compare standard LASSO vs. xtune LASSO varying external data informativeness from low to high.
Sample size n=200, p=1000, snr=3, – 0=3, ” =0.5, fl =0. Evaluation measurements are MSE, R
2
and
relative MSE moving left to right.
60
(d) Compare standard LASSO vs. xtune LASSO varying regression coecients sparsity ” = 0.3,0.5,0.7.
Sample size n=200, p=1000, snr=3, q=10, – 0=3, fl =0. Evaluation measurements are MSE, R
2
and
relative MSE moving left to right.
(e) Compare standard LASSO vs. xtune LASSO varying regression overall penalty magnitude – 0 = 1,3,5.
Sample size n=200, p=1000, snr=3, q=10, ” =0.5, fl =0. Evaluation measurements are MSE, R
2
and
relative MSE moving left to right.
(f) Compare standard LASSO vs. xtune LASSO varying correlation fl = 0,0.3,0.6,0.9. The indexes of
non-zero (signal) regression coecients are selected randomly. Sample size n=200, p=1000, snr =3,
q=10, – 0=3, ” =0.5. Evaluation measurements are MSE, R
2
and relative MSE moving left to right.
61
(g) Compare standard LASSO vs. xtune LASSO varying correlation fl all
=0,0.3,0.6,0.9. Both non-zero
(signal) features and zero (noise) features have AR(1) correlation structure with correlation fl . No correlation
between signal features and noise features. Sample size n=200, p=1000, snr=3, q=10, – 0=3, ” =0.5.
Evaluation measurements are MSE, R
2
and relative MSE moving left to right.
(h) Compare standard LASSO vs. xtune LASSO varying correlation fl noise=0,0.3,0.6,0.9 for noise features.
Signal features are weakly correlated with AR(1) correlation structure fl signal
=0.3. Noise features are
correlated with AR(1) correlation structure fl noise. No correlation between signal features and noise features.
Sample size n=200, p=1000, snr=3, q=10, – 0=3, ” =0.5. Evaluation measurements are MSE, R
2
and relative MSE moving left to right.
(i) Compare standard LASSO vs. xtune LASSO varying correlation fl signal
=0,0.3,0.6,0.9 for signal
features. Noise features are weakly correlated with AR(1) correlation structure fl noise=0.3. Signal features
are correlated with AR(1) correlation structure fl signal
. No correlation between signal features and noise
features. Sample size n=200, p=1000, snr=3, q=10, – 0=3, ” =0.5. Evaluation measurements are
MSE, R
2
and relative MSE moving left to right.
Figure 3.5 Additional xtune LASSO simulation studies results.
62
3.A.2 Supplementary tables
Table 3.2 Genes of the LYM, MES and CIN attractor based on (Cheng et al., 2013a)
Lymphocypte-Specific Attractor (LYM)
PTPRC CD3D CD48 DOCK2 CD53 ARHGAP25
LCP2 CORO1A IRF8 IL10RA LCK GZMK
CYTIP EVI2B GMFG IL2RG GZMA CD52
PTPRCAP SASH3 PRKCB SELL CD2 IL7R
LPXN PVRIG ARHGAP15 FGL2 PLAC8 HCLS1
PPP1R16B HLA-DMB GPR171 SRGN CCR7 CCR5
NCF4 GPR18 RAC2 MAP4K1 GIMAP4 CD27
LTB CST7 PRF1 CD8A APOBEC3G CCR2
TNFRSF1B SEMA4D CD5 SH2D1A SIT1 TBC1D10C
CD247 TRAT1 ITK SAMD3 S1PR4 RGL4
CD6 UBASH3A PTPN7 CD3G CD96 EOMES
SLAMF1 SLA2 PDCD1 CXCR3 BTLA KLHL6
SIRPG SPOCK2 ZNF831 PSTPIP1 LAT ICOS
ARHGAP30 SLAMF7 DENND1C TYROBP LAPTM5 FCER1G
LY96 CD163 ITGB2 C1QC CD86 MS4A6A
C1QB TNFSF13B FPR3 SLAMF8 GPNMB CLEC2B
CLEC7A TFEC MARCH1 C1orf162 EVI2A MS4A4A
MNDA CYBB C3AR1 MS4A7 HCK MAFB
ITGAM NCF2 WIPF1 MRC1 SLA PLXNC1
TLR8 ST8SIA4 C1QA CD14 SLC31A2 HAVCR2
CLEC4A CCR1 FCGR2B SAMSN1 CSF1R MPEG1
TM6SF1 C1orf38 LAIR1 STX11 LRRC25 FAM49A
CD300LF NLRC4 DOK2 NCF1 HCST PLEK
TNFAIP8L2 CD300A AIF1 MSR1 SIGLEC7 TAGAP
SUCNR1 CD37 GPR65 SIGLEC9 CD84 TLR1
LILRB1 FYB GIMAP6 CSF2RB RCSD1 CTSS
RSE6 MYO1F FERMT3 FCGR2A IKZF1 GIMAP1
DOCK8 FCGR1A LY86 VSIG4 LST1 ALOX5AP
NPL
63
Mesenchymal Transition Attractor (MES)
COL5A2 COL1A2 COL3A1 SPARC FSTL1 CTSK
COL6A3 FBN1 VCAN COL1A1 CDH11 COL5A1
MMP2 CRISPLD2 AEBP1 FAP DCN HTRA1
THBS2 GLT8D2 ANGPTL2 POSTN LAMB1 MXRA5
LOX ADAM12 SERPINF1 PCOLCE EFEMP2 THY1
TCF4 BGN SPON1 LUM ASPN OLFML2B
LRRC15 DACT1 COL8A2 PDGFRB PALLD ISLR
SRPX2 NNMT MXRA8 NID2 PDGFRA ZEB1
MYL9 LOXL1 BNC2 COL6A2 SFRP2 WISP1
SGCD NOX4 PRRX1 FIBIN P4HA3 NDN
MAGEL2 PPAPDC3 HMCN1 MSRB3 TSHZ3 FAM26E
ECM2 RARRES2 TIMP2 RAB31 ANTXR1 RAB34
FRMD6 FNDC1 VIM SDC2 ZNF521 CMTM3
RCN3 IGFBP7 GPC6 CTHRC1 MITF HEG1
SI2 SULF1 SERPING1 ENTPD1 PDGFC OLFML1
GLI3 VEGFC CCDC80 SPOCK1 GAS1 FAM110B
DDR2 C1S COL12A1 DJB5 COL11A1 CNRIP1
SYNDIG1 TSPYL5 INHBA FN1 COL10A1 TNFAIP6
COPZ2 PMP22 ALPK2 TMEM158 PRKG1 ACTA2
VGLL3 PLAU EPYC GREM1 GJB2 MMP11
ADAMTS2 NUAK1 C3orf80 ECM1 COLEC12 FILIP1L
THBS1 NTM TIMP3 HEPH SFRP4 VCAM1
EDNRA ITGBL1
Mitotic CIN Attractor (CIN)
CENPA CEP55 ASPM KIF4A DLGAP5 CDC20
UBE2C PRC1 KIF2C CCNB2 FOXM1 BUB1B
AURKA MELK CCNB1 CC2 NCAPG KIF20A
HJURP KIF14 NUSAP1 BUB1 STIL NDC80
TTK KIF11 RACGAP1 RAD51AP1 TPX2 KIF15
ZWINT NEK2 EZH2 NCAPH MAD2L1 TRIP13
GINS1 RRM2 CDKN3 CDK1 CENPF CENPE
DSCC1 CDC45 MKI67 CENPN ESPL1 PBK
64
CKS2 HMMR CDCA8 MCM10 EXO1 AURKB
CENPI CDCA3 CDCA5 C15orf42 FAM83D DEPDC1B
PTTG3P MYBL2 ANLN KIF4B RAD51 DEPDC1
PTTG1 ARHGAP11A FAM54A GTSE1 FAM64A ORC1
KIF23 PLK1 BIRC5 FEN1 E2F8 UBE2T
PARPBP RSEH2A CHEK1 TK1 FANCI DTL
DBF4 NUF2 SPC25 KIF18A CCDC99 SHCBP1
POLE2 FBXO5 OIP5 RFC4 SMC4 PTTG2
CDC25C HELLS SGOL2 MASTL PLK4 TROAP
TOP2A POLQ KIFC1 ERCC6L KP2 ECT2
Table 3.3 Genes selected by standard LASSO, adaptive LASSO and xtune LASSO in the breast
cancer data example
Genes selected by standard LASSO
RNF215 SNCG HIST1H1B WWC2 ZGLP1 ABO
LOC221442 P53TG1 RSPH9 MX1 PTP4A3 ADCK2
C20orf12 TRIML2 ZNF283 SERPINE3 P2RX2 AIM1L
CLIC4 F12 CYP3A43 NBPF1 UBE2R2 ZNF460
MRPL11 INSIG2 SIRPG PRR7 RIPK4 MTERFD3
OR2L2 NT5C2 NDUFAF2 OR6Y1 BAX OR6K2
IL13RA2 MRTO4 CSH2 ZNF432 C21orf84 GPR50
RPP25 KIF20A AMPD2 NLRX1 LOC729264 CLASP1
AK125594 CD244 CYP26C1 PSKH2 ZNF454 NNAT
ARHGEF5 PLEKHM1 ABHD8 RGS12 KBTBD12 AGRP
NGF C10orf71 PBRM1 ZNF93 DCUN1D5 KRAS
SCAMP1 PA2G4 CHRNA4 BTF3L4 PPP6R3 OR5H1
THSD1 RPS2 MATN4 ADCY1 ACSS2 TSPAN16
SLC6A1 GOLGA6L6 CPB2 BECN1 TBX4 RIMS4
PAEP LRRFIP2 OSTBETA ACO1 DHRS3 MED19
SLC25A18 HK1 SLC2A2 NKX6-1 C19orf56 ODF3L1
TRAFD1 GABPB1 ELAVL1 PAOX BCAT1 PRDX2
ZNF655 KRT34 KLF5 COQ5 FGD3 SOX11
65
ANGPT2 BPIFA1 SYCP2L ZNF707 EPHA3 SERPINB4
MGC15885 GLP1R KIAA2015 KCNE3 TAC1 WDR4
AK125462 ZSCAN5D KRTAP10-9 MGAT4B DEPDC5 VPS13A
PPFIBP1 SLC39A2 TMEM179 SPRR3 CCDC27 GDA
AW084149 CR748429 BX091227 LOC254099 BX090143 CR745891
BG941175 DB341941 LOC100133461 BF675965 AI184572 NUP210L
BQ011290 EPG5 AW594636 BX096685 AI681512 BM674836
AI205778 BU620815 CA441597 AW593145 CD691743 AA349515
LOC729175 BX109079 W36271 AA400223 CA398130 AI928453
C6orf112 AA431757 BU075540 AI769920 AW297905 AA393387
BX104313 BE148564 C7orf40 HGC6.3 BG204289 DB298273
BX089019 BU737861 AA903082 BM980906 POU5F1B SLC18A2
DEFB112 VIT PTTG1IP BLNK PI4K2B HCRT
CYR61 MPND C17orf61 HHIP CCDC127 WDR16
SLC5A12 PEG10 TAGLN3 NQO1 CPSF4 PIGA
TEX11 AGL LIAS CABP2 ARID1B
Genes selected by xtune LASSO
SIRPG THBS1 KIF20A VGLL3 FAM83D RAD51
CENPA
Genes selected by adaptive LASSO
ZNF358 CETN2 AW136659 KCNG4
66
Chapter 4
Marginal likelihood maximization
algorithms
4.1 Introduction
In the previous chapter, we discussed an empirical Bayes approach (type-II maximum likelihood
estimation) whereby we integrate out the regression coecients — and maximize the marginal
likelihood with respect to the hyperparameters– . The estimation of data-driven penalty parameters
lies in optimizing the marginal likelihood (5.9):
≠ ¸(– ) = log|C
– |+Y
T
C
≠ 1
– Y
where C
– = ‡ 2
I +XV
≠ 1
X
T
and
V =
S W W W W W U ÷ 1
.
.
.
÷ p
T X X X X X V =
S W W W W W U 1
2‡ 2
exp(2Z
1·
– )
.
.
.
1
2‡ 2
exp(2Z
p·
– ).
T X X X X X V Having estimated – , we plug them in the log-linear function to compute the individual penalty
parameters.
The Nelder-Mead algorithm, which is the default algorithm for optim package in R, was first
used to maximize the marginal likelihood, but its implementation is slow in practice. To improve
67
computationeciency,weimplementedseveraloptimizationprocedures,includingEMalgorithmand
direct update algorithms using the gradient-based optimizer. The EM algorithm is straightforward
to implement, but the convergence can be very slow in practice. The direct update algorithm is
faster to converge compared to the EM algorithm. However, optimizing the marginal likelihood
function, neither by EM algorithm or direct update algorithm, is slow in practice. Motivated by
(Wipf and Nagarajan, 2008), we propose an accelerated reweighted L
2
algorithm, which exploits
two majorization functions of the marginal likelihood to enable maximization via two “nested” MM
procedures.
In this chapter, we present the above mentioned dierent training algorithms for implementing
xtune LASSO and evaluate their performance. For all algorithms, we assume the noise variance ‡ 2
is estimated by ˆ ‡ 2
L,
ˆ
⁄ using the R package selectiveinference (see Section 4.2.1).
4.2 Methods
4.2.1 Estimation of ‡ 2
As described in Section 3.2.1, we use a double exponential prior for — , which is conditional on ‡ 2
.
The population variance parameter ‡ 2
can be estimated from the data or given a point-mass prior.
Here we view ‡ 2
as a nuisance parameter and estimate it from the data. Variance estimation in
high dimensional liner model is a dicult problem. Several error variance estimators have been
proposed with proof of asymptotic consistency and normality under a variety of assumptions. Reid
et al. (2013) proposed a residual sum of squares based estimator using LASSO coecients with a
penalty parameter selected by cross-validation.
ˆ ‡ 2
L,
ˆ
⁄ =
1
n≠ ˆ s
L,
ˆ
⁄ n
ÿ
i=1
(Y
i
≠ X
i
ˆ
— )
2
(4.1)
where
ˆ
⁄ is selected by 10-fold cross validation,
ˆ
— is the estimated LASSO regression coecients,
and ˆ s
L,
ˆ
⁄ the number of nonzero covariates at regularization parameter
ˆ
⁄ . In their paper, the
performance of ten noise variance estimators (including ˆ ‡ 2
L,
ˆ
⁄ ) was investigated in high dimensional
settings. Simulationstudiesandtheoreticalanalysissuggestthatthecross-validationbasedestimator
68
ˆ ‡ 2
L,
ˆ
⁄ was shown to be robust and perform admirably under dierent sparsity and signal strength
settings.
Therefore, in our current implementation, we use ˆ ‡ 2
L,
ˆ
⁄ to estimate the noise variance because (1)
it works well in practice and gives a reasonably good estimate of ‡ 2
(2) the estimation is fast and
helps making the estimation of – easier. ˆ ‡ 2
L,
ˆ
⁄ is estimated using the estimateSigma() function in
the R package selectiveInference.
4.2.2 EM algorithm
Since our model has this form – æ — æ Y, a straight forward way is to use the EM algorithm.
In our model, – is the parameter vector to optimize. Because the regression coecients — are
integrated out, we can regard it as a latent variable, and hence we can optimize this marginal
likelihood function using EM. In the E step, we first compute the posterior distribution of — given
the current estimated ◊ and then use this to find the expected complete-data log likelihood Q(– ).
In the M step, we maximize Q(– ) with respect to – to get estimated individual weight precision.
Estep
In E step, we compute the posterior distribution of p(— |X,Y,– ). Due to the conjugate property of
Gaussian prior, the posterior is also Gaussian. The posterior of — given estimated ◊ is:
p(— |X,Y,– )= N(µ,)
where
=( ‡ ≠ 2
X
T
X +V)
≠ 1
µ = ‡ ≠ 2
X
T
Y
(4.2)
69
Mstep
The expected complete data log likelihood is:
Q(– )=E[logN(Y,— |– )]
=E[logN(Y|X—,‡ 2
I
n
)+logN(— |0,V
≠ 1
)]
=
1
2
E
#
N log‡ ≠ 2
≠ ‡ ≠ 2
||Y ≠ X— ||
2
+log(det(V))≠ Tr(V——
T
)
$
+const
=
1
2
N log‡ ≠ 2
≠ ‡ ≠ 2
2
!
||Y ≠ Xµ||
2
+Tr(X
T
X)
"
+
1
2
log(det(V))≠ 1
2
Tr
#
V(µµ
T
+)
$
+const
(4.3)
Since and µ are estimated in the E step, they can be regard as constants. The only term that
contains V is:
Q
Õ
=
1
2
log(det(V))≠ 1
2
Tr
#
V(µµ
T
+)
$
log(det(V)) = log(
1
2
exp(2Z
1·
– )
1
2
exp(2Z
2·
– )···
1
2
exp(2Z
p·
– ))
= log(exp(2(Z
1·
+···+Z
p·
)– ))+const
= 2(Z
1·
+···+Z
p·
)– +const
=2Z
sum
– +const
=2
q
ÿ
i=1
Z
sum
i
– i
+const
(4.4)
where
Z
sum
=
p
ÿ
j=1
(Z
1·
+···+Z
p·
)
) ˆ log(det(V))
ˆ–
i
=2Z
sum
i
(4.5)
Tr
#
V(µµ
T
+)
$
=
p
ÿ
j=1
V
j
(µ
2
j
+ jj
)
therefore
70
ˆ Tr
#
V(µµ
T
+)
$
ˆ–
i
=
p
ÿ
j=1
ˆ (V
j
(µ
2
j
+ jj
))
ˆ–
i
=
p
ÿ
j=1
(µ
2
j
+ jj
)
ˆV
j
ˆ–
i
=
p
ÿ
j=1
c
j
ˆV
j
ˆ–
i
=
p
ÿ
j=1
c
j
Z
ji
exp(2Z
j·
– )
=
p
ÿ
j=1
c
j
Z
ji
exp(2Z
j·
≠ i
– ≠ i
)exp(2Z
ji
– i
)
(4.6)
where c
j
=(µ
2
j
+ jj
)
Also we have
dQ
d– i
=2Z
sum
i
≠ p
ÿ
j=1
c
j
Z
ji
exp(2Z
j·
– ) (4.7)
Because the exp(2Z
ji
– i
) term is not linear with respect to – i
, exp(– i
) cannot be separated out.
Hence we cannot get a closed form update rule for – . Instead, we need to use numeric methods
to optimize the Q function and update – . Since we cannot directly set
dQ
d– i
to be zero, we have
to optimize the Q function numerically in the M step. In summary, we propose the following EM
algorithm.
71
Algorithm 3: EM algorithm to implement xtune LASSO
Step 1: Initialize – with – 0
while not converge do
Step 2 (E step): Using the current value of – . Compute:
=( ‡ ≠ 2
X
T
X +V)
≠ 1
µ = ‡ ≠ 2
X
T
Y
(4.8)
Step 3 (M step): Update – by maximize the Q function using its gradient:
Q
Õ
(– ) = log(det(V))≠ Tr
#
V(µµ
T
+)
$
ˆQ
Õ
(– )
ˆ – =2Z
sum
i
≠ p
ÿ
j=1
c
j
Z
ji
exp(2Z
j·
– )
(4.9)
end
4.2.3 Direct update algorithms
Another natural approach is to directly maximize (5.9) using its gradient. We show below that the
gradient function cannot be set to zero to obtain a closed-form update rule. Two gradient-based
optimization algorithms are used.
Noting that:
|V||‡ 2
I +XV
≠ 1
X
T
| = |‡ 2
I||V +‡ ≠ 2
X
T
X|
Hence the log-determinate component of (5.9)is:
log|‡ 2
I +XV
≠ 1
X
T
| =≠ log| |+nlog‡ 2
≠ log|V| (4.10)
Also by using the Woodbury inversion identity :
(A+UCV)
≠ 1
=A
≠ 1
≠ A
≠ 1
U(C
≠ 1
+VA
≠ 1
U)
≠ 1
VA
≠ 1
We have:
(‡ 2
I +XV
≠ 1
X
T
)
≠ 1
= ‡ ≠ 2
I≠ ‡ ≠ 2
X(V +‡ ≠ 2
X
T
X)
≠ 1
X
T
‡ ≠ 2
72
Therefore, the quadratic component of (5.9)is:
Y
T
(‡ 2
I +XV
≠ 1
X
T
)
≠ 1
Y = ‡ ≠ 2
Y
T
Y ≠ ‡ ≠ 2
Y
T
X X
T
‡ ≠ 2
Y
= ‡ ≠ 2
Y
T
(Y ≠ Xµ)
= ‡ ≠ 2
#
||Y ≠ Xµ||
2
+Y
T
Xµ≠ µX
T
Xµ
$
= ‡ ≠ 2
||Y ≠ Xµ||
2
+‡ ≠ 2
Y
T
Xµ≠ ‡ ≠ 2
µ
T
X
T
Xµ
= ‡ ≠ 2
||Y ≠ Xµ||
2
+µ
T
≠ 1
µ≠ ‡ ≠ 2
µ
T
X
T
Xµ
= ‡ ≠ 2
||Y ≠ Xµ||
2
+µ
T
Vµ
(4.11)
Adding up (4.10) and (4.11), we have:
≠ ¸(– )=≠ log| |+nlog‡ 2
≠ log|V|+
1
‡ 2
||Y ≠ Xµ||
2
+µ
T
Vµ
Define the following eigenvector equation:
(‡ ≠ 2
X
T
X)“ j
= ” j
“ j
(4.12)
Therefore,= ‡ ≠ 2
X
T
X +V has eigenvalues ” j
+÷ j
. Now consider the derivative of log| | with
respect to ÷ j
:
dlog| |
d÷ j
=
d
d÷ j
log
j=p
Ÿ
j=1
(” j
+÷ j
)
=
d
d÷ j
j=p
ÿ
j=1
(” j
+÷ j
)
=
1
” j
+÷ j
= jj
dl
d÷ j
=≠ jj
+
1
÷ j
≠ µ
2
j
(4.13)
For xtune LASSO:
dl
d– i
=
p
ÿ
j=1
dl
d÷ j
◊ d÷ j
d– i
(4.14)
73
Since ÷ j
=
1
2‡ 2
exp(2Z
j·
– ),
d÷ j
d– i
=Z
ji
exp(2Z
j·
– )
dl
d– i
=
p
ÿ
j=1
(≠ jj
+
1
÷ j
≠ µ
2
j
)Z
ji
exp(2Z
j·
– ) (4.15)
Setting (4.15) to be zero, we have
2
p
ÿ
j=1
Z
ji
≠ p
ÿ
j=1
c
j
Z
ji
exp(2Z
j·
– )=0
which results in the same form as (4.7) from the EM algorithm when we set the Q function to be 0.
Therefore, we can see that the EM algorithm and direct update algorithm have the same update
rule, given ‡ 2
is estimated externally. Similarly, using a direct update algorithm, we still cannot get
a closed-form iterative update algorithm. Since we cannot directly set the derivative of l(– ) with
respect to – to zero. We have to use numeric methods to estimate – . In summary, (5.9)isthe
function we want to optimize and its gradient is:
¸(– )
Õ
=2
p
ÿ
j=1
Z
ji
≠ p
ÿ
j=1
c
j
Z
ji
exp(2Z
j·
– ) (4.16)
Since both the objective function and its gradient can be computed analytically, we can use
gradient-based optimization algorithms to optimize (5.9). We implemented two algorithms: the
L-BFGS algorithm and gradient descent with momentum algorithm described below.
L-BFGS algorithm
The Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm belongs to the quasi-
Newton optimization class. It solves the problem of minimizing an objective, given its gradient,
by iteratively computing approximations of the inverse Hessian matrix of the objective function
(Coppola et al., 2014).
We use thelbfgs packageinRtoimplementtheL-BFGSalgorithm. Thepackageusesaninterface
with C++ and has a fast and memory-ecient implementation that is particularly well-suited
for large scale problems. It is more computationally ecient compared with other optimization
74
packages, such as optim or BB in R. Detailed description of the algorithm and the package can be
found in (Coppola et al., 2014).
Gradient Descent with Momentum Algorithm
A straightforward algorithm to implement gradient-based optimization is gradient descent. Instead
of using standard gradient descent, we apply the gradient descent with momentum algorithm, as it
almost always works faster than the standard gradient descent algorithm. The proposed algorithm
is as follows.
Algorithm 4: Gradient descent with momentum algorithm to implement xtune LASSO
Step 1: Set the momentum strength hyperparameter m=0.9 (m can be any value between
0 and 1), and the step size s=0.1.
Step 2: Initialize – to – 0
, initialize the velocity vectorv (say, 0); set t=0.
while not converge do
Step 3: Compute the gradientg using (4.16)
Step 4: Update the velocity
vΩ mv≠ sg
Step 5: Update the parameters – t+1
=– t
+v
end
4.2.4 Iterative reweighted-L
2
algorithm
Relationship to Automatic Relevance Determination
ForxtuneLASSO,weareintegratingouttheregressioncoecient — andoptimizinghyperparameters
– , not vice versa. This type of empirical Bayes (type II ML) estimation is also used in a method
called automatic relevance determination or ARD in the context of neural networks and signal
processing (MacKay, 1992; Neal, 1995). The model specification of ARD is closely related to our
approximated hierarchical model. Here we review the problem statement and recent works on ARD.
Like our model, a single precision hyperparameter is associated with each regression coecient
in the ARD model. It is a hierarchical Bayesian approach where there are hyperparameters that
explicitly represent the relevance of dierent input features. Each regression coecient is given an
independent Gaussian prior. The hyperparameter vector “ controls how far away from zero each
75
regression coecient is allowed to go. The ARD model is related to the Bayesian LASSO ( Park
and Casella, 2008) that we discussed in Chapter 1. In ARD, the regression coecient precisions
are viewed as constant while in Bayesian LASSO, they are treated as random variables and are
assigned a conjugate prior distribution.
ARD assumes the following model:
Y|X,— ≥N(X— ,I
n
‡ 2
)
— ≥N(0,V
≠ 1
)
(4.17)
where
V =
S W W W W W U “ 1
.
.
.
“ p
T X X X X X V MacKay (1992) and Neal (1995) first proposed ARD for neural network input selection. In their
paper, the authors discussed the use of an evidence framework to obtain point estimates for the
hyper-parameters. In the evidence framework, the point estimates for ÷ and ‡ 2
is obtained by
maximizing logp(D|÷ ,‡ 2
). They also discussed the use of an evidence framework in the situation
with neural network models, which is more complex (Murphy, 2012).
Motivated by that, Tipping (2001) proposed a general Bayesian framework for obtaining sparse
solutions to regression and classification tasks using sparse Bayesian learning (SBL) and relevance
vector machine (RVM). In all the above three papers, their model has a similar form as (4.17). Two
algorithms - EM algorithm and direct update algorithm are proposed to implement ARD. The EM
algorithm is straightforward to implement, but the convergence can be very slow in practice. The
direct update algorithm is faster to converge compared to the EM algorithm. However, optimizing
the marginal likelihood function, neither by EM algorithm or direct update algorithm, is slow in
practice. Moreover, neither EM or direct update is guaranteed to converge to a local minimum or
even a saddle point (Wipf and Nagarajan, 2008).
Tipping and Faul (2003) further explored the property of the marginal likelihood function and
proposed a greedy and highly accelerated algorithm that optimally updates a single hyper-parameter
at a time. While the algorithm proposed by (Tipping and Faul, 2003) is very ecient and very
76
fast to implement, it can sometimes be trapped in a local minimum when the for high dimensional
data, i.e., whenp>n (Wipf and Nagarajan, 2008). To generalize the methods mentioned above
to more general problems such as classification without additional approximation and to explore
the relationship between ARD model and maximum a posteriori (MAP) estimation, Wipf and
Nagarajan (2008) and Wipf and Nagarajan (2014) proposed a new view and an alternative view of
ARD model. They described an alternative formulation of the ARD cost function using an auxiliary
function. The update rule is derived by decoupling the likelihood using upper bounding functions
that are convex and can be more eciently optimized. The proposed reformulation of ARD can be
conveniently optimized by solving a series of reweighted L
1
problem.
ARD, SBL, and RVM all assume the model described in (4.17). Essentially they are the same
method with dierent names. In the context of the neural network, the method is called ARD.
Applying ARD to neural networks givens an ecient means of variable selection in non-linear models
(MacKay, 1992; Neal, 1995). In the context of linear models, this method is called SBL (Tipping,
2001). Combining ARD and SBL with the use of kernel function in a linear model gives relevance
vector machine (Murphy, 2012).
Derivation of the reweighted-L
2
algorithm
As described in Section 4.2.2 and Section 4.2.3, both the EM algorithm and the two direct update
algorithms involve calculating the inverse of a p◊ p matrix in each iteration. When the number
of features p is large, computing the matrix inverse is time-consuming. Also, the EM algorithm
usually takes many iterations to converge. Therefore, all three algorithms discussed above are slow
to implement in practice when p is large.
Wipf and Nagarajan (2008), however, pointed out that the log-likelihood is the sum of a concave
function and a convex function and introduces a new view of the ARD. They convert the non-convex
optimization problem in ARD into a series of reweighted minimum l
1
procedures. Motivated by
their idea, we modify their algorithm and apply that to our xtune LASSO model. We, therefore,
propose an iterative reweighted L
2
optimization algorithm described in detail below.
The main idea of the reweighted L
2
algorithm is to use two loops of majorization minimization
(MM) procedure. Figure 4.1 illustrate the idea of MM procedure. For minimizing f over œR ,the
MM procedure consists of two steps: the majorization step construct a majorization function g over
77
◊ where g is an upper bound on f and coincides with f at x
t
. Then in minimization step, we
update x by x
t+1
= argmin
xœX
g(x|x
t
) (Sun et al., 2017).
Figure 4.1 Idea of MM procedure. The MM procedure consists of two steps: (1) find a surrogate
function that is a upperbound to the objective function at the current point; (2) Minimize the
surrogate function to obtain updated value. This figure is from (Sun et al., 2017).
The marginal likelihood (5.9) has two components. From (Boyd and Vandenberghe, 2004),
log|C
– | is concave in – . Y
T
C
≠ 1
– Y is convex. The majorization function of a concave function f(◊ )
is its gradient function g(◊ |◊ n
) (slope) as it satisfies:
f(◊ n
)=g(◊ n
|◊ n
)
f(◊ )Æ g(◊ |◊ n
),◊ ”= ◊ n
.
(4.18)
That is, g(◊ |◊ n
) lies abovef(◊ ) and is tangent to it at the point ◊ = ◊ n
(Boyd and Vandenberghe,
2010). Figure 4.2 shows the graphic representation of concave function and its gradient and Figure
4.3 illustrate the non-convex marginal likelihood as a sum of a concave function and convex function.
The slope of log|C
– | at current value of – is:
◊ =Ò – log|C
– | = diag[X
T
C
≠ 1
– X] (4.19)
78
Figure 4.2 Graphic representation of a concave function and its gradient. A concave function can be
upperbounded by a linear function. This figure is from (Sun et al., 2017).
79
Figure 4.3 Non-convex marginal likelihood (5.9) as a sum of concave function and a convex function.
This figure is from (Yuille and Rangarajan, 2001).
Therefore we can use the following optimization algorithm:
Algorithm 5: MM procedure overview
Step 1: Initialize – with – 0
(e.g, – i
=0,’ i)
while not converge do
Step 2 Given – , update ◊ by:
◊ new
Ω diag[X
T
C
≠ 1
– X]
Step 3 Given ◊ , update – by solving the following minimization problem:
– Ω argmin
– ¸
◊ (– ),◊ T
1
÷ +Y
T
C
≠ 1
– Y (4.20)
end
The minimization problem (4.20) is a convex optimization problem and can be expressed as an
iterative reweighted-L
2
optimization problem. According to (Wipf and Nagarajan, 2008), the data
dependent term Y
T
C
≠ 1
– Y can be re-expressed as:
Y
T
C
≠ 1
– Y =min
” 1
‡ 2
||Y ≠ X” ||
2
+
ÿ
j
” 2
j
÷ j
(4.21)
Therefore the upper-bounding auxiliary function for ¸
◊ (– ) is:
¸
◊ (– ,” ),◊ T
1
÷ +
ÿ
j
÷ j
” 2
j
+
1
‡ 2
||Y ≠ X” ||
2
Ø ¸
◊ (– ) (4.22)
80
The – value that minimizes 4.20 can be estimated by iteratively updating ” and – in 4.22. For any
” , – is estimated by minimizing
◊ T
1
÷ +
ÿ
j
÷ j
” 2
j
=2◊ T
exp(≠ 2Z– )+
ÿ
j
1
2
exp(2Z
j·
– )” 2
j
(4.23)
There is no closed-form solution for – that minimize 4.23. We then optimize it using numeric
methods. The gradient of (4.23) is given by:
d¸
◊ (– )
d– =≠ 2(◊ ≠ ” 2
÷ 2
)
T
ÿ
j
÷ j
Z
j·
(4.24)
This step can be implemented using standard optimization algorithms such as gradient descent or
BFGS. Substitute updated – into (4.22), we can then update ” by:
” Ω argmin
” ||Y ≠ X” ||
2
+
ÿ
j
÷ j
” 2
j
(4.25)
Equation (4.25) has a weighted convex L
2
regularized cost function, and it can be optimized
eciently using standard ridge software such as glmnet.
81
In summary, the iterative reweighed L
2
algorithm has the following schema:
Algorithm 6: Reweighted L
2
algorithm to implement xtune LASSO
Step 1: Initialize – with – 0
(e.g, – i
=0,’ i=1,2,...,q)
while not converge do
Step 2: Given – , update ◊ by:
◊ new
Ω diag[X
T
C
≠ 1
– X]
Step 3: Given ◊ , update – by the following inner loop:
• Step 3.1: initialize – with estimates from last iteration
• Step 3.2: given – , update ” by:
” Ω argmin
” ||Y ≠ X” ||
2
+
ÿ
j
÷ j
” 2
j
We utilize the glmnet package to implement this step.
• Step 3.3: given ” , update – by:
– Ω argmin
– ◊ T
1
÷ +
ÿ
j
÷ j
” 2
j
gradient of the objective function in this step is given by:
L =◊ T
1
÷ +
ÿ
j
÷ j
” 2
j
L
Õ
=≠ 2(◊ ≠ ” 2
÷ 2
)
T
ÿ
j
÷ j
Z
j·
(÷ =
1
2‡ 2
exp(2Z– ))
(4.26)
We utilize the optim package with L-BFGS optimizer to implement this step.
• Step 3.4: iterate Step 3.2 and Step 3.3 until converge
end
82
4.3 Results
We implemented xtune LASSO using the four algorithms (EM algorithm, L-BFGS, gradient descent
with momentum, and iterative reweighted-L
2
algorithm) described in the previous section. In this
section, we compared their performance in practice regarding four aspects (1) computation time (2)
the number of iteration until convergence (3) estimated hyperparameter – values.
The simulations were run at a sample size of n = 200 with number of features p = 500.The
number of variables in the external information Z was q = 10. Each element of Z was randomly
sampled to be 0 with probability 0.8 and 1 with probability 0.2. A column of 1 was appended to Z.
– 0
=2 and – k
,k=1,...,q were randomly draw from a normal distribution with mean 0 and variance
1. The values of — were randomly sample from a Double Exponential(exp(Z– )) distribution. True
error variance values ‡ 2
=1. We also generated a testing dataset from the same model with the
same sample size (n = 200). Model was trained using the training set and validated in the validation
set. We also run LASSO model with cross-validation using glmnet as reference.
Figure 4.3 shows the convergence plots and coecients estimation values for EM, gradient
descent and iterative reweighted-L
2
algorithms. All algorithms have the same convergence criteria
10
≠ 4
. EM algorithm takes approximately 150 iterations to converge, and gradient descent takes
about 120 iterations to converge. The iterative reweighted-L
2
algorithm takes about 20 iterations
to converge. The coecient estimation values are the same for all three algorithms. This is not
surprising because the – estimates are almost identical.
Table 4.1 shows the estimated hyperparameter – using EM, gradient descent, LBFGS, iterative
reweighted-L
2
and Nelder-Mead algorithms. Among the five algorithms we compared, the L-BFGS
algorithm uses the lbfgs package in R and the Nelder-Mead algorithm uses the optim package in R.
EM, gradient descent, L-BFGS, and iterative reweighted-L
2
algorithms belongs to the first-order
optimization that makes use of the gradient function while the Nelder-Mead algorithm directly
optimize the likelihood function. The – estimates using the first four optimization algorithms are
very similar. The estimates by the Nelder-Mead algorithm has some dierence with the other four
algorithms.
The regression coecients estimates are the same for the first four optimization algorithms since
the hyperparameter – estimates are the same for the first four optimization algorithms. Therefore,
83
(a) Convergence plot and coecient estimation values of EM algorithm
(b) Convergence plot and coecient estimation values of the gradient descent algorithm
(c) Convergence plot and coecient estimation values of iterative reweighted- L2 algorithm
Figure 4.4 Convergence plots and estimated coecients values for EM, gradient descent and iterative
reweighted-L
2
algorithms. The marginal likelihood plot shows the likelihood 5.9 with each iteration.
The coecient histogram plot shows the distribution of the regression coecients, and the coecient
values plot shows the estimated values for each regression coecient.
84
Table 4.1 Compare hyperparameter – estimates across all optimization algorithms
EM Gradient Descent L-BFGS Reweighted-L
2
Nelder-Mead True
– 0
2.4333 2.43 2.4335 2.4323 2.2675 2
– 1
-1.1486 -1.1463 -1.1488 -1.1463 -1.0395 -0.7668
– 2
-0.8108 -0.8099 -0.8109 -0.8114 -0.6805 -0.8165
– 3
-0.264 -0.2632 -0.264 -0.2642 -0.1788 -0.1415
– 4
-0.2126 -0.2128 -0.2126 -0.2121 -0.1388 -0.2776
– 5
0.4493 0.4485 0.4494 0.4504 0.4871 0.4363
– 6
-1.4591 -1.4568 -1.4593 -1.4585 -1.4563 -1.1869
– 7
1.4773 1.4847 1.4781 1.4797 2.3239 1.192
– 8
0.1057 0.1064 0.1057 0.1058 0.1507 -0.0182
– 9
-0.5533 -0.5508 -0.5534 -0.5546 -0.5977 -0.2481
– 10
-0.3296 -0.328 -0.3296 -0.3291 -0.3027 -0.3629
MSE
a
23.9 23.9 23.9 23.9 24.9 34.5
R
2b
0.839 0.839 0.839 0.839 0.833 0.768
Time
c
3.33 2.96 1.63 1.06 6.81 0.52
ab
MSE and R
2
are calculated on the testing set. The last column presents LASSO using cross-validation.
c
Computation time in minutes. The last column shows the computation time for LASSO using cross-validation
Figure4.5Comparisonofcomputationtimeforcross-validationLASSO,xtuneLASSOusingiterative
reweighted-L
2
, and xtune LASSO using LBFGS. The number of predictors p = 5000, sample size
n = 500 and number of variables in the external data q = 10. The median running time is 4 seconds
for LASSO; 56 seconds for iteratively reweighted-L
2
algorithm and 35 minutes (2158) for LBFGS
algorithm.
85
the MSE and R
2
are the same for the four algorithms. Nelder-Mead gives a slightly higher MSE.
Under this simulation setting where the external data is informative, and we see an improvement in
prediction using xtune LASSO compared to standard LASSO. However, standard LASSO using
glmnet is the most computation ecient one.
Inthissimulationsetting,weseethatL-BFGSisthefastesttoimplementxtuneLASSO.However,
when p is larger, iterative reweighted-L
2
is much faster than the L-BFGS algorithm. Figure 4.5
compares the computation time of LASSO, iterative reweighted-L
2
and L-BFGS algorithm for the
scenario with n = 500, p = 5000, and q = 10.When n and p are large, the iterative reweighted-L
2
algorithm is much faster than the L-BFGS algorithm for implementing xtune LASSO. However,
both methods are slower than standard LASSO using cross-validation implemented using glmnet.
4.4 Discussion
In this chapter, we proposed several dierent algorithms for fitting the xtune LASSO model and
compared their performance. EM algorithms are one of the most straight forward and eective
algorithms for maximum likelihood estimation as they consistently increase the likelihood by
maximizing a simple surrogate function for the log-likelihood in each step. A faster and more
direct approach is to directly optimize the objective function. Computing the gradient function
allows us to use gradient-based and second-order optimization algorithms. We have shown that at
convergence, the results are identical to those obtained by the EM algorithm. However, since the
objective function is non-convex, the results can depend on the initial values (Murphy, 2012).
The reweighted L
2
algorithm showed conceptual and computation advantages compared to the
other two methods. By using a “nested” MM procedure, we convert the non-convex optimization
problem into a series of easier convex optimization problems that avoid large matrix inversion, which
can become computationally expensive as the number of predictors grows. In addition, every EM
algorithm can be viewed as a particular case of the more general case of MM procedure, which
typically uses convexity rather than missing data in constructing the majorization (or minimization)
auxiliary function. Properties of the MM procedure have been studies in (D.R and Lange, 2004).
Building on the MM principle, the reweighted L
2
algorithm is easy to implement and competitive on
large data problems. Besides reducing the computation burden, the representation of the marginal
86
likelihood also allows us to explore other upper bounds for both components and extend to more
complicated penalty forms. Zhou et al. (2019) proposed an MM algorithm for variance components
estimation, which related to our penalized estimation and could possibly be adapted to our problem
in the future.
87
Chapter 5
Incorporation of external information
into Elastic-Net regression and
classification
This chapter is a modified version of a manuscript in preparation for submission to Statistics in
Medicine.
5.1 Introduction
In Chapter 3, we presented a method named xtune LASSO that incorporates external information
into LASSO regression for linear models. Rather than using a single penalty parameter to control
the global amount of shrinkage across all regression coecients, our model allows each regression
coecienttohaveitsownindividualpenaltyparameter. Thepenalizationisadaptedinadata-driven
manner, where the vector of penalty parameters is modeled as a function of the external information
Z. While demonstrating promising performance for many problems, the xtune LASSO still has
some drawbacks inherently to the LASSO estimator. The main shortcomings include
1. the number of predictors selected by LASSO estimator cannot be larger than the sample size.
In practice, however, it is often not the case
88
2. when the predictors are highly correlated, the LASSO estimator usually only selects one
predictor from the correlated group while ignoring others (Zou and Hastie, 2005)
The Elastic net regression is also a penalized least squares method with penalty term being a
combination of the LASSO L
1
penalty and ridge L
2
penalty. Zou and Hastie (2005) showed that
the elastic net regression improve LASSO estimator especially in the two scenarios described above.
Inthischapter, capitalizingontheidea ofElastic-Netregression, we extend the previous methods
to develop an external information guided Elastic-Net regression (xtune EN) that allows the joint
use of primary data and external information. We introduce the method for linear models first,
and then extend it to classification problems. We demonstrate, through simulation studies and
applications in real data, that the proposed method can improve the prediction performance of
penalized regression.
5.2 Methods
We describe the xtune Elastic-Net regression model first and consider the adjustments required for
classification later.
5.2.1 Regression
Elastic-Net regression with data-driven penalties
Given a data set with n observations and p predictors, consider a linear model:
Y =X— +‘
where Y is the vector of target responses one is interested in predicting based on X, an n◊ p
“design” matrix whose columns comprise the predictors, and where ‘ represents the error vector.
The Elastic-Net framework makes the conventional assumption that the errors are independent and
follow a Gaussian distribution with with mean zero and a common variance ‡ 2
.
Elastic-Net penalized regression solves the following problem
min
— œ R
p
||Y ≠ X— ||
2
2
+⁄ [(1≠ c)||— ||
2
2
/2+c||— ||
1
] (5.1)
89
where
||— ||
1
=
p
ÿ
j=1
|— j
|, and ||— ||
2
=(
p
ÿ
j=1
— 2
j
)
1
2
The Elastic-Net penalty is a mixture of the LASSO penalty (c=1) and the ridge penalty
(c=0), and the weight of those two penalties is controlled by the parameter c ranging from 0 to 1.
The non-negative penalty parameter ⁄ controls the overall amount of shrinkage. The Elastic-Net
estimator is defined as
ˆ
— EN
that minimize equation (5.1), and is typically solved through the
LARS-EN algorithm (Zou and Hastie, 2005) with given penalty parameter ⁄ .
Like in the LASSO case, the selection of the tuning parameter ⁄ is typically made by k-fold
cross-validation (Hastie et al., 2009). The optimal weighting parameter c can also be tuned using
cross-validation simultaneously with ⁄ (Kohannim et al., 2012; Cho et al., 2010). Many other
researchers use preselected c values and then estimate ⁄ by cross-validation (Abram et al., 2016;
Shen et al., 2011). In our analysis, we employed the latter approach.
In order to incorporate the external information Z we have on each feature, we allow the
Elastic-Net model to have a dierential amount of shrinkage guided by Z. For this, we generalize
the single penalty parameter ⁄ in (5.1) to be a vector of penalty parameters ⁄ =(⁄ 1
,...,⁄ p
),where
each ⁄ j
is associated with the corresponding predictor — j
.The ⁄ vector is, in turn, modeled as a
log-linear function of the external information Z.
min
— œ R
p
||Y ≠ X— ||
2
2
+
p
ÿ
j=1
(⁄ j
[(1≠ c)— 2
j
/2+c|— j
|])
log⁄ =Z– (5.2)
Whenthereisonlyasinglepenaltyparameter,⁄ canbeestimatedusingcross-validation. However,
cross-validation cannot make use of the external informationZ and is very computationally intensive
when there is more than one tuning parameter. On the other hand, the Bayesian framework provides
a convenient way to incorporate external information. We, therefore, propose using a Bayesian
formulation of the Elastic-Net model and solve the problem using an empirical Bayes approach.
90
Bayesian framework
Since ‘ is independent noise assumed to be mean-zero Gaussian with variance ‡ 2
, the response
variables follow the normal distribution conditionally, i.e.
Y|X,— ≥N(X— ,‡ 2
I
n
) (5.3)
Li and Lin (2010) pointed out that the marginal conditional posterior mode of — |Y is equivalent to
the Elastic-Net estimator
ˆ
— EN
using the following prior for — , which is conditional on ‡ 2
:
f(— |‡ 2
)Ã exp{≠ ⁄ 2‡ 2
[(1≠ c)||— ||
2
2
/2+c||— ||
1
]} (5.4)
The parameter ‡ 2
can estimated from the data or given a point-mass prior (Li and Lin, 2010).
In this paper, we estimate ‡ 2
using the method proposed by (Reid et al., 2013) and assume it is
“set in advance”. Park and Casella (2008) and Li and Lin (2010) used a specification that assigns a
non-informative prior for ‡ 2
and used a Gibbs sampler based on full conditional distributions to
obtain regression coecients estimates. Our method is dierent in that, rather than extending the
model to include Bayesian inference over the hyperparameters, we use a empirical Bayes approach
that maximizes the log likelihood of the hyperparameters. This procedure is similar to the type-II
maximum likelihood procedure described in Tipping and Faul (2003) for sparse Bayesian learning
and the relevance vector machine.
The prior f(— |‡ 2
) is a compromise between the Gaussian prior for the Ridge regression and
the double exponential prior for the LASSO regression. Conditional on a given ‡ 2
, the Elastic-Net
estimator
ˆ
— EN
yields exactly the marginal posterior mode of — |Y,‡ 2
. Based on the discussion
above, we have the following hierarchical model to integrate external information into Elastic-Net
regression.
Y|X,— ≥N(X— ,‡ 2
I
n
)
f(— |‡ 2
)Ã exp{≠ ⁄ j
2‡ 2
[(1≠ c)— 2
j
/2+c|— j
|]}
log⁄ =Z– (5.5)
91
Once – is estimated, so that the tuning parameter vector ⁄ known, we can estimate — using
standard software such as glmnet.
Approximating and maximizing the marginal likelihood
Given ‡ 2
, the likelihood of – is calculated by integrating out the random eect — from the joint
distribution of Y and — .
¸(– )=
⁄
R
p
p(Y|X,— )f(— |– )d— (5.6)
The Elastic-Net prior, like the LASSO prior, is not dierentiable at 0. When X is not orthogonal,
the p dimensional integral in (5.6) does not have a usable closed-form solution (Foster et al., 2008).
We have developed an ecient xtune LASSO algorithm to optimize – for the LASSO regression
case. To obtain a closed form approximate log likelihood, a Gaussian prior with the same variance
was used to approximate the LASSO prior, i.e a normal distribution — j
≥ N(0,
2
· 2
j
) was used to
approximate the double exponential distribution — j
≥ Double Exponential(· i
). The Elastic-Net
prior is a compromise between ridge prior and LASSO prior, and it is less “sharp” than the LASSO
prior (Li and Lin, 2010). Following the same approximation procedure described in Chapter 3, we
propose approximating the L
1
norm part in (5.6)bya L
2
norm with the same variance. In specific,
f(— j
)= ’ ⁄ j
,c
exp{≠ ⁄ j
(1≠ c)— 2
j
/2+⁄ j
c|— j
|}
¥ ’ Õ
⁄ j
,c
exp{≠ ⁄ j
(1≠ c)— 2
j
/2+
— 2
j
2(
2
(⁄ j
c)
2
)
}
=N(0,
2
2⁄ j
(1≠ c)+(c⁄ j
)
2
)
(5.7)
Using this approximation, the method described in Chapter 3 is a particular case of the empirical
Bayes elastic net regression when c=1. The approximation equals exactly the Bayesian ridge for
c=0. The approximated hierarchical model is:
92
Y|X,— ≥N(X— ,I
n
‡ 2
)
— ≥N(0,
2
2⁄ (1≠ c)+(c⁄ )
2
)
log⁄ =Z– .
(5.8)
The approximate marginal likelihood of – can be obtained by integrating out — . In log-form, the
approximate log likelihood is:
≠ ¸(– ) = log|C
– |+Y
T
C
≠ 1
– Y (5.9)
where
C
– = ‡ 2
I +XV
≠ 1
X
T
V =diag(÷ )= diag(
2⁄ (1≠ c)+(c⁄ )
2
2
)
(5.10)
5.2.2 Classification
Elastic-Net classification follows the same framework as described for regression. In two-class
problems with a binary response variable yœ{ 0,1}, we generalize the linear model by applying the
logistic sigmoid link function
t
n
= ‡ (X— )=1/(1+e
≠ X— )
to model the probability of being in class 1 and adopting the Bernoulli distribution for P(Y|X).
We write the objective function for the penalized logistic regression using the negative binomial
log-likelihood as:
min
— œ R
p
n
ÿ
n=1
[y
n
logt
n
+(1≠ y
n
)log(1≠ t
n
)]+⁄ [(1≠ c)||— ||
2
2
/2+c||— ||
1
] (5.11)
Similar to (5.5), the hierarchical model to integrate external information in Elastic-Net logistic
regression is:
93
y|X,— ≥ Bernoulli(‡ (X— ))
f(— )Ã exp{≠ ⁄ j
[(1≠ c)— 2
j
/2+c|— j
|]}
log⁄ =Z– .
(5.12)
The prior distribution of— is approximated using normal approximation as in the regression case. In
contrast to the regression model, we can no longer analytically integrate over the parameter vector
— . There is no closed-form expression for ¸(– ). We utilize the Laplace approximation procedure
described in (Tipping, 2001).
The idea is that we begin by initializing the hyperparameter vector – . For given value of – ,
we then use Laplace approximation to obtain an approximation to the marginal likelihood. The
approximated marginal likelihood takes the same form as (5.9) in the regression case, so we can
utilize the same optimization algorithm to obtain the re-estimated value of – .Theprocessis
repeated until converge. In more detail, the following procedure is used:
1. For the current values of – , the mode of the posterior distribution p(— |Y,– ) over — is found
by maximizing
logp(— |y,– ) = log{p(y|— )p(— |– )}≠ logp(y|– )
=
N
ÿ
n=1
[y
n
logt
n
+(1≠ y
n
)log(1≠ t
n
)]≠ 1
2
— T
V— +const
(5.13)
whereV = diag(
2⁄ (1≠ c)+(c⁄ )
2
2
) as in (5.10). This is standard penalized logistic log-likelihood
and can be maximized using standard second-order Newton method via such software as
glmnet (Friedman et al., 2010b).
2. The mean and covariance of the Laplace approximation are obtained by setting the gradient
vector of the log posterior distribution (5.13) to zero, as:
— ú = X
T
Bˆ y
=( X
T
BX +V)
≠ 1
(5.14)
94
and
ˆ y =X— ú +B
≠ 1
(y≠ t) (5.15)
where B is an N ◊ N diagonal matrix with elements b
n
= t
n
(1≠ t
n
). Using Laplace
approximation, we can write the approximate log marginal likelihood in the form:
logp(y|– )=≠ 1
2
{N log(2fi )+log|C|+ˆ y
T
C
≠ 1
ˆ y} (5.16)
where
C =B
≠ 1
+XV
≠ 1
X
T
.
The approximate log marginal likelihood takes the same form as (5.9)whereB© ‡ ≠ 2
I
n
and
ˆ y© y for the regression case. The Laplace approximation essentially maps the classification
model to a linearized variant with a new response variable ˆ y and data-dependent noise, with
the noise variance for ‘ given by ‡ ≠ 2
=t
n
(1≠ t
n
) (Tipping, 2001). Therefore, we can adopt
the same optimization algorithm for linear case to maximize (5.16).
5.2.3 Marginal likelihood maximization
The algorithm proposed in this paper essentially follows the same reweighted-L
2
procedure as
described in Chapter 3. Modifications are made to extend the method for Elastic-Net regression
and classification cases. The proposed marginal likelihood maximization algorithm is as follows:
1. Initialize – .
2. If regression, estimate ‡ 2
using the estimator ˆ ‡ 2
L,
ˆ
⁄ defined by (Reid et al., 2013).
3. If classification, compute — ú that maximize (5.13). Given — ú , computeB and ˆ y using (5.15).
4. Update – by maximizing (5.16)where B © ‡ ≠ 2
I
n
and ˆ y © y for the regression case. In
specific, the following iterative reweighted-L2 algorithm is used. Greater details about the
derivation of the reweighted-L2 algorithm were provided in Chapter 4.
(a) Given – , update ◊ by:
◊ new
Ω diag[X
T
C
≠ 1
– X]
95
(b) If regression, for given ◊ , update – using the following inner loop:
• Initialize – by – old
• Update ” by optimizing
q
j
÷ j
” 2
j
+
1
‡ 2
||Y ≠ X” ||
2
using glmnet
• Given” , estimate– byoptimizing◊ T 1
÷ +
q
j
÷ j
” 2
j
usingstandardconvexoptimization
software such as optim
• Iterate the last two steps until convergence
(c) If classification, for given ◊ , update – by the following inner loop:
• Let X
B
=B
1/2
X, Y
B
=B
1/2
Y, initialize – by – old
• Update ” by optimizing
q
j
÷ j
” 2
j
+||Y
B
≠ X
B
” ||
2
using glmnet
• Given” , estimate– byoptimizing◊ T 1
÷ +
q
j
÷ j
” 2
j
usingstandardconvexoptimization
software such as optim
• Iterate the last two steps until convergence
5.3 Results
5.3.1 Simulation studies for xtune EN logistic regression
Simulation studies were conducted to show the eectiveness of the proposed xtune Elastic-Net
logistic regression modeling strategies. The simulation schema is similar to that of Chapter 3
but with Elastic-Net logistic models instead of LASSO linear models. The performance of xtune
Elastic-netlogisticregressionwascomparedwithElastic-Netlogisticregressionintermsofprediction
accuracy under the various simulation scenarios. For Elastic-Net logistic regression, the glmnet R
package was used to fit the model.
In each simulation run we first sampled 200 instances of p-dimensional multivariate normal
distributions with mean vector 0 and covariance matrix i,j
= fl |i≠ j|
. The external metadata matrix
Z of dimension p◊ q has 0/1 entries indicating the grouping of each feature. The entries were
independently generated to be zero or one with probability 0.8 and 0.2, respectively. Note that each
feature can belong to multiple groups.
The true regression coecient — were generated by sampling from a double exponential dis-
tribution with local parameter 0 and scale parameter e
Z– . To achieve sparcity of the underlying
96
regression coecients, Ë percent of the — sweresettobe 0.Thecoecients – are (2,≠ 2,≠ 1,0,1,2).
The corresponding response variables y were simulated according to a Bernoulli distribution with
model-based probabilities. The performance of the two models is assessed by the prediction accuracy
measured by Area under the ROC Curve (AUC) computed on the test data set (n = 400)using
models fitted in the training set. One hundred replicates were generated for each scenario.
The five dierent scenarios varying dierent key parameters are summarized as follows:
1. The number of predictor features p = 400,800 and 1200. For each scenario with dierent
number of predictor features, a fixed set of fl =0.5,Ë =0.9 is used. Both models fit with the
Elastic-Net penalty parameter c equals to 1/2.
2. Degree of correlation between features fl =0.5,0.7 and 0.9. Number of predictors p and
proportion of zero covariates Ë are fixed at 800 and 0.9 respectively. Both models fit with the
Elastic-Net penalty parameter c equals to 1/2.
3. Proportion of zero regression coecients Ë =0.1,0.5 and 0.9. p and fl are fixed at 800 and 0.5.
Both models fit with the Elastic-Net penalty parameter c equals to 1/2.
4. Elastic-Net penalty parameter c=0,1/2 and 1. Data is simulated with p = 800, fl =0.5, and
Ë =0.9. xtune Elastic-Net logistic regression and standard Elastic-Net logistic regression with
dierent penalty parameters are fit.
5. Proportion of regression coecients with wrong external information Ÿ = 10%,20% and 30%.
In each case, Ÿ percent of rows in Z that are “flipped”, i.e, entries with values 0 are set to be
1 and entries with value 1 are set to be 0. p,fl and Ë is fixed at 800,0.5 and 0.9.
Figure 5.1 summarizes the prediction accuracy across the five scenarios. The boxplots are based
on 100 runs. The numbers shown in the figure are mean AUC based on the 100 runs. Several
observations can be made from Figure 5.1. The xtune Elastic-Net logistic regression performs
significantly better than the standard Elastic-Net regression when the external information is
informative, but the xtune Elastic-net logistic regression can have a loss in performance when a
large proportion of the external information is misleading.
Both methods had higher prediction accuracy when the features are highly correlated. Also,
the strength of borrowing information from external data becomes smaller as correlation increases.
97
Figure 5.1 Results for the five scenarios considered in the xtune EN logistic model simulation studies.
98
Hebiri and Lederer (2013) pointed out that the prediction errors for LASSO regression are mostly
smaller for the correlated settings in experimental studies. In scenario 1, the xtune Elastic-Net
logistic regression has better performance than standard Elastic-Net logistic regression when the
p/n ratio is high. In scenario 3 and 4, xtune Elastic-Net performs better than standard Elastic-net
across dierent levels of sparsity and Elastic-Net penalty parameters. As shown in example 5,
the performance of xtune Elastic-Net logistic regression decreases when a large proportion of the
external information used is wrong.
5.3.2 Simulation studies for xtune EN linear regression
In this section, we illustrate the eectiveness of the xtune EN linear model. For this comparison, we
generated data from the linear model. The simulations followed a similar schema as the logistic
regression case with some adaptations. The performance of xtune EN regression with weighting
parameterc=1/2 was compared with standard EN regression withc=1/2 under various simulation
scenarios.
As in the logistic regression case, we sampled 200 training instances and 1000 testing instances
of p-dimensional multivariate normal distributions with mean vector 0 and covariance matrix
i,j
= fl |i≠ j|
. The external metadata matrix Z of dimension p◊ q has 0/1 entries indicating the
grouping of each feature. The entries were independently generated to be zero or one with a
probability of 0.8 and 0.2,respectively.
The true regression coecient — were generated by sampling from a double exponential dis-
tribution with mean 0 and scale parameter e
Z– . To achieve sparsity of the underlying regression
coecients, Ë percent of the — sweresettobe 0. The hyperparameter vector – was of length 10
and ranging from≠ 2 to 2. – 0
was set to be 3 to control the overall amount of penalty. In specific,
– =(3.00,≠ 2.00,≠ 1.56,≠ 1.11,≠ 0.67,≠ 0.22,0.22,0.67,1.11,1.56,2.00).
Foreachsimulationreplicate,anindependenterrorterm‘wassampledfromanormaldistribution
with mean 0 and a variance corresponding to the pre-set signal to noise ratio. The outcome was
generated as Y =X— +‘. Each simulated data was split into a training set (n = 200) and a test set
(n = 1000). The performance of the xtune EN (linear) was assessed by the prediction R
2
, and mean
square error computed on the test using the model fitted in the training set. The large test set
guarantees that the generalization/test error was accurately estimated. Penalty parameter tuning
99
for the standard xtune with c=1/2, which was fitted to compare with the proposed xtune LASSO,
was performed by 10-fold cross-validation. One hundred replicates were generated for each scenario.
The five scenarios varying dierent key parameters are summarized as follows:
1. The number of predictor features p = 500,1000 and 3000. For each scenario with dierent
number of predictor features, a fixed set of fl =0.6,Ë =0.9 is used. Both models fit with the
Elastic-Net penalty parameter c equals to 1/2.
2. Signal to noise ratio (SNR)=1,3,5. SNR is defined as Var(X— )/‡ 2
. Data is simulated with
p = 1000, fl =0.6, and Ë =0.9. xtune Elastic-Net logistic regression and standard Elastic-Net
logistic regression with dierent penalty parameters are fit.
3. Degree of correlation between features fl =0.3,0.6 and 0.9. Number of predictors p and
proportion of zero covariates Ë and SNR are fixed at 1000, 0.9 and 3 respectively. Both models
fit with the Elastic-Net penalty parameter c equals to 1/2.
4. Proportion of zero regression coecients Ë =0.1,0.5 and 0.9. p and fl are fixed at 800 and 0.5.
Both models fit with the Elastic-Net penalty parameter c equals to 1/2.
5. Proportion of regression coecients with wrong external information Ÿ = 10%,20% and 30%.
In each case, Ÿ percent of rows in Z that are ”flipped”, i.e, entries with values 0 are set to be
1 and entries with value 1 are set to be 0. p,SNR,fl and Ë is fixed at 1000,3,0.5 and 0.9.
Figure 5.2 (a)-(e) compare the performance of the linear xtune EN model to standard linear EN
model as we vary the number of predictors, the SNR, the correlation between predictors, the sparsity
of regression coecients, and the informativeness of external information. We obtain trends in
performance that are consistent with those found for the comparison of xtune LASSO vs. standard
LASSO in Chapter 3. The gain in predictive performance increases as the number of predictor
variables increases when the external information is informative. When the external information is
comprised of a high proportion of misleading information, there is some loss of performance if we
incorporate the external data. Both methods have a higher prediction ability when the features are
more correlated.
100
(a) Compare standard EN vs. xtune EN varying the number of predictors p = 500,1000,3000. Sample size n=200,
SNR = 3, q=10, – 0=3, ” =0.5, fl =0. Evaluation measurements are MSE, R
2
and relative MSE moving left to right.
(b) Compare standard EN vs. xtune LASSO varying first level data SNR = 1,3,5. Sample size n=200, number of
features p=1000, q=10, – 0=3, ” =0.5, fl =0. Evaluation measurements are MSE, R
2
and relative MSE moving left
to right.
(c) Compare standard EN vs. xtune EN varying correlation fl = 0.3,0.6,0.9. The indexes of non-zero (signal) regression
coecients are selected randomly. Sample size n=200, p=1000, SNR = 3, q=10, ” =0.5. Evaluation measurements
are MSE, R
2
and relative MSE moving left to right.
101
(d) Compare standard EN vs. xtune EN varying regression coecients sparsity ” = 0.3,0.5,0.7. Sample size n=200,
p=1000, SNR = 3, q=10, – 0=3, fl =0. Evaluation measurements are MSE, R
2
and relative MSE moving left to
right.
(e) Compare standard EN vs. xtune EN varying the percentage of wrong external information. Sample size n=200,
p=1000, SNR = 3, q=10, ” =0.5, fl =0. Evaluation measurements are MSE, R
2
and relative MSE moving left to
right.
Figure 5.2 Results for the five scenarios considered in the xtune EN linear model simulation studies.
102
5.3.3 The prostate cancer example
We illustrate the proposed xtune Elastic-Net logistic method through two real-world examples:
prostate and breast cancer classification based on gene expression values. The prostate cancer
data set consists of prostate cancer data with 187 samples and 29,336 gene expression values. One
hundred fifty-four patients had “no evidence of disease” (NED) after radical prostatectomy, and 33
patients had a clinical recurrence (CR) after radical prostatectomy. We consider Elastic-Net logistic
regression modeling to classify the patients into the above two clinical responses (i.e., NED or CR).
Many previous research have studied gene expression profiles associated with prostate cancer.
Those studies revealed a large variety of gene candidates associated with prostate cancer risk,
recurrence, or survival, with a few genes consistent cross dierent studies. In this example, we search
publications and studies that identified genes associated with the prostate cancer through PubMed
and use the results as the external information. In specific, the external information matrix consists
of three variables: column one code genes that were identified by one previous study as 1 and the
rest as 0, column two code genes that were identified by two previous study as 1 and the rest as
0 and column three code genes that were identified by three or more previous study as 1 and the
rest as 0. Publications used to construct the external information matrix include: (Takata et al.,
2010; Long et al., 2012; Fraser et al., 2003; Roscilli et al., 2014; Kazma et al., 2012; Luo et al., 2014;
Su et al., 2014; Antunes et al., 2013; Loeb and Partin, 2008; Wei et al., 2015; Merola et al., 2015;
Tennakoon et al., 2014; Kaminski et al., 2019; Yu et al., 2007; Pettigrew et al., 2012; Georgescu
et al., 2019; Yang et al., 2018; Pudova et al., 2019; Zhang et al., 2019; Comes et al., 2014; Vaarala
et al., 2000; Pletscher-Frankild et al., 2014; Mulin Jun et al., 2011; Rouillard et al., 2016a). Details
about the grouping of genes are provided in Appendix 5.A.
In order to evaluate the proposed xtune Elastic-Net logistic regression, we compared the
classification results based on the AUC score. The performance was evaluated in a cross-validation
scheme. Stratified on recurrence results, data were randomly split into a training data consisting
of 70% of the observations and a test data set consisting of 30% of the observations. We fitted
the ridge logistic model, xtune ridge logistic model, Elastic-Net logistic model with c=1/2,xtune
Elastic-Net logistic model with c=1/2, LASSO logistic model, xtune LASSO logistic model in the
103
training data and evaluated their prediction performance in the testing data. One hundred random
splits were replicated.
Figure 5.3 Left: Comparison of prediction AUC for xtune LASSO vs. standard LASSO, xtune EN
vs. standard EN and xtune ridge vs. standard ridge across 100 split of the prostate cancer dataset.
Middle: Penalty parameter(s) estimated by standard EN and xtune EN across 100 simulations in
the prostate cancer data example. Gene group 1 corresponds to genes that have not been identified
by any paper, gene group 2 corresponds to genes that have been identified in one paper, gene group
3 corresponds to genes that have been identified in two papers and gene group 4 corresponds to
genes that have been identified by three or more papers. Right: Number of selected features by
standard EN and xtune EN (both with weighting parameter 1/2) across 100 simulations in the
prostate cancer example.
Figure 5.3 left panel shows the AUC score across the 100 splits. Borrowing information helped
improve the prediction accuracy upon standard LASSO, ridge, and Elastic-Net regression. The
penalty applied to the previous study identified genes is smaller than the penalty applied to the rest
of the genes. It implies that our modeling strategies are useful tools for classification in the real
world. There is currently much discussion about finding the crucial gene for cancer classification.
We can expect that our methodology is helpful to the research fields by incorporating previous
study results.
5.3.4 The breast cancer example
In the second example, our aim is to predict breast cancer five-year survival (survived or dead) using
patients’ clinical information and gene expression data. The breast cancer data consists of 29,476
gene expression features and two clinical variables (age at diagnosis and lymph node status). The
data contains a discovery set of 594 observations and an additional validation set of 564 observations.
104
The data originate from the Molecular Taxonomy of Breast Cancer International Consortium
(METABRIC) cohort (Curtis et al., 2012)(https://ega-archive.org/dacs/EGAC00001000484)where
more information can be found on the clinical details and on the preprocessing of the gene expression
data. As before, our aim is to improve the prediction of the base standard Elastic-Net model.
Figure 5.4 ROC curves for standard LASSO, xtune LASSO, standard EN, xtune EN, standard ridge
and xtune ridge applied on the breast cancer dataset.
The external information data consists of the previous study result from (Cheng et al., 2013b)
where groups of genes referred to as “metagenes” are identified to be prognostic of breast cancer.
In specific, the three metagenes are associated respectively with mitotic chromosomal instability,
mesenchymal transition, and lymphocyte-based immune recruitment. Each group of the metagenes
is modeled as a binary variable in the external information data. An additional binary variable is
added to the external information data to encode whether a variable is clinical variables and gene
expression variables (see Chapter 3.3.2 for more detailed description about the data).
105
Both the xtune ENL and ENL models were fit in the discovery set and evaluated using the
validationset. Figure5.4showsthereceiveroperatingcharacteristic(ROC)curveswithcorresponding
AUC scores on the validation set.
Table 5.1 Comparison of standard logistic LASSO, standard logistic LASSO EN, standard logistic
LASSO ridge vs. xtune logistic LASSO, xtune logistic EN, and xtune logistic ridge for AUC and
number of selected covariates in application to the breast cancer data.
AUC No. of selected covariates
standard LASSO 0.711 130
xtune LASSO 0.746 6
standard EN with c=1/2 0.698 217
xtune EN with c=1/2 0.747 11
standard ridge 0.669 29479
xtune ridge 0.760 29479
Table 5.1 shows the tuning parameters chosen by base penalized regression models and external
information information penalized regression models. The two clinical variables were given a very
small penalty in xtune models and were estimated to have a relatively large eect size, which
seems reasonable as clinical features are usually considered to be more informative. In additional,
xtune models shrunk the coecients corresponding to expression features that do not belong to any
metagene (the vast majority) toward zero much more aggressively than the standard LASSO, while
it shrunk the clinical variables and the expression features in metagenes 2 and 3 much less than the
base models. The xtune LASSO and xtune EN also yielded a more parsimonious model (6 and 11
selected features respectively) while the standard LASSO selected 130 features and standard EN
selected 217 features.
5.4 Discussion
By using “data-driven” penalization, our proposed method incorporates information from external
data into penalized regression. We showed through simulation and real data examples that the
prediction performance of xtune EN is often better than that of the standard EN when the external
covariates are informative for coecients’ eect sizes.
The type of external information can take dierent forms. In the data example, grouping of
features based on previous study results was used. However, the range of external information the
106
proposed method can take is very general and the external covariates can be continuous, categorical,
or a mixture of both. Other possible forms include structural knowledge (omics type) or continuous
external variables (e.g. external p-values, evidence scores).
The proposed method fits the dichotomous classification problem by using a logit link function.
One could generalize the classification algorithm to the polychotomous case where the number of
classes K is greater than two. For that, we can use the multinomial likelihood form:
P(Y|— )=
N
Ÿ
n=1
K
Ÿ
k=1
‡ {X— k
}
Y
nk
(5.17)
where the multi-class response Y is coded using the conventional “one-of-K” coding and each has its
own coecient vector — k
and associated hyperparameters. However, the number of hyperparameters
scales with K, which can be disadvantageous from a computational perspective.
Most penalized estimators correspond to the mode of a Bayesian posterior distribution (Bhat-
tacharya et al., 2014). Using this connection, the empirical Bayes estimation employed in this paper
estimates the individual variances of the first-level coecients by maximizing the marginal likelihood.
This is also known as Type II maximum likelihood estimation, which has been employed for fitting
the relevance vector machine (RVM) model, a popular technique in the pattern recognition machine
learning community. Instead of relying on cross-validation, the “Bayesian” set-up has distinct
advantages in terms of the penalty parameter choice by marginalizing them over the posterior
distribution (Bhattacharya et al., 2014). Bayesian penalization methods ((Park and Casella, 2008),
(Li and Lin, 2010)) also have employed this connection and use hierarchical model structures to
select the penalty parameter based on sample strategies. However, these methods do not incorporate
information from external covariates and are computationally challenging in high dimensions. Not
addressed in our current method is the probabilistic measure of the uncertainty of the regression
coecients. To provide inference on the parameters, alternatives to our empirical Bayes approach
may be considered in the future, for example, based on a fully Bayesian approach. In this paper,
we used a log-linear function to model the relationship between dierential penalties and external
information with no variability. A generalized linear model (GLM) with a log-link function, for
example, could be considered in future research.
107
The gain in prediction performance in the proposed model depends to the relevance of the
external information that is used to guide the penalization. The external information is relevant or
“informative” if it can help dierentiate groups of predictors of dierent eect size or correlates with
the importance of the predictors. Therefore, expert knowledge of the study domain is often crucial.
A question that we have not addressed in the paper is the how to decide whether the inclusion of a
external covariate would be helpful before they are included in the model. In practice, the use of
xtune penalized regression is more demanding than the use of the standard penalized regression
methods as it requires more time and thought about the extraction, processing, and interpretation
of the external information. On the other hand, the process of finding and understanding external
information about the predictors can help the investigators better understand the predictors, rather
than fitting the model as a “black box”.
In terms of computational eciency, xtune EN models require significantly greater training time
than standard penalized regression with no external information fitted by glmnet.Ourimplemen-
tation and experiments were run using “prototype”-level R code. Some room for improvement in
the latter is to be expected. In principle, the xtune penalized model is designed for any number of
external variables. In practice, however, since the hyperparameters relies on maximum likelihood
estimation, the number of external variables that can reasonably be included is related to the sample
size and number of predictors. For example, if the data set has n=1,000 , p = 50,000 variables,
the algorithm is manageable for up to 50 external variables.
Implementation of the proposed method can be found in the R-package xtune, available for down-
load through CRAN. It provides all functions to perform external information guided penalization
(LASSO, ridge and Elastic-Net) for both linear models and logistic models.
108
Appendix 5.A
5.A.1 Supplementary tables
Table 5.2 External information used for the prostate cancer example
Genes identified by one paper
ABCB5 ABCC11 ABCC5 ABLIM2 ACER1
ADAM10 ADAM23 ADAMTS9 ADAMTSL3 ADGRF2
ADRB2 AGPAT9 AIG1 AKAP6 AKNAD1
ALK ALPK3 AMIGO2 ANK1 ANKFN1
ANKRD44 ANKS1B ANO2 AOX2P APBB3
APP AR ARFGEF1 ARHGAP10 ARHGAP15
ARHGAP8 ARMC8 ASB14 ASCL2 ASPA
ASRGL1 ASS1 ASTN2 ATF3 ATF7IP
ATG14 ATP1B2 ATP6V1E1 ATXN1 AUTS2
BICC1 BLOC1S5-TXNDC5 BMPR1B BOLL BRE
BTNL8 C16ORF78 C20ORF196 C4ORF22 C4ORF45
C6ORF7 C6ORF89 C9ORF3 CACNA2D3 CADM1
CADM2 CALCR CAMK2N1 CAP2 CCDC174
CCDC67 CCHCR1 CD200R1L CD38 CDH12
CDH13 CDKAL1 CELA2A CELSR1 CEP112
CFAP61 CHAT CHD9 CHL1 CHMP2B
CHST11 CLIC5 CLOCK CLPB CLVS2
CNNM3 CNTN4 CNTN5 CNTN6 CNTNAP2
COL13A1 COL6A1 COQ10B CPS1 CPVL
CS CSMD1 CSPP1 CTDSPL CTIF
CTNNA2 CTNNBL1 CUL5 CYP2C8 CYR61
DAB1 DAB2IP DACH2 DCAF4 DCLK1
DFNB31 DGKB DLGAP1 DMC1 DNAH12
DOCK1 DOPEY1 DSCAM DUOX1 DYSF
E2F5 EGFLAM EGFR EGR1 EIF3E
ELAVL2 ELMO1 ELOVL6 ENAM EPHA10
ERBB4 ERC2 ERICH4 ESRRG EYS
109
FADS1 FADS2 FAM111B FAM163A FAM189A1
FAM207A FAM227B FAM71F1 FAM9A FARP1
FAT4 FBLN1 FCHSD1 FGF12 FGF5
FGF6 FHIT FLJ40194 FLJ45872 FLVCR2
FMNL2 FOLH1 FOSL2 FRMD4A FRMD5
FRMD6 FRY FSHR FSTL4 FSTL5
FUT10 GALNT1 GALNT13 GALNT14 GALNTL6
GARNL3 GAS6 GATA2 GBA3 GBP3
GDF3 GIMAP8 GLB1 GLI2 GLIS3
GNA12 GOLM1 GPN1 GRID2 GRIN2B
GRIN3A GRM7 GSPT2 GTPBP1 HAPLN1
HARS HERC2 HLA-F-AS1 HMCN2 HNRNPA1P4
HS3ST4 HS6ST3 HTT ICAM5 IK
IL16 IL18 IL1RAPL1 INSC IQCJ
IRAK3 IRF2BP2 IRX4 ITGA9 ITGB2
ITGB4 ITGB6 KCNB1 KCNJ3 KCNMA1
KIAA0825 KIAA1217 KIAA1456 KIAA1804 KIF13B
KIF16B KIF1A KLF10 KLF5 KLF6
KLHL32 KLK2 KLKP1 KMO KRT8
LACC1 LAMA5 LAMB2 LDB2 LHPP
LHX2 LILRA3 LIMCH1 LINC00540 LINC00578
LMO7 LOC145845 LOC339166 LOC339862 LOC63930
LOC728755 LPA LRCH1 LRP1B LRP5L
LRRCC1 LUZP2 LYN MACROD1 MACROD2
MAD1L1 MAGI1 MALRD1 MAN2B2 MAP2
MAPK4 MAST4 MBD5 MDGA2 MED12L
MEMO1 MFGE8 MGAT4A MIA3 MICAL3
MINOS1-NBL1 MLLT3 MMP7 MOB3B MRO
MYBPC1 MYEOV MYRF NAALADL2 NBEA
NCOR2 NFIA NGEF NKAIN3 NKX3-1
NLGN1 NOX4 NPHS2 NRD1 NRG3
NRXN3 NSF NTM NTNG1 NUDT11
NUP160 NUP210 NXPH1 OAS1 OPRD1
110
OSBPL1A OTX1 PADI3 PALLD PARD3B
PAX2 PCDH15 PCDH9 PCDHA1 PCDHA2
PCDHA4 PCDHA6 PCDHA7 PCDHGA1 PCDHGA2
PDE11A PDE4D PDZRN3 PELP1 PEX5L
PHACTR1 PIP PKHD1 PKIA PLA2G16
PLA2G4C PLEKHA7 PLEKHH2 POU5F1B POU6F1
PPP1R14A PPP2R2B PRKCB PRKG1 PRR16
PTGFRN PTK6 PTPRD PTPRG PTPRJ
PTPRM PXDNL QSOX1 R3HDM1 RAB30
RAB3C RAD23B RAD51B RALYL RBBP8NL
RBFOX1 RBFOX3 RBM19 RBMS3 REEP3
RELN RERE RFTN2 RGS11 RMDN2
RNF103-CHMP3 RNF141 ROBO4 RPS6KA2 RPSAP52
RPTOR RYR1 RYR2 SAMD3 SBF2
SCHIP1 SCN3A SCN9A SCOC SCUBE1
SDHAF3 SDK1 SEMA3E SERINC5 SERPINB11
SETBP1 SEZ6L SF3B1 SFMBT2 SFXN1
SGCD SGK1 SH2D4B SHBG SLC22A1
SLC2A14 SLC2A3 SLC45A3 SLC51A SLC6A14
SLC8A1 SLC8B1 SLC9A9 SLC9C2 SLIT1
SLX4IP SMAD6 SMAGP SMYD3 SNAP25
SNCA SND1 SORCS1 SOX5 SPON1
SPTBN1 SRA1 SRD5A3 SRGAP1 ST6GALNAC1
STYK1 SUN2 SUPT3H SYK TACC2
TBC1D22A TBC1D4 TBRG4 TBX3 TEAD1
TET2 TFEC TIPARP TLL1 TLR1
TLR6 TMEFF2 TMEM132D TMEM165 TMEM167A
TMEM241 TMEM258 TNFSF10 TOP1 TOX
TPK1 TRAPPC9 TTC4 TTC7A TUBA1C
UBE2S UBE2W UMODL1 UNC13C USP45
UST VAMP8 VGLL3 VPS53 VTI1A
VWDE VWF WDR11 WWOX ZFAND4
ZFAT ZHX3 ZNF385B ZNF385D ZNF827
111
ZNRF1 ZRANB1
Genes identified by two papers
ABLIM1 ACOXL AFM ANK2 ANXA13
ARMC2 B4GALNT4 BAIAP2L1 BIK C2orf43
CCDC178 CEP152 CLDN11 CLPTM1L CTBP2
CTNND2 EBF2 EEFSEC EHBP1 EVC
FAM124A FARP2 FERMT2 FGF10 FYCO1
GPRC6A GRHL1 HNF1B ITGA6 JAZF1
KCNA3 KCNN3 KIAA1211 KIF13A KLK15
KLK3 LAMA2 LDAH LMTK2 MAGED1
MDM4 MLPH MSMB NOTCH4 OAS2
PDLIM5 PLCL1 RGS17 RNASE9 SHROOM2
SIDT1 SLC22A3 SPATA16 TERT THADA
TMTC2 TNC TNRC6B TRIM8 TSEN2
UPF2 ZBTB38 ZGPAT ZNF280B ZNF652
Genes identified by three papers or more
FOXP4 MMP11 PCA3 PGC RFX6
Table 5.3 Genes selected by standard LASSO, adaptive LASSO and xtune LASSO in the breast
cancer data example
Features selected by standard logistic LASSO
lymph node status RNF215 SNCG HIST1H1B WWC2 ABO
LOC221442 P53TG1 C11orf74 MX1 PTP4A3 ADCK2
C20orf12 ZNF283 SERPINE3 P2RX2 METAP2 CLIC4
F12 CYP3A43 SIRPG PRR7 MTERFD3 IL13RA2
ZNF432 C21orf84 GPR50 RPP25 KIF20A PLD2
NLRX1 CD244 CYP26C1 PSKH2 NNAT ARHGEF5
ABHD8 RGS12 KBTBD12 NGF CLEC4F PBRM1
DCUN1D5 KRAS PA2G4 PPP6R3 THSD1 REEP1
MATN4 ADCY1 ACSS2 TSPAN16 SLC6A1 GOLGA6L6
BECN1 TBX4 PAEP C17orf79 NKX6-1 DHRS11
C19orf56 GABPB1 BCAT1 PRDX2 PTP4A3 KLF5
112
COQ5 FGD3 SOX11 ANGPT2 SYCP2L ZNF707
EPHA3 KCNE3 WDR60 WDR4 KIAA0430 KRTAP10-9
MGAT4B CD55 VPS13A PPFIBP1 SLC39A2 TMEM179
CCDC27 MTCP1NB AW084149 LOC254099 BX090143 AI184572
NUP210L EPG5 AW594636 AW451392 BX096685 AI681512
AI205778 CD691743 LOC729175 CA398130 AI928453 AI769920
AW297905 AA393387 BE148564 HGC6.3 DB298273 BX089019
BU737861 AA903082 BM980906 TMEM116 C6orf126 BTC
VIT BLNK PI4K2B CYR61 MPND C17orf61
WDR16 PEG10 NQO1 CPSF4 PTP4A3 PIGA
AGL LIAS ARID1B NA NA NA
Featuress selected by xtune logistic LASSO
age PR status lymph node status KIF20A RAD51 CENPA
Features selected by standard EN with c=1/2
lymph node status RNF215 SNCG HIST1H1B WWC2 ABR
ZGLP1 ABO LOC221442 TM4SF19 P53TG1 C11orf74
SHMT2 MX1 PTP4A3 ADCK2 AK096593 C20orf12
AK128447 ZNF283 SERPINE3 P2RX2 METAP2 CLIC4
F12 CYP3A43 NBPF1 LOC401233 UBE2R2 ZNF460
INSIG2 SIRPG PRR7 MTERFD3 OR2L2 NDUFAF2
BAX CDCA5 STAT5B OR6K2 C9orf102 PLXNB3
POLDIP3 IL13RA2 ZNF432 C21orf84 GPR50 RPP25
KIF20A PLD2 NLRX1 LOC729264 CD244 CYP26C1
PSKH2 ZNF454 ABI2 NNAT ARHGEF5 PLEKHM1
ABHD8 RGS12 KBTBD12 ZNF274 VGLL3 NGF
C2orf53 ZCCHC9 CLEC4F PBRM1 DCUN1D5 KRAS
SCAMP1 PA2G4 GDPD1 CHRNA4 BTF3L4 PPP6R3
THSD1 REEP1 RUNDC1 RPS2 MATN4 SACS
TBXA2R ADCY1 ACSS2 TSPAN16 SLC6A1 GOLGA6L6
BECN1 TBX4 FJX1 PAEP ACO1 C17orf79
DCAF16 COASY SLC25A18 NKX6-1 DHRS11 C19orf56
ODF3L1 GABPB1 ELAVL1 PDCD10 BCAT1 PRDX2
MMP12 PTP4A3 KLF5 AGRN COQ5 FGD3
113
SOX11 ANGPT2 SYCP2L ZNF707 EPHA3 SERPINB4
TMEM105 FAM102B KIAA2015 KCNE3 TAC1 TBX2
WDR60 WDR4 KIAA0430 AK125462 ZSCAN5D KRTAP10-9
MGAT4B GATA1 CD55 DEPDC5 VPS13A PPM1N
PPFIBP1 SLC39A2 TMEM179 TDRD9 CCDC27 GDA
MTCP1NB SPC25 AW084149 LOC254099 BX090143 BG941175
LOC100133461 BF675965 AI184572 NUP210L EPG5 AW594636
AK123174 AW451392 BX106045 BX096685 AI681512 AI205778
AI671241 AW593145 CD691743 LOC729175 BX109079 W36271
AA400223 CA398130 AI928453 AI769920 AW297905 AA393387
BX104313 BE148564 HGC6.3 DB298273 BX089019 BU737861
AA903082 BM980906 TMEM116 C6orf126 POU5F1B SLC18A2
BTC VIT CNOT6 BLNK PI4K2B HCRT
CYR61 MPND C17orf61 HHIP WDR16 SLC5A12
PEG10 TMUB2 EYA4 BIRC7 TAGLN3 NLGN4X
ZNF384 NQO1 CPSF4 PTP4A3 ARID4B PIGA
AGL LIAS CABP2 VPS13A ARID1B NA
Features selected by xtune logistic EN with c=1/2
age PR status lymph node status CDCA5 KIF20A TROAP
ANLN PTTG1 FAM83D RAD51 CENPA
114
Chapter 6
xtune: An R package for
incorporating external data into
penalized regression
This chapter is a modified version of the vignette of R package xtune available under https://cran.r-
project.org/web/packages/xtune/index.html
6.1 Overview
6.1.1 Motivation
xtune is a package that extends standard penalized regression (LASSO, ridge, and Elastic-Net)
to allow dierential shrinkage based on external information on the features to achieve a better
prediction accuracy. External information on the predictors are often available to investigators.
Combining information from the primary data with information from external source can improve
accuracy in estimation and prediction.
xtune package solves the problem of building penalized regression models based the “primary”
datawhileincorporatingsummaryinformationfromanexternalsource. Ithasthefollowingobjective
function:
115
min
— œ R
p
¸(Y|X,— )+
p
ÿ
j=1
(⁄ j
[(1≠ c)— 2
j
/2+c|— j
|])
log⁄ =Z– For linear models
¸(Y|X,— )= ||Y ≠ X— ||
2
2
For logistic models
¸(Y|X,— )=
n
ÿ
n=1
[y
n
logt
n
+(1≠ y
n
)log(1≠ t
n
)]
The Elastic-Net penalty is a mixture of the LASSO penalty (c=1) and the ridge penalty
(c=0), and the weight of those two penalties is controlled by the parameter c ranging from 0 to
1. The tuning parameter ⁄ i
controls the strength of the penalty for each regression coecient and
is, in turn, modeled as a log-linear function of the external information. This is dierent from
standard penalized regression, where a single penalty parameter ⁄ applied equally to all regression
coecients to control the amount of regularization in the model. Ideally, we want to give a small
penalty to important features and a large penalty to unimportant features. We guide the penalized
regression model with external data Z that are potentially informative for the importance/eect
size of coecients.
Note that the Greek letter – is commonly used represent the mixture parameter, here we use
c instead, as – is used to represent the hyperparameter vector linking penalty parameters and
external information. The current implementation requires the user to pre-specify the c value.
6.1.2 Data structure examples
The external information Z is of dimension p◊ q,where p is the number of features and q is the
number of variables in Z. Examples of Z include:
• Grouping of predictors
• Prior knowledge of biological importance
• External p-values
116
Table 6.1 Data structure example 1: structure of primary data X and Y. Each row consists of a set
of observations on a person’s weekly dietary. Outcome is a binary variable indicating the result of
weight loss.
Milk
(8fl oz)
Egg
(1)
Celery
(4-inch)
Cookies
(2)
...
Hot dogs
(1)
Outcome
(Weight loss)
Observation 1 2 3 1.5 1 4.5 No
Observation 2 1.5 2 1 1.2 3 Yes
... ... ... ... ... ...
Observation n 2 1 2 12 3 No
Table 6.2 Data structure example 1: structure of external information data Z. Each row in Z
corresponds to a column/feature in X. Columns in Z are extra information about the features in X.
Diet Items
(Units/Week)
Variables inZ
Cholesterol Calories Protein ... Carbohydrates
Milk (8fl oz) 33 150 8.03 ... 12.0
Egg (1) 212 105 1.17 ... 26.7
Celery (4-inch stick) 3 6.67 0.30 ... 1.46
Cookies (2) 6 121 0 ... 0
... ... ... ... ...
Hot dogs (1) 40 144 5.08 1.15
• Functional annotations
• ...
Two data structure examples are given below. In the first example, the goal is to predict a
person’s weight loss (yes or no) using his/her weekly dietary intake. Our external information Z
could incorporate information about the levels of relevant food constituents of the dietary items.
Table 6.1 illustrates the primary data X and Y and Table 6.2 illustrates the corresponding external
information Z which contains the nutrition facts about each dietary item.
In the second example, consider predicting human height by gene expression profiles. Gene
function annotation potentially holds valuable information for dierentiating the importance of
genes and thus can be used as external information. The data structure are shown in Table 6.3 and
Table 6.4.
6.1.3 Tuning multiple penalty parameters
Penalized regression fitting consists of two phases:
117
Table 6.3 Data structure example 2: structure of primary data X and Y. Features in the primary
data are gene expression profiles and the outcome is human height.
Gene
1
Gene
2
Gene
3
Gene
4
... Gene
p
Height (cm)
Observation 1 6.349 9.326 3.125 2.713 4.325 162
Observation 2 5.021 7.342 4.973 3.513 4.453 173
... ... ... ... ... ...
Observation n 3.434 10.234 8.473 2.452 4.628 169
Table 6.4 Data structure example 2: structure of external information data Z. Gene function
annotations are used as external information.
Gene Names
Variables inZ
Gene Function 1 Gene Function 2 ... Gene Function q
Gene
1
10...0
Gene
2
01...0
Gene
3
00...0
Gene
4
00...0
... ... ... ... ...
Gene
p
01 0
1. learning the tuning parameter(s)
2. estimating the regression coecients given the tuning parameter(s).
Phase 1 is the key to achieve good performance. Cross-validation is widely used to tune a single
penalty parameter, but it is computationally infeasible to tune more than three penalty parameters.
xtune employs an empirical Bayes approach to learn the multiple tuning parameters. The individual
penalties are interpreted as variance terms of the priors in a random eect formulation of penalized
regressions. A majorization-minimization algorithm is employed for implementation. Once the
tuning parameters ⁄ s are estimated, and therefore the penalties are known, phase 1 - estimating
the regression coecients is done using glmnet. Utilities for carrying out post-fitting summary and
prediction are also provided.
Key elements of the xtune package include:
1. xtune() tunes dierential shrinkage parameters in penalized regression based on external
information.
xtune(X, Y, Z = NULL, family = c("linear", "binary"), c = 1, sigma.square = NULL,
118
message = TRUE, control = list())
Two main usages:
• The basic usage of it is to choose the tuning parameter in Lasso and Ridge regression
using an Empirical Bayes approach, as an alternative to the widely-used cross-validation.
This is done by calling xtune without specifying external information matrix Z.
• More importantly, if an external information Z about the predictors X is provided,
xtune can allow dierential shrinkage parameters for regression coecients in penalized
regression models. The idea is that Z might be informative for the eect-size of regression
coecients, therefore we can guide the penalized regression model using Z.
2. predict.xtune() provides predicted values for linear response and predicted classes probabilities
for binary response.
predict(object, newX, type = c("response", "class"), X = NULL, Y = NULL, ...)
3. coef.xtune(). extracts model coecients from objects returned by xtune object.
coef(object, ...)
4. auc(), mse(), and misclassification() evaluate the performance of models
5. Three example datasets: diet, example and gene.
6.1.4 Get started
Like many other R packages, the simplest way to obtain xtune is to install it directly from CRAN.
Type the following command in R console:
# install the package
install.packages("xtune")
library(xtune)
119
6.2 Usage examples
We use three simulated examples to illustrate the usage and syntax of xtune. The first example gives
users a general sense of the data structure and model fitting process. The second and third examples
use simulated data in concrete scenarios. In the second example data diet, we provide simulated data
to represent the dietary example described in (Witte et al., 1994). In the third example gene,we
provide simulated data to represent the bone density data published in the European Bioinformatics
Institute (EMBL-EBI) ArrayExpress repository, ID: E-MEXP-1618.
6.2.1 General example
In the first example, Y is a n = 100-dimensional continuous observed outcome vector, X is matrix
of p potential predictors observed on the n observations, and Z is a set of q=4 external features
available for the p = 300 predictors.
## load the example data
data("example")
X <- example$X; Y <- example$Y; Z <- example$Z
dim(X);dim(Z)
#> [1] 100 300
#> [1] 300 4
Each column of Z contains information about the predictors in design matrix X. The number of
rows in Z equals to the number of predictors in X.
X[1:3,1:4]
#> feature 1 feature 2 feature 3 feature 4
#> obs 1 -0.7667960 0.9212806 2.0149030 0.79004563
#> obs 2 -0.8164583 -0.3144157 -0.2253684 0.08712746
#> obs 3 -0.1415352 0.6623149 -1.0398456 1.87611212
The external information is encoded as follows:
120
Z[1:4,]
#> ext var 1 ext var 2 ext var 3 ext var 4
#> feature 1 1 0 0 0
#> feature 2 1 0 0 0
#> feature 3 0 1 0 0
#> feature 4 0 1 0 0
Here, each variable in Z is a binary variable. Z
jk
indicates if predictor j has an external variable k
or not. This Z is an example of (non-overlapping) grouping of predictors. Predictor 1 and 2 belongs
to group 1; predictor 3 and 4 belongs to group 2; predictor 5 and 6 belongs to group 3, and the rest
of the predictors belongs to group 4. To fit a xtune LASSO model to this data:
fit.example1 <- xtune(X,Y,Z) # default c = 1, family = "linear"
#> Z provided, start estimating individual tuning parameters
The individual penalty parameters are returned by:
fit.example1$penalty.vector
In this example, predictors in each group get dierent estimated penalty parameters.
unique(fit.example1$penalty.vector)
#> [,1]
#> Predictor_1 0.008024536
#> Predictor_3 0.014687127
#> Predictor_5 0.034129109
#> Predictor_7 1.202465611
Coecient estimates and predicted values and can be obtained via coef() and predict():
coef(fit.example1)
predict(fit.example1, newX = X)
121
The mse() function can be used to get the mean square error between prediction values and true
values.
mse(predict(fit.example1, newX = X), Y)
6.2.2 The dietary data example (binary outcome)
Suppose we want to predict a person’s weight loss (binary outcome) using his/her weekly dietary
intake. Our external information Z could incorporate information about the levels of relevant food
constituents in the dietary items.
data(diet)
head(diet$DietItems)[1:4,1:5]
#> Milk Margarine Eggs Apples Lettuce
#> [1,] 1 1 4 1 2
#> [2,] 1 0 0 0 2
#> [3,] 0 1 2 3 1
#> [4,] 0 2 0 1 0
head(diet$weightloss)
#> [1] 0 1 1 1 1 1
The external information Z in this example is:
diet$NutritionFact[1:5,]
#> Calories Protein Carbohydrates
#> Milk 0.8293411 0.9765283 3.1774423
#> Margarine 0.5591902 0.8340626 0.2589148
#> Eggs 0.9751737 0.1534241 0.4487985
#> Apples 3.3310695 1.3267249 1.3918929
#> Lettuce 1.3401675 0.7208144 0.3664577
122
In this example, Z is not a grouping of the predictors. The idea is that the nutrition facts about
the dietary items might give us some information on the importance of each predictor in the model.
Similar to the previous example, the xtune model could be fit by:
fit.diet = xtune(X = diet$DietItems,Y=diet$weightloss,Z = diet$
NutritionFact, family="binary")
#> Z provided, start estimating individual tuning parameters
Each dietary predictor is estimated an individual tuning parameter based on their nutrition fact.
fit.diet$penalty.vector
#> [,1]
#> Milk 0.169339210
#> Margarine 0.031858661
#> Eggs 0.044613793
#> Apples 0.009547992
#> Lettuce 0.023354286
#> Celery 0.023490297
#> Hot dogs 0.047758476
#> Liver 0.062248125
#> Dark bread 0.008485647
#> Pasta 0.052605046
#> Beer 0.016805193
#> Liquor 0.047894353
#> Cookies 0.067277037
#> Bran 0.081038328
To make prediction using the trained model
predict(fit.diet,newX = diet$DietItems)
#> [1] 0.27156793 0.99985867 1.01171441 1.06317164 0.54136333
0.77975008 ...
123
The above code returns the predicted probabilities (scores). To make a class prediction, provide the
original X and Y used for the fit and use the type = “class” option.
predict(fit.diet,newX = diet$DietItems,type = "class",X = diet$
DietItems,Y=diet$weightloss)
#> [1] 0 1 1 1 1 1 0 1 0 1 0 0 0 1 0 0 0 1 0 1 1 0 1 1 1 0 0 0 1 0
0 0 1 1 0 ...
6.2.3 The gene expression data example (linear outcome)
The gene data contains simulated gene expression data. The dimension of data is 50◊ 200.The
outcome Y is continuous (bone mineral density). The external information is four previous study
results that identify the biological importance of genes. For example, Z
jk
represents whether gene
j
is identified to be biologically important in the previous study k result. Z
jk
=1 means that gene
j is identified by previous study k result and Z
jk
=0 indicates that gene j is not identified to be
important by previous study k result.
data(gene)
gene$GeneExpression[1:3,1:5]
#> Gene_1 Gene_2 Gene_3 Gene_4 Gene_5
#> [1,] -0.7667960 1.7520578 0.9212806 -0.6273008 2.0149030
#> [2,] -0.8164583 -0.5477714 -0.3144157 -0.8796116 -0.2253684
#> [3,] -0.1415352 -0.8585257 0.6623149 -0.3053110 -1.0398456
gene$PreviousStudy[1:5,]
#> study 1 study 2 study 3 study 4
#> Gene_1 0 0 0 0
#> Gene_2 1 0 0 0
#> Gene_3 0 0 1 0
#> Gene_4 1 0 1 0
#> Gene_5 0 0 0 0
124
A gene can be identified to be important by several previous study results; therefore, the external
information Z in this example can be seen as an overlapping group of variables. Model fitting:
fit.gene = xtune(X = gene$GeneExpression,Y=gene$bonedensity,Z = gene
$PreviousStudy)
The rest of the steps are the same as the previous two examples.
6.2.4 Two special cases
No external information Z
To tune a single penalty parameter using empirical Bayes tuning, simply use the xtune() function
without providing Z. If no external information Z is provided, the function will perform empirical
Bayes tuning to choose the single penalty parameter in penalized regression, as an alternative to
cross-validation. For example:
fit.eb <- xtune(X,Y)
#> No Z matrix provided, only a single tuning parameter will be
estimated using empirical Bayes tuning
The estimated tuning parameter is returned by:
fit.eb$lambda
#> 0.001
Z as an identity matrix
If an identity matrix is provided as the external information Z,the xtune() function will estimate a
separate tuning parameter ⁄ j
for each regression coecient — j
. Note that this is not advised when
the ratio of number of predictors p to number of observations is high. Using the dietary example,
the following code estimates a separate penalty parameter for each coecient.
125
Z_iden = diag(ncol(diet$DietItems))
fit.diet.identity = xtune(diet$DietItems,diet$weightloss,Z_iden)
#> Z provided, start estimating individual tuning parameters
fit.diet.identity$penalty.vector
#> [,1]
#> Milk 0.054815200
#> Margarine 0.087389204
#> Eggs 6.809513817
#> Apples 0.020539316
#> Lettuce 23.190397121
#> Celery 0.019095557
#> Hot dogs 0.019681366
#> Liver 0.327122069
#> Dark bread 0.005405395
#> Pasta 3.223042821
#> Beer 0.042084237
#> Liquor 0.112456551
#> Cookies Inf
#> Bran 0.056912390
Apredictorisexcludedfromthemodel(regressioncoecientestimatedtobezero)ifitscorresponding
penalty parameter is estimated to be infinity.
6.3 Conclusion
In this chapter, we presented the main usage of the xtune package. More details about each
function are provided in the package documentation. Computational eciency and scalability may
be improved by the use of Rcpp integration or parallel computing. We leave this direction for future
research.
126
Bibliography
Abram, S., Helwig, N., Moodie, C., Deyoung, C., MacDonald, A., and Waller, N. (2016). Boot-
strap enhanced penalized regression for variable selection with neuroimaging data. Frontiers in
Neuroscience, 10.
Antunes, A., Reis, S., Leite, K., Real, D., Canavez, J., Camara-Lopes, L., Dall’oglio, M., and Srougi,
M. (2013). Pgc and psma in prostate cancer diagnosis: Tissue analysis from biopsy samples.
International braz j urol : ocial journal of the Brazilian Society of Urology , 39:649–656.
Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P.,
Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L., Kasarskis,
A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M., Rubin, G. M., and Sherlock, G.
(2000). Gene ontology: tool for the unification of biology. the gene ontology consortium. Nat
Genet, 25(1):25–29.
Barber, R., Drton, M., andTan, K.M.(2016). Laplace Approximation in High-Dimensional Bayesian
Regression, volume 11, pages 15–36.
Berger, J. (2008). Statistical Decision Theory, pages 1–7.
Bergersen, L. C., Glad, I. K., and Lyng, H. (2011). Weighted lasso with data integration. Statistical
Applications in Genetics and Molecular Biology, 10(1).
Bhattacharya, A., Pati, D., Pillai, N., and Dunson, D. (2014). Dirichlet–laplace priors for optimal
shrinkage. Journal of the American Statistical Association, 110.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer Science+Business
Media, LLC.
Blei, D., Kucukelbir, A., and McAulie, J. (2016). Variational inference: A review for statisticians.
Journal of the American Statistical Association, 112.
Boulesteix, A.-l., Bin, R. D., Jiang, X., Fuchs, M., Boulesteix, A.-l., Bin, R. D., Jiang, X., and
Fuchs, M. (2017). IPF-LASSO : integrative L 1 -penalized regression with penalty factors for
prediction based on multi-omics data IPF-LASSO : integrative L 1 -penalized regression with
penalty factors for prediction based on multi-omics data. 2017(187).
Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press, New
York, NY, USA.
Boyd, S. and Vandenberghe, L. (2010). Convex Optimization, volume 25.
Cheng, W., Ou Yang, T.-H., and Anastassiou, D. (2013a). Biomolecular events in cancer revealed
by attractor metagenes. PLoS computational biology, 9:e1002920.
127
Cheng, W.-Y., Yang, T.-H. O., and Anastassiou, D. (2013b). Development of a prognostic model
for breast cancer survival in an open challenge environment. Science Translational Medicine,
5(181):181ra50–181ra50.
Cho, S., Kim, K., Kim, Y. J., Lee, J.-K., Cho, Y., Lee, J.-Y., Han, B.-G., Kim, T. H., Ott, J., and
Park, T. (2010). Joint identification of multiple genetic variants via elastic-net variable selection
in a genome-wide association analysis. Annals of human genetics, 74:416–28.
Comes, N., Serrano Albarrás, A., Capera, J., Serrano-Novillo, C., Condom, E., Ramon y Cajal,
S., Ferreres, J., and Felipe, A. (2014). Involvement of potassium channels in the progression of
cancer to more malignant phenotype. Biochimica et biophysica acta, 1848.
Coppola, A., Stewart, B., and Okazaki, N. (2014). lbfgs: Limited-memory BFGS Optimization.R
package version 1.2.1.
Curtis, C., Shah, S. P., Chin, S. F., Turashvili, G., Rueda, O. M., Dunning, M. J., Speed, D.,
Lynch, A. G., Samarajiwa, S., Yuan, Y., Gräf, S., Ha, G., Haari, G., Bashashati, A., Russell,
R., McKinney, S., Aparicio, S., Brenton, J. D., Ellis, I., Huntsman, D., Pinder, S., Murphy, L.,
Bardwell, H., Ding, Z., Jones, L., Liu, B., Papatheodorou, I., Sammut, S. J., Wishart, G., Chia, S.,
Gelmon, K., Speers, C., Watson, P., Blamey, R., Green, A., MacMillan, D., Rakha, E., Gillett, C.,
Grigoriadis, A., De Rinaldis, E., Tutt, A., Parisien, M., Troup, S., Chan, D., Fielding, C., Maia,
A. T., McGuire, S., Osborne, M., Sayalero, S. M., Spiteri, I., Hadfield, J., Bell, L., Chow, K., Gale,
N., Kovalik, M., Ng, Y., Prentice, L., Tavaré, S., Markowetz, F., Langerød, A., Provenzano, E.,
Purushotham, A., Børresen-Dale, A. L., and Caldas, C. (2012). The genomic and transcriptomic
architecture of 2,000 breast tumours reveals novel subgroups. Nature, 486(7403):346–352.
D.R, H. and Lange, K. (2004). A tutorial on mm algorithms. The American Statistician, 58:30–37.
Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression. Annals of
Statistics, 32:407–499.
Figueiredo, M. A. (2003). Adaptive sparseness for supervised learning. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 25(9):1150–1159.
Forbes, S., Bindal, N., Bamford, S., Cole, C., Kok, C., Beare, D., Jia, M., Shepherd, R., Leung,
K., Menzies, A., Teague, J., Campbell, P., Stratton, M., and Futreal, P. (2010). Cosmic: Mining
complete cancer genomes in the catalogue of somatic mutations in cancer. Nucleic acids research,
39:D945–50.
Foster, S. D., Verbyla, A. P., and Pitchford, W. S. (2008). A random model approach for the LASSO.
Computational Statistics, 23(2):217–233.
Fraser, S., Grimes, J., Diss, J., Stewart, D., Dolly, O., and Djamgoz, M. (2003). Predominant
expression of kv1.3 voltage-gated k+ channel subunit in rat prostate cancer cell lines: Electrophys-
iological, pharmacological and molecular characterisation. Pflügers Archiv : European journal of
physiology, 446:559–71.
Friedman, J., Hastie, T., and Tibshirani, R. (2007). Pathwise coordinate optimization. 1(2):302–332.
Friedman, J., Hastie, T., and Tibshirani, R. (2010a). Regularization Paths for Generalized Linear
Models via Coordinate Descent. 33(11):1212–1217.
128
Friedman, J., Hastie, T., and Tibshirani, R. (2010b). Regularization paths for generalized linear
models via coordinate descent. Journal of Statistical Software, 33(1):1–22.
Georgescu, C., Corbin, J., Thibivilliers, S., Webb, Z., Zhao, Y., Koster, J., Fung, K.-M., Asch, A.,
Wren, J., and Ruiz Echevarria, M. (2019). A tme2-regulated cell cycle derived gene signature is
prognostic of recurrence risk in prostate cancer. BMC Cancer, 19.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning. Springer
New York Inc., second edition.
Hebiri, M. and Lederer, J. (2013). How correlations influence lasso prediction. IEEE Transactions
on Information Theory, 59(3):1846–1854.
Hoerl, A. E. and Kennard, R. W. (1970). Ridge Regression: Biased Estimation for Problems
Nonorthogonal. Technometrics, 42(1):80–86.
Kamel, H. and Al-Amodi, H. (2017). Exploitation of gene expression and cancer biomarkers in
paving the path to era of personalized medicine. Genomics, Proteomics Bioinformatics, 15.
Kaminski, L., Torrino, S., Dufies, M., Djabari, Z., Haider, R., Roustan, F.-R., Emilie, J., Laurent, K.,
Nottet, N., Michiels, J.-F., Gesson, M., Rocchi, S., Mazure, N., Durand, M., Tanti, J., Ambrosetti,
D., Clavel, S., Ben-Sahra, I., and Bost, F. (2019). Pgc1 inhibits polyamine synthesis to suppress
prostate cancer aggressiveness. Cancer Research, 79:canres.2043.2018.
Kazma, R., Meord, J., Cheng, I., Plummer, S., Levin, A., Rybicki, B., Casey, G., and Witte, J.
(2012). Association of the innate immunity and inflammation pathway with advanced prostate
cancer risk. PloS one, 7:e51680.
Kohannim, O., Hibar, D., Jahanshad, N., Stein, J., Hua, X., Toga, A., Jack, C., Weinen, M.,
and Thompson, P. (2012). Predicting temporal lobe volume on mri from genotypes using l1-l2
regularized regression. Proceedings / IEEE International Symposium on Biomedical Imaging:
from nano to macro. IEEE International Symposium on Biomedical Imaging, pages 1160–1163.
Kraemer, N., Schaefer, J., and Boulesteix, A.-L. (2009). Regularized estimation of large-scale gene
regulatory networks using gaussian graphical models. BMC Bioinformatics, 10(384).
Krstajic, D., Buturovic, L., Leahy, D., and Thomas, S. (2014). Cross-validation pitfalls when
selecting and assessing regression and classification models. Journal of cheminformatics, 6:10.
Le Thi, H. A. and Pham Dinh, T. (2018). Dc programming and dca: thirty years of developments.
Mathematical Programming, 169.
Lei, J. (2017). Cross-Validation with Confidence. pages 1–35.
Li, Q. and Lin, N. (2010). The bayesian elastic net. Bayesian Analysis, 5.
Lim, C. and Yu, B. (2016). Estimation Stability With Cross-Validation (ESCV). Journal of
Computational and Graphical Statistics, 25(2):464–492.
Liu, J., Liang, G., Siegmund, K. D., and Lewinger, J. P. (2018). Data integration by multi-tuning
parameter elastic net regression. BMC Bioinformatics, 19(1):369.
Loeb, S. and Partin, A. (2008). Review of the literature: Pca3 for prostate cancer risk assessment
and prognostication. Reviews in urology, 13:e191–5.
129
Long, Q.-Z., Du, Y.-F., Ding, X.-Y., Li, X., Song, W.-B., Yang, Y., Zhang, P., Zhou, J.-P., and Liu,
X.-G. (2012). Replication and fine mapping for association of the c2orf43, foxp4, gprc6a and rfx6
genes with prostate cancer in the chinese population. PloS one, 7:e37866.
Luo, Y., Gou, X., Huang, P., and Mou, C. (2014). The pca3 test for guiding repeat biopsy of
prostate cancer and its cut-o score: A systematic review and meta-analysis. Asian journal of
andrology, 16.
MacKay, D. J. C. (1992). Bayesian Interpolation. Neural Computation, 4(3):415–447.
Mai, Q., Zou, H., and Yuan, M. (2012). A direct approach to sparse discriminant analysis in
ultra-high dimensions. Biometrika, 99(1):29–42.
Mallick, H. and Yi, N. (2013). Bayesian methods for high dimensional linear models. Journal of
Biometrics Biostatistics, S1:005.
Merola, R., Tomao, L., Antenucci, A., Sperduti, I., Sentinelli, S., Masi, S., Mandoj, C., Orlandi, G.,
Papalia, R., Guaglianone, S., Costantini, M., Cusumano, G., Cigliana, G., Ascenzi, P., Gallucci,
M., and Conti, L. (2015). Pca3 in prostate cancer and tumor aggressiveness detection on 407
high-risk patients: A national cancer institute experience. Journal of experimental clinical cancer
research : CR, 34:15.
MulinJun, L., Wang, P., Liu, X., Lim, E.L., Wang, Z., Yeager, M., Wong, M., Sham, P., Chanock, S.,
and Wang, J. (2011). Gwasdb: A database for human genetic variants identified by genome-wide
association studies. Nucleic Acids Research, 40:D1047–54.
Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective.
Neal, R. M. (1995). by Bayesian Learning for Neural Networks.
Nuyten, D., Kreike, b., Hart, A., Chi, J.-T., Sneddon, J., Wessels, L., Peterse, H., Bartelink, H.,
Chang, H., and Vijver, M. (2006). Predicting a local recurrence after breast-conserving therapy
by gene expression profiling. breast cancer res 8:r62. Breast cancer research : BCR, 8:R62.
Osborne, M. R., Presnell, B., and Turlach, B. A. (2000). On the LASSO and Its Dual. Journal of
Computational and Graphical Statistics, 9(2):319–337.
Pan, W., Xie, B., and Shen, X. (2010). Incorporating predictor network in penalized regression
with application to microarray data. Biometrics, 66(2):474–484.
Park, T.andCasella, G.(2008). TheBayesianLasso. Journal of the American Statistical Association,
103(482):681–686.
Pettigrew, C., Clerkin, J., and Cotter, T. (2012). Duox enzyme activity promotes akt signalling in
prostate cancer cells. Anticancer research, 32:5175–81.
Pletscher-Frankild, S., Palleja, A., Tsafou, K., Binder, J., and Jensen, L. (2014). Diseases: Text
mining and data integration of disease–gene associations. Methods (San Diego, Calif.), 74.
Pletscher-Frankild, S., Pallejà, A., Tsafou, K., Binder, J. X., and Jensen, L. J. (2015). Diseases:
Text mining and data integration of disease–gene associations. Methods, 74:83–89.
Polson,N.,Scott,J.,andWindle,J.(2012). Bayesianinferenceforlogisticmodelsusingpolya-gamma
latent variables. Journal of the American Statistical Association, 108.
130
Pudova, E., Lukyanova, E., Nyushko, K., Mikhaylenko, D., Zaretsky, A., Snezhkina, A., Savvateeva,
M., Kobelyatskaya, A., Melnikova, N., Volchenko, N., Efremov, G., Klimina, K., Belova, A.,
Kiseleva, M., Kaprin, A., Alekseev, B., Krasnov, G., and Kudryavtseva, A. (2019). Dierentially
expressed genes associated with prognosis in locally advanced lymph node-negative prostate
cancer. Frontiers in Genetics, 10:730.
Reid, S., Tibshirani, R., and Friedman, J. (2013). A Study of Error Variance Estimation in Lasso
Regression. pages 1–30.
Reppe, S., Refvem, H., Gautvik, V. T., Olstad, O. K., Høvring, P. I., Reinholt, F. P., Holden, M.,
Frigessi, A., Jemtland, R., and Gautvik, K. M. (2010). Eight genes are highly associated with
BMD variation in postmenopausal Caucasian women. Bone, 46(3):604–612.
Roscilli, G., Cappelletti, M., Vitis, C., Ciliberto, G., Napoli, A., Ruco, L., Mancini, R., and
Aurisicchio, L. (2014). Circulating mmp11 and specific antibody immune response in breast and
prostate cancer patients. Journal of translational medicine, 12:54.
Rouillard, A., Gundersen, G., Fernandez, N., Wang, Z., Monteiro, C., McDermott, M., and Ma’ayan,
A. (2016a). The harmonizome: a collection of processed datasets gathered to serve and mine
knowledge about genes and proteins. Database, 2016:baw100.
Rouillard, A. D., Gundersen, G. W., Fernandez, N. F., Wang, Z., Monteiro, C. D., McDermott,
M. G., and Ma’ayan, A. (2016b). The harmonizome: a collection of processed datasets gathered
to serve and mine knowledge about genes and proteins. Database, 2016.
Shen, L., Kim, S., Qi, Y., Inlow, M., Swaminathan, S., Nho, K., Wan, J., Risacher, S., Shaw, L.,
Trojanowski, J., Weiner, M., and Saykin, A. (2011). Identifying neuroimaging and proteomic
biomarkers for mci and ad via the elastic net. volume 7012, pages 27–34.
Su, B., Gao, L., Baranowski, C., Gillard, B., Wang, J., Ransom, R., Ko, H.-K., andGelman, I.(2014).
A genome-wide rnai screen identifies foxo4 as a metastasis-suppressor through counteracting
pi3k/akt signal pathway in prostate cancer. PloS one, 9:e101411.
Sun, Y., Babu, P., and Palomar, D. (2017). Majorization-minimization algorithms in signal
processing, communications, and machine learning. IEEE Transactions on Signal Processing,
65:794–816.
Tai, F. and Pan, W. (2007). Incorporating prior knowledge of predictors into penalized classifiers
with multiple penalty terms. Bioinformatics, 23(14):1775–1782.
Takata, R., Akamatsu, S., Kubo, M., Takahashi, A., Hosono, N., Kawaguchi, T., Tsunoda, T.,
Inazawa, J., Kamatani, N., Ogawa, O., Fujioka, T., Nakamura, Y., and Nakagawa, H. (2010).
Genome-wide association study identifies five new susceptibility loci for prostate cancer in the
japanese population. Nature genetics, 42:751–4.
Tennakoon, J., Shi, Y., Han, J., Tsouko, E., White, M., Burns, A., Zhang, A.-J., Xia, X., Ilkayeva,
O., Xin, L., Ittmann, M., Rick, F., Schally, A., and Frigo, D. (2014). Androgens regulate prostate
cancer cell growth via an ampk-pgc-1-mediated metabolic switch. Oncogene, 33:5251–61.
Tharmaratnam, K., Sperrin, M., Jaki, T., Reppe, S., and Frigessi, A. (2016). Tilting the lasso by
knowledge-based post-processing. BMC Bioinformatics, 17(1):1–9.
131
Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal
Statistical Society. Series B: Statistical Methodology, 58(1):267–288.
Tibshirani, R. (2011). Regression Shrinkage and Selection Via the Lasso: a restrospective. Journal
of the Royal Statistical Society. Series B: Statistical Methodology, 73:273–282.
Tipping, M. (2001). Sparse Bayesian Learning and the Relevance Vector Mach. Journal of Machine
Learning Research, 1:211–244.
Tipping, M. E. and Faul, A. C. (2003). Fast marginal likelihood maximisation for sparse Bayesian
models. Proceedings of the ninth ..., (x):1–13.
Vaarala, M., Porvari, K., Kyllönen, A., and Vihko, P. (2000). Dierentially expressed genes in two
lncap prostate cancer cell lines reflecting changes during prostate cancer progression. Laboratory
investigation; a journal of technical methods and pathology, 80:1259–68.
Van de Wiel, M., Beest, D., and Münch, M. (2017). Learning from a lot: Empirical bayes in
high-dimensional prediction settings.
van de Wiel, M. A., Lien, T. G., Verlaat, W., van Wieringen, W. N., and Wilting, S. M. (2016).
Better prediction by use of co-data: Adaptive group-regularized ridge regression. Statistics in
Medicine, 35(3):368–381.
Vandenberghe, L. (2011). Ee236c - optimization methods for large-scale systems. 1(2):302–332.
Wei, W., Leng, J., Shao, H., and Wang, W. (2015). High pca3 scores in urine correlate with poor-
prognosis factors in prostate cancer patients. International journal of clinical and experimental
medicine, 8:16606–16612.
Wipf, D. and Nagarajan, S. (2008). A New View of Automatic Relevance Determination. Compute,
20(2):1625–1632.
Wipf, D. and Nagarajan, S. (2014). Iterative Reweighted and Methods for Finding Sparse Solutions
Iterative Reweighted Methods for Finding Sparse Solutions. (August):1–29.
Witte, J., Greenland, S., Haile, R., and Bird, C. (1994). Hierarchical regression analysis applied
to a study of multiple dietary exposures and breast cancer. Epidemiology (Cambridge, Mass.),
5:612–21.
Yang, L., Roberts, D., Takhar, M., Erho, N., Bibby, B., Thiruthaneeswaran, N., Bhandari, V.,
Cheng, W.-C., Haider, S., McCorry, A., Mcart, D., Jain, S., Alshalalfa, M., Ross, A., Schaer, E.,
Den, R., Karnes, R., Klein, E., Hoskin, P., and West, C. (2018). Development and validation of a
28-gene hypoxia-related prognostic signature for localized prostate cancer. EBioMedicine, 31.
Yu, J., Cao, Q., Mehra, R., Laxman, B., Yu, J., Tomlins, S., Creighton, C., Dhanasekaran, S., Shen,
R., Chen, G., Morris, D., Marquez, V., Shah, R., Ghosh, D., Varambally, S., and Chinnaiyan, A.
(2007). Integrative genomics analysis reveals silencing of -adrenergic signaling by polycomb in
prostate cancer. Cancer cell, 12:419–31.
Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables.
Journal of the Royal Statistical Society. Series B: Statistical Methodology, 68(1):49–67.
Yuille, A. and Rangarajan, A. (2001). The concave-convex procedure (cccp). Neural Computation,
15:1033–1040.
132
Zhang, Q., Yin, X., Pan, Z., Cao, Y., Han, S., Gao, G., Gao, Z., Pan, Z., and Feng, W. (2019).
Identification of potential diagnostic and prognostic biomarkers for prostate cancer. Oncology
Letters, 18.
Zhou, H., Hu, L., Zhou, J., and Lange, K. (2019). Mm algorithms for variance components models.
Journal of Computational and Graphical Statistics, 28(2):350–361.
Zou, H. (2006a). The adaptive lasso and its oracle properties. Journal of the American statistical
association, 101(476):1418–1429.
Zou, H. (2006b). The adaptive lasso and its oracle properties. Journal of the American Statistical
Association, 101(476):1418–1429.
Zou, H.andHastie, T.(2005). Zouh, hastiet.regularizationandvariableselectionviatheelasticnet.
j r statist soc b. 2005;67(2):301-20. Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 67:301 – 320.
133
Abstract (if available)
Abstract
The rapid advancement of high-throughput sequencing technologies has produced unprecedented amounts and types of omic data. Predicting clinical outcomes based on genomic features like gene expression, methylation, and genotypes is becoming increasingly important for individualized risk assessment and treatment. Associated with genomic features, there is also a rich set of meta-features such as functional annotation, pathway information, and knowledge from previous studies, that comprise valuable additional information. Traditionally, such meta-feature information is used in a post-hoc manner to enhance model explainability. For example, after model fit, analysis can be conducted to formally assess whether the selected gene features are enriched in particular metabolic pathways or gene ontology annotations. This kind of post-hoc analysis can provide biological insights and validation for a prediction model. In this dissertation, we propose novel methods that exploit genomic meta-features a-priori rather than post-hoc, to improve better identify important markers and improve prediction performance. We aim at addressing one central question: how can we predict an outcome of interest and identify relevant features while taking additional information on the features into account? ❧ Since genomic data sets are typically high-dimensional, penalized regression methods are commonly used to select relevant features and build predictive models. Standard penalized regression applies one penalty parameter to all features, ignoring the structural difference or heterogeneity of features. Based on this, we integrate meta-features into penalized regression by adapting the penalty parameters to be meta-feature-driven. The penalty parameters are modeled as a log-linear function of the meta-features and are estimated from the data using an approximate empirical Bayes approach. ❧ This dissertation is structured as follows. Chapter 1 introduces how penalized regression techniques can be used to solve high dimensional data problems. Chapter 2 describes an empirical Bayes approach to select the penalty parameter(s) in penalized regression. Chapter 3 discusses our method for incorporating meta-features into LASSO linear regression. Chapter 4 is devoted to the optimization algorithms for marginal likelihood maximization. Chapter 5 extends the model to Ridge and Elastic-Net linear and logistic regression. Finally, Chapter 6 presents the R package we developed to implement our method.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Hierarchical regularized regression for incorporation of external data in high-dimensional models
PDF
Prediction and feature selection with regularized regression in integrative genomics
PDF
Generalized linear discriminant analysis for high-dimensional genomic data with external information
PDF
Statistical analysis of high-throughput genomic data
PDF
Robust feature selection with penalized regression in imbalanced high dimensional data
PDF
Characterization and discovery of genetic associations: multiethnic fine-mapping and incorporation of functional information
PDF
Enhancing model performance of regularization methods by incorporating prior information
PDF
High-dimensional regression for gene-environment interactions
PDF
Best practice development for RNA-Seq analysis of complex disorders, with applications in schizophrenia
PDF
Genome-wide characterization of the regulatory relationships of cell type-specific enhancer-gene links
PDF
Hierarchical approaches for joint analysis of marginal summary statistics
PDF
Prediction modeling with meta data and comparison with lasso regression
PDF
Feature selection in high-dimensional modeling with thresholded regression
PDF
Latent unknown clustering with integrated data (LUCID)
PDF
Covariance-based distance-weighted regression for incomplete and misaligned spatial data
PDF
Finding signals in Infinium DNA methylation data
PDF
Gene-set based analysis using external prior information
PDF
Nonlinear modeling of the relationship between smoking and DNA methylation in the multi-ethnic cohort
PDF
Uncertainty quantification in extreme gradient boosting with application to environmental epidemiology
PDF
Statistical methods and analyses in the Multiethnic Cohort (MEC) human gut microbiome data
Asset Metadata
Creator
Zeng, Chubing
(author)
Core Title
Incorporating prior knowledge into regularized regression
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Biostatistics
Publication Date
04/24/2020
Defense Date
03/24/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
empirical Bayes,external information,genomic studies,high dimensional data,MM algorithm,OAI-PMH Harvest,regularized regression
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Lewinger, Juan Pablo (
committee chair
), Conti, David (
committee member
), Dehghani, Morteza (
committee member
), Siegmund, Kimberly (
committee member
), Thomas, Duncan Campbell (
committee member
)
Creator Email
chubingz@usc.edu,chubingzeng@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-289901
Unique identifier
UC11675370
Identifier
etd-ZengChubin-8316.pdf (filename),usctheses-c89-289901 (legacy record id)
Legacy Identifier
etd-ZengChubin-8316.pdf
Dmrecord
289901
Document Type
Dissertation
Rights
Zeng, Chubing
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
empirical Bayes
external information
genomic studies
high dimensional data
MM algorithm
regularized regression