Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Hierarchical regularized regression for incorporation of external data in high-dimensional models
(USC Thesis Other)
Hierarchical regularized regression for incorporation of external data in high-dimensional models
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
HIERARCHICAL REGULARIZED REGRESSION FOR INCORPORATION OF EXTERNAL
DATA IN HIGH-DIMENSIONAL MODELS
by
Garrett M. Weaver
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(BIOSTATISTICS)
December 2019
ii
Table of Contents
List of Tables ................................................................................................................................................. v
List of Figures ............................................................................................................................................... vi
Abstract ...................................................................................................................................................... viii
Chapter 1 Introduction................................................................................................................................ 1
1.1 High-Dimensional Data and Associated Challenges in Prediction ..................................................... 1
1.2 Regularization, Optimization, and Regression .................................................................................... 3
1.2.1 Definition ...................................................................................................................................... 3
1.2.2 Convex Optimization .................................................................................................................... 6
1.2.3 Ridge Regression .......................................................................................................................... 7
1.2.4 Lasso Regression .......................................................................................................................... 9
1.3 Shrinkage, Regularization, and Hierarchical Models ........................................................................ 11
1.4 Integration of External Data .............................................................................................................. 15
Chapter 2 Hierarchical Regularized Linear Regression ........................................................................... 18
2.1 Overview ............................................................................................................................................ 18
2.2 Methods ............................................................................................................................................. 19
2.2.1 General Framework .................................................................................................................... 19
2.2.2 Ridge-Lasso ................................................................................................................................ 22
2.2.3 Hyperparameter Tuning .............................................................................................................. 27
2.3 Simulations ........................................................................................................................................ 28
2.3.1 Independence Case ..................................................................................................................... 28
2.3.2 Normal-Normal Case .................................................................................................................. 29
2.3.3 Magnitude and Direction of Effect ............................................................................................. 32
2.4 Results................................................................................................................................................ 34
2.4.1 Independence Case ..................................................................................................................... 34
2.4.2 Normal-Normal Case .................................................................................................................. 36
2.4.3 Magnitude and Direction of Effect ............................................................................................. 44
2.5 Discussion .......................................................................................................................................... 44
Chapter 3 Coordinate Descent for Hierarchical Regularized Regression ................................................ 47
3.1 Overview ............................................................................................................................................ 47
3.2 Methods ............................................................................................................................................. 48
3.2.1 Coordinate Gradient Descent ...................................................................................................... 48
3.2.2 Coordinate Gradient Descent for Regularized Regression ......................................................... 51
3.2.3 Application of Coordinate Descent to Hierarchical Regularized Regression............................. 52
iii
3.2.4 Hyperparameter Tuning Revisited .............................................................................................. 54
3.2.4 Pathwise Coordinate Descent in Two-Dimensions ........................................................................ 56
3.3 Discussion .......................................................................................................................................... 58
Chapter 4 Applications to Real Data ........................................................................................................ 60
4.1 Overview ............................................................................................................................................ 60
4.2 Prediction of Chronological Age with Methylation Data .................................................................. 61
4.1.1 Introduction................................................................................................................................. 61
4.1.2 Methods ...................................................................................................................................... 61
4.1.3 Results......................................................................................................................................... 62
4.2 Using Gene Annotations to Improve Prediction of Prostate Cancer Recurrence .............................. 63
4.2.1 Introduction................................................................................................................................. 63
4.2.2 Methods ...................................................................................................................................... 64
4.2.3 Results......................................................................................................................................... 69
4.3 Recovery of Gene Expression Signatures for Breast Cancer Mortality ............................................ 70
4.3.1 Introduction................................................................................................................................. 70
4.3.2 Methods ...................................................................................................................................... 71
4.3.3 Results......................................................................................................................................... 72
4.5 Discussion .......................................................................................................................................... 74
Chapter 5 xrnet: An R package for Hierarchical Regularized Regression ............................................... 77
5.1 Overview ............................................................................................................................................ 77
5.2 Methods ............................................................................................................................................. 78
5.2.1 Coordinate Descent Solver ......................................................................................................... 78
5.2.2 Safe Sets, Strong Sets, and Active Sets ...................................................................................... 80
5.2.3 Memory....................................................................................................................................... 81
5.3 Simulations ........................................................................................................................................ 82
5.4 Results................................................................................................................................................ 84
5.3.2 Speed........................................................................................................................................... 84
5.3.3 Memory....................................................................................................................................... 86
5.3.4 Ease of Use ................................................................................................................................. 87
5.4 Discussion .......................................................................................................................................... 89
Chapter 6 Hierarchical Regularized Logistic Regression ........................................................................ 90
6.1 Overview ............................................................................................................................................ 90
6.2 Methods ............................................................................................................................................. 91
6.2.1 Regularized Logistic Regression ................................................................................................ 91
6.2.2 Hierarchical Regularized Logistic Regression ........................................................................... 95
iv
6.3 Simulations ........................................................................................................................................ 95
6.4 Real Data Applications ...................................................................................................................... 98
6.4.1 Revisiting the METABRIC Breast Cancer Analysis .................................................................. 98
6.4.2 PrediXcan + Hierarchical Regularized Regression: Understanding Gene-based
Associations in Colorectal Cancer ....................................................................................................... 98
6.5 Results.............................................................................................................................................. 103
6.5.1 Simulations ............................................................................................................................... 103
6.5.2 METABRIC Analysis Revisited............................................................................................... 104
6.5.3 PrediXcan + Hierarchical Regularized Regression .................................................................. 105
6.6 Discussion ........................................................................................................................................ 113
Appendices ................................................................................................................................................ 117
A.1 Derivation of Ridge-Lasso in Orthonormal Case with a 2
nd
Level Intercept.................................. 117
A.2 Derivation of Ridge-Lasso Solution for General Case with a 2
nd
Level Intercept ......................... 118
A.3 Relationship between linear discriminant analysis (LDA) and linear regression ........................... 120
References .................................................................................................................................................. 122
v
List of Tables
Table 1.1 Comparison of Maximum Likelihood and James-Stein Estimators on Baseball Data
(Adapted from Efron 2010) ......................................................................................................................... 13
Table 2.1 True Parameter Estimates and External Data Settings for Independence Case........................... 28
Table 2.2 Behavior of Ridge-Lasso Estimates as a Function of 𝜆 1
and 𝜆 2
................................................. 35
Table 4.1 Mean and Median Test MSE by Method ..................................................................................... 63
Table 4.2 Summary of Enriched Molecular Functions Gene Ontology Annotations in Shahabi et. al.
2016 ............................................................................................................................................................. 66
Table 4.3 Average Selection Rate and Average Coefficient Estimates of External Data (Enriched
Annotations) ................................................................................................................................................ 70
Table 4.4 Comparison of Test AUC between Ridge Regression and Ridge-Lasso .................................... 73
Table 4.5 Coefficient Estimates for External Data Variables (Metagenes) ................................................. 73
Table 5.1 Feature Comparison of xrnet, glmnet, and biglasso .................................................................... 85
Table 5.2 Maximum Memory Allocated to Fit Lasso Regression on All Training Data and to
Compute 5-Fold Cross-Validation Sequentially .......................................................................................... 87
Table 6.1 Comparison of Test AUC between Ridge Regression and Ridge-Lasso Regression (Linear
and Logistic Versions) ............................................................................................................................... 104
Table 6.2 Coefficient Estimates for External Data Variables (Metagenes) .............................................. 105
Table 6.3 Summary of Significant PrediXcan Results Compared to Genes Selected in EN-EN
Hierarchical Regularized Regression Model and Results of Bien et. al. (2019) ....................................... 106
Table 6.4 Marginal Association with Colorectal Cancer Case-Control Status and EN-EN Coefficient
Estimates for SNPs Used in COLCA1 Imputation Model ........................................................................ 110
Table 6.5 Marginal Association with Colorectal Cancer Case-Control Status and EN-EN Coefficient
Estimates for SNPs Used in COLCA2 Imputation Model ........................................................................ 111
Table 6.6 Marginal Association with Colorectal Cancer Case-Control Status and EN-EN Coefficient
Estimates of SNPs Used in TRIM4 and OR2AE1 Imputation Models ..................................................... 112
Table 6.7 Marginal Association with Colorectal Cancer Case-Control Status and EN-EN Coefficient
Estimates for SNPs Used in LIMA1 Imputation Model ............................................................................ 113
vi
List of Figures
Figure 1.1 Original results of Hoerl and Kennard comparing least-squares and ridge estimators
(Annotated) .................................................................................................................................................... 9
Figure 1.2 Contours of objective and constraint functions for the lasso (left) and the ridge (right)
(James et al., 2013) ...................................................................................................................................... 11
Figure 2.1 Complete SVD Decomposition .................................................................................................. 25
Figure 2.2 Behavior of Ridge-Lasso Estimates as a Function of 𝜆 1
and 𝜆 2
................................................ 35
Figure 2.3 Relative Prediction Performance, as measured by decrease in prediction MSE, of Ridge-
Lasso Compared to Standard Ridge Regression by Number of Predictors, External Data Signal-to-
Noise Ratio, and External Data Correlation (𝑆𝑁𝑅 𝑋 = 2) ........................................................................... 36
Figure 2.4 Relative Prediction Performance, as measured by decrease in prediction MSE, of Ridge-
Lasso Compared to Standard Ridge Regression by Number of Predictors, External Data Signal-to-
Noise Ratio, and External Data Correlation ( 𝑆𝑁𝑅 𝑋 = 1) .......................................................................... 37
Figure 2.5 Relative decrease in 𝑀𝑆𝐸𝛽 of Ridge-Lasso Compared to Standard Ridge Regression by
Number of Predictors, External Data Signal-to-Noise Ratio, and External Data Correlation (𝑆𝑁𝑅 𝑋 =
2) ................................................................................................................................................................. 37
Figure 2.6 Relative decrease in 𝑀𝑆𝐸𝛽 of Ridge-Lasso Compared to Standard Ridge Regression by
Number of Predictors, External Data Signal-to-Noise Ratio, and External Data Correlation (𝑆𝑁𝑅 𝑋 =
1) ................................................................................................................................................................. 38
Figure 2.7 Average True Positive Selection Rate of Nonzero External Variables (𝛼 ≠ 0) by Number
of Predictors and External Data Signal-to-Noise Ratio (𝑆𝑁𝑅 𝑋 = 2) ......................................................... 39
Figure 2.8 Average True Positive Selection Rate of Nonzero External Variables (𝛼 ≠ 0) by Number
of Predictors and External Data Signal-to-Noise Ratio (𝑆𝑁𝑅 𝑋 = 1) ......................................................... 39
Figure 2.9 Average True Negative Selection Rate of Nonzero External Variables (𝛼 ≠ 0) by Number
of Predictors and External Data Signal-to-Noise Ratio (𝑆𝑁𝑅 𝑋 = 2) ......................................................... 40
Figure 2.10 Average True Negative Selection Rate of Nonzero External Variables (𝛼 ≠ 0) by
Number of Predictors and External Data Signal-to-Noise Ratio (𝑆𝑁𝑅 𝑋 = 1) ........................................... 40
Figure 2.11 Relative Prediction Performance, as measured by decrease in prediction MSE, of Ridge-
Lasso Compared to Standard Ridge Regression by Number of External Variables, External Data
Signal-to-Noise Ratio, and External Data Correlation (𝑆𝑁𝑅 𝑋 = 2) ........................................................... 41
Figure 2.12 Relative Prediction Performance, as measured by decrease in prediction MSE, of Ridge-
Lasso Compared to Standard Ridge Regression by Number of External Variables, External Data
Signal-to-Noise Ratio, and External Data Correlation (SNR
X
= 1) ........................................................... 41
Figure 2.13 Relative decrease in MSE(𝛽 ) of Ridge-Lasso Compared to Standard Ridge Regression
by Number of External Variables, External Data Signal-to-Noise Ratio, and External Data
Correlation (𝑆𝑁𝑅 𝑋 = 2) ............................................................................................................................. 42
Figure 2.14 Relative decrease in MSE(𝛽 ) of Ridge-Lasso Compared to Standard Ridge Regression
by Number of External Variables, External Data Signal-to-Noise Ratio, and External Data
Correlation (𝑆𝑁𝑅 𝑋 = 1) ............................................................................................................................. 42
vii
Figure 2.15 Average True Positive Selection Rate of Nonzero External Variables (𝛼 ≠ 0) by
Number of External Variables and External Data Signal-to-Noise Ratio (𝑆𝑁𝑅 𝑋 = 2).............................. 43
Figure 2.16 Average True Positive Selection Rate of Nonzero External Variables (𝛼 ≠ 0) by
Number of External Variables and External Data Signal-to-Noise Ratio (𝑆𝑁𝑅 𝑋 = 1).............................. 43
Figure 4.1 Relationship between Gene Expression and Gene Ontology Annotation Data ......................... 65
Figure 4.2 Gene Expression Data Matrix (bottom) and Gene-Annotation Data Matrix (top) .................... 65
Figure 4.3 Gene Ontology Ancestry Tree for Molecular Function Annotation GO:0004470 (malic
enzyme activity) ........................................................................................................................................... 68
Figure 4.4 Correlation Between Enriched Molecular Function Gene Ontology Terms .............................. 69
Figure 4.5 Comparison of Cross-validation AUC by Method ..................................................................... 70
Figure 4.6 Comparison of Coefficient Estimates for Metagene Probes Between Ridge Regression and
Ridge-Lasso Regression .............................................................................................................................. 74
Figure 5.1 xrnet R Package Structure .......................................................................................................... 78
Figure 5.2 Average Computing Time (s) to Solve Linear Lasso Regression Using Standard Path from
R Package glmnet as a Function of Number of Features (𝑛 = 1,000) and Number of Observations
(𝑝 = 10,000) ............................................................................................................................................... 85
Figure 5.3 Average Computing Time (s) to Solve Logistic Lasso Regression Using Standard Path
from R Package glmnet as a Function of Number of Features (𝑛 = 1,000) and Number of
Observations (𝑝 = 10,000) ......................................................................................................................... 86
Figure 5.4 Comparison of Memory Allocation During Sequential 5-fold Cross-Validation ...................... 87
Figure 5.5 Example of Mean Cross-Validation Error Contour Plot in xrnet R package ............................. 88
Figure 6.1 PrediXcan Framework for Gene Expression Imputation ......................................................... 100
Figure 6.2 Prediction Performance, as measured test AUC, of Ridge-Lasso Compared to Standard
Ridge Regression by Number of Predictors and External Data Signal-to-Noise Ratio (𝜌𝑧 = 0.2,𝑞 =
100) ........................................................................................................................................................... 103
Figure 6.3 Prediction Performance, as measured test AUC, of Ridge-Lasso Compared to Standard
Ridge Regression by Number of External Data Variables and External Data Signal-to-Noise Ratio
(𝜌𝑧 = 0.2,𝑝 = 500) .................................................................................................................................. 104
viii
Abstract
The evolution of technologies in the last half century has led to an exponential growth in
data across multiple domains. With a corresponding increase in processing power, there is
significant interest in methods to extract novel information from a number of growing data
stores. Our focus is primarily on biological data generated as a result of these advances that can
also be classified as high-dimensional. High-dimensionality is attributed to data where the
number of variables (𝑝 ) exceeds the number of available observations (𝑛 ) . We are particularly
interested in array and sequencing data that encodes information on the structure, function, and
regulation of the genome. The potential number of features under consideration in current
genomic analyses can range from tens of thousands in microarray studies to easily over a million
variants in genome-wide association studies (GWAS). Sample sizes can range from under a
hundred to hundreds of thousands. Increased availability of genomic data has enabled research
on the causal relationships between genomic regions and health-related outcomes and the
incorporation of genomic features into prediction models. Analyses have also expanded from the
use of a single type of genomic data to methods that can integrate multiple data types into a joint
analysis.
The primary aim of this thesis is the development of a statistical method that enables the
integration of external information for genomic data that is potentially informative for their
effects on health-related outcomes. One potential source of external information is the growing
collection of structural and functional annotations that help describe how genes encode
biological functions. We are interested in using such external sources to improve the predictive
ability of genomic data. In the introduction, we describe the general challenges associated with
high-dimensional data and how a class of methods, commonly referred to as regularization, can
ix
address these issues. The connection between regularization and hierarchical models is then used
as motivation for our general framework that enables the incorporation of external data for high-
dimensional data predictors in what we will refer to as hierarchical regularized regression.
Chapter 2 is devoted to the implementation of our method for the prediction of continuous
outcomes under specific types of regularization. Chapter 3 implements a more efficient
algorithm to fit our model based on coordinate gradient descent. Chapter 4 provides several
applications of our method to real data examples. Chapter 5 describes the R package and
associated functions we have developed to implement our method. We finish our discussion by
extending the model to binary outcomes.
1
Chapter 1 Introduction
1.1 High-Dimensional Data and Associated Challenges in Prediction
For the majority of the 20
th
century, primary methods for statistical inference and
prediction focused on scenarios in which samples sizes (𝑛 ) were much larger than the number
variables under consideration (𝑝 ) , as before the advent of technologies generating large volumes
of data, these scenarios were the norm. A result of this focus on ‘large n, small p’ was the
development of a rich collection of likelihood-based approaches with estimators that have good
statistical properties in terms of unbiasedness and minimal variance. However, with increased
interest in high-dimensional data, many popular methods for inference and prediction were found
to be inadequate. One example of an issue that becomes more apparent as 𝑝 → 𝑛 is
multicollinearity; when one predictor is highly correlated with a linear combination of another
set of predictors in a sample. For finite n, we expect the probability of spurious correlations
between features in a data set to increase as 𝑝 becomes large (Fan, Han and Liu, 2014).
Additionally, for a given outcome of interest, the probability of finding a false association of any
variable with the outcome dramatically increases with p. Both spurious and true correlations
between variables in a finite data set are drivers for the true challenge underlying the
development of any prediction model, overfitting. Intuitively, overfitting occurs when a model is
capturing random noise in a set of data and not the true signal describing the association between
a set of features and the outcome.
More formally, consider the prediction of an outcome 𝒀 , as a function of p features from
n independent observations, designated by the 𝑛 × 𝑝 matrix 𝑿 .
𝒀 = 𝑓 (𝑿 )+ 𝝐 , 𝝐 ~ 𝑀𝑉𝑁 (𝟎 𝒏 ,𝜎 2
𝑰 𝒏 )
2
Here, 𝝐 represents random noise associated with each measured observation, with errors assumed
to be uncorrelated across observations. The goal is to estimate the true regression function, 𝑓 , by
a member of a finite dimensional (i.e. parametric) or infinite dimensional family of functions (i.e.
non-parametric). To assess the performance of 𝑓 ̂
, we can evaluate its ‘predictive ability’ at a new
data point, (𝑋 0
,𝑌 0
) , not part of the training data, using a loss function to quantify the distance
between prediction 𝑓 ̂
(𝑋 0
) and 𝑌 0
. A common loss function that will be employed throughout this
paper is the quadratic loss, whose expectation is decomposed as follows.
𝑅𝑖𝑠𝑘 (𝑌 0
,𝑓 ̂
(𝑋 0
)) = 𝐸 [(𝑌 0
− 𝑓 ̂
(𝑋 0
))
2
] = 𝐵𝑖𝑎𝑠 [𝑓 ̂
(𝑋 0
)]
2
+ 𝑉𝑎𝑟 [𝑓 ̂
(𝑋 0
)] + 𝑉𝑎𝑟 (𝜖 )
The decomposition is referred to as the bias-variance tradeoff. The first term captures the error
associated with misspecification of 𝑓 ̂
. The second term captures the error due to variability in the
estimator, 𝑓 ̂
. The last term represents the irreducible error that is present regardless of the choice
of f. As 𝑝 increases, the flexibility of 𝑓 ̂
to model the outcome tends to increase and an initial
decrease in bias occurs if the additional predictors decrease the model misspecification.
However, as p approaches n, many potential estimators can become highly sensitive to the input
data, 𝑿 , leading to an estimator with high variance and poor generalization to external data. In
addition to overfitting, when 𝑝 > 𝑛 , many methods can become computationally intractable.
As an example, consider the use of multiple linear regression to model the continuous
outcome, 𝒀 ∈ ℝ
𝑛 , centered on its mean, and the predictor matrix 𝑿 ∈ ℝ
𝑛 ×𝑝 , assumed to be
standardized such that
1
𝑛 ∑ 𝑥 𝑖𝑗
𝑛 𝑖 =1
= 0 and
1
𝑛 ∑ 𝑥 𝑖𝑗
2 𝑛 𝑖 =1
= 1 for all 𝑗 = 1,…,𝑝 . Note that we will
assume that 𝒀 is centered and 𝑿 is standardized through the rest of the discussion.
𝒀 = 𝑿𝜷 + 𝝐 , 𝝐 ~ 𝑀𝑉𝑁 (𝟎 𝒏 ,𝜎 2
𝑰 𝒏 )
3
𝜷 is the 𝑝 × 1 vector of regression coefficients describing the association of each feature with the
outcome that we would like to estimate. The standard solution for 𝜷 , the least-squares (LS)
estimator, involves solving a system of linear equations, known as the normal equations with
solution:
𝜷 ̂
𝑳𝑺
= (𝑿 𝑻 𝑿 )
−1
𝑿 𝑻 𝒚
Gauss-Markov theorem states that the LS estimator is the best linear unbiased estimator (BLUE).
Along with these favorable statistical properties, ease of computation and interpretation have
made linear regression a popular statistical method. However, the solution is intractable
when 𝑝 > 𝑛 , as the matrix 𝑿 𝑻 𝑿 is not of full rank and its inverse does not exist. The LS
estimator also exhibits high variance when predictors are significantly correlated.
1.2 Regularization, Optimization, and Regression
1.2.1 Definition
In response to the previous limitations, the analysis of high-dimensional data has become
an active research area with numerous methodological approaches developed in both statistics
and computer science. Our interest is in the use of regularization or shrinkage-type methods to
address high-dimensionality, with a specific focus on genomic data. Before describing current
methods, we provide a broader definition of regularization and its relation to mathematical
optimization. Again, consider the previous model to predict an outcome 𝒀 based on a function of
𝑝 features, 𝑓 (𝑿 ) . We now assume that f takes on a parametric form that is dependent on a finite
number of parameters. As mentioned previously, we want to find the estimator, 𝑓 ̂
, that
minimizes the expected prediction error. Additionally, the estimator needs to be computationally
4
tractable when 𝑝 > 𝑛 and relatively insensitive to variability in the set of data, 𝑿 , used to
construct the estimator.
From an optimization perspective, regularization achieves these aims by adding
additional information into the estimation in the form of constraints on the parameter space.
Returning to our example using linear regression, the LS estimator can be viewed as the solution
to an unconstrained optimization. Unconstrained optimization is the process of minimizing (or
maximizing) an objective function with respect to a set of variables and takes the general form.
minimize
𝜷 𝑓 (𝜷 )
The objective function for the LS problem minimizes the sum of squared errors. Note that we
have replaced the random vector 𝒀 with the observed vector 𝒚 .
minimize
1
2
𝜷 ‖𝒚 − 𝑿𝜷 ‖
2
2
‖𝒙 ‖
𝑞 = (∑ 𝑥 𝑖 𝑞 𝑖 )
1/𝑞 and is referred to as the 𝑙 𝑞 -norm
Constrained optimization also optimizes an objective function, but within a restricted solution
space defined by a set of constraints that are a function of the variables to be optimized. The set
of possible solution points that satisfy the constraints is known as the feasible set. Two types of
constraints can be used to represent the feasible set, equality constraints and inequality
constraints. The standard form for a constrained optimization is shown below and is referred to
as the primal problem.
minimize
𝜷 𝑓 (𝜷 )
𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 𝑔 𝑖 (𝜷 )− 𝑐 𝑖 ≤ 0,𝑖 = 1,…,𝑟 (𝑖𝑛𝑒𝑞𝑢𝑎𝑙𝑖𝑡𝑦 )
ℎ
𝑗 (𝜷 )− 𝑑 𝑗 = 0,𝑗 = 1,…,𝑠 (𝑒𝑞𝑢𝑎𝑙𝑖𝑡𝑦 )
As an example, consider the addition of one inequality constraint to the LS problem.
5
minimize
𝜷 ‖𝒚 − 𝑿𝜷 ‖
2
2
subject to 𝑔 (𝜷 )− 𝑐 ≤ 0,𝑐 > 0
In this case, the feasible set is defined as {𝜷 ∈ ℝ
𝑝 :𝑔 (𝜷 )≤ 𝑐 }, where 𝑐 ∈ ℝ
++
is a fixed value.
An alternate representation is the Lagrange dual problem. To form the dual problem, a Lagrange
function is used to express the primal problem as a weighted sum of the objective and constraint
functions, where 𝝀 and 𝜸 are referred to as the dual variables.
ℒ(𝜷 ,𝝀 ,𝜸 )= 𝑓 (𝜷 )+ ∑𝜆 𝑖 (𝑔 𝑖 (𝜷 )− 𝑐 𝑖 )
𝑖 + ∑𝛾 𝑗 (ℎ
𝑗 (𝜷 )− 𝑑 𝑗 )
𝑗 ,𝝀 ≻ 0
The Lagrange dual function is then expressed as a function of the dual variables,
𝑔 𝑑𝑢𝑎𝑙 (𝝀 ,𝜸 )= inf
𝜷 ℒ(𝜷 ,𝝀 ,𝜸 )
Note that for any solution in the feasible set, the Lagrange dual function forms a lower bound for
the primal solution. To reduce the difference between the primal and dual solutions, referred to
as the duality gap, we want to maximize the Lagrange dual function, which gives us the
Lagrange dual problem.
maximize
𝜆 ,𝛾 𝑔 𝑑𝑢𝑎𝑙 (𝜆 ,𝛾 )
𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 𝜆 ≻ 0
Returning to the constrained LS problem, the Lagrange function and Lagrange dual function are,
ℒ(𝜷 ,𝜆 )=
1
2
‖𝒚 − 𝑿𝜷 ‖
2
2
+ 𝜆 (𝑔 (𝜷 )− 𝑐 )
𝑔 𝑑𝑢𝑎𝑙 (𝝀 )= inf
β
{
1
2
‖𝒚 − 𝑿𝜷 ‖
2
2
+ 𝜆 (𝑔 (𝜷 )− 𝑐 )}
To minimize the Lagrange function, differentiate with respect to 𝛽 , set to zero, and solve for 𝛽
∇
𝛽 ℒ(𝜷 ,𝜆 )= 𝑿 𝑇 (𝑿𝜷 − 𝒚 )+ 𝜆 𝜕 𝜕 𝜷 𝑔 (𝜷 )= 0
6
𝑿 𝑇 𝑿𝜷 = 𝑿 𝑇 𝒚 − 𝜆 𝜕 𝜕 𝜷 𝑔 (𝜷 )
Without knowing the functional form of 𝑔 (𝜷 ) , we cannot proceed further. In general, one would
solve for the primal variable and then substitute the solution back into the Lagrange dual
function to form the objective function for the Lagrange dual problem. Under specified
conditions know as strong duality, the primal and dual problem yield equivalent solutions. In
either form, we must specify 𝑔 (𝜷 ) to reflect our beliefs about the underlying form of the
parameter space and to be computationally tractable as will be discussed in the next section.
1.2.2 Convex Optimization
Least-squares falls under the category of a special class of problems known as convex
optimization problems, for which efficient algorithms exist to compute their solution. A
constrained optimization of the form
minimize
𝒙 𝑓 (𝒙 )
𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 𝑔 𝑖 (𝒙 )≤ 𝑏 𝑖 ,𝑖 = 1,…,𝑘 ,𝒙 ∈ ℝ
𝑝
is considered to be a convex optimization if it meets the following criteria (Boyd and
Vandenberghe, 2004).
1. The objective function, 𝑓 : ℝ
𝑝 → ℝ, is convex with respect to the parameters
2. The constraint functions, 𝑔 𝑖 :ℝ
𝑝 → ℝ,𝑖 = 1,…,𝑚 , are convex with respect to the
parameters
* Strong duality is guaranteed to hold under the assumptions that
a) The objective function is quadratic
b) The constraint functions are affine
7
The attractive computational properties of convex optimizations have resulted in the
development of regularization methods that tend to have objective and constraint functions that
meet these criteria. In the next few sections, we review a family of such methods known as
regularized regression in the literature. In the following set of previously developed methods, the
objective function is quadratic convex, and the constraint function tends to take the general linear
form below. The reason for this form will become clear in the next few sections.
𝑔 (𝜷 )= 𝛿 ∑|𝛽 𝑖 |
𝑖 +
(1 − 𝛿 )
2
∑𝛽 𝑖 2
𝑖 ≤ 𝑡
𝑡 ≥ 0,0 ≤ 𝛿 ≤ 1
As we will see, regularized regression can improve prediction in high-dimensional data sets by
restricting the parameter space to a class of biased estimators that can have significantly lower
variance.
1.2.3 Ridge Regression
One of the original motivations for regularization in a regression context was to address
multicollinearity (Hoerl and Kennard, 1970) and the ill-conditioning of the covariance matrix
associated with it. The solution developed by Hoerl and Kennard, coined ridge regression, adds a
positive quantity, 𝜆 , to the diagonal of 𝑿 𝑻 𝑿 and results in the ridge estimator
𝛽 ̂
(𝜆 )
𝑅𝑖𝑑𝑔𝑒 = (𝑿 𝑻 𝑿 + 𝜆 𝑰 𝒑 )
−1
𝑿 𝑻 𝒚 ,𝜆 ≥ 0
Motivation for their method stems from the relationship between the mean squared error (MSE)
of the regression estimates and the eigenvalues of 𝑿 𝑻 𝑿 . The expectation and variance of
the 𝑀𝑆𝐸 (𝜷 ̂
𝑳𝑺
,𝜷 ) are
𝐸 [𝑀𝑆𝐸 (𝜷 ̂
𝑳𝑺
,𝜷 )] = 𝜎 2
𝑇𝑟𝑎𝑐𝑒 (𝑿 𝑻 𝑿 )
−1
= 𝜎 2
∑
1
𝑑 𝑖 𝑝 𝑖 =1
8
𝑉𝑎𝑟 [𝑀𝑆𝐸 (𝜷 ̂
𝑳𝑺
,𝜷 )] = 2𝜎 4
𝑇𝑟𝑎𝑐𝑒 (𝑿 𝑻 𝑿 )
−2
= 2𝜎 4
∑
1
𝑑 𝑖 2
𝑝 𝑖 =1
where 𝑑 𝑖 ,𝑖 = 1,…,𝑝 are the eigenvalues of 𝑿 𝑻 𝑿 . We see that the performance of the LS
estimator is highly dependent on the minimum eigenvalue. Significant multicollinearity leads to
small eigenvalues and results in an estimator with large expected difference from the true
estimate and high variance. By adding a positive constant to the diagonal, the eigenvalues
become (𝑑 𝑖 + 𝜆 )
−1
, removing the instability induced by (𝑿 𝑻 𝑿 )
−1
. For any choice of 𝜆 > 0, the
resulting estimator is biased, but for a subset of 𝜆 ∈ ℝ
+
, the reduction in variance can be much
larger relative to the increase in bias. By proper choice of 𝜆 ,we can obtain an overall reduction in
MSE compared to the LS estimator (Figure 1.1).
As an unconstrained optimization, ridge regression adds a penalty on the sum of squares
of the regression coefficients,
minimize
𝛽
1
2
‖𝒚 − 𝑿𝜷 ‖
2
2
+
𝜆 2
‖𝜷 ‖
2
2
, 𝜆 ≥ 0
A similar constrained optimization problem that leads to the same solution can be written as,
minimize
𝜷 1
2
‖𝒚 − 𝑿𝜷 ‖
2
2
subject to ‖𝜷 ‖
2
2
≤ 𝑡 ,𝑡 ≥ 0
9
From the form of the constraint, the coefficients shrink towards zero as 𝑡 → 0 or equivalently
as 𝜆 → ∞. This is more easily seen if we look at the solution to the ridge estimator based on
differentiation of the objective function when 𝑿 is orthogonal (i.e. 𝑿 𝑻 𝑿 = 𝑰 𝒑 ) :
𝜷 ̂
𝑹𝒊𝒅𝒈𝒆 (𝜆 )=
𝜷 ̂
𝑳𝑺
1 + 𝜆
An interesting side note is that Hoerl and Kennard focused on the case where 𝑿 has full column
rank but is near singular. Ridge regression has expanded to use in high-dimensional data where
the ridge penalty also ensures (𝑿 𝑻 𝑿 + 𝜆 𝑰 𝒑 )
−1
is nonsingular when 𝑝 > 𝑛 .
1.2.4 Lasso Regression
Tibshirani (1996) introduced a similar regularization method that possesses the shrinkage
property of ridge regression and the added ability to perform variable selection. The least
𝜆
0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1
0
11
22
33
44
55
66
77
88
99
101
𝑀𝑆𝐸 (𝜷 ̂
,𝜷 )
𝑀𝑆𝐸 (𝛽 ̂
𝐿𝑆
)
𝑀𝑆𝐸 (𝛽 ̂
𝑅𝑖𝑑𝑔𝑒 )
𝑉𝑎𝑟 (𝛽 ̂
𝑅𝑖𝑑𝑔𝑒 )
𝐵𝑖𝑎𝑠 (𝛽 ̂
𝑅𝑖𝑑𝑔𝑒 )
2
Figure 1.1 Original results of Hoerl and Kennard comparing least-squares and ridge estimators
(Annotated)
10
absolute shrinkage and selection operator or lasso replaces the 𝑙 2
-norm with the 𝑙 1
-norm, the sum
of the absolute value of the regression coefficients.
minimize
𝛽 1
2
‖𝒚 − 𝑿𝜷 ‖
2
2
+ 𝜆 ‖𝜷 ‖
1
, 𝜆 ≥ 0
‖𝒙 ‖
1
= ∑|𝑥 𝑖 |
𝑖
This constraint also results in shrinkage of the coefficients towards zero. However, unlike ridge
regression, the lasso can shrink some coefficients exactly to zero. A fact that is best visualized by
the shape of the constraint function (Figure 1.2) and the case when 𝑿 is orthogonal, which
simplifies to soft-thresholding.
𝛽 ̂
𝑙𝑎𝑠𝑠𝑜 (𝜆 )= 𝑆 (𝑿 𝑻 𝒚 ,𝜆 )
𝑆 (𝒙 ,𝒚 )= 𝑠𝑖𝑔𝑛 (𝒙 )(|𝒙 | − 𝒚 )
+
This enables lasso regression to produce a sparse solution, where only a subset of the p variables
have nonzero estimates, making it a strong candidate for a number of high-dimensional problems
in which we believe the number of true nonzero effects is small relative to 𝑝 . This behavior
contrasts with ridge regression, which tends to perform best when we expect a moderate number
of correlated nonzero effects. Despite the non-differentiability of the 𝑙 1
-norm, which initially
posed computational challenges, a number of different optimization methods are now available
to efficiently compute its solution (Efron et al., 2004; Friedman, Hastie and Tibshirani, 2010).
11
Figure 1.2 Contours of objective and constraint functions for the lasso (left) and the ridge
(right) (James et al., 2013)
1.3 Shrinkage, Regularization, and Hierarchical Models
Regularized regression is also referred to as a shrinkage method. Copas (1983) defines
shrinkage as the general reduction in prediction performance of a model when used on a new
data set compared to performance in the data set used to construct the model (i.e. overfitting). He
goes on to note interest in “preshrunk” estimators that can “anticipate” potential overfitting by
incorporating additional information. These estimators have also earned this name due to their
general behavior to shrink estimates towards potentially informative values. Both views align
nicely with the motivation behind regularized regression and the solutions to ridge and lasso
might be thought of as “preshrunk” estimators.
Another method that relates to regularized regression is a Bayesian hierarchical model in
which we treat the model parameters as random variables. One example of a shrinkage estimator
based on a hierarchical model is the James-Stein estimator (James and Stein, 1961), which is
derived as the posterior mean of the following two-level model
𝑦 𝑖 | 𝜇 𝑖 ~ 𝑁 (𝜇 𝑖 ,𝜎 2
)
𝜇 𝑖 | 𝑀 ,𝐴 ~ 𝑁 (𝑀 ,𝐴 )
12
The parameters of the normal distributions are empirically estimated from the data, which results
in the following general form for the James-Stein (JS) estimator.
𝜃 ̂
= 𝑦̅ + (1 −
(𝑘 − 3)𝜎 2
‖𝒚 − 𝑦̅‖
2
2
)(𝒚 − 𝑦̅),𝑘 ≥ 4,𝑣 ∈ ℝ
An example by Efron (2010) compared the maximum likelihood (ML) estimator to the JS
estimator on baseball statistics. He predicts future batting averages (#hits / # at bats) for 18 major
league baseball players based on their first 45 attempts at bat. If we let each initial batting
average be a binomial proportion,
𝑛 𝑝 𝑖 ~ 𝑏𝑖𝑛𝑜𝑚𝑖𝑎𝑙 (𝑛 ,𝑃 𝑖 ),𝑛 = 90,𝑃 𝑖 = 𝑡𝑟𝑢𝑒 𝑏𝑎𝑡𝑡𝑖𝑛𝑔 𝑎𝑣𝑒𝑟𝑎𝑔𝑒
then a normal approximation of the estimate is
𝑝 𝑖 |𝑃 𝑖 ~ 𝑁 (𝑃 𝑖 ,𝜎 2
)
Table 1.1 presents the true batting average of each player for the rest of the season, 𝑝 𝑖 (𝑇𝑟𝑢𝑒 )
,
along with the corresponding ML and the JS estimates, defined as
𝑝 ̂
𝑖 (𝑀𝐿 )
= 𝑝 𝑖
𝑝 ̂
𝑖 (𝐽𝑆 )
= 𝑝 ̅+ (1 −
(𝑘 − 3)𝜎 2
∑ (𝑝 𝑖 − 𝑝 ̅ )
2
𝑖 )(𝑝 𝑖 − 𝑝 ̅ ),𝑘 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠
where 𝑝 𝑖 is the initial batting average for the 𝑖 𝑡 ℎ
player, 𝑝 ̅ is the grand mean for the initial
average across all 18 players, and 𝜎 2
= 𝑝 ̅ (1 − 𝑝 ̅ )/90 is the binomial variance.
From Table 1.1, we see that for the JS estimator, the individual batting averages are
shrunk towards the grand mean of 0.265 and there is an overall improvement in prediction
compared to the ML estimator. As in regularized regression, the JS estimator is biased, but has
decreased variance compared to the ML estimator which leads to an overall decrease in
prediction error. However, the JS estimator also significantly underestimates the true batting
average for one player, Roberto Clemente (an MLB hall of famer). Certainly, if we had
13
considered prior information beyond an initial batting average, we might have wanted to shrink
his ML estimate less.
Building off the previous example, a connection can be made between regularized
regression and hierarchical models. Hoerl and Kennard originally noted that their method could
be viewed in a Bayesian context, where we assume that each 𝛽 𝑗 is normally distributed with
mean zero and variance 𝜏 2
.
𝒚 | 𝑿 ,𝜷 ~ 𝑀𝑉𝑁 (𝑿𝜷 ,𝜎 2
𝑰 𝒏 )
𝛽 𝑗 |𝜏 2
~ 𝑛 (0,𝜏 2
),𝑗 = 1,… ,𝑝
Player Name
Initial Batting
Average
(Hits / At bats)
𝒑 𝒊 (𝑻𝒓𝒖𝒆 )
𝒑̂
𝒊 (𝑴𝑳𝑬 )
𝒑̂
𝒊 (𝑱𝑺 )
Clemente 18 / 45 0.346 0.400 0.294
F Robinson 17 / 45 0.298 0.378 0.289
F Howard 16 / 45 0.276 0.356 0.285
Johnstone 15 / 45 0.222 0.333 0.280
Berry 14 / 45 0.273 0.311 0.275
Spencer 14 / 45 0.270 0.311 0.275
Kessinger 13 / 45 0.263 0.289 0.270
L Alvarado 12 / 45 0.210 0.267 0.266
Santo 11 / 45 0.269 0.244 0.261
Swoboda 11 / 45 0.230 0.244 0.261
Unser 10 / 45 0.264 0.222 0.256
Williams 10 / 45 0.256 0.222 0.256
Scott 10 / 45 0.303 0.222 0.256
Petrocelli 10 / 45 0.264 0.222 0.256
E Rodriguez 10 / 45 0.226 0.222 0.256
Campaneris 9 / 45 0.286 0.200 0.252
Munson 8 / 45 0.316 0.178 0.247
Alvis 7 / 45 0.200 0.156 0.242
Overall Average 0.265 0.265 0.265
𝑴𝑺𝑬 (𝒑 𝒊 ,𝒑̂
𝒊 ) 0.0755 0.0214
Table 1.1 Comparison of Maximum Likelihood and James-Stein Estimators on Baseball
Data (Adapted from Efron 2010)
14
The posterior mean for 𝜷 turns out to be the ridge estimator and the optimal value for 𝜆 is the
ratio of the error variances, 𝜎 2
/𝜏 2
. We also note that a hierarchical model in terms of the
coefficients can be viewed as a multivariate version of the model used to define the JS estimator
with 𝑀 = 0 and 𝐴 = (𝜎 2
/𝜆 )𝑰 𝒑 .
𝜷 ̂
| 𝜷 ~ 𝑀𝑉𝑁 (𝜷 ,𝜎 2
(𝑿 𝑻 𝑿 )
−1
)
𝜷 ~ 𝑀𝑉𝑁 (𝟎 𝒑 ,
𝜎 2
𝜆 𝑰 𝒑 )
A similar relation for lasso regression is described by Park and Casella (2008), where they
expand on the initial observation by Tibshirani that the Bayesian equivalent assumes the 𝛽 ′𝑠
have independent and identically distributed Laplace priors.
𝒚 | 𝑿 ,𝜷 ~ 𝑀𝑉𝑁 (𝑿𝜷 ,𝜎 2
𝑰 𝒏 )
𝛽 𝑗 | 𝜆 ~ 𝐷𝐸 (𝜆 ),𝑗 = 1,…,𝑝
We draw attention to the relationship between regularized regression and hierarchical
models to emphasize that the constraints imposed by the additional penalty terms in regularized
regression act as ‘prior’ information for the underlying parameter space. When analyzing high-
dimensional data, such as gene expression or GWAS, a natural prior to assume is that most of the
genomic features have near zero or no effect on the outcome of interest. The constraints imposed
by ridge and lasso regression capture this underlying belief as exhibited by their Bayesian
counterparts. The connection between regularization and hierarchical models also raises the
question of what other types of ‘prior’ information we could potentially include in terms of
constraint functions that may improve upon the current approaches that default to zero.
15
1.4 Integration of External Data
In terms of incorporating other ‘prior’ information, our primary aim is to integrate
external information for genomic data within a regularization framework. Several previous
studies have already investigated methods that use regularized regression to incorporate a priori
information, including genomic annotations, and have shown improvements in prediction
performance and variable selection (Bergersen, Glad and Lyng, 2011; Tai and Pan, 2007; van de
Wiel et al., 2016).
Bergersen et. al. (2011) presents an extension of regularized regression that is similar to
the adaptive lasso (Zou, 2006). Briefly, the adaptive lasso generalizes the lasso by assigning
different penalty weights to different coefficients. This modification is commonly referred to as a
weighted lasso. The objective function is modified to include a diagonal matrix, 𝑷 , with a
variable-specific weight vector, 𝒘 , on the diagonal.
minimize
𝛽 1
2
‖𝒚 − 𝑿𝜷 ‖
2
2
+ 𝜆 ‖𝑷𝜷 ‖
1
, 𝜆 ≥ 0
𝑷 = 𝑑𝑖𝑎𝑔 (𝑤 𝑗 ),𝑗 = 1,… ,𝑝
The adaptive lasso defines the weights as,
𝑤 𝑗 =
1
|𝛽 ̂
𝑗 |
𝛾
where 𝛽 ̂
𝑗 is an n-consistent estimator of 𝛽 𝑗 (i.e. LS estimator) and 𝛾 > 0. The weighting allows
predictors with larger estimates for 𝛽 ̂
to be penalized less. The estimation process for the
adaptive lasso is also an oracle procedure (Fan and Li, 2001), defined as an estimation procedure
where the resulting estimator has the following properties (asymptotically):
1) Consistency in variable selection (i.e. identifies right subset of variables with true
nonzero effect)
16
2) Asymptotic normality / optimal estimation rate
Bergersen et. al. (2011) extend this idea by allowing the weights to be a function of the
outcome, predictors, and/or external information, 𝒁 , where 𝛾 is user-defined or tuned by cross-
validation.
𝑤 𝑗 =
1
|𝑔 𝑗 (𝒚 ,𝑿 ,𝒁 )|
𝛾
As with the adaptive lasso, the weights are intended to capture the importance of each predictor
and thus how strongly that predictor should be penalized. The method was applied to cervix
cancer data in which time to metastasis was the outcome, cDNA microarray gene expression data
acted as the predictors, and gene copy number was used as the external information. They
proceed to define the weights for each gene as the Spearman correlation coefficient between the
gene expression level and gene copy number, 𝑤 𝑗 = 𝜌̂
𝑗 , with the belief that genes with high
correlation between expression and copy number may be drivers in terms of time to metastasis.
Using a time-integrated version of the area under the ROC curve, iAUC, they found a significant
improvement in performance in a validation data set by using the weighted lasso (iAUC = 0.753)
compared to the standard lasso (iAUC = 0.561).
Van de Wiel et. al. (2016) developed a method that has elements of the group lasso
(Yuan and Lin, 2006) in which predictors are partitioned into groups based on the external data.
Ridge regression is then applied to the grouped variables and an application of empirical Bayes
is used to estimate group-specific penalties. In two data examples, they use their method to
classify precancerous cervical lesions from normal cervix tissue using methylation data. The
authors grouped each probe by their sample variance and whether the probe was in a CpG-island,
17
where methylation is more prevalent. The underlying belief is that probes that are in locations
with higher methylation may be more important for discriminating precancerous and normal
cervix tissue. Based on a 10-fold cross-validated AUC, they found that their group-regularized
method outperformed several other regularization methods, including standard ridge regression,
adaptive ridge, and group lasso.
Among currently available methods that use regularized regression, the common strategy
is to use external information to customize penalization of individual variables or sets of
variables. Our method takes a different approach motivated by the traditional two-level
hierarchical model where parameters are treated as random variables that are dependent on a
second level of hyperparameters (Greenland, 1993). As a result, we do not use the external data
to modify the rate of shrinkage for each variable, but rather the value to which each variable is
shrunk. In relation to the discussion of the James-Stein estimator, James and Stein (1961) were
the first to demonstrate that shrinkage not just towards zero, but any fixed point, outperforms
maximum likelihood estimates. This was clearly seen in the baseball example where the JS
estimator shrinks towards the average effect as opposed to zero. Current methods to incorporate
external information also require careful selection of the external data to define weighting
functions or to complete a priori grouping of variables. Our method is data-driven, where the
importance of the external information is estimated based on the data. This approach allows us to
include various types of external information (e.g. dichotomous, continuous) and for the external
features to be high-dimensional relative to the sample size. We aim to show that shrinking
estimates towards informative values based on external information improves prediction
performance as compared to standard approaches (e.g. ridge regression). In the next chapter, we
introduce our model as applied to continuous outcomes. The chapters that follow will expand
18
upon those results by presenting a more efficient method of estimation, applications to real data
examples, and how we can extend our method to other outcomes.
Chapter 2 Hierarchical Regularized Linear Regression
2.1 Overview
There is significant interest in methods that can incorporate prior information for
genomic data, with the hope that the external information can improve the predictive ability of
these data sources. To address this need, we propose an extension of regularized regression that
enables incorporation of external information in a hierarchical framework. Our primary focus in
this chapter is to introduce our general modelling framework in terms of a continuous outcome
and common types of regularization. To exemplify how the model might be used in a real data
application, we describe it in terms of an example. Specifically, we consider the prediction of a
continuous health outcome with gene expression data as predictor variables and genomic
annotations for those genes as the external data that we want to incorporate in the model. Gene
expression data, such as microarrays, contain quantitative measures of gene expression levels in
a cell or tissue. Genomic annotations are a catalogue of terms that classify the structural and
functional attributes of genes. The annotations are curated by large consortiums (e.g. GO
Consortium) of individuals that review literature to assess evidence for how genes encode
biological functions. In the application section (Chapter 4), we will provide a more thorough
overview of these data sources. We close the chapter with several simulations to demonstrate the
behavior of the model and under what scenarios our model has improved performance over
current methods.
19
2.2 Methods
2.2.1 General Framework
In the context of our example, let 𝒚 be the 𝑛 × 1 observed vector for the continuous
outcome of interest measured on 𝑛 subjects, 𝑿 the 𝑛 × 𝑝 matrix containing the genomic features
(i.e. gene expression levels) measured on all subjects, and 𝒁 the 𝑝 × 𝑞 matrix of the external
annotations for the genomic features. We denote 𝜷 as the 𝑝 × 1 vector of regression coefficients
describing the association of each genomic feature with the outcome and 𝜶 as the 𝑞 × 1 vector of
regression coefficients describing the association of each annotation feature with the regression
coefficients, 𝜷 . We assume that 𝒚 is centered and the matrices 𝑿 and 𝒁 have been centered and
scaled such that
1
𝑛 ∑ 𝑥 𝑖𝑗
𝑛 𝑖 =1
= 0,
1
𝑛 ∑𝑥 𝑖𝑗
2
𝑛 𝑖 =1
= 1, 𝑗 = 1,…,𝑝
1
𝑝 ∑ 𝑧 𝑗𝑘
𝑝 𝑗 =1
= 0,
1
𝑝 ∑𝑧 𝑗𝑘
2
𝑝 𝑗 =1
= 1, 𝑘 = 1,…,𝑞
This allows for the intercept associated with 𝑿 , 𝛽 0
, to be excluded and the intercept associated
with 𝒁 , 𝛼 0
, to be equal to 𝛽 ̅
. It is important to note that estimates from regularized regression are
not invariant to scaling. The decision of whether to standardize is data-dependent, but there is a
general notion that standardizing ensures all variables are on the same scale when penalized. The
results presented in this discussion generalize to the case when 𝑿 and/or 𝒁 are not scaled.
Our method is partially motivated by a two-level hierarchical model, where in addition to
a regression of the outcome, 𝒚 , on the predictor variables (i.e. gene expression levels):
𝒚 = 𝑿𝜷 + 𝝐 , 𝝐 ~ 𝑀𝑉𝑁 (𝟎 𝒏 ,𝜎 1
2
𝑰 𝒏 )
20
a second-level regression is used to model the association between the first stage coefficients, 𝜷 ,
and the external data (i.e. genomic annotations) (Conti and Witte, 2003)
𝜷 = 𝛼 0
+ 𝒁𝜶 + 𝜸 , 𝜸 ~ 𝑀𝑉𝑁 (𝟎 𝒑 ,𝜎 2
2
𝑰 𝒑 )
To exemplify the use of a two-level regression, we consider the simulations completed by Witte
and Greenland (1993) to incorporate nutrient levels when assessing the association of food intake
with risk of breast cancer. The underlying reasoning for their use of the second level is that the
nutrients (e.g. Calcium, Vitamin E) that make up foods may be associated with breast cancer.
Therefore, foods with similar nutrient profiles may have similar effects on the risk of breast
cancer. Inclusion of the second level shrinks the effects of foods with common nutrient levels
towards each other. As discussed in Chapter 1, shrinkage can reduce the variance of estimators,
resulting in improved estimation of the regression coefficients based on MSE.
In the context of our example, the hierarchy assumes that genes with common
annotations have a similar effect on the outcome. It is plausible that genes with similar molecular
functions or that are implicated in common biological processes may also have similar effects on
a health-related outcome. As an example, consider a model where 𝒁 contains a single binary
annotation that indicates whether a gene has function “A”. Then 𝛼 0
+ 𝛼 represents the average
effect of all genomic features with function “A”. Due to the high dimensionality of the gene
expression data and potentially of the external annotation data, we cannot use conventional
methods for estimation. To jointly estimate 𝜷 and 𝜶 in a high-dimensional setting, we propose an
extension of regularized regression, termed here as ‘hierarchical regularized regression’, that
accounts for the hierarchical nature of the data and enables regularization of both sets of
parameters.
21
minimize
𝜷 ,𝛼 0
,𝜶 1
2
‖𝒚 − 𝑿𝜷 ‖
2
2
+
𝜆 1
𝑟 ∑|𝛽 𝑗 − 𝛼 0
− 𝒁 𝒋 𝑻 𝜶 |
𝑟 𝑝 𝑗 =1
+
𝜆 2
𝑠 ∑|𝛼 𝑘 |
𝑠 𝑞 𝑘 =1
(2.1)
The objective function in equation 2.1 is simultaneously minimized with respect to 𝜷 , 𝛼 0
, and 𝜶 ,
where 𝜆 1
and 𝜆 2
are treated as hyperparameters that must be tuned, and r/s are user-defined
values that determine the type of regularization. To better understand the joint model, consider a
naïve approach to estimate 𝜷 and 𝜶 through two separate regressions. To estimate 𝜷 through a
regularized regression, an 𝑙 𝑟 penalty term is added to the standard least squares function,
resulting in the convex optimization
minimize
𝜷 1
2
‖𝒚 − 𝑿𝜷 ‖
2
2
+
𝜆 1
𝑟 ∑|𝛽 𝑗 |
𝑟 𝑝 𝑗 =1
(2.2)
Equivalently, equation 2.2 can be obtained by setting 𝛼 0
and 𝛼 to zero in equation 2.1. The
hyperparameter 𝜆 1
, controls the degree of regularization (i.e. shrinkage towards zero). Setting
𝑟 = 1 yields lasso regression, while 𝑟 = 2 results in ridge regression. For specified 𝑟 and 𝜆 1
, 𝜶 is
then estimated through a separate regularized regression, i.e. by solving the optimization
problem
minimize
𝛼 0
,𝜶
1
2
‖𝜷 ̂
− 𝛼 0
− 𝒁𝜶 ‖
2
2
+
𝜆 2
𝑠 ∑|𝛼 𝑘 |
𝑠 𝑞 𝑘 =1
(2.3)
where 𝜷 ̂
are the estimates obtained from equation 2.2. However, as simulations will show, by not
directly incorporating the external data into the estimation of 𝜷 , this results in suboptimal
performance, in terms of prediction, as compared to the joint estimation in eqn. (2.1).
A comparison of equation 2.1 to equation 2.2 finds that we do not shrink the coefficients
of the genomic features, 𝜷 , towards zero as 𝜆 1
increases. Rather, we shrink the estimates towards
22
𝒁𝜶 . If the external data is truly noninformative, the additional penalty, 𝜆 2
, on the external data is
intended to shrink the 𝜶 ′𝑠 to zero and equation 2.1 simplifies to,
minimize
𝜷 ,𝛼 0
1
2
‖𝒚 − 𝑿𝜷 ‖
2
2
+
𝜆 1
𝑟 ∑|𝛽 𝑗 − 𝛼 0
|
𝑟 𝑝 𝑗 =1
(2.4)
Given the fact that 𝛼 0
= 𝛽 ̅
, 𝜷 is shrunk towards the overall average effect of all genomic features
as 𝜆 1
increases. The idea of shrinking towards the average effect as opposed to zero has been
proposed previously (Copas, 1983) and with proper cross-validation of the hyperparameters, we
show that there is little to no penalty on the performance when the data is truly noninformative.
By retaining the penalty on the external data, we can also include high-dimensional or highly
correlated external data.
2.2.2 Ridge-Lasso
For the remainder of this chapter, we focus on the case where 𝑟 = 2 (ridge) and 𝑠 = 1
(lasso) and will refer to this setting as ridge-lasso regression. The ridge-lasso penalty scheme is
potentially a good fit for gene expression data where we might expect a moderate number of
correlated genes to be associated with the outcome, but only a small number of the available
annotations to be associated with the effect of those genes. We reparametrize equation 2.1 with
specified 𝑟 /𝑠 , and for clarity, the following derivations consider the case where the intercept for
the external data is excluded (𝛼 0
= 0) . The results generalize to the inclusion of an intercept and
a full derivation is provided in the appendix (A.1 and A.2). We have also replaced the
summation notation with the equivalent norm representations.
minimize
𝜷 ,𝜶 1
2
‖𝒚 − 𝑿𝜷 ‖+
𝜆 1
2
‖𝜷 − 𝒁𝜶 ‖
2
2
+ 𝜆 2
‖𝜶 ‖
1
(2.5)
23
To gain a better intuition for the ridge-lasso model, consider the case where 𝑿 and 𝒁 are
orthogonal and an analytical solution exists for equation 2.5. Under the orthonormal assumption,
the score equations for 𝜷 and 𝜶 are
(1 + 𝜆 1
)𝜷 − 𝑿 𝑻 𝒚 − 𝜆 1
𝒁𝜶 = 0 (2.6)
𝜶 − 𝒁 𝑻 𝜷 +
𝜆 2
𝜆 1
𝜕 𝜕 𝜶 ‖𝜶 ‖
1
= 0 (2.7)
where 𝛽 ̂
𝑳𝑺
= 𝑿 𝑻 𝒚 is a vector of the individual least-squares estimates for the regression of 𝒚 on
each 𝒙 𝒋 . Solving equation 2.6 for 𝜷 ,
𝜷 = (
1
1 + 𝜆 1
)𝜷 ̂
𝑳𝑺
+ (
𝜆 1
1 + 𝜆 1
)𝒁𝜶
and substituting the solution into equation 2.7, the solution for 𝜶 is a lasso regression of the
least-squares estimate for 𝜷 on 𝒁 with penalty (1 + 𝜆 1
)𝜆 2
/𝜆 1
.
𝜶̃ = 𝒁 𝑻 𝜷 ̂
𝑳𝑺
−
(1 + 𝜆 1
)𝜆 2
𝜆 1
𝜕 𝜕 𝜶 ‖𝜶 ‖
1
In terms of the soft-thresholding operator, the solution is
𝜶̃ = 𝑆 (𝒁 𝑻 𝜷 ̂
𝑳𝑺
,
(1 + 𝜆 1
)𝜆 2
𝜆 1
) (2.8)
𝑆 (𝒙 ,𝒚 )= 𝑠𝑖𝑔𝑛 (𝒙 )(𝒙 − 𝒚 )
+
The solution for 𝜷 is then found by back substitution of 𝜶̃ into the score equation for 𝜷 .
𝜷 ̃
= (
1
1 + 𝜆 1
)𝜷 ̂
𝑳𝑺
+ (
𝜆 1
1 + 𝜆 1
)𝒁 𝜶̃ (2.9)
From the form of equation 2.9, the ridge-lasso estimator for 𝜷 is a weighted average of the least-
squares estimator and the “group-average” effect, 𝒁𝜶 , like the two-level hierarchical model.
For general 𝑿 and 𝒁 , the optimization problem in equation 2.5 can be fit by several
convex optimization routines. However, a number of these methods are not necessarily the most
24
efficient for this problem and do not easily extend to a generalized linear model (GLM)
framework. Friedman et. al. (2010) developed an efficient algorithm that utilizes coordinate
gradient descent to handle large-scale regularized regression under multiple outcome types.
Along with the algorithm, they developed the R package glmnet that implements their method
for lasso, ridge, and elastic net regression. We propose an algorithm that utilizes coordinate
gradient descent, as implemented in glmnet, and singular value decomposition (SVD) to jointly
estimate 𝜷 and 𝜶 . The following provides a derivation of the algorithm in the case when 𝑝 > 𝑛 .
We first use the relationship between ridge regression and the SVD. Let the SVD of 𝑿 𝑛 ×𝑝
be written as,
𝑿 = 𝑼𝑫 𝑽 𝑻
where, under the assumption that 𝑝 > 𝑛 , 𝑼 is an 𝑛 × 𝑛 orthogonal matrix, 𝑽 is a 𝑝 × 𝑛 matrix
with orthonormal columns, and 𝑫 is an 𝑛 × 𝑛 diagonal matrix with singular values 𝑑 1
≥ 𝑑 2
≥
⋯ ≥ 𝑑 𝑛 > 0. Note that the columns of 𝑼 and 𝑽 form bases for the column and row space of 𝑿 ,
respectively. The columns of 𝑼 and 𝑽 are also referred to as the left singular vectors and right
singular vectors, respectively. Now, consider the standard solution for ridge regression
𝜷 ̂
𝑹𝒊𝒅𝒈𝒆 (𝜆 )= (𝑿 𝑻 𝑿 + 𝜆 𝑰 𝒑 )
−1
𝑿 𝑻 𝒚
Rewriting the ridge estimator in terms of the SVD, where 𝑿 𝑻 𝑿 = 𝑽 𝑫 𝟐 𝑽 𝑻 , the ridge solution can
be computed in terms of the SVD.
𝜷 ̂
𝑹𝒊𝒅𝒈𝒆 (𝜆 )= 𝑽 𝑑𝑖𝑎𝑔 (
𝑑 𝑗 2
𝑑 𝑗 2
+ 𝜆 )𝑼 𝑻 𝒚
To extend the use of this decomposition to our model, we note that in the prior definition
of the SVD, the number of left singular vectors and the number of right singular vectors are
equal to the min(𝑛 ,𝑝 ) . This definition is referred to as the “reduced” or “truncated” SVD. A
25
complete computation of the SVD when 𝑝 > 𝑛 results in 𝑛 left singular vectors and 𝑝 right
singular vectors (Figure 2.1).
The additional 𝑝 − 𝑛 right singular vectors can be defined as a 𝑝 × (𝑝 − 𝑛 ) matrix, 𝑽 ⊥
, whose
columns form an orthonormal basis spanning the orthogonal complement to the space spanned
by the columns of 𝑽 or equivalently as a basis for the nullspace of 𝑿 . If we combine 𝑽 and 𝑽 ⊥
,
the resulting partitioned matrix 𝑸 = [𝑽 | 𝑽 ⊥
] is orthogonal. This is easily seen by recognizing
that 𝑽 and 𝑽 ⊥
are bases for the row space and right nullspace of 𝑿 . We then use the following
change of variables,
𝜸 = 𝑸 𝑻 𝜷 = [
𝑽 𝑻 𝑽 ⊥
𝑻 ]𝜷 = [
𝜽 𝜼 ]
to partition the solution into a component projected into the column space of 𝑽 , 𝜽 ∈ ℝ
𝑛 , and a
component projected into the nullspace of 𝑽 , 𝜼 ∈ ℝ
𝑝 −𝑛 , such that the following relations hold
𝜷 = 𝑸𝜸 = 𝑽𝜽 + 𝑽 ⊥
𝜼
𝑿𝜷 = 𝑼𝑫 𝑽 𝑻 (𝑽𝜽 + 𝑽 ⊥
𝜼 )= 𝑼𝑫𝜽
‖𝜷 − 𝒁𝜶 ‖
2
2
= ‖𝑸𝜸 − 𝒁𝜶 ‖
2
2
= 𝑸 𝑻 𝑸 ‖𝜸 − 𝑸 𝑻 𝒁𝜶 ‖
2
2
= ‖𝜽 − 𝑽 𝑻 𝒁𝜶 ‖
2
2
+ ‖𝜼 − 𝑽 ⊥
𝑻 𝒁𝜶 ‖
2
2
𝑈 𝑛 ×𝑛
𝐷 𝑛 ×𝑝
𝑉 𝑛 ×𝑝 𝑇
𝑉 ⊥,(𝑝 −𝑛 ) ×𝑝 𝑇
Figure 2.1 Complete SVD Decomposition
26
The objective function for the ridge-lasso (equation 2.5) is then written in terms of the change of
variables
minimize
𝜽 ,𝜼 ,𝜶 1
2
‖𝒚 − 𝑼𝑫𝜽 ‖+
𝜆 1
2
‖[
𝜽 𝜼 ] − [
𝑽 𝑻 𝒁𝜶
𝑽 ⊥
𝑻 𝒁𝜶
]‖
2
2
+ 𝜆 2
‖𝜶 ‖
1
(2.10)
and the resulting score equations for 𝜽 , 𝜼 , and 𝜶 are
𝜽 = (𝑫 2
+ 𝜆 1
𝑰 𝒏 )
−1
(𝑫 𝑼 𝑻 𝒚 + 𝜆 1
𝑽 𝑻 𝒁𝜶 ) (2.11)
𝜼 = 𝑽 ⊥
𝑻 𝒁𝜶 (2.12)
𝒁 𝑻 𝒁𝜶 = 𝒁 𝑻 (𝑽𝜽 + 𝑽 ⊥
𝜼 )−
𝜆 2
𝜆 1
𝜕 𝜕 𝜶 ‖𝜶 ‖
1
(2.13)
By substituting equations 2.11 and 2.12 into equation 2.13 and rearranging with respect to 𝜶 ,
𝒁 𝑻 𝑽 𝑑𝑖𝑎𝑔 (
𝑑 𝑖 2
𝑑 𝑖 2
+ 𝜆 1
)𝑽 𝑻 𝒁𝜶 = 𝒁 𝑻 𝑽 𝑑𝑖𝑎𝑔 (
𝑑 𝑖 𝑑 𝑖 2
+ 𝜆 1
)𝑼 𝑻 𝒚 −
𝜆 2
𝜆 1
𝜕 𝜕 𝜶 ‖𝜶 ‖
1
the form of the equation resembles a lasso regression. If we proceed to define two new variables
𝒁 ∗
= 𝑑𝑖𝑎𝑔 (
𝑑 𝑖 √𝑑 𝑖 2
+ 𝜆 1
)
𝑽 𝑻 𝒁
𝒚 ∗
= 𝑑𝑖𝑎𝑔 (
1
√𝑑 𝑖 2
+ 𝜆 1
)
𝑼 𝑻 𝒚
the solution for 𝜶 is a standard lasso regression of 𝒚 ∗
on 𝒁 ∗
with penalty 𝜆 2
/𝜆 1
.
(𝒁 ∗
)
𝑻 𝒁 ∗
𝜶 = (𝒁 ∗
)
𝑻 𝒚 ∗
−
𝜆 2
𝜆 1
𝜕 𝜕 𝜶 ‖𝜶 ‖
1
(2.14)
The solution to equation 2.14, 𝜶̃, can then be substituted into the score equations for 𝜽 and 𝜼
from which we can find the solution for 𝜷 without ever finding 𝑽 ⊥
, by recognizing that 𝑽 ⊥
𝑽 ⊥
𝑻 =
𝑰 𝒑 − 𝑽 𝑽 𝑻
27
𝜷 ̃
= 𝑽 𝜽 ̃
+ 𝑽 ⊥
𝜼̃ = 𝑽 𝜽 ̃
+ (𝑰 𝒑 − 𝑽 𝑽 𝑻 )𝒁 𝜶̃ (2.15)
Consequently, to compute the joint model we only need to compute a single SVD of 𝑿 , followed
by a lasso regression for each fixed combination of 𝜆 1
and 𝜆 2
. Given the efficiency of currently
available software to compute both the SVD of a matrix and lasso regression (e.g. the glmnet R
package), we can efficiently scale our algorithm to handle high-dimensional data. We also would
like to note that this computational method is specific to the case where 𝑟 = 2 in the general
model framework (equation 2.1), but generalizes to different penalties on 𝜶 , including ridge, 𝑠 =
1, and elastic net, 𝑠 ∈ (0,1) . For the case when 𝑛 ≥ 𝑝 , 𝑽 ⊥
only contains the zero vector and the
solution for 𝜷 simplifies to 𝜷 ̃
= 𝑽 𝜽 ̃
. The solution for 𝜶 is the same when 𝑛 ≥ 𝑝 .
2.2.3 Hyperparameter Tuning
To tune the hyperparameters 𝜆 1
and 𝜆 2
, we utilize k-fold cross-validation in a two-
dimensional grid search. Initial starting values for both 𝜆 1
and 𝜆 2
are computed by using the
naïve approach where we estimate 𝜷 and 𝜶 in two separate regularized regressions as discussed
in Section 2.2.1. The R functions glmnet() and cv.glmnet() are used to fit the naïve approach and
to determine the initial values for 𝜆 1
and 𝜆 2
. The initial penalties chosen are those that produce
the minimum cross-validation MSE. A grid of potential values is then selected around the initial
optimal values. A manual iterative search is employed in which we complete a k-fold cross-
validation on a grid of user-defined values selected around the initial optimal penalties. If the
minimum cross-validated MSE occurs on the upper or lower bounds of the grid, we change the
range of the penalties in the appropriate direction and complete the k-fold CV for the new values
of the grid. The iterations continue until the minimum cross-validated MSE lies within the two-
dimensional grid.
28
2.3 Simulations
2.3.1 Independence Case
As a first example, we consider the case where both the predictor variables, 𝑿 , and the
external variables, 𝒁 , are independent. We simulated 100 observations from the model
𝒚 = 𝑿𝜷 + 𝝐 , 𝝐 ~ 𝑀𝑉𝑁 (𝟎 𝒏 ,𝑰 𝒏 )
where 𝜷 = (6,2.6,2,1.75,0.6,1,0.4,0.3)
𝑇 , 𝛽 0
= 0, and 𝑿 is multivariate normal with mean
zero and identity covariance matrix. The external data has two dichotomous variables to indicate
whether a predictor has attribute “A” and/or attribute “B” (Table 2.1), where 1 = “yes” and 0 =
“no”. To make the two external variables independent, the first four predictors only have
attribute “A” and the last four predictors only have attribute “B”. We have purposefully chosen
the first four 𝜷 ′𝑠 to have somewhat larger effects compared to the last four 𝜷 ′𝑠 . Given this
design, the intercept for the external data, 𝛼 0
, is excluded, else the model is overparameterized.
The purpose of this simulation is to demonstrate the model’s behavior as we vary 𝜆 1
and 𝜆 2
.
Table 2.1 True Parameter Estimates and External Data Settings for Independence Case
Variable 𝜷 𝒁 𝟏 = 𝑨𝒕𝒕𝒓𝒊𝒃𝒖𝒕𝒆 "𝑨 " 𝒁 𝟐 = 𝑨𝒕𝒕𝒓𝒊𝒃𝒖𝒕𝒆 "𝑩 "
1 6 1 0
2 2.6 1 0
3 2 1 0
4 1.75 1 0
5 0.6 0 1
6 1 0 1
7 0.4 0 1
8 0.3 0 1
29
2.3.2 Normal-Normal Case
To assess the performance of the ridge-lasso regression, we utilize a Bayesian
interpretation of ridge-lasso based on the following three-level hierarchical model.
𝛼 𝑘 ~ 𝐷𝐸 (𝜆 2
),𝑘 = 1,…,𝑞
𝛽 𝑗 | 𝒁 ~ 𝑁 (𝒁 𝒋 𝑻 𝜶 ,𝜎 2
2
),𝑗 = 1,…,𝑝
𝑌 𝑖 | 𝑿 ~ 𝑁 (𝑿 𝒊 𝑻 𝜷 ,𝜎 1
2
),𝑖 = 1,…,𝑛
Above, we assume that the effects of the genomic features, 𝜷 , are independently distributed
normal with the mean dependent on the external data and common variance, 𝜎 2
2
. The third level
of the model induces sparsity on the 𝜶 ′𝑠 by assuming that the external data regression
coefficients, 𝜶 , have independent and identically distributed Laplace (i.e. double exponential)
priors. This hierarchy is a natural extension of the Bayesian interpretations of ridge and lasso
regression discussed in Section 1.3.
We use this hierarchy as a basis for the following simulations in which we assess the
predictive performance of the model under various settings, using standard ridge regression as
the baseline comparison. The simulations focus on how the relative performance is impacted by
the number of predictor variables (p), the number of external variables (q), and the
informativeness of the predictors and external data. To quantify the level of informativeness of
both the predictor variables and the external data, we use the signal-to-noise ratio (SNR), where
an increase in informativeness is equivalent to an increase in the SNR. For a given linear model
of the form
𝒚 = 𝑿𝜷 + 𝝐 , 𝝐 ~ 𝑀𝑉𝑁 (𝟎 𝒏 ,𝜎 2
𝑰 𝒏 )
the theoretical SNR can be defined in terms of the true values of the regression coefficients, the
covariance matrix of 𝑿 , 𝚺 , and the residual variance (Dicker, 2014).
30
𝑆𝑁𝑅 =
𝑆𝑖𝑔𝑛𝑎𝑙 𝑁𝑜𝑖𝑠𝑒 =
𝜷 𝑻 𝚺 𝜷 𝜎 2
To represent the sparsity associated with the Laplace prior, we only set four of the
external variables to have a true nonzero effect in all simulations. The variables with true
nonzero effects are chosen such that they have little to no correlation with each other.
𝜶 = (0.02,0,…0,0.08,0,…,0,−0.03,0,…,0,0.025)
𝑇
The reason for manually choosing values for 𝜶 instead of simulating directly from a Laplace is
due to the fact that simulating directly from the Laplace distribution would result in near zero,
but not exactly zero values for all coefficients. As an alternative, we could simulate from a
Laplace distribution and threshold values to zero that are below a minimum value as well. The
external data, 𝒁 , and predictors, 𝑿 , are assumed to be multivariate normal with mean zero and to
have a lag 1 autoregressive, AR(1), correlation structure with common variance 𝜎 2
= 1. The
correlation used to simulate the AR(1) structure for 𝑿 , 𝜌 𝑋 , is fixed at 0.5, while the correlation
for 𝒁 , 𝜌 𝑍 , is considered at various values.
𝒁 ~ 𝑀𝑉𝑁 (𝟎 𝒒 ,𝚺 𝟏 )
𝑋 ~ 𝑀𝑉𝑁 (𝟎 𝒑 ,𝚺 𝟐 )
𝚺 𝟏 = 𝜎 2
𝑹 𝑞 ×𝑞 ,𝑹 𝑖𝑗
= 𝐶𝑜𝑟𝑟 (𝑍 𝑖 ,𝑍 𝑗 )= 𝜌 𝑍 |𝑖 −𝑗 |
,𝑖 = 1,…,𝑞 ,𝑗 = 1,…,𝑞
𝚺 𝟐 = 𝜎 2
𝑹 𝑝 ×𝑝 ,𝑅 𝑖𝑗
= 𝐶𝑜𝑟𝑟 (𝑋 𝑖 ,𝑋 𝑗 )= 𝜌 𝑋 |𝑖 −𝑗 |
,𝑖 = 1,…,𝑞 ,𝑗 = 1,…,𝑞
Given 𝜶 , 𝑿 , and 𝒁 , we then simulate the first level coefficients, 𝜷 , and the outcome, 𝒀 , based on
the three-level hierarchical model. 𝛼 0
is fixed at zero and 𝛽 0
is fixed at zero across all trials.
We proceed to complete two sets of simulations. In the first set of simulations, we fix the
number of external variables at 100, and proceed to vary the number of predictors, the
31
correlation among the external variables, and the SNR for both the external data and the
predictors. We complete 1000 replicates under each unique combination of the settings below:
Number of External Variables: 100
Number of 1
st
Level Predictors: 500, 1000, 2000
External Data (𝒁 ) Signal-to-Noise Ratio (𝑆𝑁 𝑅 𝑍 ) : 0, 0.5, 1, 2
Predictor Data (𝑿 ) Signal-to-Noise Ratio (𝑆𝑁 𝑅 𝑋 ) : 1, 2
External Data Correlation (𝜌 𝑍 ): 0.2, 0.5, 0.8
In the second set of simulations, we fix the number of predictor variables at 500, and vary the
number of external variables, the correlation among the external variables, and the SNR for both
the external data and the predictors. Again, we complete 1000 replicates under each combination
of the settings below:
Number of 1
st
Level Predictors: 500
Number of External Data Variables: 500, 1000, 2000
External Data (𝒁 ) Signal-to-Noise Ratio (𝑆𝑁 𝑅 𝑍 ) : 0, 0.5, 1, 2
Predictor Data (𝑿 ) Signal-to-Noise Ratio (𝑆𝑁 𝑅 𝑋 ) : 1, 2
External Data Correlation (𝜌 𝑍 ): 0.2, 0.5, 0.8
To be obtain the desired theoretical SNR’s, we set the residual variances according the formulas
shown below.
𝜎 2
2
=
𝜶 𝑻 𝚺 𝒁 𝜶 𝑆𝑁 𝑅 𝑍
𝜎 1
2
=
𝜷 𝑻 𝚺 𝑿 𝜷 𝑆𝑁 𝑅 𝑋
For each trial, a training data set is used to estimate the ridge regression and ridge-lasso
parameters. A fixed sample size of 300 observations is used in all trials. 10-fold cross-validation
32
is used to tune the hyperparameters in both models. The fitted models based on the optimal
hyperparameters, where optimal is defined as the hyperparameters that obtain the minimum
cross-validated MSE, are then evaluated on a validation data set with a fixed sample size of 1000
observations. The validation predictors, 𝑿 𝒗𝒂𝒍 , and outcome, 𝒚 𝒗𝒂𝒍 , are simulated under the same
hierarchical model. The prediction error for both methods is then evaluated as the prediction
MSE in the validation data set.
𝑀𝑆𝐸 (𝑣𝑎𝑙𝑖𝑑𝑎𝑡𝑖𝑜𝑛 )=
1
𝑛 ∑(𝑦 𝑖 (𝑣𝑎𝑙 )
− 𝑦̂
𝑖 (𝑣𝑎𝑙 )
)
2
𝑛 𝑖 =1
Note that 𝜶 is the only parameter that is fixed across all trials and scenarios that were considered.
2.3.3 Magnitude and Direction of Effect
One of the potential limitations of our method is its reliance on modelling the expected
value of 𝛽 and the assumption that the external information is informative for both the magnitude
and direction of effect. To demonstrate this behavior, we simulate two scenarios. In both, the
external information consists of two dichotomous variables (i.e. 1 = “yes”/ 0 = “no”), but only
one of the external variables is truly informative. We will refer to these variables as 𝑍 1
(informative) and 𝑍 2
(noninformative). For the first scenario, 𝑍 1
is informative for both the
magnitude and direction of effect. For the second scenario, 𝑍 1
is only informative for the
magnitude of effect. The number of predictors is fixed at 300 and the number of observations is
fixed at 150 across all simulations. As in the previous simulations, we use a linear model where
𝑿 is again assumed to have a multivariate normal distribution with mean zero and AR(1)
correlation structure (𝜌 = 0.8 and 𝜎 2
= 1).
𝒚 = 𝑿𝜷 + 𝝐 , 𝝐 ~ 𝑀𝑉𝑁 (𝟎 𝒏 ,𝜎 2
𝑰 𝒏 ),𝜎 = 3
33
For both scenarios, 1000 training data sets and 1000 corresponding validation data sets with a
sample size of 3000 were generated. The predictive accuracy of our method is compared to
standard ridge regression in terms of prediction MSE in the validation data set.
Scenario 1
For each replicate, 30 of the 300 predictors are randomly selected and the effect of these
predictors on the outcome is simulated to be independent and identically distributed normal with
mean one and variance 0.25.
𝛽 𝑗 ~ 𝑁 (1,0.25)
The effects of the remaining predictors on the outcome are assumed to be normally distributed
with mean zero and variance 0.125.
𝛽 𝑗 ~ 𝑁 (0,0.125)
We selected the “best” configuration, in terms of informativeness, for 𝑍 1
by coding the 30
randomly selected predictors as one (‘yes’) and the rest as zero (‘no’). This configuration
effectively partitions the predictors into two sets, one set that tends to have positive effects and
another set that tends to have near zero effects. 𝑍 2
is independently sampled as 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖 (𝑝 )
where 𝑝 = 𝑃 (𝑍 2
= 1)= 0.5.
Scenario 2
Similar to the first scenario, for each replicate, we randomly select 30 variables and
simulate their effects to be normal with some nonzero mean. However, we now simulate the first
15 as 𝑁 (1,0.25) and the last 15 as 𝑁 (−1,0.25) .The effects of the remaining predictors are still
simulated as 𝑁 (0,0.125) . The first 30 randomly selected predictors are still coded as one (‘yes’)
34
for 𝑍 1
and the rest are set zero. The second noninformative external variable, 𝑍 2
, is still sampled
as 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖 (0.5) .
2.4 Results
2.4.1 Independence Case
Recall that 𝜷 and 𝜶 have a closed-form solution when 𝑿 and 𝒁 are orthogonal (Section
2.2.2). In this simulation, 𝑿 and 𝒁 are independent, but not orthogonal, as they do not have
orthonormal columns. The corresponding closed-form solution for the independence case is
𝛼 ̃
𝑘 = 𝑆 (𝜶̂
𝒌 (𝑳𝑺 )
,
(1 + 𝜆 1
)𝜆 2
𝜆 1
),𝑘 = 1,…,𝑞
𝜷 ̃
= (
1
1 + 𝜆 1
)𝜷 ̂
𝑳𝑺
+ (
𝜆 1
1 + 𝜆 1
)𝒁 𝜶̃
where 𝜶̂
𝒌 (𝑳𝑺 )
= 𝒁 𝒋 𝑻 𝜷 ̂
𝑳𝑺
/‖𝒁 𝒋 ‖
2
2
is the least-squares estimate of 𝜷 ̂
𝑳𝑺
on 𝒁 𝒋 . Based on the closed-
form solutions, we first note that for fixed 𝜆 2
,
lim
𝜆 1
→0
(1 + 𝜆 1
)𝜆 2
𝜆 1
= ∞
lim
𝜆 1
→∞
(1 + 𝜆 1
)𝜆 2
𝜆 1
= 𝜆 2
Thus, as 𝜆 1
→ 0 or 𝜆 2
→ ∞, 𝛼 shrinks toward zero. Table 2.2 exemplifies the behavior of the
ridge-lasso as the penalties become very large or very small relative to the estimates. Figure 2.2
demonstrates that for fixed 𝜆 2
, 𝜷 converges towards 𝒁𝜶 for increasing 𝜆 1
. The red and blue lines
represent the average least-squares estimates of the first four predictors and the last four
predictors, respectively.
35
Figure 2.2 Behavior of Ridge-Lasso Estimates as a Function of 𝝀 𝟏 and 𝝀 𝟐
Table 2.2 provides numeric examples to help visualize the following properties:
Column 1: Given 𝜆 2
= 0, as 𝜆 1
→ 0, 𝜶̃ = 𝜶̃
𝑳𝑺
and 𝜷 ̃
→ 𝜷 ̂
𝑳𝑺
Column 2: Given 𝜆 2
= 0, as 𝜆 1
→ ∞, 𝜶̃ = 𝜶̃
𝑳𝑺
and 𝛽 ̃
𝑗 → 𝒁 𝒋 𝑻 𝜶̃
𝑳𝑺
, 𝑗 = 1,… ,𝑝
Column 3: As 𝜆 1
→ 0 and 𝜆 2
→ 0,
(1+𝜆 1
)𝜆 2
𝜆 1
→ 1 and 𝜷 ̃
→ 𝜷 ̂
𝑳𝑺
Column 4: As 𝜆 1
→ ∞ and 𝜆 2
→ ∞, 𝜶̃ → 0 and 𝜷 ̃
→ 0
𝒁 𝟏 𝒁 𝟐 𝜷
𝜷 ̂
𝑳𝑺
𝜷 ̃
𝝀 𝟏 = 𝟏 𝐞 −𝟓 , 𝝀 𝟐 = 𝟎
𝜶̃ = (𝟑 .𝟎 , 𝟎 .𝟓𝟐 )
𝝀 𝟏 = 𝟏𝟎 , 𝝀 𝟐 = 𝟎
𝜶̃ = (𝟑 .𝟎 , 𝟎 .𝟓𝟐 )
𝝀 𝟏 = 𝟏 𝐞 −𝟓 , 𝝀 𝟐 = 𝟏 𝒆 −𝟓
𝜶̃ = (𝟐 .𝟎 , 𝟎 )
𝝀 𝟏 = 𝟏𝟎 , 𝝀 𝟐 = 𝟏𝟎
𝜶̃ = (𝟎 , 𝟎 )
1 0 6 5.97 5.97 3.27 5.97 0.54
1 0 2.6 2.45 2.45 2.95 2.45 0.22
1 0 2 1.95 1.95 2.90 1.95 0.18
1 0 1.75 1.62 1.62 2.87 1.62 0.15
0 1 0.6 0.55 0.55 0.52 0.55 0.05
0 1 1 0.82 0.82 0.55 0.82 0.07
0 1 0.4 0.35 0.35 0.50 0.35 0.03
0 1 0.3 0.35 0.35 0.50 0.35 0.03
Table 2.2 Behavior of Ridge-Lasso Estimates as a Function of 𝝀 𝟏 and 𝝀 𝟐
36
2.4.2 Normal-Normal Case
Figures 2.3 – 2.10 display results for the first set of simulations in which we varied the
number of predictors, the correlation of the external data, and the signal-to-noise ratio of both the
external data and the predictors. The performance of the ridge-lasso compared to standard ridge
regression, as measured by prediction MSE in the test data set, always improves as the
informativeness of the external information increases across all scenarios (Figures 2.3 – 2.4). For
fixed 𝑆𝑁 𝑅 𝑋 and 𝑆𝑁 𝑅 𝑍 , predictive performance increases as the number of predictor variables
increases as well. The average MSE of the regression coefficients for the predictor variables, 𝜷 ,
is also lower in ridge-lasso compared to the ridge model, with a similar trend to improve as the
informativeness of the external data increases (Figures 2.5 – 2.6). When the external data is not
informative (𝑆𝑁 𝑅 𝑍 = 0) , on average, there is little to no penalty if we incorporate the external
data. The predictive performance of the ridge-lasso appears to be relatively robust as we increase
the correlation with little to no reduction in prediction performance. Predictive performance
relative to ridge regression decreases when the SNR of the predictor variables is decreased from
two to one across all scenarios.
Figure 2.3 Relative Prediction Performance, as measured by decrease in prediction MSE,
of Ridge-Lasso Compared to Standard Ridge Regression by Number of Predictors,
External Data Signal-to-Noise Ratio, and External Data Correlation ( 𝑺 𝑵 𝑹 𝑿 = 𝟐 )
37
Figure 2.4 Relative Prediction Performance, as measured by decrease in prediction MSE,
of Ridge-Lasso Compared to Standard Ridge Regression by Number of Predictors,
External Data Signal-to-Noise Ratio, and External Data Correlation ( 𝑺𝑵 𝑹 𝑿 = 𝟏 )
Figure 2.5 Relative decrease in 𝑴𝑺𝑬 (𝜷 ̂
) of Ridge-Lasso Compared to Standard Ridge
Regression by Number of Predictors, External Data Signal-to-Noise Ratio, and External
Data Correlation (𝑺𝑵 𝑹 𝑿 = 𝟐 )
38
Figure 2.6 Relative decrease in 𝑴𝑺𝑬 (𝜷 ̂
) of Ridge-Lasso Compared to Standard Ridge
Regression by Number of Predictors, External Data Signal-to-Noise Ratio, and External
Data Correlation (𝑺𝑵 𝑹 𝑿 = 𝟏 )
In terms of variable selection, the ability to recover the underlying sparse solution for the
external data variables improves as both the external information becomes more informative and
the number of predictor variables increases (Figures 2.7 – 2.8). A similar improvement is
observed for standard ridge regression as well. Sensitivity with respect to the true non-zero
external variables shows a modest improvement in the ridge-lasso model compared to ridge
regression. The lasso penalty on the external data also produces a relatively sparse solution by
setting the non-informative effects to zero with an average specificity of at least 80% in all
scenarios (Figures 2.9 – 2.10). However, the specificity for ridge-lasso is consistently lower
compared to ridge regression. Interestingly, there is not a clear trend in specificity as the number
of predictor variables increases. Sensitivity and specificity decrease in both ridge-lasso and ridge
regression when the signal-to-noise ratio of the predictors is decreased to one.
Figures 2.11 – 2.18 displays the results of the second set of simulations in which we
varied the number of external variables. As the number of external variables increases, there is a
general decrease in the prediction performance compared to ridge regression, but still a clear
39
trend of improvement as the external data becomes more informative. Additionally, there is still
not a significant reduction in predictive ability compared to ridge regression when the external
data is not informative (𝑆𝑁 𝑅 𝑍 = 0). As might be expected, both the sensitivity and specificity
decrease in both ridge-lasso and ridge regression as the number of external data variables is
increased.
Figure 2.7 Average True Positive Selection Rate of Nonzero External Variables (𝜶 ≠ 𝟎 ) by
Number of Predictors and External Data Signal-to-Noise Ratio (𝑺𝑵 𝑹 𝑿 = 𝟐 )
Figure 2.8 Average True Positive Selection Rate of Nonzero External Variables (𝜶 ≠ 𝟎 ) by
Number of Predictors and External Data Signal-to-Noise Ratio (𝑺𝑵 𝑹 𝑿 = 𝟏 )
40
Figure 2.9 Average True Negative Selection Rate of Nonzero External Variables (𝜶 ≠ 𝟎 ) by
Number of Predictors and External Data Signal-to-Noise Ratio (𝑺𝑵 𝑹 𝑿 = 𝟐 )
Figure 2.10 Average True Negative Selection Rate of Nonzero External Variables (𝜶 ≠ 𝟎 )
by Number of Predictors and External Data Signal-to-Noise Ratio (𝑺𝑵 𝑹 𝑿 = 𝟏 )
41
Figure 2.11 Relative Prediction Performance, as measured by decrease in prediction MSE,
of Ridge-Lasso Compared to Standard Ridge Regression by Number of External Variables,
External Data Signal-to-Noise Ratio, and External Data Correlation (𝑺𝑵 𝑹 𝑿 = 𝟐 )
Figure 2.12 Relative Prediction Performance, as measured by decrease in prediction MSE,
of Ridge-Lasso Compared to Standard Ridge Regression by Number of External Variables,
External Data Signal-to-Noise Ratio, and External Data Correlation (𝑺𝑵 𝑹 𝑿 = 𝟏 )
42
Figure 2.13 Relative decrease in MSE(𝜷 ̂
) of Ridge-Lasso Compared to Standard Ridge
Regression by Number of External Variables, External Data Signal-to-Noise Ratio, and
External Data Correlation (𝑺𝑵 𝑹 𝑿 = 𝟐 )
Figure 2.14 Relative decrease in MSE(𝜷 ̂
) of Ridge-Lasso Compared to Standard Ridge
Regression by Number of External Variables, External Data Signal-to-Noise Ratio, and
External Data Correlation (𝑺𝑵 𝑹 𝑿 = 𝟏 )
43
Figure 2.15 Average True Positive Selection Rate of Nonzero External Variables (𝜶 ≠ 𝟎 ) by
Number of External Variables and External Data Signal-to-Noise Ratio (𝑺𝑵 𝑹 𝑿 = 𝟐 )
Figure 2.16 Average True Positive Selection Rate of Nonzero External Variables (𝜶 ≠ 𝟎 ) by
Number of External Variables and External Data Signal-to-Noise Ratio (𝑺𝑵 𝑹 𝑿 = 𝟏 )
44
2.4.3 Magnitude and Direction of Effect
Figure 2.16 demonstrates the ability of our method to improve prediction performance
when the external information can partition the predictors into groups of variables with the same
direction of effect (Scenario 1). Our method is not able to utilize external information that is only
informative for the magnitude of effect (Scenario 2). Another simulation that would lead to
similar results is to let the external information be informative for the variance of the effects, as
opposed to the means.
2.5 Discussion
In this section, we introduced a novel method to incorporate external information through
an extension of regularized regression. In the spirit of a standard two-level hierarchical model,
we allow the mean effect of the predictors to depend on a set of external data features and apply
regularization to both sets of parameters. Unlike standard regularized methods (i.e. ridge
regression), our model framework allows for the effects of predictors to shrink towards
potentially informative nonzero values. An additional regularization term on the external data
enables variable selection on these external features and reduces the impact on prediction
45
performance when the external information is not informative. To efficiently fit the model when
a ridge penalty is applied to the first level predictors, an algorithm that utilizes singular value
decomposition (SVD) and coordinate descent was developed. This fitting method was used to
demonstrate the behavior of our framework when a lasso penalty is applied to the external data.
Simulations validate that incorporating external information by using hierarchical
regularized regression (i.e. ridge-lasso) improves the predictive ability as measured by prediction
MSE (PMSE) relative to ridge regression, a comparable method that cannot utilize this
information. The more informative the external data, the larger performance gain we see relative
to ridge regression. The predictive ability decreases for both methods (i.e. larger PMSE) as 𝑝
increases. The decrease is due to the dense solution for 𝜷 (i.e. 𝛽 𝐽 ≠ 0 ∀ 𝑗 = 1,…,𝑝 ), which
makes the signal strength, 𝜷 𝑻 𝚺 𝜷 , an increasing function of 𝑝 . To maintain a constant 𝑆𝑁 𝑅 𝑋 the
error variance is increased to match the increasing signal strength. The result is an overall
decrease in the predictive ability of 𝑿 as we increase the number of predictors. We can contrast
this result with the relatively unchanged prediction performance as the number of external
variables (𝑞 ) is increased. By purposefully making the solution for 𝛼 sparse, the signal strength
does not change as we add more noninformative external variables and the error variance is not
adjusted to maintain a constant 𝑆𝑁 𝑅 𝑍 . Another set of simulations that would improve the ability
to compare the prediction performance across 𝑝 would be to fix the out-of-sample predictive
ability of 𝑿 as opposed to the signal-to-noise ratio.
The performance of ridge-lasso regression does not appear to be significantly affected if
we increase the correlation among the external variables. This is most likely due to our focus on
prediction using the first level variables and not variable selection of the external data. In terms
of selecting the true nonzero external variables, the average true positive selection rate increases
46
as the second level data becomes more informative. Relative to ridge regression, ridge-lasso
regression consistently has higher sensitivity, but lower specificity across all scenarios evaluated.
Although, our primary aim is to improve prediction of the outcome, ridge-lasso appears to be an
effective method for variable selection. This could make ridge-lasso regression a potential
method to identify underlying molecular mechanisms that drive the association between genomic
regions and health outcomes. If the external data exhibits high correlation, our method easily
extends to the use of a ridge or elastic net penalty on the external variables.
An underlying assumption of our model is that the external data is informative for both
the direction and magnitude of effect. This is a potential limitation compared to other methods
that attempt to estimate the variance as opposed to the mean. As an example, if we wanted to use
a biological pathway in our external data, the direction of association of genes in a pathway with
a specific outcome will most likely differ and researchers generally do not know the direction of
effect a priori. In this scenario, our method can miss an informative genomic annotation if the
average effect across all genes in the pathway is near zero. In a setting where the external
information is only informative for the magnitude, one might want a different model where
E𝜷 = 0 and ln(Var𝜷 )= 𝒁𝜶 (Thomas et al., 2009). To enable the incorporation of external data
that is informative for magnitude only, one could consider shrinking predictors toward their
absolute value of effect based on the external information. If we have strong prior evidence for
the direction of effect across a set of predictors, there is the potential to include this information
in the external data or to modify the design matrix to fix all effects in the same direction (i.e.
make all estimates positive). Although, this might be a relatively strong assumption to make in
practice. Despite these limitations, the simulation results provide evidence that our method can
improve prediction of a continuous outcome given the right set of external information.
47
Chapter 3 Coordinate Descent for Hierarchical Regularized Regression
3.1 Overview
The method for fitting ridge-lasso regression described in the previous chapter works
well for moderate sizes of 𝑝 (i.e. tens of thousands of variables) and 𝑛 (i.e. hundreds of
observations). However, there may exist computational limitations to extend this algorithm to
very high-dimensional 𝑝 and n, such as GWAS, where we might expect millions of variables and
potentially sample sizes over 100,000. More importantly, the previous fitting method does not
easily extend to different outcomes (i.e. binary) or to different types of regularization other than
ridge on the predictors. In this chapter we derive an alternative optimization problem that
produces an equivalent solution, but with the added benefit that it can be solved using coordinate
descent. We describe coordinate descent and how this optimization method can be used to
accommodate our problem. As we will show, this alternate optimization not only enables the use
of different penalty types on the predictors but can also take advantages of pathwise coordinate
descent and ‘warm starts’ to compute the solution along a path of hyperparameter values. In
regularized regression, ‘warm starting’ refers to using a previously computed solution for a given
hyperparameter value as the initial value to compute the solution for a new hyperparameter
value. We demonstrate how pathwise coordinate descent can be extended to a two-dimensional
grid search that is needed to tune the two hyperparameters in our ridge-lasso model. We end with
a discussion of the advantages of the new method to fit hierarchical regularized regression
models.
48
3.2 Methods
3.2.1 Coordinate Gradient Descent
Consider the class of convex optimization problems briefly introduced in Section 1.2.2
that solve problems of the form
minimize
𝒙 ∈ℝ
𝑝 𝑓 (𝒙 )
𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 𝑔 𝑖 (𝒙 )≤ 𝑏 𝑖 ,𝑖 = 1,…,𝑚
under the assumptions that:
1. The objective function, 𝑓 : ℝ
𝑝 → ℝ, is convex with respect to the parameters
2. The constraint functions, 𝑔 𝑖 :ℝ
𝑝 → ℝ,𝑖 = 1,…,𝑚 , are convex with respect to the
parameters
We now make the additional assumptions that 𝑓 (𝒙 ) is smooth (i.e. continuously differentiable)
and the constraint functions, 𝑔 𝑖 (𝒙 ) , are separable, but not necessarily smooth, such that 𝑔 (𝒙 )=
∑ 𝑔 𝑖 (𝒙 𝒊 )
𝑚 𝑖 =1
. Below is an optimization problem that represents this general class of problems.
minimize
𝒙 𝑓 (𝒙 )+ ∑𝜆 𝑖 𝑔 𝑖 (𝒙 𝒊 )
𝑚 𝑖 =1
(3.1)
Tseng (2001) demonstrated that (block) coordinate descent algorithms will converge to the
global minimum of problems that take the form of equation 3.1. Intuitively, this means if 𝑓 (𝒙 )
has reached a minimum along each coordinate, 𝒙 𝑖 (𝑖 = 1,…,𝑝 ) , then we have reached the global
minimum. As will be shown, Friedman et. al. (2010) used these results to develop the
coordinate-wise minimization algorithm that is implemented in the R package glmnet for ridge,
lasso, and elastic net regression.
As a first example, we use linear regression with an outcome 𝒚 ∈ ℝ
𝑛 and a matrix of
predictors 𝑿 ∈ ℝ
𝑛 ×𝑝 . As before, we assume 𝒚 is centered and 𝑿 is standardized such
49
that
1
𝑛 ∑ 𝑥 𝑖 𝑛 𝑖 =1
= 0, and
1
𝑛 ∑ 𝑥 𝑖 2
= 1
𝑛 𝑖 =1
. The quadratic loss used to solve the corresponding
unconstrained optimization is both convex and differentiable.
min
𝜷 1
2
‖𝒚 − 𝑿𝜷 ‖
2
2
To solve this optimization problem by coordinate gradient descent we first choose an initial
solution for 𝜷 . Here we consider the initial value 𝜷 ̂
(0)
= 𝟎 𝒑 . Then, we determine the coordinate-
wise update for 𝛽 𝑗 by computing the partial derivative with respect to 𝛽 𝑗 and setting the resulting
equation to zero.
𝜕 𝜕 𝛽 𝑗 = 𝒙 𝑗 𝑻 (𝑿𝜷 − 𝒚 )= 𝒙 𝒋 𝑻 (𝑿 −𝒋 𝜷 −𝒋 + 𝒙 𝑗 𝜷 𝑗 − 𝒚 )
= 𝒙 𝒋 𝑻 𝒙 𝒋 𝜷 𝒋 + 𝒙 𝒋 𝑻 (𝑿 −𝒋 𝜷 −𝒋 − 𝒚 )= 0
where 𝒙 𝒋 is the 𝑗 𝑡 ℎ
column of 𝑋 , 𝑿 −𝒋 is 𝑿 without the 𝑗 𝑡 ℎ
column, and 𝜷 −𝒋 is 𝜷 without the 𝑗 𝑡 ℎ
element. Solving for 𝜷 𝒋 , the coordinate-wise update, given current estimates for the other
parameters, 𝜷 ̂
−𝒋 , can be written as:
𝜷 ̂
𝒋 ←
𝒙 𝒋 𝑻 (𝒚 − 𝑿 −𝒋 𝜷 ̂
−𝒋 )
𝒙 𝒋 𝑻 𝒙 𝒋 (3.2)
After we have determined the coordinate-wise update, the general algorithm to fit linear
regression by coordinate gradient descent is relatively straightforward (Algorithm 1), where we
have used the standardization assumption
1
𝑛 𝒙 𝒋 𝑻 𝒙 𝒋 =
1
𝑛 ∑ 𝑥 𝑖 2
= 1
𝑛 𝑖 =1
to write the update.
50
Algorithm 1: Coordinate Gradient Descent for Linear Regression
𝜷 ̂
(0)
← 0 (𝑠𝑒𝑡 𝑖𝑛𝑖𝑡𝑖𝑎𝑙 𝑣𝑎𝑙𝑢𝑒 𝑓𝑜𝑟 𝛽 ,ℎ𝑒𝑟𝑒 𝑠𝑒𝑡 𝑡𝑜 𝑡 ℎ𝑒 𝑧𝑒𝑟𝑜 𝑣𝑒𝑐𝑡𝑜𝑟 )
𝑘 ← 1
while (𝑁𝑜𝑡 𝐶𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑑 )
foreach 𝑗 ∈ [1,… 𝑝 ]
𝛽 ̂
𝑗 (𝑘 )
←
1
𝑛 𝒙 𝒋 𝑻 (𝒚 − 𝑿 −𝒋 [
𝜷 ̂
𝒍 <𝒋 (𝒌 )
𝜷 ̂
𝒍 >𝒋 (𝒌 −𝟏 )
])
end
𝑘 ← 𝑘 + 1
(𝐶 ℎ𝑒𝑐𝑘 𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑛𝑐𝑒 )
end
In this algorithm, 𝜷 ̂
𝒍 <𝒋 (𝒌 )
consists of all elements in the vector 𝜷 ̂
from iteration 𝑘 with index less
than 𝑗 and 𝜷 ̂
𝒍 >𝒋 (𝒌 −𝟏 )
consists of all elements in the vector 𝜷 ̂
from iteration (𝑘 − 1) with index
greater than 𝑗 . In effect, we are always using the most recent estimates for each parameter
estimate as we cycle through all 𝑝 variables. The order of the variables is arbitrary. After one
cycle through all 𝑝 variables, we check a user-defined convergence criterion. In the case of linear
regression, we could evaluate the max change in the fitted values between the (𝑘 − 1) and 𝑘 𝑡 ℎ
iteration as the criterion:
max
𝑗 =1,…,𝑝 1
n
∑ 𝑥 𝑖𝑗
(𝛽 𝑗 (𝑘 −1)
− 𝛽 𝑗 𝑘 )
𝑛 𝑖 =1
After this difference has reached a user-defined minimum threshold, the algorithm is complete.
Note that computing the partial gradient to perform the univariate update requires 𝑂 (𝑛 )
operations and thus 𝑂 (𝑛𝑝 ) operations are required for a single iteration through all 𝑝 variables.
Compare this with the truncated SVD that has a complexity of 𝑂 (min(𝑛 𝑝 2
,𝑛 2
𝑝 )) (Hogben,
2006). There is a clear computational advantage for a method that can only use coordinate
descent in the large 𝑛 , large 𝑝 scenario. Computation of the SVD also requires space for
additional matrices, including one of size 𝑝 × 𝑛 , 𝑉 , and one of size 𝑝 × 𝑝 , 𝑉 𝑉 𝑇 . We will show in
51
Chapter 5 that it is possible to fit our model with coordinate gradient descent without ever
making a copy of the feature matrix, which gives yet another advantage from a memory
standpoint for using a method that avoids the need to compute the SVD.
3.2.2 Coordinate Gradient Descent for Regularized Regression
Friedman et. al. (2010) extend coordinate gradient descent to lasso, ridge, and elastic net
regression by solving the optimization problem
min
𝜷 1
2
‖𝒚 − 𝑿𝜷 ‖
2
2
+
𝜆 2
(1 − 𝛿 )‖𝑷𝜷 ‖
2
2
+ 𝜆𝛿 ‖𝑷𝜷 ‖
1
(3.3)
𝛿 = 1 corresponds to lasso regression, 𝛿 = 0 corresponds to ridge regression, and 𝛿 ∈ (0,1)
corresponds to elastic net regression. Additionally, they provide the option to perform a weighted
lasso, where variables are individually penalized by the diagonal penalty matrix, 𝑷 , along with
an overall penalty, 𝜆 . This convex optimization problem is the sum of a differentiable objective
function and separable constraint functions
𝑔 (𝜷 )= (1 − 𝛿 )‖𝑷𝜷 ‖
2
2
+ 𝛿 ‖𝑷𝜷 ‖
1
= ∑ [(1 − 𝛿 )𝑤 𝑗 𝛽 𝑗 2
+ 𝛿 𝑤 𝑗 |𝛽 𝑗 |]
𝑝 𝑗 =1
Thus, based on the results of Tseng (2001) coordinate gradient descent will converge to the
global minimum. If we take the partial derivative with respect to 𝛽 𝑗 and set it to zero,
𝜕 𝜕 𝛽 𝑗 = 𝒙 𝒋 𝑻 (𝑿 −𝒋 𝜷 −𝒋 + 𝒙 𝒋 𝜷 𝒋 − 𝒚 )+ 𝜆 (1 − 𝛿 )𝑤 𝑗 𝛽 𝑗 + 𝜆𝛿 𝑤 𝑗 𝜕 𝜕 𝛽 𝑗 ‖𝜷 ‖
1
= 0
the coordinate-wise update for 𝛽 𝑗 , given the current estimates for the other parameters, still only
requires 𝑂 (𝑘𝑛 ) operations (equation 3.4).
52
𝛽 ̂
𝑗 ←
𝒙 𝒋 𝑻 (𝒚 − 𝑿 −𝒋 𝜷 ̂
−𝒋 )− 𝜆𝛿 𝑤 𝑗 𝜕 𝜕 𝛽 𝑗 ‖𝜷 ‖
1
𝒙 𝒋 𝑻 𝒙 𝒋 + 𝜆 (1 − 𝛿 )𝑤 𝑗 =
𝑆 (𝒙 𝒋 𝑻 (𝒚 − 𝑿 −𝒋 𝜷 ̂
−𝒋 ),𝜆𝛿 𝑤 𝑗 )
𝒙 𝒋 𝑻 𝒙 𝒋 + 𝜆 (1 − 𝛿 )𝑤 𝑗 (3.4)
𝑆 (𝒙 ,𝒚 )= 𝑠𝑖𝑔𝑛 (𝒙 )(𝒙 − 𝒚 )
+
Along with a number of other computational ‘tricks’, the implementation of coordinate gradient
descent in the R package glmnet was found to be more efficient to solve large-scale regularized
regression problems when compared to other methods (Efron et al., 2004).
3.2.3 Application of Coordinate Descent to Hierarchical Regularized Regression
To use coordinate gradient descent for ridge-lasso regression, and more generally for
hierarchical regularized regression, we return to its relationship with the following two-level
regression:
𝒚 = 𝑿𝜷 + 𝝐 , 𝝐 ~ 𝑀𝑉𝑁 (𝟎 𝒏 ,𝜎 1
2
𝑰 𝒏 )
𝜷 = 𝒁𝜶 + 𝜸 , 𝜸 ~ 𝑀𝑉𝑁 (𝟎 𝒑 ,𝜎 2
2
𝑰 𝒑 )
The two-level model can be written as a single regression by substituting the second level into
the first level (equation 3.5). The form of this regression is similar to that of a mixed effects
model, where we would generally think of 𝜸 as random effects that are normally distributed with
common variance. However, we are now effectively treating the 𝜶 ′
𝑠 as independent and
identically distributed Laplace random variables, the Bayesian interpretation of the lasso penalty.
𝒚 = 𝑿𝒁𝜶 + 𝑿𝜸 + 𝝐 (3.5)
𝛼 𝑗 ~ 𝐷𝐸 (𝜆 2
),𝑗 = 1,…,𝑞
𝜸 ~ 𝑀𝑉𝑁 (𝟎 𝒑 ,𝜎 2
2
𝑰 𝒑 )
𝝐 ~ 𝑀𝑉𝑁 (𝟎 𝒏 ,𝜎 1
2
𝑰 𝒏 )
53
The equivalent optimization problem is found using the variable substitution 𝜷 = 𝒁𝜶 + 𝜸 in the
objective function for the ridge-lasso (equation 2.5).
minimize
𝜶 ,𝜸 1
2
‖𝒚 − 𝑿𝜸 − 𝑿𝒁𝜶 ‖ +
𝜆 1
2
‖𝜸 ‖
2
2
+ 𝜆 2
‖𝜶 ‖
1
(3.6)
As in the results of Friedman et. al. (2010), equation 3.6 consists of a smooth objective function
and constraint functions that are separable, making it a good candidate for coordinate descent. To
express equation 3.7 in a form that mimics equation 3.4, we define two diagonal penalty
matrices, 𝑷 𝟏 and 𝑷 𝟐 , that account for both the penalty type and penalty value (equation 3.7). The
overall penalty, 𝜆 , is subsumed into the individual penalties.
min
𝜷 ∗
1
2
‖𝒚 − 𝑿 ∗
𝜷 ∗
‖+
1
2
‖𝑷 𝟏 𝜷 ∗
‖
2
2
+ ‖𝑷 𝟐 𝜷 ∗
‖
1
(3.7)
𝑿 ∗
= [𝑿 | 𝑿𝒁 ]
𝜷 ∗
= [
𝜸 𝜶 ]
𝑷 𝟏 = 𝑑𝑖𝑎𝑔 (𝜆 𝑗 (1− 𝛿 𝑗 ))
𝑷 𝟐 = 𝑑𝑖𝑎𝑔 (𝜆 𝑗 𝛿 𝑗 )
𝜆 𝑗 is the penalty on the 𝑗 𝑡 ℎ
variable and 𝛿 𝑗 determines the type of penalty on the 𝑗 𝑡 ℎ
variable
(𝛿 𝑗 = 1 is lasso, 𝛿 𝑗 = 0 is ridge, and 𝛿 𝑗 ∈ (0,1) is elastic net). Differentiating with respect to 𝛽 𝑗 ∗
𝜕 𝜕 𝛽 𝑗 = (𝒙 𝒋 ∗
)
𝑇 (𝑿 −𝒋 ∗
𝜷 −𝒋 ∗
+ 𝒙 𝒋 ∗
𝜷 𝒋 ∗
− 𝒚 )+ 𝜆 𝑗 (1− 𝛿 𝑗 )𝛽 𝑗 ∗
+ 𝜆 𝑗 𝛿 𝑗 𝜕 𝜕 𝛽 𝑗 ∗
‖𝜷 ∗
‖
the coordinate-wise update for 𝛽 𝑗 , given the current estimates for the other parameters, is
𝛽 ̂
𝑗 ←
(𝒙 𝒋 ∗
)
𝑇 (𝒚 − 𝑿 −𝒋 ∗
𝜷 ̂
∗
−𝒋 )− 𝜆 𝑗 𝛿 𝑗 𝜕 𝜕 𝛽 𝑗 ∗
‖𝜷 ∗
‖
1
(𝒙 𝒋 ∗
)
𝑇 𝒙 𝒋 ∗
+ 𝜆 𝑗 (1− 𝛿 𝑗 )
=
𝑆 ((𝒙 𝒋 ∗
)
𝑇 (𝒚 − 𝑿 −𝒋 ∗
𝜷 ̂
∗
−𝒋 ),𝜆 𝑗 𝛿 𝑗 )
(𝒙 𝒋 ∗
)
𝑇 𝒙 𝒋 ∗
+ 𝜆 𝑗 (1 − 𝛿 𝑗 )
(3.8)
54
Thus, we can use coordinate gradient descent to efficiently compute 𝜶 and 𝜸 and then solve for
𝜷 by back substitution into the second level model, 𝜷 = 𝒁𝜶 + 𝜸 . From the form of equation 3.7,
it is clear that we can consider other penalty types for the coefficients associated with the
predictors, 𝜸 . In Chapter 6, we will also demonstrate how this algorithm can extend to other
outcome types.
3.2.4 Hyperparameter Tuning Revisited
As described in Section 2.2.3, the hyperparameters 𝜆 1
and 𝜆 2
must be “tuned” to find the
combination of values that optimize a defined loss function (i.e. mean square error). In the
previous chapter, we utilized a grid of penalty combinations that were manually formed around
an initial penalty combination. This approach provides no guarantee that the grid contains the
appropriate range of values for each hyperparameter and requires manual adjustment of the range
when the optimal penalty combination falls outside the grid boundary. The new formulation of
the model (equation 3.7) allows us to determine more appropriate upper bounds for each
hyperparameter.
To determine the upper bounds for the ridge-lasso model, we first consider an upper
bound in the case of standard lasso regression. A natural choice is the penalty value that is just
large enough to shrink all coefficients to zero,
𝜆 𝑚𝑎𝑥 = inf{𝜆 ∈ ℝ
++
| 𝜷 ̂
(𝜆 )= 𝟎 𝑝 }
where 𝜷 ̂
(𝜆 ) is the vector of coefficients estimates for a given penalty value, 𝜆 . Based on the
coordinate-wise update for lasso regression and the fact that for all 𝜆 > 𝜆 𝑚𝑎𝑥 , 𝜷 ̂
(𝜆 )= 𝟎 𝑝 , an
individual coefficient estimate, 𝛽 ̂
𝑗 , becomes non-zero as we decrease 𝜆 only if the absolute value
of its least-squares estimate, |𝑋 𝑗 𝑇 𝑦 |, is larger than 𝜆 𝑗 . This is clearly seen in the univariate update
55
of equation 3.4, which simplifies when we set the initial solution to 𝜷 = 𝟎 𝒑 (i.e. the solution for
any 𝜆 > 𝜆 𝑚𝑎𝑥 ) and 𝛿 𝑗 = 1,𝑗 = 1,…,𝑝 . In this simplified form, any estimate can only become
nonzero if 𝒙 𝑗 𝑇 𝒚 > 𝜆 𝑗
𝛽 ̂
𝑗 ←
𝑆 (𝒙 𝒋 𝑻 𝒚 ,𝜆 𝑗 )
𝒙 𝒋 𝑻 𝒙 𝒋
Thus, if we let 𝜆 𝑚𝑎𝑥 = max
𝑗 =1,…,𝑝 |𝑿 𝑗 𝑇 𝒚 |, the maximum of the least-squares estimates for all
variables, the solution is guaranteed to set all coefficients to zero. Similarly, for elastic net
regression, we can define 𝜆 𝑚𝑎𝑥 = max
𝑗 |
1
𝛿 𝑋 𝑗 𝑇 𝑦 |, where 𝛿 ∈ (0,1) is the elastic net parameter. In
the case of ridge regression, estimates shrink to zero only in the limit as 𝜆 → ∞. A general
heuristic used in practice is to let 𝜆 𝑚𝑎𝑥 = max
𝑗 |
1
0.001
𝑋 𝑗 𝑇 𝑦 | for both ridge regression and elastic
net regression for any 𝛿 < 0.001 . The upper bound for all three models can then be written as,
𝜆 𝑚𝑎𝑥 = max
𝑗 |
1
max (0.001,𝛿 )
𝑋 𝑗 𝑇 𝑦 | (3.9)
For ridge-lasso regression, we also define the maximum values for 𝜆 1
and 𝜆 2
as the
values that are just large enough to shrink their respective coefficients, 𝜷 and 𝜶 , to zero.
However, the penalty value that guarantees the corresponding coefficients shrink to zero for each
hyperparameter is dependent on the value of the other hyperparameter. To understand the
relationship between the two hyperparameters, first consider the minimum value for 𝜆 2
that still
ensures all second-level coefficients, 𝜶 , are zero as a function of 𝜆 1
.
𝜆 2(𝑚𝑎𝑥 )
(𝜆 1
)= inf{𝜆 2
∈ ℝ
++
| 𝜶̂ (𝜆 1
,𝜆 2
)= 𝟎 𝒒 }
56
To determine 𝜆 2(𝑚𝑎𝑥 )
(𝜆 1
) , consider the solution to equation 3.7 when 𝜸 = 𝟎 𝑝 , the limiting
solution as 𝜆 1
→ ∞. The problem reduces to a lasso regression of 𝑿𝒁 on 𝒚 and the lasso update
for each 𝛼̂
𝑗 is nonzero when |(𝑿𝒁 )
𝑗 𝑇 𝒚 | > 𝜆 2
. The maximum second level penalty is defined as,
lim
𝜆 1
→∞
𝜆 2(𝑚𝑎𝑥 )
(𝜆 1
)= max
𝑗 |(𝑿𝒁 )
𝑗 𝑇 𝒚 |
For the case of finite 𝜆 1
, we use the fact that for all 𝜆 2
> 𝜆 2,𝑚𝑎𝑥 ,𝜶̂ = 𝟎 𝒒 and equation 3.7
reduces to a standard ridge regression. The lasso update for 𝛼̂
𝑗 now only becomes nonzero if,
|(𝑿𝒁 )
𝑗 𝑇 (𝒚 − 𝑿 𝜸̂
𝑟𝑖𝑑𝑔𝑒 (𝜆 1
))| > 𝜆 2
and the maximum second level penalty can then be defined as,
𝜆 2(𝑚𝑎𝑥 )
(𝜆 1
)= max
𝑗 =1,…,𝑝 |(𝑿𝒁 )
𝑗 𝑇 (𝒚 − 𝑿 𝜸̂
𝑟𝑖𝑑𝑔𝑒 (𝜆 1
))| (3.10)
where 𝜸̂
𝑟𝑖𝑑𝑔𝑒 (𝜆 1
) is the solution to a standard ridge regression with penalty 𝜆 1
. Using a similar
argument, we can define a maximum penalty value for 𝜆 1
conditional on the value of 𝜆 2
. Again,
because the ridge penalty will never shrink any estimate exactly to zero, we inflate the penalty,
𝜆 1,𝑚𝑎𝑥 (𝜆 2
)= max
𝑗 |
1
0.001
𝑿 𝑗 𝑇 (𝒚 − 𝑿𝒁 𝜶̂
𝑙𝑎𝑠𝑠𝑜 (𝜆 2
))| (3.11)
With improved upper bounds, a lower bound for each hyperparameter is defined as a fraction of
the upper bound, 𝜆 𝑚𝑖𝑛 = 𝑐 𝜆 𝑚𝑎𝑥 ,𝑐 ∈ (0,1) . A sequence of equidistant values, generally on a log
scale, are generated between the upper and lower bounds to form the path of penalty values to
consider for each hyperparameter.
3.2.4 Pathwise Coordinate Descent in Two-Dimensions
To evaluate the penalty grid, pathwise coordinate descent is used to efficiently evaluate
sequences of penalties by using “warm starts”. Given two penalties, 𝜆 (𝑘 )
and 𝜆 (𝑘 +1)
, a ‘warm
57
start’ refers to using the solution found at 𝜆 (𝑘 )
as the initial solution in the coordinate descent
algorithm at 𝜆 (𝑘 +1)
. Given that the solutions at the two penalty values should be relatively close,
we expect that initializing at the previous solution should lead to a lower number of iterations for
coordinate descent to converge at 𝜆 (𝑘 )
. The penalty sequence is generally evaluated from the
largest penalty value to the smallest penalty value. Reasons for evaluating the sequence in this
order will be discussed in chapter 5. As a precursor to that discussion, evaluating the penalty
sequence in a decreasing fashion will allow us to take advantage of screening rules that can
significantly improve computational speed.
Pathwise coordinate descent is adapted to a two-dimensional setting by first ordering both
penalty sequences from largest to smallest, where 𝑚 1
and 𝑚 2
are the lengths of the respective
sequences.
𝜆 1
(1)
> 𝜆 1
(2)
> ⋯ > 𝜆 1
(𝑚 1
)
𝜆 2
(1)
> 𝜆 2
(2)
> ⋯ > 𝜆 2
(𝑚 2
)
𝝀 𝟐 (𝟏 )
𝝀 𝟐 (𝟐 )
𝝀 𝟐 (𝟑 )
…
𝝀 𝟐 (𝒎 𝟐 )
𝝀 𝟏 (𝟏 )
𝝀 𝟏 (𝟐 )
𝝀 𝟏 (𝟑 )
…
𝝀 𝟏 (𝒎 𝟏 )
Figure 3.1 Two-Dimensional Pathwise Coordinate Descent
58
The solution to the hierarchical regularized regression is then found for the largest penalties
under consideration, 𝜆 1
= 𝜆 1
(1)
and 𝜆 2
= 𝜆 2
(1)
. After determining the solution for the largest
penalties, we solve the remaining sequence of second level penalties from largest to smallest,
holding 𝜆 1
= 𝜆 1
(1)
. The first level penalty is then set to the next smallest penalty value, 𝜆 1
= 𝜆 1
(2)
,
and the solutions are again computed across the sequence of second level penalties. Following
this pattern for all penalty combinations, the path can be visualized as traversing across the rows
of a two-dimensional grid (Figure 3.1). Rather than traverse the rows of the penalty grid, we
could also consider a strategy that traverses the columns of the grid as well.
3.3 Discussion
In this section, we proposed an alternative method for fitting our model using only
coordinate descent and highlighted some of the conceptual and computational advantages. By
rewriting the model, we avoid the need to compute the SVD, which can become computationally
expensive as both the number of observations and number of predictors grow. The alternative
representation also allows us to generate well-defined upper bounds for both hyperparameters,
leading to an easier to implement cross-validation procedure that ensures we are searching in the
right range of values. Beyond reducing the computational burden, the alternate fitting method
also opens the opportunity to explore other variants of the model. Specifically, we can now
consider different penalty types other than ridge for the predictors and we can explore other
outcome types, including binary or survival, which tend to be more prevalent among those
available for health-related outcomes.
Despite the success of coordinate descent to solve convex optimization problems such as
ours, there are other algorithms that are actively used to solve these types of problems. For
example, if we want to expand the use of our model to data that is stored in a distributed
59
environment, coordinate descent is much less appealing computationally due to the serial nature
of the algorithm. More recent approaches to handle very large multi-node data sources rely on
algorithms that are parallelizable. One example is Apache Spark, which defaults to the use of
either l-BFGS (Nocedal and Wright, 2006), a quasi-Newton approach, or stochastic gradient
descent to solve regularized regression problems in a distributed environment. However, until
data within our domain of interest reaches a size that is not manageable on a single large
computational node, we prefer to work with a method that requires less upfront effort to setup.
There are potential modifications to the strategy presented to evaluate the two-
dimensional grid of hyperparameter combinations that could improve the computational speed of
parameter tuning. As an example, rather than traversing the hyperparameter grid row-by-row, we
could first fit the model for all 𝜆 1
values (i.e. the first column of Figure 3.1). The resulting
problem is a standard regularized regression model where 𝜶 𝒒 = 0. Then pathwise coordinate
descent could be applied in parallel across the rows of the grid. We could also abandon the grid
search altogether in favor of new approaches for hyperparameter optimization that include
stochastic search methods or Bayesian optimization algorithms (Bergstra and Bengio, 2012;
Snoek, Larochelle and Adams, 2012). However, it is unclear whether these approaches could still
utilize warm-starts to speed up the computation across a path of solutions. Full-scale simulations
would be necessary to determine if any speed gains can be achieved with these newer
approaches. Customization of these off-the-shelf optimization methods to our model would also
be necessary for a fair comparison.
The most important takeaway of this chapter is that reframing the original problem
allowed us to use a well-studied algorithm with good convergence properties to fit our model.
Another benefit is that the reparametrization in (3.5) provides an alternative way to interpret our
60
model. From equation 3.7, we see that our model is a regression of the original predictor
variables, 𝑿 , and linear combinations of those predictor variables constructed from the external
data, 𝑿𝒁 , on the outcome. Thus, our model is effectively capturing whether the meta-features
constructed from 𝒁 capture the underlying signal better than 𝑿 itself, 𝜶 , and whether the effect of
any individual feature in 𝑿 deviates significantly from these summarized effects, 𝜸 . Overall, the
combination of improved model interpretation and increased model flexibility expand the
applicability of hierarchical regularized regression to real world data sets.
Chapter 4 Applications to Real Data
4.1 Overview
Beyond the promising results of the simulations in chapter 2, we are truly interested in
assessing the ability of our model to improve the prediction of health-related outcomes in real
data sets. We also want to evaluate how the resulting estimates compare to standard approaches
and if any additional insights can be gained by jointly modelling the predictors and external data.
In the following examples, we evaluate our modelling framework on three data sets that are
made up of different types of genomic data sources and different sources of external information.
The first example explores the association of DNA methylation with age and whether this
association can be captured at the gene level. The second and third examples demonstrate the use
of genomic annotations and gene expression signatures to improve the prediction of cancer-
related outcomes.
61
4.2 Prediction of Chronological Age with Methylation Data
4.1.1 Introduction
Horvath (2013) developed a model that demonstrated strong correlation between DNA
methylation levels and chronological age (r > 0.9). The multi-tissue prediction model he
developed is the result of training an elastic net regression model on 39 data sets for which
methylation data was publicly available. The data sets were from one of two Illumina platforms
(Infinium 450K and Infinium 27K) for which 21,369 of the CpG probe sites overlapped. For our
analysis, we used another publicly available dataset consisting of 656 individuals with
methylation data collected on the Infinium 450K platform. The total number of measured CpG
sites available in this dataset was 473,034.
Given that DNA methylation plays an important role in the regulation of gene expression,
an interesting question is whether the association of DNA methylation with age can be captured
at the gene level. Specifically, if we can map the DNA methylation probes back to the gene or
genes that they regulate, do probes that map to the same gene tend to have a similar association
with age and does this additional information improve the prediction of age.
4.1.2 Methods
To evaluate this question, we generated an external data matrix that maps each of the 𝑝
methylation probes to their closest gene, in terms of distance, and proceed to use these mappings
to group the probes by gene. The resulting external data matrix, 𝒁 𝑝 ×𝑞 , consists of 𝑞 columns to
represent the 𝑞 unique genes. The j
th
column of 𝒁 acts a as a 0-1 indictor variable that codes all
probes that map to the j
th
gene as one and the rest as zero. The columns of 𝒁 are normalized by
the number of probes that map to the corresponding gene such that each column sums to one.
Each second level coefficient, 𝛼 𝑗 , then represents the average effect of all probes that map to
62
gene 𝑗 on chronological age. As mentioned previously, the use of genes as external data is
reasonable given that we might expect probes that map to the same gene to have a similar
association with age. Additionally, the average number of probes that map to the same gene is
moderate in size with a mean of approximately 33 probes that will contribute to the estimation of
the gene-level effect.
Rather than consider all probes, the dimensionality of the methylation data is reduced to
only include the top 250,000 probes by empirical variance. To reduce the number of genes in the
external data, we only consider those genes that have at least 10 probes mapped to them. The
resulting external data matrix consists of 6,766 unique genes.
We compared lasso, ridge, and elastic net regression to their equivalent models in our
framework. To evaluate the predictive performance of our model compared to the standard
regularized regression models, 50 training and test data sets were generated by randomly
splitting the original methylation data set, with 80% of the observations reserved for a training
data set and the other 20% reserved for a test data set. 10-fold cross validation is used to tune the
hyperparameters in each training data set. As in the previous simulations, the optimal
hyperparameters are those that produce the minimum cross-validated MSE. The MSE in the
corresponding test data set is used to compare the predictive ability of the models.
4.1.3 Results
Table 4.1 reports the results for each standard model and the corresponding hierarchical
model with the second-level penalty on the external data that performed best with respect to the
test MSE. As in the results of Horvath (2013), we also find that among the standard methods,
elastic net has the best performance in terms of test MSE. Our method decreases the test MSE
63
compared to their standard counterpart and the elastic net / ridge model obtains the lowest
median and mean MSE overall. The largest improvement is an 18% reduction in the test MSE of
ridge-ridge regression compared to ridge regression. Regardless of the penalty applied to the
first-level predictors (i.e. methylation probes), a ridge penalty on the external probe-gene
mapping data consistently performed better than either a lasso or elastic net penalty.
Method Ridge Ridge-Ridge Lasso Lasso-Ridge Elastic Net
Elastic
Net-Ridge
Mean Test MSE 51.8 44.1 33.0 32.4 29.1 28.7
Median Test MSE 51.6 43.1 32.5 31.3 28.5 28.0
Table 4.1 Mean and Median Test MSE by Method
4.2 Using Gene Annotations to Improve Prediction of Prostate Cancer Recurrence
4.2.1 Introduction
Identification of patients with localized prostate cancer who are at high risk for
recurrence is a challenging problem. Current methods to identify this subset of patients rely
primarily on clinical features of the tumor (i.e. Gleason score) and the patient (i.e. age, prostate-
specific antigen levels) (Cooperberg Matthew et al., 2005; D'Amico et al., 1998; Kattan et al.,
1998). However, several results have shown that incorporating additional biomarker information
derived from the tumor can improve the detection of aggressive tumors that are more likely to
result in recurrence later on (Cullen et al., 2015; Cuzick et al., 2011; Erho et al., 2013). As an
example, Shahabi et al. (2016) developed an elastic net regression model that utilized both
clinical variables and gene expression data. To develop the model, genome-wide expression
profiles, 50 – 200ng RNA, using the Whole-Genome DASL-HT Assay were collected from each
of 187 patients. Their final model defined a 28-gene expression signature that was shown to
improve classification of clinical recurrence compared to a model that used the clinical variables
64
only. In addition to the elastic net model, gene set enrichment analyses were completed for the
28-gene signature to determine whether there was enrichment for any Gene Ontology (GO)
terms. Table 4.2 shows the molecular function terms that were found to be significantly enriched
after adjustment for multiple comparisons.
4.2.2 Methods
We extend the analysis of Shahabi et al. (2016) by considering the enriched molecular
functions as a source of external data. Our belief is that genes that have common molecular
functions or that are involved in the same biological processes may have similar effects on an
outcome of interest. Like the methylation example, the enriched annotations are incorporated
into an external data design matrix as 0-1 indicator variables where the j
th
column indicates
whether each gene has the j
th
enriched annotation. Figures 4.1 and 4.2 visualize the relationship
between genes expression data, GO annotations, and the external data matrix. Note that this
coding of the genomic annotations does not differentiate between the case when gene “G” does
not have genomic annotation “A” and the case when gene “G” has not yet been evaluated for
genomic annotation “A”. The issue of how to treat GO annotations is an entirely separate
problem, but we want to clearly define how we treat the annotations.
To determine the appropriate subset of enriched GO terms to use, additional processing is
needed to account for the hierarchical nature of the Gene Ontology and to accurately identify all
genes that have evidence for one of the enriched annotations. In the Gene Ontology, annotations
are related to each other through a set of standardized parent-child relationships that can be
visualized as an ancestry tree.
65
nucleus cytoplasm
cell
surface
cytosol membrane …
cell
migration
hemopoiesis
acrosome
reaction
cell
adhesion
glucose
transport
…
DNA
binding
calcium ion
activity
protein
binding
mRNA
binding
GTPase
activity
…
Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene
6
Obs ID Probe
1
Probe
2
Probe
3
Probe
4
Probe
5
Probe
6
Probe
7
Probe
8
…
12135425 8.3179 4.9389 4.8127 4.4523 6.8563 9.1467 4.8033 4.6462 …
12135426 8.0051 7.2069 8.8757 8.5990 6.9521 8.8852 6.7581 7.1154 …
Figure 4.1 Relationship between Gene Expression and Gene Ontology Annotation Data
Probe ID
Gene
ID
DNA
Binding
Calcium Ion
Activity
Protein
Binding
…
1 Gene1 1 0 0
2 Gene2 0 1 0
3 Gene2 0 0 0
4 Gene3 1 0 1
Probe1 Probe2 Probe3 Probe4 …
8.3179 4.9389 4.8127 4.4523
6.8563 9.1467 4.8033 4.6462
8.0051 7.2069 8.8757 8.5990
Figure 4.2 Gene Expression Data Matrix (bottom) and Gene-Annotation Data Matrix (top)
Cellular
Component
s
Biological
Functions
Molecular
Functions
{
{
{
66
Category GeneOntology ID Included in Analysis
Down
Regulated
oxidoreductase activity GO:0016491 Yes
coenzyme binding GO:0050662 Yes
NADP binding GO:0050661 Yes
oxidoreductase activity, acting
on CH-OH group of donors
GO:0016614 Yes
oxidoreductase activity, acting
on the CH-OH group of donors,
NAD or NADP acceptor
GO:0016616 No
peptide binding GO:0042277 Yes
cofactor binding GO:0048037 No
amide binding GO:0033218 No
Up
Regulated
growth factor binding GO:0019838 Yes
Table 4.2 Summary of Enriched Molecular Functions Gene Ontology Annotations in
Shahabi et. al. 2016
For each gene, a recursive search starting at the most granular level of the ancestry tree (i.e. the
tips of the tree) is used determine whether the gene has one of the enriched GO annotations.
As an example, consider the ME2 gene that has the following molecular function
annotations available from the Panther Classification System (www.pantherdb.org):
GO:0004470, GO:0004471, GO:0004473, GO:0008948, GO:0009055, GO:0046872, GO:0051287
Now, consider the first annotation term for ME2 that is bolded and underlined (GO:0004470).
The GO ancestry tree in Figure 4.3 describes the relationship of this annotation with its parent
annotation terms. Notice that two of the parents in the ancestry tree for GO:0004470 are enriched
annotations in Table 4.2 (GO:0016491, GO:0016491).
As a result of the hierarchical relationship between annotation terms, there exists
significant overlap among terms that results in high correlation between GO terms (Figure 4.4).
To reduce the correlation between the external data features, only one of the GO terms in each of
the three highly correlated pairs is kept for the analysis as summarized in Table 4.2.
67
To assess the performance of ridge-lasso compared to ridge regression, stratified 3-fold
cross-validation was performed 50 times. Stratified k-fold cross-validation was used due to the
significant imbalance in the case/control ratio. Additionally, the small sample size in this data set
did not allow us to separate the samples into training and testing data sets. For each cross-
validation run, the highest average cross-validated AUC was recorded for both methods. To fit a
binary outcome with the linear version of our model, we utilized the relationship between linear
discriminant analysis (LDA) and linear regression (Hastie, Friedman and Tibshirani, 2001).
Specifically, it can be shown that the linear regression coefficients for a two-class outcome will
be proportional to the discriminant direction in LDA. The intercept in the linear regression model
needs to be adjusted to account for the multivariate normality assumption in LDA. However,
since the computation of the AUC only requires the relative ranking of predictions, we do not
need to compute the adjusted intercept, as this only shifts the predictions and would not change
their relative ranking. For a full description of the relationship between linear regression and
LDA, see Appendix A.3.
68
Figure 4.3 Gene Ontology Ancestry Tree for Molecular Function Annotation GO:0004470
(malic enzyme activity)
69
Figure 4.4 Correlation Between Enriched Molecular Function Gene Ontology Terms
4.2.3 Results
Figure 4.5 summarizes the average cross-validation AUC across the 50 trials for both
methods. The average difference in AUC between ridge-lasso and ridge was approximately
0.015, a small improvement. Each external variable (i.e. annotation) had a nonzero estimate in at
least 15 or more of the 50 cross-validation trials (Table 4.3) with GO annotation GO:0016614
and GO:0050662 being selected in over 80% of the trials. Additionally, the direction of effect
tends to match the results of Shahabi et al. (2016). As an example, the coefficient estimate for
GO:0016614 (-0.51) can be interpreted as an increased risk of clinical recurrence when genes
that have this genomic annotation exhibit decreased expression.
70
Figure 4.5 Comparison of Cross-validation AUC by Method
Annotation GO:0016491 GO:0050662 GO:0016614 GO:0019838 GO:0050662 GO:0042277
Average
Selection
Rate
32% 78% 86% 20% 84% 38%
Average
Coefficient
Estimate
0.11 -0.18 -0.51 0.011 -0.30 0.003
Shahabi
Regulation
Direction
Down Down Down Up Down Down
Table 4.3 Average Selection Rate and Average Coefficient Estimates of External Data
(Enriched Annotations)
4.3 Recovery of Gene Expression Signatures for Breast Cancer Mortality
4.3.1 Introduction
Molecular characteristics of breast cancer tumors are an important factor for determining
treatment regimen and are also independent prognostic factors for cancer recurrence and survival
(Kwa, Makris and Esteva, 2017). METABRIC is an international consortium that further
classified tumors based on molecular signatures. An analysis of 2,000 breast cancer patients in
the METABRIC cohort successfully divided tumors into ten novel subgroups based on molecular
71
characteristics that correlated strongly with survival (Curtis et al., 2012). The DREAM Breast
Cancer Prognosis Challenge (Margolin et al., 2013) used the METABRIC data set to further
improve prediction of survival based on clinical variables, gene expression, and copy number
variation. The top model in this competition made significant use of four gene signatures,
referred to as attractor metagenes, that capture molecular events known to be associated with
clinical outcomes in many cancers (Cheng, Yang and Anastassiou, 2013b). Their final model
used the average expression of the 10 top-ranked genes, in terms of average mutual information,
in each metagene as predictors. Of the four metagenes, only two were predictive across all
patients, CIN and FGD3-SUSD3, while the other two metagenes were predictive within
subgroups, MES and LYM.
4.3.2 Methods
As in the previous example for prostate cancer, we use the gene expression data,
consisting of 29,477 probes, as our primary features. However, we now utilize the previously
defined attractor metagenes as external data by grouping probes that are in the same metagene. In
the resulting external data matrix, the j
th
column codes all probes that map to the j
th
metagene as
one and the rest as zero. The CIN, MES, and LYM metagenes each consist of 100 genes, but for
our analysis, we only considered the 50 top-ranked genes in terms of average mutual information
(Cheng, Yang and Anastassiou, 2013a). The FGD3-SUSD3 metagene consists of only two genes.
Each column of the external data matrix was normalized by the number of probes in the
corresponding metagene such that each column summed to one. In addition to the gene
expression data, we also considered two unpenalized clinical variables, age and whether a patient
was lymph node positive.
72
The training and testing METABRIC data sets were subset to those patients who are ER+
and HER2- due to significant heterogeneity in expression between ER+/HER2-, ER-, and
HER2+ tumors. This subgroup analysis aligns with previously developed gene signatures that
were shown to only be predictive within this subgroup (Kwa et al., 2017). The patient data was
previously divided into training and validation datasets for the METABRIC open source
competition. We use the same division of training and testing data. The time to mortality
outcome was converted to a dichotomous variable to indicate whether a patient died due to breast
cancer within 7.5 years of time of diagnosis. Sample sizes after subsetting to ER+/HER2-
patients not censored within 7.5 years were 485 and 452 in the training and test data sets,
respectively. In addition to using a cutoff of 7.5 years, cutoffs of five and ten years were also
evaluated. Here we again use the relationship between LDA and linear regression to fit this
binary outcome as continuous in our framework.
The predictive performance of the ridge-lasso model was compared to both linear ridge
regression and logistic ridge regression. The hyperparameters for all three models were tuned by
using stratified repeated 5-fold cross-validation. Stratification was used due to the class
imbalance of cases and controls. AUC in the test data set was used to compare the predictive
ability of the models.
4.3.3 Results
Ridge-lasso regression improves class prediction over both linear and logistic ridge
regression. Similar improvements were found if we used a cutoff of five or ten years as well
(data not shown). Addition of age and lymph node status improved the AUC significantly in all
three methods, with positive lymph status being a strong predictor of recurrence. The signs of the
73
estimates for the metagene variables at the optimal penalty combination match those estimated in
Cheng et al. (2013) with a proliferative (i.e. positive) effect for genes in the CIN metagene and a
protective effect for the FGD3-SUSD3 metagene (Table 4.5). It also interesting to note that the
CIN and FGD3-SUSD3 metagenes were found to be predictive in the entire sample, while the
MES and LYM metagenes were only predictive in a sub-sample in the results of Cheng et. al.
(2013). These results align with our model that is based on the entire sample as we shrink the
effects of the MES and LYM metagenes to zero.
Since only a small number of probes map to a metagene, the effect of the external data on
the regression coefficients for these probes can be visualized (Figure 4.6). The FGD3-SUSD3
probes exhibit the largest coefficient change, with a near zero effect estimate in ridge regression
to a strong protective effect in the ridge-lasso model.
Method Test AUC
Gene Expression Only Gene Expression + Clinical
Linear Ridge 0.695 0.741
Logistic Ridge 0.684 0.731
Ridge-Lasso 0.713 0.755
Table 4.4 Comparison of Test AUC between Ridge Regression and Ridge-Lasso
Metagene Coefficient Estimate
Gene Expression Only Gene Expression + Clinical
CIN 0.13 0.16
MES 0.0 0.0
LYM 0.0 0.0
FGD3-SUSD3 -0.06 -0.05
Table 4.5 Coefficient Estimates for External Data Variables (Metagenes)
74
Figure 4.6 Comparison of Coefficient Estimates for Metagene Probes Between Ridge
Regression and Ridge-Lasso Regression
4.5 Discussion
In this chapter, we applied our model to predict health-related outcomes with different
types of genomic data sources and different types of external information. These examples
demonstrate that in addition to continuous external data variables, as used in the Chapter 2
simulations, we can also incorporate binary indicator variables to group genomic features with
similar characteristics. The use of the relationship between linear regression and LDA also
allowed for exploration of binary outcomes. All three examples demonstrated the potential for
hierarchical regularized regression to improve predictive performance by incorporating external
data sources in our regularization framework. For each example, we now discuss the individual
challenges encountered when trying to incorporate the external information.
75
The prediction of age based on methylation data highlighted our ability to incorporate
both high-dimensional predictors and high-dimensional external data. The external information
is rather simple and assumes that the measured methylation at each CpG site is associated with
expression of the nearest gene. However, methylation of CpG’s could impact more distant genes
rather than the nearest. The ideal information for this example would be an external sample that
has both gene expression and methylation expression data available. We could then model the
association between methylation expression and gene expression. By capturing the strength of
the association between CpG sites and transcription sites, we could use this information in place
of the potentially misclassified 0-1 binary indicators. Despite not having this level of
information, comparing ridge regression to the ridge-ridge model, it does seem that the binary
indicators are capturing groups of probes that have a common association with age. An
additional challenge of this example is that the performance of the comparison methods is
already remarkably high with a test 𝑅 2
greater than 0.95 for both lasso and elastic net regression.
With little room for improvement, it is of no surprise that we see smaller gains in performance in
our model compared to these two comparison methods.
The prediction of prostate cancer recurrence with gene expression data presented a much
more complicated source of external information, functional annotations from the gene ontology
(GO). As mentioned in the description for this example, the primary challenge lies in the
hierarchical structure of the GO terms. Given the correlation between sets of terms, we realized
that one cannot simply throw all GO terms into a model as a set of external data features and
hope that the regularization term will properly account for their correlation. Another challenge
when using the Gene Ontology is the sparsity of the annotations. Selecting GO terms high in the
ancestry tree (i.e. near the roots) leads to broad terms that most likely do not capture processes
76
that are specific to the health outcome of interest. At the other end of the spectrum, highly
specific GO terms may only be associated with one or two genes. An expert opinion is most
likely necessary to compile a set of relevant GO terms that are potentially informative for both
the direction and magnitude of effect of a set of genomic features on a given outcome.
It is also important to note that the improvements in prediction in prostate cancer example
cannot be fully trusted for two reasons. The GO terms used are based on an enrichment analysis
from the same sample. The second is that we did not have a sample size that was adequate for
generating a hold-out test data set. Both issues lend to a strong possibility that our model could
be overfitting and would not generalize as well to an external data set. However, the primary
purpose of this example was to verify whether we could capture the same trends found by the
enrichment analysis. With this objective in mind, the ridge-lasso was effective in capturing the
same trends at a functional level.
Many of the challenges encountered in the prostate cancer example were not present in
the breast cancer example. The metagenes were based on an external data set and a separate test
data set was available to evaluate the predictive performance. However, the number of probes
that make up the metagenes is small (195) relative to the total number of expression probes
(20,000+). To have an impact on predictive performance (i.e. AUC), the regression coefficients
for the metagene probes need to exhibit a large enough change relative to the other probes to
affect their relative ranking. We also recognized that there is still a potential for overfitting when
cross-validation is used to tune the two hyperparameters. Overfitting within cross-validation is a
known issue (Cawley and Talbot, 2010), and a heuristic “one standard error” rule has been
applied in practice for standard regularized regression models. The “one standard error” rule
increases the regularization until the cross-validation error is approximately within one standard
77
error of the optimal value. Future work could determine whether a similar empirical “one
standard error” rule could be applied to both hyperparameters in our model.
A clear theme across all examples that is not specific to our model is the need to identify
potential sources of external information and to translate that information into a form that can be
used within a given modelling framework. As data related to genomic features continues to be
generated through curation of experimental results, the usefulness of this information in any
modelling context needs further evaluation. Overall, our results demonstrate that biological
annotations and other sources of external information for genomic data are a promising resource
to improve the prediction of health-related outcomes.
Chapter 5 xrnet: An R package for Hierarchical Regularized Regression
5.1 Overview
We now describe the R package developed to fit our model, including the primary R
functions to fit and tune hierarchical regularized regression models. Beyond providing a set of
methods to fit our model, the xrnet package also attempts to improve upon the popular R
package glmnet in terms of relative speed and memory usage by using design elements found in
the more recent R package biglasso (Zeng and Breheny, 2017). Figure 5.1 provides a high-level
view of the R package. Beyond functions to fit and ‘tune’ hierarchical regularized regression
models, xrnet also provides additional functionality to visualize and digest model results. A
significant amount of time was devoted to the overall software design to ensure ease of use and
extensibility. In the next section, we discuss several key components of xrnet that allow for
efficient computation of our model in terms of computational speed and memory usage. xrnet
can be downloaded at the following public code repository https://github.com/USCbiostats/xrnet.
and full documentation can be viewed at https://uscbiostats.github.io/xrnet/.
78
Figure 5.1 xrnet R Package Structure
5.2 Methods
5.2.1 Coordinate Descent Solver
As discussed in Chapter 3, we extended the coordinate descent algorithm of Friedman et
al. (2010) to enable variable-specific penalty values and penalty types. The base solver for
weighted least-squares regression and the extension of the base solver to logistic regression are
written in C++. Rcpp (Eddelbuettel and Francois, 2011) was used as the interface between R and
C++ and the Eigen C++ template library was used to handle the primary data structures due to its
strong linear algebra library and support for both dense and sparse matrices. Algorithm 2 shows
the base coordinate descent algorithm used for solving a weighted elastic net regression problem
via coordinate descent, which includes as subproblems, both lasso and ridge regression. The
update for each variable requires steps that can be completed in 𝑂 (𝑛 ) operations leading to an
overall computational cost of 𝑂 (𝑝𝑛 ) operations for a single iteration. Additionally, many
variables do not vary across iterations and can be pre-computed.
79
For the case of a binary outcome, a quadratic approximation of the logistic likelihood is
used to form a weighted least squares problem that can then be used with Algorithm 2. The full
solver requires an additional loop outside of our primary coordinate solver to update the
weighted least squares approximation. For more details on how we form the quadratic
approximation and weighted least squares problem, see Chapter 6.
Algorithm 2: Coordinate Descent for Weighted Elastic Net Regression
Input:
𝑦 : outcome variable (𝑛 × 1 vector)
𝑋 : predictor / design matrix (𝑛 × 𝑝 matrix)
𝑊 : diagonal weight matrix (observation-specific, 𝑛 × 𝑛 )
𝜆 : penalty values (variable-specific, 𝑝 × 1 vector)
𝛿 : penalty type (variable-specific, 𝑝 × 1 vector)
Output:
𝛽 : regression coefficient estimates (𝑝 × 1 vector)
Procedure:
𝜷 (0)
← 0
𝒓 ← 𝒚 𝑇 𝑾
𝑘 ← 1
maxDelta← 0
while (maxDelta> 𝜖 )
maxDelta← 0
foreach 𝑗 ∈ [1,… ,𝑝 ]
𝛽 𝑗 (𝑘 )
← 𝑠𝑖𝑔𝑛 (𝑏 𝑗 )×
max(0,𝑎𝑏𝑠 (𝒙 𝑗 𝑇 𝒓 + 𝛽 𝑗 (𝑘 −1)
) − 𝜆 𝑗 𝛿 𝑗 )
𝒙 𝑗 𝑇 𝑾 𝒙 𝑗 +(1−𝛿 𝑗 )𝜆 𝑗
Δ𝛽 ← 𝛽 𝑗 (𝑘 −1)
− 𝛽 𝑗 (𝑘 )
𝑟 ← 𝑟 − Δ𝛽 𝒙 𝑗 𝑇 𝑾
maxDelta← max (maxDelta ,Δ𝛽 2
𝒙 𝑗 𝑇 𝑾 𝒙 𝑗 )
end
𝑘 ← 𝑘 + 1
End
80
5.2.2 Safe Sets, Strong Sets, and Active Sets
Following the introduction of lasso regression and its variants, research efforts have
focused on methods that can reduce the time it takes to fit a full solution path, which refers to a
sequence of hyperparameter values that the model is evaluated at. One of the primary strategies
relies on the assumption of a sparse solution and attempts to filter out predictors that are zero or
will most likely be zero. By substantially reducing the number of variables that we must cycle
through in Algorithm 2, the fitting time can decrease drastically. El Ghaoui et. al. (2010)
introduced a filter rule termed the “SAFE” rule. The “SAFE” rule utilizes the dual problem to
determine which variables are guaranteed to be zero for a given penalty value in lasso regression.
Tibshirani et al. (2012) approach the problem from a different perspective and develop their own
filtering method termed the “strong” rule, which generally filters out a larger proportion of
variables compared to the “SAFE” rule but does not guarantee that the filtered variables are truly
zero. The “strong” rules requires post-convergence checks to ensure the filtered variables are
truly zero after the algorithm reaches convergence. In addition to the basic “SAFE” and “strong”
rules, each filtering rule has their own sequential version that is meant to be used when solving a
sequence of penalties in decreasing order. Despite the need to perform post-convergence checks,
simulations and real data examples by Tibshirani et al. (2012) demonstrated that the sequential
“strong” rule tends to perform best in terms of computation time.
The xrnet R package utilizes both the sequential “strong” rule and a heuristic filtering
rule from the glmnet R package that is referred to by the package authors as an “active set” rule.
The “active set” rule further filters to only include variables that have had their estimate change
from zero at least once during any iteration of the coordinate descent update. In practice, the
authors of the glmnet R package found this additional filtering to further improve speed.
81
5.2.3 Memory
In addition to computational efficiency, the physical memory required to analyze
genomic data must be taken into consideration when developing a solution to fit our model. By
default, most standard R objects are loaded into primary memory (RAM). Thus, any analysis that
utilizes standard R objects, such as matrices, is going to be limited by the memory available on
the machine used to conduct the analysis. Strategies to handle data that is too large for in-
memory processing is a very active field with many different approaches that include (but are not
limited to):
1) Buy more RAM or find a bigger computer (the simplest solution)
2) Out-of-memory computing
3) Distributed computing
4) In-database computing
With a primary goal of the package being ease of use, the solution to handle any ‘big
data’ scenario should require minimal effort to setup the technical infrastructure. After reviewing
potential options, we were inspired by the biglasso R package (Zeng and Breheny, 2017) that
utilizes out-of-memory matrices from the bigmemory R package (Kane, Emerson and Weston,
2013). The functionality in the bigmemory package is derived directly from the C++ Boost
library to enable the creation of memory-mapped matrices. Memory mapping stores the contents
of a file in virtual memory and utilizes paging to access segments of the file when it is too large
to fit in physical memory. The OS (operating system) automatically handles access to the file and
allows the file to be shared in virtual memory across different processes. The only additional step
required of the user is to create the memory-mapped file using the bigmemory package.
82
To fully utilize this out-of-memory solution, we follow another strategy of the biglasso R
package that avoids creating any copies of the feature matrix. There are two steps where copying
occurs, feature standardization and cross-validation of the hyperparameter. Rather than copy and
standardize the entire matrix, the feature vectors are standardized ‘on the fly’ within the
weighted coordinate descent algorithm. A similar ‘on the fly’ strategy was also adopted to handle
the matrix multiplication required to construct the matrix 𝑿𝒁 in our algorithm. Additionally,
since the memory-mapped files can be shared across processes, cross-validation can be
completed in parallel using read-only access to the same memory-mapped matrix. It is important
to note that the ‘on the fly’ standardization increases the number of computations required within
a single iteration of the coordinate descent algorithm. However, as simulations will demonstrate,
this does not have a significant impact on the overall computation time.
5.3 Simulations
Speed
To evaluate xrnet relative to glmnet, the simulation settings used by the authors of
biglasso were replicated (Zeng and Breheny, 2017). These simulations compare the
computational time of both linear and logistic lasso regression along a sequence of 100 penalty
values as a function of the number of variables and the number of observations. In addition to
speed, memory usage is compared for one of these scenarios to demonstrate the significant
impact of memory-mapped matrices and the ability to scale analyses without the need to scale
computing resources.
83
For linear regression, 𝑝 independent features were sampled from a standard normal
distribution to form a design matrix 𝑿 with 𝑛 observations. A continuous outcome, 𝒚 , is
generated as a linear combination of the features with a normally distributed error term,
𝒚 = 𝑿𝜷 + 𝝐 , 𝝐 ~𝑀𝑉𝑁 (𝟎 𝒏 ,𝜎 2
𝑰 𝒏 )
where 𝜷 is a 𝑝 × 1 vector of regression coefficients. To produce a sparse solution, all regression
coefficients are set to zero except for 20 that are sampled from a 𝑢𝑛𝑖𝑓𝑜𝑟𝑚 [−1,1] distribution.
The first set of simulations varies the number of variables from 10,000 to 200,000 with a step
size of 10,000 and fixed sample size of 1,000. The second set of simulations varies the number of
observations from 1,000 to 20,000 with a step size of 1,000 and the number of variables fixed at
10,000. A lasso regression is fit with both xrnet and glmnet using the standard path of 100
penalty values generated by glmnet. Timings to fit the full path were evaluated on 50 generated
data sets for each unique combination of 𝑛 and 𝑝 .
A similar set of simulations were used to evaluate the performance of logistic lasso
regression. The 𝑝 features are simulated as independent standard normal variables and 20
coefficients are sampled from a 𝑢𝑛𝑖𝑓𝑜𝑟𝑚 [−1,1] while the rest are set to zero. The outcome is
then simulated by assuming that 𝑙𝑜𝑔𝑖𝑡 (𝑃 (𝑌 = 1)) is a linear function of the 𝑝 features,
𝑙𝑜𝑔𝑖𝑡 (𝑃 (𝑌 = 1|𝑋 ))= 𝑋𝛽
The outcome is then a set of 𝑛 independent Bernoulli random variables,
𝑌 𝑖 |𝑿 𝑖 = 𝒙 𝑖
~ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖 (
exp(𝒙 𝑖 𝑇 𝜷 )
1 + exp (𝒙 𝑖 𝑇 𝜷 )
As in the linear regression case, we vary the number of variables from 10,000 to 200,000 with a
fixed sample size of 1,000, and then vary the number of observations from 1,000, to 20,000 with
84
the number of variables fixed at 10,000. Timings were again evaluated in 50 data sets using the
data generating process above for all combinations of sample size and number of predictors.
Memory
Memory usage was evaluated using a single setting from the speed simulations (𝑛 =
1,000,𝑝 = 200,000) . The resulting design matrix has an approximate size of 1.5GB. To monitor
memory usage, the profvis R package was used to sample the allocated memory during model
fitting. Memory allocation was compared for both fitting a single model to the entire data set and
k-fold cross-validation. Note that we compare glmnet to the best-case scenario for our package in
which a file-backed memory-mapped matrix is used when fitting the model in xrnet.
5.4 Results
5.3.1 Feature Comparison
In addition to implementing an R package to fit our model, we also integrated popular
features available in the R packages glmnet and biglasso. Table 5.1 provides a feature
comparison summary between all three R packages. Here, we demonstrate that xrnet can not
only fit hierarchical regularized regression models, but also supports fitting standard regularized
regression models and unifies features that are not supported in both glmnet and biglasso.
5.3.2 Speed
Across all combinations of sample size and number of predictor variables, xrnet was
consistently faster than glmnet for both lasso regression (Figure 5.2) and logistic lasso regression
(Figure 5.3). Performance relative to glmnet improved as sample size or number of predictors
85
increased. However, the relative performance gain is not as large for logistic lasso regression in
the case of increasing sample size.
Feature xrnet glmnet biglasso
Matrix types supported
Dense (In-Memory)
Sparse (In-Memory)
Memory-Mapped
Dense (In-Memory)
Sparse (In-Memory)
Memory-Mapped
Outcome types supported Gaussian, Binomial
Gaussian, Binomial,
Poisson, Cox,
Multiresponse
Gaussian
Gaussian, Binomial
Feature-specific penalty scaling Yes Yes Yes
Feature-specific penalty types Yes No No
User controls feature
standardization
Yes Yes No
User controls inclusion of
intercept
Yes Yes No
Box (upper/lower) constraints Yes Yes No
Hybrid Safe/Strong Rules No No Yes
Integration of External Data Yes No No
Table 5.1 Feature Comparison of xrnet, glmnet, and biglasso
Figure 5.2 Average Computing Time (s) to Solve Linear Lasso Regression Using Standard
Path from R Package glmnet as a Function of Number of Features (𝒏 = 𝟏 ,𝟎𝟎𝟎 ) and
Number of Observations (𝒑 = 𝟏𝟎 ,𝟎𝟎𝟎 )
86
Figure 5.3 Average Computing Time (s) to Solve Logistic Lasso Regression Using Standard
Path from R Package glmnet as a Function of Number of Features (𝒏 = 𝟏 ,𝟎𝟎𝟎 ) and
Number of Observations (𝒑 = 𝟏𝟎 ,𝟎𝟎𝟎 )
5.3.3 Memory
The reduced memory footprint of a file-backed memory-mapped matrix is clearly seen
when fitting a single model or tuning the hyperparameters through k-fold cross-validation (Table
5.2). Figure 5.4 tracks memory allocation over time to compute sequential 5-fold cross-
validation. On average, the memory usage for glmnet increases to over 4 times the memory
required to store the original matrix. Additionally, there are small spikes in memory due to the
computation of the cross-validation error within R. The single spike in memory allocation for
xrnet allocates space to save the estimates for the regression coefficients. If k-fold cross-
validation were completed in parallel, the difference in memory usage would be even more
drastic, as glmnet would end up generating 2(𝑘 + 1) of the original predictor matrix, while xrnet
would still use the same matrix that can be accessed by multiple R sessions simultaneously. In
addition to not copying the primary feature matrix, we also reduce memory usage during cross-
validation by immediately computing the loss within folds as opposed to storing 𝑘 + 1
coefficient estimate matrices, which can take up a significant amount of memory when 𝑝 and/or
𝑞 are large.
87
Figure 5.4 Comparison of Memory Allocation During Sequential 5-fold Cross-Validation
Method glmnet xrnet
Fit single model 3.63 GB 0.65 GB
5-fold Cross-Validation 6.40 GB 0.66 GB
Table 5.2 Maximum Memory Allocated to Fit Lasso Regression on All Training Data and
to Compute 5-Fold Cross-Validation Sequentially
5.3.4 Ease of Use
Beyond the primary two R functions used to fit our model on a set of data, xrnet(), and to
tune the two hyperparameters, tune_xrnet(), we also utilized R’s S3 objects to develop helper
functions that assist in visualizing the results of k-fold cross-validation, predicting an outcome in
a new data set, extracting estimates for a particular set of penalty values, and defining the
regularization for both the predictors and external data variables. Below is a list that summarizes
the helper functions that we believe are integral to making the model fitting process as simple as
possible for a user.
• define_penalty(): Defines the type of regularization used on a set of variables, the path of
penalty values to evaluate (automatic or manually set), and variable-specific penalty
88
scaling. The following additional functions can shorten the syntax required for defining
the regularization, by specifying the type in the function name.
o define_ridge(), define_lasso(), define_enet()
• predict(): Used with a model generated from xrnet() or tune_xrnet() to:
o Extract coefficient estimates for a single penalty combination or a set of penalty
combinations
o Predict an outcome in another data set using estimates from a single penalty
combination or a set of penalty combinations
• plot(): Used with a model generated from tune_xrnet() to:
o Generate a contour plot of cross-validated error across the two-dimensional grid
o Generate a cross-validated error curve with respect to changes in a single penalty
while the other penalty is at a fixed value.
Figure 5.5 provides an example of the contour plot generated from one of the results in the
METABRIC breast cancer analysis.
Figure 5.5 Example of Mean Cross-Validation Error Contour Plot in xrnet R package
89
5.4 Discussion
In this section we introduced the R package, xrnet, that was developed to efficiently fit
our model by using coordinate descent. Beyond enabling us to both fit and tune hierarchical
regularized regression models, xrnet incorporates useufl features from the R packages glmnet
and biglasso to ensure our implementation is efficient with respect to computational speed and
memory usage. The use of ‘on the fly’ standardization and memory-mapped matrices
significantly decreased memory usage. Speed comparisons revealed that xrnet also tends to be
faster than glmnet for both linear and logistic lasso regression. The observed computational
improvements were not necessarily expected and could be due to a combination of factors. These
factors include the efficient vectorized implementation of linear algebra routines within the
Eigen library and the fact that we avoid the need to allocate and store a copy of the predictor
matrix. The performance gains of xrnet over glmnet in the case of linear lasso regression see a
reduction in the case of logistic lasso regression as n increases. The reason for the decrease in
relative performance is most likely due to the need to compute n exponentials every time we
update the quadratic approximation for the logistic log-likelihood (see Chapter 6 for a
description of this approximation). Exponentials are expensive and most likely make up a large
percentage of the computation time. More detailed profiling could reveal if this is in fact why we
do not see the same performance gains. However, given that the goal was to develop an
implementation that was at least as efficient as glmnet, there was not a need to profile the code to
this depth.
Beyond raw performance, we highlighted some of the functions available in the package
that ensure our software provides the functionality necessary to conduct a complete analysis
within our modelling framework. We would also like to note that a significant amount of time
90
was required upfront to design these R functions and create tests that verify their functionality.
The benefits of completing these software development steps outweighed the additional effort.
Not only with regard to the speed at which an analysis can be completed by using these
functions, but also the increased confidence that is gained with respect to the correctness of a
result returned from a given function. Overall, regardless of the size of data that a method is
intended for, we found that clearly documented and tested code translates to reproducible results
that are essential if we want to distribute our software for use in the general scientific
community.
Chapter 6 Hierarchical Regularized Logistic Regression
6.1 Overview
Many prediction models used in clinical applications involve binary outcomes (e.g.
presence vs. absence of a disease). The natural extension of regularized regression models to
binary outcomes comes in the form of regularized logistic regression. In this chapter, we describe
how the coordinate descent algorithm derived in Chapter 3 can be used to extend our general
framework to binary outcomes in what we refer to as ‘hierarchical regularized logistic
regression’. To do so, we describe regularized logistic regression and an efficient algorithm that
uses a combination of iterated reweighted least squares (IRLS) and coordinate gradient descent
to fit the model. We demonstrate how this algorithm can be extended to our hierarchical
framework. Simulations are used to compare the performance of this new binary version of the
ridge-lasso to logistic ridge regression and we discuss how this method can be applied to other
types of outcomes (i.e. count data) in a generalized liner modelling framework.
91
6.2 Methods
6.2.1 Regularized Logistic Regression
In the context of the previous example used in Chapter 2, let 𝑿 be an 𝑛 × 𝑝 matrix of
gene expression levels measured on 𝑛 individuals and 𝒁 be a 𝑝 × 𝑞 matrix of genomic
annotations for the 𝑝 genes. The observed 𝑛 × 1 outcome vector, 𝒚 , is now binary. In terms of an
actual outcome, we let 𝑦 𝑖 indicate whether the 𝑖 𝑡 ℎ
observation experienced a clinical recurrence
of a particular disease. The outcome can be modeled as 𝑛 independent 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖 (𝜋 𝑖 ) random
variables, where
𝑌 𝑖 = {
1 𝑐𝑙𝑖𝑛𝑖𝑐𝑎𝑙 𝑟𝑒𝑐𝑢𝑟𝑟𝑒𝑛𝑐𝑒
0 𝑛𝑜 𝑐𝑙𝑖𝑛𝑖𝑐𝑎𝑙 𝑟𝑒𝑐𝑢𝑟𝑟𝑒𝑛𝑐𝑒
𝑃 (𝑌 𝑖 = 1)= 𝜋 𝑖
𝑃 (𝑌 𝑖 = 0)= 1 − 𝜋 𝑖
The corresponding binomial likelihood function (without the normalizing constant) for the 𝑛
observations is
𝐿 (𝝅 ,𝒚 )= ∏𝜋 𝑖 𝑦 𝑖 (1 − 𝜋 𝑖 )
1−𝑦 𝑖
𝑛 𝑖 =1
(6.1)
The most widely used statistical method to model the probability of a binary outcome is logistic
regression. By using a logistic transformation, we can model the log-odds of the outcome as a
linear combination of covariates, in this case, the gene expression levels.
𝑙𝑜𝑔𝑖 𝑡 (𝜋 𝑖 )= log (
𝜋 𝑖 1 − 𝜋 𝑖 ) = 𝛽 0
+ 𝒙 𝒊 𝑻 𝜷 (6.2)
𝑃 (𝑦 𝑖 |𝑥 𝑖 ,𝛽 0
,𝜷 )= 𝜋 𝑖 =
exp(𝛽 0
+ 𝒙 𝒊 𝑻 𝜷 )
1 + exp(𝛽 0
+ 𝒙 𝒊 𝑻 𝜷 )
(6.3)
92
where 𝒙 𝒊 is the vector of gene expression levels for the 𝑖 𝑡 ℎ
observation and 𝜷 is the 𝑝 × 1 vector
of log-odds ratios describing the association of each genomic feature with clinical recurrence.
Taking the log of equation 6.1 and rewriting it in terms of the 𝑿 , 𝛽 0
, and 𝜷 , the log-likelihood
can be expressed as
𝑙 (𝛽 0
,𝜷 )= ∑𝑦 𝑖 (𝛽 0
+ 𝒙 𝒊 𝑻 𝜷 )
𝑛 𝑖 =1
− ∑log[1 + exp(𝛽 0
+ 𝒙 𝒊 𝑻 𝜷 )]
𝑛 𝑖 =1
(6.4)
The estimates for 𝛽 0
and 𝜷 are generally found by using numerical methods (e.g. Newton-
Raphson) to maximize the log-likelihood (or equivalently to minimize the negative log-
likelihood). As with linear regression, to extend the use of logistic regression to high-
dimensional data, constraints are placed on the parameter space. In the primal form, the
constrained optimization replaces the least-squares objective function with the negative log-
likelihood, where 𝑔 (𝛽 ) can be any the constraint terms we considered in the case of linear
regression in chapter 2.
min
𝛽 0
,𝜷 −𝑙 (𝛽 0
,𝜷 )
𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 𝑔 (𝜷 )≤ 𝑡 ,𝑡 ≥ 0
A similar optimization with an equivalent solution adds a penalty term to the objective function,
just as in the case of linear regression.
min
𝛽 0
,𝜷 −𝑙 (𝛽 0
,𝜷 )+ 𝜆𝑔 (𝜷 ),𝜆 ≥ 0 (6.5)
The objective function is again convex, and so long as we choose convex constraint functions
that are either smooth or separable, we can apply coordinate gradient descent to solve equation
6.5. As an example, Friedman et. al. 2010 propose the use of iteratively reweighted least squares
(IRLS) and coordinate gradient descent to fit regularized logistic regression.
93
To derive their algorithm, we first expand the negative log-likelihood using a second-
order Taylor series and show that minimizing the approximation by Newton-Raphson is
equivalent to IRLS. The general form of a second-order approximation at the point 𝒙 = 𝒙 𝟎 is
shown below, where ∇𝑓 (𝒙 𝟎 ) is the gradient of 𝑓 evaluated at 𝒙 𝟎 and ∇
2
𝑓 (𝒙 𝟎 ) is the Hessian of 𝑓
at 𝒙 𝟎 .
𝑓 (𝒙 )≈ 𝑓 (𝒙 𝟎 )+ ∇𝑓 (𝒙 𝟎 )(𝒙 − 𝒙 𝟎 )+
1
2
(𝒙 − 𝒙 𝟎 )
𝑇 ∇
2
𝑓 (𝒙 𝟎 )(𝒙 − 𝒙 𝟎 ),𝒙 ∈ ℝ
𝑝
To minimize the approximation, we differentiate with respect to 𝒙 and set the resulting equation
to zero. Rearranging, we see that the minimization is equivalent to the Newton-Raphson update,
where 𝒙 = 𝜷 (𝑘 +1)
(new estimate) and 𝒙 𝟎 = 𝜷 (𝑘 )
(current estimate).
0 = ∇𝑓 (𝒙 𝟎 )+ ∇
2
𝑓 (𝒙 𝟎 )(𝒙 − 𝒙 𝟎 )
𝒙 = 𝒙 𝟎 − (∇
2
𝑓 (𝒙 𝟎 ))
−1
∇𝑓 (𝒙 𝟎 ) (6.6)
Below is a derivation of the approximation for the binomial likelihood. To simplify notation, we
have modified 𝑿 to be an 𝑛 × (𝑝 + 1) matrix with an additional column of ones and 𝜷 to be a
(𝑝 + 1) vector with 𝛽 0
as its first element.
−𝑙 (𝜷 )= −∑𝑦 𝑖 𝒙 𝒊 𝑻 𝜷 𝑛 𝑖 =1
+ ∑log[1 + exp(𝒙 𝒊 𝑻 𝜷 )]
𝑛 𝑖 =1
The first and second order partial derivatives with respect to 𝛽 𝑗 are
𝜕 𝜕 𝛽 𝑗 = −∑𝑦 𝑖 𝑥 𝑖𝑗
𝑛 𝑖 =1
+ ∑𝑥 𝑖𝑗
exp(𝒙 𝒊 𝑻 𝜷 )
1 + exp(𝒙 𝒊 𝑻 𝜷 )
𝑛 𝑖 =1
= 𝒙 ∙𝒋 𝑻 (𝝁 − 𝒚 )
𝜕 2
𝜕 𝛽 𝑗 2
= ∑
𝑥 𝑖𝑗
2
exp(𝒙 𝒊 𝑻 𝜷 )
2
[1 + exp(𝒙 𝒊 𝑻 𝜷 )]
2
= 𝒙 ∙𝒋 𝑻 𝒙 ∙𝒋 𝝁 (1 − 𝝁 )
𝑛 𝑖 =1
𝒙 ∙𝒋 𝑻 is the 𝑗 𝑡 ℎ
column of 𝑋
94
𝜇 = (
exp(𝒙 𝟏 𝑻 𝜷 )
1 + exp(𝒙 𝟏 𝑻 𝜷 )
,…,
exp(𝒙 𝒏 𝑻 𝜷 )
1 + exp(𝒙 𝒏 𝑻 𝜷 )
)
𝑇
The gradient and Hessian for 𝛽 are then easily seen to be
∇𝑓 (𝜷 )= 𝑿 𝑻 (𝝁 − 𝒚 )
∇
2
𝑓 (𝜷 )= 𝑿 𝑻 𝑑𝑖𝑎𝑔 (𝜇 𝑖 (1 − 𝜇 𝑖 ))𝑿 = 𝑿 𝑻 𝑾𝑿
𝑾 = 𝑑𝑖𝑎𝑔 (𝜇 𝑖 (1 − 𝜇 𝑖 ))
If we substitute the gradient and Hessian into equation 6.6, the resulting N-R update is the
solution to a weighted least-squares regression of 𝒚 ∗
on 𝑿 with weight matrix 𝑾 .
𝜷 (𝑘 +1)
= 𝜷 (𝑘 )
− (𝑿 𝑻 𝑾𝑿 )
−1
𝑿 𝑻 (𝝁 − 𝒚 )
= (𝑿 𝑻 𝑾𝑿 )
−1
𝑿 𝑻 𝑾 (𝑾 −1
[𝒚 − 𝝁 (𝜷 (𝒌 )
)] + 𝑿 𝜷 (𝒌 )
)
𝒚 ∗
= 𝟏 𝒏 𝛽 0
(𝑘 )
+ 𝑿 𝜷 (𝒌 )
+ 𝑾 −1
[𝒚 − 𝝁 (𝜷 (𝒌 )
)]
𝑾 = 𝑑𝑖𝑎𝑔 [𝝁 (𝜷 (𝒌 )
)(1 − 𝝁 (𝜷 (𝒌 )
))]
The algorithm by Friedman et al 2010, replaces the log-likelihood in equation 6.5 with the
weighted least squares derived from the quadratic approximation
min
𝛽 0
𝜷 1
2
‖√𝑾 (𝒚 ∗
− 𝑿𝜷 )‖
2
2
+ 𝜆𝑔 (𝜷 ),𝜆 ≥ 0 (6.7)
Given an initial estimate, 𝜷 (𝟎 )
, we compute the weight matrix, 𝑾 , and the “working” response,
𝒚 ∗
. Coordinate gradient descent is then used to solve equation 4.7. The resulting estimates, 𝜷 (𝟏 )
,
are used to reweight the quadratic approximation and coordinate descent is applied again. The
algorithm iterates between IRLS and coordinate descent until we have reached a specified
convergence criterion.
95
6.2.2 Hierarchical Regularized Logistic Regression
Returning to the ridge-lasso model, the linear regression in the first level of the two-stage
model is replaced with a logistic regression of the binary outcome 𝑦 on the genomic features.
𝑙𝑜𝑔𝑖𝑡 (𝒚 )= 𝛽 0
+ 𝑿𝜷 + 𝝐
𝜷 = 𝒁𝜶 + 𝜸
Our original implementation of the ridge-lasso, presented in Chapter 2, represents this hierarchy
by the following convex optimization
min
𝛽 0
,𝛽 ,𝛼 −𝑙 (𝛽 0
,𝜷 )+
𝜆 1
2
‖𝜷 − 𝒁𝜶 ‖
2
2
+ 𝜆 2
‖𝜶 ‖
1
By rewriting the two-level model as a single regression
𝑙𝑜𝑔𝑖𝑡 (𝒚 )= 𝛽 0
+ 𝑿 𝒁𝜶 + 𝑿𝜸 + 𝝐
we can express the optimization in terms of 𝛽 0
, 𝜶 , and 𝜸 .
min
𝛽 0
,𝛽 ,𝛼 −𝑙 (𝛽 0
,𝜶 ,𝜸 )+
𝜆 1
2
‖𝜸 ‖
2
2
+ 𝜆 2
‖𝜶 ‖
1
(6.8)
Comparing to equation (6.5), we can use a quadratic approximation of the log-likelihood to
compute the solution by coordinate gradient descent.
6.3 Simulations
To verify the behavior in the case of a binary outcome, we completed preliminary
simulations that utilize a similar data generation process to that presented in Chapter 2. As in
Chapter 2, 𝑿 is an 𝑛 × 𝑝 matrix of predictors measured in 𝑛 individuals and 𝒁 is a 𝑝 × 𝑞 matrix
of meta-features available for the 𝑝 features. The external data meta-features are simulated as
multivariate normal with mean zero, AR(1) correlation structure, and common variance of one.
The coefficients for the predictors, 𝜷 𝒑 ×𝟏 , are generated as a linear combination of the meta-
features,
96
𝜷 = 𝒁𝜶 + 𝜸
where 𝛼 0
is fixed at zero, 𝜶 is a sparse 𝑞 × 1 vector and 𝜸 is the error term assumed to be a
𝑀𝑉𝑁 (0,𝑰 𝒑 𝜎 2
2
) . To quantify the informativeness of 𝒁 , the signal-to-noise ratio is used to specify
the variance term for 𝜎 2
2
.
𝜎 2
2
=
𝜶 𝑻 𝚺 𝒁 𝜶 𝑆𝑁 𝑅 𝑍
The Bayes’ formulation of linear discriminant analysis (LDA) is used to simulate 𝑿 and
𝒚 such that the theoretically achievable AUC is fixed across all simulated at 0.8. As a reminder,
in LDA we assume that conditional on the outcome, the predictors are distributed multivariate
normal with class-specific means and a common covariance matrix.
𝑓 𝑘 (𝑥 )∼ 𝑀𝑉𝑁 (𝝁 𝒌 ,𝚺 ),𝑘 = 0,1
To fix the theoretical AUC, we use the fact that the regression coefficients in a standard linear
regression, 𝜷 , are a scalar multiple of the discriminant direction in LDA (Appendix A.3).
𝜷 = 𝑐 𝚺 −1
(𝝁 𝟏 − 𝝁 𝟎 ),𝑐 > 0
To simplify, we fix the mean in the control group, 𝜇 0
, to zero and assume that 𝚺 has an AR(1)
correlation structure with common variance of one. Under the assumption of a linear model and
for given 𝜷 and 𝑐 , the AUC is computed as the probability that a randomly selected case has a
higher predicted risk compared to a randomly selected control.
𝐴𝑈𝐶 = 𝑃 (𝑿 𝟏 𝜷 > 𝑿 𝟎 𝜷 )= 𝑃 (𝑿 𝟏 𝜷 − 𝑿 𝟎 𝜷 > 0)= 𝑃 (𝒀 > 0)
𝑌 ~ 𝑁 (
1
𝑐 (𝚺 𝜷 )
𝑇 𝜷 ,2𝜷 𝑻 𝚺 𝜷 )
For a specified AUC, the scaling factor 𝑐 can be calculated empirically by solving for the root
with respect to 𝑐 of the function 𝑃 (𝑌 > 0)− 𝐴𝑈𝐶 . After determining the appropriate value of 𝑐
for the desired AUC, the mean vector in the case group is computed as 𝝁 𝟏 = 𝚺 𝜷 /𝑐 .
97
With defined 𝝁 𝟏 , 𝝁 𝟐 , and 𝚺 , 𝑿 is generated for 150 controls and 150 cases (𝑛 = 300) by
sampling from their corresponding conditional distributions, 𝑓 𝑘 (𝒙 |𝝁 𝒌 ,𝚺 ) . The intercept, 𝛽 0
, is
computed based on the normality assumptions of LDA as follows (see Appendix A.3).
𝛽 0
= −
1
2
(𝝁 𝟎 + 𝝁 𝟏 )
𝑇 𝜷 + log(
𝑛 1
𝑛 0
)
A test set of 500 controls and 500 cases is also created by using the same data generating
process.
As a preliminary analysis of how our model behaves in the context of a binary outcome,
we compared the performance of the ridge-lasso case in the context of logistic regression to
standard logistic ridge regression. The effect of external data signal-to-noise ratio, correlation
between external data features, number of predictor variables, and number of external data
variables on relative prediction performance were evaluated for all combinations of the following
settings, first varying the number of predictors for a fixed number of external variables.
Number of External Variables: 100
Number of 1
st
Level Predictors: 300, 600, 1200
External Data (𝒁 ) Signal-to-Noise Ratio (𝑆𝑁 𝑅 𝑍 ) : 0, 0.5, 1, 2
External Data Correlation (𝜌 𝑍 ): 0.2, 0.5, 0.8
We then varied the number of external variables for a fixed number of predictor variables.
Number of External Variables: 300, 600, 1200
Number of 1
st
Level Predictors: 500
External Data (𝒁 ) Signal-to-Noise Ratio (𝑆𝑁 𝑅 𝑍 ) : 0, 0.5, 1, 2
External Data Correlation (𝜌 𝑍 ): 0.2, 0.5, 0.8
For each unique combination of parameter settings, 500 training and 500 corresponding testing
data sets were generated. 10-fold cross-validation was used to determine the ‘optimal’
98
hyperparameter values, where optimal is the set of hyperparameters that attained the largest
average cross-validation AUC. Model performance is compared between logistic ridge-lasso and
logistic ridge regression by evaluating the AUC in the test set.
6.4 Real Data Applications
6.4.1 Revisiting the METABRIC Breast Cancer Analysis
In chapter 4, we took advantage of the relationship between linear regression and LDA to
model the association between breast cancer recurrence and gene expression. We now use
logistic ridge-lasso regression to determine if similar performance gains are attained by
incorporating the metagenes. The gene expression, external metagenes, and clinical covariates
are unchanged from their definitions in Chapter 4. Repeated stratified 5-fold cross-validation is
again used to tune the hyperparameters with AUC used as the loss function. The test AUC was
used to evaluate the performance at the optimal hyperparameter values.
6.4.2 PrediXcan + Hierarchical Regularized Regression: Understanding Gene-based Associations
in Colorectal Cancer
Introduction
Gamazon et. al. (2015) introduced a gene-based association method, PrediXcan, that
models the molecular mechanisms through which genetic variation affects a given phenotype.
Their model looks for effects of genetic variation on a phenotype that are mediated through gene
expression.
To identify genetic variants that are associated with gene expression, PrediXcan uses
reference transcriptome data sets that contain both genotype and expression levels for multiple
tissues (e.g. GTEx) to build additive models of gene expression as a function of genetic
99
variation. As an example, let 𝑮 be an 𝑛 × 𝑝 matrix that contains the genotypes of 𝑛 individuals
across 𝑝 SNPs. For a given tissue, let 𝑬 be an 𝑛 × 𝑞 matrix that contains the expression levels
across 𝑞 genes in the same 𝑛 individuals. For each of the 𝑞 genes, a linear model is built based
on the 𝑝 SNPs,
𝑬 𝒌 = 𝑮 𝜷 𝒌 + 𝝐
where 𝑬 𝒌 is an 𝑛 × 1 vector of the expression levels of 𝑘 𝑡 ℎ
gene, 𝑮 𝒋 is an 𝑛 × 1 vector of the
number of reference alleles for SNP 𝑗 , and 𝜷 𝒌 is a 𝑝 × 1 vector describing the association of
each SNP with the expression of gene 𝑘 . Generally, not all 𝑝 SNPs are included in each model,
but rather, only SNPs within a fixed distance (i.e. base pairs) of the gene are included. To handle
the high-dimensionality of 𝑮 , the authors used elastic net regression to perform variable selection
and obtain an estimate for 𝜷 . Estimates for each of the 𝑞 genes can be combined into a 𝑝 × 𝑞
matrix, 𝑾 , where the k
th
column of 𝑾 contains the regression coefficients to model gene
expression of the k
th
gene. Gene expression in a new GWAS sample, 𝑮 𝒏𝒆𝒘 , can then be imputed
by a simple matrix multiplication, 𝑮 𝒏𝒆𝒘 𝑾 . Under the PrediXcan framework, univariate
regression models (i.e. logistic regression) are then used to test the association between imputed
gene expression level and a phenotype of interest. Figure 6.1 summarizes the construction of the
weight matrix and imputation of gene expression.
100
Figure 6.1 PrediXcan Framework for Gene Expression Imputation
In the context of our hierarchical regularized regression framework, we can treat the
genotype matrix in the new sample, 𝑮 𝒏𝒆𝒘 , as the primary predictors and the matrix of regression
coefficients, 𝑾 , as the external information.
𝒚 = 𝑮 𝒏𝒆𝒘 𝜷 + 𝝐
𝜷 = 𝑾𝜶 + 𝜸
Rewriting this as a single level regression, we see that the 𝑛 × 𝑞 matrix 𝑮 𝒏𝒆𝒘 𝑾 is the same
imputed gene expression matrix used in PrediXcan.
𝒚 = 𝑮 𝒏𝒆𝒘 𝑾𝜶 + 𝑮 𝒏𝒆𝒘 𝜸 + 𝝐
Unlike PrediXcan, we are now jointly modelling the association of imputed gene expression,
𝜶 𝒒 ×𝟏 , and genetic variation, 𝜸 𝒑 ×𝟏 , with a health outcome of interest. 𝜷 𝒑 ×𝟏 can be interpreted as
the total effect of genetic variation on the outcome, where we can think of the total effect as
being composed of a “direct” effect of genetic variation and an “indirect” effect captured by the
101
imputed gene expression. By capturing the SNP-level and gene-level effects in a joint model, we
can potentially better understand the association of a genomic region with an outcome of interest.
Methods
To understand if new insights can be gained from our model in a PrediXcan framework,
we follow-up the analysis of Bien et. al. (2019) that investigated the role of genetic risk variants
in colorectal cancer. The sample used in their analysis was comprised of 26,904 participants
aggregated from 16 studies for which colorectal cancer case-control status, individual-level
genotyping, and basic clinical data were available. Additional information describing each of the
16 studies can be reviewed in their online materials (Bien et al., 2019), but it is important to note
that the studies were primarily composed of individuals of European ancestry. In their analysis,
gene expression for both transverse colon and whole blood were imputed by using previously
developed models that are publicly available at http://predictdb.org (Gamazon et al., 2015). The
PredictDB were developed by using GTEx (Lonsdale et al., 2013) reference data sets that contain
both genotype and gene expression data across multiple tissues. For a given tissue and gene,
elastic net regression was used to build a model that predicts gene expression based on cis-
regulatory variants, defined as being within 1Mb of the gene. Further information on how the
models were trained is available on PredictDB.org.
The PrediXcan software package was used with the weights from the PredictDB models
to impute gene expression in the aggregated study sample. Logistic regression was then used to
model the association between imputed gene expression and case-control status with adjustment
for age, sex, study center, genotyping batch, and the first four principal components to account
for populations substructure. A false discovery rate (FDR) threshold of 0.2 was used in the
discovery phase to select significant associations between case/control status and imputed gene
expression that were then carried forward into a replication phase. In the replication phase,
102
statistically significant associations were found for TRIM4 (replication 𝑝 = 0.01) and PYGL
(replication 𝑝 = 6.7 × 10
−4
) in the transverse colon results.
We extend the previously described analysis to a larger sample of 128,380 individuals
aggregated from the same 16 studies plus an additional 37 studies for which genotype, outcome,
and clinical covariates were available. Our analysis also uses the transverse colon imputation
models that are available from PredictDB.org. The resulting external data matrix constructed
based on these models consists of weights to impute 5,612 genes based on 129,529 genetic
variants (SNPs). Despite having access to full genome-wide set of SNPs, for this analysis we
only considered the subset of variants for which weights were available in the PredictDB colon
transverse models. The resulting predictor matrix consists of 129,529 SNPs across the set of
128,380 individuals.
Using the larger sample size, we compare the results of a standard PrediXcan analysis
with results obtained from our hierarchical regularized regression model. For the regularization,
we use an elastic net penalty on both the genotype data (𝐺 𝑛𝑒𝑤 ) and the external weight matrix
(𝑊 ), which we will refer to as the EN-EN model for the rest of the discussion. Note that we fit
our model on each chromosome separately and do not fit all 22 chromosomes jointly. All models
include adjustment for age, sex, study, and the first four principal components to account for
population substructure. In the EN-EN models, the adjustment variables (i.e. age, sex, study,
principal components) are not penalized.
In contrast to the previous real data examples where prediction accuracy was of primary
interest, here we focus on model selection and the behavior of our model relative to the
PrediXcan approach. With this goal in mind, we used the binomial deviance as the metrics of
choice rather than the AUC for tuning the hyperparameters using 10-fold cross-validation for
103
both our model and the elastic net regression models. The estimates reported are those that
obtained the minimum average deviance across the 10 folds.
6.5 Results
6.5.1 Simulations
Figures 6.2 - 6.3 compare the performance of the logistic ridge-lasso to logistic ridge
regression as we vary the number of predictor variables, the number of external data variables,
and the correlation between the external data variables. We obtain trends in performance that are
consistent with those found for the case of linear regression in Chapter 2. Increasing the
informativeness of the external data leads to significant improvement in performance relative to
logistic ridge regression. Performance relative to logistic ridge regression also still shows a small
improvement as 𝑝 increases and a minor decrease as 𝑞 increases. Additionally, the logistic
version of our model can discard noninformative external data and demonstrates that there is not
a significant impact on performance in this scenario.
Figure 6.2 Prediction Performance, as measured test AUC, of Ridge-Lasso Compared to
Standard Ridge Regression by Number of Predictors and External Data Signal-to-Noise
Ratio (𝝆 𝒛 = 𝟎 .𝟐 ,𝒒 = 𝟏𝟎𝟎 )
104
Figure 6.3 Prediction Performance, as measured test AUC, of Ridge-Lasso Compared to
Standard Ridge Regression by Number of External Data Variables and External Data
Signal-to-Noise Ratio (𝝆 𝒛 = 𝟎 .𝟐 ,𝒑 = 𝟓𝟎𝟎 )
6.5.2 METABRIC Analysis Revisited
As in Chapter 4, the incorporation of the metagene information in a logistic ridge-lasso
regression improves the prediction of clinical recurrence both with and without clinical
information relative to standard logistic ridge regression (Table 6.1). Inclusion of the age and
lymph node positive status result in a similar improvement in the test AUC as well. The CIN and
FGD3-SUSD3 are the only two metagenes selected and their direction of effect is again positive
for the proliferative metagene and negative for the protective metagene (Table 6.2).
Method Test AUC
Gene Expression Only Gene Expression + Clinical
Linear Ridge 0.695 0.741
Logistic Ridge 0.684 0.731
Ridge-Lasso 0.713 0.755
Logistic Ridge-Lasso 0.711 0.753
Table 6.1 Comparison of Test AUC between Ridge Regression and Ridge-Lasso
Regression (Linear and Logistic Versions)
105
Metagene Coefficient Estimate
Gene Expression Only Gene Expression + Clinical
CIN 0.70 0.62
MES 0.0 0.0
LYM 0.0 0.0
FGD3-SUSD3 -0.35 -0.29
Table 6.2 Coefficient Estimates for External Data Variables (Metagenes)
6.5.3 PrediXcan + Hierarchical Regularized Regression
After Bonferroni correction for multiple comparisons by the number of genes included in
the analysis (5,612), the standard PrediXcan approach on the larger set of patients identified 32
genes that have a statistically significant association between predicted gene expression and
case-control status (Table 6.3). A number of these genes were previously identified in Bien et. al.
(2019) and were also selected by the EN-EN model. We also want to highlight that the two novel
discoveries from Bien et. al. (2019), TRIM4 and PYGL, are also selected in both the PrediXcan
analysis and by the EN-EN model. Comparing the genes selected to those not selected by the
EN-EN model finds that our model selects genes with a wide range of test 𝑅 2
values. The test 𝑅 2
for each PredictDB imputation model provides an estimate of what proportion of gene
expression is captured by genetic variation. Across all imputed genes, test 𝑅 2
varies substantially
and in some cases appears tob e worse than just using mean gene level expression. As an
example, consider the genes LIMA1 and COX14. Despite having significant marginal p-values,
the associated imputation test 𝑅 2
is low for both genes. As will be discussed later, this may be a
partial explanation for why the EN-EN model does not select either gene. It is important to note
that a lower test 𝑅 2
does not preclude a gene from being of practical importance with respect to
its effect through gene expression. Rather, this measure only tells us if some fraction of variation
in gene expression is captured by the genetic variants in a given imputation model.
106
Table 6.3 Summary of Significant PrediXcan Results Compared to Genes Selected in EN-
EN Hierarchical Regularized Regression Model and Results of Bien et. al. (2019)
Chr Gene
Marginal
Z-Score
1
Marginal
P-value
2
Selected in
EN-EN
3
In Bien
Discovery
4
PredictDB
Test 𝑹 𝟐 5
1
ARPC5 -5.14 1.55e-03 Yes Yes 0.148
LAMC2 -4.69 1.54e-02 No No 0.0484
2 TMBIM1 4.81 8.55e-03 No No 0.0148
3
LRRC34 4.87 6.36e-03 Yes No 0.0323
SFMBT1 4.85 6.88e-03 Yes No 0.139
6
HLA-DRB9 5.46 2.73e-04 Yes No 0.178
HLA-DRB6 5.24 8.77e-04 Yes No 0.614
CCHCR1 5.06 2.41e-03 Yes No 0.437
HLA-DRB5 -5 3.18e-03 Yes No 0.498
TCF19 4.87 6.41e-03 No No 0.0313
7
OR2AE1 5.12 1.67e-03 No No 0.0593
TRIM4 -5.06 2.35e-03 Yes Yes 0.466
8 POU5F1B -15.03 2.81e-47 Yes Yes 0.0295
10
EBAG9P1 5.59 1.30e-04 Yes No 0.0468
RP11-80H5.9 4.45 4.88e-02 No No -0.008
11
COLCA2 -11.26 1.14e-25 Yes Yes 0.336
C11orf53 -11.12 5.72e-25 Yes Yes 0.185
COLCA1 -10.26 6.30e-21 Yes Yes 0.136
12
LIMA1 -10.8 1.96e-23 No Yes 0.0117
COX14 7.52 3.19e-10 No Yes -0.0134
LRP1 5.09 2.04e-03 Yes No 0.0627
TMEM116 4.86 6.73e-03 No Yes 0.009
RP3-462E2.5 -4.57 2.72e-02 No No 0.0166
14
PYGL -5.89 2.16e-05 Yes Yes 0.241
NIN -5.42 3.41e-04 No No 0.096
ABHD12B -5.27 7.47e-04 No No 0.211
NID2 4.5 3.78e-02 Yes No 0.0735
17 DBIL5P 4.95 4.21e-03 Yes No 0.033
19
RPL28 -4.67 1.70e-02 Yes No 0.142
DOHH 4.46 4.60e-02 Yes No 0.0564
20
MAP1LC3A -4.93 4.56e-03 Yes No 0.0801
LIME1 4.77 1.05e-02 Yes No 0.0412
1
Marginal association of imputed gene expression with colorectal cancer case-control status
2
P-value for marginal association after Bonferroni adjustment for multiple comparisons
3
Whether gene-level regression coefficient is nonzero in EN-EN model
4
Marginal Association of imputed gene expression with colorectal cancer case-control
status significant at FDR level of 0.2 in results of Bien et. al. (2019)
5
Predictive performance of PredictDB model in GTEx test data set as measured by 𝑅 2
107
To better understand the behavior of hierarchical regularized regression in the PrediXcan
framework, we provide a more detailed overview of select findings from Table 6.3.
We first focus on the ‘ideal’ scenario where the EN-EN model simultaneously identifies
known colorectal cancer associated genes and identifies underlying risk variants. The genes of
interest are COLCA1 (Colorectal Cancer Associated 1) and COLCA2 (Colorectal Cancer
Associated 2). Both genes have some level of variation explained by the variants in their
respective imputation models (PredictDB Test 𝑅 2
> 0.1), demonstrate a significant marginal
association with case-control status based on the PrediXcan analysis, and are selected by the EN-
EN model. Of equal interest is the selection of the underlying variants in the EN-EN model and
whether the estimates for these SNPs differ from the imputed gene-level effect. Tables 6.4 – 6.5
summarize the results for the top 20 SNPs, with respect to their marginal association with case-
control status, among the subset of SNPs used in the imputation models for COLCA1 and
COLCA2, respectively. In addition to the marginal association of each SNP with case-control
status, we also report the weight (i.e. regression coefficient) for each SNP in the imputation
model, the SNP-level regression coefficient in the EN-EN model (𝛽 ) , and the deviation of the
SNP-level effect from the gene-level effect in the EN-EN model (i.e. 𝛾 = 𝛽 − 𝑍𝛼 ). We first note
that there is significant overlap in the SNPs used for imputation in both models. The SNPs of
greatest interest are rs7130173 and rs3087967, which not only have the largest marginal effects,
but also have an overall SNP-level effect (𝛽 ) that deviates from the gene-level effect in the EN-
EN model (𝛾 > 0). This large deviation results in both SNPs having estimates that are much
larger in magnitude compared to the other SNPs in the imputation models. The selection of these
SNPs and their larger magnitudes align with previous studies that have identified rs7130173 and
rs3087967 as potential risk variants for colorectal cancer (Biancolella et al., 2014; Law et al.,
108
2019; Thean et al., 2012). The relative impact these two SNPs have on the overall imputation of
gene expression is larger for COLCA2 compared to COLCA1 based on the magnitude of their
weights in their respective imputation models. We also want to note that the top 8 SNPs in both
Tables 6.4 and 6.5 are highly correlated (𝑟 > 0.95) with each other. Even in the presence of high
correlation, the EN-EN model identifies genes that are associated with colorectal cancer and risk
variants whose effect is potentially mediated through both the expression of these genes and
other mechanisms not captured by these genes.
The second example focuses on two genes in chromosome 11, TRIM4, one of the novel
discoveries in Bien et. al. (2019), and OR2AE1. The standard PrediXcan analysis identifies both
genes as being associated with case-control status after adjustment for multiple comparisons. In
contrast, the EN-EN model only selects TRIM4 and not OR2AE1. Upon review of the underlying
SNPs used to impute the expression for these genes, we find significant overlap (Table 6.6). All
but two of the SNPs included in the imputation model for OR2AE1 are also in the imputation
model for TRIM4. Additionally, the SNPs that are common to both models are highly correlated,
which results in similar marginal associations with case-control status. Given the similar
magnitude for the imputation weights (note that the sign flips for OR2AE1 (-) imputation weights
compared to TRIM4 (+)), we might expect the marginal association of imputed gene expression
with case-control status to be similar for the genes. This is in fact what we find with OR2AE1
having a slightly stronger marginal association with case-control status (𝑝 = 1.7𝑒 − 3)
compared to TRIM4 (𝑝 = 2.4𝑒 − 3). Despite having similar results for the PrediXcan analysis,
the proportion of variation in gene expression explained by the imputation model for OR2AE1 is
much lower (0.0593) compared to TRIM4 (0.466). The combination of shared SNPs and the
difference in imputation may explain why the EN-EN model selects TRIM4 and not OR2AE1.
109
Compared to the last example, none of the underlying variants have an effect that deviates
significantly from the gene-level effect (𝛾 = 0), which makes a case for the effects of the true
underlying variant/s being mediated largely through TRIM4.
In the last example, we return to the LIMA1 gene mentioned previously. This gene ranks
4
th
with respect to its marginal association with case-control status among all genes but is not
selected by the EN-EN model. The imputation model has a test 𝑅 2
of 0.012. The low test 𝑅 2
may
be a potential reason why it is not selected in our model. Despite setting the gene-level effect to
zero, the EN-EN model does select a subset of the SNPs used in the imputation model for LIMA1
(Table 6.7). As in the COLCA1/COLCA2 example, multiple SNPs in the imputation model for
LIMA1 have estimates that deviate from the gene-level effect of zero. The two SNPs with the
strongest marginal association (rs12372718 and rs10783387) also have the largest EN-EN
coefficients (𝛽 ). Unlike the prior example, the SNPs used to impute LIMA1 are not used to
impute the expression of any other gene. The top SNPs are again highly correlated, resulting in
similar marginal associations with case-control status and similar weights in the imputation
model. Based on these results, there does appear to be an underlying variant that is associated
with colorectal cancer in the region of LIMA1, but there is not strong evidence that the effect of
this variant is mediated through the expression levels of LIMA1, but rather by some other
mechanism (i.e. nearby gene).
110
Table 6.4 Marginal Association with Colorectal Cancer Case-Control Status and EN-EN
Coefficient Estimates for SNPs Used in COLCA1 Imputation Model
SNP
PredictDB
Weight
Marginal
Z-Score
Marginal
P-value
𝜷 (EN-EN) 𝜸 (EN-EN)
rs7130173 0.0836 -12.5 5.98e-36 -0.036 -0.0313
rs3087967 0.0353 -12.5 7.09e-36 -0.0431 -0.0392
rs4477469 0.0112 -12.4 1.58e-35 -0.0097 -0.0086
rs10789822 0.0118 -12.4 1.95e-35 -0.0037 -0.0026
rs7122375 0.0218 -12.2 2.09e-34 -0.0022 0
rs4608113 0.0122 -12.2 2.24e-34 -0.0010 0
rs3802842 0.0122 -12.2 2.43e-34 -0.001 0
rs3802840 0.0119 -12.2 2.78e-34 -0.001 0
rs6589220 0.316 -11.7 1.57e-31 -0.0071 0
rs7947849 -0.0486 8.63 6.20e-18 0.0003 0
rs1986684 0.144 -4.84 1.27e-06 -0.0038 0
rs4938639 -0.0017 -3.44 5.80e-04 0.0004 0
rs4936462 0.0601 -2.77 5.57e-03 -0.0018 0
rs1784635 -0.0251 2.73 6.42e-03 0.0064 0.0057
rs7939170 -0.022 2.72 6.47e-03 0.00064 0.0001
rs2724431 -0.0236 2.72 6.48e-03 0.00086 0.0003
rs1870962 -0.0254 2.71 6.79e-03 0.001 0.0003
rs1676518 -0.0205 2.71 6.82e-03 0.0005 0
rs10891268 0.119 -2.14 3.22e-02 -0.0048 0
rs1792802 -0.0033 2.11 3.52e-02 0.0004 0
rs12798554 -0.0523 1.99 4.65e-02 0.0003 0
rs11213927 0.129 -1.99 4.69e-02 -0.0008 0
rs7927079 -0.296 -1.96 5.00e-02 0.0018 0
rs7930326 0.089 -1.95 5.07e-02 -0.0006 0
rs7946320 -0.0234 -1.52 1.28e-01 0.0001 0
1
Regression coefficient for SNP in COLCA1 PredictDB imputation model
2
Marginal association of SNP with case-control status adjusted for age, sex,
study, and first four principal components
3
SNP-level regression coefficient estimate in EN-EN model
4
Difference of gene-level effect from SNP-level effect, 𝑦 = 𝑍𝛼 − 𝛾
111
Table 6.5 Marginal Association with Colorectal Cancer Case-Control Status and EN-EN
Coefficient Estimates for SNPs Used in COLCA2 Imputation Model
SNP
PredictDB
Weight
Marginal
Z-Score
Marginal
P-value
𝜷 (EN-EN) 𝜸 (EN-EN)
rs7130173 0.158 -12.5 5.98e-36 -0.036 -0.0313
rs3087967 0.13 -12.5 7.09e-36 -0.0431 -0.0392
rs4477469 0.0377 -12.4 1.58e-35 -0.0097 -0.0086
rs10789822 0.0383 -12.4 1.95e-35 -0.0037 -0.0026
rs7122375 0.0762 -12.2 2.09e-34 -0.0022 0
rs4608113 0.0363 -12.2 2.24e-34 -0.001 0
rs3802842 0.036 -12.2 2.43e-34 -0.001 0
rs3802840 0.0356 -12.2 2.78e-34 -0.001 0
rs6589220 0.176 -11.7 1.57e-31 -0.0071 0
rs3087587 -0.0022 5.61 6.20e-18 0.0003 0
rs1986684 0.115 -4.84 1.27e-06 -0.0038 0
rs17111116 0.0319 3 5.80e-04 0.0004 0
rs4936462 0.0555 -2.77 5.57e-03 -0.0018 0
rs1784635 -0.0167 2.73 6.42e-03 0.0064 0.0057
rs7939170 -0.0133 2.72 6.47e-03 0.0006 0.0001
rs2724431 -0.0152 2.72 6.48e-03 0.0009 0.0003
rs1870962 -0.0173 2.71 6.79e-03 0.001 0.0003
rs1676518 -0.0113 2.71 6.82e-03 0.0005 0
rs10891268 0.158 -2.14 3.22e-02 -0.0048 0
rs12366145 -0.0994 -1.46 3.52e-02 0.0004 0
rs1123066 -0.0263 1.41 4.65e-02 0.0003 0
rs7113896 -0.147 -1.34 4.69e-02 -0.0008 0
rs11213674 0.0427 1.32 5.00e-02 0.0018 0
rs10891329 -0.0412 1.31 5.07e-02 -0.0006 0
rs12364447 0.0738 -1.3 1.28e-01 0.0001 0
1
Regression coefficient for SNP in COLCA2 PredictDB imputation model
2
Marginal association of SNP with case-control status adjusted for age, sex,
study, and first four principal components
3
SNP-level regression coefficient estimate in EN-EN model
4
Difference of gene-level effect from SNP-level effect, 𝑦 = 𝑍𝛼 − 𝛾
112
Table 6.6 Marginal Association with Colorectal Cancer Case-Control Status and EN-EN
Coefficient Estimates of SNPs Used in TRIM4 and OR2AE1 Imputation Models
SNP
PredictDB
Weight
(TRIM4)
PredictDB
Weight
(OR2AE1)
Marginal
Z-Score
Marginal
P-value
𝜷
(EN-EN)
𝜸
(EN-EN)
rs2527899 0.051 Not in Model -5.15 2.55e-07 -0.0021 0
rs2247761 0.073 Not in Model -5.15 2.57e-07 -0.0031 0
rs2527919 0.074 Not in Model -5.15 2.61e-07 -0.0031 0
rs2572005 0.075 Not in Model -5.14 2.71e-07 -0.0031 0
rs2572006 0.077 Not in Model -5.14 2.71e-07 -0.0032 0
rs2571998 0.033 -0.042 -5.14 2.73e-07 -0.0014 0
rs2527922 0.078 Not in Model -5.14 2.73e-07 -0.0033 0
rs2571997 0.033 -0.044 -5.13 2.88e-07 -0.0014 0
rs2527903 Not in Model -0.01 -5.13 2.92e-07 0 0
rs2572009 0.004 Not in Model -5.12 3.00e-07 -1.74e-04 0
rs2571996 Not in Model -0.011 -5.12 3.05E-07 0 0
rs2527907 0.034 -0.042 -5.12 3.07e-07 -0.0014 0
rs2571994 0.034 -0.043 -5.11 3.15e-07 -0.0014 0
rs2571992 0.034 -0.043 -5.11 3.25e-07 -0.0014 0
rs2572019 0.082 Not in Model -5.1 3.36e-07 -0.0034 0
rs2572010 0.08 Not in Model -5.09 3.54e-07 -0.0033 0
rs2082744 0.002 Not in Model -5.08 3.85e-07 -7.79e-05 0
rs474229 0.016
Not in Model
-4.82 1.40e-06 -6.47e-04 0
rs1025576 0.034
Not in Model
-4.76 1.89e-06 -0.0014 0
rs2572022 0.016
Not in Model
-4.76 1.97e-06 -6.57e-04 0
rs2525550 0.008
Not in Model
-3.2 0.0014 -3.50e-04 0
rs1128097 0.018
Not in Model
1.27 0.20 -7.45e-04 0
rs4236540 0.063
Not in Model
-1.02 0.31 -0.0026 0
rs10953283 0.001
Not in Model
-0.94 0.35 -5.60e-05 0
rs472660 0.027
Not in Model
-0.73 0.46 -0.0012 0
rs10953281 0.033
Not in Model
-0.62 0.54 -0.0014 0
rs11489584 -0.031
Not in Model
-0.57 0.57 0.0013 0
rs12113906 -0.027
Not in Model
-0.56 0.58 0.0011 0
1
Regression coefficient for SNP in TRIM4 PredictDB imputation model
2
Regression coefficient for SNP in OR2AE1 PredictDB imputation model
3
Marginal association of SNP with case-control status adjusted for age, sex, study, and first four
principal components
4
SNP-level regression coefficient estimate in EN-EN model
5
Difference of gene-level effect from SNP-level effect, 𝑦 = 𝑍𝛼 − 𝛾
113
Table 6.7 Marginal Association with Colorectal Cancer Case-Control Status and EN-EN
Coefficient Estimates for SNPs Used in LIMA1 Imputation Model
SNP
PredictDB
Weight
Marginal
Z-Score
Marginal
P-value
𝜷 (EN-EN) 𝜸 (EN-EN)
rs12372718 -0.012 11.03 2.69e-28 0.031 0.031
rs10783387 -0.013 11.01 3.53e-28 0.023 0.023
rs11169578 -0.037 11 3.70e-28 0.009 0.009
rs11169571 -0.012 10.98 4.60e-28 0.009 0.009
rs10876098 -0.016 10.96 5.79e-28 0 0
rs7306677 -0.014 10.95 6.41e-28 0.005 0.005
rs7133974 -0.014 10.95 6.61e-28 2.3e-04 2.3e-04
rs1129406 -0.014 10.72 8.33e-27 0 0
rs11169567 -0.014 10.72 8.53e-27 0 0
rs7136702 0.035 -8.04 9.14e-16 0 0
rs7964643 -0.004 3.62 2.93e-04 0 0
rs7489251 -0.014 3.39 6.97e-04 0 0
rs10876033 0.008 -3.16 0.0016 0 0
rs10876035 0.009 -3.08 0.0021 0 0
rs4343103 0.012 -3.05 0.0023 0 0
rs11169423 0.013 -3.03 0.0024 0 0
rs4424740 0.011 -3.02 0.0025 0 0
1
Regression coefficient for SNP in LIMA1 PredictDB imputation model
2
Marginal association of SNP with case-control status adjusted for age, sex,
study, and first four principal components
3
SNP-level regression coefficient estimate in EN-EN model
4
Difference of gene-level effect from SNP-level effect, 𝑦 = 𝑍𝛼 − 𝛾
6.6 Discussion
In this chapter, we explored hierarchical regularized regression for binary outcomes
through an extension of regularized logistic regression. By modifying the fitting algorithm
presented in Chapter 3 to use a combination of iteratively reweighted least squares and
coordinate descent, we demonstrated that coordinate descent can still be used to efficiently fit
our model. The need to update the quadratic approximation does incur an increased
computational cost, especially as 𝑛 increases. This issue was hinted at in Chapter 5 where our
implementation of logistic lasso regression did not have significantly improved speed compared
114
to currently available software packages. Nevertheless, this approach allowed us to easily extend
to binary outcomes and could also be extended to other outcomes that can fit be with IRLS (e.g.
Poisson regression, Cox regression).
Simulations to assess the behavior of logistic ridge-lasso regression align with the results
of Chapter 2. Increasing the informativeness of the external data improved the predictive
performance of logistic ridge-lasso regression compared to standard logistic ridge regression.
Additionally, the impact on predictive ability of increasing the number of predictor variables or
the number of external variables matched those found for linear regression as well. There are
perhaps additional simulation scenarios that are specific to binary outcomes, such as class
imbalance, that should be evaluated in future work to assess their impact on prediction
performance. The results for the METABRIC breast cancer example were also replicated with
the predictive performance found to be similar regardless of whether the logistic or linear version
of our model was used to predict breast cancer recurrence. This alignment is expected, as logistic
regression and LDA tend to agree unless the LDA multivariate normality assumption for the
predictors is severely violated Given that linear regression has a lower computational burden, its
use when both 𝑝 and 𝑛 become extremely large may be warranted, especially if prediction is of
primary concern.
Beyond verifying the behavior of the logistic version of our model, we recognized that
hierarchical regularized regression falls naturally into the PrediXcan framework used to assess
the effect of genetic variation on a phenotype that is mediated through gene expression. Shifting
our focus to model selection, we utilized a large consortium of case-control data for colorectal
cancer to compare a standard PrediXcan analysis to hierarchical regularized logistic regression.
Both methods were used to identify genes that are associated with colorectal cancer based on the
115
association between imputed gene expression, derived by external models of gene expression
and genetic variation, and case-control status. Within our framework, we further identified
potential risk variants among the SNPs used in the imputation models.
We found that additional insight can be gained by using hierarchical regularized logistic
regression to jointly model the imputed gene expression with the SNPs used to generate the
imputations. Specifically, the joint model allowed us to capture the effect of a variant through
two components. The first component is the effect of the variant as mediated through expression
of a nearby gene. The second component an additional effect of the variant not captured by gene
expression of the genes included in the model, which could be explained by a combination of
other unknown mechanisms. One simple mechanism could be another nearby gene through
which a variant’s effect is truly mediated, but due to the hard cutoff of 1Mb for inclusion in any
given imputation model, the variant was not included in the imputation of that nearby gene. To
some extent, our method also accounts for imputation quality. If none of the variants are truly
associated with expression of a gene, the association of these variants with the outcome will be
poorly captured by the imputed expression. Our model can discard the gene-level effect if the
association is better explained by the variants directly. PrediXcan tends to still select these genes
when the variants have a strong marginal association with the outcome. The selection of only a
SNP and not the associated gene does not completely exclude the possibility that some
proportion of the SNPs effect is truly mediated through a given gene. As an example, our model
may only select the SNP if the variance explained by the overall imputation model is low and the
model includes other SNPs that do not have a marginal association with the outcome. The
resulting imputed gene expression will simply be a noisy version of the original true SNP signal.
The ideal scenario for capturing gene-level effects would be a set of SNPs that have moderate to
116
strong marginal associations with the outcome and that do explain some proportion of variation
in gene expression.
In our analysis we did not follow-up potential new discoveries because all 22
chromosomes were fit in separate models, which can result in a high false positive rate. Within a
given chromosome, the solution to the EN-EN model is relatively sparse (~ 40-60 nonzero
coefficients) but combining the nonzero coefficients from all models results in over 1000
“discoveries”. For proper selection, we would need to jointly fit all 22 chromosomes in a single
model, which could yield a sparser solution. In addition to fitting all 22 chromosomes jointly, to
make any formal inference for the nonzero estimates, we need to capture the uncertainty
associated with these estimates. Formal inference for regularized regression models is an active
field of research (Dezeure et al., 2015; Lee et al., 2016; Lockhart et al., 2014; Taylor and
Tibshirani, 2015) and applicability of currently available methods to our model would require
further evaluation.
Even in the absence of novel discoveries, applying hierarchical regularized logistic
regression in the PrediXcan framework demonstrated that our method can be used for more than
just prediction. The idea of using our method for variable selection has been alluded to in prior
chapters, but with a focus on selection of the external variables. In the context of the current
example, selection of both the first and second-level variable has allowed us to generate a more
comprehensive picture of how genetic variation may influence a phenotype through gene
expression.
117
Appendices
A.1 Derivation of Ridge-Lasso in Orthonormal Case with a 2
nd
Level Intercept
Under the orthonormal assumption, the score equations for 𝜷 , 𝛼 0
, and 𝜶 are
(1 + 𝜆 1
)𝜷 − 𝑿 𝑻 𝒚 − 𝜆 1
(𝟏 𝒑 𝛼 0
− 𝒁𝜶 )= 0 (𝐴 1.1)
𝑝 𝛼 0
+ 𝟏 𝒑 𝑻 (𝜷 − 𝒁𝜶 )= 0 (A1.2)
𝛼 + 𝟏 𝒑 𝛼 0
− 𝒁 𝑻 𝜷 +
𝜆 2
𝜆 1
𝜕 𝜕 𝜶 ‖𝜶 ‖
1
= 0 (𝐴 1.3)
Solving equation A1.1 for 𝜷
𝜷 = (
1
1 + 𝜆 1
)𝜷 ̂
𝑳𝑺
+ (
𝜆 1
1 + 𝜆 1
)(𝟏 𝒑 𝛼 0
+ 𝒁𝜶 ) (A1.4)
and substituting the solution into equations A1.2 and A1.3, 𝛼 0
is the mean of the LS estimates
for 𝜷 , 𝛽 ̅
(𝐿𝑆 )
𝛼 0
= 𝟏 𝒑 𝑻 (𝑿 𝑻 𝒚 − 𝒁𝜶 )
[𝟏 𝒑 𝑻 𝒁 = 0 𝑎𝑛𝑑 𝑿 𝑻 𝒚 = 𝜷 ̂
(𝑳𝑺 )
]
𝛼 ̃
0
=
1
𝑝 𝟏 𝒑 𝑻 𝜷 ̂
(𝑳𝑺 )
= 𝛽 ̅
(𝐿𝑆 )
(A1.5)
and 𝜶 is a lasso regression of the centered LS estimates for 𝜷 , 𝜷 ̂
𝒄𝒆𝒏𝒕 (𝑳𝑺 )
, on 𝒁 with penalty
(1 + 𝜆 1
)𝜆 2
/𝜆 1
.
𝛼 ̃= 𝒁 𝑻 (𝑿 𝑻 𝒚 − 𝟏 𝒑 𝛽 ̅
(𝐿𝑆 )
)−
(1 + 𝜆 1
)𝜆 2
𝜆 1
𝜕 𝜕 𝜶 ‖𝜶 ‖
1
= 𝒁 𝑻 𝜷 ̂
𝒄𝒆𝒏𝒕 (𝑳𝑺 )
−
(1 + 𝜆 1
)𝜆 2
𝜆 1
𝜕 𝜕 𝜶 ‖𝜶 ‖
1
(A1.6)
In terms of the soft-thresholding operator, the solution for 𝛼 is
𝜶̃ = 𝑆 (𝒁 𝑻 𝜷 ̂
𝒄𝒆𝒏𝒕 (𝑳𝑺 )
,
(1 + 𝜆 1
)𝜆 2
𝜆 1
) (A1.7)
118
𝑆 (𝒙 ,𝒚 )= 𝑠𝑖𝑔𝑛 (𝒙 )(𝒙 − 𝒚 )
+
The solution for 𝜷 can then be found by back substitution of 𝛼 ̃
0
and 𝜶̃ into equation A2.8.
𝜷 ̃
= (
1
1 + 𝜆 1
)𝜷 ̂
𝑳𝑺
+ (
𝜆 1
1 + 𝜆 1
)(𝟏 𝒑 𝛼 ̃
0
+ 𝒁 𝜶̃) (𝐴 1.8)
A.2 Derivation of Ridge-Lasso Solution for General Case with a 2
nd
Level
Intercept
As in Section 2.2.2, we utilize SVD, 𝑿 = 𝑼𝑫 𝑽 𝑻 , and the following variable transformation
𝜸 = 𝑸 𝑻 𝜷 = [
𝑽 𝑻 𝑽 ⊥
𝑻 ]𝜷 = [
𝜽 𝜼 ]
𝜷 = 𝑽𝜽 + 𝑽 ⊥
𝜼
to rewrite the objective function, with the 2
nd
level intercept now included for the external data.
min
𝜽 ,𝜼 ,𝜶 1
2
‖𝒚 − 𝑼𝑫𝜽 ‖+
𝜆 1
2
‖[
𝜽 𝜼 ] − [
𝑽 𝑻 (𝟏 𝒑 𝛼 0
+ 𝒁𝜶 )
𝑽 ⊥
𝑻 (𝟏 𝒑 𝛼 0
+ 𝒁𝜶 )
]‖
2
2
+ 𝜆 2
‖𝜶 ‖
1
(A2.1)
The corresponding score equations for 𝜽 , 𝜼 , 𝛼 0
, and 𝜶 are
𝜽 = (𝑫 2
+ 𝜆 1
𝑰 𝒏 )
−1
(𝑫 𝑼 𝑻 𝒚 + 𝜆 1
𝑽 𝑻 (𝟏 𝒑 𝛼 0
+ 𝒁𝜶 )) (A2.2)
𝜼 = 𝑽 ⊥
𝑻 (𝟏 𝒑 𝛼 0
+ 𝒁𝜶 ) (A2.3)
𝑝 𝛼 0
= 𝟏 𝒑 𝑻 (𝑽𝜽 + 𝑽 ⊥
𝜼 ) (A2.4)
𝒁 𝑻 𝒁𝜶 = 𝒁 𝑻 (𝑽𝜽 + 𝑽 ⊥
𝜼 − 𝟏 𝒑 𝛼 0
)−
𝜆 2
𝜆 1
𝜕 𝜕 𝜶 ‖𝜶 ‖
1
(A2.5)
By substituting equations A2.2 and A2.3 into equation A2.5 and rearranging with respect to 𝛼
119
𝒁 𝑻 𝑽 𝑑𝑖𝑎𝑔 (
𝑑 𝑗 2
𝑑 𝑗 2
+ 𝜆 1
) 𝑽 𝑻 𝒁𝜶
= 𝒁 𝑻 𝑽 𝑑𝑖𝑎𝑔 (
𝑑 𝑗 √𝑑 𝑗 2
+ 𝜆 1
)
[
𝑑𝑖𝑎𝑔 (
1
√𝑑 𝑗 2
+ 𝜆 1
)
𝑼 𝑻 𝒚 − 𝑑𝑖𝑎𝑔 (
𝑑 𝑗 √𝑑 𝑗 2
+ 𝜆 1
)
𝑽 𝑻 𝟏 𝒑 𝛼 0
]
−
𝜆 2
𝜆 1
𝜕 𝜕 𝜶 ‖𝜶 ‖
1
we obtain a form similar to a lasso regression. By letting 𝒁 𝒊𝒏𝒕𝒆𝒓 be a 𝑝 × (𝑞 + 1) matrix, where
we have added a column of ones for the intercept term
𝒁 𝒊𝒏𝒕𝒆𝒓 = [𝟏 𝒑 | 𝒁 ]
and defining two new variables
𝒁 ∗
= 𝑑𝑖𝑎𝑔 (
𝑑 𝑖 √𝑑 𝑖 2
+ 𝜆 1
)
𝑽 𝑻 𝒁 𝒊𝒏𝒕𝒆𝒓
𝒚 ∗
= 𝑑𝑖𝑎𝑔 (
1
√𝑑 𝑖 2
+ 𝜆 1
)
𝑼 𝑻 𝒚
the resulting solution for 𝜶 is lasso regression of 𝒁 ∗
on 𝒚 ∗
with penalty 𝜆 2
/𝜆 1
, where we do not
penalize the first variable (i.e. the intercept)
(𝒁 ∗
)
𝑇 𝒁 ∗
𝜶 = (𝒁 ∗
)
𝑇 𝒚 ∗
−
𝜆 2
𝜆 1
𝜕 𝜕 𝜶 ‖𝜶 −1
‖
1
(A2.6)
The solution to equation A2.6, 𝜶̃, is then substituted into the equations A2.2 and A2.3, where
𝛼 0
= 𝜶̃
1
and 𝜶 = 𝜶̃
−𝟏 . The resulting solutions to equations A2.2 and A2.3, 𝜽 ̃
and 𝜼̃, are then
used to compute the solution for 𝜷 ̃
.
𝜷 ̃
= 𝑽 𝜽 ̃
+ (𝑰 𝒑 − 𝑽 𝑽 𝑻 )(𝟏 𝒑 𝛼 ̃
1
+ 𝒁 𝜶̃
−𝟏 ) (A2.7)
120
A.3 Relationship between linear discriminant analysis (LDA) and linear regression
Consider the problem of modelling the relationship between 𝑝 predictors and a two-class
outcome (i.e. binary outcome). Let 𝑋 be an 𝑛 × 𝑝 matrix that contains measurements for 𝑝
predictors from 𝑛 observations and let 𝑦 be an 𝑛 × 1 vector of the observed binary responses. In
LDA, we assume that the conditional distribution of the 𝑝 predictors given the outcome,
𝑓 (𝑋 = 𝑥 |𝑌 = 𝑦 )= 𝑓 𝑘 (𝑥 ) , is multivariate normal with class-dependent means 𝜇 𝑘 (𝑘 = 1,2), and
common covariance matrix Σ.
𝑓 𝑘 (𝑥 )= ((2𝜋 )
𝑝 /2
|Σ|
1/2
)
−1
exp(−
1
2
(𝑥 − 𝜇 𝑘 )
𝑇 Σ
−1
(𝑥 − 𝜇 𝑘 ))
Under this assumption, LDA is derived by using Bayes’ theorem,
𝑃 (𝑋 = 𝑥 |𝑌 = 𝑘 )=
𝑃 ( 𝑌 = 𝑘 |𝑋 = 𝑥 )𝑃 (𝑌 = 𝑘 )
∑ 𝑃 (𝑌 = 𝑘 |𝑋 = 𝑥 )𝑃 (𝑌 = 𝑘 )
𝑘 𝑗 =1
to solve for the conditional probability of belonging to class 𝑘 given 𝑋 = 𝑥 (i.e. the posterior
probability).
𝑃 (𝑌 = 𝑦 |𝑋 = 𝑥 )=
𝜋 𝑘 𝑓 𝑘 (𝑥 )
∑ 𝜋 𝑗 𝑓 𝑗 (𝑥 )
𝐾 𝑗 =1
Here, 𝜋 𝑘 = 𝑃 (𝑌 = 𝑘 ) is the prior probability for class 𝑘 , which can be empirically estimated as
the proportion of observations that are in class 𝑘 , 𝜋̂
𝑘 = 𝑛 𝑘 /𝑛 . Assigning each observation to the
class with highest posterior probability is then equivalent to assigning to the class with the
largest value of 𝜋̂
𝑘 𝑓 𝑘 (𝑥 ) . For a two-class problem, we assign to class one if:
𝑓 1
(𝑘 )𝜋 1
𝑓 2
(𝑘 )𝜋 2
> 1
Taking logs on both sides and re-arranging terms, an observation is assigned to class one if:
𝑥 𝑇 Σ
−1
(𝜇 1
− 𝜇 2
)>
1
2
(𝜇 1
+ 𝜇 2
)Σ
−1
(𝜇 1
− 𝜇 2
)− log (
𝜋 1
𝜋 2
)
121
LDA can also be derived from Fisher’s linear discriminant, which aims to maximize the
between group variance relative to the within group variance. Again, consider two classes, 𝑘 =
1,2, where we have measured class-specific means 𝜇 1
, 𝜇 2
on 𝑝 predictors and the predictors
share a common covariance matrix Σ = Σ
1
= Σ
2
. Let 𝛽 be a 𝑝 × 1 vector that generates a linear
combination of the of the means 𝛽 𝑇 𝜇 𝑗 and variances 𝛽 𝑇 Σ𝛽 . We want to find the linear
combination (i.e. 𝛽 ) that maximizes the between class variance relative to the within class
variance.
max
𝛽 𝛽 𝑇 (𝜇 1
− 𝜇 2
)(𝜇 1
− 𝜇 2
)
𝑇 𝛽 𝛽 𝑇 Σ𝛽 =
𝛽 𝑇 𝑆 𝐵 𝛽 𝛽 𝑇 Σ𝛽
Differentiating with respect to 𝛽 , setting to zero, and dividing by 𝛽 𝑇 Σ𝛽 , we find that the
maximization problem becomes a generalized eigenvalue problem.
𝑆 𝐵 𝛽 = 𝜆 (Σ𝛽 ), 𝜆 =
𝛽 𝑇 𝑆 𝐵 𝛽 𝛽 𝑇 Σ𝛽
Assuming Σ is invertible, this problem can be rewritten as a standard eigenvalue problem,
Σ
−1
(𝜇 1
− 𝜇 2
)[(𝜇 1
− 𝜇 2
)
𝑇 𝛽 ] = 𝜆𝛽
and we can see that direction of the linear boundary for the discriminant function is proportional
to Σ
−1
(𝜇 1
− 𝜇 2
) since [(𝜇 1
− 𝜇 2
)
𝑇 𝛽 ] is a scalar or equivalently:
𝛽 = 𝑐 Σ
−1
(𝜇 1
− 𝜇 2
), 𝑐 > 0
122
References
Bergersen, L. C., Glad, I. K., and Lyng, H. (2011). Weighted lasso with data integration. Stat
Appl Genet Mol Biol 10.
Bergstra, J., and Bengio, Y. (2012). Random search for hyper-parameter optimization. J. Mach.
Learn. Res. 13, 281-305.
Biancolella, M., Fortini, B. K., Tring, S., et al. (2014). Identification and characterization of
functional risk variants for colorectal cancer mapping to chromosome 11q23.1. Human
molecular genetics 23, 2198-2209.
Bien, S. A., Su, Y.-R., Conti, D. V., et al. (2019). Genetic variant predictors of gene expression
provide new insight into risk of colorectal cancer. Human genetics 138, 307-326.
Boyd, S. P., and Vandenberghe, L. (2004). Convex optimization. Cambridge, UK ; New York:
Cambridge University Press.
Cawley, G. C., and Talbot, N. L. C. (2010). On Over-fitting in Model Selection and Subsequent
Selection Bias in Performance Evaluation. J. Mach. Learn. Res. 11, 2079-2107.
Cheng, W.-Y., Yang, T.-H. O., and Anastassiou, D. (2013a). Biomolecular Events in Cancer
Revealed by Attractor Metagenes. PLOS Computational Biology 9, e1002920.
Cheng, W.-Y., Yang, T.-H. O., and Anastassiou, D. (2013b). Development of a Prognostic
Model for Breast Cancer Survival in an Open Challenge Environment. Science Translational
Medicine 5, 181ra150.
Conti, D. V., and Witte, J. S. (2003). Hierarchical modeling of linkage disequilibrium: genetic
structure and spatial relations. Am J Hum Genet 72, 351-363.
Cooperberg Matthew, R., Pasta David, J., Elkin Eric, P., et al. (2005). THE UNIVERSITY OF
CALIFORNIA, SAN FRANCISCO CANCER OF THE PROSTATE RISK ASSESSMENT
SCORE: A STRAIGHTFORWARD AND RELIABLE PREOPERATIVE PREDICTOR OF
DISEASE RECURRENCE AFTER RADICAL PROSTATECTOMY. Journal of Urology 173,
1938-1942.
Copas, J. B. (1983). Regression, Prediction and Shrinkage. Journal of the Royal Statistical
Society. Series B (Methodological) 45, 311-354.
Cullen, J., Rosner, I. L., Brand, T. C., et al. (2015). A Biopsy-based 17-gene Genomic Prostate
Score Predicts Recurrence After Radical Prostatectomy and Adverse Surgical Pathology in a
Racially Diverse Population of Men with Clinically Low- and Intermediate-risk Prostate Cancer.
European Urology 68, 123-131.
Curtis, C., Shah, S. P., Chin, S.-F., et al. (2012). The genomic and transcriptomic architecture of
2,000 breast tumours reveals novel subgroups. Nature 486, 346-352.
123
Cuzick, J., Swanson, G. P., Fisher, G., et al. (2011). Prognostic value of an RNA expression
signature derived from cell cycle proliferation genes in patients with prostate cancer: a
retrospective study. The Lancet Oncology 12, 245-255.
D'Amico, A. V., Whittington, R., Malkowicz, S. B., et al. (1998). Biochemical Outcome After
Radical Prostatectomy, External Beam Radiation Therapy, or Interstitial Radiation Therapy for
Clinically Localized Prostate Cancer. JAMA 280, 969-974.
Dezeure, R., xfc, hlmann, P., Meier, L., and Meinshausen, N. (2015). High-Dimensional
Inference: Confidence Intervals, p-Values and R-Software hdi. Statistical Science 30, 533-558.
Dicker, L. H. (2014). Variance estimation in high-dimensional linear models. Biometrika 101,
269-284.
Eddelbuettel, D., and Francois, R. (2011). Rcpp: Seamless R and C++ Integration. Journal of
Statistical Software; Vol 1, Issue 8 (2011).
Efron, B. (2010). Large-scale inference : empirical Bayes methods for estimation, testing, and
prediction. Cambridge ; New York: Cambridge University Press.
Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression. Annals of
Statistics 32, 407-451.
Erho, N., Crisan, A., Vergara, I. A., et al. (2013). Discovery and Validation of a Prostate Cancer
Genomic Classifier that Predicts Early Metastasis Following Radical Prostatectomy. PLOS ONE
8, e66855.
Fan, J., Han, F., and Liu, H. (2014). Challenges of Big Data Analysis. National science review 1,
293-314.
Fan, J., and Li, R. (2001). Variable Selection via Nonconcave Penalized Likelihood and its
Oracle Properties. Journal of the American Statistical Association 96, 1348-1360.
Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization Paths for Generalized Linear
Models via Coordinate Descent. J Stat Softw 33, 1-22.
Gamazon, E. R., Wheeler, H. E., Shah, K. P., et al. (2015). A gene-based association method for
mapping traits using reference transcriptome data. Nat Genet 47, 1091-1098.
Ghaoui, L. E., Viallon, V., and Rabbani, T. (2010). Safe Feature Elimination in Sparse
Supervised Learning. CoRR abs/1009.4219.
Greenland, S. (1993). Methods for epidemiologic analyses of multiple exposures: a review and
comparative study of maximum-likelihood, preliminary-testing, and empirical-Bayes regression.
Stat Med 12, 717-736.
Hoerl, A. E., and Kennard, R. W. (1970). Ridge Regression - Biased Estimation for
Nonorthogonal Problems. Technometrics 12, 55-67.
124
Hogben, L. (2006). Handbook of Linear Algebra: CRC Press.
Horvath, S. (2013). DNA methylation age of human tissues and cell types. Genome Biol 14,
R115.
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An Introduction to Statistical
Learning: with Applications in R. New York, NY: Springer New York, New York, NY.
James, W., and Stein, C. (1961). Estimation with Quadratic Loss. In Proceedings of the Fourth
Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the
Theory of Statistics, 361-379. Berkeley, Calif.: University of California Press.
Kane, M., Emerson, J. W., and Weston, S. (2013). Scalable Strategies for Computing with
Massive Data. Journal of Statistical Software; Vol 1, Issue 14 (2013).
Kattan, M. W., Eastham, J. A., Stapleton, A. M. F., Wheeler, T. M., and Scardino, P. T. (1998).
A Preoperative Nomogram for Disease Recurrence Following Radical Prostatectomy for Prostate
Cancer. JNCI: Journal of the National Cancer Institute 90, 766-771.
Kwa, M., Makris, A., and Esteva, F. J. (2017). Clinical utility of gene-expression signatures in
early stage breast cancer. Nature Reviews Clinical Oncology 14, 595.
Law, P. J., Timofeeva, M., Fernandez-Rozadilla, C., et al. (2019). Association analyses identify
31 new risk loci for colorectal cancer susceptibility. Nature Communications 10, 2154.
Lee, J. D., Sun, D. L., Sun, Y., and Taylor, J. E. (2016). Exact post-selection inference, with
application to the lasso. Ann. Statist. 44, 907-927.
Lockhart, R., Taylor, J., Tibshirani, R. J., and Tibshirani, R. (2014). A significance test for the
lasso. Ann. Statist. 42, 413-468.
Lonsdale, J., Thomas, J., Salvatore, M., et al. (2013). The Genotype-Tissue Expression (GTEx)
project. Nature Genetics 45, 580.
Margolin, A. A., Bilal, E., Huang, E., et al. (2013). Systematic analysis of challenge-driven
improvements in molecular prognostic models for breast cancer. Science translational medicine
5, 181re181-181re181.
Nocedal, J., and Wright, S. (2006). Numerical Optimization: Springer New York.
Park, T., and Casella, G. (2008). The Bayesian Lasso. Journal of the American Statistical
Association 103, 681-686.
Shahabi, A., Lewinger, J. P., Ren, J., et al. (2016). Novel Gene Expression Signature Predictive
of Clinical Recurrence After Radical Prostatectomy in Early Stage Prostate Cancer Patients. The
Prostate 76, 1239-1256.
125
Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical Bayesian optimization of machine
learning algorithms. In Proceedings of the 25th International Conference on Neural Information
Processing Systems - Volume 2, 2951-2959. Lake Tahoe, Nevada: Curran Associates Inc.
Tai, F., and Pan, W. (2007). Incorporating prior knowledge of predictors into penalized
classifiers with multiple penalty terms. Bioinformatics 23, 1775-1782.
Taylor, J., and Tibshirani, R. J. (2015). Statistical learning and selective inference. Proceedings
of the National Academy of Sciences 112, 7629.
Thean, L. F., Li, H. H., Teo, Y. Y., et al. (2012). Association of Caucasian-identified variants
with colorectal cancer risk in Singapore Chinese. PloS one 7, e42407-e42407.
Thomas, D. C., Conti, D. V., Baurley, J., Nijhout, F., Reed, M., and Ulrich, C. M. (2009). Use of
pathway information in molecular epidemiology. Human Genomics 4, 21-42.
Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal
Statistical Society Series B-Methodological 58, 267-288.
Tibshirani, R., Bien, J., Friedman, J., et al. (2012). Strong rules for discarding predictors in lasso-
type problems. Journal of the Royal Statistical Society. Series B, Statistical methodology 74,
245-266.
Tseng, P. (2001). Convergence of a Block Coordinate Descent Method for Nondifferentiable
Minimization. Journal of Optimization Theory and Applications 109, 475-494.
van de Wiel, M. A., Lien, T. G., Verlaat, W., van Wieringen, W. N., and Wilting, S. M. (2016).
Better prediction by use of co-data: adaptive group-regularized ridge regression. Stat Med 35,
368-381.
Yuan, M., and Lin, Y. (2006). Model selection and estimation in regression with grouped
variables. Journal of the Royal Statistical Society Series B-Statistical Methodology 68, 49-67.
Zeng, Y., and Breheny, P. (2017). The biglasso Package: A Memory- and Computation-Efficient
Solver for Lasso Model Fitting with Big Data in R. arXiv.
Zou, H. (2006). The Adaptive Lasso and Its Oracle Properties. Journal of the American
Statistical Association 101, 1418-1429.
Abstract (if available)
Abstract
The evolution of technologies in the last half century has led to an exponential growth in data across multiple domains. With a corresponding increase in processing power, there is significant interest in methods to extract novel information from a number of growing data stores. Our focus is primarily on biological data generated as a result of these advances that can also be classified as high-dimensional. High-dimensionality is attributed to data where the number of variables (p) exceeds the number of available observations (n). We are particularly interested in array and sequencing data that encodes information on the structure, function, and regulation of the genome. The potential number of features under consideration in current genomic analyses can range from tens of thousands in microarray studies to easily over a million variants in genome-wide association studies (GWAS). Sample sizes can range from under a hundred to hundreds of thousands. Increased availability of genomic data has enabled research on the causal relationships between genomic regions and health-related outcomes and the incorporation of genomic features into prediction models. Analyses have also expanded from the use of a single type of genomic data to methods that can integrate multiple data types into a joint analysis. ❧ The primary aim of this thesis is the development of a statistical method that enables the integration of external information for genomic data that is potentially informative for their effects on health-related outcomes. One potential source of external information is the growing collection of structural and functional annotations that help describe how genes encode biological functions. We are interested in using such external sources to improve the predictive ability of genomic data. In the introduction, we describe the general challenges associated with high-dimensional data and how a class of methods, commonly referred to as regularization, can address these issues. The connection between regularization and hierarchical models is then used as motivation for our general framework that enables the incorporation of external data for high-dimensional data predictors in what we will refer to as hierarchical regularized regression. Chapter 2 is devoted to the implementation of our method for the prediction of continuous outcomes under specific types of regularization. Chapter 3 implements a more efficient algorithm to fit our model based on coordinate gradient descent. Chapter 4 provides several applications of our method to real data examples. Chapter 5 describes the R package and associated functions we have developed to implement our method. We finish our discussion by extending the model to binary outcomes.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Prediction and feature selection with regularized regression in integrative genomics
PDF
Incorporating prior knowledge into regularized regression
PDF
Enhancing model performance of regularization methods by incorporating prior information
PDF
Generalized linear discriminant analysis for high-dimensional genomic data with external information
PDF
Prediction modeling with meta data and comparison with lasso regression
PDF
High-dimensional regression for gene-environment interactions
PDF
Characterization and discovery of genetic associations: multiethnic fine-mapping and incorporation of functional information
PDF
Robust feature selection with penalized regression in imbalanced high dimensional data
PDF
Statistical analysis of high-throughput genomic data
PDF
Hierarchical approaches for joint analysis of marginal summary statistics
PDF
Latent unknown clustering with integrated data (LUCID)
PDF
Feature selection in high-dimensional modeling with thresholded regression
PDF
Machine learning approaches for downscaling satellite observations of dust
PDF
Covariance-based distance-weighted regression for incomplete and misaligned spatial data
PDF
Uncertainty quantification in extreme gradient boosting with application to environmental epidemiology
PDF
Bayesian hierarchical models in genetic association studies
PDF
Using multi-level Bayesian hierarchical model to detect related multiple SNPs within multiple genes to disease risk
PDF
Penalized portfolio optimization
PDF
Detecting joint interactions between sets of variables in the context of studies with a dichotomous phenotype, with applications to asthma susceptibility involving epigenetics and epistasis
PDF
Disparities in exposure to traffic-related pollution sources by self-identified and ancestral Hispanic descent in participants of the USC Children’s Health Study
Asset Metadata
Creator
Weaver, Garrett Michael
(author)
Core Title
Hierarchical regularized regression for incorporation of external data in high-dimensional models
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Biostatistics
Publication Date
12/11/2019
Defense Date
09/24/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Elastic-Net,genetics, genomics,hierarchical,lasso,machine learning,Modeling,OAI-PMH Harvest,penalization,regression,regularization,regularized regression,ridge,Statistics
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Lewinger, Juan Pablo (
committee chair
), Conti, David (
committee member
), Hacia, Joseph (
committee member
), Marjoram, Paul (
committee member
)
Creator Email
gmweaver.usc@gmail.com,gmweaver@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-249848
Unique identifier
UC11673560
Identifier
etd-WeaverGarr-8041.pdf (filename),usctheses-c89-249848 (legacy record id)
Legacy Identifier
etd-WeaverGarr-8041.pdf
Dmrecord
249848
Document Type
Dissertation
Rights
Weaver, Garrett Michael
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
Elastic-Net
genetics, genomics
hierarchical
lasso
machine learning
penalization
regression
regularization
regularized regression