Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Generalized linear discriminant analysis for high-dimensional genomic data with external information
(USC Thesis Other)
Generalized linear discriminant analysis for high-dimensional genomic data with external information
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
GENERALIZED LINEAR DISCRIMINANT
ANALYSIS FOR HIGH-DIMENSIONAL
GENOMIC DATA WITH EXTERNAL
INFORMATION
by
Sisi Li
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree of
DOCTOR OF PHILOSOPHY (BIOSTATISTICS)
May 2019
Copyright 2019 Sisi Li
2
Table of Contents
Chapter 1 Introduction ......................................................................................................... 6
1.1 High-dimensional classification .................................................................................................6
1.2 Linear discriminant analysis ......................................................................................................9
1.3 Existing methods for high-dimensional LDA ............................................................................. 13
1.3.1 Lasso penalized discriminant analysis ...................................................................................... 13
1.3.2 Scout method ............................................................................................................................ 15
1.3.3 Penalized Fisher’s linear discriminant ...................................................................................... 17
Chapter 2 Sparse LDA for high-dimensional classification ................................................... 20
2.1 Introduction ........................................................................................................................... 20
2.2 Methods ................................................................................................................................. 22
2.2.1 Graphical lasso algorithm ......................................................................................................... 22
2.2.2 Soft-thresholding operator ....................................................................................................... 24
2.2.3 Sparse LDA: an extension of scout LDA .................................................................................... 25
2.3 Simulation studies .................................................................................................................. 27
2.3.1 Simulation 1: Effect size, correlation, dimensionality ............................................................. 27
2.3.2 Simulation 2: Three forms of correlation ................................................................................. 35
2.3.3 Simulation 3: Robustness .......................................................................................................... 38
2.4 Application: Breast Cancer Survival Prediction on METABRIC Dataset ...................................... 40
2.5 Discussion .............................................................................................................................. 44
Chapter 3 Incorporating external information for sparse LDA ............................................. 46
3.1 Introduction ........................................................................................................................... 46
3
3.2 Convex optimization problem ................................................................................................. 48
3.3 Incorporating external information in inverse covariance estimation ....................................... 50
3.4 Simulation studies .................................................................................................................. 52
3.5 Room for improvement .......................................................................................................... 53
3.6 Discussion .............................................................................................................................. 58
Chapter 4 Self-training semi-supervised algorithms for high-dimensional LDA ..................... 60
4.1 Introduction ........................................................................................................................... 60
4.2 Methods ................................................................................................................................. 61
4.2.1 Algorithm 1 ................................................................................................................................ 62
4.2.2 Algorithm 2 ................................................................................................................................ 62
4.2.3 Algorithm 3 ................................................................................................................................ 63
4.3 Simulation studies .................................................................................................................. 64
4.4 Discussion .............................................................................................................................. 71
Chapter 5 Regularized Hierarchical Regression and Classification for High-Dimensional
Genomic Data with External Information ............................................................................. 73
5.1 Abstract ................................................................................................................................. 73
5.2 Introduction ........................................................................................................................... 73
5.3 Methods ................................................................................................................................. 76
5.4 Simulation studies .................................................................................................................. 85
5.4.1 Simulation settings .................................................................................................................... 85
Scenario 1: second-level information ................................................................................................ 86
Scenario 2: first-level information ..................................................................................................... 87
Scenario 3: correlation among features ............................................................................................ 87
4
Scenario 4: sample size ...................................................................................................................... 87
Scenario 5: first-level dimensionality ................................................................................................ 88
Scenario 6: second-level dimensionality ........................................................................................... 88
5.4.2 Results ....................................................................................................................................... 88
5.5 Application: Breast Cancer Survival Prediction on METABRIC Dataset ...................................... 91
5.5.1 Results I: repetition of k-fold cross-validation ......................................................................... 93
5.5.2 Results II: k-fold repeated cross-validation .............................................................................. 93
5.5.3 Results III: a ratio of 2:1 vs 1:1 between training set and test set .......................................... 94
5.5.4 Results IV: standardization vs non-standardization on gene expressions .............................. 95
5.5.5 Results V: all samples vs a subset of ER+/HER2- tumors ......................................................... 95
5.5.6 Results VI: 200 vs 20,000 probes .............................................................................................. 96
5.5.7 Results VII: top 100 vs top 10 genes in attractor metagenes .................................................. 97
5.5.8 Results VIII: other variants of external information ................................................................ 97
5.5.9 Results VIIII: including vs not including clinical variables ........................................................ 98
5.5.10 Summary .................................................................................................................................. 98
5.6 Three strategies to choose values for tuning paramters using k-fold cross-validation .............. 100
5.6.1 Strategy 1: K-fold nested cross-validation on one-dimensional grids ................................... 101
5.6.2 Strategy 2: K-fold cross-validation on a two-dimensional grid ............................................. 103
5.6.3 Strategy 3. K-fold repeated cross-validation on a two-dimensional grid ............................. 107
5.6.4 Discussion ................................................................................................................................ 110
5.7 Discussion ............................................................................................................................ 112
5.8 Appendix A ........................................................................................................................... 114
REFERENCES ...................................................................................................................... 117
5
Acknowledgements
Working as a Ph.D. student in biostatistics at the University of Southern California is a
pleasant experience to me. I want to thank my advisor Dr. Juan Pablo Lewinger for mentoring
my doctoral work. Dr. Lewinger introduced me to the world of statistics and genetics, taught me
how to do research and provided a lot of luminous guidance for my dissertation research. His
friendly personality, endless patience and kind encouragement helped me overcome many
difficulties during these years. It would be hardly possible for me to accomplish this journey
without his support. I also want to thank Dr. Duncan Thomas, Dr. David Conti, Dr. Kimberly
Siegmund and Dr. Robertas Gabrys, for their valuable advices. I also appreciate the help from
Garrett Weaver and Chubing Zeng. It’s my great luck to work with them in a team. In addition, I
want to thank all faculty and staff of the Department of Preventive Medicine for their help.
Finally, I would like to thank my husband Kan Wang for his consistent support and I am
especially grateful for my parents’ unconditional love.
6
Chapter 1 Introduction
1.1 High-dimensional classification
Statistical classification is a supervised learning problem where the goal is to predict
categorical outcomes. There are a variety of methods for classification, such as logistic
regression, linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), k-nearest
neighbors, tree-based methods and support vector machine. In the low-dimensional setting where
the number of observations is much larger than the number of features (𝑛 > 𝑝 ), traditional
classification techniques like logistic regression, LDA and QDA perform quite well. However,
the rapid development in technologies for data collection in recent years has led to a substantial
increase in the number of measurable features. This puts us in the high-dimensional settings
where the number of features is much larger than the number of observations (𝑝 ≫ 𝑛 ). The high-
dimensional setting presents huge challenges for traditional methods and has become an
important topic in many fields like bioinformatics, medicine, finance and geography. As an
example, consider a tumor classification problem based on whole genome gene expression
features, which usually involves more than 20,000 expression probes measured in only hundreds
of patients. As the number of features is far beyond the sample size, traditional methods cannot
be applied at all or their performance is often suboptimal. For example, the classifiers of LDA
and the QDA do not even have a well-defined discriminant direction when 𝑝 ≫ 𝑛 , because the
empirical covariance matrix is not invertible. When the dimensionality is high, the aggregated
error in estimating population centroids can be so huge that the classifier using all features can be
as poor as random guessing (1). Moreover, classifiers based on all 𝑝 featuresare difficult to
interpret and risk overfitting (2).
7
During the past decades, many methods have been proposed to address the challenges
presented by high-dimensional settings. Most of the approaches enhance prediction performance
through feature selection or dimensionality reduction. Penalized methods such as ridge
regression (3), the lasso (4) and the elastic net (5) are widely used. They impose different
regularization penalties in the regression model to shrink the estimated regression coefficients
towards zero. The 𝐿 1
-norm penalty used in the lasso and the elastic net regression shrink some
coefficients exactly to zero, effectively performing feature selection. Well-known dimensionality
reduction methods include principal components regression and partial least squares regression
(6). They use a set of linear combinations to convert the 𝑝 -dimensional features space to a
subspace with a lower dimension where we can perform classical, low-dimensional
classification.
Linear discriminant analysis is a simple classification technique that performs well in
low-dimensional settings. Recently, several methods that extend LDA for high-dimensional data
have been proposed (7-9). In this thesis, we focus on novel extensions of LDA for high-
dimensional settings using penalized methods. We propose two types of high-dimensional LDA
that derive the classifiers in different ways. In Chapter 2, we present a simple modification on the
Scout classification method (10), an existing high-dimensional LDA method based on a
covariance-regularized regression. We used sparsity inducing penalties to estimate both the
inverse covariance matrix of the features and the difference in class mean vectors. More
specifically, like in Scout LDA the graphical lasso algorithm (11) is used to obtain a sparse
estimate of the inverse covariance matrix. In addition, a soft-threshold operator function on the
vector of difference in empirical means is used to make the latter sparse. We show that this novel
classifier, combining the graphical lasso and the soft-threshold operator, yields much better
8
performance than the Scout classification method, and even outperforms some widely-used high-
dimensional classifiers such as lasso penalized logistic regression in some scenarios.
With the advent of large-scale sequencing and other high-throughput technologies, large
international projects like The Cancer Genome Atlas (TCGA) and Gene Ontology (GO) have
generated massive amounts of data on the role of the genome in both normal and disease
processes. They provide great opportunities as well as challenges for us to refine current gene
expression analysis through incorporating such external information. Therefore, in Chapter 3 and
Chapter 4, we discuss two approaches to incorporate a specific type of external information (e.g.
TCGA gene expression data) into the high-dimensional LDA classifier proposed in Chapter 2.
Chapter 3 presents a convex optimization framework that incorporates the external data in the
estimation of inverse covariance in graphical lasso to improve the classification performance. In
Chapter 4, we treat the external information as unlabeled data and formalize the problem from a
semi-supervised learning perspective. We train classification models from the labeled and the
unlabeled data together through a self-training algorithm, which is the simplest semi-supervised
learning method. Simulation studies show that both approaches have rather limited effect on
improving the prediction performance. A study about the estimation error of coefficients is
performed to investigate the potential room for improvement. We find that sparsity is more
important than accuracy when the dimensionality is high. Thus, a better estimate of the inverse
covariance matrix does not necessarily guarantee a better classifier for high-dimensional data.
Many techniques have been developed either to deal with high dimensions or to
incorporate external information, but not both. For instance, penalized methods shrink coefficient
estimates without considering external information, while traditional hierarchical models (12)
that can incorporate external information in a second level cannot handle high dimensionality.
9
Motivated by the fact that the standard low-dimensional LDA classifier is equivalent to linear
regression of binary outcomes, we propose a new method to extend LDA through least squares in
Chapter 5. Specifically, we introduce a regularized hierarchical regression model to handle the
challenges of high-dimensional data and to incorporate external information (e.g. gene
annotations) simultaneously. In this approach, we model gene annotations as covariates of gene
expression features. A penalized least squares formulation is used to include a second-level
regression for gene expression coefficients on gene-annotations and to impose a ridge penalty
term on the second-level coefficients. Even though this model has an explicit analytical solution,
we transform it into an equivalent ridge regression which can be fitted efficiently. The
regularized hierarchical model demonstrates significant improvement over methods that do not
incorporate external information.
1.2 Linear discriminant analysis
Linear discriminant analysis (LDA) is a classification method that has been widely
applied in various fields of study. In particular, its application to the analysis of gene expression
data is of interest to us. In LDA, linear boundaries (i.e. hyperplanes) in feature space is used to
separate different classes. Here we are focusing on binary classification, but LDA applies to
multiclass problems as well. Features in each class are assumed to follow a multivariate normal
distribution with a class-specific mean vector and a common covariance matrix. There are three
different ways to formulate the hyperplane that defines the LDA classifier. We will show in the
following that these three formulations are equivalent and lead to the same classification rule.
The first approach follows as a simple application of Bayes theorem. Suppose the
outcome variable 𝑌 has 𝐾 classes. Let 𝑓 𝑘 (𝑥 )= 𝑃 (𝑋 = 𝑥 |𝑌 = 𝑘 ) denote the density function of
10
the vector of features 𝑋 = (𝑋 1
,𝑋 2
,…,𝑋 𝑝 ) for an observation coming from the k
th
class. By
Bayes’ theorem, the posterior probability that an observation belongs to the kth class given the
values of 𝑋 is
𝑃 (𝑌 = 𝑘 |𝑋 = 𝑥 )=
𝜋 𝑘 𝑓 𝑘 (𝑥 )
∑ 𝜋 𝑖 𝑓 𝑖 (𝑥 )
𝐾 𝑖 (1.1)
where 𝜋 𝑖 is the prior probability for the 𝑖 th
class. Then, it is natural to classify an observation to
the class with the largest posterior probability. To implement this idea, we need a procedure to
estimate the posterior probability from data. First, we need to specify the prior probability 𝜋 𝑖 for
each class. There are some straightforward ways to estimate the priors. For example, we can
simple estimate it empirically as the fraction of the training observations that belong to the kth
class. Namely, 𝜋̂
𝑘 =
𝑛 𝑘 𝑛 , where 𝑛 𝑘 is the sample size of the 𝑘 th
class and 𝑛 is the total sample size
of all classes in total. Since the linear discriminant analysis assumes that the vector of features in
the k
th
class are normally distributed with a class-specific mean vector 𝜇 𝑘 and a common
covariance matrix Σ, the density function is as follows
𝑓 𝑘 (𝑥 )=
1
(2𝜋 )
𝑝 /2
|Σ|
1/2
𝑒𝑥𝑝 (−
1
2
(𝑥 − 𝜇 𝑘 )′Σ
−1
(𝑥 − 𝜇 𝑘 )) (1.2)
By plugging 𝜋̂
𝑘 and 𝑓 𝑘 (𝑥 ) into (1.1) and performing some straightforward algebra, we can get
the LDA classifier which assigns an observation to the class which has the largest value of
𝛿 𝑘 (𝑥 )= 𝑥 𝑇 Σ
−1
𝜇 𝑘 −
1
2
𝜇 𝑘 𝑇 Σ
−1
𝜇 𝑘 + log (
𝑛 𝑘 𝑛 ) (1.3)
The second formulation of LDA is essentially a special case of the Fisher’s linear
discriminant method, which aims to maximize the variance between classes and to minimize the
variance within classes. Suppose two classes of observations differ both in means (𝜇 1
,𝜇 2
) and in
covariance matrix (Σ
1
,Σ
2
). Fisher’s linear discriminant seeks a linear combination of features
11
𝛽𝑥 = 𝛽 1
𝑥 1
+ ⋯+ 𝛽 𝑝 𝑥 𝑝 such that the ratio of the variance between classes to the variance within
classes
(𝛽 𝜇 2
− 𝛽 𝜇 1
)
2
𝛽 𝑇 Σ
2
𝛽 + 𝛽 𝑇 Σ
1
𝛽 =
(𝛽 (𝜇 2
− 𝜇 1
))
2
𝛽 𝑇 (Σ
2
+ Σ
1
)𝛽
is maximized. This ratio measures the separation between the distributions of linear
combinations in the two classes. To maximize this separation, the vector of parameters 𝛽 should
be proportional to the quantity (Σ
1
+ Σ
2
)
−1
(𝜇 2
− 𝜇 1
) . Because in LDA, one assumption is that
features in different classes have the same covariance matrix, the discriminant direction of the
linear boundary in LDA is 𝛽 = 𝑐 Σ
−1
(𝜇 2
− 𝜇 1
) after substituting Σ
1
= Σ
2
= Σ, where 𝑐 is a
positive constant. Thus, an observation will be classified to class 2 or class 1 according to
whether
(𝑥 − (𝜇 2
+ 𝜇 1
)/2)
𝑇 Σ
−1
(𝜇 2
− 𝜇 1
)+ 𝑙𝑜𝑔 (𝜋 2
𝜋 1
⁄ )> 0 (1.4)
Plugging the sample mean vectors, which estimate 𝜇 1
and 𝜇 2
respectively, the pooled sample
covariance for Σ, and the prior estimates 𝜋 1
= 𝑛 1
𝑛 ⁄
, 𝜋 2
= 𝑛 2
𝑛 ⁄
and into (1.4), yields the LDA
classifier.
The third equivalent formulation of the LDA hyperplane can be obtained from a linear
regression of a modified binary outcome 𝑌 with an adjusted intercept. Denote the class labels as
𝑌 1
= −𝑛 𝑛 1
⁄ and 𝑌 2
= −𝑛 𝑛 2
⁄ . The least squares problem
(𝜷 ̂
,𝛽 ̂
0
)= argmin∑ (𝑦 𝑖 − 𝛽 0
− 𝒙 𝑖 𝑇 𝜷 )
2 𝑛 𝑖 =1
(1.5)
yields 𝜷 ̂
= 𝑐 𝚺 ̂
−1
(𝝁̂
2
− 𝝁̂
1
) for a positive constant 𝑐 , which is the same as the discriminant
direction of LDA. However, using 𝛽 ̂
0
in (1.5) will not give us the same classifier as LDA in
(1.4). To develop the classifier from linear regression that is equivalent to the previous two
12
versions, Mai and Zou (2012) (9) proved that we only need to adjust the linear regression
intercept to the optimal one using the following
𝛽 ̂
0
opt
= −(𝝁̂
2
+ 𝝁̂
1
)
𝑇 𝜷 ̂
2 ⁄ + 𝜷 ̂
𝑇 𝚺 ̂
𝜷 ̂
[(𝝁̂
2
− 𝝁̂
1
)
𝑇 𝜷 ̂
]
−1
log(𝑛 2
𝑛 1
⁄ ) (1.6)
In summary, we can develop the same LDA classifier by estimating the discriminant direction
from the linear regression and using the adjusted intercept in (1.6).
Because of its simplicity and good performance, linear discriminant analysis is still
widely used in different fields even though more sophisticated classification techniques have
been proposed. With the rapid increase in the scale of available features, classification problems
based on high-dimensional datasets have become important in many applications, such as tumor
classification with gene expressions. However, in its original formulation, linear discriminant
analysis is not applicable to high-dimensional data because when the number of features 𝑝 is
much larger than the sample size 𝑛 , the empirical covariance matrix is not invertible. In recent
years, LDA has been successfully extended to high-dimensional classification problems. For
example, Mai and Zou (2012) proposed a sparse discriminant analysis specially for high-
dimensional data via lasso penalized least squares. Witten & Tibshirani (2011) (8) extended
LDA through the Fisher’s linear discriminant. The scout family of methods that use a shrunken
estimate of the inverse covariance matrix of features was proposed in Witten & Tibshirani
(2009)(10). We review these methods in detail below.
13
1.3 Existing methods for high-dimensional LDA
1.3.1 Lasso penalized discriminant analysis
Motivated by the fact (reviewed in the previous section) that linear discriminant analysis
can be obtained by a standard linear regression on a modified binary outcome, Mai and Zou
(2012) proposed an efficient approach for high-dimensional sparse discriminant analysis based
on a lasso penalized least squares formulation. One crucial advantage of the lasso penalized
discriminant analysis is that it can simultaneously identify a subset of discriminative features and
estimate a sparse Bayes classification direction when the number of features is much larger than
the sample size. In addition, the lasso penalized discriminant analysis is computationally efficient
even when the dimension is very high, since the computational burden of the penalized least
squares formulation is rather low. Moreover, unlike methods that rely on a strong assumption of
independence among features, such as the nearest shrunken centroids classifier by Tibshirani et
al. (2002) and features annealed independence rules by Fan and Fan (2008), this approach allows
for correlation among features.
As mentioned in the previous section, the least squares formulation in (1.5) yields the
same classification direction as the linear discriminant analysis shown in (1.4) in the traditional
𝑝 < 𝑛 setting. However, this relation does not hold in high-dimensional settings, since the
empirical covariance matrix is not invertible when 𝑝 > 𝑛 and thus the classification direction of
LDA is not well defined. A regularized regression with a lasso penalty can shrink most
unimportant regression coefficients to exactly zero, which is equivalent to selecting a relatively
small subset of effective features. When the true number of predictive features is indeed small,
14
this can lead to a significant improvement in prediction. Specifically, the lasso penalized
discriminant analysis uses a lasso penalized least squares formulation
(𝛽 ̂
𝜆 ,𝛽 ̂
0
𝜆 )= argmin {𝑛 −1
∑ (𝑦 𝑖 − 𝛽 0
− 𝑥 𝑖 𝑇 𝛽 )
2 𝑛 𝑖 =1
+ ∑ 𝜆 |𝛽 𝑗 |
𝑝 𝑗 =1
} (1.7)
to compute a discriminative direction 𝛽 ̂
𝜆 for a binary classification problem, where 𝜆 is a
parameter that can be tuned by cross-validation. The classifier takes the form:
𝑥 𝑇 𝛽 ̂
𝜆 + 𝛽 ̂
0
= 0. (1.8)
As in the 𝑝 < 𝑛 case, the intercept obtained directly from the linear regression, 𝛽 ̂
0
, as in (1.8) is
different from 𝛽 ̂
0
𝜆 in (1.7). And as mentioned in the section 1.2, the optimal intercept of the LDA
classifier is given by
𝛽 ̂
0
opt
= −(𝜇 ̂
2
+ 𝜇 ̂
1
)
𝑇 𝛽 ̂
2 ⁄ + 𝛽 ̂
𝑇 Σ
̂
𝛽 ̂
[(𝜇 ̂
2
− 𝜇 ̂
1
)
𝑇 𝛽 ̂
]
−1
log(𝑛 2
𝑛 1
⁄ ) .
The classification direction together with the optimal intercept, fully determine the classification
rule of LDA (1.8) in high-dimensional settings.
The lasso penalized least squares formulation in (1.7) yields a sparse estimate for the
classification direction which is essentially equivalent to a feature selection procedure. This
method shows better performance than the 𝑙 1
-penalized linear discriminant by Witten &
Tibshirani (2011) and other sparse discriminant methods that assumes independence among
features, such as the features annealed independence rules proposed by Fan and Fan (2008) and
the nearest shrunken centroids classifier proposed by Tibshirani et al. (2002). Admittedly,
methods based on independence rules are efficient in terms of computational cost especially
when the dimension is high, but more likely than not, they may select misleading features.
Allowing for a correlation structure among the features, the lasso penalized discriminant analysis
(Mai and Zou, 2012) identifies discriminative features more accurately, and hence has better
performance in classification than methods that rely on independence rules. Because of its
15
consistently good performance in classification, we will use the lasso penalized discriminant
analysis as a reference, to which we will compare our high-dimensional LDA method along with
lasso penalized logistic regression in Chapter 2.
1.3.2 Scout method
Scout is a family of prediction methods proposed by Witten and Tibshirani (2009) that
use a regularized estimate of the inverse covariance matrix of features in the estimation of
regression coefficients for high-dimensional settings. The general idea is to apply a penalty term
to the log-likelihood of the data under a multivariate normal model to obtain an estimate of the
inverse covariance matrix in high-dimensional settings. The shrunken estimate of the inverse
covariance matrix is then used in generalized linear models, linear discriminant analysis, or Cox
regression in settings where 𝑝 ≫ 𝑛 . Common regularized regressions such as the lasso, the ridge
and the elastic net are special cases of Scout framework.
Consider a matrix containing both the features and the outcome 𝑋 ̃
= (𝑋 ,𝑦 ) , which has a
dimension of 𝑛 × (𝑝 + 1) . Assume that 𝑋 ̃
is multivariate normally distributed with covariance Σ.
The empirical covariance matrix of 𝑋 ̃
is
S = (
S
XX
S
Xy
S
Xy
T
S
yy
)
Then Σ
−1
is estimated by maximizing the multivariate normal log-likelihood
log{det (Σ
−1
)} − tr(SΣ
−1
).
Let Θ = (
Θ
XX
Θ
Xy
Θ
Xy
T
Θ
yy
) represent the estimate of Σ
−1
. A linear regression of the outcome on the
features can be obtained by estimating Σ
−1
, because the least squares solution is 𝛽 =
16
(𝑋 𝑇 𝑋 )
−1
𝑋 𝑇 𝑦 = −Θ
Xy
/Θ
yy
. However, S is not invertible when 𝑝 > 𝑛 , so it’s necessary to apply
a penalty term in the estimation of the regression coefficients.
Under a multivariate normal assumption, the zero entries in the inverse covariance matrix
of features represent corresponding pairs of features that are conditionally independent given all
the other variables. A non-zero element indicates a pair of features partially correlated. It’s
reasonable to assume that most features are conditionally independent or weakly correlated given
other features when 𝑝 > 𝑛 . Therefore, we can impose an additional penalty in the log-likelihood
of the data to estimate a shrunken inverse covariance. Scout is based on a two-stage
maximization procedure of covariance-regularized regression. The first step is to estimate the
inverse covariance matrix of features Θ
̂
XX
by maximizing
log{det (Θ
XX
)} − tr(S
XX
Θ
XX
)− 𝜆 1
‖Θ
XX
‖
𝑝 1
. (1.9)
The second step is to compute Θ
̂
, the inverse covariance of 𝑋 ̃
, by plugging Θ
̂
XX
as the top left
elements in Θ
̂
and then maximizing
log{det (Θ)} − tr(SΘ)−
𝜆 2
2
‖Θ‖
𝑝 2
. (1.10)
Then the regression coefficients can be obtained by 𝛽 ̂
= −Θ
̂
Xy
/Θ
̂
yy
. In the two Scout criteria
functions (1.9) and (1.10), ‖∙‖
𝑝 1
and ‖∙‖
𝑝 2
denote the 𝑙 𝑝 -norm penalties of a matrix where 𝑝 = 1
and 𝑝 = 2 represent the sum of absolute values of all entries and the sum of squared entries of
the matrix, respectively, and 𝜆 1
,𝜆 2
≥ 0 are tuning parameters that will control the magnitude of
shrinkage in each step. The first step is to estimate the inverse covariance matrix for the features
only while the second step is to combine the response variable with features to estimate the
regression coefficients.
Some well-known regularized regressions are special cases of the Scout framework. For
example, when 𝜆 1
= 0 and 𝜆 2
= 0, the Scout method corresponds to the ordinary least squares
17
formulation. If 𝜆 1
= 0, then the Scout is equivalent to a lasso regression for 𝑝 2
= 1 and a ridge
regression for 𝑝 2
= 2. The elastic net also belongs to the Scout family where 𝑝 1
= 2 and 𝑝 2
= 1.
The Scout method can be used to extend linear discriminant analysis to high-dimensional
classification. Scout provides a regularization of the inverse within-class covariance matrix, with
which we can develop a classification rule for high-dimensional settings through linear
discriminant analysis. For example, when 𝜆 2
= 0 and 𝑝 1
= 1, the Scout corresponds to the
graphical lasso, which is a method to estimate a sparse inverse covariance matrix. In addition to
regression and classification, the Scout method can be applied in generalized linear models such
as the cox regression for the survival analysis as well. In the next chapter, we will discuss an
extension of the Scout LDA method where 𝜆 2
= 0 and 𝑝 1
= 1 for high-dimensional
classification problems, and will compare it to the Scout LDA, the lasso penalized logistic
regression and the lasso penalized discriminant analysis.
1.3.3 Penalized Fisher’s linear discriminant
The lasso penalized discriminant analysis (Mai and Zou, 2012) and the Scout LDA
(Witten and Tibshirani, 2009) extend LDA for high-dimensional classification by imposing
additional penalty terms to the least squares and the log-likelihood of the data under a
multivariate normal model. They approach the problem of extending LDA to high dimensions
through its connection to linear regression. The penalized Fisher’s discriminant (Witten and
Tibshirani, 2011) is a different approach for extending LDA to high-dimensional settings
through the Fisher’s linear discriminant.
When 𝑝 < 𝑛 , Fisher’s linear discriminant estimates the classification coefficients by
maximizing the between-class variance and minimizing the within-class variance. For an
18
outcome with 𝐾 classes, the 𝑘 th discriminant vector β
̂
𝑘 in the Fisher’s linear discriminant is
estimated by
max (𝛽 𝑘 𝑇 Σ
̂
𝑏 𝛽 𝑘 ) subject to 𝛽 𝑘 𝑇 Σ
̂
𝑤 𝛽 𝑘 ≤ 1,𝛽 𝑘 𝑇 Σ
̂
𝑤 𝛽 𝑖 = 0 ∀𝑖 < 𝑘 (1.11)
where Σ
̂
𝑏 is the estimate for the between-class covariance and Σ
̂
𝑤 is the estimate for the within-
class covariance. Let μ ̂
𝑘 denote the empirical mean vector of features in class 𝑘 . The within-
class covariance can by estimated by Σ
̂
𝑤 =
1
𝑛 ∑ ∑ (𝑥 𝑖 − 𝜇 ̂
𝑘 )(𝑥 𝑖 − 𝜇 ̂
𝑘 )
𝑇 𝑖 ∈𝐶 𝑘 𝐾 𝑘 =1
, where 𝐶 𝑘
represent the indices of observations in class 𝑘 . The estimate for the between-class covariance is
Σ
̂
𝑏 =
1
𝑛 𝑋 𝑇 𝑋 − Σ
̂
𝑤 =
1
𝑛 𝑋 𝑇 𝑌 (𝑌 𝑇 𝑌 )
−1
𝑌 𝑇 𝑋 , where 𝑌 is a 𝑛 × 𝐾 matrix with the entries 𝑌 𝑖𝑘
indicating whether the 𝑖 th observation is in the class 𝑘 . When 𝑝 > 𝑛 , Σ
̂
𝑤 is singular so that the
estimate for the discriminant vectors is not unique. Some existing methods modify the Fisher’s
discriminant problem (1.11) by enforcing a positive definite estimate Σ
̃
𝑤 for the within-class
covariance for high-dimensional settings. Then the problem becomes
max (𝛽 𝑘 𝑇 Σ
̂
𝑏 𝛽 𝑘 ) subject to 𝛽 𝑘 𝑇 Σ
̃
𝑤 𝛽 𝑘 ≤ 1,𝛽 𝑘 𝑇 Σ
̃
𝑤 𝛽 𝑖 = 0 ∀𝑖 < 𝑘 (1.12)
Different types of positive definite estimates Σ
̃
𝑤 are suggested in some references such as Xu et
al. (2009), Bickel and Levina (2004), Dudoit et al. (2001), Krzanowski et al. (1995) and
Friedman (1989).
The penalized Fisher’s discriminant (Witten and Tibshirani, 2011) further modifies the
problem (1.12) by introducing a penalty term on the estimates. The 𝑘 th discriminant vector 𝛽 ̂
𝑘 is
estimated by
max(𝛽 𝑘 𝑇 Σ
̂
𝑏 𝑘 𝛽 𝑘 − 𝑃 𝑘 (𝛽 𝑘 )) subject to 𝛽 𝑘 𝑇 Σ
̃
𝑤 𝛽 𝑘 ≤ 1 (1.13)
where Σ
̂
𝑏 𝑘 =
1
𝑛 𝑋 𝑇 𝑌 (𝑌 𝑇 𝑌 )
−1/2
𝑄 𝑘 ⊥
(𝑌 𝑇 𝑌 )
−1/2
𝑌 𝑇 𝑋 . When 𝑘 = 1, 𝑄 1
⊥
is the identity matrix, and
when 𝑘 > 1, 𝑄 𝑘 ⊥
is an orthogonal projection matrix into the space that is orthogonal to
19
(𝑌 𝑇 𝑌 )
−1/2
𝑌 𝑇 𝑋 𝛽 ̂
𝑖 for all 𝑖 < 𝑘 . Σ
̃
𝑤 is a positive definite estimate given by Σ
̃
𝑤 =
𝑑𝑖𝑎𝑔 (σ ̂
1
2
,…,σ ̂
𝑝 2
) which only retains the diagonal elements of Σ
̂
𝑤 . 𝑃 𝑘 (𝛽 𝑘 ) is a convex penalty
function on the 𝑘 th discriminant vector. The solution to the problem (1.13) corresponds to the
penalized LDA-𝐿 1
method when 𝑃 𝑘 (𝛽 𝑘 ) is a lasso penalty and to the penalized LDA-FL method
when 𝑃 𝑘 (𝛽 𝑘 ) is a fused lasso penalty (Tibshirani et al., 2005). The penalized LDA-FL method is
applicable only when the important features have a linear ordering. Because it exploits this type
of ordering among features, the penalized LDA-FL method has better performance than the
penalized LDA-𝐿 1
method in simulation studies under a linear ordering of the features.
The penalized Fisher’s discriminant is closely connected with the penalized principal
components analysis (7) because Fisher’s discriminant problem can be transformed to a standard
eigenproblem by replacing 𝛽 ̃
𝑘 = Σ
̂
𝑤 1/2
𝛽 𝑘 in the problem (1.11). One note we should make is that
the objective function in the problem (1.13) is not concave. Therefore, it cannot be solved as a
convex optimization problem. Instead, a minimization-maximization algorithm (13) is used to
obtain numeric solutions.
20
Chapter 2 Sparse LDA for high-dimensional
classification
2.1 Introduction
As described in chapter 1, linear discriminant analysis (LDA) is a classification method
that only applies to data where the sample size is larger than the number of features. In recent
years, the size of available data has been growing to an unprecedented scale and it is rather
common that the number of features may heavily exceed the total sample size. To better leverage
the benefit of sparsity in estimating coefficients for high-dimensional settings, we propose an
extension of an existing high-dimensional classifier based on LDA, scout LDA (10), by applying
sparsity inducing penalties in the estimation of both the inverse covariance matrix of the genomic
features and the difference in class means. Specifically, we first use the graphical lasso (11) to
obtain a sparse estimate of the inverse covariance matrix, just like the scout LDA. However, we
also apply a soft-threshold operator on the vector of differences of class means, to obtain a sparse
estimate for the difference. Through simulation studies we show that the sparse LDA using soft-
thresholding yields better prediction performance than the scout LDA, and is comparable to
competitive high-dimensional classifiers, such as lasso logistic regression and sparse
discriminant analysis (lasso LDA).
Suppose 𝑿 = (𝒙 1
,𝒙 2
,…,𝒙 𝑝 ) is an 𝑛 × 𝑝 matrix representing data values for 𝑝 features
and 𝑛 observations. Let 𝒚 = (𝑦 1
,…,𝑦 𝑛 ) be a binary outcome vector, then 𝑦 𝑖 = 0,1 representing
the class labels for the 𝑖 th
observation. The linear discriminant analysis assumes that features in
the two classes follow multivariate normal distributions with class-specific means but a common
21
covariance matrix: 𝑋 |𝑦 = 0 ~ 𝑁 (𝜇 0
,Σ) and 𝑋 |𝑦 = 1 ~ 𝑁 (𝜇 1
,Σ) . As described in Chapter 1, the
optimal classifier that minimizes probability of misclassification has the form
(𝑥 − (𝜇 0
+ 𝜇 1
)/2)
𝑇 Σ
−1
(𝜇 0
− 𝜇 1
)+ 𝑙𝑜𝑔 (𝜋 0
𝜋 1
⁄ )= 0.
Linear discriminant analysis estimates 𝜇 0
,𝜇 1
and Σ by using empirical mean vectors and the
pooled empirical covariance. Then, the linear discriminant analysis direction is given by 𝛽 ̂
=
Σ
̂
−1
(𝜇 ̂
0
− 𝜇 ̂
1
) .
However, the empirical covariance matrix is not invertible when the number of features is
much larger than the number of observations (𝑝 ≫ 𝑛 ) , so the first challenge of applying linear
discriminant analysis in high-dimensional classification problems is how to estimate the inverse
covariance matrix Σ
−1
. Bickel & Levina (2008), Rothman et al. (2008) and Cai et al. (2010)
have provided good estimates for the inverse covariance matrix when 𝑝 ≫ 𝑛 . However, when the
number of features is large, even though each component of the population mean vectors may be
estimated accurately, the accumulated estimation error can be very large (Fan & Fan, 2008). In
this case, it is worth enforcing most classification coefficients to be zero, regardless of the
accuracy of the estimates, as long as their effect sizes are relatively small. To enhance sparsity a
for high-dimensional classifier, we propose to use graphical lasso algorithm and a soft-threshold
operator to obtain both a sparse inverse covariance matrix and a sparse difference in mean
vectors between classes, respectively.
22
2.2 Methods
2.2.1 Graphical lasso algorithm
Graphical lasso is an algorithm developed by Friedman et al. (2008) that provides a
sparse estimate of the inverse of a covariance matrix by applying a lasso penalty to the normal
log-likelihood. Let’s assume that continuous variables follow a multivariate normal distribution
with mean vector 𝜇 and variance-covariance matrix Σ. The idea of the graphical lasso is to shrink
many elements to zero by introducing a lasso penalty when maximizing a log-likelihood in the
estimation of an inverse covariance matrix Σ
−1
. The graphical lasso algorithm shrinks some non-
zero entries with small magnitude in the inverse covariance matrix to be zero, which means that
the corresponding pairs of variables are estimated to be conditionally independent given all other
variables.
Here is how graphical lasso works. If there are 𝑛 observations from a multivariate normal
distribution 𝑁 (𝜇 ,Σ) , where 𝜇 is a vector of length 𝑝 , then we can estimate the inverse covariance
matrix Θ = Σ
−1
by maximizing the log-likelihood of the data
Θ
̂
= argmax (logdetΘ − tr(𝑆 Θ)− 𝜌 ‖Θ‖
1
) (2.1)
where Θ = Σ
−1
is non-negative definite, 𝑆 is the empirical covariance matrix of 𝑛 observations,
detΘ is the determinant of Θ and tr(𝑆 Θ) is the trace of the matrix 𝑆 Θ. The lasso penalty term is
‖Θ‖
1
= ∑|Θ
𝑖𝑗
|
𝑖 ,𝑗 . The constant 𝜌 is a tuning parameter, which controls the sparsity of the
estimate, and its value is determined by k-fold cross validation.
There are several approaches to solve this problem (14). Instead of focusing on the
estimation of Σ
−1
, Banerjee et al. (14) starts from the estimation of Σ, which is denoted by 𝑊 .
By considering the partition of 𝑊 and 𝑆
23
𝑊 = (
𝑊 11
𝑤 12
𝑤 12
𝑇 𝑤 22
), 𝑆 = (
𝑆 11
𝑠 12
𝑠 12
𝑇 𝑠 22
)
the problem can be solved by optimizing each row-and-column pair of 𝑊 using with a block
coordinate descent algorithm. Banerjee et al. (14) show that the problem (2.1) is equivalent to
min
𝛽 {
1
2
‖𝑊 11
1/2
𝛽 − 𝑏 ‖
2
+ 𝜌 ‖𝛽 ‖
1
}, (2.2)
where 𝑏 = 𝑊 11
−1/2
𝑠 12
. Once (2.2) is solved, 𝑤 12
will be solve by 𝑤 12
= 𝑊 11
𝛽 . Friedman et al.
(11) propose the graphical lasso algorithm to solve (2.2) by using the current estimate for the
upper block 𝑊 11
, updating 𝑤 and iterating until convergence.
The graphical lasso algorithm starts by initializing 𝑊 as 𝑊 = 𝑆 + 𝜌𝐼 . The next step is to
solve the lasso least squares problem (2.2) for each 𝑗 = 1,2,…𝑝 ,1,2,…𝑝 ,…. . Note that the
diagonal of 𝑊 remains fixed through. This procedure takes the inner products 𝑊 11
and 𝑠 12
as
input and gives a 𝑝 − 1 vector solution 𝛽 ̂
. The corresponding row and column of 𝑊 can be filled
with 𝑤 12
= 𝑊 11
𝛽 ̂
. Then repeat the procedure until convergence.
In general, this algorithm provides a sparse estimate for the inverse covariance matrix of
features. The tuning parameter 𝜌 controls the number of entries that are shrunken to zero,
therefore it controls the sparsity of the estimate. The computational efficiency of the graphical
lasso algorithm is superior to that of the competing methods, such as the COVSEL procedure by
Banerjee et al. (14) and the Meinshausen-Bu ̈ hlmann approximation. Considering the importance
of sparsity in high-dimensional classifications and the computational efficiency, we choose the
graphical lasso algorithm to estimate the inverse covariance matrix when calculating the LDA
discriminant direction in high dimensions. Graphical lasso has been implemented in the R
package glasso by Friedman et al. (11). In our implementation, we use the R package
24
glassoFast, which is a fast version for graphical lasso based on the algorithm (FORTRAN
subroutine) of Sustik and Calderhead (2018).
2.2.2 Soft-thresholding operator
Soft-thresholding is another technique that has been widely used in statistics and
computer science to enhance sparsity. It was proposed by Donoho and Johnstone in 1994 to filter
noise in the wavelet transformation. For a positive parameter λ, the soft-threshold operator is a
function of 𝑡 defined as:
Ο(𝑡 ,𝜆 ):= sign (𝑡 )(|𝑡 | − 𝜆 )
+
= {
𝑡 − 𝜆 , 𝑡 > 𝜆 0, − 𝜆 ≤ 𝑡 ≤ 𝜆 𝑡 + 𝜆 , 𝑡 < −𝜆 (2.3)
A plot of the soft-threshold function is shown in Figure 2.1.
Figure 2.1. Soft-threshold function in (2.3)
Many properties of soft-threshold operator and its more traditional alternative, the hard-
threshold operator, have extensively studied. From a perspective of model selection, the soft-
threshold operator is asymptotically equivalent to subset selection with a same asymptotic rate of
25
risk. From the perspective of prediction modeling, the soft-threshold operator introduces a small
amount of bias in exchange for a considerable reduction in variance. This bias-variance tradeoff
results in a reduction of the total risk, which can be decomposed into variance, bias and noise.
These properties make the lasso, which applies soft-threshold operator, a competitive model
selection method compared to subset selection. As mentioned in Tibshirani (1996) (4), the
difference between these two selectors is more dramatic when the variables are at least
moderately correlated. In this scenario, theoretical results are more ambiguous than in the
uncorrelated design.
For the lasso regularized regression, a 𝐿 1
penalty is appended to the residual sum of
squares to yield a new penalized objective function. The non-differentiability of the objective
function makes it more difficult to optimize compared to ridge regression, which uses a 𝐿 2
penalty and has an analytical solution. Essentially in the orthogonal case, the solution of lasso
takes the form of a soft-threshold function where estimates with small magnitude are shrunken to
be zero exactly. This property justifies lasso as a model selection method.
2.2.3 Sparse LDA: an extension of scout LDA
As described above, the LDA classifier for a binary classification problem under the
traditional 𝑝 < 𝑛 settings has the hyperplane defined by
(𝑥 − (𝜇 ̂
0
+ 𝜇 ̂
1
)/2)
𝑇 Σ
̂
−1
(𝜇 ̂
0
− 𝜇 ̂
1
)+ log(𝑛 0
𝑛 1
⁄ )= 0 (2.4)
as its classification boundary, where 𝜇 ̂
0
,𝜇 ̂
1
and 𝑛 0
,𝑛 1
are the empirical mean vectors of features
and sample sizes in each class, Σ
̂
−1
is the common (i.e. pooled across the two classes) estimate of
the inverse covariance matrix of the features. The most important step for extending linear
discriminant analysis for high-dimensional classifications is to obtain a discriminant direction
26
through 𝛽 ̂
= Σ
̂
−1
(𝜇 ̂
0
− 𝜇 ̂
1
) . Scout is a family of methods that use shrunken estimate for the
inverse covariance matrix Σ
−1
when predicting the outcome in high dimensions. As introduced
in the previous section, the LDA classifier using the graphical lasso estimate for the inverse
covariance matrix is a special case of the Scout method of the covariance-regularized regression.
An approach to generalize linear discriminant analysis for high-dimensional settings via the
Scout method is to replace Σ
̂
−1
in (2.4) with the graphical lasso estimate Θ
̂
from (2.1), and to
plug in the empirical mean vectors 𝜇 ̂
0
and 𝜇 ̂
1
directly. We call this approach the scout LDA for
convenience. Hence, the discriminant direction of the scout LDA has the form
𝛽 ̂
scout
= Θ
̂
∙ (𝜇 ̂
0
− 𝜇 ̂
1
) . (2.5)
We propose an extension of the scout LDA by further introducing sparsity into the
estimate of the difference of the population means, 𝜇 0
− 𝜇 1
. Instead of plugging the difference of
empirical means 𝜇 ̂
0
,𝜇 ̂
1
directly into the direction formula (2.5), we apply a soft-threshold
operator on 𝜇 ̂
0
− 𝜇 ̂
1
to obtain a sparse estimate
𝒪 (𝜇 ̂
0
− 𝜇 ̂
1
,𝜆 )= sign (𝜇 ̂
0
− 𝜇 ̂
1
)(|𝜇 ̂
0
− 𝜇 ̂
1
| − 𝜆 )
+
(2.6)
Where 𝜆 ≥ 0 is a tuning parameter. With the above modification, the discriminant direction of
the extended scout LDA becomes
𝛽 ̂
sparse
= Θ
̂
∙ 𝒪 (𝜇 ̂
0
− 𝜇 ̂
1
,𝜆 ), (2.7)
where Θ
̂
= argmax (logdetΘ − tr(𝑆 Θ)− 𝜌 ‖Θ‖
1
) is the estimate for the inverse covariance
matrix of features from graphical lasso algorithm. Compared to the scout LDA’s direction 𝛽 ̂
scout
,
which uses a sparse estimate only for the inverse covariance matrix of the features, the direction
𝛽 ̂
sparse
given in (2.7) consists of sparse estimates for both the inverse covariance matrix and the
difference in means between classes. We call the proposed classification method based on the
discriminant direction in (2.7), sparse LDA.
27
There are two tuning parameters in the new direction 𝛽 ̂
sparse
: one is 𝜌 from the graphical
lasso (2.1) and the other is 𝜆 from soft-threshold operator function (2.6). We apply the k-fold
cross-validation on a two-dimensional grid that consists of candidate values for 𝜌 and 𝜆 . The best
values of 𝜌 and 𝜆 are chosen simultaneously as a pair where the average AUC on the hold-out
subsamples is maximized.
2.3 Simulation studies
In this section, al simulation study to evaluate the performance of the sparse LDA and to
compare it with the scout LDA (10), lasso penalized discriminant analysis (9), and lasso
penalized logistic regression. The direct sparse discriminant analysis is implemented in the R
package dsda, available on https://ani.stat.fsu.edu/~mai/research.html. The lasso penalized
logistic regression has been implemented in the R package glmnet. Both the scout LDA and the
sparse LDA are implemented using the R package glassoFast to estimate a sparse inverse
covariance matrix of the classification features. All simulation analyses were performed in R
version 3.4.0.
2.3.1 Simulation 1: Effect size, correlation, dimensionality
We randomly generate a training set of 200 samples from two multivariate normal
distributions with different mean vectors and a common covariance matrix. Each data sample is a
feature vector of 𝑝 dimensions. Among them, 100 samples generated from 𝒩 (0
𝑝 ,Σ) are labeled
as 𝑌 = 0 and another 100 generated from 𝒩 (Σβ
true
,Σ) are labeled as 𝑌 = 1. We set the two
class mean vectors as 0
𝑝 and Σβ
true
so that the true discriminant coefficients would be β
true
. Σ is
a block-diagonal matrix of five blocks with the form Σ = 𝐵𝑙𝑜𝑐𝑘𝐷𝑖𝑎𝑔 (Σ
𝐵 ,…,Σ
𝐵 ) , where Σ
𝐵 is a
28
square matrix with a dimension of 𝑝 5 ⁄ . In this way, we partition the 𝑝 different features into five
groups and each group has a same correlation structure Σ
𝐵 . Features within a group might be
correlated with each other, while there is no correlation between features from different groups.
In the simulation studies, we consider two types of structures for Σ
𝐵 : an autoregressive AR(1)
structure and a uniform correlation model.
An independent test set of 10,000 samples are also generated from the above two
distributions with balanced size. We use the training set to fit models and the test set to evaluate
prediction performance. After a classifier is developed on the training set, an external AUC is
calculated on the test set to assess the performance of each model. We fix the sample size as 𝑛 =
200, and consider different scenarios in regard of correlation, effect size and sparsity by using
the eleven simulation settings shown in Table 2.1.
Table 2.1. Simulation settings
Setting 𝑝 Σ
𝐵 β
true
1 400 Σ
𝑖𝑗
𝐵 = 0.8
|𝑖 −𝑗 |
0.3(1.8,3.2,−2,−2.3,2.6,0
𝑝 −5
)
2 400 Σ
𝑖𝑗
𝐵 = 0.8
|𝑖 −𝑗 |
0.4(1.8,3.2,−2,−2.3,2.6,0
𝑝 −5
)
3 400 Σ
𝑖𝑗
𝐵 = 0.8
|𝑖 −𝑗 |
0.5(1.8,3.2,−2,−2.3,2.6,0
𝑝 −5
)
4 400 Σ
𝑖𝑗
𝐵 = 0.5
|𝑖 −𝑗 |
0.3(1.8,3.2,−2,−2.3,2.6,0
𝑝 −5
)
5 400 Σ
𝑖𝑗
𝐵 = 0.2
|𝑖 −𝑗 |
0.25(1.8,3.2,−2,−2.3,2.6,0
𝑝 −5
)
6 400 Σ
𝑖𝑗
𝐵 = 0.8
|𝑖 −𝑗 |
0.22(1.8,0
79
| 3.2,0
79
| − 2,0
79
| − 2.3,0
79
| 2.6,0
79
)
7 400 Σ
𝑖𝑗
𝐵 = 0.5
|𝑖 −𝑗 |
0.22(1.8,0
79
| 3.2,0
79
| − 2,0
79
| − 2.3,0
79
| 2.6,0
79
)
8 400 Σ
𝑖𝑗
𝐵 = 0.2
|𝑖 −𝑗 |
0.22(1.8,0
79
| 3.2,0
79
| − 2,0
79
| − 2.3,0
79
| 2.6,0
79
)
9 600 Σ
𝑖𝑗
𝐵 = 0.8
|𝑖 −𝑗 |
0.3(1.8,3.2,−2,−2.3,2.6,0
𝑝 −5
)
10 800 Σ
𝑖𝑗
𝐵 = 0.8
|𝑖 −𝑗 |
0.3(1.8,3.2,−2,−2.3,2.6,0
𝑝 −5
)
11 400 Σ
𝑗𝑗
𝐵 = 1, Σ
𝑖𝑗
𝐵 = 0.8,𝑖 ≠ 𝑗 0.3(1.8,3.2,−2,−2.3,2.6,0
𝑝 −5
)
The first setting is a baseline scenario. All the other scenarios modify simulation settings
from different aspects and will be compared to setting 1. First, we use AR(1) correlation
structure for the matrix Σ
𝐵 . The true discriminant coefficients β
true
are (𝛽 1
,…,𝛽 400
)=
29
𝐶 (1.8,3.2,−2,−2.3,2.6,0
395
) , where 𝐶 is a constant controlling the effect size of the data. In
settings 1-3, we fix the correlation parameter as 𝜌 = 0.8 and consider different effect sizes by
setting 𝐶 = 0.3,0.4,0.5. Next, we considered different correlation coefficients in setting 1, 4 and
5 with respect to 𝜌 = 0.8,0.5,0.2. In previous settings, since only the first five elements in β
true
are non-zero, all the true discriminative features are in the same correlation group and are
correlated with each other. In settings 6-8, we consider a different type of discriminant
coefficient vector for the true model. We relocate the non-zero coefficients to five different
groups by setting (𝛽 1
,…,𝛽 400
)= 0.22(1.8,0
79
| 3.2,0
79
| − 2,0
79
| − 2.3,0
79
| 2.6,0
79
) ,
therefore the true discriminatives are no longer correlated. In settings 9 and 10, we increase the
number of features from 400 to 600 and 800 so that the data has higher dimensionality. In the
last scenario, we change the correlation structure from AR(1) to a uniform correlation model.
Simulation results are summarized in Figures 2.2-2.6 based on 500 replications.
Figure 2.2. Settings 1-3: Different effect size with 𝑪 = 0.3, 0.4, 0.5
30
Figure 2.3. Settings 1, 4 & 5: Different correlation degree with 𝛒 = 0.8, 0.5, 0.2 where
discriminative features are correlated
Figure 2.4. Settings 6-8: Different correlation degree with 𝛒 = 0.8, 0.5, 0.2 where
discriminative features are uncorrelated
31
Figure 2.5. Settings 1, 9 & 10: Different sparsity with 𝒑 = 400, 600, 800
Figure 2.6. Setting 11: Uniform correlation structure
In general, sparse LDA performs better than the scout LDA in all the eleven scenarios.
This means that when graphical lasso is used to estimate the inverse covariance Σ
̂
−1
for the
direction 𝛽 ̂
= Σ
̂
−1
(𝜇 ̂
2
− 𝜇 ̂
1
) , using soft-thresholding on 𝜇 ̂
2
− 𝜇 ̂
1
yields better performance better
than using the difference of empirical means directly. The lasso penalized discriminant analysis
32
(Mai et al., 2012) and the lasso penalized logistic regression have very comparable performance
to sparse LDA, and both are better than the scout LDA in all settings.
To be more specific, Figure 2.2 shows that the performance of each method is improved
when the effect size increases. The sparse LDA performs better than the lasso penalized logistic
regression and the lasso penalized discriminant analysis in settings 1-3, and the difference
between the sparse LDA and the lasso penalized logistic regression doesn’t change with the
effect size; while the difference between the Scout LDA and other methods is smaller for larger
effect size. In Figure 2.3 we can observe the effect of increasing the correlation when the true
discriminative features are correlated with each other and another 75 non-discriminative features
in the same group. The sparse LDA shows better performance when the correlation is high (𝜌 =
0.8) and medium (𝜌 = 0.5). The improvement in sparse LDA compared to lasso penalized
logistic regression increases when 𝜌 becomes higher. For the scenario of weak correlation (𝜌 =
0.2), the sparse LDA, lasso penalized logistic regression and lasso penalized discriminant
analysis have comparable average performance, but sparse LDA is more stable than the other
two because its boxplot shows smaller interquartile range. Figure 2.4 demonstrates similar results
for different correlation when the discriminative features are not correlated with each other but
are instead correlated with another 79 non-discriminative features in the same group. Sparse
LDA has similar performance with lasso penalized logistic regression and lasso penalized
discriminant analysis for 𝜌 = 0.2, 0.5 and 0.8. The difference the Scout LDA and the other three
methods becomes smaller when 𝜌 increases. In Figure 2.5, the number of features has very little
effect on the prediction performance obviously. Sparse LDA outperforms both the lasso
penalized logistic regression and the lasso penalized discriminant analysis for 𝑝 = 400, 600 and
800 when the correlation of diagonal blocks Σ
𝑖𝑗
𝐵 = 0.8
|𝑖 −𝑗 |
is fixed. Figure 2.6 shows the result
33
for uniform correlation structure. When Σ
𝑗𝑗
𝐵 = 1, Σ
𝑖𝑗
𝐵 = 0.8,𝑖 ≠ 𝑗 , the sparse LDA doesn’t
perform better than lasso penalized logistic regression. The performance of the Scout LDA has
been improved, but it still performs not as good as the other three methods.
The only difference between the sparse LDA and the scout LDA is that the sparse LDA
uses a soft-threshold operator on the difference of empirical class means to increase the sparsity
while the scout LDA uses the difference of empirical means directly to compute the
classification direction. Both methods use the graphical lasso algorithm to obtain a sparse
estimate of the inverse covariance matrix. However, the sparse LDA has substantially better
performance than the scout LDA in all simulation settings. For example, in the simulation setting
1, the median testing AUC is 0.62 when using the scout LDA, while it increases to 0.75 when
using the sparse LDA. In most scenarios, there is a more than 20% improvement in the testing
AUC compared to the scout LDA. We further compare the estimation error, bias and variance of
discriminant coefficients between the scout LDA and the sparse LDA. The results of the setting 1
are shown in Figures 2.7-2.9.
First, the overall average estimation error is much higher in the scout LDA than in the
sparse LDA. On average, the estimates of all coefficients in the scout LDA have slightly lower
bias but much higher variance that those in the sparse LDA. Next, the high estimation variance in
the sparse LDA mainly comes from the estimate of zero coefficients. Since the soft-thresholding
function shrinks a lot of elements in the difference vector of class means towards zero, the
discriminant coefficients of the sparse LDA have many zero entries as well. The sparse LDA can
estimate most true zero coefficients exactly as zero and thus has a very small estimation error for
zero coefficients. However, even though the scout LDA estimates the zero coefficients with a
small bias, those estimates are not exactly to be zero. They are some small values close to zero
34
and have relatively large variance. When only a small subset of features is truly discriminating in
the high-dimensional setting, the estimate of zero-coefficients is dominant in the overall
estimation quality. Therefore, the sparse LDA overall estimates classification coefficients much
better than the scout LDA, which leads to an improved prediction performance, even though the
estimate error of non-zero coefficients in the scout LDA is lower than that in the sparse LDA.
Figure 2.7. Estimation error, bias and variance of all coefficients
Figure 2.8. Estimation error, bias and variance of non-zero coefficients
35
Figure 2.9. Estimation error, bias and variance of zero coefficients
2.3.2 Simulation 2: Three forms of correlation
In simulation studies, we can separate all the features into two categories: discriminative
features that have non-zero discriminant coefficients, and non-discriminative features that
correspond to zero discriminant coefficients. Therefore, in the covariance matrix Σ for all
features, there are three forms of correlations with respect to these two types of features: the
correlation between discriminative features only, the correlation between discriminative and non-
discriminative features, and the correlation between non-discriminative features only. In
simulation study 1, we’ve shown that in general the correlation is an important factor for the
prediction performance of sparse LDA, but it does not specify the correlation between different
types of features and thus unable to distinguish their potential effect on prediction performance.
Therefore, our goal in this simulation section is to investigate the influence of these three types
of correlation one by one, on the performance of the classification methods involved in our
study.
36
In this simulation study, a special type of covariance matrix is designed for Σ to help us
investigate the pattern of effects on the prediction abilities for these three kinds of correlations
respectively.
Figure 2.10. correlation structure used in simulation 2
We consider the simplest case that the correlation coefficients of the same type are equal.
The true covariance matrix Σ is shown in Figure 2.10, where 𝜌 1
, 𝜌 2
and 𝜌 3
are the correlation
between discriminative features, the correlation between discriminative and non-discriminative
features, and the correlation between non-discriminative features, respectively. In our settings,
the first five features are true discriminatives and the following ten non-discriminative features
are correlated with them. The remaining 𝑝 − 15 non-discriminative features are not correlated
with the five discriminative features. All the 𝑝 − 5 non-discriminative features are equally
correlated with each other. Under this circumstance, we consider four different trios of values for
(𝜌 1
, 𝜌 2
,𝜌 3
) that are listed in Table 2.
37
Table 2.2. Simulation settings
Setting 𝑝 (𝜌 1
,𝜌 2
,𝜌 3
) β
true
12 400
(0.8, 0.1, 0.5)
0.3(1.8,3.2,−2,−2.3,2.6,0
𝑝 −5
)
13 400
(0.2, 0.1, 0.5)
0.3(1.8,3.2,−2,−2.3,2.6,0
𝑝 −5
)
14 400
(0.8, 0.1, 0.2)
0.3(1.8,3.2,−2,−2.3,2.6,0
𝑝 −5
)
15 400
(0.8, 0.2, 0.5)
0.3(1.8,3.2,−2,−2.3,2.6,0
𝑝 −5
)
Across settings 12-15, we fix the number of features 𝑝 and true discriminant features
β
true
same as setting 1. The first scenario, where the values of (𝜌 1
,𝜌 2
,𝜌 3
) are set to be (0.8, 0.1,
0.5), is a baseline. Then we modify values for 𝜌 1
, 𝜌 3
and 𝜌 2
in settings 13, 14 and 15,
accordingly. By comparing settings 13-15 to setting 12, we can tell which type of correlation
plays a more important role in prediction performance. Simulation results based on 500
replications are summarized in Figure 2.11.
Figure 2.11. Effect of three types of correlation (𝝆 𝟏 ,𝝆 𝟐 ,𝝆 𝟑 ) on the prediction performance
of high-dimensional classifiers
38
In Figure 2.11, lasso penalized logistic regression, lasso penalized discriminant analysis
and sparse LDA outperform the scout LDA across all scenarios. The two panels on the top
illustrate that the correlation of discriminative features 𝜌 1
heavily affects the performance of
sparse LDA. When discriminative features are highly correlated (𝜌 1
= 0.8), sparse LDA
performs better than lasso logistic regression; however, sparse LDA performs slightly worse than
lasso logistic regression when discriminative features is weakly correlated (𝜌 1
= 0.2). The two
panels on the bottom demonstrate very similar results to the one at the top left corner. Both the
correlation between non-discriminative features (𝜌 3
) and the correlation between discriminative
and non-discriminative features (𝜌 2
) have little impact on the prediction performance of sparse
LDA.
2.3.3 Simulation 3: Robustness
As we know, unlike logistic regression, LDA has model assumptions that features from
different classes are normally distributed and share a common covariance matrix. In this
simulation study, we aim to evaluate the robustness of the sparse LDA in terms of the
assumption for common covariance matrix. Therefore, we simulate two balanced classes of
features from multivariate normal distributions with different means and different covariance
matrix 𝒩 (0
𝑝 ,Σ
1
) and 𝒩 (Σ
1
β
,Σ
2
) . The covariance Σ
1
and Σ
2
are block-diagonal matrices of
five blocks with the form Σ
𝑘 = 𝐵𝑙𝑜𝑐𝑘𝐷𝑖𝑎𝑔 (Σ
𝑘 𝐵 ,…,Σ
𝑘 𝐵 ),𝑘 = 1,2, where Σ
𝑘 𝐵 is a square matrix
with the AR(1) correlation structure Σ
𝑘 ,𝑖𝑗
𝐵 = 𝜌 𝑘 |𝑖 −𝑗 |
,𝑘 = 1,2,𝑖 ,𝑗 = 1,…,𝑝 /5. The total sample
size 𝑛 = 200 , the number of features 𝑝 = 400 and β = 0.3(1.8,3.2,−2,−2.3,2.6,0
𝑝 −5
) are all
fixed. We fix the first-class correlation coefficient at 𝜌 1
= 0.8 and consider three different values
39
for the second-class correlation coefficient with 𝜌 2
= 0.8,0.5,0.2. The simulation settings are
listed in Table 3. Simulation results of 500 replications are summarized in Figure 2.12.
Table 2.3. Simulation settings
Setting
1
st
-class
𝜌 1
2
nd
-class
𝜌 2
∆𝜌
β
1 0.8 0.8 0 0.3(1.8,3.2,−2,−2.3,2.6,0
𝑝 −5
)
16 0.8 0.5 0.3 0.3(1.8,3.2,−2,−2.3,2.6,0
𝑝 −5
)
17 0.8 0.2 0.6 0.3(1.8,3.2,−2,−2.3,2.6,0
𝑝 −5
)
Figure 2.12 demonstrates that, if the two classes of features have the same covariance
structure but different correlation coefficients, then sparse LDA performs slightly worse when
the difference in the correlations is as much as 0.6. The performance of sparse LDA is not
obviously affected when there is a small difference in the correlation coefficients between
classes. Scout LDA shows less robustness than sparse LDA as the difference in the correlations
increases. Penalized regression methods, such as lasso logistic regression and lasso discriminant
analysis, are relatively more robust with respect to the difference in correlation between classes.
However, the sparse LDA still outperforms the other high-dimensional classifiers, including the
lasso penalized logistic regression, even if one class of features are highly correlated with 𝜌 1
=
0.8 while the other class of features are weakly correlated with 𝜌 2
= 0.2. Therefore, this
simulation study illustrates that sparse LDA has good robustness with respect to the model
assumption of common covariance. It performs consistently well across scenarios where two
classes have same covariance structure but have different correlation coefficients.
40
Figure 2.12. Effect of the difference in correlation between classes on the prediction
performance of high-dimensional classifiers
2.4 Application: Breast Cancer Survival Prediction on METABRIC
Dataset
We applied sparse LDA and compared it to scout LDA, lasso logistic regression and
lasso LDA on a gene expression dataset of breast cancer patients from Molecular Taxonomy of
Breast Cancer International Consortium (METABRIC) study, which is available at European
Genome-Phenome Archive with the accession of EGAS00000000083. This dataset contains
cDNA microarray profiling of breast cancer specimens that were processed on the Illumina HT-
12 v3 platform. Among all specimens, 997 samples were grouped as a discovery set and another
995 samples composed a validation set. We fit models on the discovery set and then evaluate
their performance on the validation set in our analysis. In addition to gene expressions, patients’
long-term clinical outcomes and pathological variables are also available in this dataset.
41
Considering that the main outcome to predict should be binary, we dichotomized patients’
survival time at 5 years. Therefore, we can predict if breast cancer patients will survive longer
than 5 years after undergoing treatments, given their gene expression levels. The normalized
gene expression values of primary tumors were used as predictor variables for each model. After
dichotomizing patients’ survival time, we excluded censorings from the analysis. Consequently,
there were 803 and 839 samples remained in the discovery and validation sets respectively.
Patients in the METABRIC study underwent chemotherapies only if they had estrogen
receptor (ER) negative tumors. Those with ER-positive tumors didn’t. First, we consider all
kinds of tumors together in analysis, without making a distinction between tumor subtypes.
However, considering that it’s difficult to treat ER-negative breast cancers and that ER-positive
subtypes occur more frequently, we then restricted the analysis to patients with ER-positive
tumors. In addition, the subtype of human epidermal growth factor receptor 2 (HER2) positive
tumors is usually associated with poor clinical outcomes, so we excluded HER2-positive patients
from analysis as well. After eliminating ER-negative and HER2-positve subtypes, there are
finally 594 in the discovery set and 564 samples in the validation set. We chose the top 1,000
probes with highest variance as predictors and fit all model on them.
For the sake of obtaining a stable measurement for prediction performance, we applied k-
fold repeated cross-validation to optimize values of tuning parameters on a two-dimensional
grid. To be more specific, we randomly split the discovery set into k folds and repeated this
process multiple times. We first obtained an average hold-out AUC over k folds under each split
and then averaged the results over splits. The optimal pair of values is a point on the grid where
the averaged average hold-out AUC is maximized. Finally, we computed an ultimate AUC on
the validation set given the optimized parameters. After using this repeated cross-validation
42
strategy, we can get rid of the randomness in results caused by splitting of k folds in the cross-
validation process.
When creating a two-dimensional grid to search on, we first generate the maximum
values for both parameters. For the tuning parameter ρ, which is used in graphical lasso, the
maximum is defined as c
1
∙ ρ
max
= max |S
̂
|, where S
̂
is an average of empirical covariance
matrix of predictors in each class. Likewise, the maximum of λ, which is the tuning parameter
used in soft-threshold operator, is decided by c
2
∙ λ
max
= max |μ ̂
2
− μ ̂
1
|, where μ ̂
1
and μ ̂
2
are
vectors of class means. Both ρ
max
and λ
max
are the minimum values that can shrink
corresponding estimates of inverse covariance matrix and the difference between class means to
be zero. Then, we compute the minimum values for ρ and λ by multiplying a ratio to their
maximum values so that ρ
min
= r
1
∙ c
1
∙ ρ
max
and λ
min
= r
2
∙ c
2
∙ λ
max
. Given those boundaries,
a sequence consisting of values that are equally spaced between the maximum and the minimum
can be generated for ρ and λ separately. Finally, we can obtain a two-dimensional grid based on
the two sequences.
In our analysis, we applied 10-times repeated 5-fold cross-validation to optimize the two
tuning parameters in sparse LDA. The two multiplying constants used in the maximum values of
ρ and λ are set as c
1
= 0.15 and c
2
= 1, and the two ratios that decide their minimum values are
set to be r
1
= 0.4 and r
2
= 0.1. We considered a sequence of 10 candidate values for both ρ and
λ. The results of AUC that were computed on the validation set as a measurement of prediction
performance for each method are listed in Table 2.4.1. When fitting lasso logistic regression and
lasso LDA models, we took into account both standardizing and not standardizing the gene
expression data, and then reported the best results for each model. For sparse LDA and scout
43
LDA methods, we didn’t standardize gene expressions and always fit both models on the data
with original scale.
Table 2.4.1 AUC computed on validation set for lasso logistic regression (lasso logit), lasso
LDA, sparse LDA and scout LDA using 1,000 probes, for all tumors and ER+/HER2- subtypes
N p Lasso logit Lasso LDA Sparse LDA Scout LDA
All tumors 803 1,000 0.720 0.719 0.725 0.705
ER+/HER2- 594 1,000 0.687 0.696 0.724 0.689
ER+/HER2- 594 20,000 0.680 0.693 0.738 0.695
The above results have shown that, first, every method performs better when using all
tumors than using a subset of ER+ and HER2- tumors, when the number of features is fixed.
Second, the sparse LDA outperforms all the other methods, no matter including all tumors or
only the ER+/HER2- subtypes in analysis. When analyzing all tumors, lasso logistic regression
and lasso LDA produce very similar AUC, which is slightly lower than that of sparse LDA but
higher than the AUC of scout LDA. However, when the analysis is restricted to the ER+/HER2-
subset, the sparse LDA performs considerably better than lasso logistic regression, lasso LDA
and scout LDA, and the later three have almost the same performance to each other.
In addition, we increased the number of features from 1,000 to 20,000 and refit all
models on expression levels of the top 20,000 probes with highest variance for the ER+/HER2-
subset. We observed that sparse LDA performs better compared to using 1,000 probes, while
there’s a small reduction in the performance of lasso logistic regression and lasso LDA. Sparse
LDA achieves an AUC of 0.738, which is much higher than that of other methods. Lasso LDA
performs slightly better than lasso logistic regression, with an AUC of 0.693 versus 0.680. Scout
LDA performs similarly to lasso LDA, showing an AUC of 0.695.
44
Therefore, this analysis on METABRIC data provides a practical example to demonstrate
that the proposing method, sparse LDA, can effectively improve the prediction performance of
the scout method. Furthermore, it even outperforms the existing high-dimensional LDA method
with good performance, which is lasso LDA, as well as the widely used and competitive high-
dimensional classification method, lasso logistic regression.
2.5 Discussion
Scout LDA using graphical lasso successfully extends the traditional linear discriminant
analysis to high-dimensional classification settings. However, we proposed an extension of scout
LDA by further applying soft-thresholding on the difference between class means to improve
prediction performance. The proposed approach, sparse LDA, introduces sparsity in 𝜇 ̂
2
− 𝜇 ̂
1
so
that there’s an enhancement in sparsity among the final estimates for discriminant coefficients.
Since the accumulated estimation error in the discriminant coefficients can be huge when the
number of features is high, the imposed sparsity in 𝜇 ̂
2
− 𝜇 ̂
1
will lead to a reduction in the overall
estimation error and therefore an improvement in the prediction performance.
Simulation results have shown that sparse LDA has much better performance than scout
LDA in all scenarios and even demonstrates some improvement compared to the lasso penalized
logistic regression under several scenarios. In simulation study 1, we investigated the influence
of effect size, correlation and dimension on its prediction performance. We further categorized
the correlation into three forms in simulation 2. In general, the effect size and the dimension of
features have very limited influence on the difference between sparse LDA and lasso logistic
regression, while the correlation of discriminative features is crucial to prediction performance.
When discriminative features are correlated with each other, sparse LDA performs better than
45
lasso penalized logistic regression. The improvement in sparse LDA is quite considerable when
the correlation of discriminative features is strong. If discriminative features are weakly
correlated or uncorrelated, sparse LDA has comparable performance with lasso logistic
regression when the correlation of non-discriminative features is also weak. Sparse LDA
performs slightly worse than lasso logistic regression when the correlation of non-discriminative
features is strong while the correlation of discriminative features is weak. The results of the
simulation 3 indicate the robustness of the sparse LDA with respect to the assumption that
different classes of features have a same covariance matrix. There is only a small reduction in the
prediction performance of the sparse LDA when the two classes of features are differentially
correlated. Penalized regression methods are more robust than the sparse LDA. In the example of
breast cancer survival prediction on METABRIC data, sparse LDA consistently outperforms
scout LDA, lasso logistic regression and lasso discriminant analysis under different
circumstances.
However, the computational cost of sparse LDA is relatively expensive compared to
penalized regression methods such as lasso penalized logistic regression and lasso penalized
discriminant analysis, especially when the dimension of features is high. Even though the
graphical lasso is a remarkably fast algorithm for estimating the inverse covariance matrix, the
sparse LDA classifier depending on the inverse covariance matrix requires much more
computational time than the penalized linear or logistic regression. In practice, if prior
information indicates high correlation among potential discriminative features and the number of
features in the dataset is within the computational ability, then the sparse LDA is recommended
over the lasso penalized logistic regression and the lasso penalized discriminant analysis.
46
Chapter 3 Incorporating external information
for sparse LDA
3.1 Introduction
With the development of sequencing techniques, genomic data that gives information for
huge number of genes across the whole genome becomes available, providing opportunities to
refine the prediction of cancer-related outcomes using gene expression data. There have been
many studies applying diverse classification techniques for different types of cancers, such as
prostate cancer (15, 16), lung cancer (17), colorectal cancer (18). However, coming along with
the overwhelming size of genomic data is the challenge of high dimensions, where the scale of
genes is much larger than the number of available individuals in a study population. Because the
number of features is much larger than the sample size in gene expression data, regularization
regression, such as LASSO(4), ridge(3) and elastic net(5), and different extensions of classical
classification methods such as the sparse discriminant analysis (9), scout family of methods(10)
are developed for high-dimensional classification problems. However, these are all general-
purpose classification techniques that do not take advantage of a large amount of data for cancers
and gene expressions that has become available on some public platforms such as The Cancer
Genome Atlas (TCGA). A central question is thus how to incorporate those publicly available
datasets as external information in classification problems of gene expressions and to improve
the prediction accuracy.
In this chapter, we will focus on incorporating prior information from publically available
projects studies such as TCGA, which contains gene expression data for the most common
47
cancers, into the Sparse LDA framework proposed in chapter 2. In Chapter 5 we will discuss
incorporating another type of external information, gene annotations, through a regularized
hierarchical regression framework. The Cancer Genome Atlas (TCGA) is a publicly accessible
platform that provides molecular data sets for comprehensive and multi-dimensional maps of the
key genomic changes in 33 types of cancer. The TCGA dataset, which describes tumor tissues
and matched normal tissues from more than 11,000 patients, has been used widely by the
research community and has contributed to more than a thousand stuydies of cancer by
independent researchers and to the TCGA research network publications.
In a gene expression classification problem such as predicting whether breast cancer
patients will survive more than five years or whether prostates cancer patients will experience
clinical recurrence after treatments, we ususally apply a supervised learning classification model
on gene expression dataset which has been labeled with the outcome variable to predict. The
TCGA gene expression data, however, does not provide values for the outcome variable to
predict and thus is lack of the class labels. Considering that LDA assumes features in different
classes have a common covariance matrix but different population means, the general idea of this
chapter is to apply TCGA gene expressions in an estimation of inverse covariance matrix, which
is an interim quantity required in the sparse LDA that was proposed in the previous chapter,
rather than in class means. To be more specific, we will modify the objective function which is
used in graphical lasso to estimate the inverse covariance matrix in two different ways. In the
first approach, we use a weighted average of the empirical covariance matrices from the internal
and external data in the graphical lasso algorithm to estimate a sparse inverse covariance. In the
second approach, we first obtain a sparse estimate of inverse covariance from the external data
using graphical lasso. Then, we add an additional penalty term in the objective function to
48
penalize the difference between the final estimate and the estimate from external dataset. In this
approach, we consider two types of penalties, ℓ
1
- and ℓ
2
-norm, for the additional term. Because
the modified objective functions in both approaches are convex functions, they can be formalized
as convex optimization problems and be solved directly through existing convex packages.
We evaluate the effect of these two approaches in simulation studies and there is no
improvement in prediction performance in both methods. Then, a study about the estimation
error in coefficients is performed to investigate the potential room for improvement in
performance. The result shows that sparsity plays a much more crucial role than accuracy when
the dimensionality is high. A more accurate estimate of an inverse covariance matrix does not
necessarily result in a better classification model for high-dimensional data.
3.2 Convex optimization problem
Convex optimization is a specific class of optimization problems, which has prevalent
applications in a lot of areas, such as data science, structural optimization, finance, automatic
control systems and statistics. The principle advantage of convex minimization problems is that
the object function to minimize is convex so that all local minimum points of the function must
be global minimum points. Because of this good property of convex functions, the optimization
problem can be solved efficiently. With the development of optimization theory and methods, it
is feasible and beneficial to formulate a mathematical optimization problem as a convex
minimization problem. The well-known method of least squares and linear optimization belong
to convex minimization problems.
A convex optimization problem is defined as a problem minimizing a convex function
over a convex set, which is of the form
49
minimize 𝑓 (𝑥 )
subject to 𝑔 𝑖 (𝑥 )≤ 0,𝑖 = 1,…,𝑚 (3.1)
where functions 𝑓 ,𝑔 1
,…,𝑔 𝑚 : ℛ
𝑛 → ℛ are convex. A function 𝑓 : ℛ
𝑛 → ℛ is convex if it satisfies
𝑓 (𝑡 𝑥 1
+ (1 − 𝑡 )𝑥 2
)≤ 𝑡𝑓 (𝑥 1
)+ (1− 𝑡 )𝑓 (𝑥 2
) (3.2)
for all 𝑥 1
,𝑥 2
∈ ℛ
𝑛 and all 𝑡 ∈ [0,1]. Geometrically, the line segment between (𝑥 1
,𝑓 (𝑥 1
)) and
(𝑥 2
,𝑓 (𝑥 2
)) lies above or on the graph of 𝑓 . An example of a simple convex function 𝑓 : ℛ → ℛ
is shown in Figure 3.1. For a convex function 𝑓 , the function –𝑓 is concave. Therefore, a
problem maximizing a concave objective function can be reformulated as a convex minimization
problem.
Figure 3.1 Graph of a convex function 𝒇 : 𝓡 → 𝓡
Convex functions have some good properties. The most important one is that a local
minimum point of a convex function is also the global minimum point. In general, convex
optimization problems in (3.1) don’t have an analytical solution, but they can be solved
efficiently with different approaches because of the convexity. For example, the interior-point
methods can usually solve a convex optimization problem after 10 to 100 iterations with a
reasonable level of accuracy. Some developed convex optimization algorithms include but are
not limited to unconstrained minimization such as gradient descent method, steepest descent
50
method and Newton’s method, equality constrained minimization, and interior-point methods
which contains the barrier method, feasibility and phase I method and primal-dual interior-point
methods.
Some software is available for convex optimization. For example, CVX is a MATLAB-
based convex modeling framework developed by Michael Grant and Stephen Boyd
(http://cvxr.com), and CVXPY is Python-embedded and implemented by Steven Diamond
(http://www.cvxpy.org/en/latest/). Both CVX and CVXPY are an interface to the open source
convex optimization solvers such as SeDuMi, SDPT3, ECOS and CVXOPT.
3.3 Incorporating external information in inverse covariance
estimation
In this section, we incorporate TCGA gene expression data as external information in the
sparse LDA by using a regularized risk minimization problem. The sparse LDA classifier
discussed in chapter 2 has a discriminant direction 𝛽 ̂
sparse
= Θ
̂
∙ 𝒪 (𝜇 ̂
2
− 𝜇 ̂
1
,𝜆 ) , where Θ
̂
is a
sparse estimate for the inverse covariance matrix Θ = Σ
−1
from graphical lasso and
𝒪 (𝜇 ̂
2
− 𝜇 ̂
1
,𝜆 ) is a soft-threshold function on the difference of empirical mean vectors between
classes. Since the external information we would like to incorporate does not contain class labels
for observations, it’s not feasible to incorporate this type of external information in the part of
𝜇 ̂
2
− 𝜇 ̂
1
. Linear discriminant analysis assumes that features in different classes have a common
covariance matrix. We can take advantage of the external date in the estimation for inverse
covariance matrix, even though the two classes of features are mixed together.
51
The estimation of Θ
̂
in (2.1) can be reformulated as a convex optimization problem which
is to minimize a loss function
−logdetΘ + tr(𝑆 Θ)+ 𝜌 ‖Θ‖
1
(3.3)
over non-negative definite matrices. 𝑆 is the empirical covariance matrix. The ℓ
1
-norm penalty
shrinks most components in the matrix Θ
̂
to be zero so it introduces sparsity in the estimate. We
consider two approaches to incorporate external information in the inverse covariance estimation
based on (3.3).
The first approach is to merge the original data and the external data to obtain a pooled
empirical covariance matrix. Suppose 𝑆 1
and 𝑆 2
are empirical covariance matrices for the two
datasets. The pooled empirical covariance is
𝑆 𝑝 =
𝑛 1
−1
𝑛 1
+𝑛 2
−2
𝑆 1
+
𝑛 2
−1
𝑛 1
+𝑛 2
−2
𝑆 2
(3.4)
where 𝑛 1
and 𝑛 2
are the number of observations in two datasets. Then, the problem (3.3) can be
modified by replacing 𝑆 with the pooled covariance 𝑆 𝑝 , which is to minimize
−logdetΘ + tr(𝑆 𝑝 Θ)+ 𝜌 ‖Θ‖
1
(3.5)
The second approach uses the external data to estimate the inverse covariance matrix
directly. Usually, the external dataset has more observations than the original one; we can
assume that the external estimation for the inverse covariance is more accurate than the original
estimation. Suppose Θ
ex
is an external inverse covariance matrix which can also be estimated by
using the graphical lasso. We modify the loss function in (3.3) by imposing a ℓ
1
- or ℓ
2
-norm to
penalize the difference between Θ and Θ
ex
. Then, the optimization problem becomes to minimize
−logdetΘ + tr(𝑆 Θ)+ 𝜌 ‖Θ‖
1
+ 𝛾 ‖Θ − Θ
ex
‖
𝑙 , 𝑙 = 1,2 (3.6)
52
The first penalty ‖Θ‖
1
makes the estimate sparse, while the second one ‖Θ − Θ
ex
‖
𝑙 shrinks the
estimate towards Θ
ex
. Tuning parameters 𝜌 ≥ 0 and 𝛾 ≥ 0 control weights of the two types of
effects: sparsity and accuracy.
Since the problem (3.5) and (3.6) are convex, they can be solved by using convex
optimization packages such as CVX in MATLAB or CVXPY in Python. Let Θ
̂
p
and Θ
̂
e
represent
solutions of problem (3.5) and (3.6), respectively. Based on the new estimates for the inverse
covariance matrix, we can develop two types of sparse LDA classifiers incorporating external
information, with discriminant directions 𝛽 ̂
pooled
= Θ
̂
p
∙ 𝒪 (𝜇 ̂
2
− 𝜇 ̂
1
,𝜆 ) and 𝛽 ̂
external
= Θ
̂
e
∙
𝒪 (𝜇 ̂
2
− 𝜇 ̂
1
,𝜆 ) . There are two tuning parameters (𝜌 and 𝜆 ) in 𝛽 ̂
pooled
, while 𝛽 ̂
external
has three: 𝜌
and 𝛾 from the problem (3.6), and 𝜆 from the soft-threshold function 𝒪 (𝜇 ̂
2
− 𝜇 ̂
1
,𝜆 ) in (2.6).
Five-fold cross-validation is used to select optimal values where the average AUC of the hold-
out subsamples is maximized.
3.4 Simulation studies
We randomly generate 100 observations as a training dataset from two multivariate
normal distributions 𝒩 (0
200
,Σ) and 𝒩 (Σβ
true
,Σ) such that 𝑛 1
= 𝑛 2
= 50. The mean of the first
class 0
200
is a vector of zeros with length 200. The covariance matrix is block-diagonal with the
form Σ = 𝐵𝑙𝑜𝑐𝑘𝐷𝑖𝑎𝑔 (Σ
𝐵 ,Σ
𝐵 ,Σ
𝐵 ,Σ
𝐵 ,Σ
𝐵 ) , where Σ
𝐵 is a 40× 40 matrix of the form Σ
𝑖𝑗
𝐵 =
0.8
|𝑖 −𝑗 |
. The vector of discriminant coefficients is β
true
= 0.32(1.8,3.2,−2,−2.3,2.6,0
195
) . An
independent test set of 10,000 observations are also generated from the two distributions on a
fifty-fifty basis. AUC is calculated on the test to assess the prediction performance. Another 200
observations are randomly generated from 𝒩 (0
200
, Σ) as an external dataset.
53
/
Figure 3.1. Effect of incorporating external information in sparse LDA
Simulation results based on 100 replications are shown in Figure 3.1. There is no obvious
improvement in prediction performance after incorporating external information in the inverse
covariance estimation. Classifiers that incorporate external information by minimizing the
problem (3.5) and (3.6) have very similar performance with the sparse LDA. Using ℓ
1
- or ℓ
2
-
norm for the penalty ‖Θ − Θ
ex
‖
𝑘 in (3.6) doesn’t affect the prediction performance apparently.
We considered different scenarios by changing the correlation coefficient, the correlation
structure and the effect size. There is still no obvious improvement in prediction performance
after incorporating external information in these scenarios.
3.5 Room for improvement
In this section, we compare the sparse LDA with three specially designed classifiers that
use either population mean vectors or population inverse covariance matrix or both, to
investigate the potential room for improvement when incorporating external information in
54
sparse LDA. Recall that the discriminant direction of the sparse LDA is 𝛽 ̂
sparse
= Θ
̂
∙
𝒪 (𝜇 ̂
2
− 𝜇 ̂
1
,𝜆 ) , where Θ
̂
is the graphical lasso’s estimate for inverse covariance matrix and
𝒪 (𝜇 ̂
2
− 𝜇 ̂
1
,𝜆 ) is a soft-threshold function on the difference in empirical mean vectors. We can
develop three classifiers by replacing estimations with true values with respect to inverse
covariance matrix and mean vectors separately or jointly in the discriminant direction. First, the
Bayes’ classifier, which is the golden standard, uses population parameters for both the inverse
covariance and mean vectors. The discriminant direction is 𝛽 Bayes
= Σ
−1
(𝜇 2
− 𝜇 1
) , where Σ
−1
is
the population inverse covariance matrix and 𝜇 1
, 𝜇 2
are population mean vectors of features in
different classes. Next, we consider a classifier, which uses population mean vectors and the
estimation of inverse covariance in the direction 𝛽 ̂
inv
= Θ
̂
∙ (𝜇 2
− 𝜇 1
) . The last classifier applies
the population inverse covariance and empirical mean vectors, so it has the direction 𝛽 ̂
mean
=
Σ
−1
∙ 𝒪 (𝜇 ̂
2
− 𝜇 ̂
1
,𝜆 ) .
We compare the prediction performance of the four methods in a simulation study. We
randomly generate 100 samples of 200-dimensional vectors of features from two multivariate
normal distributions. Among them, 50 observations labeled as 𝑌 = 0 are drawn from
𝒩 (0
200
,Σ) , where 0
200
is a vector of zeros with length of 200, and the other 50 observations
labeled as 𝑌 = 1 are drawn from 𝒩 (Σβ
true
,Σ) . The true model has discriminant coefficients
β
true
= (β
1
,…,β
200
)= 0.3(3,1.7,−2.2,−2.1,2.55,0
195
) . The covariance Σ =
𝐵𝑙𝑜𝑐𝑘𝐷𝑖𝑎𝑔 (Σ
𝐵 ,…,Σ
𝐵 ) is a block-diagonal matrix with five blocks on the diagonal, where Σ
𝐵
has an AR(1) correlation structure with the form Σ
𝑖𝑗
𝐵 = 0.8
|𝑖 −𝑗 |
. The prediction performance is
measured by the external AUC which is obtained on an independent test set of 1000
observations. The test set contains two classes of observations with a ratio of 1:1. A simulation
result based on 100 replications is shown in Figure 3.2.
55
Table 3.1. List of classifiers to compare
Method 1 True inverse and means, 𝛽 Bayes
= Σ
−1
(𝜇 2
− 𝜇 1
)
Method 2 Empirical inverse and true means, 𝛽 ̂
inv
= Θ
̂
∙ (𝜇 2
− 𝜇 1
)
Method 3 True inverse and empirical means, 𝛽 ̂
mean
= Σ
−1
∙ 𝒪 (𝜇 ̂
2
− 𝜇 ̂
1
,𝜆 )
Method 4 Empirical inverse and means, 𝛽 ̂
sparse
= Θ
̂
∙ 𝒪 (𝜇 ̂
2
− 𝜇 ̂
1
,𝜆 )
Figure 3.2. Prediction performance of four LDA classifiers
The first method uses true values for both the inverse covariance and the means. This
classier has the highest testing AUC and represents the best prediction performance in theory.
The second classifier which uses true means and estimated inverse covariance performs worse
than the golden standard but does better than the sparse LDA classifier using estimated values
for both parts. The difference between the second and the fourth classifier illustrates that there is
56
some room for improvement in prediction performance by increasing the estimation accuracy for
the class means 𝜇 ̂
1
and 𝜇 ̂
2
in sparse LDA. However, the external information doesn’t help in the
estimation for class means, therefore we are unable to improve the performance of sparse LDA
from this aspect.
Notice that the third classifier which is developed via true inverse covariance and
estimated means works worse than the last one using estimated inverse covariance and means.
This result indicates that when the estimated class means are used (with soft-threshold operator)
in the discriminant direction to develop the classifier, using a sparse estimate of the inverse
covariance matrix yields better prediction performance than using the accurate one. Therefore, a
better estimate of the inverse covariance matrix does not guarantee a better sparse LDA
classifier. The prediction performance even becomes worse when the estimate of the inverse
covariance has more accuracy. To explore the rationale of this phenomenon, we take a further
look at the estimate of discriminant coefficients for each method. The results about estimation
error, bias and variance in the estimate of all coefficients, non-zero coefficients and zero
coefficients are shown in Figure 3.3-3.5.
Figure 3.3. Estimation error, bias and variance of all coefficients
57
Figure 3.4. Estimation error, bias and variance of non-zero coefficients
Figure 3.5. Estimation error, bias and variance of zero coefficients
Figure 3.3 shows the overall estimation error, bias and variance in the estimate across
methods. The first classifier uses true inverse covariance and true class means so that the
discriminant coefficients are exactly β
true
. The estimation bias, variance and total error in all
coefficient are all zero, indicating the best classifier in theory. The second classifier using true
means has a smaller estimation error than the third and the last one, so its prediction performance
58
is just below the golden standard. In the third classifier, even though the estimated coefficients
have relatively small bias, the variance in the estimate is huge. This leads to a lot of estimation
error in the estimate of coefficients so that the third classifier is even worse than the last one
which has slightly more bias but much lower variance. Therefore, the phenomenon that the
sparse LDA classifier using a sparse estimate of inverse covariance performs better than the one
using true inverse covariance can be explained by the bias-variance tradeoff in estimates. The
sparse estimate of inverse covariance used in the sparse LDA reduces the variance enormously
via introducing a relatively small bias so that the total error in the final estimate of coefficients is
decreased. Figure 3.4 and 3.5 demonstrate that the third classifier has a better estimate of non-
zero coefficients than the last one but greatly worse estimate of zero coefficients. In Figure 3.4,
the third classifier using true inverse has more variance in the estimate of non-zero coefficients
than Sparse LDA, but the estimation bias is much higher in Sparse LDA so the total estimation
error is larger in Sparse LDA than in the third classifier. Figure 3.5 demonstrates that the bias in
estimates of non-zero coefficients is very small across all methods, but the third classifier has
much more estimation variance than others. Using true inverse leads to huge variance in the
estimate of zero coefficients. In the high dimensional setting, most features are non-predictive so
most coefficients are close to zero. The large estimation error in zero coefficients finally results
in a high level of overall estimation error.
3.6 Discussion
In this chapter, we attempted to incorporate external data of features (e.g., TCGA gene
expression data) in the estimation of the inverse covariance matrix for sparse LDA. Two types of
modified objective functions were considered to develop convex optimization problems for the
59
estimation: 1) using the pooled empirical covariance matrix of internal and external data; 2)
adding an additional ℓ
1
- or ℓ
2
-norm penalty for the difference between internal and external
inverse covariance. Both methods have little effect on improving the prediction performance of
the Sparse LDA. In order to assess the potential room for improvement, we compared the sparse
LDA with three theoretical classifiers that use the true Σ
−1
and 𝜇 2
− 𝜇 1
partially or completely.
The simulation result shows that the theoretical classifier using the true inverse covariance and
empirical means performance much worse than the Sparse LDA which completely uses
empirical values. To understand this result, we compared the error, bias and variance in the
estimates of coefficients across the four classifiers. We find that the true inverse covariance,
which is not sparse, can lead to a high estimation error because the variance in the estimates of
coefficients is huge, when the means of features are estimated. To be more specific, there is
extremely high variance in the estimates of zero coefficients when the classifier is using the true
inverse covariance and estimated means. The sparse estimate of inverse covariance used in the
sparse LDA results in a small increase in bias but a huge reduction in variance, so the overall
estimation error is kept low. Therefore, even though we can improve the estimation accuracy for
inverse covariance by incorporating external information, a better estimation of Σ
−1
doesn’t
guarantee a better classifier the dimensionality is high. The sparsity in the estimate of the inverse
covariance matrix is much more important its accuracy when 𝑝 ≫ 𝑛 .
60
Chapter 4 Self-training semi-supervised
algorithms for high-dimensional LDA
4.1 Introduction
In this chapter, we will discuss another approach to incorporate external information in
high-dimensional LDA. As mentioned in the previous chapter, the TCGA program provides
datasets of gene expressions for 33 types of cancers. Our objective is to take advantage of this
kind of publicly available information in gene expression classification problems and thus to
improve the prediction accuracy of our method. Remarkably, the TCGA data differs from the
typical gene expression datasets used in analysis because it’s lack of class labels, or namely an
outcome variable to predict. For example, when predicting whether breast cancer patients will
survive more than five years or whether prostates cancer patients will experience clinical
recurrence after treatments, we usually choose a supervised learning classification model (e.g.
lasso logistic regression) and apply it on a dataset from a group of samples which consists of
both gene expressions and the main outcome variable, such five-year survival status or clinical
recurrence status, that indicates class labels for all samples. However, the data from TCGA
program only contains gene expressions for breast or prostate cancers, without indicating if those
patients survived more than five years or experienced clinical recurrences or not. Therefore, the
external information we would like to incorporate from TCGA can be considered as unlabeled
data, while the set of gene expressions marked with outcomes, to which a supervised learning
method applies, is labeled data. Therefore, in other words, our purpose is using both labeled and
unlabeled data to devise a classification model that can outperform the one developed from
61
labeled data solely. In machine learning language, this type of problem is called semi-supervised
learning which is between supervised and unsupervised learning.
In general, semi-supervised learning can be applied to classification, regression and
clustering. The most well-known semi-supervised learning methods include self-training, co-
training, mixture models, graph-based semi-supervised learning and semi-supervised support
vector machines. Self-training algorithm is perhaps the simplest semi-supervised learning
method. Therefore, in this chapter, we devise three types of self-training algorithms to
incorporate unlabeled data in high-dimensional LDA models, trying to improve the prediction
performance of the supervised-learning classifier that depends on labeled data only.
4.2 Methods
The basic idea of self-training is to use a supervised learning classifier’s prediction to
teach itself. More specifically, we first train a prediction function 𝑓 from the labeled data (𝑋 𝑙 ,𝑌 𝑙 ) .
Then, we just apply the function 𝑓 on each observation of the unlabeled data, 𝑥 ∈ 𝑋 𝑢 , to predict
its outcome. Next, the unlabeled observations as well as their predicted outcomes, (𝑥 ,𝑓 (𝑥 )) , are
added to the labeled data so that we can retrain a new predictive function from the updated
labeled data and repeat this procedure. However, there are several choices in the process of
adding unlabeled observations with their predicted labels to the labeled data. For example, we
can move all unlabeled observations with their predicted labels to the labeled data at once, or we
can only add a few most confident predictions each time and ignore originally unlabeled data
points. Considering those variations, we devised three types of algorithms for self-training high-
dimensional LDA.
62
4.2.1 Algorithm 1
In the first self-training algorithm, we use sparse LDA as the supervised learning
classifier. In each iteration, we only pick a few most confident predictions from the unlabeled
data and add them to the labeled data.
Algorithm 1
1. Randomly hold-out a small subset of the labeled data (𝑋 𝑙 ℎ
, 𝑌 𝑙 ℎ
) as a validation set;
2. Train a sparse LDA classifier 𝑓 from remaining labeled data (𝑋 𝑙 𝑟 ,𝑌 𝑙 𝑟 ) ;
3. Apply 𝑓 to classify observations in the unlabeled data 𝑥 ∈ 𝑋 𝑢 , and then sort them by
prediction confidence, which is measured by their distance to the decision boundary;
4. Add top ten unlabeled observations with the most confident predictions (𝑥 ,𝑓 (𝑥 )) to the
labeled data (𝑋 𝑙 𝑟 ,𝑌 𝑙 𝑟 ) ;
5. Re-train a sparse LDA classifier 𝑓 ′ from the updated labeled data and validate it on the hold-
out labeled data (𝑋 𝑙 ℎ
,𝑌 𝑙 ℎ
);
6. Repeat the step 2-5 twenty times, and choose the classifier with the highest validation AUC as
the final model.
4.2.2 Algorithm 2
In the second self-training algorithm, we also use sparse LDA as the supervised learning
classifier. However, differing from algorithm 1, we add all unlabeled observations with their
predictions to the labeled data at once and no iteration is performed.
Algorithm 2
1. Train a sparse LDA classifier 𝑓 from the labeled data (𝑋 𝑙 ,𝑌 𝑙 ) ;
2. Use 𝑓 to classify observations in the unlabeled data 𝑥 ∈ 𝑋 𝑢 ;
3. Add all unlabeled observations with their predicted labels (𝑥 ,𝑓 (𝑥 )) to the labeled data
(𝑋 𝑙 ,𝑌 𝑙 ) ;
4. Re-train a sparse LDA classifier 𝑓 ′ from the updated labeled data.
63
4.2.3 Algorithm 3
The third algorithm is inspired by a self-training semi-supervised SVM algorithm
proposed by Li et al. (2008) (19). In this algorithm, we choose lasso penalized discriminant
analysis by Mai et al. (2012) (9) as a supervised learning classifier. The main idea is to keep re-
training the classifier until its objective function converges. The objective function of standard
lasso penalized discriminant analysis is a lasso penalized least squares equation which has the
following form
min[
1
𝑛 ∑(𝑦 𝑖 − 𝛽 0
− 𝑥 𝑖 𝑇 𝛽 )
2
𝑛 𝑖 =1
+ 𝜆 ‖𝛽 ‖
1
]
In short, we predict labels for unlabeled data points and use them to update objective function,
and repeat this process until the difference in objective function is small enough. The details of
this algorithm are summarized below.
Algorithm 3
1. Train a classifier 𝑓 from the labeled data (𝑋 𝑙 ,𝑌 𝑙 ) using lasso penalized discriminant analysis;
2. Use 𝑓 to classify observations in the unlabeled data 𝑥 ∈ 𝑋 𝑢 ;
3. Add all unlabeled observations with their predicted labels (𝑥 ,𝑓 (𝑥 )) to the labeled data
(𝑋 𝑙 ,𝑌 𝑙 ) ;
4. Re-train a classifier 𝑓 ′ from the updated labeled data
5. Compute the objective function value
𝑔 (𝑘 )
=
1
𝑛 + 𝑚 [∑(𝑦 𝑖 − 𝛽 0
− 𝑥 𝑖 𝑇 𝛽 )
2
𝑛 𝑖 =1
+ ∑(𝑓 ′(𝑥 𝑛 +𝑗 )− 𝛽 0
− 𝑥 𝑛 +𝑗 𝑇 𝛽 )
2
𝑚 𝑗 =1
] + 𝜆 ‖𝛽 ‖
1
where n is the sample size of original labeled data and m is the sample size of original unlabeled
data, 𝑘 represents the 𝑘 th iteration;
6. Update labels for original unlabeled data points
7. Repeat step 4 to 6
8. Stop if |𝑔 (𝑘 )
− 𝑔 (𝑘 −1)
| < 𝛿 , where 𝛿 is a pre-determined positive constant.
64
4.3 Simulation studies
In this section, we present simulation studies to evaluate the effect of the above three
self-training algorithms. First, we fix the number of features as 𝑝 = 600 and randomly generate
two classes of labeled observations from multivariate normal distributions 𝒩 (0
𝑝 ,Σ) and
𝒩 (Σβ
true
,Σ) . The total sample size of the labeled data is fixed at 𝑛 = 200 and each class has
100 samples. Then, another 2,000 observations are generated also from the same two
distributions but not indicating their class labels. Therefore, these 2,000 observations are treated
as unlabeled data. The vector of true coefficients is β
true
= 𝑐 ∙ (3,1.7,−2.2,−2.1,2.55,0
𝑝 −5
).
Variance-covariance matrix Σ is a block-diagonal matrix consisted of five blocks with the form
Σ = 𝐵𝑙𝑜𝑐𝑘𝐷𝑖𝑎𝑔 (Σ
𝐵 ,…,Σ
𝐵 ) , where Σ
𝐵 has an exchangeable correlation structure, Σ
𝑗𝑗
𝐵 = 1,Σ
𝑖𝑗
𝐵 =
𝜌 ,𝑖 ≠ 𝑗 . Last, an independent set of 10,000 testing observations is generated from the same two
distributions, to compute an external AUC as a measurement of prediction performance.
Simulation results based on 100 replications are shown in Figure 4.2-4.4.
Before a discussion about simulation results, let’s consider two possibilities in the
implementation of algorithm 1. As presented above, the supervised learning method used in
algorithm 1 is sparse LDA, which contains two tuning parameters: 𝜌 from graphical lass and 𝜆
from the soft-threshold operator. As mentioned in the chapter of sparse LDA, their values are
optimized through k-fold cross-validation. Notice that we need to refit sparse LDA at each
iteration in the self-training algorithm 1. Therefore, there are two approaches in terms of the
choice of tuning parameters’ values. One approach is that we only perform k-fold cross-
validation to choose values for 𝜌 and 𝜆 in the first fit of the model. Their values will be fixed
when we refit sparse LDA during the iteration process. The other approach is that we apply k-
65
fold cross-validation to optimize tuning parameters at each iteration so that the values of 𝜌 and 𝜆
are different across iterations, while their values are kept the same all the way in the first
approach. We implemented both approaches for algorithm 1 and compared their performance at
each iteration.
In Figure 4.1, we plot the external AUC of sparse LDA fitted at each iteration, for the two
types of algorithm 1. Plot (a) shows the results of the approach that the values of 𝜌 and 𝜆 are
updated at each iteration and plot (b) represents that their values are kept the same across
iterations. When implementing both approaches, we let the algorithm stop at the 20
th
iteration.
First, the plot (a) shows that the performance of the classifier fitted at each iteration is very
unstable. The difference in external AUC between two consecutive iterations can be so dramatic
that the refitted classifier outperforms the original one at some iterations but loses at others, and
pattern of changes is quite random. However, the curve of AUC shown in plot (b) is smooth
compared to plot (a). The external AUC goes down at the first three iterations and bounces after
the third. It exceeds the original AUC at the 5
th
iteration and keeps increasing afterword. Even
though there’s a small reduction at 16
th
and 20
th
, the curve is still above the baseline. The change
in AUC is much smaller than what is shown in plot (a) so that the performance of classifiers at
each iteration is relatively stable. In summary, the two plots in Figure 4.1 demonstrate that the
choice of tuning parameters at each iteration has a substantial effect on prediction performance.
Considering that fixing tuning parameters across iterations leads to more stable classifiers, we
finally decided to optimize values of 𝜌 and 𝜆 with k-fold cross-validation only at the first time to
fit a classifier and to fix their values at across all iterations, when implementing algorithm 1. The
simulation results shown in Figure 4.2 are based on this approach.
66
Figure 4.1 Variation in prediction performance across iterations for algorithm 1
(a) choose tuning parameters for sparse LDA at each iteration
(b) fix tuning parameters for sparse LDA across iterations
In Figure 4.2, we compared algorithm 1 to a supervised learning model, sparse LDA,
under correlation coefficients of 0.8 and 0. In both scenarios, the classifier obtained from self-
training procedure performs slightly worse than the baseline which is a supervised learning
classifier trained on labeled data alone. Under 𝜌 = 0.8, the median external AUC is almost the
same between supervised and semi-supervised learning, but the semi-supervised learning show
fewer variance in performance. However, there is an obvious reduction in both prediction
accuracy and stability when using self-training under 𝜌 = 0. Therefore, algorithm 1 is not an
effective semi-supervised learning process. It does not perform as good as a supervised learning
method without using unlabeled data.
67
Figure 4.2 Simulation results of algorithm 1
(a) 𝝆 = 𝟎 .𝟖 (b) 𝝆 = 𝟎
In Figure 4.3, we compared algorithm 2 to the supervised learning model, sparse LDA,
also under correlation coefficients of 0.8 and 0. Additionally, we considered two levels of effect
size for each correlation scenario. In all scenarios, self-training classifiers trained on both labeled
and unlabeled data perform better that supervised learning classifiers obtained from labeled data
alone. Under the scenario with high correlation of 𝜌 = 0.8 and a large effect size, using
algorithm 2 leads to a considerable improvement in external AUC compared to using sparse
LDA directly. If we keep the correlation coefficient as 𝜌 = 0.8 and lower the effect size, then the
increase in AUC diminishes. When 𝜌 = 0, a semi-supervised learning process performs slightly
better than a supervise-learning, no matter which effect size is used. In general, algorithm 2
provides a more efficient way to perform semi-supervised learning than algorithm1. It
consistently improves the performance of supervised learning across all scenarios. However, the
superiority of this kind of semi-supervised learning is not obvious. It performs closely to a
supervised learning in most cases.
68
Figure 4.3 Simulation results of algorithm 2
(a) 𝝆 = 𝟎 .𝟖
(b) 𝝆 = 𝟎
In Figure 4.4, we compare test AUC’s of algorithm 3 to a supervised learning model,
lasso discriminant analysis (lasso LDA), under correlation coefficients of 0.8 and 0. For each
scenario, we also plot the values of objective functions and test AUC’s at each iteration to further
69
demonstrate how the algorithm works. Under both scenarios, the objective function keeps
decreasing across iterations and tend to converge after a few steps. When 𝜌 = 0.8, the average
test AUC continues to increase at the first few iterations and then does not change too much.
Consequently, the median test AUC of this semi-supervised learning algorithm is almost the
same as that the supervised learning method, but there is a noticeable decline in performance
variability. However, the average test AUC keeps decreasing across iterations when 𝜌 = 0 and
hence the semi-supervised learning underperforms supervised learning in terms of both
prediction accuracy and stability. Therefore, we conclude that algorithm 3 is not a good choice to
perform self-training lasso discriminant analysis. It doesn’t show remarkable enhancement in
prediction performance over the supervised learning and may even lose a lot under some
circumstances.
Figure 4.4 Simulation results of algorithm 3
(a) 𝝆 = 𝟎 .𝟖
70
(b) 𝝆 = 𝟎
71
4.4 Discussion
In this chapter, we discussed incorporating external information in high-dimensional
classification problems from a semi-supervised learning perspective which approaches external
information as unlabeled data and aims to train a better classifier from labeled and unlabeled data
together than using labeled data alone. We focused on self-training, which is the simplest
approach in semi-supervised learning, and designed three self-training algorithms for the high-
dimensional LDA classifiers of sparse LDA and lasso discriminant analysis. From simulation
studies, we found that only the algorithm 2 helps improve prediction performance to some
degree, while the algorithms 1 and 3 do not work as well as their corresponding supervised
learning methods.
Since self-training is using a classifier’s prediction to teach itself, it requires that the
predictions of the supervised learning classifier, at least the most confident ones, should be
correct. This assumption is crucial to the direction of effect during the self-training process. On
one hand, there might be accumulated improvement in performance when the predictions are
correct. On the other hand, the performance can become worse and worse as the self-training
goes if the predictions are not accurate enough. Unfortunately, it is not easy to satisfy the
assumption that the most confident predictions are correct in many cases, so it is difficult to
improve the prediction performance by using self-training. Usually, self-training can slightly
improve prediction performance if the supervised classifier trained from labeled data accurately
predicts the outcome of unlabeled observations. If the supervised classifier does not predict
unlabeled data well enough, then the performance may reduce a lot after applying the self-
training algorithms.
72
In these two chapters, we tried two different but straightforward ways to incorporate a
specific type of external information (e.g., TCGA gene expressions) in high-dimensional LDA.
However, both approaches have a very limited effect on improving prediction performance. The
approach that incorporates external information to estimate more accurate inverse covariance
matrix has turned out to be not beneficial at all, whereas there’s more potential for semi-
supervised learning. Even though all the approaches we have tried in this chapter do not perform
very well, self-training is just the tip of an iceberg for semi-supervised learning. There is a
bounty of more sophisticated methods in this field, such as co-training, mixture models and
graph-based semi-supervised learning. These can be interesting topics for future research in
terms of incorporating unlabeled external information.
73
Chapter 5 Regularized Hierarchical Regression and
Classification for High-Dimensional Genomic Data
with External Information
5.1 Abstract
Recent studies have shown that gene annotations informative about disease biological
processes can help prioritize causal genes. Therefore, the integration of genomic annotation data
and more generally, meta-data external to the study at hand, is biostatistical problem of great
interest. We propose a 2-level penalized hierarchical regression model to integrate external meta-
data in the prediction of health-related outcomes. The model has an analytical solution and can
be fitted efficiently using a modified ridge regression. Additionally, motivated by the connection
between linear discriminant analysis and linear regression, this regression framework can be
conveniently adapted to classification problems. Through simulation we show that in a wide
range of scenarios the proposed approach yields improvement in prediction performance. We
also use an example the METABRIC breast cancer data to demonstrate that the integrated data
can better predict the outcome than the gene expression data alone.
Keywords: data integration, gene expressions, regularized regression, hierarchical model
5.2 Introduction
Integration of high-dimensional ‘omic’ data has become an important area of statistical
research(20, 21) as the quantity and complexity of generated data continues to increase. Large
international projects like The Cancer Genome Atlas (TCGA) have generated massive amounts
74
of data on the role of the genome in both normal and disease processes. Leveraging this
information to enhance discovery and prediction has become an important area of statistical
research(22). In this paper, we focus on how to integrate this ‘external information’ to enhance
discovery and prediction. Previous work has shown that incorporating external information in a
regression or classification problem has the potential to improve model selection and prediction
accuracy. Specifically, for high-dimensional genomic studies, several approaches have been
developed in recent years to leverage prior information in various ways. For example, the
weighted lasso proposed by Bergersen et al. (2011)(23) integrate relevant external information
on the covariates to customize the lasso penalties for each regression coefficient, providing more
stable model selection results and improved predictability than standard lasso. Van de Wiel et al.
(2015) (24) proposed an adaptive group-regularized ridge regression, which derives empirical
Bayes estimates for group-specific penalties by utilizing co-data such as gene annotations or
external p-values, resulting in improved prediction and greater distinction between estimates of
small and large effects. Tai and Pan (2007) (25) propose a framework to incorporate prior
knowledge of predictors into developing penalized classifiers based on nearest shrunken
centroids and penalized partial least squares regression. Liu et al. (2018) (26) integrate molecular
features from multiple platforms in a regression model with multi-tuning parameters.
Furthermore, some methods incorporate SNP functional information to prioritize GWAS results
for post-GWAS analyses. For instance, Lewinger et al. 2007 (27) includes markers function,
prior linkage or association data in a prior model for the true non-centrality parameters of
marker-disease associations. Chen and Witte 2007 (28) uses existing information about SNPs in
a hierarchical model to select the most-promising SNPs. All these data integration approaches,
which utilize external information to customize regression penalties through which to adjust the
75
final regression coefficients, demonstrate improvement upon methods that do not incorporate
external information. An alternative approach is to construct a hierarchical model (12), which
regresses coefficients directly on second-level covariates that are summarized from external
information. However, standard hierarchical models are not applicable to high-dimensional
settings.
Considering the rising demand of data integration, we propose a regularized hierarchical
model, which can handle high-dimensional omic features arising from gene-expression or
GWAS studies and high-dimensional annotations such as those arising from gene ontology. The
first level is a standard linear regression of a disease-related outcome on genomic features such
as gene expression levels or SNP genotypes. The second level regresses the first-level
coefficients on ‘metavariables’ representing annotations for the genomic features used in the
first-level. Since genomic annotations can be themselves high-dimensional, the second-level
coefficients are regularized with an additional 𝑙 2
penalty. The coefficients for the first-level
genomic features are shrunken towards the product of gene annotations and second-level
coefficients through the second least squares. Meanwhile, the additional 𝑙 2
penalty will shrink
the second-level coefficients towards zero. Therefore, the first-level regression can also be high-
dimensional.
In practice, however, it’s more common to predict a binary health-related outcome than a
continuous one when using the genomic data. The regression model we present can be
conveniently applied to a high-dimensional classification problem with external information, due
to the connection between the linear regression and the linear discriminant analysis (LDA).
LDA is a simple classification method which uses a hyperplane in the feature space to separate
different classes. In the classical 𝑝 < 𝑛 settings, the LDA hyperplane can be derived through a
76
linear regression for a binary outcome with an adjusted intercept(9). Because of its simplicity,
many methods have been recently proposed to extend LDA for high-dimensional data, such as
the Scout method by Witten and Tibshirani (2009) (10), the lasso penalized discriminant analysis
by Mai and Zou (2012) (9) and the penalized Fisher’s discriminant by Witten and Tibshirani
(2011) (8). However, all these methods only overcome challenges presented by high-dimensional
settings. None of them considers taking advantage of prior information for gene expression
classification studies. Therefore, the proposed regression model can also be considered as an
extension of LDA which can handle high-dimensional settings and incorporate external
information simultaneously.
In terms of model implementation, since each term in the proposed regression is a 𝐿 2
-
norm and differentiable, it has analytical solutions for coefficients of both levels. However, the
analytical solutions involve the inverse of a 𝑝 × 𝑝 matrix, which is expensive to compute when
the dimensionality is extremely high. Instead of solving the problem directly, we fit this
regularized hierarchical model efficiently by applying a standard ridge regression on two groups
of features where their coefficients have different shrinkage accordingly.
5.3 Methods
In this paper, we assume a high-dimensional regression setting where 𝑦 = (𝑦 1
,…,𝑦 𝑛 ) is a
vector of representing a quantitative outcome that we want to regress on a set of high-
dimensional (𝑝 ≫ 𝑛 ) genomic features arranged in a 𝑛 × 𝑝 design matrix 𝑋 . For example, the
features in 𝑋 can consist of gene expression, methylation, or genotypes, and the outcome 𝑦 can
be a quantitative the level of certain biomarker measured on n subjects. We will later relax the
assumption that 𝑦 is quantitative and extend the approach to binary outcomes as well. In a typical
77
application there will also be a low-dimensional set of adjustment covariates such as sex, age,
clinical variables arranged in 𝑛 × 𝑟 matrix 𝑋 0
with 𝑟 < 𝑛 . Associated with each column/feature
in 𝑋 , we assume there is a set of q meta-features such as gene or SNP annotations arranged in a
𝑝 × 𝑞 matrix 𝑍 . For example, if 𝑋 represents gene expression features, each row in the matrix 𝑍
represents a gene and each column may a gene function derived from Gene Ontology (GO)(29,
30). If 𝑋 represents SNP genotypes 𝑍 may contain SNP functional information that classifies
SNPs according to whether they are exonic, intronic, intergenic, synonymous, etc(31). In both
these examples 𝑍 will be a matrix of zeros and ones indicating the presence or absence of a gene
function or whether a SNP belongs to a particular class of SNPs, but the external information
may contain quantitative variables such as weights or scores derived from test statistic or p-
values from previous studies. The number of meta-features is usually smaller than the number of
features, 𝑝 , but it can be larger than the sample size, 𝑛 .
We assume that the main analytical goal is to either develop a predictive model for 𝑦
based on 𝑋 (predictive goal), to identify a subset of features in 𝑋 associated with 𝑦 (inferential
goal), or a mixture of both. The meta-features 𝑍 encode information external to the study at hand
but that the researcher believes may be relevant to the outcome of interest. Specifically, the
belief that features in 𝑋 that have similar meta-feature patterns are likely to have more similar
effects on the outcome 𝑦 than features with dissimilar patterns. For example, if 𝑋 represents
expression features, we may expect that those corresponding to genes with similar functions may
have more similar effects on the outcome.
The regularized linear regression model we propose solves the following minimization
problem:
𝑚𝑖𝑛 𝛼 ,𝛼 0
,𝛽 ,𝛽 0
‖𝑦 − 𝑋 0
𝛽 0
− 𝑋𝛽 ‖
2
2
+ 𝜆 1
‖𝛽 − 𝛼 0
1
𝑝 − 𝑍𝛼 ‖
2
2
+ 𝜆 2
‖𝛼 ‖
2
2
(1)
78
where ‖ ‖
2
denotes the L2 norm of a vector, 1
𝑘 represents a 𝑘 -dimensional vectors of ones, and
𝜆 1
,𝜆 2
≥ 0 are tuning parameters. We assume the matrix of adjustment covariates 𝑋 0
contains a
column of ones corresponding to an intercept term. Minimization of the first term ‖𝑦 − 𝑋 0
𝛽 0
−
𝑋𝛽 ‖
2
2
alone would yield a standard least-squares linear regression of 𝑦 on 𝑋 and the adjustment
covariates 𝑋 0
. The penalty term ‖𝛽 − 𝛼 0
1
𝑝 − 𝑍𝛼 ‖
2
2
shrinks the estimates of the coefficients
𝛽 corresponding to features X toward a share similar patterns of the meta-features in 𝑍 to be
similar to each other. When 𝜆 2
> 0, the ridge penalty ‖𝛼 ‖
2
2
regularizes the coefficients, 𝛼 , of
the ‘second level’ regression of 𝛽 on 𝑍 . Considering the high dimensional problem in both levels,
we introduce a quadratic penalty in the second-level model, which hence has the fitting criterion
𝑚𝑖𝑛 𝛼 ,𝛼 0
‖𝛽 − 𝑍𝛼 − 𝛼 0
1
𝑝 ‖
2
2
+ 𝜆 ‖𝛼 ‖
2
2
for 𝜆 > 0. The effect of two penalties ‖𝛼 ‖
2
2
and ‖𝛽 − 𝑍 ′𝛼 ′‖
2
2
is to control the size of coefficients 𝛼 and 𝛽 by shrinking them towards zero and 𝑍 ′𝛼 ′
respectively. The tuning parameters 𝜆 1
and 𝜆 2
determine the amount of shrinkage and are
selected by cross-validation.
The intercept terms 𝛼 0
and 𝛽 0
in (1) are separated from the rest of the coefficients
because they are not penalized. Both intercept terms can be removed by mean-centering 𝑦 , and
the columns of 𝑋 , and 𝑍 . Similarly, the effect of any adjustment covariates can be removed by
considering the residuals of regressing y and the columns of 𝑋 by the covariates. In what follows
we assume this centering and adjusting has already being performed yielding the simplified
model:
𝑚𝑖𝑛 𝛼 ,𝛽 ‖𝑦 − 𝑋𝛽 ‖
2
2
+ 𝜆 1
‖𝛽 − 𝑍𝛼 ‖
2
2
+ 𝜆 2
‖𝛼 ‖
2
2
(2)
79
Letting 𝜆 2
→ +∞ and therefore forcing 𝛼 → 0 in (2) shows that the proposed model extends
standard ridge regression:
𝑚𝑖𝑛 𝛽 ‖𝑦 − 𝑋𝛽 ‖
2
2
+ 𝜆 1
‖𝛽 ‖
2
2
(3)
Model (1) also arises naturally as a three-level hierarchical model. Specifically, the first level can
be written as standard linear regression of the form:
𝑦 𝑖 = 𝛽 0
+ 𝑥 𝑖 𝑇 𝛽 + 𝛿 𝑖 𝑖 = 1,…,𝑛
where 𝑥 𝑖 is the vector of feature values for the sample 𝑖 and the 𝛿 i
are independent error term
with normal distribution 𝑁 (0,𝜎 1
2
) . The second-level model relates the coefficients 𝛽 =
(𝛽 1
,…,𝛽 𝑝 ) of the first-level regression to the meta-features in 𝑍 using in turn a linear regression
of the form
𝛽 𝑗 = 𝛼 0
+ 𝑧 𝑗 𝑇 𝛼 + 𝜖 𝑗 𝑗 = 1,…,𝑝
where 𝑧 𝑗 is the vector of meta-features corresponding to feature 𝑗 . The errors, 𝜖 𝑗 , also follow
independent normal distributions 𝑁 (0,𝜎 1
2
) . The third-level in the hierarchy are ‘priors’ on the
coefficients 𝛼 = (𝛼 1
,…, 𝛼 𝑞 ) , assumed to have independent normal distribution 𝑁 (0,𝜎 3
2
). More
compactly, the three-level hierarchical model as:
𝑦 |𝛽 ~ 𝑁 (𝑋𝛽 ,𝜎 1
2
𝐼 )
𝛽 |𝛼 ~ 𝑁 (𝑍𝛼 ,𝜎 2
2
𝐼 )
𝛼 ~ 𝑁 (0,𝜎 3
2
𝐼 )
𝜖 𝑗 ~ 𝑁 (0,𝜎 4
2
)
Maximum a posteriori estimation (MAP) based on the of the full likelihood
𝑃 (𝑦 |𝛽 )𝑃 (𝛽 |𝛼 )𝑃 (𝛼 ) , where P(∙) represent probabilities, yields the objective function in (1) with
𝜆 1
=
𝜎 1
2
𝜎 2
2
and 𝜆 2
=
𝜎 1
2
𝜎 3
2
.
80
The objective function in (1) is jointly convex in all the parameters 𝜃 = (𝛽 0
,𝛽 ,𝛼 0
,𝛼 ) and
can therefore be easily minimized using standard convex optimization solver such as those
implemented in CVX (32, 33) or CVXPY (34) , CVXR. However, we show in appendix A that
the problem in (1) is equivalent to two separate ridge regression problems for which there are
well established efficient methods(35).
A third way to formulate model (1) that provides further insight is to perform the change
of variables: 𝛾 = 𝛽 − 𝑍𝛼 First, assume without loss of generality, we rewrite (1) in the
following form without intercept terms:
𝑚𝑖𝑛 𝛼 ,𝛽
1
2
‖𝑦 − 𝑋𝛽 ‖
2
2
+
1
2
𝜆 1
‖𝛽 − 𝑍𝛼 ‖
2
2
+
1
2
𝜆 2
‖𝛼 ‖
2
2
(4)
Let 𝛾 = 𝛽 − 𝑍𝛼 and plug it in (4). Then we get
𝑚𝑖𝑛 𝛼 ,𝛽
1
2
‖𝑦 − 𝑋𝛾 − 𝑋𝑍𝛼 ‖
2
2
+
1
2
𝜆 1
‖𝛾 ‖
2
2
+
1
2
𝜆 2
‖𝛼 ‖
2
2
(5)
This is a standard ridge regression because 𝑋 ∗
= [𝑋 ,𝑋𝑍 ] is a matrix of (𝑝 + 𝑞 ) features and
𝛾 ∗
= [
𝛾 𝛼 ] is the vector of regression coefficients. Then, the solutions to the ridge regression in (5)
are
𝛾 ∗
̂
= (𝑋 ∗
𝑇 𝑋 ∗
+ 𝛬 ∗
)
−1
𝑋 ∗
𝑇 𝑦 (6)
where 𝛬 ∗
= [
𝜆 1
𝐼 𝑝 ×𝑝 0
0 𝜆 2
𝐼 𝑞 ×𝑞 ]. Now we consider the singular value decomposition 𝑋 ∗
= 𝑈𝐷 𝑉 𝑇 ,
where 𝑈 is a 𝑛 × (𝑝 + 𝑞 ) matrix with orthonormal columns and 𝑉 is a (𝑝 + 𝑞 )× (𝑝 + 𝑞 )
orthogonal matrix. 𝐷 is a diagonal matrix with elements 𝑑 1
≥ 𝑑 2
≥ … ≥ 𝑑 𝑝 +𝑞 ≥ 0. After
applying the singular value decomposition on 𝑋 ∗
, the fitted vectors for the ridge regression are
𝑋 ∗
𝛾 ∗
̂
= 𝑋 ∗
(𝑋 ∗
𝑇 𝑋 ∗
+ 𝛬 ∗
)
−1
𝑋 ∗
𝑇 𝑦
81
= 𝑈𝐷 (𝐷 2
+ 𝛬 ∗
)
−1
𝐷𝑈
𝑇
= ∑ 𝑢 𝑗 𝑑 𝑗 2
𝑑 𝑗 2
+𝜆 1
𝑢 𝑗 𝑇 𝑦 𝑝 𝑗 =1
+ ∑ 𝑢 𝑗 𝑑 𝑗 2
𝑑 𝑗 2
+𝜆 2
𝑢 𝑗 𝑇 𝑦 𝑝 +𝑞 𝑗 =𝑝 +1
(7)
where the u
j
are columns of 𝑈 . This computes the coordinates of 𝑦 as regards the orthonormal
basis 𝑈 with a matrix partition of 𝑋 ∗
(36). To be more specific, the left 𝑝 columns in 𝑋 ∗
capture
the original features 𝑋 such as genes, while the right 𝑞 columns 𝑋𝑍 consist of linear
combinations of 𝑋 , which is the average of genes sharing the same annotations. Instead of
modeling on features 𝑋 only, we incorporate their linear combinations 𝑋𝑍 , which might also be
important, as additional features in a standard ridge regression model, and penalize the
corresponding two groups of coefficients separately. The ridge regression shrinks computed
coordinates by the factors
𝑑 𝑗 2
𝑑 𝑗 2
+𝜆 1
and
𝑑 𝑗 2
𝑑 𝑗 2
+𝜆 2
with respect to 𝑋 and 𝑋𝑍 . Therefore, the regression
coefficients corresponding to the two groups of features in 𝑋 and 𝑋𝑍 will be shrunk with
different amounts.
To understand the solutions better, we present a simple case. Let 𝑋 = [𝑋 1
,𝑋 2
,𝑋 3
]
𝑛 ×3
with orthonormal columns and 𝑍 = [
1
1
0
]. First, we have
𝑋 ∗
𝑇 𝑋 ∗
= [
𝑋 𝑇 (𝑋𝑍 )
𝑇 ][𝑋 ,𝑋𝑍 ] = [
𝑋 𝑇 𝑋 𝑋 𝑇 𝑋𝑍
𝑍 𝑇 𝑋 𝑇 𝑋 𝑍 𝑇 𝑋 𝑇 𝑋𝑍
] (8)
In this case, 𝑋 𝑇 𝑋 = 𝐼 3×3
, 𝑋 𝑇 𝑋𝑍 = 𝑍 , Z
T
𝐗 T
XZ = Z
T
Z = 2 and 𝛬 ∗
= [
𝜆 1
𝐼 3×3
0
0 𝜆 2
]. Then the
solutions in (6) are
𝛾 ∗
̂
= 𝜆 −1
𝑋 ∗
𝑇 𝑦 , (9)
82
where 𝜆 = (
1 + 𝜆 1
0
0 1 + 𝜆 1
0 1
0 1
0 0
1 1
1+ 𝜆 1
0
0 2+ 𝜆 2
) . Let 𝛽 ̂
𝑋 𝑜𝑙𝑠 and 𝛽 ̂
𝑋𝑍
𝑜𝑙𝑠 be the ordinary least squares
solutions for the linear regression of 𝑦 on 𝑋 and the linear regression of 𝑦 on 𝑋𝑍 , respectively.
We can have 𝛽 ̂
𝑋 𝑜𝑙𝑠 = (𝑋 𝑇 𝑋 )
−1
𝑋 𝑇 𝑦 = 𝑋 𝑇 𝑦 , and 𝛽 ̂
𝑋𝑍
𝑜𝑙𝑠 = (𝑍 𝑇 𝑋 𝑇 𝑋𝑍 )
−1
𝑍 𝑇 𝑋 𝑇 𝑦 =
(𝑍 𝑇 𝑍 )
−1
𝑍 𝑇 𝑋 𝑇 𝑦 =
1
2
𝑍 𝑇 𝑋 𝑇 𝑦 =
1
2
(𝑋 1
+ 𝑋 2
)𝑦 , where
1
2
𝑍 𝑇 𝑋 𝑇 =
1
2
(𝑋 1
+ 𝑋 2
) is the average effects
of the first two genes. Then, the solution in (9) has the form of
𝛾 ∗
̂
= 𝜆 −1
[
𝑋 𝑇 (𝑋𝑍 )
𝑇 ]𝑦 = 𝜆 −1
[
𝑋 𝑇 𝑦 𝑍 𝑇 𝑋 𝑇 𝑦 ] = 𝜆 −1
[
𝑋 1
𝑦 𝑋 2
𝑦 𝑋 3
𝑦 (𝑋 1
+ 𝑋 2
)𝑦 ] = 𝜆 −1
[
𝛽 ̂
𝑋 𝑜𝑙𝑠 2𝛽 ̂
𝑋𝑍
𝑜𝑙𝑠 ] (10)
Therefore, the solutions for the ridge regression on 𝑋 and 𝑋𝑍 together are lined to the solutions
for the linear regressions on 𝑋 and 𝑋𝑍 separately with a shrinkage matrix 𝜆 −1
.
When 𝜆 2
= 0, the solution in (10) is 𝛾 ∗
̂
= (
1+2𝜆 1
2𝜆 1
(1+𝜆 1
)
𝑋 1
𝑦 +
1
2𝜆 1
(1+𝜆 1
)
𝑋 2
𝑦 +
1
2−2(1+𝜆 1
)
(𝑋 1
+
𝑋 2
)𝑦 ,
1
2𝜆 1
(1+𝜆 1
)
𝑋 1
𝑦 +
1+2𝜆 1
2𝜆 1
(1+𝜆 1
)
𝑋 2
𝑦 +
1
2−2(1+𝜆 1
)
(𝑋 1
+ 𝑋 2
)𝑦 ,
1
1+𝜆 1
𝑋 3
𝑦 ,
1
2−2(1+𝜆 1
)
𝑋 1
𝑦 +
1
2−2(1+𝜆 1
)
𝑋 2
𝑦 +
1+𝜆 1
2𝜆 1
(𝑋 1
+ 𝑋 2
)𝑦 )
𝑇 .
When 𝜆 2
= 0 and 𝜆 1
→ +∞, 𝜆 −1
→ (
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0
1
2
) . Then we can get
𝛾 ∗
̂
= [
𝛾̂
𝛼̂
] = [
𝛽 ̂
− 𝑍 𝛼̂
𝛼̂
] =
[
𝛽 ̂
1
− 𝛼̂
𝛽 ̂
2
− 𝛼̂
𝛽 ̂
3
𝛼̂
]
→
[
0
0
0
1
2
(𝑋 1
+ 𝑋 2
)𝑦 ]
(11)
83
Therefore, 𝛽 ̂
1
→ 𝛼̂ , 𝛽 ̂
2
→ 𝛼̂ and 𝛼̂ →
1
2
(𝑋 1
+ 𝑋 2
)𝑦 = 𝛽 ̂
𝑋𝑍
𝑜𝑙𝑠 as 𝜆 1
→ +∞ when 𝜆 2
= 0. This
means that if we don’t penalize the second-level coefficients 𝛼 , then the hierarchical model of
the form
𝑚𝑖𝑛 𝛼 ,𝛽
1
2
‖𝑦 − 𝑋𝛽 ‖
2
2
+
1
2
𝜆 1
‖𝛽 − 𝑍𝛼 ‖
2
2
will shrink the first-level coefficients β
1
and β
2
towards the ordinary linear squares solution 𝛽 ̂
𝑋𝑍
𝑜𝑙𝑠
of 𝑦 on 𝑋𝑍 , which depends on the average effects of the first-level features
1
2
(𝑋 1
+ 𝑋 2
) .
When implementing this method, we use a more efficient approach to fit the model in (1).
First, we apply a singular value decomposition on the matrix 𝑋 . Then, the problem in (1) can be
transformed into two equivalent ridge regressions with lower dimensionality. We use the R
package glmnet (35) to fit the resulting ridge regressions. More details are presented in Appendix
A.
Binary outcomes
As discussed in the introduction section, binary outcomes are much more common than
quantitative ones in genomic studies and it is therefore desirable to extend the approach above to
handle them. One approach would be to replace the linear regression in the first level of the
hierarchical model by a logistic regression. However, the regularized hierarchical model (1) can
be directly applied to binary outcomes by exploiting the connection between linear regression
and linear discriminant analysis (LDA) for binary classification. Specifically, LDA assumes that
features in the two classes defined by the binary outcome are normally distributed with class-
specific means 𝜇 1
, 𝜇 2
and a common covariance matrix Σ. The classification boundary in the
feature space is given by
84
[𝑥 − (𝜇 1
+ 𝜇 2
) 2 ⁄ ]
𝑇 Σ
−1
(𝜇 2
− 𝜇 1
)+ log(𝜋 2
𝜋 1
⁄ )= 0 (12)
where 𝜋 1
and 𝜋 2
are prior probabilities for two classes. The direction of the LDA’s classification
boundary Σ
̂
−1
(𝜇 ̂
2
− 𝜇 ̂
1
) can be equivalently estimated by using linear regression(9).
The solution to the least squares formulation
(𝛽 ̂
,𝛽 ̂
0
)= argmin∑ (𝑦 𝑖 − 𝛽 0
− 𝑥 𝑖 𝑇 𝛽 )
2 n
i=1
(13)
is 𝛽 ̂
= 𝑐 Σ
̂
−1
(𝜇 ̂
2
− 𝜇 ̂
1
) , which is proportional to the LDA’s slope with a constant multiplier.
Therefore, the least squares formulation derives the same classification boundary as linear
discriminant analysis if the intercept is correctly chosen. Consider a new boundary
𝑥 𝑇 𝛽 ̂
+ 𝛽 ̂
0
𝑜𝑝𝑡 = 0 (14)
where the optimal intercept can be estimated by
β
̂
0
opt
= −(μ ̂
2
+ μ ̂
1
)
T
β
̂
2 ⁄ + β
̂
T
Σ
̂
β
̂
[(μ ̂
2
− μ ̂
1
)
T
β
̂
]
−1
log(n
2
n
1
⁄ ) (15)
It can be shown that the boundary (14) is equivalent to the LDA’s classification boundary (12)
(9).
Motivated by this connection between linear discriminant analysis and linear regression
in the classical 𝑝 < 𝑛 setting, we can apply the regularized hierarchical model to binary
classification problem in high dimensional settings by estimating the direction β
̂
∗
in a regularized
hierarchical least squares problem
𝛽 ̂
∗
= argmin
1
2
‖𝑦 − 𝑋𝛽 ‖
2
2
+
1
2
𝜆 1
‖𝛽 − 𝑍 ′𝛼 ′‖
2
2
+
1
2
𝜆 2
‖𝛼 ‖
2
2
Plugging it into (15), we can calculate the optimal intercept 𝛽 ̂
0
∗
and then develop a classification
boundary 𝑥 𝑇 𝛽 ̂
∗
+ 𝛽 ̂
0
∗
= 0.
85
5.4 Simulation studies
To evaluate the performance of the above model, we compare the regularized hierarchical
linear discriminant analysis with second-level covariates, to the standard ridge linear regression
across a wide range of scenarios. The R package glmnet provides a highly efficient linear
regression implementation. Considering that the regularized hierarchical model can be efficiently
fitted via a ridge linear regression on suitably transformed data, we implement it by using the R
package glmnet as well. We apply the 5-fold repeated cross-validation to choose the optimal
values for tuning parameters λ
1
and λ
2
that lead to a maximized average AUC on the hold-out
subsamples.
5.4.1 Simulation settings
We randomly generate a training set of 𝑛 samples such that 𝑛 1
= 𝑛 2
= 𝑛 2 ⁄ . Two classes
of 𝑝 -dimensional vectors of features (𝑋 1
,𝑋 2
) , labeled as 𝑦 1
= 0 and 𝑦 2
= 1, are simulated from
multivariate normal distributions 𝒩 (0
𝑝 ,Σ) and 𝒩 (Σβ
true
,Σ) , respectively. The covariance
matrix Σ has an autoregressive correlation structure Σ
𝑖𝑗
= 𝜌 |𝑖 −𝑗 |
for 𝑖 ,𝑗 = 1,…,𝑝 . The external
information 𝑍 is a 𝑝 × 𝑞 matrix where each element is drawn from a Bernoulli distribution with
𝑝 = 0.05. The true model has second-level regression coefficients (𝛼 0
,𝛼 )= (𝛼 0
,𝛼 1
,…,𝛼 200
)=
(0.2,0.1(1
100
,2
50
,0
50
)) and first-level regression coefficients 𝛽 = 𝑐 (𝑍𝛼 + 𝛼 0
∙ 1
𝑝 + 𝘀 ∙ 1
𝑝 )
with an error term 𝘀 ~ 𝒩 (0,𝜎 2
) , where 𝑐 is a constant. An independent test set of 10,000
samples are also generated from the same distributions as the training set to calculate an external
AUC as a measurement for the prediction performance.
86
To capture the prediction ability of features in each data, we consider two quantities. The
second-level information content is measured by the theoretical R-squared of 𝛽 , which has the
form of
𝑅 𝛽 2
=
𝜆 𝜆 + 𝜎 2
where 𝜆 = 𝛼 𝑇 Σ
𝑍 𝛼 measures the signal and Σ
𝑍 is the true covariance of 𝑍 . The first-level
information content is measured by the theoretical AUC, which is calculated by
𝑃 (𝛽 𝑥 2
> 𝛽 𝑥 1
)= 𝑃 (𝛽 (𝑥 2
− 𝑥 1
)> 0)= 𝑃 (𝐺 > 0)
𝐺 ~ 𝒩 (𝛽 Σ𝛽 𝑇 ,2𝛽 Σ𝛽 𝑇 )
We designed six simulation scenarios to evaluate the influence of the second-level
information, the first-level information, the correlation, sample size and the dimensionality of
both levels on the prediction performance. Each simulation setting is performed with 100
replications.
Scenario 1: second-level information
We fix the first-level information and use varying second-level information degree 𝑅 𝛽 2
by
changing the standard deviation 𝜎 of the error term in the regression coefficients 𝛽 . Three
settings of 𝜎 = 8,1.2,0.7 are designed to archive different values for 𝑅 𝛽 2
= 0,0.1,0.2, which
represents for non-informative, weakly-informative and strongly-informative external
information situations, respectively. We fix the theoretical AUC at a level of 0.82 by
appropriately adjusting the constant 𝑐 in the regression coefficients 𝛽 in each simulation setting.
The sample size 𝑛 = 100, the number of first-level features 𝑝 = 400, the number of second-
level covariates 𝑞 = 200, and the correlation coefficient 𝜌 = 0.5 are kept consistent across all
87
simulation settings. We expect the regularized hierarchical linear discriminant analysis to
perform better when the second-level information increases.
Scenario 2: first-level information
Next we consider an opposite scenario of the first one where the second-level information
degree is fixed at 𝑅 𝛽 2
= 0.2, while the first-level information degree varies across settings in
which the theoretical AUC has different values of 0.73, 0.82, and 0.91. Settings for 𝑛 = 100 ,
𝑝 = 400, 𝑞 = 200 and 𝜌 = 0.5 are also kept fixed. This scenario demonstrates the effect of
incorporating identical external information 𝑍 given features 𝑋 with different levels of prediction
ability.
Scenario 3: correlation among features
We fix both the first-level and the second-level information but change the correlation
degree of the first-level features. The theoretical AUC is kept around 0.82 and the second-level
information degree is fixed at 𝑅 𝛽 2
= 0.2. We set 𝜌 = 0.2,0.5,0.8 for weak, medium, strong
correlation degrees. This scenario shows that the correlation degree of the first-level features
influences the effect of incorporating external information.
Scenario 4: sample size
To evaluate the influence of the sample size on the prediction performance, we compare
three situations of 𝑛 = 100,200,300 for fixed dimensionality on both levels where 𝑝 = 400 and
𝑞 = 200. The correlation coefficient is kept as 𝜌 = 0.5. The theoretical AUC is around 0.82 and
88
the second-level information degree is 𝑅 𝛽 2
= 0.05. This illustrates that how the effect of
incorporating external information varies by the sample size of the first-level data.
Scenario 5: first-level dimensionality
Then we would like to study the effect of varying dimensionality on performance. In this
scenario, we use the same settings as scenario 4 for 𝑞 = 200, 𝜌 = 0.5, 𝑅 𝛽 2
= 0.05 and consider
three situations for different number of first-level features, 𝑝 = 400, 1000,2000 , given a fixed
sample size 𝑛 = 100.
Scenario 6: second-level dimensionality
In the last scenario, we further examine a varying second-level dimensionality by
comparing three situations of 𝑞 = 12,100,200 under fixed settings of 𝑛 = 100, 𝑝 = 400, 𝜌 =
0.5, 𝑅 𝛽 2
= 0.05.
5.4.2 Results
The simulation results for scenarios 1-6 are summarized in Figure 1. In each plot, the
method colored in yellow represents the theoretically best AUC for each setting. We compare the
prediction performance between the hierarchical ridge regression in purple and the standard ridge
regression in blue. Scenario 1 illustrates that the second-level information has a crucial impact on
the effect of incorporated external information to the hierarchical ridge regression when the
theoretical AUC is fixed. Incorporating second-level covariates substantially improves the
prediction performance of the standard ridge regression when the second-level is informative,
and the improvement increases when the second-level information raises. When twenty percent
89
of the variance in the first-level regression coefficients can be explained by the second-level
covariates, the hierarchical ridge regression archives very close performance with the best level
in theory. However, the hierarchical ridge regression performs slightly worse than the standard
ridge regression when the second-level is non-informative.
In scenario 2 we can observe considerable improvement in prediction performance across
all levels of the theoretical AUC when the second-level covariates are informative. Scenario 3
demonstrates that the effect of incorporated external information depends on the correlation
degree of the first-level features. The prediction ability of standard ridge regression is sensitive
to the first-level correlation when the theoretical AUC is kept fixed. The standard ridge
regression works well when the first-level features are highly correlated and reduces when the
correlation becomes smaller. However, the hierarchical ridge regression is stable and always
performs better than the standard ridge regression when the correlation coefficient varies.
Therefore, incorporating second-level covariates can improve the prediction performance across
different degrees of the first-level correlation. The improvement in performance diminishes as
the correlation of first-level features increases.
In scenario 4 we can study the effect of the sample size for fixed dimensionality 𝑝 and 𝑞
on the improvement in performance. When the sample size becomes smaller, the prediction
ability of the standard ridge reduces a lot, while the hierarchical ridge regression is relatively
stable. This indicates that when the sample size is small, using standard ridge regression does not
guarantee a satisfactory performance, however, we can expect a substantial improvement by
incorporating external information using the proposed model. Consequently, the improvement
over the standard ridge increases as the 𝑛 /𝑝 ratio decreases. Scenario 5 shows the same pattern
in regard of the 𝑛 /𝑝 ratio as scenario 4. When the sample size 𝑛 is fixed, the improvement over
90
ridge is enhanced as the number of features 𝑝 increases. Therefore, there is more potential
improvement for the regularized hierarchical regression over methods that do not use external
information for problems with higher dimensionality. We also examined the effect of the number
of second-level features 𝑞 in scenario 6. We observed more improvement over ridge for higher 𝑞 .
Figure 1. Simulation results
91
5.5 Application: Breast Cancer Survival Prediction on METABRIC
Dataset
To further evaluate our methods, we applied the proposed model on a dataset of breast
cancer tumors from Molecular Taxonomy of Breast Cancer International Consortium
(METABRIC) study. The METABRIC microarray dataset is available at European Genome-
Phenome Archive with the accession of EGAS00000000083. It includes cDNA microarray
profiling of around 2000 breast cancer specimens processed on the Illumina HT-12 v3 platform
(Illumina_Human_WG-v3)(37). The primary tumors were originally divided into a discovery set
of 997 samples and a validation set of 995 samples. In our analysis, we used the discovery set as
the training set to fit the model and the validation set as the test set to evaluate the model
performance in prediction. The METABRIC dataset also contains patients’ long-term clinical
outcomes and pathological variables. We dichotomized patients’ survival time at 5 years and
used this binary variable which indicates the 5-year survival of breast cancer as the outcome to
predict. After removing censorings, 803 and 839 samples were remained in the training and the
test sets respectively. The predictor variables were normalized gene expression values of primary
tumors.
In this dataset, only estrogen receptor (ER)-negative patients received chemotherapies
while ER-positive patients did not. Considering that ER-positive breast cancers occur most
frequently and that ER-negative subtype has shown highly heterogeneous responses to different
chemotherapies(38) and are difficult to treat(39), we restricted our analysis to ER-positive
patients. Additionally, human epidermal growth factor receptor 2 (HER2)-positive subtype is
associated with poor clinical outcomes and is typically ER-negative(39), we further excluded
92
HER2-positive patients from the analysis. Eventually, there were 594 samples remained in the
training set and 564 samples in test set.
We used four types of attractor metagenes, which were described in the paper of a
prognostic model for breast cancer survival(40), as an example of external information to
incorporate. Among the four attractor metagenes, three of them are universal metagenes that
contain genes associated with mitotic chromosomal instability (CIN), mesenchymal transition
(MES), and lymphocyte-specific immune recruitment (LYM)(41). The mitotic CIN attractor
metagene consists of genes that are strongly associated with tumor grade and only the top 10
genes, CENPA, DLGAP5, MELK, BUB1, KIF2C, KIF20A, KIF4A, CCNA2, CCNB, and
NCAPG, are annotated. The MES attractor metagene is associated with tumor stage and the top
10 genes, OL5A2, VCAN, SPARC, THBS2, FBN1, COL1A2, COL5A1, FAP, AEBP1, and
CTSK, are annotated. The LYM attractor metagene is strongly protective in ER-negative breast
cancer in the METABRIC dataset, and the top 10 genes, PTPRC, CD53, LCP2, LAPTM5,
DOCK2, IL10RA, CYBB, CD48, ITGB2, and EVI2B are annotated. The last attractor metagene
consists of two genes that are most protective genes in the METABRIC data set: FGD3 and
SUSD3. Therefore, the external data we incorporated consists of four variables and each
indicates the top 10 genes included in an attractor metagene. The total 32 annotated genes
correspond to 36 Illumina probe IDs. We ranked the probes in the METABRIC dataset by their
variance of normalized expression values. Only the top 20,000 probes with highest variance
along with the 36 annotated ones, which leads to 20,003 unique probes in total, were used in our
analysis.
93
5.5.1 Results I: repetition of k-fold cross-validation
We fit the standard ridge regression on METABRIC dataset and the hierarchical ridge
regression on METABRIC data with annotations from metagenes as external information. The
optimal values of tuning parameters in both methods were selected by a 5-fold cross-validation
with AUC used as the criterion. Considering the randomness in the generation of five folds, we
repeated the cross-validation procedure 50 times with randomized fold IDs for each method. In
each replicate, the fold IDs were fixed between the two methods, and a final AUC was computed
on the test set as a measurement for the prediction performance. We found that the average test
AUC of the standard ridge regression was 0.657 while the average test AUC of the hierarchical
ridge regression was 0.691. There was an improvement of 0.034 (over 5%) in the average test
AUC by incorporating the external information from metagenes for the METABRIC dataset.
5.5.2 Results II: k-fold repeated cross-validation
The previous results are based on repeating a 5-fold cross-validation 50 times and the
only thing differs between replicates is the randomness involved in the procedure of splitting
training data into 5 folds. Under each replicate, we get a pair of optimized 𝜆 1
and 𝜆 2
via applying
5-fold cross-validation on the training set and then compute an AUC on test set using the
optimized 𝜆 1
and 𝜆 2
. Therefore, we obtain 50 test AUC’s after repeating the whole process 50
times and use the average of those results as a measurement for prediction performance to avoid
the effect of randomness in splitting data.
Here, we consider an alternative approach to handle randomness, which also involves
repetition of cross-validation but will report only one AUC on the test set as an evaluation for
model performance. To be more specific, we randomly split training data into k-folds and repeat
94
this process 10 times. Instead of choosing a pair of optimized 𝜆 1
and 𝜆 2
and computing a
corresponding test AUC for each split, we first obtain an average hold-out AUC over k folds
under each split and then average the results over 10 splits. The optimal 𝜆 1
and 𝜆 2
are those on
which the averaged average hold-out AUC is maximized. Finally, we compute an ultimate AUC
on the test set using the optimal 𝜆 1
and 𝜆 2
. Using this repeated cross-validation strategy, we got a
result that the test AUC of standard ridge regression is 0.661 and the test AUC of hierarchical
ridge regression is 0.705. There is an improvement of 0.044 (over 6%) in the test AUC by
incorporating the external information from metagenes for the METABRIC dataset. The test
AUC’s for both standard ridge regression and hierarchical ridge regression are quite close to the
average results we got from repeating the regular k-fold cross-validation 50 times in the previous
part.
5.5.3 Results III: a ratio of 2:1 vs 1:1 between training set and test set
Notice that the training and test sets in the above analysis contain nearly equal samples,
because the original discovery and validation sets in the METABRIC study have similar sample
size. However, the training set is usually larger than the test set so that most samples can be used
to fit the model while just a small fraction is needed to test the performance. Therefore, we
pooled the training and test sets together, and then randomly split the whole dataset into a new
training set of 772 samples and a new test set of 386 samples, and repeat the 5-fold cross-
validation procedure 50 times on the new training set. The average test AUC became 0.686 for
standard ridge regression and 0.695 for hierarchical ridge regression. There’s only slight
improvement after incorporating external information. This results consists with the what we
observed in the simulation studies. In section 5.3, we found sample size is an influential factor to
the performance of standard ridge regression while it doesn’t obviously affect the performance of
95
hierarchical ridge regression. There is considerable improvement in prediction accuracy for ridge
regression as sample size increases. Therefore, when the sample size in the training set changes
from 594 to 772 after pooling and randomly splitting the whole dataset, the AUC of standard
ridge regression increases from 0.657 to 0.686 while the AUC of hierarchical ridge regression is
kept almost the same around 0.69.
5.5.4 Results IV: standardization vs non-standardization on gene expressions
In previous two sections, we discussed two different strategies to repeat the process of
cross-validation when fitting models on training data. Under both strategies, we fit the standard
ridge regression and hierarchical ridge regression on standardized gene expressions. In this part,
we fit both models on gene expressions without standardization. Since the two cross-validation
strategies don’t affect the performance to both models substantially, we use the repeated cross-
validation as an example in this part. We found that the test AUC increases from 0.66 to 0.68,
when fitting the standard ridge regression on gene expressions without standardization compared
to fitting it on a standardized dataset. However, the test AUC of hierarchical ridge regression
doesn’t change too much. It’s around 0.705 not matter we fit the model on standardized or
unstandardized gene expressions. Therefore, the improvement in test AUC by incorporating
external information dwindles from 0.04 to 0.02.
5.5.5 Results V: all samples vs a subset of ER+/HER2- tumors
In the above analyses, we restricted samples to a subset of ER-positive and HER2-
negative tumors, because the survival of ER-negative patients is difficult to predict and HER2-
positve tumors usually demonstrate very different gene expression patterns from others. There
were 594 tumors in the training set and 564 tumors in the test set. In this section, we redo the
96
analysis using all samples instead of that subset. Therefore, both the standard ridge regression
and hierarchical ridge regression are fitted and evaluated on training and test sets with larger
sample size. After removing censorings, there are 803 and 839 samples remained in the new
training and test set respectively. Remarkably, the control-to-case ratio differs between all
samples and the subset of ER-positive and HER2-negative tumors. When restricted to ER-
positive and HER2-negative tumors, the control-to-case ratio is 7.4 in training set and 7.2 in test
set. However, when all subjects considered together, the control-to-case ratio is 4.5 in both
training and test sets. This means that the 5-year survival rate among ER-positive and HER2-
negative patients is higher than the average survival rate of all patients.
We find that both standard ridge regression and hierarchical ridge regression perform
better when fitted on all samples compare to fitted on the subset. The average test AUC is 0.70
for standard ridge regression and 0.72 for hierarchical ridge regression when using all subjects,
while the average test AUC is 0.66 and 0.69 for ridge and hierarchical ridge when using the
subset of ER-positive and HER2-negative tumors.
5.5.6 Results VI: 200 vs 20,000 probes
Previously, we ranked probes in the METABRIC dataset by their variance of normalized
expression values and only included the top 20,000 probes with highest variance as predictors in
analysis. Now we refit models on a dataset of top 200 probes with highest variance and observed
that both models perform better than using 20,000 probes. Specifically, the average test AUC of
ridge regression is 0.710 and the average test AUC of hierarchical ridge is 0.725, when fitted on
the subset of ER-positive and HER2-negative tumors. We also performed the analysis of 200
probes on all subjects and obtained a slight improvement in both models. The average test AUC
of ridge turned into 0.715 and the average test AUC of hierarchical ridge increased to 0.732.
97
These findings are consistent with the results in section 5.4.5 that both models perform better
when fitted on all samples compare to fitted on the subset. However, since the model
performance of using 200 probes is much better than using 20,000 probes when fitted on the
subset of ER-positive and HER2-negative tumors, there’s minor room of improvement for fitting
models on all subjects using 200 probes compared to using 20,000 probes.
5.5.7 Results VII: top 100 vs top 10 genes in attractor metagenes
The external information we incorporated contains four metagenes. Other than the one
consisting of FGD3 and SUSD3 genes, each of remaining attractor metagenes (CIN, MES and
LYM) is an average of a group of genes that are most closely correlated. In above analysis, we
annotated the top 10 genes within each attractor metagene as one and all the other genes as zero
when generating the matrix of external information. However, the three attractor metagenes CIN,
MES and LYM are originally created from 100 genes. Therefore, we annotated the 100 genes in
each attractor metagene as one and all the others as zero, and then created another matrix of
external information. Fitting the hierarchical ridge regression using this new external matrix, we
obtained a test AUC of 0.68, which is lower than the test AUC (0.71) of using an external matrix
generated from top 10 genes.
5.5.8 Results VIII: other variants of external information
We also explored other variants of external information and examined their effects on the
prediction performance of hierarchical ridge regression. For example, we compared
standardizing the 𝑍 matrix of external information by standard errors with dividing 𝑍 by its
column sums. In addition, we tried removing genes that compose the LYM attractor metagene
from the external information because their coefficients are almost zero. We found that these
98
modifications on the external information doesn’t affect model performance apparently. The test
AUC of hierarchical ridge regression is always around 0.70, when all the other settings are kept
the same.
5.5.9 Results VIIII: including vs not including clinical variables
In previous analyses, the predicting features of all models consist of gene expressions
only. Now we perform additional analysis to include clinical variables as well as gene
expressions in predictors. We consider two clinical variables: a continuous variable of age and a
binary variable indicating whether there are lymph nodes in tumors, and don’t penalize their
coefficients. We use a 5-fold repeated cross-validation to optimize tuning parameters. When
fitted on gene expressions and clinical variables, both ridge regression and hierarchical ridge
regression perform better. The test AUC is 0.73 and 0.75 for standard ridge and hierarchical
ridge respectively.
5.5.10 Summary
In the example of survival prediction for breast cancer patients, we applied our mode on
the METABRIC data of gene expressions with the external information obtained from attractor
metegenes. It demonstrates that incorporating external information using hierarchical ridge
regression can improve the prediction performance of standard ridge regression. In this study, we
considered different approaches to perform k-fold cross-validation to generate stable and
reasonable measurements for model performance. We also investigated a series of comparisons
in the analysis, such as a varying ratio between training set and test set, whether standardizing
gene expression values or not, fitting model on all subjects or a subset of ER-positive and HER2-
negative subjects, using the top 20,000 or 200 probes with highest variance as predictors,
99
whether including clinical variables as predictors or not, and several variants of the external
information to incorporate.
First, there isn’t much difference in results between the two kinds of repetition we used
for k-fold cross-validation. Under both approaches, we can avoid fluctuant results caused by
randomness in the generation of k folds. Second, among the above circumstances, sample size of
the training set has a noticeable impact on the model performance. In general, both the ridge
regression and the hierarchical ridge have higher test AUC when fitted on a training set with
more samples. We find that including all subjects in analysis makes both models perform better
than using a subset of ER-positive and HER2-negative samples. In addition, an increased ratio
between training and test set is linked to a considerable improvement in the performance of ridge
regression when analysis samples are restricted to the ER+/HER2- subset; however, a varying
ratio doesn’t essentially affect the performance of hierarchical ridge regression. Third, the
number of predicting features is another important factor to model performance. Both ridge and
hierarchical ridge perform considerably better when fitted on 200 predictors than on 20,000
predictors. What’s more, whether including clinical variables as predictors also influence
performance for both models. Ridge and hierarchical ridge including clinical variables as well as
gene expressions perform better than those using gene expressions alone. Additionally,
standardizing the data of gene expressions is related to a reduction in the prediction performance
of ridge regression, but hierarchical ridge regression doesn’t work differently no matter whether
gene expressions are standardized or not. Furthermore, all the variants on external information
extracted from attractor metagenes that we considered don’t exhibit evident influence on the
performance of hierarchical ridge regression.
100
In summary, even though the amount of improvement in test AUC over ridge regression
might vary under different circumstances, the enhancement in performance caused by external
information incorporated in our model exists across all situations we explored. Therefore, this
analysis provides a practical example showing that the proposed model, by incorporating
external information of genes via a hierarchical ridge regression, have better prediction
performance than the standard method.
5.6 Three strategies to choose values for tuning paramters using k-
fold cross-validation
The regularized hierarchical regression model presented in (1) contains two tuning
parameters 𝜆 1
and 𝜆 2
. The values of 𝜆 1
and 𝜆 2
determine the amount of shrinkage on regression
coefficients, so they are crucial to model performance. As mentioned in section 5.4, we use k-
fold cross-validation to optimize 𝜆 1
and 𝜆 2
to make the model fit the training set of data as well
as possible. However, there can be still some variants of the regular k-fold cross-validation. In
this section, we present three versions of k-fold cross-validation to choose the values of 𝜆 1
and
𝜆 2
, and evaluate how these different approaches would affect the prediction performance of
models. In strategy 1, we first find an optimal value for 𝜆 1
using a k-fold cross-validation and
then optimize 𝜆 2
given the optimal 𝜆 1
in another cross-valiation. In strategy 2 and 3, we
directly apply a k-fold cross-validation on a two-dimensional grid of 𝜆 1
and 𝜆 2
, and search for
the optimal combination of their values on that grid. Different from strategy 1, these two
versions optimize 𝜆 1
and 𝜆 2
simultaneously. In strategy 2, we use a regular k-fold cross-
validation on the two-dimensional grid of 𝜆 1
and 𝜆 2
while a repeated cross-validation is used in
strategy 3 to improve stability. We implement these three versions of k-fold cross-validation
101
separately and compare their performance across six simulation scenarios using the same
settings listed in section 5.3.
5.6.1 Strategy 1: K-fold nested cross-validation on one-dimensional grids
In this version, we only need to specify a sequence of 𝜆 1
. A sequence of 𝜆 2
will be
automatically created by cv.glment for each 𝜆 1
. We first randomly split the original training
data into k equally sized subsamples and retain a single subsample as a validation set. For each
value of 𝜆 1
, we perform a nested 10-fold cross-validation on the remaining 𝑘 − 1 subsamples
using cv.glmnet to optimize 𝜆 2
. Then based on this temporarily optimal 𝜆 2
, we compute an
optimal AUC on the validation set for a given 𝜆 1
. The outer cross-validation is then repeated k
times, with each of the k subsamples used exactly once as the validation data. Next, we average
the k results of validation AUC as a single measurement for that value of 𝜆 1
. The value which
corresponds to highest average validation AUC will be selected as the optimal 𝜆 1
. Given this
optimal 𝜆 1
, we apply another 10-fold cross-validation on the whole training data using
cv.glmnet again to achieve an ultimate 𝜆 2
. In our implementation, the default sequence of 𝜆 1
consists of a series of values equally separated between 0.01 and 10,000.
Simulation results are shown in Figure 5.2, where the method colored in purple
represents the regularized hierarchical regression using strategy 1 for 5-fold cross-validation.
Plots in Figure 5.2 demonstrate similar patterns to the simulation results presented in section
5.3.2, across all six scenarios. In general, using this k-fold cross-validation strategy, the
regularized hierarchical regression we’re proposing outperforms ridge regression, in terms of
both prediction accuracy and stability. In all situations, except the first one in scenario 1 where
the external data is non-informative, there are substantial increases in test AUC after
102
incorporating informative external data using our model. In addition, our model exhibits less
variance in test AUC over ridge regression in most situations, which means that it performs
more stable than ridge regression across different replicates. Therefore, we conclude that
strategy 1 provides us an efficient way to execute k-fold cross-validation in choosing values for
𝜆 1
and 𝜆 2
.
Figure 5.2 Simulation results of strategy 1
103
5.6.2 Strategy 2: K-fold cross-validation on a two-dimensional grid
This approach is more straightforward than strategy 1. In strategy 2, we just need to
create a two-dimensional grid of 𝜆 1
and 𝜆 2
and then apply a regular k-fold cross-validation
directly on that grid to evaluate each combination of 𝜆 1
and 𝜆 2
. To be more specific, we
randomly split the original training data into k equally sized subsamples and retain a single
subsample as a validation set. For each combination of 𝜆 1
and 𝜆 2
on the grid, we fit the model
on the remaining 𝑘 − 1 subsamples and compute an AUC on the validation set. Then we repeat
this process k times, with each of the k subsamples used exactly once as the validation set.
Next, we average the k results of validation AUC as a single measurement for each
combination of 𝜆 1
and 𝜆 2
. The optimal values are the pair where the average validation AUC is
maximized.
In our implementation, the default sequences of 𝜆 1
and 𝜆 2
are generated using the
following principle. Given the data of predictors 𝑋 , outcomes 𝑌 and an external information
matrix 𝑍 , the maximum value for the sequence of 𝜆 1
is 𝜆 1,𝑚𝑎𝑥
= 1000∙ max |𝑋 𝑇 𝑌 |, and the
maximum value in the sequence of 𝜆 2
is 𝜆 2,𝑚𝑎𝑥
= 1000∙ max |(𝑋𝑍 )
𝑇 𝑌 |. The minimum values
104
of 𝜆 1
and 𝜆 2
are generated based on their maximum values: 𝜆 1,𝑚𝑖𝑛 = 0.0001∙ 𝜆 1,𝑚𝑎𝑥
and
𝜆 2,𝑚𝑖𝑛 = 0.0001∙ 𝜆 2,𝑚𝑎𝑥
. Then, a sequence of 𝜆 1
and 𝜆 2
is created independently by a series of
values equally spaced between the maximum and the minimum under log-scale.
105
Figure 5.3 Simulation results of strategy 2 vs strategy 1
We compare strategy 2 with strategy 1. Simulation results are summarized in Figure
5.3. The method colored in purple represents the regularized hierarchical regression using
strategy 2 for 5-fold cross-validation. The orange represents the regularized hierarchical
106
regression using strategy 1. As shown in Figure 5.3, the regularized hierarchical regression
using strategy 2 to choose values for 𝜆 1
and 𝜆 2
does improve the prediction accuracy over ridge
regression on average. However, the improvement in test AUC is not as much as using strategy
1. What’s more, strategy 2 doesn’t work quite well in terms of stability. The test AUC has more
variance in strategy 2 than in strategy 1, even than in the ridge regression, across all scenarios.
In some situations, such as the one representing low first-level information in scenario 2, the
𝑛 = 100 case in scenario 4, 𝑝 = 1000 case in scenario 5 and the 𝑞 = 200 case in scenario 6,
strategy 2 performs extremely unstable between replicates compared to the hierarchical ridge
regression using strategy 1 and the standard ridge regression.
Due to its unsatisfactory performance in stability, we design 4 experiments to examine
if there’s any possibility to make strategy 2 more stable. In the 4 experiments, we consider two
directions to enhance stability. One is to search on a finer grid, and the other is to repeat the
cross-validation process multiple times. When exploring the effect of one direction, we fix the
settings in terms of the other one. Therefore, taking the original settings of strategy 2 as a
baseline, we have three additional experiments among which two of them involve modification
on settings of only one side and the left one modify settings on both directions.
Figure 5.4 Results of four experiments for situation #1 in scenario 2
107
We take one of the worst cases in Figure 5.3, the first one in scenario 2 for a low first-
level information, as an example to focus on, and perform the 4 above experiments on it. The
first experiment is using the original settings of strategy 2 where the two-dimensional grid we
searched on consists of 10 values for 𝜆 1
and 100 values for 𝜆 2
. In the second one, we explore
the direction of a finer grid by increasing the number of 𝜆 1
values from 10 to 100 and fixing the
number of 𝜆 2
as 100. In the first two experiments, we only perform the k-fold cross-validation
once. In the third experiment, we search on the original grid of 10 𝜆 1
’s and 100 𝜆 2
’s, but repeat
splitting the training data into 5 folds and the cross-validation process 3 times. In the last
experiment, we search on the finer grid of 100 𝜆 1
’s and 100 𝜆 2
’s, and repeat the splitting and
cross-validation 3 times. Results of these four experiments are summarized in Figure 5.4, from
which we observe that the stability of hierarchical ridge regression is improved only in the third
experiment. This means that a finer grid of 𝜆 1
doesn’t help enhance stable performance while
the repeated cross-validation does. Therefore, we further implement a repeated version of
strategy 2 and evaluate it as strategy 3 in the next section.
5.6.3 Strategy 3. K-fold repeated cross-validation on a two-dimensional grid
This approach is a simple extension of strategy 2. Instead of randomly splitting training
data only once, we repeat this process multiple times here. For each split, prediction
performance is assessed using the average AUC on validation data over k folds. After splitting
training data and repeating the k-fold cross-validation process under each split for multiple
times, the results are then averaged over the splits. The optimal values of 𝜆 1
and 𝜆 2
are a
combination where the averaged average validation AUC is maximized. The default sequences
108
of 𝜆 1
and 𝜆 2
are automatically generated from the given data in the same way that strategy 2
does.
We also compare strategy 3 with strategy 1. Simulation results are shown in Figure 5.4.
The method colored in purple represents a regularized hierarchical regression using strategy 3
for a 5-fold 3-times repeated cross-validation. The orange represents a regularized hierarchical
regression using strategy 1.
It shows that using strategy 3 does enhance the stability of our model in situations
where strategy 2 performs extremely unstable, such as the one representing low first-level
information in scenario 2, the 𝑛 = 100 case in scenario 4, 𝑝 = 1000 case in scenario 5 and 𝑞 =
200 case in scenario 6. However, repeated cross-validation is not guaranteed to be more stable
than the regular k-fold cross-validation under all circumstances. For example, there isn’t
obvious improvement in stability across all situations in scenario 1, cases of medium and high
first-level information in scenario 2, all situations in scenario 3, n=200 and n=300 cases in
scenario 4. In those situations, the regular k-fold cross-validation used in strategy 2 doesn’t
lose too much in stability, and the repeated strategy has similar performance as the regular
cross-validation. In addition, even though strategy 3 leads to more stable performance than
strategy 2 in some situations, it performs still less stable than strategy 1 across all scenarios. On
the other hand, the average test AUC’s of strategy 3 and strategy 2 are at the same level across
all scenarios, therefore we believe that these two approaches perform closely in terms of
prediction accuracy. In summary, although that strategy 3 doesn’t perform as well as strategy 1
in terms of both prediction accuracy and stability, it provides an opportunity to improve the
stability for strategy 2, especially under some extreme cases, without much to lose.
109
Considering its straightforward intuition and relatively satisfactory performance, we
finally use strategy 3 as k-fold cross-validation to choose values for 𝜆 1
and 𝜆 2
in the
implementation of the regularized hierarchical regression model.
Figure 5.5 Simulation results of strategy 3 vs strategy 1
110
5.6.4 Discussion
In this section, we discussed three versions of k-fold cross-validation to choose 𝜆 1
and
𝜆 2
for the model. Strategy 1 differs from strategies 2 and 3 on several aspects. First, the grids of
values they search on are different. In strategy 1, we start from a one-dimensional grid of 𝜆 1
.
For each value of 𝜆 1
, we use a nested k-fold cross-validation to look for an interim optimal
result with respect to 𝜆 2
, and the sequence of 𝜆 2
we search on varies across different values of
𝜆 1
. By comparing the interim optimal results across the pre-specified sequence of 𝜆 1
using an
outer k-fold cross-validation, we can determine the optimal value for 𝜆 1
. We then optimize 𝜆 2
through another k-fold cross-validation based on that optimal 𝜆 1
. Therefore, the optimal values
of 𝜆 1
and 𝜆 2
are decided separately. In brief, we first find an interim optimal 𝜆 2
for each 𝜆 1
,
then we determine an optimal value for 𝜆 1
based on interim 𝜆 2
’s, and optimize an ultimate
value for 𝜆 2
in the end. However, the optimal values of 𝜆 1
and 𝜆 2
are selected simultaneously
in strategies 2 and 3, because these two approaches search on a two-dimensional grid of 𝜆 1
and
𝜆 2
. Second, unlike in strategy 1 where each 𝜆 1
has a different sequence of 𝜆 2
, the sequence of
𝜆 2
is fixed across different values of 𝜆 1
in strategies 2 and 3. This might be an advantage of the
strategy 1 because it seems that using customized sequences of 𝜆 2
for different 𝜆 1
has better
111
accuracy and stability than using an overall sequence of 𝜆 2
for each 𝜆 1
. Third, the values of
both 𝜆 1
and 𝜆 2
are equally spaced under log-scale in strategies 2 and 3, while in strategy 1,
only the sequence of 𝜆 2
is equally spaced under a log-scale, the sequence of 𝜆 1
is equally
spaced under an original scale. In strategy 2, we directly apply a regular k-fold cross-validation
on a two-dimensional grid. Because of its unsatisfactory stability in performance, we extend
strategy 2 by repeating the splitting and cross-validation process multiple times also on a two-
dimensional grid in strategy 3.
We compared these three strategies in simulation studies and found that strategy 1
performs best in terms of both accuracy and stability. It has most improvement in prediction
accuracy and is more stable than ridge regression across all scenarios. Strategy 2 performs
slightly worse than strategy 1 but considerably better than ridge regression in terms of
prediction accuracy on average; however, strategy 2 is quite unstable compared to strategy 1
and even ridge regression in some scenarios. Using repeated cross-validation, strategy 3
performs more stable under situations where strategy 2 loses stability seriously. But the
prediction accuracy isn’t affected obviously by repeated cross-validation.
Through the comparison of different cross-validation strategies, we learned that the
values of tuning parameters 𝜆 1
and 𝜆 2
are crucial to the performance of our model, therefore
the way to choose their values does affect the model performance substantially. Since the
regularized hierarchical model contains more than one tuning parameters, there are more
variants in the implementation of k-fold cross-validation than the usual one-tuning-parameter
case. Here we considered three versions of k-fold cross-validation and found that there are
different extents of improvement in prediction performance over ridge regression across theses
three strategies. Among them, strategy 1 outperforms other two in both prediction accuracy and
112
stability. This distinction might be due to the different sequences of 𝜆 2
used in strategy 1. We
suspect that searching on a customized range of 𝜆 2
with respect to different 𝜆 1
is more efficient
than searching on a fixed 𝜆 2
sequence for all 𝜆 1
’s, when there are two parameters. We also
learned that using repeated cross-validation can help enhance stability when the regular k-fold
cross-validation doesn’t perform stable enough. Even though the increase of stability doesn’t
always occur, it doesn’t sacrifice accuracy or stability when using repeated cross-validation.
In conclusion, questions such as how to choose values for model parameters or how to
design k-fold cross-validation are tough but important for models that contain more than one
tuning parameters. Even though the three strategies that we explored indeed produce
improvement in prediction performance over ridge regression, it’s still promising to explore
better strategies for cross-validation, or even create competent approaches other than cross-
validation, to optimize tuning parameters more efficiently.
5.7 Discussion
In this chapter, we have presented a regularized hierarchical approach that can
incorporate external information of predictors (e.g. annotation of genes) as second-level
covariates in a high-dimensional regression. By introducing a second-level regression on the
external data, the parameters of predictors that share the same annotations are shrunken towards
their average. In this way, the magnitude of shrinkage in parameters across predictors with
different annotations differs. With the informative prior knowledge on predictors, this would
lead to better estimates and therefore greater prediction accuracy than the ridge regression in
which all parameters are shrunken towards zero. In addition, the regularization on the second-
level parameters makes it possible to incorporate a high-dimensional external data. We have
113
shown that the proposed model is theoretically equivalent to a standard ridge regression with
transformed predictors. Therefore, parameters of this model can be estimated efficiently. Due
to the connection between linear regression and LDA, our approach can adapt to both
continuous and binary outcomes conveniently.
In simulation studies, we have shown the effectiveness of incorporating external
information in improving the prediction accuracy. The proposed model has exhibited better
predictive performance than the standard ridge regression across a wide range of scenarios,
when the incorporated external dataset is informative. We investigated the effect of multiples
factors on the performance for both methods, including the degree of informativeness of the
original and the external dataset, the correlation of predictors, the sample size, and the
dimensionality of features on both levels. We found that ridge regression usually doesn’t
perform well when predicting features are weakly correlated or the sample size is small. In
these cases, there is a considerable potential to improve prediction accuracy via incorporating
external information. In practice, a lot of datasets contain very limited number of available
samples and the prediction performance of standard methods, such as ridge regression, would
be restricted by the small sample size. Now given the proposed approach, we can take
advantage of the increasing knowledge of biology and genetics as external information in a
regularized hierarchical regression framework, to substantially improve the unsatisfactory
performance of standard ridge.
In addition to effectiveness, we have also examined the robustness of the proposed
method and found that it worked almost as well as the ridge regression even when the external
data is completely non-informative. Once the external data is informative, the proposed method
obviously performed better than the ridge regression. In general, as we expected, the
114
improvement over the standard method depends on how informative the incorporated dataset is.
This leads to the main limitation of the proposed method: the choice of the external information
to incorporate is crucial to the performance. When applying this method, users need to identify
the beneficial prior knowledge of predictors by themselves and to seek out information of high
quality to incorporate, which is subject to the scope of relevant knowledge as well as the
availability of data.
In the breast cancer example, we have demonstrated how to incorporate a specific type
of external information (i.e. metagenes) into analysis. Nevertheless, the prior knowledge that
can be considered doesn’t limit to the one used in this example. The proposed method provides
a flexible way to incorporate various types of prior knowledge into current studies. Therefore,
due to the accumulation of knowledge about cancers and the increasing number of available
datasets, this method can potentially provide more improvement in the predictive performance.
Finally, our work provides a flexible framework to integrate multiple types of data and
results in an improvement in predictive performance. Even though all derivations are conducted
for the combination of a linear regression on the first level and a ridge regression on the second,
this approach can be applied to any generalized linear models on the first level, including but
not limited to the logistic regression for categorical outcomes, the poisson regression for count
variables and the cox regression for survival analysis. It can also be generalized to other types
of penalties on the second level, such as the lasso and the elastic net. These are all interesting
topics for future work.
5.8 Appendix A
For fixed 𝜆 1
> 0 and 𝜆 2
> 0, the problem (1) has solutions
115
𝛽 ̂
= (𝑋 𝑇 𝑀 𝑛 𝑋 + 𝜆 1
𝑀 𝑝 − 𝜆 1
2
𝑀 𝑝 𝑍𝐿 𝑍 𝑇 𝑀 𝑝 )
−1
𝑋 𝑇 𝑀 𝑛 𝑦 (a1)
𝛼̂ = 𝜆 1
𝐿 𝑍 𝑇 𝑀 𝑝 𝛽 ̂
(a2)
𝛽 ̂
0
= 𝑛 −1
1
𝑛 (𝑦 − 𝑋𝛽 )
𝛼̂
0
= 𝑝 −1
1
𝑝 (𝛽 − 𝑍𝛼 )
where 𝑀 𝑛 = 𝐼 𝑛 −
1
𝑛 1
𝑛 1
𝑛 𝑇 , 𝑀 𝑝 = 𝐼 𝑝 −
1
𝑝 1
𝑝 1
𝑝 𝑇 , 𝐿 = (λ
1
𝑍 𝑇 𝑀 𝑝 𝑍 + 𝜆 2
𝐼 𝑞 )
−1
. Solutions (a1) and (a2)
are expensive to compute, because they require the inverse of a 𝑝 × 𝑝 matrix. Now we present a
more efficient way to solve the problem.
Let 𝑋 = 𝑈𝐷 𝑉 𝑇 = 𝑅 𝑉 𝑇 be the singular-value decomposition of 𝑋 , where 𝑈 is a 𝑛 × 𝑛 orthogonal
matrix, 𝐷 is a diagonal matrix with elements 𝜎 1
≥ 𝜎 2
≥ … ≥ 𝜎 𝑛 ≥ 0, 𝑉 is a 𝑝 × 𝑛 matrix with
orthonormal columns, and 𝑅 = 𝑈𝐷 is also a 𝑛 × 𝑛 matrix. Let 𝑉 ⊥
be a 𝑝 × (𝑝 − 𝑛 ) matrix which
spans the orthogonal complement to the column space of 𝑉 . Then the 𝑝 × 𝑝 matrix 𝑄 = (𝑉 | 𝑉 ⊥
)
is orthogonal. Therefore, 𝛽 = 𝑄 𝑄 𝑇 𝛽 = (𝑉 | 𝑉 ⊥
) (
𝑉 𝑇 𝛽 𝑉 ⊥
𝑇 𝛽 ). Let 𝜃 = 𝑉 𝑇 𝛽 and 𝜂 = 𝑉 ⊥
𝑇 𝛽 . Then
𝛽 = 𝑉𝜃 + 𝑉 ⊥
𝜂 (a3)
Notice that 𝑋𝛽 = 𝑅 𝑉 𝑇 (𝑉𝜃 + 𝑉 ⊥
𝜂 )= 𝑅𝜃 and that ‖𝛽 − 𝑍 ′
𝛼 ′
‖
2
2
= ‖𝑄 𝑇 (𝛽 − 𝑍 ′
𝛼 ′
)‖
2
2
=
‖(
𝜃 − 𝑉 𝑇 𝑍 ′
𝛼 ′
𝜂 − 𝑉 ⊥
𝑇 𝑍 ′
𝛼 ′
)‖
2
2
= ‖𝜃 − 𝑉 𝑇 𝑍 ′𝛼 ′‖
2
2
+ ‖𝜂 − 𝑉 ⊥
𝑇 𝑍 ′
𝛼 ′
‖
2
2
. After substitutions in (1), the problem is
equivalent to
min
𝛼 ,𝛼 0
,𝜃 ,𝜂 ,𝛽 0
1
2
‖𝑦 − 𝑅𝜃 − 𝛽 0
1
𝑛 ‖
2
2
+
1
2
𝜆 1
‖𝜃 − 𝑉 𝑇 𝑍 ′𝛼 ′‖
2
2
+
1
2
𝜆 1
‖𝜂 − 𝑉 ⊥
𝑇 𝑍 ′
𝛼 ′
‖
2
2
+
1
2
𝜆 2
‖𝛼 ‖
2
2
(a4)
Fix 𝛼 ′
and let 𝜃 ∗
= 𝜃 − 𝑉 𝑇 𝑍 ′
𝛼 ′
. Then we get
min
𝜃 ∗
,𝜂
1
2
‖𝑦 ∗
− 𝑅 𝜃 ∗
‖
2
2
+
1
2
𝜆 1
‖𝜃 ∗
‖
2
2
+
1
2
𝜆 1
‖𝜂 − 𝑉 ⊥
𝑇 𝑍 ′
𝛼 ′
‖
2
2
(a5)
where 𝑦 ∗
= 𝑦 − 𝛽 0
1
𝑛 − 𝑅 𝑉 𝑇 𝑍 ′
𝛼 ′
. Minimizing the first two score equations is a standard ridge
regression of 𝑦 ∗
on 𝑅 , hence 𝜃 ∗
= (𝑅 𝑇 𝑅 + 𝜆 1
𝐼 𝑛 )
−1
𝑅 𝑇 𝑦 ∗
(42). Therefore,
116
𝜃 = (𝐷 2
+ 𝜆 1
𝐼 𝑛 )
−1
𝐷𝑈
𝑇 (𝑦 − 𝛽 0
1
𝑛 + 𝜆 1
𝑈 𝐷 −1
𝑉 𝑇 𝑍 ′
𝛼 ′
)
𝜂 = 𝑉 ⊥
𝑇 𝑍 ′
𝛼 ′
Plugging 𝜃 and 𝜂 back into (a4), and after some linear algebra, we get
min
𝛼 ,𝛼 0
,𝛽 0
‖𝛽 ∗
− 𝑍 ∗
𝛼 ‖
2
2
+
𝜆 2
𝜆 1
‖𝛼 ‖
2
2
(a6)
where
𝛽 ∗
= 𝑑𝑖𝑎𝑔 (
1
√𝜎 𝑖 2
+ 𝜆 1
)
𝑈 𝑇 (𝑦 − 𝛽 0
1
𝑛 )− 𝑑𝑖𝑎𝑔 (
𝜎 𝑖 √𝜎 𝑖 2
+ 𝜆 1
)
𝑉 𝑇 𝛼 0
1
𝑝
𝑍 ∗
= 𝑑𝑖𝑎𝑔 (
𝜎 𝑖 √𝜎 𝑖 2
+ 𝜆 1
)
𝑉 𝑇 𝑍
This is a standard ridge regression of the modified outcome 𝛽 ∗
on 𝑍 ∗
, and it gives
𝛼̂ = (𝑍 ∗
𝑇 𝑍 ∗
+
𝜆 2
𝜆 1
𝐼 𝑞 )
−1
𝑍 ∗
𝑇 𝛽 ∗
(a7)
We can get rid of the intercept term 𝛽 0
by using centered 𝑦 and standardized 𝑋 . Since
𝛽 = 𝑉𝜃 + 𝑉 ⊥
𝜂 = 𝑉𝜃 + 𝑉 ⊥
𝑉 ⊥
𝑇 𝑍 ′
𝛼 ′
= 𝑉𝜃 + (𝐼 𝑝 − 𝑉 𝑉 𝑇 )𝑍 ′
𝛼 ′
(a8)
it can be solved by substituting 𝜃 and 𝛼̂ . The potentially large matrix 𝑉 ⊥
never needs to
computed. In our code, we use the glmnet package to fit the ridge regression in (a6). This gives
the estimated coefficients 𝛼 . Then, we directly plug the estimates of 𝛼 in (a8) to archive the
estimates for coefficients 𝛽 .Compared to the original solutions (a1) and (a2) that require the
inverse of a 𝑝 × 𝑝 matrix, the transformed solution (a7) only needs the inverse of a matrix with
lower dimensions 𝑞 × 𝑞 . The computation cost is reduced when 𝑝 > 𝑞 . In practice, expression
data may have as many as 20,000 genes, while gene annotations identify 10 to 500 regions.
Therefore, the improvement in computational efficiency is substantial.
117
REFERENCES
1. Fan J, Fan Y. High dimensional classification using features annealed independence rules. Annals
of statistics. 2008;36(6):2605.
2. James G, Witten D, Hastie T, Tibshirani R. An introduction to statistical learning: Springer; 2013.
3. Hoerl A, Kennard R. Ridge regression, in ‘Encyclopedia of Statistical Sciences’, Vol. 8. Wiley,
New York; 1988.
4. Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical
Society Series B (Methodological). 1996:267-88.
5. Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal
Statistical Society: Series B (Statistical Methodology). 2005;67(2):301-20.
6. Wold S, Sjöström M, Eriksson L. PLS-regression: a basic tool of chemometrics. Chemometrics
and intelligent laboratory systems. 2001;58(2):109-30.
7. Witten DM, Tibshirani R, Hastie T. A penalized matrix decomposition, with applications to
sparse principal components and canonical correlation analysis. Biostatistics. 2009;10(3):515-34.
8. Witten DM, Tibshirani R. Penalized classification using Fisher's linear discriminant. Journal of
the Royal Statistical Society: Series B (Statistical Methodology). 2011;73(5):753-72.
9. Mai Q, Zou H, Yuan M. A direct approach to sparse discriminant analysis in ultra-high
dimensions. Biometrika. 2012;99(1):29-42.
10. Witten DM, Tibshirani R. Covariance ‐regularized regression and classification for high
dimensional problems. Journal of the Royal Statistical Society: Series B (Statistical Methodology).
2009;71(3):615-36.
11. Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso.
Biostatistics. 2008;9(3):432-41.
12. Raudenbush S, Bryk AS. A hierarchical model for studying school effects. Sociology of
education. 1986:1-17.
13. Lange K. Elementary optimization. Optimization: Springer; 2004. p. 1-17.
14. Banerjee O, Ghaoui LE, d’Aspremont A. Model selection through sparse maximum likelihood
estimation for multivariate gaussian or binary data. Journal of Machine learning research.
2008;9(Mar):485-516.
15. Shahabi A, Lewinger JP, Ren J, April C, Sherrod AE, Hacia JG, et al. Novel gene expression
signature predictive of clinical recurrence after radical prostatectomy in early stage prostate cancer
patients. The Prostate. 2016;76(14):1239-56.
16. Yu YP, Landsittel D, Jing L, Nelson J, Ren B, Liu L, et al. Gene expression alterations in prostate
cancer predicting tumor aggression and preceding development of malignancy. Journal of clinical
oncology. 2004;22(14):2790-9.
17. Ye Q-H, Qin L-X, Forgues M, He P, Kim JW, Peng AC, et al. Predicting hepatitis B virus–
positive metastatic hepatocellular carcinomas using gene expression profiling and supervised machine
learning. Nature medicine. 2003;9(4):416.
18. Wang Y, Jatkoe T, Zhang Y, Mutch MG, Talantov D, Jiang J, et al. Gene expression profiles and
molecular markers to predict recurrence of Dukes' B colon cancer. Journal of clinical oncology.
2004;22(9):1564-71.
19. Li Y, Guan C, Li H, Chin Z. A self-training semi-supervised SVM algorithm and its application
in an EEG-based brain computer interface speller system. Pattern Recognition Letters. 2008;29(9):1285-
94.
20. Gomez-Cabrero D, Abugessaisa I, Maier D, Teschendorff A, Merkenschlager M, Gisel A, et al.
Data integration in the era of omics: current and future challenges. BioMed Central; 2014.
21. Lapatas V, Stefanidakis M, Jimenez RC, Via A, Schneider MV. Data integration in biological
research: an overview. Journal of Biological Research-Thessaloniki. 2015;22(1):9.
118
22. Huang S, Chaudhary K, Garmire LX. More is better: recent progress in multi-omics data
integration methods. Frontiers in genetics. 2017;8:84.
23. Bergersen LC, Glad IK, Lyng H. Weighted lasso with data integration. Stat Appl Genet Mol Biol.
2011;10(1).
24. Wiel MA, Lien TG, Verlaat W, Wieringen WN, Wilting SM. Better prediction by use of co ‐data:
adaptive group ‐regularized ridge regression. Statistics in medicine. 2016;35(3):368-81.
25. Tai F, Pan W. Incorporating prior knowledge of predictors into penalized classifiers with multiple
penalty terms. Bioinformatics. 2007;23(14):1775-82.
26. Liu J, Liang G, Siegmund KD, Lewinger JP. Data integration by multi-tuning parameter elastic
net regression. BMC bioinformatics. 2018;19(1):369.
27. Lewinger JP, Conti DV, Baurley JW, Triche TJ, Thomas DC. Hierarchical Bayes prioritization of
marker associations from a genome-wide association scan for further investigation. Genet Epidemiol.
2007;31(8):871-82.
28. Chen GK, Witte JS. Enriching the analysis of genomewide association studies with hierarchical
modeling. Am J Hum Genet. 2007;81(2):397-404.
29. Expansion of the Gene Ontology knowledgebase and resources. Nucleic Acids Res.
2017;45(D1):D331-d8.
30. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene ontology: tool for
the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000;25(1):25-9.
31. Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-
throughput sequencing data. Nucleic Acids Res. 2010;38(16):e164.
32. Grant M, Boyd S, Ye Y. CVX: Matlab software for disciplined convex programming. 2008.
33. Grant MC, Boyd SP. Graph implementations for nonsmooth convex programs. Recent advances
in learning and control: Springer; 2008. p. 95-110.
34. Diamond S, Boyd S. CVXPY: A Python-embedded modeling language for convex optimization.
The Journal of Machine Learning Research. 2016;17(1):2909-13.
35. Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via
coordinate descent. Journal of statistical software. 2010;33(1):1.
36. Friedman J, Hastie T, Tibshirani R. The elements of statistical learning: Springer series in
statistics New York; 2001.
37. Curtis C, Shah SP, Chin S-F, Turashvili G, Rueda OM, Dunning MJ, et al. The genomic and
transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature.
2012;486(7403):346.
38. Chen Y-Z, Kim Y, Soliman HH, Ying G, Lee JK. Single drug biomarker prediction for ER−
breast cancer outcome from chemotherapy. Endocrine-related cancer. 2018;25(6):595-605.
39. Rivenbark AG, O’Connor SM, Coleman WB. Molecular and cellular heterogeneity in breast
cancer: challenges for personalized medicine. The American journal of pathology. 2013;183(4):1113-24.
40. Cheng W-Y, Yang T-HO, Anastassiou D. Development of a prognostic model for breast cancer
survival in an open challenge environment. Science translational medicine. 2013;5(181):181ra50-ra50.
41. Cheng W-Y, Yang T-HO, Anastassiou D. Biomolecular events in cancer revealed by attractor
metagenes. PLoS computational biology. 2013;9(2):e1002920.
42. Hastie T, Tibshirani R. Efficient quadratic regularization for expression arrays. Biostatistics.
2004;5(3):329-40.
Abstract (if available)
Abstract
Because of its simplicity, linear discriminant analysis (LDA) performs well in classification problems when sample size is relatively small. However, standard LDA cannot be applied to high-dimensional genomic data where the number of features is much larger than the sample size. We propose two different extensions of LDA for classification on high-dimensional data. The first extension is based on Scout, a family of regression and classification methods for high-dimensional settings that use a sparse estimate of inverse covariance matrix of features derived from graphical LASSO. We show that introducing additional sparsity to difference in class means results in improved predictive performance over Scout. The second extension of LDA, which is motivated by a connection between LDA and linear regression, is based on a 2-level hierarchical regression model regularized with ridge penalties. This version of high-dimensional LDA can incorporate external information relevant to the classification of classes, such as gene annotations or existing information from previous studies. Through simulation we show that the proposed approaches yield better prediction than existing methods of high-dimensional LDA in a wide range of scenarios. We also illustrate our methods with an application to the METABRIC gene expression data.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Statistical analysis of high-throughput genomic data
PDF
Incorporating prior knowledge into regularized regression
PDF
Prediction and feature selection with regularized regression in integrative genomics
PDF
Hierarchical regularized regression for incorporation of external data in high-dimensional models
PDF
Robust feature selection with penalized regression in imbalanced high dimensional data
PDF
Latent unknown clustering with integrated data (LUCID)
PDF
Characterization and discovery of genetic associations: multiethnic fine-mapping and incorporation of functional information
PDF
High-dimensional regression for gene-environment interactions
PDF
Best practice development for RNA-Seq analysis of complex disorders, with applications in schizophrenia
PDF
Improved computational and statistical guarantees for high dimensional linear regression
PDF
Gene-set based analysis using external prior information
PDF
Feature selection in high-dimensional modeling with thresholded regression
PDF
Prediction modeling with meta data and comparison with lasso regression
PDF
Enhancing model performance of regularization methods by incorporating prior information
PDF
Genome-wide characterization of the regulatory relationships of cell type-specific enhancer-gene links
PDF
Covariance-based distance-weighted regression for incomplete and misaligned spatial data
PDF
Integrative analysis of multi-view data with applications in epidemiology
PDF
Finding signals in Infinium DNA methylation data
PDF
Uncertainty quantification in extreme gradient boosting with application to environmental epidemiology
PDF
Large-scale inference in multiple Gaussian graphical models
Asset Metadata
Creator
Li, Sisi
(author)
Core Title
Generalized linear discriminant analysis for high-dimensional genomic data with external information
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Biostatistics
Publication Date
02/14/2020
Defense Date
01/14/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
data integration,gene expression,high-dimensional classification,linear discriminant analysis,OAI-PMH Harvest,regularized regression
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Lewinger, Juan Pablo (
committee chair
), Conti, David (
committee member
), Gabrys, Robertas (
committee member
), Siegmund, Kimberly (
committee member
)
Creator Email
sisili@usc.edu,sisixteen@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-122198
Unique identifier
UC11675578
Identifier
etd-LiSisi-7073.pdf (filename),usctheses-c89-122198 (legacy record id)
Legacy Identifier
etd-LiSisi-7073.pdf
Dmrecord
122198
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Li, Sisi
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
data integration
gene expression
high-dimensional classification
linear discriminant analysis
regularized regression