Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Robust feature selection with penalized regression in imbalanced high dimensional data
(USC Thesis Other)
Robust feature selection with penalized regression in imbalanced high dimensional data
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ROBUST FEATURE SELECTION WITH PENALIZED REGRESSION IN
IMBALANCED HIGH DIMENSIONAL DATA
by
Jie Ren
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(STATISTICAL GENETICS)
August 2014
Copyright 2014 Jie Ren
To Ting, and my dear parents
ii
Acknowledgments
First of all, I would like to express my sincere appreciation to my mentors, Dr. Juan
Pablo Lewinger and Dr. Richard Watanabe. It would be impossible for me to finish the
required work in order to get the degree without them. I was introduced to the area of
statistical genetics and presented all kinds of opportunies to learn cutting-edge methods,
perform data analysis, write scientific paper, attend scientific conferences, and develop
my own methods. I learnt a lot from them, not only in terms of performing scientific
research, but also in terms of becoming a qualified researcher and a responsible person.
I would also like to extend my appreciation to other committee members, Dr. Mar-
iana Stern, Dr. Kimberly Siegmund, and Dr. Jinchi Lv. Dr. Stern gave me access to
the data which motivated the ideas in the dissertation and also helped me to refine my
methods. I was introduced to regularized regression framework by Dr. Siegmund and
I also learnt a lot from her in terms of analyzing genomic data. I learnt a lot about the
theoritical development of penalized regression during the course taught by Dr. Lv. I
would also like to thank all the faculty in the Department of Preventive Medicine, who
taught me important knowledge in biostatistics and epidemiology.
I am very grateful to all my friends, both in the Los Angeles area and all over the
rest of the world. The six years as a Ph.D student was memorable because of all the
moments spent with them.
Last but not the least, I am extremely grateful to my parents and my fianc´ e. Their
love consistently support me during the last six years.
iii
TableofContents
Dedication ii
Acknowledgments iii
ListofTables vi
ListofFigures vii
Abstract x
1: Introduction 1
1.1 High dimensional data problems . . . . . . . . . . . . . . . . . . . 1
1.1.1 Characteristics and challenges of high dimensional data . . . 1
1.1.2 Penalized linear regression . . . . . . . . . . . . . . . . . . 6
1.1.3 Variable selection limitation of LASSO and relevant strate-
gies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2 Ensemble methods . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3 Imbalance data problems . . . . . . . . . . . . . . . . . . . . . . . 20
1.3.1 Problem description . . . . . . . . . . . . . . . . . . . . . . 20
1.3.2 Strategies for imbalanced data problems . . . . . . . . . . . 23
1.3.3 Metrics for imbalanced problems . . . . . . . . . . . . . . . 26
2: Imbalance learning in the framework of penalized logistic
regressionforhighdimensionaldataproblems 27
2.1 Motivation and introduction . . . . . . . . . . . . . . . . . . . . . . 27
2.1.1 Cutoff-dependent single assessment metrics . . . . . . . . . 28
2.1.2 Receiver operating characteristics (ROC) curves . . . . . . . 31
2.1.3 Prediction-recall (PR) curves . . . . . . . . . . . . . . . . . 31
2.2 Relevant studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.3.1 Simulation plan . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3: Parameter-freefeatureselectionwithstabilityselection 56
3.1 Stability selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2 SS-test: test-based method of error control . . . . . . . . . . . . . . 63
3.3 Simulation study to compare the performance of SS-thr and SS-test
in balanced data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
iv
3.3.1 Simulation plan . . . . . . . . . . . . . . . . . . . . . . . . 66
3.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.4 Simulation study to compare the performance of SS-thr and SS-test
in imbalanced data . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.4.1 Simulation plan . . . . . . . . . . . . . . . . . . . . . . . . 73
3.4.2 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4: Rank-BasedStabilitySelection 88
4.1 Motivation and description of framework . . . . . . . . . . . . . . . 88
4.2 Simulation plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5: StabilitySelectioninUltraHigh-DemensionalData 100
5.1 Motivation and description of framework . . . . . . . . . . . . . . . 100
5.2 Simulation plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6: Applicationtoprostatecancerrecurrencestudy 109
6.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.2 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Bibliography 124
v
ListofTables
2.1 Performance of cutoff-dependent single metrics with indepen-
dentdesignevaluatedbyG-mean . . . . . . . . . . . . . . . . . . 52
2.2 Performance of cutoff-dependent single metrics with indepen-
dentdesignevaluatedbyAUC-ROC . . . . . . . . . . . . . . . . 53
2.3 Performanceofcutoff-independentmetricswithindependentdesign
evaluatedbyG-mean . . . . . . . . . . . . . . . . . . . . . . . . 54
2.4 Performanceofcutoff-independentmetricswithindependentdesign
evaluatedbyAUC-ROC . . . . . . . . . . . . . . . . . . . . . . . 55
6.1 Performance of models selected by metrics with 0.5 cutoff in
prostatecancerrecurrencedata . . . . . . . . . . . . . . . . . . 118
6.2 Performanceofmodelsselectedbymetricswithcaseproportion
cutoffinprostatecancerrecurrencedata . . . . . . . . . . . . . 119
6.3 FeatureselectedselectedbyLASSOandstabilityselectionbased
methodsinprostatecancerrecurrencedata . . . . . . . . . . . . 120
6.4 FeatureselectedselectedbyLASSOandstabilityselectionbased
methods in prostate cancer recurrence data by adapting a two-
stepapproach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
vi
ListofFigures
1.1 Augmented correlation in high dimensions . . . . . . . . . . . . . . 5
1.2 Estimation pictures for (left) LASSO and (right) ridge regression.
(Tibshirani 1996) . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Forms of penalized least square functions (Hastieetal. 2009) . . . . 11
1.4 Geometric view of LARS algorithm with two predictors . . . . . . . 12
1.5 Between-class imbalance and within-class imbalance (He & Garcia
2009) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.1 Confusion matrix to evaluate classification performance . . . . . . . 30
2.2 ROC curves and PR curves taken from the same models on a highly-
skewed cancer detection data (Davisetal. 2005) . . . . . . . . . . . 33
2.3 Performance of cutoff-dependent single metrics with independent
design using G mean (top) and AUC-ROC (bottom) for evaluation. 46
2.4 Performance of single metrics with independent design using G mean
(top) and AUC-ROC (bottom) for evaluation. . . . . . . . . . . . . 47
2.5 Performance of single metrics with independent design of sample
size 720 using G mean (top) and AUC-ROC (bottom) for evalua-
tion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.6 Performance of selected metrics compared with informative sam-
pling using independent design and evaluated by G mean (top) and
AUC-ROC (bottom) . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.7 Performance of selected metrics compared with informative sam-
pling using block design and evaluated by G mean (top) and AUC-
ROC (bottom) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.8 Performance of selected “1se model”s using independent design and
evaluated by G mean (top) and AUC-ROC (bottom) . . . . . . . . . 51
3.1 An example of regularization path (left) and the corresponding sta-
bility path (right) . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.2 Illustration of SS-test . . . . . . . . . . . . . . . . . . . . . . . . . 78
vii
3.3 Impact of varying the frequency cut-offs on the TPN with n =
200. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.4 Error control for SS-thr and SS-test across various critical values. 80
3.5 Error control for SS-thr and SS-test across various numbers of active
features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.6 Error control for SS-thr and SS-test across various critical values
with block design. . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.7 Capability of identifying active set for different selectors given FPN
at 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.8 Capability of identifying active set for different selectors given FPN
at 1 with block design. . . . . . . . . . . . . . . . . . . . . . . . . 84
3.9 Capability of identifying active set for SS-thr and SS-test with tar-
geted error control bound at 1. . . . . . . . . . . . . . . . . . . . . 85
3.10 FPN and TPN of different selectors in imbalanced data. . . . . . . . 86
3.11 FPN and TPN of different selectors in imbalanced data given with
block design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.1 True features from the top seven features detected by different selec-
tors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.2 Error control for SS-rank and SS-ranktest across various critical val-
ues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.3 Error control for SS-rank and SS-ranktest across various critical val-
ues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.4 Capability of identifying active features for different selectors given
fixed FPN at 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.5 Capability of identifying active features for different selectors given
fixed FPN at 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.1 FPN and TPN for different selectors givenp = 10000. . . . . . . . . 106
5.2 FPN and TPN for different selectors given p = 10000 with block
design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
viii
5.3 Capability of detecting active feature for different selectors given
FPN fixed at around 1p = 10000 and with block design. . . . . . . 108
6.1 The stability path when SS-thr was applied to all features in prostate
cancer recurrence data. The red curves indicated features identifed
by the method with
thr
= 0:6. . . . . . . . . . . . . . . . . . . . . 122
6.2 The stability path when SS-thr was applied as the second step to
features remained in prostate cancer recurrence data after filtering
by a univariate screener. The red curves indicated features identifed
by the method with
thr
= 0:6. . . . . . . . . . . . . . . . . . . . . 123
ix
Abstract
This work is motivated by an ongoing USC/Illumina study of prostate cancer recur-
rence after radical prostatectomy. The study generated gene expression data for nearly
thirty thousand probes from 187 tumor samples, of which 33 came from patients with
recurrent prostate cancer and 154 came from patients with non-recurrent prostate cancer
after years of follow-up. Our goal was to use penalized logistic regression and stabil-
ity selection to find a “gene signature” of recurrence that can improve upon PSA and
Gleason-score, which are well-established but poor predictors. For interpretability and
future clinical use, the gene signature should ideally involve a small proportion of probes
to predict recurrence in new patients. Due to the skewness in the data, the model selected
by tuning the LASSO penalty parameter based on the average misclassification rate in
cross validation did not have a balanced performance,i.e. it predicted non-recurrent can-
cer with high accuracy but predicted recurrent cancer with very low accuracy. In addi-
tion, standard penalized regression with cross validation appeared to select many noise
features. In my simulation study in Chapter 2, I evaluated the performance of models
selected by different metrics in imbalanced data. I concluded that G-mean-thr (G-mean
with an alternative cutoff) and area under the ROC curve (AUC-ROC) were the most
robust metrics to class imbalance. In Chapter 3, I examined the performance of stability
selection (SS-thr) in simulation studies and found that its feature selection capability
(a) depended on the stability cutoff chosen and (b) is conservative as a result of a strin-
gent error control. To address these problems, I proposed new feature selectors based
on stability selection, including SS-test, an essentially parameter-free test-based outlier-
detection approach, and SS-rank and SS-ranktest, parameter-free rank-based methods.
I demonstrated their advantage over SS-thr, ULR, and LASSO with cross validation in
x
extensive simulation studies and also found that all these stability selection based meth-
ods were robust to class imbalance. These newly developed methods and procedures
was applied to the prostate cancer recurrence data. I used a variety of metrics to do
model selection within the penalized logistic regression framework using imbalanced
prostate cancer recurrence data and demonstrated that G-mean with the case-proportion
cutoff selected the model with the most balanced prediction accuracy in cross validation.
In addition, I also showed the importance of using an appropriate cutoff to evaluate mod-
els when models were built from skewed data. I also applied stability selection based
methods including SS-thr, SS-test, SS-rank and SS-ranktest to select important genes
from the same data. Three genes, ABCC1, NKX2-1 and ZYG11A were identified by
all of the methods and also appeared to stand out from other features in the stability
path plot. I fit a logistic regression model using these genes and clinical features, which
has significantly higher prediction accuracy than clinical-only models when evaluated
by cross-validation.
xi
Chapter1
Introduction
1.1 Highdimensionaldataproblems
1.1.1 Characteristicsandchallengesofhighdimensionaldata
Due to technological advances in high-throughput genomic technologies, we are cur-
rently able to measure thousands to millions of features in a given sample at relatively
low expenses. This high dimensionality has brought both blessings and curses. On one
hand, there is an opportunity to gather information on the scale that lead to unprece-
dented biological insights. On the other hand, the massive amount of data has brought
numerous new challenges and difficulties in data preprocessing, storage, analysis and
interpretation. These issues are not unique to genomics. High dimensional data struc-
tures have become increasingly common and essential in a variety of fields, ranging
from medical studies, financial modeling, to information retrieval (Fan & Li 2006).
From a statistical perspective, high dimensional data analysis is faced with three
important problems: statistical accuracy, model interpretability and computational com-
plexity (Fan & Lv 2010). In conventional statistics, the number of featuresp is usually
much smaller than the number of observationsn. In this case, standard statistical tech-
niques such as un-penalized regression have well understood properties and the algo-
rithms that implement them are computationally efficient. However, when the dimen-
sionalityp is comparable to, or larger than the sample sizen, standard statistical methods
1
can break down. In addition, most algorithms are not efficient asp increases to the scale
ofn and the asymptotic theory is not well developed.
For example, consider the standard multivariate linear regression model,
y =X +,N
0;
2
(1.1)
in which y =
y
1
; y
2
; ; y
n
T
is the response vector, X =
x
1
; x
2
; ; x
n
T
is the design matrix, =
1
;
2
; ;
p
T
is the vec-
tor of regression coefficient, and =
1
;
2
; ;
n
T
is the error vector.
The least square solution to this problem is
^
OLS
=
X
T
X
1
X
T
y (1.2)
When the dimensionality p is comparable to n, numerical instabilities can occur
asX
T
X becomes ill-conditioned. Moreover, oncep is larger thann, the least square
solution is not unique. Ridge regression solves this problem by penalization of the least
squares which leads to a problem where the inverse ofX
T
X +I is required (which
exists for all> 0, Tikhonov & Arsenin 1977) rather than that ofX
T
X. Nevertheless,
the model performance is affected by several other issues, such as spurious association
and noise accumulation.
An important problem of high dimensional data is the increased chance of finding
“spurious association”. Spurious features that are independent of the response of inter-
est at the population level could appear to be associated, simply because of their high
2
sample correlation with the true predictor. It is easy to show that as the number of pre-
dictors increases, the probability that the maximum sample correlation and maximum
multiple correlation with a given predictor will become arbitrarily high even when all
the predictors are sampled from Gaussian distribution with independent components
(Figure 1.1) (Fan & Lv 2010). Even worse, extremely high correlations between predic-
tors,i.e. collinearity, can result in unstable parameter estimation, over-fitting and model
misspecification (Fan & Lv 2010).
A second problem with high dimensional data comes from noise accumulation.
Again consider the standard multivariate linear regression model. Assuming the design
matrixX has full rankp, then the prediction risk for a new observation using the ordi-
nary least-square estimator
^
OLS
is
Efn
1
k~ yX
^
k
2
2
g =
1 +
P
n
2
(1.3)
which increases as dimensionality increases. Similar impact of high dimensionality
on noise accumulation is also noticed in classification problems. Fan & Fan (2008) have
shown that when including all the features into consideration the classification rate is as
poor as random guessing due to noise accumulation in the estimation of population cen-
troid. Hastie et al. (2001) propose the ”bet on sparsity” principle for high dimensional
problems:
“Use a procedure that does well in sparse problems, since no procedure
doeswellindenseproblems“
This principle was illustrated in the context of linear model using simulation. In each
run, there were 50 observations plus 300 independent predictors, and the performance
of ridge regression is compared with that of LASSO. Essentially, the ridge regression
3
model performed equally poorly when 10, 30, or 300 nonzero features were respec-
tively used in the true model. On the other hand, the LASSO method (withL
1
penalty)
performed reasonably well when the true model was sparse.
Given these statistical challenges in high dimensionality, feature selection turns out
to be critical. In order to conduct valid feature selection, we need to assume that only
a small proportion of features are contributing to the prediction of response, which is a
common assumption in genetic studies. In this case, a statistical procedure is desirable,
provided that it shrinks most of the parameters to zero. Such procedure could alleviate
difficulties including spurious association and noise accumulation in some extent. The
model is also more interpretable and computationally more efficient.
Last but not the least, univariate regression methods or two-sample tests have been
popular in the analysis of micro-array data and genome-wide association studies (Dudoit
etal. 2009, Storey & Tibshirani 2003). These methods have helped us gain insight into
the biological architecture of metabolism or diseases. From the prospective of feature
selection, however, they are essentially rankings based on marginal correlation with the
outcome, assuming genes independently affect the disease status or quantitative pheno-
types. These tests are not immune of spurious associations and might miss important
signature genes (Fanetal. 2009).
In summary, high dimensional data has brought researchers not only opportuni-
ties to explore the relationship between numerous features and the outcome, but also
challenges in statistical properties, model interpretation and computational complexity.
Conventional methods feasible in low scale data environments are likely to suffer issues
including spurious association and computational difficulty. Given a limited sample
size, a reasonable strategy is to apply a sparse modeling framework. This will confine
the hypothetical space to relatively low dimensions, which allows us to recover the good
interpretability and desired statistical properties of the model.
4
0.3 0.4 0.5 0.6 0.7 0.8
0 5 10 15
maximum pairwise correlation
N = 100 Bandwidth = 0.01268
Density
0.70 0.75 0.80 0.85 0.90 0.95 1.00
0 5 10 15 20 25 30 35
maximum multiple correlation
N = 100 Bandwidth = 0.007798
Density
Figure 1.1: Augmented correlation in high dimensions
Left panel shows the distribution of maximum absolute sample correlation coefficient
of the first predictor with any of the rest predictors. Right panel shows the distribution
of the maximum absolute multiple correlation coefficient between the first predictor
and the linear combination of 5 featurees, computed by the greedy maximizing residual
algorithm (the actual values are larger than presented here). The predictors are generated
from independent normal distribution, with n=50, p=1000 (solid curve) or p=10000
(dashed curve), based on 100 simulations.
5
1.1.2 Penalizedlinearregression
In later parts of this section, we will discuss the properties, advantages and disadvan-
tages of different methods in the framework of penalized linear regression (PLS). We
will start with the classical best subset selection, and then introduce LASSO, a funda-
mental approach to deal with high dimensionality. After that, we will discuss a few
methods aiming to overcome the problems of LASSO.
Bestsubsetselection The balance between model complexity and statistical accuracy
has drawn our attention for years. In general, we prefer simple models. However,
complex models tend to have low bias. We could always increase the likelihood of
the model by adding more predictors, but this might incur overfitting problems since it
measures prediction error poorly. A strategy to estimate in-sample error is the Akaike
information criterion (AIC) (Akaike 1973, Akaike 1974), which holds asymptotically
asN!1:
2
N
loglik(
^
) +
2dim(
^
)
N
(1.4)
where dim(
^
) is the dimension of the model and loglik
^
is the log likelihood
function. In order to use AIC for model selection, we simply adopt the model that
minimizes AIC among all the models. Additionally, the Bayesian approach of model
selection motivates the Bayesian information criterion (BIC), with the generic form:
2loglik(
^
) + (logN)dim(
^
) (1.5)
6
Schwarz proves that the model with minimum BIC is equivalent to the model with
largest approximate posterior probability (Schwarz 1978).
If combined together, AIC and BIC will select a model that maximizes the penalized
log-likelihood:
loglik()kk
0
(1.6)
where theL
0
-norm of is the number of non-zero components in and 0 is
a tuning parameter. = 1 in AIC while =
logN
2
in BIC. Therefore, compared to
AIC, BIC tends to penalize complex models more heavily. Given andkk
0
= m,
the solution to the penalized likelihood is the subset with largest maximum likelihood
among all models with sizem. Unfortunately, solving this problem is of NP-complexity.
Also, subset selection is not stable because of its discrete nature (Breiman 1995).
L
1
-norm regularized method Let us use the linear regression model in equation 1.1
and consider the penalized least square method,
min
2R
P
f
1
2n
kyXk
2
2
+
P
X
j=1
p
(j
j
j)g (1.7)
in which p
(t);t > 0 is the penalty function indexed by the tuning parameter .
Different penalty functions have been proposed to solve the problem. Bridge regression
defines the penalty to bep
(t) = t
q
for 0 < q 2 (Frank & Friedman 1993). This
method bridges best subset selection with ridge regression, including the L
1
penalty
as a special case. Tibshirani (1996) proposes a new method called the least absolute
shrinkage and selection operator (LASSO) for producing sparse models. In terms of the
7
description in equation 1.1, with the additional assumption that eachx
j
in the design
matrix is standardized to have mean 0 and variance 1, the LASSO estimator becomes:
^
= arg min
2R
fkyXk
2
2
g subject to
P
X
j=1
j
j
jt (1.8)
wheret 0 is a tuning parameter, which controls the amount of shrinkage. When
t
^
OLS
, the solution reduces to ordinary least square solution. Whent<
^
OLS
,
all the component in
^
OLS
will shrink towards 0, with some coefficients exactly equal 0.
This property naturally reduces the LASSO model to a sparse and interpretable model.
Why does LASSO, not ridge regression, result in a sparse model? Figure 1.2 illustrates
LASSO and ridge regression geometrically when there are only two input variables
(Tibshirani 1996). The residual sum of squares has elliptical contours, centered at the
OLS estimate. The constraint
2
1
+
2
2
t is a circle for ridge regression, while the
constraintj
1
j+j
2
jt is a diamond for LASSO. When the elliptical contours intersect
the constraint region, there is a solution. Geometrically, the intersection would be much
more likely to occur at the corner for diamond, therefore leaving one of the parameter
j
equal to 0.
Generally, ridge regression does a proportional shrinkage. Best subset selection
draws a line when the size of a model reaches some point and drops all the other vari-
ables below the line, thus it could be regarded as hard-thresholding shrinkage. The
shrinkage of LASSO is not as intuitive. However, whenX
T
X =nI
p
, LASSO estima-
tion is equivalent to
^
j
=sign
^
OLS
j
j
^
OLS
j
j
+
(1.9)
8
where
0 is related to the LASSO constraint, which is the soft-thresholding
estimator proposed by Donoho et al. (Donoho & Johnstone 1994). Figure 1.3 will help
understand the shrinkage of these three methods geometrically.
Given these different penalty functions, which one should we choose? Fan & Li
(2001) advocate penalty functions to give estimators with sparsity property that leads to
simpler model, unbiasedness property especially when the true coefficient
j
is large,
and continuity property so that the model prediction would be stable. WhenX
T
X =
nI
p
, simple algebra will turn the problem in equation 1.7 into a univariate penalized
least square problem:
^
(z) = arg min
2R
f
1
2
(z)
2
+p
(jj)g (1.10)
Antoniadis & Fan (2001) show that in such a simple model, the properties could be
illustrated more easily:
(1) sparsity ifmin
t0
ft +p
0
(t)g> 0, which suggests singularity at the origin;
(2) approximate unbiasness ifp
0
(t) = 0 when t is large, which suggests concavity
on [0;1);
(3) continuity if and only if arg min
t0
ft +p
0
(t)g = 0.
Some simple calculation could suggest thatL
q
penalty withq> 1 does not have the
sparsity property, whileL
1
penalty does not satisfy the unbiasedness condition, andL
q
penalty with 0q < 1 does not satisfy the continuity condition. LASSO tends to have
higher bias for predictors with large cofficients, especially when is large. Therefore,
to compensate the bias of LASSO, smaller is preferred, and that is why LASSO tends
to select many noisy variables in the model. This problem will be further reviewed in
the next part of this section.
9
Nowadays, LASSO is almost equivalent to L
1
-norm based penalized regression
methods. Another important contribution to its popularity is the invention of least angle
regression algorithm (LARS)(Efronetal. 2004). LARS links LASSO to forward stage-
wise regression and reveals the rationale behind the optimization geometrically.
Essentially, forward selection is a greedy algorithm. When a new feature is added
into the model, steps are taken as large as possible, that is, the direction of the residual of
the response is perpendicular to that of the newly added feature. At the other extreme,
forward stagewise regression takes a very small step along the direction of selected
features, as checking the correlation between the residual with all the features. LARS
takes an intermediate step size between these two approaches. At first, it identifies the
predictor with the largest correlation with the response. Then it continuously moves
the coefficient of this feature toward its OLS solution. However, instead of fitting this
feature completely as in forward selection, the process stops when the correlation of
the residual with another predictor is equal to that with the first predictor. Then the new
direction of moving the correlation would be the equiangular path among all the selected
features. This process will continue until all the features are included in the model,
and end up with the OLS solution. In general, LARS gives very similar estimation as
LASSO. They deviate from each other only when the coefficient of an estimator crosses
zero. When the estimate of a predictor changes signs, LASSO will regard it as a noise
feature and drop it. Therefore, a LASSO modification on LARS is to drop the predictor
from the active set when a non-zero coefficient hits zero, and then to recompute the
direction using the rest predictors in the set. The LARS algorithm is extremely efficient,
requiring the same order of computation as that of a single least square fit using the p
predictors. This computation efficiency makes LASSO popular among researchers.
10
Figure 1.2: Estimation pictures for (left) LASSO and (right) ridge regression. (Tibshi-
rani 1996)
Figure 1.3: Forms of penalized least square functions (Hastieetal. 2009)
11
y
x₁
x₂
x₃
y
x₁
x₂
x₃
y
x₁
x₂
x₃
y
x₁
x₂
x₃
R
R
u
u
u’
R
Figure 1.4: Geometric view of LARS algorithm with two predictors
12
1.1.3 FeatureselectionlimitationofLASSOandrelevantstrategies
Besides prediction accuracy, another important concern in statistical learning is feature
selection. In the studies of genetics and epigeneitcs, we are often more interested in
identifying biomarkers with biological effects. As a matter of fact, given the variety
of sequencing platforms in use, it is often very difficult to find a validation study with
exactly the same markers. Instead, it is much easier to find a replication data mapping to
the same genes. That is, it is often more meaningful to detect a set of interesting genes
than to fit a model with good prediction accuracy. In addition, with the sparse nature
of L
1
-norm penalization, a good feature selection procedure will further enhance the
prediction accuracy.
LetS =fj :
0
j
6= 0g and assume sparse model so thatjSj = s < p. Denote
^
()
the estimate of the coefficient produced by a procedure. is called an oracle procedure
if it (1) identifies the right subset model:fj :
^
6= 0g =S; (2) has the optimal estimation
rate:
p
n
^
()
S
S
!
d
N(0;
) where
is the true covariance matrix knowing
S (Fan & Li 2001).
It turns out LASSO has neither properties. Zou (2006) proves that the convergence
rate for LASSO procedure is not optimal. With regard to consistency of feature selec-
tion, certain designs never have this property asymptotically. This is called the irrepre-
sentable condition (Meinshausen & B¨ uhlmann 2006, Zhao & Yu 2006, Zou 2006, Yuan
& Lin 2006). AssumeS =f1;:::;s
0
g consists of the firsts
0
features. Let
^
=
0
@
^
1;1
^
1;2
^
2;1
^
2;2
1
A
13
in which
^
1;1
is thes
0
s
0
matrix corresponding to the active features and
^
2;2
is the
(ps
0
)(ps
0
) matrix corresponding to the noise features. Then the irreprensentable
condition is:
k
^
2;1
^
1
1;1
sign(
0
1
;:::;
0
s
0
)k
1
for some 0<< 1
Intuitively, it suggests that the correlation between active features and noise features
has to remain relatively low to avoid selecting noise features. In high dimensional data
structure, this condition is not trivial at all.
To overcome the limitations of LASSO, other L
1
-norm based penalization meth-
ods and methods using other penalty functions are proposed. Below we will focus on
L
1
-norm based methods by illustrating three typical ones, including adaptive LASSO,
stability selection and elastic net.
Adaptive LASSO is proposed by Zou (2006), and it addresses the weighted LASSO
problem:
arg min
2R
fkyXk
2
2
+
P
X
j=1
w
j
j
j
jg (1.11)
wherew is a known weights vector. Note that for standard LASSO, w
j
= 18j.
They prove that if the weights are data-dependent and cleverly chosen, the estimate of
adaptive LASSO enjoys the oracle properties. A suggested weight vector is
^ w =
1
j
^
j
where
^
could be
^
OLS
or
^
ridge
and
is a tuning parameter. Intuitively, adaptive
LASSO gives less penalty to when is large, therefore decreasing the bias of LASSO.
14
Adaptive LASSO is also anL
1
-norm based approach, which is in common with LASSO,
so it can achieve high computational efficiency by borrow algorithms from LASSO.
However, Adaptive LASSO could be problematic if the weight vector it obtains from
the data is wrong. Also, when the estimate for coefficients is close to 0 at the first step,
the weight for coefficient will go to infinity, making it almost impossible to be chosen
by adaptive LASSO.
Elasticnet aims to deal with a different limitation of LASSO (Zou & Hastie 2005):
P
X
j=1
f
2
j
+ (1)j
j
jg (1.12)
where is a tuning parameter and usually chosen by cross validation. Apparently,
elastic net is a compromise between ridge regression and LASSO. It has advantage over
LASSO and ridge regression when dealing with a set of strong yet correlated features.
LASSO tends to randomly pick one of them, while ridge regression tends to shrink the
coefficients towards each other. Also, in thep>n case, the LASSO always stops when
n features is in the model while elastic net could go beyond that. Moreover, elastic net
has computational complexity on the same order of LASSO.
Stability selection is a bagging method (Meinshausen & B¨ uhlmann 2010), which
takes multiple subsamples from the data and goes though a LASSO procedure in each
sample. Then the selection probability of each feature is calculated across these sub-
samples. Features with selection probability above certain threshold are named stable
features. An advanced version of stability selection is to run a randomized LASSO
procedure for each subsample to give the data more perturbance. The framework of
randomized LASSO is similar to equation 1.13. However, unlike adaptive LASSO, the
15
weight is not data dependent but randomly chosen. The authors prove that stability
selection with randomized LASSO could select features consistently under a condi-
tion on the sparse eigenvalues of the design (Candes & Tao 2007, Van De Geer & Van
Houwelingen 2004, Meinshausen & Yu 2009, Bickeletal. 2009), which is weaker than
the irrepresentable condition. They also demonstrate an approach to select the range
for to control the false positive rate. This method will be explored in more details
in Chapter 3. Unlike adaptive LASSO and elastic net, stability selection (at least the
original method) could not be used for prediction.
In summary, LASSO based method is popular among penalized regression methods
in solving high dimensional data problem because of its favorable properties including
stable and sparse estimation and computational efficiency. However, when is large, the
bias of LASSO often cause it to select many noise features. To overcome this obstacle,
methods like adaptive LASSO and stability selection are developed to achieve higher
feature selection consistency than LASSO.
16
1.2 Ensemblemethods
Ensemble methods build a prediction model by combining the strengths of a collection
of weak learners (typically by weighted or unweighted voting). These methods excel
single classifiers when (1) each committee member surpasses random guess; (2) the
predictions are diverse across the committee (Hansen & Salamon 1990). Statistically,
ensemble methods have great performance when each committee member features low
bias and large variance. To be detailed, despite its good performance in average, a
classifier is still unfavorable due to numerical instability. On the other hand, ensemble
methods can maintain low bias and reduce variance when a committee consists of these
individual classifiers. For example, regression methods are generally stable so that they
are seldom used as a committee member. In contrast, tree-based methods are known for
their good interpretability yet numerical instability. That is why many popular ensemble
methods are based on tree-based classifiers.
Bagging The most straightforward ensemble method is called bagging, i.e.
bootstrap aggregation (Breiman 1996). Suppose the training data is Z =
f(x
1
;y
1
); (x
2
;y
2
);:::; (x
N
;y
N
)g. B bootstrap samples of the training set Z
b
;b =
1; 2;:::;B are collected. In each bootstrap sample Z
b
, we fit the model and give the
prediction for a new test point
^
f
b
(x). The bagging estimate is taking the average of the
prediction from B models:
^
f
bag
(x) =
1
B
B
X
b=1
^
f
h
(x) (1.13)
which is a Monte Carlo estimate of the true estimate.
17
Bagging has been widely used to reduce the variance for certain methods. However,
the correlation between trees in the bootstrap samples is still high, making the decrease
in variance not evident.
Randomforest modifies bagging by building a large collectoin of de-correlated trees,
and averaging them (Breiman 2001). The key modification in the random forest is reduc-
ing the number of candidate features at each split fromp tom. Typical values form are
p
p or even as low as 1. Notably, there is a huge perturbance on the data, where the
trees in the random forest are very different from each other. Random forests are shown
to outperform bagging. However, this huge perturbance could be problematic for other
methods, although it benefits tree methods for their highly nonlinear nature (Hastie et
al. 2009).
Boosting is another class of ensemble methods. Unlike bagging and random forest,
the committee of classifiers evolves over time, and the final prediction is a weighted
vote from the committee. For instance, a popular boosting algorithm, designated
”AdaBoost.M1.”, applies this idea in a very straightforward way (Freund & Schapire
1995). Initially, an equal weight is assigned to each observation and a base classifier is
trained on the data. Then in each iteration stepm = 2; 3;:::;M, the observation weights
are modified according to whether they are classified correctly in the last step. That is,
the weight of the observation misclassified by the last learner will be increased in the
next iteration. Meanwhile learners with a lower misclassification rate receive a higher
weight. At last, the final prediction is a weighted majority vote of all these M classi-
fiers. Friedman et al. (2000) showed that statistically, AdaBoost fits an additive model
of base learner by optimizing over an exponential loss function. To incorporate other
loss functions into the framework of additive model, Friedman proposed gradient boost-
ing, which is more robust to the structure of the data (Friedman 2001). Shrinkage and
18
sub-sampling could be further integrated into gradient boosting to prevent overfitting
problems.
19
1.3 Imbalancedataproblems
1.3.1 Problemdescription
In binary classification problems, the class distribution is often imbalanced. Sometimes
one class may severely outpresents the other (He & Shen 2007, Pearson et al. 2003).
When a standard classifier is applied to imbalanced data, a model that blindly classifies
all the samples as the majority case could still achieve low misclassification rate (Chawla
et al. 2011). Consider the cancer classification problem. It is obviously undesirable
if a model classifies all the patients as noncancerous, which could achieve an overall
< 1% misclassification rate simply because the prevalence of cancer is low. Apparently,
we need not only a framework more suitable for imbalanced data problems but also a
system to select models with a more balanced misclassification rate among classes. The
former problem is generally fixed by data- or algorithm-based modification on classical
methods, which will be reviewed in this part. The later problem is concerned with what
metric is better suited for imbalanced classes, which will be described extensively in
Chapter 2.
Notably, between-class imbalance might be only a tip of an iceberg. Within-class
imbalance, that is, the imbalanced distribution of subclasses within a class, could be
masked by between-class imbalance (Jo & Japkowicz 2004, Japkowicz 2003, Pratietal.
2004). In Figure1.5b, cluster A and B respectively represent the dominant majority and
dominant minority class, while cluster C and D represent the minority subclass of the
minority class and majority class, respectively. Cluster C could be easily identified when
all the data from cluster B is removed; otherwise, it will probably be regarded as noisy
data due to its underrepresentation (He & Garcia 2009). Close connection has been
observed between within-class imbalances and small disjuncts (Japkowicz & Stephen
2002, Jo & Japkowicz 2004, Japkowicz 2003, Pratietal. 2004).
20
The problem of small disjuncts concerns with heterogeneous concepts, rules that
cover of small cluster of examples pertaining to the main concept, and arise as a separate
cluster from the “major disjunct”. Although found in both majority and minority classes,
small disjunct is more difficult to be differentiated from noise in minority class (He &
Garcia 2009). For example, consider when we try to differentiate between diabetes
patients and healthy people using genetic data. Type I diabetes greatly differs from type
II diabetes in terms of genetic architecture, and it only accounts for about 5% of the
diabetes cases. When we train the model, type I diabetes patients are likely to be a small
disjunct. Since type I diabetes patients are highly underrepresented, there is a great
chance to regard them as noise. Therefore, the misclassification rate of type I diabetes
patients would be high if the model is used for prediction.
21
Figure 1.5: Between-class imbalance and within-class imbalance (He & Garcia 2009)
(a) shows a data set with between-class imbalance; (b) shows a complicated data set
with both between-class imbalance and within-class imbalance.
22
1.3.2 Strategiesforimbalanceddataproblems
Sampling methods Huge efforts have been made to fix the imbalanced data prob-
lem with stratified sampling strategies to get a more balanced sample. Kubat et al.
(1997) selectively under-sampled the majority class while keeping the original minority
class. They used 1-NN misclassification rule and Tomek links concept (Tomek 1976)
to remove noisy samples and borderline examples from the majority class, respectively.
Another strategy is to classify all the samples in the borderline region as minority, as
proposed in the SHRINK system (Kubat et al. 1997). Japkowicz (2000) studied the
effect of randomly oversampling from the minority class and randomly subsampling
from the majority. The results showed that both strategies were very effective, although
there was no significant improvement when sampling was applied to borderline exam-
ples only. Ling & Li (1998) combined these strategies and found that it hardly outclassed
the undersampling approach. Therefore, undersampling the majority class seems to be
a better strategy to find a good model.
Despite their popularity, undersampling and oversampling could incur problems
such as loss of information and overfitting (Mease et al. 2007). Informed under-
sampling methods using ensemble methods are proposed to address this issue, such
as EasyEnsemble and BalanceCascade (Liu et al. 2009). EasyEnsemble applies the
conception of bagging by building a committee composed of random samples from the
majority group. The final decision is based on multiple classifiers learned from each
subsample and the minority class. BalanceCascade, on the other hand, borrows the
idea of boosting. It starts with drawing a random subsample from the majority class.
Subsequently, correctly classified examples in the majority class are removed from the
sampling pool while those misclassified are sampled again. That is, hard examples are
recycled and processed by another classifier. The methods could fix the problem of
information loss caused by subsampling.
23
To improve over-sampling approach, Chawlaetal. introduced the synthetic minority
oversampling technique (SMOTE) algorithm by generating new data based on the fea-
ture space in existing minority examples (Chawla et al. 2011). For each sample in the
minority classx
i
, they randomly selected one of its K nearest neighbors ^ x
i
, and created
a new example along the line connectingx
i
and ^ x
i
. These synthetic samples break the
ties introduced by simple oversampling and alleviate the effect of overfitting.
Numerous sampling ideas have been developed on the basis of SMOTE algorithm.
For example, Borderline-SMOTE is introduced to fix the problem that new samples are
generated without considering the classes of their neighbors (Han et al. 2005). The
minority samples are classified into “SAFE”, “DANGER”, or “NOISE” instances based
on the composition of neighbors. No synthetic samples are generated for “NOISE”
instances. As for “DANGER” instances with over half of neighbors from the major-
ity class, Borderline-SMOTE only creates new instances for those minority examples
“closer” to the border.
One drawback of SMOTE series is computational complexity. These methods are
computationally expensive for their reliance on local searching methods, and may break
down in high dimensions due to the curse of dimensionality (Bellman 1961). That is, all
the examples lie on the edge of the feature space and are very distant from each other.
Costsensitivelearning Another way of addressing the imbalanced class problem is to
assign different costs to misclassified samples mainly by means of cost matrix (Pazzani
et al. 1994, Domingos 1999). For instance, letC(i;j) be the cost of classifying ai
th
class example as aj
th
class. Then typically we will setC(i;i) = 0 andC(i;j)C(j;i)
if the prevalence ofi
th
class is lower. We aim to minimize the total cost after the cost
matrix is implemented.
24
Most cost-sensitive methodologies fall into three categories based on how the cost
matrix is integrated. Weighting methods integrate cost matrix before model building,
that is, weights are assigned to samples from different categories according to cost
matrix (Zadrozny et al. 2003, Maloof 2003). The second category is algorithm-based
methods, which integrate the matrix into the specific algorithm. For instance, moti-
vated by the success of AdaBoost (Freund & Schapire 1995), Sunetal. (2007) propose
AdaC1, AdaC2 and AdaC3 which incorporate cost matrix into the weight updating steps
inside AdaBoost. As a result, examples with a higher cost are more likely to receive a
higher weight. These modified algorithms excelled AdaBoost in imbalanced data.
The last category is to integrate the cost matrix after model building, such as the
MetaCost framework built on top of general ensemble method (Domingos 1999). The
prediction originates from votes weighted by a combination of conditional probability
and cost for each class after obtaining a committee of classifiers. Another example
comes from tuning the decision cutoff when models output probability for each class.
Due to the post-hoc nature, these methods are closely related to the choice of summary
metrics.
The legitimacy of all three categories relies on a valid cost matrix, whose estima-
tion, unfortunately, is very challenging and discourages practical appliances. In this
case, summary measures such as ROC curves, are useful since they could evaluate the
performance over a range of different settings, which will be reviewed in the next chap-
ter.
One-classlearning as another train of method, aims to recognize instances of a con-
cept by using mainly, or merely one class of examples. One-class learning is particularly
useful for extremely imbalanced class in high dimensional data structures (Raskutti &
Kowalczyk 2004), and it surpasses discriminative methods under conditions such as
25
multi-modality of the feature space (Japkowicz 2001). One-class learning could be cat-
egorized into one-class support vector machine (SVM) (Tax 2001, Manevitz & Yousef
2002) and the autoencoder method (Japkowicz 2000, Japkowicz 2001).
1.3.3 Metricsforimbalancedproblems
Simple measures like misclassification rate suffices to assess the algorithm validity in
balanced data, but probably not in imbalanced data. Therefore, we need metrics that take
both majority and minority class into consideration and are relatively robust to diverse
imbalance ratios.
These metrics could be divided into three groups based on their nature: cutoff-
dependent single assessment metrics including F-Measure and G-mean, cutoff-
independent single assessment metrics such as area under the ROC curve, break-even
point for precision-recall curve, and cutoff-independent whole curve metrics such as
ROC curve. They will be thoroughly reviewed in Chapter 2.
In summary, when standard classifiers without adjustment are applied to discrimi-
nate imbalanced data, it is likely that most observations in the minority group will be
classified as the majority class. This problem could be fixed, or partially fixed by clever
sampling schemes or cost-sensitive learning. Equally important is the choice of robust
summary metrics for performance evaluation and model selection. We will continue to
elaborate the concepts as well as presenting the work on imbalanced class problems of
high dimensional data in Chapter 2 and 3.
26
Chapter2
Imbalancelearningintheframework
ofpenalizedlogisticregressionforhigh
dimensionaldataproblems
2.1 Motivationandintroduction
This work is motivated in part by an ongoing USC/Illumina study of prostate cancer
recurrence after radical prostatectomy. The study generated gene expression data for
nearly thirty thousand probes from 187 tumor samples, of which 33 came from patients
with recurrent prostate cancer and 154 came from patients with non-recurrent prostate
cancer after years of follow-up. Our goal was to build a penalized logistic regression
classifier to predict the recurrence of new patients. However, the model selected by tun-
ing the LASSO penalty parameter based on the average misclassification rate in cross
validation was neither balanced nor stable. Here balance means the misclassification
rates in both cases and controls were reasonably low when the model was used to pre-
dict new test samples. For example, a model that classifies all samples as controls could
still achieve low overall misclassification rate in the presence of imbalanced data, but it
is not valid because it gives no help in predicting the cases. Stability means the selected
best model is robust to perturbances of the data and therefore more likely to perform
well on new samples. Well-known alternatives to the misclassification rate are predic-
tion metrics such as the area under the ROC curve (AUC-ROC) and Cohen’s Kappa.
27
However, their performance in imbalanced data contexts is not clear, especially in a
high dimensional feature space. I decided to perform a simulation study to determine
whether there is a prediction metric for model selection with imbalanced data that per-
forms well, especially in the high dimensional feature context. In addition, sampling
methods could be utilized to generate balanced subsamples to further improve perfor-
mance. In our simulations we studied the effect of sampling methods in conjunction
with several prediction metrics. Below follows a review on the most common metrics
used for imbalanced data.
2.1.1 Cutoff-dependentsingleassessmentmetrics
Although many classifiers (e.g. support vector machines) simply return a classification
rule, some, including logistic regression, also return class probabilities. For a binary
classification problem, a probability cutoff atp = 0:5 is often used but alternative cutoffs
will yield different classification rules implying different trade-offs between sensitivity
and specificity. Once the cutoff is determined, we could draw a confusion matrix to
evaluate classification performance, as illustrated in Figure 2.1. Here [p;n] are the true
positive and negative class labels, while [Y;N] are the predicted positive and negative
labels. The counts in each cell depend on the chosen cutoff, thus summary metrics based
on the confusion matrix are also cutoff-dependent.
In this chapter, we designate the minority class as the positive class and the majority
class as the negative class and only consider the binary classification case. With this
convention, common prediction metrics are easily derived from the confusion matrix:
Accuracy =
TP +TN
P
C
+N
C
(2.1)
28
MisclassificationRate = 1Accuracy (2.2)
Recall =Sensitivity =
TP
TP +FN
(2.3)
Specificity =
TN
TN +FP
(2.4)
Precision =
TP
TP +FP
(2.5)
Sensitivity is equivalent to true positive rate (TPR) and specificity is equivalent to
1false positive rate (FPR). Most of these metrics describe the misclassification rate
in one of the two classes. Although accuracy is a summary metric taking both classes
into consideration, simple derivation could show thatAccuracy =SN + (1)SP ,
where the weight = P
C
=(P
C
+N
C
), is a function of the class distribution. Thus,
either specificity or sensitivity will dominate the value of accuracy when the data is
imbalanced.
In contrast, metrics that attempt to average the effect of two summary measures are
less sensitive to the class distribution. Three examples are:
Kappa =Accuracy
Y
Y +N
P
C
P
C
+N
C
(2.6)
GMean =
p
SNSP (2.7)
29
F =
(1 +)
2
RecallPrecision
Recall +Precision
2
(2.8)
Specifically, the F-measure combines precision and recall, and uses coefficient to
weight the importance of precision against recall (Van Rijsbergen 1974). When = 1,
F-measure is the harmonic mean of precision and recall. Geometric mean (G-mean)
naturally combines accuracy in both classes by taking the geometric mean of sensitivity
and specificity (Kubat & Matwin 1997). Cohen’s kappa (Cohen 1968), on the other
hand, is used to measure the agreement between predicted and observed classifications,
while correcting for agreement occuring by chance alone.
Given their simplicity, these measures could be easily integrated into a model selec-
tion framework. Once the probability cutoff is set, we could generate a confusion matrix
However, this simplicity makes the comparisons difficult between different models over
a range of settings. For example, it is difficult to compare the performance of two models
over a range of decision cutoff values.
Figure 2.1: Confusion matrix to evaluate classification performance
30
2.1.2 Receiveroperatingcharacteristics(ROC)curves
ROC curves are recommended to tackle the abovementioned issue. The ROC curve is
obtained by plotting the TPR(p) (sensitivity) against FPR(p) (1-specificity) for the entire
range of possible cutoff probability p (Provost & Fawcett 1997, Provost et al. 1998).
Evidently, each point in the ROC space corresponds to a confusion matrix. Generally, a
point above the diagonal corresponds to a classifier exceeding random guessing, while a
point closer to the topleft corner, i.e. point (0,1), corresponds to a classifier with higher
discriminative capability. If points on curve A are all above or equal to points on curve
B in ROC space, termed as “A dominates B”, the model corresponding to A definitely
surpasses the other model. Otherwise, they are only conditionally better.
Various metrics related to ROC curves are proposed to aid model selection, the main
one being the ”AUC”, the area under the ROC curve (Bradley 1997). AUC is equivalent
to both the Wilcoxon test of ranks and the probability that a randomly chosen positive
instance is rated higher than a negative one (Hanely & McNeil 1982).
2.1.3 Prediction-recall(PR)curves
Commonly used in information retrieval, PR curves plot precision against recall (Man-
ning & Sc¨ utzh 1999) and the top-right corner in PR space is optimal. Figure 2.2 shows
ROC curves and PR curves taken from a highly-skewed cancer detection data (Daviset
al. 2005). ROC curves indicated that performance of both algorithms is comparable and
fairly close to optimal. However, PR curves suggested that Algorithm 2 is obviously
better than 1, while both have ample room for improvement. The difference could be
explained by the highly skewed data distribution, while ROC analysis is over optimistic,
a large change in false positive number (FPN) may only slightly alter FPR due to the
31
large denominator (Nc) (Davis & Goadrich 2006). PR curve can be more advantageous
in model selection since precision is more sensitive to the change of FPN.
There is a one-to-one correspondence between a point in PR space and a point in
ROC space. Therefore, a curve dominant in ROC space also dominates in PR space
(Davis & Goadrich 2006). Particularly, a convex hull could be constructed in PR space
with the points from a convex hull in ROC space, and further interpolated into an achiev-
able PR curve and the area under the achievable PR cuve (AUC-PR) is a useful summary
metric. Another useful summary metrics from the PR curve is the precision value at
break-even point, at which precision equals control.
Above we reviewed different metrics that are more sensitive to imbalanced ratios
than naive accuracy rates. These measurements could be easily implemented for model
selection and performance evaluation in any framework. In section 2.3 we evaluated
their performance in a high dimensional feature space.
32
Figure 2.2: ROC curves and PR curves taken from the same models on a highly-skewed
cancer detection data (Davisetal. 2005)
33
2.2 Relevantstudies
The imbalanced class problem for high-dimensional data has been studied using ran-
dom sampling and informative sampling methods. Lin & Chen (2013) evaluated the
performance of diagonal linear discriminant analysis (DLDA), SVM, and random forest
under the effect of various imbalance ratios, sample sizes and correlation structures by
using metrics including sensitivity, specificity, geometric mean, overall accuracy and
F-measure. DLDA turned out more robust to extreme class imbalance than SVM and
random forest. All three algorithms showed considerable improvements after adopting
under-sampling ensemble modification inspired by the informative sampling idea. The
performance of SVM improved by adjusting the decision cutoff in order to favor minor-
ity class prediction. Interestingly, the effect of class imbalance on performance was less
pronounced in the presence of correlation structure
Blugas et al. (2010) conducted more extensive studies by adding more classifiers
such as K-nearest neighbors, penalized logistic regression (PLR), and using perfor-
mance evaluation metrics including overall accuracy, accuracy for each class and AUC
under ROC curve. Similarly, DLDA was the most robust standard approach, and random
under-sampling or informative under-sampling considerably enhanced the performance
of most classifiers. Particularly, the performance of PLR was greatly boosted by combi-
nation with a sampling scheme or adjustment of decision threshold.
Here we want to investigate a different aspect of the imbalanced data problem. The
performance of penalized logistic regression depends on the selection of tuning param-
eters, such as in LASSO, or and in elastic net. In practice we usually choose
the hyper-parameter that minimizes misclassification rate or binomial deviance in cross
validation. In imbalanced data, however, this procedure could be troublesome because
misclassification rate could not treat imbalanced classes “fairly” as previously reviewed.
34
Our goal is to study alternative strategies for tuning penalized logistic regression
when the data distribution is skewed. To our knowledge, the strategy of what prediction
metric to use when tuning the LASSO or elastic net penalties by cross-validation in
the imbalanced data context has not been investigated thoroughly in the literature. My
study provided valuable guidance for building predictive models with imbalanced and
high dimensional data. I plan to use simulation studies to evaluate the performance of a
series of metrics as selection criteria, as well as that of integrating sampling strategies.
The methods with good performances in the simulation study will be further examined
in real studies in section 2.4.
35
2.3 Simulationstudy
2.3.1 Simulationplan
Within each replicate of simulation, high dimensional imbalanced training set and bal-
anced test set were generated from the same distribution. Multiple penalized logistic
regression models were built from the training set, each corresponding to a particular
combination of tuning parameters and . The average performance for each model
in the training set was measured by a variety of metrics in cross validation and for each
metric, the model with the best performance was selected. Each model was also applied
to the test set, with the performance measured by the same metrics. I was interested in
the performance of the metric-selected best models in the test set. Their performances
were also compared to that of the classifier generated by informative sampling. The
comparison of performances was investigated in a variety of settings including imbal-
ance ratio, sample size, number of effective variables, correlation structure, best model
criteria, which was detailed in the following paragraphs.
Balanced test sets were used to evaluate performances of models. I believed that this
is a fair way to compare among different metrics. I believed that metrics investigated in
this study would give a similar trend of model preferences in a balanced data set and the
simulation results confirmed that it was the case.
Datageneration We first constructed anNp design matrixX, where we fixed the
number of features atp = 1000 and the population size atN = 50000 so that the pool
was large enough for sampling. The design matrix had following types according to the
correlation structure:
(a) Independent: thep predictors followed anN (0;I
p
) distribution in whichI
p
was
app identity matrix.
36
(b) Equi-correlation: thep predictors followed anN (0; ) distribution, and the pop-
ulation variance covariance matrix was
0
B
B
B
B
B
B
B
@
1 r r
r 1 r
.
.
.
.
.
.
.
.
.
.
.
.
r r 1
1
C
C
C
C
C
C
C
A
(2.9)
where
ij
= r8i6= j. Three values ofr are tested in the simulation: 0.1, 0.5, and
0.9. We included 0.9 to evaluate the performance of different metrics in more extreme
scenarios. Note that setting (a) was a special case of (b), in whichr = 0.
(c) Toeplitz or autocorrelated: thep predictors followed anN (0; ) distribution in
which
ij
=r
jijj
. Two different values ofr were tested : 0.9, and 0.99.
(d) Toeplitz design within blocks: thep predictors followed anN (0; ) distribution,
where
ij
=
8
>
>
<
>
>
:
r
jijj
if i,j are in a block
0 otherwise
(2.10)
In this setting we constructed 10 blocks, each of which had 100 predictors, and fixed
r = 0:9.
(e) Design matrix constructed using sample covariance matrix calculated from real
data as the population covariance matrix.
Given a design matrixX constructed using any of the settings above, I next specified
the active set by setting
(0)
for a random chosen setS. I tested different sizes of the
37
active sets =jSj = 4; 7; 15 in the simulation. For s=4, I fixed
T
0
= (1:5;2; 1; 0:7).
For s=7, I fixed
T
0
= (1;0:5; 0:7;1:2 0:9; 0:3; 0:55). For s=15, I fixed
T
0
=
(1;0:7; 0:5; 0:4; 0:4;0:35; 0:3; 0:3; 0:3; 0:25;0:25;0:25; 0:25; 0:2;0:2).
Given the active set and its coefficients, I used a logistic regression model to sim-
ulate the responses. First, the predictors for an observationx
n
was randomly sampled
from the design matrixX, and then the response y
n
was sampled from the Bernoulli
distributionBernoulli(expit(
0
+
T
0
x
n
)) where
0
was set to achieve an approximate
population disease prevalence of 5%.
For simulation replicate a training set consisting of n
1
cases and n
0
controls was
generated using the above method. When one class reached the target leveln
i
,i = 0; 1,
the sampling continued until the target number for the other class was achieved (any
new sample falling in the already “full” class was discarded).
The total sample size for the training set took values of 360 and 720. In the 360 case,
the ratio of case to control ranged from 1:1, 1:2, 1:3, 1:4, 1:5, 1:6.2, 1:7, 1:8, 1:9, 1:11,
1:14 to 1:17. In the 720 case, the ratio of case to control ranged from 1:1, 1:2, 1:3, 1:4,
1:5, 1:6.2, 1:7, 1:8, 1:9, 1:11, 1:14, 1:17, 1:19, to 1:23. A test set of 100 cases and 100
controls was generated in the same way as the training set.
Model building and performance evaluation I applied LASSO and elastic-net reg-
ularized generalized logistic regression (RGLR) (Friedman et al. 2010) to the training
set with the elastic net parameter ranging from 0.2, 0.4, 0.6, 0.8 to 1. The regular-
ization path for the other parameter was selected by the original algorithm specified
in R package glmnet. Five-fold cross validation is used to find the best combination of
(;). The summary metrics used to perform model selection included accuracy, G-
mean, F-measure, , accuracy-thr, G-mean-thr, F-thr, -thr, binomial deviance, AUC-
ROC, AUC-PR, and precision-recall breakeven point. “thr” after the name of the metric
38
suggested the case proportion cutoff taking the value of
n
1
n
1
+n
0
was used in the evaluation
of performancesinstead of the 0.5 cutoff.
In addition to the metrics above, we also incorporated informative sampling into
RGLR framework. For this purpose, 101 balanced subsamples were generated from
the training set. Specifically, all the cases remained in the subsample, while an equal
number of controls was randomly sampled from the controls. A separate RGLR model
fitting and selection was conducted in each subsample. The resulting 101 classifiers
constituted a committee that determined the classification of a new observation by the
majority vote.
Within each specific method, two criteria was used for model selection. ”Minimum
error model” selected the model with the largest metric values (except for using the
minimum for binomial deviance). ”1se model” chose the simplest model whose measure
was within one standard deviation of the largest metric value. In addition, all the models
were evaluated in the test data and the best of them was deemed the ”oracle model”,
whose performance provided an upper limit on the performance of the selected models.
The performance of the classifier(s) built by the above methods was evaluated in the
test set. All the metrics used in the model selection process were used to evaluate the
performance in the test set.
All the above procedures from data generation to performance evaluation were
repeated 100 times. The mean and standard deviation across the 100 iterations were
reported. Statistical analysis and simulations were carried out using R language for
statistical computing (R version 3.0.0) (R core team 2014).
39
2.4 Results
As expected, the relative order of the performances of different models depended very
little on the choice of metrics to evaluate performance (Figure 2.3 - Figure 2.6, Table
2.1 - Table 2.6). Meanwhile, the relative order of model performances was also quite
consistent across all the scenarios with varying design matrix and total sample sizes
with only a few exceptions. Therefore, I here reported most of the results in terms of
sample size of 360, independent design matrix, 7 active features and models evaluated
by G-mean.
Models selected by metrics with conventional cutoff at 0.5 were among the worst of
all the selected models, especially when the case control ratio was below 1:5. On the
other hand, using a ”case proportion” cutoff could improve model performance except
for overall accuracy (Fig 2.3). Models selected by accuracy performed equally bad
independent of cutoff chosen. Among the other three metrics with ”case proportion”
thresholds, G-mean consistently outperformed F measure and both of them consistently
outperformed Cohen’s Kappa.
The performances of all the models selected by cutoff independent metrics achieved
the same level of that selected by G-mean when the case control ratio was between 1:1
and 1:10. When the case control ratio was below 1:10, G-mean and binomial deviance
selected the best model; the performance of model selected by AUC-ROC were slightly
worse than the previous two; the model selected by precision-recall breakeven point
was worse but still much better than that selected by accuracy (Fig 2.4). I did not
include AUC-PR in the figure because the performance of models selected was very
similar to that selected by AUC-ROC. When the sample size increased to 720, the mod-
els selected by AUC-ROC and breakeven point rivaled that selected by G-mean and
binomial deviance when the case control ratio was above 1:15 (Fig 2.5).
40
Sampling methods has been shown useful in dealing with imbalance problem, in
which balanced subset was generated from the original data. As expected, when sam-
pling approach was applied before model building, the performance was optimal when
a convention cutoff at 0.5 was used. Additionally, there were little differences across
models selected by different metrics (data not shown). The performance of classifiers
constructed by informative sampling was very similar to previous models selected by G-
mean, deviance and AUC-ROC when the case control ratio was between 1:1 and 1:10.
When the case control ratio was below 1:10, the performance of classifiers from infor-
mative sampling was not as good as that selected by G-mean and deviance, and even
slightly worse than that selected by AUC-ROC (Fig 2.6). As observed in the case of
AUC-ROC, an increase in sample size helped improve informative sampling.
When correlation structures were added into the design matrix, the effect of imbal-
ance problems was alleviated to some extent for all metrics, which was observed pre-
viously (Lin & Chen 2012). Meanwhile, the relative order of the performance of clas-
sifiers was quite consistent with few exceptions (Fig 2.7). When the design matrix is a
Toeplize design within 10 blocks, the performance of informative sampling was in the
same range as that of G-mean, deviance and AUC-ROC and even slightly better than
that of AUC-ROC. When the design matrix was equal correlation matrix, the classifiers
constructed by informative sampling could not differentiate the test set well, as well as
the model selected by binomial deviance.
So far, I was focused on comparing the “minimum error” models selected by differ-
ent methods, which meant the model with the minimal prediction error across cross val-
idation was selected and validated in the test sets. The performances of models selected
by the “1se” criteria were illustrated in Fig 2.8. The performances of models selected
by G-mean and informative sampling were very similar to their corresponding “mini-
mum error model”, while those selected by AUC-ROC and binomial deviance reduced
41
considerably. Additionally, an increase in total sample size could help improve the per-
formance of models selected by AUC-ROC but could not help that selected by binomial
deviance.
42
2.5 Discussion
Moving probability cutoff for PLR from 0.5 to the “case proportion” cutoff was shown
to be very instrumental in tackling imbalanced data, together with a summary metric
that treat the minority and majority class in a balanced manner. Moving the cutoff
was useful to improve the performance of PLR and support vector machine in previous
studies (Blugas et al. 2010). Here I demonstrated its usefulness in model selection as
long as a balanced summary metric was used.
Across all the scenarios examined, G-mean with “case proportion” cutoff was con-
sistently the best or one of the best methods to select the model with good prediction
accuracy in the test set. This consistency was remarkably stable across a variety of case
proportions, total sample sizes, number of effective features and correlation structures.
When using the “1se model” instead of “minimum error model”, the performance of
models selected by G-mean was essentially unaffected. Besides, these could achieve
at least 80% of the performance of the best possible models in most settings. These
convincing results and the simple nature made us suggest using G-mean as a criterion to
conduct model selection of PLR in skewed data.
AUC related measures AUC-ROC and AUC-PR demonstrated very little differences
among each other during the simulation. Meanwhile, the performance for all of them
was the best or among the best provided that total sample size was large enough, or the
imbalanced ratio was not very extreme. A careful look into the results revealed the major
factor was not total sample size or imbalance ratio alone, but the number of samples in
the minority class. When the sample size in the minority class was low (below 40),
the number of observations in the minority class was often equal to or fewer than 5 in
some folds for a 5-fold cross validation. Metrics are known to be unstable when they are
calculated from a curve with so few points. Therefore, we suggest taking an appropriate
fold number to do cross validation to ensure that the sample size in each fold is large
43
enough. 10 is probably a safe bound. When the sample size is too low to perform
a cross-validation, G mean is preferred to deal with imbalanced data. Another cutoff-
independent metric, precision-recall breakeven point, selected models with performance
close to but often slightly worse than models selected by AUC-related metrics in all of
the settings.
Binomial deviance performed surprisingly well in the simulation as long as the “min-
imal error criteria” was applied and pairwise correlation between predictors was not
high. As a measure of model fitting, binomial deviance was not a measure of prediction
error like all the other parameters. Although these two were closely related in model
building, model selection was a separate issue. “1se criteria” did not work for deviance
possibly because a model in the neighborhood of a good model in terms of fitting was
not necessarily optimal for prediction. We expect model fitting in real studies to be
not as good as in the simulation. In practice, using the model with the minimum error
for PLR often resulted in a complicated model with poor interpretability yet no clear
advantage in terms of prediction. Sometimes, the issue of overfitting also rose. On the
other hand, the alternative 1se criteria is recommended since the resulting much simpler
model often gave more useful results. Thus, due to its incompatibility with the “1se
criteria”, I do not suggest using binomial deviance to select models.
In the simulation, the performance of informative sampling did not go beyond that
of the models selected by the metrics. Evidently, the algorithm of PLR was quite
robust to imbalance problem and the problem could be overcome by choosing appro-
priate decision cutoffs and summary metrics. When the correlation structure was not
very dense, classifiers constructed by informative sampling achieved the same level as
models selected by G-mean and AUC-related metrics. An advantage of the ensemble
subsampling method was its stability. Since the decision was made by the commit-
tee, a change in a single classifier would not hurt much. Therefore, the performance
44
of ”1se model” and ”minimum error model” was almost identical for informative sam-
pling. Additionally, informative sampling could not handle small sample size well, as
AUC-related metrics. This was probably because the loss of model fitting had a heavier
impact on the small sample. Furthermore, a potential problem with sampling method
in practice was that the samples were likely to be heterogeneous, unlike the case in the
simulation studies, leading to classifiers built from subsamples with huge heterogeneity.
Due to these concerns together with its computational burden, I do not recommend to
use informative sampling to handle imbalance problems with PLR.
In conclusion, I have studied several summary metrics in terms of model selection
for PLR when the data was imbalanced, together with strategies including a change of
probability cutoff and sampling approaches. I have demonstrated model selection could
be conducted appropriately when an alternative decision cutoff and robust summary
metrics, especially G-mean, was applied. The resulting final model of PLR by adopt-
ing these approaches could achieve satisfactory prediction performance in spite of data
imbalance in the training set.
45
5 10 15
0.0 0.2 0.4 0.6 0.8
ratio
gmean
5 10 15
0.5 0.6 0.7 0.8 0.9
ratio
AUC
Figure 2.3: Performance of cutoff-dependent single metrics with independent design
using G mean (top) and AUC-ROC (bottom) for evaluation.
n = 360;s = 7. Included metrics are accuracy (black), G mean (red), F measure
(green), Cohen’s Kappa (blue), and oracle (gold). Solid curves and dash curves respec-
tively indicate 0.5 cutoff and “case proportion” cutoff.
46
5 10 15
0.0 0.2 0.4 0.6 0.8
ratio
gmean
5 10 15
0.5 0.6 0.7 0.8 0.9
ratio
AUC
Figure 2.4: Performance of single metrics with independent design using G mean (top)
and AUC-ROC (bottom) for evaluation.
n = 360;s = 7. Included metrics are accuracy with 0.5 cutoff (black), G mean with
”case proportion” cutoff (red), binomial deviance (green), AUC-ROC (blue), AUC-PR
(magenta), breakeven (grey) and oracle (gold).
47
5 10 15 20
0.0 0.2 0.4 0.6 0.8
ratio
gmean
5 10 15 20
0.5 0.6 0.7 0.8 0.9
ratio
AUC
Figure 2.5: Performance of single metrics with independent design of sample size 720
using G mean (top) and AUC-ROC (bottom) for evaluation.
n = 720;s = 7. Included metrics are accuracy with 0.5 cutoff (black), G mean with
“case proportion” cutoffd (red), binomial deviance (green), AUC-ROC (blue), AUC-PR
(magenta), breakeven (grey) and oracle (gold).
48
5 10 15
0.5 0.6 0.7 0.8 0.9
ratio
ACC
5 10 15
0.0 0.2 0.4 0.6 0.8
ratio
gmean
Figure 2.6: Performance of selected metrics compared with informative sampling using
independent design and evaluated by G mean (top) and AUC-ROC (bottom)
n = 360;s = 7. Included metrics are accuracy with 0.5 cutoff (black), G mean with
“case proportion” cutoff(red), binomial deviance (green), AUC-ROC (blue), informative
sampling (magenta) and oracle (gold).
49
5 10 15
0.5 0.6 0.7 0.8 0.9
ratio
ACC
5 10 15
0.0 0.2 0.4 0.6 0.8
ratio
gmean
Figure 2.7: Performance of selected metrics compared with informative sampling using
block design and evaluated by G mean (top) and AUC-ROC (bottom)
n = 360;s = 7;r = 0:9. Included metrics are accuracy with 0.5 cutoff (black), G
mean with “case proportion” cutoff (red), binomial deviance (green), AUC-ROC (blue),
informative sampling (magenta) and oracle (gold).
50
5 10 15
0.5 0.6 0.7 0.8 0.9
ratio
ACC
5 10 15
0.0 0.2 0.4 0.6 0.8
ratio
gmean
Figure 2.8: Performance of selected “1se model”s using independent design and evalu-
ated by G mean (top) and AUC-ROC (bottom)
n = 360;s = 7. Included metrics are accuracy with 0.5 cutoff (black), G mean with
“case proportion” cutoff (red), binomial deviance (green), AUC-ROC (blue), informa-
tive sampling (magenta) and oracle (gold).
51
Table 2.1: Performanceofcutoff-dependentsinglemetricswithindependentdesignevaluatedbyG-mean
n1 n0 0.5cutoff caseproportioncutoff
Acc sd G sd F sd sd Acc sd G sd F sd sd
20 340 0.086 0.190 0.13 0.214 0.13 0.211 0.13 0.210 0.22 0.057 0.60 0.069 0.48 0.134 0.43 0.141
24 336 0.139 0.236 0.25 0.252 0.25 0.253 0.24 0.249 0.26 0.055 0.64 0.068 0.56 0.111 0.51 0.128
30 330 0.321 0.287 0.39 0.223 0.39 0.224 0.39 0.227 0.31 0.059 0.66 0.062 0.57 0.121 0.52 0.130
36 324 0.479 0.266 0.53 0.097 0.53 0.096 0.53 0.097 0.37 0.063 0.69 0.055 0.61 0.103 0.55 0.113
40 320 0.499 0.274 0.52 0.073 0.53 0.074 0.54 0.079 0.41 0.058 0.69 0.050 0.63 0.094 0.59 0.096
45 315 0.503 0.264 0.52 0.061 0.53 0.068 0.55 0.077 0.44 0.054 0.70 0.049 0.65 0.086 0.60 0.105
50 310 0.625 0.175 0.54 0.055 0.56 0.065 0.57 0.066 0.47 0.055 0.71 0.048 0.67 0.079 0.64 0.092
60 300 0.696 0.085 0.59 0.055 0.59 0.062 0.61 0.065 0.54 0.052 0.73 0.040 0.70 0.056 0.67 0.074
72 288 0.698 0.087 0.62 0.045 0.63 0.049 0.65 0.055 0.58 0.062 0.74 0.036 0.72 0.051 0.70 0.067
90 270 0.738 0.044 0.67 0.052 0.69 0.056 0.71 0.052 0.68 0.071 0.75 0.039 0.74 0.041 0.74 0.043
120 240 0.762 0.033 0.73 0.045 0.74 0.043 0.75 0.037 0.76 0.038 0.77 0.033 0.76 0.034 0.76 0.034
180 180 0.770 0.033 0.77 0.031 0.77 0.026 0.77 0.029 0.77 0.030 0.77 0.029 0.76 0.027 0.76 0.027
The results were reported as mean and standard deviation across 100 replications
Notes:s = 7;n = 360; Acc: accuracy; G: G-mean; F: F-measure;: Kappa
52
Table 2.2: Performanceofcutoff-dependentsinglemetricswithindependentdesignevaluatedbyAUC-ROC
n1 n0 0.5cutoff caseproportioncutoff
Acc sd G sd F sd sd Acc sd G sd F sd sd
20 340 0.53 0.094 0.55 0.102 0.55 0.102 0.55 0.101 0.63 0.044 0.68 0.078 0.66 0.069 0.65 0.071
24 336 0.56 0.108 0.61 0.120 0.61 0.120 0.61 0.119 0.65 0.046 0.73 0.070 0.70 0.061 0.69 0.060
30 330 0.64 0.119 0.67 0.099 0.67 0.099 0.67 0.100 0.65 0.049 0.75 0.061 0.73 0.066 0.71 0.066
36 324 0.71 0.123 0.75 0.056 0.75 0.055 0.75 0.056 0.69 0.054 0.79 0.055 0.76 0.068 0.74 0.065
40 320 0.72 0.129 0.74 0.049 0.75 0.050 0.75 0.052 0.69 0.054 0.79 0.045 0.78 0.052 0.76 0.057
45 315 0.72 0.124 0.74 0.045 0.74 0.046 0.75 0.050 0.70 0.048 0.80 0.045 0.78 0.049 0.76 0.056
50 310 0.78 0.087 0.75 0.046 0.75 0.050 0.75 0.049 0.71 0.047 0.81 0.042 0.80 0.052 0.78 0.056
60 300 0.81 0.055 0.77 0.045 0.77 0.046 0.77 0.046 0.74 0.049 0.82 0.038 0.81 0.042 0.80 0.050
72 288 0.81 0.055 0.77 0.040 0.78 0.041 0.79 0.043 0.75 0.052 0.83 0.031 0.82 0.039 0.81 0.049
90 270 0.83 0.037 0.79 0.040 0.80 0.041 0.82 0.040 0.80 0.050 0.84 0.032 0.84 0.032 0.83 0.034
120 240 0.85 0.031 0.83 0.044 0.83 0.042 0.85 0.035 0.85 0.034 0.85 0.029 0.85 0.030 0.85 0.029
180 180 0.85 0.030 0.85 0.026 0.85 0.025 0.85 0.027 0.85 0.033 0.85 0.033 0.85 0.027 0.85 0.028
The results were reported as mean and standard deviation across 100 replications
Notes:s = 7;n = 360; Acc: accuracy; G: G-mean; F: F-measure;: Kappa
53
Table 2.3: Performanceofcutoff-independentmetricswithindependentdesignevaluatedbyG-mean
n1 n0 Acc sd G sd dev sd ROC sd PR sd br-even sd
20 340 0.086 0.190 0.60 0.069 0.59 0.113 0.56 0.163 0.54 0.173 0.50 0.226
24 336 0.139 0.236 0.64 0.068 0.63 0.111 0.60 0.150 0.59 0.142 0.54 0.219
30 330 0.321 0.287 0.66 0.062 0.66 0.059 0.64 0.118 0.63 0.125 0.61 0.126
36 324 0.479 0.266 0.69 0.055 0.68 0.056 0.69 0.074 0.67 0.103 0.65 0.130
40 320 0.499 0.274 0.69 0.050 0.70 0.044 0.69 0.088 0.70 0.048 0.67 0.133
45 315 0.503 0.264 0.70 0.049 0.70 0.047 0.71 0.055 0.70 0.053 0.69 0.061
50 310 0.625 0.175 0.71 0.048 0.71 0.040 0.72 0.041 0.71 0.059 0.71 0.064
60 300 0.696 0.085 0.73 0.040 0.72 0.037 0.73 0.040 0.73 0.044 0.72 0.049
72 288 0.698 0.087 0.74 0.036 0.73 0.038 0.74 0.037 0.73 0.046 0.73 0.047
90 270 0.738 0.044 0.75 0.039 0.74 0.035 0.75 0.034 0.75 0.034 0.74 0.040
120 240 0.762 0.033 0.77 0.033 0.76 0.032 0.77 0.030 0.77 0.031 0.76 0.034
180 180 0.770 0.033 0.77 0.029 0.77 0.029 0.77 0.026 0.77 0.026 0.77 0.031
The results were reported as mean and standard deviation across 100 replications
Notes:s = 7;n = 360; Acc: 0.5 cutoff accuracy; G: case proportion cutoff G-mean; ROC: AUC-ROC;
PR: AUC-PR; br-even: breakeven point.
54
Table 2.4: Performanceofcutoff-independentmetricswithindependentdesignevaluatedbyAUC-ROC
n1 n0 Acc sd G sd dev sd ROC sd PR sd br-even sd
20 340 0.53 0.094 0.68 0.078 0.68 0.079 0.68 0.088 0.67 0.085 0.66 0.102
24 336 0.56 0.108 0.73 0.070 0.73 0.074 0.72 0.076 0.71 0.074 0.70 0.090
30 330 0.64 0.119 0.75 0.061 0.75 0.065 0.74 0.074 0.74 0.067 0.73 0.072
36 324 0.71 0.123 0.79 0.055 0.79 0.055 0.78 0.060 0.77 0.066 0.77 0.071
40 320 0.72 0.129 0.79 0.045 0.80 0.045 0.79 0.059 0.79 0.049 0.78 0.068
45 315 0.72 0.124 0.80 0.045 0.80 0.045 0.80 0.050 0.80 0.050 0.79 0.049
50 310 0.78 0.087 0.81 0.042 0.82 0.037 0.81 0.041 0.81 0.046 0.80 0.048
60 300 0.81 0.055 0.82 0.038 0.82 0.036 0.82 0.039 0.82 0.040 0.82 0.043
72 288 0.81 0.055 0.83 0.031 0.83 0.032 0.83 0.032 0.83 0.039 0.82 0.039
90 270 0.83 0.037 0.84 0.032 0.84 0.032 0.84 0.032 0.84 0.034 0.83 0.038
120 240 0.85 0.031 0.85 0.029 0.85 0.030 0.85 0.028 0.86 0.028 0.85 0.032
180 180 0.85 0.030 0.85 0.033 0.85 0.026 0.86 0.024 0.85 0.024 0.85 0.034
The results were reported as mean and standard deviation across 100 replications
Notes:s = 7;n = 360; Acc: 0.5 cutoff accuracy; G: case proportion cutoff G-mean; ROC: AUC-ROC;
PR: AUC-PR; br-even: breakeven point.
55
Chapter3
Parameter-freefeatureselectionwith
stabilityselection
In this chapter, I focus on feature selection, a common task in data analysis in high
dimensional data. As reviewed in Chapter 1, the standard LASSO penalized regression
method with cross validation tends to select too many noise features. Stability selection
is a flexible framework for feature selection based on subsampling of the original data
that can reduce the selection of noise features by providing an explicit bound on the
false discovery error rate in the selection process. However, I found that the realized
false discovery rate of stability selection though controlled as predicted by theory below
the target, appears too conservative especially for small samples. An additional unde-
sirable feature of standard stability selection is that it can yield quite different sets of
selected features for different choices of its tuning parameters. In this chapter, I propose
a modification to the original stability selection algorithm to address the sensitivity to
tuning parameters and the conservativeness for small samples.
I first provide a detailed description of the original stability selection and several
related methods that motivate the modifications I propose. I then introduce the novel
approach to robustly control the error in the framework of stability selection and evaluate
its performance in both balanced data and imbalanced data in a simulation study.
56
3.1 Stabilityselection
Current genome-wide array-based technologies can measure very large number of
biomarkers (e.g., genotypes, gene expression, methylation) but it is costly and time-
consuming to follow up multiple biomarkers with functional experiments to determine
their biological role. An ideal feature selection algorithm should be capable of detecting
biologically important features while avoiding unrelated noise features. Stability selec-
tion is a general algorithm for feature selection that can be used to improve upon any
given feature selection algorithm (Meinshausen & B¨ uhlmann 2010). It can be thought of
as an ensemble method where the original data is “perturbed” multiple times and in each
‘version’ of the data the base selection algorithm selects a set of features. These sets of
features are then aggregated to obtain a single set of selected features (He & Yu 2010).
Stability selection identifies features that are consistently selected and avoiding those
that are selected by chance. The idea behind stability selection has been proposed mul-
tiple times before being formalized and studied theoretically. For example, BoLASSO
(Bach 2008) used LASSO as the base feature selector and perturbed the data by boot-
strap sampling. This is very similar to the stability selection algorithm except that sub-
sampling is used instead of bootstrapping. Davis et al. used several feature selectors
as the base learner including simple F-test, Pearson correlation, SVM and decistion tree
methods to achieve a higher prediction accuracy (Davisetal. 2006).
Stability selection can be used with any regularized regression method. Here we
introduce it in the context of LASSO linear regression:
y =X +,N
0;
2
(3.1)
57
where y =
y
1
; y
2
; ; y
n
T
is the response vector, X =
x
1
; x
2
; ; x
n
T
is the design matrix, =
1
;
2
; ;
p
T
is the
vector of regression coefficient, and =
1
;
2
; ;
n
T
is the random error
whose components are assumed independent and identically distributed. L
1
-norm
penalization leads to the LASSO estimator (Tibshirani 1996):
min
2R
P
f
1
2n
kyXk
2
2
+
P
X
j=1
j
j
jg (3.2)
where > 0 is a tuning parameter. We defineS
(0)
=fi :
i(0)
6= 0g2f1;:::;Pg
the set of active features and for every value2 , we define:
^
S
=fi :
^
i
()6= 0g2f1;:::;Pg (3.3)
which is an empirical estimate ofS
(0)
. In feature selection, the questions of interest
are a) whether
^
S
is a good estimate of S
(0)
for any and if so b) how to find an
appropriate . The answer to the first question is affirmative when the design matrix
satisfies the irrepresentable condition, as discussed in Chapter 1. However, LASSO
with a relatively small often has lower prediction error and therefore cross validation
tends to favor small, yielding larger models with many noise features.
LetI be a random subsample off1;:::;ng of sizem, drawn without replacement and
let
^
S
(I) be the selected set
^
S
given 2 for the subsample I. Then, for a feature
subsetKf1;:::;pg, the frequency of selection of the subsetK across ther subsamples
is
58
^
k
=P
fK
^
S
(I)g (3.4)
These frequencies are estimates of the corresponding probablityP
of selection with
respect to all possible random subsamples. The size of a subsamplem is usually set as a
half of the total number of observations, which is closest to bootstrap sampling and has
several computational advantages. The higher the frequency of selection of K across
the subsamples the more stable it is deemed. In practice, one is typically interested in
subsetsK of size 1, i.e. containing a single feature.
A useful graphical representation of the stability selection algorithm is given by
the stability path, constructed in the spirit of regularization paths. The regularization
path plotsf
^
k
: 2 ;k = 1;:::;pg as a function of . The stability path plots
f
^
k
: 2 ;k = 1;:::;pg as a function of. An example of a stability path is given
in Figure 3.1. Compared to the regularization path, active features are more likely to
”stand out” in the stability path.
For a cut-off with 0 < < 1 and a set of regularization parameters , the set of
stable features is defined as
^
S
stable
=fk : max
2
^
k
g (3.5)
^
S
stable
is determined by the cut-off parameter and the regularization region . The
choice of the grid values in the region depends on the particular algorithm used. In
the R package glment, the largest corresponding to a model with only 1 feature, the
smallest is set by default as 0.01 times the largest, and the rest values were taken
evenly spaced between these two. As illustrated in Figure 3.1, both
^
S
stable
and
^
S
(I)
59
changed with . Particularly,
^
S
stable
(
0
) is a subset of
^
S
stable
() if
0
is a subset of .
In the original paper introducing stability selection (Meinshausen & B¨ uhlmann 2010),
an approach was proposed to control the error and to select and as described below.
Letq
be the average number of selected features in the region ,
q
=E(j
^
S
(I)j) (3.6)
where the expectation is taken with respect toP
, and defineV to be the number of
falsely selected features,
V =jN\
^
S
stable
j (3.7)
where N is the set of noise features. Assuming the distribution off1
fk2
^
S
g
;k2Ng
is exchangable for all2 and also assuming that the original base feautre selector
procedure is not worse than random guessing,i.e.
E(S\
^
S
stable
)
E(N\
^
S
stable
)
=
jSj
jNj
(3.8)
when the size of the subsample m =
n
2
, it is proved that the expectation of V for
2 (
1
2
; 1) is bounded by
E(V ) =
1
2 1
q
2
p
(3.9)
60
where p is the number of features. It should be noted that while the better-than-
random-guessing assumption is likely to be satisfied by the majority of base feature-
selection algorithms used in practice, the exchangeability assumption is a rather strong
requirement. Meinshausen & B¨ uhlmann (2010)stated that the value of in the range of
0.6-0.9 yield very similar set of selected features?. To control the number of false dis-
coveries in practice, one can set E(V)apriori to reflect our tolerance on the expectation
of falsely selected features, and determineq
usingapriori chosen with equation 3.9.
The stability selection algorithm is then run and the set of stable features is determined
with the cutoff for eachs in . Finally, a subset of is found that yields a number
of stable features close toq
to ensure the bound on E(V) given by equation 3.9 holds.
The advantage of stability selection over LASSO with cross validation is supported
by the simulation studies in Meinshausen & B¨ uhlmann (2010), who report that com-
pared to LASSO with cross validation, in most scenarios stability selection selected (a)
a higher number of true features before selecting a noise one and (b) fewer noise features
when attaining a target proportion of uncovered true features.
Meinshausen & B¨ uhlmann (2010) investigated their proposed bound on the expected
number of false discoveries in simulation studies using permuted data from the vitamin
gene expression data, where the number of noise features was shown to be effectively
controlled below the bound set a priori and was also much smaller than the number of
noise features selected by LASSO with cross validation. Unfortunately, they did not
investigate the bound using extenstive simulated data with varying sample size, correla-
tion structure and size of active set.
Although stability selection provides a reasonable way to control the error rate when
performing feature selection, I realized, during the course of a pilot simulation to inves-
tigate its performance, that its practical usefulness was limited by two important issues.
61
First, as mentioned previously, the error control bound in equation 3.9 provides an
approach to select a region given
thr
and E(V) and the stable features are then iden-
tified in this region to ensure the expection of false discovery rate. However, in the
simulation study, I found that the error control is too loose in most cases and the num-
ber of noise features selected by this fashion is considerably below the target. This
implies that although the error control bound could control the error well as predicted,
it becomes too conservative and the capabillity of detecting active features is affected as
a result.
Second, when one attempted to build a prediction model from the training data, the
varying performance of models with different tuning paramters is not an issue and the
model that minimizes prediction error could be determined by methods such as cross
validation. For feature selection procedures, the major obstacle that prevents us is that
it is difficult to reveal the active set from the data and it is impossible to use any kind
of “error” to select paramters when the active set was unknown. Therefore, a good
feature selection procedure should have few parameters to tune and be robust to their
specification, i.e. give a consistent set of true and false discoveries for a wide range
of tuning parameter choices. With standard stability selection, the selected feature set
varies considerably with cutoff values, as illustrated in Section 3.2.3.
62
3.2 SS-test: test-basedmethodoferrorcontrol
In this section, I propose a modification of stability selection, which I name test-based
stability selection (SS-test), based on framing the selection of stable features as an out-
lier detection problem. SS-test addresses the defficiencies of standard stability selection
(SS-thr) described above.
The idea is to frame the selection of features in stability selection as an outlier detec-
tion problem. We assume that for all the noise predictors,
^
k
follows a beta distribution
with parameters, and (to be determined from the data), while for the active features
the
^
k
could be regarded as outliers of this ‘null distribution’ (Fig 3.2). In other words,
given a null beta distribution, if the probability of observing an empirical frequency
^
k
is much higher than expected, it is likely to be drawn from a different distribution as an
outlier and the corresponding feature to be a true positive.
The idea of our proposed method is as follows. The outliers are identified by an
algorithm that iteratively re-estimates the null distribution and updates the current set
of outliers. Specifically, we maintain three feature sets,O, containing all the currently
identified outliers,N, the complement ofO, andP , which is updated at each iteration
and contains the outliers identified based on the current estimate of the null distribution.
At the start of the algorithmN is initialized to contain all the features and bothO andP
to be empty. We rank the features by their corresponding selection frequencies and move
the feature with the maximum frequency from setN to setO. The null beta distribution
is estimated using all of the features in the setN by matching the first two moments (this
is faster than maximum likelihood estimation and performs well). With these estimates
andapriori set significance level atq, we find the critical value for selection frequencies
as the corresponding upper tail probability of the null distribution. All the features with
selection frequencies above the critical value (including those features in the setO) are
added to setP . The algorithm stops whenP is a subset ofO. The algorithm continues
63
to iterate until the stopping condition is satisified. After the algorithm stops, all features
in the setP are designated as the stable features.
The specific steps of the SS-test are follows:
GivenX, (a vector of sizep,x
k
=
^
k
) andq
Copy the setX to the setN and move the predictor with maximum
^
k
fromN to
O
Repeatf
Calculate the mean and variance ofN and use them to fit a beta distribution
f
null
Givenf
null
andq, find critical valueC such thatP (
^
j
> C) <
q
p
if j is from
null distribution.
Identify potential outliersP :
^
k
>C8k2X.
IfPO break the loop; otherwise move the predictor with maximum
^
k
from
N toO.
g
A featurek is stable if it is inP
By iteratively re-estimating the null distribution based on the non-outliers, the algo-
rithm prevents the true features from influencing the parameter estimation of the null
distribution, thereby preserving the capability to detect active features. Note that the
parameter q in the algorithm had the same role as the significance level in any test
based procedure, which controls the expected number of false discoveries in the selec-
tion, asE(V ) in SS-thr.
The algorithm above is repeatedly applied for all of thes in . In order to prevent
the selection of too many features, the largest was usually set at a value such that on
average, one feature was selected in the subsample, while the smallest was usually set
64
at a value such that in average a small proportion (say 10%) of features was selected in
the subsample. We define SS-test based stable features as
^
S
SStest
=fk :
X
2
I(k2O
)
jj
SStest
g (3.10)
In this way, the information for the whole region of and is integrated into the
algorithm. Simulation results demonstated thatO
were remarkably consistent across
the sequence. Also, a choice of
SStest
from 0.2 to 0.8 gave very similar results, so the
procedure was essentially parameter free. Our recommendation is to set
SStest
= 0:5.
We should point out that the outlier detection framework of SS-test relies crucially on
a sparsity assumption that a relatively small proportion of features have a real biological
relationship to the outcome under investigation. This is a common assumption in genetic
studies and is also key to the LASSO and virtually all other forms of feature selectors in
high dimensional contexts, as described in Chapter 1. Without sparsity inference with
pn is not feasible.
65
3.3 SimulationstudytocomparetheperformanceofSS-
thrandSS-testinbalanceddata
In this section, I compared the error control and ability to detect active features of SS-
test with those of SS-thr in a simulation study. Within each simulation replicate a high
dimensional data set was generated in which a subset of features were set as active,
denoted asS
(0)
. I compared several approaches to feature selection, including univariate
logistic regression, LASSO with cross validation, standard stability selection (denoted
as SS-thr), and test-based stability selection (denoted as SS-test). For each simulation
replicate we performed feature selection using each of the methods to estimate the active
set by
^
S. First I examined the impact of the chosen cutoff for SS-thr and SS-test by look-
ing at the fluctuations in
^
S as the cutoff was “tuned”. I then investigated the number of
falsely selected features (FPN) for different feature selectors across a variety of settings
including several sample sizes, correlation structures, and the sizes of the active setS
(0)
.
I also examined the number of correctly selected features (TPN) when their FPN were
maintained at the same level (approximately) across all the feature selectors.
3.3.1 Simulationplan
Datageneration The predictors and outcomes in the simulation was generated in the
same way as in Section 2.3.2.
In this simulation the data set was always balanced (n
0
= n
1
) andn
1
ranged from
50 to 500.
Feature selection I evaluated the performance of different feature selection methods
based on the number of falsely selected features (FPN) and the number of correctly
66
selected features (TPN). In the complete null case with no active featuress = 0, only
the FPN was evaluated.
With univariate logistic regression, we selected features based on a bonferroni cor-
rected p-value cut-off of p-value< =p, where=0.05. Together with this cutoff, we
also tested a range of cutoff from=p to 0.01 to compare the ability to detect active set.
For LASSO with cross validation, the common procedure was using
^
S correspond-
ing to the 1se model, i.e. the simplest model whose prediction error was within one
standard error of the minimum error model. The prediction error was measured by
AUC-ROC using 5 fold cross validation.
For stability selection based methods, we used 100 subsamples and a grid of 100
values for.
With SS-thr, to test the robustness of feature selection to changes of the stability
frequency cutoff, we first fixed E(V)=1 while changing from 0.6 to 0.9 (recall that
> 0:5 is required for equation 3.9 to apply) by increments of 0.05, and evaluated
the fluatuation of
^
S across. Then we fixed = 0:6 and examined if error control is
achieved for other values of E(V) ranging from 0.1 to 10.
For SS-test, to test the robustness of feature selection to
SStest
, we first fixed
q = 0:2 while changing
SStest
from 0.2 to 0.8 with increments of 0.05 and evaluated
the stability of
^
S across the range of
SStest
values. We also examined if error control
was achieved with differentq values ranging from 0.02 to 1 with a fixed
SStest
= 0:5.
For each simulation scenario above we ran 100 replicates after which we summa-
rized the mean and standard deviation of the performance metrics. All statistical anal-
ysis and simulations were carried out using the R language for statistical computing (R
version 3.0.0) (R core team 2014).
67
3.3.2 Results
Robustness of feature selection First, we examined whether the number of selected
features was affected by the changes of cutoff in SS-thr and SS-test. For SS-thr, the
cutoff was changed in the range of 0.6 to 0.9, and for each cutoff withapriori set E(V)
level, the stable features were those with selection frequencies above the cutoff in the
region determined by the error control bound. For SS-test, the cutoff was changed
from 0.2 to 0.8, and for each cutoff with a priori set q level, the stable features were
those with that were frequently identified as stable features across the region, with
the frequency above the cutoff. We observed large fluctuations in TPN for SS-thr. By
contrast, features selected by SS-test were stable across the range of
SStest
. Figure
3.3 depicted a scenario with sample size of 200 and independent features and illustrated
the typical qualitative behavior for all scenarios.
Effectivenessoferrorcontrol We next investigated the effectiveness of error control
bound. The number of noise features (FPN) included for SS-thr were much smaller
than the expected number of chosen noise features E(V) (Figure 3.4 top). When the
features were independent, FPN remained at low level across varying sample size and
number of active features (Figure 3.5 top). Larger FPN was observed for a given E(V)
in the presence of correlation structure (Figure 3.6 top), especially for a large sample.
However, FPN were still controlled well by the set bound.
FPN for SS-test remained level across various sample size and tended to decrease for
larger samples (Figure 3.4 bottom). We also observed similar trend of FPN for different
number of active features (Figure 3.5 bottom). Whens = 0, FPN remained stable and
did not decrease for larger sample size and tended to slightly increase for the same q
value compared to thes = 7 scenario. The consistency of error control for SS-test was
68
also robust given correlation structure (Figure 3.6 bottom). Besides, the presence of
correlation structure had little effect on FPN for a givenq value.
Capabilityofdetectingtruefeatures It was unfair to compare across different meth-
ods if the type I error rate was not controlled at the same level. Thus, we tested a
series of E(V)s or signifiance levels for SS-thr, SS-test, and ULR and then found the
appropriate level for each method to control FPN at around the targeted levelFPN
target
for all sample sizes. We additionally included LASSO with cross validation and ULR
with bonferroni corrected significance level in the comparison. When predictors were
independent and the number of active features was set at 7, LASSO with cross valida-
tion included 6-10 noise features, and bonferroni-corrected ULR appeared conservative
(Figure 3.7). ULR, SS-thr and SS-test had similar capability of detecting active features
for any givenFPN
target
. Furthermore, their TPN approached that of LASSO with cross
validation when FPN was bounded at 1. Similar trend of TPN was observed when the
size of the active set was set to 4 or 15.
When the predictors followed a Toeplitz design within block design, false positives
of bonferroni-corrected ULR dramatically boosted as sample size increased. On the
other hand, TPN for SS-thr and SS-test also declined when correlation structure was
present (Figure 3.8).
3.3.3 Summary
The simulation shows that performance characteristics of SS-thr can depend heavily on
a parameter that is difficult to tune, which has strong implications for practical use. The
feature set selected by SS-thr varies considerably as the frequency cutoff changes,
while SS-test, which also has a tuning parameter, but yields results that are very consis-
tent across a wide range of values of
SStest
. In addition, the number of noise features
69
selected by both SS-thr and SS-test was robust against varying number of active features
and sample sizes. The impact of correlation on FPN was higher for SS-thr than SS-test,
especially for large samples. Based on these empirical studies, we recommend settingq
in the range 0:15 0:2 bounds FPN around 1 and settingq in the range 0:3 0:4 bounds
FPN around 2.
In addtion, SS-test had at least the same capability of detecting true features as SS-thr
if FPN was controlled at the same level. Both SS-test and SS-thr can perform as well as
univariate linear regression when the FPN for these methods was controlled at the same
level. A criticism for stability selection is that it tends to be conservative. For example,
an application of stability selection in a genome-wide association study showed lower
power than a simple univariate regression selector (ULR) (Alexander & Lange 2011). I
speculate that this power loss is partially due to a conservative error control of stability
selection, especially in the p n setting of a GWAS (please refer to Chapter 5). In
our simulation, I found that the power of detecting active features with modest effect
size for SS-thr with E(V)= 1 was not as good as for SS-test withq = 0:16 (Figure 3.9).
However, that is simply because the actual error rate was much smaller than the targeted
bound, while the FPN for SS-test was approximately 1. When error was controlled well
and p was not too large compared to n, I demonstrated here that stability selection could
achieve the same power as ULR and SS-test when the predictors were independent from
each other. In the presence of correlation features, ULR failed to control error rate at the
target level so that stability selection was preferred.
Nevertheless, when one wants to maximize the ability to detect true features without
selecting many noise features, we recommend using SS-test with an appropriateq value
corresponding to the tolerance of error rates. Although stringent error control could be
achieved by SS-thr, its loose error control bound and its dependence on cutoff values
affect its ability to detect active features. We recommend using SS-thr when stringent
70
error control is desired and using it with stability path plot. In light of these results one
can argue that SS-test is essentially a tuning-free method that successfully addresses
the sensitivity to the tuning parameter of SS-thr, making it an attractive alternative for
feature selection.
71
3.4 SimulationstudytocomparetheperformanceofSS-
thrandSS-testinimbalanceddata
It is potentially challenging to do feature selection in a skewed data set. To overcome
this obstacle, we considered other sampling strategies to combine with stability selection
since subsamples in the standard procedure maintains the same case control ratio as the
original data. The rationale of these new strategies is to generate multiple balanced
subsamples from the original data and selected features in each of them. In balanced
sampling framework, N = 0:5 n
1
observations are sampled without replacement
from each class, wheren
1
is the sample size for the minority class. In bootstrap sampling
framework,N =n
1
observations were sampled with replacement from each class. Note
this is equivalent to generating bootstrap samples in the minority class and the size of
each subsample is two times that for balanced sampling. In the same way as informative
sampling introduced in Chapter 2, although some information was lost in each balanced
subsample due to the decrease of sample size, the information could be recovered by
aggretating the information from multiple subsamples which cover the entire data set.
The modification of sampling framework has little impact on the use of subsequent
selection procedure, since a selection frequency matrix with the same dimensions is
still generated. However, the error control bound for SS-thr is likely to change due
to the changes in the size for each subsample. In the extreme case, when the data is
highly imbalanced, the selection probability for any variable could not go above 50%
since subsamples in balanced sampling or bootstrap sampling were very different from
each other. In contrast, I suspected SS-test to be relative immune to the alteration of
sampling framework. That is, the selection frequencies could still be modeled by a Beta
distribution.
72
In this section, the performance of SS-thr and SS-test was evaluated in imbalanced
data using simulation study. Besides the comparison in the standard subsampling frame-
work, the performance with a balanced sampling framework (leading to Bal-ori and Bal-
test procedure) and a bootstrap sampling framework (leading to Boot-ori and Boot-test
procedure) is evaluated.
3.4.1 Simulationplan
The simulation procedure had the same spirit as in the balanced setting. The major
differences were a) imbalanced data set was generated; b) four new feature selectors
were added, including Bal-thr and Bal-test, which combined SS-thr or SS-test with the
balanced sampling framework, and Boot-thr and Boot-test, which combined SS-thr or
SS-test with the bootstrap sampling framework.
DataGeneration The approach to generate design matrix and outcomes was the same
as in Section 3.3.1.
The total sample size for the data was set at 360 and 720. In the 360 case, the ratio
of case to control ranged from 1:1, 1:2, 1:3, 1:4, 1:5, 1:6.2, 1:7, 1:8, 1:9, 1:11, 1:14 to
1:17. In the 720 case, the ratio of case to control ranged from 1:1, 1:2, 1:3, 1:4, 1:5,
1:6.2, 1:7, 1:8, 1:9, 1:11, 1:14, 1:17, 1:19, to 1:23.
Feature selection I applied a series of feature selection methods to identify true fea-
tures, including ULR, LASSO with cross validation, SS-thr, SS-test, Bal-thr, Bal-test,
Boot-thr, and Boot-test. I evaluated the performance of different methods based on FPN
and TPN.
I used the same settings as in Section 3.3.1 for feature selectors including ULR,
LASSO, SS-thr and SS-test .
73
For Bal-thr, Bal-test, Boot-thr, and Boot-test, settings for SS-thr and SS-test were
adopted except that a different sampling framework was integrated. We used 100 or 500
subsamples and a grid of 100 values for.
For each simulation scenario above I ran 100 replicates after which I summarized
the mean and standard deviation of the performance metrics. All statistical analysis and
simulations were carried out using the R language for statistical computing (R version
3.0.0) (R core team 2014).
3.4.2 Result
When the features were independent, I set E(V)=15 and q =0.3 for cutoff based and
test based methods to control FPN at around 2, respectively. In most cases, the false
positives were controlled at the targeted level. As the case control ratio decreased, the
FPN for SS-thr, SS-test, Bal-test, and Boot-test remained stable while the TPN declined
for all the methods (Figure 3.10). When the case control ratio was below 1:10, these
methods identified more true features than that for LASSO. There were little differences
among the three types of“test” methods and SS-thr.
On the other hand, FPN and TPN for Bal-thr and Boot-thr were much lower than
those of SS-thr when the case control ratio was below 1:5, even when E(V) was set as
high as 50. This observation was expected and implied that the selection frequency of
very few variable could go above 60% for any.
Similar trends were observed in the settings using sample size of 720 and with
correlation structures as well. As observed previously, correlation structure generally
help alleviate the imbalance problem to some extent for LASSO-based methods (Figure
3.11).
74
3.4.3 Summary
The simulation results demonstrates that the skewness of data had little impact on the
performance of SS-thr and SS-test. The decrease in the true positives with the increase
of skewness is caused by the decrease in the effective sample size. Together with the
observation in the previous chapter, it appears that penalized logistic regression is rela-
tively robust in imbalanced data environment in terms of both class prediction and fea-
ture selection. In addition, the number of noise features selected by SS-thr and SS-test
was both robust to the imbalance ratio. When the data was highly skewed, the capabil-
ity of detecting true features of SS-thr and SS-test was superior to LASSO with cross
validation.
The performances of test-based and cutoff-based procedures revealed another advan-
tage of test-based approach. When integrated with different sampling schemes, the
test-based procedure demonstrated remarkable flexibility. The number of false posi-
tives was remarkably stable and controlled at the desired level. By contrast, the error
control bound for standard stability selection was more specifically targeted for the stan-
dard subsampling and was required to be modified in order to combine with alternative
sampling framework. Theoretically, features selected by the test-based procedure could
always be attained by a cutoff-based procedure with certain and E(V). The trick for
Bal-thr or Boot-thr to achieve comparable FPN and TPN was to lower the cutoff to
below 0.5. Instead of deriving such a bound for every specific alteration, SS-test pro-
vides a robust approach to control the error and achieves a very competitive power in
the mean time.
When using LASSO as the basic learner, more balanced sampling framework did not
benefit feature selection in skewed data. However, it was still encouraging that balanced
sampling performed as well when integrated with the test-based algorithm. The size
of each subsample for balanced sampling was at most
4n
1
n
1
+n
0
of that for imbalanced
75
subsample, leading to a considerable gain in the computational efficiency whenn
1
<<
n
0
. In addtion, generalized stability selection does not require LASSO to be the basic
learner. Balanced sampling framework could contribute when the basic learner was not
robust to skewness.
76
−2.8 −2.2 −1.6
−0.6 0.0 0.4
log λ
Coefficients
19 7 3 2
0.10 0.20
0.0 0.4 0.8
λ
Regularizaon Path Stability Path
Selecon Probability
Figure 3.1: An example of regularization path (left) and the corresponding stability path
(right)
The red cuves in stability path indicated features in the active set.
77
0.0 0.2 0.4
0.0 0.4 0.8
λ
Selection Frequency
0.00 0.10 0.20 0.30
0.0 0.4 0.8
Selection Frequency
expected
observed
Stability Path for Λ CDF for λ=0.2034
Figure 3.2: Illustration of SS-test
A cross section for = 0:2034 was taken from the stability path (left) and both the
empirical cumulative distribution function and the expeced cumulative distribution func-
tion of a beta distribution with the first two moments matched was plotted using the
selection frequencies of all features except the five largest (right). The red cuves in
stability path indicated features in the active set
78
0.60 0.65 0.70 0.75 0.80 0.85 0.90
2.0 2.5 3.0 3.5 4.0 4.5 5.0
TPN
0.3 0.4 0.5 0.6 0.7
2.0 2.5 3.0 3.5 4.0 4.5 5.0
π
TPN
SS−thr
SS−test
Figure 3.3: Impact of varying the frequency cut-offs on the TPN withn = 200.
For SS-thr (top, red) E(V) was set at 1. For SS-test (bottom, blue) q was set at 0.2.
79
200 400 600 800 1000
0 1 2 3 4
FPN
200 400 600 800 1000
0 1 2 3 4
n
FPN
Figure 3.4: Error control for SS-thr and SS-test across various critical values.
Trend of FPN for SS-thr (top), and SS-test (bottom) as a function of sample size. From
lower to higher FPN the curves correspond to E(V)=1, 2, 4, and 6 andq = 0.1, 0.2, 0.4,
and 0.6 for SS-thr and SS-test, respectively. Predictors were independent.
80
200 400 600 800 1000
0 1 2 3 4
FPN
SS−thr
s=0
s=4
s=7
s=15
200 400 600 800 1000
0 1 2 3 4
n
FPN
SS−test
s=0
s=4
s=7
s=15
Figure 3.5: Error control for SS-thr and SS-test across various numbers of active fea-
tures.
Trend of FPN for SS-thr (top), and SS-test (bottom) as a function of sample size. E(V)
set at 1 for SS-thr andq at 0.16 for SS-test. Predictors were independent.
81
200 400 600 800 1000
0 1 2 3 4
FPN
200 400 600 800 1000
0 1 2 3 4
n
FPN
Figure 3.6: Error control for SS-thr and SS-test across various critical values with block
design.
Trend of FPN for SS-thr (top), and SS-test (bottom) as a function of sample size. From
lower to higher FPN the curves correspond to E(V)=1, 2,4, and 6 andq = 0.1, 0.2, 0.4,
and 0.6 for SS-thr and SS-test, respectively. Predictors followed a block design.
82
0 2 4 6 8 10
FPN
200 400 600 800 1000
0 1 2 3 4 5 6 7
n
TPN
SS−thr
SS−test
ULR
ULR−bf
LASSO
Figure 3.7: Capability of identifying active set for different selectors given FPN at 1.
Trend of FPN (top) and TPN (bottom) of LASSO (black), Bonferroni-corrected ULR
(green solid), ULR (green dashed), SS-thr (red), SS-test (blue) as a function sample
size. Predictors were independent.
83
0 10 20 30 40 50 60
FPN
200 400 600 800 1000
0 1 2 3 4 5 6 7
n
TPN
SS−thr
SS−test
ULR
LASSO
Figure 3.8: Capability of identifying active set for different selectors given FPN at 1
with block design.
Trend of FPN (top) and TPN (bottom) of LASSO (black), Bonferroni-corrected ULR
(green solid), SS-thr (red), SS-test (blue) as a function of sample size. Predictors fol-
lowed a block design.
84
0.0 0.5 1.0 1.5 2.0 2.5 3.0
FPN
100 200 300 400 500 600
0 1 2 3 4 5 6 7
n
TPN
SS−thr
SS−test
Figure 3.9: Capability of identifying active set for SS-thr and SS-test with targeted error
control bound at 1.
Trend of FPN (top) and TPN (bottom) of SS-thr (red), SS-test (blue) while increasing
the sample size. Predictors follow an independent design.
85
0 5 10 15 20
FPN
SS−thr
SS−test
Bal−thr
Bal−test
Boot−thr
Boot−test
ULR
ULR−bf
LASSO
5 10 15
0 1 2 3 4 5 6 7
ratio
TPN
Figure 3.10: FPN and TPN of different selectors in imbalanced data.
Trend of FPN (top) and TPN (bottom) of LASSO (black), Bonferroni-corrected ULR
(green solid), ULR (green dashed), SS-thr (red solid), SS-test (blue solid), bal-thr (red
dashed), bal-test (blue dashed), boot-thr (red dotted), boot-test (blue dotted), while
decreasing the case control ratio. Predictors followed an independent design. E(V)
was set at 0.4 for cutoff-based approach whileq was set at 0.3 for test-based approach.
86
0 5 10 15 20 25 30
FPN
SS−thr
SS−test
Bal−thr
Bal−test
Boot−thr
Boot−test
ULR−bf
LASSO
5 10 15
0 1 2 3 4 5 6 7
ratio
TPN
Figure 3.11: FPN and TPN of different selectors in imbalanced data given with block
design.
Trend of FPN (top) and TPN (bottom) of LASSO (black), Bonferroni-corrected ULR
(green solid), SS-thr (red solid), SS-test (blue solid), bal-thr (red dashed), bal-test (blue
dashed), boot-thr (red dotted), boot-test (blue dotted), while decreasing the case control
ratio. Predictors follow a block design. E(V) was set at 0.2 for cutoff-based approach
whileq was set at 0.3 for test-based approach.
87
Chapter4
Rank-BasedStabilitySelection
The framework that stability selection resided in is composed of three major compo-
nents, including perturbance of data, feature selection with a basic learner, and infor-
mation aggregation. In SS-thr or SS-test, these tasks were performed by subsampling,
LASSO feature selection, and a summary of the selection of features across subsamples
in terms of selection frequency. From the perspective of information aggregation, fea-
tures selected by a subsample are assigned the same weight in standard stability selec-
tion. In practice, active features usually enter into the model at a early stage since they
have the highest correlation with the outcome. I suspect the capability of identifying
active features for stability selection could be further boosted by assigning a higher
weight to features that enter earlier into the model. The easiest way to incorporate this
idea into stability selection framework is, instead of selection frequency, calculating the
average rank of selection across subsamples. Here I propose cutoff based SS-rank and
test-based SS-ranktest and investigate their performance in simulation study.
4.1 Motivationanddescriptionofframework
Stability selection has been applied to various gene signature identification studies. A
common modification in the aggregation step was to compute ranks of features in a
model instead of selection frequency. The rank could be derived from a univariate score
(Siebourg et al. 2012), the order of selection in SVM-RFE (Abeel et al. 2010), or the
order of selection as decreased in LASSO or elastic net (Haury et al. 2011, Haury et
88
al. 2012). In the aggregation step, the ranks were usually averaged across subsamples
and features with the lowest average ranks were selected. However, most studies looking
into these strategies simply used this idea to rank the importance of features and did not
investigate its error control properties.
As observed in Section 3.3, stability selection tended to detect fewer active features
than LASSO to compensate for a much better error control. In standard stability selec-
tion, I assigned 1 to all features selected and 0 to all features not selected in a given
subsample, i.e. I gave identical weights to all the features selected by LASSO. How-
ever, most features selected along the regularization path were false and their selection
can potentially degrade the ability to detect active features. Rather than 0-1 weights,
one could give more weight to features that enter earlier in the regularization path, for
instance, using the order of selection instead of the selection frequency. I should point
out that the assumption that one needs to make here is that active features were likely to
enter earlier into the path and I acknowledge that it is possible that active features with a
small effect size do not enter into the path early. However, the ultimate goal is to build a
gene signature that could predict the outcome and in this case, features with small effect
on the outcome are not likely to help much with prediction.
Notably, ranks are calculated using the information along the whole path. Thus it
is not necessary to find a subset of as in SS-thr or to average across region as in
SS-test. In other words, rank-based stability selection is parameter free.
To integrate ranks into the test-based procedure, ranks are transformed into scores
ranging between 0 and 1. The rank score is defined as
f(r) =
m + 1r
m + 1
(4.1)
89
where r is the rank of a feature and m equals to the largest rank of all selected
features. When a feature is not selected, its rank is assigned tom + 1 so that its score
is 0. I plus one to both numerator and denominator so that the last feature selected has
a higher score than features not selected. The scores are averaged across subsamples,
resulting in a vector of average scores with sizep. To guarantee sparsity, the value ofm
should be relatively small, leading to most of average scores close to 0. The test-based
procedure is then used to learn this vector and detect outliers from the null distribution.
The cutoff-based procedure is integrated more directly, where the cutoff was setapriori
and features with scores above the cutoff are stable. The cutoff based and test based
versions are respectively named “SS-rank” and “SS-ranktest” in the simulation studies.
90
4.2 Simulationplan
I used the same procedures for data generation and the implementation of ULR, LASSO,
SS-thr, and SS-test as in Section 3.3.1.
For SS-rank and SS-ranktest, a LASSO procedure with 100 grid values of was fit
in each subsample. I used the rank oft
k
, the times coefficient of featurek was nonzero
in the 100 fits, to define the rank ofk.
Practically this worked similarly as ranking features based on the order of selection.
It was advantageous in that it did not assign high ranks to features that entered once and
then exited the path. According to the algorithm of LARS and LASSO, these features
exited because their sign changed during optimization, and they were usually noise fea-
tures. In this way, the ranks of all features that are not selected are the same and the
largest. The rank for each feature was then transformed to rank scores using equation
4.1 withm equal to the largest rank of all selected features. The scores were averaged
across subsamples. The performance of SS-rank was evaluated in terms of FPN and
TPN using a range of cutoff values. SS-ranktest was evaluated using a range ofq values.
For each simulation scenario above I ran 100 replicates after which I summarized
the mean and standard deviation of the performance metrics. All statistical analysis and
simulations were carried out using the R language for statistical computing (R version
3.0.0)?.
91
4.3 Results
First, the set of top s featuresf
^
S
s
g was identified in any feature selector based on the
selection frequency (SS-thr), rank score (SS-rank), p value (ULR) or the order of enter-
ing the path (LASSO with cross validation). Larger overlap betweenf
^
S
s
g andfS
(0)
g
indicates a better ability of feature selection. Whens = 7 and predictors were indepen-
dent, the performance of SS-thr, SS-rank was very similar to that of LASSO in terms of
both mean number of overlaps and its standard deviation (Figure 4.1 top). ULR, on the
other hand, did not perform as well as the above methods. The performance of ULR was
severely affected by correlation structure than that of other methods (Figure 4.1 bottom),
which was previously reported (Hauryetal. 2011).
I next examined FPN for different cut-offs for SS-rank and SS-ranktest. When the
features were independent, as the sample size increased, FPN for SS-rank increased ini-
tially and then became stable after the sample size reached 300 (Figure 4.2 top) while
the number of false positives for SS-ranktest kept steady as sample size increased (Fig-
ure 4.2 bottom). The correspondence of FPN withq value in SS-ranktest was similar to
that in SS-test. When the correlation structure was present, the general trend of FPN for
both selectors appeared similar and the magnitude was generally larger with the same
significance level (Figure 4.3).
The ability to detect active features of SS-rank and SS-ranktest was not superior to
that of SS-thr and SS-test with FPN controlled at 1 (Figure 4.4). The same trend was
observed in the presence of correlation structure (Figure 4.5).
92
4.4 Summary
I did not observe an advantage of SS-rank and SS-ranktest in terms of detecting active
features by either examining the tops features selected or fixing FPN at the same level.
It appeared that the ability of detecting active features for stability selection could not
be improved by assigning weights according to the rank. I argued that features with
large effect sizes could be detected by all the selectors so that a higher weight did not
help, while features with small effect sizes were difficult in feature selection since it was
difficult to differentiate them from “top” false signals with comparable false effect sizes
so that both them and “top” false signals received relatively higher weight. Neverthe-
less, rank-based method has one advantage over standard stability selection in that the
procedure does not involve any parameters.
Compared to SS-thr, the cutoff-based SS-rank controlled FPN more consistently
with the changes in sample sizes. When the sample size was below 300, FPN seemed to
linearly increase with sample size. This is probably because the total number of features
selected by each subsample was lower for smaller sample size. The rank scores in our
procedure calculated by equation 4.1 is influenced bym, the largest rank of all selected
features, i.e. the total number of feature selected. For example, a feature ranked at 7
has a score of 0.93 when there are 99 selected features but only has a score of 0.3 when
there are 9 features selected. This bias could be fixed by deriving a cutoff that based
on the number of selected features, either theoretically or empirically. However, the
most natural solution was already there. The test-based algorithm fixes this problem by
determining the cutoff with the distribution of data.
The test-based procedure once again demonstrated its flexibility to incorporate with
alternative approaches. The combination of test-based algorithm with rank scores was
almost seamless and when the features were independent from each other, FPN was
controlled at the desired level by the correspondingq value used in SS-test. However,
93
FPN for SS-ranktest is more heavily influenced by correlation structure than SS-test for
a given significance level, which was also observed in SS-thr and SS-rank. We expected
this impact from correlation structure since the noise features highly correlated with
active features were usually also correlated with the outcome and it was difficult for
feature selectors to avoid them. When the data was perturbed, these “spurious fea-
tures” either did not enter into the model at all when their correlated active features
were already in the model, or enter into the model relatively early owing to their “effect
size” on the outcome. Therefore, they were easier to avoid in SS-test due to low fre-
quency across subsamples than in SS-ranktest as a result of relatively high average rank
scores.
As mentioned previously, several application studies used stability selection as an
approach to rank the importance of features. I here demonstrated there was no clear
advantage of using either selection frequency or average ranks over standard LASSO.
Notably, all these methods performed much better than ULR in high dimensional data
with some correlation structures, which implied that ULR was more vulnerable than
penalized regression to spurious associations.
94
200 400 600 800 1000
0 1 2 3 4 5 6 7
TPN
200 400 600 800 1000
0 1 2 3 4 5 6 7
n
TPN
SS−thr
SS−rank
ULR
LASSO
Figure 4.1: True features from the top seven features detected by different selectors.
Trend of overlapping TPN between active feature set and the top seven features selected
by feature selectors were plotted with increasing sample size. The features followed
either an independent design (top) or block design (bottom).
95
200 400 600 800 1000
0 1 2 3 4 5
FPN
200 400 600 800 1000
0 1 2 3 4 5
n
FPN
Figure 4.2: Error control for SS-rank and SS-ranktest across various critical values.
Trend of FPN for SS-rank (top), and SS-ranktest (bottom), while increasing the sample
size. Cutoff took the value of 0.66, 0.54, 0.42, 0.36, making the curves from the bottom
to the top. q took the value of 0.1, 0.2, 0.4, and 0.6, making the curves from the bottom
to the top. The features followed an independent design.
96
200 400 600 800 1000
0 2 4 6 8
FPN
200 400 600 800 1000
0 1 2 3 4 5 6 7
n
FPN
Figure 4.3: Error control for SS-rank and SS-ranktest across various critical values.
Trend of FPN for SS-rank (top), and SS-ranktest (bottom), while increasing the sample
size. Cutoff took the value of 0.66, 0.54, 0.42, 0.36, making the curves from the bottom
to the top. q took the value of 0.1, 0.2, 0.4, and 0.6, making the curves from the bottom
to the top. The features followed a block design.
97
0 2 4 6 8 10
FPN
200 400 600 800 1000
0 1 2 3 4 5 6 7
n
TPN
SS−thr
SS−test
SS−rank
SS−ranktest
ULR
ULR−bf
LASSO
Figure 4.4: Capability of identifying active features for different selectors given fixed
FPN at 1.
Trend of FPN (top) and TPN (bottom) of LASSO (black), Bonferroni-corrected ULR
(green solid), ULR (green dashed), SS-thr (red solid), SS-test (blue solid), SS-rank (red
dashed), SS-ranktest (blue dashed) were plotted against increasing sample size. Predic-
tors followed an independent design.
98
0 10 20 30 40 50 60 70
FPN
200 400 600 800 1000
0 1 2 3 4 5 6 7
n
TPN
SS−thr
SS−test
SS−rank
SS−ranktest
ULR−bf
LASSO
Figure 4.5: Capability of identifying active features for different selectors given fixed
FPN at 1.
Trend of FPN (top) and TPN (bottom) of LASSO (black), Bonferroni-corrected ULR
(green solid), ULR (green dashed), SS-thr (red solid), SS-test (blue solid), SS-rank (red
dashed), SS-ranktest (blue dashed) were plotted against increasing sample size. Predic-
tors followed an block design.
99
Chapter5
StabilitySelectioninUltra
High-DemensionalData
In the previous two chapters, the performance of SS-thr, SS-test, SS-rank and SS-
ranktest was examined in the high-dimensional data, where the number of features was
at the same scale, or several times larger than the sample size. However, ultra high
dimensional data,i.e. thepn case, is commonly encountered in genome studies. The
number of gene expression profiles is on the order of tens of thousands. The number of
gene methylation and genome SNP profiles is on the order of hundreds of thousands. A
common strategy is to reduce the dimensionalityp from this large scale to a relatively
large scale that the feature selection methods could handle by some fast procedures.
I propose to integrate stability selection based methods into a two-step procedure, in
which a univariate screening step is used to lower down the scale before feature selec-
tion. The performance of these approaches together with that of using stability selection
based methods alone was examined in this challenging environment.
5.1 Motivationanddescriptionofframework
As we mentioned in the introduction, it is of interest and necessity to examine the perfor-
mance of stability selection whenp>>n. Several studies reported that the performance
for LASSO decreased considerably for a much largerp, possibly because the problems
100
of noise accumulation and spurious association become more serious as dimensional-
ity increases (please refer to Chapter 1). Besides, the computational burden is heavy
in such environment even for a relatively simple procedure. To deal with this issue, the
most common approach is to use a dimensionality reduction method first so that the data
could then be handled comfortably by a feature selection method. These dimensional-
ity methods should be relatively simple to guarantee the computational efficiency thus
screener based on univariate relation between features and the outcome is most popular.
Specifically, a two-step procedure is used in which a screening method is first applied to
exclude most features so that the regular model could then learn from data with a much
smaller scale. A commonly used screener for classification is a two-sample t-test (Tib-
shirani 2003). Fan & Lv (2008) introduce sure independence screening and propose to
use correlation ranking as a screener, which ranks features according to their marginal
correlation with the response. They prove that under certain regularization conditions,
this method select all active features with probability close to 1. Due to its simplicity,
I use univariate two-sample t-test as a screener. The p values for the t-test are ranked
and the 1000 features with the smallest p values are kept in the data from which fea-
ture selectors identified important features. The performance of these selectors with or
without the integration of this two-step approach is evaluated in the simulation study.
101
5.2 Simulationplan
The simulation procedure had the same spirit as in the basic setting. The major differ-
ences were a) the number of featuresp in the data; b) all of the stability selection related
feature selectors had a corresponding two-step version, which were named SS-thr2, SS-
test2, SS-rank2, and SS-ranktest2 accordingly.
DataGeneration The approach to generate design matrix and outcomes was the same
as in Section 3.3.1 except (a) the number of featurep was set at 10000 and 50000; (b) the
largest sample size tested was set at 600 instead of 1000; (c) only independent design
and block design were investigated, and in the block design, there were
p
100
blocks and
each had 100 features.
Feature selection I applied a series of feature selection methods to identify active
features, including ULR, LASSO with cross validation, SS-thr, SS-test, SS-rank, SS-
ranktest, SS-thr2, SS-test2, SS-rank2, SS-ranktest2. In all of the two-step-version selec-
tors, a two-sample t-test was first applied univariately to all features and the 1000 fea-
tures with the smallest p values were identified and included in the data used for the
second step. I evaluated the performance of different methods based on FPN and TPN.
For each simulation scenario above I ran 100 replicates after which I summarized
the mean and standard deviation of the performance metrics. All statistical analysis and
simulations were carried out using the R language for statistical computing (R version
3.0.0) (R core team 2014).
102
5.3 Results
First, I examined the error rate for a much larger number of features using the signifi-
cance level corresponding to FPN at around 1 whenp = 1000. When the features were
independent, SS-thr and SS-rank had a FPN lower than 0.5 while the FPN for SS-test
and SS-ranktest was around 1.5 and 2, respectively (Figure 5.1). A screener before
feature selection made the error rate for all feature selectors closer to 1, although the
change was subtle. Another thing to notice was the conservativeness of SS-thr for a
small sample size still held even when E(V) increased to 10.
When a block design was used for features, the FPN for SS-test and SS-rank was
around 3 and the FPN for both methods returned to around 1 when a first screener step
was used (Figure 5.2). The screener also improved SS-thr in terms of stability across
sample sizes. When the critical values were used to achieve a FPN around 1 for all
the methods, the two step SS-test or SS-ranktest showed a better capability of detecting
active features than their counterparts (Figure 5.3).
103
5.4 Summary
I did not observe considerable differences in the FPN or TPN for SS-test or SS-ranktest
for a much larger number of features, especially when the correlation among features
was sparse. The error rates for SS-test and SS-ranktest increased in the presence of
both high dimensionality and correlation structure. As observed previously, the impact
appeared higher on SS-ranktest. This increase in error rates was not unexpected. Cor-
relation structure in combination with high dimensionality would increase the number
of spurious features, making it difficult for feature selectors to differentiate them from
active features. In fact, I found that most of the noise features were in the neighborhood
of active features.
This problem was alleviated when SS-test or SS-ranktest was performed in a lower
dimension, which was achieved by adding a filtering step to exclude most features in
terms of their association with the outcome. By means of the two-step approach, the
ability of identifying active features dropped only a little compared to using SS-test or
SS-ranktest alone, while the error rates was much lower. The two-step approach was
also more computationally efficient, especially when the number of features was large.
On the other hand, the major issue concerned with SS-thr and SS-rank was that
their ability to detect active features decreased considerably, especially when the sample
size was small. In most cases, the FPN and TPN for SS-thr was independent of E(V)
chosen, implying that few features had selection frequencies above the cutoff level. The
screening step before SS-thr helped improve the ability to identify active features and
the two-step procedure had FPN and TPN comparable to SS-test or SS-ranktest. The
conservativeness of SS-rank probably resorted to the way that scores were generated in
its algorithm, as mentioned in Chapter 4, and a change of cutoff could probably fix that.
Despite of the good performance of two-step approach, I acknowledge that the sim-
ulation is still rather premature due to the limit of time and computational resources.
104
In an ideal two-stage procedure, the two steps should be independent, otherwise the
significance level of the second step is likely to be affected. This impact is generally
acceptable in terms of prediction but in feature selection, the significance level often
determines the probability of false positives. One should take caution when using these
two-step approaches and use an independent first step if possible.
105
0.0 0.5 1.0 1.5 2.0 2.5 3.0
FPN
100 200 300 400 500 600
0 1 2 3 4 5 6 7
n
TPN
SS−thr
SS−test
SS−rank
SS−ranktest
SS−thr2
SS−test2
SS−rank2
SS−ranktest2
ULR
ULR−bf
Figure 5.1: FPN and TPN for different selectors givenp = 10000.
Trend of FPN (top) and TPN (bottom) of LASSO (black), Bonferroni-corrected ULR
(green solid), ULR (green dashed), SS-thr (red solid), SS-test (blue solid), SS-rank (yel-
low solid), SS-ranktest (magenta solid), SS-thr2 (red dashed), SS-test2 (blue dashed),
SS-rank2 (yellow dashed), SS-ranktest2 (magenta dashed) were plotted against increas-
ing sample size. Predictors were independent. Note that the scale of FPN was narrow
to shown the differences between stability selection based method, so LASSO was not
plotted.
106
0 1 2 3 4 5
FPN
100 200 300 400 500 600
0 1 2 3 4 5 6 7
n
TPN
SS−thr
SS−test
SS−rank
SS−ranktest
SS−thr2
SS−test2
SS−rank2
SS−ranktest2
ULR−bf
Figure 5.2: FPN and TPN for different selectors givenp = 10000 with block design.
Trend of FPN (top) and TPN (bottom) of LASSO (black), Bonferroni-corrected ULR
(green solid), ULR (green dashed), SS-thr (red solid), SS-test (blue solid), SS-rank (yel-
low solid), SS-ranktest (magenta solid), SS-thr2 (red dashed), SS-test2 (blue dashed),
SS-rank2 (yellow dashed), SS-ranktest2 (magenta dashed) were plotted against increas-
ing sample size. Predictors followed a block design. Note that the scale of FPN was
narrow to shown the differences between stability selection based method, so LASSO
and ULR-bf were not plotted.
107
0.0 0.5 1.0 1.5 2.0
FPN
100 200 300 400 500 600
0 1 2 3 4 5 6 7
n
TPN
SS−thr
SS−test
SS−rank
SS−ranktest
SS−thr2
SS−test2
SS−rank2
SS−ranktest2
ULR−bf
Figure 5.3: Capability of detecting active feature for different selectors given FPN fixed
at around 1p = 10000 and with block design.
Trend of FPN (top) and TPN (bottom) of LASSO (black), Bonferroni-corrected ULR
(green solid), ULR (green dashed), SS-thr (red solid), SS-test (blue solid), SS-rank (yel-
low solid), SS-ranktest (magenta solid), SS-thr2 (red dashed), SS-test2 (blue dashed),
SS-rank2 (yellow dashed), SS-ranktest2 (magenta dashed) were plotted against increas-
ing sample size. Predictors followed a block design. Note that the scale of FPN was
narrow to shown the differences between stability selection based method, so LASSO
and ULR-bf were not plotted.
108
Chapter6
Applicationtoprostatecancer
recurrencestudy
Our application study is the prostate cancer whole genome gene expression data, which
motivated the work in this dissertation. This is an ongoing USC/Illumina study of
prostate cancer recurrence after radical prostatectomy. The study generated gene expres-
sion data for nearly thirty thousand probes from 187 tumor samples, of which 33 came
from patients with recurrent prostate cancer and 154 came from patients with non-
recurrent prostate cancer after 8-20 years of follow-up. Our goal was to use penalized
logistic regression and stability selection to find a “gene signature” of recurrence that
can improve upon PSA and Gleason-score, which are well-established but poor pre-
dictors. For interpretability and future clinical use, the gene signature should ideally
involve a small proportion of probes to predict recurrence in new patients.
In my simulation study in Chapter 2, I evaluated the performance of models selected
by different metrics in imbalanced data. I concluded that G-mean-thr (G-mean with
an alternative cutoff) and area under the ROC curve (AUC-ROC) were the most robust
metrics to class imbalance.
In Chapter 3, I examined the performance of stability selection (SS-thr) in simulation
studies and found that its feature selection capability (a) depended on the stability cutoff
chosen and (b) is conservative as a result of a stringent error control. To address these
problems, I proposed new feature selectors based on stability selection, including SS-
test, an essentially parameter-free test-based outlier-detection approach, and SS-rank
109
and SS-ranktest, parameter-free rank-based methods. I demonstrated their advantage
over SS-thr, ULR, and LASSO with cross validation in extensive simulation studies and
also found that all these stability selection based methods were robust to class imbalance.
In this chapter I apply the lesson learnt and the methods developed in Chapter 2-5
to the prostate cancer recurrence data. I focus on investigating a) the performance of
penalized regression models selected by different metrics including accuracy, G-mean-
thr, AUC-ROC, etc., b) the features selected by ULR, LASSO with cross validation,
SS-thr, SS-test, SS-rank, and SS-ranktest with or without the combination of a filering
step.
6.1 Methods
Sample description The prostate cancer recurrence study includes patients who
underwent a radical retropubic prostatectomy and bilateral pelvic lymph node dissection
(RRP/PLND) at the University of Southern California from 1988 to 2008. Malignant
glands representative of the overall Gleason score of each patient were obtained using
laser capture microdissection. After RNA isolation, whole genome gene expression pro-
files for over 29,000 transcripts were obtained using Whole Genome DASL HT platform
(Illumina, Inc.). In the analysis, we included 187 organ-confined patients with available
formalin-fixed paraffin embedded prostate tissue from RRP. 154 of them were controls,
i.e. with no recurrence following RRP, while 33 of them were cases, i.e., with clin-
ical metastatic recurrence after RRP. The raw gene expression data was preprocessed
by background correction, normalization and batch effect adjustments. In addition to
the gene expression profiles, there were clinical features in the data, including opera-
tion year, age, ethnicity, PSA, Gleason score, disease stage and capsular invasion. Since
PSA and Gleason score are the best-known clinical predictors of recurrence we included
110
them in all our models (i.e. their coefficients were not penalized in the penalized regres-
sion models). Also, in all analyses below the genetic and clinical features were left
un-standardized.
Regularization by cross validation LASSO and elastic-net regularized logistic
regression was applied to the data set with ranging from 0.2, 0.4, 0.6, 0.8 to 1 with
R package glmnet (Friedman et al. 2010). A regularization path with 80 grid values
of was selected by the original algorithm. 5-fold cross validation was used to find
the best combination of (;). The performance metrics used to choose the regular-
ization parameters included accuracy, G-mean, F-measure,, accuracy-thr, G-mean-thr,
F-thr, -thr, binomial deviance, AUC-ROC, AUC-PR, and precision-recall breakeven
point. Specifically, the chosen regularization parameter is the one yielding the lowest
error across the cross-validation folds according to each metric. The performance of
each of the constructed classifiers was evaluated by their accuracy in both the case and
the control classes using both a 0.5 cutoff and case-proportion cutoff of the predicted
probability of recurrence.
Featureselection I applied several feature selection methods to identify true features,
including Bonferroni-corrected univariate logistic regression (ULR), LASSO regular-
ized logistic regression with cross validation (Friedman et al. 2010), SS-thr, SS-test,
SS-rank, and SS-ranktest. Due to the large number of features, I also used a two-step
alternative for all methods except ULR, in which a filtering step testing the association
of each feature with the outcome was used before feature selection and only the top
1000 features with the smallest p values were kept.
The features selected by LASSO with cross validation were based on the “1se
method” using G-mean-thr as the performance measure. For stability selection based
methods, I used 500 subsamples and a grid of 50 values for. The feature set selected
111
by the stability selection related methods was based on the critical value (E(V) or q)
with a corresponding average FPN around 2. The selected probes and their mapped
genes for each method were reported. The selected features were fitted by a logis-
tic regression model, which we named “SS3 model”. The performance of the model
was evaluated by G-mean-thr or AUC-ROC using 5-fold cross validation and compared
with the model with only clinical features. Recognizing the bias introduced by feature
selection, I also applied a parametric bootstrap resampling approach to access the sig-
nificance. Specifically, in each bootstrap resample, the original observations for all the
features were kept in the data while the outcome was resampled from the Bernoulli dis-
tributionBernoulli(expit(
0
+
T
0
x
n
)) where
0
and
0
was estimated from a logistic
regression model with the clinical features only andx
n
was the values of corresponding
clinical features for this observation. Stability selection was applied to this resampled
data and the three features with highest selection frequency were selected and composed
the feature set together with clinical features. A logistic regression model was fitted
using the feature set and its AUC-ROC across 5-fold cross validation was recorded. The
procedure was repeated for 10000 times, generating a null distribution of AUC-ROC.
We reported the p value of “SS3 model”, which corresponded to the probability that the
AUC-ROCof the “SS3 model” were smaller than or equal to that of any model in the
resamples.
All statistical analysis was carried out using the R language for statistical computing
(R version 3.0.0) (R core team 2014).
112
6.2 Result
Model selection The performance of models selected by different metrics with 0.5
cutoff and case-proportion cutoff was summarized in Table 5.1 and Table 5.2. When 0.5
was used as the probability cutoff of cases and controls in the evaluation, all selected
models had a skewed performance, with perfect prediction in controls with accuracy of
at least 0.974 and poor prediction in cases with accuracy of at most 0.329. By contrast
when case proportion cutoff, which was 0.176 in this particular case, was used, all
selected models predicted better in the minority class with accuracy of at least 0.5. The
huge differences revealed that the probability cutoff between cases and controls moved
from 0.5 in this data and reminded us the importance of carefully choosing cutoff values
when validating the models built from imbalanced data.
In the evaluation of models selected by all of the examined metrics, the model
selected by G-mean-thr had the best prediction accuracy in the cases and also predicted
reasonably well with at least 70% accuracy in both classes. The model selected by
precision-recall breakeven point also had at least 68
Feature selection First I tested the performance of stability selection based method
using LASSO as the basic learner and then compared it with that of ULR and LASSO
with cross validation. The genes corresponding to the probes identified were summa-
rized in Table 5.3. While no probes were found significant using ULR with Bonferroni
correction, LASSO with cross validation selected at least 6 features more than any of the
other methods. The features identified by the stability selection related methods were
more consistent and a subset of features identified by LASSO. ABCC11, NKX2-1 and
ZYG11A were identifed by all these methods, while PGC were identified by SS-ranktest
only. Figure 6.1 showed the stability path drawn from this data by plotting the selection
113
frequency in SS-thr and SS-test againsts. The 4 probes identified by stability selec-
tion based methods were marked by the red curves (the three features that were never
penalized were also marked by the red curves, with selection frequencies of 1), which
were well seperated from other curves for mosts. Based on the stability path, another
potentially active feature was age, whose selection frequency was around 0.5 for most
and appeared to stand out when was large.
Then I reduced the number of features to 1000 by a univariate screener and selected
features using all the above procedures from the reduced data. The genes corresponding
to the probes identified were summarized in Table 5.4. LASSO with cross validation
identified more features. When the same significance level was used, cut-off based SS-
thr and SS-rank selected more features while test-based SS-test and SS-ranktest selected
fewer. The stability path for this reduced data looked similar to that for the full data and
the frequency of most of the features were higher in reduced data (Figure 6.2).
Using feature set including the three selected probes and unpenalized clinical fea-
tures PSA and Gleason score, I fit a logistic regression model, which I named “SS3
model”. It had an AUC-ROC of 0.834 and G-mean (using case proportion cutoff) of
0.776. The logistic regression model with only PSA and Gleason score as predictors
had an AUC-ROC of 0.675 and G-mean of 0.604. In the bootstrap resamples, only very
few models in had a higher prediction accuracy than “SS3 model” (p=0.041), suggesting
that the “SS3 model” performed significantly better than clincial-feature-only model.
114
6.3 Discussion
The application of model selection by different metrics again illustrated the importance
of choosing an appropriate cutoff in both model selection and model validation. When
using a wrong cutoff to evaluate the performance, all models selected, and probably all
the models built had a skewed prediction accuracy. Thus, when a selected model did
not perform well in practice, one should consider an alternative cutoff in the evaluation
before resorting to models selected by other methods.
G-mean with case-proportion cutoff performed well in all simulation settings and
was also the most favorable here since its corresponding model predicted well in both
classes. Although other simple metrics also combined the information from two classes,
they did not perform as well as G-mean probably because these measures were not
targeted for achieving the most balanced prediction accuracies in both classes. I would
expect F-measure to select the most favorable model if the aim was to find a model
with both low false positive rate and low false discovery rate. Generally, in model
selection one should be aware of the properties the desired model has and thereby use
a metric that is appropriately designed to achieve these properties. In addition, models
selected by most of the whole curve metrics did not perform well, probably because of
the restriction of sample sizes in each fold.
I also used methods including LASSO with cross validation, and four type of stabil-
ity selection based methods to select features from the prostate cancer recurrence data.
All of the stability selection based approach selected a subset of genes identified by
LASSO with cross validation, with great overlaps with each other. The effect size of
ABCC11, NKX2-1 and ZYG11A was large and consistent so that their selection fre-
quency was above 60% in most of the region. It is questionable whether PGC is an
active gene. Although its selection frequency was well separated from noise features
for largers, its frequency decreased with the decrease of shrinkage for smaller. This
115
unusual behavior indicates that PGC exits the model in some subsamples. An early exit
in the regularization path usually implies that the effect of that feature on the outcome
is unstable and the feature is possibly a false signal.
I demonstrated that the “SS3” model had significantly higher prediction accuracy
than clinical-feature-only models in our data. The bias as a result of larger models or
using top selected features was corrected by a bootstrap resampling approach. Never-
theless, external validation data is still neccesary to confirm that “SS3” model could
predict the recurrence of prostate cancer better than clinical-feature-only models.
In the simulation study, I demonstrated that SS-thr could control the number of noise
features at a very low level as long as the correlation among features was not high, as the
case here. Therefore, these three features identified were probably active features and
they remained the only features SS-thr could identify even when E(V) increased to 50. In
this particular study we aim to identify more active features with some tolerance of noise
features. Thereby SS-test withq = 4 (corresponding to an average of 5 noise features
selected in the simulation) was also applied to the data, which selected 11 genes and a
clinical feature age. It would be interesting to apply this larger model to the prediction
of prostate cancer recurrence when the external data is available.
Most of the features identified by SS-test and SS-ranktest were no longer selected
when a two-step procedure was used. This is probably owing to the change in the
actual significance level for the second step. The same significance level was used in
the second step when SS-test or SS-ranktest was applied to the data. This is only valid
if the distribution of most of the features stayed in the second stage is the same as the
distribution of excluded features. The background noise was augmented in the stability
path plot for the second stage, (Figure 6.2) probably because many features remained
were weakly associated with the outcome. Therefore, signals failed to be identified with
the same significance level as used in SS-test.
116
On the other hand, when a two-step procedure was used, the frequency, or average
ranks of most features increased, prompting more features to be identified by stability
selection. This is potentially problematic in terms of error control and I would expect
more features to be identified if a smaller number of features were used in the second
step. For example, the gene CPVL identifed using the two-step approach does not appear
plausible on the stability path plot. I suggest one should always resort to stability path
plot when analyzing the results for stability selection.
Although the performance in simulation study is good, one should still take caution
when using the two-step approach based on the results of this application study. Since
the two steps were not independent, it was generally difficult to find an appropriate sig-
nificance level in the second step. Based on the particular problem that one investigates,
a first screening step independent of the second step is more desirable and easier to con-
trol, e.g. filtering out features with small variance. In addtiion, when the correlation
structure in the data was not very dense, the performance of stability selection based
methods was not affected considerably by the dimensionality.
To summarize, I used a variety of metrics to do model selection within the penal-
ized logistic regression framework using imbalanced prostate cancer recurrence data and
demonstrated that G-mean with the case-proportion cutoff selected the model with the
most balanced prediction accuracy in cross validation. In addition, I also showed the
importance of using an appropriate cutoff to evaluate models when models were built
from skewed data. I also applied stability selection based methods including SS-thr, SS-
test, SS-rank and SS-ranktest to select important genes from the same data. Three genes,
ABCC1, NKX2-1 and ZYG11A were identified by all of the methods and also appeared
to stand out from other features in the stability path plot. I fit a logistic regression model
using these genes and clinical features, which has significantly higher prediction accu-
racy than clinical-only models when evaluated by cross-validation.
117
Table 6.1: Performance of models selected by metrics with 0.5 cutoff in prostate
cancerrecurrencedata
d.f. Acc sd case Acc sd control Acc sd
Acc 0.4 89 0.877 0.027 0.329 0.11 0.993 0.015
G-
mean
0.4 89 0.877 0.027 0.329 0.11 0.993 0.015
F 0.4 89 0.877 0.027 0.329 0.11 0.993 0.015
0.4 89 0.877 0.027 0.329 0.11 0.993 0.015
Acc-
thr
0.4 134 0.855 0.048 0.295 0.131 0.974 0.043
G-
mean-
thr
0.6 22 0.839 0.021 0.152 0.013 0.987 0.03
F-thr 0.4 134 0.855 0.048 0.295 0.131 0.974 0.043
-thr 0.4 134 0.855 0.048 0.295 0.131 0.974 0.043
deviance0.4 58 0.855 0.033 0.238 0.12 0.987 0.03
AUC-
ROC
0.4 86 0.871 0.025 0.3 0.095 0.993 0.015
br-
even
0.4 21 0.823 0.028 0.09 0.083 0.98 0.045
AUC-
PR
0.4 119 0.866 0.05 0.329 0.11 0.98 0.045
The results are reported as mean and standard deviation across 5-fold cross validation
d.f.: number of nonzero features in the model, Acc: accuracy, br-even: breakeven point
118
Table 6.2: Performance of models selected by metrics with case proportion cutoff
inprostatecancerrecurrencedata
d.f. Acc sd case Acc sd control Acc sd
Acc 0.4 89 0.805 0.129 0.59 0.281 0.849 0.131
G-
mean
0.4 89 0.805 0.129 0.59 0.281 0.849 0.131
F 0.4 89 0.805 0.129 0.59 0.281 0.849 0.131
0.4 89 0.805 0.129 0.59 0.281 0.849 0.131
Acc-
thr
0.4 134 0.859 0.105 0.5 0.263 0.934 0.085
G-
mean-
thr
0.6 22 0.736 0.105 0.719 0.206 0.739 0.1
F-thr 0.4 134 0.859 0.105 0.5 0.263 0.934 0.085
-thr 0.4 134 0.859 0.105 0.5 0.263 0.934 0.085
deviance0.4 58 0.747 0.129 0.619 0.286 0.771 0.132
AUC-
ROC
0.4 86 0.8 0.125 0.59 0.281 0.843 0.124
br-
even
0.4 21 0.694 0.086 0.686 0.229 0.694 0.096
AUC-
PR
0.4 119 0.832 0.112 0.562 0.232 0.888 0.112
The results are reported as mean and standard deviation across 5-fold cross validation
d.f.: number of nonzero features in the model, Acc: accuracy, br-even: breakeven point
119
Table 6.3: Feature selected selected by LASSO and stability selection based meth-
odsinprostatecancerrecurrencedata
Gene LASSO ULR SS-thr SS-test SS-rank SS-ranktest
ABCC11 X X X X X
ADRA2 X
CCLEC4F X
CPVL X
GPR81 X
MMP11 X
NKX2-1 X X X X X
PGC X X X
UPK1A X
ZYG11A X X X X X
120
Table 6.4: Feature selected selected by LASSO and stability selection based meth-
odsinprostatecancerrecurrencedatabyadaptingatwo-stepapproach
Gene LASSO2 ULR SS-thr2 SS-test2 SS-rank2 SS-ranktest2
ABCC11 X X X
ANO7 X
ADRA2 X
CD96 X
CCLEC4F X
CPVL X
DUOX1 X
DUOXA1 X
FMNL3 X
GPR81 X
IP6K3 X
KCNQ2 X
MMP11 X
NKX2-1 X X X
OAS2 X
PGC X X X X
PLEKHM3 X
RAPGEF1 X
RRP12 X
S100P X
TXLNB X
UPK1A X
UPK3B X
UTS2D X
ZYG11A X X X
The “2” in the names of some feature selectors indicated a two-step approach was applied
121
Figure 6.1: The stability path when SS-thr was applied to all features in prostate cancer
recurrence data. The red curves indicated features identifed by the method with
thr
=
0:6.
122
Figure 6.2: The stability path when SS-thr was applied as the second step to features
remained in prostate cancer recurrence data after filtering by a univariate screener. The
red curves indicated features identifed by the method with
thr
= 0:6.
123
Bibliography
[1] Thomas Abeel, Thibault Helleputte, Yves Van de Peer, Pierre Dupont, and Yvan
Saeys. Robust biomarker identification for cancer diagnosis with ensemble feature
selection methods. Bioinformatics, 26(3):392–398, 2010.
[2] Hirotugu Akaike. Information theory and an extension of the maximum likelihood
principle. In Second international symposium on information theory, pages 267–
281. Akademinai Kiado, 1973.
[3] Hirotugu Akaike. A new look at the statistical model identification. Automatic
Control, IEEE Transactions on, 19(6):716–723, 1974.
[4] David H Alexander and Kenneth Lange. Stability selection for genome-wide asso-
ciation. Genetic epidemiology, 35(7):722–728, 2011.
[5] Anestis Antoniadis and Jianqing Fan. Regularization of wavelet approximations.
Journal of the American Statistical Association, 96(455), 2001.
[6] Francis R Bach. Bolasso: model consistent lasso estimation through the bootstrap.
In Proceedings of the 25th international conference on Machine learning, pages
33–40. ACM, 2008.
[7] Richard Bellman. Adaptive control processes: a guided tour, volume 4. Princeton
university press Princeton, 1961.
[8] Peter J Bickel, Ya’acov Ritov, and Alexandre B Tsybakov. Simultaneous analysis
of lasso and dantzig selector. The Annals of Statistics, 37(4):1705–1732, 2009.
[9] Andrew P Bradley. The use of the area under the roc curve in the evaluation of
machine learning algorithms. Pattern recognition, 30(7):1145–1159, 1997.
[10] Leo Breiman. Better subset regression using the nonnegative garrote. Technomet-
rics, 37(4):373–384, 1995.
[11] Leo Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996.
[12] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
[13] Emmanuel Candes and Terence Tao. The dantzig selector: statistical estimation
when p is much larger than n. The Annals of Statistics, pages 2313–2351, 2007.
124
[14] Nitesh V . Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip
Kegelmeyer. Smote: synthetic minority over-sampling technique. arXiv preprint
arXiv:1106.1813, 2011.
[15] Jacob Cohen. Weighted kappa: Nominal scale agreement provision for scaled
disagreement or partial credit. Psychological bulletin, 70(4):213, 1968.
[16] Chad A Davis, Fabian Gerick, V olker Hintermair, Caroline C Friedel, Katrin Fun-
del, Robert K¨ uffner, and Ralf Zimmer. Reliable gene signatures for microar-
ray classification: assessment of stability and performance. Bioinformatics,
22(19):2356–2363, 2006.
[17] Jesse Davis, Elizabeth S Burnside, Inˆ es de Castro Dutra, David Page, Raghu
Ramakrishnan, Vitor Santos Costa, and Jude W Shavlik. View learning for sta-
tistical relational learning: With an application to mammography. In IJCAI, pages
677–683. DTIC Document, 2005.
[18] Jesse Davis and Mark Goadrich. The relationship between precision-recall and roc
curves. In Proceedings of the 23rd international conference on Machine learning,
pages 233–240. ACM, 2006.
[19] Pedro Domingos. Metacost: a general method for making classifiers cost-sensitive.
In Proceedings of the fifth ACM SIGKDD international conference on Knowledge
discovery and data mining, pages 155–164. ACM, 1999.
[20] David L Donoho and Jain M Johnstone. Ideal spatial adaptation by wavelet shrink-
age. Biometrika, 81(3):425–455, 1994.
[21] Sandrine Dudoit, Juliet Popper Shaffer, and Jennifer C Boldrick. Multiple hypoth-
esis testing in microarray experiments. Statistical Science, pages 71–103, 2003.
[22] Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle
regression. The Annals of statistics, 32(2):407–499, 2004.
[23] Jianqing Fan. Comments on wavelets in statistics: A review by a. antoniadis.
Journal of the Italian Statistical Society, 6(2):131–138, 1997.
[24] Jianqing Fan and Yingying Fan. High dimensional classification using features
annealed independence rules. Annals of statistics, 36(6):2605, 2008.
[25] Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likeli-
hood and its oracle properties. Journal of the American Statistical Association,
96(456):1348–1360, 2001.
[26] Jianqing Fan and Runze Li. Statistical challenges with high dimensionality: Fea-
ture selection in knowledge discovery. arXiv preprint math/0602133, 2006.
125
[27] Jianqing Fan and Jinchi Lv. Sure independence screening for ultrahigh dimen-
sional feature space. Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 70(5):849–911, 2008.
[28] Jianqing Fan and Jinchi Lv. A selective overview of variable selection in high
dimensional feature space. Statistica Sinica, 20(1):101, 2010.
[29] Jianqing Fan, Richard Samworth, and Yichao Wu. Ultrahigh dimensional feature
selection: beyond the linear model. The Journal of Machine Learning Research,
10:2013–2038, 2009.
[30] LLdiko E Frank and Jerome H Friedman. A statistical view of some chemometrics
regression tools. Technometrics, 35(2):109–135, 1993.
[31] Yoav Freund and Robert E Schapire. A desicion-theoretic generalization of on-line
learning and an application to boosting. In Computational learning theory, pages
23–37. Springer, 1995.
[32] Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for
generalized linear models via coordinate descent. Journal of statistical software,
33(1):1, 2010.
[33] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regres-
sion: a statistical view of boosting (with discussion and a rejoinder by the authors).
The annals of statistics, 28(2):337–407, 2000.
[34] Jerome H Friedman. Greedy function approximation: a gradient boosting machine.
Annals of Statistics, pages 1189–1232, 2001.
[35] Hui Han, Wen-Yuan Wang, and Bing-Huan Mao. Borderline-smote: A new over-
sampling method in imbalanced data sets learning. In Advances in intelligent com-
puting, pages 878–887. Springer, 2005.
[36] JA Hanely and BJ McNeil. The meaning and use of the area under a receiver
operating characteristic (roc) curve. Radiology, 143(1):29–36, 1982.
[37] Lars Kai Hansen and Peter Salamon. Neural network ensembles. Pattern Analysis
and Machine Intelligence, IEEE Transactions on, 12(10):993–1001, 1990.
[38] Trevor. Hastie, Robert. Tibshirani, and J Jerome H Friedman. The elements of
statistical learning, volume 1. Springer New York, 2001.
[39] Anne-Claire Haury, Pierre Gestraud, and Jean-Philippe Vert. The influence of
feature selection methods on accuracy, stability and interpretability of molecular
signatures. PloS one, 6(12):e28210, 2011.
126
[40] Anne-Claire Haury, Fantine Mordelet, Paola Vera-Licona, and Jean-Philippe Vert.
Tigress: trustful inference of gene regulation using stability selection. BMC sys-
tems biology, 6(1):145, 2012.
[41] Haibo He and Edwardo A Garcia. Learning from imbalanced data. Knowledge
and Data Engineering, IEEE Transactions on, 21(9):1263–1284, 2009.
[42] Haibo He and Xiaoping Shen. A ranked subspace learning method for gene expres-
sion data classification. In IC-AI, pages 358–364, 2007.
[43] Zengyou He and Weichuan Yu. Stable feature selection for biomarker discovery.
Computational biology and chemistry, 34(4):215–225, 2010.
[44] Nathalie Japkowicz. Supervised versus unsupervised binary-learning by feedfor-
ward neural networks. Machine Learning, 42(1-2):97–122, 2001.
[45] Nathalie Japkowicz. Class imbalances: are we focusing on the right issue. In
Workshop on Learning from Imbalanced Data Sets II, volume 1723, page 63, 2003.
[46] Nathalie Japkowicz et al. Learning from imbalanced data sets: a comparison of
various strategies. In AAAI workshop on learning from imbalanced data sets, vol-
ume 68, 2000.
[47] Nathalie Japkowicz and Shaju Stephen. The class imbalance problem: A system-
atic study. Intelligent data analysis, 6(5):429–449, 2002.
[48] Taeho Jo and Nathalie Japkowicz. Class imbalances versus small disjuncts. ACM
SIGKDD Explorations Newsletter, 6(1):40–49, 2004.
[49] Miroslav Kubat, Robert C Holte, and Stan Matwin. Machine learning for the
detection of oil spills in satellite radar images. Machine learning, 30(2-3):195–
215, 1998.
[50] Miroslav Kubat, Stan Matwin, et al. Addressing the curse of imbalanced training
sets: one-sided selection. In ICML, volume 97, pages 179–186, 1997.
[51] Wei-Jiun Lin and James J Chen. Class-imbalanced classifiers for high-dimensional
data. Briefings in bioinformatics, 14(1):13–26, 2013.
[52] Charles X Ling and Chenghui Li. Data mining for direct marketing: Problems and
solutions. In KDD, volume 98, pages 73–79, 1998.
[53] Xu-Ying Liu, Jianxin Wu, and Zhi-Hua Zhou. Exploratory undersampling for
class-imbalance learning. Systems, Man, and Cybernetics, Part B: Cybernetics,
IEEE Transactions on, 39(2):539–550, 2009.
127
[54] Lara Lusa et al. Class prediction for high-dimensional class-imbalanced data. BMC
bioinformatics, 11(1):523, 2010.
[55] Marcus A Maloof. Learning when data sets are imbalanced and when costs are
unequal and unknown. In ICML-2003 workshop on learning from imbalanced
data sets II, volume 2, 2003.
[56] Larry M Manevitz and Malik Yousef. One-class svms for document classification.
The Journal of Machine Learning Research, 2:139–154, 2002.
[57] Christopher D Manning and Hinrich Sch¨ utze. Foundations of statistical natural
language processing, volume 999. MIT Press, 1999.
[58] David Mease, Abraham J Wyner, and Andreas Buja. Boosted classification
trees and class probability/quantile estimation. The Journal of Machine Learning
Research, 8:409–439, 2007.
[59] Nicolai Meinshausen and Peter B¨ uhlmann. High-dimensional graphs and variable
selection with the lasso. The Annals of Statistics, 34(3):1436–1462, 2006.
[60] Nicolai Meinshausen and Peter B¨ uhlmann. Stability selection. Journal of the Royal
Statistical Society: Series B (Statistical Methodology), 72(4):417–473, 2010.
[61] Nicolai Meinshausen and Bin Yu. Lasso-type recovery of sparse representations
for high-dimensional data. The Annals of Statistics, pages 246–270, 2009.
[62] Michael J Pazzani, Christopher J Merz, Patrick M Murphy, Kamal Ali, Timothy
Hume, and Clifford Brunk. Reducing misclassification costs. In ICML, volume 94,
pages 217–225, 1994.
[63] R Pearson, G Goney, and J Shwaber. Imbalanced clustering for microarray time-
series. In Proceedings of the ICML, volume 3, 2003.
[64] Ronaldo C Prati, Gustavo EAPA Batista, and Maria Carolina Monard. Class imbal-
ances versus class overlapping: an analysis of a learning system behavior. In
MICAI 2004: Advances in Artificial Intelligence, pages 312–321. Springer, 2004.
[65] Foster Provost and Tom Fawcett. Robust classification for imprecise environments.
Machine Learning, 42(3):203–231, 2001.
[66] Foster J Provost and Tom Fawcett. Analysis and visualization of classifier per-
formance: Comparison under imprecise class and cost distributions. In KDD,
volume 97, pages 43–48, 1997.
[67] Foster J Provost, Tom Fawcett, and Ron Kohavi. The case against accuracy esti-
mation for comparing induction algorithms. In ICML, volume 98, pages 445–453,
1998.
128
[68] R Core Team. R: A Language and Environment for Statistical Computing. R
Foundation for Statistical Computing, Vienna, Austria, 2014.
[69] Bhavani Raskutti and Adam Kowalczyk. Extreme re-balancing for svms: a case
study. ACM Sigkdd Explorations Newsletter, 6(1):60–69, 2004.
[70] Gideon Schwarz. Estimating the dimension of a model. The annals of statistics,
6(2):461–464, 1978.
[71] Juliane Siebourg, Gunter Merdes, Benjamin Misselwitz, Wolf-Dietrich Hardt, and
Niko Beerenwinkel. Stability of gene rankings from rnai screens. Bioinformatics,
28(12):1612–1618, 2012.
[72] John D Storey and Robert Tibshirani. Statistical significance for genomewide stud-
ies. Proceedings of the National Academy of Sciences, 100(16):9440–9445, 2003.
[73] Yanmin Sun, Mohamed S Kamel, Andrew KC Wong, and Yang Wang. Cost-
sensitive boosting for classification of imbalanced data. Pattern Recognition,
40(12):3358–3378, 2007.
[74] David MJ Tax. One-class classification. 2001.
[75] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the
Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
[76] Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu.
Class prediction by nearest shrunken centroids, with applications to dna microar-
rays. Statistical Science, pages 104–117, 2003.
[77] AN Tikhonov and V Ya Arsenin. Solutions of ill-posed problems. 1977. WH
Winston, Washington, DC, 330.
[78] Ivan Tomek. Two modifications of cnn. IEEE Trans. Syst. Man Cybern., (11):769–
772, 1976.
[79] Sara A Van De Geer and Hans C Van Houwelingen. High-dimensional data: p¿¿ n
in mathematical statistics and bio-medical applications. Bernoulli, 10(6):939–943,
2004.
[80] Cornelis Joost van Rijsbergen. Foundation of evaluation. Journal of Documenta-
tion, 30(4):365–373, 1974.
[81] Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped
variables. Journal of the Royal Statistical Society: Series B (Statistical Methodol-
ogy), 68(1):49–67, 2006.
129
[82] Bianca Zadrozny, John Langford, and Naoki Abe. Cost-sensitive learning by cost-
proportionate example weighting. In Data Mining, 2003. ICDM 2003. Third IEEE
International Conference on, pages 435–442. IEEE, 2003.
[83] Cun Hui Zhang. Penalized linear unbiased selection. Department of Statistics,
2007.
[84] Peng Zhao and Bin Yu. On model selection consistency of lasso. The Journal of
Machine Learning Research, 7:2541–2563, 2006.
[85] Hui Zou. The adaptive lasso and its oracle properties. Journal of the American
statistical association, 101(476):1418–1429, 2006.
[86] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic
net. Journal of the Royal Statistical Society: Series B (Statistical Methodology),
67(2):301–320, 2005.
130
Abstract (if available)
Abstract
This work is motivated by an ongoing USC/Illumina study of prostate cancer recurrence after radical prostatectomy. The study generated gene expression data for nearly thirty thousand probes from 187 tumor samples, of which 33 came from patients with recurrent prostate cancer and 154 came from patients with non‐recurrent prostate cancer after years of follow‐up. Our goal was to use penalized logistic regression and stability selection to find a “gene signature” of recurrence that can improve upon PSA and Gleason‐score, which are well‐established but poor predictors. For interpretability and future clinical use, the gene signature should ideally involve a small proportion of probes to predict recurrence in new patients. Due to the skewness in the data, the model selected by tuning the LASSO penalty parameter based on the average misclassification rate in cross validation did not have a balanced performance, i.e. it predicted non‐recurrent cancer with high accuracy but predicted recurrent cancer with very low accuracy. In addition, standard penalized regression with cross validation appeared to select many noise features. In my simulation study in Chapter 2, I evaluated the performance of models selected by different metrics in imbalanced data. I concluded that G‐mean‐thr (G‐mean with an alternative cutoff) and area under the ROC curve (AUC‐ROC) were the most robust metrics to class imbalance. In Chapter 3, I examined the performance of stability selection (SS‐thr) in simulation studies and found that its feature selection capability (a) depended on the stability cutoff chosen and (b) is conservative as a result of a stringent error control. To address these problems, I proposed new feature selectors based on stability selection, including SS‐test, an essentially parameter‐free test‐based outlier‐detection approach, and SS‐rank and SS‐ranktest, parameter‐free rank‐based methods. I demonstrated their advantage over SS‐thr, ULR, and LASSO with cross validation in extensive simulation studies and also found that all these stability selection based methods were robust to class imbalance. These newly developed methods and procedures was applied to the prostate cancer recurrence data. I used a variety of metrics to do model selection within the penalized logistic regression framework using imbalanced prostate cancer recurrence data and demonstrated that G‐mean with the case‐proportion cutoff selected the model with the most balanced prediction accuracy in cross validation. In addition, I also showed the importance of using an appropriate cutoff to evaluate models when models were built from skewed data. I also applied stability selection based methods including SS‐thr, SS‐test, SS‐rank and SS‐ranktest to select important genes from the same data. Three genes, ABCC1, NKX2‐1 and ZYG11A were identified by all of the methods and also appeared to stand out from other features in the stability path plot. I fit a logistic regression model using these genes and clinical features, which has significantly higher prediction accuracy than clinical‐only models when evaluated by cross‐validation.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Incorporating prior knowledge into regularized regression
PDF
Prediction and feature selection with regularized regression in integrative genomics
PDF
Statistical analysis of high-throughput genomic data
PDF
Generalized linear discriminant analysis for high-dimensional genomic data with external information
PDF
Feature selection in high-dimensional modeling with thresholded regression
PDF
Hierarchical regularized regression for incorporation of external data in high-dimensional models
PDF
Prediction modeling with meta data and comparison with lasso regression
PDF
Genes and environment in prostate cancer risk and prognosis
PDF
High-dimensional feature selection and its applications
PDF
Pathway-based analysis of multivariate traits using classical dimensionality reduction methods
PDF
Enhancing model performance of regularization methods by incorporating prior information
PDF
Leveraging sparsity in theoretical and applied machine learning and causal inference
PDF
Genetic variation in the base excision repair pathway, environmental risk factors and colorectal adenoma risk
PDF
High-dimensional regression for gene-environment interactions
PDF
Characterization and discovery of genetic associations: multiethnic fine-mapping and incorporation of functional information
PDF
Large-scale inference in multiple Gaussian graphical models
PDF
Essays on factor in high-dimensional regression settings
PDF
Uncertainty quantification in extreme gradient boosting with application to environmental epidemiology
PDF
Carcinogenic exposures in racial/ethnic groups
PDF
Age related macular degeneration in Latinos: risk factors and impact on quality of life
Asset Metadata
Creator
Ren, Jie
(author)
Core Title
Robust feature selection with penalized regression in imbalanced high dimensional data
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Statistical Genetics and Genetic Epidemiology
Publication Date
07/15/2014
Defense Date
06/10/2014
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
feature selection,high dimensional data,imbalanced data,OAI-PMH Harvest,penalized regression,stability selection
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Lewinger, Juan Pablo (
committee chair
), Watanabe, Richard M. (
committee chair
), Lv, Jinchi (
committee member
), Siegmund, Kimberly D. (
committee member
), Stern, Mariana C. (
committee member
)
Creator Email
jieren0408@gmail.com,milannellolamasia@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-443080
Unique identifier
UC11287084
Identifier
etd-RenJie-2699.pdf (filename),usctheses-c3-443080 (legacy record id)
Legacy Identifier
etd-RenJie-2699.pdf
Dmrecord
443080
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Ren, Jie
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
feature selection
high dimensional data
imbalanced data
penalized regression
stability selection