Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Prediction and feature selection with regularized regression in integrative genomics
(USC Thesis Other)
Prediction and feature selection with regularized regression in integrative genomics
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
PredictionandFeatureSelection with Regularized Regression
inIntegrative Genomics
by
DixinShen
ADissertationPresentedtothe
FACULTYOFTHEGRADUATESCHOOL
UNIVERSITYOFSOUTHERNCALIFORNIA
InPartialFulfillmentofthe
RequirementsfortheDegree
DOCTOROFPHILOSOPHY
(BIOSTATISTICS)
August2021
Copyright 2021 DixinShen
Acknowledgements
I would like to express my greatest gratitude to my advisor and chair of my committee, Dr. Juan
PabloLewingerforhisguidanceandsupportthroughoutmyPh.D.Yourwisdomhelpedmebecome
a better researcher, better person. I would also like to thank my committee members, Dr. David
Conti, Dr. Duncan Thomas, Dr. Meredith Franklin, and Dr. Joseph Hacia for the insights and
advice. Itwasagreatpleasuretoworkwithyou.
I would like to thank all my friends during my time at USC, who shared many great moments
withme.
Last but not least, my family especially my parents, thank you for your unconditional support
andencouragement.
ii
TableofContents
Acknowledgements ii
ListofTables vi
ListofFigures vii
Abstract viii
Chapter1: Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Genomics-basedpredictionofdiseasetraits . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Othermachinelearningmethods . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.3 Comparisonofpredictivemethodsingenomics . . . . . . . . . . . . . . . 5
1.3 Regularizedregression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 Ridgeregression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.2 Sparseregularizedregressionandfeatureselection . . . . . . . . . . . . . 7
1.3.2.1 TheLasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.2.2 Theelasticnet . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.3 Discussion of ridge regression, the lasso, the elastic net, and best subset
selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.4 Nonconvexregularizedregression . . . . . . . . . . . . . . . . . . . . . . 13
1.3.4.1 Smoothed clipped absolute deviation penalty (SCAD) and min-
imaxconcavepenalty(MCP) . . . . . . . . . . . . . . . . . . . 14
1.3.4.2 Theadaptivelasso: approximationtononconvexregularization . 17
1.4 Incorporatingmeta-featuresintomodelingprocess . . . . . . . . . . . . . . . . . 17
Chapter 2: A Regularized Cox’s Hierarchical Model for Incorporating Annotation In-
formationinOmicsStudies 21
2.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.1 Setupandnotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.2 Modelfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.3 Two-dimensionalhyperparametertuning . . . . . . . . . . . . . . . . . . . 27
iii
2.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.1 Simulationmethods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.2 Simulationresults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.5.1 Geneexpressionsignaturesforbreastcancersurvival . . . . . . . . . . . . 33
2.5.2 Anti-PD1predictivebiomarkerformelanomasurvival . . . . . . . . . . . 36
2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Chapter3: Meta-FeatureGuidedRegularizedRegressionforSurvivalOutcomes 40
3.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.1 Setupandnotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.2 Modelfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.2.1 EmpiricalBayesestimationofthehyperparameters . . . . . . . 45
3.3.2.2 Marginallikelihoodfunctionoptimization . . . . . . . . . . . . 48
3.3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3.4 Inclusionofunpenalizedfeatures . . . . . . . . . . . . . . . . . . . . . . 50
3.4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4.1 Simulationmethods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4.2 Simulationresults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.5.1 Geneexpressionsignaturesforbreastcancersurvival . . . . . . . . . . . . 55
3.5.2 Anti-PD1predictivebiomarkerformelanomasurvival . . . . . . . . . . . 57
3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Chapter4: !
0
RegularizedRegressionwithCorrelatedFeatures 61
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2 Proximaldistancealgorithm for!
0
regularizedregression . . . . . . . . . . . . . 62
4.3 Incorporatingthedatacovariancematrix . . . . . . . . . . . . . . . . . . . . . . . 64
4.4 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Chapter5: DiscussionandFuture Work 67
5.1 Thetwoproposedmethods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2 Highdimensionalityofgenomicdata . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.3 Featureselection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.4 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
References 74
Appendices 77
A Appendixforchapter2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
A.1 Computationofdiagonalelementsofweightmatrix . . . . . . . . . . . . . 78
A.2 Solveregularizedweightedleastsquareswithcycliccoordinatedescent . . 79
A.3 ExamplecodesforRpackage‘xrnet’ . . . . . . . . . . . . . . . . . . . . 80
iv
A.4 Moresimulationresults . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
B Appendixforchapter3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
B.1 Moresimulationresults . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
v
ListofTables
1.1 Meta-featurematrix ` forgenesets/pathways . . . . . . . . . . . . . . . . . . . . 18
1.2 Meta-featurematrix ` formultipletypesofgenomicsdata . . . . . . . . . . . . . 18
1.3 Meta-featurematrix ` forsummarystatistics . . . . . . . . . . . . . . . . . . . . 19
2.1 METABRIC:TestC-indexbetweenstandardridgeandridge-lasso . . . . . . . . . 35
2.2 METABRIC:Coefficientestimatesformetagenes . . . . . . . . . . . . . . . . . . 35
2.3 Anti-PD1: Coefficientestimates(nonzero)forhallmarkmeta-features . . . . . . . 37
3.1 METABRIC:TestC-indexbetweenstandardelasticnetandmetageneguidedelastic
net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2 METABRIC:MetageneweightsU . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.3 Anti-PD1: Listofhallmarkmeta-featuresandtheirrespectiveestimatedweight . . 59
4.1 Predictionandfeatureselectioncomparisonbetweenlassoand!
0
. . . . . . . . . 66
vi
ListofFigures
1.1 Softthresholdingfunction(¹G_º= sign¹Gº¹jGj_º
¸
. . . . . . . . . . . . . . . . 9
1.2 EstimatorsofV
9
inthecaseoforthonormalcolumnsof ^ . . . . . . . . . . . . . . 12
1.3 Constraintregions
Í
?
9=1
jV
9
j
@
1fordifferentvaluesof@ . . . . . . . . . . . . . . 14
1.4 SCADandMCPpenaltyfunctionsandtheircorrespondingthresholdfunctions . . 16
2.1 Two-dimensionalgridpathwiseoptimization . . . . . . . . . . . . . . . . . . . . . 28
2.2 Simulationresults(‘xrnet’): predictionperformance . . . . . . . . . . . . . . . . . 33
2.3 Simulationresults(‘xrnet’): meta-featureselection . . . . . . . . . . . . . . . . . 34
3.1 Normalapproximationoftheelasticnetprior . . . . . . . . . . . . . . . . . . . . 47
3.2 Simulationresults(metaguided): predictionperformance . . . . . . . . . . . . . . 54
3.3 Simulationresults(metaguided): featureselection . . . . . . . . . . . . . . . . . 55
4.1 Comparisonofsignalestimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.1 Simulation: ‘xrnet’predictionperformance(i) . . . . . . . . . . . . . . . . . . . . 82
5.2 Simulation: ‘xrnet’predictionperformance(ii) . . . . . . . . . . . . . . . . . . . 83
5.3 Simulation: ‘xrnet’predictionperformance(iii) . . . . . . . . . . . . . . . . . . . 83
5.4 Simulation: ‘xrnet’meta-featureselection(i) . . . . . . . . . . . . . . . . . . . . 84
5.5 Simulation: ‘xrnet’meta-featureselection(ii) . . . . . . . . . . . . . . . . . . . . 84
5.6 Simulation: ‘xrnet’meta-featureselection(iii) . . . . . . . . . . . . . . . . . . . . 85
5.7 Simulation: metaguidedfeatureselectionaccuracy(i) . . . . . . . . . . . . . . . 86
5.8 Simulation: metaguidedfeatureselectionaccuracy(ii) . . . . . . . . . . . . . . . 86
5.9 Simulation: metaguidednumberofselectedfeatures(i) . . . . . . . . . . . . . . . 87
5.10 Simulation: metaguidednumberofselectedfeatures(ii) . . . . . . . . . . . . . . 87
vii
Abstract
Advances in high-throughput technologies have enabled the genomewide interrogation of genetic
variation, the transcriptome, and the epigenome. Translating these advances into personalized
health care requires the ability to predict disease outcomes based on individual omic profiles.
In parallel, extensive research in genomics have produced a vast amount of annotations on the
function, structure and regulation of the genome. This explosion of genomic information has
made genomics-based predictive modeling extremely appealing but also challenging in terms of
integrating multiple types of genomics data. The purpose of the thesis is to develop modeling
techniques that can integrate the diverse genomics information to achieve improved prediction
performance.
In chapter 1, we review existing predictive modeling techniques and introduce the problem of
integratingannotationsandotherkindsofexternalinformation. Wediscussthechallengespresented
by the high dimensionality of genomics data and the best current tools based on regularized
regression. We present detailed information about ridge, the lasso, the elastic net, and selected
nonconvex regularized regression. Since in addition to prediction, model interpretation is also of
greatinterest,wediscussfeatureselectionpropertiesprovidedbysparseregularizedregression.
Two predictive models for integrating external information into regularized regression for
survival outcomes are proposed in chapters 2 and 3. In chapter 2, we present a regularized
hierarchicalmodelthatintegratesmeta-featureslinearlybymodelingtheregressioncoefficientsto
depend on external information through their means. The method developed in chapter 3 allows
for differential penalization of features guided by external information. The method can also be
viii
viewed as a hierarchical, but where the regression coefficients depend on the external information
throughtheirvariance.
Wethusprovidetwodifferentalternativesforintegratingexternaldataintopredictivemodeling
with survival outcomes. We showed in simulations that with informative external data, prediction
performance can improve considerably, and that the quality of feature/meta-feature selection also
improves. Theproposedmodelsarealsoappliedtotwohighdimensionalgeneexpressiondatasets:
a study of gene expression signatures for breast cancer survival, and an anti-pd1 immunotherapy
predictivebiomarkerstudyformelanomasurvival.
Chapter4introducesanovelalgorithmtosolve!
0
-regularizedleastsquaresbasedonproximal
distance algorithm. This extends convex regularization methods like lasso and ridge to nonconvex
approachesforsparsermodelandless biasedcoefficientestimation.
Perspectivesaboutdataintegration,highdimensionality,featureselection,andstatisticalinfer-
encearediscussedinchapter5,where wealsoproposepotentialfuturework.
Keywords: genomics; data integration; regularized regression; feature selection; high dimen-
sionaldata;convexandnonconvexoptimization
ix
Chapter1
Introduction
1.1 Background
Technological advances ofthe recent decades suchas microarrays and nextgeneration sequencing
(NGS) have revolutionized our ability to interrogate the structure, function, and regulation of the
genomeatanunprecedentedscale. Ithasacceleratedtheinvestigationoftheeffectsofthegenome,
epigenomeandtranscriptomeonhumanhealth,andopenedthedoortopersonalizedmedicine.
Personalizedmedicinereferstoourabilitytodiagnoseandpredicthealthoutcomesandresponse
to treatment based on an individual’s ‘omic’ profile, so that treatment can be taylored to optimize
theindividual’shealthoutcomes.
Ontopofgenomic,transcriptomicandepigenomicprofilesmeasuredonindividuals,theresits
alayerof‘metaknowledge’aboutgenefunction,geneproductsandgeneregulation,thatiscritical
for interpreting the results of studies of genomics and disease. These ‘meta-features’ can include
functionalannotationsandpreviousstudies.
The purpose of this thesis is to develop high-dimensional statistical methods to integrate
genomics data with meta-features into the predictive modeling process, with the ultimate goal
ofimprovingpredictionperformanceandproducingmoreinterpretableresults.
1
1.2 Genomics-basedprediction of disease traits
Amongthemanyusesofpredictionofdiseaseoutcomesbasedongenomicdata,wecandistinguish
between two important applications: 1) disease prognosis, namely prediction of the course of
disease, and survival. We refer to genomic features predictive of these outcomes as prognostic
biomarkers;2)predictionoftreatmenteffect,basedonpatients’genomicsprofiles,i.e.,howpatients
theyrespondtoatreatment. Werefertogenomicfeaturespredictiveoftheseoutcomesaspredictive
biomarkers(MandrekarandSargent,2010).
Inthisthesis,thefocusisonpredictivemodelingincancergenomics,asarray-andsequencing-
based assays of tumor tissue has enabled clinically actionable insights for many types of cancers.
Wegiveabriefoverviewofsomeofthepredictivemodelingtechniquescommonlyused.
1.2.1 Regression
Regression methods are widely used techniques for constructing predictive models. The most
populararelinearregressionforquantitativetraits,logisticregressionforbinaryandmulti-category
traits,andCox’sproportionalhazardsregressionforsurvivaltraits. Theseregressionmethodsrely
on linear combinations of the genomic features/predictors to predict the outcome of interest. The
training data consists of a response vector _ of length= (for survival outcomes, it is a response
matrix of dimension = 2, survival time and censoring status) and a data/design matrix ^ of
dimension=?,where= isthenumberofindependentsamples,? isthenumberoffeatures. The
linear predictors for each of the= samples are defined as #
)
x
8
, where x
8
is the features for the8
C
sample, # is the regression coefficients to be estimated. The optimization problems for the linear
andlogisticcasesareshownbelow:
min
#
(
1
2=
=
Õ
8=1
¹H
8
#
)
x
8
º
2
)
linearregression (1.1)
min
#
(
1
=
=
Õ
8=1
h
H
8
#
)
x
8
ln¹1¸4
#
)
x
8
º
i
)
logisticregression (1.2)
2
WewilldealindetailwiththeCox’sproportionalhazardsregressioninlatersections,assurvival
is the main type of trait this thesis will be concerned with. Incorporation of non-linearities are
possiblewithregression,bytheinclusionofhigherorderandinteractionterms.
There are several requirements to apply standard regression methods. First, the number of
features ? should be less than the number of samples= for the model to be able to fit. However,
this is usually not the case in genomics studies because of the large number of features examined
and the limits to sample size due to costs and logistics. For example, biopsies could be highly
invasive depending on the location of tumors, and hence reserved for some patients. As a result,
the number of samples is small to moderate at best, typical number ranging from hundreds to
thousands. By contrast, the number of genomic features is often huge. There may be tens (e.g.,
gene expression) or even hundreds of thousands features (e.g., methylation). If there are multiple
omic types, the numbers can be even larger. Therefore, genomics data are typically very high
dimensional (? ¡¡ =). Multicollinearity between features is another serious concern. Highly
correlated features should have similar regression coefficients (negatively correlated markers have
opposite sign, but similar in magnitude). But in standard regression, the coefficient estimates are
highly unstable when the features are highly correlated. Other concerns include marker-marker
interactions,missingdata.
Regressionmodelscannotbefitwithhigh-dimensionaldata. Therefore,regularizationneedto
be introduced. Considering a linear regression, equation (1.1), the solution to it is the ordinary
least square (OLS),
ˆ
# =¹^
)
^º
1
^
)
_. If ^
=?
is high-dimensional, ? ¡ =, the highest rank of
matrix¹^
)
^º
??
is min¹?=º==, so it is a singular matrix. Mathematically, there is no solution
to #. Intuitively, the model is too complex to fit because there is not enough data. Regularization
is a technique to control model complexity, by shrinking the regression coefficients. Regularized
regression canbewrittenasanoptimizationproblemoftheform:
min
#
f¹#º¸_%¹#ºg (1.3)
3
where¹#º is the log of likelihood function. %¹#º is the regularization/penalty function, which
penalizes the regression coefficients and shrinks their estimates toward zero. Shrinkage stabilizes
thecoefficientestimates(reducesmodelvariance)anddecreasesmodelcomplexity(increasesbias).
Thus, by controlling the amount of shrinkage, the hyperparameter _ 0 controls the trade-off
between model complexity (bias) and model stability (variance). Different types of regularization
techniqueswillbediscussedinsection1.3.
1.2.2 Othermachinelearningmethods
Diagnosisandprognosisofdiseaseoutcomesbasedonomicsfeaturesaresupervisedclassification
problems in the machine learning parlance. Therefore, machine learning methods such as tree-
based models, neural networks can be applied to these tasks. Classification and regression trees
(CART) are known to excel at capturing complex interaction patterns. With multiple tree splits at
differentnodesofdifferentfeatures,tree-basedmethodsare,attheiressencemulti-wayinteractions
models. Omic features can contribute to prediction of disease outcomes in very different ways.
Classicalregressionmodelsassumeadditivecontributionsofthefeaturesbutcannothandlemulti-
way interactions parsimoniously. Tree-based methods are a great complement to linear models in
omic-based predictions. Gradient boosted tree methods are ensembles of many simple trees with
only a few terminal nodes. While simple trees can often achieve only slightly better than random
predictions, ensemble tree methods can often provide state-of-the-art prediction performance. We
brieflydescribe‘xgboost’,oneofthemostwidelyusedtreeboostingalgorithms(ChenandGuestrin,
2016). ‘xgboost’hastheobjectivefunctiongivenby:
obj¹\º=
=
Õ
8=1
;¹H
8
ˆ H
8
º¸
Õ
:=1
¹5
:
º
ˆ H
8
=
Õ
:=1
5
:
¹G
8
º
4
;¹H
8
ˆ H
8
ºisthelossfunction: ameasureof‘distance’,forsample8,betweenitslabelH
8
andthemodel
prediction ˆ H
8
. Thepredictionfunction ˆ H
8
isanensembleofaseriesofsimpletrees(weaklearners),
Í
:=1
5
:
¹G
8
º. Theobjectivefunctionobj¹\ºistobeminimized,withtheaddedregularization
¹5
:
º
tocontrolthemodelcomplexity. Theoptimizationalgorithminvolvesgreedilyoptimizingonetree
at a time with a gradient descent method. The ‘xgboost’ style gradient tree boosting has enjoyed
great success in many machine learning applications such as the Netflix prize challenge, and a
numberofKagglechallenges. Becauseofitstree-basednaturetoexploreinteractionpatterns,itis
agoodalternativetolinearmodels.
1.2.3 Comparisonofpredictivemethodsingenomics
Gradient boosting machine and neural networks are often superior when the sample size is large,
sincethereisenoughinformationforthemtoexplorecomplexnon-linearpatterns. Butinscenarios
with smaller sample sizes, regression approaches can often better capture overall feature effects in
high dimension settings. This is the reason why regression methods are still the most widely used
modelingenomics, insteadoftree-basedmethods andneuralnetworks,despite theirhugesuccess
inotherareas.
1.3 Regularizedregression
Regularization is essential to deal with high dimensional genomics data. There are many type of
penaltyfunctions. Wedescribebelowsomeofthemostcommon.
1.3.1 Ridgeregression
RidgeregressionwasproposedbyHoerlandKennard(1970). Itputsaregularizationonmagnitude
of regression coefficients, namely the squared size. Ridge regression is written as the following
optimizationproblem
min
#
¹#º¸_k#k
2
2
(1.4)
5
_ is the hyperparameter controlling the amount of regularization, the greater_ is, the greater the
amountofregularizationoncoefficients #. Thecoefficientsareshrunktowardzero. Anequivalent
waytowritetheridgeproblemis
min
#
¹#º
subjectto k#k
2
2
C
(1.5)
which is constrained optimization form of the above Lagrangian form (1.4). The coefficients are
restricted within a circle with diameterC, the!
2
norm ball. There is a one-to-one correspondence
between the parameters_ andC. When there are many correlated features in a standard regression
model, their coefficients can be unstable due to high variance. By imposing a size constraint, the
problemisalleviated.
The solution to the ridge optimization problem is
ˆ
#
A8364
= ¹^
)
^¸_Oº
1
^
)
_. Like the
OLS solution, ridge regression solution is also a linear function of outcome_. It adds a positive
constant_ to the diagonal of ^
)
^ before inversion, making the matrix nonsingular even if ^
)
^
is not of full rank (high-dimensional setting). This is how ridge regression fit high-dimensional
data and other ill-formed design matrix ^. In the case of orthonormal column spaces of ^, the
ridge solution becomes a scaled version of the OLS solution, i.e.,
ˆ
#
A8364
=
1
1¸_
ˆ
#
$!(
, shrinking the
coefficients by a fraction of 1¸_. If the column spaces of ^ are not othonormal, ridge regression
shrinks the directions with smallest variances the most. Those directions are in fact the principal
components directions of ^. Principle components are linear combinations of the columns of ^.
The first principle component has the largest sample variance (eigen value). Subsequent principal
components have maximum variance subject to being orthogonal to the earlier ones. Hence, the
smalleigenvalueprinciplecomponentsdirectionsareshrunkthemost.
6
RidgeregressionhasaBayesianinterpretation,assuminglinearregression:
_j#;^#¹^#f
2
Oº
c¹#º#¹0W
2
Oº
Bothlikelihoodandpriorarenormal,therefore,theposteriordistributionisalsoanormal. Because
normalisitsownconjugatefamily. Thenegativelogposteriordensityof#isequaltotheexpression
in equation (1.4), with_=f
2
W
2
. Thus the ridge estimate is the mean and mode of the posterior
distribution. In genomics, Bayesian ridge regression is the motivation of genomic best linear
unbiasedpredictor(G-BLUP)(deLosCamposetal.,2013).
Ridge regression shrinks coefficients toward zero, but never to exactly zero. In other words, it
doesn’t perform feature selection in terms of magnitude of regression coefficients. If coefficients
shrink to zero, these features are no longer in the model, and thus not associated with outcome.
Featureswithlargercoefficientsinmagnitude,weatherpositiveornegative,arestronglyassociated
with outcome. However, ridge regularization is a widely used technique for controlling model
complexity to balance the trade-off between bias and variance. The more complex the model, the
lessbias,butthelargervariance,andviceversa. Itisusedinneuralnetworksandgradientboosting
machines,whereitisknownasweight decay.
1.3.2 Sparseregularizedregressionandfeatureselection
1.3.2.1 TheLasso
Proposedby(Tibshirani,1996),thelassoisaregularizationmethodlikeridge,butperformsfeature
selection. Thelassooptimizationproblem,Lagrangianform,isdefinedas
min
#
f¹#º¸_k#k
1
g (1.6)
7
Itcanalso bewrittenintheequivalentconstrainedoptimizationproblemjustlikeridge,
min
#
¹#º
subjectto k#k
1
C
(1.7)
The similarity to the ridge regression is the !
2
norm penalty function for ridge is replaced by
!
1
norm penalty function for the lasso. The term sparse refers to a model with few nonzero
coefficients. A key property of the lasso is its ability to yield sparse solutions. Lets look at the
lasso estimator for linear regression. For the 9
C
feature, i.e, the 9
C
element of coefficients vector
#,thecoordinate-wiseupdate,forstandardizedfeatureswithmean0andvariance1,hastheform
ˆ
V
;0BB>
9
=(¹
1
=
=
Õ
8=1
G
89
A
¹9º
8
_º (1.8)
where
• A
¹9º
8
= H
8
Í
;<9
G
8;
ˆ
V
;
is the partial residual which removes from the outcome the current
fit from all but the 9
C
predictor. Because features are usually standardized to make the
shrinkage comparable,
1
=
Í
=
8=1
G
89
A
¹9º
8
is the simple least squares solution when fitting this
partialresidualtoG
89
.
• (¹G_º isthesoft-thresholdingoperatordefinedas
sign¹Gº¹jGj_º
¸
=
8
>
>
>
>
>
>
>
> <
>
>
>
>
>
>
>
>
:
G_ ifG ¡ 0 and_jGj
G¸_ ifG 0 and_jGj
0 if_jGj
(1.9)
One can derive the results using the notion of subgradients, the detailed derivation of coordinate
descent are described in Friedman et al. (2007). Notice the lasso solution shrinks the regression
coefficient(solutionofleastsquares,
1
=
Í
=
8=1
G
89
A
¹9º
8
)byanamountof_,aslongasitsmagnitude/ab-
solute value is larger than_. For the features having smaller effect sizes than_, they are shrunk
8
to 0, thus being excluded to the model (Figure 1.1). This is the main difference between ridge and
the lasso, while ridge regression does a proportional shrinkage, lasso translates each coefficient
by a constant_, truncating at zero. Therefore, the lasso has the ability to perform feature selec-
tion by excluding unimportant features. In this way, the lasso model is more parsimonious, more
interpretable,comparedtoridge,whichkeepsallthefeaturesinthemodel.
Figure1.1: Softthresholdingfunction(¹G_º= sign¹Gº¹jGj_º
¸
. ThefigureisfromHastieetal.
(2019). Thebluebrokenlineisthesoftthreshodingestimator,alongwiththe 45
lineinblack.
Therearesomeimportantpropertiesofthelassoinadditiontofeatureselection.
• Justlikeridgeregression,thelassoalsohasanBayesianinterpretation. Thepriordistribution
of # isdoubleexponential/Laplaceforthelasso,insteadofnormalforridge.
• Degrees of freedom: Suppose there are ? features, fitting a linear regression using only a
subset of: of these features, if these: features were chosen without regard to the outcome,
the procedure spends : degrees of freedom. However, if the : features were chosen using
knowledge of the outcome, for example best subset selection, then the fitting procedure
spends more than : degrees of freedom. Such a fitting strategy is adaptive, as well as the
lasso. The lasso, with a fixed penalty parameter_, the number of nonzero coefficients :
_
9
is an unbiased estimate of the degrees of freedom (Zou et al., 2007; Tibshirani and Taylor,
2012). The reason that lasso has exactly : degrees of freedom rather than larger than : is
that it not only selects features which inflates the degrees of freedom, but also shrinks the
coefficients.
• The number of nonzero coefficients is at most =, the sample size, when the data is high
dimensional,? ¡=.
• Assume that the underlying true signal is sparse, the lasso recovers the true signals well. If
theunderlyingtruthisnotsparse,thelassodoesnotworkwell.
1.3.2.2 Theelasticnet
Proposed by Zou and Hastie (2005), the elastic net combines the ridge and the lasso; it solves the
convexoptimizationproblem
min
V
¹#º¸_
1
2
¹12ºk#k
2
2
¸2k#k
1
(1.10)
where22»0 1¼ isaparameterthatcontrolswhetherthepenaltyfunctiontobemoreclosetolasso
ormoreclosetoridge. When2= 1,itreducesto!
1
norm,lassopenalty;when2= 0,itreducesto
squared !
2
norm, ridge penalty. The coordinate-wise update for the elastic net linear regression,
againassumingthefeaturesarestandardizedtomean0andvariance1,is
ˆ
V
4=4C
9
=
(¹
1
=
Í
=
8=1
G
89
A
¹9º
8
_2º
1¸_¹12º
(1.11)
We can see the elastic net estimator shrinks the regression coefficients in the way of both lasso
and ridge: it has the soft-thresholding portion truncating at _2; it also shrink the coefficients
proportionallywithafactorof 1¸_¹12º. Hence,theelasticnetshrinksthecoefficientsandsome
ofthemtoexactly0,sofeatureselection.
10
1.3.3 Discussionofridgeregression,thelasso,theelasticnet,andbestsubset
selection
Best subset selection finds for each : 2f0 1 2?g the subset of size : that gives smallest
residual sum of squares (validation error). Best subset selection linear regression is equivalent to
!
0
constrainedregression,whendesignmatrix ^ isorthogonal:
min
#
1
2=
k_ ^#k
2
2
subjectto k#k
0
:
(1.12)
wherek#k
0
=
Í
?
9=1
¹V
9
< 0º,isdefinedasthenumberofnonzerocoefficients. Strictlyspeaking,
!
0
is not a norm because it does not have properties of a norm, but the naming and notation
are widely used. The !
0
constraint penalizes the number of nonzero coefficients, instead of the
magnitude of coefficients. This exactly describes the best subset selection setting. And because it
doesnotshrinkthecoefficients,the!
0
estimatesareunbiased,whileotherregularizationestimates
are biased toward 0. Although, best subset selection, or !
0
constrained regression is superior in
coefficient estimation, feature selection, it does not have an efficient algorithm, when ^ is not
orthogonal. If we want to select a best subset without knowing how many features should be
included to be the best subset,:, then there are 2
?
combinations of features need to be examined.
In other words, there are no polynomial time algorithm to solve it; the problem is NP-hard. Many
approximation methods have been proposed to solve the problem. Among them, iterative hard
thresholding is well performed. The closed form hard thresholding solution for !
0
constrained
linearregressionwhen ^ isorthogonalis
ˆ
#
!
0
=
p
2_
¹
1
=
^
)
_º (1.13)
11
where
p
2_
¹º isthehardthresholdingoperator,
p
2_
¹
1
=
^
)
_º=
8
>
>
>
> <
>
>
>
>
:
1
=
^
)
_ ifj
1
=
^
)
_j ¡
p
2_
0 ifj
1
=
^
)
_j
p
2_
(1.14)
Itdoesnotshrinkregressioncoefficientsatall,buttruncatesat
p
2_. Thisisincloserelationtothe
soft thresholding estimator of lasso, equation (1.9), which shrinks coefficients by the amount of_
andtruncatesat_. Thisiswhyhardthresholdingisanunbiasedestimatorofregressioncoefficient.
In fact, the lasso is one of many approximations to !
0
constrained problem. And it is the closest
convex approximation to it, while!
0
is a nonconvex optimization problem. Figure 1.2 shows the
estimators for best subset/!
0
, ridge, and lasso in the case of orthonormal orthogonal ^. We can
see the unbiased estimator of best subset; feature selection ability of best subset and lasso; and
differentshrinkageschemebetweenridgeandlasso.
Figure 1.2: Estimators ofV
9
for best subset, ridge, lasso, in the case of orthogonal column space
of ^. Estimatorsareshownbybroken redlines. ThefigureisfromHastieetal.(2009).
Ridge regression and the elastic net are also convex optimization problems. Since ridge, lasso
and elastic net share this nice property, they have a highly efficient computational algorithm in
pathwise coordinate descent (Friedman et al., 2007). The algorithm solves the problems along
a decreasing sequence of_ values, for the purpose of tuning_ via cross validation. Apart from
givingapathofsolutions,thealgorithmexploitswarmstart,whichinitializes
ˆ
# withthesolutions
of previous _ value. This works because convex objective functions have continuous solutions
12
along the path. By starting at previous solutions, coordinate descent updates need less iterations,
thus leads to a more efficient and stable algorithm. On the other hand, iterative hard thresholding
isnotasefficient,andonlyguaranteestoreachlocalminimumsdueto!
0
’snonconvexity.
The lasso does not deal with highly correlated features very well; the solutions tend to be
unstable. If there is a group of variables among which the pairwise correlations are very high,
thenthelassotendstoselectonlyonevariablerandomlyfromthegroup. Ridgeregression,onthe
other hand, shrink group correlated features toward each other. In other words, the features in a
correlated group share the group effect evenly: having equal coefficients across the group and the
coefficientsadduptothegroupeffectsize. Theelasticnetisacombinationofridgeandlasso. As
we see the elastic net solution in equation (1.11), it truncates the coefficients like lasso, shrink the
coefficients proportionally like ridge. Therefore, it complements the inability of the lasso dealing
with group correlated features with ridge regularization’s grouping effect, which makes it a better
choicewhenfeaturesarebelievedtohavehighercorrelationstructure.
1.3.4 Nonconvexregularizedregression
Ridgeregression,thelasso,theelasticnetareallconvexoptimizationproblems. Therearestableand
efficientalgorithmstosolveit. Andtheyalwaysreachtheirglobalminimumsolutions. Becauseof
these, they are widely used for regularization, controlling model complexity. However, by moving
from!
2
ridge to!
1
lasso, the shrinkage of some of the coefficients gets heavier, and finally being
setto0whenreachthelassopenalty. Infact,thelassoistheonlyconvexregularizationtoproduce
sparsemodels. Whenthenumberoffeaturesislargeandthetrueunderlyingmodelhasonlyafew
features, lasso is not able to shrink enough coefficients to 0. Nonconvex regularization leads to
moresparse,lessbiasedsolutions. To seethis,consider!
@
regularization,
min
#
8
> <
>
:
1
2=
k_ ^#k
2
2
¸_
?
Õ
9=1
jV
9
j
@
9
> =
>
;
(1.15)
13
for @ 0. It is the lasso for @ = 1, ridge for @ = 2. Figure 1.3 displays !
@
regularization in
the case of two inputs. For 0 @ 1, the regularization is nonconvex, with the limiting@ = 0
corresponding to best subset selection. For these nonconvex constraints, they concentrate more
mass in the coordinate directions, thus the solutions tend to be more sparse, and less shrinkage
(biased toward 0). Unfortunately, along with nonconvexity comes combinatorial computational
complexityandunstablealgorithms. Alternativenonconvexregularizationhavebeenproposed.
Figure 1.3: Constraint regions
Í
?
9=1
jV
9
j
@
1 for different values of@. For@ 1, the constraint
regionisnonconvex. ThefigureisfromHastieetal.(2009).
1.3.4.1 Smoothedclippedabsolutedeviationpenalty(SCAD)andminimaxconcavepenalty
(MCP)
Proposed by Fan and Li (2001), the SCAD penalty defined on»01º is given by (symmetric on
¹1 0º)
%
_W
¹Vº=
8
>
>
>
>
>
>
>
> <
>
>
>
>
>
>
>
>
:
_V ifV_
W_V05¹V
2
¸_
2
º
W1
if_ VW_
_
2
¹W
2
1º
2¹W1º
ifV¡W_
(1.16)
14
for_ 0 and W ¡ 2. The univariate solution for a SCAD regularized simple linear regression
coefficientisasfollow
ˆ
V= 5
(
¹I_Wº=
8
>
>
>
>
>
>
>
> <
>
>
>
>
>
>
>
>
:
(¹I_º ifjIj 2_
(¹IW_¹W1ºº
11¹W1º
if 2_jIjW_
I ifjIj ¡W_
(1.17)
whereI=
1
=
^
)
_ istheOLSsolution.
ProposedbyZhangetal.(2010),theMC+penaltydefinedon»01º isgivenby(symmetricon
¹1 0º)
%
_W
¹Vº=
8
>
>
>
> <
>
>
>
>
:
_V
V
2
2W
ifVW_
1
2
W_
2
ifV¡W_
(1.18)
for _ 0 and W ¡ 1. The univariate solution for a MC+ regularized simple linear regression
coefficientisasfollow
ˆ
V= 5
"%
¹I_Wº=
8
>
>
>
> <
>
>
>
>
:
(¹I_º
11W
ifjIjW_
I ifjIj ¡W_
(1.19)
TherationalofSCADandMCPissimilarinthatbothpenaltiesbeginbyapplyingsamepenalty
as the lasso, and reduce the amount of penalty as the regression coefficient gets further away from
zero. As a result of the penalty trend, when the coefficient is small in magnitude, it is shrunk to
zero just like lasso; however, when the coefficient is large (larger than OLS solution), there is a
transition region that shrinks the coefficient less than the lasso, and after the transition region, it
is equal to the OLS solution without any shrinkage. This is a trend from less biased toward 0 to
unbiased estimator, for those features with large effect sizes, thereby more likely to be associated
with outcomes. Without the transition region, it is the hard thresholding estimator. The difference
betweenSCADandMCPisinthewaytheymaketransition. Figure1.4showsthetrendofpenalty
functions of SCAD, MCP and their threshold functions. Indexed by nonconvexity parameterW, it
15
bridges the gap between lasso (W =1) and best subset/hard threshold (W = 2
¸
for SCAD, 1
¸
for
MCP).
Figure1.4: SCADandMCPpenaltyfunctionsandtheircorrespondingthresholdfunctions. Both
areshownwith_= 1 anddifferentvaluesforW. ThefigureisfromMazumderetal.(2011).
Thetwononconvexregularizationtechniquesachievelessbiasedestimator,moresparsesubset
thanthelasso. Thechoiceofconvexandnonconvexregularizationdependsontheapplication. For
example, for a gene expression profile data, the underlying model is sparse but with a relatively
largesubsetoffeatures,thelassoisabetterchoice. Becausewithalargenumberoffeaturesinthe
model,say 10002000,thedirectionofthefeaturecoefficientsaremoremeaningfulratherthanthe
magnitude of the coefficients. For a genetic association data, by which the underlying model only
consists of a few markers, the accuracy of selection and unbiasedness of estimators are important.
Hence,nonconvexregularizationisabettersuitinthesituation.
We close the section by mentioning an approximation to !
@
(0 @ 1) regularization that
enjoysconvexproperty.
16
1.3.4.2 Theadaptivelasso: approximationtononconvexregularization
Proposed by Zou (2006), the adaptive lasso is a way of fitting models sparser than lasso. Using a
pilotestimate
˜
V,theadaptivelassohastheform
min
#
8
> <
>
:
1
2=
k_ ^#k
2
2
¸_
?
Õ
9=1
F
9
jV
9
j
9
> =
>
;
(1.20)
whereF
9
= 1j
˜
V
9
j
E
. The adaptive lasso can be seen as an approximation to the!
@
regularization
with@= 1E. We can see the adaptive lasso is convex in #. Moreover, when the pilot estimates
meetsomeregulatoryconditions,themethodrecoversthetruemodelundermoregeneralconditions
thandoesthelasso. Onecanuseleastsquaressolutionasthepilotestimatewhen? =,univariate
least squares solution when ? =. The indication is that, when least squares solution is small,
the amount of penaltyF
9
for feature 9 is large, thereby
ˆ
V
9
is more likely to be set to 0; when least
squares solution is large, feature 9 is penalized less (smallF
9
), then
ˆ
V
9
is shrunk less, making it
lesslikely tobe0andlessbiased.
1.4 Incorporatingmeta-features into modeling process
As the types and the volume of genomics data are huge thanks to advanced high-throughput
sequencingtechnologies,aswellastheever-growingannotationdatabases,thereisincreasedneed
to integrate multiple types of genomics data, annotation data into modeling process. Because
morerelatedfeaturesprovidemoreinformationtotheoutcomeofinterest,predictionperformance
can be improved. The traditional method is modeling one type of genomics data at a time, and
combinethemodelsinsomeform. Astoutilizingannotationdata,summarystatistics,itisusually
performedaftermodelinggenomicsdata. Thisstyleofmodelingdifferentgenomicsdataseparately
may ignore the interplay between them, the collective effect on the outcome. In this thesis, we
introducedtheconceptmeta-features,thefeaturesofthefeatures. Nowweintroduceaseconddata
matrix `
=?
that systematically stores extra data. Considering an annotation data that consists of
17
several functional gene sets/pathways, and each gene set contains a group of genes. That is, if we
have ? genomic features, @ functional gene sets (meta-features), the meta-feature matrix ` will
havedimension?@,eachrowrepresentsonegenomicfeatureandhasvaluesof0or1indicating
whether this genomic feature belongs to a gene set (1 indicates it belongs to the gene set, and 0
not). Table1.1showsthemeta-featurematrixforgenesets.
gene set1 geneset2 geneset3 ...
gene1 1 0 0 ...
gene2 0 1 0 ...
gene3 0 1 1 ...
Table1.1: Meta-featurematrix ` forgenesets/pathways
Theusageofmeta-featurematrixdoesnotlimittoannotationdata,infact,itcanaccommodate
many types of information. We discuss 2 situations to show the flexibility of putting external data
intometa-featurematrix `.
• Thereare3typesgenomicsdata,geneexpressions,singlenucleotidepolymorphisms(SNPs),
DNA methylation to be integrated into the modeling process. The meta-feature matrix tells
whichgenomicfeatureisSNP,geneexpression,ormethylation. Table1.2showstheindicator
meta-feature matrix. For example, ILMN_343291 is a microarray probe, gene expression;
rs10853372isaSNPlocus.
Geneexpression SNP Methylation
ILMN_343291 1 0 0
rs10853372 0 1 0
ILMN_1651210 1 0 0
463100A3 0 0 1
Table1.2: Meta-feature matrix ` formultipletypesofgenomicsdata
18
• Therearesummarystatisticsfromsimilarstudiesonthesamesetofgenomicfeatures. These
statisticsfrommeta-analysiscanbehighlyinformative. Theyincludep-values,hazardratios,
and source of features. In table 1.3, gene BAX has a p_value 0.0006 associated with the
outcome, hazard ratio is 0.7605, the reason being included in the model is from previous
GWAS studies. This is a hybrid matrix holding continuous values and indicator values:
continuous values like pvalues, hazard ratios gives importance of the features; indicator
variabletellsthereasonwhythefeatureisincluded.
p_value Hazardratio Literature GWAS
BAX 0.0006 0.7605 0 1 ...
IL6 0.2611 -0.2077 1 0 ...
LDHB 878 10
6
0.0768 0 1 ...
Table1.3: Meta-featurematrix ` forsummarystatistics
With the above examples, we have shown the flexibility of the meta-feature matrix housing
externalinformation. Throughthemeta-featurematrix,wecanintegratemultipletypesofgenomics
data,genomicannotationdata,summarystatisticsfromsimilarstudies,andsoon. Itistheheartof
ourmodelingprocesstointegrateextrainformationthatmightbeusefultoprediction.
Weproposetwomodelingstrategybasedonregularizedregression. Inchapter2,weincorporate
meta-featureswithahierarchicalstructure
min
#
¹#º¸
_
1
2
k# `"k
2
2
¸_
1
k"k
1
where ¹#º is log likelihood of regressing outcome on genomic features. And the coefficients
of genomic features # are regressed on meta-features. This integrates meta-features linearly. In
19
chapter 3, we integrate meta-features in a non-linear way by allowing differential penalties for
individualfeatures,
min
#
8
> <
>
:
¹#º¸
?
Õ
9=1
_
9
1
2
¹12ºV
2
9
¸2jV
9
j
9
> =
>
;
_
9
=4
z
)
j
"
The individualized penalty parameters_
9
’s are guided by meta-features, non-linearly with expo-
nentialfunction.
20
Chapter2
ARegularizedCox’sHierarchical Model for Incorporating
AnnotationInformationin Omics Studies
2.1 Abstract
Associatedwithhigh-dimensionalomicsdatathereareoften“meta-features”suchaspathwaysand
functional annotations that can be informative for predicting an outcome of interest. We extend to
Cox regression the regularized hierarchical framework of Kawaguchi et al. (2021) for integrating
meta-features, with the goal of improving prediction and feature selection performance with time-
to-eventoutcomes. Regularizationisappliedtotheomicfeaturesaswellasthemeta-featuressothat
high-dimensionaldatacanbehandledatbothlevels. TheproposedhierarchicalCoxmodelcanbe
efficientlyfittedbyacombinationofiterativereweightedleastsquaresandcycliccoordinatedescent.
Inasimulationstudyweshowthatwhentheexternalmeta-featuresareinformative,theregularized
hierarchical model can substantially improve prediction performance over standard regularized
Cox regression. Importantly, when the external meta-features are uninformative, the prediction
performance based on the regularized hierarchical model is on par with standard regularized
Cox regression, indicating robustness of the framework. We illustrate the proposed model with
applications to prediction of breast cancer survival, prediction of overall survival of melanoma,
basedongeneexpression.
21
2.2 Introduction
Prediction based on high-dimensional omics data such as gene expression, methylation, and geno-
typesareimportantfordevelopingbetterprognosticanddiagnosticsignaturesofhealthoutcomes.
However, developing prediction models with high-dimensional omics data, where the number of
features is often orders of magnitude larger than the available number of subjects is challenging.
Sparse regularized regression methods, including the Lasso (Tibshirani, 1996) and its variants,
elasticnet(ZouandHastie,2005),adaptiveLasso(Zou,2006),groupLasso(YuanandLin,2006)
and others, control model complexity by shrinking all regression coefficients toward zero while
settingsomeexactlytozero,effectivelyselectingfeaturesassociatedwiththeoutcome.
Kawaguchi et al. (2021) have shown that incorporating meta-features relating to the omics
features can yield improved prediction of an outcome of interest and they developed a regularized
hierarchicalregressionframeworktoincorporateexternalmeta-featureinformationintotheanalysis
of omics data. Example of meta-features are biological pathways containing different gene sets,
functional information from databases like gene ontology, or results and summary statistics from
previous studies. Their approachis implemented tohandle quantitativeand binaryoutcomes. The
regularization type they applied to both features and meta-features is ridge only. Here we extend
the regularized hierarchical model of Kawaguchi et al. (2021) to handle time to event outcomes
andalsoaddthelasso,elasticnettometa-features.
Therearenumerousapproachesfortestingenrichmentofmeta-featuresafterananalysisrelating
thegenomicfeaturestoanoutcomeofinterestisperformed. Forexample,Subramanianetal.(2005)
developedgenesetenrichmentanalysis(GSEA)toyieldinsightsongenegrouplevel. Thegenesin
a group, which is a meta-feature of the genes, share common chromosomal location or biological
function. The enrichment analysis is performed after prior analysis of single gene differential
expression analysis. However, there are few approaches capable of incorporating meta-features
directly into the modeling process, rather than in a post-hoc fashion. Approaches to incorporate
meta-features include the application of differential penalization based on external information
and two-stage regression methods, where the outcome is first regressed on the genomic features
22
and the resulting effect estimates are in turn regressed on the external meta features. Tai and
Pan (2007) grouped genes based on existing biological knowledge and applied group-specific
penalties to nearest shrunken centroids and penalized partial least squares. Bergersen et al. (2011)
incorporates external meta-feature information by weighting the LASSO penalty of each genomic
feature with some function of meta-feature. Zeng et al. (2021) on the other hand, incorporates
external meta-feature to customize the penalty of each feature with a different function of meta-
feature. These three methods are based on idea 1), which no longer assuming every genomic
featureareequallyimportant,butofdifferentimportancebasedonexternalinformation. However,
they cannot handle large number of meta-features. Chen and Witte (2007) applied the idea of
hierarchical modeling in a Bayesian framework, where second stage linear regression is normal
prior distribution, first stage regression is normal conditional distribution, and estimate first stage
regression coefficients with posterior estimator. This method does not apply to high dimensional
data since it is standard regression with no regularization. The above data integration methods
improve prediction compared to modeling with genomic features only. However, none of the
approachesabovecanhandletime-to-eventoutcomes.
In this chapter, we introduce a regularized Cox’s proportional hazard hierarchical model to
integrate meta-features. We will see that when external meta-features are informative, regularized
hierarchical modeling improves prediction performance considerably. On the other hand, we also
show that when the external meta-features are not informative, it does not perform worse than
standard regularized model, which does not use any external information. This shows that the
model is robust to the informativeness of the meta-features and can be safely used when the meta-
featureinformativenessisaprioriunknown,asitistypicallythecase. Themodelcanbeefficiently
fitted using a combination of iterative reweighted least squares and cyclic coordinate descent as
proposedforLassoCoxregressionby Simonetal.(2011).
23
2.3 Methods
2.3.1 Setupandnotation
We assume a survival analysis setting with outcome¹y%º = ¹H
1
H
=
X
1
X
=
º, where
% =¹X
1
X
=
º is a vector of censoring status for each subject,X
8
= 1 represents event happens,
X
8
= 0 represents right-censoring; y =¹H
1
H
=
º is the vector of observed time, ifX
8
= 1, H
8
is
eventtime,andifX
8
= 0,H
8
iscensoringtime,for=subjects. Let ^ denotethe=? designmatrix,
where ? is the number of features, each row represents the observations on one subject, and each
columnrepresentsthevaluesofonefeatureacrossthe= subjects. Weareparticularlyinterestedin
the high dimension setting, ? ¡¡ =, where the number of features is larger than the sample size.
Thegoalistodevelopapredictivemodelfortheoutcome¹y%º basedonthedata ^.
In a genomics context, the time-to-event outcome¹y%º could be event free time, time to
disease relapse, time to death. The design matrix ^ could be genotypes, gene expressions, DNA
methylation. For example, in Molecular Taxonomy of Breast Cancer International Consortium
(METABRIC) data, outcome¹y%º is breast cancer specific survival, data matrix ^ represents
geneexpressionswithdimensionnumberofpatientsnumberofgenes.
Associatedwitheachfeaturethereistypicallyasetofmeta-featuresannotations. If ^ consists
of gene expression values, pathway gene sets could be meta-features indicating the set of genes
involved. As for the METABRIC example, 4 meta-features are believed to be associated with
breast cancer: genes with mitotic chromosomal instability (CIN), mesenchymal transition (MES),
lymphocyte-specific immune recruitment (LYM), and FGD3-SUSD3 genes. Each meta-feature
consists a vector of indicator variables for whether a gene belongs to the functional gene group.
The genomic meta-features can be collected into a matrix ` of dimensions ?@, where @ is
the number of meta-features. We propose a penalized hierarchical regression for integrating the
externalmeta-featureinformationin ` forpredictingtime-to-eventoutcomesbasedonthefeatures
in ^
min
"#
1
=
ln!
¹#º¸
_
1
2
k# `"k
2
2
¸_
2
k"k
1
(2.1)
24
where !
¹#º is the negative log of the Cox partial likelihood function (see below), # is a length
? vector ofregression coefficients corresponding tothe? features in ^, and" is alength@ vector
of regression coefficients corresponding to the meta-features. The objective function in (2.1) can
be viewed as arising from a hierarchical model. In the first level of the hierarchy, the negative
log partial likelihood !
¹#º term in (2.1) corresponds to the time-to-event outcome modeled as
a function of ^ via a Cox’s proportional hazard regression model. In the second level, the !
2
penaltytermk#`"k
2
2
correspondstoalinearregressionoftheestimateof #onthemeta-feature
information `. Itcanalsobethoughtofasan!
2
regularizationtermthatshrinkstheestimateof #
toward`"ratherthantotheusualshrinkagetowardzero. Inthethirdlevelofthehierarchy,theterm
k"k
1
isan!
1
regularizationpenaltyonthevectorofestimatedefficts ˆ ". Itenablestheselectionof
importantmeta-featuresbyshrinkingmanyofitscomponentsto 0. Thehyperparameters_
1
_
2
0
controlthedegreeofshrinkage/regularizationappliedtoeachofthepenaltytermsandcanbetuned
bycross-validation. Finally,notethatwhen"= 0,theobjectivefunction(2.1)reducestoastandard
!
2
-regularizedCoxregression.
The partial likelihood function !
¹#º in (2.1) is the Breslow approximation (Breslow, 1972)
to the Cox partial likelihood. LettingC
1
C
2
C
:
¹: = 1 2º be unique event times
arrangedonincreasingorder,theCoxmodelassumesproportionalhazards:
¹Cx
9
º=
0
¹Cº exp¹x
)
9
#º (2.2)
where ¹Cx
9
º is the hazard rate for subject 9 with feature values x
9
at timeC;
0
¹Cº is baseline
hazard rate at timeC, regardless of the feature values. The Cox partial likelihood function (Cox,
1972)canthenbewrittenas
!¹#º=
Ö
:
4
x
)
:
#
Í
92'
:
4
x
)
9
#
(2.3)
where '
:
=f9 : H
9
C
:
g, is the risk set at time C
:
, i.e., the set of all subjects who have not
experienced the event and are uncensored just prior to time C
:
. The partial likelihood function
allowsestimationof #withoutexplicitlymodelingthebaseline
0
,anditdependsonlyontheorder
25
in which events occur but not on the exact times of occurrence. However, the partial likelihood
assumes that event times are unique. To handle ties, where multiple individuals experience the
eventatthesametime,weusetheBreslowapproximationofthepartiallikelihoodin(2.3)
!
¹#º=
Ö
:
exp¹
Í
92
:
x
)
:
#º
¹
Í
92'
:
4
x
)
:
#
º
3
:
(2.4)
where
:
= f9 : X
9
= 1H
9
= C
:
g, is the set of individuals who have event time H
:
, and
3
:
=
Í
9
¹X
9
= 1H
9
= C
:
º is the number of events at time H
:
. Breslow’s likelihood function
automaticallyreducestothepartiallikelihoodwhentherearenoties.
2.3.2 Modelfitting
The objective function (2.1) can be minimized efficiently using iterative reweighted least squares
combined with coordinate descent (Simon et al., 2011). If the current estimates of the regression
coefficientsare¹
˜
# ˜ "º,weformaquadraticapproximationtothenegativelog-partiallikelihoodby
Taylorseriesaroundthecurrentestimates. Theapproximatedobjectivefunctionhastheform:
min
"#
1
2=
¹y
0
^#º
)
]¹y
0
^#º¸
_
1
2
k# `"k
2
2
¸_
2
k"k
1
(2.5)
where
y
0
= ˜ (¸]
1
¹%diag»exp¹ln
0
¹yº¸ ˜ (º¼º (2.6)
]= diag»exp¹ln
0
¹yº¸ ˜ (º¼diag»4
˜ (
¼Sdiag
"
2
0:
3
:
#
S
)
diag»4
˜ (
¼ (2.7)
In (2.6) and (2.7), diag»a¼ is a diagonal matrix with vector a as diagonal elements. S is an
= indicator matrix with¹8:º
C
element ¹H
8
C
:
º. Also, ˜ ( = ^
˜
# is the linear predictor;
0:
=
3
:
Í
92'
:
exp¹ ˜ [
9
º
is estimated baseline hazard rate at event time H
:
;
0
¹H
8
º =
Í
::H
:
H
8
0:
is
cumulative baseline hazard at timeH
8
. In the first part of quadratic approximation (2.5),
1
2=
¹y
0
^#º
)
]¹y
0
^#º can be viewed as a weighted version of least squares as y
0
work as responses,
26
] asweights. Weightmatrix] isusuallyadiagonalmatrix,however,inCoxproportionalhazard
model,] is a full symmetric matrix as shown in (2.7). This leads to computational difficulty as it
requires calculation of$¹=
2
º entries. According to Simon et al. (2011), we can compute only the
diagonal entries of ] without much loss of accuracy, thereby speeding up implementation. The
diagonalelementsof],F
8
hastheform:
F
8
=
Õ
:2
8
3
:
4
˜ [
8
Í
92'
8
4
˜ [
9
Õ
:2
8
3
:
¹4
˜ [
8
º
2
¹
Í
92'
8
4
˜ [
9
º
2
(2.8)
where
8
is the set of unique event timeC
:
such thatC
:
H
8
(the times for which observation8
is still at risk). In computing weights F
8
’s, one bottleneck is that for each : in
8
, we need to
calculate
Í
92'
8
4
˜ [
9
. Both
8
and '
:
have $¹=º elements, so the weight computation is $¹=
2
º
time. However, if we sortH
8
’s in non-decreasing order, note that
Í
92'
8
4
˜ [
9
can be calculated in a
cumulative summation fashion: for the risk set'
:¸1
, the only difference between it and'
:
are the
observations that are in'
:
but not in'
:¸1
, 9 :C
:
H
9
C
:¸1
. This same idea can also be applied
to calculate
Í
:2
8¸1
3
:
Í
92'
8
4
˜ [
9
. And the weight computation complexity can be reduced to$¹=º.
Detailsare inAppendixA.1.
Now,let$= # `",anduseonlydiagonalelementsof],then(2.5)canbewrittenas:
min
"#
(
1
2=
=
Õ
8=1
F
8
¹H
0
8
$
)
x
8
"
)
¹^`º
8
º
2
¸
_
1
2
k$k
2
2
¸_
2
k"k
1
)
(2.9)
where¹^`º
8
isthe8
C
rowof=@ matrix ^`. Thisreducedtheproblemtorepeatedlysolvingthe
regularized,weightedleastsquaresproblem(2.9)usingcycliccoordinatedescent(Friedmanetal.,
2010). DetailsaregiveninAppendixA.2.
2.3.3 Two-dimensionalhyperparametertuning
Theoptimizationapproachdescribedaboveisforfittingthemodelforonecombinationofthetuning
parameters_
1
_
2
. More than one value combination of_
1
_
2
are usually of interest, as_
1
_
2
are
27
tuned by cross-validation to get the best performance out of the model. For the proposed model,
a two-dimensional grid of _
1
_
2
values are constructed, and pathwise coordinate optimization
(Friedmanetal.,2007)isappliedalongthetwo-dimensionalpath. Thepathwisealgorithmutilizes
currentestimatesaswarmstart,sincethesolutionstotheconvexproblem(2.9)iscontinuous. This
charactermakesthealgorithmremarkablyefficientandstable.
In detail,for_
1
of thethe two-dimensional grid,which controls theamount of shrinkage to!
2
termk# `"k
2
2
, or in the transformation formk$k
2
2
, since we initialize ˜ $ = 0 ˜ " = 0, starting
with_
1 max
= 1000 max
9
1
=
Í
=
8=1
F
8
¹0ºG
89
H
0
8
¹0º gives small value solutions to ˆ $ ˜ ", making the
algorithm faster to converge. While for_
2
, which controls the amount of shrinkage to !
1
term
k"k
1
, _
2 max
= max
:
1
=
Í
=
8=1
F
8
¹0º¹GIº
8:
H
0
8
¹0º is the smallest value that makes the entire vector
ˆ "= 0. We compute the solutions for a decreasing sequence of_
1
_
2
. More specifically, we start
with_
1 max
_
2 max
, select_
min
= 0001_
max
, and construct a sequence of 20_ values from_
max
to
_
min
on log scale, therefore there are 400_
1
_
2
combination of values in total. In order to apply
warm start, we fix one value of_
¹<
1
º
1
1 <
1
20, decrease_
2
along the sequence, but keeping
thesolutionsfor_
¹<
1
º
1
_
2 max
,sothatwhenweworkoutthesolutionsforonesequenceof_
2
values
andgoto_
¹<
1
¸1º
1
_
2 max
,(where_
¹<
1
¸1º
1
isthenextvaluealongthesequenceof_
1
),wecanusethe
solutionsof_
¹<
1
º
1
_
2 max
forwarmstart(figure2.1).
Figure2.1: Two-dimensionalgridpathwiseoptimization.
28
2.3.4 Summary
SummarizingproceduresforfittingregularizedhierarchicalCoxmodel:
1. Initialize # and" with
e
# and e ".
2. Foreach_
1
_
2
,repeat,untilconvergenceof¹
ˆ
# ˆ "º:
• Compute weights ] and working response y
0
with current estimate¹
˜
# ˜ "º, form the
quadraticapproximation,equation(2.5)
• Findminimizer¹
ˆ
# ˆ "º,solutiontoequation(2.9),usingcoordinatedescent
• Set¹
˜
# ˜ "º=¹
ˆ
# ˆ "º
2.4 Simulations
2.4.1 Simulationmethods
We performed a simulation study to evaluate the predictive performance of the hierarchical Cox’s
regression model compared to standard penalized Cox’s regression. The main parameters we
control include informativeness of the meta-features, sample size, number of features and number
ofmeta-features. Wegeneratedthe?@ meta-featurematrix `,witheachelementdrawnfroman
independent Bernoulli variable with probability 01. This mimics binary indicators for whether a
genebelongstoaparticularbiologicalpathway.
The first level regression coefficients are generated as # = `"¸ 9, where 9 #¹0f
2
Oº.
To control the predictive power of the meta-features, we set the signal-to-noise ratio, SNR =
"
)
cov¹`º"f
2
,wherethesignalisthevarianceof # explainedbymodel `",andf
2
isthenoise.
A higher signal-to-noise ratio implies a higher level of informativeness of the meta-features with
respecttothecoefficients#. Thedatamatrix^isgeneratedbysamplingfromamultivariatenormal
distribution, #¹0º, where the covariance matrix has an autoregressive correlation structure
89
=d
j89j
for89 = 1?.
29
ThecumulativedistributionfunctionoftheCoxproportionalhazardmodelisgivenby
¹Cjxº= 1 exp
h
0
¹Cº4
#
)
x
i
where
0
¹Cºisbaselinecumulativehazardfunction. Usingtheinverseprobabilityintegraltransform
(Benderetal.,2005),wegeneratedsurvivaltimesC as:
C=
1
0
h
ln¹*º4
#
)
x
i
(2.10)
where* uniform»0 1¼. For the baseline hazards we used a Weibull distribution, which has
cumulative hazard function
0
¹Cº = ¹
C
1
º
E
. The baseline Weibull parameters were set to 1 =
5E = 8, which result in survival times in the range 0 to 20. We simulated the censoring time,
2, based on an exponential distribution with density 5¹2º = exp¹_2º, with_ = 006. Then, the
time-to-event outcome¹H
8
X
8
º is generated as¹min¹C
8
2
8
º¹C
8
2
8
ºº. The value of exponential
distribution parameter _ was chosen to result in a ratio of subjects experiencing the event vs.
subjectsexperiencingcensoringofabout 2 to 1.
To control the predictivity of the features ^ for the outcome y, we set Harrell’s concordance
index(C-index)(Harrelletal.,1982)astheperformancemetric. Itisdefinedastheprobabilitythat
a randomly selected patient who experienced an event has a higher risk score #
)
x than a patient
who has not experienced an event at a given time. The C-index is an analog of the area under
the ROC for time-to-event data. The higher the C-index, the better the model can discriminate
between subjects who experience the outcome of interest and subjects who do not or have not yet.
TocontroltheC-indexweaddedrandomnoisetothesurvivaltimesC,wherethenoiseisdistributed
asanormalwithmeanzeroandavariancevaluesettoyieldaC-indexof0.8acrossallsimulation
scenarios. This is the population/theoretical C-index of the generated survival data, achievable if
# wereknownorifonehadaninfinitesamplesize. When # isestimatedfromafinitetrainingset,
theachievedmodelC-indexwillbelower.
30
We simulated a base case scenario with sample size # = 100, number of features ? = 200,
and number of meta-features @ = 50. This is a high dimensional setting, ? ¡¡ #, typical of
genomic studies. The first 20% of the coordinates of the meta-feature level coefficients " were
set to be 0.2, and the rest set to 0. In the base scenario, the meta-features are highly informative,
with a signal noise ratio set to 2. The covariance matrix of ^ has autoregressive-1 structure,
parameter d = 05, so that the features are moderately correlated. In the following simulation
situations, we vary one of the parameters and hold the others fixed. Simulations were performed
with= 100replicatesforallscenarios. Themodelsweretrainedonatrainingsetofsize# (100
in the base scenario but varied in the experiments below), with the hyper-parameters_
1
_
2
tuned
on an independent validation set of the same size as training set. The final predictive performance
wasevaluatedonalargetestsetofsize10,000.
Werunaseriesofexperimentvaryingonekeyparameteratatimefromthebasecasescenario
asfollows:
Experiment 1: varying the signal-to-noise ratio of the meta-features from completely uninfor-
mative, (SNR= 0), to slightly informative (SNR= 01), to moderately informative, (SNR= 08), to
highlyinformative(SNR= 2).
Experiment2: Varyingthesamplesizefromlowtohigh,# = 100 200 500.
Experiment3: Varyingthenumberoffeaturesfromlowtohigh: ?= 200 500 1000.
Experiment4: Varyingthenumberofmeta-featuresfromlowtohigh: @= 20 50 100.
2.4.2 Simulationresults
The results of the experiments are shown in Figure 2.2. In each panel, the horizontal dashed
linerepresentingthepopulation/theoreticalC-index,themaximumachievablewithinfinitetraining
data, is provided as a reference for each parameter setting. We compared the performance of the
hierarchical ridge-lasso Cox model incorporating meta-features to that of a standard ridge Cox
model.
31
Withinformativemeta-features(SNR¡ 0inexperiments1-4)thehierarchicalridge-lassomodel
consistently outperforms the standard ridge model, with the performance gain over the standard
ridge model increasing with the informativeness of the meta-features (experiment 1). Importantly,
whenthemeta-featuresarecompletelyuninformative,thehierarchicalridge-lassomodelperforms
only slightly worse than standard ridge model (experiment 1, SNR= 0). This shows robustness of
thehierarchicalridge-lassotouninformativemeta-features.
Experiment 2 shows that the gains in performance of the hierarchical ridge-lasso over the
standardridgemodelcanbequitelarge,particularlywhenthesamplesizeissmall. Asthesample
size# increases, the performance of both models increases and the difference between the two is
reduced.
As the dimensionality ? of the features increases (experiment 3), the performance of the
standard ridge model deteriorates dramatically, while the performance of the hierarchical model
onlydecreasesslowlyastheinformationinthemeta-featureshelpsstabilizeitsperformance.
In experiment 4, the performance of the standard ridge model does not change, as it does not
utilizemeta-featureinformation. However,forthehierarchicalridge-lassomodel,theperformance
decreases as the number of noise meta-features increases (the number of informative meta-feature
isfixedat10andtheadditionalmeta-featuresarenoisemeta-features).
We also examined the ability of the model to select informative meta-features by second-level
Lasso penalty. In particular, we looked at the true and false positive meta-feature selection rate
in experiment 1, where the second level meta-features informativeness varies (Figure 2.3). We
see that as the SNR of the meta-features increases, the true positive selection rate of informative
meta-featuresimprovesdramatically(Figure2.3a)atthecostofaslightincreaseinthefalsepositive
rate(Figure2.3b).
32
Figure2.2: Simulationresults: predictionperformance.
2.5 Applications
2.5.1 Geneexpressionsignaturesforbreastcancersurvival
To illustrate the performance of our approach, we applied the hierarchical survival model to
the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) study. The
METABRIC microarray dataset is available at European Genome-Phenome Archive with the ac-
cession of EGAS00000000083. It includes cDNA microarray profiling of around 2000 breast
cancerspecimensprocessedontheIlluminaHT-12v3platform(Illumina_Human_WG-v3)(Curtis
et al., 2012). The dataset was divided into a discovery/training set of 997 samples, and a vali-
dation/test set of 995 samples (Cheng et al., 2013a). The goal is to build a prognostic model for
breast cancer survival, based on gene expressions and clinical features. The data ^ consists of
33
Figure2.3: Simulationresults: meta-featureselection.
of 29,477 gene expression probes and two clinical features, age at diagnosis and the number of
positive lymph nodes. The meta-feature data ` consists of four “attractor metagenes”, which are
selected gene co-expression signatures that are associated with the ability of cancer cells to divide
uncontrollably, to invade surrounding tissues, and, with the effort of the organism to fight cancer
withaparticularimmuneresponse(Chengetal.,2013b). Thethreeuniversal“attractormetagenes”
are genes involved in mitotic chromosomal instability (CIN), in mesenchymal transition (MES),
lymphocyte-specific immune recruitment (LYM). In addition, a meta-gene whose expression is
associated with good prognosis and that contains the expression values of two genes—FGD3 and
SUSD3. The CIN, MES, and LYM metagenes each consist of 100 genes, but for our analysis, we
only considered the 50 top-ranked genes. The data matrix ` is an indicator matrix of whether a
specificexpressionprobecorresponds toageneinametagene.
Model building was based on the samples with ER positive and HER2 negative, as treatments
are homogeneous in this group, and they are associated with good prognosis (Rivenbark et al.,
2013). There were 740 samples in the discovery set and 658 samples in the validation set in
the ER+ and HER2- subset after removing samples with missing values. We used 5-fold cross
validationtotunethehyper-parameters_
1
_
2
inthediscoveryset. Thetestsetwasusedtoevaluate
34
model performance. The same training/test scheme was used to fit a standard ridge regression
withoutattractormetageneinformationascomparison.
With only gene expression features in the model and no clinical features, the test C-index for
theridge-lassohierarchicalmodelwithmetageneinformationwas0.678whichcomparesfavorably
with the test C-index of 0.648 for the standard Cox ridge counterpart. When adding the clinical
features,ageatdiagnosisandnumberofpositivelymphnodes,thetestC-indexincreasedto0.752,
and 0.727 for the Cox hierarchical model, and the standard Cox ridge model, respectively (Table
2.1). Themetagenes“CIN”and“FGD3-SUSD3”wereidentifiedbythehierarchicalmodelasbeing
important (had non-zero coefficients). “CIN”, which is a breast cancer inducing metagene, had a
positivecoefficient,indicatinggenesin“CIN”hadanoverallincreasedriskoverothergenes,while
the“FGD3-SUSD”metagenehadanegativecoefficientestimate,indicatingFGD3andSUSD3had
a reduced risk (Table 2.2). The identified metagenes were also found by previous analysis (Cheng
etal.,2013a).
StandardRidge Ridge-Lasso
TestC-index
Geneexpressionsonly 0.648 0.678
Geneexpressions+clinicalfeatures 0.727 0.752
Table2.1: METABRIC:TestC-indexbetweenstandardridgeandridge-lasso
Metagene
CoefficientEstimate
Geneexpressionsonly Geneexpressions+clinical
CIN 0.0094 0.0080
MES 0.0021 0.0033
LYM 0.0011 0.0008
FGD3-SUSD3 -0.2083 -0.1111
Table2.2: METABRIC:Coefficientestimatesformetagenes
35
2.5.2 Anti-PD1predictivebiomarkerformelanomasurvival
Wealsoappliedthemodeltoamelanomadatasettopredictoverallsurvivalaftertreatingpatients
with anti-PD-1 immune checkpoint blockade. The programmed death 1 pathway (PD-1) is an
immune-regulatory mechanism used by cancer to hide from the immune system. Antagonistic
antibodiestoPD-1pathwayanditsligands,programmeddeathligand1(PD-L1),hasdemonstrated
high clinical benefit rates and tolerability. Immune checkpoint blockades such as Nivolumab,
pembrolizumab are anti-PD-1 antibodies showing improved overall survival for the treatment of
advanced melanoma. However, less than 40% of the patients respond to the treatments (Moreno
et al., 2015). Therefore, predicting treatment outcomes, identifying predictive signals are of great
interest to appropriately select patients most likely to benefit from anti-PD-1 treatments. We
exploredtranscriptomesandclinicaldatausingourmodeltoillustratepredictionperformanceand
predictivesignalselection.
The dataset combined 3 clinical studies in which RNA-sequencing were applied to patients
treated with anti-PD1 antibodies, Gide et al. (2019); Riaz et al. (2017); Hugo et al. (2016). The
geneexpressionvaluesarenormalizedtowardallsampleaverageineachstudyasthecontrol,sothat
theyarecomparabletooneanotheracrossfeatureswithinasampleandcomparabletooneanother
across samples. There are 16010 genes in common across 3 studies and 117 subjects combined.
We build predictive models in terms of overall survival, based on gene expression profile. Since
the subjects are all treated with anti-PD1 antibodies, the transcriptomic features selected by the
model are predictive signals for treatment efficacy or resistance. We selected meta-features from
molecularsignaturedatabase,hallmarkgenesets(Liberzonetal.,2015). 13genesetsareenriched
to have false positive rates less than 0.25. An indicator matrix ` is formed to illustrate whether
eachofthe16010genesbelongtooneofthe13hallmarkgenesets.
We performed 5-fold cross validation to tune the hyperparameters and report the validation
predictionperformance. Weseeimprovementinpredictionwiththehallmarkgenesetinformation
with a C-index of 0.663 for ridge-lasso compared to 0.637 for standard ridge when only including
gene expressions; 0.754 for ridge-lasso compared to 0.739 for standard ridge when inculding gene
36
expressions and clinical features. At the gene set level, among model selected sets (non-zero
coefficientU),3genesetshaveabsoluteeffectsizelargerthan0.01(Table2.3). Specifically,genes
in response to interferon gamma, genes that are involved in KRAS regulation were identified. A
subsetofthegenesintheidentifiedgenesetsbyourmodelwereinconcordancewiththepreviously
publishedanti-PD1genesignatures(Riazetal.,2017;Hugoetal.,2016).
Geneset Coefficientestimate
IFNGinterferongammaresponse -0.0100*
Interferonalpharesponse -0.0013
IL-2_STAT5signaling 0.0072
Bileacidmetabolism -0.0011
KRASsignalingdownregulated -0.0135*
KRASsignalingupregulated 0.0100*
Apoptosis 0.0004
Xenobioticmetabolism 0.0013
Table 2.3: Coefficient estimates (nonzero) for hallmark meta-features. * Gene sets with absolute
valueofcoefficientslargerthan0.01.
2.6 Discussion
In this chapter we extended the regularized hierarchical regression model of Kawaguchi et al.
(2021) to time-to-event data and to accommodate a lasso or elastic-net penalty in the second-level
model. Thehierarchicalregularizedregressionmodelenablesintegrationofexternalmeta-feature
information directly into the modeling process. We showed that prediction performance improves
when the external meta-feature data is informative. And the improvements are largest for smaller
sample sizes, when prediction is hardest and performance improvement is most needed. Key to
obtaining performance gains though is prior knowledge of external information that is potentially
informativefortheoutcome. Forexample,clinicians,epidemiologists,orothersubstantiveexperts
may provide insights into what type of annotations are likely to be informative. However, the
modelisrobusttoincorporatingasetofmeta-featuresthatiscompletelyirrelevanttotheoutcome
37
of interest. In this scenario, a very small price in prediction performance is paid relative to
a standard ridge model (i.e., without external information). This should encourage the user to
integratemeta-featuresevenifuncertainabouttheirinformativeness.
An underlying assumption of the proposed regularized hierarchical model is that the effects in
agroupdeterminedbymeta-features(e.g.,genesinapathway)aremostlyinthesamedirection. A
limitation of the method is that if the effects have opposite signs and ‘cancel each other out’ there
wouldbelittleornoimprovementinprediction,evenifthepathwayinformationisinformative.
In addition to developing predictive signatures, the model can also be deployed in discovery
applicationswherethemaingoalistoidentifyimportantfeaturesassociatedwiththeoutcomerather
thandevelopingapredictivemodel. However,thereisnostandardwaytoperformformalinference
(standarderrors,p-values,confidenceintervals)withhigh-dimensionalregressionmodels. Several
approaches exist (Meinshausen et al., 2009; Shah and Samworth, 2013) and this is an active area
of research. Adding formal statistical inference would be an important future work to expand the
rangeofuseoftheproposedmodel.
The regularized hierarchical model is implemented in the“xrnet” R package available from
CRAN.Theimplementationisefficientandcanbeusedtoperformanalyseswithlargenumbersof
features, meta-features, and subjects. While the models we focused on in the simulation and data
applicationsareall“ridge-lasso”,i.e.,withan!
2
normpenaltyappliedto #`",andan!
1
norm
appliedtothemeta-featurecoefficients",thepackageofferstheflexibilityofusingtheLasso,elastic
net,andridgepenaltiestopenalizethemeta-featuresdependingontheapplication. Forexample,if
selectionatthemeta-featurelevelisdesiredandthemeta-featuresarehighlycorrelated,theelastic
net penalty is a better option for" regularization. Because if there is a group of variables that are
highlycorrelated,thelassotendstoselectoneofthem,whiletheelasticnetenjoysgroupingeffect
which selects all the variables in a group with estimated coefficients close to equal in magnitude
(ZouandHastie,2005). Theapproachdoesnotperformfeatureselectiononfirstlevelinformation
as it uses a ridge penalty. In a high dimensional setting, standard regularized regression like lasso
and elastic net often select relatively large numbers of features. It can then be valuable to identify
38
groupsofgenesdefinedbymeta-featuresthatmayjointlyhavesignificantpredictivepowerforthe
outcomeofinterest. Anotherpotentialimprovementofthemodelistoextendtherangeofpenalty
typestononconvexpenalties,suchasSCAD(FanandLi,2001),MCP(Zhangetal.,2010). These
penaltiesyieldlessbiasedeffectsizeestimatesthanthatoflassoandelasticnet.
39
Chapter3
Meta-FeatureGuidedRegularized Regression for Survival
Outcomes
3.1 Abstract
Regularized regression is a widely used technique for building prognostic and diagnostic models
based on genomic features. Regularized regression can handle high-dimensional data common in
genomicstudies,andinthecaseofsparseregularizedmethodsitcanalsoperformfeatureselection.
Associatedwithgenomicfeatures,suchasgeneexpression,genotypesandDNAmethylation,there
is a great deal of functional information that, if directly incorporated into the modeling process,
could yield better prediction performance. Examples of functional information are pathways and
other gene sets, gene ontology annotations, and knowledge from previous studies. Functional
information can often be represented as meta-features, i.e. attributes of the features rather than of
thesubjects.
Inthischapter,weextendtheapproachofZengetal.(2021)tosurvivaloutcomes. Themethod
canincorporatepriorinformationintheformofmeta-featurestoguideregularizedCoxregression
models. This is accomplished by letting each genomic feature to have its own ‘custom’ penalty
parameter, rather than having a single penalty parameter controlling the degree of regularization
across all features. The individual penalty parameters are in turn modeled to be functions of the
40
meta-features. We show the benefits of the method by a simulation study and applications to two
genomicstudies.
3.2 Introduction
PredictinghealthoutcomesbasedongenomicprofilesisanimportantandactiveareaofBiomedical
research. A commonly used statistical tool for developing predictive model in genomic studies
is regularized regression. When the number of features is larger than the number of samples,
which is usually the case for genomic data, regularization needs to be introduced so that the
model complexity, relative to the amount of data available, can be controlled. Sparse regularized
regressionisapopularchoice,asitnotonlyshrinkstheregressioncoefficientstowardzerotomake
the model less complex by reducing its variance, but it also shrinks some of the coefficients with
small or no effect on the outcome to exactly zero, thereby performing feature selection. The most
widely used sparse regularized regression methods are the lasso (Tibshirani, 1996) and the elastic
net (Zou and Hastie, 2005). While the lasso and elastic net share many properties, the elastic net
canbetterdealwithcorrelatedfeatures. Specifically,whilethelassotendstoselectasinglefeature
amongahighlycorrelatedgroupoffeaturespredictiveoftheoutcome,theelasticnettendstoselect
theentiregroupandmaketheireffect-sizeestimatessimilartoeachother. Ridgeregression(Hoerl
andKennard,1970)isanotherwidelyusedregularizationtechniquetocopewithhigh-dimensional
andcollineardata. However,ridgeregressiondoesnotyieldsparsemodels,i.e. itdoesnotperform
featureselection.
Severalextensionsofthelassohavebeendevelopedtoexploitadditionalstructureofthefeatures
such as groupings or natural orderings. For example, the group lasso (Yuan and Lin, 2006) takes
grouping information such as genes mapping to a pathway or probes mapping to a a gene, and
shrinks the coefficients by group so that either all the features in a group are selected or no feature
in the group is selected. The sparse group lasso (Simon et al., 2013) further allows sparsity
within each group. The fused lasso deals with the setting where an ordering of the features is
41
available (e.g. along the genome), by adding an !
1
penalty for the differences of neighbouring
coefficients. The above extend regularized regression methods to account for attributes of the
features, which can be often encoded as features of the features or meta-features. Example of
meta-features common in genomic studies are functional gene sets like hallmark (Liberzon et al.,
2015), gene ontology pathways like those in the Reactome database (Jassal et al., 2020), and
summary statistics like p-values and regression coefficients from meta-analyses or other previous
relevant work. Meta-features can be encoded as data matrices, where the samples/rows represent
the features, and the columns represent the meta-features. None of the regularization approaches
above can systematically utilize general meta-feature information. For example, the group lasso
requires features in different groups to be mutually exclusive. The fused lasso can incorporate an
ordering of the features into account, which can be captured by a single quantitative meta-feature,
butitisnotdesignedtointegratemultiplequantitativemeta-features.
One way of utilizing functional meta-features that define groupings is by performing gene set
enrichment analysis (Subramanian et al., 2005) after identifying features that are predictive of the
outcome of interest. However, incorporating meta-features directly into the modeling process can
potentially improve both prediction performance and the quality of feature selection. In chapter 2,
weincorporatedmeta-featuresinahierarchicalfashion,inanapproachthatcanbeconceptualized
as regressing the outcome on the original features,_ = ^#¸9, where_ is the length= outcome
vector, ^ isthe=? datamatrix, # isthelength? featurecoefficientvectortobeestimatedinthe
model,andthenregressingtheestimates # onthemeta-features, #= `"¸$,where ` isthe?@
meta-featurematrix," iscoefficientsvectorformeta-features. Tointegratebothfeaturesandmeta
featuresjointly,anoptimizationproblemisformulatedasfollows:
min
#"
1
2=
k_ ^#k
2
2
¸
_
1
2
k# `"k
2
2
¸_
2
k"k
1
The feature data, ^, and the meta-feature data, `, are both integrated through the !
2
terms. The
additional !
1
term penalizes the meta-feature coefficients " to control the model complexity and
perform meta-feature selection. This approach incorporates meta-features in a linear way. In
42
Chapter 2 it was shown that this approach can considerably improve the prediction performance
of regularized Cox regression when integrating informative meta-features. Zeng et al. (2021)
developedanalternativemethodforintegratingmeta-features`inanon-linearway,forquantitative
andbinaryoutcomes. Thisisaccomplishedbylettingeachfeaturehaveitsown‘customized’penalty
parameter, rather than a single global penalty parameter for all features. The customized penalty
parameterapproachcanalsoimprovepredictionperformanceandfeatureselection. Inthischapter,
weextend thisapproachtosurvivaloutcomes.
3.3 Methods
3.3.1 Setupandnotation
Startingwiththetime-to-eventmodelsetup,welettheoutcomebe¹y%ºwherey=¹H
1
H
2
H
=
º
denotesthevectorofobservedtimes,and%=¹X
1
X
2
X
=
ºdenotesthevectorofcensoringstatus
forthe=subjects. Forsubject8,8= 1 2=,X
8
= 1iftheeventofinterest(e.g.,death)tookplace
within the followup period and H
8
is time of the event; ifX
8
= 0, the event did not happen within
the followup period, and H
8
is the censoring time. We assume there are ? genomic features (e.g.
expression or methylation levels, genotypes), measured on each of the= subjects. The=? data
matrix^storesthefeaturevalues,i.e.,x
8
=¹G
81
G
82
G
8?
ºisavectoroffeaturevaluesforsubject
8. Associated with the features, we assume there are@ meta-features. A ?@ matrix ` stores the
meta-feature values for the ? original features, i.e., z
9
=¹I
91
I
92
I
9@
º for 9 = 1 2? is a
vectorofmeta-featurevaluesforfeature 9. Themostcommonregressionmethodfortime-to-event
data is Cox’s proportional hazards model (Cox, 1972). It assumes proportional hazard functions
with respectto feature valuesat the sametime point, which allowsmodel fitting withoutexplicitly
knowingtheformofthebaselinehazardfunction,anddependingonlyontheorderinwhichevents
occur, but not on the exact time of occurrence. We let C
1
C
2
C
;
C
<
be the
event times arranged in increasing order and
;
=f8 : X
8
= 1H
8
= C
;
g be the set of subjects that
43
experiencedtheeventattimeC
;
. Letting#bealength?vectorofregressioncoefficients,Breslow’s
adjustmenttotheCoxpartiallikelihood(Breslow,1972)takestheform:
!¹#º=
<
Ö
;=1
4
Í
82
;
x
)
8
#
¹
Í
82'
;
4
x
)
8
#
º
3
;
where '
;
=f8 : H
8
C
;
g is the risk set at event timeC
;
, i.e., the set of all subjects who have not
experiencedtheeventandareuncensoredjustpriortotimeC
;
,and3
;
=j
;
j isthenumberofevents
at timeC
;
. Breslow’s partial likelihood handles potential ties at each event time (i.e., more than
one subject experiencing the event at the same time). When there are no ties,!¹#º automatically
reduces to Cox’s partial likelihood. We can see that neither the hazard functions nor the actual
timesareinvolvedinthepartiallikelihood,onlytheorderofeventtimesmatters.
To control the model complexity we consider an elastic net regularized Cox model. Denoting
the log partial likelihood by ¹#º, the regularized Cox regression model is the solution to the
followingoptimizationproblem:
min
#2R
?
¹#º¸_
1
2
¹12ºk#k
2
2
¸2k#k
1
(3.1)
The regularization function includes the lasso (2 = 1), elastic net (0 2 1º, and ridge (2 = 0)
penalties as particular cases. When 0 2 1, the penalty is sparse inducing, i.e., shrink some
coefficients to exactly zero, producing more interpretable models. In (3.1) the global penalty
parameter _ is the same for all the features, which implicitly assumes all features are a priori
equallyimportant. Toincorporatemeta-featureswhichmightbeinformativeabouttheimportance
of the features, we give each feature its own unique penalty parameter_
9
, which in turn we model
44
tobeafunctionofthemeta-features. Specifically,weassume_
9
=4
z
)
j
"
where"isaweightvector
oflength@. Themodelisnowgivenbythefollowingoptimizationproblem:
min
#2R
?
8
> <
>
:
¹#º¸
?
Õ
9=1
_
9
1
2
¹12ºV
2
9
¸2jV
9
j
9
> =
>
;
_
9
=4
z
)
j
"
(3.2)
3.3.2 Modelfitting
ThestandardelasticnetCoxproportionalhazardsmodelinequation(3.1),canbefittedbypathwise
coordinatedescent(Simonetal.,2011). Sincethepenaltyparameter_isaglobalhyper-parameter,
the algorithm typically constructs a grid of _ values from which the best is selected by cross-
validation. By contrast, the proposed model in equation (3.2), has ?_’s defined by the weights as
", , =¹_
1
_
2
_
?
º = 4
`"
, so it is not feasible to tune them using traditional approaches like
cross-validation. Instead, we estimate the weights " to get the , directly based on the data using
anempirical-Bayesapproach. Withknown,,themodelin(3.2)canbefitviacoordinatedescent.
3.3.2.1 EmpiricalBayesestimationofthehyperparameters
To estimate ", we rely on an alternative and natural Bayesian/random effects formulation of the
model in (3.2). We then estimate the hyper-parameters by maximizing the marginal likelihood
obtainedbyintegratingouttherandomeffects#. BasedontheBayesianelasticnet(Lietal.,2010),
themodelgivenbyequation(3.2)canbere-formulatedasthefollowinghierarchicalmodel:
5¹_j#;^º=
<
Ö
;=1
4
Í
82
;
x
)
8
#
¹
Í
82'
;
4
x
)
8
#
º
3
;
(3.3)
c¹V
9
;"º/4G?
_
9
1
2
¹12ºV
2
9
¸2jV
9
j
(3.4)
45
With the likelihood (3.3) and prior distribution (3.4), we construct the joint distribution of_ and
#,andintegrateout #,sotogetthemarginallikelihoodof_:
5¹_;"º=
¹
#2R
?
5¹_#;"º3#
=
¹
#2R
?
5¹_j#;^ºc¹#;"º3#
=
¹
#2R
?
<
Ö
;=1
4
Í
82
;
x
)
8
#
¹
Í
82'
;
4
x
)
8
#
º
3
;
?
Ö
9=1
4G?
_
9
1
2
¹12ºV
2
9
¸2jV
9
j
3#
This integral does not have a closed form expression because the elastic net prior is not conjugate
for the likelihood. We propose to approximate the integral by first approximating the elastic net
prior with a normal prior, and then applying the Laplace approximation to get a tractable final
approximation. Toapproximatetheelasticnetprior,weuseanormalpriorwiththesamevariance
as the lasso component. The lasso component of the prior corresponds to a double exponential
distribution,¹_
9
2º,withvariance
2
¹_
9
2º
2
. Thenormalapproximationofthisdoubleexponential
distributionisthen#¹0
2
¹_
9
2º
2
º. Hence,theelasticnetpriorcanbeapproximatedasfollow,
c¹V
9
;"º/4G?
_
9
1
2
¹12ºV
2
9
¸2jV
9
j
4G?
(
1
2
_
9
¹12ºV
2
9
¸
V
2
9
2¹
2
¹_
9
2
º
2
)
=#
0
2
2_
9
¹12º¸2
2
_
2
9
!
(3.5)
Figure3.1showsthenormalpriorapproximationcomparestotheelasticnetprior. Withtheprior
distribution approximation, the joint log-distribution, ln 5¹_#;"º, then takes the form of a ridge
regularized Coxregressionwithindividualpenaltyparameters,
ln 5¹_#;"º=
<
Õ
;=1
"
Õ
82
;
x
)
8
#3
;
ln¹
Õ
82'
;
4
x
)
8
#
º
#
?
Õ
9=1
1
2
E
9
V
2
9
¸const
E
9
=
2_
9
¹12º¸2
2
_
2
9
2
(3.6)
46
−4 −2 0 2 4
0.00 0.05 0.10 0.15 0.20 0.25
λ=0.5, c=0.25
normal
elastic net
−4 −2 0 2 4
0.00 0.05 0.10 0.15 0.20 0.25
λ=0.5, c=0.5
normal
elastic net
−4 −2 0 2 4
0.00 0.05 0.10 0.15 0.20 0.25
λ=0.5, c=0.75
normal
elastic net
Figure3.1: Normalapproximationoftheelasticnetprior
Applying the Laplace approximation (Laplace, 1986) to (3.6), the integral with respect to ?-
dimensional vector # has a closed form solution. Consider a Taylor series of ln 5¹_#;"º at the
stationarypoint
˜
#,wherer ln 5¹_
˜
#;"º= 0,
ln 5¹_#;"º ln 5¹_
˜
#;"º
1
2
¹#
˜
#º
)
N¹#
˜
#º
˜
# is the solution of a ridge regularized Cox regression (3.6), which can be computed, for fixed ",
usingforexampletheglmnet Rpackage(Simonetal.,2011). Here N istheHessianmatrix,
N=rrj
#
ln 5¹_#;"ºj
#=
˜
#
^
)
]^¸\
where\= diag»v¼= diag»E
1
E
?
¼,] isadiagonalmatrixwithelements
]
88
=
Õ
;2
8
3
;
4
x
)
8
#
Í
:2'
;
4
x
)
:
#
Õ
;2
8
3
;
¹4
x
)
8
#
º
2
¹
Í
:2'
;
4
x
)
:
#
º
2
47
The Hessian is in turn approximated to avoid computing the full matrix,, which is costly. Here
we only use diagonal elements to speed up the computation without loss of accuracy. For greater
details, refer to Simon et al. (2011). We see now that 5¹_#;"º’s Taylor approximation has a
multivariatenormalformwithmean
˜
#,andvarianceN
1
. Integratingout #yieldsthenormalizing
constant:
ln 5¹_;"º ln 5¹_j
˜
#;^º lnc¹
˜
#;"º
?
2
ln 2c¸
1
2
lnjNj
= lnj\j¸
˜
#
)
\
˜
#¸ lnjNj¸const
(3.7)
The approximate negative log marginal likelihood in equation (3.7) is the objective function we
minimizetoestimate".
3.3.2.2 Marginallikelihoodfunctionoptimization
Although the marginal likelihood function given by equation (3.7) is nonconvex, it can be decom-
posedasthedifferenceoftwoconvexfunctions: 6¹"º := lnj\j¸
˜
#
)
\
˜
#,and¹"º := lnjNj.
Thisdecompositionmakesitpossibletouseoptimizationalgorithmsspecificallydesignedfordif-
ferenceofconvexfunctions(DCA)(LeThietal.,2015). ThemainideaofDCAistoapproximate
the nonconvex objective function by a sequence of convex ones. At each iteration we approximate
the concave part (¹")) by its affine majorization, i.e., the supporting hyperplane obtained by cal-
culating its gradient, or subgradients if not differentiable, and then minimize the resulting convex
approximation. The resulting algorithm can also be thought of as a majorization-minimization
algorithm(HunterandLange,2004). Theaffineapproximationoftheconcavepartisthemajoriza-
tion step, which forms a surface lying above the objective function, and is tangent to it, i.e, at the
current estimation of the target parameter, the majorization equals to the objective function. This
ensures the majorization is a tight upper bound for the objective. Minimizing the convex upper
boundistheminimizationstep. TheDCAalgorithmforthemarginallikelihood ln 5¹_;"º:
1. Initialize" with ˜ "2R
@
.
2. Majorization:
48
• calculatethegradientatcurrentestimation ˜ ",
) =r
v
lnjNj= diag»N
1
¼
• formtheconvexupperbound,
D¹"º=6¹"º¸¹ ˜ "º¸)
)
¹v ˜ vº
= lnj\j¸
˜
#
)
\
˜
#¸)
)
v¸const
3. Minimization: ˆ "= argmin
"
fD¹"ºg.
4. Set ˜ "= ˆ ".
5. Repeatstep2-4untilconvergenceof ˆ ".
TheminimizationofD¹"ºcanbeperformedwithastandardfirstordermethodlikegradientdescent,
orasecond ordermethodlikeNewton-Raphson. ThegradientandHessianofD¹"º are:
r
"
D¹"º= `
)
¹
1
v
¸
˜
#
2
¸)º¹¹12º,¸2
2
,
2
º
rr
"
D¹"º= `
)
diag
,
2
v
2
¹12¸2
2
,º
2
¸¹
1
v
¸
˜
#
2
¸)º,¹12¸ 22
2
,º
`
3.3.3 Summary
We incorporate the meta-features into feature-specific penalty parameters modeled as a log-linear
functions of the meta-features. We then use a Bayesian interpretation of regularized regression
to obtain the marginal likelihood function for the meta-feature weights ". The " parameter are
estimated by maximum marginal likelihood. The resulting nonconvex objective function can be
decomposedintoadifferenceoftwoconvexfunctions,whichcanbeoptimizedwithadifferenceof
convexfunctionsalgorithm. Theestimated"canbepluggedintocomputethepenaltyparameters.
Thecompletemodelfittingprocedureisgivenby:
49
1. Initialize" with ˜ ".
2. Repeat,untilconvergenceof ˆ ".
(a) Laplaceapproximationofmarginallikelihoodwithknown ˜ ",section3.3.2.1,
• Approximatetheelasticnetpriorwithanormalprior,equation(3.6),
• Calculate
˜
# and N.
(b) OptimizeLaplaceapproximationofmarginallikelihood,equation(3.7),getsolution ˆ ",
withDCAdescribedinsection3.3.2.2.
(c) Set ˜ "= ˆ ".
3. Calculatecustomizedpenaltyvector,=4
` ˆ "
.
4. FitregularizedCoxregression,equation(3.2),with,.
3.3.4 Inclusionofunpenalizedfeatures
We have introduced the computational algorithm for meta-feature guided regularized regression
(3.2), in which all the features are regularized, and have meta-features associated with them.
However, more often than not, there are other types of features providing substantial predictive
power to the outcome, in addition to genomics. For example, demographics like age at diagnosis,
gender; clinicalfeatures suchas symptoms, imagingand labresults. These featuresusually donot
havemeta-featureinformationanditisdesirabletoincludetheminthemodelwithoutpenalization.
Toextendourmodeltoincludeunpenalizedfeatures,wedenoteunpenalizedfeaturematrix ^
0
=?
0
,
i.e., there are ?
0
unpenalized features; #
0
2 R
?
0
is the coefficient vector for them. Then (3.2) can
berewrittenas
min
#
0
2R
?
0
#2R
?
8
> <
>
:
¹#
0
#º¸
?
Õ
9=1
_
9
1
2
¹12ºV
2
9
¸2jV
9
j
9
> =
>
;
_
9
=4
z
)
j
"
(3.8)
50
The only difference between (3.8) and (3.2) is the addition of unpenalized features ^
0
and their
coefficients #
0
in log partial likelihood¹#
0
#º. If we walk through the algorithm described in
section3.3.3,theprocedureswillfollow. Exceptthatthealternatingoptimizationof#,"nowtakes
placewith¹#
0
#º,",and #
0
isestimatedusingregularizedregressionwiththeir_ being0.
3.4 Simulations
3.4.1 Simulationmethods
In this section, we perform a series of simulation experiments to evaluate the performance of the
proposed model in terms of prediction and feature selection performance compared to standard
regularized Cox regression. We first generate meta-feature data ` by sampling from independent
Bernoulli variables, with probability 0.1. This is to mimic biological pathway/functional gene set
meta-features. Each pathway contains a group of genes, and 1 indicates the gene belongs to the
pathwayand0ifthegenedoesnot. Meta-featureweights"werefixedfromanequallyspacedgrid
of values ranging from -1 to 1. We then generate # from a normal prior distribution with mean 0,
andvariancecomputedfrom"`,basedonequation(3.5). Becausewewanttheunderlyingmodel
to be sparse, we only keep the top 20% of the # elements with largest absolute values, and set the
remaining to be 0. To model uninformative meta-features we also consider scenarios where we
randomly flip some of the rows of ` from 0 to 1 or vice versa. The proportion of modified rows,
tracks the overall informativeness of meta-features, e.g., 10% of the rows modified corresponds to
high informativeness while 90% of the rows modified corresponds to low informativeness. The
rowsofthedatamatrix ^ aresampledfromamultivariatenormaldistributionwithautoregressive
correlation structure,
89
= d
j89j
, with d = 05. Survival times are generated using the inverse
probabilityintegraltransform:
C=
1
0
ln¹*º4
#
)
x
51
where* uniform»0 1¼,
0
¹Cº =¹C5º
8
is a baseline cumulative hazard function with Weibull
distribution. Censoring times are sampled from an exponential distribution, 2 exp¹01º. The
parameter values for the Weibull and exponential distributions are set to generate survival times
in the 0 to 20 range, and a fixed ratio (3/2) of events to censoring. We then add normal noise to
survival times to control the overall underlying predictive ability of the features for the outcome.
Specifically,wesettheconcordancemeasure(Harrelletal.,1982)to0.8,bycontrollingthestandard
deviation of the added noise. The survival outcome is set to be the minimum of the generated
survivalandcensoringtime,H= min¹C2º.
For each simulation replicate, we generated training data as described above. We then fit a
standard elastic net regularized Cox regression without external meta-features `, and also fit our
proposed model with meta-features. Elastic net regression is tuned with 5-fold cross validation,
while the proposed model does not require penalty parameter tuning as it is estimated as part of
themodelfittingprocedure. Wethencomparethepredictionperformancebetweenthetwomodels
on an independent test set of size 1000 generated following the same procedure described for the
trainingdata. WeusedtheC-indexasaperformancemetric. Foreachsimulationscenarioweused
100replicates.
Werun a seriesofexperimentsvaryingone keyparameterwhilekeepingthe othersfixed. The
base case parameters are sample size = = 100, feature size ? = 200, meta-feature size @ = 10,
meta-feature`informativeness: 5%offeatures(rowsof`)modifiedtohaveincorrectvalues. Four
experimentsareconductedbyvaryingoneparameteratatime.
1. Meta-feature informativeness level from high to low, proportion of rows of ` modified 5%,
15%,30%.
2. Featuresize,?= 200 600 1000.
3. Samplesize,== 100 200 300.
4. Meta-featuresize,@= 10 20 30.
52
In experiment 1, we also examined the feature selection performance by both models, to evaluate
howinformativenessofmeta-featuresinfluencesmodelinterpretation.
3.4.2 Simulationresults
Figure 3.2 shows the results of the 4 simulation experiments. The horizontal dashed line in each
panel represents the population/theoretical C-index, achievable with infinite samples of training
data. Itisprovidedasareferenceforeachparametersetting. Theperformanceofthemeta-feature
guided elastic net Cox model (denoted as ‘meta’ in the figure) is compared to that of a standard
elasticnetCoxmodel.
In experiment 1, there are consistent improvements in prediction performance as long as the
meta-featuresareinformative. Thehighertheinformativenessofmeta-features,thelargerthegains
inpredictionperformanceoverthestandardelasticnetmodel.
Experiment2illustratesthemodelperformancewithrespecttothenumberoffeatures?. Larger
number of features relative to the sample size makes it harder for both models to predict well, as
both models’ test C-index are substantially below the theoretical C-index of 0.8, when ? = 600
or ? = 1 000. However, the meta-feature model consistently outperforms the standard elastic net
model.
Experiment 3 evaluates a similar situation as experiment 2, but instead of varying the number
of features, it varies the sample size= while keeping ? = 200. As the sample size gets larger, i.e,
the ratio of number of features to sample size becomes smaller, both models perform better, and
the meta-feature guided elastic net consistently outperforms the standard elastic net. Furthermore,
alargersamplesizeyieldsmorestablepredictionmetrics(smallervarianceoftestC-index).
Asthesizeofmeta-featuresincreases(experiment4),thepredictionperformanceimprovement
over the standard elastic net becomes smaller. This indicates the meta-feature guided model’s
inabilitytohandlealargenumberofmeta-features.
In terms feature selection performance, we define accurate selection as follow: features with
non-zerosimulatedcoefficientsareestimatedwithnon-zerovaluesandfeatureswithzerosimulated
53
Figure3.2: Simulationresult: predictionperformance
coefficients are estimated as zero. In Figure 3.3, we compared the feature selection accuracy of
the meta-feature guided elastic net model, standard elastic net model with the_ value that gives
maximum cross-validated C-index (denoted as ‘enet.min’), and with the _ value that gives the
most regularized model such that the cross-validated C-index is within one standard error of the
maximum (denoted as ‘enet.1se’), in experiment 1, where the informativeness of meta-features
varies. The‘1se’elasticnetmodelissparser(fewernonzerocoefficients)thanthe‘min’elasticnet
model. The proposed meta-feature guided elastic model again outperforms both standard elastic
netmodelsinfeatureselectionaccuracy. Moreover,theselectionismorestablewithmeta-features
comparedtoeitherofthestandardelasticnetmodels. Notethattheselectionaccuracyofthe‘min’
elasticnetmodelishighlyunstable,sinceityieldslesssparsemodelsthanthe‘1se’elasticnet,and
themeta-featureguidedmodel.
54
Figure 3.3: Simulation results: feature selection. ‘enet.min’ is the standard elastic net model with
the_ value that gives maximum cross-validated C-index; ‘enet.1se’ is the elastic net model with
the_ value that gives the most regularized model such that the cross-validated C-index is within
onestandarderrorofthemaximum.
3.5 Applications
3.5.1 Geneexpressionsignaturesforbreastcancersurvival
To illustrate the performance of our approach in real data, we applied the meta-feature guided
regularized regression to the Molecular Taxonomy of Breast Cancer International Consortium
(METABRIC)study. ThedataincludescDNAmicroarrayprofilingforcloseto2,000breastcancer
specimensprocessedontheIlluminaHT-12v3platform(Illumina_Human_WG-v3)(Curtisetal.,
2012). The data was divided into a discovery/training set of 997 samples, and a validation/test
set of 995 samples. The goal is to build a prognostic model for breast cancer survival, based on
gene expressions and clinical features. The feature data ^ consists of of 29,477 gene expression
probes and two clinical features, age at diagnosis and the number of positive lymph nodes. The
meta-feature data ` consists of four “attractor metagenes”, which are selected gene co-expression
signaturesassociatedwiththeabilityofcancercellstodivideuncontrollably,toinvadesurrounding
tissues,and,withtheeffortoftheorganismtofightcancerwithaparticularimmuneresponse(Cheng
etal.,2013b). Threemetagenesareuniversal“attractormetagenes”andconsistofgenesinvolvedin
55
mitoticchromosomalinstability(CIN),inmesenchymaltransition(MES),andlymphocyte-specific
immunerecruitment(LYM)respectively. Afourthmetageneisassociatedwithgoodprognosisand
contains two genes: FGD3 and SUSD3. The CIN, MES, and LYM metagenes each consist of 100
genes, but for our analysis, we considered only the 50 top-ranked within each metagene. The data
matrix ` is an indicator matrix for whether a specific expression probe corresponds to a gene in a
metagene.
Model building was based on the samples with ER positive and HER2 negative, as treatments
aremorehomogeneousinthisgroup,andtheyareassociatedwithgoodprognosis(Rivenbarketal.,
2013). Therewere740samplesinthediscoverysetand658samplesinthevalidationsetintheER+
and HER2- subset after removing samples with missing values. We applied meta-feature guided
elastic net regression to train the model. The test set was used to evaluate model performance.
The same training/test scheme was used to fit a standard elastic net regression without attractor
metageneinformationascomparison.
Withonlygeneexpressionfeaturesinthemodelandnoclinicalfeatures,thetestC-indexforthe
metagene guided elastic net model is 0.637 compared to the test C-index of 0.663 for the standard
Cox elastic net counterpart. When adding the clinical features, age at diagnosis and number of
positivelymphnodes,thetestC-indexincreasedto0.715,and0.728forthemetageneguidedelastic
net model, and the standard Cox elastic net model, respectively (Table 2.1). Our metagene guided
elastic net model does not perform as well as the standard elastic net model, in both situations,
with or without clinical features. This contrasts with the experience with ‘xrnet’, which improved
performance in the METABRIC data. To understand the difference, we looked at the metagene
weights estimated from the empirical Bayes method. Intercept represents the penalty applied to
probes that do not map to any of the metagenes. The metagene weights indicates the differential
penalty amount applied to the genes in the metagene; a positive (negative) weight will have the
genesinthemetagenepenalizedmore(less)strongly,indicatingtheyareless(more)importantfor
predictingthemortality. Themetagenes‘CIN’and‘FGD3-SUSD3’aremoreimportantthan‘MES’
and‘LYM’, conforming to the findings with ‘xrnet’ (chapter 2, section 2.5). Also, the number of
56
features selected by metagene guided model is much less than that of standard elastic net, 9 vs
30 for gene expression only, and 9 vs 49 for gene expression plus clinical features, respectively.
These findings suggests that although the methods agree in identifying which metagenes are more
important, non-linear integration of meta-features does not fit well the specific data and that the
underlyingrelationshipbetweengeneexpressionsandmetagenesinMETABRICisbettercaptured
bythelinearmodelimplicitin‘xrnet’.
Elasticnet Metageneelasticnet
C-index
Geneexpressionsonly 0.663 0.637
Geneexpressions+clinical 0.728 0.715
Table3.1: METABRIC:TestC-indexbetweenstandardelasticnetandmetageneguidedelasticnet
Metagene
MetageneweightsU
Geneexpressionsonly Geneexpressions+clinical
Intercept 4.7355 4.7582
CIN -1.3677 -1.4057
MES 0.5520 0.7581
LYM 0.1074 0.2953
FGD3-SUSD3 -2.0187 -1.8620
Table3.2: METABRIC:MetageneweightsU
3.5.2 Anti-PD1predictivebiomarkerformelanomasurvival
Wealsoappliedtheproposedmeta-featuremodeltoamelanomadatasettopredictoverallsurvival
in patients treated with PD-1 immune checkpoint blockade. The programmed death 1 pathway
(PD-1) is an immune-regulatory mechanism used by cancer to hide from the immune system.
Antagonistic antibodies to PD-1 pathway and its ligands, programmed death ligand 1 (PD-L1),
demonstrate clinical benefits and tolerability. Immune checkpoint blockades such as Nivolumab,
pembrolizumab are anti-PD-1 antibodies showing improved overall survival for the treatment of
57
advanced melanoma. However, less than 40% of the patients respond to them (Moreno et al.,
2015). Therefore,identifyingpredictivesignalsoftreatmentoutcomesisofgreatinteresttoselect
patients most likely to benefit from anti-PD-1 treatment. We explored transcriptomes and clinical
datausingourmodeltoillustratepredictionperformanceandpredictivesignalselection.
The dataset combined 3 clinical studies in which RNA-sequencing profiles of patients treated
with anti-PD1 antibodies was obtained: Gide et al. (2019); Riaz et al. (2017); Hugo et al. (2016).
Thegeneexpressionlevelswerenormalizedtowardallsampleaverageineachstudyasthecontrol,
so that they are comparable to one another across features within a sample and comparable to one
another across samples. There are 16,010 genes in common across the 3 studies, and a combined
totalof117subjects. Theclinicalvariablesbeingconsideredareage,gender,andtumorresponse.
We build predictive models for overall survival based on transcriptomics and clinical variables.
Since the subjects are all treated with anti-PD1 antibodies, the transcriptomic features selected by
the model are predictive signals for treatment efficacy or resistance. We selected hallmark gene
sets as meta-features from the molecular signature database (Liberzon et al., 2015). A total of 13
gene sets were enriched (Subramanian et al., 2005) with false positive rates less than 0.25 (Table
3.3). Themeta-featurematrixisformedasanindicatorofwhethereachofthe16,010genesbelong
tooneofthe13hallmarkgenesets(`).
We compared prediction and feature selection performance between the meta-feature guided
elastic net model and the standard elastic net model. The data is split into training (75%) and test
set (25%). Standard elastic net was trained using 5-fold cross validation, while the meta-feature
model was trained with estimated hyperparameters. The test concordance index is C = 0.7340
for the elastic net, and C = 0.7609 for our meta-feature guided model. As for feature selection,
the meta-feature model, which selects 4 genes (GPAA1, COX6C, VPS28, PLCB4), is sparser,
comparedto11genesselectedintheelasticnetmodel.
58
Hallmarkmeta-feature Estimated"
Interferon gammaresponse -0.0102
Allograftrejection 0.1570
Interferonalpharesponse -0.0314
IL6JAKSTAT3signaling 0.1131
Inflammatoryresponse 0.0744
Complement 0.1577
TNFAsignalingviaNFKB 0.1180
IL2STAT5signaling 0.1613
Bileacidmetabolism 0.0876
Krassignalingdown 0.2338
Xenobioticmetabolism 0.2598
Apoptosis 0.2557
Krassignalingup 0.2737
Table3.3: Anti-PD1: Listofhallmarkmeta-featuresandtheirrespectiveestimatedweight".
3.6 Discussion
In this chapter, we extended to survival outcomes the customized regularization model guided by
genomic meta-features of Zeng et al. (2021). This model has individualized penalty parameters
for each of the features instead of the single global penalty parameter traditional in commonly
used regularized regression methods. Differential penalization allows for the importance of each
feature in predicting the outcome of interest to be determined from the data by their underlying
characteristics/meta-features. With informative meta-features, the more important features will be
penalizedless,resultinginalowerlikelihoodofbeingshrunktoexactly0,andforthelessimportant
features to be more strongly penalized, resulting in a higher likelihood of being excluded from the
model. In our simulation experiments and the anti-PD1 immunotherapy predictive modeling
application, the customized regularization model showed benefits in prediction performance, and
theselectedmodeltendedtobesparser(fewerfeaturesselected). Ofnoteisthattoderiveprediction
performance gains, the external meta-feature data needs to be informative and low-dimensional.
59
Therefore, substantive pior knowledge to carefully select a potentially relevant and limited meta-
featuressetiscritical.
We have developed two different methods to systematically integrate meta-feature information
into the modeling process with survival outcomes and high-dimensional data. The approach in
this chapter integrates meta-features in a non-linear way. It allows the meta-features to dictate the
importanceofeachfeaturebydefiningthepenaltyparametersasanon-linearfunctionofthemeta-
features. Bycontrast,the‘xrnet’regularizedhierarchicalmodelinchapter2integratesmeta-features
linearly. Thefundamentaldifferencebetweenthetwomethodsisthat‘xrnet’modelsmeta-features
through the mean of #, while the proposed model in this chapter models meta-features through
the variance of #. Specifically, ‘xrnet’ models ##¹`"gOº, and the customized regularization
approcah models # as having an approximate prior#¹0
2
2_
9
¹12º¸2
2
_
2
9
º. In most scenarios, a user
wouldnotknowwhichmodeltofavorinadvance. Dependingontheapplicationathand,onewould
wanttoconsiderbothapproaches,selectingthebestbasedontheperformanceontestdata.
Our proposed model applied empirical Bayes to estimate penalty parameters for each feature,
which uses the Laplace approximation to obtain the marginal likelihood. However, the Laplace
approximation works well when sample size is relatively large. We see in simulation experiment
2, that while our model shows prediction benefits in all feature set sizes, there is a decreasing gain
trend over standard elastic net Cox regression, which maybe caused by the drop in the quality of
the Laplace approximation when the sample size is small relative to the feature set size. But we
didnotdirectlyassessedhowwelltheLaplaceapproximationaffectsthemodelperformance.
Wehavealreadymentionedthattheproposedcustomizedregularizationmodeldoesnothandle
well high-dimensional meta-features. This is because no regularization is applied to the meta-
feature weights ", or other mechanism to control the complexity of this part of the model. As a
result,themodelfittingalgorithmisnotabletodealwithlargenumberofmeta-featuresefficiently
andstably. Carefulfilteringofthemeta-featuresisthenrequiredinahigh-dimensionalmeta-feature
setting. Futureworktoregularizethemeta-featurepartofthemodelisapotentialimprovementto
expanditsabilityofhandlinghighdimensionalmeta-featuredata.
60
Chapter4
!
0
RegularizedRegression with Correlated Features
4.1 Introduction
In chapter 1, we discussed the feature selection properties of the lasso, elastic net, and best subset
selection. Thelatterisequivalentto!
0
constrainedregressionwhendatamatrix^isorthogonal. In
fact,thelasssoandelasticnetcanbethoughtofasapproximationmethodstobestsubsetselection.
We see this in section 1.3.4; in the spectrum of !
@
constrained regression, @ = 0 corresponds
to best subset selection, 0 @ 1 to nonconvex regularization, @ = 1 to the lasso, and @ = 2
to ridge regression, which does not perfom selection. As the value of@ moving away from 0 to
largervalues,theestimatedregressioncoefficientsbecomemorebiasedtoward0,andtheresulting
model becomes less sparse. Under orthogonality of the design matrix, best subset selection (!
0
)
yieldsunbiasedcoefficientestimates. Therefore,!
0
constrainedregressionwouldbegenerally the
preferred model. However, in reality the features are not orthogonal, and exhibit some degree of
correlations. In this situation, the!
0
constrained regression is an NP-hard problem (Huo and Ni,
2007).
Several approaches have been proposed for fitting the !
0
-regularized linear regression. Blu-
mensathandDavies(2008)introducedaniterativehardthresholdingalgorithm,whichisaproximal
gradient-basedmethod. Basedonhardthresholding,severalvariantsofthemethodhavealsobeen
developed, such as proximal iterative hard thresholding (Zhang and Zhang, 2019), and proximal
alternating iterative hard thresholding (Yang et al., 2017). These methods share the main idea of
61
applying the proximal operator univariately to obtain coordinate-wise solutions. However, these
methods ignore the correlation structure of the data matrix ^, and might not work well with cor-
related features. We look at a novel algorithm to !
0
constrained least squares and modify it to
incorporate the covariance matrix ^
)
^ into account. The original plan for this work was to later
extend!
0
constrainedleastsquarestointegratemeta-features. However,theapproachdidnotwork
as expected and we turned to the methods in Chapters 2 and 3 instead. We include it here for
completionandbecausesomeoftheideasmayproveusefulinthefuture.
4.2 Proximaldistancealgorithmfor!
0
regularizedregression
Proposed by Keys et al. (2019), the proximal distance algorithm is a general method for solving
a constrained optimization problem. It converts a constrained minimization problem into an
unconstrainedone,withapenaltyonthedistancetotheconstrainset. Theconstrainedminimization
problemis
min
G2R
?
5¹Gº
subjectto G2
(4.1)
where is a closed set. This general constrained optimization problem can be turned into a
penalizedversion(unconstrained)
min
G2R
?
5¹Gº¸
d
2
dist¹Gº
2
(4.2)
where the squared distance is defined as dist¹Gº
2
= inf
H2
kG Hk
2
2
, i.e. the square of the
Euclidean distance of G to the closed set . The distance penalty function is nonnegative and
vanishes precisely on . As the penalty parameter d tends to1, the distance of G to set is
penalized so strongly that it is close to 0, i.e., G is in the set, which means the minimizer found
by the unconstrained version (4.2) is equivalent to the constrained one (4.1). To minimize (4.2),
we again use a majorization-minimization (MM) algorithm similar to that used in chapter 3 to
estimate the model hyperparameters. The majorization step, which forms an upper bound of 5¹Gº
62
around the current iterateG
=
, replaces the distance penalty function dist¹Gº
2
with the spherical
quadratickG%
¹G
=
ºk
2
2
, where%
¹G
=
º is the projection of the=
C
iterateG
=
onto C: the point in
set C that attains the minimum distance to the point, %
¹G
=
º = argmin
G2
kG
=
Gk
2
. Therefore,
bythedefinitionofprojection,kG%
¹G
=
ºk
2
2
dist¹Gº
2
,andatcurrentiterateG
=
,thetangency
condition,kG
=
%
¹G
=
ºk
2
2
= dist¹G
=
º
2
of a majorization is satisfied. The minimization step is
thentheproximalmapofthecurrentprojection
G
=¸1
= prox
d
1
5
¹%
¹G
=
ºº= argmin
G
5¹Gº¸
d
2
kG%
¹G
=
ºk
2
2
(4.3)
TheMMprincipleguaranteesthatG
=¸1
decreasesthepenalizedloss.
Forthe!
0
-regularizedleastsquaresproblem,theconstrainedoptimizationformcanbewritten
as
min
V2R
?
1
2
kH ^Vk
2
2
subjectto V2(
?
:
(4.4)
where(
?
:
is the set of vectors with at most: nonzero entries out of ?. The unconstrained form of
(4.4),i.e.,thepenalizeddistanceobjectivefunctionis
min
V2R
?
1
2
kH ^Vk
2
2
¸
d
2
dist¹V(
?
:
º
2
(4.5)
To use the majorization-minimization algorithm to solve problem (4.5), first we perform distance
majorization
min
V2R
?
1
2
kH ^Vk
2
2
¸
d
2
kV%
(
?
:
¹V
=
ºk
2
2
(4.6)
Theminimizationstepistheproximaloperatoroftheprojectionontoset(
?
:
,
V
=¸1
= prox
d
1
OLS
¹%
(
?
:
¹V
=
ºº=¹^
)
^¸dOº
1
¹^
)
H¸d%
(
?
:
¹V
=
ºº (4.7)
63
We discussed earlier that the distance penaly parameterd should be increased systematically until
a constrained minimum is reached. In practice, starting d
0
= 1, and increasing its value through
a sequenced
=
= min¹U
=
d
0
d
max
º withU slightly larger than 1 works well. The overall algorithm
canbesummarizedasfollow:
1. InitializeV
0
= 1,d
0
= 1,dist
0
=1,loss
0
=1,
2. Foriterationcounter= from 1 tomaxiteration:
(a) UpdateV
=
= prox
d
1
$!(
¹%
(
?
:
¹V
=1
ºº
(b) Currentdistancedist
=
=kV
=
%
(
?
:
¹V
=1
ºk
2
F
(c) Currentlossfunctionvaueloss
=
=$!(¹V
=
º
(d) Ifjdist
=
dixt
=1
j toleranceandjloss
=
loss
=1
j tolerance:
• break
• return%
(
?
:
¹V
=
º
else:
• dist
=1
= dist
=
• loss
=1
= loss
=
• increased byasmallmultiplierU everyfewiterations
4.3 Incorporatingthedata covariance matrix
TheproximaldistancealgorithmusestheEuclideandistanceasthedistancemetric. Butthisignores
the correlations between features. To incorporate the covariance matrix, we can use the weighted
distance(Mahalanobisdistance):
3¹GHº=
q
¹GHº
)
]
1
¹GHº
64
where, assuming the columns of ^ are mean-centered, ] = ^
C
^ is the data covariance matrix.
Withweighteddistancemajorization,thenewupdateis
V
=¸1
= prox
d
1
OLS
¹%
(
?
:
¹V
=
ºº=¹^
)
^¸d]º
1
¹^
)
H¸d]%
(
?
:
¹V
=
ºº (4.8)
However,weneedtoprojectthecurrentiteratetothe!
0
constraintset,%
(
?
:
¹V
=
º. Withthestandard
Euclidean distance, the projection simply keeps the largest : coordinates of V and the remaining
coordinates areset to0. Withthe weighteddistance, theprojection problem isagain NP-hard. We
noticethatiftheweightmatrixisdiagonal,theprojectionkeepsthe: coordinateswithlargestvalue
ofF
9
V
2
9
. Basedonthisobservation,weproposethefollowingapproximationtotheprojection: for
eachcoordinate 9,computethevalue
4
)
9
]V
4
)
9
]4
9
,whichistheweightedprojectionontoaxis 9,andkeep
the axis projection value
¹4
)
9
]Vº
2
4
)
9
]4
9
, while setting the rest of the coordinates to 0. This is similar to
theprojectionmethodwiththestandardEuclideandistancebutinaweightedinner-productspace.
4.4 Simulation
Wesimulatedadatamatrix^ ofdimension 100200,followingamultivariatenormaldistribution,
#¹0º, where has an autoregressive(1) correlation structure with d = 01, i.e., features are
close to uncorrelated. Regression coefficients # are also simulated with #¹0
V
º, where
V
is
autoregressive(1) with d = 06. The variance components of # (diagonal elements of
V
) are set
to be large for the first 10 elements, and small for the remaining 190. This setting makes the true
underlying model sparse, i.e., the first 10 elements of # are non-zero, and the rest are all zeros.
The data true predictive '
2
is fixed at '
2
= 05. Hyper-parameters of of the lasso (_) and !
0
(:) are tuned using a simulated validation set. The experiment is repeated 100 times. The results
(Table 4.1) showed that the validation'
2
of the lasso is higher than the validation'
2
for standard
!
0
regression. In terms of feature selection, the true positive selection rate of lasso is also better
than that of !
0
regression. However, this comes at the price of a higher false positive rate. We
also looked at the bias of the first element of #, where bias is the absolute value of the difference
65
between the true value and the estimated value. The average bias of signal 1 of !
0
regression is
muchsmallerthanforthelasso.
Validation'
2
Ture positiverate Falsepositiverate Bias (signal1)
!
0
0.268 0.506 0.019 0.231
lasso 0.318 0.770 0.114 0.394
Table4.1: Predictionandfeatureselectioncomparisonbetweenlassoand!
0
Wethenlookedatthe10truesignalcoefficientestimatesfromasinglerunoftheaboveexper-
iment. Estimatesfromstandard!
0
regression,ourproposedweighted!
0
regressionincorporating
thedatacorrelationstructure,thelasso,alongwiththesimulatedtruevaluesarecompared(Figure
4.1). The standard!
0
selected 5 out of 10 true signals, while weighted!
0
and lasso both chose 8
ofthem. However,lassoestimatesareheavilyshrunk. Notethatweusedweighteddistancemetric
intheweighted!
0
algorithm,butwiththestandardprojection.
Figure4.1: Comparisonofsignalestimates
Thesimulationresultsshowedsomepotentialfortheweighted!
0
algorithmtoimprovefeature
selection in terms of accuracy and bias of the estimates. However, when we tried the proposed
approximatesolutiontoweightedprojectioninsimulation,theresultsdidn’tshowimprovementin
featureselection,andwehaven’tcomeupwithabettersolutiontotheweightedprojectionproblem.
66
Chapter5
DiscussionandFutureWork
5.1 Thetwoproposedmethods
Wehavedevelopedtwohigh-dimensionalpredictivemodelsforintegratingmeta-features. Whenthe
meta-featuresareinformative,bothmodelsshowimprovementinpredictionperformancecompared
to standard regularized regression, and the improvement increases as the meta-features become
more informative. And when the meta-features are irrelevant, the regularized hierarchical model
(‘xrnet’) performs only slightly worse than standard regularized regression, showing robustness
in prediction. This robustness can reassure users experimenting with different meta-feature sets,
when they are uncertain about their informativeness. We also observe that for the ‘xrnet’ model,
the prediction performance gains are largest when the number of features are large relative to the
number of samples (Figure 2.2). Thus, although prediction is generally hardest with very high
dimensionaldata,informativemeta-featurescanhelpcompensatefortherelativelysmallernumber
of samples. This is an attractive property for practical applications, as in many prognostic and
diagnostic studieswith genomic data, the samplesizes are oftensmall, in the hundredsof subjects
or smaller. The meta-feature guided regularized regression (chapter 3) does not exhibit this same
behavior. Itsperformanceimprovementisconstantasthefeaturesetsizeincreases(Figure3.2,soit
providesstablepredictionimprovementsthatonlydependontheinformativenessofmeta-features.
67
The two models are also quite different in regards to feature selection. While the hierarchical
‘xrnet’modelperformsselectionatmeta-featurelevel,themeta-featureguidedregularizedregres-
sion does not. For example, if the meta-features are gene sets/pathways, ‘xrnet’ selects gene sets,
i.e., group of genes, rather than individual genes. When standard regularized regression yields
a relatively dense model, it hardly provides any insight into disease mechanisms. However, with
group of genes that share common biological functions, it reveals association between outcome
and those biological functions. Meanwhile, the meta-feature guided regularized regression selects
at individual feature level, the same as standard regularized regression. Nevertheless, with indi-
vidualized penalty for each feature, it usually yields a sparser model. Both of the proposed data
integrationstrategiesproducemoreinterpretablemodel,exceptintheirownway.
Mathematically, although both methods are based on regularized regression, they are funda-
mentallydifferent. Theregularizedhierarchicalmodelintegratesmeta-featureslinearly,andmodels
meta-featuresthroughthemeanofthefeature-levelregressioncoefficients#,whilethemeta-feature
guided regularized regression integrates meta-features non-linearly, and models them through the
variance of #. Thus they provide truly alternative options for building predictive models with
meta-features,andtheusermaywanttotrybothapproachesandchoosethebetterperformingone
for the problem at hand. Similarities and differences between the two models are summarized
below.
Regularizedhierarchicalmodel
• Prediction improves with informative
meta-features
• Selectionatthemeta-featurelevel
• The two penalty parameters are tuned via
cross-validation
• Modelsmeta-featuresthroughthemeanof
#,linearintegration
Meta-featureguidedmodel
• Prediction improves with informative
meta-features
• Selectionatthefeaturelevel
• Estimates individual penalty parameters
viaempiricalBayes
• Models meta-features through the vari-
anceof #,non-linearintegration
68
Thetwomethodshavebothbeenappliedtogeneexpressionsignaturesofbreastcancersurvival
(METABRIC), and anti-PD1 immunotherapy predictive biomarker for melanoma survival. In the
METABRIC data application, the two methods yielded different prediction performances. While
regularized hierarchical model (‘xrnet’) showed improvement in prediction after integrating the
metagene information, the metagene guided regularized regression did not perform as well as
standard regularized regression. This is an example where one method can outperform the other
in a specific data application. By contrast, in the anti-PD1 immunotherapy predictive biomarker
application,bothmethodshadcomparablepredictionperformanceandbothoutperformedstandard
regularized regression.
5.2 High-dimensionality of genomic data
Genomics data are often high dimensional. The human genome contains approximately 3 billion
basepairs,whichresideinthe23pairsofchromosomeswithinthenucleusofallourcells. Thereare
about20Kproteincodinggenes,20Knon-codinggenesinthehumangenomeand30KCpGislands,
so genomewide gene expression and methylation studies have thousands of features. However, the
numberofsamples,especiallyinoncologysettings,isoftensmallrelativetothenumberoffeatures.
For cancer patients, tumor genomic profiles are often obtained from biopsies, which can be highly
invasive,risky,andcostly. Asaresult,thenumberofpatientswithgenomicdatacanbelimited. A
typical data set in cancer genomics consists of hundreds to thousands subjects, and the number of
genomicvariablesareinthetenstohundredsofthousands,evenmillions.
The novel liquid biopsy techniques, which only requires blood rather than tumor tissue, is less
invasive,painful,andmucheasiertoaccess,andcanbeperformedinmosthealthconditions. This
could potentially generate more samples than tissue biopsies, and high dimensionality may no
longer be an issue for genomics data. With both samples and features in high dimension, simple
linear models are not among the best anymore. Since machine learning has made great progress
over the past years, there are better modeling choices for this regime. Non-linear patterns like
69
interactions between features, are hard to detect with limited sample sizes. With large samples,
non-linear patterns can be explored with methods like gradient boosting, a tree-based method that
specialize in detecting complicated interaction patterns. Deep neural networks, with enough units
and layers, can also capture non-linear patterns. These two approaches are hugely successful in
settings(e.g.,imagedetection,languagetranslation)wherenon-linearpatternsdominateandwhere
large amounts of samples are available. Liquid biopsies open the possibility to have large sample
sizes that can be analyzed using deep learning in the near future, and to have a large impact on
cancergenomics.
5.3 Featureselection
So far we have discussed feature selection from a subset of the features already considered in the
modeling process. But in high dimensional settings one may want to pre-screening the set of
features to be included in the analysis to have more predictive power. A common approach that is
notrecommended,istoletthedatadecidethesetoffeaturesbyconductingapre-analysisbasedon
the data. However, pre-screening features based on the data and then building a predictive model
usingthesamedatacanresultinupwardlybiasedpredictions. Whatshouldbedoneinstead,tothe
degreepossible,istopre-screenbasedonknowledgefrompriorstudiesandtheliterature. However,
thisstrategyrisksmissingimportantfeaturesnotpreviouslystudied.
The two methods developed in this thesis both use common sparse regularization techniques:
thelassoandtheelasticnet. Bothareconvex,andhavestableandefficientalgorithms. Moreover,as
wediscussedin section1.3.2,thelasso subsetselectionisconsistentunder certainconditions, and
theelasticnetcancomplementthelassotodealwithgroupcorrelatedfeatures. However,asconvex
approximations to best subset selection, they both produce estimators biased toward 0, and there
existsomeconditionsinwhichtheirsubsetselectionarenotconsistent. Discussedinsection1.3.4,
nonconvex regularization, which are closer approximations to best subset selection, produce more
parsimoniousmodels,andtheestimatorsarelessbiasedtoward0. TheSCADandtheMCP,along
70
with the adaptive lasso, approximation to !
@
regularization, enjoy the oracle property; namely,
when the true estimators have some zero components, they are estimated as 0 with probability
tending to 1, and the nonzero components are estimated as well when the correct submodel is
known (Fan and Li, 2001). This property improves the model accuracy compared to the lasso and
the elastic net. As the three regularized regression techniques are amenable to be fitted by cyclic
coordinatedescent,theyarenaturalextensionsforthetwomethodsdevelopedinthisthesis.
5.4 Inference
Although the approaches developed in this dissertation are conceived mainly for predictive mod-
eling, the two regression methods can also be used in discovery settings to uncover genomic risk
factors. However,statisticalinferencetoprovidep-values,standarderrorsandconfidenceintervals
forthemodelparameterestimatesarenotreadilyavailable. Ingeneral,inferenceisnotstraightfor-
ward for high-dimension since there is no general asymptotic theory comparable to that on which
mostlow-dimensionalfrequentistinferenceisbased.
One way to provide inference is to to rely on the Bayesian interpretation of regularized re-
gression, but this requires a change of paradigm as well as relying on computationally expensive
MCMCapproachesForexample,theBayesianlasso(ParkandCasella,2008),
_j#_f#¹^#f
2
Oº
#j_f
?
Ö
9=1
_
2f
4
_
f
jV
9
j
Thepriordistributionof #isadoubleexponentialdistribution(Laplacedistribution). UsingBayes
rule,thenegativelogposteriordistribution, #j__f is
1
2f
2
k_ ^#k
2
2
¸
_
f
k#k
1
71
Thelassoestimatorcoincidesthemodeoftheposteriordistribution,given_f. Withtheposterior
distribution, we can get not only the point estimator, mode of posterior distribution, but also
the variance, which in turn, credible interval. The exact Bayesian calculation for lasso does not
have closed form solution, Markov chain Monte Carlo (MCMC) can be used to sample posterior
distributionrealization. However,thecompleteBayesianlassorequirespriordistributionforf and
_, introducing complications. Moreover, MCMC procedure is computationally heavy, especially
whenthenumberoffeaturesislarge, exactlythesituationofgenomicsdata.
Another way of obtaining the variance of estimator is the bootstrp (Efron, 1979). The non-
parametric bootstrap is a method of approximating the sampling distributions. It takes resample
from the samples (training set) with the same size=, with replacement. In the resample, model is
trainedwithcross-validationtogettheparameterestimate
ˆ
#¹_
+
º. RepeattheproceduresBtimes
(enoughlargetimes),thebootstrapvarianceis
Var¹
ˆ
Vº=
1
1
Õ
1=1
¹
ˆ
V
1
¯
ˆ
V
º
2
where
ˆ
V
1
is the estimator calculated from the 1
C
resample, and
¯
ˆ
V
=
1
Í
1=1
ˆ
V
1
is the mean of
resample estimators. Therefore, by central limit theorem, we can get 95% confidence interval for
the random variable
ˆ
#¹_
+
º. The nonparametric bootstrap is computationally tractable compared
totheBayesianapproachthatreliesonMCMC,asineachresampleiteration,theefficientpathwise
coordinatedescentalgorithmisperformed.
Frequentist statistical inference with high dimensional regularized regression is an active re-
search area. Several recent approaches exist. Among them, referred to as family wise error rate
control with the single split method (Wasserman and Roeder, 2009), and its extension multi-split
method (Meinshausen et al., 2009) are resampling methods based on reducing number of features
to manageable size in a hold out set. The idea is to perform sparse regularized regression in one
set, which will give an active set of features,V
9
:
ˆ
V
9
< 0, with size at most= (sample size). Then
inferencecanbemadeintheholdoutsetsimplyusingstandardregressionwiththeactivefeatures.
72
Because the single split suffers heavily from the randomness of arbitrary split, multi-split with
repeated procedure is a natural extension. The summary statistics used for repeated p-values is
quantile that will control the family wise error rate under a pre-specified type I error. This line of
workoffers apotentialoptionforextendingourmethodsfrompredictiontoinference.
73
Bibliography
Sumithra J Mandrekar and Daniel J Sargent. Predictive biomarker validation in practice: lessons
fromrealtrials. Clinicaltrials,7(5):567–573,2010.
Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of
the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages
785–794,2016.
Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal
problems. Technometrics,12(1):55–67,1970.
Gustavo de Los Campos, Ana I Vazquez, Rohan Fernando, Yann C Klimentidis, and Daniel
Sorensen. Predictionofcomplexhumantraitsusingthegenomicbestlinearunbiasedpredictor.
PLoSGenet,9(7):e1003608,2013.
RobertTibshirani. Regressionshrinkageandselectionviathelasso. JournaloftheRoyalStatistical
Society: SeriesB(Methodological),58(1):267–288,1996.
Jerome Friedman, Trevor Hastie, Holger Höfling, Robert Tibshirani, et al. Pathwise coordinate
optimization. Annalsofappliedstatistics,1(2):302–332,2007.
Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity: the
lassoandgeneralizations. ChapmanandHall/CRC,2019.
Hui Zou, Trevor Hastie, Robert Tibshirani, et al. On the “degrees of freedom” of the lasso. The
AnnalsofStatistics,35(5):2173–2192,2007.
Ryan J Tibshirani and Jonathan Taylor. Degrees of freedom in lasso problems. The Annals of
Statistics,pages1198–1232,2012.
HuiZouandTrevorHastie. Regularizationandvariableselectionviatheelasticnet. Journalofthe
royalstatisticalsociety: seriesB(statisticalmethodology),67(2):301–320,2005.
TrevorHastie,RobertTibshirani,andJeromeFriedman. Theelementsofstatisticallearning: data
mining,inference,andprediction. SpringerScience&BusinessMedia,2009.
Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle
properties. JournaloftheAmerican statisticalAssociation,96(456):1348–1360,2001.
Cun-Hui Zhang et al. Nearly unbiased variable selection under minimax concave penalty. The
Annalsofstatistics,38(2):894–942, 2010.
74
Rahul Mazumder, Jerome H Friedman, and Trevor Hastie. Sparsenet: Coordinate descent with
nonconvex penalties. Journal of the American Statistical Association, 106(495):1125–1138,
2011.
HuiZou. Theadaptivelassoanditsoracleproperties. JournaloftheAmericanstatisticalassocia-
tion,101(476):1418–1429,2006.
Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables.
Journal oftheRoyalStatisticalSociety: SeriesB(StatisticalMethodology),68(1):49–67,2006.
Aravind Subramanian, Pablo Tamayo, Vamsi K Mootha, Sayan Mukherjee, Benjamin L Ebert,
Michael A Gillette, Amanda Paulovich, Scott L Pomeroy, Todd R Golub, Eric S Lander, et al.
Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide ex-
pression profiles. Proceedings of the National Academy of Sciences, 102(43):15545–15550,
2005.
Feng Tai and Wei Pan. Incorporating prior knowledge of predictors into penalized classifiers with
multiplepenaltyterms. Bioinformatics,23(14):1775–1782,2007.
Linn Cecilie Bergersen, Ingrid K Glad, and Heidi Lyng. Weighted lasso with data integration.
Statisticalapplicationsingeneticsandmolecularbiology,10(1),2011.
Chubing Zeng, Duncan Campbell Thomas, and Juan Pablo Lewinger. Incorporating prior knowl-
edgeintoregularizedregression. Bioinformatics,37(4):514–521,2021.
Gary K Chen and John S Witte. Enriching the analysis of genomewide association studies with
hierarchicalmodeling. TheAmericanJournalofHumanGenetics,81(2):397–404,2007.
Noah Simon, Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for cox’s
proportionalhazardsmodelviacoordinatedescent.Journalofstatisticalsoftware,39(5):1,2011.
NormanEBreslow. Contributiontodiscussionofpaperbydrcox. J.Roy.Statist.Soc.,Ser.B,34:
216–217,1972.
DavidRCox. Regressionmodelsandlife-tables. JournaloftheRoyalStatisticalSociety: SeriesB
(Methodological),34(2):187–202,1972.
Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized linear
modelsviacoordinatedescent. Journalofstatisticalsoftware,33(1):1,2010.
Ralf Bender, Thomas Augustin, and Maria Blettner. Generating survival times to simulate cox
proportionalhazardsmodels. Statisticsinmedicine,24(11):1713–1723,2005.
Frank E Harrell, Robert M Califf, David B Pryor, Kerry L Lee, and Robert A Rosati. Evaluating
theyieldofmedicaltests. Jama,247(18):2543–2546,1982.
Christina Curtis, Sohrab P Shah, Suet-Feung Chin, Gulisa Turashvili, Oscar M Rueda, Mark J
Dunning, Doug Speed, Andy G Lynch, Shamith Samarajiwa, Yinyin Yuan, et al. The genomic
and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature, 486
(7403):346–352,2012.
75
Wei-YiCheng,Tai-HsienOuYang,andDimitrisAnastassiou. Developmentofaprognosticmodel
for breast cancer survival in an open challenge environment. Science translational medicine, 5
(181):181ra50–181ra50,2013a.
Wei-Yi Cheng, Tai-Hsien Ou Yang, and Dimitris Anastassiou. Biomolecular events in cancer
revealedbyattractormetagenes. PLoSComputBiol,9(2):e1002920,2013b.
Ashley G Rivenbark, Siobhan M O’Connor, and William B Coleman. Molecular and cellular
heterogeneity in breast cancer: challenges for personalized medicine. The American journal of
pathology,183(4):1113–1124,2013.
Blanca Homet Moreno, Giulia Parisi, Lidia Robert, and Antoni Ribas. Anti–pd-1 therapy in
melanoma. Seminarsinoncology,42(3):466–473,2015.
Tuba N Gide, Camelia Quek, Alexander M Menzies, Annie T Tasker, Ping Shang, Jeff Holst,
Jason Madore, Su Yin Lim, Rebecca Velickovic, Matthew Wongchenko, et al. Distinct immune
cell populations define response to anti-pd-1 monotherapy and anti-pd-1/anti-ctla-4 combined
therapy. Cancercell,35(2):238–255,2019.
Nadeem Riaz, Jonathan J Havel, Vladimir Makarov, Alexis Desrichard, Walter J Urba, Jennifer S
Sims, F Stephen Hodi, Salvador Martín-Algarra, Rajarsi Mandal, William H Sharfman, et al.
Tumor and microenvironment evolution during immunotherapy with nivolumab. Cell, 171(4):
934–949,2017.
Willy Hugo, Jesse M Zaretsky, Lu Sun, Chunying Song, Blanca Homet Moreno, Siwen Hu-
Lieskovan, Beata Berent-Maoz, Jia Pang, Bartosz Chmielowski, Grace Cherry, et al. Genomic
and transcriptomic features of response to anti-pd-1 therapy in metastatic melanoma. Cell, 165
(1):35–44,2016.
ArthurLiberzon,ChetBirger,HelgaThorvaldsdóttir,MahmoudGhandi,JillPMesirov,andPablo
Tamayo. The molecular signatures database hallmark gene set collection. Cell systems, 1(6):
417–425,2015.
NicolaiMeinshausen,LukasMeier,andPeterBühlmann.P-valuesforhigh-dimensionalregression.
Journal oftheAmericanStatistical Association,104(488):1671–1681,2009.
Rajen D Shah and Richard J Samworth. Variable selection with error control: another look at
stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology),
75(1):55–80,2013.
Noah Simon, Jerome Friedman, Trevor Hastie, and Robert Tibshirani. A sparse-group lasso.
Journal ofcomputationalandgraphicalstatistics,22(2):231–245,2013.
BijayJassal,LisaMatthews,GuilhermeViteri,ChuqiaoGong,PascualLorente,AntonioFabregat,
KonstantinosSidiropoulos,JustinCook,MarcGillespie,RobinHaw,etal.Thereactomepathway
knowledgebase. Nucleicacidsresearch,48(D1):D498–D503,2020.
QingLi,NanLin,etal. Thebayesianelasticnet. Bayesiananalysis,5(1):151–170,2010.
76
PierreSimonLaplace. Memoirontheprobabilityofthecausesofevents. Statisticalscience,1(3):
364–378,1986.
HoaiAnLeThi,TPhamDinh,HoaiMinhLe,andXuanThanhVo. Dcapproximationapproaches
forsparseoptimization. EuropeanJournalofOperationalResearch,244(1):26–46,2015.
David R Hunter and Kenneth Lange. A tutorial on mm algorithms. The American Statistician, 58
(1):30–37,2004.
Xiaoming Huo and Xuelei Ni. When do stepwise algorithms meet subset selection criteria? The
AnnalsofStatistics,pages870–887,2007.
ThomasBlumensathandMikeEDavies. Iterativethresholdingforsparseapproximations. Journal
ofFourieranalysisandApplications,14(5-6):629–654,2008.
XueZhangandXiaoqunZhang. Anewproximaliterativehardthresholdingmethodwithextrapo-
lationforl0minimization. JournalofScientificComputing,79(2):809–826,2019.
Fan Yang, Yi Shen, and Zhi Song Liu. The proximal alternating iterative hard thresholding
method for l0 minimization, with complexity o (1/k). Journal of Computational and Applied
Mathematics,311:115–129,2017.
KevinLKeys,HuaZhou,andKennethLange. Proximaldistancealgorithms: Theoryandpractice.
J.Mach. Learn.Res.,20(66):1–38,2019.
TrevorParkandGeorgeCasella.Thebayesianlasso.JournaloftheAmericanStatisticalAssociation,
103(482):681–686,2008.
Bradley Efron. Bootstrap methods: another look at the jackknife. The Annals of Statistics, 7(1):
1–26,1979.
Larry Wasserman and Kathryn Roeder. High dimensional variable selection. Annals of statistics,
37(5A):2178,2009.
77
Appendices
A Appendixforchapter 2
A.1 Computationofdiagonalelementsofweightmatrix
Diagonalelementsofweightmatrix],theHessianoflogofCox’xpartiallikelihoodfunction,has
theform
F
8
=
Õ
:2
8
3
:
4
˜ [
8
Í
92'
8
4
˜ [
9
Õ
:2
8
3
:
¹4
˜ [
8
º
2
¹
Í
92'
8
4
˜ [
9
º
2
The two sums
Í
:2
8
and
Í
92'
8
both have= elements, hence it is a$¹=
2
º computation. However,
if we notice the difference between'
:
and'
:¸1
is the observations that are in'
:
but not in'
:¸1
,
i.e.,f9 : C
:
H
9
C
:¸1
g, provided that the observed times y are sorted in non-decreasing order,
then
Í
92'
:
4
˜ [
9
canbecalculatedascumulativesums:
Õ
92'
:
4
˜ [
9
=
Õ
92'
:¸1
4
˜ [
9
¸
Õ
92'
:
&98'
:¸1
4
˜ [
9
Thesamecumulativesumideacanbeappliedto
Í
:2
8
:
Õ
:2
8¸1
3
:
4
˜ [
8
Í
92'
8
4
˜ [
9
=
Õ
:2
8
3
:
4
˜ [
8
Í
92'
8
4
˜ [
9
¸
Õ
:2
8
&:82
8¸1
3
:
4
˜ [
8
Í
92'
8
4
˜ [
9
Õ
:2
8¸1
3
:
¹4
˜ [
8
º
2
¹
Í
92'
8
4
˜ [
9
º
2
=
Õ
:2
8
3
:
¹4
˜ [
8
º
2
¹
Í
92'
8
4
˜ [
9
º
2
¸
Õ
:2
8
&:82
8¸1
3
:
¹4
˜ [
8
º
2
¹
Í
92'
8
4
˜ [
9
º
2
78
The equations above only calculate the sums once, and add at each sample index, which brings
the computation cost down to linear time, $¹=º. Considering sorting observed times as a data
pre-processingprocedure,theoverall computationtimefortheweightsis$¹= log=º.
A.2 Solveregularizedweightedleastsquareswithcycliccoordinatedescent
Tosolveregularizedweightedleastsquares,equation(2.9),wefirstcomputethegradientatcurrent
estimates of¹ ˜ $ ˜ "º. LetW
9
be the 9
C
coordinate of $, 1 9 ?;U
:
be the:
C
coordinate of ",
1 :@. Thegradientofequation2.9withrespecttoW
9
is
1
=
=
Õ
8=1
F
8
G
89
¹H
0
8
$
)
x
8
"
)
¹xzº
8
º¸_
1
W
9
SettingthegradientwithrespecttoW
9
to0,thecoordinate-wiseupdateforW
9
hastheform
W
9
=
1
=
Í
=
8=1
F
8
G
89
A
¹9º
8
1
=
Í
=
8=1
F
8
G
2
89
¸_
1
whereA
¹9º
8
=H
0
8
Í
;<9
˜ W
;
G
8;
˜ "
)
¹xzº
8
,isthepartialresidualexcludingthecontributionofG
89
. As
forU
:
,if ˜ U
:
¡ 0,thegradientofequation(2.9)withrespecttoU
:
is
1
=
=
Õ
8=1
F
8
¹GIº
8:
¹H
0
8
$
)
x
8
"
)
¹xzº
8
º¸_
2
A similar expression exists if ˜ U
:
0, and ˜ U
:
= 0 is treated separately. Setting the gradient with
respecttoU
:
to0,thecoordinate-wise updateforU
:
hastheform
U
:
=
(¹
1
=
Í
=
8=1
F
8
¹GIº
8:
B
¹:º
8
_
2
º
1
=
Í
=
8=1
F
8
¹GIº
2
8:
79
whereB
¹:º
8
=H
0
8
˜ $
)
x
8
Í
;<:
˜ U
;
¹GIº
8;
, is the partial residual excluding the contribution of¹GIº
8:
.
(¹I_º isthesoft-thresholdingoperator:
sign¹Iº¹jIj_º
¸
=
8
>
>
>
>
>
>
>
> <
>
>
>
>
>
>
>
>
:
I_ ifI¡ 0 and_jIj
I¸_ ifI 0 and_jIj
0 if_jIj
A.3 ExamplecodesforRpackage‘xrnet’
The regularized hierarchical model, chapter 2, is implemented in R package ‘xrnet’. The package
is in CRAN and can be downloaded as any other R packages. Up to the time of this thesis being
written,theCRANversionof‘xrnet’implementswithrespecttocontinuousandbinaryoutcomes.
The survival module is in development branch of github repository. It can be installed with the
command
devtools :: install_github ("USCbiostats / xrnet " , ref="development")
We give example codes for using ’xrnet’ with respect to survival outcomes. As an minimum
example, function ‘xrnet’ performs the regularized hierarchical model, data and the type of model
tobeperformedmustbeprovided.
library ( xrnet )
fit = xrnet (x=x, y=y, external=z , family="cox")
ArgumentG is the data matrix, H is the outcome, external is meta-feature data matrix. If external
data is not provided, a standard regularized regression is performed. Family=“cox” indicates the
type of model, in this case, Cox’s proportional hazards model. To specify the regularization type
andregularizationpath,thehelperfunction‘define_penalty’canbeused.
• Regularizationtype
– 0:=ridge
80
– 1:=lasso
– (0,1):=elasticnet
• Regularizationpath
– Numberofthesequenceforeachof_
1
or_
2
– Ratio_
min
_
max
• Userdefinedsequenceofregularizationparameters
The arguments ‘penalty_main’ and ‘penalty_external’ are used to specify the above regularization
optionstothefeaturesindatamatrix ^ andtothemeta-featuresin ^. Forexample,weapplyridge
tothefeatures,andlassotothemeta-features. Eachofpenaltyparametersequenceshas20values.
Thecodes areasfollow
fit = xrnet (x=x, y=y, external=z , family="cox" ,
penalty_main=define_penalty (0 , num_penalty =20),
penalty_external=define_penalty (1 , num_penalty =20))
Helpfunction‘define_ridge’,‘define_enet’,‘define_lasso’areavailabletodirectlyspecifythetype
ofregularization.
In order to tune the hyperparameters, _
1
_
2
, cross-validation is used. In ‘xrnet’ package,
function‘tune_xrnet’isforcross-validation
cvfit = tune_xrnet (x=x. train , y=y. train , external=z ,
family="cox" ,
penalty_main=define_ridge () ,
penalty_external=define_lasso () ,
loss="c−index" , nfolds =5)
The example code shows that it performs 5 fold cross-validation (nfolds=5), and the validation
metric is C-index (Harrell’s concordance index). The folds can also be specified by user with
81
argument ‘foldid’. To predict and evaluate prediction performance on a hold out test set, with
cross-validatedmodel,usethefollowingcodes
library ( glmnet ) # for function Cindex
pred = predict ( cvfit , newdata=x. test )
test _cindex = Cindex(pred , y. test )
coefficient = predict ( cvfit , type=" coefficients ")
A.4 Moresimulationresults
We conducted 6 experiments for the regularized hierarchical Cox’s regression (‘xrnet’). The base
casescenarioismeta-featuresignal-noiseratioSNR= 2,samplesize# = 100,numberoffeatures
? = 200, number of meta-features @ = 50, theoretical/population concordance index 0.8, data
matrix ^ correlationd= 05. Ineveryexperiment,wevaryoneofthe6parametersandholdothers
fixed. Predictionperformance,andmeta-featureselectionaccuracyareshownforeachexperiment.
Figure5.1: Simulation: ‘xrnet’predictionperformance(i)
82
Figure5.2: Simulation: ‘xrnet’predictionperformance(ii)
Figure5.3: Simulation: ‘xrnet’predictionperformance(iii)
83
Figure5.4: Simulation: ‘xrnet’meta-featureselection(i)
Figure5.5: Simulation: ‘xrnet’meta-featureselection(ii)
84
Figure5.6: Simulation: ‘xrnet’meta-featureselection(iii)
B Appendixforchapter 3
B.1 Moresimulationresults
We conducted 4 experiments for the meta-feature guided regularized regression model. The base
case scenario is high meta-feature informativeness, sample size # = 100, number of features
? = 200, number of meta-features@ = 10. In every experiment, we vary one of the 4 parameters
andholdothersfixed. Wehaveseenthepredictionperformanceresults(Figure3.2),hereweshow
featureselectionaccuracy,andnumberoffeaturesselected.
85
Figure5.7: Simulation: metaguidedfeatureselectionaccuracy(i)
Figure5.8: Simulation: metaguidedfeatureselectionaccuracy(ii)
86
Figure5.9: Simulation: metaguidednumberofselectedfeatures(i)
Figure5.10: Simulation: metaguidednumberofselectedfeatures(ii)
87
Abstract (if available)
Abstract
Advances in high-throughput technologies have enabled the genomewide interrogation of genetic variation, the transcriptome, and the epigenome. Translating these advances into personalized health care requires the ability to predict disease outcomes based on individual omic profiles. In parallel, extensive research in genomics have produced a vast amount of annotations on the function, structure and regulation of the genome. This explosion of genomic information has made genomics-based predictive modeling extremely appealing but also challenging in terms of integrating multiple types of genomics data. The purpose of the thesis is to develop modeling techniques that can integrate the diverse genomics information to achieve improved prediction performance. ❧ In chapter 1, we review existing predictive modeling techniques and introduce the problem of integrating annotations and other kinds of external information. We discuss the challenges presented by the high dimensionality of genomics data and the best current tools based on regularized regression. We present detailed information about ridge, the lasso, the elastic net, and selected nonconvex regularized regression. Since in addition to prediction, model interpretation is also of great interest, we discuss feature selection properties provided by sparse regularized regression. ❧ Two predictive models for integrating external information into regularized regression for survival outcomes are proposed in chapters 2 and 3. In chapter 2, we present a regularized hierarchical model that integrates meta-features linearly by modeling the regression coefficients to depend on external information through their means. The method developed in chapter 3 allows for differential penalization of features guided by external information. The method can also be viewed as a hierarchical, but where the regression coefficients depend on the external information through their variance. ❧ We thus provide two different alternatives for integrating external data into predictive modeling with survival outcomes. We showed in simulations that with informative external data, prediction performance can improve considerably, and that the quality of feature/meta-feature selection also improves. The proposed models are also applied to two high dimensional gene expression datasets: a study of gene expression signatures for breast cancer survival, and an anti-pd1 immunotherapy predictive biomarker study for melanoma survival. ❧ Chapter 4 introduces a novel algorithm to solve L₀-regularized least squares based on proximal distance algorithm. This extends convex regularization methods like lasso and ridge to nonconvex approaches for sparser model and less biased coefficient estimation. ❧ Perspectives about data integration, high dimensionality, feature selection, and statistical inference are discussed in chapter 5, where we also propose potential future work.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Hierarchical regularized regression for incorporation of external data in high-dimensional models
PDF
Incorporating prior knowledge into regularized regression
PDF
Generalized linear discriminant analysis for high-dimensional genomic data with external information
PDF
High-dimensional regression for gene-environment interactions
PDF
Robust feature selection with penalized regression in imbalanced high dimensional data
PDF
Statistical analysis of high-throughput genomic data
PDF
Latent unknown clustering with integrated data (LUCID)
PDF
Characterization and discovery of genetic associations: multiethnic fine-mapping and incorporation of functional information
PDF
Feature selection in high-dimensional modeling with thresholded regression
PDF
Prediction modeling with meta data and comparison with lasso regression
PDF
Enhancing model performance of regularization methods by incorporating prior information
PDF
Covariance-based distance-weighted regression for incomplete and misaligned spatial data
PDF
Hierarchical approaches for joint analysis of marginal summary statistics
PDF
Integrative analysis of multi-view data with applications in epidemiology
PDF
Uncertainty quantification in extreme gradient boosting with application to environmental epidemiology
PDF
Comparison of models for predicting PM2.5 concentration in Wuhan, China
PDF
Genome-wide characterization of the regulatory relationships of cell type-specific enhancer-gene links
PDF
High-dimensional feature selection and its applications
PDF
Finding signals in Infinium DNA methylation data
PDF
Two-step study designs in genetic epidemiology
Asset Metadata
Creator
Shen, Dixin
(author)
Core Title
Prediction and feature selection with regularized regression in integrative genomics
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Biostatistics
Degree Conferral Date
2021-08
Publication Date
07/08/2021
Defense Date
06/17/2021
Tag
convex and nonconvex optimization,data integration,feature selection,genomics,high dimensional data,OAI-PMH Harvest,regularized regression
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Lewinger, Juan Pablo (
committee chair
), Conti, David (
committee member
), Franklin, Meredith (
committee member
), Hacia, Joseph (
committee member
)
Creator Email
dixinshe@usc.edu,dixinshen@yahoo.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC15293614
Unique identifier
UC15293614
Legacy Identifier
etd-ShenDixin-9715
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Shen\, Dixin
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
convex and nonconvex optimization
data integration
feature selection
genomics
high dimensional data
regularized regression