Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Comparison of Cox regression and machine learning methods for survival analysis of prostate cancer
(USC Thesis Other)
Comparison of Cox regression and machine learning methods for survival analysis of prostate cancer
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
COMPARISON OF COX REGRESSION AND MACHINE LEARNING
METHODS FOR SURVIVAL ANALYSIS OF PROSTATE CANCER
by
Chelsea Haosin Lee
A Thesis Presented to the
FACULTY OF THE USC KECK SCHOOL OF MEDICINE
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
MASTER OF SCIENCE
BIOSTATISTICS
May 2020
Copyright 2020 Chelsea Haosin Lee
Table of Contents
Acknowledgements iii
List of Tables iv
List of Figures v
Abstract vi
1 Introduction 1
2 Methods 4
2.1 Dataset Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Case Identication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Study Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.3 Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Cox Proportional Hazards (PH) Regression . . . . . . . . . . . . . . . . . . . 6
2.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.1 Regularized Cox PH Regressions . . . . . . . . . . . . . . . . . . . . 7
2.3.2 Random Survival Forest . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.3 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Model Performance and Selection . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Interpretable Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Results 14
3.1 Dataset Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Model Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Model Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4 Discussion 24
4.1 Survival Analysis: Traditional Statistics vs Machine Learning . . . . . . . . . 24
4.2 Eects of Predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 User Experience Diculties . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5 Conclusion 29
References 30
ii
Acknowledgements
I wish to express my sincere appreciation to my supervisor, Professor Juan Pablo Lewinger,
for the guidance, insightful and concrete suggestions, and encouragement. I would also
like to show gratitude to Professor Mariana Stern and Professor Lihua Liu, for lending their
expertise and time. I also thank Professor Wendy Mack and Renee Stanley for the mentorship
and providing me support throughout my graduate school career. Lastly, I thank all of my
friends who have been cheering me on the sidelines.
iii
List of Tables
3.1 Distribution of values by race group . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Model Performances on Validation Set (n=11,290) . . . . . . . . . . . . . . . 22
3.3 Model Performances on Testing Set (n=11,291) . . . . . . . . . . . . . . . . 22
3.4 Model Coecients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
iv
List of Figures
3.1 5-fold CV Loss Plots on Regularized Cox PH Regressions . . . . . . . . . . . 15
3.2 Random Survival Forest Out-of-Bag Error Plot and Feature Importances . . 16
3.3 mboost Loss Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Testing Error Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5 Predicted survival probabilities against time in months - Race and Cancer Stage 18
3.6 Predicted survival probabilities against time in months - Treatment . . . . . 19
3.7 Predicted survival probabilities against time in months - Age and Node In-
volvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
v
Abstract
This thesis seeks to develop a prediction risk model for estimating time from diagnosis to
mortality from Prostate Cancer using various machine learning methods including Random
Survival Forest, in comparison to the traditional Cox Proportional Hazards (PH) approach.
Furthermore, there has been evidence of health disparities among Latinos alone and so we
analyzed several demographic and clinical features among non-Latino Whites (NLW), non-
Latino Blacks (NLB), and Latinos by nativity. Features also included two new variables
that have not been previously studied for survival from Prostate Cancer: religious status
and longest-held occupation. A total of 56,451 records from California Cancer Registry from
2004-2006 were analyzed with 5,467 deaths. Foreign-born Latinos had longer survival than
NLW (Cox PH stratied on cancer stage: HR=0.84, p<0.05; Unstratied Cox PH: HR=0.81,
p<0.05) but Latinos with unknown birthplace had the best overall survival prospects (Cox
PH stratied on cancer stage: HR=0.24, p<0.05; Unstratied Cox PH: HR=0.23, p<0.05).
Only cancer stage, treatment type, age, regional lymph node involvement, and race/ethnicity
were important variables that contributed to survival predictions. Unstratied Cox PH
achieved a C-index of 0.85 on the testing set, which performed similarly to Random Survival
Forest that had C-index of 0.87. Cox PH serves as a reliable model for predicting survival
time and would be the preferred model to maintain interpretability benets for clinicians
and relevant stakeholders.
vi
Chapter 1 Introduction
Eorts have been made to provide accurate understanding of survival duration from life-
changing disease diagnosis such as cancer. Knowing survival time can aect the ability to
deliver a good prognosis and also medical decision-making as there are some high-risk treat-
ment options that might be worthwhile if the patient can survive long enough to experience
a better quality of life (Mackillop & Quirt, 1997).
Survival data, or time-to-event data, includes observations that are censored in which the
event of interest is not observed within the study period. Reasons for censoring include, but
are not limited to, loss to follow-up or collecting information not long enough to observe the
event. Casting this as a classication problem would lose the ability to predict time-to-event
and would need separate classication models trained on each time point when events were
observed. Linear regression may x this issue but would omit censored observations, leading
to skewed data.
Parametric and semi-parametric methods, commonly known as survival methods, have
been developed to analyze time-to-event data that can appropriately handle censoring. One
of which is the popular Cox Proportional Hazards (PH) model which has readily interpretable
coecients available that quanties hazard ratios (Bradburn, Clark, Love, & Altman, 2003;
Cox, 1972).
However, these parametric restraints may constrain these models from better understand-
ing survival time. Machine Learning (ML) algorithms can help ll in the gap as they are
widely known to outperform traditional statistical methods in various areas. Several appli-
cations in healthcare have already been made (Wiens & Shenoy, 2017; Mozaari-Kermani,
Sur-Kolay, Raghunathan, & Jha, 2014). One of the major benets is the ability to capture
non-linearities between the predictors and the outcome of interest. Several ML implementa-
tions adapted for survival data have shown success in predicting mortality of patients with
cardiac arrhythmia (Miao, Cai, Zhang, Li, & Zhang, 2015) or identifying genes for pancreatic
1
survival (Wu, Gong, & Clarke, 2011).
On the other hand, ML brings an increase in complexity which may result in improved
predictions but at the cost of interpretability. These so-called \black box" models make it
dicult to inform physicians and other relevant stakeholders to draw meaningful insights.
Several interpretability packages were built to alleviate this issue by creating visualizations
that illuminate how the models arrive at their predictions.
One such cancer with a major public health burden is Prostate Cancer (PCa), the most
diagnosed type of cancer and the second major cause of death from cancer in American men
(Siegel, Miller, & Jemal, 2019). Although the 5-year survival rate for men with localized PCa
is almost 100%, the 5-year survival rate for men whose PCa has spread is 30% (Siegel et al.,
2019). Survival time for those in distant stages of their cancer may be of prime importance.
There exist disparities among minority groups; Non-Latino Blacks (NLB) and non-Latino
Whites (NLW) have the highest incidence rates, while Hispanics are less likely to be diag-
nosed with PCa and have lower mortality rates (ACS, 2018). However, there is evidence
that suggests these dierences are present among Hispanics as well. Overall, US Latinos
have higher incidence but lower mortality rates than their foreign counterparts across Latin
America (Ferlay et al., 2015). Schupp and colleagues also found that survival patterns dif-
fered by nativity among Hispanic men in California (Schupp, Press, & Gomez, 2014). With
Hispanics being the largest minority group in the US totaling near 60 million in 2017, there is
also considerable heterogeneity with various ethnic groups that exist among Hispanics alone.
These give rise to diverse incidence and mortality rates (Noe-Bustamante & Flores, 2019;
Pinheiro et al., 2017). Social and clinical determinants and how they relate to PCa mortality
between NLB, NLW, and Latinos born in or outside of the US are not clear. Moreover, not
much is known about whether ML is superior to its traditional counterpart concerning time
to death from PCa.
This thesis seeks to investigate whether ML methods could further improve traditional
survival analysis techniques on explaining time to PCa fatality among US-born (USB) and
2
Foreign-born (FB) Latinos compared to NLB and NLW using data from the California Cancer
Registry (CCR); CCR is the statewide surveillance program monitoring cancer occurrences
in California where the largest proportion of US Latinos currently resides. This will also
implement an interpretability package called survxai to better comprehend the inner work-
ings of these complex models (Grudziaz, Gosiewska, & Biecek, 2018). Lastly, this will assess
known and potentially new additional determinants of PCa survival.
3
Chapter 2 Methods
2.1 Dataset Preparation
2.1.1 Case Identication
Data were drawn from California Cancer Registry (CCR) January 2019 research le where
only primary and malignant PCa cases were identied (Surveillance, Epidemiology, and End
Results Program (SEER) Site Recode 2080 based on the site code C619 and eligible histology
codes as stated in the International Classication of Diseases for Oncology Third Edition
(ICD-0-3)) (Fritz et al., 2001). The event of interest was dened as time from diagnosis
of PCa to death from PCa as recorded on the death certicate (ICD Ninth Revision code
beginning with 185 for cases 1995-1998 and ICD Tenth Revision code beginning with C61
for 1999 and beyond) or the end of the study period, December 31st, 2018. Patients who
died of other causes were censored.
Data were ltered to those who were identied as non-Latino Whites (NLW), non-Latino
Blacks (NLB), and Hispanics based on both self-reported race information by CCR and by
the North American Association of Central Cancer Registry (NAACCR) Hispanic Identica-
tion Algorithm (NAACCR Race and Ethnicity Work Group, 2011). Hispanics were further
classied as US-born (USB) Latinos or foreign-born (FB) Latinos by birthplace information.
If birthplace was missing, those issued with a social security number (SSN) within 20 years
of a known birth date were labeled as USB Latino. Likewise, if SSN was issued at least 21
years from a known birth date, then the patient was labeled as FB Latino.
Data cleaning was split between STATA (StataCorp, 2017) and R (R Core Team, 2019).
All ML was performed on R.
2.1.2 Study Variables
Age at diagnosis (in years) was drawn from CCR. Socioeconomic status (SES) (low, middle,
high) were obtained from two sources: Census 2000 results for patients diagnosed between
1996-2005 and the American Community Survey (ACS) 2007-2011 data for those diagnosed
4
between 2006 and beyond. We also considered PSA results at diagnosis (normal, elevated,
and borderline), payer type (not insured, Managed Care, Medicaid, Medicare, and other),
marital status (single, married, and separated/divorced/widowed), and positive lymph node
involvement (N0 for no cancer in lymph nodes and N1 for lymph nodes containing cancer).
Stage of disease at diagnosis was considered using SEER Summary Stage 2000 (in situ+localized,
regional, and remote). The summary stage was used instead of the four stages of disease
dened by SEER-modied American Joint Committee on Cancer Staging System because it
had fewer missing. In-situ and localized stages were combined due to low counts of In-Situ
cancers.
A composite treatment variable was made (none, hormone therapy only, surgery only,
chemotherapy or radiation only, and other types of therapies as dened by CCR + im-
munotherapy + radiation therapy with surgery).
Religion was also considered as there was found to be correlation between spirituality
and better quality of life among Latino men with PCa (Maliski, Husain, Connor, & Litwin,
2012), but little is known about whether spirituality impacts survival time. Hence, religion
was dichotomized (none/agnostic/atheist and religious).
Occupation longest held by the patient was also considered as several studies have found
that occupations exposed to pesticides such as farming or white collar workers were associ-
ated with higher chance of PCa death (Krstev & Knutsson, 2019; Sritharan et al., 2018).
Therefore, we included occupation was a categorical variable (executives and white collars,
farmers/machine operators and similar, other, or not working).
Due to the occupation variable having very high and varied frequencies of missingness,
only cases whose diagnosis lie between January 1st, 2004 and December 31st, 2006 were
ltered to as these years had the least missingness.
2.1.3 Missing Data
Those with missing survival time either from lack of diagnosis date or date of death, or
the date of death recorded before diagnostic date were dropped from the analysis. Patients
5
whose survival status was missing were also ltered out because a clear status was needed.
For all other variables, records stated as unknown in the registry or left missing were
grouped into a general 'Unknown' category for each study variable.
2.2 Cox Proportional Hazards (PH) Regression
The Cox PH model is a semi-parametric method that links the features of an observation
and its corresponding risk based on time. Represented by the hazard functionh
0
(t) for time
t and for N total observations denoted with i = 1,...,N, Cox PH can be written as:
h(t;x
i
) =h
0
(t)exp(
>
x
i
) (2.1)
where x
i
is the covariate vector for observation i and is the coecients vector for p
covariates. Because survival ties are present in this dataset, Cox PH maximizes the Breslow
approximation for the partial likelihood to estimate the (Equation 2.2) (Breslow, 1972).
For J total distinct event times:
t
(j)
= time at time point j
R
(j)
= set of observations at risk at t
(j)
D
(j)
= set of observations whose events were observed at t
(j)
d
(j)
= number of observations that fall in D
(j)
L() =
J
Y
j=1
exp(
P
i2D
(j)
>
x
i
)
(
P
i2R
(j)
exp(
>
x
i
))
d
(j)
(2.2)
The baseline hazard function h
0
(t) assumes no distribution allowing the Cox PH to be
more
exible than fully parametric models. Furthermore, it can handle multiple covariates
simultaneously unlike other popular non-parametric methods, such as the Kaplan-Meier
estimator, thereby opening additional possibilities for more accurate predictions. Thus, the
Cox PH model will be used as the base model of comparison between variants of the Cox
PH and other ML approaches.
6
A key assumption of the model is the Proportional Hazards (PH) assumption, which
states that hazard ratios must stay constant over time. Both the Schoenfeld Residuals and
graphical plots of residuals against time for each covariate will be analyzed for violation of
the PH assumption (Grambsch & Therneau, 1994). If the correlation between Schoenfeld
residuals and time is signicant (p<0.05) and the residual plots reveal substantial changes
throughout time, the Cox PH model will be stratied for those covariates.
However, the following models cannot be tested for proportional hazards since one model
is fully non-parametric (Random Survival Forest) and the remaining models have biased
coecients which make computation of the Schoenfeld residuals nonsensical.
2.3 Machine Learning
2.3.1 Regularized Cox PH Regressions
The Cox PH model can be extended via regularization, which is a technique to identify newly
calculated coecients and/or a subset of variables from a large set while reducing the overall
error (Simon, Friedman, Hastie, & Tibshirani, 2011). Regularization penalizes the overall
model error by introducing a shrinkage penalty, a non-negative tunable parameter usually
denoted as , that can lower the variance of the estimators by shrinking model coecients
but at the cost of increasing bias. However, the reduction of the former can oset the latter
resulting in lower overall error. A = 0 indicates no penalization, maintaining the model's
original full complexity. Higher values of lambda penalizes or shrinks coecients stronger.
A very high value of lambda can make the intercept-only model.
Because signals from variables may vary by data, dierent shrinkage techniques will be
used leading to three variants of the regularized Cox model: (1) Ridge Cox PH (2) LASSO
Cox PH (3) Elastic Net Cox PH. All three models use the Breslow estimator for ties.
Ridge Cox PH
Ridge seeks to minimize the negative partial log-likelihood subject to l2 norm of p model
coecients (Hoerl & Kennard, 1988). Coecients of variables can never truly be equal
7
to zero, and thus Ridge works well when the outcome of interest is dependent on large
combination of variables with similar signals.
^
Ridge
= argmin
(log(L()) +
p
X
j
2
j
) (2.3)
Lasso Cox PH
The LASSO (least absolute shrinkage and selection operator) technique minimizes the same
loss function but subject to thel1 norm of model coecients (Tibshirani, 1996). Making the
switch from squared dierences to absolute dierences allows LASSO to eectively reduce
coecients to zero, leading to feature selection. LASSO typically performs best on datasets
with relatively few, strong predictors.
^
Lasso
= argmin
(log(L()) +
p
X
j
j
j
j) (2.4)
.
Elastic Net Cox PH
Elastic Net combines the advantages of both Ridge and LASSO (Zou & Hastie, 2005). When
variables are correlated, Elastic Net typically outperforms the rest.
^
Elastic Net
= argmin
(log(L()) +[
p
X
j
j
j
j + (1)
p
X
j
2
j
]) (2.5)
Here, let = 0:5.
2.3.2 Random Survival Forest
Another popular machine learning tool are ensemble methods, which use a collection of
statistical learning models together to form one nal predictive model in the hopes of having
better prediction (Dietterich, 2000).
One such building block model is a decision tree, a type of non-parametric model that is a
8
series of splitting rules that best divides observations into groups of similar outcomes based
upon a criterion (e.g. mean squared error) (James, Witten, Hastie, & Tibshirani, 2013).
An ensemble of decision trees leads to a popular method called Random Forest (Breiman,
2001). Instead of using one tree to make predictions, Random Forest uses bagging where b
bootstrapped samples from the data are drawn and separate decision trees are trained on
each of these bootstrapped datasets (Breiman, 1996). Then the predictions are averaged
together, which reduces variance in the prediction error. Taking one step further in reducing
the overall error, traditional Random Forest decorrelates the trees by having each split of
a decision tree consider only a random subset of predictors. This allows for more robust
predictions and thereby lowering the variance.
Bootstrapping leads to an average of 37% of the data being left out, leaving out-of-bag
(OOB) data that is not used to build a tree and can be considered for validation. Therefore,
predictions can either be made from the average predicted outcome of in-bag models or
out-of-bag (OOB) models.
Because Random Forest relaxes the assumption of proportional hazards and takes ad-
vantage of bagging and decorrelating properties, a survival version called Random Survival
Forest (RSF) will be used (Ishwaran, Kogalur, Blackstone, Lauer, et al., 2008). This alter-
native model utilizes the Random Forest structure but is tailored instead for censored data.
A covariate at a certain split value that maximizes the log-rank test statistic between the
two groups of data it produces is selected at each split of the tree (Mantel, 1966). High
values of log-rank indicate more survival dierences and hence better separation of survival
outcomes. For each of the terminal nodes in all b decision trees that were made on each
bootstrapped sample, a cumulative hazard function (CHF) and survival function are built
using Nelson{Aalen and Kaplan{Meier estimators, respectively. If there aren terminal nodes
in a tree, then there will be n CHF and survival estimators (i.e, all patients in a particular
terminal node of a tree will have the same cumulative hazard and survival estimator that
was made from the data in the node itself).
9
There are two general steps to predict the outcome of an observation with a given co-
variate vector. In a tree, the CHF is found for this observation by following the covariate
values down to the terminal node with matching values. Then, an ensemble CHF is built by
averaging the CHF of trees by using the procedure described previously. There can be two
ensemble CHF: one made by averaging trees where the observation is in-bag and another for
trees where the observation is OOB.
The performance metric is the C-index made by the OOB ensemble CHF computed over
all possible unique times of permissible pairs, which provides a less bias estimate than that
found from in-samples. There are considerable number of hyperparameters that can be
tuned but empirical evidence reveal that default values are quite optimal. Hence, no tuning
is performed. Due to computational size, the fast implementation of RSF is utilized with
100 trees, random splitting of variables, sampling the 3=4
th
power of the bootstrap samples.
2.3.3 Boosting
Boosting is another widely used ensemble method known for improved predictions and re-
duced chance of overtting (James et al., 2013). Boosting builds `weak learners' or simple
models in a sequential fashion with each learner improving the residuals made by the prior
learner via gradient descent of a specied loss function. Results are then aggregated together
in a weighted fashion to form the nal output. The idea is that tting each subsequent learner
on residuals would place more emphasis on observations that were predicted poorly and help
`boost' the model performance.
There are two widely known parameters that can be tuned to prevent boosting from
overtting: (1) Number of `weak learners' or boosting iterations - Describes how many such
models should be built indicating the depth of the overall learning (2) Shrinkage or Learning
rate - Similar to the shrinkage penalty in regularized regressions, it is a value from 0 to 1
that reduces the in
uence of each additional learner. Having such incremental improvements
in the overall model safeguards against overtting and can redirect the model in subsequent
steps should it move away from the parameters.
10
However, for the following the two methods, only (1) will be tuned given that the value of
(2) is small (e.g., 0.1) since the authors of both methods have stated that tuning the penalty
parameter is of minor importance.
The Cox PH model serves as the `weak learner' for the following two boosting methods,
both of which use dierent loss functions but still result in the familiar Cox PH structure
allowing for the same interpretations.
mboost
For model-based boosting (mboost) (Hofner, Mayr, Robinzonov, & Schmid, 2014), the loss
function is based on the original partial log-likelihood described by Cox for events without
ties (Cox, 1975) and the derived gradients are built upon this by Ridgeway (Ridgeway,
1999). The partial log-likelihood becomes negative and thus mboost attempts to minimize
this 'error'. The `weak learners' are least squares estimates of each covariate that t on
the negative gradient. Instead of tting all weak learners, mboost implements component-
wise boosting by setting all coecients to zero in the rst step and then only updating
one estimate of the coecient in each boosting step. Specically, mboost moves towards the
negative gradient one covariate at a time by picking the best covariate that is most correlated
with the negative gradient. This leads to a potentially more sparse model as it can result in
at most k added covariates in the nal Cox model if there were k iterations of boosting.
Here, 25 bootstrapped samples is used to tune for the number of boosting iterations.
CoxBoost
CoxBoost seeks to maximize the penalized partial log likelihood function using oset gradient
boosting instead (Binder & Schumacher, 2008). This unique loss includes a penalized term
similar to the l2 norm in Ridge Cox PH. The oset accounts for what was already tted
in previous iterations and helps the model move closer to its maximum partial likelihood
estimator, a vector of covariates the maximizes this loss function. Like mboost, CoxBoost
11
can result in a sparser model. In each boosting step, there is a set of sets of covariates to
consider that may be overlapping and dierent in size. All sets are simultaneously updated
for their corresponding covariates. The best set that improves overall t by maximizing the
penalized partial log likelihood is then selected for the update. Unlike traditional gradient
descent, the update is not shrunk at all and instead is captured in the lambda penalty
parameter in the loss function.
CoxBoost's tuning capabilities only allow unstratied cross-validation to tune the number
of boosting iterations. (Due to technical diculties, only the default 10 folds are used).
2.4 Model Performance and Selection
To appropriately handle censored data, Harrell's concordance index (C-index) can assess the
model performance (Harrell Jr, 2015). A generalization of the Area Under a Curve (AUC),
a measure of discrimination quality for binary outcomes, the C-index ranges from 0 to 1 to
indicate how well the model can distinguish between individuals with longer survival time
versus those with shorter survival time among all comparable pairs (pairs that are either
both uncensored or the uncensored observation is paired with a later censored observation).
The C-index is the proportion of all such pairs that are concordant, which is dened as when
the model correctly assigns a higher predicted probability of survival to the individual with
the observed longer survival time. Higher C-indexes indicate better classication where a
value of 1 means perfect discrimination ability and a value of 0.5 means random predictions.
The dataset is split into three subsets via stratied random sampling on censoring rates
to prevent bias in learning: Training (60%), Validation (20%), Testing (20%). All model
building will be performed on the Training set. Within this set, 5-fold Cross-Validation
with the same stratied random sampling procedure is used to tune the relevant model
hyperparameters if necessary. The tuned models are then compared on their performance on
the Validation set where the model with the highest C-index is chosen. Finally, the selected
model is then assessed on the Testing set to obtain a less bias estimate of the model's
generalizability.
12
2.5 Interpretable Package
Interpretable packages can provide both global and local visualizations. A global method
describes overall model behavior by nding covariates that contribute to overall predictions
while a local method determines which of these covariates drive a particular patient's pre-
diction. Given the nature of this dataset, only global methods will be considered: Variable
Response/Partial Dependence Plots and a Feature Importance Plot.
Package survxai can explain a limited number of survival analysis methods at this time
of writing. In this thesis, only Cox PH and RSF is applicable (Grudziaz et al., 2018).
In survxai, a Variable Response Plot elucidates the eect of a covariate on survival prob-
ability over time by illustrating the mean survival curves of unique values of the covariate.
Within RSF, a feature importance plot can be made, which is a box plot of `importance'
scores across all covariates (Fisher, Rudin, & Dominici, 2018). As the name implies, this
calculates the importance of a feature for predictions by randomizing the value of covariate,
measuring the change in error, and taking a ratio of the error from permuted values over the
original error to make an importance score. A relatively high increase in error suggests that
the covariate is \important" for the model since the predictions rely heavily on it, leading
to a higher importance score.
13
Chapter 3 Results
3.1 Dataset Characteristics
The values and distribution of the studied variables are presented in Table 3.1. The nal
sample consisted of 56,451 individuals diagnosed with PCa during 2004-2006 in California.
Of them, 90.3% censored (n=50,984) and 9.7% that experienced death from malignant PCa
(n=5,467). Of the total sample, NLW constituted the majority (n=41,564) and USB Latinos
were the least represented (n=3,686). Median survival time was 125 months or 10.4 years
among patients with median age of 67 years. Both the medians of survival time and age were
relatively similar in NLW, NLB, USB and FB Latinos. Majority of the participants had in
situ or localized cancer stages (80.0%), high SES (50.7%), elevated PSA results at diagnosis
(81.0%), only surgery for treatment (33.3%), been paid by Managed Care systems (49.8%),
married (69.9%), practiced religion (45.2%) or had unknown religious status (50.1%), no
regional lymph node involvement (90.4%), had unknown longest held occupation (55.2%).
Patients overall and among each of the four dierent race groups had very similar dis-
tributions of censoring rates, cancer summary stages, PSA results, and regional lymph node
involvements.
Those with low SES were over-represented among FB Latinos (62.3%), followed by NLB
(47.8%) and US Latinos (42.7%).
Compared to the overall population, FB Latinos were under-represented for being treated
with only chemotherapy or radiation (14.2%) but over-represented for only hormone treat-
ment (11%). On the other hand, NLB had the highest proportion of not receiving any
treatment (20%) and the smallest percentage for opting only for surgery (27.7%).
While NLW and the total sample had comparable distributions for payer type, FB Lati-
nos had an over-representation of uninsured (3.1%), Medicaid (8.6%), and Medicare (40.4%)
and under-representation of Managed Care (39.6%). Blacks also followed with an over-
representation of Medicaid (4.3%) and 'Other' insurance which includes TRICARE, mili-
14
tary, Veterans Aairs, Indian/Public Health Service (11%). However, they had an under-
representation of Medicare (28.3%). Contrary to FB Latinos, US Latinos had an under-
representation for Medicare (34.3%)
NLB had the lowest proportion of marriage (57%) with highest proportion of single
status (17.9%) and separated/divorced/widowed (18.3%). On the other hand, FB Latinos
had the opposite prevalence patterns with the majority being married (72.6%) and smaller
population of singles (7.9%) and separated/divorced/widowed (10.9%).
Both FB Latinos and USB Latinos had highest percentages of religiosity (57.9% for FB
and 54.1% for USB).
FB Latinos had the lowest proportion of holding executives and white collar jobs (3.3%),
highest proportion of working in farms, machinery, etc (9.1%) and not working (20.6%).
3.2 Model Tuning
No tuning was required for Cox PH models.
Penalized Cox PH partial likelihood loss plots are shown in Figure 3.1. The two extracted
's reveal that the less complex model, represented by higher which in this case is
1se
,
yields a very similar partial likelihood to a model trained on
min
. Hence, all three models
are tuned with
1se
to obtain more parsimonious models.
Figure 3.1: 5-fold CV Loss Plots on Regularized Cox PH Regressions
Note: 2 (10
3
; 10
5
) with error bars representing1 sd. Two vertical dotted lines denote
min
that
minimizes the average partial likelihood and
1se
that is within 1 standard error of
min
. Numbers at the
top represent df.
For RSF, a loss plot of 500 trees revealed that 100 and beyond had a plateau of C-index.
15
So, 100 trees were chosen (Figure 3.2 shows a truncated plot).
Figure 3.2: Random Survival Forest Out-of-Bag Error Plot and Feature Importances
Note: Both on training set with left barplot of feature importances by permutation and right barplot of
(1C-index) against number of trees.
For mboost and CoxBoost, after tuning through each of their respective cross-validations,
the optimal number of iterations are 498 and 249, respectively (loss plot for mboost is shown
in Figure 3.3 but no plot can be extracted for CoxBoost within mlr). Both needed almost
all iterations provided to continue improving the error.
Figure 3.3: mboost Loss Plot
Note: Five shaded curves represent each fold of the 5-fold CV process. The bolded line represents the mean
CV error, the negative partial log likelihood.
16
3.3 Model Results
Model performances as measured by C-index are displayed in Table 3.2 with results across
eight survival models (RSF is excluded because coecients are not provided). Cox PH was
stratied on cancer stage. The remaining ML algorithms that did not stratify performed
very similarly to Unstratied Cox PH, with most C-indexes hovering around 0.85. Only RSF
was the best performing model with a slightly higher C-index of 0.87. Both Unstratied Cox
PH and RSF were evaluated again on testing set. The same C-indexes were found after
rounding to the thousandths place (Table 3.3). Condence intervals could not be computed.
Instead, both Unstratied Cox PH and RSF were resampled several times as another way
to demonstrate variability of their performances. In Figure 3.4, boxplots of their C-indexes
reveal substantial overlap, indicating that RSF does not necessarily outperform Cox PH.
Figure 3.4: Testing Error Variations
Note: Boxplot of 5 dierent C-indexes using 5-fold CV with Stratied Random Sampling of the Testing set.
In Table 3.4, all methods had a consensus on which group of predictors aected PCa
survival: age, cancer stage, regional lymph node involvement, NLB, high SES, treatment
type, Medicaid as a payer, married and separated men. Variable response plots only had ve
of these variables (age, cancer stage, node involvement, race, and treatment) with distinct
predicted curves for both Cox PH models and RSF (Figure 3.5-3.7). (Note that all Cox PH
plots are those without stratication. Stratied Cox was omitted since plots were identical.)
17
Important only in both Cox models and Ridge Cox, race had an eect with both NLW
and USB Latinos having comparable survival rates and FB Latinos with a survival advan-
tage (Stratied Cox PH: HR=0.84, p<0.05; Unstratied Cox PH: HR=0.81, p<0.05; Ridge
Cox: HR=0.86). However, Latinos with unknown nativity had the largest eect estimate
(Stratied Cox PH: HR=0.24, p<0.05; Unstratied Cox PH: HR=0.23, p<0.05) and with
predicted probabilities being the highest curve for Cox in Figure 3.5. Further exploratory
analysis revealed that this group mostly had in situ+localized stages, were low SES, high
proportions of both unknown marital status and no treatment recorded.
Figure 3.5: Predicted survival probabilities against time in months - Race and Cancer Stage
Likewise for PSA results, those with unknown diagnostics had poor survival (Stratied
Cox PH: HR=1.78, p<0.05; Unstratied Cox PH: HR=1.76, p<0.05 and all ML models
have HR>1). Among those with missing PSA results, relatively high percentages opted for
surgery (43%) or no treatment (28%) and Medicare as a payer (47%) compared to the whole
population (results not shown).
Type of treatment was also very important with several options leading to opposite eects
on survival. Choosing only hormonal had high hazards (Both Stratied and Unstratied
18
HR=1.49, p<0.05 and all HR>1 for ML) whereas surgery had a protective eect (Stratied
Cox PH: HR=0.29, p<0.05; Unstratied Cox PH: HR=0.30, p<0.05 and all HR<1 for
ML) followed by chemotherapy or radiation with a less protective eect (Stratied Cox PH:
HR=0.52, p<0.05; Unstratied Cox PH: HR=0.53, p<0.05). Similar ndings were found in
RSF with those who chose hormonal only treatments having the most survival disadvantage
(Figure 3.6).
Figure 3.6: Predicted survival probabilities against time in months - Treatment
For insurance payers, Medicaid had higher risk of worse survival (Stratied Cox PH:
HR=1.22, p>0.05; Unstratied Cox PH: HR=1.21 and all HR>1 for ML). Being married was
protective (Stratied Cox PH: HR=0.77, p<0.05; Unstratied Cox PH: HR=0.76, p<0.05
and all HR<1 for ML) but separated was not (both Stratied and Unstratied Cox PH:
19
HR=1.03, p>0.05 and all HR>1 for ML).
Cases whose nodes were aected also worsen survival (Stratied Cox PH: HR=1.82,
p<0.05; Unstratied Cox PH: HR=1.84, p<0.05 and all HR>1 for ML). The same pattern
is further highlighted in RSF (Figure 3.7).
Figure 3.7: Predicted survival probabilities against time in months - Age and Node Involve-
ment
Note: Age was automatically split into quintiles by survxai.
Although present, the eects of occupation type and religion disappear when adjusted
for these social and clinical predictors. This resulted in all non-signicant HR (p>0.05) in
both Cox PH models and ML models having values close to the null HR of 1.
20
Table 3.1: Distribution of values by race group
Variable NLW NLB USB Latinos FB Latinos Unknown Total
N=41564 (%) N=5704 (%) N=3686 (%) N=5245 (%) N=252 (%) N=56451 (%)
Status
Censored 37676 (90.6%) 5042 (88.4%) 3326 (90.2%) 4696 (89.5%) 244 (96.8%) 50984 (90.3%)
Death 3888 (9.4%) 662 (11.6%) 360 (9.8%) 549 (10.5%) 8 (3.2%) 5467 (9.7%)
Summary Stage
In Situ+Localized 33280 (80.1%) 4480 (78.5%) 2900 (78.7%) 4048 (77.2%) 211 (83.7%) 44919 (80.0%)
Regional 5347 (12.9%) 704 (12.3%) 483 (13.1%) 701 (13.4%) 16 (6.3%) 7251 (12.8%)
Remote 1817 (4.4%) 337 (5.9%) 196 (5.3%) 281 (5.4%) 9 (3.6%) 2640 (4.7%)
Unknown 1120 (2.7%) 183 (3.2%) 107 (2.9%) 215 (4.1%) 16 (6.3%) 1641 (2.9%)
SES
Low SES 8638 (20.8%) 2725 (47.8%) 1575 (42.7%) 3270 (62.3%) 126 (50%) 16334 (28.9%)
Middle SES 8369 (20.1%) 1256 (22%) 851 (23.1%) 951 (18.1%) 50 (19.8%) 11477 (20.2%)
High SES 24557 (59.1%) 1723 (30.2%) 1260 (34.2%) 1024 (19.5%) 76 (30.2%) 28640 (50.7%)
PSA Results
Normal 3686 (8.9%) 255 (4.5%) 249 (6.8%) 309 (5.9%) 7 (2.8%) 4506 (8.0%)
Elevated 33113 (79.7%) 5032 (88.2%) 3061 (83%) 4324 (82.4%) 207 (82.1%) 45737 (81.0%)
Borderline 1481 (3.6%) 126 (2.2%) 116 (3.1%) 119 (2.3%) 6 (2.4%) 1848 (3.3%)
Unknown 3284 (7.9%) 291 (5.1%) 260 (7.1%) 493 (9.4%) 32 (12.7%) 4360 (7.7%)
Treatment
None 6718 (16.2%) 1146 (20.1%) 587 (15.9%) 910 (17.3%) 96 (38.1%) 9457 (16.8%)
Hormonal only 3584 (8.6%) 577 (10.1%) 353 (9.6%) 577 (11%) 28 (11.1%) 5119 (9.1%)
Surgery only 14191 (34.1%) 1579 (27.7%) 1226 (33.3%) 1783 (34%) 32 (12.7%) 18811 (33.3%)
Chemo or radiation only 8186 (19.7%) 1062 (18.6%) 665 (18%) 747 (14.2%) 49 (19.4%) 10709 (18.9%)
Other including combination 1121 (2.7%) 135 (2.4%) 117 (3.2%) 181 (3.5%) 1 (0.4%) 1555 (2.8%)
Unknown 7764 (18.7%) 1205 (21.1%) 738 (20%) 1047 (20%) 46 (18.3%) 10800 (19.1%)
Payer
Not Insured 340 (0.8%) 118 (2.1%) 27 (0.7%) 164 (3.1%) 11 (4.4%) 660 (1.2%)
Managed Care 21018 (50.6%) 2899 (50.8%) 1978 (53.7%) 2075 (39.6%) 120 (47.6%) 28090 (49.8%)
Medicaid 446 (1.1%) 246 (4.3%) 64 (1.7%) 450 (8.6%) 20 (7.9%) 1226 (2.2%)
Medicare 16174 (38.9%) 1616 (28.3%) 1266 (34.3%) 2119 (40.4%) 42 (16.7%) 21217 (37.6%)
Other 1535 (3.7%) 626 (11%) 202 (5.5%) 125 (2.4%) 5 (2%) 2493 (4.4%)
Unknown 2051 (4.9%) 199 (3.5%) 149 (4%) 312 (5.9%) 54 (21.4%) 2765 (4.9%)
Marital Status
Single 3947 (9.5%) 1019 (17.9%) 351 (9.5%) 412 (7.9%) 23 (9.1%) 5752 (10.2%)
Married 29707 (71.5%) 3249 (57%) 2557 (69.4%) 3807 (72.6%) 112 (44.4%) 39432 (69.9%)
Separated/Divorced/Widowed 5332 (12.8%) 1046 (18.3%) 531 (14.4%) 573 (10.9%) 18 (7.1%) 7500 (13.1%)
Unknown 2578 (6.2%) 390 (6.8%) 247 (6.7%) 453 (8.6%) 99 (39.3%) 3767 (6.7%)
Nodes
N0 37761 (90.9%) 5123 (89.8%) 3350 (90.9%) 4576 (87.2%) 213 (84.5%) 51023 (90.4%)
N1 728 (1.8%) 119 (2.1%) 71 (1.9%) 119 (2.3%) 4 (1.6%) 1041 (1.8%)
Unknown 3075 (7.4%) 462 (8.1%) 265 (7.2%) 550 (10.5%) 35 (13.9%) 4387 (7.8%)
Religion
None/Agnostic/Atheist 2213 (5.3%) 229 (4%) 104 (2.8%) 90 (1.7%) 4 (1.6%) 2640 (4.7%)
Religious 17754 (42.7%) 2698 (47.3%) 1994 (54.1%) 3037 (57.9%) 49 (19.4%) 25532 (45.2%)
Unknown 21597 (52%) 2777 (48.7%) 1588 (43.1%) 2118 (40.4%) 199 (79%) 28279 (50.1%)
Occupation
Executives + white collars 5698 (13.7%) 414 (7.3%) 243 (6.6%) 171 (3.3%) 2 (0.8%) 6528 (11.6%)
Farming, machinery, etc 880 (2.1%) 181 (3.2%) 180 (4.9%) 478 (9.1%) 11 (4.4%) 1730 (3.1%)
Other occupations 5521 (13.3%) 977 (17.1%) 608 (16.5%) 556 (10.6%) 12 (4.8%) 7674 (13.4%)
Not working 6837 (16.4%) 779 (13.7%) 648 (17.6%) 1078 (20.6%) 31 (12.3%) 9373 (16.6%)
Unknown 22628 (54.4%) 3353 (58.8%) 2007 (54.4%) 2962 (56.5%) 196 (77.8%) 31146 (55.2%)
Median (Mean)
Survival Time 126 (108.5) 123 (105.2) 126 (109.6) 124 (106.4) 123 (95.6) 125 (108)
Age 68 (68) 64 (64.5) 67 (66.8) 68 (67.7) 68 (66.6) 67 (67.5)
Note: NLW = non-Latino whites; NLB = non-Latino blacks; USB = US-born; FB = foreign-born
21
Table 3.2: Model Performances on Valida-
tion Set (n=11,290)
Model C-index
Cox PH with stratication 0.685
Cox PH without stratication 0.851
Ridge Cox PH 0.852
Lasso Cox PH 0.851
Elastic Net Cox PH 0.851
Random Survival Forest 0.868
mboost 0.843
CoxBoost 0.851
Note: Stratication was performed on cancer stage.
ML models starting from Ridge Cox PH do not use
stratication.
Table 3.3: Model Performances on Test-
ing Set (n=11,291)
Model C-Index
Cox PH without stratication 0.851
Random Survival Forest 0.868
22
Table 3.4: Model Coecients
Cox PH
with Stratication
Cox PH
without Stratication
Ridge Elastic Net Lasso mboost CoxBoost
Age 1.04* 1.04* 1.03 1.04 1.04 1.03 1.48
Race group
NLW Ref Ref 1.01 - - - -
NLB 1.18* 1.17* 1.15 1.11 1.1 - 1.04
USB Latinos 1.04 1.03 1.03 - - - -
FB Latinos 0.84* 0.81* 0.86 - - - 0.98
Unknown 0.24* 0.23* 0.47 0.93 - - 0.98
Summary Stage
In Situ+Localized NA Ref 0.56 0.5 0.48 - 0.87
Regional NA 2.18* 1.01 - - 1.01 1.12
Remote NA 12.74* 6.16 6.4 6.32 14.33 1.62
Unknown NA 2.99* 1.56 1.49 1.43 3.12 1.15
SES
Low SES Ref Ref 1.07 1.01 - - 1.01
Middle SES 0.97 0.96 1.03 - - - -
High SES 0.85* 0.85* 0.92 0.93 0.93 - 0.96
PSA Results
Normal Ref Ref 0.85 - - - -
Elevated 1.17 1.17 0.98 - - - -
Borderline 0.84 0.84 0.77 - - - 0.99
Unknown 1.78* 1.76* 1.38 1.4 1.4 1.15 1.09
Treatment
None Ref Ref 1.28 - - - -
Hormonal only 1.49* 1.49* 1.96 1.49 1.43 1.29 1.09
Surgery only 0.29* 0.3* 0.58 0.42 0.38 0.62 0.63
Chemo or radiation only 0.52* 0.53* 0.77 0.66 0.64 0.82 0.82
Other including combination 1.08 1.09 1.39 - - - -
Unknown 1.2* 1.2* 1.5 1.19 1.14 1.03 1.06
Payer
Not Insured Ref Ref 1.09 - - - -
Managed Care 0.82 0.82 0.93 - - - -
Medicaid 1.22 1.21 1.28 1.19 1.17 - 1.04
Medicare 0.82 0.82 1 - - - -
Other 0.82 0.82 0.98 - - - -
Unknown 1.19 1.22 1.35 1.22 1.18 - 1.05
Marital Status
Single Ref Ref 1.11 1.05 1.03 - 1.02
Married 0.77* 0.76* 0.89 0.91 0.91 0.98 0.95
Separated/Divorced/Widowed 1.03 1.03 1.2 1.17 1.16 1.04 1.05
Unknown 0.78* 0.76* 0.92 - - - 0.99
Nodes
N0 Ref Ref 0.72 0.56 0.56 - 0.82
N1 1.82* 1.84* 1.45 1.04 1.02 1.69 -
Unknown 1.64* 1.68* 1.36 - - 1.68 0.98
Religion
None/Agnostic/Atheist Ref Ref 0.96 - - - -
Religious 1.14 1.14 1.08 1.08 1.08 - 1.05
Unknown 0.94 0.94 0.93 0.98 0.99 - 0.98
Occupation
Executives + white collars Ref Ref 0.98 - - - -
Farming, machinery, etc 1.18 1.2 1.15 - - - 1.01
Other occupations 1.04 1.04 1.02 - - - -
Not working 1.1 1.09 1.09 1.01 - - -
Unknown 0.94 0.93 0.94 0.95 0.96 - 0.96
Note: Cox models can provide signicance. Those with * are signicant with p<0.05. The numbers displayed
above are Hazard Ratios (HR).
23
Chapter 4 Discussion
4.1 Survival Analysis: Traditional Statistics vs Machine Learning
Various ML models were compared using Cox PH regression as a base comparison to see if
predicting survival time of PCa death can be further improved using Harrell's concordance
index as a measure of prediction accuracy. In addition, an interpretable package was evalu-
ated to better understand which covariates are aecting survival in some of these black box
methods.
For a priori reasons, stratication on the baseline hazards was performed on cancer stage.
However, the C-index for stratication hurts the model performance much more than without
stratication (C-index 0.69 vs 0.85). Stratication groups the observations into each strata
of cancer stage to create unique baseline hazards, but it also loses the ability to adjust for
stage since its eects are ignored. Since stage is a very predictive feature as shown in Figure
3.5 with survival probabilities becoming progressively worse as cancer stage becomes more
severe, the predictions suer when regressing without cancer stage.
Furthermore, the C-indexes on validation set show that almost all ML methods were
comparable to Cox PH's unstratied performance with the exception of RSF performing the
best with a C-index of 0.87. These match what was reported in the literature. The creators
of RSF demonstrated that it performed either just as well as, or slightly better than, Cox PH
on a variety of health-based time-to-event datasets (Ishwaran et al., 2008). Similar results
were reported on breast cancer data again using both Cox PH and RSF with the latter having
a small improvement in predictive performance (Omurlu, Ture, & Tokatli, 2009). Kattan
and colleagues compared Cox PH on several urological datasets for time-to-recurrence and
also found that it performed on par with ML methods using other tree-based methods and
neural networks (Kattan, 2003).
As a result, ML may not always be superior to traditional statistical methods for the case
in survival data, specically for time-to-death. There could also be a lack of complicated,
24
nonlinear relationships in this dataset, and so ML would not be able to outshine Cox PH as
much.
4.2 Eects of Predictors
Only a subset of variables had substantial evidence of in
uencing PCa survival. The fea-
ture selection models, Elastic Net Cox, LASSO Cox, and mboost, generally agree with the
signicant predictors found in both Cox PH models.
Results with increasing age and NLB having higher risk of PCa mortality match what
was previously reported (ACS, 2018). Despite the fact that there is evidence of survival
disparities among USB and FB Latinos with the latter having better survival (Schupp et al.,
2014; Pinheiro et al., 2017), nativity eects were not as pronounced in the ML models with
the exception of Ridge Cox. However, all models reveal that those with missing nativity have
the best survival prospects of all races analyzed. Hence, it is still crucial to recognize the
heterogeneity of Latinos by nativity and to better understand cancer patterns by improving
data collection quality.
For PCa, a review of social determinants found that both SES and marital status play
important roles in survival with low SES having worse survival but married men having
improved survival (Coughlin, 2019). Although SES as a variable itself was not deemed
important in our results, it may contribute to certain predictors being identied as either
risk or protective factors. Namely those who chose surgery or chemotherapy/radiation only
had relatively high proportions of high SES compared to the total sample. However, this is
not to say SES is the lone explainer as those who opted for only hormonal treatments also
had high percentages of high SES.
On the other hand, treatment type has been shown to aect survival. A meta-analysis
reviewing studies with localized PCa found that only brachytherapy, a type of radiation
therapy, was "superior" for low-risk group (as dened by National Comprehensive Cancer
Network), combination radiation therapy was better for intermediate-risk group, combination
radiation with or without hormonal therapy for high-risk group (Grimm et al., 2012). As
25
such, our result for survival depending on treatment reveals similar patterns, especially with
treatment as the second most important feature in RSF as shown in Figure 3.2. However,
our dataset included cancers that have spread and so other treatments may prove to be
superior as we have shown surgery was the most protective factor.
Another known social determinant of PCa is insurance status where a study using the
National Cancer Database found that those who are uninsured or have public insurance
were associated with more severe cancers (Fedewa, Etzioni, Flanders, Jemal, & Ward, 2010).
Our results show that, although not signicant by Cox, Medicaid was associated with worse
survival. The lack of eects from other insurance payer types could be that their eects on
disease progression do not necessarily translate to impacting survival time.
However, this was not the case for regional lymph node involvement. A study found
that having positive lymph node ndings was signicantly associated with PCa progression
(Stamey, McNeal, Yemoto, Sigal, & Johnstone, 1999), and our results show that cases whose
nodes were aected have poorer survival prospects.
As for religion, most models had slightly elevated HR, but the eect estimates were not
signicant in both Cox models. While religion and occupation alone may be correlated with
PCa mortality, their eects were subdued as they did not shed any additional light into
predicting survival time. This could partly be due to the lack of recorded data as there were
approximately 55% missing for both religion and occupation.
Although we are using a large, comprehensive population-based cancer registry, statistical
power is reduced when cases were ltered into 2004-2006. As an unintended consequence,
we are also unable to incorporate Gleason stage, an important clinical risk factor, since it
was over 98% missing for the years analyzed. Losing Gleason and having cases with missing
information across several predictors may introduce bias. We also lack information on family
history of PCa, another known risk factor.
Lastly, creating separate categories for missing data in each variable may produce biased
estimates (Raghunathan, 2004). However, simply removing these rows would result in a
26
drastic reduction in data size. Imputation using mean or mode would not change the distri-
butions but would add little to no additional information for our analysis. Having an explicit
missing category would enable us to measure its predictive power (i.e, is there a dierence
between those who have been identied as a Hispanic born in the US vs those with unknown
birthplace information). Therefore, we felt this ad-hoc approach of handling missing data
was the most plausible.
4.3 User Experience Diculties
mlr was used for the goal of streamlining the ML pipeline by wrapping many algorithms
in R with a unied interface, similar to sci-kit learn in Python (Pedregosa et al., 2011).
However, several capabilities could not be done such as not being able to have Cox PH
perform stratication of the baseline hazards. Making a new learner, training, and predictor
function had to be made by explicitly modifying the formula to include strata(). Other
issues included the inability to supply the user-dened folds for tuning in surv.cv.glmnet
for regularized Cox PH and surv.mBoost for mBoost. This was a problem as mlr did not
maintain stratied sampling of the folds. Moreover, it could not change the number of
folds in surv.cv.CoxBoost for CoxBoost. Thus, some models had to be tuned outside of
mlr by using the default packages themselves and then subsequently trained on these tuned
hyperparameters back into mlr. These defeat some of the purposes of mlr's ecosystem.
XgBoost, a popular boosting method for regression and classication problems (Chen &
Guestrin, 2016), had capabilities to perform survival analysis but had no implementation
available in mlr.
A newer package called mlr3 (Lang et al., 2019) was discovered late in the process and
may have solved these issues, and so we recommend using this instead especially since mlr
is no longer being maintained.
Run time also varied substantially. Cox PH and its penalized versions only took few min-
utes to run. The Newton-Raphson Method was used to nd the maximum partial likelihood
estimates for Cox PH and a form of partial Newton algorithm was used for penalized models.
27
Here, both provide quadratic convergence. On the other hand, it was very time consuming
with RSF, mBoost, and CoxBoost (ranging from 2h-22h). A server with signicant memory,
more than a conventional personal laptop, had to be used to run these models or otherwise
R would abort.
Visualizing the thinking process using survxai is also limited at this time of writing. As
the eld of ML is developing and applications in survival are still young, we would expect
future developments with more diverse applications.
28
Chapter 5 Conclusion
Our results indicate that PCa survival is unequally distributed by race. Relative to non-
Latino Whites (NLW), non-Latino Blacks (NLB) had the worst survival but disparate sur-
vival patterns exist among Latino subgroups by nativity. Foreign-born Latinos exhibited
signicantly better survival than NLW, while US-born Latinos experienced similar survival
as NLW. These give evidence that nativity should be considered when assessing population
health and reducing cancer disparities. Although religious and occupation status had no sig-
nicant eect on survival, improved data quality collection is important as these data were
predominantly missing and may in
uence outcomes. For PCa, Cox PH performs well enough
and would be the recommended model as it provides explicit hazard ratios of the predic-
tors, providing tangible results important for clinical decision-making. A better performing
model, Random Survival Forest, might be too much of a blackbox model for clinicians and
might not be worth the sacrice for minor improvement.
29
References
ACS. (2018). Cancer facts & gures for hispanics/latinos: 2018-2020. Atlanta: American
Cancer Society.
Binder, H., & Schumacher, M. (2008). Allowing for mandatory covariates in boosting
estimation of sparse high-dimensional survival models. BMC bioinformatics, 9 (1), 14.
Bradburn, M. J., Clark, T. G., Love, S. B., & Altman, D. G. (2003). Survival analysis
part ii: multivariate data analysis{an introduction to concepts and methods. British
journal of cancer, 89 (3), 431{436.
Breiman, L. (1996). Bagging predictors. Machine learning, 24 (2), 123{140.
Breiman, L. (2001). Random forests. Machine learning, 45 (1), 5{32.
Breslow, N. E. (1972). Discussion of the paper by dr cox. Journal of the Royal Statistical
Society, Series B, 34 , 216{217.
Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In
Proceedings of the 22nd acm sigkdd international conference on knowledge discov-
ery and data mining (pp. 785{794). New York, NY, USA: ACM. Retrieved from
http://doi.acm.org/10.1145/2939672.2939785 doi: 10.1145/2939672.2939785
Coughlin, S. S. (2019). A review of social determinants of prostate cancer risk, stage, and
survival. Prostate International.
Cox, D. R. (1972). Regression models and life-tables. Journal of the Royal Statistical Society:
Series B (Methodological), 34 (2), 187{202.
Cox, D. R. (1975). Partial likelihood. Biometrika, 62 (2), 269{276.
Dietterich, T. G. (2000). Ensemble methods in machine learning. In International workshop
on multiple classier systems (pp. 1{15).
Fedewa, S. A., Etzioni, R., Flanders, W. D., Jemal, A., & Ward, E. M. (2010). Asso-
ciation of insurance and race/ethnicity with disease severity among men diagnosed
with prostate cancer, national cancer database 2004-2006. Cancer Epidemiology and
30
Prevention Biomarkers, 19 (10), 2437{2444.
Ferlay, J., Soerjomataram, I., Dikshit, R., Eser, S., Mathers, C., Rebelo, M., . . . Bray,
F. (2015). Cancer incidence and mortality worldwide: sources, methods and major
patterns in globocan 2012. International journal of cancer, 136 (5), E359{E386.
Fisher, A., Rudin, C., & Dominici, F. (2018). Model class reliance: Variable importance
measures for any machine learning model class, from the \rashomon" perspective.
arXiv preprint arXiv:1801.01489 .
Fritz, A., Percy, C., Jack, A., Shanmugaratnam, K., Sobin, L., Parkin, D. M., . . . others
(2001). International classication of diseases for oncology. 2000. Geneva: World
Health Organization, 3 .
Grambsch, P. M., & Therneau, T. M. (1994). Proportional hazards tests and diagnostics
based on weighted residuals. Biometrika, 81 (3), 515{526.
Grimm, P., Billiet, I., Bostwick, D., Dicker, A. P., Frank, S., Immerzeel, J., . . . others (2012).
Comparative analysis of prostate-specic antigen free survival outcomes for patients
with low, intermediate and high risk prostate cancer treatment by radical therapy.
results from the prostate cancer results study group. BJU international, 109 , 22{29.
Grudziaz, A., Gosiewska, A., & Biecek, P. (2018). survxai: an r package for structure-
agnostic explanations of survival models. J. Open Source Software, 3 (31), 961.
Harrell Jr, F. E. (2015). Regression modeling strategies: with applications to linear models,
logistic and ordinal regression, and survival analysis. Springer.
Hoerl, A., & Kennard, R. (1988). Ridge regression, in `encyclopedia of statistical sciences',
vol. 8. Wiley, New York.
Hofner, B., Mayr, A., Robinzonov, N., & Schmid, M. (2014). Model-based boosting in r:
a hands-on tutorial using the r package mboost. Computational statistics, 29 (1-2),
3{35.
Ishwaran, H., Kogalur, U. B., Blackstone, E. H., Lauer, M. S., et al. (2008). Random survival
forests. The annals of applied statistics, 2 (3), 841{860.
31
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical
learning (Vol. 112). Springer.
Kattan, M. W. (2003). Comparison of cox regression with other methods for determining
prediction models and nomograms. The Journal of urology, 170 (6), S6{S10.
Krstev, S., & Knutsson, A. (2019). Occupational risk factors for prostate cancer: A meta-
analysis. Journal of cancer prevention, 24 (2), 91.
Lang, M., Binder, M., Richter, J., Schratz, P., Psterer, F., Coors, S., . . . Bis-
chl, B. (2019, dec). mlr3: A modern object-oriented machine learn-
ing framework in R. Journal of Open Source Software. Retrieved from
https://joss.theoj.org/papers/10.21105/joss.01903 doi: 10.21105/joss.01903
Mackillop, W. J., & Quirt, C. F. (1997). Measuring the accuracy of prognostic judgments
in oncology. Journal of clinical epidemiology, 50 (1), 21{29.
Maliski, S. L., Husain, M., Connor, S. E., & Litwin, M. S. (2012). Alliance of support for
low-income latino men with prostate cancer: God, doctor, and self. Journal of religion
and health, 51 (3), 752{762.
Mantel, N. (1966). Evaluation of survival data and two new rank order statistics arising in
its consideration. Cancer Chemother Rep, 50 , 163{170.
Miao, F., Cai, Y.-P., Zhang, Y.-X., Li, Y., & Zhang, Y.-T. (2015). Risk prediction of
one-year mortality in patients with cardiac arrhythmias using random survival forest.
Computational and mathematical methods in medicine, 2015 .
Mozaari-Kermani, M., Sur-Kolay, S., Raghunathan, A., & Jha, N. K. (2014). Systematic
poisoning attacks on and defenses for machine learning in healthcare. IEEE journal of
biomedical and health informatics, 19 (6), 1893{1905.
Noe-Bustamante, L., & Flores, A. (2019). Facts on latinos in the us. Pew Research Center.
Omurlu, I. K., Ture, M., & Tokatli, F. (2009). The comparisons of random survival forests
and cox regression analysis with simulation and an application related to breast cancer.
Expert Systems with Applications, 36 (4), 8582{8588.
32
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., . . . Duch-
esnay, E. (2011). Scikit-learn: Machine Learning in Python . Journal of Machine
Learning Research, 12 , 2825{2830.
Pinheiro, P. S., Callahan, K. E., Siegel, R. L., Jin, H., Morris, C. R., Trapido, E. J., & Gomez,
S. L. (2017). Cancer mortality in hispanic ethnic groups. Cancer Epidemiology and
Prevention Biomarkers, 26 (3), 376{382.
R Core Team. (2019). R: A language and environment for statistical computing [Computer
software manual]. Vienna, Austria. Retrieved from https://www.R-project.org/
Raghunathan, T. E. (2004). What do we do with missing data? some options for analysis
of incomplete data. Annu. Rev. Public Health, 25 , 99{117.
NAACCR Race and Ethnicity Work Group. (2011). Naaccr guide-
line for enhancing hispanic/latino identication: Revised naaccr his-
panic/latino identication algorithm [nhia v2.2.1]. Springeld, IL: North
American Association of Central Cancer Registries. Retrieved from
https://www.naaccr.org/wp-content/uploads/2016/11/NHIA-v2.2.1.pdf
Ridgeway, G. K. (1999). Generalization of boosting algorithms and applications of bayesian
inference for massive datasets (Unpublished doctoral dissertation).
Schupp, C. W., Press, D. J., & Gomez, S. L. (2014). Immigration factors and prostate cancer
survival among hispanic men in california: does neighborhood matter? Cancer, 120 (9),
1401{1408.
Siegel, R. L., Miller, K. D., & Jemal, A. (2019). Cancer statistics, 2019. CA: a cancer
journal for clinicians, 69 (1), 7{34.
Simon, N., Friedman, J., Hastie, T., & Tibshirani, R. (2011). Regularization paths for cox's
proportional hazards model via coordinate descent. Journal of statistical software,
39 (5), 1.
Sritharan, J., MacLeod, J., Harris, S., Cole, D. C., Harris, A., Tjepkema, M., . . . Demers,
P. A. (2018). Prostate cancer surveillance by occupation and industry: the canadian
33
census health and environment cohort (canchec). Cancer medicine, 7 (4), 1468{1478.
Stamey, T. A., McNeal, J. E., Yemoto, C. M., Sigal, B. M., & Johnstone, I. M. (1999).
Biological determinants of cancer progression in men with prostate cancer. Jama,
281 (15), 1395{1400.
StataCorp. (2017). Stata statistical software: Release 15. College Station, TX: StataCorp
LLC.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society: Series B (Methodological), 58 (1), 267{288.
Wiens, J., & Shenoy, E. S. (2017). Machine learning for healthcare: on the verge of a major
shift in healthcare epidemiology. Clinical Infectious Diseases, 66 (1), 149{153.
Wu, T. T., Gong, H., & Clarke, E. M. (2011). A transcriptome analysis by lasso penalized cox
regression for pancreatic cancer survival. Journal of Bioinformatics and Computational
Biology, 9 (supp01), 63{73.
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal
of the royal statistical society: series B (statistical methodology), 67 (2), 301{320.
34
Abstract (if available)
Abstract
This thesis seeks to develop a prediction risk model for estimating time from diagnosis to mortality from Prostate Cancer using various machine learning methods including Random Survival Forest, in comparison to the traditional Cox Proportional Hazards (PH) approach. Furthermore, there has been evidence of health disparities among Latinos alone and so we analyzed several demographic and clinical features among non-Latino Whites (NLW), non-Latino Blacks (NLB), and Latinos by nativity. Features also included two new variables that have not been previously studied for survival from Prostate Cancer: religious status and longest-held occupation. A total of 56,451 records from California Cancer Registry from 2004-2006 were analyzed with 5,467 deaths. Foreign-born Latinos had longer survival than NLW (Cox PH stratified on cancer stage: HR=0.84, p<0.05
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Disparities in colorectal cancer survival among Latinos in California
PDF
Machine learning-based breast cancer survival prediction
PDF
Prostate cancer disparities among Californian Latinos by country of origin: clinical characteristics, incidence, treatment received and survival
PDF
Radical prostatectomy or external beam radiation therapy versus no local therapy for survival benefit in metastatic prostate cancer: a SEER-Medicare analysis
PDF
Air pollution and breast cancer survival in California teachers: using address histories and individual-level data
PDF
Identifying prognostic gene mutations in colorectal cancer with random forest survival analysis
PDF
Racial and ethnic disparities in delays of surgical treatment for breast cancer
PDF
A novel risk-based treatment strategy evaluated in pediatric head and neck non-rhabdomyosarcoma soft tissue sarcomas (NRSTS) patients: a survival analysis from the Children's Oncology Group study...
PDF
Genes and environment in prostate cancer risk and prognosis
PDF
Machine learning approaches for downscaling satellite observations of dust
PDF
An analysis of disease-free survival and overall survival in inflammatory breast cancer
PDF
Body size and the risk of prostate cancer in the multiethnic cohort
PDF
Understanding prostate cancer genetic susceptibility and chromatin regulation
PDF
Comparative study of the POG and FNCLCC grading systems in non-rhabdomyosarcoma soft tissue sarcomas
PDF
Association between informed decision-making and mental health-related quality of life in long term prostate cancer survivors
PDF
Hierarchical regularized regression for incorporation of external data in high-dimensional models
PDF
Dietary and supplementary folate intake and prostate cancer risk
PDF
Predictive factors of breast cancer survival: a population-based study
PDF
Red and processed meat consumption and colorectal cancer risk: meta-analysis of case-control studies
PDF
Analysis of unmet needs for young cancer survivors using multinomial logistic regression
Asset Metadata
Creator
Lee, Chelsea Haosin
(author)
Core Title
Comparison of Cox regression and machine learning methods for survival analysis of prostate cancer
School
Keck School of Medicine
Degree
Master of Science
Degree Program
Biostatistics
Publication Date
05/05/2021
Defense Date
03/25/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
machine learning,OAI-PMH Harvest,prostate cancer,survival analysis
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Lewinger, Juan Pablo (
committee chair
), Stern, Mariana Carla (
committee chair
), Liu, Lihua (
committee member
)
Creator Email
chelsehl@usc.edu,leechelseahaosin@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-299267
Unique identifier
UC11663912
Identifier
etd-LeeChelsea-8361.pdf (filename),usctheses-c89-299267 (legacy record id)
Legacy Identifier
etd-LeeChelsea-8361.pdf
Dmrecord
299267
Document Type
Thesis
Rights
Lee, Chelsea Haosin
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
machine learning
prostate cancer
survival analysis