Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Identifying prognostic gene mutations in colorectal cancer with random forest survival analysis
(USC Thesis Other)
Identifying prognostic gene mutations in colorectal cancer with random forest survival analysis
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Copyright 2020 Zihang Chen
IDENTIFYING PROGNOSTIC GENE MUTATIONS IN
COLORECTAL CANCER WITH RANDOM FOREST SURVIVAL
ANALYSIS
by
Zihang Chen
A Thesis Presented to the
FACULTY OF THE KECK SCHOOL OF MEDICINE
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
MASTER OF SCIENCE
(BIOSTATISTICS)
August 2020
ii
Dedication
I dedicated this work to my beloved parents for always being supportive and loving to
me.
iii
Acknowledgments
I would like to thank my advisor, Dr. Joshua Millstein, for being a great mentor who has
given me such a great help on my thesis. Furthermore, I have a special feeling of gratitude to my
committee members, Dr. Meredith Franklin and Dr. Wendy Mack, whose guidance has made me
become a more professional scholar.
iv
TABLE OF CONTENTS
Dedication ii
Acknowledgments iii
List of Tables v
List of Figures vi
Abstract vii
Introduction 1
Methods 3
Results 8
Discussion 27
References 30
v
List of Tables
Table 1. Baseline characteristics (n=264) 8
Table 2. Interaction for top 10 variables in the OS model 23
Table 3. Interaction for top 10 variables in PFS model 25
vi
List of Figures
Figure1.Bar chart for single gene mutations in all patients 9
Figure 2. Group bar chart for single gene mutations in location of primary tumor 10
Figure 3. Group bar chart for single gene mutations in INV 10
Figure 4. Group bar chart for single gene mutations in Folfox6-BV and Folfiri-BV 11
Figure 5. Bar chart for two-gene combination mutations in all patients 12
Figure 6. Group bar chart for two-gene combination mutations in location of primary tumor 13
Figure 7. Group bar chart for two-gene combination mutations in INV Response 13
Figure 8. Group bar chart for two-gene combination mutations in Folfox6-BV and Folfiri-BV 14
Figure 9. Bar chart for three-gene combination mutations in all patients 15
Figure 10. Group bar chart for three-gene combination mutations in location of primary tumor 16
Figure 11. Group bar chart for three-gene combination mutations in INV Response 16
Figure 12. Group bar chart for three-gene combination mutations in Folfox6-BV and Folfiri-BV17
Figure 13. Bar chart for four-gene combination mutations in all patients 18
Figure 14. Group bar chart for four-gene combination mutations in location of primary tumor 19
Figure 15. Group bar chart for four-gene combination mutations in INV Response 19
Figure 16. Group bar chart for four-gene combination mutations in Folfox6-BV and Folfiri-BV 20
Figure 17. Gene co-expression network analysis for two-gene combinations 21
Figure 18. Error rate curve and importance of variables in OS model 23
Figure 19. Survival linear prediction in OS model 24
Figure 20. Error rate curve and importance of variables in PFS model 25
Figure 21. Survival linear of prediction in PFS model 26
vii
ABSTRACT
Background: A gene mutation is a permanent alteration in the DNA sequence that composes
a gene. Colorectal cancer is the result of gene mutations and the sequence of chromosomes in
important genes. In this thesis, I used the MAVERICC dataset to identify gene mutations that are
prognostic for survival in colorectal cancer using random survival forest.
Methods: A total of 264 patients with colorectal cancer were selected in the study (164 males
and 100 females), with a range of age (31 to 87) and mean 60.17 years. At least 5% of patients (n
= 13) had one or more mutations in all patients, and groups defined by primary tumor site, Best
of Response by Investigators (INV response), and responders and non-responders for
FLOFOX6-BV and FOLFIRI-BV. A gene mutations network analysis was conducted to deal
with the complexity of interrelationships between genes. Covariates and gene mutations were
independent variables in random survival forest in overall survival (OS) and progression free
survival (PFS). The interaction and out-of-bag error rates were used for model evaluation and
prediction performance.
Results: The APC, TP53 and KRAS genes were the most frequent gene mutations in all groups.
Four-gene combinations was the last gene combination group analyzed because all five-gene
combinations were present in less than 5% of patients (n = 13). Thirty-one independent variables
were eventually selected in the OS and PFS models; half of these independent variables were
interactions among pairs of main effects. The prediction error rate for the OS model was 38.62%.
The prediction error rate for the PFS model was 43.56%.
Conclusion: The APC, TP53 and KRAS genes were the top frequent gene mutations in
colorectal cancer from single gene mutations to four-combination gene mutations. The model
viii
predicted poor survival outcomes with a relatively high error rate. Additional models are
suggested to improve prediction accuracy.
1
INTRODUCTION
A gene mutation is a permanent alteration in the DNA sequence that composes a gene.
Mutations range in size from a single DNA building block to a large segment of a chromosome
that includes multiple genes (1). Cancers can be caused by DNA mutations that turn on
oncogenes or turn off tumor suppressor genes (2). There are no oncogenes or tumor suppressor
genes that are activated or deleted from all cancers. Although tumor types from one specific
organ have a tendency to share mutations in certain genes or in different genes within a single
growth-regulatory pathway, even tumors of a single organ do not show uniform genetic
alterations (3). The abnormal behavior demonstrated through cancer cells is the result of a series
of mutations in key regulatory genes (4). The cells become progressively more abnormal as more
genes become damaged (4). When cells are out of control, genes mutations are usually needed to
cause colorectal cancer.
Colorectal cancer is a cancer that starts in the colon or rectum (5). It usually affects older
adults and begins as small noncancerous clumps of cells called polyps that form on the inside of
the colon (6). Currently, colorectal cancer is the third most common cancer in men and the
second most common cancer in women worldwide (7). Even though survival for patients with
unresectable metastatic colorectal cancer has improved over the past decade, due to the
introduction of agents targeting the Epidermal Growth Factor Receptor and the Vascular
Endothelial Growth Factor, these treatments are usually not completely curative, and intrinsic
and acquired drug resistance is frequently clinically observed (8). It is believed that many
common, low-penetrance genetic risk variants exist for colorectal cancer. These variants explain
a substantial proportion of genetic variation for colorectal cancer. Genome-wide association
studies (GWAS) have become a powerful tool to uncover genetic susceptibility factors for
2
complex diseases. More than 40 colorectal cancer GWAS risk locations have been identified,
which have expanded our understanding of the etiology of colorectal cancer (9).
Survival analysis and gene mutations network analysis have become popular methods to
analyze gene mutations and cancer in recent years. Survival analysis is applied to the analysis of
gene mutations in relation to cancer outcomes by analyzing the time to an event of interest (10).
By detecting a relationship between gene profiles and time to an event such as cancer recurrence
or death, a good survival model is expected to achieve accurate predictions of prognoses or
diagnoses (10). Although the common choice of survival analysis is Cox proportional hazards
regression modeling, random forest survival, a non-parametric method for ensemble estimation
constructed by bagging trees classification, has become a notable method for survival prediction
(11). Compared with other survival analysis methods, random survival forest can avoid
limitations of univariate regression approaches such as overfitting, unreliable estimation of
regression coefficients or inflated standard errors (11). It can easily deal with high dimensional
data and does not force a restrictive structure on how variables should be combined (12). In
addition, gene network approaches provide insights into the patterns of transcriptome
organization and suggest common biological functions for networked genes (13). Furthermore,
these approaches have the potential to infer the regulatory network of genes and the causality of
relationships between genes.
In this study, I aim to identify prognostic gene mutations in colorectal cancer using
random survival forest. As many cancers are caused by gene mutations, this analysis sought to
identify what genes may contribute to colorectal cancer through mutations and whether these
gene mutations affect patient survival time using random forest survival analysis. The
MAVERICC clinical dataset was used for this purpose. MAVERICC was the first prospective
3
study to evaluate tumor excision repair cross-complementing 1 and plasma VEGF-A as potential
biomarkers for clinical outcomes following first-line treatment with oxaliplatin and bevacizumab
(14). Descriptive statistical analysis was used to describe some basic features of the dataset and
included simple graphics analysis (15). Inferential statistical analyses tested hypotheses related to
the study objectives (15). Bar charts displayed individual gene mutations and combined-gene
mutations in all patients, and by primary tumor site, Best of Response by Investigators (INV
response), and responders and non-responders for FLOFOX6-BV and FOLFIRI-BV. Gene
mutations network and random survival forest analysis in OS and PFS were conducted using the
R language to deal with the complexity of interrelationship among genes and to develop a
predictive model and evaluate model performance.
METHODS
MAVERICC Data Management
MAVERICC was a global randomized, biomarker-stratified, open-label, phase II,
multicenter study (14). MAVERICC clinical and gene mutation data were merged by patient ID
and observations with missing values were deleted.
The MAVERICC clinical dataset included age, sex, race, ECOG performance status at
baseline, primary tumor site, primary tumor resection, number of metastatic sites, KRAS status
and other basic variables. Race categories included Asian, Black African American, Native
Hawaiian, American Indian and White. The primary tumor restriction was colorectal and the
cancer type included colon cancer, rectal cancer and combined colon and rectal cancer. The
primary tumor site was left and right. The left side was everything distal from splenic flexure,
including rectal cancer and the right side was right colon and transverse up to splenic flexure.
4
The INV response measure is the best response recorded from the start of the study treatment
until the disease progression (16); four categories included complete response (CR), partial
response (PR), stable disease (SD) and progressive disease (PD).
The gene mutations dataset indicated whether each patient had mutations or not for each
measured gene. To prepare for the analysis, frequent gene mutations coded as {0, 1} were
identified {1 if mutation present, 0 if no mutation}. Gene correlations were also tested among
patients with combined-gene mutations in order to check whether top frequent individual gene
mutations were also the top frequent combined-gene mutations and whether patients who had top
frequent individual gene mutations also had top combined-gene mutations.
Analysis objectives were to evaluate associations between independent variables and
gene mutations with overall survival (OS) and progression free survival (PFS). OS was defined
as time from randomization to death from any cause (17). PFS was defined as time from
treatment initiation until disease progression or worsening (18). PFS contains the concept of
deterioration over OS and can be used to assess the clinical benefits of some treatments (19).
Bar Charts of Mutations by Group
Grouped bar charts display a numeric value for a set of entities split in groups and
subgroups. Bars are grouped by position for levels of one categorical variable, with different
colors indicating the secondary category level within each group (20). The values are displayed
for levels of two categorical variables within study groups instead of a basic bar chart that simply
shows frequency in an overall sample. Bar charts were created for all patients with one or more
gene mutations in specific genes. Genes were displayed for those genes in which 5% or more of
patients had one or more mutations; the top frequent 20 genes are displayed in each bar chart.
5
Grouped bar charts were created using the groups defined by primary tumor site, INV response,
responders and non-responders for treatment with FOLFOX6-BV, and responders and non-
responders for treatment FOLFIRI-BV. FOLFOX6-BV and FOLFIRI-BV are two different
therapies for metastatic colorectal cancer patients. The boundary was the same for all patients
and the graph showed the top frequent 20 genes. The same method was used to create the
grouped bar charts for two-gene combinations, three-gene combinations and four-gene
combinations.
Gene Mutations Network Analysis
Network analysis is an analytical method that offers a potential new framework for
analysis (21). It considers how a set of nodes are connected to each other through edges.
Network analysis provides a systems-oriented perspective because it uncovers patterns and
relations among objects and gives a view on how system components are tied to a larger web of
interactions (21). Gene mutations network analysis was used in this dataset to identify the
associations between genes, which prioritizes candidate disease genes or discerns transcriptional
regulatory programs (22).
The R package “tidyr” was used to split the two-gene combinations to individual genes
and create a dataset in which each column showed one of the two-gene combinations. I set the
gene-gene as gene-similarity matrix and used undirected edges to connect nodes. Edges are
drawn as lines connecting two vertices, and nodes are genes. The R package for gene mutations
network analysis, “igraph”, was used for this purpose. I created the network graph by specifying
nodes degree and edges weight. Nodes degree is the number of connections to connect with other
nodes; the nodes degree distribution is the probability distribution of these degrees over the
6
whole network. Edges weight is the value used to determine the thickness of edges that connect
to nodes, indicating the strength of co-regulation between two genes. The network created by
nodes degree and edges weight provides information on what are top frequent genes.
Random Forest Survival Analysis
Random forest is a nonparametric machine learning strategy that can be used for building
a risk prediction model in survival analysis. In survival settings, the predictor is an ensemble
formed by combining the results of numerous survival trees and bootstrap aggregation (23). For
this analysis, the dataset was randomly split into training and testing datasets. A common ratio of
70% training dataset and 30% test dataset was used. Training and testing datasets avoid
overfitting that can be apparent when using only one dataset. The training dataset performance
reflects how models would perform on data it has never seen before; models may just memorize
the training dataset example and completely fail on the testing dataset example that it has never
seen (24). The test dataset is important because it can help us choose the best model and
overcome the overfitting.
The R package “randomSurvivalForest” was used. In random survival forests, the
ensemble is constructed by aggregating tree-based Nelson-Aalen estimators (25). In each
terminal node of a tree, the conditional cumulative hazard function at time t is estimated using
the Nelson-Aalen estimate using the in-bag data (24)
𝐻
"
#
(𝑡|𝑥)= +
𝑁
-
#
∗
(𝑑𝑠,𝑥)
𝑌
3
#
∗
(𝑠,𝑥)
4
5
𝑁
-
#
∗
(𝑠,𝑥) counts the uncensored events until time s and 𝑌
3
#
∗
(𝑠,𝑥) is the number at risk at
time s.
The ensemble survival function at time t of random survival forest is
7
𝑆
789 :
(𝑡|𝑥)=exp(−
1
𝐵
A𝐻
#
(𝑡|𝑥)
B
C
#DE
)
B is the number of bootstrap samples; Expand a survival trees based on data of each
bootstrap samples g = 1,…, G
In random forest, variables that have high importance are drivers of the outcome and their
values have a significant impact on the outcome values. Variables with low importance can be
omitted from the model. Because there were in total 85 variables in the dataset, independent
variables that had positive importance were selected in order to make the model simpler and
better prediction. Subsequent steps checked interactions to identify how paired variables affected
the outcomes. Finally, the testing data was used to make a prediction between survival rate and
OS time and PFS time, and also check out-of-bag error to evaluate the model practicality.
Random survival trees and forests are popularly applied in time-to-event analysis and can
automatically detect covariate interactions without specifying them beforehand (26). The
principle random survival forest predict new data through aggregating the predictions of the
number of trees. For each bootstrap iteration and related tree, prediction error using data not in
the bootstrap sample was estimated. It is important to check the out-of-bag error to make sure the
outcome is accurate and whether the model is a good fit for the dataset.
8
Results
Baseline Characteristics
Table 1. Baseline characteristics (n=264)
Baseline characteristics for the sample of 264 patients are presented in Table 1. The mean
(SD) age was 60.17 years old. The age range was between 31 and 87. Males represented 63% of
Age, mean (range) 60.17 (31,87)
Sex
Male 164 (63%)
Female 100 (37%)
Race
American Indian or Alaska Native 3 (1%)
Asian 9 (3%)
Black or African American 25 (9%)
Native Hawaiian or Other Pacific Islander 2 (1%)
White 218 (83%)
Not available 7 (3%)
Ethnicity
Hispanic or Latino 40 (15%)
Not Hispanic or Latino 215 (81%)
Not available 9 (4%)
ECOG Score
0 143 (54%)
1 120 (45%)
Not Done 1 (1%)
KRAS Status
Mutant 96 (36%)
Wild-type 150 (57%)
Unknown 18 (7%)
Cancer Type
Colon and Rectal Cancer 10 (4%)
Colon Cancer 191 (72%)
Rectal Cancer 63 (24%)
Primary tumor resected
Colorectal 16 (6%)
No. of Metastatic Sites
1 77 (29%)
2 95 (36%)
3 67 (25%)
4 17 (7%)
5 8 (3%)
Overall Survival Months
Mean 18.62
Median 18.79
Progression Free Survival
Mean 10.84
Median 9.45
ECOG, Eastern Cooperative Oncology Group;
KRAS, V-Ki-ras2 Kirsten rat sarcoma viral oncogene homolog.
9
patients (n = 164) and 37% of patients (n = 100) were female. There were 83% of patients (n =
218) of White race, and 81% of patients (n = 215) were not Hispanic or Latino. A total of 143
patients had ECOG score-zero and 120 patients had ECOG score-one. There were 57% of
patients (n = 150) had wild-type KRAS status, and 72% of patients (n = 191) had colon cancer.
Overall 36% of patients (n = 95) had 2 metastatic sites, which was the highest number of
metastatic sites from 1 to 5. The mean of overall survival months was 18.62; the median of
overall survival months was 18.79. The mean of progression free survival months was 10.84; the
median of overall survival months was 9.45.
Grouped Bar Charts of Individual and Combined Gene Mutations
1) Single Gene Mutations
Figure 1. Bar chart for single gene mutations in all patients
221
206
129
56
51 51 50
45 45
43 42
40 40
38
35 34 34 34 34
32
0
50
100
150
200
APC
TP53
KRAS
MLL3
LRP1B
PIK3CA
SMAD4
FAT3
MLL2
SPTA1
FAT1
ARID1B
ATM
BRCA2
ARID1A
BRAF
POLE
PRKDC
SOX9
GNAS
Gene
Count
All patients
10
Figure 2. Group bar chart for single gene mutations in location of primary tumor
Figure 3. Group bar chart for single gene mutations in INV Response
0
20
40
60
80
100
120
APC
TP53
KRAS
MLL3
LRP1B
PIK3CA
SMAD4
FAT3
MLL2
SPTA1
FAT1
ARID1B
ATM
BRCA2
ARID1A
BRAF
POLE
PRKDC
SOX9
GNAS
Gene
Count
location
left
right
Location of Primary Tumor
0
20
40
60
80
100
120
140
APC
TP53
KRAS
MLL3
LRP1B
PIK3CA
SMAD4
FAT3
MLL2
SPTA1
FAT1
ARID1B
ATM
BRCA2
ARID1A
BRAF
POLE
PRKDC
SOX9
GNAS
Gene
Count
INVResponse
response
nonresponse
Responders vs Non−responders
11
Figure 4. Group bar chart for single gene mutations in Folfox6-B and Folfiri-BV
There were a total of 387 individual genes measured in the study. Overall 88 genes were
selected for analysis since the frequency showed 5% or more patients who had one or more
mutations, which was 5% of 264 patients (n = 13). Each bar chart shows the most frequent 20
0
20
40
60
APC
TP53
KRAS
MLL3
LRP1B
PIK3CA
SMAD4
FAT3
MLL2
SPTA1
FAT1
ARID1B
ATM
BRCA2
ARID1A
BRAF
POLE
PRKDC
SOX9
GNAS
Gene
Count
folfox6bevacizumab
response
nonresponse
Responders vs Non−responders
0
20
40
60
APC
TP53
KRAS
MLL3
LRP1B
PIK3CA
SMAD4
FAT3
MLL2
SPTA1
FAT1
ARID1B
ATM
BRCA2
ARID1A
BRAF
POLE
PRKDC
SOX9
GNAS
Gene
Count
folfiribevacizumab
response
nonresponse
Responders vs Non−responders
12
genes with mutations. The results showed that APC, TP53 and KRAS were the most frequently
mutated genes from Figure 1 to 4. The amount of APC, TP53 and KRAS genes was far more
than other single genes. Number of genes in left tumor side, response of group INV response,
response of group Folfox6-BV and response of group Folfiri-BV were generally higher than
right tumor side, non-response of group INV response, non-response of group Folfox6-BV and
non-response of group Folfiri-BV for each gene.
2) Two-gene combinations
Figure 5. Bar chart for two-gene combination mutations in all patients
118
44
34
37
49
172
36 37
39
45
38
34
93
34
37
34
37
44
37
35
0
50
100
150
APC.TP53
APC.KRAS
KRAS.TP53
APC.MLL3
APC.PIK3CA
APC.LRP1B
MLL3.TP53
APC.SMAD4
APC.SPTA1
APC.FAT1
APC.FAT3
FAT3.TP53
LRP1B.TP53
TP53.SMAD4
APC.ARID1B
TP53.SPTA1
APC.ATM
APC.MLL2
KRAS.PIK3CA
MLL2.TP53
Gene
Count
All patients
13
Figure 6. Group bar chart for two-gene combination mutations in location of primary tumor
Figure 7. Group bar chart for two-gene combination mutations in INV Response
0
20
40
60
80
100
APC.TP53
APC.KRAS
KRAS.TP53
APC.MLL3
APC.PIK3CA
APC.LRP1B
MLL3.TP53
APC.SMAD4
APC.SPTA1
APC.FAT1
APC.FAT3
FAT3.TP53
LRP1B.TP53
TP53.SMAD4
APC.ARID1B
TP53.SPTA1
APC.ATM
APC.MLL2
KRAS.PIK3CA
MLL2.TP53
Gene
Count
location
left
right
Location of Primary Tumor
0
20
40
60
80
100
APC.TP53
APC.KRAS
KRAS.TP53
APC.MLL3
APC.PIK3CA
APC.LRP1B
MLL3.TP53
APC.SMAD4
APC.SPTA1
APC.FAT1
APC.FAT3
FAT3.TP53
LRP1B.TP53
TP53.SMAD4
APC.ARID1B
TP53.SPTA1
APC.ATM
APC.MLL2
KRAS.PIK3CA
MLL2.TP53
Gene
Count
INVResponse
response
nonresponse
Responders vs Non−responders
14
Figure 8. Group bar chart for two-gene combination mutations in Folfox6-BV and Folfiri-BV
144 two-gene mutation combinations were selected, showing a frequency in more than 13
patients. The APC.TP53, APC.KRAS and KRAS.TP53 combinations were most frequent from
0
20
40
APC.TP53
APC.KRAS
KRAS.TP53
APC.MLL3
APC.PIK3CA
APC.LRP1B
MLL3.TP53
APC.SMAD4
APC.SPTA1
APC.FAT1
APC.FAT3
FAT3.TP53
LRP1B.TP53
TP53.SMAD4
APC.ARID1B
TP53.SPTA1
APC.ATM
APC.MLL2
KRAS.PIK3CA
MLL2.TP53
Gene
Count
folfox6bevacizumab
response
nonresponse
Responders vs Non−responders
0
20
40
60
APC.TP53
APC.KRAS
KRAS.TP53
APC.MLL3
APC.PIK3CA
APC.LRP1B
MLL3.TP53
APC.SMAD4
APC.SPTA1
APC.FAT1
APC.FAT3
FAT3.TP53
LRP1B.TP53
TP53.SMAD4
APC.ARID1B
TP53.SPTA1
APC.ATM
APC.MLL2
KRAS.PIK3CA
MLL2.TP53
Gene
Count
folfiribevacizumab
response
nonresponse
Responders vs Non−responders
15
Figure 5 to 8, which was the combination of the top 3 individual genes. The frequency of
APC.TP53, APC.KRAS and KRAS.TP53 gene mutations was far more than other 2-gene
combinations. The number of gene mutations in the left tumor side, response of group INV
response, response of group Folfox6-BV and response of group Folfiri-BV were generally higher
than right tumor side, non-response of group INV response, non-response of group Folfox6-BV
and non-response of group Folfiri-BV for each gene. Basically these 2-gene combinations within
groups showed similar results to the individual gene mutations.
3) Three-gene combinations
Figure 9. Bar chart for three-gene combination mutations in all patients
22 22
29
83
25
28
31
22
32
25
30
22
39
25
26
27 27
33
22
24
0
20
40
60
80
APC.KRAS.TP53
APC.MLL3.TP53
APC.TP53.SPTA1
APC.LRP1B.TP53
APC.KRAS.PIK3CA
APC.FAT3.TP53
APC.KRAS.MLL3
APC.KRAS.SMAD4
APC.TP53.PIK3CA
APC.TP53.SMAD4
APC.TP53.FAT1
APC.KRAS.FAT1
APC.MLL2.TP53
APC.TP53.ARID1B
KRAS.MLL3.TP53
APC.KRAS.FAT3
APC.KRAS.LRP1B
APC.KRAS.SPTA1
APC.MLL.TP53
KRAS.FAT3.TP53
Gene
Count
All patients
16
Figure 10. Group bar chart for three-gene combination mutations in location of primary tumor
Figure 11. Group bar chart for three-gene combination mutations in INV Response
0
10
20
30
40
50
APC.KRAS.TP53
APC.MLL3.TP53
APC.TP53.SPTA1
APC.LRP1B.TP53
APC.KRAS.PIK3CA
APC.FAT3.TP53
APC.KRAS.MLL3
APC.KRAS.SMAD4
APC.TP53.PIK3CA
APC.TP53.SMAD4
APC.TP53.FAT1
APC.KRAS.FAT1
APC.MLL2.TP53
APC.TP53.ARID1B
KRAS.MLL3.TP53
APC.KRAS.FAT3
APC.KRAS.LRP1B
APC.KRAS.SPTA1
APC.MLL.TP53
KRAS.FAT3.TP53
Gene
Count
location
left
right
Location of Primary Tumor
0
20
40
APC.KRAS.TP53
APC.MLL3.TP53
APC.TP53.SPTA1
APC.LRP1B.TP53
APC.KRAS.PIK3CA
APC.FAT3.TP53
APC.KRAS.MLL3
APC.KRAS.SMAD4
APC.TP53.PIK3CA
APC.TP53.SMAD4
APC.TP53.FAT1
APC.KRAS.FAT1
APC.MLL2.TP53
APC.TP53.ARID1B
KRAS.MLL3.TP53
APC.KRAS.FAT3
APC.KRAS.LRP1B
APC.KRAS.SPTA1
APC.MLL.TP53
KRAS.FAT3.TP53
Gene
Count
INVResponse
response
nonresponse
Responders vs Non−responders
17
Figure 12. Group bar chart for three-gene combination mutations in Folfox6-BV and Folfiri-BV
64 three-gene combinations were selected, which showed frequencies of more than 13
patients. The APC.TP53.KRAS gene combination was the most frequent gene combination from
Figure 9 to 12, which was the combination of top 3 individual genes. The frequency of
0
20
APC.KRAS.TP53
APC.MLL3.TP53
APC.TP53.SPTA1
APC.LRP1B.TP53
APC.KRAS.PIK3CA
APC.FAT3.TP53
APC.KRAS.MLL3
APC.KRAS.SMAD4
APC.TP53.PIK3CA
APC.TP53.SMAD4
APC.TP53.FAT1
APC.KRAS.FAT1
APC.MLL2.TP53
APC.TP53.ARID1B
KRAS.MLL3.TP53
APC.KRAS.FAT3
APC.KRAS.LRP1B
APC.KRAS.SPTA1
APC.MLL.TP53
KRAS.FAT3.TP53
Gene
Count
folfox6bevacizumab
response
nonresponse
Responders vs Non−responders
0
20
APC.KRAS.TP53
APC.MLL3.TP53
APC.TP53.SPTA1
APC.LRP1B.TP53
APC.KRAS.PIK3CA
APC.FAT3.TP53
APC.KRAS.MLL3
APC.KRAS.SMAD4
APC.TP53.PIK3CA
APC.TP53.SMAD4
APC.TP53.FAT1
APC.KRAS.FAT1
APC.MLL2.TP53
APC.TP53.ARID1B
KRAS.MLL3.TP53
APC.KRAS.FAT3
APC.KRAS.LRP1B
APC.KRAS.SPTA1
APC.MLL.TP53
KRAS.FAT3.TP53
Gene
Count
folfiribevacizumab
response
nonresponse
Responders vs Non−responders
18
APC.TP53.KRAS was far more than other 3-gene combinations. Basically it has similar results
with individual gene mutations. We conclude that the most frequent genes in group individual
gene mutation were also significant in group two-gene combinations and three-gene
combinations.
4) Four-gene combinations
Figure 13. Bar chart for four-gene combination mutations in all patients
14
18
24
14
17 17
16
19
0
5
10
15
20
25
APC.KRAS.MLL3.TP53
APC.KRAS.TP53.SPTA1
APC.KRAS.FAT3.TP53
APC.KRAS.TP53.FAT1
APC.KRAS.TP53.SMAD4
APC.KRAS.TP53.PIK3CA
APC.KRAS.LRP1B.TP53
APC.KRAS.TP53.ARID1B
Gene
Count
All patients
19
Figure 14. Group bar chart for four-gene combination mutations in location of primary tumor
Figure 15. Group bar chart for four-gene combination mutations in INV Response
0
2
4
6
8
10
12
14
APC.KRAS.MLL3.TP53
APC.KRAS.TP53.SPTA1
APC.KRAS.FAT3.TP53
APC.KRAS.TP53.FAT1
APC.KRAS.TP53.SMAD4
APC.KRAS.TP53.PIK3CA
APC.KRAS.LRP1B.TP53
APC.KRAS.TP53.ARID1B
Gene
Count
location
left
right
Location of Primary Tumor
0
2
4
6
8
10
12
APC.KRAS.MLL3.TP53
APC.KRAS.TP53.SPTA1
APC.KRAS.FAT3.TP53
APC.KRAS.TP53.FAT1
APC.KRAS.TP53.SMAD4
APC.KRAS.TP53.PIK3CA
APC.KRAS.LRP1B.TP53
APC.KRAS.TP53.ARID1B
Gene
Count
INVResponse
response
nonresponse
Responders vs Non−responders
20
Figure 16. Group bar chart for four-gene combination mutations in Folfox6-BV and Folfiri-BV
Only 8 four-gene combinations were selected, which showed mutations in at least 13
patients. The result was different than the three-gene combinations. The
APC.KRAS.MLL3.TP53 was the most frequent gene combination in Figure 13, 14, 15 and 16 of
0
2
4
6
APC.KRAS.MLL3.TP53
APC.KRAS.TP53.SPTA1
APC.KRAS.FAT3.TP53
APC.KRAS.TP53.FAT1
APC.KRAS.TP53.SMAD4
APC.KRAS.TP53.PIK3CA
APC.KRAS.LRP1B.TP53
APC.KRAS.TP53.ARID1B
Gene
Count
folfox6bevacizumab
response
nonresponse
Responders vs Non−responders
0
2
4
6
8
APC.KRAS.MLL3.TP53
APC.KRAS.TP53.SPTA1
APC.KRAS.FAT3.TP53
APC.KRAS.TP53.FAT1
APC.KRAS.TP53.SMAD4
APC.KRAS.TP53.PIK3CA
APC.KRAS.LRP1B.TP53
APC.KRAS.TP53.ARID1B
Gene
Count
folfiribevacizumab
response
nonresponse
Responders vs Non−responders
21
Folfiri-BV. The gene combination was higher in non-response than response for all genes in
FOLFIRI-BV, but the result was reversed in FOLFOX6-BV. The APC.KRAS.SPTA1.TP53 gene
appeared most frequently in FOLFOX6-BV, but less frequently in FOLFIRI-BV. The APC,
TP53 and KRAS genes were still the most significant genes in four-gene combinations. Four-
gene combinations was the last group combination evaluated, because all the numbers of five-
gene combinations were less than 5% or more of patients have one or more mutations.
Gene Mutations Network Analysis
Figure 17. Gene mutations network analysis for two-gene combinations
22
There were total 144 gene pairs that represented combined 70 individual genes. From
Figure 17, APC had the highest node degree, with 67 connections. TP53 had the second highest
node degree, with 46 connections and KRAS had the third highest node degree, with 26
connections. The connections of these three genes were far more than other genes. Connections
for all genes except APC, TP53 and KRAS were all less than 10. The probabilistic weight of an
edge is derived from the similarity measure between the two gene vectors and reflects the
probability that these two genes are mates (27).
Random Forest Survival
OS Model
Overall 264 patients were split into 70% training dataset (n = 185) and 30% testing
dataset (n = 79). The survival objects for OS were OS months and OS censoring. The survival
objects for PFS were PFS months and PFS censoring. 96 independent variables were selected
both in the OS model and the PFS model. which included 88 individual gene mutations and 8
covariates variables. Only 1/3 of these variables had positive importance both in the OS and PFS
models; independent variables that only had positive importance were selected in order to
decrease the error rate.
Eventually, 31 variables were selected in the OS model. The number of deaths for the
training dataset was 74. The error rate was 32.52%. The number of trees was 1000. From the
Figure 18, we see when the number of trees was around 450, the error rate was the lowest. Only
6 variables had negative importance when these 31 variables were selected.
23
Figure 18. Error rate curve and importance of variables in OS model
Table 2. Interaction for top 10 variables in the OS model
Var1 Var2 Paired Additive Difference
No.of.Metastatic.Sites:Baseline.ECOG.Score 0.0213 0.0195 0.0465 0.0409 0.0056
Baseline.ECOG.Score:MLL3 0.0186 0.0106 0.0326 0.0292 0.0034
APC:NOTCH2 0.0094 0.0023 0.0148 0.0117 0.0031
Baseline.ECOG.Score:KRAS 0.0186 0.0025 0.0240 0.0211 0.0029
Ethnicity:BRAF 0.0113 0.0034 0.0173 0.0147 0.0025
MLL3:FBXW7 0.0109 0.0013 0.0147 0.0122 0.0025
APC:BRAF 0.0094 0.0045 0.0161 0.0139 0.0023
FAT3:RET 0.0058 0.0026 0.0103 0.0084 0.0020
Ethnicity:EPHB4 0.0113 0.0002 0.0134 0.0114 0.0020
PTEN:RET 0.0082 0.0027 0.0128 0.0109 0.0019
The interactions between each independent variable were evaluated. Paired was the
importance of two variables combined as whole. Additive was the summed value of each
variable. Difference was the paired minus additive values; a positive difference means that paired
variables was more important than variable 1 and variable 2 separate, which means that there
was an interaction effect. 220 paired variables had interaction effects out of 465 paired variables.
24
The mean of additive was 0.0074 and the mean of paired was 0.0075. Table 2 shows the top 10
paired variables that had interaction effects. Variables which had high importance individually
also had high interaction effects when paired.
Figure 19. Survival linear prediction in OS model
The test data was then used for prediction. The sample size was 79 in the testing data.
The number of deaths was 41. The error rate was 38.62%. From the Figure 19, the lowest
survival rate was 0.35 at OS time 42.62 months. The red line showed that value of survival rate
that we predicted. The black lines showed what the range of survival rate could be.
25
PFS Model
In the PFS model, 31 variables were selected. The number of deaths for the training
dataset was 137. The error rate was 37.9%. The number of trees was 1000. From Figure 20, we
could see when the number of trees was around 580 and 780, the error rate was the lowest. Only
5 variables became of negative importance when the 31 variables were selected.
Figure 20. Error rate curve and importance of variables in PFS model
Table 3. Interaction for top 10 variables in PFS model
Var1 Var2 Paired Additive Difference
TP53:PARP4 0.0090 0.0029 0.0139 0.0119 0.0020
TP53:FBXW7 0.0090 0.0004 0.0113 0.0094 0.0019
POLE:MLL2 0.0105 -0.0020 0.0103 0.0085 0.0018
POLE:MLL 0.0105 0.0032 0.0154 0.0137 0.0018
TP53:SPTA1 0.0090 -0.0026 0.0080 0.0064 0.0016
Ethnicity:TP53 0.0092 0.0106 0.0212 0.0198 0.0014
Ethnicity:Age.at.Baseline..years. 0.0092 0.0035 0.0141 0.0127 0.0014
POLE:BRAF 0.0105 0.0040 0.0158 0.0145 0.0013
POLE:FBXW7 0.0105 0.0008 0.0126 0.0113 0.0013
POLE:Ethnicity
0.0105 0.0085 0.0203 0.0190 0.0012
26
206 paired variables had interaction effects among 465 paired variables. The mean of
additive was 0.0057 and the mean of paired was 0.0056. Table 3 shows the top 10 paired
variables that had interaction effects. The interactions were different than those selected in the
OS model. In the PFS model, top variables which had high importance usually paired with
variables which had lower importance that had high interaction effects.
Figure 21. Survival linear of prediction in PFS model
In the prediction testing dataset, the sample size was 79. The number of deaths was 52.
The error rate was 43.56%. From the Figure 21, the lowest survival rate was 0.10 at PFS time
27
35.32 months. The red line showed the value of survival rate that we predicted. The black lines
showed what the range of survival rate could be.
Discussion
Genome-wide association studies have identified several common genetic markers that
are significantly associated with colorectal cancer (28). A very small ratio of colorectal cancers
are caused by inherited gene mutations. Further mutations may occur in other genes with age,
which can lead the cells to lose control. Familial adenomatous polyposis (FAP) is inherited in an
autosomal dominant manner by a germline mutation in the APC gene (29). The APC gene is a
tumor suppressor and the can help keep cell growth in check. When patients have inherited
changes in APC gene, this inhibition is lost. From the results of this analysis, the APC gene had
the highest number of mutations among all of the 387 gene mutations measured. Even in the
gene-combined mutations, the top frequent mutations were composed of the APC gene. These
results are consistent with the conclusions in most research articles (28)(29).
The analytic results also showed that the number of mutations of the TP53 gene and the
KRAS gene were far more than other genes, which means these two genes were also significant
in colorectal cancer. Approximately half of colorectal cancers show TP53 gene mutations, with
higher frequency observed in distal colon and rectal tumors and lower frequencies in proximal
tumors and those with the microsatellite instability or methylator phenotypes (30). Evidence
shows that TP53 gene mutations are associated with clinical features, which include prognosis
and response to therapy, and is usually required for the response of colorectal cancers to
FOLFOX6-BV and FOLFIRI-BV (30). The KRAS gene is one of the most frequently mutated
oncogenes in cancer and appears in 35% of colon cancers (31). It is one of the most significant
28
targets in cancer drug development. Although mutations in TP53 and KRAS genes account for a
large portion of colorectal cancer, they are probably not involved in the primary initiating events
compared with mutations of the APC gene.
Gene network analysis is a powerful technique for multigene analysis of large-scale
datasets (32). In the present study, gene network analysis clearly indicated the interactions and
structure between genes. The result was similar with result in mutations frequency of grouped
bar charts, demonstrating that the APC, TP53 and KRAS genes had more interrelationships with
other genes.
The results of prediction by random forest survival was less than satisfactory. Although
random forest survival has been identified as an appropriate model to analyze survival data, there
is not much literature that confirms this assertion (12). The out-of-bag error rate for OS model
was 38.62%, and the out-of-bag error rate for PFS model was 43.56%, which is unacceptably
high. According to these outcomes, I suggest to make a prediction by Cox proportional hazards
model to compared with the random forest survival model. Cox proportional hazards model is
the recognized model commonly used in medical research for investigating the association
between survival time of patients and one or more predictor variables. The Cox proportional
hazards model had a better predictive performance in the presence of those covariates that satisfy
the proportional hazards assumption compared to the random forest survival model (12).
Random forest survival model is more robust in complex survival functions and maintains low
prediction out-of-bag error rate, when covariate variables have non-proportional hazard
associations (33).
In conclusion, further research on the use and performance of random forest survival
models in gene mutation datasets is recommended. Although we did not obtain acceptable
29
predictive performance, the random survival forest approach is innovative and has potential to be
applied to other datasets and in other settings.
30
References
1. U.S National Library of Medicine Genetics Home Reference, What is a gene mutation
and how do mutations occur. 2020. Available from
https://ghr.nlm.nih.gov/primer/mutationsanddisorders/genemutation.
2. The American Cancer Society Medical and Editorial Content Team. What Causes
Colorectal Cancer. 2018. Available from https://www.cancer.org/cancer/colon-rectal-
cancer/causes-risks-prevention/what-
causes.html#:~:text=A%20very%20small%20portion%20of,changes%20in%20the%20A
PC%20gene.
3. Boland R.C, Ricciardiello L. How many mutations does it take to make a tumor. Proc
Natl Acad Sci USA 1999;96(26):14675-14677.
4. Emory Winship Cancer Institute Cancer Quest. Mutation. 2020.
5. The American Cancer Society Medical and Editorial Content Team. What Causes
Colorectal Cancer. 2018. Available from https://www.cancer.org/cancer/colon-rectal-
cancer/about/what-is-colorectal-cancer.html.
6. Mayo Clinic. Colon Cancer. 2020. Available from https://www.mayoclinic.org/diseases-
conditions/colon-cancer/symptoms-causes/syc-20353669.
7. Armaghany T, Wilson J.D, Chu Q, Mills G. Genetic alterations in colorectal cancer.
Gastrointest Cancer Res 2012:5(1):19-27.
8. Donnard E, Asprino P.F, Correa B.R, Bettoni F, Koyama F.C, et al. Mutational analysis
of genes coding for cell surface proteins in colorectal cancer cell lines reveal novel
altered pathways, druggable mutations and mutated epitopes for targeted therapy.
Oncotarget 2014;5(10):9199-9213.
9. Zeng C, Matsuda K, Jia W, Chang J, Kweon S. Identification of susceptibility loci and
genes for colorectal cancer risk. Gastroenterology 2016;150(7):1633-1645.
10. Zhang W, Ota T, Shridhar V, et al. Network-based survival analysis reveals subnetwork
signatures for predicting outcomes of ovarian cancer treatment. PLoS Comput Biol
2013;9(3):e1002975.
11. Wang H, Li G. A selective review on random survival forests for high dimensional data.
Quant Biosci 2017;36(2):85-96.
12. Nasejje J.B, Mwambi H. Application of random survival forests in understanding the
determinants of under-five child mortality in Uganda in the presence of covariates that
satisfy the proportional and non-proportional hazards assumption. BMC Res Notes
2017;(10):459.
31
13. Li L, Briskine R, Schaefer R, et al. Co-expression network analysis of duplicate genes in
maize (Zea mays L.) reveals no subgenome bias. BMC Genomics 2016;(17):875.
14. Parikh A, Lee F, Yau L, et al. MAVERICC, a Randomized, Biomarker-stratified, Phase
II Study of mFOLFOX6-Bevacizumab Versus FOLFIRI-Bevacizumab as First-line
Chemotherapy in Metastatic Colorectal Cancer. Clin Cancer Res 2019;25(10):2988-2995.
15. Trochim W M.K. Research Methods Knowledge Base. 2020: Available from:
https://conjointly.com/kb/.
16. Eisenhauer E.A, Therasse P, Bogaerts J, et al. New response evaluation criteria in solid
tumours: Revised RECIST guideline (version 1.1). European Journal of Cancer
2009;228-247.
17. Cheema P.K, Burkes. R,L. Overall survival should be the primary endpoint in clinical
trials for advanced non-small-cell lung cancer. Curr Oncol 2013; 20(2):e150-e160.
18. Hess M.L, Brnabic A, Mason O, Lee P, Barker S. Relationship between Progression-free
Survival and Overall Survival in Randomized Clinical Trials of Targeted and Biologic
Agents in Oncology. J Cancer 2019;10(16):3717-3727.
19. Korn R.L, Corwley J.J. Overview: Progression-Free Survival as an Endpoint in Clinical
Trials with Solid Tumors. Clin Cancer Res 2013;19(10):2607-2612.
20. Yi M. A Complete Guide to Grouped Bar Charts. Chartio 2019. Available from:
https://chartio.com/learn/charts/grouped-bar-chart-complete-guide/.
21. Fath B.D, Scharler U.M. Systems Ecology: Ecological Network Analysis. Encyclopedia
of Ecology 2008;1083-1088.
22. Van Dam S, Vosa U, Van Der Graaf A, Franke L, Pedro De Magalhaes J. Gene co-
expression analysis for functional classification and gene-disease predictions. Briefings in
Bioinformatics, 2018;19(4):575-592.
23. Mogensen U.B, Ishwaran H, Gerds T.A. Evaluating Random Forests for Survival
Analysis Using Prediction Error Curves. J Stat Softw 2012;50(11):1-23.
24. Draelos R. Best Use of Train/Val/Test Splits, with Tips for Medical Data. Glass Box
Machine Learning 2019. Available from: https://glassboxmedicine.com/2019/09/15/best-
use-of-train-val-test-splits-with-tips-for-medical-data/.
25. Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS. “Random Survival Forests." The
Annals of Applied Statistics. 2008;2(3):841-860.
32
26. Nasejje J.B, Mwambi H, Dheda K, Lesosky M. A compasison of the conditional
inference survival forest model to random survival forests based on a simulation study as
well as on two applications with time-to-event data. 2017;17:115.
27. Kerr G, Perrin D, Ruskin H.J, Crane M. Edge Weighting of Gene Expression Graphs.
ResearchGate 2009;9:45.
28. Cho Y.A, Lee J, Oh J.H, et al. Genetic risk score, combined lifestyle factors and risk of
colorectal cancer. Cancer Res Treat 2019;51(3):1033-1040.
29. Bogaert J, Prenen H. Molecular genetics of colorectal cancer. Annals of Gastroenterology
2014;27:9-14.
30. Iacopetta B. TP53 mutation in colorectal cancer. Human Mutation 2003;21:271-276.
31. Porru M, Pornpili L, Caruso C, Biroccio A, Leonetti C. Targeting KRAS in metastatic
colorectal cancer: current strategies and emerging opportunities. Jounal of Experimental
& Clinical Cancer Research 2018;37-57.
32. Tang J, Kong D, Cui Q, et al. Prognostic genes of breast cancer identified by gene co-
expression network analysis. Front Oncol 2018;8:374.
33. Ehrlinger J, Rajeswaran J, Blackstone EH. ggrandomforests: exploring random forest
survival. R Vignette; 2016.
Abstract (if available)
Abstract
Background: A gene mutation is a permanent alteration in the DNA sequence that composes a gene. Colorectal cancer is the result of gene mutations and the sequence of chromosomes in important genes. In this thesis, I used the MAVERICC dataset to identify gene mutations that are prognostic for survival in colorectal cancer using random survival forest. ❧ Methods: A total of 264 patients with colorectal cancer were selected in the study (164 males and 100 females), with a range of age (31 to 87) and mean 60.17 years. At least 5% of patients (n = 13) had one or more mutations in all patients, and groups defined by primary tumor site, Best of Response by Investigators (INV response), and responders and non-responders for FLOFOX6-BV and FOLFIRI-BV. A gene mutations network analysis was conducted to deal with the complexity of interrelationships between genes. Covariates and gene mutations were independent variables in random survival forest in overall survival (OS) and progression free survival (PFS). The interaction and out-of-bag error rates were used for model evaluation and prediction performance. ❧ Results: The APC, TP53 and KRAS genes were the most frequent gene mutations in all groups. Four-gene combinations was the last gene combination group analyzed because all five-gene combinations were present in less than 5% of patients (n = 13). Thirty-one independent variables were eventually selected in the OS and PFS models
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
An analysis of disease-free survival and overall survival in inflammatory breast cancer
PDF
Disparities in colorectal cancer survival among Latinos in California
PDF
Infants in non-rhabdomyosarcoma soft tissue sarcoma
PDF
Randomized clinical trial generalizability and outcomes for children and adolescents with high-risk acute lymphoblastic leukemia
PDF
Comparison of Cox regression and machine learning methods for survival analysis of prostate cancer
PDF
Racial/ethnic differences in colorectal cancer patient experiences, health care utilization and their association with mortality: findings from the SEER-CAHPS data
PDF
Risk factors and survival outcome in childhood alveolar soft part sarcoma among patients in the Children’s Oncology Group (COG) Phase 3 study ARST0332
PDF
Identification of differentially connected gene expression subnetworks in asthma symptom
PDF
Red and processed meat consumption and colorectal cancer risk: meta-analysis of case-control studies
PDF
Extremity primary tumors in non-rhabdomyosarcoma soft tissue sarcoma: survival analysis
PDF
A novel risk-based treatment strategy evaluated in pediatric head and neck non-rhabdomyosarcoma soft tissue sarcomas (NRSTS) patients: a survival analysis from the Children's Oncology Group study...
PDF
Comparison of models for predicting PM2.5 concentration in Wuhan, China
PDF
Machine learning-based breast cancer survival prediction
PDF
Predictive factors of breast cancer survival: a population-based study
PDF
Use of cell-free nucleic acids in associating PD-L1 gene expression with presence of driver mutations in DNA and demographics across different cancers
PDF
Differential methylation analysis of colon tissues
PDF
Air pollution and breast cancer survival in California teachers: using address histories and individual-level data
PDF
Incidence and survival rates of the three major histologies of renal cell carcinoma
PDF
An assessment of necrosis grading in childhood osteosarcoma: the effect of initial treatment on prognostic significance
PDF
Transposable element suppression in basal-like breast cancer
Asset Metadata
Creator
Chen, Zihang
(author)
Core Title
Identifying prognostic gene mutations in colorectal cancer with random forest survival analysis
School
Keck School of Medicine
Degree
Master of Science
Degree Program
Biostatistics
Publication Date
07/30/2020
Defense Date
06/23/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
colorectal cancer,OAI-PMH Harvest,prognostic gene mutations,random forest survival analysis
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Millstein, Joshua (
committee chair
), Franklin, Meredith (
committee member
), Mack, Wendy (
committee member
)
Creator Email
curryczh@gmail.com,zihangc@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-350396
Unique identifier
UC11663233
Identifier
etd-ChenZihang-8801.pdf (filename),usctheses-c89-350396 (legacy record id)
Legacy Identifier
etd-ChenZihang-8801.pdf
Dmrecord
350396
Document Type
Thesis
Rights
Chen, Zihang
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
colorectal cancer
prognostic gene mutations
random forest survival analysis