Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Analysis of factors associated with breast cancer using machine learning techniques
(USC Thesis Other)
Analysis of factors associated with breast cancer using machine learning techniques
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ANALYSIS OF FACTORS ASSOCIATED WITH BREAST CANCER
USING MACHINE LEARNING TECHNIQUES
by
CHUHAN ZHANG
A Thesis Presented to the
FACULTY OF THE USC KECK SCHOOL OF MEDICINE
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
MASTER OF SCIENCE
(APPLIED BIOSTATISTICS AND EPIDEMIOLOGY)
December 2023
Copyright 2023 CHUHAN ZHANG
ii
ACKNOWLEDGEMENTS
I would like to express my deep gratitude to my thesis advisor, Ming Li, for her invaluable
guidance, unwavering support, and constructive feedback throughout the research process. Her
expertise and dedication have been instrumental in defining the direction and quality of this
work. Professor Li has given me a lot of help in life and study and has always supported my
decision. Thanks to the professor for her support whatever in mind and study, helping me
through very difficult moments.
I am also grateful to Professor Kimberly Siegmund and Professor Jin Piao for their
encouragement and motivation to strive for excellence. I would like to thank Keck School of
Medicine at the University of Southern California for providing us with the resources and a good
academic environment, which supported me in completing this project from the start to the end.
Additionally, I am sincerely grateful to my family for their endless encouragement, patience, and
belief in my abilities. Their continued support has been a source of strength for me.
Finally, I thank all those who contributed to the formation of this article. Your assistance is vital
in this endeavor.
Chuhan Zhang
September 2023
iii
TABLE OF CONTENTS
ACKNOWLEDGEMENTS............................................................................................................ ii
LIST OF TABLES .......................................................................................................................... v
LIST OF FIGURES ....................................................................................................................... vi
ABSTRACT:................................................................................................................................. vii
Chapter 1: Introduction................................................................................................................... 1
1.1 Current Situation of Breast Cancer.................................................................................................1
1.2 Research Objectives.......................................................................................................................2
Chapter 2: Machine Learning Algorithms in Medical Research .................................................... 4
2.1 Foundations of Machine Learning Algorithms....................................................................................4
2.2 Applications in Healthcare and Biomedicine.......................................................................................4
2.3 Application of Machine Learning in Breast Cancer Analysis..............................................................5
Chapter 3: Data Preprocessing and Modeling Method................................................................... 7
3.1 Data Collection ....................................................................................................................................8
3.2 Data Preprocessing...............................................................................................................................9
3.3 Theory and Application of Machine Learning...................................................................................14
3.4 Performance Evaluation.....................................................................................................................18
Chapter 4: Results and Analysis ................................................................................................... 21
4.1 Feature Engineer................................................................................................................................21
4.2 Model Selection and Performance Metrics........................................................................................24
4.3 Interpreting AUC/ROC Results.........................................................................................................28
4.4 Importance Feature ............................................................................................................................31
Chapter 5: Discussion ................................................................................................................... 33
iv
5.1 Summary of Findings.........................................................................................................................33
5.2 Contributions and Limitations...........................................................................................................34
5.3 Future Research Directions................................................................................................................36
References..................................................................................................................................... 37
v
LIST OF TABLES
Table 1 Summary of Data Features................................................................................................. 8
Table 2: Percentage of Differentiated and Grade.......................................................................... 11
Table 3 Statistics for Numerical Columns.................................................................................... 11
Table 4 Key Matrix for Models .................................................................................................... 25
Table 5 Comparing SMOTE with Random Oversampling........................................................... 28
Table 6 Decision Tree Matrix with SMOTE................................................................................. 29
Table 7 Random Forest Matrix with SMOTE............................................................................... 30
Table 8 Adaboost Matrix with SMOTE........................................................................................ 30
Table 9 Gradient Boosting Matrix with SMOTE.......................................................................... 31
Table 10 Top 10 Important Features............................................................................................. 32
vi
LIST OF FIGURES
Figure 1: Workflow diagram........................................................................................................... 7
Figure 2: Analysis of Outlier......................................................................................................... 10
Figure 3: Distribution of data points for numerical features......................................................... 12
Figure 4: Relationship of Hormone Receptor with Patient Outcome ........................................... 13
Figure 5: Distribution of the Alive and Dead................................................................................ 14
Figure 6: Heatmap Numerical Features Correlation..................................................................... 21
Figure 7: Distribution of Continuous Features ............................................................................. 22
Figure 8: Relationship of Survival Months with
Tumor Size, Regional Node Examined, and Regional Node Positive.......................................... 22
Figure 9: Comparison of Stage Group with Tumor Size and Survival Months............................ 23
Figure 10: Relationship of Hormone Receptor with Tumor Size and Survival Months............... 23
Figure 11 Categorical Feature Correlation with Status_Dead ...................................................... 24
Figure 12: Confusion Matrix for Different Models (No Oversampling)...................................... 25
Figure 13 Confusion Matrix for Different Models (Random Oversampling) .............................. 27
Figure 14 Decision Tree............................................................................................................... 29
Figure 15 Random Forest.............................................................................................................. 29
Figure 16 AdaBoost ...................................................................................................................... 30
Figure 17 Gradient Boosting......................................................................................................... 31
vii
Analysis of Factors Associated with Breast Cancer Using Machine
Learning Techniques
ABSTRACT:
Breast cancer, being one of the most prevalent malignancies among women, imposes significant
physiological and psychological burdens on patients and their families. In the pursuit of enhanced
understanding and prediction of breast cancer incidence, the utilization of machine learning
techniques for factor analysis has become increasingly pivotal. The core objective of the study is
to unveil the intricately linked elements contributing to breast cancer onset through the application
of diverse machine learning methodologies, thus providing a scientific basis for early prediction
and intervention. This study encompasses six classical machine learning methods, including
Logistic Regression, Support Vector Classifier (SVC), Random Forest, Decision Tree, Adaboost
Classifier, and Gradient Boosting Classifier. The outcomes of this research hold significance not
only for early prediction and intervention in breast cancer but also methodologically, offering
insights for further investigation into breast cancer factors. By dissecting the application of
machine learning techniques in the analysis of breast cancer factors, this study provides valuable
insights into the medical domain and offers inspiration for future research directions.
Keywords: Breast cancer, Machine Learning, Random Forest, Decision tree, Prediction, and
Intervention of Breast Cancer
1
Chapter 1: Introduction
1.1 Current Situation of Breast Cancer
Breast cancer, the most prevalent form of cancer among women, consistently exhibits high
incidence and mortality rates (Asri et al., 2016; Giaquinto et al., 2022). It is the second leading
cause of mortality in women, following closely behind lung cancer. Scientists recognized the
dangers associated with breast cancer from its early stages, leading to substantial initial research
efforts toward its treatment (Giaquinto et al., 2022; Yue et al., 2018). Thanks to the dedication of
researchers and the implementation of early detection methods, the mortality rate has shown a
consistent and gradual decrease over the past decades. In 2016, approximately 246,660 new cases
of invasive breast cancer among women were reported in the United States, with 40,450 resulting
in fatalities. Breast cancer constitutes roughly 12% of all new cancer cases and accounts for 25%
of all cancers diagnosed in women (Asri et al., 2016; Siegel et al., 2016).
Statistics from 2017 to 2019 indicate that the probability of breast cancer in women of all ages
surpasses that of other cancers (DeSantis et al., 2019; Siegel et al., 2017). Roughly 13% of females
(1 in 8) will receive a diagnosis of invasive breast cancer during their lifetime, and about 3% (1 in
39) will succumb to this condition. Unlike the diagnosis risk, which peaks in women aged 70-79
(4.1%) and decreases thereafter, the likelihood of succumbing to breast cancer rises consistently
with age. Additionally, female breast cancer incidence rates have slowly increased by about 0.5%
per year since the mid-2000s (DeSantis et al., 2019; Siegel et al., 2017; Siegel et al., 2023).
In 2022, approximately 287,850 new cases of invasive breast cancer and 51,400 cases of DCIS
2
were identified among women in the United States (Giaquinto et al., 2022). It is anticipated that
breast cancer will result in 43,250 female fatalities. Of these cases, 83% of invasive breast cancers
are detected in women aged 50 and above, with 91% of breast cancer-related deaths occurring
within this age bracket. Furthermore, half of all breast cancer fatalities occur among women aged
70 years or older (Giaquinto et al., 2022).
However, in 2023, it is estimated that there will be approximately 297,790 new cases of cancer,
accounting for 15.2% of all newly diagnosed cancer cases. Additionally, there are expected to be
43,170 cancer-related deaths in the same year, constituting 7.1% of all cancer-related deaths
(National Cancer Institute).
According to Cancer Research UK's data, the survival rate for breast cancer over five years
approaches 100% when detected in its earliest stage, in sharp contrast to a survival rate as low as
15% when identified in its advanced stages (Giaquinto et al., 2022). These statistics underscore
the significant impact of breast cancer on public health and emphasize the importance of continued
research and efforts in prevention and treatment. The precise categorization achieved through these
techniques further supports medical practitioners in determining the most suitable course of
treatment (Giaquinto et al., 2022).
1.2 Research Objectives
In recent years, the field of breast cancer research has witnessed a substantial increase in
machine learning techniques in the realms of diagnosis and prognosis. These methods employ
classification algorithms to differentiate individuals afflicted by breast cancer, discern between
3
benign and malignant tumors, and forecast disease prognosis (Lukasiewicz et al., 2021). The
precise stratification achieved through these techniques offers invaluable support to medical
practitioners in determining the most appropriate treatment approaches.
This study aims to investigate the factors that influence the survival rate of breast cancer
patients by implementing six distinct machine-learning models. Additionally, it seeks to identify
the optimal machine learning model for accurate prognostication of future patient outcomes.
Through the utilization of key features, this research endeavors to enhance the effectiveness of
treatment plan formulation. Finally, it strives to contribute to the existing body of literature on
breast cancer research by assembling a comprehensive dataset, informed by machine learning
models.
These advancements hold paramount clinical implications, enabling the customization of
individualized treatment regimens and the delivery of more precisely targeted medical services.
These tools are poised to greatly assist medical professionals in patient management, offering
practical implications for tailored treatment plans. Furthermore, the meticulous categorization
achieved through these techniques provides enhanced support for medical practitioners in
determining the most fitting course of treatment.
4
Chapter 2: Machine Learning Algorithms in Medical Research
2.1 Foundations of Machine Learning Algorithms
Machine learning (ML) algorithms are computational methods. Machine learning algorithms
learn from data to make predictions or decisions. Typically, ML algorithms are categorized into
three main types: supervised, unsupervised, and reinforcement learning. 1) Supervised Learning:
supervised learning involves training a model to map input data to output data. It requires labeled
training data and is used for tasks like classification and regression. Key algorithms include
decision trees to support vector machines and neural networks (Mitchell, 1997). 2) Unsupervised
Learning: This type of learning involves modeling with datasets that do not have labeled responses.
The system tries to learn the patterns and the structure from the input data without any supervised
feedback. Clustering and association are examples of unsupervised learning tasks (Mitchell, 1997).
3). Reinforcement Learning: it is a type of learning where an agent learns by interacting with the
environment to achieve the maximum cumulative reward. It is widely used in areas like game
playing, navigation, and real-time decisions (Mitchell, 1997).
2.2 Applications in Healthcare and Biomedicine
Machine learning (ML) has transformative potential in healthcare and biomedicine. It can
analyze large and complex data sets, enabling advanced diagnostic and predictive capabilities
(Rajkomar et al., 2019). One major application is the development of predictive models for disease
diagnosis and prognosis, enabling early and accurate identification of diseases such as diabetes,
5
cardiovascular disease, and cancer (Rajkomar et al., 2019).
In diagnostic imaging, convolutional neural networks (a subcategory of machine learning)
have proven effective at interpreting medical images to detect abnormalities and diseases.
Furthermore, machine learning algorithms can assist drug discovery and development by
predicting drug responses and identifying potential drug candidates (Rajkomar et al., 2019; Yawen
et al., 2019).
Furthermore, machine learning provides personalized medical solutions by analyzing patient
data to customize treatment plans, optimize drug selection and dosage, and predict patient
outcomes (Rajkomar et al., 2019; Yawen et al., 2019).
2.3 Application of Machine Learning in Breast Cancer Analysis
Machine learning (ML) is particularly important in the field of breast cancer analysis. Machine
learning algorithms facilitate early detection of breast cancer by analyzing mammogram images
and identifying malignant tumors with high accuracy. Advanced ML models can process
histopathology images to differentiate between benign and malignant lesions and predict cancer
staging (Esteva et al., 2017).
Moreover, ML techniques enable the analysis of genomic and proteomic data to identify
biomarkers and molecular signatures associated with breast cancer, aiding in the development of
targeted therapies (Yue et al., 2018). They also support the design of personalized treatment
strategies by predicting patient responses to different therapeutic interventions, based on individual
genetic and clinical profiles. ML models are also vital in evaluating the risk factors and predicting
6
the likelihood of breast cancer recurrence, which is crucial for patient management and treatment
planning. Therefore, machine learning models are considered effective in exploring related factors
leading to death in breast cancer patients (Yue et al., 2018).
7
Chapter 3: Data Preprocessing and Modeling Method
This research study initially went over the source and attributes of the utilized dataset. The
gathered details consisted of a variety of data essential to breast cancer, incorporating clinical
markers, gene expression details, and way of life aspects. Throughout the information
preprocessing phase, significant attention was given to essential procedures like information
filtration, selecting appropriate features, and attending to any outliers. These actions are targeted
at guaranteeing the accuracy and dependability of the subsequent machine learning models (Yawen
et al., 2019).This research implemented the sci-kit-learn library of Python programming language.
The steps followed in this implementation are shown in Fig.1.
Figure 1: Workflow diagram.
8
3.1 Data Collection
The dataset is from ‘https://ieee-dataport.org/open-access/seer-breast-cancer-data’
The breast cancer patient dataset was sourced from the November 2017 release of the NCI's
SEER Program, which aimed to offer comprehensive cancer statistics derived from population
data. The dataset specifically focused on women diagnosed with infiltrating duct and lobular
carcinoma breast cancer during the years 2006-2010 (TENG, 2019). To enhance data completeness,
individuals with unspecified tumor size, unexamined regional lymph nodes, positively identified
regional lymph nodes, and survival durations of less than one month were omitted. Consequently,
the final dataset comprised a total of 4024 patients (TENG, 2019).
Table 1 Summary of Data Features
Features Type Category Description
Age Numeric -
This data item represents the age of the patient at
diagnosis for this cancer. The code represents the patient’s
actual age in years(TENG, 2019).
Race Categorical
White Race recode is based on the race variables and the
American Indian/Native American IHS link variable. This
recode should be used to link to the populations for white,
black and other. It is independent of Hispanic ethnicity
(TENG, 2019)
Black
Other
Marital Status Categorical
Married
This data item identifies the patient’s marital status at
the time of diagnosis for the reportable tumor (TENG, 2019).
Widowed
Divorced
Separated
Single
T Stage Categorical
T1
SEER*Stat Name: Breast Adjusted AJCC 6th T
(1988+)
T2
T3
T4
N Stage Categorical
N1 SEER*Stat Name: Breast Adjusted AJCC 6th N
(1988+) N2
N3
6th Stage Categorical
IIA
SEER*Stat Name: Breast Adjusted AJCC 6th Stage
(1988+)
IIB
IIIA
IIIB
IIIC
Differentiate Categorical
Grade Categorical
1
2
3
4
A Stage Categorical Regional
Regional — A neoplasm that has extended 1) beyond
the limits of the organ of origin directly into surrounding
organs or tissues; 2) into regional lymph nodes by way of the
lymphatic system; or 3) by a combination of extension and
9
regional lymph nodes (TENG, 2019).
Distant
Distant - A neoplasm that has spread to parts of the
body remote from the primary tumor either by direct
extension or by discontinuous metastasis (e.g., implantation
or seeding) to distant organs, issues, or via the lymphatic
system to distant lymph nodes (TENG, 2019).
Tumor Size Numeric - Formation on tumor size. Each indicates the exact
size in millimeters.
Estrogen Status Categorical
Positive Created by combining information from Tumor
marker 1 (1990-2003) (NAACCR Item #=1150), with
information from CS site-specific factor 1 (2004+)
(NAACCR Item #=2880) (TENG, 2019).
Negative
Progesterone Status Categorical
Positive Created by combining information from Tumor
marker 2 (1990-2003) (NAACCR Item #=1150), with
information from CS site-specific factor 2 (2004+)
(NAACCR Item #=2880) (TENG, 2019).
Negative
Regional Node Examined Numeric -
Records the total number of regional lymph nodes
that were removed and examined by the pathologist
(TENG, 2019).
Regional Node Positive Numeric -
Records the exact number of regional lymph nodes
examined by the pathologist that were found to contain
metastases (TENG, 2019).
Survival Months Numeric - The Month of survival on follow-up 48 months
Status Categorical
Alive
Dead Censored their status at the follow-up cut-off date
3.2 Data Preprocessing
3.2.1 Dealing with Outlier
The boxplot (Fig 2) representations of certain numerical attributes such as Age, Tumor Size,
Regional Node Examined, Regional Node Positive, and Survival Months revealed notable
disparities among data points. By employing the "IQR method," outliers were detected in the
columns. There was a total of 617 outliers, roughly constituting 15% of the overall data points.
Given that the dataset comprises only 4024 entries, eliminating 15% of the data may exert a
substantial influence on the ultimate predictions. Consequently, the outliers have been retained for
machine learning modeling purposes.
10
Figure 2: Analysis of Outlier
3.2.2 Patient Characteristics
The dataset includes female patients ranging from 30 to 69, with an average age of 53.9.
Patients were further characterized based on race and marital status. Among the patients, three
races were reported: White (84.82%), Black (7.95%), and Other (7.23%). The relationship status
of patients was classified as Married (65.68%), Single (12.08%), Divorced (15.28%), Widowed
(5.84%), or Separated (1.12%). A significant proportion of diagnosed cancer patients are married.
The T Stage defines patients’ tumor sizes. It includes four categories. Category T2, corresponding
to tumor sizes between 20mm and 50mm, was the most common at 44.38%. This was followed by
T1, T3, and T4, with proportions of 39.84%, 13.25%, and 2.53% respectively. The N Stage
represents cancer spread to lymph nodes and has three categories: N1(67.89%), N2(20.38%), and
N3(11.73%). Around 68% of patients were in the N1 Stage, indicating cancer spread to 1 to 3
axillary lymph nodes and/or internal mammary lymph nodes. The 6th Stage column indicates
11
breast cancer stage grouping. It was determined by a combination of T Stage, N Stage, and M
Stage (Metastasis). The column contains five categories: IIA (32.43%), IIB (28.08%), IIIA
(26.09%), IIIC (11.73%), and IIIB (1.67%). The Differentiate and Grade column indicates how
closely tumors resemble normal tissue (see Table 2). Table 3 describes the patients’ Tumor Size in
millimeters (mm). The average size is 30.47, ranging from 1 to 140 mm. For Survival Months, the
average value is 71.3 months. The range is between 1 and 107 months. The distribution of values
for numerical features is illustrated with the help of a histplot (Fig.3). From the histplot,
the Age column has a skewness of -0.22. Columns such as Tumor Size and Regional Node
Positive are highly skewed towards the right side, whereas Regional Node Examined and Survival
Months columns are moderately skewed.
Table 2: Percentage of Differentiated and Grade
Table 3 Statistics for Numerical Columns
Grade Differentiated Percentage
I Well differentiated 13.49%
II: Moderately differentiated 58.42%
III Poorly differentiated 27.61%
IV Undifferentiated (anaplastic) 0.47%
Age Tumor Size(mm) Regional Node
Examined
Regional Node
Positive Survival Months
Count 4024.00 4024.00 4024.00 4024.00 4024.00
Mean 53.97 30.47 14.36 4.16 71.30
Std 8.96 21.12 8.10 5.11 22.92
min 30.00 1.00 1.00 1.00 1.00
25% 47.00 16.00 9.00 1.00 56.00
50% 54.00 25.00 14.00 2.00 73.00
75% 61.00 38.00 19.00 5.00 90.00
max 69.00 140.00 61.00 46.00 107.00
12
Figure 3: Distribution of data points for numerical features
3.2.3 Hormone Receptor Status
Receptors are proteins in cells that can attach to certain substances in the blood. To make the
best decision for treating breast cancer, a test was conducted to check for hormone receptor status
(Sohail et al., 2020). This hormone receptor status is either positive or negative based on the
receptors that attach to estrogen and progesterone hormones (Sohail et al., 2020). Cancer that has
estrogen receptors is called Estrogen Positive and cancer that has progesterone receptors is called
Progesterone Positive. The hormone receptor is positive if the estrogen or progesterone is positive,
or both are positive. Otherwise, the hormone receptor is negative. For positive hormone receptors,
hormone therapy drugs can be used to lower estrogen levels. Breast cancer cells with positive
hormone receptor status grow slowly compared to cells with negative hormone receptor status.
13
To study the effect of the Hormone Receptor, a new variable named Hormone Receptor
Status was created from the Estrogen Status and Progesterone Status variables. (Fig. 4).
Figure 4: Relationship of Hormone Receptor with Patient Outcome
3.2.4 Imbalance data
The barplot (Fig.5) displays the percentage of patients belonging to the
Alive and Dead classes. It is also important to note that the given dataset is highly imbalanced.
Only 15.31% of the data belongs to the Dead class and the remaining belong to the Alive class.
Since most of the data points are associated with the Alive class, there is a high chance that the
machine learning models may be biased towards that class and could lead to poor performance for
the Dead class.
In the feature selection process, we opted to exclude race. Race may not have uniformly
distributed data across all categories during the subsequent statistical analysis. Consequently, this
non-uniform distribution could potentially lead to varying error rates. To enhance the model's
accuracy, it should be omitted.
14
Figure 5: Distribution of the Alive and Dead
3.3 Theory and Application of Machine Learning
In this section, six machine learning models were applied to predict patient outcome
(Alive/Dead): logistic regression, support vector classifier, decision tree classifier, random forest
classifier, AdaBoost classifier, and gradient boosting classifier. Given the significant class
imbalance in the dataset, the initial step involves testing these models on the imbalanced dataset
to predict the target variable. To achieve this, the data were split into training and testing sets,
allocating 25% of the data for testing and using the remaining 75% for training. Additionally, a 5-
fold cross-validation strategy was employed in this project (Chen et al., 2021).
Subsequently, this research addressed the class imbalance by utilizing Random oversampling
and SMOTE techniques to augment data entries for underrepresented classes, particularly those
labeled as "Dead”. Random Oversampling involves randomly duplicating samples from the
underrepresented class to balance the dataset. In contrast, SMOTE (Synthetic Minority
Oversampling Technique) generates synthetic samples to augment the underrepresented class,
offering a widely recognized approach to oversampling.
15
Following this augmentation, we re-evaluated the performance of the machine learning
models on the adjusted dataset."
3.3.1 Logistic Regression
Logistic regression employs maximum likelihood to predict outcomes by estimating the
probabilities of outcome classes (Lorena et al., 2011). It is a popular and easily interpretable
method that does not impose distribution assumptions on the explanatory data (Pohar et al., 2004).
Nonetheless, the statistical complexity of logistic regression is limited since it assumes a linear
connection between inputs and the logarithm of the outcome odds (Tu, 1996). When aiming to
capture nonlinear associations between variables, developers of the logistic regression model need
to proactively identify and potentially apply intricate variable transformations before training.
3.3.2 Support Vector Classifier (SVC)
In healthcare, Support Vector Classification (SVC) is an adaptation of the Support Vector
Machine (SVM) algorithm, optimized for categorizing data. As a crucial component of medical
decision-support systems, it distinguishes between different states of health by identifying patterns
within clinical data. This technique relies on the strategic selection of kernel functions and the
tuning of parameters to improve the accuracy of predictive models (Cortes & Vapnik, 1995).
SVC has proven particularly effective in medical research for a variety of diagnostic
applications. For instance, it assists in the early identification of patients at risk of cognitive decline
following cardiac surgery and enhances the screening process for cervical cancer. Moreover, it
16
contributes to obesity prognosis and facilitates the differentiation between benign and malignant
breast tissue, thereby aiding in the early detection and treatment of breast cancer (Son et al., 2010).
3.3.3 Decision Tree
Decision trees categorize information by asking successive inquiries about predictor variables
(Mitchell, 1997). These trees consist of nodes (depicting tests for specific inputs), branches
(representing outcomes of node tests), and leaves (found at the tree's bottom, offering final
categorizations) (Miguel-Hurtado et al., 2016). Decision trees offer great interpretability, as the
process of determining an individual's classification is readily understandable (Podder et al., 2021).
In a clinical context, a transparent model like this could be more desirable.
3.3.4 Random Forest
Random Forest is a popular ensemble learning algorithm that combines the power of multiple
decision trees to improve the accuracy, robustness, and generalization of predictions (Zhang & Ma,
2012). A Random Forest consists of a collection of decision trees, where each tree is trained on a
random subset of the data and uses a random subset of features. The randomness injected into both
data and feature selection helps to reduce overfitting and improve the model's performance.
Random Forest has several advantages. It reduces overfitting through random data sampling and
feature selection, enhancing the model robustness (Zhang & Ma, 2012). Its accuracy is high due
to multiple trees' combined predictions. It reveals feature importance, indicating influential
predictors. Moreover, it adeptly manages missing data without imputation. However, Random
17
Forest's complexity can rise with many trees, causing longer training and more memory usage. Its
ensemble nature makes overall model interpretation challenging.
3.3.5 AdaBoost Classifier
AdaBoost, short for Adaptive Boosting, is an ensemble learning algorithm designed to
enhance the classification performance of weak learners sequentially (Zhang & Ma, 2012). A weak
learner is a classification model that performs only slightly better than random guessing. The core
idea of AdaBoost lies in its sequential training process. Weak learners are trained one after another,
with each subsequent learner emphasizing data points that the previous learner misclassified. This
allows AdaBoost to focus on challenging instances and improve overall classification accuracy.
During training, each weak learner is assigned a weight based on its accuracy. More accurate
learners receive higher weights, while poorer performers receive lower weights. This weighting
mechanism ensures that more accurate classifiers have a stronger impact on the final classification
(Zhang & Ma, 2012). AdaBoost has several advantages, including the ability to achieve high
accuracy with a relatively small number of weak learners and its resistance to overfitting. However,
it may be sensitive to noisy data and outliers.
3.3.6 Gradient Boosting.
Gradient Boosting is a powerful ensemble learning method used in machine learning for both
regression and classification tasks. Unlike traditional bagging methods like Random Forest, which
builds multiple independent models in parallel, Gradient Boosting builds a series of weak learners
sequentially. These weak learners are typically shallow decision trees (Friedman, 2001).
18
The key idea behind Gradient Boosting is to train each weak learner to correct the errors of
the previous one. In each iteration, the algorithm focuses on the data points where the previous
models performed poorly, assigning higher weights to them. This way, the subsequent models are
specialized in handling the more challenging instances (Chen et al., 2021). The predictions of all
weak learners are then combined through a weighted sum to form the final prediction. The weights
are determined by an optimization process that minimizes a specified loss function. This is usually
done using gradient descent.
One of the strengths of Gradient Boosting lies in its ability to capture complex relationships
in data and produce highly accurate predictions (Chen et al., 2021; Friedman, 2001). It is also
robust to overfitting, though some hyperparameter tuning may be necessary to achieve optimal
performance.
3.4 Performance Evaluation
Performance evaluation in the context of machine learning refers to the process of assessing
how well a trained model performs on a specific task. It involves using various metrics and
techniques to quantitatively measure the model's accuracy, precision, recall, and other relevant
factors.
3.4.1 Accuracy:
Accuracy is a metric that measures the overall correctness of a classification model. It
calculates the ratio of correctly predicted instances to the total instances in the dataset. Higher
accuracy indicates better performance (Podder et al., 2021).
19
�������� = !"#!$
%"#%$#!"#!$ (1)
3.4.2 Precision:
Precision is a metric that assesses the accuracy of positive predictions made by the model. It
is the ratio of true positive predictions to the sum of true positives and false positives. Precision is
particularly important when minimizing false positives is crucial (Podder et al., 2021).
��������� = !$
%$#!$ (2)
3.4.3 Recall:
Recall, also known as sensitivity or true positive rate, measures the ability of the model to
identify all the relevant instances. It is the ratio of true positives to the sum of true positives and
false negatives. Recall is significant when avoiding false negatives is a priority (Podder et al.,
2021).
������ = !$
%"#!$ (3)
3.4.4 F1-Score:
The F1-score is the harmonic mean of precision and recall. It provides a balanced assessment
of a model's performance by considering both false positives and false negatives. F1-score is
particularly useful when there is an uneven class distribution (Podder et al., 2021).
�1&'()* = +∗$)*'-.-(/∗0*'122
$)*'-.-(/#0*'122 (4)
3.4.5 AUC/ROC
Receiver Operator Characteristic (ROC) curves are a common tool used in machine learning
to visualize the performance of algorithms in binary decision scenarios (Davis & Goadrich.,
2006). The ROC curve helps us understand how well the model performs at different thresholds,
20
particularly useful when dealing with imbalanced datasets. The closer the curve is to the top-left
corner, the better the model's performance. AUC is the area under the ROC curve, representing
the classifier's ability to distinguish between positive and negative classes at various thresholds.
AUC values range from 0.5 (random guessing) to 1 (perfect classifier).
AUC provides a single numerical summary of a classification model's performance without
the need to set a specific threshold. A model with a high AUC value generally performs well
across many thresholds. In summary, ROC curves and AUC are crucial tools for evaluating the
performance of the classification of the six machine learning models. The ROC curve provides a
visual means to compare the six models' performance at different thresholds, while AUC offers a
single numerical value to summarize model performance. Both metrics offer valuable insights
when selecting the right model or adjusting thresholds.
21
Chapter 4: Results and Analysis
4.1 Feature Engineer
We performed correlation analysis through heatmap visualization (Fig 6) and further studied
the relationship between various features. Patients who survived had a higher number of months
of survival compared with patients who died because of a shorter number of months of survival.
Tumor sizes were smaller in breast cancer survivors compared with patients who died. However,
few patients with larger tumors survive.
Figure 6: Heatmap Numerical Features correlation
Regional nodes examined record the exact number of regional nodes removed and examined
by the pathologist. Positive regional lymph nodes recorded the exact number of regional lymph
22
nodes found by the pathologist to contain metastases. The boxplot shows that in patients who died,
their cancer had spread to other body parts (metastasis) (Fig.7 and Fig.8).
Figure 7: Distribution of Continuous Features
Figure 8: Relationship of Survival Months with Tumor Size, Regional Node Examined, and Regional Node Positive
There were five stage grouping categories under the Stage 6 feature. Tumor size varied
between stage groups. For each stage, the tumors in living and dying patients were the same size.
This showed that there was no difference in tumor size between surviving and dying patients in a
specific stage group. Furthermore, the range of survival months for all surviving patients in
23
different stage groups remained almost the same. The pattern was similar for patients who died.
(Fig.9)
Figure 9: Comparison of Stage Group with Tumor Size and Survival Months
Upon the analysis(Fig.10), it was observed that when hormone receptors are negative, patients
tend to experience lower survival durations, typically ranging from 20 to 40 months. This
observation suggests that hormone receptor status may significantly impact the length of survival.
Figure 10: Relationship of Hormone Receptor with Tumor Size and Survival Months
Finally, exploring the correlation between different categorical features and the 'Status_Dead'
outcome (Fig.11), the features are presented along the x-axis, while the correlation values are
shown along the y-axis. Bars to the right indicate a positive correlation, suggesting that as the value
of the feature increases, the likelihood of 'Status_Dead' also increases. Conversely, bars to the left
indicate a negative correlation, suggesting an inverse relationship. For Status that corresponds to
24
the 'Dead' class, there isn't any strong correlation. The correlated values range between -0.2 and
+0.2.
Figure 11 Categorical Feature Correlation with Status_Dead
4.2 Model Selection and Performance Metrics
4.2.1 Baseline Models (No Oversampling)
Utilized Python to assess the performance of six models, namely logistic regression, support
vector classifier, decision tree, random forest, AdaBoost, and gradient boosting, leveraging predefined functions. This approach allows us to identify the most effective indicators for this
dataset and pinpoint areas where certain indicators may encounter challenges. we computed key
metrics such as accuracy, precision, recall, F1-Score, and AUC for these six models, facilitating a
25
comprehensive comparison of their performance (Table 4). We can categorize the models into
two groups: linear models, which include Support Vector Classifier (SVC) and logistic
regression, and non-linear models, which encompass Decision Trees (DT), Random Forest (RF),
AdaBoost (Ada), and Gradient Boosting (GDBT). It was evident that the non-linear model
outperforms the linear one (Fig.12). Based on the results, we selected 4 models, including DT,
RF, Ada, and GBDT to compare.
Figure 12: Confusion Matrix for Different Models (No Oversampling)
Table 4 Key Matrix for Models
Accuracy Precision Recall F1 Score Auc Score
Logistic
Regression
Training 0.9 0.78 0.45 0.57 0.87
Testing 0.89 0.76 0.44 0.56 0.86
SVC Training 0.89 0.81 0.38 0.52 0.83
Testing 0.89 0.81 0.35 0.49 0.81
Decision Tree Training 1 1 1 1 1
Testing 0.85 0.51 0.5 0.5 0.71
Random Forest Training 1 1 1 1 1
Testing 0.9 0.82 0.45 0.59 0.85
AdaBoost Training 0.9 0.78 0.53 0.63 0.9
Testing 0.9 0.75 0.5 0.6 0.85
26
Gradient
Boosting.
Training 0.93 0.91 0.59 0.72 0.93
Testing 0.91 0.83 0.52 0.64 0.87
4.2.2 No Oversampling and Random Oversampling
Given that our dataset exhibited a high degree of imbalance, there was a potential risk of
diminished model accuracy. Hence, it became imperative to implement measures to enhance our
model's predictive capability- Oversampling.
In the comparison of model performances (Fig.12 and Fig.13), the application of Random
Oversampling demonstrated marked improvements across various machine learning algorithms.
The primary rationale behind adopting Random Oversampling was its ability to enhance overall
accuracy and notably reduce the misclassification rates, particularly in the "Dead" class. A
consistent observation across models like Logistic Regression, Random Forest, and Gradient
Boosting was the reduced False Positives for the "Dead" class. This is of paramount significance
in fields such as medical or life sciences, where overlooking a positive case can lead to severe
consequences.
Additionally, Random Oversampling addresses the class imbalance by artificially augmenting
the minority class, promoting a more balanced representation. This, in turn, bolsters the robustness
of the models, as they are exposed to a more diverse set of sample combinations during training.
Therefore, this project must use the Oversampling method.
27
Figure 13 Confusion Matrix for Different Models (Random Oversampling)
4.2.3. Random Oversampling & SMOTE
Based on Table 4, there is a potential risk of the training model being over-fitted. SMOTE
tends to be more effective in addressing over-fitting compared to random oversampling.
Referring to Table 4, the optimal model (Gradient Boosting) was chosen to contrast the impacts
of SMOTE and random oversampling. Given the higher accuracy achieved by the Gradient
Boosting Classifier, we opted to use it as the basis for comparing Random Oversampling and
SMOTE (Table 5).
The Random Oversampling yielded an 86% accuracy in the training set, indicating correct
classification for a majority of samples. Precision and recall stood at 88% and 84% respectively,
showing effectiveness in identifying positive cases, albeit with some misclassifications. The F1
score, balancing precision, and recall, reached 0.86, denoting an overall good performance. The
high AUC score of 0.94 indicates strong class discrimination.
28
With SMOTE, the training set saw outstanding performance. An accuracy of 94% showcases
a high level of correct classifications. Precision and recall both stood at an impressive 97% and
91%, resulting in a remarkable F1 score of 0.94. The exceptionally high AUC score of 0.98
indicates superior class discrimination.
In the testing set, the model maintained strong performance with a 90% accuracy, suggesting
good generalization to new data. Precision was 75%, indicating a relatively low false positive
rate. However, recall dropped to 54%, implying more false negatives compared to the training
set. F1 and AUC scores were 0.63 and 0.87, respectively.
In summary, the SMOTE method significantly enhanced the model's performance in various
metrics compared to the Random Oversampling. This approach resulted in a more accurate and
reliable model for this specific dataset.
Table 5 Comparing SMOTE with Random Oversampling
Gradient Boosting Accuracy Precision Recall F1 Score Auc Score
Random
Oversampling
Training 0.86 0.88 0.84 0.86 0.94
Testing 0.83 0.47 0.69 0.56 0.87
SMOTE Training 0.94 0.97 0.91 0.94 0.98
Testing 0.9 0.75 0.54 0.63 0.87
4.3 Interpreting AUC/ROC Results
The Decision Tree model (Fig 13 and Table 6) demonstrated perfect performance on the
training data, achieving 100% accuracy, precision, recall, F1 score, and AUC score. This might
indicate overfitting. On the testing data, the model performs reasonably well with an accuracy of
82%. However, there is room for improvement, especially in terms of precision and recall.
29
Figure 14 Decision Tree
Table 6 Decision Tree Matrix with SMOTE
Accuracy Precision Recall F1 Score Auc Score
Training 1 1 1 1 1
Testing 0.82 0.41 0.46 0.43 0.67
Similar to the Decision Tree, the Random Forest model (Fig 14 and Table 7) achieves perfect
scores on the training data. This indicates a potential for overfitting, but Random Forest generally
mitigates overfitting compared to a single decision tree. On the testing data, the model
demonstrates good performance with an accuracy of 89%. It shows improvements in precision,
but recall could be enhanced.
Figure 15 Random Forest
30
Table 7 Random Forest Matrix with SMOTE
Accuracy Precision Recall F1 Score Auc Score
Training 1 1 1 1 1
Testing 0.89 0.73 0.48 0.58 0.85
The AdaBoost model (Fig.15 and Table.8) performs impressively on the training data, with an
accuracy of 93% and high precision, recall, F1 score, and AUC score. It seems to have learned the
underlying patterns effectively. On the testing data, the model maintains strong performance with
an accuracy of 89%. While precision could be improved, it demonstrates a good balance between
precision and recall.
Figure 16 AdaBoost
Table 8 Adaboost Matrix with SMOTE
Accuracy Precision Recall F1 Score Auc Score
Training 0.93 0.95 0.9 0.92 0.97
Testing 0.89 0.65 0.56 0.61 0.85
The Gradient Boosting model (Fig.16 and Table.9) excelled on the training data, achieving
an accuracy of 94% and high precision, recall, F1 score, and AUC score. It effectively learned
from the training data. On the testing data, the model maintains a high accuracy of 90%. While
31
precision improved as compared to other models. There was still room for enhancement in terms
of recall.
Figure 17 Gradient Boosting
Table 9 Gradient Boosting Matrix with SMOTE
Accuracy Precision Recall F1 Score Auc Score
Training 0.94 0.97 0.91 0.94 0.98
Testing 0.9 0.75 0.54 0.63 0.87
In summary, all models performed well, but Gradient Boosting stood out as the most
promising with high scores on both training and testing data. It demonstrated strong generalization
capabilities and a good balance between precision and recall. However, further fine-tuning and
potentially exploring different techniques may increase performance.
4.4 Importance Feature
By enhancing the Gradient Boosting model with SMOTE technology, we achieved significant
performance improvements (Table 10). The model plays a crucial role in follow-up months of
breast cancer patients to predict alive/death with a high feature importance score of 0.. 5068. In
addition, N1 stage (0.0598) and married marital status (0.0487) are also significant in the
predictions of the model.
32
At the molecular level, both negative (0.0412) and positive (0.0394) progesterone status
showed significant effects, emphasizing the relevance of hormonal status in breast cancer
development and prognosis. Furthermore, level 2 classification (0.0348) also played a significant
role in the prediction.
Marital status also had an impact on model performance. Married (0.0487) and single (0.0318)
both had some degree of importance. Furthermore, tumor differentiation of moderately
differentiated (0.0280) and well-differentiated (0.0281) was also important in prediction.
Finally, lymph node stages N1 (0.0598) and N2 (0.0236) were considered key features by the
model, indicating that lymph node status is critical for the prognosis of breast cancer patients.
In conclusion, by integrating breast cancer-specific biological features and clinical indicators,
our improved model provided accurate prognostic prediction for breast cancer patients.
Table 10 Top 10 Important Features
Features Importance Score
Survival Months 0.5068
N Stage_N1 0.0598
Marital Status_Married 0.0487
Progesterone Status_Negative 0.0412
Progesterone Status_Positive 0.0394
Grade_Grade 2 0.0348
Marital Status_Single 0.0318
Differentiate_Well differentiated 0.0281
ifferentiate_Moderately differentiated 0.0280
N Stage_N2 0.0236
33
Chapter 5: Discussion
5.1 Summary of Findings
This study aimed to conduct a thorough examination of diverse factors and attributes
associated with breast cancer patients and to assess the performance of various machine learning
models in predicting patient outcomes. The patient demographics ranged from ages 30 to 69, with
a predominance of White and married individuals. A comprehensive analysis unveiled the intricate
nature of breast cancer and its correlations with factors like tumor size, survival months, etc.
Regarding cancer characteristics, a substantial portion of the studied patients fell into the T2
category (tumor sizes ranging from 20mm to 50mm) and N1 stage (indicating cancer spread to 1
to 3 axillary lymph nodes and/or internal mammary lymph nodes). Most patients were diagnosed
with cancer in Stage IIA, suggesting advanced disease at the time of diagnosis.
Interestingly, survivors generally exhibited smaller tumor sizes as compared to non-survivors.
However, within specific stage groups, tumor size did not show significant variation between
surviving and non-surviving patients. The study also observed that negative hormone receptor
status is associated with shorter survival durations, indicating the pivotal role of hormone receptors
in breast cancer prognosis.
The dataset exhibited a notable imbalance, with many data points attributed to the 'Alive' class.
This imbalance posed a potential risk of model bias favoring the ‘Alive’ class, necessitating the
application of oversampling techniques to rectify the class distribution. Different models including
Gradient Boosting, Decision Tree, Random Forest, and AdaBoost were evaluated, showcasing
34
varying performances. Gradient Boosting emerged as the most promising model, demonstrating
high accuracy, precision, recall, and AUC scores, indicating robust generalization capabilities and
balanced precision and recall.
To address dataset imbalance, techniques like Random Oversampling and SMOTE were
employed. Notably, SMOTE significantly enhanced the model's performance across various
metrics compared to Random Oversampling, resulting in a more accurate and reliable model for
this specific dataset. However, it was observed that some models, particularly the Decision Tree,
exhibited signs of overfitting, achieving perfect scores on the training data but displaying reduced
performance on the testing data.
There is clear room for improvement and optimization, particularly in precision and recall,
suggesting the potential benefits of further algorithm tuning and exploring additional techniques.
In conclusion, this comprehensive analysis provided profound insights into the relationships
between various features and breast cancer survival, paving the way for developing enhanced
predictive models and advanced clinical applications. Despite inherent limitations, such as
potential biases and overfitting, the findings significantly contribute to understanding breast cancer
characteristics and their impact on patient outcomes.
5.2 Contributions and Limitations
This study conducts a comprehensive and in-depth analysis, along with visualization, of breast
cancer data categorized by various factors such as alive, dead, tumor size, and stage. This provides
a broad understanding of the role of these various factors in breast cancer survival. The study also
35
carefully evaluates different machine learning models, with a special focus on gradient boosting,
to select the most accurate and reliable predictive model. Evaluations include metrics like precision,
recall, F1 score, and AUC, offering insights into the overall efficiency of each model. By
comparing random oversampling and SMOTE, the study finds that SMOTE yields better
improvement. This approach successfully addresses the inherent imbalance in the dataset, reducing
bias and enhancing the reliability of the model, ultimately improving the classification of the
outcome (alive/died). Moreover, detailed analyses yield important insights into the impact of
hormone receptor status on survival time, the correlation between tumor size and survival, and the
lack of significant correlation between tumor size and survival within specific stage groupings.
However, the study has some limitations. Some models, especially decision trees, exhibit
overfitting, evident in perfect scores on the training data but diminished performance on the test
data, affecting their ability to generalize. Regarding race, due to the imbalance in categories, it is
necessary to remove race from the analysis. If race is to be included, it would require performing
SMOTE twice, which could lead to overfitting. Additionally, models show varying performance
in metrics like precision, recall, and F1 score on the test data, potentially influencing the reliability
of predictions in real-world applications. The study is inherently constrained by the scope and
diversity of available datasets, with most of the data falling under the "alive" category, potentially
impacting the depth and breadth of the analysis and its applicability to a broader context.
Furthermore, there may be potential biases introduced by relying on existing data and its high
representativeness, necessitating caution in interpreting the results.
36
5.3 Future Research Directions
Future research endeavors can explore advanced oversampling and under-sampling
techniques to rectify dataset imbalance and bolster model generalization, with particular attention
to enhancing performance for the 'Dead' category. The collection of more datasets keeps the
balance of 'Black' and "White' categories. Additionally, delving deeper into optimizing the
successful gradient-boosting model through hyperparameter tuning and ensemble methods could
further refine the balance between precision and recall. The integration of more diverse and
inclusive datasets, representative of various demographic groups and cancer subtypes, holds
promise in developing models with broader applicability and heightened predictive accuracy.
Further exploration into feature engineering may uncover deeper insights and more robust
correlations, enriching the model's learning and prediction capabilities. Dedicated studies on the
significance of hormone receptor status in survival durations, along with an examination of its
underlying mechanisms, can provide enhanced understanding and guide targeted therapeutic
strategies.
Moreover, the insights and models from this research could form the basis for developing
clinical applications and decision-support systems, aiding medical professionals in diagnosing,
staging, and predicting breast cancer patients. Future studies should also prioritize the evaluation
and enhancement of model interpretability, ensuring that the insights and predictions are
effectively understood and applied in clinical settings. By building on these insights and addressing
the observed limitations, future research can significantly advance the understanding and
management of breast cancer.
37
References
Asri, H., Mousannif, H., Moatassime, H. A., & Noel, T. (2016). Using Machine Learning
Algorithms for Breast Cancer Risk Prediction and Diagnosis. Procedia Computer Science,
83, 1064-1069. https://doi.org/10.1016/j.procs.2016.04.224
Chen, Z., Wang, M., De Wilde, R. L., Feng, R., Su, M., Torres-de la Roche, L. A., & Shi, W. (2021).
A Machine Learning Model to Predict the Triple Negative Breast Cancer Immune Subtype.
Front Immunol, 12, 749459. https://doi.org/10.3389/fimmu.2021.749459
Davis, J., & Goadrich., M. (2006). The relationship between Precision-Recall and ROC curves.
ICML '06, 233–240. https://doi.org/10.1145/1143844.1143874
DeSantis, C. E., Ma, J., Gaudet, M. M., Newman, L. A., Miller, K. D., Goding Sauer, A., Jemal,
A., & Siegel, R. L. (2019). Breast cancer statistics, 2019. CA Cancer J Clin, 69(6), 438-
451. https://doi.org/10.3322/caac.21583
Friedman, J. H. (2001). Greedy Function Approximation: A Gradient Boosting Machine. The
Annals of Statistics, 29, 1189–1232. http://www.jstor.org/stable/2699986
Giaquinto, A. N., Sung, H., Miller, K. D., Kramer, J. L., Newman, L. A., Minihan, A., Jemal, A.,
& Siegel, R. L. (2022). Breast Cancer Statistics, 2022. CA Cancer J Clin, 72(6), 524-541.
https://doi.org/10.3322/caac.21754
Lorena, A. C., Jacintho, L. F. O., Siqueira, M. F., Giovanni, R. D., Lohmann, L. G., de Carvalho,
A. C. P. L. F., & Yamamoto, M. (2011). Comparing machine learning classifiers in potential
distribution modelling. Expert Systems with Applications, 38(5), 5268-5275.
https://doi.org/10.1016/j.eswa.2010.10.031
Lukasiewicz, S., Czeczelewski, M., Forma, A., Baj, J., Sitarz, R., & Stanislawek, A. (2021). Breast
Cancer-Epidemiology, Risk Factors, Classification, Prognostic Markers, and Current
Treatment Strategies-An Updated Review. Cancers (Basel), 13(17).
https://doi.org/10.3390/cancers13174287
Miguel-Hurtado, O., Guest, R., Stevenage, S. V., Neil, G. J., & Black, S. (2016). Comparing
Machine Learning Classifiers and Linear/Logistic Regression to Explore the Relationship
between Hand Dimensions and Demographic Characteristics. PLoS One, 11(11), e0165521.
https://doi.org/10.1371/journal.pone.0165521
Mitchell, T. M. (1997). Machine learning. MacGraw-Hill. .
National Cancer Institute, S., Epidemiology, and End Results Program Cancer Stat Facts: Female
Breast Cancer. https://seer.cancer.gov/statfacts/ html/breast.html
Podder, P., Bharati, S., Mondal, M. R. H., & Kose, U. (2021). Application of machine learning for
the diagnosis of COVID-19. In Data Science for COVID-19 (pp. 175-194).
https://doi.org/10.1016/b978-0-12-824536-1.00008-3
Pohar, M., Blas, M., & Turk, S. (2004). Comparison of logistic regression and linear discriminant
analysis. Advances in Methodology and Statistics, 1(1), 143-161.
https://doi.org/10.51936/ayrt6204
Siegel, R. L., Miller, K. D., & Jemal, A. (2016). Cancer statistics, 2016. CA Cancer J Clin, 66(1),
38
7-30. https://doi.org/10.3322/caac.21332
Siegel, R. L., Miller, K. D., & Jemal, A. (2017). Cancer Statistics, 2017. CA Cancer J Clin, 67(1),
7-30. https://doi.org/10.3322/caac.21387
Siegel, R. L., Miller, K. D., Wagle, N. S., & Jemal, A. (2023). Cancer statistics, 2023. CA Cancer
J Clin, 73(1), 17-48. https://doi.org/10.3322/caac.21763
Sohail, S. K., Sarfraz, R., Imran, M., Kamran, M., & Qamar, S. (2020). Estrogen and Progesterone
Receptor Expression in Breast Carcinoma and Its Association With Clinicopathological
Variables Among the Pakistani Population. Cureus, 12(8), e9751.
https://doi.org/10.7759/cureus.9751
TENG, J. (2019). SEER Breast Cancer Data. https://dx.doi.org/10.21227/a9qy-ph35
Tu, J. V. (1996). Advantages and disadvantages of using artificial neural networks versus logistic
regression for predicting medical outcomes. J Clin Epidemiol, 49(11), 1225-1231.
https://doi.org/10.1016/s0895-4356(96)00002-9
Yawen, L., Liu, Y., Bohan, Y., Ning, W., & Tian, W. (2019). Application of interpretable machine
learning models for the intelligent decision. 333, 273-283.
Yue, W., Wang, Z., Chen, H., Payne, A., & Liu, X. (2018). Machine Learning with Applications in
Breast Cancer Diagnosis and Prognosis. Designs, 2(2).
https://doi.org/10.3390/designs2020013
Zhang, C., & Ma, Y. (2012). Ensemble Machine Learning. https://doi.org/10.1007/978-1-4419-
9326-7
Abstract (if available)
Abstract
Breast cancer, being one of the most prevalent malignancies among women, imposes significant physiological and psychological burdens on patients and their families. In the pursuit of enhanced understanding and prediction of breast cancer incidence, the utilization of machine learning techniques for factor analysis has become increasingly pivotal. The core objective of the study is to unveil the intricately linked elements contributing to breast cancer onset through the application of diverse machine learning methodologies, thus providing a scientific basis for early prediction and intervention. This study encompasses six classical machine learning methods, including Logistic Regression, Support Vector Classifier (SVC), Random Forest, Decision Tree, Adaboost Classifier, and Gradient Boosting Classifier. The outcomes of this research hold significance not only for early prediction and intervention in breast cancer but also methodologically, offering insights for further investigation into breast cancer factors. By dissecting the application of machine learning techniques in the analysis of breast cancer factors, this study provides valuable insights into the medical domain and offers inspiration for future research directions.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Machine learning-based breast cancer survival prediction
PDF
Predictive factors of breast cancer survival: a population-based study
PDF
Pathogenic variants in cancer predisposition genes and risk of non-breast multiple primary cancers in breast cancer patients
PDF
Air pollution and breast cancer survival in California teachers: using address histories and individual-level data
PDF
Construction of a surgical survival prediction model of stage IV NSCLC patients-based on seer database
PDF
Associations between isoflavone soy protein (ISP) supplementation and breast cancer in postmenopausal women in the Women’s Isoflavone Soy Health (WISH) clinical trial
PDF
Screening and association testing of coding variation in steroid hormone coactivator and corepressor genes in relationship with breast cancer risk in multiple populations
PDF
twas_sim, a Python-based tool for simulation and power analysis of transcriptome-wide association analysis
PDF
An analysis of disease-free survival and overall survival in inflammatory breast cancer
PDF
The impact of the COVID-19 pandemic on cancer care delivery
PDF
The effects of tobacco exposure on hormone levels and breast cancer risk among young women
PDF
Instability of heart rate and rating of perceived exertion during high-intensity interval training in breast cancer patients undergoing anthracycline chemotherapy
PDF
The role of pesticide exposure in breast cancer
PDF
Genes and hormonal factors involved in the development or recurrence of breast cancer
PDF
Predicting autism severity classification by machine learning models
PDF
Comparison of participant and study partner predictions of cognitive impairment in the Alzheimer's disease neuroimaging initiative 3 study
PDF
A comparison of three different sources of data in assessing the adolescent and young adults cancer survivors
PDF
The role of heritability and genetic variation in cancer and cancer survival
PDF
Prostate cancer: genetic susceptibility and lifestyle risk factors
PDF
The effect of inhibiting the NFkB pathway on myeloid derived suppressor cell cytokine production
Asset Metadata
Creator
Zhang, Chuhan
(author)
Core Title
Analysis of factors associated with breast cancer using machine learning techniques
School
Keck School of Medicine
Degree
Master of Science
Degree Program
Applied Biostatistics and Epidemiology
Degree Conferral Date
2023-12
Publication Date
12/01/2023
Defense Date
11/30/2023
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
breast cancer,decision tree,machine learning,OAI-PMH Harvest,prediction and intervention of breast cancer,random forest
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Li, Ming (
committee chair
), Piao, Jin (
committee member
), Siegmund, Kimberly (
committee member
)
Creator Email
cathiezch@gmail.com,chuhanz@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113781966
Unique identifier
UC113781966
Identifier
etd-ZhangChuha-12511.pdf (filename)
Legacy Identifier
etd-ZhangChuha-12511
Document Type
Thesis
Format
theses (aat)
Rights
Zhang, Chuhan
Internet Media Type
application/pdf
Type
texts
Source
20231205-usctheses-batch-1111
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
breast cancer
decision tree
machine learning
prediction and intervention of breast cancer
random forest