Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Application of statistical learning on breast cancer dataset
(USC Thesis Other)
Application of statistical learning on breast cancer dataset
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Application of Statistical learning on Breast Cancer Dataset
by
Jing Jin
A Thesis Presented to the
FACULTY OF THE USC DANA AND DAVID DONRNSIFE COLLEGE OF
LETTERS,ARTS AND SCIENCES
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
MASTER OF SCIENCE
(APPLIED MATHEMATICS)
May 2024
Copyright 2024 Jing Jin
Acknowledgements
I would like to thank my committee members Professor.Sergey Lotosky, Professor.Jianfeng Zhang
and Professor. Ricardo Mancera.
I would like to thank all of my friends.
Finally, I would like to thank my family and myself to make all this happenns.
ii
Table of Contents
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Breast cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.3 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.4 LASSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Project Scope and Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Chapter 2: Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Chapter 3: Wisconsin Breast Cancer Data set . . . . . . . . . . . . . . . . . . . . . . . . . 9
Chapter 4: Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.1 Feature Selection Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.1.1 Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.1.2 LASSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2 Model Selection Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2.2 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3 Some modeling training details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.4 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.4.1 Objective and Response . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.4.2 Factors and Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.5 Experiment Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Chapter 5: Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.1 Modeling training results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.1.1 Logistic Regression with all features . . . . . . . . . . . . . . . . . . . . . 20
5.1.2 Logistic Regression with features selected by LASSO . . . . . . . . . . . . 20
iii
5.1.3 Random Forest with all features . . . . . . . . . . . . . . . . . . . . . . . 22
5.1.4 Random Forest with features selected by LASSO . . . . . . . . . . . . . . 23
5.2 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.2.1 One-way Layout for Feature selection . . . . . . . . . . . . . . . . . . . . 25
5.2.2 One-way Layout for Model Selection . . . . . . . . . . . . . . . . . . . . 25
5.2.3 Two-way layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Chapter 6: Conclusion and ongoing work . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
iv
List of Figures
4.1 ANOVA Feature selection result . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 AIC of Logistic Regression by features selected by LASSO . . . . . . . . . . . . . 13
4.3 Table used for DoE(Partially) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.1 ROC/AUC for Logistic Regression with all features . . . . . . . . . . . . . . . . . 21
5.2 ROC/AUC for Logistic Regression with features selected by LASSO . . . . . . . . 21
5.3 ROC/AUC for Random Forest with all features . . . . . . . . . . . . . . . . . . . 22
5.4 ROC/AUC for Random Forest with features selected by LASSO . . . . . . . . . . 23
5.5 One-way layout for feature selection . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.6 One-way Layout for Model Selection . . . . . . . . . . . . . . . . . . . . . . . . 26
5.7 Two-way layout Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
v
Abstract
This study investigates the impact of model selection and feature selection on the predictive performance of models for breast cancer diagnosis. This study uses Wisconsin Breast Cancer Dataset and
applies Design of Experiments (DOE) methodologies to systematically explore different combinations of models and feature selection methods. The results show that the choice of model (Logistic
Regression or Random Forest) has a significant effect on the predictive power of the model, as
measured by the Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC).
However, the feature selection method and the interaction between the model and feature selection
method do not have a statistically significant impact on the AUC. This suggests that model selection is critical in optimizing performance, while the choice of feature selection method may not be
as important in this context. The study concludes by suggesting future research directions, such
as investigating alternative feature selection methods, assessing model robustness with different
datasets, and exploring more advanced machine learning techniques.
vi
Chapter 1
Introduction
1.1 Background
1.1.1 Breast cancer
Breast cancer remains one of the most pervasive and challenging diseases[1][2], demanding precise
diagnostic tools for effective treatment planning. As such, the development of predictive models
using machine learning offers promising avenues for advancements in diagnosis and prognosis.
However, the construction of these models is complex, involving crucial decisions about feature
selection and modeling techniques that significantly impact their performance and utility in clinical
practice.
1.1.2 Logistic Regression
Logistic Regression is a common used prediction model that is widely used when dealing with
classification problem. James et al. (2021). [3] used a dataset named ’default’ which has the
binary response variable ’Yes’ or ’No’ to illustrate how logistic regression is utilized to estimate
the likelihood of default(predicted value) based on certain predictors. The likelihood with a certain
‘balance’ is denoted as Pr(default = Yes | balance). This probability, abbreviated as p(balance),
is expected to lie in the interval [0, 1]. Here the ’balance’ is the threshold to determine whether
1
the predicted outcome is ’Yes’ or ’No’. Normally, 0.5 is a widely used threshold, in this case if
p(balance) > 0.5, one might predict a default status of ‘Yes’.
James et al. (2021). [3] further illustrated how one might model the relationship between
p(X) = Pr(Y = 1 | X) and the predictor ‘X’, where ‘Y’ is coded generically as 0 or 1. A linear
regression model might initially be considered, p(X) = β0 + β1X. However, this model has its
limitation. It can yield probabilities less than 0 or greater than 1 for certain values of ‘X’, which is
not align with the true probability since it is bounded with [0, 1].
To solve this issue, p(X) must be modeled using a function that ensures outputs remain within
the [0, 1] range for any value of ‘X’. The logistic function serves this purpose, given by:
p(X) = e
β0+β1X
1+e
β0+β1X
The maximum likelihood method is used to fit this model. As illustrated, for smaller values
of ‘X’, logistic regression ensures the predicted probability is close to zero. For higher values, it
approaches but never exceeds one.
After transforming the logistic model a bit, James et al. (2021). [3] derive the odds ratio,
p(X)
1−p(X)
, equals e
β0+β1X
, and the odds can vary from 0 to infinity. The odds ratio is a measure used
to compare the likelihood of a certain event happening in two different groups. It often used in
medical studies, for example to understand if a treatment increases or decreases the chances of a
specific outcome such as recovery from a disease. If the odds ratio is greater than 1, the event is
more likely in the first group; if it’s less than 1, it’s more likely in the second group; and if it’s
exactly 1, the odds are the same in both groups.
Taking the logarithm of the odds yields the logit function, which linearly relates to ‘X’ in
logistic regression:
log
p(X)
1− p(X)
= β0 +β1X
2
While in linear regression β1 signifies the average change in ‘Y’ with a one-unit increase in ‘X’,
in logistic regression it denotes the change in the log odds. Consequently, the odds are multiplied
by e
β1 for each unit increase in ‘X’. Nevertheless, due to the nonlinear relation between p(X) and
‘X’, β1 does not represent the change in p(X) per unit increase in ‘X’, as this is influenced by the
current value of ‘X’. Regardless, a positive β1 implies that increasing ‘X’ increases p(X), while a
negative β1 indicates the opposite.
1.1.3 Random Forest
The Random Forest algorithm, developed by L. Breiman in 2001 [4], offers a fresh perspective
compared to traditional logistic regression.It has been extremely successful with its high accuracy and interpret-ability. This method effectively combines multiple tree models, each built on a
randomly selected set of data, to enhance prediction quality in both classification and regression
tasks.
Random Forest maintains the advantages of decision trees and enhances performance by employing bagging with samples, selecting random variable subsets, and utilizing a majority voting
system [4]. It effectively manages various types of data, including missing values and different
variable types (like continuous, binary, and categorical), making it ideal for modeling complex,
high-dimensional data. Unlike traditional decision trees, Random Forest doesn’t require tree pruning, as its ensemble approach and bootstrapping techniques naturally prevent overfitting. The
impressive effectiveness of Random Forest has spurred ongoing research into its variants within
the field of computational biology [5].
1.1.4 LASSO
LASSO, also called least absolute shrinkage and selection operator, is discovered and publicized
by Statistician Robert Tibshirani independently in 1996[6].It was introduced in order to improve
the prediction accuracy and interpretability of regression models by reducing set of the known
covariates for use in a model.
3
Statistician Robert Tibshirani introduced Lasso’s basic form by the following [6]: suppose
there is a sample consisting of N cases, each consisting of p covariates and a single outcome. yi
is the outcome and xi
:= (x1, x2,..., xp)
T
i
is the covariate vector for the i-th case. Let β0 be the
constant coefficient, β := (β1,β2,...,βp) is the coefficient vector, and s is a free parameter that
determines the lasso coefficient estimates.
The objective function of Lasso is:
min
β0,β
(
N
∑
i=1
(yi −β0 −βxi j)
2
)
subject to
p
∑
j=1
|βj
| ≤ s.
Here xi j are elements in the covariate matrix X, and Xi j = (xi)j and x
T
i
is the i-th row of X, the
expression can be transformed as the following:
min
β0,β
∥y−β0 −Xβ∥
2
2
subject to
∥β∥1 ≤ s,
where ∥α∥p =
∑
N
i=1
|αi
|
p
1/p
is the standard ℓ
p norm.
Let ¯x be ghe scalar mean of the data points xi
, and ¯y be the mean of the response variables
yi
.Then the resulting estimate for β0 is ˆβ0 = y¯−x¯
T β. In other words:
yi − ˆβ0 −x
T
i β = yi −(y¯−x¯
T
β)−x
T
i β = (yi −y¯)−(xi −x¯)
T
β,
and therefore it is standard to work with variables that have been made zero-mean. Additionally, the covariates are typically standardized
∑
N
i=1
x
2
i = 1
so that the solution does not depend
on the measurement scale.
1.2 Motivations
The motivation of this study arises from the imperative to enhance the accuracy and interpretability
of predictive models for breast cancer. We aim to systematically evaluate how different machine
learning models and feature selection methods influence the diagnostic accuracy of such models. In
particular, we focus on the Area Under the Receiver Operating Characteristic Curve (AUC-ROC)
as a measure of performance, given its importance in medical decision making.
The ultimate goal is to aid medical professionals in making more informed decisions, leading to
improved patient outcomes in the fight against breast cancer. Central to this study is the exploration
of how different modeling techniques and feature selection methods influence the accuracy of these
models.
In the medical field, particularly in breast cancer diagnosis, the selection of features and choice
of predictive models are critical. The efficiency and accuracy of these tools determine the nature
and severity of the cancer, directly impacting patient care and treatment planning.
1.3 Project Scope and Significance
In this project, I delves into the application of Design of Experiments (DOE) for predictive model
building in the Wisconsin Breast Cancer Dataset (WBCD). WBCD provides a robust foundation
for this investigation, offering a well-defined set of features and cases for analysis.
The primary aim is to implement DOE methodologies to evaluate the effectiveness of various
features and model configurations. This approach allows for an in-depth examination of causal
relationships and interactions betIen predictors and the model’s performance, with a special focus
on the Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC). By concentrating on streamlined and relevant feature sets, the study aspires to develop a decision-making
tool that balances practicality with robustness. The significance of this research lies in its potential
to refine the diagnostic process, leading to more rapid and accurate diagnoses, thereby benefiting
medical practitioners and enhancing patient care.
5
In summary, through detailed analysis and optimization across different model building approach, this study aims to provide insightful revelations on the impact of feature and model selection on the performance of breast cancer predictive models. The anticipated outcome is a model
that is not only statistically sound but also clinically relevant, ensuring a harmonious blend of
accuracy and interpretability.
6
Chapter 2
Literature Review
The application of logistic regression and machine learning in medical diagnostics has gained considerable attention in recent years, particularly in the context of breast cancer prediction. Ahmed F.
S. and Shawky, D.M. [7] presented a pivotal study on the use of Logistic Regression for automatic
breast cancer diagnosis. Their work underscores the potential of logistic regression in achieving
high accuracy in differentiating betIen benign and malignant cases, a key aspect that aligns with
the objectives of our current study.
Wang et al. [8] explored a combined approach using Logistic Regression and Artificial Neural
Networks for chronic disease prediction. Their methodology, focused on hypertension, provides
insights into the integration of traditional statistical methods with advanced machine learning techniques. This hybrid approach resonates with our study’s exploration of feature selection methods
and their impact on model performance, particularly in enhancing predictive accuracy.
In a similar vein, Haq et al. [9] proposed a hybrid framework employing various machine
learning algorithms for heart disease prediction. Their research highlights the importance of algorithmic synergy in medical diagnostics, an aspect that is central to our study’s exploration of
different predictive models in breast cancer diagnosis.
Recently, Random Forest is also widely used in medical field. Steven. J [10]used the Random
Forest algorithm for survival analysis, highlighting its advantage in detecting complex interactions
and nonlinear effects without the predefined specifications required by traditional methods like the
Cox model. ”Random Forest for Bioinformatics” by Yanjun Qi[11] , delves into the application
7
of the Random Forest algorithm in bioinformatics, highlighting its efficacy in handling complex
biological data characterized by small sample sizes and high-dimensional spaces, and its unique
advantages in feature selection and interaction analysis within computational biology.
These studies collectively demonstrate the relevance of advanced predictive modeling techniques in medical diagnostics. They provide a foundation for this research, yet there remains
a significant gap in the literature regarding the systematic application of Design of Experiments
(DOE) methodologies in this domain. Specifically, there is a lack of focused investigation into
how DOE can be employed to evaluate and optimize feature selection and model configurations in
breast cancer prediction. This is precisely where this study aims to contribute, bridging this gap by
leveraging DOE approaches to explore these crucial aspects of predictive modeling.
8
Chapter 3
Wisconsin Breast Cancer Data set
This study uses the Wisconsin Breast Cancer Dataset (WBCD), The Wisconsin Breast Cancer
Dataset (WBCD) is an influential dataset in the field of medical machine learning, particularly in
breast cancer research. Developed by Dr. William H. Wolberg at the University of Wisconsin Hospitals, the dataset was made available online in 1992.[12]. It consists of nuclear features derived
from Fine Needle Aspirate (FNA) biopsy test results of breast tissues. They describe characteristics
of the cell nuclei present in the image.
The WBCD encompasses a total of 569 samples, comprising 356 benign (62.7%) and 213
malignant (37.3%) cases. Each sample in the dataset is described by a set of 32 attributes. The first
attribute is the sample ID, the second denotes the diagnostic result (benign or malignant), and the
remaining 30 attributes are numerical features that offer detailed information about the cell nuclei
characteristics.
These features include the following key measurements for each cell nucleus:
• Radius: Mean of distances from the center to points on the perimeter.
• Texture: Standard deviation of gray-scale values.
• Perimeter.
• Area.
• Smoothness: Local variation in radius lengths.
9
• Compactness: Calculated as the ratio of perimeter squared to area.
• Concavity: Severity of concave portions of the contour.
• Concave Points: Number of concave portions of the contour.
• Symmetry.
• Fractal Dimension: A measure of the “coastline approximation”.
For each of these features, the mean, standard error, and the ’worst’ or largest value (mean of the
three largest values) Ire computed, resulting in 30 different feature measurements for each sample.
The further details of the statistical analysis results is attached in appendix.
This comprehensive dataset offers an extensive foundation for analyzing and predicting breast
cancer malignancy. Its richness and complexity make it an ideal candidate for feature selection
processes. This study leverages the Wisconsin Breast Cancer Dataset in its application of Design of Experiments (DOE) methodologies. The objective is to optimize predictive modeling in
breast cancer diagnosis by systematically exploring and selecting the most significant features.
This approach aims to enhance the accuracy and efficiency of the models, thereby contributing
significantly to the field of medical diagnostics.
10
Chapter 4
Methodology
4.1 Feature Selection Techniques
4.1.1 Analysis of Variance
Analysis of Variance (ANOVA) is a statistical method used to compare the means of three or more
samples to determine if at least one sample mean is significantly different from the others. When
employed as a feature selection technique in the context of predictive modeling, ANOVA involves
fitting a regression model to all predictors and examining the statistical significance of each feature
through their p-values.
In this study, ANOVA was leveraged to discern the impact of individual features on the prediction of breast cancer outcomes. By evaluating the p-values in the ANOVA table, we can identify
which predictors have a statistically significant effect on the dependent variable. A low p-value
indicates that a feature has a meaningful contribution to the model, helping to refine the feature set
for more accurate predictions.
The results (Figure 4.1) from ANOVA shows that all features are significant.Consequently,
it was determined that all features would be retained for the subsequent training of the logistic
regression and random forest models.
11
Figure 4.1: ANOVA Feature selection result
12
4.1.2 LASSO
The second feature selection methods used in this study is Least Absolute Shrinkage and Selection Operator (LASSO). The Least Absolute Shrinkage and Selection Operator (LASSO) method
represents a refined approach to feature selection, particularly valued for its efficiency in simplifying complex models while retaining predictive power. It operates by imposing an L1 penalty on
the regression model, which encourages a sparse solution: coefficients of less influential variables
are shrunk towards zero. This penalty term is a critical component of the LASSO method, as it
directly influences the model by eliminating non-contributory predictors, effectively performing
feature selection while the model is being fitted.
This approach not only enhances the prediction accuracy but also improves the model’s interpretability, crucial in the healthcare context. Specific thresholds and parameters were adjusted to
optimize the balance between model complexity and prediction accuracy.
Figure 4.2: AIC of Logistic Regression by features selected by LASSO
According to the results from LASSO, the following 13 variables are selected:texture worst,compactness
se, radius se, symmetry worst,concavity worst,area worst,concave points worst,smoothness worst,concave
13
points mean,smoothness se,perimeter worst,concavity mean,fractal dimension mean. Figure 4.2
presents a visualization of the stepwise changes in the Akaike Information Criterion (AIC) values
for a Logistic Regression model, as determined by the sequential inclusion of important features
selected via LASSO. Each plotted point on the graph corresponds to the AIC value of a model that
includes an additional feature from the prioritized list derived by LASSO. The first data point reflects the AIC of the model using only the most important feature, the second data point represents
the AIC when the two most important features are used, and so on. Consequently, the third data
point includes the top three features, with each subsequent point adding one more feature in order
of their selection by LASSO. This incremental approach illustrates the impact of each feature on
the model’s information criterion, providing insight into how the combination of selected features
influences the overall model fit.
4.2 Model Selection Techniques
4.2.1 Logistic Regression
This study chose Logistic Regression as its primary model since both Ahmed F. S. and Shawky,
D.M. [7], Haq et al. [9] and some other existing literature already verify the vitality of this model.It
is often used for modeling and analyzing datasets where the outcome variable is binary.
Logistic Regression was selected due to its widespread use in healthcare, particularly for its
intepretability, a crucial factor in medical decision-making. This model’s ability to provide insight
into the relationship between features and the binary outcome (benign or malignant) makes it
particularly suitable for breast cancer prediction.In this study, Logistic Regression with all features
is treated as a baseline model to do comparison with other combinations.
4.2.2 Random Forest
Random Forest, a powerful ensemble learning technique, was selected as a complementary model
in our study, primarily due to its exceptional capability to process complex datasets that exhibit
14
non-linear relationships among features. This method constructs a multitude of decision trees
during training and outputs the class that is the mode of the classes (classification) of the individual
trees. The strength of Random Forest lies in its ensemble approach, which combines the predictions
of multiple decision trees to produce a more accurate and stable prediction than any single tree
could achieve.
In the context of this study, with a comprehensive set of 30 features to consider, Random Forest
stands out as a particularly suitable choice. Its inherent capacity to handle a large number of input
variables allows us to utilize the full spectrum of available data without the need for prior feature
reduction. Moreover, its method of randomly selecting subsets of features at each split point in
the decision trees makes it less likely to overfit to our training data compared to models that might
consider all possible feature splits.
As we go further into the study, the design of experiments (DoE) methodology will be pivotal
in validating our hypothesis that Random Forest can outperform other models when managing
datasets rich in features. Through this approach, we aim to empirically ascertain the model’s
effectiveness and establish a clear understanding of how feature selection influences the predictive
power of our chosen models.
4.3 Some modeling training details
Only one irrelevant column is omitted during the data pre-processing process. No missing data in
the dataset, thus no data imputation and further pre-processing is involved. The data-set was split
into a training set (75%) and a testing set (25%), ensuring a representative sample of the overall
data-set. In the training phase, both Logistic Regression and Random Forest models are optimized
with default hyperparameters.
The accuracy of each model was assessed on the testing set, focusing on the Area Under the
Curve (AUC) of the Receiver Operating Characteristic (ROC) curve. The AUC is a pivotal metric
in this context, as it enables us to quantitatively measure the models’ accuracy and dependability
15
in differentiating benign and malignant breast cancer cases. By emphasizing the AUC in our
evaluation, I ensure that the developed predictive models are not only statistically robust but also
hold significant clinical relevance and effectiveness. This approach aligns with the overarching
goal of the study to create models that are as practical in real-world medical diagnostics as they
are rigorous in their statistical foundation.
4.4 Experiment Design
Design of experiments is a popular approach to optimize a system in various research area. Model
optimization is a natural objective for design of experiments, and feature selection/model selection
can be seen as a procedure that improve the model’s accuracy. Thus, different models and features
combinations can be seen as the factors of experiments. By modifying these factors, the machine
learning model or algorithm implemented on the breast cancer dataset can be optimized.
4.4.1 Objective and Response
In our study of model and feature selections, the primary focus centers on identifying significant
differences across various dimensions. These include: 1) assessing the effectiveness of different
feature selection methods; 2) evaluating the performance of two distinct predictive models; and 3)
exploring potential interactions between feature selection methods and model performance. This
multifaceted approach is designed to dissect and understand the intricate dynamics of model and
feature selection in breast cancer prediction.
The Response in this study is ROC/AUC values.
The methodology is structured as a comprehensive analysis, where each aspect of the study -
feature selection, model performance, and their interaction - contributes to a holistic understanding of the predictive modeling process. The initial phase of the study emphasizes identifying the
most impactful features, setting the stage for a deeper examination of model performances. Subsequently, the interaction analysis betIen chosen features and models provides nuanced insights,
16
enriching our understanding of the synergy betIen different components in predictive modeling.
This layered approach not only enhances model accuracy but also offers valuable insights into the
mechanics of model and feature selection in medical diagnostics.
4.4.2 Factors and Levels
Table 4.1:
Factors Level 1 Level 2
Features Sets All features Features Selected by LASSO
Model Selection Logistic Regression Random Forest
This study examines two primary factors, each with two distinct levels. The first factor concerns
feature selection methods. We explore two different feature sets: one derived from ANOVA, which
includes all available features, and another refined through LASSO, featuring a subset of selected
predictors. This contrast aims to ascertain the impact of comprehensive versus selective feature
inclusion on model performance.
The second factor centers around the choice of predictive models. Logistic Regression is incorporated due to its prevalent use in medical and machine learning contexts, prized for its interpretability. In contrast, Random Forest is chosen for its robustness and high accuracy. This para
taxis of models allows us to evaluate how each, and their combinations, perform in terms of their
AUC values, thereby shedding light on the efficacy of different model types under varying feature
selection scenarios. The study’s focus extends to understanding the influence of these factors, both
individually and collectively, on the predictive accuracy in breast cancer diagnosis.
4.5 Experiment Implementation
The experiment was conducted using a structured approach designed to assess the impact of various
model and feature selection combinations. As previously mentioned, a dedicated dataset(Figure 7)
was created specifically for the Design of Experiments (DOE) framework. This dataset facilitated
the evaluation of different combinations of models and feature selection methods.
17
Figure 4.3: Table used for DoE(Partially)
18
For each combination of model and feature selection, the model was executed ten times. This
repetitive approach was adopted to generate a diverse set of AUC values, ensuring a robust assessment of model performance. The variability inherent in these multiple runs provided a comprehensive view of each combination’s efficacy under different iterations.
To analyze the results, we employed a one-way layout analysis. This method was instrumental
in determining if there were significant differences in performance between the two feature selection methods (ANOVA-selected features vs. LASSO-selected features) and between the two
predictive models (Logistic Regression and Random Forest).
Further, a two-way layout analysis was utilized to explore potential interactions between the
feature selection methods and the predictive models. This analysis was crucial in understanding
how the choice of feature set interacts with the choice of model and what impact this interaction has
on the AUC values. The combination of one-way and two-way layouts provided a comprehensive
understanding of both individual and interactive effects on model performance in breast cancer
diagnosis.
19
Chapter 5
Results
5.1 Modeling training results
5.1.1 Logistic Regression with all features
Figure 5.1 illustrates the ROC/AUC curve for a Logistic Regression model utilizing the full set
of features. An AUC value of 93.22% signifies a strong predictive capability, suggesting that the
model is able to distinguish between classes with high accuracy. However, it is crucial to consider
the practical application of such a model in clinical settings. While a comprehensive feature set
may contribute to a high degree of accuracy, it may also lead to a model that is complex and
difficult to interpret. Clinicians require models that not only predict with high accuracy but are
also interpretable and actionable. Therefore, the inclusion of every available feature may not be
the most effective approach for developing a decision-making tool for breast cancer diagnosis.
The trade-off between model accuracy and interpretability must be carefully balanced to ensure
the tool’s utility and acceptance in a real-world medical environment.
5.1.2 Logistic Regression with features selected by LASSO
Figure 5.2 illustrates the ROC/AUC curve for a Logistic Regression model leveraging a feature set
refined by LASSO. The AUC value, having ascended from 93.22% to 99.77%, indicates that the
model achieves superior predictive accuracy with a reduced subset of predictors.
20
Figure 5.1: ROC/AUC for Logistic Regression with all features
Figure 5.2: ROC/AUC for Logistic Regression with features selected by LASSO
21
Figure 5.3: ROC/AUC for Random Forest with all features
This outcome aligns with the requirements of clinical settings, where models must balance
simplicity with predictive precision. Naturally, we might hypothesize that within the Wisconsin
Breast Cancer dataset, the choice of feature set has a substantial effect on model accuracy. Yet, to
establish the statistical significance of these findings, further examination needs to be done. Such
analysis would not only corroborate the preliminary observations but also reinforce the validity
of feature selection methods like LASSO in enhancing model performance in medical diagnostic
tools.
5.1.3 Random Forest with all features
Figure 5.3 shows the ROC/AUC curve for a Random Forest model using the complete set of features. The AUC of this model stands at 98.59%, which is a notable enhancement over the baseline
Logistic Regression model that recorded an AUC of 93.22%. This increase signifies that the predictive accuracy of the model has improved with the application of a different machine learning
algorithm.
22
Figure 5.4: ROC/AUC for Random Forest with features selected by LASSO
The goal of this project is to augment model accuracy by exploring various modeling approaches. Therefore, the implementation of different models with potential of high accuracy, such
as Random Forest, is imperative. The improved AUC observed in Figure 5 supports the hypothesis
that the choice of the predictive model influences the AUC values. To substantiate this hypothesis
and fully assess the impact of different models on predictive accuracy, additional rigorous analysis
is required.
5.1.4 Random Forest with features selected by LASSO
Figure 5.4 depicts the ROC/AUC curve for a Random Forest model that utilizes features selected by
LASSO. With an AUC of 98.59%, this model demonstrates a significant improvement in predictive
accuracy over the baseline model, which is a Logistic Regression using all features with an AUC
of 93.22%. However, it is noteworthy that this AUC does not show a considerable variance from
the Random Forest model using the full feature set. Consequently, these ROC/AUC results alone
23
do not conclusively determine the most optimized combination of model and feature selection for
the Wisconsin Breast Cancer Dataset.
The subsequent phase of this project will employ the Design of Experiments (DoE) methodology to rigorously evaluate whether there is a significant interaction effect between feature selection
methods and predictive modeling techniques on the AUC values. The integration of Random Forest with LASSO-selected features introduces a diverse dimension to the study, paving the way for
a comprehensive analysis of model-feature interactions.
5.2 Main results
This section delves into the results and analysis derived from the experiments, with a focus on elucidating the advantages of the Design of Experiments (DOE) approach in enhancing the accuracy
of predictive models in breast cancer diagnosis. It is important to note that while the overarching
methodological framework remains consistent, the specific outcomes and insights gleaned may
differ due to the variability in models and features employed in the model building process.
A key aspect of this analysis revolves around the Area Under the Curve (AUC) of the Receiver
Operating Characteristic (ROC). The AUC serves as a pivotal metric in evaluating the efficacy of
the models. It provides a quantifiable measure of how well each model, influenced by the chosen
features, can discriminate between benign and malignant cases.
The DOE methodology plays a crucial role in this study, as it allows for a systematic exploration of how different factors, which are the model selection and feature selection methods—influence the AUC values. Through this approach, we are able to discern the relative impact
of these factors on model performance. This is not just about achieving the highest possible AUC,
but also about gaining deeper insights into how different algorithmic choices and data characteristics influence the model’s predictive capabilities.
Ultimately, the results from this DOE approach provide valuable insights into the complex
interplay between feature selection, model choice, and predictive accuracy. This analysis helps to
24
Figure 5.5: One-way layout for feature selection
identify which factors are most critical in optimizing the AUC, thereby guiding the development
of more effective and accurate predictive models for breast cancer diagnosis.
5.2.1 One-way Layout for Feature selection
From the ANOVA of One-way Layout for Feature selection(Figure 5.5), the F-statistic value of
0.106 is quite low, indicating that the variance explained by the feature selection method is not
significantly greater than the variance within the groups. Furthermore, the high p-value of 0.747,
which is much greater than the common alpha level of 0.05, leads to the conclusion that there is
no significant difference in AUC scores between the two feature selection methods. In the context
of model development, this implies that the predictive performance, as measured by AUC, is not
dependent on whether all features or only those selected by LASSO are used.
5.2.2 One-way Layout for Model Selection
The ANOVA test results(Figure 5.6) show an F-statistic value of approximately 40.31 and a highly
significant p-value of 1.89e-07, which is well below the conventional alpha level of 0.05. This
indicates that there is a statistically significant difference in the AUC scores for the two models
tested. The triple asterisks (***) highlight the strong statistical significance.
25
Figure 5.6: One-way Layout for Model Selection
Figure 5.7: Two-way layout Results
Given the significant F-statistic and the associated p-value, we can conclude that the choice
of model (whether Logistic Regression or Random Forest) has a profound effect on the predictive
power of the model as measured by AUC. This suggests that when considering model accuracy for
breast cancer prediction, model selection is a critical factor that should be carefully considered to
optimize performance.
5.2.3 Two-way layout
The two-way ANOVA conducted to evaluate the effects of model type, feature selection method,
and their interaction on the AUC yielded insightful results(Figure 5.7). The model factor demonstrated a significant influence on the AUC with an F-value of approximately 38.866 and a p-value
26
of 3.38e-07, which is highly significant (***).In contrast, the feature selection method did not show
a significant effect on AUC, with an F-value of approximately 0.209 and a non-significant p-value
of 0.650. The interaction between model and feature selection method also did not significantly
affect AUC, indicated by an F-value of approximately 0.432 and a p-value of 0.515.
These results indicate that while the choice of model plays a crucial role in the predictive
performance as measured by AUC, the feature selection method and the interaction between the
model and feature selection method do not have a statistically significant impact. This underscores
the importance of model selection in the predictive modeling process for breast cancer datasets,
and suggests that the benefit of a particular feature selection technique may not be as critical
in this context. The absence of a significant interaction implies that the model’s performance
enhancement is consistent across different feature selection methods used in this study.
27
Chapter 6
Conclusion and ongoing work
This study found that the type of predictive model used significantly affects performance on the
Wisconsin Breast Cancer Dataset, particularly regarding AUC values. This confirms the importance of selecting the right algorithm. However, feature selection methods and their interaction
with the model type is less significantly impacting AUC values. This might indicate the inherent informativeness of the dataset’s features or the robustness of the models to less informative
features. The consistent model performance across different feature selection methods opens up
possibilities for exploring model-specific feature selection techniques.
Future research could explore various directions, such as testing other feature selection methods
(like recursive feature elimination or principal component analysis), assessing model robustness
on different datasets, incorporating advanced machine learning techniques like deep learning, and
evaluating the clinical applicability and interpretability of the models. These efforts could provide
significant insights into medical diagnostics and improve decision-making tools for breast cancer
prognosis.
28
References
1. Ahmed, O. S., Omer, E. E., Alshawwa, S. Z., Alazzam, M. B. & Khan, R. A. Approaches to
federated computing for the protection of patient privacy and security using medical applications. Applied Bionics and Biomechanics 2022, 6. doi:10.1155/2022/1201339 (2022).
2. World Health Organization. Breast cancer: prevention and control https://www.who.int/
cancer/prevention/diagnosis-screening/breast-cancer/en/. 2016.
3. James, G., Witten, D., Hastie, T. & Tibshirani, R. An Introduction to Statistical Learning:
with Applications in R 2nd ed. doi:YourBookDOIHere (Springer, 2021).
4. Breiman, L. Random Forests. Machine Learning 45 (ed Schapire, R. E.) 5–32 (2001).
5. Yang, P., Hwa Yang, Y., Zhou, B., Zomaya, Y., et al. A review of ensemble methods in
bioinformatics. Current Bioinformatics 5, 296–308 (2010).
6. Tibshirani, R. Regression Shrinkage and Selection Via the Lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58, 267–288. doi:10.1111/j.2517-6161.1996.
tb02080.x (1996).
7. S., A. F. & Shawky, D. Logistic Regression Model for Breast Cancer Automatic Diagnosis in
SAI Intelligent Systems Conference November 10-11 (2015).
8. Wang, A., An, N., Xia, Y., Li, L. & Chen, G. A Logistic Regression and Artificial Neural
Network-based Approach for Chronic Disease Prediction: a Case Study of Hypertension in
2014 IEEE International Conference on Internet of Things (iThings 2014), Green Computing
and Communications (GreenCom 2014), and Cyber-Physical-Social Computing (CPSCom
2014) (2014).
9. Haq, A. U., Li, J. P., Memon, M. H., Nazir, S. & Sun, R. A hybrid intelligent system framework for the prediction of heart disease using machine learning algorithms. Mobile Information Systems, 1–21 (2018).
29
10. Rigatti, S. J. Random Forest. Journal of Insurance Medicine 47. RESEARCH NOTES, 31–
39 (2017).
11. Qi, Y. in Ensemble Machine Learning (eds Zhang, C. & Ma, Y.) Chapter 11 (Springer, New
York, NY, 2012). doi:10.1007/978-1-4419-9326-7_11.
12. Wolberg, W. H., Street, W. N., Heisey, D. M. & Mangasarian, O. L. Computer-derived nuclear
grade and breast cancer prognosis. Analytical and Quantitative Cytology and Histology 17,
257–264 (1995).
30
Abstract (if available)
Abstract
This study investigates the impact of model selection and feature selection on the predictive performance of models for breast cancer diagnosis. This study usesWisconsin Breast Cancer Dataset and applies Design of Experiments (DOE) methodologies to systematically explore different combinations of models and feature selection methods. The results show that the choice of model (Logistic Regression or Random Forest) has a significant effect on the predictive power of the model, as measured by the Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC). However, the feature selection method and the interaction between the model and feature selection method do not have a statistically significant impact on the AUC. This suggests that model selection is critical in optimizing performance, while the choice of feature selection method may not be as important in this context. The study concludes by suggesting future research directions, such as investigating alternative feature selection methods, assessing model robustness with different datasets, and exploring more advanced machine learning techniques.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Supervised learning algorithms on factors impacting retweet
PDF
Analysis of factors associated with breast cancer using machine learning techniques
PDF
Elements of dynamic programming: theory and application
PDF
Statistical insights into deep learning and flexible causal inference
PDF
Machine learning-based breast cancer survival prediction
PDF
Improvement of binomial trees model and Black-Scholes model in option pricing
PDF
Increase colorectal cancer prediction accuracy with the influence (I)-score
PDF
The application of machine learning in stock market
PDF
Predictive factors of breast cancer survival: a population-based study
PDF
Reinforcement learning for the optimal dividend problem
PDF
The spread of an epidemic on a dynamically evolving network
PDF
Feature and model based biomedical system characterization of cancer
PDF
Small area cancer incidence mapping using hierarchical Bayesian methods
PDF
Tamed and truncated numerical methods for stochastic differential equations
PDF
A survey on the computational hardness of linear-structured Markov decision processes
PDF
Finding technical trading rules in high-frequency data by using genetic programming
PDF
The existence of absolutely continuous invariant measures for piecewise expanding operators and random maps
PDF
Equilibrium model of limit order book and optimal execution problem
PDF
Feature selection in high-dimensional modeling with thresholded regression
PDF
Predicting autism severity classification by machine learning models
Asset Metadata
Creator
Jin, Jing
(author)
Core Title
Application of statistical learning on breast cancer dataset
School
College of Letters, Arts and Sciences
Degree
Master of Science
Degree Program
Applied Mathematics
Degree Conferral Date
2024-05
Publication Date
03/27/2024
Defense Date
03/26/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
data analysis,design of experiment,healthcare,OAI-PMH Harvest,statistical learning
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Lototsky, Sergey (
committee chair
), Mancera, Ricardo (
committee member
), Zhang, Jianfeng (
committee member
)
Creator Email
jennjin25@gmail.com,jinjing@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113858332
Unique identifier
UC113858332
Identifier
etd-JinJing-12718.pdf (filename)
Legacy Identifier
etd-JinJing-12718
Document Type
Thesis
Format
theses (aat)
Rights
Jin, Jing
Internet Media Type
application/pdf
Type
texts
Source
20240327-usctheses-batch-1131
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
data analysis
design of experiment
healthcare
statistical learning