Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Explainable AI architecture for automatic diagnosis of melanoma using skin lesion photographs
(USC Thesis Other)
Explainable AI architecture for automatic diagnosis of melanoma using skin lesion photographs
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Explainable AI Architecture for Automatic Diagnosis of Melanoma Using Skin Lesion
Photographs
by
Ruitong Sun
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
MASTER OF SCIENCE
(COMPUTER SCIENCE)
December 2022
Copyright 2023 Ruitong Sun
Dedication
I would like to dedicate this paper to my parents, who gave me the opportunity to see the wider world. At
the same time, I am also very grateful to Professor Mohammad, without his encouragement, I might not
be able to complete this thesis.
ii
Acknowledgements
I am very grateful to Prof. Mohammad Rostami, without his guidance and encouragement, I would not
have had the opportunity to complete my research and chase my dreams.
iii
TableofContents
Dedication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Chapter 2: Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 An overview of Skin cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Melanoma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Deep Learning methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Basic in Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1.1 Supervised Learning in Deep Learning . . . . . . . . . . . . . . . . . . . 6
2.2.1.2 Unsupervised Learning in Deep Learning . . . . . . . . . . . . . . . . . 6
2.2.1.3 Weakly-Supervised Learning in Deep Learning . . . . . . . . . . . . . . 6
2.2.1.4 Zero-Shot Learning in Deep Learning . . . . . . . . . . . . . . . . . . . 7
2.2.1.5 Self-Supervised Learning in Deep Learning . . . . . . . . . . . . . . . . 7
2.2.2 Image classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2.1 CNN In image classification . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2.2 Deep Residual Network (ResNet) . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.3.1 UNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.3.2 Combination of ResNet and U-Net . . . . . . . . . . . . . . . . . . . . . 13
2.3 An overview of Melanoma diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Human expert Melanoma diagnosis method . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Deep Learning for Melanoma Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Explainability methods in Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.1 Grad-CAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.2 Lime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 Explainability frameworks in Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Chapter 3: Proposed work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1 Problem Explainable Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
iv
3.2 Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.1 Overview of the network architecture . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.2 Bio-UNet Baseline Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.3 Self-Supervised Learning for Bio-UNet . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Experimental Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.2 Performance Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.3 Ablative Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Chapter 4: Conclusions and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
v
ListofTables
3.1 Comparison of the number of non-empty masks and the number of empty masks for each
attribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Evaluating the localization accuracy for the five clinical indicators. Dice coefficient in
percentage is reported. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Localization performance results for the ablative study on the importance of subnetworks
of Bio-Unet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
vi
ListofFigures
1.1 Left: a skin mole photograph; right: explainability heatmap generated using Grad-
Cam [66] for a trained DNN. We obseve that the heatma includes the whole mole without
offering a good explanation for the model decision outcome. . . . . . . . . . . . . . . . . . 2
2.1 Example dermoscopic images from the ISIC 2018 Challenge “Task to classify the
dermoscopic images into one of the following categories: melanoma, melanocytic nevus,
basal cell carcinoma, actinic keratosis / Bowen’s disease, benign keratosis, dermatofibroma,
and vascular lesion.” Dermoscopy provides high-resolution magnified images of the skin
that can reveal pathological details. Although the images already contain many details, it
is still difficult for experts to distinguish between malignant and benign lesions. Top row:
benign nevi. Bottom row: malignant melanomas. . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 SIMCLR architecture, Figure from [13] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 CNN architecture, Figure from [38] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Residual learning, Figure from [27] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Semantic segmentation (left) and Instance segmentation (right); Figure from [68] . . . . . 12
2.6 U-Net architecture [59] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.7 ABCDE rule [48] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.8 7-point checklist; Figure from [46] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.9 An example of using the Grad CAM method on medical images . . . . . . . . . . . . . . . 18
2.10 Lime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.11 XAI taxonomy; Figure from [4] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.12 interpretable features; Figure from [18] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.13 Using majority voting of ensemble models; Figure from [40] . . . . . . . . . . . . . . . . . 21
vii
3.1 Proposed architecture for explainable diagnosis of melanoma: the architecture is trained
to simultaneously classify skin lesion pictures using a CNN classifier and learn to localize
melanoma clinical indicators on the input image using a U-Net-based segmentation
network. The classification branch receivers its input from the segmentation path to
enforce classifying images based on clinical indicators. . . . . . . . . . . . . . . . . . . . . 26
3.2 Examples of clinical indicator biomarkers. . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 The second on the left is the ground truth;the third on the left is the output in the form of
probability; the first on the right is probability > 0.3 = 1 . . . . . . . . . . . . . . . . . . . . 34
3.4 Localization performance for samples of dermatoscopic images: from top to the bottom,
we have included a sample input image along with binary localization maps generated
for globules, milia like cyst, negative network, pigment network , and streaks biomarkers
indicators. From left to right, the input image, ground truth mask of the indicator, and
masks generated by Grad-CAM, Grad-CAM++, LayerCAM, and Bio-Unet are visualized.
The cam-based feature maps are generated with a Resnet50 backbone, trained for
classification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5 Binary masks for the ablative study: from top to bottom are samples of globules, milia_-
like_cyst, negative network, pigment network, streaks. From left to right are input image,
ground truth mask, and mask generated byf
Res
(·) +f
Final_Seg
(·),f
Res
(·) +f
CLR
(·) +
f
Final_Seg
(·) ,f
Res
(·) +f
Seg
(·) +f
Final_Seg
(·), and Bio-Unet, following Table 3.3.3. . . . . 41
3.6 Effect of each stage of optimization on localization quality: the three rows are the
localization maps generated after completion of training of Bio-UNet subnetworks. The
network in the first row is f
Res
, the network in the second row is composed off
CLS
(·)
andf
Res
(·), and the network in the third row is Bio-Unet. . . . . . . . . . . . . . . . . . . 42
3.7 Impact of repeatingf
Seg
(·): the columns from left to right are ground truth, not repeated,
repeated once, repeated four times. The first and second rows are examples of “streaks”.
The third and fourth rows are examples of “milia_like_cyst . . . . . . . . . . . . . . . . . . 43
viii
Abstract
Melanoma is a prevalent lethal type of cancer that is treatable mostly if diagnosed at early stages of de-
velopment. As a result, there is an urgent need to develop and implement therapeutic interventions that
can effectively and conveniently identify patients with or at risk for melanoma at scale and at early stages
to apply sight-saving treatment on time. Skin lesions are a typical indicator for diagnosing melanoma
but they often led to delayed diagnosis due to high similarities of cancerous and benign lesions at early
stages of melanoma. In other words, despite the possibility of diagnosing melanoma through inspecting
skin lesions, the task can be performed by expert dermatologists. Deep learning (DL) can be used as a
solution to classify skin lesion pictures with a high accuracy, but clinical adoption of deep learning faces
a significant challenge. The reason is that the decision processes of deep learning models are often un-
interpretable which makes them black boxes that are challenging to trust. In this thesis, we develop an
explainable deep learning architecture for melanoma diagnosis which generates clinically interpretable
visual explanations for its decisions. Our idea is based on supervising a deep neural network to learn
identifying clinical indicators of melanoma and then base the diagnosis task on these indicators. As a re-
sult, the model is trained to diagnose melanoma similar to expert dermatologists. More importantly, our
model is able to localize the clinical indicators on the input skin lesion images. We conduct experiments
on a real-world melanoma dataset. Our experiments demonstrate that our proposed architectures matches
clinical explanations significantly better than existing architectures.
ix
Chapter1
Introduction
Melanoma is a prevalent type of skin cancer that can be highly deadly in advanced stages. For this reason,
early detection of melanoma is the most important factor for successful treatment. New skin moles or
changes in existing models are the most distinct symptoms of melanoma. However, due to similarity of
benign and cancerous moles, melanoma diagnosis is a sensitive task that can be preformed by trained
dermatologist. If skin moles are not screened and graded on time, melanoma maybe detected by patients
very late. Unfortunately, this is the case with low-income populations with limited access to healthcare.
Advances in deep learning along with accessibility of smartphones have led to emergence of automatic
diagnosis of melanoma using skin lesion ordinary photographs [16, 76, 35, 1, 33, 49, 32]. When evaluated
only in terms of diagnosis of melanoma, deep models have accuracy rates close to those of dermatologists.
Despite this success, adoption of these models in clinical settings has been limited.
A primary challenge for adopting deep learning in clinical tasks is the challenge of interpretability.
Deep neural networks (DNNs) sometimes are called “black boxes” because their internal decision-making
process is opaque. Existing explainability methods [70, 66, 87] try to clarify decisions of these black boxes
to help users or developers understand the most important areas of the image in making the classification,
in the form of a heatmap. However, the area alone is not particularly helpful, e.g., Grad-Cam [66] simply
highlights the entire mole in the melanoma image in Figure 1.1. The heatmap is not very helpful because
1
Figure 1.1: Left: a skin mole photograph; right: explainability heatmap generated using Grad-Cam [66] for
a trained DNN. We obseve that the heatma includes the whole mole without offering a good explanation
for the model decision outcome.
fig1
it is trivial that the mole is the important region for diagnoses and useful clinical explanations should
consider coarser regions and features of the mole. In other words, the highlighted regions are often too
large to show the shape of an interpretable region, or extremely deviated from the regions of interest to
dermatologists. A reason behind this deficiency is that many explainability methods primarily consider
the last DNN layer for heatmap generation, whereas some interpretable features maybe encoded at earlier
layers. Hence, improving DL explanation methods may be helpful.
More importantly, there is no guarantee that a trained DNN uses human interpretable indicators for
decision-making, irrespective of improving DL explainability algorithms. We argue existing explainabil-
ity methods may not be sufficient for explainable DL in high-stakes domain such as medicine due to the
end-to-end training pipeline of DL. In other words, a model that is trained only using a high-level abstract
label, e.g., cancerous vs benign, may learn to extract indicator features that are totally different com-
pared to the features human experts use. In contrast, dermatologists are trained to perform their diagnosis
through identifying intermediate indicator biomarkers [5]. The solution that we propose is to benefit from
intermediate-level annotations that denote human-interpretable features in the training pipeline to enforce
a DNN learn making decisions similar to clinical experts. However, data annotation, particularly in medical
2
applications, is an expensive and time-consuming task and generating a finely annotated dataset is infea-
sible. To circumvent this challenge, we use self-supervised learning [12] to train a human-interpretable
model using only a small annotated dataset. Our empirical experiments demonstrate that our approach
can generate explanations more similar to expert dermatologists.
3
Chapter2
RelatedWork
In this chapter we offer a brief survey and background about existing works in machine learning for diag-
nosing melanoma. A comprehensive survey on all existing works is beyond our work and we only survey
the most related existing works.
2.1 AnoverviewofSkincancer
The most prevalent form of cancer in the US is skin cancer [24]. In their lifetimes, one in five Americans will
get skin cancer [74]. The three primary types of skin cancer are malignant melanoma, basal cell carcinoma
(BCC), and squamous cell carcinoma (SCC). Each year, more than 3 million Americans are diagnosed with
BCC and SCC, two non-melanoma skin cancers (NMSC) [57]. But the majority of skin cancer fatalities
are caused by melanoma [9], which affects more than 1 million Americans [42]. Over the past 30 years,
melanoma incidence has sharply increased in the United States [71].
2.1.1 Melanoma
The malignant transformation of melanocytes results in the tumor known as melanoma. It is extremely
dangerous and usually affects the skin, though it can also affect mucous membranes. In contrast to chil-
dren, adults experience it more frequently. Comparing non-Hispanic whites with non-Hispanic blacks or
4
Figure 2.1: Example dermoscopic images from the ISIC 2018 Challenge “Task to classify the dermoscopic
images into one of the following categories: melanoma, melanocytic nevus, basal cell carcinoma, actinic
keratosis / Bowen’s disease, benign keratosis, dermatofibroma, and vascular lesion.” Dermoscopy provides
high-resolution magnified images of the skin that can reveal pathological details. Although the images
already contain many details, it is still difficult for experts to distinguish between malignant and benign
lesions. Top row: benign nevi. Bottom row: malignant melanomas.
fig:MN
Asian/Pacific Islanders, skin cancer rates are nearly 30 times higher in non-Hispanic whites. Patients with
darker skin frequently experience skin cancer diagnosis at a late stage, which makes treatment more chal-
lenging. Other than an early surgical resection, there is no effective treatment for malignant melanoma.
Therefore, it is essential to find and treat malignant melanoma as soon as possible.
2.2 DeepLearningmethods
Deep neural networks are inspired by the human nervous system [45] and have increased the performance
of AI algorithms significantly over the past two decades. We offer a brief survey about methods of deep
learning that are most relevant to our work.
5
2.2.1 BasicinDeeplearning
2.2.1.1 SupervisedLearninginDeepLearning
Supervised learning is the most common approach in machine learning [17]. In general, supervised learn-
ing requires feeding a large amount of labeled data into the model so that the model learns the relationship
between the input x and the label y. Deep learning uses supervised learning in situations such as image
classification or object detection. Since the image and its corresponding label are already known, the net-
work modifies the parameters to make the prediction result gradually approach the ground truth label.
This process is called supervised learning.
2.2.1.2 UnsupervisedLearninginDeepLearning
Due to the difficulty in finding human-annotated labels [62], people are turning to unsupervised learning,
which can make use of sizable amounts of unlabeled data. Contrary to supervised learning, there is no
right or wrong answer; the algorithm finds similarities in the data rather than assigning them labels from
outside sources. Unsupervised domain adaptation is a special case of where we benefit from a secondary
domain with labeled data to train the model [20, 37, 61].
2.2.1.3 Weakly-SupervisedLearninginDeepLearning
Weak supervision is a middle ground between supervised and unsupervised learning, where the goal is to
use noisy and limited labeled data along with a large amounts of training data [92, 83, 64] to train a model.
As a result, the burden of data labelling can be mitigated significantly. Semi-supervised learning [94, 93]
can be considered a sub-class of weakly-supervised learning where the training data is mostly unlabeled.
There are many methods to implement weakly-supervised learning. A common approach is to use the
labeled portion to generated labels for the unlabeled portion that are relatively accurate.
6
2.2.1.4 Zero-ShotLearninginDeepLearning
Zero-shot learning is a learning paradigm that attempts to relax the need for having labeled data for novel
classes or tasks by learning useful relationships from previously seen tasks or classes [84, 58, 81, 63]. A
common approach to do zero-shot learning is to map data to a modality at which relations are easier to be
used. For example, if we map an image to its semantic description in natural language, we may be able to
use natural language processing to identify an image that belongs to an unseen class.
2.2.1.5 Self-SupervisedLearninginDeepLearning
The basic idea of self-supervised learning/SSL is to hide parts of the input and use the observable part
to predict the hidden part [31, 86, 43, 28]. Self-supervised learning does not group and cluster data like
unsupervised learning does. Likewise, SSL can enable models to benefit from learning more powerful data
representations with large amounts of unlabeled data.
Contrastive learning In self-supervised representation learning, contrastive learning is one type of
learning. Contrastive learning focuses on learning common features between positive samples and distin-
guishing differences between negative samples. The goal of contrastive learning is to learn an encoder that
encodes data of the same class similarly and data of different classes as differently as possible. There are
two main contrastive learning methods, one is Momentum Contrast [26] and the other is SIMCLR [13].
Momentum Contrast - The idea is that learning good representations requires a large dictionary with
a lot of negative samples, while maintaining the dictionary keys’ encoder as consistant as possible. Instead
of using static memory banks or small batches, this approach treats the dictionary as a queue, with the
most recent mini-batch enqueued and the oldest mini-batch dequeued. This enables the negative samples
to increase in size as necessary.
7
Figure 2.2: SIMCLR architecture, Figure from [13]
fig:SIMCLR
SimCLR - The core idea is to use a larger batch size (In the paper, they use a large batch size of 8192)
and data augmentation method (combination of random (crop + flip + color jitter + grayscale)), and to
embed nonlinear encoder before similarity matching. Specifically, To create the two augmented images x
1
andx
2
, an image is taken and subjected to random augmentation. To obtain representations, an encoder
is applied to each of the two images in the pair. The representations z are then obtained by applying a
non-linear fully connected layer. The goal is to maximize how similar these two representations, z
1
and
z
2
, are for the same image.
2.2.2 Imageclassification
Image classification is the task of classifying and assigning labels to groups of pixels or vectors in an image,
and its results for specific tasks have exceeded human-level accuracy.
2.2.2.1 CNNInimageclassification
Convolutional neural networks (CNNs), a subset of deep neural networks (DNNs), have demonstrated
outstanding performance in computer vision tasks, particularly image classification [52, 2, 23].
8
Background of CNN In the 1990s, LeCun et al [34]. published the first research on contemporary
convolutional neural networks (CNNs). Using the MNIST database of handwritten numbers, a CNN was
trained. The MNIST dataset, which has become well-known, contains pictures of handwritten digits with
the ground truth labels 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. A CNN model is trained on MNIST by using images,
asking it to forecast the digits shown in the images, and then altering the model’s settings in response to
whether it correctly identified the digit. In 2012, a large deep convolutional neural network named AlexNet
performed well on ImageNet, after which a series of CNN-based networks such as VGGNet, ResNet, and
DenseNet appeared.
CNN Architecture The basic unit of the CNN architecture is called the neuron. The CNN layer is
composed of a group of neurons, each layer has a different purpose. Convolution Layer, ReLu Layer,
Pooling Layer, and Fully-Connected Layer are the four primary types of CNN layers.
Figure 2.3: CNN architecture, Figure from [38]
fig:CNN
Convolution Layer: CNNs are primarily composed of convolutional layers. It has a number of
kernels whose parameters will be discovered during training. Typically, filters are smaller than the original
image. An activation map is produced after each filter has been convolved with the image.
9
Relu layer: The ReLU layer is always placed after the convolutional layer and the fully connected
layer, as seen in the figure 2.3. This is due to the fact that using ReLU helps stop the exponential growth
in the amount of computation needed to run a neural network.
PoolingLayer: The convolutional layer is followed, as shown in figure 2.3, by a pooling layer. The
pooling layer is used to make the feature map less dimensional. In other words, it lessens the number
of parameters that must be learned, accelerating the execution of the network. The pooling layer also
provides a summary of the region’s features at the same time.
Fully-ConnectedLayer: The fully connected layer is the feedforward neural network. Fully con-
nected layers form the final part of the network. A fully connected layer is used to learn a non-linear
combination of these features from the output of the previous convolutional layer.
2.2.2.2 DeepResidualNetwork(ResNet)
Figure 2.4: Residual learning, Figure from [27]
fig:ResNet
ResNet is one of the most famous CNN architecture because it overcomes the "vanishing gradient"
problem. The Vanishing/Exploding gradient problem was previously caused by CNN-based architectures
adding layers to reduce error rates. Gradients may become 0 or excessively large as a result. Residual
blocks are a solution to the vanishing/exploding gradients issue. In this network, skip connections bypass
10
some intermediate layers to link activations from one layer to another layer. This creates a residual block.
These residual blocks are stacked to create the Resnet [27].
There are many works in the literature [7, 88, 91, 79, 82] using Resnet for medical image classifica-
tion. Recently [14] proposes a modular group attention block that can capture feature dependencies in
medical images in two independent dimensions (channel and space). By stacking these group attention
blocks, their architecture is superior to state-of-the-art backbone models in medical image classification
tasks. [89]proposes a variant ResNet model by replacing global average pooling with adaptive dropout
for medical image classification. Their results show that their model achieves large performance gains in
medical image classification compared to traditional architectures without significant loss of efficiency.
Due to the limited performance of a single network, more and more researches combine ResNet with
other network models. [3] proposed an architecture combining ResNet and DenseNet for accurate super-
resolution of medical images, called Wavelet based medical image super resolution using cross connected
residual-in-dense grouped convolutional neural network. The author solves the vanishing gradient prob-
lem by using residuals and dense skip connections. Some researchers combine ResNet and Inception struc-
ture to improve the network performance of ResNet. [21] proposes a residual inception block, where an
additional inception path with 1×1 convolution is applied to feature map resizing to solve the problem of
different input and output feature map channels.
2.2.3 ImageSegmentation
In the field of computer vision, image segmentation is just as important as image classification. Image
segmentation is the task of grouping image parts that belong to the same object class together. There are
two types of image segmentation [68], one is to classify pixels and give each pixel a label, which is called
semantic segmentation, and the other is to obtain the boundaries of individual objects, which is called
instance segmentation.
11
Figure 2.5: Semantic segmentation (left) and Instance segmentation (right); Figure from [68]
fig:S&I
semantic segmentation Semantic segmentation is to assign a category label to each pixel in the im-
age [22, 80, 73]. If there are sky, cars, etc. in the picture, then each category will be given a unique color.
As shown on the left of the figure 2.5, all the pixels about people are a category, so everyone is connected
together.
instancesegmentation The instance segmentation method is similar to object detection in that it pro-
duces a mask, whereas object detection produces a bounding box [25, 36, 8]. Instance segmentation does
not need to label every pixel; it only needs to find the object of interest’s boundary. The outline of each
person is clearly marked in the image on the right of the figure 2.5.
2.2.3.1 UNet
There are many algorithms for semantic segmentation, such as U-Net, Mask R-CNN, etc. But U-Net, it is
one of the most commonly used image segmentation algorithms, and the architecture of U-Net appears in
many new algorithms and has been used extensively in the domain medical image analysis [30, 10, 72, 85].
U-Netarchitecture The left half of the U-Net [59] network is used for feature extraction, that is, down-
sampling, and the right half is used for upsampling. This architecture is also called Encoder-Decoder. Ac-
cording to the Figure 2.6, A multi-channel feature map corresponds to each blue box. The box’s top is
12
Figure 2.6: U-Net architecture [59]
fig:U-Net
marked with the number of channels. The box’s size is located in the bottom left corner. The cropped and
copied feature maps are indicated by the white colored boxes.
Reasons for Copying Feature Maps As shown in the picture 2.6, there is a white box copied from
the encoder, pointed by a gray arrow, this action is to try to concatenate the white feature map and the
upsampling blue feature map. All in all, this concatenate step is to add information to the upsampling fea-
ture map, make the model rely on more information, and improve the problem of insufficient information
when upsampling.
2.2.3.2 CombinationofResNetandU-Net
In recent years, many researchers have introduced ResNet into the field of medical image segmentation
by combining ResNet and U-Net, which helps to improve the segmentation accuracy. [29] proposed an
architecture UNet 3+ based on Resnet and unet, UNet 3+ can reduce network parameters and improve
computing efficiency. At the same time, full-scale skip connections are used in the architecture, which
is conducive to the emergence of organs of different scales. In order to solve the difficulty of pancreas
13
segmentation, [39] proposed RRA-UNet, which uses ring residual module and attention module. RRA-
UNet outperforms other methods in pancreas segmentation and provides more reliable auxiliary diagnostic
data in clinical medicine applications.
2.3 AnoverviewofMelanomadiagnosis
2.3.1 HumanexpertMelanomadiagnosismethod
The appearance of a new pigmented spot on the skin or a change in the shape, size, color, or feel of an
existing mole are the first signs of melanoma. Some melanomas can be seen to develop from moles, while
others grow directly on the skin. Based on this complex situation, experts have proposed several diagnostic
methods: 1). ABCDE rule 2). 3-point checklist 3). 7-point checklist.
ABCDErule : ABCDE stands for Asymmetry, Border, Color, Diameter, Evolution respectively.
Figure 2.7: ABCDE rule [48]
fig:ABCDE
As shown is the table 2.7, every criterion except evolution has its corresponding value. and its weight
factor. Total scores can be calculated as:
Total_Scores =A× 1.3+B× 0.1+C× 0.5+D× 0.5
14
1. 4.75< Total_Scores: no melanoma showed lower than 4.75.
2. 4.75< Total_Scores< 5.45: All lesions with scores in this range cannot be completely ruled out as
early-stage melanoma: Suspicious lesions
3. 5.45< Total_Scores: Diagnosis as malignant melanoma
3-pointchecklist The 3-point checklist is intended to prevent non-experts from overlooking melanoma.
1. Asymmetry: Along one or two vertical axes, there is asymmetry in the color and structure.
2. Atypical pigment network: Thick lines and irregularly shaped holes in the pigment network.
3. Blue-white structures: Blue and/or white in any form.
The presence of any one of the three conditions indicates the possibility of melanoma.
15
7-point checklist The 7-point checklist is a very well-known method for diagnosing melanoma. As
shown in table 2.8, this checklist contains seven skin attributes that need to be identified. The Total Score
is calculated using the following formula, calculated as:
Total_Score =(#major× 2)+(#minor).
If the total score is greater than 3, the lesion is considered malignant.
Figure 2.8: 7-point checklist; Figure from [46]
fig:7p
2.3.2 DeepLearningforMelanomaDiagnosis
Dermatology is one of the most common use cases of DL in medicine, with many existing works in the
literature [16, 76, 35, 1, 33, 49, 32]. Despite significant progress in DL, these methods simply train a DL on a
labeled binary dataset using supervised learning. Despite being naive in terms the artificial intelligence (AI)
algorithms they use, these works lead to decent performances, comparable with expert clinicians. There is
still room for improving explainability characteristics of these methods to convince clinicians adopting AI
for melanoma diagnosis in practice. However, only a few works have explored explainability of AI models
for melanoma diagnosis. Murabayashi et al. [47] use clinical indicators benefit from virtual adversarial
training [44] and multitask learning to train a model that predicts the clinical indicators in addition to the
16
binary label to improve explainability. Nigar et al. [50] simply use LIME to study interpretability of their
algorithm.
Stieler et al. [75] use the ABCD-rule, a diagnostic approach of dermatologists, while training a model
to improve interpretability. Shorfuzzaman [69] used meta-learning to train an ensemble of DNNs, each
predicting an indicator, to use indicators to explain decisions. These existing works, however, do not
spatially locate the indicators. We develop and architecture that generates spatial masks on the input
image to locate clinical indicators spatially.
2.4 ExplainabilitymethodsinDeepLearning
Existing explainability methods in deep learning primarily determine which spatial regionx of the input
image or combination of regions led to a specific model decision or contribute significantly to the network
prediction (see Figure 1.1). There are two main approaches to identify regions of interest when using deep
learning: Model-based methods and model agnostic methods. Model-based methods work based on the
details of the specific structures of a deep learning model. They examine the activations or weights of the
deep network to find regions of importance [66, 65, 53]. Grad-CAM and Layerwise Relevance propagation
[65] are examples of such methods. Attention-based methods [19] similarly identify important image
regions. Model agnostic methods separate the explanations from the model which offers wide applicability.
These methods (e.g., LIME [55]) manipulate inputs (e.g., pixels, regions or superpixels) and measure how
changes in input affect output. If an input perturbation has no effect, it is not relevant to decision-making.
In contrast, if a change has a major impact (e.g., changing the classification from glaucoma to normal),
then the region is important to the classification. SHapley Additive exPlanations (SHAP) [41] can assign
each feature or region an importance value for a particular prediction. Note, however, the regions found
by these algorithms do not necessarily correspond to intermediate concepts or diagnostic features that are
17
known to experts or novices. Hence, while these algorithms are helpful to explain classifications of DNNs,
they do not help training models that mimic humans when making predictions.
Identifying regions of interest is also related to semantic segmentation [51] which divides an image
into segments that are semantically meaningful (e.g., separating moles from background skin in diagnos-
ing melanoma). However, these methods mostly segment based on spatial similarities and do not offer
any explanation how these segments that are generated can be used for classification of the input image.
UNets [60] also identify regions within images but do not indicate the importance of regions to overall
classification, a key step in explaining model decision.
2.4.1 Grad-CAM
Figure 2.9: An example of using the Grad CAM method on medical images
fig:GC
Grad-CAM is the most commonly used method among Model-based methods. The Grad-CAM method
is as follows: Forward propagation is used to obtain the feature map A after an image is fed into CNN.
After obtaining the feature mapA, the predicted scorey
c
of the target classC is backpropagated to obtain
gradient information, which is the contribution of each feature map pixel to the target class. Averaging
the gradient information yields the importance α c
k
of each channel k, and the final weighted sum is the
Grad-CAM.
18
L
C
Grad− CAM
=Relu(
X
k
α c
k
A
k
)
1. L
C
Grad− CAM
: Grad-CAM method result for target class.
2. k: Channel number of feature map A.
3. C: Target class
4. α c
k
: Weight ofA
k
α c
k
=
1
Z
X
i
X
j
∂y
c
∂A
k
ij
1. i: Index of the width dimension.
2. j: Index of the height dimension..
3. k: Channel number of feature map A.
4. C: Target class
5. y
c
: The gradient of the score for class c.
6. Z: width× height
2.4.2 Lime
Marco Tulio Ribeiro et al. proposed the XAI method LIME [56] (Local Intepretable Model-agnostic Expla-
nation). The basic idea behind LIME is to use linear models to locally approximate the predictions of the
target black-box model. Through minor changes in the input, the method detects what has changed in the
black-box model’s output. Create a explainable model based on this variation.
19
Figure 2.10: Lime
fig:LM
ξ (x)=argmin
g∈G
L(f,g,π x
)+Ω( g)
1. f: The model being explained.
2. g: interpretable model(linear model).
3. G: potentially interpretable models, such as linear models, decision trees.
4. π x
: Indicates the distance between the data z we add perturbation and the original data x.
5. Ω( g): Indicates the complexity of the model g.
6. The objective is to minimizeL(f,g,π x
) while keepingΩ( g) low enough to be interpreted by humans
2.5 ExplainabilityframeworksinDeepLearning
As shown in figure 2.11, the model-specific and model-agnostic methods we mentioned before belong to
the post-hoc explainability methods. Another important part of Explainable AI is developing explainable
models. However, most architectures are called explainable models because they only use post-hoc explain-
ability methods, and only few papers aim to make the architecture explainable. So we want to introduce
more related interpretable architectures. Some models use interpretable features (shown in figure 2.12)
20
Figure 2.11: XAI taxonomy; Figure from [4]
fig:XAI_TAX
to make explainable architectures, eg: M. Deramgozin et al. proposed a hybrid explainable AI framework
that utilizes the encoder to extract 13 facial expression-action units from an input image [18].
Figure 2.12: interpretable features; Figure from [18]
fig:EXP-F
Some design interpretable models by using majority voting of ensemble models or using transpar-
ent models(KNN, Decision Trees). eg: Siyuan et al. first utilizes discriminant correlation analysis (DCA)
to generate better representations, then feeds these representations as input to three randomized neural
networks, and finally obtains the results by majority voting. [40]. (shown in figure 2.13).
Figure 2.13: Using majority voting of ensemble models; Figure from [40]
fig:Esm
21
In general, most explainable frameworks are thinking about how to make black boxes more under-
standable. But the logic of the black box’s decision is not what human expected, especially in the medical
field, we need a black box with the similar logic as human experts. Therefore, the explainable architecture
in this paper means that the architecture can make decisions based on the similar logic as humans.
In the next chapter, we benefit from the concepts and methods that we surveyed in this chapter and
develop an explainable AI architecture for diagnosing melanoma.
22
Chapter3
Proposedwork
In this chapter, we provide our architecture for human-explainable diagnosis of melanoma. Our idea to
enable a deep learning model to identify clinical indicators of melanoma and use them to perform the
diagnosis task
∗
.
3.1 ProblemExplainableArchitecture
sec:formulation
Most works based on using AI for melanoma diagnosis, consider that we have access to a datasetD
B
=
(x
i
,y
i
)
N
i=1
that includes skin lesion imagesx
i
along with corresponding binary labelsy
i
for cancerous vs
benign cases. The standard supervised learning is then use to train a suitable binary classifier, primarily
based on convolutional neural networks (CNNs). In our work, we also consider such a dataset is accessible,
referred to Dataset B in our formulation. Despite being a simple procedure for AI, it has been used exten-
sively in the literature [54, 90, 16, 76, 35, 1, 33, 49, 32] due to high accuracy rates. However, as explained,
this simple baseline does not lead to a human-centered explainable model.
∗
Results of this chapter are currently under review [77]
23
To develop an explainable AI model, we consider that we also have access to a second dataset, where
images are annotated with clinically plausible indicators [5]. These indicator commonly are used by der-
matologists and residents are trained to diagnose melanoma based on identifying them. We try to imple-
ment a similar approach, where the model is trained in end-to-end scheme to first predict the indicators as
intermediate-level labels and then use them for diagnosis label prediction. LetD
A
=(x
′
i
,y
′
i
,(z
ij
)
d
j=1
)
M
i=1
denotes this dataset, wherex
′
i
andy
′
i
similarly denote the images and their binary diagnostic labels. Addi-
tionally,z
ij
denotes a feature mask array with the same size as the input image, where for eachj, the mask
denotes the spatial location of a clinically interpretable indicator, e.g., pigmented network, on the input
image in the form of a a binary segmentation map. We refer to this dataset as Dataset A. Clearly, preparing
Dataset A is a significantly more challenging task than Dataset B. It suffices to go though existing medical
records to prepare Dataset B according to diagnosis. In contrast, existing medical records rarely include
instances for Dataset B and hence, a dermatologist should determine the absence and presence of each
indicator and locate then on the image in addition to a simple binary label. Even if can annotate some im-
ages to generate Dataset A, the size of Dataset A will be significantly smaller than Dataset B ( M << N)
due to scarcity of dermatologists who would accept serving as data annotators. Our goal is to benefit
from both Dataset A and Dataset B to train an architecture that can be used for melanoma diagnosis with
interpretable explanations.
A naive idea to train an explainable model is to use a suitable architecture and train one segmenta-
tion model, e.g., U-Net [60] to predict indicator masks. Previously, this idea has been used for training
explainable AI models for medicine [67]. In our problem, we can use one U-Net for each of thed indica-
tors and train them using Dataset B. Hence, we will have d image segmentation models that determine
spatial location of each indicator for input images. However, there are two shortcoming. First, we will
still need a secondary classification model to determine the diagnosis label from the indicators [47]. More
importantly, the size of the Dataset B may not be sufficient for this purpose, particularly because only a
24
subset of instances will contain a particular type of indicator and semantic segmentation is a complex task
compared to classification. Since we likely will encounter the challenge of attribute sparsity, we likely will
face overfitting during model execution.
The idea that we will explore is to benefit from information encded in Dataset B. Dataset B is not
attributed coarsely, yet it is similar to Dataset B and transferring knowledge between these two datasets
might be feasible. We formulate a weakly supervised learning problem for this purpose. Specially, we rely
on self-supervised learning using Dataset B to train an encoder that can represent input images better so
that the model becomes generalizable for localization of biomarkers. Additionally, rather than training
segmentation models in isolation and individually, we propose to train a shared segmentation model in a
multitask learning setting, where each indicator define a task. As a result, we can benefit from transfer
learning to resolve the challenge of data sparsity. Finally, we also train the classier for prediction of labels
y
i
using the indicators so that the model is trained to make predictions similar to an expert.
3.2 ProposedAlgorithm
sec:algorithm
We develop an explainable architecture for melanoma diagnosis. The network adopts U-Net as Basic back-
bone. By letting the encoder perform the downsampling task, the Resnet backbone has a sense of boundary.
This section will describe the network architecture components and explain why we use each part and the
training procedure we used.
3.2.1 Overviewofthenetworkarchitecture
Figure 3.1, visualizes the architecture of our proposed model. The goal for this architecture is to learn
generating localization masks when the network is trained to perform diagnosis classification. As opposed
to only using an explainability method, our architecture is trained to directly generate localization mask
for the indicators in Dataset A. Our architecture is composed of Four key subnetwork:
25
Figure 3.1: Proposed architecture for explainable diagnosis of melanoma: the architecture is trained to
simultaneously classify skin lesion pictures using a CNN classifier and learn to localize melanoma clinical
indicators on the input image using a U-Net-based segmentation network. The classification branch re-
ceivers its input from the segmentation path to enforce classifying images based on clinical indicators.
fig:P1
(1) A pretrained Resnet50 network for attributes classification; we denote the subnetwork (1) as f
Res
wheref
Res
(x
i
) = E
1
(x
i
)+fc
1
(x
i
) , x
i
∈ D
L
andE
1
(·) represents the encoder, which consists
of four consecutive blocks, each block consists of a downsampling and two ResNet residual blocks,
fc
1
(·) represents a fully connected layer which produces probabilistic outputs.
(2) An architecture based on U-Net for biomarker indicator locations. We replaced the U-Net encoder
with the encoder E
1
(·) of the f
Res
(·) ; we denote the subnetwork (2) as f
Seg
where f
Seg
(h
i
) =
E
1
(h
i
)+D
1
(h
i
), whereh
i
denotes the heatmaps that are generated using Grad-Cam,D
1
(·) denotes
the decoder which outputs biomarker indicator location.
26
(3) We add a ResNet50-based projection head for self-supervised learning; we denote the subnetwork
(3) as f
CLR
(·), where f
CLR
(x
′
i
) = E
1
(x
′
i
) + P
1
(x
′
i
) and x
′
i
∈ D
UL
, P
1
(·) is a projection head
consisting of two fully connected layers that outputs a feature vector.
(4) subnetwork (4) has the same architecture asf
Seg
(·), but its encoderE
2
(·) is independent of the
previous encoderE
1
(·); we denote subnetwork(4) asf
Final_Seg
(·) wheref
Final_Seg
(h
i
)=E
2
(h
i
)+
D
2
(h
i
),D
2
(·) represents the decoder that outputs the final biomarker indicator location.
In summary, our architecture learns to classify the input images. Then, we use Grad-Cam to generate
attention heatmaps. These heatmaps do not necessarily show interpretable indicators, but are useful for
classification which means that they should have correlations with features expert clinicians use. Our idea
is to feed these heatmaps to the segmentation subnetwork and generate the indicator biomarker local-
ization maps using the Grad-Cam heatmaps. As a result, the architecture learns to generate explanations
along with diagnosis labels. We can see that our full architecture can only be trained using Dataset A.
3.2.2 Bio-UNetBaselineTraining
Our Bio-Unet baseline consists of networksf
Res
(·),f
Seg
(·), andf
Final_Seg
(·). At first, We train the net-
workf
Res
(·) to do the skin lesion classification task using simple supervised learning. After this stage, the
classification network can predict diagnosis labels with high accuracy. We then apply Grad-CAM on the
network to generate attention heatmaps. Grad-CAM is generally used by computing the gradient of the
classification score with respect to the last convolutional feature map to identify the classification score
of the selected target in the most influential part of the input image. However, the final convolutional
feature map primarily provides a high-level region of the input image which corresponds to the area of
interest. Our experiments demonstrate that this mostly will just generate the full lesion mole similar to
Figure 1.1 which is usually much larger than the region delineated by experts for a specific indicator and
usually cannot reflect the location and area of an indicator. However, the earlier convolutional feature
27
maps contain low-level information, e.g., the boundary of a region of interest. Hence, combining attention
maps at all convolutional layers looks like an option that may generate a good estimation for the location
of an indicator, but we empirically observed that simply averaging all heatmaps will result in poor output.
Therefore, we would like to benefit from binary masks provided by experts and reconstruct these heatmaps
using the segmentation networkf
Seg
(·). We learn to use appropriate weights for each heatmap so that all
bottleneck blocks of E
1
(·) contribute to reconstruction of the final heatmap. Because the encoder E
1
(·)
has been used for training, in order not to affect its parameters, we created an encoder E
3
(·) with the same
architecture asE
1
(·). Specifically, the Resnet encoder E
3
(·) consists of 12 bottleneck blocks, each loaded
with optimal checkpoint parameters.Then we input a training image and a bottleneck block in Grad-CAM,
get a heatmap, and then replace the bottleneck block in turn, getting a total of 12 localization masks. We
denotef
Grad_CAM
(b
i
,j) whereb
i
is the bottleneck block and j is the index of the selected label.
In order to dig out the largest connection between different bottleneck blocks, we use the f
Final_Seg
(·)
to reconstruct these heatmaps. As shown in Figure 3.1, the subnetwork f
Final_Seg
(·) does not perform
any classification task, but reconstructs the final localization mask for the biomarker indicators based
on classification pathway. We have observed that the result is better than using f
Seg
(·) if heatmaps are
constructed via the subnetworkE
1
(·).
For the segmentation task, dice coefficient is commonly used with a value ranging from 0 to 1 as the
loss function. The larger the value, the more similar two binary masks would be. However, the Dice loss is
only designed for binary data. In order to avoid adding the artificial factor of threshold during the model
training, We adopt the soft dice loss as the segmentation loss in Eq. 3.1. The soft dice loss directly uses the
predicted probability instead of using the threshold to change the output to 0 or 1. It is defined as:
L
SoftDice
=1− 2
P
Pixels
y
true
y
predict
P
Pixels
y
2
true
+
P
Pixels
y
2
pred
(3.1)
eq1 eq1
28
By performing downsampling on the input heatmap, the encoder will learn the boundaries of the
attributes. During training, we adopt an ensemble strategy where we only add one attribute heatmaps as
input intof
Seg
(·), making the classification model f
Res
(·) more accurate for the selected attribute region
of interest. As a result, explainations can improve the diagnosis accuracy as well.
3.2.3 Self-SupervisedLearningforBio-UNet
The proposed Bio-Unet baseline architecture can learn boundaries efficiently, but this is only effective
when the number of annotated images is large for biomarker indicators, e.g., pigment network. or the
area corresponding to the indicators on a lesion is contiguous and large. To enable classification models
to generate accurate heatmaps when Dataset A is small or when the area for an indicator is very small or
is scatterred on the image, we benefit from self-supervised learning on Dataset B to improve the baseline
of our proposed network architecture that is obtained by training on Dataset B. Specifically, we use Sim-
CLR [12] which uses contrastive learning for improved visual representation. As shown in Figure 3.1, there
are two independent data augmentersT
1
(·) andT
2
(·) which are randomly selected from rotation, scaling,
cropping, brightness, contrast, saturation, and flipping transforms to generate an augmented version sam-
ples of Dataset B so that we can compute the contrastive learning loss. Each training imagex
′
i
∈ D
UL
is
passed through two data augmenters to produce two augmented images. The two augmented images will
then pass through our shared weight encoderE
1
(·) and projection headP
1
(·), resulting in two 128-length
features. In a minibatch ofN input images,2N augmented images will be produced, for each pair of aug-
mented images we treat as positive pairs, the other2(N− 1) are negative examples.We adopt contrastive
loss as semi-supervised loss in Eq. (3.2)
L
CLR
=− log
exp(sim(f
CLR
(T
1
(x
i
)),f
CLR
(T
2
(x
j
)))/τ )
P
2N
k=1
exp(sim(f
CLR
(T
1
(x
i
)),f
CLR
(T
2
(x
k
)))/τ )
, (3.2)
eq2 eq2
29
wheresim(u,v) =
u
T
v
∥u∥∥v∥
and k̸= i andτ denotes a temperature parameter. Upon training the encoders
on Dataset B using self-supervised learning, we can benefit from transferring obtained knowledge across
Dataset A and Dataset B. Due to the space limit, the full training procedure for our architecture is described
in Algorithm 1.
30
Algorithm1 Proposed Architecture Training Approach
alg:cap
Input1: (x
i
,y
i
)
N
i=1
∈D
B
=D
UL
Input2: (x
′
i
,y
′
i
,(z
ij
)
d
j=1
)
M
i=1
∈D
A
=D
L
Input3: (b
i
)
12
i=1
Output1: parameterθ for encoderE
1
Output2: the final biomarker indicator location
1: f
CLR
(x
′
i
)=E
1
(x
′
i
)+P
1
(x
′
i
);
2: f
Res
(x
i
)==E
1
(x
i
)+fc
1
(x
i
);
3: f
Seg
(h
i
)=E
1
(h
i
)+D
1
(h
i
)
4: f
Final_Seg
(h
i
)=E
2
(h
i
)+D
2
(h
i
)
5: f
Grad_CAM
(b
i
,j) whereb
i
∈E
3
6: T
1
,T
2
: two separate data augmentation operators
7:
8: while j< 5do ▷ For 5 attributes
9: while Not stopdo
10: Sample batchB
1
=x
′
i
∈D
UL
11: Generatingf
CLR
(T
1
(B
1
)) andf
CLR
(T
2
(B
1
))
12: Calculating lossL
CLR
as equation(2)
13: Computing gradient ofL
CLR
and updateE
1
parametersθ andP
1
parametersθ 1
14: Sample batchB
2
={(x
i
,y
i
)∈D
L
}
15: Generatingf
Res
(B
2
)
16: Calculating lossL
BCE
17: Computing gradient ofL
BCE
and updateE
1
parametersθ andfc
1
parametersθ 2
18: Load the optimal checkpoint parameterθ onE
3
, usef
Grad− CAM
(b
i
,j) whereb
i
∈ E
3
to get
the heatmaph
i
19: while k< 1do ▷ not repeatf
Seg
(·)
20: Generatingf
Seg
(h
i
)
21: Calculating lossL
SoftDice
as equation(1)
22: Computing gradient ofL
SoftDice
and updateE
1
parametersθ andD
1
parametersθ 3
23: endwhile
24: EndWhile
25: endwhile
26: EndWhile
27: returnE
1
parametersθ 28: Load the optimal checkpoint parameterθ onE
3
, usef
Grad− CAM
(b
i
,j) whereb
i
∈ E
3
to get the
heatmaph
i
29: while Not stopdo
30: Generatingf
Final_Seg
(h
i
)
31: Calculating lossL
SoftDice
as equation(1)
32: Computing gradient ofL
SoftDice
and updateE
2
parametersθ 4
andD
2
parametersθ 5
33: endwhile
34: EndWhile
35: return The final biomarker indicator location
36: endwhile
37: EndWhile
alg
31
3.3 ExperimentalValidation
sec:experiment
3.3.1 ExperimentalSetup
Datasets We used the ISIC 2018 dataset [15, 78] to simulate our semi-supervised learning framework. It
is a large collection of dermatoscopic images of common pigmented skin lesions with several prediction
tasks. We used its Task 2 data as Dataset A and its Task 3 data as Dataset B.
DatasetA: Task 2 of the ISIC 2018 dataset poses a challenge for melanoma clinical indicator detection.
The task is to detect the following five dermoscopic attributes that are melanoma indicators: pigment
network, negative network, streaks, mila-like cysts and globules. For samples of these clinical indicator
biomarkers, as shown in the figure 3.2. There are 2594 images in this task with binary labels for melanoma
diagnosis. Table 3.3.1 shows the statistics for these indicators. As it can be guessed from our previous
discussion, the dataset is sparse.
Figure 3.2: Examples of clinical indicator biomarkers.
fig:P_DB
tab1
DatasetB: Task 3 in the ISIC 2018 dataset consists of 10015 images with only binary diagnosis labels.
As it can be seen this dataset is much larger than Dataset A. However, the images are not annotated with
32
Nonempty empty Skin images total
globules 603 1991 2594
milia_like_cyst 682 1912 2594
negative network 190 2404 2594
pigment network 1523 881 2594
streaks 100 2494 2594
Table 3.1: Comparison of the number of non-empty masks and the number of empty masks for each
attribute
the indicator biomarkers. Task 3 dataset designed for the lesion classification challenge. The classification
task is to classify dermoscopic images into one of the following classes: melanoma, melanocytic nevus,
basal cell carcinoma, actinic keratosis / Bowen’s disease, benign keratosis, dermatofibroma, and vascular
lesion.
Relationship between Dataset A and Dataset B For almost all images (2593/2594) in dataset A are
present in dataset B. After we combined the two dataset labels, we noticed that 2385/2593 (92%) of dataset A
were melanocytic nevus or melanomas. This also shows that these attributes in dataset A are significantly
associated with melanoma and nevus.
Baseline for Comparison: Bio-Unet was compared against using variants of CAM when applied on
the classification subnetwork for generating heatmap explanations for each of the indicators. We included
Layer-CAM, Grad-CAM, Grad-CAM++ in our experiments. Our goal is to demonstrate that despite having
a good accuracy, the ability to classify images with high accuracy does not lead to human-interpretable
explanations, irrespective of the particular algorithm that we use for generating heatmaps.
Evaluationmetrics Our primary goal is localize the melanoma indicators on the input image. For this
goal, we use the Dice metric [11]. We computed the DICE metric between the generated mask for each
indicator and the provided ground truth map. For melanoma diagnosis, we used classification accuracy as
the evaluation metric.
33
ImplementationDetails
Thresholdselectionforbinarization: The threshold that we used for binary mask generation has
a significant impact on the results. When the threshold is small, the pixels with low confidence area will
become 1, which will greatly reduce the dice coefficient. We did a grid search on all thresholds separated
by an increment equal to 0.05 in the range [0,1]. Our experiments demonstrated that 0.3 is an optimal
choice which would result in a high Dice coefficient value, To unify the results, we used the threshold of
0.3 in all experiments.
Figure 3.3: The second on the left is the ground truth;the third on the left is the output in the form of
probability; the first on the right is probability > 0.3 = 1
fig:P6
HardwareandOptimizer: The entire framework is implemented using PyTorch and trained on four
NVIDIA RTX 2080Ti GPUs with 11GB of memory. We use ResNet50 with pretrained weights on ImageNet
as the encoder for our architecture Bio-Unet. The Adam optimizer with one set of hyperparameters (lr =
1e
− 4
, weight decay =1e
− 4
) is used for all tasks during the training stage.
Preprocessing: We start by training the subnetwork f
CLR
(·). Dataset B serves as the input for
f
CLR
(·). Each image is first resizes to 512× 512, then the image is normalized to have a zero mean and
unit variance. Finally, data augmentation is carried out to improve the model generalization. Operations
for data augmentation include random rotation, cropping, brightness, contrast, saturation, and flipping.
34
Each mini-batch contains one positive example and 22 negative examples, with the batch size set to 24.
The temperature parameterτ is set to 0.5.
We then perform the classification task using all of dataset A’s data, each resized to 512× 512, and
normalize images with zero mean and unit variance without performing any data augmentation. Due to
the large size of the images and memory cap, the batch size is set to 16.
Segmentation training: Once the classification task is complete, we use Grad-CAM to create an
attention heatmap with a bottleneck block and the desired target attribute index as input. We then use
this method to replace the bottlenecks one at a time until every bottleneck has been tried. After gathering
all heatmaps, we input those heatmaps into the f
Seg
(·) subnetwork for training it as the reconstruction
function. We used a batch size of 8.
After the network training is complete, we create heatmaps using Grad-CAM as f
Grad_CAM
(b
i
,j) ,
whereb
i
∈E
1
(·), using the bottleneck blockb
i
in the encoderE
1
(·) loaded with the optimum checkpoint
parameters and the index of the selected attribute as input. We then input these heatmaps with a batch size
of 8 into thef
Final_Seg
(·) subnetwork to generate the final localization masks for the biomarker indicators.
3.3.2 PerformanceResults
After training the ResNet50 subnetwork and the Bio-UNet architecture for classification, we obtained
76% and 77% classification accuracy rates, respectively. We observe that both architecture have better
melanoma diagnosis rates than human expert diagnosis and using the additional subnetworks for seg-
mentation and localization has not led to a significant diagnosis performance boost. Table 3.2 presents
results for localizing the five indicators when measured using the DICE metric. As it can be seen, utilizing
Bio-UNet architecture along with our proposed training scheme enhances localization results significantly
by 15.54%, 11.54%, 17.44%, 27.13%, 17.78% over the standard architecture for the five indicators listed in Ta-
ble 3.3.1, respectively. This is a significant observation because we verify again that merely the ability to
35
perform a classification task with high accuracy in AI does not automatically mean that the model’s deci-
sions will be human-interpretable. We conclude from our results that in order to make AI explainable, we
may need to incorporate intermediate-level human-interpretable annotations in the deep learning end-
to-end training pipelines and design DNN architectures that learn to perform a downstream task using
human-interpretable intermediate indicators.
globules milia_like_cyst negative pigment streaks AVG
Grad-Cam 12.06% 3.05% 10.16% 27.98% 11.93% 13.04%
Grad-Cam++ 13.22% 3.83% 12.39% 26.47% 12.66% 13.71%
Layer-Cam 12.42% 4.26% 11.76% 27.52% 16.03% 14.40%
Bio-Unet 28.76% 15.8% 29.83% 55.11% 28.4% 31.58%
Table 3.2: Evaluating the localization accuracy for the five clinical indicators. Dice coefficient in percentage
is reported.
tab2
To provide an intuition behind the quality of results presented in Table 3.2, we have visualized samples
of heatmaps that are generated using different methods in Figure 3.4 for visual inspection. In this figure,
we have selected an example image such that it is annotated with a corresponding indicator biomarker
and presented the binary localization map that the AI pipelines generate. We can observe that the three
CAM-based techniques generate features maps that are far larger than the ground truth mask and pretty
much include the majority of the input mole. This means that the generated maps are not interpretable
because they point to the whole mole which even a beginner knows that it should be the primary area of
attention. In contrast, a close visualize comparison between columns two and six, demonstrates that our
method generate binary feature maps that are quite similar to the ground truth, focusing on a subarea of
the mole that in reality pertains to the clinical indicator. We also conclude that although the DICE metric is
the predominant metric for segmentation, it is a sensitive metric when the semantic classes in the images
are imbalanced.
36
3.3.3 AblativeExperiments
Experiments on the importance of the subnetworks: We conduct an ablation study to investigate
the contribution of each component of Bio-UNet. Table 3.3.3 present results for our ablative study. We
observe that the subnetworkf
CLS
(·) is removed, localization results for “milia_like_cyst”, “negative net-
work”, and “streaks” indicators reduces significantly. This observation is expected because “streaks” and
“negative network” exist in only a very small number of samples, while “milia_like_cyst” indicator usually
appears in discrete points. We can conclude that self-supervised is extremely helpful to localize indicator
that are infrequent an appear in a scattered discontinuous manner on input images. Figure 3.5 presents
samples for generated masks using our ablative experiment. The second row presents a case of “milia_-
like_cyst” indicator. It can seen that it is only a single dot-like region in the ground truth. When we remove
f
CLS
(·), the DIC metric reduces from 0.727 to 0.001. The fifth column presents the result after removing
f
CLS
(·). From Figure 3.5, we conclude that the predicted point is only slightly moved to the right, which
greatly reduces the Dice coefficient. Therefore, we conclude that contrastive learning helps to generate
more accurate localization mask in some cases. We also observe removingf
Seg
(·) from Bio-UNet results
in significant degradation in Dice metric for the “pigment network” indicator in majority of the samples.
The reason might be that when encoderE
1
(·) downsamples heatmaps generated by the model for the seg-
mentation, it produces heatmaps with better quality. Finally, whenf
CLS
(·) andf
Seg
(·) are both removed
from Bio-UNet, the average result drops by 3.78%, which demonstrates the importance of both components
for improved performance. We conclude that all new aspects of our architecture are critical for improved
performance.
Tab3
37
f
Final_Seg
f
Seg
f
CLR
globules milia_like_cyst negative pigment streaks AVG
f
Res
(·) +f
Final_Seg
(·) ✓ 26.94 7.5 27.15 47.9 29.5 27.80
f
Res
(·) +f
Seg
(·) +f
Final_Seg
(·) ✓ ✓ 27.76 11.61 25.01 53.58 22.25 28.04
f
Res
(·) +f
CLR
(·) +f
Final_Seg
(·) ✓ ✓ 27.21 15.71 28.06 41.81 34.23 29.40
Bio-Unet ✓ ✓ ✓ 28.76 15.8 29.83 55.11 28.4 31.58
Table 3.3: Localization performance results for the ablative study on the importance of subnetworks of
Bio-Unet.
Experimentsontheeffectofeachoptimizationstage: We also provide an ablative samples to study
the effect of each stage of the optimization pipeline according to Algorithm 1 in Figure 3.6. Note that
whenf
final_seg
(·) has not been trained, we used the last bottleneck block as the final localization mask for
biomarker indicators. We observe that the corresponding localization masks are not accurate, apparently
the model only bases its predictions on the left (top right corner of 3.6). Note that we can see that there
are still some white pixels in ResNet50-bottlneck1 (top left corner of 3.6), which indicates that information
may be lost during the forward pass. By usingf
_seg
(·), the model will use more human expert-approved
regions for prediction. When we compare the network consisting off
CLR
(·) andf
Res
(·) with the network
consisting of only f
Res
(·), the localization masks are not significantly different. In contrast, Bio-Unet
generates a more fine-grained localization mask.
Impact of repeating f
Seg
(·) within one epoch: Finally, we have performed an experiment to study
whether repeatingf
Seg
(·) within each epoch is helpful. In Figure 3.7 shown, we have visualized examples
of “streaks” indicator in the first and second rows. We observe that when we repeat f
Seg
(·) once in a loop,
the resulting localization mask is more similar to the ground-truth than results without any repetition. The
average DICE coefficient of streaks increased significantly from 28.4% to 37.38%. This can be explained
by the fact that “streaks” indicator appears in continuous thin lines, rather than discrete dots and the
subnetworkf
Seg
(·) improve the sensitivity of the classification model to boundaries. When the number of
repetitions increases from 1 to 4, the average DICE metric decreases from 37.38% to 30.51% for “streaks”.
We conclude that repletion of f
Seg
(·) is useful when it is preformed once. The third and fourth rows in
38
Figure 3.7 are examples of “milia_like_cyst” indicator which usually appears as a single dot-like shape or
scattered dots. In the example of the third row, when we repeatf
Seg
(·) once in the loop, the Dice metric is
higher than the Dice value generated by not repeatingf
Seg
(·) . In the example of the fourth, the Dice metric
for repeating once is lower than the Dice value when there is no repetition. The average Dice metric for
“milia_like_cyst” decreased from 15.8% to 12.3%. The average Dice metric of “milia_like_cyst” decreased
from 12.3% to 5.2% when the number of repetitions increased from one to four. This is because f
Seg
(·)
does not work well for localization masks with single dot or scattered dots. Because of these observations,
we chose not to repeatf
Seg
(·) in any of attributes to keep our architecture consistent.
39
Figure 3.4: Localization performance for samples of dermatoscopic images: from top to the bottom, we
have included a sample input image along with binary localization maps generated for globules, milia like
cyst, negative network, pigment network , and streaks biomarkers indicators. From left to right, the input
image, ground truth mask of the indicator, and masks generated by Grad-CAM, Grad-CAM++, LayerCAM,
and Bio-Unet are visualized. The cam-based feature maps are generated with a Resnet50 backbone, trained
for classification.
fig:P2
40
Figure 3.5: Binary masks for the ablative study: from top to bottom are samples of globules, milia_like_-
cyst, negative network, pigment network, streaks. From left to right are input image, ground truth mask,
and mask generated by f
Res
(·) + f
Final_Seg
(·), f
Res
(·) + f
CLR
(·) + f
Final_Seg
(·) ,f
Res
(·) + f
Seg
(·) +
f
Final_Seg
(·), and Bio-Unet, following Table 3.3.3.
fig:P4
41
Figure 3.6: Effect of each stage of optimization on localization quality: the three rows are the localization
maps generated after completion of training of Bio-UNet subnetworks. The network in the first row is
f
Res
, the network in the second row is composed off
CLS
(·) andf
Res
(·), and the network in the third row
is Bio-Unet.
fig:P5
42
Figure 3.7: Impact of repeating f
Seg
(·): the columns from left to right are ground truth, not repeated,
repeated once, repeated four times. The first and second rows are examples of “streaks”. The third and
fourth rows are examples of “milia_like_cyst
fig:P6
43
Chapter4
Conclusionsandfuturework
We developed an architecture for explainable diagnosis of melanoma using skin lesion images. Our archi-
tecture is designed to localize melanoma clinical indicators spatially and use them to predict the diagnosis
label. As a result, it performs the task similar to a clinician, leading to interpretable decisions. We ben-
efited from contrastive learning and self-supervised learning to address the challenge of annotated data
scarcity for our task that requires coarse annotations with respect to clinical indicators. Experimental re-
sults demonstrate that our model is able to generate localization masks for identifying clinical biomarkers
and generates more plausible explanations compared to existing classification architectures.
Our model can be improved in several fronts. Although our architecture effectively localize melanoma
clinical indicators. But we use the strategy of ensemble to conduct experiments. The training time of the
ensemble method is relatively long and the operation is slow. Therefore, it is our expectation to build a
single simple model that can achieve the same results as the ensemble method for all attributes. Addi-
tionally, we think that DICE metric might not be the perfect metric to evaluate localization ability of deep
learning segmentation algorithms. Finally, we used a shared encoder for all clinical indicators but there
are existing works that demonstrate using a multi-branch model might be more useful [6]. The idea is to
use indicator-specific encoders and perform a prepossessing step to enhance the possibility of attending
to a given clinical indicator.
44
Future works include clinical verification of results in consultation with dermatologists. This is a
challenging task given that finding collaborators from medical schools is not straightforward. The other
direction that we envision is exploring the problem of model fairness. In our exploration, we have no-
ticed that exiting datasets are extremely biased in favor of light skin people. This raises the concern that
even if we train a model with good generalization error on the testing split of existing datasets, they may
not generalize well on images from people with dark skin tones. We tried to mitigate this challenge by
augmenting the dataset we used with new images but unfortunately, collecting images from people with
dark skin tones in not an easy task and we leave this task for future exploration. Finally, extending our
results to a multimodal setting can be beneficial because dermatologists often rely on several factors and
data modalities when they diagnose melanoma. As wearable measurement devices are becoming cheaper,
AI-based precision health screening can be improved by considering multimodal data analysis.
45
Bibliography
[1] Adekanmi A Adegun and Serestina Viriri. “Deep learning-based system for automatic melanoma
detection”. In: IEEE Access 8 (2019), pp. 7160–7172.
[2] Saad Albawi, Tareq Abed Mohammed, and Saad Al-Zawi. “Understanding of a convolutional neural
network”. In:2017internationalconferenceonengineeringandtechnology(ICET). Ieee. 2017, pp. 1–6.
[3] Gadipudi Amaranageswarao, S. Deivalakshmi, and Seok-Bum Ko. “Wavelet based medical image
super resolution using cross connected residual-in-dense grouped convolutional neural network”.
In: Journal of Visual Communication and Image Representation 70 (2020), p. 102819.issn: 1047-3203.
doi: https://doi.org/10.1016/j.jvcir.2020.102819.
[4] Plamen P. Angelov, Eduardo A. Soares, Richard Jiang, Nicholas I. Arnold, and Peter M. Atkinson.
“Explainable artificial intelligence: an analytical review”. In: Wiley Interdisciplinary Reviews: Data
Mining and Knowledge Discovery 11 (2021).
[5] G Argenziano, G Fabbrocini, P Carli, V De Giorgi, E Sammarco, and M Delfino. “Epiluminescence
microscopy for the diagnosis of ABCD rule of dermatoscopy and a new 7-point checklist based on
pattern analysis”. In: Archives of Dermatology 134 (1998), pp. 1536–1570.
[6] Kleanthis Avramidis, Mohammad Rostami, Melinda Chang, and Shrikanth Narayanan.
“Automating Detection of Papilledema in Pediatric Fundus Images with Explainable Machine
Learning”. In: 2022 IEEE International Conference on Image Processing (ICIP). IEEE. 2022,
pp. 3973–3977.
[7] Swarnambiga Ayyachamy, Varghese Alex, Mahendra Khened, and Ganapathy Krishnamurthi.
“Medical image retrieval using Resnet-18”. In: Medical Imaging 2019: Imaging Informatics for
Healthcare, Research, and Applications. Vol. 10954. SPIE. 2019, pp. 233–241.
[8] Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee. “Yolact: Real-time instance
segmentation”. In: Proceedings of the IEEE/CVF international conference on computer vision. 2019,
pp. 9157–9166.
[9] Cancer facts & figures 2022 . en. https://www.cancer.org/research/cancer-facts-statistics/all-
cancer-facts-figures/cancer-facts-figures-2022.html. Accessed: 2022-11-27.
46
[10] Hu Cao, Yueyue Wang, Joy Chen, Dongsheng Jiang, Xiaopeng Zhang, Qi Tian, and
Manning Wang. “Swin-unet: Unet-like pure transformer for medical image segmentation”. In:
arXiv preprint arXiv:2105.05537 (2021).
[11] Aaron Carass, Snehashis Roy, Adrian Gherman, Jacob C Reinhold, Andrew Jesson, Tal Arbel,
Oskar Maier, Heinz Handels, Mohsen Ghafoorian, Bram Platel, et al. “Evaluating white matter
lesion segmentations with refined Sørensen-Dice analysis”. In: Scientific reports 10.1 (2020),
pp. 1–19.
[12] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. “A simple framework for
contrastive learning of visual representations”. In: International conference on machine learning.
PMLR. 2020, pp. 1597–1607.
[13] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. “A Simple Framework
for Contrastive Learning of Visual Representations”. In: CoRR abs/2002.05709 (2020). arXiv:
2002.05709.url: https://arxiv.org/abs/2002.05709.
[14] Junlong Cheng, Shengwei Tian, Long Yu, Chengrui Gao, Xiaojing Kang, Xiang Ma, Weidong Wu,
Shijia Liu, and Hongchun Lu. “ResGANet: Residual group attention network for medical image
classification and segmentation”. In: Medical Image Analysis 76 (2022), p. 102313.issn: 1361-8415.
doi: https://doi.org/10.1016/j.media.2021.102313.
[15] Noel C. F. Codella, Veronica Rotemberg, Philipp Tschandl, M. Emre Celebi, Stephen W. Dusza,
David A. Gutman, Brian Helba, Aadi Kalloo, Konstantinos Liopyris, Michael A. Marchetti,
Harald Kittler, and Allan Halpern. “Skin Lesion Analysis Toward Melanoma Detection 2018: A
Challenge Hosted by the International Skin Imaging Collaboration (ISIC)”. In:CoRR abs/1902.03368
(2019). arXiv: 1902.03368.url: http://arxiv.org/abs/1902.03368.
[16] Noel CF Codella, Q-B Nguyen, Sharath Pankanti, David A Gutman, Brian Helba, Allan C Halpern,
and John R Smith. “Deep learning ensembles for melanoma recognition in dermoscopy images”. In:
IBM Journal of Research and Development 61.4/5 (2017), pp. 5–1.
[17] Pádraig Cunningham, Matthieu Cord, and Sarah Jane Delany. “Supervised learning”. In: Machine
learning techniques for multimedia. Springer, 2008, pp. 21–49.
[18] M. Deramgozin, S. Jovanovic, H. Rabah, and N. Ramzan. “A Hybrid Explainable AI Framework
Applied to Global and Local Facial Expression Recognition”. In: 2021 IEEE International Conference
on Imaging Systems and Techniques (IST). 2021, pp. 1–5.doi: 10.1109/IST50367.2021.9651357.
[19] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al.
“An image is worth 16x16 words: Transformers for image recognition at scale”. In: arXiv preprint
arXiv:2010.11929 (2020).
[20] Yaroslav Ganin and Victor Lempitsky. “Unsupervised domain adaptation by backpropagation”. In:
International conference on machine learning. PMLR. 2015, pp. 1180–1189.
47
[21] Fei Gao, Teresa Wu, Xianghua Chu, Hyunsoo Yoon, Yanzhe Xu, and Bhavika Patel. “Deep residual
inception encoder–decoder network for medical imaging synthesis”. In: IEEE journal of biomedical
and health informatics 24.1 (2019), pp. 39–49.
[22] Alberto Garcia-Garcia, Sergio Orts-Escolano, Sergiu Oprea, Victor Villena-Martinez, and
Jose Garcia-Rodriguez. “A review on deep learning techniques applied to semantic segmentation”.
In: arXiv preprint arXiv:1704.06857 (2017).
[23] Jiuxiang Gu, Zhenhua Wang, Jason Kuen, Lianyang Ma, Amir Shahroudy, Bing Shuai, Ting Liu,
Xingxing Wang, Gang Wang, Jianfei Cai, et al. “Recent advances in convolutional neural
networks”. In: Pattern recognition 77 (2018), pp. 354–377.
[24] Gery P Guy Jr, Cheryll C Thomas, Trevor Thompson, Meg Watson, Greta M Massetti,
Lisa C Richardson, and Centers for Disease Control and Prevention (CDC). “Vital signs: melanoma
incidence and mortality trends and projections - United States, 1982-2030”. en. In: MMWR Morb.
Mortal. Wkly. Rep. 64.21 (June 2015), pp. 591–596.
[25] Abdul Mueed Hafiz and Ghulam Mohiuddin Bhat. “A survey on instance segmentation: state of the
art”. In: International journal of multimedia information retrieval 9.3 (2020), pp. 171–189.
[26] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross B. Girshick. “Momentum Contrast for
Unsupervised Visual Representation Learning”. In: CoRR abs/1911.05722 (2019). arXiv: 1911.05722.
url: http://arxiv.org/abs/1911.05722.
[27] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep Residual Learning for Image
Recognition”. In: CoRR abs/1512.03385 (2015). arXiv: 1512.03385.url:
http://arxiv.org/abs/1512.03385.
[28] Dan Hendrycks, Mantas Mazeika, Saurav Kadavath, and Dawn Song. “Using self-supervised
learning can improve model robustness and uncertainty”. In: Advances in neural information
processing systems 32 (2019).
[29] Huimin Huang, Lanfen Lin, Ruofeng Tong, Hongjie Hu, Qiaowei Zhang, Yutaro Iwamoto,
Xianhua Han, Yen-Wei Chen, and Jian Wu. “UNet 3+: A Full-Scale Connected UNet for Medical
Image Segmentation”. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP). 2020, pp. 1055–1059.doi: 10.1109/ICASSP40776.2020.9053405.
[30] Huimin Huang, Lanfen Lin, Ruofeng Tong, Hongjie Hu, Qiaowei Zhang, Yutaro Iwamoto,
Xianhua Han, Yen-Wei Chen, and Jian Wu. “Unet 3+: A full-scale connected unet for medical
image segmentation”. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP). IEEE. 2020, pp. 1055–1059.
[31] Ashish Jaiswal, Ashwin Ramesh Babu, Mohammad Zaki Zadeh, Debapriya Banerjee, and
Fillia Makedon. “A survey on contrastive self-supervised learning”. In: Technologies 9.1 (2020), p. 2.
[32] Mario Fernando Jojoa Acosta, Liesle Yail Caballero Tovar, Maria Begonya Garcia-Zapirain, and
Winston Spencer Percybrooks. “Melanoma diagnosis using deep learning techniques on
dermatoscopic images”. In: BMC Medical Imaging 21.1 (2021), pp. 1–11.
48
[33] Sara Hosseinzadeh Kassani and Peyman Hosseinzadeh Kassani. “A comparative study of deep
learning architectures on melanoma detection”. In: Tissue and Cell 58 (), pp. 76–83.
[34] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “ImageNet Classification with Deep
Convolutional Neural Networks”. In: Advances in Neural Information Processing Systems. Ed. by
F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger. Vol. 25. Curran Associates, Inc., 2012.url:
https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf.
[35] Yuexiang Li and Linlin Shen. “Skin lesion analysis towards melanoma detection using deep
learning network”. In: Sensors 18.2 (2018), p. 556.
[36] Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. “Path aggregation network for instance
segmentation”. In: Proceedings of the IEEE conference on computer vision and pattern recognition.
2018, pp. 8759–8768.
[37] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. “Unsupervised domain
adaptation with residual transfer networks”. In: Advances in neural information processing systems
29 (2016).
[38] Walter Hugo Lopez Pinaya, Sandra Vieira, Rafael Garcia-Dias, and Andrea Mechelli. “Chapter 10 -
Convolutional neural networks”. In: Machine Learning. Ed. by Andrea Mechelli and Sandra Vieira.
Academic Press, 2020, pp. 173–191.isbn: 978-0-12-815739-8.doi:
https://doi.org/10.1016/B978-0-12-815739-8.00010-9.
[39] Lin Lu, Liqiong Jian, Jun Luo, and Bin Xiao. “Pancreatic Segmentation via Ringed Residual U-Net”.
In: IEEE Access 7 (2019), pp. 172871–172878.doi: 10.1109/ACCESS.2019.2956550.
[40] Siyuan Lu, Di Wu, Zheng Zhang, and Shui-Hua Wang. “An Explainable Framework for Diagnosis
of COVID-19 Pneumonia via Transfer Learning and Discriminant Correlation Analysis”. In: ACM
Trans. Multimedia Comput. Commun. Appl. 17.3s (Oct. 2021).issn: 1551-6857.doi: 10.1145/3449785.
[41] Scott M Lundberg and Su-In Lee. “A unified approach to interpreting model predictions”. In:
Proceedings of the 31st international conference on neural information processing systems. 2017,
pp. 4768–4777.
[42] Melanoma of the Skin - cancer stat facts. en. https://seer.cancer.gov/statfacts/html/melan.html.
Accessed: 2022-11-27.
[43] Ishan Misra and Laurens van der Maaten. “Self-supervised learning of pretext-invariant
representations”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. 2020, pp. 6707–6717.
[44] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. “Virtual adversarial training: a
regularization method for supervised and semi-supervised learning”. In: IEEE transactions on
pattern analysis and machine intelligence 41.8 (2018), pp. 1979–1993.
[45] Yaniv Morgenstern, Mohammad Rostami, and Dale Purves. “Properties of artificial networks
evolved to contend with natural spectra”. In: Proceedings of the National Academy of Sciences
111.supplement_3 (2014), pp. 10868–10872.
49
[46] Seiya Murabayashi and Hitoshi Iyatomi. “Towards Explainable Melanoma Diagnosis: Prediction of
Clinical Indicators Using Semi-supervised and Multi-task Learning”. In: 2019 IEEE International
Conference on Big Data (Big Data). 2019, pp. 4853–4857.doi: 10.1109/BigData47090.2019.9005726.
[47] Seiya Murabayashi and Hitoshi Iyatomi. “Towards explainable melanoma diagnosis: prediction of
clinical indicators using semi-supervised and multi-task learning”. In: 2019 IEEE International
Conference on Big Data (Big Data). IEEE. 2019, pp. 4853–4857.
[48] Franz Nachbar, Wilhelm Stolz, Tanja Merkle, Armand B Cognetta, Thomas Vogt,
Michael Landthaler, Peter Bilek, Otto Braun-Falco, and Gerd Plewig. “The ABCD rule of
dermatoscopy”. en. In: J. Am. Acad. Dermatol. 30.4 (Apr. 1994), pp. 551–559.
[49] Ahmad Naeem, Muhammad Shoaib Farooq, Adel Khelifi, and Adnan Abid. “Malignant melanoma
classification using deep learning: datasets, performance measurements, challenges and
opportunities”. In: IEEE Access 8 (2020), pp. 110575–110597.
[50] Natasha Nigar, Muhammad Umar, Muhammad Kashif Shahzad, Shahid Islam, and Douhadji Abalo.
“A Deep Learning approach based on Explainable Artificial Intelligence for Skin Lesion
Classification”. In: IEEE Access (2022).
[51] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. “Learning deconvolution network for
semantic segmentation”. In: Proceedings of the IEEE international conference on computer vision.
2015, pp. 1520–1528.
[52] Keiron O’Shea and Ryan Nash. “An introduction to convolutional neural networks”. In: arXiv
preprint arXiv:1511.08458 (2015).
[53] Phillip E Pope, Soheil Kolouri, Mohammad Rostami, Charles E Martin, and Heiko Hoffmann.
“Explainability methods for graph convolutional neural networks”. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. 2019, pp. 10772–10781.
[54] J Premaladha and KS Ravichandran. “Novel approaches for diagnosing melanoma skin lesions
through supervised and deep learning algorithms”. In: Journal of medical systems 40.4 (2016),
pp. 1–12.
[55] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. “" Why should i trust you?" Explaining
the predictions of any classifier”. In: Proceedings of the 22nd ACM SIGKDD international conference
on knowledge discovery and data mining. 2016, pp. 1135–1144.
[56] Marco Túlio Ribeiro, Sameer Singh, and Carlos Guestrin. “"Why Should I Trust You?": Explaining
the Predictions of Any Classifier”. In: CoRR abs/1602.04938 (2016). arXiv: 1602.04938.url:
http://arxiv.org/abs/1602.04938.
[57] H W Rogers, M A Weinstock, S R Feldman, and B M Coldiron. “Incidence estimate of nonmelanoma
skin cancer (keratinocyte carcinomas) in the US population”. In: JAMA Dermatol (2015).
[58] Bernardino Romera-Paredes and Philip Torr. “An embarrassingly simple approach to zero-shot
learning”. In: International conference on machine learning. PMLR. 2015, pp. 2152–2161.
50
[59] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. “U-Net: Convolutional Networks for
Biomedical Image Segmentation”. In: CoRR abs/1505.04597 (2015). arXiv: 1505.04597.url:
http://arxiv.org/abs/1505.04597.
[60] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. “U-net: Convolutional networks for
biomedical image segmentation”. In: International Conference on Medical image computing and
computer-assisted intervention. Springer. 2015, pp. 234–241.
[61] Mohammad Rostami. “Lifelong domain adaptation via consolidated internal distribution”. In:
Advances in Neural Information Processing Systems 34 (2021), pp. 11172–11183.
[62] Mohammad Rostami, David Huber, and Tsai-Ching Lu. “A crowdsourcing triage algorithm for
geopolitical event forecasting”. In: Proceedings of the 12th ACM Conference on Recommender
Systems. 2018, pp. 377–381.
[63] Mohammad Rostami, David Isele, and Eric Eaton. “Using task descriptions in lifelong machine
learning for improved performance and zero-shot transfer”. In: Journal of Artificial Intelligence
Research 67 (2020), pp. 673–704.
[64] Mohammad Rostami, Soheil Kolouri, Eric Eaton, and Kyungnam Kim. “Deep transfer learning for
few-shot SAR image classification”. In: Remote Sensing 11.11 (2019), p. 1374.
[65] Wojciech Samek, Alexander Binder, Grégoire Montavon, Sebastian Lapuschkin, and
Klaus-Robert Müller. “Evaluating the visualization of what a deep neural network has learned”. In:
IEEE transactions on neural networks and learning systems 28.11 (2016), pp. 2660–2673.
[66] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh,
and Dhruv Batra. “Grad-cam: Visual explanations from deep networks via gradient-based
localization”. In: Proceedings of the IEEE international conference on computer vision. 2017,
pp. 618–626.
[67] Neeraj Sharma, Luca Saba, Narendra N Khanna, Mannudeep K Kalra, Mostafa M Fouda, and
Jasjit S Suri. “Segmentation-Based Classification Deep Learning Model Embedded with Explainable
AI for COVID-19 Detection in Chest X-ray Scans”. In: Diagnostics 12.9 (2022), p. 2132.
[68] Pulkit Sharma. Image segmentation: Types of image segmentation. June 2022.url: https:
//www.analyticsvidhya.com/blog/2019/04/introduction-image-segmentation-techniques-python/.
[69] Mohammad Shorfuzzaman. “An explainable stacked ensemble of deep learning models for
improved melanoma skin cancer detection”. In: Multimedia Systems 28.4 (2022), pp. 1309–1323.
[70] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. “Learning important features through
propagating activation differences”. In: International conference on machine learning. PMLR. 2017,
pp. 3145–3153.
[71] Rebecca L. Siegel, Kimberly D. Miller, Hannah E. Fuchs, and Ahmedin Jemal.Cancerstatistics,2022.
en. Jan. 2022.doi: 10.3322/caac.21708.
51
[72] Serban Stan and Mohammad Rostami. “Privacy Preserving Domain Adaptation for Semantic
Segmentation of Medical Images”. In: arXiv preprint arXiv:2101.00522 (2021).
[73] Serban Stan and Mohammad Rostami. “Unsupervised model adaptation for continual semantic
segmentation”. In: Proceedings of the AAAI Conference on Artificial Intelligence . Vol. 35. 3. 2021,
pp. 2593–2601.
[74] Robert S Stern. “Prevalence of a history of skin cancer in 2007: results of an incidence-based
model”. en. In: Arch. Dermatol. 146.3 (Mar. 2010), pp. 279–282.
[75] Fabian Stieler, Fabian Rabe, and Bernhard Bauer. “Towards domain-specific explainable AI: model
interpretation of a skin image classifier using a human approach”. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. 2021, pp. 1802–1809.
[76] Nazneen N Sultana and Niladri B Puhan. “Recent deep learning methods for melanoma detection:
a review”. In: International Conference on Mathematics and Computing. Springer. 2018, pp. 118–132.
[77] Ruitong Sun and Mohammad Rostami. “Explainable Artificial Intelligence Architecture for
Melanoma Diagnosis Using Indicator Localization and Self-Supervised Learning”. In: ().
[78] Philipp Tschandl, Cliff Rosendahl, and Harald Kittler. “The HAM10000 Dataset: A Large Collection
of Multi-Source Dermatoscopic Images of Common Pigmented Skin Lesions”. In: CoRR
abs/1803.10417 (2018). arXiv: 1803.10417.url: http://arxiv.org/abs/1803.10417.
[79] Mingrui Wang and Xuhui Gong. “Metastatic cancer image binary classification based on resnet
model”. In: 2020 IEEE 20th International Conference on Communication Technology (ICCT). IEEE.
2020, pp. 1356–1359.
[80] Panqu Wang, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang, Xiaodi Hou, and Garrison Cottrell.
“Understanding convolution for semantic segmentation”. In: 2018 IEEE winter conference on
applications of computer vision (WACV). Ieee. 2018, pp. 1451–1460.
[81] Wei Wang, Vincent W Zheng, Han Yu, and Chunyan Miao. “A survey of zero-shot learning:
Settings, methods, and applications”. In: ACM Transactions on Intelligent Systems and Technology
(TIST) 10.2 (2019), pp. 1–37.
[82] Weibin Wang, Dong Liang, Qingqing Chen, Yutaro Iwamoto, Xian-Hua Han, Qiaowei Zhang,
Hongjie Hu, Lanfen Lin, and Yen-Wei Chen. “Medical image classification using deep learning”. In:
Deep learning in healthcare. Springer, 2020, pp. 33–51.
[83] Yaqing Wang, Quanming Yao, James Tin Yau Kwok, and Lionel Ming-Shuan Ni. “Generalizing from
a few examples: A survey on few-shot learning”. In: ACM Computing Surveys 53.3 (2020), p. 63.
[84] Yongqin Xian, Bernt Schiele, and Zeynep Akata. “Zero-shot learning-the good, the bad and the
ugly”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017,
pp. 4582–4591.
52
[85] Xiangyi Yan, Hao Tang, Shanlin Sun, Haoyu Ma, Deying Kong, and Xiaohui Xie. “After-unet: Axial
fusion transformer unet for medical image segmentation”. In: Proceedings of the IEEE/CVF Winter
Conference on Applications of Computer Vision. 2022, pp. 3971–3981.
[86] Xiaohua Zhai, Avital Oliver, Alexander Kolesnikov, and Lucas Beyer. “S4l: Self-supervised
semi-supervised learning”. In: Proceedings of the IEEE/CVF International Conference on Computer
Vision. 2019, pp. 1476–1485.
[87] Jianming Zhang, Sarah Adel Bargal, Zhe Lin, Jonathan Brandt, Xiaohui Shen, and Stan Sclaroff.
“Top-down neural attention by excitation backprop”. In: International Journal of Computer Vision
126.10 (2018), pp. 1084–1102.
[88] Qingchen Zhang, Changchuan Bai, Zhuo Liu, Laurence T Yang, Hang Yu, Jingyuan Zhao, and
Hong Yuan. “A GPU-based residual network for medical image classification in smart medicine”.
In: Information Sciences 536 (2020), pp. 91–100.
[89] Qingchen Zhang, Changchuan Bai, Zhuo Liu, Laurence T. Yang, Hang Yu, Jingyuan Zhao, and
Hong Yuan. “A GPU-based residual network for medical image classification in smart medicine”.
In: Information Sciences 536 (2020), pp. 91–100.issn: 0020-0255.doi:
https://doi.org/10.1016/j.ins.2020.05.013.
[90] Xiaoqing Zhang. “Melanoma segmentation based on deep learning”. In: Computer assisted surgery
22.sup1 (2017), pp. 267–277.
[91] Tao ZHOU, Bing-qiang HUO, Hui-ling LU, and Hai-ling REN. “Research on residual neural
network and its application on medical image processing”. In: ACTA ELECTONICA SINICA 48.7
(2020), p. 1436.
[92] Zhi-Hua Zhou. “A brief introduction to weakly supervised learning”. In: National science review 5.1
(2018), pp. 44–53.
[93] Xiaojin Zhu and Andrew B Goldberg. “Introduction to semi-supervised learning”. In: Synthesis
lectures on artificial intelligence and machine learning 3.1 (2009), pp. 1–130.
[94] Xiaojin Jerry Zhu. “Semi-supervised learning literature survey”. In: (2005).
53
Abstract (if available)
Abstract
Melanoma is a prevalent lethal type of cancer that is treatable mostly if diagnosed at early stages of development. As a result, there is an urgent need to develop and implement therapeutic interventions that can effectively and conveniently identify patients with or at risk for melanoma at scale and at early stages to apply sight-saving treatment on time. Skin lesions are a typical indicator for diagnosing melanoma but they often led to delayed diagnosis due to high similarities of cancerous and benign lesions at early stages of melanoma. In other words, despite the possibility of diagnosing melanoma through inspecting skin lesions, the task can be performed by expert dermatologists. Deep learning (DL) can be used as a solution to classify skin lesion pictures with a high accuracy, but clinical adoption of deep learning faces a significant challenge. The reason is that the decision processes of deep learning models are often uninterpretable which makes them black boxes that are challenging to trust. In this thesis, we develop an explainable deep learning architecture for melanoma diagnosis which generates clinically interpretable visual explanations for its decisions. Our idea is based on supervising a deep neural network to learn identifying clinical indicators of melanoma and then base the diagnosis task on these indicators. As a result, the model is trained to diagnose melanoma similar to expert dermatologists. More importantly, our model is able to localize the clinical indicators on the input skin lesion images. We conduct experiments on a real-world melanoma dataset. Our experiments demonstrate that our proposed architectures matches clinical explanations significantly better than existing architectures.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Automatic decipherment of historical manuscripts
PDF
Responsible artificial intelligence for a complex world
PDF
Computational modeling of mental health therapy sessions
PDF
Lexical complexity-driven representation learning
PDF
Generating and utilizing machine explanations for trustworthy NLP
PDF
Balancing prediction and explanation in the study of language usage and speaker attributes
PDF
Simulation and machine learning at exascale
PDF
Externalized reasoning in language models for scalable and trustworthy AI
PDF
Explainable and green solutions to point cloud classification and segmentation
PDF
Robust and proactive error detection and correction in tables
PDF
Towards learning generalization
PDF
Acceleration of deep reinforcement learning: efficient algorithms and hardware mapping
PDF
Learning to optimize the geometry and appearance from images
PDF
Unsupervised domain adaptation with private data
PDF
Learning controllable data generation for scalable model training
PDF
Human appearance analysis and synthesis using deep learning
PDF
Schema evolution for scientific asset management
PDF
An FPGA-friendly, mixed-computation inference accelerator for deep neural networks
PDF
Efficient deep learning for inverse problems in scientific and medical imaging
PDF
Recording, reconstructing, and relighting virtual humans
Asset Metadata
Creator
Sun, Ruitong
(author)
Core Title
Explainable AI architecture for automatic diagnosis of melanoma using skin lesion photographs
School
Viterbi School of Engineering
Degree
Master of Science
Degree Program
Computer Science
Degree Conferral Date
2022-12
Publication Date
07/21/2023
Defense Date
12/02/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Explainable artificial intelligence,Melanoma,OAI-PMH Harvest,self-supervised learning
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Rostami, Mohammad (
committee chair
), Nakano, Aiichiro (
committee member
), Narayanan, Shri (
committee member
)
Creator Email
ruitongs@usc.edu,ruitongss@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC112620923
Unique identifier
UC112620923
Identifier
etd-SunRuitong-11377.pdf (filename)
Legacy Identifier
etd-SunRuitong-11377
Document Type
Thesis
Format
theses (aat)
Rights
Sun, Ruitong
Internet Media Type
application/pdf
Type
texts
Source
20221214-usctheses-batch-997
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
Explainable artificial intelligence
self-supervised learning