Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Molecular classification of breast cancer specimens from tissue morphology
(USC Thesis Other)
Molecular classification of breast cancer specimens from tissue morphology
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
1
Molecular classification of breast
cancer specimens from tissue
morphology
Rishi R. Rawat
Dedicated to Buddy Rawat
A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of
Philosophy (Cancer Biology and Genomics) from the University of Southern California
Graduate School
August 2019
2
Table of Contents
Introduction ---------------------------------------------------------------------------------------------------- 3
Chapter 1. Correlating nuclear morphometric patterns with estrogen receptor status in breast
cancer pathologic specimens ------------------------------------------------------------------------------- 10
Chapter 2. Machine learning of tissue “fingerprints” improves H&E-based molecular
classification of breast cancer ------------------------------------------------------------------------------ 27
Chapter 3. Development of a high throughput, reproducible immune cell quantifier in breast
cancer specimens -------------------------------------------------------------------------------------------- 49
Chapter 4. A methodological synthesis ------------------------------------------------------------------- 60
Conclusion ---------------------------------------------------------------------------------------------------- 66
References ---------------------------------------------------------------------------------------------------- 74
3
Introduction
Richard Feynman, the famous physicist, once said: It is very easy to answer many of these
fundamental biological questions; you just look at the thing!
But where and what to look for? Since the invention of light microscope in the 1600s, our
understanding of health and disease has evolved considerably. We’ve learned about cells,
subcellular and extracellular components, and these insights have shaped the way we think about
disease. However, there’s a tremendous amount of information available under the lens of the
microscope, potentially far more than we currently appreciate and use. What’s exciting about
recent advances in artificial intelligence (AI) is that they offer a means to learn patterns directly
from raw data, which may be significantly different from the patterns commonly studied in
medicine.
One of the most compelling recent examples of this was the C-Path study by Beck et al.
1
Historically, pathologists looking at breast cancer specimens have focused intensely on the
cancerous regions within the tissue: looking at features like tubule formation, mitotic figures, and
nuclear grade to gauge the risk to the patient. However, when Beck et. al. used automated image
analysis, to the surprise of many researchers, they found that computers could identify features in
the stromal regions that could predict survival independent of the cancer regions. This finding went
against the then-traditional view of cancer being a disease of cancer cells, and in recent years, a
large body of research has emerged to support the importance of stroma-cancer cross talk, which
can drive cancer growth, immune evasion, and drug resistance.
2
4
In hindsight, the importance of the stromal regions in influencing the behavior of cancer seems
obvious. However, this feeling illustrates the blinding power of bias and dogma. Once we accept
a principal, it’s hard to see past it. Unless, of course, we challenge these assumptions. The power
and potential of digital image analysis lies in its ability to rapidly identify features and test them
for clinical significance.
In the seven years since the C-Path study, the fields of digital image analysis and the related branch
of artificial intelligence called computer vision have evolved considerably. While the C-Path study
focused on a large collection of hand-crafted features related to cell morphology, cell type
clustering, etc., novel algorithms based on artificial neural networks make it possible to study the
patterns in images with minimal human input and bias. While artificial neural network algorithms
have been studied for over 50 years and have gone through waves of intense development (Figure
1)
3
, followed by long periods of disinterest, we are currently in the third cycle which is much like
a renaissance. Whereas previous versions of these algorithms (called cybernetics or
connectionism) failed to gain traction due to limited computing resources, recent advances in
hardware driven by the gaming industry, have made it possible for computers to perform a range
of complex image prediction tasks with human-level or great than human level accuracy.
5
The landmark result that sparked the renewed interest in neural networks came in 2012, when a
specific type of neural network called a “deep neural network” (DNN) nearly halved the error rate
on the annual ImageNet Large Scale Visual Recognition Challenge (Figure 2).
4
The ImageNet
challenge is designed to allow researchers to compare progress in image detection algorithms
Figure 1. Two previous waves of interest in artificial neural network algorithms. Figure
adapted from Goodfellow et. al.
3
Figure 2. Error rates on the ImageNet image recognition challenge decreased nearly 50% in
2012 through the use of deep neural networks. Figure source: The Economist
4
6
across a wider variety of natural everyday objects. A training set is provided to researchers
containing labeled images from over 1000 classes (including cars, trucks, desks, cats, dogs, etc.).
Then the algorithms are tested on a set of held-out data, blinded to the researchers, but known by
the contest organizers. Until 2011, the error rate of multiple algorithms seemed stuck at around
25%. However, in 2012 a deep network called “AlexNet” (named after its first author) was able to
cut the error down by 40%, down to 15%. This was the beginning of a major shift in the field. The
next year, a deep network inspired by AlexNet, but significantly larger and more complex, got
nearly 10% error, and then in 2015, an even more complex neural network classified images better
than human-level performance.
The difference between a network like the AlexNet and previous networks is that AlexNet is a
deep convolutional neural network, whereas most previous networks were shallow. Deep and
shallow refers to is the number of layers in these networks and how they process inputs. More than
5 layers is considered “deep.” AlexNet contains 7 layers. Conceptually, neural networks are
typically represented as connected nodes in layers. Each node is a neuron, and the connections
(inspired by synapses between biological neurons) are weight parameters that are refined during
training. When a network is trained on a task, such as identifying cats versus dogs, numerous
pictures from both categories are shown to the network and the weights adjusted to make better
predictions. Importantly, neural networks can learn features directly from the data. We do not
always tell them to pay attention to eyes, hair, or the shapes of the animals; they can learn these
(or other features) on their own, in the process of trying to make the right classification.
7
It is believed that depth is key to networks like AlexNet performing so well at image classification.
Since each network layer builds complexity on the preceding layer’s output, deeper architectures
may make it possible for networks to learn hierarchies of features in describing an image. Indeed,
a growing body of evidence suggests that neurons from earlier layers tend to learn fine-grain
features such as edge and color, whereas features from deeper layers in the network learn concepts
that are more holistic, such as body parts in the case of animals.
5
This thesis describes the application of deep learning to discover novel visual biomarkers of breast
cancer from routinely stained pathology slides. This research is both clinically and scientifically
motived. Deep learning can learn patterns of cells and tissue architecture that go beyond traditional
definitions. The ubiquitous use of hematoxylin and eosin (H&E) stains at hospitals around the
Figure 3. Shallow versus deep convolutional neural networks.
8
world suggests that automated analysis could improve care for many patients. Specifically, this
work focuses on two questions: how to predict clinically-relevant subtypes of breasts cancer and
reliably quantify immune cells based on the H&E image alone.
In the process of developing these tools, I have developed three deep learning approaches,
described in three chapters. Chapter 1 describes a pilot study of CellNet, a hybrid neural network,
which uses deep learning to learn spatial relationships between nuclei. CellNet was used to predict
the clinical status of estrogen receptor (ER) from H&E images, a task which is normally assessed
by immunostaining. Chapter 2 approaches the question of cancer subtyping from a different angle,
which my co-authors and I call, “tissue fingerprinting.” Instead of directly predicting clinical
information—such as molecular markers—from H&E images, networks are first trained to learn
histologic features that can distinguish patients from each other. After this training is
accomplished, the network’s features are extracted and correlated clinical targets such as ER,
progesterone receptor (PR) and Her2. Chapter 3 describes an approach using deep learning to
characterize the immune infiltrate of breast cancer images. We trained a neural network to identify
immune cells from H&E images and found that machine-learned lymphocyte features could
significantly predict survival.
While these chapters outline approaches that were developed independently, Chapter 4 describes
a framework to synthesize the three projects to create a powerful, human-interpretable deep
learning system that learns form whole slides. As this work is ongoing, Chapter 4 may be regarded
as an introduction to works in progress.
9
I’m grateful for the mentors and teachers who have supported this research and my growth as a
scientist. In particular, I wish to acknowledge the following mentors and co-authors who have
directly contributed to the research described. Throughout the manuscript, when I use the pronoun
“we” I generally mean to include a subset of these collaborators who are coauthors and inventors
of the technologies described (in chronological order):
Paul Macklin, David B. Agus, Daniel Ruderman, Fei Sha, David Rimm, Darryl Shibata,
Michael Press, Yanling Ma, Itzel Ortega, Preeyam Roy.
In addition, I wish to acknowledge the Breast Cancer Research Foundation and Oracle Corporation
for supporting these projects with funding or donated computational resources, as well as the
Australian Breast Cancer Tissue Bank for providing images and clinical annotations.
10
Chapter 1. Correlating nuclear
morphometric patterns with estrogen
receptor status in breast cancer
pathologic specimens
I’m intrigued by the diversity of cellular nuclei. Some breast cancer nuclei are
two to three times larger than their normal counterparts. Others are half the
size. Multiple times I’ve had to go back and check that I’m using the ‘right’
magnification to make sure I’m looking at the right scale. What are the
implications of differently sized and shaped nuclei? This question prompted the
beginning of my work with image processing, computer vision, and neural
network algorithms. I asked, whether nuclear variation could be linked to
differences in growth factor pathways.
Working with Daniel Ruderman, Paul Macklin, David L. Rimm, David B. Agus, we conducted a
pilot study
6
focused on predicting clinical estrogen receptor (ER) status—defined as greater than
one percent of cells positive for estrogen receptor by immunohistochemistry (IHC) staining—from
spatial arrangement of nuclear features. The motivation to study the relationship between
morphology and growth factors comes from the way breast cancer patients are managed in the US.
Characterizing growth receptor pathways in breast cancer (via hormone receptor and HER2 status)
is critical for patient management in breast cancer. In the US, the standard of care uses multiple
IHC stains for ER, progesterone receptor (PR), and HER2 to categorize the breast tumor, determine
prognosis and select treatment regimens.
7,8
However, these assays may be inconsistent across
laboratories,
9
and they are somewhat expensive and often challenging in low resourced settings.
Even though it has relatively low sensitivity and specificity for response,
7,10
marker status is the
11
one of the oldest and widely used companion diagnostic tests. For example, only 50% of women
with ER-positive tumors and 60%-70% of women with ER-positive and PR-positive tumors show
partial or complete response to tamoxifen therapy.
11–13
While pathologists have long seen a
correlation between low grade morphology and positive ER status, we hypothesized that a
quantitative approach using deep-learning could learn the visual features of ER-status.
In this study, we focused on the morphometric features of nuclei and explored how deep learning
on these features, specifically, could distinguish between ER-negative and ER-positive breast
cancer. We constructed a learning pipeline consisting of 5 steps: (1) data acquisition, (2) extraction
of morphometric features (descriptors of position, shape and orientation), (3) quality control, (4)
training the neural network, and (5) testing the neural network.
Data Acquisition
The data for the study come from publicly available H&E images and corresponding clinical ER
status (positive/negative, determined by IHC) for a tissue microarray of 131 treatment-naïve
invasive ductal carcinoma (IDC) patients
14
(Table 1). The H&E images were acquired from the
website of the tissue microarray supplier, US Biomax, Inc. (Derwood, MD 20855). As a service
to customers, US Biomax, Inc. provides JPEG-compressed H&E images of many tissue
microarrays along with IHC staining information, such as ER receptor status. With permission
from US Biomax, Inc., we used the array titled “HBre-Duc140Sur-01”
(http://www.biomax.us/tissue-arrays/Breast/HBre-Duc140Sur-01), which contains 140 tissue
cores (1.5 mm diameter) from 140 patients diagnosed with invasive ductal carcinoma. We chose
this particular microarray because the H&E images displayed minimal staining artifacts and
included molecular marker staining status. To collect the data, I used the digital slide viewer on
12
the US Biomax, Inc. website, zoomed in to 20× resolution (0.5 µm per pixel) and took screenshots
of each core. These images were associated with to ER status annotations from the US Biomax,
Inc. website.
Extraction of morphometric features
Following image acquisition, we used image analysis pipelines to segment nuclei and extract
features of nuclear morphometry. I implemented an automated nuclear segmentation pipeline
using Python (version 2.7.12) and Fiji
15
(version 1.0, a distribution of ImageJ
16
). The steps consist
of the following:
1. Scale images as necessary to a resolution 0.5 µm per pixel, using bicubic interpolation.
2. Transform the RGB image into hue, saturation, brightness channels, retaining only the
brightness channel for downstream analysis.
3. Apply an automatic, global Otsu threshold
17
to roughly identify cellular regions
4. Apply a local adaptive threshold with a radius of 20 pixels (10 µm) to provide fine-scale
local separation of nuclei.
Table 1: Patient Information
13
5. Use the built-in Fiji watershed transform to separate overlapping nuclei.
6. Calculate the following morphometric parameters for each detected nucleus using the
particle analysis functions in ImageJ: center of nucleus (x,y coordinates), major axis length,
minor axis length, major axis to minor axis ratio, area, perimeter, and circularity.
7. Convert data into a MultiCellDS digital tissue snapshot (a standardized XML
representation for spatial multicellular data)
18
for storage.
The pipeline identified on average 4960 nuclei per image.
Quality Control
Looking at the output of the segmentations (Figure 1), it was clear that some images had not been
segmented properly. Thus, we applied a blinded quality control step to exclude over-segmented
images. Then we randomized the images into a training set (57 patients) and a test set (56 patients).
Figure 1. Exemplary over-segmented and well-segmented images. An image
segmentation was deemed “well segmented” if it appeared to be more than 70%
concordant.
14
Training the neural network
Reformatting the data
Following the exclusion of overly-segmented images, we prepared the data for training (Figure 2).
Each MultiCellDS digital tissue snapshot was converted into a sparse 12 channel image consisting
of zeros everywhere except at the cell centers, which contain information about the nuclei. The
first six channels correspond to cellular shape features (major axis, minor axis, major: minor ratio,
area, perimeter, circularity). In addition, we constructed 6 “binary angle” features from the nuclear
angle measurement, leading to a total of 12 feature channels; if the major axis of cell i has an angle
θi (0 < θi < 180) with the positive x-axis, I define six orientation features φi,j (1 ≤ j ≤ 6) by
φi,j = 1 if 30 × (j-1) < θ_i ≤ 30 × j
φi,j = 0 otherwise.
The rationale for constructing binary features relates to the training process for the neural network.
We wanted the network to learn rotationally invariant features, which are robust to flips and
rotations (in the spatial image coordinates) of the 12-D image. The final step before training
involved downscaling the sparse images 4× via nearest-neighbor scaling to reduce downstream
computation. Thus, the DNN sees cell features at a resolution of 2 µm per pixel. Following down
sampling, cells positioned at physical coordinates (x1,y1), are positioned at matrix indices (x2,y2)
such that:
x2 = floor(x1/4)
y2 = floor(y1/4)
15
Figure 2. Construction of a sparse 12-channel image. a Hematoxylin and eosin stained tissue are
processed by a nuclear segmentation algorithm. Each nuclear feature is measured and represented on a
single 2D array, where individual cells are represented as points. Arrays are stacked to form a 12D image.
b Detailed view of 12 individual channels that would be stacked to from a 12-channel image.
16
Network Design
The overall structure of our neural network was inspired by previous work applying deep learning
to image segmentation
19
and high-content screening
20
. The network has approximately 4.6 × 10
5
parameters arranged in six fully convolutional layers, 5 max pooling layers, one global mean layer,
and one batch-normalization layer (Figure 5). Through cross-validation on the training set, it was
decided to use leaky rectifying linear neurons with cross-entropy loss. Using a batch normalization
layer
21
was necessary for convergence. Over one batch of training data, a batch normalization layer
produces outputs with zero mean and unit variance. In training, this leads to a well-distributed set
of output predictions, which accelerates the learning process. In addition, a dropout layer was used,
Figure 3. Schematic of the deep neural network. a The 12 Channel Image is loaded into a fully
convolutional network with six convolutional and max-pooling layers (not shown for simplicity).
The output is a 1D map of ER predictions, which is averaged and normalized (not shown) to produce
an ER score for the image. The size of the matrix that holds the convolutional weights is indicated
in red, where a matrix N × C × X × Y has N Kernels that act on a C channel input of size X × Y ×
C. b An example of convolutional and max pooling operations. In convolution, the starting image
(left) is convolved by four kernels (middle) to produce four feature maps (right). In max pooling,
the maximum value of each 2 × 2 square is used to produce an output image.
17
which randomly eliminates 50% of the neurons during each round of training to prevent co-
adaptation of neurons (a form of over-fitting)
22
.
Using a global mean layer gives the option of training the network on images of arbitrary size.
However, we chose to train on small patches extracted from sparse images to increase the relative
size of the training set. Thus, during the training process, we used randomly extracted small patches
(100 × 100 pixels, 200 × 200 µm) from the downscaled feature maps (approx. 750 x 750 pixels,
1500 × 1500 µm) and assigned them the same class as the overall image. At runtime, these patches
were randomly flipped and rotated (in multiples of 90 degrees) to augment the dataset and promote
the learning of rotationally invariant features. Theoretically, the augmented training set consists of
10
8
different patches; however only a subset of these images was actually used to train the network.
Each layer in the neural network combines features from the previous layer, and deeper layers can
learn higher order features. The model uses a fully convolutional architecture, which means that it
can process images of arbitrary size, producing output in the form of a spatial map that scales with
the size of the input image
19
. Thus, the final classification layer produces a spatial map for ER
score over the image, and the average prediction over the map is treated as the score for the image.
All experiments were conducted on an Nvidia K80 GPU using the Deep Learning libraries
Theano
23
and Lasagne
24
.
Training
The datasets consist training (n = 57) and test (n = 56) sets. From the training set, we held out 20%
of the data for cross validation during the training process. From the training set, we subsampled
small patches (100 × 100 pixels, 200 × 200 µm) and trained the network using image-level labels
18
(ER+, ER-) for the patches and a cross-entropy loss function. After approximately 450 epochs
(corresponding to training on approx. 7 × 10
4
individual patches), the training loss began to plateau
(Figure 4). The loss had plateaued by epoch 825, so we added back the held-out cross-validation
data and trained the net for approximately 1000 epochs to maximize accuracy on the entire training
dataset.
Figure 4. Training Loss.
Step 5: Testing the Neural Network
Following training, all parameters and weights in the neural network were fixed. Full sized images
were classified, and the predictions were stored in a text file for analysis. The test sets were held
out during training and were only evaluated after the network had been trained.
Results
Nuclear morphometric features predict ER status
After training the neural network, we tested the pipeline on the test set and measured area under
the receiver operating characteristic curve (AUROC) scores of 0.70 (95%CI=0.56-0.85) and 0.72
(95%CI=0.55-0.89) on the training and test sets, respectively (Figure 5). Because the 95%
19
confidence intervals do not include the null value of 0.5 AUROC, it suggests our pipeline learned
to predict ER status significantly.
A correlation between nuclear size, heterogeneity, and ER status
While deep networks are typically considered to be uninterpretable “black boxes,” we applied
several techniques to reverse-engineer the system and understand the morphometric patterns the
DNN used to classify ER status. Our first step was to visualize the heatmap the DNN learned to
predict (Figure 6). This analysis is similar to laying an IHC image over an H&E image; however,
while an IHC image shows the real protein expression, the DNN heatmap shows regions estimated
by the DNN to reflect clinical ER status. Because the DNN was trained to predict an accurate
patient-level classification (not the spatial pattern of ER-staining), the regions predicted on the
heatmap may be different from regions predicted by IHC. However, regions on the DNN heatmap
contain information that leads to an accurate ER+/- prediction and are thus diagnostic regions for
ER-assessment.
For this analysis, we selected several cases that were classified correctly and overlaid the predicted
heatmaps on the H&E image to form a “digital stain” where ER-negative regions are colored red
and ER-positive regions are uncolored (Figure 7). By visual inspection, we observed a subset of
epithelial areas were predicted ER-negative. Thus, it appears that features in epithelial regions are
used by the DNN to classify ER status.
20
Figure 5. left Receiver operating characteristic (ROC) curves for the training dataset
(AUROC=0.70, 95%CI=0.56-0.85), and right test dataset (AUROC=0.72, 95%CI=0.55-0.89).
Figure 6. Digital stain for regions predicted to be ER-negative. Pixels are shaded red in
regions predicted to be ER-negative with probability greater than 50%. Enlarged regions of
ER-negative tissue left reveal that the network classifies sub-regions of epithelial tissue as ER
negative. For comparison, ER positive tissue is shown right.
21
Next, we used the DNN to define spatial parameters related to the specific nuclear features linked
to the ER prediction. We divided all of the training images (n=57) into small image patches (64 ×
64 pixels, 128 × 128 µm, 11,161 total). Then we predicted the ER score for each patch and sorted
the patches by the score from ER positive to ER negative. When we looked at the patches most
strongly predicted to be ER positive or ER negative, we noticed a difference in nuclear size and
the variation in nuclear features: ER-negative seemed correlated to larger, more variable nuclei
than ER-positive. To formally investigate whether the pipeline learned features related to nuclear
size and heterogeneity, we divided the sorted list of image patches into 15 groups ranked by
predicted ER score (744 patches per group. Randomly chosen patches from these 15 groups are
illustrated in Figure 3A). For each patch, we calculated the mean value of each nuclear feature
(intra-patch mean) and the variance of the feature (intra-patch variance). We also calculated the
inter-patch mean and standard error across all patches in each group (Figure 3B). This revealed
that several nuclear morphometric quantities, such as mean height, width, area and perimeter were
elevated in patches classified as ER negative. Additionally, nuclear heterogeneity (variance of
nuclear features) is correlated to an ER negative prediction.
Based on these observations, we directly tested if the mean and variance of nuclear features in a
patch could predict ER status. We randomly sampled 5000 patches from the training set, calculated
the intra-patch means and variances of nuclei within each patch and trained a logistic regression
model on these features. Next, we applied the trained logistic regression model to full-sized images
in the test set. We divided each image into equally-spaced non-overlapping patches, calculated an
ER score for each patch, and averaged the ER score from all patches in each test image. On the
training set, we obtained an AUROC of 0.648 (95% CI: 0.498-0.799). On the test set, we obtained
an AUROC of 0.672 (95% CI: 0.494-0.850). While these linear classifiers are less accurate than
22
the DNN, the trend suggests that these features capture information about ER status. Analyzing a
DNN trained on expert-defined features helped us interpret the DNN in terms of biological
relationships.
Figure 7. Correlating nuclear morphometric features with ER predictions from the neural
network. Image “patches” were extracted from the training dataset, ranked by predicted
probability of ER-status, and divided into 15 groups by prediction status. a Two representative
patches classified as ER positive and ER negative are shown. b left The mean of each nuclear
feature was calculated within each patch (intra-patch mean); within each group, intra-patch
means were averaged to calculate the inter-patch mean. b right The variance of each nuclear
feature was calculated in each patch (intra-patch variance); within each group, intra-patch
variances were averaged. The x-axis in b indicates group number, higher group numbers
correspond to ER negative predictions.
23
Discussion
We aimed to test feasibility of predicting ER status in breast cancer specimens based on nuclear
morphometric features in H&E stained specimens as a way of identifying molecular markers
and/or pathway activation without DNA sequencing or other molecular studies. For this pilot
study, we define ER-positive by clinical ER status (greater than one percent of cancer cells staining
positive for ER on an IHC stain). Using deep learning and labeled tissue images, we trained a
learning pipeline to correlate patterns of nuclei to ER status and found that it learned to predict ER
with statistical significance. Analysis of the trained model revealed that the network learned an
association between large pleomorphic nuclei and ER negative tumors. While this finding is not
novel
25
, it is significant that this is the first time a neural network learned this relationship without
human supervision. As the size of the training dataset grows, we anticipate that it may learn novel
patterns not currently recognized in the field. In fact, the ultimate goal of this work would be to
evolve to a highly sensitive and specific theragnostic of clinical benefit to hormonal therapy.
A core factor in this work was the development of a hybrid machine-learning approach that
combined expert-defined local features with the powerful feature-learning framework of
convolutional neural networks. While convolutional neural networks can learn high-order features
from the raw image data, training these models typically requires thousands to millions of training
images to minimize the impact of noise and color variations. To reduce the impact of stain
variation, our study introduced a pre-processing step to extract nuclear morphometric data and
developed a novel method for deep learning on these features instead of the raw RGB image pixels.
Preprocessing effectively compresses each training image into a vector of morphometric data.
While this constrains the types of features the neural network can learn, it also prevents the learning
24
of spurious correlations between nonsensical variables (e.g., staining variation). Thus, we believe
using expert-defined features as input allowed the network to learn patterns that generalized well
between the training and test datasets.
There are a number of limitations to this work that can be expected in a proof-of-concept study.
Most significant is the relatively low AUROC achieved, compared to the molecular methods to
predict expression of estrogen receptor. We recognize that in this early stage, this test is not close
to being a replacement for immunohistochemistry. However, similarly, the best molecular tests for
ER status also have a relatively low AUROC with respect to prediction of response to hormonal
therapy
26,27
. Furthermore, AUROC may not be the best way to evaluate predictive tests, since in
treating patients, specificity is always sacrificed for increased sensitivity to prevent any patient
from missing the opportunity to benefit from the drug. It is possible that with further effort, deep
learning on larger, more comprehensively annotated cohorts will be able to improve the specificity
without sacrificing sensitivity.
Another weakness of the work is the relatively small sample size and pilot nature of the study,
which focuses on tissue microarray cores. This work focused on the generation of the algorithms
and the approach, prior to going through the challenging process of obtaining images from large,
comprehensively annotated whole slide images from cooperative group studies. The publication
of these pilot studies represents a prerequisite in order to obtain and scan whole sections from the
valuable multi-institutional, evidence level 1 trials.
25
This proof-of-concept demonstrates a technique to correlate morphometric features to a clinical
ER receptor status and provides a means to begin understanding the relationships between
morphometry and variables of potentially greater clinical significance, such as ER staining
heterogeneity or anti-estrogen response. Our hybrid system is not a “black-box” learning system.
It learns high-order features based on lower-order, human-defined features that can be reverse-
engineered to capture morphologic features that are highly correlated to molecular biology. In this
study, we used digital staining and patch analysis to visualize the correlation between large
pleomorphic nuclei with ER negative tumors. By incorporating subcellular or extra-cellular
features, future works may explore how the spatial distribution of nuclei and other features (e.g.
nucleoli, mitotic figures, collagen, lymphocytes) correlate to subtypes and outcomes. In fact, the
results of the C-Path study
1
suggest that the information we may extract for the extra-cellular
features may be more informative for prediction of response than that cellular features. We believe
such algorithms will help researchers understand how the spatial relationships between different
types of cells correlate to disease severity and clinical outcomes.
Code availability
We used custom python and R scripts provided in the supplementary materials on the npj Breast
Cancer website. The nuclear segmentations that were used to train the neural network are freely
available under the Creative Commons CC-BY 4.0 license as MultiCellDS digital snapshots
18
and
are available upon request. In addition, the raw H&E images used to generate cell segmentations
are available from the website of Biomax.us (IDC, https://www.biomax.us/tissue-
arrays/Breast/HBre-Duc140Sur-01).
26
Acknowledgements
We thank US. Biomax, Inc. for giving permission to analyze H&E images from their website, and
Dr. Samuel Friedman for help with image processing and MultiCellDS XML validation. This
research was supported by a grant from the Breast Cancer Research Foundation (BCRF-16-103).
27
Chapter 2. Machine learning of tissue
“fingerprints” improves H&E-based
molecular classification of breast
cancer
Although deep learning (DL) has potential to teach us novel aspects of biology, the most
impressive use cases to date recapitulate patterns that experts already recognize.
28–33
While these
approaches may improve inter-observer variability and accelerate clinical workflows, my goal was
to use DL to learn how morphology from H&E images can be used to predict known biomarkers,
34
prognosis
35
and theragnosis—tasks which are not currently possible for pathologists to discern by
eye. These capabilities could improve our understanding of cancer biology. The biggest challenge
is obtaining large, well annotated training sets to support accurate learning.
While computer vision datasets often contain millions of annotated images,
28
clinical pathology
case sets generally number in the hundreds. Moreover, noise in the clinical annotations dilutes the
learning signal and increases the probability the network will learn spurious features like stain
color, clinical site, or other technical variations.
36–38
To overcome this limitation, together with Fei
Sha, Darryl Shibata, Daniel Ruderman, David B. Agus, Preeyam Roy, and Itzel Ortega, the concept
“tissue fingerprints,” was developed. It is based on the hypothesis that molecular differences of
the tumor are often translated into subtle differences in morphologic phenotypes. This idea is akin
to the paradigm of precision medicine, where instead of grouping patients, individual patients are
treated based on their specific molecular and environmental parameters. Hence, instead of training
28
a network to distinguish between groups of samples, we first pre-configured the network to
recognize or “fingerprint” individual tumors. This task can leverage unannotated pathology image
datasets, which are widely available. By pretraining a network to first accurately fingerprint
tissues, we expect that far less annotated data will be necessary to then adapt it to a clinical task.
To implement tissue fingerprints, we trained a neutral network to fingerprint pathologic tumor
samples from a training set and subsequently tested it on a simple matching task using tumor
images from new patients. Briefly, multiple images were halved, and network attempted to learn
a vector of features (a “fingerprint”) that could match the pairs (Figure 1). An important aspect of
this work was using the matching task to learn stain- and site-invariant features of architecture.
Figure 1. Networks are first trained to learn tissue fingerprints, which are patterns of cells and tissue
visible on H&E images that can be used to distinguish between patients. Following this training
internship, which can be scaled to very large numbers of patients without clinical outcome
annotations, the fingerprints are repurposed to make clinically relevant predictions from small labeled
datasets.
29
We controlled for these sources of noise by testing whether the fingerprints which could match
tissues from the same patients that had been stained and scanned at different sites. Optimizing on
the task of matching, we performed experiments testing the impact of training set size and methods
of image normalization.
Once this training internship was accomplished, the fingerprints were extracted from the network
and used as features to classify between groups of tumors with a clinically relevant difference. In
particular, we chose estrogen receptor (ER) status in breast cancer, an important predictive and
prognostic molecular marker that is currently assessed in the clinic by immunohistochemistry
(IHC) staining.
39
We tested whether a fingerprint-based classifier could predict this molecular
information from tissue architecture as represented in a hematoxylin and eosin (H&E) image.
Results
FINGERPRINTING DATASETS
We trained the networks to learn fingerprints from breast cancer cores in tissue microarrays
(TMAs). The TMA format makes it easy to process one set of tissues in multiple ways and allowed
us to simulate the batch-effects that are commonly encountered at pathology labs. By staining/
scanning one section at USC and using another section stained by the TMA provider, we obtained
paired images with the same architectural features, but different staining colors. A downside to the
use of TMAs is that they contain tumor regions selected by pathologist to be representative of the
tumor, and thus present a particular selection bias that may not accurately represent tumor tissue
generally. Our goal was to learn a fingerprint that summarized the architecture but ignored the
staining differences.
30
We used one TMA to train (BR20819) and another TMA to test (BR20823). Each TMA contains
approximately 208 tissue cores from 104 patients, with no patient overlap between arrays. We used
three serial sections of the training TMA. One section was stained/scanned by the TMA supplier
(US Biomax), the other two were stained at USC. We similarly collected two serial sections of the
test array, stained at USC and Biomax.
Table 1. Fingerprinting Datasets
Table 2. Fingerprinting Results
31
Stain normalization is necessary for fingerprinting
We performed four experiments, varying training set size and image normalization, to determine
how to best train a fingerprint network (Table 2). We hypothesized that training on large numbers
of cores would improve accuracy, and that image color normalization would greatly improve
training on small data but have a smaller effect on larger datasets. In experiment 1, the baseline,
we collected from slides 1 and 2 (serial sections of the training TMA that were stained at Biomax
and USC, respectively) 207 tissue cores, divided them in half, and then trained the network to
recognize patients based on patterns in one of the halves. (We arbitrarily chose to train on the left
halves). Each core was assigned a categorical number from 1 to 207, and the network was trained
to identify the core from a patch sampled from the image-half (patch size: 224 px
2
, 0.5
microns/pixel).
In experiment 2, we scaled the dataset over 20-fold: adding 13,000 additional training images
(from new patients). Again, we trained the network to predict the index.
In experiments 3 and 4, we used the same datasets as before, but included a color normalization
procedure based on neural style transfer.
40,41
In the first two experiments, we predicted that as the
network was trained to recognize increasing numbers of images, it would automatically learn stain-
invariant features. In the second two experiments, we used the style-transfer algorithm
CycleGAN
41
to recolor images (Figure 2), making them appear as if they were prepared at a
different site. CycleGAN can exchange the texture between two spatially similar image sets, while
preserving overall structural information. Compelling examples include transforming photographs
into impressionist paintings and horses into zebras. Here, we use CycleGAN to transfer the H&E
32
staining coloration from a reference site to images from other sites. Then we trained the networks
to look at both images and predict the same features, cancelling out the effect of stain variation.
To compare the quality of the fingerprints learned in the four experiments, we performed “tissue
matching” on the test TMA sections. Using the thus-trained NN, we calculated fingerprints for left
halves of cores from one section (stained at Biomax, slide 4) and the right halves from the other
(stained at USC, slide 5), and matched each left fingerprint to the nearest right fingerprint in 512D
fingerprint space. Since there were 208 cores in the test set, we report a core-level accuracy (acc.
= number of cores matched correctly / 208). The null accuracy by chance is 0.4% (1/208 cores).
While fingerprints from all four experiments matched cores better than chance, the accuracy was
highest in experiment 4, which used a large stain-normalized training set, with a matching accuracy
Figure 2. Converting horses to zebras (adapted from Zhu et. al.
41
, left) and normalizing
histology images (right).
33
of 63% (131 of 208 cores). Surprisingly, stain normalization seems to be necessary to get the
performance gains of larger training sets. Comparing the results of experiments 2 and 3 to the
baseline, increasing training set size in the absence of stain-normalization (experiment 2) provided
only a minuscule improvement in matching accuracy over the baseline; however, stain-
normalization nearly doubled the accuracy. It is important to note that in all four experiments we
used standard image augmentation during training. Immediately before the image was shown to
the fingerprint network, it was randomly adjusted for brightness and contrast and converted to
grayscale. Even with these procedures, which were intended to make networks invariant to color
differences between images,
33
doing an additional “style” normalization step before the
augmentation provided a significant improvement.
A large portion of the mistakes in experiment 4 were due to using a test set with 2 cores per patient.
Several of the misclassified cores were actually from the same patient. This is because some tumors
are morphologically homogeneous and have similar patterns across cores. Thus, we also calculated
a pooled accuracy, which uses the fingerprints from both left cores to match both right cores and
found that fingerprints could match patients with 93% accuracy (see methods for details).
Encouraged by the high patient-level accuracy of the matching, we started studying the features of
the final neural network layer. When the network is shown an image, this layer produces a 512-
dimensional vector of continuous numbers, which we call the “tissue fingerprint.” In the remainder
of the work, we apply the network to extract fingerprints and explore how they can be used to link
histologic patterns to clinical subgroups of breast cancer.
34
Fingerprint visualizations reveal style-invariant histologic patterns
As both a control and as a starting point for understanding the fingerprint features, we performed
low-dimensional visualizations. We took the left and right halves of cores from the test slides,
which had been stained at different sites, and calculated fingerprints for patches from the halves.
When these fingerprints are embedded in a tSNE plot (Figure 3a), colored by patient, we observed
that the left and right halves from the same patient are close in the embedding space, reflecting the
stain invariant properties of fingerprints. Moreover, visualizing the same embedding as a map of
image patches, instead of colored points (Figure 3b), reveals that different regions of the map
contain different architectures, such as nucleoli, fat, micropapillary growth, and mucin patterns
(Fig 3D). Thus, in the embedding space, fingerprints are clustered by histologic patterns even if
the patches they come from exhibit markedly different colorations.
35
Figure 3. a: Representative tSNE visualization of fingerprints from the test set. In this visualization, left halves
from slide 5 and right halves of slide 4. b: Visualization of a representative pair. Left half presented on top, right
half on the bottom, middle shows a heat map of fingerprint distance (distance from fingerprints from the bottom
image to the average fingerprint of the top image). c: Visualization of tissue patches in the fingerprint embedding
(a). d: Left, exploded displays of the original patches in the embedding show similar histologic features (nucleoli,
micro-papillae, fat, mucin).
36
Fingerprints can be used to visualize similar regions between tissues
In Figure 3b, we focus on a specific Left/Right pair from the test set. We calculated the average
fingerprint of the right half and plotted a heat map showing the similarity (defined as 1 - normalized
Euclidean distance) from each patch in the left half to the average fingerprint of the right half (red
is similar, blue is dissimilar). The overlay (bottom) shows that similarity between the right and left
halves is highest in a discrete region that appears to contain epithelial cells. This observation is
consistent with the abundance of epithelial cells in the right image, suggesting that fingerprint
similarity may have utility in histologic search (for similar-appearing tissues).
Fingerprints combine epithelium and stroma
To directly visualize image components that comprise a fingerprint, we generated heat maps of
tissue cores, highlighting regions that most accurately predict patient identity (Figure 3d). We
show the original H&E images alongside heat maps of patient prediction accuracy using
corresponding image regions. Red areas identify the patient accurately, and blue ones do so poorly.
Based on the presence of both red and blue areas, some core regions are more predictive of patient
identity than others, meaning their patterns are specific to that patient’s tumor.
37
Fingerprints correlate to molecular status of breast cancer
Because each tumor has unique genetic and microenvironmental interactions, we hypothesized
that networks trained to recognize patients would implicitly learn features that reflect the
underlying biological processes. For breast cancer, ER/PR/Her2 status is among the most
important indicators of prognosis. Hormone-receptor (ER/PR) positive tumors tend to be less
aggressive and occur in older patients. Additionally, ER-positive tumors can be treated effectively
with drugs that target the estrogen axis. Similarly, Her2-positive tumors can be treated with drugs
that target the Her2 axis. For these reasons, the NCCN task force mandate
42
that ER, PR and Her2
status be measured for every new case of breast cancer.
Figure 4. a: heat maps of distinctive regions. Red indicates areas most predictive of patient identity and blue
regions that are less predictive. b: an exploded view of two cores from (a).
38
While ER, PR, and Her2 status is routinely assessed by immunohistochemistry (IHC) staining for
the receptors (Her2 can also measured by FISH staining), we explored whether H&E morphology,
quantified via fingerprints, could serve as a surrogate marker of these proteins. Initially, we queried
this hypothesis on a set of breast cancer images curated by The Cancer Genome Atlas (TCGA).
43
These samples consist of H&E whole slide images (WSIs) from 939 patients at 40 sites. First,
scaled the images to 20X resolution, randomly extracted 50 image patches per image, and fed them
through the fingerprint network to calculate fingerprints (Figure 5, step 1). Then, we trained a
second neural network to compress the fingerprints (512D vectors) into a patch-wise prediction of
particular receptor (step 2). For simplicity, the figure indicates the process for predicting ER;
however, the same procedure was used for PR and Her2. Finally, we averaged the predictions
across the image to estimate the patient’s receptor status (step 3). We trained on the TCGA dataset
with 5-fold cross validation. The patients were split into 5 groups: 3 groups were used to train the
second network; one group was used to monitor training progress and decide when to stop training;
the last group was tested. The plot in Figure 4B (left) shows the ROC curve for a representative
test set from the TCGA data, for ER classification. The average AUROC of the test sets was 0.88
(n=183, per test set). This is the highest ER classification score we have observed, including our
previous work using nuclear morphometric features (0.72 AUROC)
44
and another recent work that
predicts molecular characteristics of breast cancer from H&E images
45
. To validate these findings,
we obtained WSIs of 2531 breast cancers from the Australian Breast Cancer Tissue
Bank
46
(ABCTB), and tested whether the TCGA-trained classifier could predict ER-status in this
group. We measured an AUROC of 0.89 on this dataset (n=2531). Applying the same pipeline to
PR and Her2, we found that fingerprints could predict PR in the TCGA dataset with an average
39
test set AUROC of 0.78_(n=180, per test set), and 0.81 (ABCTB, n=2514). The results for Her2
were AUROC=0.71(TCGA, n=124) and AUROC=0.79 (ABCTB, n=2487).
Figure 5. a: Illustration of whole slide clinical estrogen receptor (ER) classification. An analogous
procedure was used for progesterone receptor and Her2 classification. Fingerprints were extracted from
50 random image patches, and a second ER-classifier, acting on the fingerprints, made local predictions,
which were averaged to produce a continuous whole-slide-level ER-score. b: Receiver operating
characteristic curves (ROC) for clinical ER, PR, and Her2 prediction. The TCGA ROC curve reflects a
test set from 5-fold cross validation, and the AUROC corresponds to the average AUROC of all 5 TCGA
test sets. All samples in the ABCTB dataset are test-samples, and were never seen during training.
Sample sizes vary depending the availability of clinical annotations.
40
Discussion
Deep learning is transforming computer vision and cancer pathology. As training sets scale, there are
substantial increases in accuracy on diagnosis, prognosis and theragnosis
47
. However, the biggest gains
are likely to come when we learn to leverage a greater spectrum of available pathology images,
including the vast majority of images which are mostly or completely unlabeled. Here, we illustrate a
novel first step, using tissue matching to discern features that are distinctive for a patient but differ
between individuals. While tissue matching is not a skill formally taught in pathology training, it allows
a neural network to discover key discriminating histologic features from a large set of unannotated
images. Interestingly, these discriminatory features, or fingerprints, tend to reside at the interfaces
between epithelial and stromal cells, and may reflect tissue specific genetic and microenvironmental
parameters. While this study used serial sections of a TMA to design a rigorous implementation of
core-matching, the resulting network trained in experiment 4 demonstrates that a training paradigm
that incorporates style normalization may benefit significantly from histology images of any type, not
just matched TMA images. Building on these results, we are currently exploring how to train a
fingerprint network on style-normalized whole-slide images with promising results.
Training a network on the task of tissue identification also improves the interpretability of DNNs and
provides insights about the elusive “black box” of deep learning. The ground truth (tissue identity) is
indisputable, and visualizations reveal cohesive, biologically interpretable patterns that leverage
parameters which are likely to reflect unique underlying genetic and microenvironmental interactions.
Hence, we anticipate that fingerprints, which identify features that discriminate between individuals,
will be useful when applied to tasks that seek to discriminate between groups of biologically different
tissues.
41
Our experiments demonstrate the significant predictive power of such fingerprints to predict the
molecular status of a tumor. Taking the fingerprint network, we extracted fingerprint features from
whole slide images and used them to predict ER, PR, and Her2 status from two independent breast
cancer cohorts. We initially trained and validated our algorithms on images from The Cancer Genome
Atlas (TCGA), with cross-validation. Then, performed independent validation on samples from the
Australian Breast Cancer Tissue bank (ABCTB, n = 2351) achieving the following areas under the
curve: 0.89 (ER), 0.81 (PR), and 0.79 (Her2). These metrics are higher than all previously published
attempts to predict molecular information from H&E images. The improved performance is secondary
to the implementation of tissue fingerprinting. The performance we found is similar to previous studies
assessing the correlation between IHC and microarray assessments of ER and PR, which found good
concordance between frozen and IHC for ER (93%) and lower for PR (83%).
42,48
We believe that using
tissue fingerprints will ultimately enable direct treatment response prediction in breast and other
cancers, to an accuracy above that provided by current molecular approaches.
Methods
Datasets
The tissue microarrays (TMAs) used in this study were obtained from supplier US Biomax, Inc..
Array BR20823, containing 207 tissue cores from 104 patients, was used to train the fingerprint
network. Array BR20819, containing 208 cores from a separate group of 104 patients, was used
to test the trained model. We obtained one or two sections from each array (for BR20823 and
BR20819, respectively), which were stained with hematoxylin and eosin under standard laboratory
protocols, before scanning was performed at 40x resolution (0.249 microns per pixel) on a Carl
Zeiss slide scanner. US Biomax, Inc. kindly provided us with images of serial sections of these
42
microarrays that had been stained and scanned by their protocols on a Leica Aperio slide scanner
at lower resolution (0.49 microns per pixel).
Additional breast cancer tissue images were used to increase the size of the training set for
experiments 2 and 4. These images are of breast cancer tissue from a variety of sources having
distinct patients from BR20819 and BR20823.
The whole slide images used in this study were obtained from The Cancer Genome Atlas
43
(TCGA) and the
Australian Breast Cancer Tissue Bank
46
. We included 939 cases Breast Carcinoma from TCGA and 2531
Breast Cancer cases from the ABCTB. Clinical characteristics are summarized in Table 3.
43
Table 3. Clinical characteristics of breast cancer patients
Breast Cancer Cohort TCGA ABCTB
Number of Patients 939 2531
ER-positive 723 2007
ER-negative 216 524
PR-positive 623 1798
PR-negative 313 716
Her2-positive 151 371
Her2-negative 508 2116
Age (years)
min 26
mean 58.2
max 90
Stage
I 156
II 539
III 210
IV 16
Histologic Grade
Grade 1
364
Grade 2
904
Grade 3
1004
Tumor Size
T1 234
T2 546
T3 123
T4 34
Node Status
N0 443
N1 315
N2 100
N3 65
node negative 443 1143
> 1 node positive 480 1388
Metastasis Status
M0 785
M1 18
44
Neural Network Training (experiments 1 and 2)
The fingerprint network was trained on image patches randomly extracted from the BR20823
images. Each circular tissue core was isolated, scaled to 0.5 micron/pixel resolution (bilinear
resampling), and cropped to a 1600x1600 pixel square. These squares were assigned a numeric
index from 1 to 207, reflecting their position on the array. This square was then divided into left
and right halves. During training, a small image patch (224 x 224 px) was sampled from the left
half, augmented on the fly through rotation, color spectrum augmentation
28
, color normalization,
and finally, conversion to grayscale.
33
It was then passed to a neural network with the objective of
minimizing cross entropy loss between patient identity and a predicted index. During training, we
monitored progress by measuring how well the network could predict the core index from patches
from the right halves of the tissue images, which it hadn’t seen. When this accuracy plateaued, we
stopped training and tested the quality of the features on the tissue matching game. Experiments 3
and 4 used the more complex loss function described below.
In the four experiments described, we used a standard implementation of the Resnet34
architecture
49
provided by the PyTorch library
50
. The network was randomly initialized and trained
from scratch. In addition, we trained larger networks, (including Resnet50 and Resnet100, results
not shown), using the conditions of experiment 4, but found the same performance as Resnet34.
The benefit of larger networks may depend on the size of the training dataset. Here we demonstrate
the concept of fingerprinting with a relatively small dataset, but training on larger datasets may
demonstrate added benefits of deeper networks.
45
Promoting style invariance by style transfer (experiments 3 and 4):
Neural style transfer was performed offline using CycleGAN
41
, a method of neural style transfer,
that aims to alter the style of the image while preserving fine details. We trained the network to
transfer styles between images of BR20823 that were stained at US Biomax or at our site (style
transfer between slides 1 and 2, respectively, as shown in Figure 2b). Thus, our original set of
13,415 cores was augmented to three-fold its original size via neural style transfer (each core has
an original image, a virtual USC stain and a virtual Biomax stain). Following style transfer, we
adapted the loss function to promote style invariance. The new loss function has two components,
a cross entropy loss (abbreviated ‘CE’) to predict the identity of each patch, which was the loss
term used in experiments 1 and 2, plus an additional loss term to minimize the distance of
fingerprints from different styles. The additional loss term is the squared error (abbreviated ‘SE’)
between the L2-normalized fingerprints, where a fingerprint is the 512-dimensional feature vector
from the last layer of the Resnet.
Loss is defined for a pair of images from tissue corei (1 ≤ i ≤ 207): Imagei1, and Imagei2. Imagei2
is a re-styled version of Imagei1 (Figure 2c). Thus, both images contain the same “content” but are
colored/styled differently. The loss is a sum of two cross entropy losses and a fingerprint distance
loss. The symbol, 𝛾, is a constant. In our experiments, we used 𝛾 = 0.5.
𝑙𝑜𝑠𝑠(𝐼𝑚𝑎𝑔𝑒
+ ,
,𝐼𝑚𝑎𝑔𝑒
+ .
,𝑦 = 𝑖) = 𝐶𝐸(𝐼𝑚𝑎𝑔𝑒
+ ,
,𝑦)+ 𝐶𝐸(𝐼𝑚𝑎𝑔𝑒
+ .
,𝑦)+
𝛾7𝐹𝑃𝑑𝑖𝑠𝑡(𝐼𝑚𝑎𝑔𝑒
+ ,
,𝐼𝑚𝑎𝑔𝑒
+ .
)<
46
Cross entropy is defined using the classification vector produced from the network for each image.
This vector contains 207 elements. Imageix from corei produces a classification vector cix.
𝐶𝐸(𝐼𝑚𝑎𝑔𝑒
+=
,𝑦 = 𝑖) = −lnA
B
C
DE
[G]
I
J
B
C
DE
[J]
K
Fingerprint distance loss is defined using the ‘fingerprints’ produced by the neural network for
each of the images. If the fingerprints for Imagei1, and Imagei2 are called FPi1 and FPi2, fingerprint
distance loss is the following. || || and SE refer to the L2 norm and the Euclidean distance,
respectively.
𝑑
LM
.
(𝐼𝑚𝑎𝑔𝑒
+ ,
,𝐼𝑚𝑎𝑔𝑒
+ .
) = 𝑆𝐸A
LM
DO
P|LM
DO
|PR S
,
LM
DT
P|LM
DT
|PR S
K
Creating heat maps of recognized regions
We generated heat maps of tissue cores showing the parts of an image most predictive of tumor
identity (Figure 3e). The region colors were determined as follows: each core image was divided
into overlapping square patches (size 224x224 pixels, with 80% linear overlap). Each patch was
passed to the neural network, and a probability vector was calculated predicting the identity of
each core via Softmax. Since there are multiple cores per patient, we aggregated the 207
probabilities into 104 probabilities (one per patient) by summing the probabilities of cores that
came from the same patient. Each heat map shows the probability of predicting the correct patient
and is shaded from 0 (blue) to 1 (red).
47
Whole Slide Image Analysis:
Each whole slide image was divided into non-overlapping squares of (112 x 112 microns), which
were scaled to a final patch size of 224 x 224 pixels. The pre-trained fingerprint network was used
to extract a 512-D fingerprint per patch. 500 fingerprints per patient were randomly selected for
analysis. These fingerprints were z-score normalized (scaled to zero-mean, unit variance, within
the patient). Following fingerprint extraction and normalization, we trained a small neural network
to predict ER status from fingerprints using 5-fold cross validation. The entire set of TCGA
patients was split into five groups. In each cross-validation fold, three groups were used to train;
one group was used to monitor overfitting and perform early stopping; the last group was used to
test the network’s final performance. We split groups on the patient level so that no patient would
be included in more than one group. The network predicting ER from fingerprints was a single
layer perceptron classifier, with eight hidden neurons using exponential linear units as the non-
linearity. To predict patient-level ER status, we predicted ER status for each fingerprint and
averaged them to calculate a patient-level ER-score. The reported ROC curves compare the
patient-level ER-score to clinical ER status. We also used the same workflow to process image
features obtained from a Resnet34 network pretrained on the ImageNet dataset. These parameters
were obtained from the torchvision library.
Additional validation of whole-slide ER classifier
We validated our classifier on an independent test set from the Australian Breast Cancer Tissue
Bank (ABCTB). These whole slides were processed like the slides from the TCGA dataset: 500
patches (112 x 112 microns, resized to 224 x 224 pixels) were extracted per patient, and
fingerprints were calculated. Then, we used the perceptron classifiers that were fitted to the TCGA
48
dataset in an ensemble manner. Training on five folds of data from the TCGA produced five ER-
classifiers. These classifiers were applied to the ABCTB fingerprints to produce five predictions
for each fingerprint, which were averaged to produce a fingerprint-level ER prediction and
averaged again to produce a patient-level ER-score. We calculated an ROC curve comparing the
neural network predictions to clinical ER status from the ABCTB.
Acknowledgements
I want to thank US. Biomax, Inc. and the Australian Breast Cancer Tissue Bank for providing
H&E images and especially acknowledge the volunteer support of Mythily Sachchithananthan and
Usman Jawaid at the ABCTB. I thank Oracle Corporation for providing cloud computing
resources. This research was supported in part by a grant from the Breast Cancer Research
Foundation.
49
Chapter 3. Development of a high
throughput, reproducible immune cell
quantifier in breast cancer specimens
Our appreciation of the immune system in breast cancer is rapidly evolving. Due to low numbers
of nonsynonymous mutations, neoepitopes, and modest immune infiltration relative to melanoma
or lung cancer, breast cancer was long thought to be poorly immunogenic
51
. However, a landmark
study in 2010 shifted this perspective, showing an independent association between the extent of
tumor infiltrating lymphocytes (TILs) and pathologic complete response
52,53
. Subsequent work has
shown that TILs
51,54,55
and lymphocyte hotspots are prognostic for survival
56
. Moreover, recent
clinical trials involving the use of PD-1 and PD-L1 inhibitors suggest that immunotherapy may be
effective for a subset of breast cancer patients. They include those with triple negative
57,58
disease,
and some patients with estrogen receptor (ER) positive and Her2 positive breast cancer
59
.
Together, these findings highlight the potential for a sustained immune response in breast cancer.
Thus, to extend the reach of immune therapies, it is a top priority to understand cellular
interactions, signaling, and spatial relationships that govern immune infiltration, priming, and
suppression in the breast microenvironment
51,53
.
Pathology image analysis is one way to study these spatial relationships
56,60,61
. Within this field,
there is an urgent need for automated technologies to quantify and qualify the immune infiltrate
by cell type
55,62,63
. Although pathologists have identified patterns of immune infiltration, such as
immune hotspots, tertiary lymphoid structures, and intraepithelial lymphocytes, they are
50
qualitative and of limited utility. Variables like “extent of TILs” are still estimated by eye,
intraepithelial lymphocyte estimates are not reproducible, and the significance of TILs in different
parts of whole slide images (e.g. borders vs. center) is not known due to the difficulty of collecting
these metrics
55
. Moreover, little is known about the spatial correlations between characteristic
patterns of microenvironmental change (e.g. angiogenesis
64
, desmoplasia
65
, remodeling
66
) and the
breast immune infiltrate. Understanding how these patterns collocate with nearby immune cells
will shed light on the biology of immune hotspot formation and the heterogeneity of the infiltrate
within and across patients.
Motivated by the urgent need for tools that can accurately measure the distribution of immune
cells on hematoxylin and eosin (H&E) breast cancer pathology images, we designed a study using
deep learning to quantify and qualify the immune infiltrate across a tumor. Whereas conventional
methodologies to study the immune infiltrate are manual and qualitative computational machine
learning approaches have the potential to accurately and reproducibly quantify cell phenotype. Our
central hypothesis was that machine learning on H&E images can reproducibly quantify cells and
features of the immune microenvironment, and that their spatial distribution can be correlated to
prognostic subtypes of breast cancer.
The primary challenge in training networks to recognize cell types is the scarcity of annotated
datasets. In this work we explored two sources of training data: immunostaining for an immune
cell marker (CD45 for B cells, T cells and macrophages) and manually annotated data, curated by
an expert. While the latter is the standard method for obtaining annotations to train networks, the
former technique trains the network to make predictions based on a molecular stain, which may
51
be more accurate than a person. However, this procedure is novel, and has not been explored in
the literature. Thus, we sought to train networks to detect immune cells by both methods and
explore how these features correlate to patient outcomes.
The study can be summarized in three steps:
1. Prepare training datasets
2. Train the neural network
3. Correlate immune cell counts to patient survival
Prepare training datasets
We used three tissue microarray (TMA) datasets in this study (Table 1). The first was prepared
locally, and the other two were obtained from external sources
67,68
. The local dataset consists of a
single tissue microarray of 112 breast cancer specimens, H&E stained. This slide was scanned
using an Olympus VS120 slide scanner (20x resolution, 0.5 microns per pixel). Then the slide was
sent to the Cedars Sinai for destaining and restaining (IHC for CD45), before being scanned and
registered to the first scan. An example of the stained/destained/restained cores is shown in Figure
1 (A, B). Finally, a threshold was applied to the IHC image to produce a binary mask for the IHC
signal (Figure 1 D).
The second dataset consists of images from a public dataset of pathologist-annotated lymphocytes.
52
The centers of lymphocytes were marked by hand in one hundred image patches. The third dataset
consists of a set of 352 images from a cohort of patients at Yale University. These images are
associated with long term follow-up data.
Train the neural network
We used a standard implementation of the U-Net neural network architecture
69
to do image
segmentation (Figure 2). This architecture combines features at low and high resolution and has
Table 1. Dataset Descriptions
CD45 Stained
Manually
annotated Survival
Use CD45 Net
training
Manual Net
training
Prediction
Type
TMA H&E Images TMA
Ground truth
CD45 Pathologist
Survival /
censor
Stain
H&E, CD45 H&E H&E
Image count
112 100 352
Image width
1000 µm 100 µm 1000 µm
Magnification
20X 20X 20X
Source
Biomax.us
Janowczyk
2016
Yale
(Dr. D. Rimm)
53
been successfully used in image segmentation competitions. Networks were shown H&E images
Figure 1. Images used to train the CD45 network
Figure 2. U-net architecture used to train CD45 Net and Manual Net
54
and trained to predict the annotations, which were either the IHC mask or the pathologist point
annotations depending on the dataset. The objective function was the mean squared error between
the predictions and the ground truth. All experiments were conducted with cross-validation and
training was stopped when the loss stopped converging.
Although the networks were trained on different information (one dataset was manually curated
for visual lymphocytes but the other was stained for CD45 which detects B cells, T cells, and
macrophages), we attempted to compare the predictions from the two networks. Using a small
portion (10%) of the pathologist-annotated data that was held out from training, we compared the
predictions of the two networks to the pathologist ground truth. As expected, the network trained
on the pathologist annotations made predictions that were concordant (Figure 3, left, P < 1E-5,
Pearson correlation). On the other hand, the predictions from the CD45 network were not as
correlated (Figure 3, right, P < 1E-4, Pearson correlation). However, a visual comparison of the
areas predicted to be immune infiltrate by the CD45 network reveals a significant amount of
overlap between the two predictions and between regions that appear to be lymphocytes by eye
(Figure 4). Thus, we proceeded to test whether the features learned by these classifiers can be
correlated to clinical outcomes.
55
Figure 3. Correlation of networks with pathologist ground truth
Figure 4. Virtual CD45 stain recognizes cells that look like TILs (hold-out set)
56
Correlating immune cell counts to patient survival
In this analysis, we defined a single feature, the ratio of TIL area to total nuclear area (TIL score)
and evaluated how this predicts survival before and after stratifying by ER status, a known marker
of prognosis (Figures 5, 6). We find that while high TIL score was prognostic for the overall
cohort using both networks, the high TIL score predicted by the CD45 network is only useful in
ER-negative patients. On the other hand, TIL score predicted by the manually-trained network was
prognostic regardless of ER-status.
57
Figure 5.
Figure 6.
58
Discussion
Through the development of the CD45-based network, we demonstrate a simple approach to train
classifiers to identify TILs from H&E images. The stain-based approach may be important in
situations where an expert does not know the visual feature, when there is ambiguity, or when a
large quantity of training data is required.
This approach can likely be scaled to additional features. I have adapted the protocol to other
markers, including Vimentin (stroma), Pan-Cytokeratin (epithelium), and others (CD138 plasma
cells, PHH3 mitotic figures) using multiplexed immunofluorescence staining followed by H&E
staining. One of the biggest challenges is making sure the images are properly curated: well-
registered, color-normalized, properly thresholded. All of these experiments were done on TMA
sections containing tissues that were fixed independently. Thus, autofluorescence background as
well as staining intensity varies from core to core. Training directly on the stained image without
a quality-control protocol is challenging. A second challenge is making sure that a classifier trained
on one set of TMA images will perform well on images from a different TMA, or a different
dataset altogether. Transferring learning from one set of images to another works in some cases,
so performing a quality control step on the output of the neural network is important. Looking
forward, a good approach to mitigate some of the problems with these variations is to train and
test on multiple TMAs stained and scanned at different sites. While this is a giant undertaking, it
may be possible to get a lot of features out of just a few markers.
The motivation for taking a segmentation-based approach that aims to identify cellular components
before correlating them with clinical findings is that it’s easy to interpret the results. Unlike the
59
complex analysis used to interpret the learning of CellNet or the heatmaps produced by tissue
fingerprinting, the output of a segmentation network is simple: the heatmap corresponds to a
protein. The ultimate goal of this research direction is to use machine learning to learn complex
features and then make those insights available to us. Between the three approaches, segmentation
provides the most explainable output, but does so without learning a relationship between different
cells or areas. The next step, described in Chapter 4, is to train networks to be explainable and
interpretable simultaneously.
60
Chapter 4. A methodological synthesis
In the three previous chapters, we’ve seen that deep learning is a powerful tool for learning spatial
relationships (Chapter 1), identifying cell types (Chapter 3) and correcting noise like staining
variation that’s a significant confounding variable in pathology image datasets (Chapter 2). We’ve
also seen how networks can learn features that allow them to do challenging tasks. Tissue
fingerprinting for instance would be challenging for a human with limited memory and attention
span but works remarkably well in a computer. The focus of this section is to discuss how these
approaches can be combined and scaled to create systems that learn powerful visual features that
are also human-interpretable.
Explainable fingerprinting
A major criticism of end-to-end deep learning (where a system is trained to predict an outcome
directly from raw pixels) is that the network is treated like a black box. While this makes little
difference for the majority of deep learning applications, it’s a major challenge in pathology, where
we aim to learn novel aspects of biology. One approach to improve explainability is to use the
powerful “black box” networks to perform segmentations, which can be human-validated, before
performing predictive modeling using simple, transparent, statistical tools. For example, in
Chapter 3, we used deep networks to learn the shapes of immune cells, which we validated by eye.
Then we took these inputs and calculated a hand-crafted feature (ratio of immune cells to total
nuclear area), which could be tested for clinical significance. This approach leverages the power
of deep learning to ask questions that would be too tedious to answer by hand.
61
On the other end of the spectrum is the tissue fingerprinting approach. It leverages the full power
of deep networks to learn a complete set of features that describe tissue. While the segmentation
method is biased by the features we believe important, fingerprinting is almost completely
unbiased, and offers the possibility for learning features that are orthogonal to our concepts. If we
can learn to unite these approaches, we might be able to construct a learning pipeline that learns
complex cell- and tissue- level features in an unbiased fashion while also being interpretable. In
Chapter 1, I described CellNet, which uses the output from segmentation as input to a deep
network, in essence combining the approaches of Chapters 2 and 3. However, the performance of
CellNet was fairly poor compared to the other results in this work, likely because the input to the
neural network was sparse. It contained just six nuclear shape features per cell. Including additional
parameters for color, texture, cytoplasm, extracellular matrix, and other components is one way to
scale the idea and may lead to a better accuracy. However, a more promising approach may be to
use networks to learn both segmentation and fingerprinting (or complex task) simultaneously. I
also propose to define segmentation more loosely. Instead of segmenting individual cells, I think
networks have the potential to learn how to segment regions that are predictive or prognostic, but
this requires us to change the scale at which we learn: from patches to learning on the whole-slide-
level.
In practice, due to limits of computer memory, all deep learning experiments are done on image
patches extracted from either tissue microarrays or whole slide images. However, most clinical
annotations, such as biomarker status, or survival, are annotations for the whole image –not
individual patches. The simple approach that most research takes is to treat every patch as if it
contains the signal. In Chapters 1 and 2, I did this too. I took patches from images, which were
62
labeled for estrogen receptor on the “patient-level” and trained a classifier on individual patches,
as though the patches were positive or negative based on the patient-level annotation. However,
this could introduce noise; while a network may learn the true signal, it may also learn nonsense
features in patches that do not contain the true signal. Some evidence for this comes from our
observation that style normalization was essential for learning fingerprints in Chapter 2. If
networks are trained directly on pixels, there’s no guarantee that they’ll learn the correct features
over convenient features. Calling all patches positive or negative, when in fact they are not, may
train networks to learn incorrect features. For this reason, I want to move from a patch- to image-
level learning: where networks are trained directly on entire whole-slides.
To get over the problems of limited computer memory, it is necessary to take the whole slide image
and reduce its memory footprint to fit on a single computing instance. This can be done through
numerous means—segmentation, sparse representations, autoencoding (e.g. the input for
CellNet)—all of which impart some type of information loss. For example, segmentation and
sparse representations compress the image based on features we choose to segment, and
autoencoding, results in the loss of fine cellular details in exchange for information about the large
structures of tissue. The alternative I propose is to use tissue fingerprints as a compressive “front-
end” to distill the information in a whole-slide into a small matrix that is amenable for whole-slide
level machine learning. Unlike the other compressive options, fingerprints can represent subtle
biologic differences that maybe difficult to describe manually. Additionally, fingerprints are pre-
trained to be staining invariant, meaning that the compressed whole-slide may have a higher signal
to noise ratio than the original image.
63
Towards the goal of whole-slide learning, I propose to retrain the fingerprint network on a larger
collection of image patches and a smaller neural network. While the bias of the computer vision
community is to train networks on ever larger neural networks, I draw inspiration for several
recently published works which illustrate that powerful deep networks can often be distilled into
smaller networks with surprisingly high accuracy
70
. One example is the teacher-student training
paradigm. Here, a giant network is trained on a task, such as binary image classification. Then, a
smaller network is trained on the same images; however, instead of learning to predict the binary
classes, the network is trained to mimic the decision-making process of the teacher. Thus, one
neural network trains another. Numerous studies suggest that student networks significantly
smaller than the teacher can learn functions that surpass the teacher’s performance.
Another example which highlights the power of small deep nets is a recent work on “BagNets,”
71
which use neural networks to learn fine-grained features, which are subsequently averaged,
spatially (according to the bag of words model), to produce high-order features. The finding that
BagNets perform nearly as well as traditional deep convolutional nets with many more parameters
suggests that most contemporary networks can learn to make predictions using local features
without any spatial context. Another study, published by the same group, suggests that texture
augmentation can train networks to learn spatial (as opposed to local) features. Reading these
studies confirms observations I’ve noticed in my previous work. While developing CellNet, I also
explored a hierarchical averaging model, which produced similar results with much less spatial
pooling. Additionally, the findings about texture augmentation agree with my results from Chapter
2 about the importance of style normalization to learn quality features. In fact, this work and the
other work both use similar style augmentation schemes.
64
Related to the general theme of using small networks, better, is the concept that the output of a
standard neural network can be made more explainable by exploiting aspects of the network’s
architecture. In experiment 2, I used a network called Resnet32, which takes an image patch (224
x 224 pixels) and produces a compressed matrix (7 x 7 x 512) which is spatially averaged to
produce a fingerprint (1 x 1 x 512 elements). Because averaging is a linear operation, all
classifications made using the fingerprint can be back-propagated to the compressed matrix,
making it possible to produce a 7x7 heatmap for every image patch. This can produce images
similar to the heatmap visualizations in Chapter 2, but for any prediction, not just patient identity.
Finally, given what I’ve observed regarding the importance of staining normalization and domain
adaptation, which is the problem of making images stained at different sites appear more similar,
I think it’s essential to improve the way we do image normalization. In Chapter 2, I used a
framework called CycleGAN
41
to produce matched images that appear to come from two sites. I
propose to use a similar framework to produce matched images that appear to come from many
more than two sites, the objective being to train networks to ignore H&E noise variation regardless
of slide origin.
Combining these observations, we trained an updated version of the fingerprint network, which is
smaller, lighter, faster. The network is inspired by the original Resnet32 design (Chapter 2), but
nearly one quarter the size. Instead of producing 512D fingerprints, it produces 32D fingerprints.
An example of the input to the new network is shown in Figure 1. The image labeled X2 is the
original patch. X1 is the style-augmented patch. These and 58 other images (29 other pairs) were
65
passed into the neural network which computes fingerprints. The low dimensional embedding of
the fingerprints is shown at the right. Lines are drawn between all X2s and X1s: the original and
restyled images in the fingerprint space. The close distance between these pairs indicates that the
network embeds them near each other in the 32D space.
Figure 1
Ongoing experiments suggest it will be possible to take an RGB whole slide image (approx. 50,000
x 50,000 pixels at 20x resolution) and compress it into a tiny matrix of size 220 x 220 x 32
elements. Performing learning on these matrices of this size is feasible and has the potential to
reveal patterns at a scale not currently possible.
66
Conclusion
The studies presented in the preceding chapters show that deep learning has potential to
accelerate and augment the clinical pathology workflow. We’ve shown that deep learning
algorithms can learn significant correlations between tissue morphology and clinical subtypes
(assessed by IHC staining) and can be trained to efficiently identify cell types. However,
reaching higher levels of accuracy may require significantly more training data than is currently
available. Some biomarkers (e.g. Estrogen Receptor) can be predicted more accurately than
others (Her2), and it isn’t clear if this is due to the inherent unpredictability of the marker, or the
lack of training data. Over the course of developing and applying these technologies, I’ve learned
a few kernels of information that might be useful to others embarking on similar work.
1. Accurate ground truth is essential
Training a network to predict biomarkers was challenging because the visual correlate is
not known. It is still not known whether there is a visual marker for ER status or whether
the network learns other features that grade tumors, which ‘correlate’ to ER. I perceive
fingerprinting as a breakthrough in my work because this enabled me to develop machine
learning techniques on a problem with known ground truth. With patient identity there was
no ambiguity about the quality of ground truth, hence I had little doubt whether training
error was due to problems with the algorithm or mistakes in the training data. Training on
a problem with well-defined ground truth simplifies debugging.
67
2. Staining and scanning variation are not trivial
I’ve spent the bulk of the past year learning how to deal with H&E staining and scanning
noise. Images stained at different sites have different colors, which are handled differently
by neural networks. I discovered that training on larger image sets (Chapter 2) does not
automatically remove the effects of differential staining as I had expected. Explicitly
handling these variations is important to ensure that the network learns invariant features.
3. Z-scoring is a simple way to reduce the impact of batch effects
However, even after learning features that appear invariant, networks can still be biased by
batch effects. These batch effects may be on the scale of the hospital, where a pathology
lab may use one formulation of H&E, or on the scale of individual slides, where patches
of images are highly correlated only by virtue of being stained at the same time. Evidence
for this lingering effect comes from the performance gains of using Z-scoring (normalizing
all fingerprints from a slide to zero-mean, unit variance) for each slide in biomarker
prediction. Z scoring significantly improves ER classification accuracy on the validation
as well as the hold-out (Australian samples) datasets.
4. Z scoring helps with domain-adaptation
Just because images are scanned at similar resolution does not mean they look the same to
the neural network. However, this point still surprises me. If I compare the mean fingerprint
across all samples from the TCGA dataset (which come from 40 sites) to the mean
fingerprint of the samples from the Australian samples (number of sites unknown), there
is a significant difference and evidence for domain shift. This shift cannot be explained by
68
H&E coloration, alone, for the samples in both datasets are highly heterogeneous in
staining colors. I worry about the slight differences between the images that lead to these
shifts. One hypothesis is that the Australian samples are compressed differently (the file
format for the Australian samples is ‘ndpi’ from Hamamatsu, versus ‘svs’ from Aperio for
TCGA). Additionally, some formats use ‘jpeg’ compression to reduce file size. I think
these details seem small but may have significant implications for unbiased algorithms. If
we don’t control for this source of noise, the network may learn to use it as a signal.
5. Self-supervision is possible
Self-supervised learning is one of the most exciting concepts I’ve gotten to explore. It has
tremendous potential to teach us new aspects of biology, provided that we can remove the
majority of noise in the input data. More on this concept in the final point.
6. Two major challenges: the lack of compatible file formats and fast visualization tools
I’m surprised and not pleased by the difficulty of working with microscopy image formats.
While there are well-developed tools for the common formats (svs, ndmi), a majority of
my in-house work relies on images produced by a slide scanner that uses a proprietary file
format. This format is notoriously hard to work with. Moreover, most whole-slide viewing
and annotation tools (the few that exist, anyways) are not compatible. Thus, when given
the option, I worked with public image datasets which were easier to access and manipulate
in Python versus samples I could prepare locally.
69
7. There’s a need for segmentation tools that work on the cloud
At several points, I attempted to develop tools to perform interactive segmentation with
deep learning. All open segmentation solutions are heavily lacking with regard to the
accuracy (none use state of the art deep networks). However, a whole-slide
viewing/annotation platform that is capable of deep learning would be a valuable tool for
pathology researchers. The bottleneck to developing this system is that most of our deep
learning models are trained on the cloud. However, annotation must be done on local
systems. Thus, there is a need for GUIs that make it possible to segment locally and perform
calculations and heatmap prediction on the cloud.
8. Smaller nets can learn useful features, too
I grew up believing bigger nets were better. This is based, generally, on results from image
classification challenges that use giant 100+
49
layer neural networks to classify natural
images. Recent works like BagNets
71
challenge the importance of larger networks. This
fits with my own observation that fingerprinting networks with 34 layers performed
equivalently to networks with 50+ layers. Thus, training on smaller deep nets may be
reasonable.
9. Explainability is important
In general, the deep learning field has been strongly biased towards the concept “end-to-
end” black-box learning. This phrase is emphasized in most papers, leading to the
conclusion that hand-crafted feature-based approaches are intrinsically worse than deep-
learned features. However, notion is starting to change. First, examples like BagNets are
70
beginning to show that deep networks are not magic—the features they learn may not be
as holistic as one might be led to believe. Second, hybrid approaches that increase the
transparency of deep networks may shed light on the features the network learns, which
can be powerful for diagnostics. Deep networks should be seen as a tool to advance our
understanding of biology, not as an end in themselves. However, when a field is growing,
it is all too easy to just “drink the Kool-Aid.”
10. Autofluorescence is challenging but intriguing
While exploring how immunofluorescence staining can be used to train segmentation
networks, I’ve observed that cores on tissue microarrays exhibit varying levels of
autofluorescence. While we have spent time optimizing stains to produce strong
fluorescent signals to overpower the auto-fluorescence, occasionally I scan the un-stained
slide to look at the autofluorescence. I look at these pictures with awe (Figure 1, next page).
The images are pseudo-colored based on emission from four fluorescent channels, DAPI,
FITC, TRITC and Cy5. A number of cores contain what look like granules. I’m curious
what compounds explain these patterns.
11. It may be possible to do a lot more, with less:
Determining cell types by self-supervised learning
Similar to how fingerprints can be trained by assigning each patient an index and training
a network to predict patient identity, I hypothesize that cell types can be learned in a similar
fashion by segmenting an H&E image into individual nuclei and training a network to
identify individual nuclei. While this idea is still in its infancy, the preliminary data provide
71
strong evidence that it’s possible. Consider the images below. The image on the left is an
H&E image. A U-Net similar to the network used in chapter 3 to detect lymphocytes was
initially trained to perform nuclear segmentation. This is a binary classification task: 1-
nucleus, 0-not nucleus. Then the features learned by the network to predict nuclear status
were compressed. Here, each pixel is colored by the first three components of a principal
component analysis compression. In spite of no explicit training about cell type, we see
that the network already seems to have a concept of different cell types.
Figure 2
72
Figure 1. Autofluorescence of Breast Cancer Tissue Microarray with enlarged inset.
73
Over the past four years, I’ve experienced moments of great inspiration, relief, pride, confusion. I
end this chapter with a new feeling, confidence. Only now, as I near the conclusion of my Ph.D.
studies, do I feel like I understand how to use these tools effectively. I hope that others can build
on these insights to accelerate their research and ultimately impact patients.
74
References
1. Beck, A. H. et al. Systematic analysis of breast cancer morphology uncovers stromal
features associated with survival. Sci. Transl. Med. 3, (2011).
2. Hanahan, D. & Weinberg, R. A. Hallmarks of Cancer: The Next Generation. (2011).
doi:10.1016/j.cell.2011.02.013
3. Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning. (MIT Press, 2016).
4. From not working to neural networking - Technology. The Economist Jun (2016).
Available at: https://www.economist.com/special-report/2016/06/23/from-not-working-to-
neural-networking.
5. Zeiler, M. D. & Fergus, R. Visualizing and Understanding Convolutional Networks.
(2013).
6. Rawat, R. R., Ruderman, D., Macklin, P., Rimm, D. L. & Agus, D. B. Correlating nuclear
morphometric patterns with estrogen receptor status in breast cancer pathologic
specimens (in submission).
7. Allred, D. C. Issues and updates: evaluating estrogen receptor-α, progesterone receptor,
and HER2 in breast cancer. Mod. Pathol. 23, S52–S59 (2010).
8. Gradishar, W. J. et al. NCCN Clinical Practice Guidelines in Oncology (NCCN
Guidelines) Breast Cancer. Natl. Compr. Cancer Netw. 2, 4 (2016).
9. Goldstein, N. S., Hewitt, S. M., Taylor, C. R., Yaziji, H. & Hicks, D. G.
Recommendations for improved standardization of immunohistochemistry. Appl
Immunohistochem Mol Morphol 15, 124–133 (2007).
10. Elizabeth Hammond, M. H. et al. American Society of Clinical Oncology/College of
American Pathologists Guideline Recommendations for Immunohistochemical Testing of
Estrogen and Progesterone Receptors in Breast Cancer. Arch Pathol Lab Med 134, (2010).
11. Ingle, J. N. et al. A Double-Blind Trial of Tamoxifen Plus Prednisolone Versus
Tamoxifen Plus Placebo in Postmenopausal Women With Metastatic Breast Cancer.
Cancer 68, 34–39 (1991).
12. Robert, N. Clinical Efficacy of Tamoxifen. Oncology 11, 15–20 (1997).
13. Wood, A. J. J. & Osborne, C. K. Tamoxifen in the Treatment of Breast Cancer. N. Engl. J.
Med. 339, 1609–1618 (1998).
14. US Biomax, I. Breast carcinoma tissue microarray, 140 cases, with ER/PR/HER2 and
survival data, followed up 9-12 years. (2015). Available at: http://www.biomax.us/tissue-
arrays/Breast/HBre-Duc140Sur-01.
15. Schindelin, J. et al. Fiji: an open-source platform for biological-image analysis. Nat.
Methods 9, 676–682 (2012).
16. Schneider, C. a, Rasband, W. S. & Eliceiri, K. W. NIH Image to ImageJ: 25 years of
image analysis. Nat. Methods 9, 671–675 (2012).
17. Otsu, N. A Threshold Selection Method from Gray-Level Hisstograms. IEEE Trans. Syst.
Man. Cybern. (1979).
18. Friedman, S. H. et al. MultiCellDS: a community-developed standard for curating
microenvironment-dependent multicellular data. bioRxiv (2016).
19. Long, J., Shelhamer, E. & Darrell, T. Fully Convolutional Networks for Semantic
Segmentation. bioRxiv 1, 1–10 (2014).
20. Kraus, O. Z., Lei Ba, J. & Frey, B. J. Classifying and segmenting microscopy images with
75
deep multiple instance learning. doi:10.1093/bioinformatics/btw252
21. Ioffe, S. & Szegedy, C. Batch Normalization: Accelerating Deep Network Training by
Reducing Internal Covariate Shift. Arxiv 1–11 (2015). doi:10.1007/s13398-014-0173-7.2
22. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: A
Simple Way to Prevent Neural Networks from Overfitting. J. Mach. Learn. Res. 15,
1929–1958 (2014).
23. Al-Rfou, R. et al. Theano: A Python framework for fast computation of mathematical
expressions. (2016).
24. Dieleman, S. et al. Lasagne: First release. (2015). doi:10.5281/zenodo.27878
25. Nadji, M., Gomez-Fernandez, C., Ganjei-Azar, P. & Morales, A. R.
Immunohistochemistry of Estrogen and Progesterone Receptors Reconsidered Experience
With 5,993 Breast Cancers. Am J Clin Pathol 123, 21–27 (2005).
26. Welsh, A. W. et al. Standardization of Estrogen Receptor Measurement in Breast Cancer
Suggests False-Negative Results Are a Function of Threshold Intensity Rather Than
Percentage of Positive Cells. J. Clin. Oncol. 29, 2978–2984 (2011).
27. Wolff, A. C. Estrogen Receptor: A Never Ending Story? Antonio. J. Clin. Oncol. 29,
2955–2957 (2011).
28. Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet Classification with Deep
Convolutional Neural Networks.
29. Ehteshami Bejnordi, B. et al. Diagnostic Assessment of Deep Learning Algorithms for
Detection of Lymph Node Metastases in Women With Breast Cancer. JAMA 318, 2199
(2017).
30. Esteva, A. et al. Dermatologist-level classification of skin cancer with deep neural
networks. Nature 542, 115–118 (2017).
31. Teare, P., Fishman, M., Benzaquen, O., Toledano, E. & Elnekave, E. Malignancy
Detection on Mammography Using Dual Deep Convolutional Neural Networks and
Genetically Discovered False Color Input Enhancement. J. Digit. Imaging 30, 499–505
(2017).
32. Liu, Y. et al. Artificial Intelligence–Based Breast Cancer Nodal Metastasis Detection.
Arch. Pathol. Lab. Med. (2018). doi:10.5858/arpa.2018-0147-oa
33. Beck, A. H., Irshad, H., Gargeya, R., Khosla, A. & Wang, D. Deep Learning Based
Cancer Metastases Detection.
34. Coudray, N. et al. Classification and mutation prediction from non–small cell lung cancer
histopathology images using deep learning. Nat. Med. 24, 1559–1567 (2018).
35. Bychkov, D. et al. Deep learning based tissue analysis predicts outcome in colorectal
cancer. Sci. Rep. 8, 3395 (2018).
36. Allison, K. H. et al. Understanding diagnostic variability in breast pathology: Lessons
learned from an expert consensus review panel. Histopathology 65, 240–251 (2014).
37. Elmore, J. G. et al. Diagnostic Concordance Among Pathologists Interpreting Breast
Biopsy Specimens. 98104, 1122–1132 (2017).
38. Robbins, P. et al. Histological grading of breast carcinomas: A study of interobserver
agreement. Hum. Pathol. 26, 873–879 (1995).
39. Hammond, M. E. H. et al. American society of clinical oncology/college of American
pathologists guideline recommendations for immunohistochemical testing of estrogen and
progesterone receptors in breast cancer (unabridged version). Arch. Pathol. Lab. Med.
134, (2010).
76
40. Gatys, L. A., Ecker, A. S. & Bethge, M. A Neural Algorithm of Artistic Style. (2015).
41. Zhu, J.-Y., Park, T., Isola, P. & Efros, A. A. Unpaired Image-to-Image Translation using
Cycle-Consistent Adversarial Networks. (2017).
42. Allred, D. et al. NCCN Task Force Report: Estrogen Receptor and Progesterone Receptor
Testing in Breast Cancer by Immunohistochemistry. J Natl Compr Canc Netw 7, 1–21
(2009).
43. Koboldt, D. C. et al. Comprehensive molecular portraits of human breast tumours. Nature
490, 61–70 (2012).
44. Rawat, R. R., Ruderman, D., Macklin, P., Rimm, D. L. & Agus, D. B. Correlating nuclear
morphometric patterns with estrogen receptor status in breast cancer pathologic
specimens. npj Breast Cancer 4, 32 (2018).
45. Couture, H. D. et al. Image analysis with deep learning to predict breast cancer grade, ER
status, histologic subtype, and intrinsic subtype. npj Breast Cancer 4, 30 (2018).
46. Carpenter, J., Marsh, D., Mariasegaram, M. & Clarke, C. The Australian Breast Cancer
Tissue Bank (ABCTB). Open J. Bioresour. 1, e1 (2014).
47. Campanella, G., Silva, V. W. K. & Fuchs, T. J. Terabyte-scale Deep Multiple Instance
Learning for Classification and Localization in Pathology. (2018).
48. Roepman, P. et al. Microarray-Based Determination of Estrogen Receptor, Progesterone
Receptor, and HER2 Receptor Status in Breast Cancer. Clin. Cancer Res. 15, 7003–7011
(2009).
49. He, K., Zhang, X., Ren, S. & Sun, J. Deep Residual Learning for Image Recognition.
50. Paszke, A. et al. Automatic differentiation in PyTorch. 31st Conf. Neural Inf. Process.
Syst. 1–4 (2017). doi:10.1017/CBO9781107707221.009
51. Vonderheide, R. H., Domchek, S. M. & Clark, A. S. Immunotherapy for Breast Cancer:
What Are We Missing? Clin. Cancer Res. 23, 2640–2646 (2017).
52. von Minckwitz, G. et al. Definition and Impact of Pathologic Complete Response on
Prognosis After Neoadjuvant Chemotherapy in Various Intrinsic Breast Cancer Subtypes.
J Clin Oncol 30, 1796–1804
53. Gingras, I., Azim, H. A., Ignatiadis, M. & Sotiriou, C. Immunology and Breast Cancer:
Toward a New Way of Understanding Breast Cancer and Developing Novel Therapeutic
Strategies. Clin. Adv. Hematol. Oncol. 13, (2015).
54. Salgado, R. et al. Tumor-Infiltrating Lymphocytes and Associations With Pathological
Complete Response and Event-Free Survival in HER2-Positive Early-Stage Breast Cancer
Treated With Lapatinib and Trastuzumab. JAMA Oncol. 1, 448 (2015).
55. Salgado, R. et al. The evaluation of tumor-infiltrating lymphocytes (TILs) in breast
cancer: recommendations by an International TILs Working Group 2014.
doi:10.1093/annonc/mdu450
56. Nawaz, S., Heindl, A., Koelble, K. & Yuan, Y. Beyond immune density: critical role of
spatial heterogeneity in estrogen receptor-negative breast cancer. Mod. Pathol. 28, 766–
777 (2015).
57. Dent, R. et al. Triple-Negative Breast Cancer: Clinical Features and Patterns of
Recurrence. Clin. Cancer Res. 13, (2007).
58. Collins, L. & Laronga, C. Breast ductal carcinoma in situ: Epidemiology, clinical
manifestations, and diagnosis - UpToDate. Available at: https://www-uptodate-
com.libproxy2.usc.edu/contents/breast-ductal-carcinoma-in-situ-epidemiology-clinical-
manifestations-and-diagnosis?source=see_link. (Accessed: 27th November 2016)
77
59. Nanda, R., Liu, M., Yau Christina & Asare, S. Pembrolizumab plus standard neoadjuvant
therapy for high-risk breast cancer (BC): Results from I-SPY 2. (2017).
60. Natrajan, R. et al. Microenvironmental Heterogeneity Parallels Breast Cancer Progression:
A Histology– Genomic Integration Analysis. doi:10.1371/journal.pmed.1001961
61. Nawaz, S. & Yuan, Y. Computational pathology: Exploring the spatial dimension of
tumor ecology. Cancer Lett. 380, 296–303 (2016).
62. Degnim, A. C. et al. Alterations in the Immune Cell Composition in Premalignant Breast
Tissue that Precede Breast Cancer Development Short Running Title: Immune cells and
premalignant breast tissue.
63. Bense, R. D. et al. Relevance of Tumor-Infiltrating Immune Cell Composition and
Functionality for Disease Outcome in Breast Cancer. J. Natl. Cancer Inst. 109, djw192
(2017).
64. Hanahan, D. & Folkman, J. Patterns and Emerging Mechanisms of the Angiogenic Switch
during Tumorigenesis. Cell 86, 353–364 (1996).
65. Cardone, A., Tolino, A., Zarcone, R., Borruto Caracciolo, G. & Tartaglia, E. Prognostic
value of desmoplastic reaction and lymphocytic infiltration in the management of breast
cancer. Panminerva Med. 39, 174–7 (1997).
66. Conklin, M. W. & Keely, P. J. Why the stroma matters in breast cancer. Cell Adh. Migr. 6,
249–260 (2012).
67. Janowczyk, A. & Madabhushi, A. Deep learning for digital pathology image analysis: A
comprehensive tutorial with selected use cases. J. Pathol. Inform. 7, 29 (2016).
68. Zarrella, E. R. et al. Automated measurement of estrogen receptor in breast cancer : a
comparison of fluorescent and chromogenic methods of measurement. 96, 1016–1025
(2016).
69. Ronneberger, O., Fischer, P. & Brox, T. U-Net: Convolutional Networks for Biomedical
Image Segmentation. Miccai 234–241 (2015). doi:10.1007/978-3-319-24574-4_28
70. Hinton, G., Vinyals, O. & Dean, J. Distilling the Knowledge in a Neural Network. 1–9
(2015).
71. Brendel, W., Bethge, M. & Karls, E. Approximating Cnns With Bag-of-Local-Features
Models Works Surprisingly Well on Imagenet. 1–15 (2019).
Abstract (if available)
Abstract
This thesis explores the use of deep learning on hematoxylin and eosin (H&E) stained pathology slides to distinguish between breast cancers with different molecular markers.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Investigating the complexity of the tumor microenvironment's role in drug response
PDF
Genomic, transcriptomic and immunologic landscapes of human cancers
PDF
Making way for artificial intelligence in cancer care: how doctors, data scientists and patients must adapt to a changing landscape
PDF
Development of a colorectal cancer-on-chip to investigate the tumor microenvironment's role in cancer progression
PDF
Transposable element suppression in basal-like breast cancer
PDF
Characterization of a new chromobox protein 8 (CBX8) antagonist in a model of human colon cancer
PDF
Understanding prostate cancer genetic susceptibility and chromatin regulation
PDF
The accumulation of somatic mutations in humans with age
PDF
HER2 and co-amplified genes in breast and gastric cancer
PDF
The effects of tobacco exposure on hormone levels and breast cancer risk among young women
PDF
Comparison of Cox regression and machine learning methods for survival analysis of prostate cancer
PDF
Post-GWAS methods in large scale studies of breast cancer in African Americans
PDF
Identification of CBP/FOXM1 as a molecular target in triple negative breast cancer
PDF
The role of endoplasmic reticulum chaperone protein GRP78 in breast cancer
PDF
Racial and ethnic disparities in delays of surgical treatment for breast cancer
PDF
Pathogenic variants in cancer predisposition genes and risk of non-breast multiple primary cancers in breast cancer patients
PDF
Investigating novel avenues for effective treatment of ovarian cancer
PDF
Context-dependent role of androgen receptor (AR) in estrogen receptor-positive (ER+) breast cancer
PDF
Molecular mechanisms of chemoresistance in breast cancer
PDF
Characterization of the ZFX family of transcription factors that bind downstream of the start site of CpG island promoters
Asset Metadata
Creator
Rawat, Rishi Raghav (author)
Core Title
Molecular classification of breast cancer specimens from tissue morphology
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Cancer Biology and Genomics
Publication Date
08/15/2019
Defense Date
05/13/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
breast cancer,machine learning,neural networks,OAI-PMH Harvest,oncology
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Mumenthaler, Shannon (
committee chair
), Agus, David B. (
committee member
), Press, Michael (
committee member
), Ruderman, Daniel (
committee member
), Sha, Fei (
committee member
)
Creator Email
rawatenator@gmail.com,rishiraw@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-215781
Unique identifier
UC11663450
Identifier
etd-RawatRishi-7751.pdf (filename),usctheses-c89-215781 (legacy record id)
Legacy Identifier
etd-RawatRishi-7751.pdf
Dmrecord
215781
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Rawat, Rishi Raghav
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
breast cancer
machine learning
neural networks
oncology