Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Brain tumor segmentation
(USC Thesis Other)
Brain tumor segmentation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Brain Tumor Segmentation
by
Sachin Raja
A thesis submitted in partial satisfaction of the
requirements for the degree of
Master of Science
in
Computer Science
in the
Graduate Division
of the
University of Southern California
Committee in charge:
Dr. Cauligi S. Raghavendra, Chair
Dr. Yan Liu
Dr. Sathyanaraya Raghavachary
May 2016
i
Abstract
Brain Tumor Segmentation
by
Sachin Raja
Master of Science in
University of Southern California
Dr. Cauligi S. Raghavendra, Chair
This thesis presents a fully automatic technique to segment brain tumor from MRI scans
using deep neural networks. The proposed approach is tailored to segment high and low grade
glioblastomas. However, with minor modications, this approach can be used with several
image based segmentation diagnostic tasks. Glioblastomas can appear in any shape, size,
location and severeness in the brain. Further, there is a signicant degree of disagreement
between several human expert annotators in their segmentations. These reasons motivated
the exploration of machine learning approach to tackle the problem of brain tumor segmen-
tation. In this work, several previous works aiming to solve this problem have been discussed
and dierent architectures based on deep convolutional neural networks are proposed. It also
discusses ways to speed-up the segmentation for application in real world. Results of the
work are presented on BraTS 2014 (Brain Tumor Segmentation challenge held in conjunction
with MICCAI) dataset.
ii
To Family and Friends
I dedicate my thesis work to my family, friends and all my teachers who have guided me
throughout my academic life. A special feeling of gratitude to my parents Dr. Deepak
Raja, Dr. Suman Raja, my grandparents Dr. Om Prakash Gupta, Late Kaushalya Devi
and my brother Rohan Raja whose words of encouragement and push for tenacity ring in
my ears. I also dedicate this thesis to all my friends for their support throughout.
iii
Acknowledgments
I wish to thank my committee members who were more than generous with their expertise
and precious time. A special thanks to Dr. Cauligi S Raghavendra, my committee chairman
for his countless hours of re
ecting, reading, encouraging, and most of all patience through-
out the entire process. Thank you Dr. Yan Liu, and Dr. Sathyanaraya Raghavachary for
agreeing to serve on my committee. I would also like to thank PhD student Ayush Jaiswal
for his continuous support and technical help.
I would like to acknowledge and thank my school division for allowing me to conduct my
research and providing any assistance requested. Special thanks goes to the members of sta
development and human resources department for their continued support.
Finally I would like to thank the beginning teachers, mentor-teachers and administrators
in our school division that assisted me with this project. Their excitement and willingness
to provide feedback made the completion of this research an enjoyable experience.
iv
Contents
Abstract i
Dedication ii
Acknowledgements iii
List of Figures vi
List of Tables vii
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Motivation and Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 The Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Expert Annotation of Tumor Structures . . . . . . . . . . . . . . . . . . . . 6
1.6 Evaluation Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.7 Contribution of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Convolutional Neural Networks 9
2.1 Why Deep Learning for Segmentation . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Basic Concepts in Machine Learning . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Introduction to Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Articial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 Convolution Neural Networks (CNN) . . . . . . . . . . . . . . . . . . . . . . 16
2.6 Convolutional Neural Networks for Segmentation . . . . . . . . . . . . . . . 20
3 Related Work 22
3.1 Generative Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Discriminative Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Recent Work using Deep Convolutional Networks . . . . . . . . . . . . . . . 25
v
4 Methodology 26
4.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2 CNNs - For Feature Extraction and Classications . . . . . . . . . . . . . . . 27
4.3 Network Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.4 Regularization to Prevent Overtting . . . . . . . . . . . . . . . . . . . . . . 28
4.5 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.6 Balancing the Training Set and Two Phase Training . . . . . . . . . . . . . . 29
4.7 Network Congurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.8 Network Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.9 Training the Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.10 Post-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5 Implementation Details 39
5.1 Overview of the Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Data Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3 Hyper-parameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6 Results 42
7 Conclusion and Future Work 46
Bibliography 47
vi
List of Figures
1.1 T1, T1C, T2 and FLAIR modalities for a high-grade glioma patient . . . . . . . 4
1.2 Ground truth image for glioma case in gure 1.1 . . . . . . . . . . . . . . . . . . 5
2.1 Neuron - basic unit of an ANN . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Multi-layer feed forward neural network . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Back Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Example of a convolutional layer for a 2-dimensional image . . . . . . . . . . . . 17
2.5 Example of a pooling layer for a 2-dimensional input volume . . . . . . . . . . . 19
2.6 Example of a convolutional neural network for classication task . . . . . . . . . 20
2.7 Sliding window approach for segmentation . . . . . . . . . . . . . . . . . . . . . 20
2.8 Axial, Sagittal and Coronal views of T1 modality of a high-grade glioma patient 21
3.1 Generative model proposed by Menze, Bjoern H., et al. [31] . . . . . . . . . . . 23
4.1 A 3D convolutional neural network designed by Ji, Shuiwang, et al. [21] for
human action recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 Two pathway CNN proposed by Havaei, Mohammad, et al. [17] . . . . . . . . . 31
4.3 Network architecture proposed by Davy, A., et. al. [10] for brain tumor segmen-
tation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4 Input cascade CNN as proposed by Havaei, Mohammad, et al. [17] . . . . . . . 33
4.5 Local pathway cascade CNN as proposed by Havaei, Mohammad, et al. [17] . . 34
4.6 Pre-output or Mean-eld cascade CNN as proposed by Havaei, Mohammad, et
al. [17] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
vii
List of Tables
4.1 Network architecture of rst CNN of the cascaded network . . . . . . . . . . . . 35
4.2 Network architecture of second CNN of the cascaded network . . . . . . . . . . 36
6.1 Results from the work of Havaei, Mohammad, et al. [17] . . . . . . . . . . . . . 42
6.2 Results in terms of dice scores for the networks implemented in the work of this
thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
1
Chapter 1
Introduction
1.1 Background
Gliomas are the most frequently observed primary brain tumors in adults. These originate
from glial cells and inltrate the surrounding tissues. Patients diagnosed with brain tumors
show highly variable symptoms depending on the type, grade, size and location of the tumor.
Cancerous or malignant brain tumors are most dangerous as they are characterized by un-
controlled growth of tumor cells and have the capability of invading and spreading to other
surrounding tissues. There are many grades of gliomas, however, they are often referred to
as `low-grade-gliomas' and `high-grade-gliomas'. The low and high grade of tumors denotes
the potential of growth and aggressiveness of the tumor. Patients diagnosed with high grade
gliomas have an average survival rate of two years or less and require immediate treatment.
Patients with low grade variants such as low-grade astrocytomas or oligodendrogliomas can
have life expectancy of several years and aggressive treatment for such patients is often de-
layed as much as possible.[32]
1.2 Diagnosis
If a neurologist suspects a brain tumor, a number of tests are carried out to conrm its
presence.
A neurological exam: During a neurological exam, doctor examines vision, hearing,
balance, coordination, strength and re
exes. Problems in one or more of these areas
may provide indications about part of the brain that could be aected by a brain tumor.
Imaging tests: Magnetic Resonance Imaging is mostly used to help diagnose brain
tumors. In some cases, a dye or contrast material is injected into the blood to better
CHAPTER 1. INTRODUCTION 2
indicate dierence between healthy and unhealthy tissues in the brain. A number of
specialized MRI scan components such as functional MRI, perfusion MRI and mag-
netic resonance spectroscopy are used by radiologists to identify tumor regions, grade,
type and size. Imaging tests like Computerized Tomography (CT) scans and Positron
Emission Tomography (PET) scans may also be used for diagnosing brain tumors.
Biopsy: Depending on the location of the glioma, a sample of the abnormal tissue
is collected and analysed under a microscope to determine if it is benign or cancerous
before treatment.
1.3 Motivation and Challenge
Despite considerable research advances in the eld of glioma detection and treatment, pa-
tient diagnosis remains dicult. Intensive neuro-imaging protocols are used before and after
treatment to evaluate the extent of progression of the disease and success of the chosen
treatment strategy.
In current clinical practices, images are either evaluated just using qualitative criteria,
that is by identifying hyper intense tissue appearance in contrast enhanced scans or by rely-
ing on rudimentary quantitative measures such as largest diameter visible from axial images
of the lesion. Currently, segmentation is done manually by human experts or by using semi-
automated segmentation methods under human supervision. Segmentation done by human
experts is a time consuming process. 3D images of the brain are segmented one slice at
a time which typically results in jagged and not exact 3D segmentations. Moreover, in a
study carried out by Mazzara et al. [30], it is shown that there is an inter and intra-operator
average segmentation variability of 28% 12% and 20% 15% respectively. This clearly
explains the diculty of the segmentation task.
Following are some of the reasons which make glioma segmentation a dicult problem
for human experts or any computational process [32]:
Tumor region in the MRI scan is segmented by computing relative intensity changes
to the surrounding tissues in all given modalities. The intensity gradients are often
smooth which typically results in ambiguous boundaries.
Type, grade, location and size of the glioma vary signicantly from patient to patient.
This prevents making use of any prior information on shape and location for segmen-
CHAPTER 1. INTRODUCTION 3
tation which has proved to be useful in other segmentation tasks.
Mass eect induced by the growing lesion tends to displace normal brain tissues which
limits the spatial prior knowledge of the healthy part of the brain.
Because of its high clinical relevance and its challenging nature, the problem of computa-
tional brain tumor segmentation has gained considerable attention during the past 20 years
resulting in a range of dierent algorithms for automated, semi-automated and interactive
segmentation of tumor structures. [32]
1.4 The Data
The available clinical dataset consists of full brain MRI scans from 54 low grade (histological
diagnosis : astrocytomas or oligoastrocytomas) and 278 high grade (anaplastic astrocytomas
and glioblastoma multiforme tumors) glioma cases. The images are a mix of pre and post
treatment brain scans. The data was acquired over the course of several years using scanners
from dierent vendors and dierent eld strengths (1.5T and 3T) from four dierent centres
- Bern University, Debrecen University, Heidelberg University and Massachusetts General
Hospital. [32]
All scans in the image dataset contain the following four modalities along with an image
describing the ground truth annotated by human experts:
T1: T1-weighted, native image, sagittal or axial 2D acquisition, with 1-6mm slice
thickness.
T1C: T1-weighted, contrast-enhanced (Gadolinium) image, with 3D acquisition and
1mm isotropic voxel size for most patients.
T2: T2-weighted image, axial 2D acquisition, with 2-6mm slice thickness.
FLAIR: T2-weighted FLAIR image, axial, coronal or sagittal 2D acquisitions, 2-6mm
slice thickness.
To homogenize data gathered from dierent sources, each subject's image volume was
rigidly co-registered to the T1C MRI which had the highest spatial resolution in most cases
CHAPTER 1. INTRODUCTION 4
Figure 1.1: T1, T1C, T2 and FLAIR modalities for a high-grade glioma patient
and resampled all images to 1mm isotropic resolution in a standardized axial orientation
with a linear interpolator.
Each MRI image has 155 axial slices where dimension of each slice is 240 * 240. The
ground truth image also has the same dimensionality with dierent tumor substructures
annotated with dierent intensities in the corresponding region. The purpose of using four
dierent modalities for tumor segmentation is that each brain tissues and abnormal tissues
show up with dierent intensities in dierent scans. This makes it easier to identify dierent
substructures of the tumor. Each pixel in the image belong to one of the following ve
categories listed in the order of increasing seriousness:
CHAPTER 1. INTRODUCTION 5
Figure 1.2: Ground truth image for glioma case in gure 1.1
Normal tissue
Edema
Non-enhancing core
Necrotic Core
Enhancing core
Based on the above mentioned tumor substructures, three main regions are dened. The
'whole-tumor' region consists of edema, non-enhancing core, necrotic core and enhancing
core. 'Core-tumor' region consists of areas aected with non-enhancing core, necrotic core
and enhancing core. 'Active-tumor' region is the enhancing core of the tumor. The biologi-
cal properties of these tumor substructures is not discussed here. The machine learning and
feature extraction algorithms are expected to extract features on their own.
CHAPTER 1. INTRODUCTION 6
1.5 Expert Annotation of Tumor Structures
MRI scans were annotated by an expert team of radiologists across Bern, Debrecen and
Boston and it took about 60 minutes to annotate brain of each patient. Since gliomas are
highly inltrative tissues for which clear boundaries are hard to dene, each subject was
annotated by several experts and the results from each were fused to obtain a single seg-
mentation through consensus. Following annotation protocol was used to annotate scans of
each high-grade and low-grade glioma scan. [32]
T2 scans were primarily used to segment edema. Extension of edema was cross-checked
using FLAIR scans. FLAIR scans were also used to distinguish it with ventricles and
other
uid lled structures. The initially segmented edema contained the core struc-
tures which were segmented in later steps.
The core tumor structure, that includes both enhancing and non-enhancing structures
was rst segmented by evaluating hyper-intensities in T1C for high-grade glioma cases
along with the inhomogeneous component of the hyper-intense lesion and the hypo-
intense lesion visible in T1.
Enhancing core of the tumor was then segmented by thresholding T1C intensities
within the resulting core tumor structure segmented in the previous step which in-
cludes the Gadolinium enhancing tumor rim and excludes the necrotic center and
vessels. The intensity threshold was visually determined by the experts on a case-by-
case basis.
The necrotic or
uid-lled region was segmented by identifying the low intensity
necrotic structures within the enhancing rim visible in T1C scans.
The non-enhancing or solid core structures was segmented by subtracting the enhanc-
ing core and the necrotic core structures from the core tumor region that was obtained
in the second step.
1.6 Evaluation Technique
When evaluating the segmentation, each pixel of the scan is assigned three mutually exclu-
sive labels. The whole tumor region is all four labels combined, the core region is three most
CHAPTER 1. INTRODUCTION 7
serious labels and the active region is the enhancing core label. For each tumor region, a pre-
diction label setP and a true label setT is calculated size of which is equal to the number of
pixels in the scan. If a pixel is predicted to be in a tumor region, it's value in the prediction set
is set to one, else, set to zero. Similarly, if a pixel belongs to a tumor region from the ground
truth image, value corresponding to that pixel in the true label set is set to one, else set to
zero For each of the three tumor regions, dice score is evaluated which is dened as following:
Dice(P;T ) =
jP
1
\ T
1
j
(jP
1
j + jT
1
j)=2
(1.1)
wherejP
1
j denotes the size of set of positively predicted voxels,jT
1
j denotes the size of
the set of true positive voxels andjP
1
\ T
1
j denotes the size of the set of voxels which are
both true and predicted as true.
For the purpose of detailed analysis, F-scores for all the tumor substructures and the
three dened tumor regions can also be evaluated. To evaluate F-Scores for each label l,
precision and recall for that label is calculated as follows:
Precision(l) =
Number of pixels correctly classified as l
Number of pixels classified as l
(1.2)
Recall(l) =
Number of pixels correctly classified as l
Number of pixels with ground truth as l
(1.3)
After the precision and recall values for each label is evaluated, F-Score for that label
can be calculated as follows:
F 1 Score(l) =
2 Precision(l) Recall(l)
Precision(l) + Recall(l)
(1.4)
The F-score can be used as a single measure of performance of the test for the positive
class. It considers both precision and recall measures of the test to compute the score. It
could be interpreted as a weighted average of precision and recall where a value of 0 indicates
worst performance and a value of 1 indicates best performance.
CHAPTER 1. INTRODUCTION 8
For visual evaluation, a new image is created from the predicted image which contains
an overlay of the prediction contours onto the ground truth image. One overlay image is
created for the evaluation of predicted tumor substructures and the other one is created for
evaluation of the three dened tumor regions against the ground truth for every test subject.
1.7 Contribution of the Thesis
The work of this thesis investigates the use of deep neural networks for the problem of
segmentation of tumor region in MRI scans. This thesis uses work carried out by Havaei,
Mohammad, et al. [17] and Davy, A., et. al. [10] as a baseline to create architecture based
on deep convolutional networks for tackling this problem. It analysis problems associated
with previous models based on generative and discriminative approaches, and proposes tech-
niques to overcome those.
Further, most of the conventional approaches take around 90 minutes to segment tumor
from one subject. This is practically too slow to be implemented in real world. In this work,
a distributed approach is discussed that is signicantly able to reduce the segmentation time.
Moreover, a post-processing technique based on Conditional Random Fields is also discussed
that is used to smoothen out the predicted segmentation.
9
Chapter 2
Convolutional Neural Networks
2.1 Why Deep Learning for Segmentation
During the past several years, the problem of brain tumor segmentation has attracted re-
searchers and several computational methods have been devised for solving this problem.
Most of these techniques are based on conventional machine learning and computer vision
techniques which require manual hand engineered features such as image gradients, Gabor
Filters, Local Binary Patterns and Histogram of oriented Gradients. One major problem
with such techniques is that hand engineering features often requires a large number of com-
putations for making predictions with high accuracy. This is usually very slow to compute
and expensive in terms of memory usage. Some of the more advanced techniques employ
dimensional reduction techniques such as Principal Component Analysis (PCA) or feature
selection techniques such as Maximum Relevance Minimum Redundancy (MRMR), but this
often leads to a reduction in the accuracy.
Also, most of the hand-engineered features exploit only very generic edge-related infor-
mation and designing such features for segmenting brain tumors is really dicult since the
tumor structures are highly inltrative and the intensity gradients are usually too smooth
to clearly identify boundaries.
Ideally, specically for image related tasks, one would want features that are composed
and rened into higher-level task specic representations that are directly obtained from
raw or minimally preprocessed input images. Designing such features manually can be very
dicult specially when dealing with medical image scans where no priors on shapes and
locations are available for segmentation.
Deep learning has shown to excel at learning a hierarchy of increasingly complex features
directly from the raw input. In this work, deep convolutional neural networks have been
applied to learn feature hierarchies that are adapted specically to the task of brain tumor
CHAPTER 2. CONVOLUTIONAL NEURAL NETWORKS 10
segmentation.
2.2 Basic Concepts in Machine Learning
This section explains some of the basic machine learning concepts that are important to
understand before dealing with the process of segmentation using deep learning and post-
processing the output using other machine learning techniques.
About Machine Learning
Machine learning algorithms are capable of guring out how to perform predictive tasks by
generalizing from data. It is generally a cost-eective and feasible method for tasks that can
be very tedious to program manually. Machine learning algorithms typically require amount
of data that is large enough to identify patterns from. This depends upon the complexity
of the task. As a consequence, machine learning has seen wide use in computer science and
other related elds [11].
Things like growing volumes and varieties of available data, computational power that
is cheaper and more powerful and aordable data storage has led to a re-surging interest in
machine learning, both in the academia and in the industry. This has made it possible to
quickly and automatically produce models that can analyze bigger, more complex datasets
and produce faster, more accurate results on a very large scale without or with minimal
human intervention. Machine learning has found its use in a wide range of applications
today such as fraud detection, web-search results, text analysis, recommendation systems
and pattern and image recognition to name a few.
The fundamental goal of machine learning is to generalize beyond the examples that the
predictive model observes during the process of learning. This set of examples is referred to
as the training set. Typically, a machine learning model is inaccurate when rst initialized
but improves itself by making predictions on the training data and adjusting its parameters
based on a cost function. A common learning paradigm where the cost for learning model
parameters is a function of both the predictions made on the training set and true labels
annotated by domain experts is known as supervised learning. With supervised learning,
the algorithm receives a set of inputs along with a set of corresponding correct outputs and
the algorithm learns by comparing its actual output with the correct outputs to nd errors.
It then modies the model by updating the parameters of the model based on the gradient
of the error function. Through methods like classication, regression, prediction and gradi-
ent boosting, supervised learning uses patterns in the training data to predict values of the
label on unseen test data. This class of algorithms is typically used in applications where
historical data predicts likely future events. Unsupervised learning is used against data that
has no annotated labels. This class of learning algorithms is expected to explore the data
CHAPTER 2. CONVOLUTIONAL NEURAL NETWORKS 11
and nd some structure within. Popular techniques include self-organizing maps, K-means
clustering and singular value decomposition to name a few. Such algorithms are typically
used in topic modeling for text, recommendation systems, identifying outliers in the data
and obtaining representations of inputs that can be further used for supervised learning tasks.
In this thesis, we explore a number of supervised and unsupervised learning techniques
to handle the problem of brain tumor segmentation.
Splitting Data
When working with machine learning algorithms, it is important to evaluate the performance
of the model by evaluating predictions made on unseen data. For this purpose, the available
annotated data is split into three sets:
Training set This set of data is used to train the model and update the model pa-
rameters based on the selected cost function.
Validation set This set is used to evaluate the performance of the model during the
training process and optimize the hyper-parameters of the models which are selected
empirically. It is also used to identify the stopping criteria for the training algorithm.
Typically, when the performance of the model stops improving by a signicant amount
on the validation set, training is stopped and the model is said to be trained. Validation
set is not used to train the model and is just used to guide the training process.
Test Set This set of data remains unseen during the entire training process. It is just
used to evaluate the nal performance of the model. This set should not be used to
update the learning algorithm in any manner.
Data should be split randomly to ensure that the distribution of data is similar in the
training and evaluation data sources. In the present work, this strategy has been used for
splitting data for training and evaluation purposes.
Overtting and Undertting
When training a machine learning algorithm, it is signicant to identify how well it has
generalized by evaluating its performance on the test data. One of the common problems
faced when training a machine learning model is overtting which occurs when it performs
good on the training data but is not able to generalize on unseen test data, that is, it ts the
training data too well. When overtting occurs, the algorithm captures the noise in the data.
In this case, the model shows low bias and high variance. Overtting is often a result of an
excessively complicated model that has more number of parameters than required to identify
CHAPTER 2. CONVOLUTIONAL NEURAL NETWORKS 12
underlying patterns in the data. It can be prevented by techniques such as regularization
and continuously monitoring the prediction accuracies of the model on validation set during
the training process.
Another problem that is faced when training a machine learning algorithm is undert-
ting. It occurs when a statistical model or machine learning algorithm cannot capture the
underlying patterns in the data. Undertting is a result of a too simple model that is unable
to t the data well enough and shows poor performance on both training and test datasets.
An undertted model typically shows low variance and high bias.
Curse of Dimensionality
Dimensionality is referred to as the number of features in the input. Curse of dimensionality
refers to the fact that many machine learning algorithms that work decently in low dimensions
become intractable when the input is high dimensional. Generalizing correctly becomes
exponentially harder as the dimensionality of the input grows since a xed-size training
set covers only a very small fraction of the input space. High-dimensional spaces show
surprising geometrical properties that are counter-intuitive with respect to the behavior of
low-dimensional spaces. When dealing with high dimensional data, it is desired to have
enough data for learning so that it lls the space or part of the space where the model
must be valid. Another possible solution is to try to decrease the dimension of the input
space without losing much relevant information in the data by applying techniques such as
principal component analysis or using feature selection methods.
2.3 Introduction to Deep Learning
Machine learning has developed, based on the ability to use computers to probe data for
structure, even if that structure cannot be expressed theoretically. Because machine learning
often uses an iterative approach to learn from data, the learning can be easily automated and
passes are run through the data until a robust pattern is found. The increased computing
power available today has helped data mining techniques evolve for use in machine learn-
ing. This allows creation of neural networks with several layers. This paradigm is known
as deep learning. Articial neural networks are a group of algorithms that are loosely based
on the understanding of the brain. In theory, ANNs can model any type of relationship
within a dataset but obtaining reliable results can be very tricky as they are highly prone
to overtting. Research on neural networks has been carried out since 1950s and has ob-
served successes as well as failures. Deep learning combines advances in computing power
and dierent
avors of neural networks to learn complicated patterns in large amounts of
data. Current state-of-the-art results in the areas of image analysis and speech recognition
are obtained using deep learning techniques. Research has been going on to apply this cat-
CHAPTER 2. CONVOLUTIONAL NEURAL NETWORKS 13
egory of algorithms to more complex tasks such as automatic language translation, medical
diagnosis and analysing time-series data to name a few. In the work of this thesis, use of
deep learning has been evaluated to solve the complex problem of brain tumor segmentation
using modestly sized networks.
2.4 Articial Neural Networks
Articial Neural Networks (ANNs) are networks of simple processing elements called neu-
rons. The design of ANNs was motivated by the structure of real brains. The structure of
neural networks may dier depending on the task under consideration, however the basic
principles are very similar. Theoretically, neural networks have the power of a universal
approximator, that is, it can realize any arbitrary mapping of one vector space onto another
vector space. Neural networks are able to capture unknown information hidden in the data
and this process is known as training of the neural network. With neural networks, training
of both supervised and unsupervised models is possible. Before going forward with the de-
tails, basic units of ANNs are discussed rst.
Neuron
Neuron is the basic unit of any articial neural network. As shown in gure 2.1, each neuron
of the network receives input signals and processes them to send an output signal. Each
input signal of a neuron is associated with a real-valued weight that re
ects the degree of
importance of the associated input signal or connection in the neural network. A neuron
performs a simple computation by taking a weighted sum of all input signals, and sends an
activated value of this weighted sum as the output signal.
Figure 2.1: Neuron - basic unit of an ANN
CHAPTER 2. CONVOLUTIONAL NEURAL NETWORKS 14
The output of a neuron is determined using the following equation:
y = f(
N
X
i
(w
i
x
i
) + b) (2.1)
Wherew
i
is the weight associated with the corresponding input signalx
i
,b is bias value
of the neuron and f(:) is the activation function.
Multi-Layer Feed-Forward Neural Networks
A multi-layer feed forward (MLF) neural network consists of neurons that are ordered into
layers as shown in gure 2.2.
Figure 2.2: Multi-layer feed forward neural network
The rst layer is called the input layer, last layer is called the output layer and layers in
between are known as hidden layers. Typically, in a fully connected neural network, each
neuron in a particular layer is connected with all the neurons of the subsequent layer. A
MLF network performs computations by passing the input in a one-way direction through
the network from input layer to the output layer. If neurons in a MLF network do not
activate the weighted sum of input signals, adding more layers does not help since linear
transformation of a line transformation still remains linear. Thus, each neuron generally
uses an activation function such as tanh or sigmoid to introduce non-linearity. As explained
earlier, an MLF network is theoretically capable of learning any continuous function pro-
vided there are sucient number of neurons in the hidden layer. However, this number is
CHAPTER 2. CONVOLUTIONAL NEURAL NETWORKS 15
typically very large and specially in case of images, (where the dimensionality is usually very
high) learning becomes intractable since each input signal is connected to all the neurons of
the subsequent layer. This makes the number of parameters very huge. A related approach
which addresses these problems are convolutional neural networks which are described in the
next section.
Learning with Back Propagation
Back propagation is a common method for training a neural network. This algorithm looks
for the minimum of the error function in weight space using the method of gradient descent.
The most commonly used error or cost functions are squared loss function for regression
problems and cross entropy loss function for classication problems.
The squared loss function is dened as:
C =
1
2
X
j
(y
j
a
j
)
2
(2.2)
The cross entropy loss function is dened as:
C =
X
j
(a
j
lny
j
) (2.3)
Wherea
j
andy
j
are the predicted output and true output of the input sample j respec-
tively.
Figure 2.3: Back Propagation
The combination of weights that minimizes the error function is considered to be the
solution of the learning problem. Since this method requires computation of gradient of the
error function at each iteration step, the error function should be continuous and dieren-
tiable. To ensure this, the activation function of each neuron within the network also needs
CHAPTER 2. CONVOLUTIONAL NEURAL NETWORKS 16
to be continuous and dierentiable. One problem associated with this technique is that it is
susceptible to a local minimum.
At each iteration step, the weights and biases are updated as follows:
w
t+1
ij
= w
t
ij
@C
@w
ij
!
t
(2.4)
b
t+1
j
= b
t
j
@C
@b
j
!
t
(2.5)
Where, w
ij
is the weight associated between the connection from neuron i to neuron j,
b
j
is the bias of neuron j, is the learning rate, C is the cost function and t is the current
time-step.
2.5 Convolution Neural Networks (CNN)
Regular neural networks don't scale well for training with images. Each pixel in an image is
a feature for the network and this amounts to an unmanageably large number of parameters
with fully connected neural networks. With convolutional neural networks that have only
images as inputs, the architecture is constrained in a more sensible way. The neurons in
each layer of a CNN are arranged in three dimensions - height, width and depth, which is
equal to the number of channels in the image. Unlike fully connected neural nets, neurons
in a layer of a CNN are only connected to a small region of the layer before it. Following is
a brief description of dierent types of layers that are used to build a convolutional neural
network.
Convolutional Layer
This layer is the core building block of a CNN which holds neurons in a 3-dimensional
volume. Parameters of a conv layer consist of a set of learn-able lters that are small along
height and width and extend through depth. When input is convolved across a lter, dot
product is computed between the input and the entries of the lter. The network learns
lters that activate when they see some specic type of feature at some spatial position in
the input through back-propagation. Each neuron of a conv layer is connected to only a
small region of the input volume and this small region is a hyper-parameter which is known
as the receptive eld of the neuron. Extent of connectivity along depth is always equal to
the number of channels in the input. The hyper-parameters associated with a conv layer
that control the dimensionality of the output are spatial dimensions of the lter, depth of
the lter, stride and zero-padding.
Spatial height and width The height and width of the lter control the height and
width of the output volume.
CHAPTER 2. CONVOLUTIONAL NEURAL NETWORKS 17
Depth It controls the number of neurons in the conv layer that connect to the same
spatial region of the input volume. It controls the number of channels in the output
volume.
Stride This parameter controls the overlapping of receptive elds of the input. When
stride is high, receptive elds overlap less and that results in a smaller dimension of
the output spatially.
Zero-padding It pads the input volume with zeros along the border of the input
volume. This hyper-parameter comes in handy when it is desired to control the spatial
size of the output volume.
Figure 2.4: Example of a convolutional layer for a 2-dimensional image
The spatial size of the output volume is given as :
OW =
IW FW + 2P
S
+ 1 (2.6)
OH =
IH FH + 2P
S
+ 1 (2.7)
WhereOW andOH are spatial width and height of output volume respectively,IW and
IH are spatial width and height of the input volume respectively, FW and FH are spatial
width and height of the lter of the conv layer, P is the zero padding and S is the stride.
Note that depth of output volume is equal to the depth of the conv layer lters.
CHAPTER 2. CONVOLUTIONAL NEURAL NETWORKS 18
Typically, Relu fucnction is used to activate the output of conv lters since it is very fast
to compute. It is dened as max(0;x). Parameter sharing scheme is used in convolutional
layers to control the number of parameters. This is based on one simple assumption - if one
patch feature is useful to activate at some spatial position, then it should also be useful to
activate at some other spatial position. This makes it possible to share same parameters of
the conv layer across dierent receptive elds of the input drastically reducing the number of
parameters. Parameter sharing scheme may not be used in cases where it is desired to learn
completely dierent features on dierent sides of the image. However, this is not the case
when dealing with the problem of brain tumor segmentation and making use of parameter
sharing scheme makes sense.
Maxout Layer
Having multiple layers in any articial neural networks makes sense only if intermediate layers
produce a non-linear activation of the weighted inputs. In case of convolutional layers, this
is done by applying element-wise non-linearity to the result of the lter convolution. There
are multiple options of introducing non-linearity by using dierent activation functions on
the lter neurons such as sigmoid, hyperbolic tangent, rected linear and maxout functions.
Sigmoid and hyperbolic tangent functions are generally avoided when dealing with modestly
sized neural networks because its computation during the forward pass and computation of
its gradient during backward pass is a time consuming process. Rectied linear function
is dened as Max(0;v), where v is the weighted sum of the input at a neuron. Recently,
maxout non-linearity has been shown to be very eective at modelling useful features in
the work of Goodfellow, Ian J., et al. [15]. Maxout features are associated with multiple
kernels and correspond to taking the max over all the dierent feature maps individually
for each spatial position. Each maxout map Z
s
is associated with K feature maps. Maxout
activation function is given as:
Z
s;i;j
= maxO
s;i;j
; O
s+1;i;j
; :::; O
s+K1;i;j
(2.8)
Where O are individual feature maps and i and j are spatial positions.
Pooling Layer
Pooling layer is inserted in between successive conv layers in a convolution neural network
to reduce the spatial size of the input representation. This is usually done to reduce the
amount of parameters and computation in the network in order to avoid overtting. It oper-
ates independently on every depth slice of the input and resizes it spatially using max, min
CHAPTER 2. CONVOLUTIONAL NEURAL NETWORKS 19
or average operation depending on the choice. Typically max pooling is used to retain the
maximum intensity values in the input volume. The depth of the output volume remains
same as that of the input after applying pooling operation.
Figure 2.5: Example of a pooling layer for a 2-dimensional input volume
The spatial dimensions of the output volume after passing the input through pooling
layer is given as:
OW =
IW FW
S
+ 1 (2.9)
OH =
IH FH
S
+ 1 (2.10)
WhereOW andOH are spatial width and height of output volume respectively,IW and
IH are spatial width and height of the input volume respectively, FW and FH are spatial
width and height of the lter of the conv layer andS is the stride. Note that depth of output
volume remains unchanged with pooling layer.
An example of a convolutional neural network for classication using supervised learning
is shown in the gure below.
CHAPTER 2. CONVOLUTIONAL NEURAL NETWORKS 20
Figure 2.6: Example of a convolutional neural network for classication task
2.6 Convolutional Neural Networks for Segmentation
For segmentation problems, each pixel of the image is classied and assigned a label unlike
normal classication problems where whole image is assigned a single label. In such cases,
sliding window approach is used where a xed sized window around a certain pixel is fed
to the convolutional neural network as the input and pixel in the center of that window is
classied. The window slides to the neighboring pixel to classify the next one. The image is
said to have segmented when every pixel of the image is assigned a label.
Figure 2.7: Sliding window approach for segmentation
CHAPTER 2. CONVOLUTIONAL NEURAL NETWORKS 21
This is a time consuming task since every pixel in the image needs to be classied instead
of entire image at once and the algorithm can be trained with much fewer images in a given
amount of time.
Figure 2.8: Axial, Sagittal and Coronal views of T1 modality of a high-grade glioma patient
For three dimensional images, for example brain image scans which is the main focus of
this work, two approaches have been used to train the model for segmentation:
In the rst approach, the input image is processed slice by slice and a 2 dimensional
patch is created for each pixel in axial (X-Y) plane as described earlier. This patch is
used as an input for training and classifying that pixel into a label. In the later section
of this thesis, these type of patches are referred to as 2-D input patches.
In the second approach also, the input image is processed slice by slice. However, in this
case, three patches are created for each pixel, one each in axial, sagittal and coronal (X-
Y, Y-Z and X-Z respectively) planes as shown in gure 2.8. While training, these three
patches corresponding to a single pixel are treated as three dierent training samples
with the same label. During classication, labels for all three patches in dierent planes
are obtained for the same pixel and the pixel is assigned label which occurs maximum
number of times among the three. In the later section of this thesis, these type of
patches are referred to as 3-D input patches.
22
Chapter 3
Related Work
There has been a signicant surge in the interest for solving the problem of automated brain
tumor segmentation in the past few decades which is evident by the number of publications
made in this area [24, 38, 7, 27, 48, 49]. This observation motivates the need of automatic
brain tumor segmentation tools and underlines the fact that this is still an active area of
research. Most of the work that has been devoted to this problem can be categorized into
one of the two broad families - generative models and discriminative models.
3.1 Generative Models
Approaches based in generative models make use of detailed prior information about the ap-
pearance and spatial distribution of healthy and dierent tumorous tissue types. These type
of models usually achieve good generalization on unseen images and represent state-of-the-
art for many brain tissue segmentation tasks [45, 23, 37, 36, 22, 14, 1]. However, encoding
prior knowledge for tumors is dicult because lesions vary signicantly from patient to pa-
tient and growing tumorous tissues tend to displace normal tissues which limits the spatial
prior knowledge about healthy part of the brain. Spatial priors for tumors are often derived
either from the appearance of tumor specic bio-markers or by modelling them as outliers
relative to the expected shape and healthy tissues of the brain. Most techniques that are
based on generative models rely on accurately aligning images to segment and spatial priors
which often poses problems in case of large lesions. Also, generative models do not allow for
modelling dierences between the biological processes observed in dierent modalities which
is useful for categorizing tumorous tissues.
Work done by Prastawa, Marcel, et al. [37] is an example of a generative model ap-
proach based on outlier detection. Their segmentation method consists of three main stages
- rst is detecting abnormal regions where intensities deviate from expectation, second is
determining if the abnormal regions are composed of both tumor and edema and third is
CHAPTER 3. RELATED WORK 23
determining proper sample locations using the spatial and geometric prior information. The
method makes use of a probabilistic brain atlas obtained by sampling specic regions of the
brain. The subject image data to be segmented is then aligned with this atlas to detect
abnormality. This method however does not capture other information such as curvature or
brain asymmetry apart form suering from the diculty of capturing prior knowledge about
shape, size and localization of the glioma. This method also fails to handle cases where there
is a large deformation in the brain structure of the subject.
In the work done by Menze, Bjoern H., et al. [31], authors propose a fully automatic
generative model to segment tumor in multi-modal image that provides channel-specic
segmentation. In their approach, they model the normal state of the healthy brain using a
spatially varying probabilistic prior for each of the tissue classes. This atlas is derived from
prior examples. They model the tumor state using a spatially varying latent probabilistic
atlas and dene a latent tumor state for each of the modality that indicates the presence of
tumor at a particular voxel of a modality using Bernoulli distribution.
Figure 3.1: Generative model proposed by Menze, Bjoern H., et al. [31]
Figure 3.1 shows the graphical diagram of the model proposed by Menze et al. Voxels
are indexed withi and modalities are indexed withc. The prior
k
is determined from prior
samples and is used to determine the label k of a normal healthy tissue. The latent atlas
is used to determine the modality specic presence of tumor t. Normal state k, tumor state
t and intensity distribution parameters (which is modelled using Bernoulli distribution)
CHAPTER 3. RELATED WORK 24
are jointly used to determine the multi-modal image observations y. The goal of the seg-
mentation process is to estimate tumorous tissues given by p(t
c
i
jy) and healthy tissues given
by p(k
i
jc).
3.2 Discriminative Models
The second category of algorithms make use of distributed models to segment brain tumors
[8, 9, 20, 41, 40]. In this approach, dierence between the appearance of the lesion and
normal tissues is learnt directly. They do not rely on spatial priors on size, location and
shape for segmentation. However, it requires substantial amounts of training data to train
discriminative models [6, 28, 46, 19, 16, 26, 48]. These approaches require extraction of a
large number of low level and high level features from training images and a direct relation-
ship between these features and label of a given voxel is modelled. Some of the features that
have been used to develop discriminative models are raw or preprocessed input pixel values,
texture features such as Gabor lterbanks or alignment based features such as inter-image
gradient, region shape dierence and symmetry analysis. Hand engineering features from
images is often a dicult task and the diculty increases to handle the problem of glioma
segmentation because the intensity change gradients are often very smooth.
Nithyapriya, G., et al. [33] have explored the use of AdaBoost and SVM classiers in
their work. As the rst step, features such as corresponding fractal, texton and intensities are
extracted from dierent modalities of the MRI scan and dierent combination of the feature
sets are exploited for tumor segmentation to nd the best set. Features are then fed to the
AdaBoost classier for the classication of tumorous and non-tumorous regions. They also
experimented with the support vector machine classier based on the fact that it searches
an optimal separating hyperplane between members and non-members of a label in a high-
dimensional feature space using one against all strategy in case of multi-class classication.
For the task of brain tumor segmentation, authors found the Radial Basis Function as the
best kernel function to be used with the SVM classier due to its ability to map vectors non-
linearly to a very high dimensional feature space. In the work done by Madheswaran, M.,
and D. Anto Sahaya Dhas [29], authors evaluated performance of SVM classier with GRBF
classier. They selected optimized features using genetic alogorithm along with joint entropy
which are contrast, homogenity of the image, entropy, correlation, energy, maximum prob-
ability, inverse dierence moment, variance, auto-correlation, directionality and coarseness.
They also extracted additional features using penalized fuzzy C-means algorithm: cluster
shade, cluster prominence, inertia and cluster tendency.
Much work has been done using decision forests as the classier for segmenting brain
tumors because they are: 1. capable of producing maximum-margin boundaries, 2. resistant
to overtting and 3. perform intrinsic feature selection. In the work done by ABianchi,
CHAPTER 3. RELATED WORK 25
Alberto, et al. [2], authors have used an ensemble approach using the outputs of multiple
separately trained decision trees where each tree diers due to randomness injected during
the training of each node by selecting random subset of features and thresholds. In their
approach, they used long range features such as those based on texture and symmetry. In
their results, authors observed that having features just based on intensities of the pixels
yield many false positives in the segmentation while adding both texture and symmetry based
features resulted in fewer false positives and fewer false negatives. The winning methods of
the BraTS 2012 challenge used the decision forests as their central classier.
3.3 Recent Work using Deep Convolutional Networks
Deep neural networks have recently attracted more attention of researchers due to their
state-of-the-art performance on several image datasets such as ImageNet and CIFAR-10
[25]. As explained in the earlier section, CNNs are an ecient and eective class of models
for computer vision that have been shown to learn and extract visual features and are able
to generalize well across many tasks. Deep neural networks have proved to be successful in
many segmentation problems as well [4, 43, 34]. One of the examples of successful applica-
tion of deep convolutional neural networks for segmentation is natural scene labelling. For
this task, input to the CNN are the patch from the image that consists of red, green and blue
channels. In the work done by Pinherio and Collobert [35], authors use a basic convolutional
network to make predictions for each pixel and they propose to improve their results by using
predictions of the rst network as an additional input to the second CNN model. It is also
proposed to have several dierent CNN models processing the image at dierent resolutions
and integrate information from all the learnt CNNs to make a nal prediction. In medical
domain, however, not much use of deep learning has been explored.
Roth, Holger R., et al. [39] presents an example of successful use of deep convolutional
neural networks to detect sclerotic spine metastases. With his architecture, he achieved a
drastic increase in recall of 0.92 from 0.79 which was the previous state-of-the-art. They
used sliding window approach over the input dataset to detect the presence of metastases
in a window. Several architectures based on deep convolutional neural networks have been
proposed to tackle the problem of brain tumor segmentation. Works of Urban, G., et al. [44]
and Havaei, Mohammad, et al. [17] in 2014 and 2015 respectively have proved to outperform
existing state-of-the-art discriminative model based on decision forests. In the later section
of this work, the novel architecture and dierent already proposed architectures based on
deep neural networks and their results are discussed.
26
Chapter 4
Methodology
4.1 Pre-processing
Compared to standard algorithms in machine learning, relatively little pre-processing of the
images is required when dealing with deep convolutional networks. Every image is cropped
or scaled to the same resolution to have a xed size that can be fed to the input layer of
the network. Every modality of each subject is standardized to have zero mean and one
standard deviation and then pixel intensity values are re-scaled to have minimum value of
zero and maximum of one.
Often MRI images collected from dierent MRI scanners and under dierent hardware
congurations suer from magnetic eld inhomogeneities. This inhomogeneity due to the
bias eld reduces the high frequency contents of the image such as edges and contours and
changes the intensity values of image pixels so that same tissue has dierent gray level distri-
bution across the image. This reduces the performance of most image processing algorithms,
especially those, which are based on the assumption of spatial invariance of the processed
image. In the approach of this work, ITK's N4 Bias Field Correction algorithm has been
used on T1 and T1C modalities to correct the inhomogeneities due to bias eld of the MRI
machine. Often, bias eld correction is not applied on T2 and FLAIR modalities as it can
attenuate the intensity of the tumor when the tumor region is large, especially at its centre.
Several other pre-processing techniques have been tried by dierent research groups work-
ing towards brain segmentation. Some groups removed the top 1% highest and lowest inten-
sities after normalizing the image. Work done by [42], have used an additional pre-processing
step known as histogram matching. Typically, MRI scans taken from dierent MRI machines
vary signicantly in terms of minimum and maximum pixel intensity values, that is vary in
terms of mean and standard deviation of intensities, which can degrade the performance of
the image processing algorithm on unseen images. To tackle this problem, pixel intensities
of the input image is normalized with respect to a reference image to have same minimum
CHAPTER 4. METHODOLOGY 27
and maximum intensities and similar means and standard deviations across all images before
training and segmentation. This process of histogram matching can be easily implemented
using a tool known as 3DSlicer.
4.2 CNNs - For Feature Extraction and
Classications
With articial neural networks, it is possible to implement both supervised and unsuper-
vised learning algorithms. To perform classication directly as the output of the network,
typically a softmax layer is used, outputs of which can be interpreted as dierent probabil-
ities for each class. Another type of approach is to use neural networks as auto-encoders.
An auto-encoder is implemented by having an output layer of the same size as input layer
and the network is trained by minimizing the input reconstruction error. This is generally
used to reduce the dimensionality of the input and obtain a good input representation by
considering outputs of intermediate layers of the trained auto-encoder. This representation
can be further used as an input to train other supervised classiers such as SVMs. The
benet of having this type of model is that it is much ecient to train multiple models
using one-against-one or one-against-all strategy for multi-class classication. This class of
architectures, where output of one classier is used as an input to train other classiers is
known as cascaded architectures. [47] and [18] have shown signicant gains in their results
by using representations from the convolutional network as an input to the supervised classi-
er. In the work of this thesis, both type of architectures are evaluated for their performance.
4.3 Network Initialization
One major challenge with any machine learning algorithm, trained using gradient descent,
is to avoid the local minimum during training. With articial neural networks, this is a big
problem because the number of parameters in the network are generally very large. One
approach generally used to avoid this problem is to have a good network initialization. To
achieve this, the intermediate layers of the neural network is rst trained using an auto-
encoder. This provides a good weight initialization of the layers and this network is then
trained in a supervised fashion using a softmax output layer or using the representations as
an input to train another supervised classifer. Erhan, Dumitru, et al. [13, 12] have reported
signicant gain in accuracies with this approach. In the work of this thesis, layers of the
convolutional neural network are rst trained using an auto-encoder. Once the network is
trained, domain specic ne-tuning of the layers can be done to specically learn about brain
tumor segmentation.
CHAPTER 4. METHODOLOGY 28
4.4 Regularization to Prevent Overtting
Over-tting generally occurs because of the increased model complexity. This happens be-
cause of having more number of parameters in the model than required to capture patterns
in the data. However, any successful convolutional neural network usually contains a few
number of layers that results in a large number of parameters. Thus, such models are prone
to over-tting if the amount of training data is not large enough. While tackling the problem
of brain tumor segmentation, since the training data was not large enough, regularization
was found to be very important in obtaining good results.
Regularization is usually applied in a machine learning algorithm to keep the variance of
the model low by bounding the absolute values of parameters in the network. This can be
achieved by adding norm(s) of weights to the loss function. Another way to keep the model
variance low is known as Dropout which can be applied to algorithms based on articial
neural networks. With dropout, a few neuron units from the network are dropped randomly
during training to enforce low variance of the model. It has been shown in the works of
Havaei, Mohammad, et al. [17] and Davy, A., et. al. [10] that best results for brain tumor
segmentation are obtained with both L1 and L2 regularizations along with dropout. In the
work of this thesis, same combination is used to avoid overtting.
4.5 Data Augmentation
As discussed in previous sections, often large amounts of training data is required to train
deep neural networks in order to avoid overtting. One way to increase the available amount
of data for training is to make label preserving transformation of the existing input data and
add it to the training set. While dealing with the problem of brain tumor segmentation, this
becomes important since data from just 373 subjects is available. Also, the fact the tumors
can vary signicantly in shape, size, grade and location, it makes it feasible to add training
data by making transformations on the images. Transformations such as translation, scaling,
rotation, horizontal and vertical re
ections are often used to increase the amount of available
data for training and improves the network's generalizing ability. These transformed images
used as additional data for training makes the network relatively invariant to the alterations
in the image to be predicted from the images used for training. In the work done by Cheng,
Jun, et al. [5] for classifying brain tumors into three categories - meningioma, glioma and
pitutary tumor, he shows a signicant gain in accuracies using augmentation.
CHAPTER 4. METHODOLOGY 29
4.6 Balancing the Training Set and Two Phase
Training
When dealing with any classication problem where the number of samples belonging to
dierent categories are skewed, a challenge is faced during training. If input samples are
chosen at random in this case, network usually tends to predict all unseen input with cat-
egory that has been observed most during the process of training. This happens because
of the fact that in such a case, parameters of the network get biased towards the category
that has been observed most while training. In order to solve this problem, a balanced set
is used for training which contains equal or almost equal samples from each of the category.
This makes the network learns patterns in the data but leaves the network with no prior
information about the data. This is crucial when dealing with the problem of brain tumor
segmentation. The number of tumorous voxels in the image are way fewer than healthy
voxels. In most previous works [10, 17], and in the work of this thesis, the network is rst
trained using balanced training data to capture dierences in patterns of healthy and tu-
morous tissues in terms of intensity values. In the second phase of the training, unbalanced
(random) set of input samples are used for training to capture the prior information. In this
phase, parameters of only the softmax layer are updated to capture prior information in the
nal values obtained from it that are interpreted as class probabilities.
4.7 Network Congurations
A convolutional neural network for brain tumor segmentation can either be trained by using
two dimensional or three dimensional structures depending on the choice of experiment.
Two-Dimensional Networks
With two dimensional networks, input to the network is a two dimensional patch around a
coordinate in a certain plane (axial, coronal or sagittal, typically axial). This approach has
shown to work decently in previous works carried out by Havaei, Mohammad, et al. [17]
and Davy, A., et. al. [10]. Each lter of a convolutional layer is two dimensional, that is,
height and width. It is important to note that the lter parameters are shared across all the
overlapping images in the input patch fed to the network. In the work of Havaei, Moham-
mad, et al. [17], authors have used patches for each coordinate only in axial plane based
on the fact the MRI scan has highest resolution in this view. Ensemble of two dimensional
networks can also be used when dealing with three dimensional networks. In this case, three
dierent networks are trained, one each for inputs from axial, sagittal and coronal planes.
Another possible approach is to train a single deep CNN from inputs across all three views.
The work of this thesis evaluates results by considering patches from just axial plane and by
CHAPTER 4. METHODOLOGY 30
training a same network using inputs across all three planes. Results have been compared
in the later chapter.
Three-Dimensional Networks
Figure 4.1: A 3D convolutional neural network designed by Ji, Shuiwang, et al. [21] for
human action recognition
When dealing with three dimensional networks, input is a three dimensional xed size vol-
ume around a coordinate. In this case, each lter is three dimensional, one each in X-Y,
Y-Z and X-Z planes. There can be multiple such lters in the convolutional layer of a 3-D
network. Each lter of a pooling layer needs to be three dimensional as well that taken in
a 3D volume as an input and produces another 3D volume as an output that depends on
the pooling function. It is important to note that parameters of each lter are shared across
all the overlapping three dimensional cubes in the input volume across all coordinates in
the MRI image. The number of parameters associated with a 3D network is usually large
and that makes it prone to overtting. So, this approach usually needs large amounts of
data for training. In the work carried out by Havaei, Mohammad, et al. [17], it has been
observed that there are no signicant gains in dice scores with the use on 3D networks. On
the contrary, it poses overhead of more training time and storage required for network. In
this thesis, results using three dimensional networks have not been evaluated.
4.8 Network Architectures
The network architecture that has been used in the work of this thesis is mostly based on the
work carried out by Havaei, Mohammad, et al. [17]. In their work, authors have explored a
CHAPTER 4. METHODOLOGY 31
variety of architectures to deal specically with the problem of brain tumor segmentation.
Two Pathway architecture
The main idea behind two pathway architectures is to depth-wise concatenate the feature
maps from dierent layers when creating the convolutional neural network. This operation
provides the ability to explore architectures with dierent computational paths. In the two
pathway architecture, the input image is convolved through one or more larger lter(s) in
one computational paths and same input is convolved through multiple smaller lters in the
other path. At a certain point, where output volumes from these two paths have same spatial
dimensions, these outputs are concatenated and fed as an input to the following part of the
network. The intuition behind using this type of architecture is that the pathway having
smaller multiple lters captures more of the visual details around that pixel and the compu-
tational pathway with larger lters captures more of the global context. This architecture
to handle the problem of brain tumor segmentation is proposed by Havaei, Mohammad, et
al. [17] and is shown in the gure below. They refer to this architecture as TwoPathCNN.
Figure 4.2: Two pathway CNN proposed by Havaei, Mohammad, et al. [17]
Integrated Learning from Dierent Sized Patches
This architecture is motivated by the works of Davy, A., et. al. [10] for brain tumor
segmentation. His proposed architecture is shown in gure 4.3.
CHAPTER 4. METHODOLOGY 32
Figure 4.3: Network architecture proposed by Davy, A., et. al. [10] for brain tumor segmen-
tation
The rst computational pathway uses the entire patch size, while the other pathway re-
ceives a spatially cropped patch as the input. Outputs from these two pathways are depth-
wise appended and used as inputs in the nal layer or intermediate layers of the network.
The motivation behind this architecture is intuitive. Since tumor itself is categorized into
four dierent categories, capturing visual details of the pixels in its close vicinity becomes
important. The computational pathway that uses the entire patch captures more of global
context and the second pathway that uses a smaller window of the patch captures more of
visual details of the region around the pixel.
Cascaded Architectures
Cascaded architectures is the second class of CNN architectures proposed by Havaei, Mo-
hammad, et al. [17]. When dealing with any segmentation problem, label of the current
pixel may signicantly depend on the label of the neighboring pixels. Conventional CNNs fail
to capture this information while training and making predictions. One possible approach
for taking this information into consideration is to post-process the predicted segmentation
using any probabilistic graphical model to produce the nal segmentation. This process is
explained in detail in the later section. In the second approach, two separate CNNs are
CHAPTER 4. METHODOLOGY 33
trained in supervised fashion. Probability outputs of a xed size neighborhood of the pixel
under consideration is depth-wise appended to a computational path of the second CNN.
The main idea behind this approach is to capture the output probability information of the
rst network and use it in the second CNN to make smoother predictions. Based on at
which level, the predicted probabilities from the rst network are appended to the compu-
tational path of the second CNN, three types of cascaded architectures have been proposed
by Havaei, Mohammad, et al. [17].
Input Cascaded CNN : In this case, output probabilities of the neighborhood from the
rst convolutional neural network is directly appended depth-wise to the input patch
of the second network. Concatenating the probabilities depth-wise to the input of the
second network can be interpreted as additional channels of the input patch. The in-
put cascaded CNN proposed by Havaei, Mohammad, et al. [17] for segmenting brain
tumor is shown in the gure below.
Figure 4.4: Input cascade CNN as proposed by Havaei, Mohammad, et al. [17]
Local Pathway Cascaded CNN : In this case, the output probabilities of the neighbor-
hood obtained from the predictions of the rst CNN are depth-wise concatenated to
the input of intermediate hidden layers of the second convolutional neural network.
Figure below shows the local pathway cascaded network proposed by Havaei, Moham-
mad, et al. [17].
CHAPTER 4. METHODOLOGY 34
Figure 4.5: Local pathway cascade CNN as proposed by Havaei, Mohammad, et al. [17]
Pre-output Cascade CNN : In this case, the output probabilities of the neighborhood
obtained from the predictions of the rst CNN are depth-wise concatenated right be-
fore the output layer or the fully connected layer of the second convolutional neural
network. Computations made by this type of architecture is similar to the computa-
tions made by one iteration of the mean-eld inference, which is used for estimating
parameters of CRFs. However, this approach diers from mean-eld iteration in the
sense that output of the pixel is in
uenced by its previous prediction and convolutional
lters used for making predictions are dierent in the two CNNs. Figure below shows
the pre-output cascaded network proposed by Havaei, Mohammad, et al. [17].
Figure 4.6: Pre-output or Mean-eld cascade CNN as proposed by Havaei, Mohammad, et
al. [17]
CHAPTER 4. METHODOLOGY 35
Architecture Trained in the Work of This Thesis
The network architecture trained in the work of the thesis, apart from evaluating the exist-
ing architectures is an Input Cascade CNN. The rst network of this cascaded architecture
contains two computational paths, one for entire input patch and the other for a smaller
cropped window around the pixel. The rst computational pathway which uses the entire
input patch itself uses a two pathway architecture. The second computational pathway also
uses two computational paths, one for entire input patch and the other for a smaller cropped
window around the pixel. The outputs from these two pathways is depth-wise concatenated
before the output layer. The architecture of the rst CNN of the cascaded network is as
shown in gure 4.1.
Input Patch: 4*34*34
Spatial Crop Layer : 8*8
Output: 4*8*8
Input Patch: 4*34*34
Input Patch : 4*8*8
Conv 160*14*14
Output: 160*21*21
Conv 64*8*8, Pool 4*4
Output: 64*24*24
Conv 64*3*3, Pool 2*2
Output: 64*21*21
Depth-wise Concatenation
Output: 224*21*21
Conv 360*13*13, Pool 2*2
Output: 360*8*8
Depth-wise Concatenation
Output: 364*8*8
Conv 420*4*4
Output: 420*5*5
Conv 500*3*3
Output: 500*3*3
Conv 5*3*3, Softmax
Output: 5*1*1 (Probabilities of 5 dierent classes)
Table 4.1: Network architecture of rst CNN of the cascaded network
The second CNN of the Input Cascade Architecture also uses a two-pathway architecture
and uses just the entire input patch. A spatial map of output probabilities of the pixel and
neighboring pixels from the rst CNN is depth-wise appended to the input image patch.
CHAPTER 4. METHODOLOGY 36
This technique essentially makes the impact of predictions of neighboring pixels on the label
of the pixel under consideration. The architecture developed in the work of this thesis is
based on Input Cascade CNN because the work of Havaei, Mohammad, et al. [17] observed
best results with Input Cascade CNN. The architecture of the second CNN is as shown in
the gure 4.2.
Input Patch: 4*34*34
Output probabilities
of patch from rst CNN: 1*34*34
Depth-wise Concatenation
Output: 5*34*34
Concatenated Input Patch: 5*34*34
Spatial Crop Layer : 8*8
Output: 5*8*8
Input Patch: 5*34*34
Input Patch : 5*8*8
Conv 160*14*14
Output: 160*21*21
Conv 64*8*8, Pool 4*4
Output: 64*24*24
Conv 64*3*3, Pool 2*2
Output: 64*21*21
Depth-wise Concatenation
Output: 224*21*21
Conv 360*13*13, Pool 2*2
Output: 360*8*8
Depth-wise Concatenation
Output: 365*8*8
Conv 420*4*4
Output: 420*5*5
Conv 500*3*3
Output: 500*3*3
Conv 5*3*3, Softmax
Output: 5*1*1 (Probabilities of 5 dierent classes)
Table 4.2: Network architecture of second CNN of the cascaded network
CHAPTER 4. METHODOLOGY 37
4.9 Training the Network
All input images for training and testing are rst pre-processed using bias ltering algorithm
on T1 and T1C modalities. The image is then normalized to have zero mean and one stan-
dard deviation and scaled to have a minimum value of zero and maximum value of one. The
next step is to generate training data which is done using sliding window approach. A xed
sized patch around each coordinate is used as an input sample and label for training is the
one of the center pixel of the input patch.
The deep CNNs are trained using stochastic gradient descent technique and parameters
of dierent layers are updated using back-propagation learning algorithm. The parameters
of the layers are initialized by conducting unsupervised pre-training on the layers of the net-
work and minimizing the input reconstruction error. This helps in achieving a good weights
initialization that can be crucial to avoid local minima during optimization. When working
with cascaded architectures, the rst CNN is trained in a supervised fashion. After the rst
CNN is trained, it is used to predict output probabilities of all pixels that are depth-wise
appended to the input of the second network which is also trained in a supervised manner.
The networks are rst trained using balanced set of input batches and in the second phase
of training, parameters of only the softmax layers are updated to capture prior information.
4.10 Post-Processing
When working with segmentation problems using deep learning, the labels predicted for each
pixel often do not depend on the labels of the surrounding neighborhood. This is usually not
true in reality. The label of a pixels depends on the labels of neighborhood in most cases. One
possible way to achieve this without the need for additional step is to make use of cascaded
architectures. The results of [17] show signicant improvement in dice scores by using Input
Cascaded network. Another possible way to handle this problem is to smoothen the segmen-
tation from deep CNN using a probabilistic graphical model such as Conditional Random
Fields (CRF). Smoothing the segmentation using CRFs can be interpreted as minimizing
the energy function of the CRF. The energy function of the CRF is dened as following:
E(l) =
X
p2P
D
p
(i
p
;l
p
) +
X
p;q2N
V
p;q
(l
p
;l
q
) (4.1)
Where E(l) denotes the energy function, P denotes set of all the pixels in the image,
i is the observed data, l denotes label and N denotes the adjacent pixels of a pixel. The
rst term of this equation is called data term [D(i
p
;l
p
) = (i
p
l
p
)
2
]. It ensures that
current labelling l is coherent with the observed data i by penalizing a label l
p
of a pixel p
CHAPTER 4. METHODOLOGY 38
if it is too dierent with the observed data i
p
. Second term is known as the smooth term
[V
p;q
(l
p
;l
q
) = I(l
p
6= l
q
)]. It aims to make the overall labelling smooth by penalizing two
neighboring labels if they are too dierent. Minimizing this function for post-processing MRI
scan predictions is costly both in terms of time and memory. Instead, a fast approximate
energy minimization approach via graph cuts proposed by Boykov, Yuri, Olga Veksler, and
Ramin Zabih [3] is used. The image is post-processed on a slice-by-slice basis from the axial
view.
39
Chapter 5
Implementation Details
5.1 Overview of the Framework
The entire glioma segmentation framework is developed in two modules - input genera-
tion module and training/prediction module. The code for image pre-processing and input
batches generation is developed in python. The learning module is developed and tested with
cuda versions of tensor
ow and cae libraries and the program is executed on GPU. The
results discussed later in this thesis are obtained by executing the program on nVidia K20
GPU. The two modules are integrated using an asynchronized queue. Data, that is, the input
batches for learning and prediction along with coordinates information is shared using queue.
Input Generating Module
The input generation module implements code for pre-processing the MRI images and gener-
ating the input batches for training or segmentation. Multiple python threads read images,
apply pre-processing steps on the images and generate batches for training and prediction.
Only those pixels are utilized for creating input samples that have at least one non-zero
value in any of the modalities. For each subject, information about coordinates and its
labels are stored in RAM for generating patch for training or segmentation. Each thread
generates patches from the data utilized by it and adds it to an asynchronized queue. Learn-
ing module of the framework consumes batches from this queue to train the model or make
predictions. During segmentation, copies of the trained model is assigned to dierent GPU
processors and a separate queue is maintained for each processor that holds input batches
for which prediction is to be made. Each queue receives batches for a certain part of the
MRI image to be segmented. After all processors have completed execution, results from
all are aggregated and nal segmentation is generated. This is implemented using Apache
Spark framework. This yields a signicant speed-up in time taken to segment one scan.
CHAPTER 5. IMPLEMENTATION DETAILS 40
Training and Post-processing Module
The learning module consumes input batches from the asynchronised queue for training or
prediction. It is developed using python interfaces of cae and tensor
ow libraries with GPU
execution support. Numpy library for python is used to create arrays for faster accesses.
Weight values of each unit in the network are bounded by both L1 and L2 norm regular-
izations. Dropout with a probability of 0.5 was also used before the output layer to avoid
overtting. Both networks of the Input Cascaded architectures are trained using stochastic
gradient descent with cross entropy as the loss function. Training happens in two phases.
In the rst phase, network is trained with a balanced set of input batches (batch contains
almost same number of samples from each category) to capture patterns from images. In the
second phase, only the parameters of softmax layer are updated. This is done by multiplying
the learning rate all other layers by zero. Segmentation is a distributed process. Each node
maintains its own asynchronized queue which is lled up with the subset of batches for that
process by the input generating module. Each node is also assigned a copy of the learnt
model and the prediction process is parallelized using Apache Spark framework. Experi-
ments in this thesis are conducted by using four GPU nodes. Output of segmentation is a
MHA image that can further be used as an input for post-processing if desired. Open source
contribution by [3] is used for implementing CRFs to smoothen the segmentation.
5.2 Data Splitting
The dataset used for the work of this thesis is the BraTS 2014 dataset which consists of 237
high-grade and 55 low-grade training cases with expert annotations available for all. The
dataset was divided into three groups as discussed earlier.
Test Set : It contains a total 30 cases, 22 from high-grade and 8 from low-grade glioma
scans.
Validation Set : It contains a total of 10 cases, 7 from high-grade and 3 from low-grade
glioma scans.
Training Set : Remaining 252 cases, 208 from high-grade and 44 from low-grade glioma
scans were used for training the network(s).
Change in accuracy on the validation set is used as the stopping criteria for training and
is also used to monitor network overtting and undertting. Results discussed in the next
section are from the 30 cases of the test set.
CHAPTER 5. IMPLEMENTATION DETAILS 41
5.3 Hyper-parameter Tuning
Hyper-parameters are those parameters of a classier that aect the prediction performance
of a network and have to be determined empirically before starting the learning procedure.
Naive way to identify parameters is to search through the entire depth. However, this be-
comes intractable in case of classiers based on neural networks because of huge number of
hyper-parameters involved. Their estimation is usually done using greedy search and prior
knowledge. Following are the hyper-parameters associated with the learning architecture
used in this work:
Network Architecture : The entire structure of the network which involves number of
layers, input dimensionality, number of units in each layer and layer connectivity is a
hyper-parameter and needs to be determined empirically. The network architecture
discussed in this thesis relies mostly on previous works of Havaei, Mohammad, et al.
[17] and Davy, A., et. al. [10] with a few minor modications.
Learning Rate, : Learning rate controls the amount by which weights and biases in
the network are updated from their older value. In this work, it is initialized with 0.005
and decreased by a certain factor after every epoch.
Momentum, : It is set to a xed value of 0.9. This parameter allows for momentum
in weight adjustment. It makes sure that change to the weight persists in the same
direction for a number of adjustment cycles.
Learning Rate Decay, : This is set to a xed value of 2
1
. After every epoch, value
of the learning rate is halved. During the training process it is eective to decrease
the learning rate with increase in the number of epochs. Smaller learning rate near
the optimal helps as gradient tends to oscillate less. However, in the beginning of the
training process, larger learning rate is desired to approach near the optimal point at
a greater speed.
Regularization Parameters, : It is the constant factor which is multiplied to the L1
and L2 regularization terms in the cost function to avoid over-tting. It is set to a
xed value of 10
5
.
42
Chapter 6
Results
The work of this thesis is mostly dependent on the work carried out by Havaei et. al. with
small modications to their network architecture and an additional post-processing step.
For detailed analysis and comparison, the following table presents the results obtained by
Havaei, Mohammad, et al. [17]. They refer to the conventional CNN as Local Path CNN,
CNN consisting of only larger lters as Global Path CNN, model averaging the outputs of
Local and Global Path CNNs as Average CNN and the two pathway architecture as Two
Path CNN. It is important to note that all results in their work have been obtained by
training the model on 30 subjects, 20 from high-grade and 10 from low-grade glioma cases.
They used two-dimensional xed sized patches only across axial plane for the purposes of
training and prediction. Further, results presented in the table below are obtained before
and after (denoted by *) completing the second phase of the training.
Conguration
Dice Scores
Whole Core Enhancing
Input Cascade CNN* 0.88 0.79 0.73
Output Cascade CNN* 0.86 0.77 0.73
Local Cascade CNN* 0.88 0.76 0.72
Two Path CNN* 0.85 0.78 0.73
Local Path CNN* 0.85 0.74 0.71
Average CNN* 0.84 0.75 0.70
Global Path CNN* 0.82 0.73 0.68
Two Path CNN 0.78 0.63 0.68
Local Path CNN 0.77 0.64 0.68
Table 6.1: Results from the work of Havaei, Mohammad, et al. [17]
CHAPTER 6. RESULTS 43
Authors of [17] observed several interesting facts from their results. The second phase
of the training (to update parameters of just the softmax layer) using unbalanced batches
was found to be critical to capture the true distribution of tumorous tissues, there-by im-
proving the prediction performance. Also, it was observed that having larger convolutional
lters in the network reduced false positives but increased true negatives, especially for the
enhancing-core region. With smaller lters however, fewer true negatives but higher false
positives were observed. Overall performance of the network with smaller lters (Local Path
CNN) was found to be better as compared to one with larger lters (Global Path CNN).
Combined training of the two computation pathways and depth-wise concatenation of their
outputs before softmax layer (Two Path CNN) out-performed both Local Path and Global
Path CNNs. This type of architecture updates parameters of both the pathways simultane-
ously to capture both local and global contexts. Further, to smoothen the segmentation (to
have label of a pixel be dependent on labels of neighboring pixels), use of cascaded architec-
tures is proposed by Havaei, Mohammad, et al. [17]. As seen from the table above, Input
Cascade network has out-performed Local Cascade and Output Cascade networks.
Input Type Conguration
Dice Scores
Whole Core Enhancing
2D Input Patch
Sagittal plane
34*34
First CNN (table 4.1)
of the cascaded
architecture 0.89 0.81 0.78
Post processing
using Input Cascade
architecture (table 4.2) 0.91 0.81 0.78
Post processing
using CRF De noising 0.93 0.84 0.75
2D Input Patches in
Sagittal, Axial and
coronal planes
34*34
First CNN (table 4.1)
of the cascaded
architecture 0.91 0.82 0.80
Post processing
using Input Cascade
architecture (table 4.2) 0.94 0.83 0.80
Post processing
using CRF De noising 0.96 0.84 0.77
Table 6.2: Results in terms of dice scores for the networks implemented in the work of this
thesis
All numeric results listed in this section are averaged across all the subjects in the test
CHAPTER 6. RESULTS 44
set. Table 6.2 shows the dice scores of whole tumor, core tumor and active tumor regions
under dierent congurations using the convolutional neural network architecture discussed
in the work of this thesis .
A variation of upto9% was observed in the dice scores of individual subjects from the
average reported above. From the table, it can be seen that using patches from all three
planes around the coordinate for training a single network results in slightly better dice
scores as compared to those obtained by using patches only along the axial plane. It is
intuitive as better prediction can be made if features from all three planes are learnt during
training. However, with this approach, the network can be trained only by almost one third
of the subjects in a given amount of time. This also increases the time taken to segment one
brain almost by thrice.
When comparing the results of Input Cascade CNN of this work against the Input Cas-
cade CNN of work by Havaei, Mohammad, et al. [17], slight improvement in dice scores can
be attributed to two factors:
Increased amount of training data : Because of huge number of parameters involved
in training of deep convolutional networks comprising of multiple computational path-
ways, it typically requires large amount of training data to achieve generalization and
prevent overtting. Results obtained by Havaei, Mohammad, et al. [17] were obtained
by training their network with scans of 30 patients as opposed to a total of 252 in this
work.
Use of multi-sized input patches : From initial experiments using input patches of
dierent sizes, several interesting facts were observed. Networks that used smaller
patch size inputs tend to produce increased false positives and fewer true negatives for
tumorous regions. Smaller region around the pixel was found to be better to identify
tumor region, however, it generally resulted in a signicant number of normal tissues
being classied as tumorous ones. Larger input patch sizes was able to capture the
global context well, but under-performed while categorizing tumorous region into its
categories (edema, enhancing core, non enhancing core and necrosis). In order to
handle this, larger patch size was used as an input to the rst computational pathway
(which uses a two pathway architecture). A cropped version of larger patch size is
depth-wise appended to the output of the rst path and was fed as the input to the
following layer. This led to an increase in dice scores by almost 1% as compared to
the work of Havaei, Mohammad, et al. [17] (Both networks were trained using same
training and test data for comparison).
Both the networks of the cascaded architecture were trained on GPUs for faster compu-
tations and unsupervised pre-training of the layers was conducted before supervised training.
It is also interesting to observe that there is no fully-connected layer in the networks trained
CHAPTER 6. RESULTS 45
in this work. The last layer of the network is a convolutional layer instead of a fully connected
layer. This signicantly reduces the number of parameters and connections in the network.
Further, prediction in this work is done in a distributed manner. Each node is assigned a
copy of the learnt model and a subset of the input pixels to be predicted. Results at the end
are aggregated and a segmentation MHA le is generated as the output. The framework
developed takes less than 90 seconds to segment a brain using Input Cascade network and
around 190 seconds to post-process the segmentation using CRF denoising.
46
Chapter 7
Conclusion and Future Work
The work of this thesis investigated the application of modestly sized deep convolutional
networks in an attempt to automate the process of brain tumor detection and segmenta-
tion. Dierent convolutional architectures were evaluated for performance. Results of this
work show that segmentation performance can be improved by having more data for training
modestly sized neural network architectures. The best model proposed in this work (multi-
patch two pathway architecture with post-processing using CRF denoising) using BraTS
2014 dataset managed to improve on the current state-of-the-art method on accuracy. Also,
with distributed approach for segmentation, overall time taken to segment one case was re-
duced from current state-of-the-art of 2 minutes to under a minute when using the Input
Cascade architecture.
The work of this thesis presents a direction for future work. It is observed that the
performance of segmentation can be improved by having more data for training and modestly
sized network architectures. Training such networks becomes extremely challenging if data is
too large to be processed by a single node in a given time or the network parameters are too
large to t into the memory of a GPU node. This problem can be tackled by using a cluster
of machines to distribute training and inference in large networks. This can be achieved in
two ways. First is to keep a copy of the model at all nodes and update parameters at a central
node that collects gradients from all nodes in the cluster and sends updated parameters to
all other nodes. Another way is to have both model and data parallelism with a separate
coordinator that executes the optimization algorithm. Each node in the cluster executes
commands independently. Using these approaches for training and prediction can allow for
training the model with larger amounts of training data with modestly sized networks.
47
Bibliography
[1] John Ashburner and Karl J Friston. \Unied segmentation". In: Neuroimage 26.3
(2005), pp. 839{851.
[2] Alberto Bianchi et al. \Brain tumor segmentation with symmetric texture and sym-
metric intensity-based decision forests". In: Biomedical Imaging (ISBI), 2013 IEEE
10th International Symposium on. IEEE. 2013, pp. 748{751.
[3] Yuri Boykov, Olga Veksler, and Ramin Zabih. \Fast approximate energy minimization
via graph cuts". In: Pattern Analysis and Machine Intelligence, IEEE Transactions on
23.11 (2001), pp. 1222{1239.
[4] Liang-Chieh Chen et al. \Semantic image segmentation with deep convolutional nets
and fully connected crfs". In: arXiv preprint arXiv:1412.7062 (2014).
[5] Jun Cheng et al. \Enhanced Performance of Brain Tumor Classication via Tumor
Region Augmentation and Partition". In: PloS one 10.10 (2015), e0140381.
[6] Dana Cobzas et al. \3D variational brain tumor segmentation using a high dimen-
sional feature set". In: Computer Vision, 2007. ICCV 2007. IEEE 11th International
Conference on. IEEE. 2007, pp. 1{8.
[7] Jason J Corso et al. \Ecient multilevel brain tumor segmentation with integrated
bayesian model classication". In: Medical Imaging, IEEE Transactions on 27.5 (2008),
pp. 629{640.
[8] Corinna Cortes and Vladimir Vapnik. \Support-vector networks". In: Machine learning
20.3 (1995), pp. 273{297.
[9] Antonio Criminisi and Jamie Shotton. Decision forests for computer vision and medical
image analysis. Springer Science & Business Media, 2013.
[10] A Davy et al. Brain tumor segmentation with deep neural networks. 2014.
[11] Pedro Domingos. \A few useful things to know about machine learning". In: Commu-
nications of the ACM 55.10 (2012), pp. 78{87.
[12] Dumitru Erhan et al. \The diculty of training deep architectures and the eect of
unsupervised pre-training". In: International Conference on articial intelligence and
statistics. 2009, pp. 153{160.
BIBLIOGRAPHY 48
[13] Dumitru Erhan et al. \Why does unsupervised pre-training help deep learning?" In:
The Journal of Machine Learning Research 11 (2010), pp. 625{660.
[14] Bruce Fischl et al. \Whole brain segmentation: automated labeling of neuroanatomical
structures in the human brain". In: Neuron 33.3 (2002), pp. 341{355.
[15] Ian J Goodfellow et al. \Maxout networks". In: arXiv preprint arXiv:1302.4389 (2013).
[16] L G orlitz et al. \Semi-supervised tumor detection in magnetic resonance spectroscopic
images using discriminative random elds". In: Pattern Recognition. Springer, 2007,
pp. 224{233.
[17] Mohammad Havaei et al. \Brain Tumor Segmentation with Deep Neural Networks".
In: arXiv preprint arXiv:1505.03540 (2015).
[18] Georey E Hinton and Ruslan R Salakhutdinov. \Reducing the dimensionality of data
with neural networks". In: Science 313.5786 (2006), pp. 504{507.
[19] Sean Ho, Lizabeth Bullitt, and Guido Gerig. \Level-set evolution with region compe-
tition: automatic 3-D segmentation of brain tumors". In: Pattern Recognition, 2002.
Proceedings. 16th International Conference on. Vol. 1. IEEE. 2002, pp. 532{535.
[20] Juan Eugenio Iglesias et al. \Is synthesizing MRI contrast useful for inter-modality
analysis?" In: Medical Image Computing and Computer-Assisted Intervention{MICCAI
2013. Springer, 2013, pp. 631{638.
[21] Shuiwang Ji et al. \3D convolutional neural networks for human action recognition".
In: Pattern Analysis and Machine Intelligence, IEEE Transactions on 35.1 (2013),
pp. 221{231.
[22] Frederik O Kaster et al. \Comparative validation of graphical models for learning
tumor segmentations from noisy manual annotations". In: Medical Computer Vision.
Recognition Techniques and Applications in Medical Imaging. Springer, 2010, pp. 74{
85.
[23] Michael R Kaus et al. \Automated segmentation of mr images of brain tumors 1". In:
Radiology 218.2 (2001), pp. 586{591.
[24] Hassan Khotanlou et al. \3D brain tumor segmentation in MRI using fuzzy classica-
tion, symmetry analysis and spatially constrained deformable models". In: Fuzzy Sets
and Systems 160.10 (2009), pp. 1457{1473.
[25] Alex Krizhevsky, Ilya Sutskever, and Georey E Hinton. \Imagenet classication with
deep convolutional neural networks". In: Advances in neural information processing
systems. 2012, pp. 1097{1105.
[26] Chi-Hoon Lee et al. \Segmenting brain tumors using pseudo{conditional random elds".
In: Medical Image Computing and Computer-Assisted Intervention{MICCAI 2008.
Springer, 2008, pp. 359{366.
BIBLIOGRAPHY 49
[27] Chi-Hoon Lee et al. \Segmenting brain tumors with conditional random elds and sup-
port vector machines". In: Computer vision for biomedical image applications. Springer,
2005, pp. 469{478.
[28] Aaron E Lefohn, Joshua E Cates, and Ross T Whitaker. \Interactive, GPU-based
level sets for 3D segmentation". In: Medical Image Computing and Computer-Assisted
Intervention-MICCAI 2003. Springer, 2003, pp. 564{572.
[29] M Madheswaran and D Anto Sahaya Dhas. \Classication of brain MRI images using
support vector machine with various Kernels." In: Biomedical Research 26.3 (2015),
pp. 505{513.
[30] Gloria P Mazzara et al. \Brain tumor target volume determination for radiation treat-
ment planning through automated MRI segmentation". In: International Journal of
Radiation Oncology* Biology* Physics 59.1 (2004), pp. 300{312.
[31] Bjoern H Menze et al. \A generative model for brain tumor segmentation in multi-
modal images". In: Medical Image Computing and Computer-Assisted Intervention{
MICCAI 2010. Springer, 2010, pp. 151{159.
[32] Bjoern H Menze et al. \The multimodal brain tumor image segmentation benchmark
(BRATS)". In: Medical Imaging, IEEE Transactions on 34.10 (2015), pp. 1993{2024.
[33] Sasikumar.C Nithyapriya.G. \Detection and Segmentation of Brain Tumors using Ad-
aBoost SVM". In: International Journal of Innovative Research in Computer and Com-
munication Engineering 2.1 (2014), pp. 2323{2328.
[34] Nikhil R Pal and Sankar K Pal. \A review on image segmentation techniques". In:
Pattern recognition 26.9 (1993), pp. 1277{1294.
[35] Pedro O Pinheiro and Ronan Collobert. \From image-level to pixel-level labeling with
convolutional networks". In: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition. 2015, pp. 1713{1721.
[36] Kilian M Pohl et al. \A unifying approach to registration, segmentation, and inten-
sity correction". In: Medical Image Computing and Computer-Assisted Intervention{
MICCAI 2005. Springer, 2005, pp. 310{318.
[37] Marcel Prastawa et al. \A brain tumor segmentation framework based on outlier de-
tection". In: Medical image analysis 8.3 (2004), pp. 275{283.
[38] Marcel Prastawa et al. \Automatic brain tumor segmentation by subject specic mod-
ication of atlas priors 1". In: Academic radiology 10.12 (2003), pp. 1341{1348.
[39] Holger R Roth et al. \Detection of sclerotic spine metastases via random aggregation
of deep convolutional neural network classications". In: Recent Advances in Compu-
tational Methods and Clinical Applications for Spine Imaging. Springer, 2015, pp. 3{
12.
BIBLIOGRAPHY 50
[40] Sandip Roy et al. \Longitudinal intensity normalization in the presence of multiple
sclerosis lesions". In: Biomedical Imaging (ISBI), 2013 IEEE 10th International Sym-
posium on. IEEE. 2013, pp. 1384{1387.
[41] Snehashis Roy, Aaron Carass, and Jerry Prince. \A compressed sensing approach
for MR tissue contrast synthesis". In: Information Processing in Medical Imaging.
Springer. 2011, pp. 371{383.
[42] Shan Shen et al. \MRI fuzzy segmentation of brain tissue using neighborhood attrac-
tion with neural-network optimization". In: Information Technology in Biomedicine,
IEEE Transactions on 9.3 (2005), pp. 459{467.
[43] Richard Socher et al. \Parsing natural scenes and natural language with recursive
neural networks". In: Proceedings of the 28th international conference on machine
learning (ICML-11). 2011, pp. 129{136.
[44] G Urban et al. \Multi-modal brain tumor segmentation using deep convolutional neural
networks". In: MICCAI BraTS (Brain Tumor Segmentation) Challenge. Proceedings,
winning contribution (2014), pp. 31{35.
[45] Koen Van Leemput et al. \Automated model-based bias eld correction of MR images
of the brain". In: Medical Imaging, IEEE Transactions on 18.10 (1999), pp. 885{896.
[46] Ragini Verma et al. \Multiparametric tissue characterization of brain neoplasms and
their recurrence using pattern classication of MR images". In: Academic radiology
15.8 (2008), pp. 966{977.
[47] Pascal Vincent et al. \Stacked denoising autoencoders: Learning useful representa-
tions in a deep network with a local denoising criterion". In: The Journal of Machine
Learning Research 11 (2010), pp. 3371{3408.
[48] Michael Wels et al. \A discriminative model-constrained graph cuts approach to fully
automated pediatric brain tumor segmentation in 3-D MRI". In: Medical Image Com-
puting and Computer-Assisted Intervention{MICCAI 2008. Springer, 2008, pp. 67{75.
[49] Darko Zikic et al. \Decision forests for tissue-specic segmentation of high-grade gliomas
in multi-channel MR". In: Medical Image Computing and Computer-Assisted Intervention{
MICCAI 2012. Springer, 2012, pp. 369{376.
Abstract (if available)
Abstract
This thesis presents a fully automatic technique to segment brain tumor from MRI scans using deep neural networks. The proposed approach is tailored to segment high and low grade glioblastomas. However, with minor modifications, this approach can be used with several image based segmentation diagnostic tasks. Glioblastomas can appear in any shape, size, location and severeness in the brain. Further, there is a significant degree of disagreement between several human expert annotators in their segmentations. These reasons motivated the exploration of machine learning approach to tackle the problem of brain tumor segmentation. In this work, several previous works aiming to solve this problem have been discussed and different architectures based on deep convolutional neural networks are proposed. It also discusses ways to speed-up the segmentation for application in real world. Results of the work are presented on BraTS 2014 (Brain Tumor Segmentation challenge held in conjunction with MICCAI) dataset.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Machine learning techniques for perceptual quality enhancement and semantic image segmentation
PDF
Deep learning models for temporal data in health care
PDF
Physics-aware graph networks for spatiotemporal physical systems
PDF
Invariant representation learning for robust and fair predictions
PDF
Simulation and machine learning at exascale
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Understanding diffusion process: inference and theory
PDF
Learning to diagnose from electronic health records data
PDF
Learning distributed representations from network data and human navigation
PDF
Object localization with deep learning techniques
PDF
Deep learning for subsurface characterization and forecasting
PDF
Dynamic graph analytics for cyber systems security applications
PDF
Tensor learning for large-scale spatiotemporal analysis
PDF
Object classification based on neural-network-inspired image transforms
PDF
Experimental analysis and feedforward design of neural networks
PDF
Graph embedding algorithms for attributed and temporal graphs
PDF
3D deep learning for perception and modeling
PDF
Representation problems in brain imaging
PDF
Hashcode representations of natural language for relation extraction
PDF
Dynamical representation learning for multiscale brain activity
Asset Metadata
Creator
Raja, Sachin
(author)
Core Title
Brain tumor segmentation
School
Viterbi School of Engineering
Degree
Master of Science
Degree Program
Computer Science
Publication Date
04/20/2016
Defense Date
03/22/2016
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
brain tumor segmentation,BraTS,convolutional,machine learning,MICCAI,neural networks,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Raghavendra, Cauligi S. (
committee chair
), Liu, Yan (
committee member
), Raghavachary, Sathyanaraya (
committee member
)
Creator Email
sachinra@usc.edu,sachinraja13@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-235608
Unique identifier
UC11278259
Identifier
etd-RajaSachin-4319.pdf (filename),usctheses-c40-235608 (legacy record id)
Legacy Identifier
etd-RajaSachin-4319.pdf
Dmrecord
235608
Document Type
Thesis
Format
application/pdf (imt)
Rights
Raja, Sachin
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
brain tumor segmentation
BraTS
convolutional
machine learning
MICCAI
neural networks