Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Ancestral inference and cancer stem cell dynamics in colorectal tumors
(USC Thesis Other)
Ancestral inference and cancer stem cell dynamics in colorectal tumors
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
i
ANCESTRAL INFERENCE AND CANCER STEM CELL DYNAMICS IN
COLORECTAL TUMORS
By
Junsong Zhao
______________________________________________________________________
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(MOLECULAR BIOLOGY)
August 2016
Copyright 2016 Junsong Zhao
ii
Dedication
To my family: past, present, and future.
iii
Acknowledgments
I owe my gratitude to many people without whom I would have not reached this point.
They accompanied, supported, and guided me during my Ph.D. study.
First and foremost, I would like to express my greatest gratitude to my advisor Dr.
Paul Marjoram for his endless support both as a mentor and as a friend. He patiently guides
and advises me on my research project, writing of manuscripts and this thesis, and
presentation preparation. I really appreciate his effort to back me when there were obstacles
in research and in my personal life. I truly admire his personal integrity and inner virtue as a
scientist. He has taught me how to do research, but most importantly, how to perform
research collaboratively and ethically.
Besides my advisor, I am also sincerely grateful to Dr. Kimberley Siegmund, who is
not officially my advisor but has supported and guided me all the time. I also would like to
sincerely thank Dr. Sergey Nuzhdin for serving as the chair of both my qualification and my
dissertation committee. I would also like to extend my gratefulness to other committee
members: Dr. Peter Ralph, and Dr. Matthew Dean for their valuable advices and critiques on
my research proposal and thesis.
I am very thankful for Brad Foley, Sarah Singor, and Mohammad Abbasi who all
helped me tremendously with my presentations. I also want to thank many of my friends:
Weisheng Xie, Maoqi Xu, Long Pei, Meng Zhou, Shiliang Zhou, Guangtong Zeng, Kan
Wang, Yongjian Kang, Lunce Fu, Asif Zubair, Katrina Sherbina who helped me proofread
this thesis, and many more friends that I am not going to enumerate here.
iv
I would like to thank my parents and sister for their everlasting support throughout
my life. I was very fortunate compared to my classmates in elementary school and middle
school. Despite of financial difficulties, my parents supported me financially all the way until
my graduation from college and support me spiritually thereafter.
Finally, I owe my sincerest gratitude to Zeyi Zhou who accompanied me and offered
me numerous advices along this journey. Without her, I would probably have not joint Dr.
Paul Marjoram’s lab and lost the opportunity to be a student of such great advisor. I sincerely
appreciate the sacrifices she made in order to be part of my life and my love for her will last
eternally.
Five years of Ph.D. life is short and long. It seems like it was only yesterday that I
landed at LAX and knew nobody here. Now, five years has pasted in a blink of an eye and so
many new faces have come to my life. I thank all the people who help, support, guide and
love me once again.
v
Table of contents
Dedication ...................................................................................................................................... ii
Acknowledgments ....................................................................................................................... iii
Table of contents ........................................................................................................................... v
List of Tables .............................................................................................................................. viii
List of Figures .............................................................................................................................. ix
Abstract .......................................................................................................................................... 1
Chapter 1: Introduction ................................................................................................................ 3
1.1 Colorectal tumors ...................................................................................................... 3
1.2 Tumor growth model ................................................................................................. 5
1.3 Intratumor heterogeneity ........................................................................................... 6
1.4 Approximate Bayesian computation ......................................................................... 7
1.5 Summary of chapters ................................................................................................. 8
Chapter 2: Ancestral inference in tumors: how much can we know? .................................. 11
2.1 Overview ................................................................................................................. 11
2.2 Introduction ............................................................................................................. 12
2.3 Experimental data and model .................................................................................. 14
2.4 Statistical methods................................................................................................... 17
2.4.1 Approximate Bayesian computation ................................................................ 17
2.4.2 Summary statistics ............................................................................................ 20
2.5 Results ..................................................................................................................... 23
2.5.1 Analysis of simulated data ................................................................................ 23
2.5.1.1 Estimation of ancestor ................................................................................ 24
2.5.1.2 Estimation of the methylation and demethylation error rates .................... 26
2.5.1.3 Number of cancer stem cells, probability of asymmetric division, and age
................................................................................................................................ 29
2.5.2 Analysis of experimental data .......................................................................... 30
2.6 Discussion ............................................................................................................... 33
2.7 Additional supplemental results .............................................................................. 36
vi
2.7.1 Estimation of DNA (de)methylation error rates: .............................................. 36
2.7.2 Estimation of asymmetric division rate and age ............................................... 39
2.7.3 Choice of summary statistics ............................................................................ 42
Chapter 3: Characterization of the early divisions of colorectal tumors ............................. 45
3.1 Overview ................................................................................................................. 45
3.2 Introduction ............................................................................................................. 46
3.3 Data and methods .................................................................................................... 48
3.3.1 Exome sequencing for bulk tissue .................................................................... 48
3.3.2 SNP microarray ................................................................................................ 48
3.3.3 AmpliSeq .......................................................................................................... 49
3.4 Results ..................................................................................................................... 50
3.4.1 Mutation patterns .............................................................................................. 50
3.4.2 Calling integer copy numbers ........................................................................... 54
3.4.3 Reconstruction of the early divisions ............................................................... 59
3.5 Discussion ............................................................................................................... 61
Chapter 4: Early mutation bursts and cancer stem cell division asymmetry in colorectal
tumors ........................................................................................................................................... 65
4.1 Overview ................................................................................................................. 65
4.2 Introduction ............................................................................................................. 66
4.3 Data, model and methods ........................................................................................ 69
4.3.1 Tumor growth model and DNA mutation embedding ...................................... 69
4.3.2 Statistical methods ............................................................................................ 71
4.3.2.1 Approximate Bayesian computation .......................................................... 71
4.3.2.2 Summary statistics ..................................................................................... 72
4.3.2.3 Selecting weights for summary statistics ................................................... 73
4.3.2.4 Local linear regression method .................................................................. 74
4.3.2.5 Global linear regression method ................................................................ 75
4.3.3 Data simulation ................................................................................................. 76
4.3.3.1 Perfect synthetic data ................................................................................. 76
4.3.3.2 Simulated data with sequencing noise ....................................................... 76
4.4 Results ..................................................................................................................... 78
vii
4.4.1 Synthetic tumor data ......................................................................................... 78
4.4.1.1 Weight summary statistics ......................................................................... 78
4.4.1.2 Minimum detectable allele frequency ........................................................ 79
4.4.1.3 Number of sampled glands ........................................................................ 82
4.4.1.4 Adding Experimental noise........................................................................ 84
4.4.2 Experimental data ............................................................................................. 87
4.4.2.1 Is there a mutation burst? ........................................................................... 87
4.4.2.2 How stem cells divide? .............................................................................. 88
4.5 Discussion ............................................................................................................... 89
Chapter 5: Concluding remarks ................................................................................................ 93
5.1 ITH carries information regarding cancer stem cells in colorectal tumor ............... 93
5.2 Early divisions during the development of a tumor ................................................ 94
5.3 Mutation burst ......................................................................................................... 95
5.4 Future directions ...................................................................................................... 96
5.4.1 Cell motility ...................................................................................................... 96
5.4.2 Scope of the data ............................................................................................... 96
References .................................................................................................................................... 99
viii
List of Tables
Table 2.1 | Parameters in our model and the corresponding prior distributions. .............. 16
Table 2.2 | Summary Statistics. See definition of each summary statistics below ........... 21
Table 3.1 | Allele frequencies under different genotype and ploidy. ................................ 55
Table 4.1 | Parameters and the corresponding prior distributions in our model. .............. 70
ix
List of Figures
Figure 2.1 | The tumor growth model ............................................................................... 14
Figure 2.2 | Posterior distributions of the initial methylation status ................................. 24
Figure 2.3 | The mean square error of the estimates of the ancestor................................. 26
Figure 2.4 | Posterior Distributions of MER and DER. .................................................... 27
Figure 2.5 | The estimate of ratio of MER and DER. ....................................................... 28
Figure 2.6 | Posterior distributions of the NCSC for four illustrative tumors. .................. 29
Figure 2.7 | Mean square error of the mode of the posterior distribution of NCSC with
respect to varying age and true NCSC .............................................................................. 30
Figure 2.8 | Ancestral inference for experimental data. The x-axis is the CpG site, the y-
axis is the probability of each site being methylated in the ancestral cell ........................ 31
Figure 2.9 | Marginal posterior distribution of the DNA methylation and demethylation.
error rate in the experimental data. ................................................................................... 32
Figure 2.10 | Joint posterior distribution of the DNA methylation and demethylation error
rate in the experimental data. ............................................................................................ 32
Figure 2.11 | The posterior distributions of the number of cancer stem cells in
experimental data .............................................................................................................. 33
Figure 2.12 | Posterior Distributions of MER and DER. .................................................. 37
Figure 2.13 | Posterior Distributions of MER and DER with different true MER and DER
values ................................................................................................................................ 38
Figure 2.14 | The joint distribution of DNA methylation error rate and demethylation
error rate for Figure 2.4a. .................................................................................................. 39
Figure 2.15 | Posterior distribution of the probability of asymmetric division and age ... 40
Figure 2.16 | Posterior distribution of different parameters for the 9 real tumors ............ 41
Figure 2.17 | Difference in the posterior distributions on the same data set when using
different weights on different summary statistics ............................................................. 42
Figure 3.1 | Multiregional sampling scheme ..................................................................... 49
Figure 3.2 | Mutation allele frequencies and categorization of AmpliSeq data for Tumor
K ........................................................................................................................................ 52
x
Figure 3.3 | Mutations in AmpliSeq data .......................................................................... 54
Figure 3.4 | Segmentation and smoothing of SNP array data for a gland (KA6) of tumor K
........................................................................................................................................... 56
Figure 3.5 | Call CN estimates from SNP array data after smoothing .............................. 58
Figure 3.6 | Adjusted copy number for Tumor K. x-axis is the chromosomal coordinate 59
Figure 3.7 | Different scenarios that depict the early events for tumor K ......................... 60
Figure 3.8 | Mutation allele frequencies and categorization of AmpliSeq data for Tumor
N ........................................................................................................................................ 62
Figure 4.1 | Schematic of tumor growth and the two types of CSC division. .................. 70
Figure 4.2 | Summary of posterior distributions for each parameter under different
weighting schemes ............................................................................................................ 81
Figure 4.3 | Different minimum detectable allele frequency. 300 tumors with a range of
generating parameters were tested .................................................................................... 82
Figure 4.4 | Summary of posterior distributions as a function of the number of glands
sampled from each tumor ................................................................................................. 84
Figure 4.5 | Summary of posterior distributions for different sequencing depths. ........... 86
Figure 4.6 | Posterior distributions of mutation rate for tumor U ..................................... 88
Figure 4.7 | Posterior distribution of asymmetric division rate (x-axis) for tumor U ....... 89
1
Abstract
Tumorigenesis is the process by which a tumor is formed. It starts with the
transformation of normal cells into cells with uncontrolled growth, coupled with genetic,
epigenetic, and cellular changes. One of the challenges that researchers have to face when
studying tumor growth is that only the end point of the growth process is observed. Namely, we
are not able to examine the clonal expansion of the first transformed cell. To design an effective
therapeutic strategy to treat tumors and to prevent tumorigenesis, basic understanding of the
entire developmental details of tumors is needed. Intratumor heterogeneity (ITH) of sequencing
data, i.e. the fact that not all parts of a given tumor are the same, has been observed in tumors in
various genomic data sets. Thus all tumors are not the same, and it is likely that they should not
all be treated in the same way. In this thesis, I aim to better understand tumor growth and the
variations in this process between individuals by viewing tumor growth as an evolutionary
process. I will decipher the past of a given tumor using techniques borrowed from molecular
phylogeny using the information about that past that is contained within the ITH.
In summary, here I describe methods and models to study ITH contained in methylation
data, exome sequencing data and SNP array data in order to decipher the ancestry of colorectal
tumors and estimate several important parameters, such as methylation error rate, demethylation
error rate, mutation rate, number of cancer stem cells, and the probability that a stem cell
undergoes symmetric division or asymmetric division.
In Chapter 1, I give a general introduction to the field. In Chapter 2, I demonstrate that
DNA methylation data that are collected from colorectal tumors can be used to infer the ancestral
methylation state, the number of cancer stem cells, and methylation/demethylation error rate. In
2
addition, I demonstrate that methylation data have limitations for parameter inference. In Chapter
3, I present work I performed as part of the development of an alternative tumor growth model, a
“big bang” model, for colorectal tumors, which is supported by genomic data from spatial
sampling. In Chapter 4, I explore the possibility that the DNA mutations from single gland
exome sequencing data can be utilized to infer the division patterns of cancer stem cells and to
infer whether or not there is a mutation burst in the early stage of tumorigenesis. Finally, in
Chapter 5, I summarize the work presented in my thesis and discuss future questions of interest.
3
Chapter 1: Introduction
This thesis consists of work reported in 4 papers (3 published; one in the process of
submission). These comprise chapters 2-4 of this document. For narrative reasons, we included
separate “Introduction” sections in each of those chapters. In the current section we therefore
provide a brief “big-picture” overview of the field in which the studies were undertaken, and the
analysis approach that was employed throughout.
1.1 Colorectal tumors
Cancer has become the second most common cause of death in the United States. In 2015,
according to the American Cancer Society, approximately one person died of cancer every minute.
However, despite decades of research, and large investments of time and money (e.g. the “War on
Cancer”), the specifics of tumor growth, especially those of the early stages of tumor growth, are
poorly understood. This is in part, at least, due to the inability of directly observing the clonal
expansion of a single cell, which is thought to initiate tumor growth [1-3].
Colorectal cancer, the third most common cancer in the United States according to the
American Cancer Society, develops in the large intestine, which includes the colon and the
rectum. The development of colorectal cancers typically begins with a polyp, which is an
abnormal but non-cancerous growth of tissue on the inner lining of the large intestine [4]. The
most common type of polyp is an adenomatous polyp, which originates within the intestinal
glands, which are also called “colon crypts”. Colorectal adenomas have the potential to grow into
cancer, and the risk of such adenomas becoming cancerous increases with time. Hence, there has
been a recent initiative to screen regularly for such polyps using colonoscopy. However, only a
4
small proportion of adenomas will eventually progress to adenocarcinoma, a cancer that develops
from glandular cells [5,6]. Therefore, it would be clearly advantageous to try to understand this
heterogeneity in tumor outcomes. To this end, we focus on proposing models for tumor growth
and fitting those models in the hope that, in the future, one might relate heterogeneity in outcome
to heterogeneity in fitted model parameters.
Among various kinds of tumors, colorectal tumors have been the focus for our research
because the disparity between benign and malignant colorectal tumors is significant, and both
tumor types are naturally organized into well-defined glandular substructures [7,8]. Each gland
forms a physically distinct collection of ~10000 cells. The tumor itself is then made up of a large
collection of such glands. Such spatial organization provides a powerful sampling mechanism in
that a homogeneous sub-clonal population can be acquired with minimal cross contamination
from either normal or neighboring tissues, which is a critical problem in general. The approach of
sampling multiple glands from the two halves of a tumor has been proven successful, and we
demonstrate this in Chapter 2 when we study glandular passenger DNA methylation patterns,
showing that the glandular structures are robust and form early after initial transformation of the
progenitor cell [1,9-11].
Some of the work in this thesis focuses on patterns of polymorphism observed in
methylation status of glandular cells extracted from colorectal tumors. Unlike nucleotide base
replication fidelity, epigenetic replication fidelity is markedly lower at certain CpG rich regions.
Therefore, DNA methylation patterns measurably change during normal human aging and are
often highly polymorphic within an individual [12]. Consequently, they can be used to infer the
history of a tumor in a way that is directly analogous to the use of nucleotide variation to infer the
history of human individuals [13]. The analysis of Chapter 2 uses such data. However, as part of
that analysis, it also becomes clear that the DNA methylation data we collected is insufficient to
5
infer all the details about tumor growth with the desired level of resolution [11]. This led us to
consider exploiting additional types of data that might have higher resolution. Thus, in Chapter 3
and Chapter 4, we incorporate data from those other technologies (next generation sequencing,
SNP arrays), together with methylation data, to understand the picture of tumor growth more fully.
1.2 Tumor growth model
Numerous models have been proposed to help better understand the growth process of
tumors. One of the earliest models was the Gompertzian model that was proposed by Laird in
1964 [14]. In this model, the Gompertzian function, which is a sigmoid function, is used to
describe the tumor dynamics. Thus, tumor growth initially follows an exponential growth model,
before it eventually becomes constrained, for example, by limitations of space, at which point the
tumor size asymptotes to some finite value. The Gompertzian model proved quite successful and
was shown to fit the experimental data well. It provided researchers with a simple yet powerful
tool to study the growth of a tumor at a macroscopic level in terms of the number of cells.
However, the model is rather simplistic and has been proven to fit the data poorly when the tumor
is small in size or when the interaction between the tumor and the host immune system needs to
be included in the model [15].
More recently in tumor modeling, the “selective sweeps” theory, in which a sequential
clonal expansion of advantageous clones (driver mutations) eventually dominates the final cancer,
has been widely adopted. According to this “selective sweeps” theory, the final metastasis in a
cancer is a consequence of the accumulation of and the action of selection on a set of driver
mutations [16,17]. Namely, in each step of this accumulation, a tumor will select for the driver
mutations, sweeping them to higher frequency, which results in a sequential expansion of clones
6
with such selective advantages [18,19]. This model has now become broadly used when
describing the progression of established tumors. Such a model has implications for the patterns
of mutational heterogeneity that should be observed in the final tumor. Detailed examination of
such patterns has led to investigators proposing refinements to this model, in the form of a
branched evolution model instead of the linear “selective sweeps” model [18].
1.3 Intratumor heterogeneity
The growth of tumors is an example of an evolutionary process, and, as such, the history
of that growth is encoded in the final genome [18,20]. The key intuition we exploit in this thesis
is that this ancestry can be inferred from the patterns of variation within genetic data from cells
sampled from a tumor [21,22]. On average, the greater the observed genetic differences between
cells, the greater the time since a common ancestor. This is known as the “molecular clock”
hypothesis [23]. Molecular phylogeny is usually employed to reconstruct the pasts of
macroscopic populations, such as individuals or species, but it can also be used to infer the fates
of somatic cells within an individual. The ability to accurately infer somatic cell phylogenies
would be extremely valuable, especially for human tissues, because more direct experimental
observations are often impractical. Many research groups, including our own, have shown that
multiregional sequencing reveals genomic intratumor heterogeneity (ITH) and thereby reveals the
genomic architecture of evolution in tumors [24,25]. Therefore, although the exact details of what
happened in early tumor development are unknown and unobservable, the ITH in the genomic
data from that tumor contains the information needed to decipher the origin of tumorigenesis.
This is the approach I adopt in the work discussed in this thesis.
7
1.4 Approximate Bayesian computation
The data I will analyze in this thesis are, generally speaking, large and complex, as are
the models upon which the analyses are based. This leads to difficulties in performing exact,
statistical analysis. Given these restrictions, I will exploit a method known as Approximate
Bayesian Computation (ABC), which was designed for precisely such a paradigm.
ABC was proposed and pioneered by Tavaré et al (1997) as a method to allow inference
of coalescence times using DNA sequence data in population genetics. However, many ABC-like
ideas can be traced back to the 1980s. In population genetics, statistical inference is typically
difficult as the data are usually complex and the population demographic models often contain
many parameters prohibiting explicit theoretical solutions, largely because the complexity of the
model generally makes it impossible to write down the likelihood function. As a consequence,
from a Bayesian perspective, direct calculation of the posterior distribution for the model
parameters is mathematically impossible or computationally intractable.
ABC side-steps these issues by proposing a simulation-based estimator of the likelihood
function. ABC exists in many forms, but in this thesis I will exploit a simulation-based rejection-
acceptance algorithm to overcome these obstacles. Since a detailed description of this and other
ABC algorithms depends on exactly what model is being simulated, I defer further discussion of
the method to the appropriate chapters.
8
1.5 Summary of chapters
In Chapter 2, I demonstrate that the DNA methylation patterns collected from different
glands in colorectal tumors do indeed contain sufficient ancestral information to allow parameter
inference regarding tumor growth. Combining a mathematical tumor growth model with an ABC
analysis, we retrieve posterior distributions for the initial methylation status of each locus, the
methylation/demethylation error rate, and the number of cancer stem cells in each gland of the
tumor. I also explore the limits of our ability to estimate those parameters. Using this framework,
we can generate tumor-specific profiles that might potentially guide therapeutics.
In Chapter 3, I discuss work related to the development of a new tumor growth model to
explain the intratumor heterogeneity that is present in glandular data that we collected via the
following techniques: SNP microarray, exome sequencing in bulk tumor tissue, and targeted
DNA sequencing in colorectal tumors. The model sheds light onto our understanding of the
initiation and progression of colorectal tumors. Contradicting the present view of tumor growth,
the model predicts that the pervasiveness of each mutation in a tumor is primarily associated with
the time at which it appeared rather than selective pressure. Namely, the earlier the mutation was
introduced to the tumor, the more pervasive the mutation is.
In Chapter 4, I address one of the interesting questions that is raised by the big bang
model for colorectal tumors discussed in chapter 3: whether there is a mutation burst (i.e. a period
of elevated mutation rate) during the initial stage of tumor development. As the pervasiveness of
mutations is a function of time, the physical distribution of the mutations in the final lesion and
the existence of a given mutation in every cell in a tumor might reasonably carry the information
required to answer this question. In addition, we address other questions regarding the processes
9
by which new tumor cells are formed. For example, the details of the process by which cancer
stem cells divide, such as whether they undergo symmetric or asymmetric division, can shape the
distribution of mutations within a tumor. Therefore, remnants of these processes observed in
patterns of mutations found at the end point of colorectal tumor growth can potentially be used to
determine such details of how cancer stem cells divide. In this chapter, I conduct a simulation
study to assess our power to answer such questions and conclude that we are indeed able to infer
the existence of a mutation burst and more details of the specific mechanisms by which cancer
stem cells divide.
In Chapter 5, I summarize the work presented in this thesis and discuss the scope of
future research.
10
11
Chapter 2: Ancestral inference in tumors:
how much can we know?
This work appears essentially as published in Zhao J, Siegmund KD, Shibata D, Marjoram P.
Ancestral inference in tumors: how much can we know?
Journal of Theoretical Biology, 359:136–145, 2014. PMCID:4138290
2.1 Overview
A tumor is thought to start from a single cell and genome. Yet genomes in the final tumor
are typically heterogeneous. The mystery of this intratumoral heterogeneity (ITH) has not yet
been uncovered, but much of the ITH may be secondary to replication errors. Methylation of
cytosine bases often exhibits ITH and therefore may encode the ancestry of the tumor. In this
study, we measure the passenger methylation patterns of a specific CpG region in 9 colorectal
tumors by bisulfite sequencing and apply a tumor development model. Based on our model, we
are able to retrieve information regarding the ancestry of each tumor using approximate Bayesian
computation. With a large simulation study we explore the conditions under which we can
estimate the model parameters, and the initial state of the first transformed cell. Finally we apply
our analysis to clinical data to gain insight into the dynamics of tumor formation.
12
2.2 Introduction
The mechanisms by which tumors grow remain poorly understood. Various models have
been proposed to study tumor initiation, growth and progression. An early study [26] showed that
the Gompertzian model fitted experimental data remarkably, although later research indicated that
a Gompertzian model will fail when the tumor is small or when the interaction between the tumor
and the host immune system is included in the model [15]. Tumor growth can also be modeled by
partial differential equations and mixture theory [27,28] with an emphasis on mass build-up and
the geometry of the tumor. Some later tumor models [29,30] focus on single-cell level behavior.
Technologic advances such as single-cell tumor sequencing [31] will increasingly provide more
experimental data for inferring tumor population structure.
Fitting models of tumor growth is problematic because we do not typically observe that
growth. Rather, we observe an end point of the growth. Furthermore, we are not able to observe
the clonal expansion of a single cell that is thought to initiate tumor growth [1,2]. Since the
parameters of tumor growth, or state of initial single cell before clonal expansion, might contain
important prognostic flags for future tumor behavior, it is vital to explore how well they might be
inferred from data collected from the final tumor. In this paper we explore this issue using
approximate Bayesian computation (ABC), a method that allows principled analysis in contexts
such as ours where models are of sufficient complexity to make more traditional analysis methods
intractable.
The key intuition that we exploit is that ancestry can be inferred from the variation
between genomes (c.f., inference of mtEVE, or Y-chromosome Adam, from human genotype
data [21,22]). The greater the differences between genomes, on average the greater the time since
13
a common ancestor (the molecular clock hypothesis [23]). Molecular phylogeny is usually
employed to reconstruct the pasts of macroscopic populations such as individuals or species, but
it can also be used to infer the fates of somatic cells within an individual. Accurate inference of
somatic cell phylogenies would be extremely valuable, especially for human tissues, because
more direct experimental observations are often impractical. However, a problem with comparing
somatic cell genomes within an individual is that few somatic mutations are expected to
accumulate within a lifetime [32]. To overcome this practical shortcoming, recent studies have
employed epigenetic measurements such as DNA methylation patterns. DNA methylation is a
covalent modification at CpG dinucleotides that is also copied after DNA replication. However,
unlike base replication, epigenetic replication fidelity is markedly lower at certain CpG-rich
regions. Therefore, DNA methylation patterns measurably change during normal human aging
and are often highly polymorphic within an individual [12]. Consequently, the 5’ to 3’ order of
DNA methylation can be used to infer the history of a tumor in a way that is directly analogous to
the use of nucleotide variation to infer history of individuals [13].
DNA methylation patterns at non-expressed CpG-rich regions (“passenger methylation”)
have been used to reconstruct the past of human tissues such as colon crypts and tumors [33].
However, it is uncertain with how much precision the pasts of somatic cells can be inferred from
methylation patterns. Complicating factors include uncertainties imposed by rapid replication
errors, stepwise changes (both methylation and demethylation are possible), and possible
variations in error rates between neighboring CpG sites that may depend on the methylation status
of neighboring sites. Potentially, certain aspects of ancestry are more recoverable from passenger
methylation patterns.
Specifically for human tumorigenesis, simple unknowns are the ancestral state of the first
tumor cell, how fast a tumor grows, and its mitotic age (numbers of divisions between the first
14
tumor cell and tumor removal). To further explore the utility of passenger methylation patterns
for the reconstruction of human tumorigenesis, we simulate data under a variety of tumor growth
models, and evaluate our ability to estimate parameters capturing tumor growth behavior,
extending earlier work [1,2] in which we focused on estimation of three parameters: the total
number of cell divisions (tumor age), the number of cancer stem cells per gland, and the
probability of asymmetric stem cell division.
2.3 Experimental data and model
We applied our analysis methodology to a data set that consists of information from 9
colorectal tumors. The methylation patterns of a short CpG-rich region (LOC, 14 CpG sites) were
measured using bisulfite sequencing. We sampled eight cells per gland, and eight glands per half,
in each tumor.
Figure 2.1 | The tumor growth model. Top graph shows the division of the 1st transformed cell
into a gland. The bottom graph shows the exponential growth and the constant-size growth of the
glands in one tumor half. See text for more details.
15
We model actual physical tumor growth, beginning with the clonal expansion of a single
cell [1,2], applying a biological constraint on the total number of tumor cells (e.g. assuming 1
billion cells/1cm
3
), and making use of clinical data on tumor size to inform our model. Tumors
arising from glandular tissues such as the colon, with cells organized into small tubular units, are
typically adenocarcinomas, which are composed of many neoplastic glands. Adenocarcinomas
are also common in the breast, prostate, lung, pancreas, and stomach. As such, dividing cancer
cells in our model are geographically confined to cancer glands, which also divide, with
constraints on the total number of cells based on the size of the tumor (see figure 2.1). Our model
directly reflects this glandular structure.
A tumor is simulated as the clonal expansion of a single transformed cell. A 4 cm
3
tumor
contains approximately 4 billion cells, which is impossible to simulate at the single-cell level by
forward simulation. However, the organization of tumor cells within glands allows for a flexible
growth modeling across two different scales, cell level and gland level. Since one gland contains
approximately 8,000 cells, a 4 cm
3
tumor can be approximated by “only” 500,000 glands. This
size is achieved after only 19 generations of exponential growth. We mimic the structure of our
sampled data by sampling only eight glands from the ~500,000, and storing their ancestral tree.
This is followed by the simulation of single-cells along the ancestral tree for the sampled glands.
This approach allows us to simulate for each tumor a sample of ~131K cells (=8,192 cells/gland x
16 sampled glands) instead of a total of ~4 billion. This ensures computational tractability.
The cells and glands follow separate models for growth. We model gland growth as
exponential growth followed by a period of constant size (see Figure 2.1). At the cell-level, the
single transformed cell undergoes exponential growth (cell doubling) until it attains the number
required of the first cancer gland (see Figure 2.1). In subsequent generations, the cells in the
gland divide until they double in number, and then the gland divides. Both the cells and glands
16
continue to divide, forming a second period of exponential growth (phase one for gland tree
growth), until the tumor reaches its fixed biological size. The tumor then enters the second phase
of gland tree growth, in which the gland number remains constant, but the cells within glands
divide and die, allowing for continued “aging” in a tumor of fixed size (no growth). Cell division
and death occurs via both symmetric and asymmetric division. We refer to long-lived dividing
cells lines as cancer stem cell lines. The model for cancer stem cell division is as follows. Under
asymmetric division, a cancer stem cell differentiates into one cancer stem cell and one normal
cancer cell, while under symmetric division, a cancer stem cell has a 0.5 probability of giving
birth to two cancer stem cells and a 0.5 probability of dividing into two normal cancer cells. This
is parameterized by probability of asymmetric division (PAD) that controls the proportion of
cancer stem cells experiencing asymmetric division. Finally, the DNA methylation patterns are
sampled from approximately 16 glands per tumor, eight per tumor half. For a detailed
mathematical description of the model, see [1], in which the same parameterization is used.
Parameters Possible Value and Prior Distribution
DNA Methylation Error Rate (MER) Unif(0,0.1)
DNA Demethylation Error Rate (DER) Unif(0,0.1)
Number of Cancer Stem Cells (NCSC) 2 to the power of 1,2,3,4,5,6,7,8,9
Probability of asymmetric division (PAD, R) Unif(0.5,1)
Number of generations of constant size (T3) Unif(10,365)
Table 2.1 | Parameters in our model and the corresponding prior distributions.
Our analysis explores variation in a total of five parameters (see Table 2.1). The
remaining model parameters were fixed at either their most likely value or at values reported in
the literature. The parameter NCSC is the number of cancer stem cells present in one gland of a
tumor. In earlier work we studied how quickly a tumor grows by comparing two different growth
models; the first was a two-phase model of exponential growth, followed by constant tumor size,
but allowing for cell aging through continued cell division and death, while the second was a
17
slower growing tumor that follows a smooth, Gompertzian (S-shaped) growth curve [2]. We
found little difference in parameter estimates from the more computationally demanding slow-
growth model [2], and do not pursue it further here. Instead, we explore our ability to estimate
parameters that were treated as fixed (at arbitrary values) in our previous work, such as the DNA
methylation and demethylation copy error rates. We also focus on our ability to infer the DNA
methylation pattern of the ancestral cell.
2.4 Statistical methods
2.4.1 Approximate Bayesian computation
As the level of detail in models used to answer biological questions grows, due to the
increased richness of data and our eagerness to use it to obtain a more comprehensive
understanding of the biological phenomena, the computational difficulties of analysis also grow,
to the extent that analysis often becomes intractable. This intractability is frequently caused by
the need to calculate the likelihood function – the function that gives the likelihood of the data for
a particular combination of model parameters. This has lead to the rising popularity of methods in
which, instead of calculation, we use simulation to perform statistical inference. Here, we exploit
one of these methods: approximate Bayesian computation [ABC]. The advantage of such
methods lies in the fact that simulation generally remains tractable long after the rise in data
richness and/or model complexity causes exact calculation to become impossible. The goal of the
simulation is to estimate the likelihood directly, for example using so-called Monte-Carlo
methods. The very first documented Monte-Carlo simulation was the Buffon’s needle experiment
to approximate π [34], in which the analyst performed a repeated series of experiments in which a
needle was tossed onto a table.
18
Many ABC methods rely upon versions of the Acceptance-rejection Algorithm [35].
Essentially, the goal is to find the joint posterior distribution for the model parameters conditional
on the observed data. This characterizes our beliefs about model parameters and states given the
observed data. The analysis proceeds by comparing simulated datasets to the data that were
observed experimentally and asking what parameter combinations lead to data that matches the
observed data. However, for complex data structures exact matches happen very infrequently,
even if data is simulated using the correct model and true parameter values. Therefore, ABC
methods use an approximation in which “near misses” are also accepted [36-38].
In this paper we use an ABC form of the Acceptance-rejection Algorithm [15,36]. We
aim to find the posterior distribution of parameters θ=(θ
1
, θ
2
,…, θ
n
) based on the data (or sample)
D that were observed from experiments. This is written as:
ƒ(θ|D)= ƒ(D|θ)π(θ) /ƒ(D).
The likelihood term, ƒ(D|θ), is impossible to obtain in many contexts, including our own,
hence our need to exploit ABC. In the ABC scheme, we use summary statistics S=(S
1
,S
2
,…,S
k
) to
represent key features of the observed data D, thereby reducing the computational complexity
significantly and, in the process, regaining tractability of analysis. The summary statistics used in
our analysis are normalized and listed in Table 2.2. So-called “sufficient statistics” are the best
choice for summary statistics if they are available (since then, by definition, ƒ(θ|S)=ƒ(θ|D)).
However, in nearly all practical cases sufficient statistics are unavailable. As a consequence, we
estimate ƒ(θ|S), for which the analysis is tractable, rather than ƒ(θ|D), for which it is not.
The ABC version of rejection sampling is as follows:
19
For i=1 to N
1. Sample parameters θ’ from the prior distribution π(θ)
2. Simulate data D’ using model M with the sampled parameters θ’
,
and summarize
D’ as S’.
3. Accept θ’ if d(S’, S) < ε, for a given threshold ε. Where d(S’, S) is a measure of
distance between S’ and S.
4. Go to 1.
After N iterations, a set of accepted parameter values, denoted by θε, is generated, which
consists of samples from the posterior distribution ƒ(θ|S’≈S), a statistical approximation to ƒ(θ|S)
[36]. Ideally, we want to choose as small ε as possible, because the posterior distribution will then
be as close as possible to ƒ(θ|S). However, the consequence of a small ε is fewer accepted θ’s,
leading to a higher level of empirical noise in the estimate of ƒ(θ|S’≈S). In other words, more
computation is needed to get a sufficient number of samples from the posterior distribution of
interest.
In our paper we use a common variation on the above description. Rather than using a
fixed threshold, ε, we sort all N distances calculated in step 3, and accept the θ’ that generated the
smallest 100*η percent distances. Such an approach has been used in a variety of ABC papers e.g.,
[38,39].
20
2.4.2 Summary statistics
The question of which statistics to include in an analysis such as this is a subtle one. In
principle, given infinite computational resources, adding extra statistics can never hurt, and will
always help unless the new statistic is completely correlated with the existing statistics, or is
completely uninformative. However, in practice, adding less informative statistics increases the
noise in the measure of distance between observed and simulated datasets, and thereby increases
the error in the degree of agreement between the empirical estimate of ƒ(θ|S’≈S) and ƒ(θ|S) itself.
As such, a small, informative set of summary statistics is best [40]. The set of summary statistics
used in our analysis are given in Table 2.2. They represent a selection of statistics we have used
in similar analyses [1,9,26], with some additional statistics designed to be informative regarding
questions of particular interest in the present paper. We generate N = 60 million simulations,
using the prior distributions shown in Table 2.1, with the methylation status of the single
ancestral cell being randomly generated for each data point, and we accept 0.005% of these
simulations (3,000 data sets) to generate the posterior distributions (thereby making ε as small as
is reasonably possible). The distance metric is defined as:
d(S’, S)=||(S’-S)*W
T
||
2
,
where W is a weight vector. We use equal weights on the seven summary statistics
shown in Table 2.2 for all analysis except that of the ancestral state, where we placed higher
weight on S
7
, a statistic that is particularly informative regarding ancestor.
We now give full details of calculation of the summary statistics we used in our analysis.
We start with some definitions of terms:
21
Hamming Distance (HD):
Given two strings, S
1
and S
2
, with equal length, the hamming distance between S
1
and S
2
is the total number positions at which these two strings have different characters. For example, if
S
1
=abc, S
2
=bac, then the HD(S
1
,S
2
)=2.
S
1
Percentage of methylation
S
2
Average number of unique tags
S
3
Average hamming distance within all glands
S
4
Average hamming distance among all glands
S
5
Average hamming distances between segments
S
6
Average number of transitions
S
7
Vector of sitewise percentage of methylation
Table 2.2 | Summary Statistics. See definition of each summary statistics below.
Number of Unique Tags (NUT):
Given N strings, S
1
, S
2
,…, S
N
, the number of unique tags is the total number of different
patterns among these N strings. For example, S
1
=abc, S
2
=abd, S
3
=abc, then NUT(S
1
,S
2
,S
3
)=2.
Number of Transitions (NT):
Given a string S, the number of transitions is defined as the total number of times the
characters in the string differ from their adjacent characters. For example, S=1011011, then
NT(S)=4.
To explain the actual statistics used, assume we cut the tumor into K fragments and
sample M glands per fragments, extracting N cells per gland, from each fragment. Each CpG
22
island locus is a string of zeros and ones with length L (zeros correspond to the state
“unmethylated”, while ones correspond to “methylated”). Define S
km
to be all the methylation
patterns in the m
th
gland in the k
th
fragment. Then set S
kmn
to be the methylation pattern of the n
th
cell in S
km
. Similarly, S
kmnl
denotes the methylation status at the l
th
site of S
kmn
. We then define the
following statistics.
Average Percentage of Methylation (APM):
APM=
Average Number of Unique Tags (ANUT):
ANUT=
Average Hamming Distance Within all Glands (AHDWG):
AHDWG =
Average Hamming Distance Among all Glands (AHDAG):
AHDAG =
Average Hamming Distances Among Fragments (AHDAF):
23
AHDAF =
Average Number of Transitions (ANT):
ANT=
Vector of Sitewise Percentage of Methylation (VSPM):
VSPM=
2.5 Results
2.5.1 Analysis of simulated data
To benchmark the performance of our analysis machinery, we begin with an analysis of
simulated datasets. By analyzing simulated data we are able to compare summaries of our
estimated posterior parameter distributions to the (in reality unobserved) generating parameter
values. We describe several such analyses below. For each analysis, in order to help intuition, we
begin by presenting some representative, illustrative results for single simulated datasets, before
presenting overall results of a more comprehensive simulation study.
24
Figure 2.2 | Posterior distributions of the initial methylation status. (a) Posterior distributions
of the estimates for the probability of the initial methylation status being 1 in the ancestors of
illustrative simulated tumors (generating parameters are shown above each figure, followed by
the DNA methylation pattern of the initial transformed cell). (b) The effect of age on posterior
distributions of the estimates for the probability of the initial methylation status being 1 in the
ancestors of the simulated tumors.
2.5.1.1 Estimation of ancestor
We begin with one of our questions of primary importance: the state of the ancestral cell.
This is not, as such, a parameter, but its posterior distribution can be estimated in a way that is
directly analogous to the method used for parameters. We assume an uninformative prior
distribution for the ancestor - a Bernoulli distribution with p=0.5 at each site. Illustrative results
are shown in Figure 2.2a. We show the posterior distribution of the methylation status of each site
for four simulated “test” tumors. In our analysis, the number of CpG sites is 14, and the
parameters used to generate these four simulated tumors are shown in the title of each figure,
25
using the notation of Table 2.1 (where the 4
th
entry in the title is the ancestral state of the
simulated tumor: 1 being methylated; 0 being unmethylated). The height of each bar represents
the posterior probability of that site being methylated in the very first transformed tumor cell. The
posterior distribution indicates that our analysis retrieves the state of the original cell of the tumor
successfully in each case (although, in many cases, not with high confidence). This is more
impressive when one considers that there are 2
14
=16,384 possible ancestors for each tumor in our
simulation.
One key feature revealed by our simulations is that the reliability of the estimate of
ancestral methylation status depends on the age of the tumor (see Figure 2.2b). When the age of
the tumor becomes great, for example T3=240 in Figure 2.2b, the initial methylation status
becomes largely not inferable. This is not unexpected. The pattern of observed variation in the
final tumor is a complicated perturbation of that in the progenitor cell. Over time, the perturbation
becomes greater and greater, until eventually all memory of initial state is lost. (In technical terms,
the stochastic process modeling methylation at each site becomes stationary.) The key question,
then, is how quickly this memory is lost. In Figure 2.3 we plot the mean square error (MSE) of
the most likely ancestral methylation state, compared to the true ancestral state, as a function of
tumor age and DNA methylation error rate (MER). As expected, the MSE increases as the tumor
becomes older and the MER becomes larger. Note that there is a degree of confounding here.
Comparing two tumors, A and B say, if tumor B is twice as old as tumor A, but has a MER that is
half that for tumor A, the pattern of variation in the two tumors will be quite similar (since it is
related to the expected numbers of changes to methylation status). In Figure 2.3 we see signal
regarding the methylation status of the ancestral cell is relatively quickly lost. But if a tumor is
young, (Age<100), and MER is low (<0.03), signal regarding ancestral state is present.
26
Figure 2.3 | The mean square error of the estimates of the ancestor. NCSC=256, R=0.8624,
ancestor 11001001100100 ( the DER was set to be half of the MER).
2.5.1.2 Estimation of the methylation and demethylation error rates
Next we explore how quickly methylation and demethylation errors propagate. We
simulate a number of tumors with known methylation and demethylation error rates (generated
uniformly through 0 to 0.1), with other parameter values kept constant (and assumed as known in
the subsequent analysis). We then estimate the posterior distribution for both MER and DER.
0
0.015
0.03
0.045
0.06
0.075
0
100
200
300
0
0.2
0.4
0.6
0.8
MER
Ancestor=11001001100100, R=0.8624, NCSC=256
Age
MSE of ancestor estimate
0
0.1
0.2
0.3
0.4
0.5
27
Figure 2.4 | Posterior Distributions of MER and DER. 4 simulated tumors generated under
MER=0.005, DER=0.002, NCSC=256, ancestor= 01111101100100, R=0.5 and (a) T3=10; (b)
T3=345; (c) top view of the joint distribution of MER and DER from Figure 2.4b.
In Figure 2.4 we show example results for four simulated tumors. These were all
generated using MER=0.005, DER=0.002, and T3=10, and have posterior distribution for the
MER and DER centered at around 0.005 and 0.002 respectively, suggesting that the MER and
DER can be correctly estimated in these illustrative cases. Increasing the age of the tumor from
T3=10 (Figure 2.4a) to T3=80 (Figure 2.12a, Supp. Mats, section 2.6.1) still results in a good
estimate of MER and DER. However, when T3=345 the estimation of MER and DER becomes
more difficult (Figure 2.4b). The reason that it appears to become harder to estimate MER and
DER for older tumors is that the tumor eventually reaches a stationary phase in which the overall
methylation level is largely a function of the relative rates of MER & DER, (expressed through
their ratio) rather than the marginal value of either. Thus, a wide range of values for MER is
supported provided we choose DER appropriately (i.e., to maintain the necessary ratio – see
Figure 2.4c). This flexibility is lost for younger tumors that have not grown for long enough to
exhibit this kind of “stationarity” behavior. These conclusions are robust across a range of
28
simulated tumor data, in which ancestral state was also varied (Figure 2.12b, Supp. Mats., section
2.6.1).
Figure 2.5 | The estimate of ratio of MER and DER. The tumors were generated using
NCSC=256, R=0.8624, and an ancestor state of 11001001100100, with the ratio of MER and
DER set to be 2.
We note in passing that the shape of the joint distribution of MER and DER provides us
with evidence of the age of the tumor. If the joint distribution is clustered into a small cloud, the
tumor is still in an early stage in terms of DNA methylation alterations (Figure 2.14, Supp. Mats,
section 2.6.1), On the other hand, if the joint distribution forms a line, as in Figure 2.4c, the tumor
is older.
Moving to summary results from the full simulation study, Figure 2.5 summarizes our
ability to estimate the ratio MER/DER over a full range of example datasets. We show results for
the estimated value of MER/DER over 40 replicates for each combination of parameter values, in
which we analyzed data that was simulated with MER/DER=2. We plot the ratio of MER and
DER against age and MER. As we can see, we obtain very good estimates of the ratio, except for
very low MERs, when, whatever the ratio is (within reason), no changes of methylation state
occur (and so accurate inference of relative rates is impossible). As discussed above, because of
0
0.015
0.03
0.045
0.06
0.075
0
100
200
300
-1
0
1
2
MER
Estimates of Ratio
Age
Ratio
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
29
the fact that patterns of methylation depend largely upon the ratio MER/DER, inference of MER
without reference to DER is much less successful.
Figure 2.6 | Posterior distributions of the NCSC for four illustrative tumors. The dotted line
indicates the true value of NCSC, and the values above each figure indicate the true parameter
values.
2.5.1.3 Number of cancer stem cells, probability of asymmetric division, and age
We now consider the remaining model parameters. In Figure 2.6 we focus on estimation of
the number of cancer stem cells (NCSC). We show results for 4 illustrative simulated tumors. The
results indicate that we can recover NCSC with reasonable accuracy, despite their being a very small
fraction of the total cell mass in the tumor. The probability of asymmetric division (PAD), R, plays an
important role in establishing the heterogeneity of tumor cells. Our results shed some insight on likely
parameter values, but overall results indicate that estimation is difficult (Figure 2.15a, Supp. Mats,
section 2.6.2).
2 4 8 16 32 64 128 256 512
0
0.1
0.2
0.3
0.4
NCSC=4, R=0.8614,T3=45, 11101111110101
2 4 8 16 32 64 128 256 512
0
0.1
0.2
0.3
0.4
NCSC=16, R=0.7514,T3=49, 01010111011100
2 4 8 16 32 64 128 256 512
0
0.05
0.1
0.15
0.2
NCSC=128, R=0.7003,T3=51, 11111010000110
2 4 8 16 32 64 128 256 512
0
0.1
0.2
0.3
0.4
NCSC=256, R=0.6142,T3=30, 11101011000011
NCSC,Mode=2 NCSC,Mode=8
NCSC,Mode=64 NCSC,Mode=512
30
Figure 2.7 | Mean square error of the mode of the posterior distribution of NCSC with
respect to varying age and true NCSC. The tumors were generated using MER=0.004,
DER=0.002, R=0.8624, and with an ancestral state of 11001001100100.
A full simulation study is shown in Figure 2.7. In contrast to results for estimation of
ancestral type, the estimation of NCSC is not very sensitive to age when NCSC is small. However,
estimation of NCSC (in absolute terms) becomes more difficult when NCSC is high, particularly
for old tumors.
2.5.2 Analysis of experimental data
Having benchmarked the performance of our method using the analysis of simulated data,
for which answers are known, we move to an analysis of the experimental data described in
section 2.2. We begin by focusing on inference of ancestral state (see Figure 2.8). The estimates
show that the ancestral state of the first 5-6 sites of the LOC tag are most likely to be methylated
for each patient. The signal regarding state is weakest in CN and CR, suggesting that those
tumors are older (see Figure 2.8 and Figure 2.10 ). These outcomes support our ability to recover
information regarding the ancestral state of tumors, but with the important caveats noted earlier: a)
that while this is possible for tumors that have divided, broadly speaking, up to 100 times, this
ability is lost for older tumors; b) inference, when possible, is accurate but relatively weak.
1
2
3
4
5
6
7
8
9
10
0
50
100
150
200
250
300
-5
0
5
10
15
NCSC (log)
MSE of Mode NCSC
Age
MSE
0
2
4
6
8
10
12
31
Figure 2.8 | Ancestral inference for experimental data. The x-axis is the CpG site, the y-axis is
the probability of each site being methylated in the ancestral cell.
The fact that DNA methylation increases with age suggests that MER is higher than DER
[41]. As shown in Figure 2.9, we do observe that the posterior distribution of MER lies to the
right of that for DER. The joint posterior distributions of MER and DER again show the
confounding that we observed for the analysis of simulated data (see Figure 2.10). These results
suggest that CN and CR are older than the other tumors (c.f. the discussion of Figure 2.4c). Both
MER and DER vary between tumors, suggesting either a patient-specific profile of tumor
progression or that the tumors are of significantly different ages. The variability of MER is much
larger than DER, but this may simply reflect that, as noted earlier, posterior variance increases
with magnitude of MER (DER). The estimated NCSC values also show a patient specific pattern
(see Figure 2.11). Tumors CN, CR, CS and CU have very small numbers of cancer stem cells.
The remaining tumors appear to have many more cancer stem cells in each gland. The posterior
distributions of PAD and T3 for these tumors are shown in Figure 2.16 (Supp. Mats, section
2.6.2).
1 2 3 4 5 6 7 8 9 10 11 12 13 14
0
0.2
0.4
0.6
0.8
CN
1 2 3 4 5 6 7 8 9 10 11 12 13 14
0
0.2
0.4
0.6
0.8
CO
1 2 3 4 5 6 7 8 9 10 11 12 13 14
0
0.2
0.4
0.6
0.8
CP
1 2 3 4 5 6 7 8 9 10 11 12 13 14
0
0.2
0.4
0.6
0.8
CR
1 2 3 4 5 6 7 8 9 10 11 12 13 14
0
0.2
0.4
0.6
0.8
CS
1 2 3 4 5 6 7 8 9 10 11 12 13 14
0
0.2
0.4
0.6
0.8
CT
1 2 3 4 5 6 7 8 9 10 11 12 13 14
0
0.2
0.4
0.6
0.8
CU
1 2 3 4 5 6 7 8 9 10 11 12 13 14
0
0.2
0.4
0.6
0.8
CW
1 2 3 4 5 6 7 8 9 10 11 12 13 14
0
0.2
0.4
0.6
0.8
CX
32
Figure 2.9 | Marginal posterior distribution of the DNA methylation and demethylation
error rate in the experimental data.
Figure 2.10 | Joint posterior distribution of the DNA methylation and demethylation error
rate in the experimental data. Shown is a top view of the distribution; the color map shows the
density.
0 0.02 0.04 0.06 0.08 0.1
0
0.2
0.4
0.6
0.8
CN
0 0.02 0.04 0.06 0.08 0.1
0
0.1
0.2
0.3
0.4
CO
0 0.02 0.04 0.06 0.08 0.1
0
0.1
0.2
0.3
0.4
CP
0 0.02 0.04 0.06 0.08 0.1
0
0.5
1
CR
0 0.02 0.04 0.06 0.08 0.1
0
0.1
0.2
0.3
0.4
CS
0 0.02 0.04 0.06 0.08 0.1
0
0.1
0.2
0.3
0.4
CT
0 0.02 0.04 0.06 0.08 0.1
0
0.5
1
CU
0 0.02 0.04 0.06 0.08 0.1
0
0.1
0.2
0.3
0.4
CW
0 0.02 0.04 0.06 0.08 0.1
0
0.1
0.2
0.3
0.4
CX
MER,Mean=0.0427
DER,Mean=0.0046
MER,Mean=0.0114
DER,Mean=0.0050
MER,Mean=0.0181
DER,Mean=0.0060
MER,Mean=0.0350
DER,Mean=0.0022
MER,Mean=0.0277
DER,Mean=0.0055
MER,Mean=0.0054
DER,Mean=0.0034
MER,Mean=0.0136
DER,Mean=0.0027
MER, Mean=0.0118
DER,Mean=0.0045
MER,Mean=0.0088
DER,Mean=0.0035
33
2.6 Discussion
Tumorigenesis is a complex process that requires considerable effort to decipher. This is
particularly true since we typically observe data from a single time-point at the end of tumor
growth, rather than being able to watch the tumor as it grows (at least in human subjects). Here,
we presented a model and an analysis method that can be used to study the ancestral state, the
methylation error rate, and the number of the cancer stem cells in individual tumors. The former
is of particular importance because it might contain important prognostic flags for future tumor
behavior.
Figure 2.11 | The posterior distributions of the number of cancer stem cells in experimental
data. The x-axis is the number of cancer stem cells.
In previous papers, researchers derived point estimates of a given parameter by fixing
other parameters based on prior knowledge [1]. The results inevitably depend on the quality of
the prior knowledge that is used. In this paper, we presented an ABC scheme that allows us to
obtain inference on the model parameters simultaneously when desired. Of course, some
2 4 8 16 32 64 128256512
0
0.1
0.2
0.3
0.4
CN
2 4 8 16 32 64 128256512
0
0.05
0.1
0.15
0.2
CO
2 4 8 16 32 64 128256512
0
0.1
0.2
0.3
0.4
CP
2 4 8 16 32 64 128256512
0
0.1
0.2
0.3
0.4
CR
2 4 8 16 32 64 128256512
0
0.1
0.2
0.3
0.4
CS
2 4 8 16 32 64 128256512
0
0.05
0.1
0.15
0.2
CT
2 4 8 16 32 64 128256512
0
0.2
0.4
0.6
0.8
CU
2 4 8 16 32 64 128256512
0
0.05
0.1
0.15
0.2
CW
2 4 8 16 32 64 128256512
0
0.05
0.1
0.15
0.2
CX
NCSC,Mode=2
NCSC,Mode=64 NCSC,Mode=2
NCSC,Mode=2 NCSC,Mode=2 NCSC,Mode=64
NCSC,Mode=2 NCSC,Mode=64 NCSC,Mode=64
34
parameters are harder to estimate than others, because they leave less of signal in the observed
experimental data. Furthermore, the ability to estimate a feature of tumor growth may depend
upon the age of the tumor. For example, we found that the ability to infer likely methylation
status of the ancestral cell decreases as the tumor ages (and, in essence, forgets where it started
from).
Parameter confounding is also an issue. If we have two tumors, A and B, with A being
twice as old as B, but with MER/DER rates that are half as high as those for B, then the properties
of the final patterns of variation will be very similar. (This is related to the concept of non-
identifiability in statistics.) Consequently, we propose that the measure of age most relevant here
is to think of the expected number of changes to methylation status, say. This is the product of the
methylation/demethylation rate and the number of cell divisions.
Our analysis shows that we can obtain informative posterior distributions for methylation
error rate and the ancestral state for experimentally observed tumor data, provided the tumors are
not so old that information regarding ancestor is lost. Therefore, we have shown that, even though
we cannot observe the early stages of tumor growth directly, important features of that growth
may be inferred by careful analysis of data collected from the tumor at a later time.
The cancer stem cell, being a long-lived dividing cell, is the primary engine driving the
tumor to develop. Therefore, understanding how many cancer stem cells there are in a tumor will
reveal tremendous amount of information on tumor progression. Experimentally, researchers
perform xenotransplantation using markers CD133
+
to count the NCSC. In our study, the NCSC
is retrieved through careful analysis of data under a model for the tumor’s natural history. Our
analysis show less than 2.5% of cells are likely to be cancer stem cells (the mode of the posterior
distribution divided by the number of cells in one gland). This confirms experimental results [42].
35
However, within this constraint, the NCSC appears to vary across tumors (see Figure 2.11). In
particular, our experimental data appears to fall into two classes: one for which the NCSC appears
to be most likely to take a very low value (2); another in which higher values are best supported
(32-128). This suggests different organization patterns and progressiveness may exist in different
tumors. Siegmund, Marjoram et al. 2011 showed that MMR deficient tumors (CR and CU) have
low NCSC [9]. This is confirmed by our analysis that CR and CU's posterior distributions are
concentrated on small NCSC.
One interesting statistical question arises during an analysis such as this: which summary
statistics should be used for the analysis? In this paper we have presented results for particular
choices of statistics that were most informative for the particular problem at hand. However, if
statistics are chosen poorly, quality of inference will suffer. We discuss this at greater length in
Section 2.7 Supplemental Material for the case of inference of ancestral state. Development of
rigorous procedures for choice of statistics, as a function of analysis question at hand, is an area
of active research [2,40,43-45].
In summary, the parameters of tumor growth, and the early stages of tumor growth play
an important role in the patterns of variation seen in tumors. They may also carry important
prognostic information. This has been hard to assess since the early stages of tumor growth are
not typically observed. Our results show that with careful statistical analysis we can obtain
estimates of both tumor growth parameters, (captured in terms of full posterior distribution rather
than simple point estimates), and of the initial methylation status at loci (which is impossible to
observe directly). Our hope is that the ability to do this may open new avenues by which
understanding of tumor growth, and future behavior, might be better understood.
36
2.7 Additional supplemental results
2.7.1 Estimation of DNA (de)methylation error rates:
Figure 2.12 presents additional illustrative examples demonstrating the ability to estimate
DNA methylation error rate (MER), α, and demethylation error rate (DER), β. Here we use a
variety of representative parameter sets, assuming that the true MER and DER are small and the
tumors are young.
Figure 2.13 shows that the performance of the posterior distributions of MER and DER
varies with different true MER, α, and DER, β, values. Two illustrative samples are shown. When
α=0.01 and β=0.005 (see Figure 2.13a), we obtain good estimates for both error rates when the
tumor is very young. However, when we increases α and β to α=0.05 and β=0.03 (see Figure
2.13b), the estimated posterior distributions are all significantly deviated from the truth. This is a
product of the non-identifiability we discussed earlier (inference is better when considering the
ratio MER/DER).
37
(a)
(b)
Figure 2.12 | Posterior Distributions of MER and DER. (a). Posterior Distributions of MER
and DER for four tumors generated using α=0.005, β=0.002, NCSC=256,
Ancestor=01111101100100, T3=80 and R=0.5. (b). Posterior Distributions of MER and DER for
tumors generated using α=0.005, β=0.002, and an illustrative variety of other parameter values
(shown in caption to each plot).
0 0.02 0.04 0.06 0.08 0.1
0
0.2
0.4
#01
0 0.02 0.04 0.06 0.08 0.1
0
0.2
0.4
#02
0 0.02 0.04 0.06 0.08 0.1
0
0.2
0.4
#03
0 0.02 0.04 0.06 0.08 0.1
0
0.2
0.4
#04
MER,mean=0.0055
DER,mean=0.0023
MER,mean=0.0061
DER,mean=0.0025
MER,mean=0.0050
DER,mean=0.0026
MER,mean=0.0057
DER,mean=0.0026
0 0.02 0.04 0.06 0.08 0.1
0
0.5
1
NCSC=4, R=0.8614,T3=45, 11101111110101
0 0.02 0.04 0.06 0.08 0.1
0
0.5
1
NCSC=16, R=0.7514,T3=49, 01010111011100
0 0.02 0.04 0.06 0.08 0.1
0
0.5
1
NCSC=128, R=0.7003,T3=51, 11111010000110
0 0.02 0.04 0.06 0.08 0.1
0
0.5
1
NCSC=256, R=0.6142,T3=30, 11101011000011
MER,Mean=0.0096
DER,Mean=0.0008
MER,Mean=0.0060
DER,Mean=0.0026
MER,Mean=0.0081
DER,Mean=0.0018
MER,Mean=0.0060
DER,Mean=0.0017
38
(a)
(b)
Figure 2.13 | Posterior Distributions of MER and DER with different true MER and DER
values. NCSC=256, ancestor=01111101100100 and T3=10. (a). α=0.01, β=0.005. (b). α=0.05,
β=0.03.
As tumors age, the level of DNA methylation seen in them is, to a large degree, a
consequence of the relative rates of methylation and demethylation (an example of statistical non-
identifiability). This is because the behavior of the tumor has, from a statistical perspective,
begun to approach stationarity (and has essentially lost all memory of the methylation status of
the initial tumor cell). We showed this property earlier, where we plotted illustrative examples of
the joint posterior distribution for α and β. Consequently, marginal posterior distributions of
single tumors will be somewhat misleading. If the joint distribution forms a line as shown in
Figure 2.4c, the tumor has reached stationary phase and the marginal distribution of either MER
0 0.02 0.04 0.06 0.08 0.1
0
0.4
0.8
#01
0 0.02 0.04 0.06 0.08 0.1
0
0.4
0.8
#02
0 0.02 0.04 0.06 0.08 0.1
0
0.4
0.8
#03
0 0.02 0.04 0.06 0.08 0.1
0
0.4
0.8
#04
M,mean=0.0128
DM,mean=0.0034
M,mean=0.0116
DM,mean=0.0042
M,mean=0.0084
DM,mean=0.0050
M,mean=0.0100
DM,mean=0.0048
0 0.02 0.04 0.06 0.08 0.1
0
0.2
0.4
#01
0 0.02 0.04 0.06 0.08 0.1
0
0.2
0.4
#02
0 0.02 0.04 0.06 0.08 0.1
0
0.2
0.4
#03
0 0.02 0.04 0.06 0.08 0.1
0
0.2
0.4
#04
M,mean=0.0700
DM,mean=0.0388
M,mean=0.0685
DM,mean=0.0386
M,mean=0.0707
DM,mean=0.0385
M,mean=0.0582
DM,mean=0.0373
39
or DER is uninformative. However, the ratio of MER and DER is now estimable. On the contrary,
when the joint distribution of MER and DER clumps to a rounded shape, as shown in Figure 2.14,
estimates of MER and DER are of good quality, as indicated by Figure 2.4a.
Figure 2.14 | The joint distribution of DNA methylation error rate and demethylation error
rate for Figure 2.4a. Darker shading corresponds to higher posterior probability.
2.7.2 Estimation of asymmetric division rate and age
Figure 2.15a illustrates that we could not obtain a good posterior distributions for
probability of asymmetric division (PAD,R). Estimation of T3 also appears to perform poorly on
occasion; see illustrative examples in Figure 2.15b. However, as is discussed earlier, MER/DER
are confounded with age, thus resulting in marginal distributions which are often misleading.
40
(a)
(b)
Figure 2.15 | Posterior distribution of the probability of asymmetric division and age. The
generating parameters are shown above each figure. Dotted lines show the true parameter values
with MER=0.005 and DER=0.002. (a) Posterior distribution of the probability asymmetric
division. (b) Posterior distribution of the tumor age (T3).
We show the posterior distributions for T3 and PAD, the joint posterior distribution of
MER and DER in Figure 2.16.
0.5 0.6 0.7 0.8 0.9 1
0
0.05
0.1
CSC=64, R=0.9171,T3=328, 11010101101110
0.5 0.6 0.7 0.8 0.9 1
0
0.05
0.1
CSC=256, R=0.6877,T3=176, 01100101110100
0.5 0.6 0.7 0.8 0.9 1
0
0.05
0.1
CSC=4, R=0.869,T3=123, 11101111000010
0.5 0.6 0.7 0.8 0.9 1
0
0.05
0.1
0.15
0.2
CSC=4, R=0.713,T3=286, 01110110000111
PAsyD,Mean=0.77437 PAsyD,Mean=0.77055
PAsyD,Mean=0.75579 PAsyD,Mean=0.64898
0 100 200 300
0
0.005
0.01
CSC=64, R=0.9171,T3=328, 11010101101110
0 100 200 300
0
0.01
0.02
0.03
0.04
CSC=256, R=0.6877,T3=176, 01100101110100
0 100 200 300
0
0.005
0.01
CSC=4, R=0.869,T3=123, 11101111000010
0 100 200 300
0
0.005
0.01
CSC=4, R=0.713,T3=286, 01110110000111
T3, Mode=220 T3, Mode=14
T3, Mode=127 T3, Mode=303
41
(a)
(b)
(c)
Figure 2.16| Posterior distribution of different parameters for the 9 real tumors. (a) T3. (b)
Joint distribution of MER and DER. (c) Probability of asymmetric division (R).
0 100 200 300
0
2
4
6
x 10
-3 CN
0 100 200 300
0
0.02
0.04
0.06
CO
0 100 200 300
0
0.02
0.04
0.06
CP
0 100 200 300
0
0.005
0.01
CR
0 100 200 300
0
0.005
0.01
0.015
CS
0 100 200 300
0
0.02
0.04
CT
0 100 200 300
0
0.005
0.01
CU
0 100 200 300
0
0.02
0.04
0.06
CW
0 100 200 300
0
0.02
0.04
CX
T3, Mode=172 T3, Mode=12 T3, Mode=12
T3, Mode=225 T3, Mode=23 T3, Mode=12
T3, Mode=42 T3, Mode=13 T3, Mode=12
0.5 0.6 0.7 0.8 0.9
0
0.05
0.1
CN
0.5 0.6 0.7 0.8 0.9
0
0.02
0.04
0.06
CO
0.5 0.6 0.7 0.8 0.9
0
0.05
0.1
CP
0.5 0.6 0.7 0.8 0.9
0
0.05
0.1
CR
0.5 0.6 0.7 0.8 0.9
0
0.05
0.1
CS
0.5 0.6 0.7 0.8 0.9
0
0.05
0.1
CT
0.5 0.6 0.7 0.8 0.9
0
0.05
0.1
CU
0.5 0.6 0.7 0.8 0.9
0
0.05
0.1
CW
0.5 0.6 0.7 0.8 0.9
0
0.05
0.1
CX
PAsyD,Mean=0.71444 PAsyD,Mean=0.76431 PAsyD,Mean=0.71498
PAsyD,Mean=0.70295 PAsyD,Mean=0.68401 PAsyD,Mean=0.76424
PAsyD,Mean=0.66824 PAsyD,Mean=0.76414 PAsyD,Mean=0.76097
42
2.7.3 Choice of summary statistics
(a)
(b)
Figure 2.17 | Difference in the posterior distributions on the same data set when using
different weights on different summary statistics. (a). Not including the statistic recording the
sitewise percentage of DNA methylation. (b). Including the statistic recording the sitewise
percentage of DNA methylation.
1 2 3 4 5 6 7 8 9 1011121314
0
0.2
0.4
0.6
0.8
CSC=4, R=0.8614,T3=45, 11101111110101
1 2 3 4 5 6 7 8 9 1011121314
0
0.2
0.4
0.6
0.8
CSC=16, R=0.7514,T3=49, 01010111011100
1 2 3 4 5 6 7 8 9 1011121314
0
0.2
0.4
0.6
0.8
CSC=128, R=0.7003,T3=51, 11111010000110
1 2 3 4 5 6 7 8 9 1011121314
0
0.2
0.4
0.6
0.8
CSC=256, R=0.6142,T3=30, 11101011000011
1 2 3 4 5 6 7 8 9 1011121314
0
0.2
0.4
0.6
0.8
CSC=4, R=0.8614,T3=45, 11101111110101
1 2 3 4 5 6 7 8 9 1011121314
0
0.2
0.4
0.6
0.8
CSC=16, R=0.7514,T3=49, 01010111011100
1 2 3 4 5 6 7 8 9 1011121314
0
0.2
0.4
0.6
0.8
CSC=128, R=0.7003,T3=51, 11111010000110
1 2 3 4 5 6 7 8 9 1011121314
0
0.2
0.4
0.6
0.8
CSC=256, R=0.6142,T3=30, 11101011000011
43
The choice of summary statistics is an important consideration in an ABC analysis. The
statistics used must be informative regarding the parameters of interest. We illustrate this using
the estimation of ancestral states. We compare results of two analyses, which either do, or do not,
include the statistic recording the sitewise percentage of methylation (see table 1), which clearly
one can expect to be informative regarding ancestral state, at least for tumors that are not old
enough to have lost all memory of their starting configuration. Figure 2.17 shows that, unless we
utilize this summary statistic we are not able to get good inference of ancestral methylation status.
We note in passing that inference of MER and NCSC (for example) remains good in this example,
because other statistics capture information that is key for their inference. A literature on choice
of summary statistics in ABC analyses is beginning to appear [44,46] .
44
45
Chapter 3: Characterization of the early
divisions of colorectal tumors
This chapter describes work I undertook as part of research reported in two papers:
Kang, H., Salomon, M. P., Sottoriva, A., Zhao, J., Toy, M., Press, M. F., et al. (2015). Many
private mutations originate from the first few divisions of a human colorectal adenoma. The
Journal of Pathology, 237(3), 355–362. http://doi.org/10.1002/path.4581
Sottoriva, A., Kang, H., Ma, Z., Graham, T. A., Salomon, M. P., Zhao, J., et al. (2015). A Big
Bang model of human colorectal tumor growth. Nature Genetics, 47(3), 209–216.
http://doi.org/10.1038/ng.3214
3.1 Overview
Over the last few decades, researchers have been constantly updating their view of tumor
growth. As part of this effort, numerous new models have been proposed. However, the details of
exactly what happens during the early stage of the tumor development are still unknown, due to
lack of direct observations. As sequencing technology has become more advanced, more detailed
genomic information regarding tumors has become available, offering the chance of a finer
degree of resolution in both our data and the models that we fit using those data. These data
confirm that genomic heterogeneity in different parts of a tumor, also known as intratumor
heterogeneity (ITH), is widely observed. Here, I present work I undertook as part of a study that
reconstructs the early divisions of a colorectal adenoma, which is referred to as tumor K in Kang,
46
et al. 2015, based on genomic data that are collected from different glands of spatially separated
regions in that tumor. Parts of this study appear in Kang, 2015. Specifically, first I looked at how
observed allele frequency needs to be understood in terms of the copy number of the
chromosomal region upon which that allele is found. And then, second, I show how those
observed allele frequencies are “quantum” in nature, and that these quantal values reflect when
those mutations, or copy number alterations, might have occurred. I then draw connections from
this analysis relevant to the big bang model for colorectal tumor growth which was reported in
Sottoriva, et al, 2015.
3.2 Introduction
A tumor is thought to be the final product of the clonal expansion of the first transformed
cell. This progenitor cell carries the necessary genetic aberrations, for example, the APC mutation,
to initiate uncontrolled growth and eventually result in a lesion of several centimeters in diameter
[47]. Researchers postulate many theories to explain the abundance of genetic variations in the
final malignancy. The most prominent one is the “selective sweep” theory, which depicts tumor
growth as a stepwise accumulation of advantageous mutations [19,48,49]. This model is well
supported by experiments observing growth of established tumors [35]. As a consequence of the
occurrence of “selective sweeps”, driver mutations are expected to accumulate, and to ultimately
be found in large quantities in the final tumor. In addition, the genomic variation across different
regions of the tumor is predicted to be relatively homogeneous as only the genomic alterations
associated with the occurrence of the driver mutations survive the selection. However, this view
of tumor growth is challenged by the fact that not many driver mutations are found in a single
tumor and the genome of the tumor is spatially heterogeneous [50]. Consequently, the theory of
star-shaped phylogeny, or trunk-branch phylogeny, has emerged as a way of describing colorectal
tumor growth [51-53].
47
Researchers study ITH by looking at the genomic information contained in bulk tissues
located in different regions of a tumor [51,54]. However, these bulk tissues are normally a
heterogeneous mixture of tumor cells and adjacent normal cells, a mixture that undermines the
reliability of the sample and the analysis/conclusion associated with it. To tackle this obstacle,
colorectal tumors, which have a distinct glandular structure [7,8], have been chosen in this
analysis. Such spatial organization of cells enables us to isolate single glands, which are nearly
homogenous and which have minimal contamination from normal tissues (<5%), for experiments
such as sequencing and SNP microarray. The purity of the sample simplifies the statistical
complications that would arise if we had a mixture of normal and cancer tissue. The glandular
structure also enables us to simulate the tumor growth process in a more well-defined geospatial
compartmental sub-unit, because the offspring cells will stay adjacent after cell division within a
gland. There are about 10,000 cells in a gland, which makes it possible to study and simulate
tumor growth at the level of individual cells in a single gland.
The ITH revealed by multiregional sampling is the product of the evolutionary process of
tumor growth [1-3]. Hence, the ancestral information of the tumor, especially the early events
during the tumor development after the transformation, can be uncovered by analyzing the ITH.
Any genomic alterations, such as DNA mutations and copy number alterations, found in all cells
of the tumor (here referred to as “public”), must have been initiated before the transformation of
the first cell. Anything found only in some regions of the tumor (here referred to as “private”)
must have taken place after the transformation of the first cell. Therefore, the clear distinction
between private and public changes provide us with the information needed to deconvolute the
timing of events in tumor development. Additionally, the distribution of a private change across
different regions of a tumor sheds light on the motility of the tumor cells after the initiation of that
private change, because the absence of motility would result in clustering of the cells with that
48
change in a confined region rather than a dispersion through the tumor. By examining the ITH,
one should be able to acquire a view of what might have happened during tumor growth.
3.3 Data and methods
Bulk tissues (about 0.5 cm
3
) were taken from both tumor halves, then EDTA solution was
used to wash out individual glands. A number of glands were sampled for each experiment and,
in addition, leftover glands from each side were collected as a bulk specimen.
3.3.1 Exome sequencing for bulk tissue
The DNA sample extracted from bulk tissue was sequenced using Illumina TruSeq
Exome Enrichment Kit, HiSeq 2000 platform, using 100 bp paired end reads. Mutations in
tumors were called against the associated normal tissues using MuTect [55].
3.3.2 SNP microarray
The DNA samples extracted from single glands and bulk samples were analyzed using
SNP microarrays (Human OmniExpress,∼700,000 SNPs) with standard Illumina protocols and
using GenomeStudio. The quality threshold was set to be 0.15. Data with SNP microarray call
rates lower than 90% were discarded. The allelic intensity values from GenomeStudio were
quantile-normalized with a threshold of 1.5 and filtered using normal adjacent colon tissue for a
patient-specific reference [56]. The normalized data were split into regions with equal copy
number (CN) using psCBS [57], which analyzes paired tumor and normal gland from the same
patient and then outputs regional CN and mBAF (“B allele frequency”) estimates, denoted as
49
CN
ps
and mBAF
ps
respectively, for each segment. We then applied further smoothing to CN
ps
and mBAF
ps
by combining any adjacent segments across which CN
ps
and mBAF
ps
copy number
did not appear to change significantly.
Figure 3.1 | Multiregional sampling scheme. A tumor is divided into two halves, side A and
side B. Each dot represents a gland. We sample glands at distal regions. Bulk tissues for both
halves are also sampled in distal regions for each experiment. The SNP microarray and AmpliSeq
data are the focus of this chapter.
3.3.3 AmpliSeq
Targeted re-sequencing (AmpliSeq) was performed on the DNA extracted from single
glands with PCR and Ion Torrent sequencing. Data with poor quality calls (Q<20) were discarded.
Mutations with allele frequencies, which were calculated as the number of mutant calls divided
by total calls, less than the averaged background allele frequency of non-mutated loci were
filtered out.
50
3.4 Results
In Chapter 2, I utilized ABC to make inference on all parameters in our tumor growth
model simultaneously in a study of ITH of DNA methylation patterns in colorectal tumors [11].
Possibly due to the relatively high turnover rate (methylation error rate and demethylation error
rate) in DNA methylation patterns, and the limited amount of DNA methylation data collected,
some resolution regarding tumor growth is lost. As a consequence, inference of tumor age and the
probability of asymmetric division failed to return very informative posterior distributions for
synthetic data. In order to better understand tumor growth and overcome these limitations, we
here collected various other types of data, such as SNP array data (Human OmniExpress), exome
sequencing data (Illumina TruSeq Exome Enrichment Kit), as described above, and targeted
sequencing data (AmpliSeq, IonTorrent sequencing), from multiple regions of tumors (both
single gland and bulk tissue). The sampling scheme is illustrated in Figure 4.1.
3.4.1 Mutation patterns
I analyzed the mutations found in the exome sequencing data and categorized them into
three categories: pubic mutations, that are found on both tumor sides; private A mutations, that
are only found in the bulk tissue of side A; and private B mutations, that are found only in the
bulk tissue of side B. I then selected a subset of the mutations identified using the exome
sequencing data for further targeted sequencing (referred to here as the “AmpliSeq data”) in 6
randomly sampled glands per tumor half. These mutations were then classified into four
categories (see Figure 3.2):
a public mutation, i.e. one that is found in all glands,
51
a variegated mutation, i.e. one that is found on both side A and B but not
in all glands,
a side A specific mutation, i.e. one that is only found in some (possibly all)
glands in side A, but not in side B,
a side B specific mutation, i.e. one that is only found in some (possibly all)
glands in side B, but not in side A.
Non-public mutations are also referred as private mutations here. The intuition behind
these categories is as follows. A public mutation, since it is found in all glands on both side of the
tumor, is likely to be present universally in all cells. Therefore, if we trace back the phylogenic
tree for that tumor, we will find that the mutation was present in the first transformed cell, which
is the ancestor of the tumor. A side-specific mutation, because it is only discovered in some
glands of one side of the tumor, could not have been present in the first cell, and so must have
appeared during subsequent tumor development. The exact time at which such a mutation was
introduced into the tumor is unknown.
52
Figure 3.2 | Mutation allele frequencies and categorization of AmpliSeq data for Tumor K.
Tumor K is an adenoma that is a benign tumor. No variegated mutation was found in tumor K.
chr position KA1 KA2 KA3 KA4L KA5 KA6 KB2L KB4 KB5 KB6 KB7 KB9
2 233321937 0.672 0.649 0.666 0.664 0.660 0.670 0.497 0.506 0.512 0.498 0.456 0.533
3 16268895 0.326 0.334 0.328 0.299 0.317 0.342 0.476 0.468 0.503 0.461 0.467 0.501
5 661368 0.677 0.718 0.730 0.768 0.528 0.633 0.709 0.731 0.709 0.726 0.517 0.710
5 112173656 0.955 0.946 0.996 0.929 0.981 0.952 0.992 0.976 0.991 0.961 0.986 0.960
6 49421407 0.344 0.340 0.357 0.334 0.282 0.400 0.478 0.480 0.516 0.498 0.445 0.448
13 46287452 0.499 0.465 0.537 0.519 0.467 0.469 0.330 0.338 0.332 0.328 0.411 0.380
14 104501374 0.662 0.638 0.694 0.660 0.665 0.630 0.468 0.494 0.512 0.500 0.560 0.436
15 34117848 0.344 0.333 0.331 0.328 0.358 0.357 0.533 0.493 0.457 0.512 0.532 0.502
17 19459211 0.637 0.657 0.673 0.628 0.667 0.726 0.447 0.488 0.460 0.475 0.545 0.418
19 17769024 0.674 0.665 0.669 0.658 0.664 0.594 0.470 0.498 0.489 0.500 0.510 0.494
19 56934487 0.303 0.288 0.294 0.297 0.304 0.326 0.476 0.453 0.484 0.443 0.529 0.444
20 43726629 0.458 0.439 0.379 0.414 0.434 0.625 0.660 0.614 0.646 0.580 0.654 0.543
21 41710104 0.463 0.476 0.508 0.482 0.499 0.502 0.495 0.502 0.501 0.494 0.528 0.513
21 42770835 0.497 0.483 0.492 0.488 0.480 0.484 0.510 0.471 0.469 0.469 0.434 0.570
1 53975644 0.353 0.320 0.000 0.000 0.356 0.000 0.000 0.000 0.000 0.000 0.000 0.000
1 110716715 0.645 0.656 0.668 0.639 0.620 0.593 0.000 0.000 0.000 0.000 0.000 0.000
1 243493893 0.333 0.328 0.000 0.000 0.308 0.000 0.000 0.000 0.000 0.000 0.000 0.000
2 198950324 0.638 0.631 0.661 0.610 0.677 0.677 0.000 0.000 0.000 0.000 0.000 0.000
3 10015366 0.306 0.325 0.319 0.324 0.282 0.368 0.000 0.000 0.000 0.000 0.000 0.000
3 53804564 0.639 0.666 0.685 0.617 0.637 0.674 0.000 0.000 0.000 0.000 0.000 0.000
4 125591644 0.306 0.302 0.262 0.243 0.293 0.346 0.000 0.000 0.000 0.000 0.000 0.000
5 56181850 0.353 0.321 0.309 0.296 0.303 0.336 0.000 0.000 0.000 0.000 0.000 0.000
5 93820565 0.000 0.000 0.350 0.000 0.000 0.303 0.000 0.000 0.000 0.000 0.000 0.000
6 76660584 0.337 0.330 0.338 0.326 0.332 0.278 0.000 0.000 0.000 0.000 0.000 0.000
8 132051723 0.000 0.000 0.342 0.000 0.000 0.182 0.000 0.000 0.000 0.000 0.000 0.000
9 117846547 0.000 0.000 0.227 0.250 0.000 0.210 0.000 0.000 0.000 0.000 0.000 0.000
10 75148164 0.651 0.643 0.681 0.673 0.654 0.630 0.000 0.000 0.000 0.000 0.000 0.000
11 5989098 0.201 0.221 0.000 0.000 0.219 0.000 0.000 0.000 0.000 0.000 0.000 0.000
11 48266847 0.341 0.316 0.313 0.325 0.308 0.352 0.000 0.000 0.000 0.000 0.000 0.000
11 62371433 0.317 0.341 0.346 0.333 0.309 0.298 0.000 0.000 0.000 0.000 0.000 0.000
11 78600872 0.343 0.342 0.351 0.336 0.363 0.344 0.000 0.000 0.000 0.000 0.000 0.000
11 118403748 0.345 0.351 0.343 0.331 0.294 0.308 0.000 0.000 0.000 0.000 0.000 0.000
12 50754633 0.000 0.000 0.319 0.000 0.000 0.328 0.000 0.000 0.000 0.000 0.000 0.000
15 65499273 0.679 0.643 0.670 0.745 0.776 0.805 0.000 0.000 0.000 0.000 0.000 0.000
17 79207479 0.680 0.691 0.663 0.655 0.605 0.724 0.000 0.000 0.000 0.000 0.000 0.000
19 15052783 0.000 0.000 0.333 0.000 0.000 0.292 0.000 0.000 0.000 0.000 0.000 0.000
19 42791356 0.000 0.000 0.325 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
20 32369107 0.000 0.000 0.242 0.000 0.000 0.255 0.000 0.000 0.000 0.000 0.000 0.000
1 15668234 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.494 0.468 0.465 0.368 0.554
2 136575511 0.000 0.000 0.000 0.000 0.000 0.000 0.425 0.497 0.487 0.483 0.493 0.489
2 182542671 0.000 0.000 0.000 0.000 0.000 0.000 0.542 0.519 0.523 0.487 0.465 0.506
4 173232894 0.000 0.000 0.000 0.000 0.000 0.000 0.388 0.490 0.543 0.510 0.618 0.372
5 94865823 0.000 0.000 0.000 0.000 0.000 0.000 0.508 0.518 0.478 0.494 0.512 0.538
5 148619355 0.000 0.000 0.000 0.000 0.000 0.000 0.489 0.572 0.495 0.514 0.591 0.529
6 130761734 0.000 0.000 0.000 0.000 0.000 0.000 0.575 0.522 0.507 0.485 0.519 0.580
7 2472446 0.000 0.000 0.000 0.000 0.000 0.000 0.333 0.349 0.330 0.335 0.254 0.314
7 28547272 0.000 0.000 0.000 0.000 0.000 0.000 0.354 0.339 0.316 0.330 0.336 0.345
7 142569487 0.000 0.000 0.000 0.000 0.000 0.000 0.319 0.340 0.348 0.324 0.359 0.402
8 61765797 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.499 0.503 0.487 0.493 0.533
8 67488411 0.000 0.000 0.000 0.000 0.000 0.000 0.533 0.507 0.528 0.505 0.477 0.532
10 5957464 0.000 0.000 0.000 0.000 0.000 0.000 0.447 0.488 0.513 0.485 0.536 0.462
11 64537849 0.000 0.000 0.000 0.000 0.000 0.000 0.509 0.503 0.495 0.478 0.428 0.455
11 66627607 0.000 0.000 0.000 0.000 0.000 0.000 0.338 0.406 0.459 0.402 0.325 0.484
12 48736982 0.000 0.000 0.000 0.000 0.000 0.000 0.514 0.488 0.511 0.462 0.533 0.528
14 88431966 0.000 0.000 0.000 0.000 0.000 0.000 0.352 0.347 0.389 0.295 0.423 0.383
20 62196367 0.000 0.000 0.000 0.000 0.000 0.000 0.350 0.295 0.354 0.312 0.354 0.309
S
i
d
e
A
S
i
d
e
B
P
u
b
l
i
c
Type
V
a
r
i
e
g
a
t
e
d
V
a
r
i
e
g
a
t
e
d
53
Interestingly for the sample we study in this chapter, tumor K, which is an adenoma that
is a benign tumor of epithelial tissue, the majority of the allele frequencies of public mutations are
around 0.33, 0.5 and 0.67, and this is consistent across different glands in a half at a given locus
(Figure 3.3A and Figure 3.3B). Table 3.1 shows the possible allele frequencies assuming copy
number (CN) and genotype are known. Since a public mutation is, by definition, universal and
present in all cells, its allele frequency in a given gland can be calculated based on Table 3.1 if
the CN is an integer and known. Surprisingly, the measured allele frequency of public mutations
from AmpliSeq data is perfectly in accordance with integer CN. For example, the frequency 0.33
corresponds to CN=3. This suggests that the gland has an integer CN on average. Furthermore, if
we observe an integer CN after averaging over thousands of cells in a gland, it is highly probable
that most of the cells in the gland have the same CN. Consistency of allele frequencies among
different glands in a tumor half supports the idea that the CN is consistent in these glands, and
this shared feature among different glands suggests that the copy number alterations (CNA)
happened very early, perhaps in the very first few divisions (otherwise, the chance of observing
consistent CNA among glands from different regions of a tumor half is extremely low). The allele
frequency of side-specific mutations also exhibit similar pattern for tumor K.
54
Figure 3.3 | Mutations in AmpliSeq data. (A) The allele frequencies of public mutations in
different glands harvested from side A of tumor K. (B) The allele frequencies of public mutations
in different glands harvested from side B of tumor K. (C) The mBAF of public mutations and side
A specific mutations on the same chromosome are plotted for different chromosomes. (D) The
mBAF of public mutations and side B specific mutations on the same chromosome are plotted for
different chromosomes.
3.4.2 Calling integer copy numbers
Without knowledge of the true CN, it is more difficult to study private mutations. As
discussed above, we know that a public mutation is universally present in all cells. Unlike public
mutations, private mutation are not universal in the tumor. Thus, direct study of private
mutations without knowledge of CN may lead to false conclusions. For example, suppose a
mutation in a gland has allele frequency 0.33. If it is a public mutation, we can conclude that the
gland is clonal, with CN=3, due to its universal presence in all cells. But if it is a private mutation,
we cannot draw the same conclusions, because we do not know the CN. For example, there are
0 5 10 15
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Different public mutation locus
Allele frequecy
KA1
KA2
KA3
KA4L
KA5
KA6
0 5 10 15
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Different public mutation locus
Allele frequecy
KB2L
KB4
KB5
KB6
KB7
KB9
A B
C D
0 5 10 15 20
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Chromosome
mBAF
Public
Side A specific
0 5 10 15 20
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Chromosome
mBAF
Public
Side B Specific
55
several possibilities that might explain that observation: (a) the CN of all the cells in this gland is
2 (no loss of heterozygosity), but only 66.7% of the cells have this mutation; (b) all the cells in
this gland have this mutation and the CN is 3 for every cell. Both case (a) and (b) result in a
mutation allele frequency of 0.33. Thus, we cannot call allele frequency without knowing the
copy number of the chromosomal piece upon which that mutation is found. However, despite this,
we can still use chromosomal context when comparing public and private mutations. Namely,
two mutations on the same chromosomal segment, regardless of whether they are public or
private, should have the same CN. This will allow us to better call allele frequency of private
mutations, since we can essentially infer the CN of that segment from the observed frequency of
the nearby public mutation (the probability that the locus of a private mutation has the same CN
as the locus of a public mutation in nearby regions on the same chromosome is high).
Copy Number Genotype B Allele Frequency (BAF) Mirrored B Allele Frequency (mBAF)
1 A 0 1
1 B 1 1
2 AA 0 1
2 AB 0.5 0
2 BB 1 1
3 AAA 0 1
3 AAB 0.33 0.33
3 ABB 0.67 0.33
3 BBB 1 1
4 AAAA 0 1
4 AAAB 0.25 0.5
4 AABB 0.5 0
4 ABBB 0.75 0.5
4 BBBB 1 1
Table 3.1 | Allele frequencies under different genotype and ploidy. mBAF=2*|BAF-0.5|.
As shown in Figure 3.3C and 3.3D, mirrored B allele frequency (mBAF) of private
mutations, which is defined as twice the absolute value of allele frequency minus 0.5, overlaps
56
with the mBAF of public mutation for chromosomes, suggesting that the two types of mutations
have similar distribution of frequency. The reason why mBAF is used here instead of BAF is that
mBAF removes variation in allele frequency caused by genotype, for example, AAB and ABB
have same mBAF but different BAF. This is extremely useful when we compare allele
frequencies without the knowledge of genotype. Since we know that public mutations are likely
to be universal in tumors, Figure 3.3C and 3.3D suggest that private mutations also become
universal within glands (rather than within entire tumors) in the same manner. It is worth noting
that a number of private mutations do not appear in all glands of a tumor half, but still appear to
be universal within the glands in which they do appear (see Figure 3.2). This suggests that private
mutations that occurred in early divisions eventually become clonal in glands. These findings
support the glandular fission process in tumor development, which was used in our previous
mathematical tumor growth model in Chapter 2.
Figure 3.4 | Segmentation and smoothing of SNP array data for a gland (KA6) of tumor K.
(A) The CN estimates of each segment given by psCBS before smoothing for gland KA6. (B) The
scatter plot for mBAF and CN estimates of each segments before smoothing for gland KA6. (C)
0 0.5 1 1.5 2 2.5 3
x 10
9
0
1
2
3
4
CN
ps
Before smoothing
0 0.5 1 1.5 2 2.5 3
x 10
9
0
1
2
3
4
Chromsome coordinate
CN
ps
After smoothing
1 2 3 4 5 6
0
0.2
0.4
0.6
0.8
1
CN
mBAF
Before smoothing
Quantum
PSCBS
1 2 3 4 5 6
0
0.2
0.4
0.6
0.8
1
CN
mBAF
After smoothing
Quantum
PSCBS
A
B
C D
57
The CN estimates after smoothing for gland KA6. (D) The scatter plot for mBAF and CN
estimates after smoothing for gland KA6.
As discussed earlier, in order to better understand the allele frequencies, we need
information on the CN at that locus. SNP microarray data can be used to gain insight on CN. We
did this as follows. The allelic intensity values from GenomeStudio were quantile-normalized
with a threshold of 1.5 [56]. Subsequently, the normalized data were split into regions with equal
CN using psCBS [57], which analyzes paired tumor and normal gland from the same patient and
then outputs regional CN and mBAF estimates, denoted as CN
ps
and mBAF
ps
respectively, for
each segment. We then applied further smoothing to CN
ps
and mBAF
ps
by combining segments
in which CN
ps
and mBAF
ps
copy number did not appear to change significantly (Figure 3.4). If a
given single segment has integer CN, the mBAF of this segment can only take a set of possible
values. For example, mBAF can only be 0 or 1 for a diploid, and 0.33 or 1 for a triploid, etc (see
Table 3.1). In other words, the mBAFs of all segments should fall into clusters given that the CNs
are integers. Clearly, as shown in Figure 3.4, the chromosomal segments form clusters separated
by mBAF, which suggests that we should be able to infer the integer CN from these values.
58
Figure 3.5 | Call CN estimates from SNP array data after smoothing. (A) Scatter plot of CN
estimates and mBAF after smoothing. (B) After heuristic assignment of CN according to mBAF
and weighted linear regression, the predicted CN estimates for each segment are shown to be
integer. (C) K-mean clustering on segments. (D) Fit into “quantum” for the clusters generated by
k-mean clustering.
We estimated CN using clustering algorithms. Since the segments fall into clusters as
shown in Figure 3.5A, we can employ clustering algorithms, such as k-means clustering, to first
identify each cluster (Figure 3.5C), then stretch the clusters linearly to fit into the “quantum”
positions to minimize the total distance between each cluster and its nearest “quantum” (Figure
3.5D).
As shown in Figure 3.6, the CN of glands in the two halves are different from each other.
The glands sampled from the left side (side A) of Tumor K are predominately triploid, with
potentially loss of heterozygosity (LOH) at Chromosome 5 and 19. On the other hand, the right
side (side B) is predominately diploid with LOH at Chromosome 5 and 19, suggesting that the
0 1 2 3 4 5
0
0.2
0.4
0.6
0.8
1
CN
mBAF
KA6 -- after smoothing
0 1 2 3 4 5
0
0.2
0.4
0.6
0.8
1
CN
mBAF
KA6 -weighted linear regression
0 1 2 3 4 5
0
0.2
0.4
0.6
0.8
1
CN
mBAF
KA6 -- k-mean clustering
Quantum
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Center
0 1 2 3 4 5
0
0.2
0.4
0.6
0.8
1
CN
mBAF
KA6 -- k-mean clustering
A
C
B
D
59
LOH is possibly present in the first transformed cell. The homogeneity among the same tumor
half indicates that the chromosomal catastrophe probably took place at the first cell division.
Figure 3.6 | Adjusted copy number for Tumor K. x-axis is the chromosomal coordinate. Stars
indicate the locations of mutations in AmpliSeq data. Diamonds represent the locations of
mutations in bulk exome sequencing data. The lines in the top subfigure show the mirrored-
mBAF for each chromosomal segment and the ones in the bottom subfigure represent the copy
number.
3.4.3 Reconstruction of the early divisions
By combining the SNP array data and AmpliSeq data, I can qualitatively illustrate what
happened in the first few cell divisions for Tumor K. The early events that 56 distinct mutations
from Tumor K experienced are categorized into four scenarios (see Figure 3.7).
60
Figure 3.7 | Different scenarios that depict the early events for tumor K. (A) Side A of the
tumor has CN 3 for these chromosomes while the other side has CN 2. 16 chromosomes (45
mutations) fall into this category. (B) CN is 1 on both sides of the tumor. One piece of
chromosome 5 containing a driver mutation of colorectal cancer fits category B. (C) CN is 4 for
side A while CN is 3 for the other side. 7 mutations on 3 chromosomes are of this category. (D).
CN of side A is 4 while CN of side B is 2. Chromosome 9 and 21 are of this category.
61
3.5 Discussion
In this chapter, I presented a study aimed at reconstructing the early events of tumor
development for a colorectal adenoma denoted as tumor K via analyzing the mutations and copy
number alterations in different glands. The variation of the genomic alterations across different
regions of the tumor preserves the chronological information, since the spatial distributions of
those alterations reflect the time at which they might have occurred and, therefore, can be used to
retrieve the history of the growth of that tumor. In order to be present in a significant fraction of
the final lesion, a genomic alteration has to happen very early, perhaps in the first several
divisions.
The mutations and CNAs depicted in Figure 3.6 and Figure 3.7 strongly favor the “star-
like” expansion that is predicted by the so called “big bang model” for colorectal growth. In
addition, there are sets of mutations (private mutations) and CNAs (except one segment of
Chromosome 5 and Chromosome 19) that are seen only in glands in one tumor half. This
indicates that these alterations occurred later in tumor development. If tumor cells were motile in
the early stages of tumor growth, we should more commonly observe that mutations, which
obviously initially have to appear on one side of the tumor or the other, might ultimately be
present on both tumor sides. Since we do not see this here, we therefore conclude that such
motility of cells at the early stage of the tumor development is not present in the studied tumor K,
which is an adenoma. Conversely, the mutations in carcinomas (e.g. tumor N in Figure 3.8), often
show different mutation patterns, known as variegated patterns, and are present in multiple
distinct locations in a tumor, suggesting that cell motility exists in such tumors and that this
motility is probably present at the very beginning of the tumor development. In other words, the
carcinoma is “born to be bad” and the high probability of ultimate metastasis is present even
before the tumor is fully established.
62
Figure 3.8 | Mutation allele frequencies and categorization of AmpliSeq data for Tumor N.
Tumor N is a carcinoma that is a malignant tumor. A number of variegated mutations are present
in tumor N.
chr position NA3L NA6L NA7L NA9L NA10L NB1L NB3 NB5L NB6L NB7L
1 55518070 0.353 0.330 0.316 0.360 0.307 0.463 0.344 0.347 0.373 0.219
1 120295237 0.612 0.584 0.584 0.600 0.615 0.587 0.601 0.643 0.538 0.599
1 158639502 0.330 0.276 0.323 0.317 0.327 0.339 0.321 0.325 0.300 0.309
1 180905704 0.327 0.308 0.298 0.216 0.298 0.305 0.302 0.312 0.338 0.224
2 1653373 0.327 0.380 0.309 0.316 0.312 0.326 0.311 0.332 0.319 0.380
2 74492351 0.311 0.328 0.296 0.342 0.314 0.311 0.300 0.337 0.300 0.274
2 197584236 0.326 0.333 0.311 0.357 0.328 0.318 0.304 0.327 0.320 0.344
3 36873026 0.312 0.326 0.283 0.368 0.324 0.289 0.301 0.340 0.395 0.277
3 51749482 0.650 0.656 0.701 0.691 0.673 0.609 0.635 0.655 0.672 0.578
3 134898730 0.618 0.629 0.603 0.625 0.668 0.596 0.693 0.626 0.741 0.671
3 156181506 0.316 0.295 0.291 0.318 0.295 0.290 0.305 0.305 0.319 0.326
3 185783660 0.646 0.693 0.668 0.715 0.668 0.661 0.664 0.652 0.604 0.606
4 3768880 0.579 0.605 0.638 0.637 0.586 0.614 0.595 0.609 0.604 0.605
4 88903784 0.278 0.328 0.276 0.262 0.319 0.342 0.286 0.271 0.245 0.411
4 183659696 0.666 0.657 0.631 0.657 0.668 0.662 0.625 0.641 0.633 0.668
4 183675635 0.310 0.340 0.340 0.291 0.324 0.356 0.333 0.326 0.291 0.429
4 183720928 0.661 0.665 0.632 0.706 0.664 0.659 0.655 0.672 0.664 0.664
5 112173917 0.631 0.635 0.603 0.608 0.670 0.629 0.650 0.638 0.606 0.659
5 137520578 0.273 0.287 0.547 0.250 0.222 0.215 0.259 0.291 0.218 0.248
5 141033696 0.632 0.673 0.611 0.691 0.636 0.650 0.684 0.658 0.647 0.659
6 38994380 0.318 0.311 0.334 0.276 0.290 0.303 0.347 0.320 0.318 0.328
6 55924961 0.649 0.655 0.643 0.648 0.655 0.646 0.621 0.654 0.660 0.624
6 142689004 0.324 0.333 0.297 0.315 0.270 0.291 0.286 0.318 0.327 0.317
7 11630195 0.662 0.661 0.662 0.646 0.675 0.546 0.603 0.618 0.689 0.567
7 66415955 0.316 0.303 0.299 0.274 0.281 0.360 0.372 0.361 0.291 0.338
7 81591314 0.349 0.334 0.309 0.313 0.351 0.379 0.425 0.395 0.333 0.343
7 103175931 0.337 0.347 0.322 0.330 0.337 0.388 0.417 0.400 0.318 0.319
7 106509370 0.308 0.268 0.244 0.267 0.248 0.329 0.353 0.361 0.260 0.309
7 148801692 0.341 0.322 0.331 0.311 0.340 0.423 0.376 0.406 0.385 0.404
8 48777193 0.324 0.361 0.306 0.347 0.323 0.293 0.341 0.349 0.307 0.324
8 67576838 0.644 0.595 0.632 0.633 0.682 0.592 0.647 0.658 0.644 0.642
8 124392898 0.241 0.309 0.345 0.421 0.361 0.304 0.297 0.237 0.275 0.160
8 139617006 0.326 0.328 0.278 0.358 0.337 0.282 0.334 0.316 0.277 0.333
8 144511844 0.331 0.347 0.348 0.346 0.334 0.321 0.293 0.349 0.345 0.292
9 34107468 0.649 0.658 0.651 0.636 0.650 0.645 0.496 0.492 0.592 0.637
9 137806649 0.643 0.683 0.650 0.650 0.657 0.633 0.486 0.492 0.656 0.581
10 34636994 0.317 0.336 0.281 0.333 0.367 0.277 0.302 0.327 0.288 0.421
10 73571121 0.676 0.713 0.562 0.724 0.682 0.598 0.660 0.640 0.676 0.600
10 106074979 0.655 0.673 0.629 0.689 0.686 0.653 0.647 0.670 0.632 0.578
10 128840959 0.628 0.669 0.633 0.636 0.652 0.684 0.656 0.650 0.589 0.531
11 32410611 0.661 0.617 0.622 0.635 0.648 0.636 0.313 0.311 0.662 0.567
11 87020637 0.569 0.577 0.560 0.624 0.582 0.549 0.237 0.265 0.527 0.634
11 102713491 0.629 0.640 0.660 0.654 0.670 0.639 0.334 0.324 0.561 0.581
11 123988418 0.649 0.627 0.622 0.666 0.656 0.644 0.325 0.322 0.619 0.661
12 3692252 0.248 0.230 0.217 0.243 0.236 0.236 0.233 0.341 0.187 0.233
12 25398284 0.731 0.716 0.746 0.722 0.740 0.676 0.742 0.659 0.686 0.673
12 49690310 0.290 0.287 0.235 0.296 0.267 0.259 0.303 0.364 0.254 0.294
12 104142897 0.752 0.768 0.750 0.730 0.757 0.728 0.729 0.658 0.710 0.720
12 109629722 0.235 0.241 0.263 0.251 0.237 0.254 0.218 0.335 0.253 0.233
13 30782787 0.785 0.794 0.958 0.808 0.822 0.754 0.722 0.738 0.793 0.730
13 110960440 0.796 0.800 0.965 0.811 0.803 0.756 0.727 0.761 0.755 0.776
14 77808176 0.335 0.354 0.503 0.374 0.308 0.236 0.327 0.342 0.274 0.219
14 94128976 0.646 0.675 0.471 0.630 0.649 0.630 0.645 0.643 0.652 0.688
15 23688968 0.331 0.326 0.310 0.350 0.316 0.302 0.322 0.339 0.343 0.382
15 42457259 0.936 0.919 0.917 0.989 0.974 0.805 0.914 0.951 0.714 0.870
15 83932115 0.645 0.638 0.656 0.661 0.647 0.640 0.672 0.649 0.633 0.630
15 99715279 0.319 0.289 0.288 0.323 0.320 0.303 0.296 0.322 0.317 0.295
16 81181854 0.645 0.687 0.686 0.722 0.692 0.696 0.528 0.507 0.383 0.512
17 7578403 0.241 0.244 0.235 0.245 0.224 0.224 0.481 0.486 0.219 0.309
17 47039145 0.701 0.691 0.649 0.644 0.661 0.543 0.956 0.980 0.534 0.615
17 48615484 0.260 0.271 0.273 0.235 0.248 0.269 0.323 0.326 0.275 0.412
17 56621125 0.731 0.747 0.667 0.735 0.743 0.732 0.982 0.985 0.708 0.704
18 73000016 0.466 0.524 0.501 0.438 0.492 0.467 0.507 0.482 0.472 0.516
19 36322588 0.216 0.241 0.197 0.214 0.231 0.179 0.227 0.189 0.156 0.211
19 54759401 0.750 0.748 0.722 0.781 0.750 0.734 0.757 0.753 0.715 0.747
20 31877712 0.569 0.569 0.562 0.603 0.531 0.534 0.569 0.571 0.570 0.528
22 37769698 0.971 0.971 0.942 0.973 0.982 0.955 0.972 0.977 0.922 0.982
1 116930073 0.325 0.310 0.343 0.332 0.336 0.310 0.000 0.000 0.314 0.280
1 150923066 0.349 0.346 0.365 0.365 0.369 0.351 0.000 0.000 0.331 0.375
1 193111014 0.357 0.401 0.348 0.325 0.282 0.271 0.000 0.000 0.337 0.313
1 248512743 0.320 0.318 0.306 0.323 0.328 0.314 0.000 0.000 0.281 0.337
3 100170755 0.342 0.321 0.312 0.265 0.325 0.319 0.000 0.000 0.314 0.333
3 110845177 0.333 0.325 0.300 0.344 0.346 0.335 0.000 0.000 0.298 0.364
4 85416920 0.324 0.320 0.372 0.364 0.338 0.327 0.000 0.000 0.346 0.327
5 140188455 0.200 0.250 0.000 0.000 0.529 0.000 0.000 0.227 0.000 0.000
5 140800799 0.321 0.329 0.000 0.329 0.308 0.325 0.325 0.310 0.287 0.330
5 175959158 0.697 0.704 0.696 0.000 0.692 0.692 0.738 0.703 0.615 0.000
5 179201778 0.321 0.303 0.620 0.335 0.320 0.299 0.000 0.000 0.335 0.285
6 32363933 0.326 0.300 0.324 0.332 0.333 0.270 0.000 0.000 0.298 0.363
6 43005686 0.437 0.392 0.406 0.403 0.369 0.438 0.000 0.000 0.338 0.382
6 43153812 0.325 0.332 0.307 0.333 0.350 0.316 0.000 0.000 0.327 0.299
8 98289605 0.263 0.277 0.347 0.182 0.329 0.285 0.000 0.000 0.333 0.348
9 135698654 0.346 0.322 0.350 0.341 0.343 0.341 0.000 0.000 0.367 0.384
10 43288012 0.000 0.667 0.000 0.000 0.000 0.000 0.000 0.524 0.000 0.000
10 118434364 0.367 0.333 0.343 0.286 0.398 0.000 0.000 0.000 0.200 0.000
11 50003372 0.348 0.351 0.310 0.342 0.334 0.308 0.000 0.000 0.336 0.348
11 57466379 0.328 0.313 0.321 0.301 0.314 0.371 0.000 0.000 0.276 0.368
12 56436344 0.222 0.000 0.000 0.000 0.000 0.000 0.000 0.297 0.000 0.000
13 58207622 0.190 0.175 0.000 0.201 0.195 0.191 0.236 0.243 0.177 0.197
13 101936304 0.199 0.216 0.239 0.193 0.203 0.164 0.000 0.000 0.254 0.190
17 4594195 0.246 0.241 0.217 0.315 0.246 0.247 0.000 0.000 0.239 0.289
17 7366895 0.328 0.361 0.298 0.305 0.336 0.300 0.000 0.000 0.260 0.346
17 48072347 0.244 0.251 0.247 0.276 0.244 0.225 0.000 0.000 0.221 0.303
19 6383454 0.240 0.239 0.218 0.228 0.230 0.247 0.000 0.000 0.228 0.263
19 13928110 0.367 0.279 0.307 0.256 0.256 0.278 0.000 0.000 0.270 0.253
20 3669866 0.206 0.192 0.195 0.213 0.184 0.160 0.000 0.000 0.188 0.159
20 47266570 0.200 0.210 0.182 0.189 0.181 0.193 0.000 0.000 0.205 0.233
21 47811147 0.484 0.496 0.428 0.531 0.497 0.436 0.000 0.000 0.327 0.526
22 32650169 0.330 0.338 0.301 0.415 0.328 0.378 0.000 0.000 0.367 0.260
6 110300609 0.333 0.363 0.330 0.410 0.413 0.000 0.000 0.000 0.000 0.000
9 130263315 0.000 0.000 0.000 0.000 0.000 0.000 0.268 0.253 0.000 0.000
11 113934751 0.000 0.000 0.000 0.000 0.000 0.000 0.379 0.335 0.000 0.000
21 33318368 0.000 0.000 0.000 0.000 0.000 0.500 0.308 0.000 0.367 0.492
type
P
u
b
l
i
c
V
a
r
i
e
g
a
t
e
d
Side A Variegated
Side B Variegated
63
Our ability to retrieve integer CN estimates is based on the assumption that the
substructure of the tumor, or gland, is nearly homogenous (i.e. clonal). However, copy number is
not necessarily integer in a gland (i.e. not all cells within a gland need have the same copy
number at the same chromosomal position). It is particularly possible that this assumption might
fail in certain kinds of tumors which have vast copy number alterations from generation to
generation, or in which cells move around from gland to gland. In such scenarios, our analysis
would not return the integer CNs that we in fact saw for the studied tumor, as each gland might be
a mixture of cells with different CNs. Therefore, the appearance of non-integer CN is also
informative regarding the evolution of tumor growth, and in particular is diagnostic of the
presence or absence of cell motility (which is thought to be associated with poor prognosis, as
discussed above).
64
65
Chapter 4: Early mutation bursts and
cancer stem cell division asymmetry in
colorectal tumors
This work is shortly to be submitted as “Early mutation bursts and cancer stem cell division
asymmetry in colorectal tumors”, J. Zhao, M. Salomon, D.Shibata, C. Curtis, K. Siegmund, P.
Marjoram
4.1 Overview
Once again, we treat tumor growth as an evolutionary process involving accumulation
of mutations, copy number alterations, and cancer stem cell (CSC) division and differentiation.
Earlier in this thesis we presented analyses that exploited patterns of ITH as seen in methylation
status, to infer key aspects of this evolutionary process. However, the resolution offered by such
data was low, so in this section we continue to exploit other, richer forms of variation data. Using
these data we present a framework that allows simulation of these processes and estimation of
mutation rates at the various stages of tumor development and CSC division patterns for
colorectal tumors. We parameterize the mutation rate and the CSC division pattern, and
successfully retrieve their posterior distributions based on DNA sequence level data. Our
approach again exploits Approximate Bayesian Computation.
66
4.2 Introduction
Tumorigenesis is the process by which normal cells transform to ultimately become
malignant and experience uncontrolled growth. During this process, numerous genomic and
epigenomic events take place. The normal spontaneous DNA mutation rate ranges from 10
-10
to
10
-9
per base per cell division due to replication error [58,59], which means the overall mutation
rate is between 0.3 and 3 mutations per cell division in the whole genome, and 0.003 - 0.03
mutations per cell division per exome. In tumors, the point mutation rate is ~5× 10
-10
per base per
division, which is within the range for normal cells, based on sequencing and microarray results
from pooled DNA [19,49,60]. The accuracy of these estimates relies on reliable detection of
mutations and estimation of the number of divisions the tumor has experienced. However, the
technology typically used does not detect all mutations since the probability of detection depends
upon depth of sequencing. Neither do we typically observe the number of generations through
which the tumor has passed during its existence.
In this chapter we seek to better estimate point mutation rates in tumors, and to then
understand how they might vary during the lifetime of the tumor. A recent study, reported in
Chapter 3, has proposed a “big bang” tumor growth model for colorectal tumors, proposing that
each tumor is the result of a single clonal expansion [61]. Furthermore, it is proposed that the
mutational intratumor heterogeneity that is typically observed in genomic data originates largely
from the first few divisions [62,63]. An open question is whether the mutation rate during those
first few divisions was elevated, resulting in a “mutation burst” at the very beginning of the tumor
development, or whether the rate was constant during tumor development. Colorectal tumors
provide an exceptional advantage when we investigate this issue, because of their glandular
structure. Each gland within the tumor is a relatively pure cell population, and mutations that
originate early in the tumors history will typically be present in all cells within a gland (this kind
67
of mutation is therefore called a “fixed” mutation in this manuscript) [63]. It is the prevalence of
such fixed mutations, relative to that of the non-fixed mutations, that will be informative
regarding the existence of an initial mutation burst. Therefore, profiling single glands, and
looking for this signal, for example, using exome sequencing, can provide us with unprecedented
information regarding both the initial mutation burst and the genomic structure within a tumor.
On the contrary, traditional bulk tissue sampling that consists of thousands of glands loses this
power, since we lose both structural information and the ability to determine that a mutation is
fixed within one or more glands.
Stem cells are undifferentiated cells that reside in multicellular organisms. They are
capable of making more stem cells, a process called self-renewal, as well as generating other
types of cells, a process known as differentiation. Stem cell division, through which the stem cells
self-renew and differentiate, has been extensively studied in simple organisms, for example, C.
elegans [64], as well as in higher organisms, such as humans [65]. Two types of stem cell
division have been discovered: asymmetric and symmetric. A stem cell that is undergoing
asymmetric cell division produces one daughter cell that is itself a stem cell, and one daughter
cell that loses stem cell properties and differentiates [66,67]. One of the advantages of this
asymmetric division is that it maintains and constrains the cell population while it produces two
different cells. This advantage is also its disadvantage under certain circumstances; for example if
the stem cell population needs to expand. In contrast, in symmetric cell division, a stem cell
divides into two daughter cells that are destined to have identical fate - in other words, both are
cells that differentiate or both are stem cells. Symmetric cell division is essential for population
expansion of stem cells during the initial stages of embryo development and during wound
healing and regeneration [68-70].
Asymmetric cell division has been extensively studied in model systems and many
processes are involved in its regulation. The localization of apical and basal determinant [71-73],
68
the orientation of spindle assembly [74], and the niche environment, such as the presence of Wnt
[75], regulate asymmetric division. Abnormality in these pathways results in disruption of
asymmetric cell division and eventually causes development of cancers [71,76]. Therefore,
equilibrium between asymmetric division and symmetric division is crucial to organisms. If the
equilibrium is disturbed, abnormal growth will take place and tumors will typically arise. Several
studies have shown that some protein markers for asymmetric division are still present during
cancer cell division, suggesting that asymmetric cell division is not totally lost in cancers [77].
The partial loss of asymmetric division in cancers suggests a way to study the
regenerative ability of a cancer. It may, therefore, be useful to understand to what extent each
tumor has lost asymmetric division. Currently, there is no technique that can measure the
proportion of stem cells that undergo asymmetric division or predict the probability that a given
stem cell will divide asymmetrically. However, these two different division mechanisms change
the probability that mutations carried by a cancer stem cell (CSC) will survive into the next
generation. For example, if a CSC with a unique mutation undergoes symmetric division and
gives birth to two non-stem cells, this mutation will vanish in the stem cell population (and
ultimately will vanish from the tumor itself, since differentiated cells have relatively short
lifespans). In this paper we propose a simulation-based method that can be used to estimate both
the mutation rate and asymmetric division rate of CSCs, and can in addition infer whether a
mutation burst occurred in that tumor. More specifically, our tumor simulation framework, which
includes the exploitation of the next-generation sequencing (NGS) data for single tumor glands,
provides researchers with more detailed information on the genomic landscape of a tumor.
69
4.3 Data, model and methods
4.3.1 Tumor growth model and DNA mutation embedding
Our analysis models possible scenarios for tumor growth. The model supposes that
any given tumor contains a particular number of CSCs. Our simulation model begins with
repeated division of the first transformed cell until the number of cells reaches the number of
CSCs existing in a gland of that tumor. This results in the formation of the first gland (a gland
typically contains ~10000 cells). Once the first gland forms, we model a gland fission process,
described in detail below, consisting of a sequence of cell divisions, until the gland ultimately
splits to form 2 glands. Initially, the tumor experiences an “exponential growth” stage, in which
the glands double in number every generation (see Figure 4.1A), for 19 generations, to ultimately
~500,000 glands and ~4 billion cells, which is approximately the size of a 4 cm
3
colon tumor.
Then the tumor enters a “constant size” phase, in which the gland fission process stops. During
the constant size phase, the cell population in each gland, which consists of both CSCs and non-
cancer stem cells (non-CSCs), is maintained by the division of CSCs and the death of non-CSCs.
As discussed earlier, a CSC can undergo two types of division: asymmetric and symmetric (see
Figure 4.1B). The probability of asymmetric division, r, is a parameter to be estimated in our
model: with probability r, a given CSC will undergo asymmetric division, in which only one of
the progeny is a CSC. Otherwise, (so with probability 1-r), a CSC will undergo symmetric
division, in which case, with equal probability, the CSC divides into two CSCs or two non-CSCs.
The possibility of DNA mutation is incorporated into each cell division. Since
mutation rates are relatively low, we model the number of DNA mutations, n, introduced into
each daughter cell according to a Poisson distribution, the parameter of which is referred to as the
mutation rate. Since we are interested in asking whether there is a mutation burst at the early
70
stage of tumor growth, we parameterize the DNA mutation rate separately before (pre rate, α) and
after (post rate, β) the first gland formation. Our parameters are summarized in Table 4.1. Since
the focus of this study is the possible existence of an early mutation burst, and the details of stem
cell division, in our simulation study NCSC and T3 are set to “typical constant” values (see Table
4.1) so that we can keep the dimensionality of the simulation study manageable. The prior
distributions of α and β are set to cover a range wider than the mutation rates, 0.015 - 0.15
mutations per cell division per exome, that are typically reported in the literature.
Figure 4.1 | Schematic of tumor growth and the two types of CSC division. (A) The three
stages of the growth model: formation of the first gland, exponential growth of gland number, and
constant size phase (the length of which is 100 generations). (B). Schematic of cell differentiation
process during Constant Size phase. Each yellow oval represents a CSC, while each white oval
represents a non-CSC (cells which have limited differentiation capability).
Parameter Range of Possible Values
and Prior Distribution
DNA point mutation rate before the first gland formation (α) Unif(0,5)
DNA point mutation rate after the first gland formation (β) Unif(0,1)
Number of Cancer Stem Cells (NCSC) 32
Probability of asymmetric division (r) Unif(0.5,1)
Number of generations in constant size phase (T3) 100
Table 4.1 | Parameters and the corresponding prior distributions in our model.
71
4.3.2 Statistical methods
4.3.2.1 Approximate Bayesian computation
Our goal here is to find the posterior distribution of tumor growth parameters, in
general denoted by θ=(θ
1
, θ
2
,…, θ
L
), based on data, D, that were observed experimentally. This is
expressed as:
ƒ(θ|D)= ƒ(D|θ)π(θ)/ƒ(D). (4.1)
In our context the likelihood term ƒ(D|θ) is intractable and not available in closed
form. Therefore we replace the likelihood calculation with an acceptance-rejection simulation
step that accepts parameter values that result in “similarity” between observed samples, D, and
simulated samples, D’, generated by θ. θ-values are sampled from the prior distribution π(θ).
Furthermore, to further reduce computational complexity, summary statistics S=(s
1
,s
2
,…,s
M
) are
used to represent the key features of the original data – in other words, ƒ(θ|S) is used to
approximate ƒ(θ|D). We use the same ABC version of rejection sampling as was used in Chapter
2:
For i=1 to N
1. Sample parameters θ’ from the prior distribution π(θ)
2. Simulate data D’ using the tumor growth model described earlier with the sampled
parameters θ’, and summarize D’ as S’.
3. Accept θ’ if d(S’, S) < ε, for a given threshold ε, where d(S’, S) is a measure of
distance (which can be thought of as 1/“similarity”) between S’ and S.
72
Adding extra non-informative or less informative summary statistics increases the
noise in the measure of distance, and thereby increases the error of matching S’ to S [78,79].
Therefore, as we saw earlier, we must carefully select a minimal set of summary statistics that
capture all important information regarding tumor growth. A number of methods have been
invented to choose a concise set of summary statistics, ensuring that they maintain
informativeness with regard to inferring posterior distributions for model parameters [40,79-82].
We explore related methods below and choose a set of summary statistics that performs well
when estimating our model parameters. As is common, we define d(S’, S) as a variant of the
Euclidean distance metric. Specifically, we use a variant of traditional Euclidean distance in
which each statistic is weighted. Some summary statistics s* might be more informative for a
particular parameter θ* than others, therefore, a higher weight on s* will help infer the posterior
distribution of θ* [43,79]. So including the weights of each summary statistic, denoted as W, the
distance metric is defined as:
d(S’,S)=||(S’-S)W
T
||
2
. (4.2)
4.3.2.2 Summary statistics
Assume that G glands are sampled from the tumor. Each gland has K
g
(1≤g≤G)
mutations. The allele frequency of k-th mutation in g-th gland is denoted as F
gk
. In a diploid
system, if a mutation is present in every cell of a gland, i.e. it has an allele frequency of 0.5, we
refer to it as a “fixed” mutation. Otherwise, the mutation is called “non-fixed”. We also define a
“gland-specific” mutation as a mutation that is only present in one gland, while a “shared”
mutation is one that is found in more than one gland.
Our analysis then includes the following summary statistics:
73
The mean of the allele frequencies of all non-fixed mutations,
The variance of the allele frequencies of all non-fixed mutations,
The number of gland-specific mutations among non-fixed mutations,
The number of gland-specific mutations among fixed mutations,
The number of shared mutations among non-fixed mutations,
The number of shared mutation among fixed mutations,
The variance of the number of non-fixed mutation across all glands,
The variance of the number of fixed mutation,
The variance of the allele frequency of non-fixed mutations across the two tumor
halves.
4.3.2.3 Selecting weights for summary statistics
In this section, we compare three methods to assign weights to summary statistics. As
a baseline comparison we employ an analysis that uses equal weight for every statistic, and infers
the parameters jointly. To reduce the dimensionality of the simulation study, which is already
large and extremely computationally intensive, our perspective here will be to focus on one
parameter of interest at a time. As a consequence, we will also learn which of our statistics are
74
most informative for each of our parameters. We show results for three methods in which we
infer parameters one-at-a-time (i.e. we infer marginal parameter posterior distributions). The
first weighting method puts equal weight on each summary statistics. The second weighting
method we use, the local linear regression method, applies local regression for the parameter of
interest θ at each data point, based on each statistic, and uses the R-square measure of fit of the
linear regression between parameter and statistic as the weight of that statistic (for details, see
below). The third method we compare, global linear regression, proceeds similarly, but now
utilizes a global measure of correspondence between parameter and each summary statistic to
determine the weight for each statistic (again, we use the R-square measure of fit of the linear
regression).
4.3.2.4 Local linear regression method
Here we describe our second analysis method, that is based on local linear regression
[83]. We conduct N simulations, each of which simulates a single dataset, and in each of which
the L parameters were sampled from the prior distributions in Table 2.1. We denote the summary
statistics observed in these N simulations by S’=(s
1
,s
2
,s
3
,…,s
M
), and denote the L generating
parameter values by ϴ’=( θ
1
, θ
2
, θ
3
,…, θ
L
), where s
m
(1≤m≤M) and θ
l
(1≤l≤L) are column vectors
of length N. For a given parameter θ
l
, we wish to assess how much information each summary
statistic carries regarding that parameter. In order to do this, we denote the collection of
generating parameter values, and the resulting summary statistic values, by the pair ( θ
l
, s
m
). For
each summary statistic s
m
, we then perform a linear regression of θ
l
, on s
m
in the vicinity of s
m
using the simulated datasets. We define this vicinity as the 100*η percent of simulated data points
that are closest to s
m
(in terms of their resulting summary statistic values, using the Euclidean
distance metric), for some η. We denote these closest points by S’
m(1)
, S’
m(2)
,…, S’
m(kN)
(where the
subscript
()
denotes the rank ordered values, starting with the value closest to s
m
) and the
75
corresponding generating parameters are denoted by ϴ’
l(1)
, ϴ’
l(2)
,…, ϴ’
l(kN)
. We then fit a linear
regression using S’
m()
as predictor variable and ϴ’
l()
as response variable. The R-square measure
of fit of the linear regression is recorded as R
2
lm
. After all weights are calculated between the l
th
parameter and m
th
summary statistic, we define final normalized weights to use as:
. (4.3)
We use the above weights as the weights for the summary statistics in a subsequent ABC
analysis for the parameter of interest, using previously unseen data.
4.3.2.5 Global linear regression method
The global correlation between a parameter and a summary statistic also indicates how
informative that summary statistic is for a parameter. While it may be less accurate, in principle,
locally, it will also be less subject to the noise that might arise from using just local values to
estimate correlation. For that reason, our third method proceeds as above but now calculates
statistic weights based on the R-square measure of fit,
, between parameter θ
l
and summary
statistic S
m
derived from a global linear regression for a set of simulated data. Again, these
weights are then used when estimating that same parameter in previously unseen data in order to
assess estimation performance of this weighting method. The weights, W
lm
, are then defined and
normalized in the same way as in (4.3).
76
4.3.3 Data simulation
4.3.3.1 Perfect synthetic data
When we generate simulated tumor data for our study we use the tumor growth model
described earlier in this section. In our first set of analyses we explore a situation in which we
have “complete” data, in the sense that no noise (e.g. sampling variation) was simulated. For
example, we assume that we can calculate the exact allele frequency of a mutation from the
output of the simulation. In reality, of course, such data is subject to error, and the effects of this
will be explored in the next subsection, but the performance of our methods on this “perfect” data
provides a benchmark for the rest of our study.
4.3.3.2 Simulated data with sequencing noise
While it is useful to benchmark the analysis method in this best case scenario, we also
wish to assess how it performs in more realistic setting. In idealized models, the sequencing depth
follows a Poisson-like distribution with a small variance [84,85]. However, in reality, the
distribution of sequencing depth has a larger variance than would be predicted by such a model.
Therefore, here, as is common, we use the more flexible negative binomial distribution,
X~NB(p,t), to model sequencing depth. If we think of the negative binomial distribution as the
number of successes, k, in a sequence of independent Bernoulli trials (with success rate p) before
the t
th
failure, then,
, for k=1,2,3…. (4.4)
Based on this definition, the mean (m) and the variance (v) of the negative binomial
distribution are:
77
, (4.5)
. (4.6)
We can re-write these constraints as:
, (4.7)
. (4.8)
Given an average sequencing depth (m) with variance (v), we model the sequencing
depth individually for each segregating locus based on the negative binomial distribution
described above. For a given locus, we draw a number k from the negative binomial distribution
with parameters p and t that are calculated by (4.7) and (4.8) to represent the sequencing depth at
this locus. Having done this we use a binomial distribution B(k, q) with q equal to the true allele
frequency of mutated (alternative) allele, to generate the number, n
a
of reads sampled for the
mutated (alternative) allele, with the remaining k - n
a
reads being of the reference allele.
As discussed earlier, mutant alleles with frequency 0.5 in the simulated diploid data
are referred to as “fixed”. However, when we are modeling the sequencing process itself, the
exact allele frequency is no longer obtained. Instead we observe counts of the number of reads at
any given position with each of the two possible allelic types. For this reason, at each locus we
conduct a hypothesis test to determine whether or not we might regard the allele as “fixed”.
Specifically, we use a standard binomial hypothesis test to assess whether we can reject the null
hypothesis that the underlying q, the true allele frequency of mutated allele, is 0.5. If we do not
reject the null hypothesis for a given mutation, then we consider this mutation as (potentially)
“fixed”. Here type II errors will increase the number of mutations called as “fixed”, typically
because when the actual allele frequency is relatively close to 0.5 the test will typically fail to
78
reject the null hypothesis even though the mutation is actually not fixed. However, we reflect this
same calling method in the production of simulated data during the ABC analysis of our synthetic
data. Thus, while our anti-conservative calling of fixed sites can be expected to introduce some
loss of precision in our analysis, there is no reason, a priori, to expect it to introduce bias for the
estimation of the parameters.
4.4 Results
4.4.1 Synthetic tumor data
4.4.1.1 Weight summary statistics
Figure 4.2 shows the results of a simulation study comparing the performance of these
three methods in estimating model parameters for our tumor growth model. For each simulated
dataset we modeled tumor growth and then sampled 6 glands from each half of the resulting
tumor. The other generating parameter values were: NCSC=32, T3=100. The figures in the first
column illustrate the mean of the posterior distributions for mutation rate before gland formation
(top), mutation rate after gland formation (middle) and the asymmetric division rate (bottom), for
300 simulated test tumors, while the ones in the second column represent the posterior standard
deviation. Each of the three methods generates good estimates, although there are some
exceptional cases in which the posterior means deviated from the true values by large amounts
when the local regression was used. This illustrates the noisier nature of estimates of local (as
compared to global) regression between parameters and summary statistics. However, the three
methods have differing performances in terms of the standard deviation of the posterior
distribution (Figure 4.2, right column). The standard deviations are very high for the local
79
regression method, again likely because local estimates of regression parameters are relatively
unstable. The method that uses equal weights performs relatively well compared to the local
regression methods. However, the posterior distributions generated by weighting the summary
statistics by R-square measure of fit of the global linear regression have a consistent tendency to
have the smallest standard deviations, indicating that the R-square measure of fit of the global
linear regression serves as a better weighting method than other two methods for our analysis.
4.4.1.2 Minimum detectable allele frequency
A key feature of our simulated data is the allele frequency of each mutation. If a
mutation occurs early during tumor development, its allele frequency will typically be large. For
example, a mutation introduced to the tumor in the early stem cell divisions has allele frequency
0.5 in a diploid setting. If a mutation first appears in the constant growth phase (T3), the allele
frequency can stay very low or can increase to 0.5, the relative probabilities of these outcomes
depending on the asymmetric division rate and the number of generations that occur before the
tumor is extracted. With real experimental data, because of the realities of sequencing
technologies, we are not able to reliably detect mutations with very low allele frequencies,
because they cannot be distinguished from sequencing errors. The ability to detect mutations with
low allele frequency depends on the depth of the sequence data that has been collected for that
tumor. For example, if the sequencing depth is 20, the lowest allele frequency that is detectable is
1/20, which is 0.05.
To explore this, we generated 300 test tumors, for each of a range of parameter values,
using a range of lowest detectable allele frequency thresholds (to reflect the inability to reliably
detect low frequency mutations). Mutations with frequency lower than this threshold were
removed from the analysis. Figure 4.3 shows results for two such thresholds: 0.05 and 0.001. We
show the means and standard deviations of the posterior distributions of the corresponding
80
parameter estimates. As we can see, the minimum detectable allele frequency does not affect the
estimation of the mutation rate before gland formation. Mutations before gland formation all
become fixed, with allele frequency 0.5, due to the gland fission processes, and so are easily
detected using either threshold. However, for the other two parameters, the detection threshold
does have an impact on the posterior distributions. The means of the posterior distributions are
largely unaffected, and are still centered around the true (generating) parameter values, but the
variance increases with the higher threshold. This suggests that our ABC analysis continues to
function well in this more realistic setting. However, we do see an effect in the standard
deviations of the posterior distribution. These grow larger as the threshold value increases,
reflecting the loss of data that results as our ability to detect lower frequency mutations is lost.
81
Figure 4.2 | Summary of posterior distributions for each parameter under different
weighting schemes. Black represents the results using equally weighted summary statistics. Red
shows the results of using the R-square (coefficient of determination) of the local linear
regression as the weight. Cyan corresponds to the results using weights given by the global linear
regression. 300 simulated test tumors were included in this analysis. The x-axis is the true
generating parameter value for each tumor. The top row shows results for mutation rate before
gland formation. The middle row shows results for the mutation rate after gland formation, while
the third row shows the results for the asymmetric division rate. The first column shows the mean
of the posterior parameter estimate, while the second column shows the standard deviation of the
posterior distribution.
82
Figure 4.3 | Different minimum detectable allele frequency. 300 tumors with a range of
generating parameters were tested. Blue dots are for a threshold of 0.001 and red dots show
results for a threshold of 0.05. 6 glands were sampled from each half. Other generating parameter
values are: NCSC=32, T3=100.
4.4.1.3 Number of sampled glands
Another important question is how the number of glands we sample from a tumor
affects our ability to estimate the model parameters. Intuitively, the more glands we sample the
more information we will have about the tumor, and so a large number of sampled glands is of
course preferred. However, the ability to sample glands from a tumor may be limited for practical
83
reasons. Therefore, we explore how estimation performance trades off against number of glands
sampled. To do this, we simulated tumor growth and harvested differing numbers of glands upon
which to base the analysis. 300 such tumors were analyzed and posterior distributions of each
parameter for each tumor were summarized via the mean and the standard deviation. As we can
see in Figure 4.4, with only one gland sampled from each tumor half the posterior distributions
have significantly larger standard deviations and the means of the posterior distributions deviate
more from the true values. Among the three parameters tested, this is most severe for estimation
of the mutation rate before gland formation. If we have only one gland sampled from each half,
the summary statistics contain relatively little information about events very early during the
tumorigenesis. More precisely, the heterogeneity within each tumor half is not captured by a
single gland at all. However, with 3 or more glands, performance is relatively good (and improves
as the number of sampled glands increases, as expected). We suggest that sampling 6 glands from
each tumor half is a reasonable compromise. This number generates reasonably accurate
estimates, that have a small standard deviation. Therefore, we assume 6 glands from each tumor
half in later analyses in this paper.
84
Figure 4.4 | Summary of posterior distributions as a function of the number of glands
sampled from each tumor. Colors correspond to the different numbers of sampled glands (see
legend). Top row: mutation rate before gland formation; middle row: mutation rate after gland
formation; Bottom row: asymmetric division rate. Other generating parameter values are:
NCSC=32, T3=100. The numbers in the legend represent the number of glands that were sampled
from each half of the simulated tumors. The first column and second column show the mean and
variance, respectively, of each parameter’s posterior distribution across 300 simulated tumors.
4.4.1.4 Adding Experimental noise
In the previous sections, we assumed that we had perfect data in the sense that all
information is recorded completely and accurately. For example, the allele frequency of each
85
mutation in the glands is assumed known, when detectable. We now add a more realistic filter to
the data. For example, estimation of allele frequency will depend upon the quality of the
sequencing data collected, which itself is subject to the sequencing technology used, the DNA
quality, and requested sequencing depth, etc. As such, we only obtain estimates of these
underlying true data characteristics (for example, we may completely fail to detect a mutant allele,
or estimate its frequency incorrectly). Also, some regions of the genome are hard to sequence,
resulting in lower coverage in those regions. To explore these considerations we now examine the
effect of sequencing depth, a key determinant of data quality, on the performance of our
estimation. Higher coverage is always better from the perspective of data accuracy. But, of course,
data is not free, so we explore what depth of coverage is necessary to ensure high quality
estimation.
Intuitively, we expect that higher sequencing depth should produce better results in
parameter estimation. The real question is “how much (coverage) is enough?” As shown in
Figure 4.5, a mean sequencing depth of just 20 appears to be sufficient to obtain good
performance in our parameter estimates. Even with significantly higher mean sequencing depth,
e.g. 80, the performance of the posterior distribution does not differ by much. There is no
detectable difference between the posterior variances that result when using a depth of just 20.
This demonstrates the advantage of the approximate Bayesian computation approach (or indeed
any other model-based approach): performance can be robust to the presence of sampling error so
long as the processes resulting in that sampling error are captured in the model itself, as they
were here.
86
Figure 4.5 | Summary of posterior distributions for different sequencing depths. The first
column and second column show the mean and standard deviation of each parameter’s posterior
distribution for the 300 simulated tumors respectively. The top row is the summary of for
mutation rate before gland formation. The middle row is for the mutation rate after gland
formation, while the third row shows the results for the asymmetric division rate. The black dots,
red dots and green dots represent the results when the mean sequencing depths are 20, 40, and 80
respectively. We see little improvement of parameter estimation with even higher sequencing
depths. The other generating parameter values used were: NCSC=32, T3=100, and v=1.1m for the
negative binomial distribution used to generate sequencing depth.
87
4.4.2 Experimental data
Having shown that our parameter estimation procedure performs well in realistic
settings, we close with an analysis of a small dataset (referred to as tumor U in Sottoriva et al.,
2015. In this new experiment, sequence data was obtained for just one gland per tumor half.
While any conclusions drawn will clearly be tentative, the results we reported earlier in this
section show that some power for parameter estimation remains.
4.4.2.1 Is there a mutation burst?
Sottoriva et al. (2015) proposed a big bang tumor growth model in which colorectal
tumors develop as a single clonal expansion containing subclonal events [61]. Subsequently, all
the early mutations, which are widely detected in the tumors, are pervasive in the final lesion.
One fundamental question arises: is there a mutation burst at the very beginning of the tumor
development? In other words, is the mutation rate before the first gland formation much bigger
than the mutation rate after the first gland formation? This question is key for understanding the
early evolution of a tumor.
Our results for analysis of tumor U are shown in Figure 4.6 and Figure 4.7. Tumor U
contains many copy number variations in the genome. We use the method described in Kang et
al., 2015 to call integer copy numbers for each chromosomal segment. The allele frequency of
each mutation then is adjusted according to the copy number of its location. (For example, if a
mutation has frequency 0.33 in a region for which the copy number is 3, the mutation is
interpreted as “fixed”.) We see that while the posterior distribution has relatively high variance,
there is still signal in the data regarding mutation rate. The mean and the mode for mutation rate
after gland formation are 0.48 and 0.5 respectively (Figure 4.6B), while the mutation rate before
88
gland formation has mean 1.01 and mode 0.33. Based on this one tumor, no clear conclusion
regarding the relative magnitudes of the two mutation rates can be drawn. While the posterior for
the pre-gland mutation rates supports higher values, and that for the post-gland mutation rate does
not, the posteriors have a large degree of overlap.
Figure 4.6 | Posterior distributions of mutation rate for tumor U. (A) mutation rate before
gland formation. (B) mutation rate after gland formation. The dashed line indicates the mean of
the posterior distribution.
4.4.2.2 How stem cells divide?
The results of our simulation study on synthetic tumors suggests that we can now
successfully infer the asymmetric division rate. Applying the same procedure to the data from
tumor U we obtain the results shown in Figure 4.7. We see that the mean of the posterior
distribution is around 0.76. Thus, even with data for just one tumor, and one gland from each side
of that tumor, we see that CSCs in this tumor (at least) almost certainly undergo asymmetric
division.
89
Figure 4.7 | Posterior distribution of asymmetric division rate (x-axis) for tumor U. The
dashed line indicates the mean of the posterior distribution.
4.5 Discussion
Inference regarding properties of tumor growth may well be crucial in understanding
both their behavior and, ultimately, how best to impact growth through medical treatment. But
tumor growth is non-trivial to understand because it is typically not observed. However, modern
technologies allow high-resolution data to be collected. Here we focus on our ability to now
collect data regarding sequence level mutation on small numbers of cells within a tumor gland
(e.g. 10,000 cells). While this application of the technology is relatively new we have access to
little actual data, our simulation study shows that model-based analyses based upon ABC have
the ability to successfully infer key parameters of tumor growth using such data.
The number of mutations thought to have originated during the first several cell
divisions, which can be detected by comparing the mutations profile in glands from multiregional
90
samples, does not match the number generated by the normal mutation rate (10
-10
to 10
-9
per base
per cell division). A mutation burst at early stage tumor growth has been proposed to explain this
phenomenon [63]. However no obvious evidence has been presented to rule out the possibility
that the mutation rate in the tumor is elevated throughout development, rather than just during an
initial “burst”. In this chapter we demonstrated that the mutation rates both during the initial stage
of tumor development and at the later stage can be estimated using sequence-level data study,
even if only a little such data is available. Based on the data from just one gland per tumor half,
the posterior distributions of mutation rates in tumor U do not allow a decisive conclusion to be
drawn regarding the burst (see Figure 4.6). However, these data are consistent with the idea of a
burst. As more data, for more glands, is collected in future, our analysis framework is likely to
allow investigators to decisively conclude whether or not such a burst has occurred.
Researchers have used bio-markers to confirm the existence of “stem cell like” cells in
various tumors [86,87]. However, the details of CSC behavior have been hard to uncover, in part
because that behavior is hard to directly observe. As a consequence, the details of how CSCs
divide are also unknown. Some experiments have shown that not all CSCs divide in the same
manner. Some go through symmetric division and others undergo asymmetric division [88,89].
However, it is still unknown that what determines how a CSC divides and with what probability
CSCs utilize asymmetric division to produce progeny. Despite all of these unknowns, CSCs are
thought to be a frequent cause of recurrence after treatment and have been targeted for
therapeutics in many studies [90-92]. In this chapter we demonstrated a method for estimating
how CSCs divide in a given tumor. The posterior distribution of the asymmetric division rate
probabilistically represents the behavior of a CSC. In previous work, we showed how to
determine the number of CSC in each gland of a colorectal tumor[11]. Together, these results
might be used to find correlations with the severity and resilience of a tumor, which might then be
used to guide the individualized therapeutics.
91
We also show that our analysis performs robustly in the face of experimental noise
when we consider both the limit of detectable allele frequency and variation of sequencing depth.
This is because in a simulation-based method it is relatively straightforward to model the
processes that result in such experimental noise. When ABC is performed, the perturbation
caused by the noise in the data is also present in simulated data and therefore it is captured by the
ABC procedure (see Figure 4.3, 4.5). We also showed that some conclusions can be drawn even
if we have just a single gland from each side of a single tumor (see Figure 4.4, 4.7, and 4.8).
We focus on marginal analysis of parameters in this paper, modeling a situation in
which there is a particular parameter that is of interest to the investigator. The method extends
naturally to joint analysis of multiple parameters. However, for the experimental data presented in
this paper, (one gland from each side of a single tumor), such a joint analysis would probably
have little discriminative power.
In summary, the simulation study presented in this chapter showed that the mutation
rate at different stages of tumor development and the asymmetric division rate of CSCs can be
retrieved based on mutation data collected from single gland sequencing. We also showed that
relatively little data is required to extract at least some useful information regarding the existence
of a mutation burst and the asymmetric division rate. Our ABC framework provides a widely-
applicable tool for extracting information from genomic data, and in particular the parameters that
closely govern the development of a tumor, which may potentially shed light on the post-
diagnosis and post-surgery treatment.
92
93
Chapter 5: Concluding remarks
In this thesis, I describe model-based analyses of genomic data, such as traditional
bisulfite sequencing data for DNA methylation patterns, next generation sequencing data for
exomes, and SNP microarray data for copy number variations, to develop our current
understanding of colorectal tumor development to a more detailed level. In this chapter, I
summarize and integrate the key findings in Chapter 2 through Chapter 4. I then discuss future
directions that researchers can explore based on the work presented in this thesis.
5.1 ITH carries information regarding cancer stem cells in
colorectal tumor
At the cell level, the driving force of tumor growth is the cancer stem cell (CSC). CSCs
serves as a perpetual source for tumor growth, expansion and survival. Researchers dedicate
tremendous amount of effect to identify, isolate, and quantify CSCs in various types of tumors.
Yet, the CSC still has not been fully understood. As the “engine” of tumor growth, the quantity of
CSCs in a given tumor is expected to affect both the speed of growth and the ability of the tumor
to resist treatment. In addition, the exact manner in which the CSCs operate, for example whether
they divide symmetrically or asymmetrically (as briefly reviewed in the introduction of Chapter
4) may also determine the capability of a tumor to resist treatment. As such, I hope that the
studies I present in this thesis might, eventually, lead to improved treatment of tumors.
In Chapter 2, I demonstrate a model-based study aimed at estimating the number of CSCs
using data from passenger DNA methylation tags that are sampled from individual colorectal
94
tumor cells (Figure 2.6 and Figure 2.11). The posterior distributions of the number of CSCs
indicate that the number of CSCs varies across different clinical tumors. In Chapter 4, I present
another simulation based study, which illustrates that the mechanism of division of the CSCs in a
tumor can be retrieved through the analysis of DNA point mutations in the context of a multi-
regional sampling of single glands. Here the mechanism of division of a CSC is parameterized as
the probability of asymmetric division. Limited by the availability of experimental data, only one
tumor is analyzed.
The successful estimation of both the number and the division pattern of CSCs delineates
the intrinsic architecture of growth potential, differentiation ability, and population composition
in tumors, which can potentially provide researchers a path to fully understand tumor
development and can facilitate clinicians designing patient-specific therapeutic strategies.
5.2 Early divisions during the development of a tumor
The early events are thought to contain crucial information regarding tumor development.
However, what happens during the initial stage of a human tumor is unknown because direct
observation is not available. Many genomic catastrophes are speculated to take place during that
early period of tumor growth, such as kataegis [93,94] and chromothripsis [95]. However, all that
is available to researchers is the data collected upon surgical removal of the tumor. Then, the key
question is can the pattern of genetic variation in an extracted tumor be used to infer key features
of the early stages of tumor growth, such as indicators of likely prognosis or best treatment
strategies?
95
In Chapter 3, I present an analysis to reconstruct the first several cell divisions in a
colorectal adenoma (tumor K). The results challenge the traditional “selective sweep” theory for
colorectal tumor development. The distributions of mutations and copy number changes across
glands offer us clues to infer the time of initiation of these genomic alterations, and the behavior
of the cells within the tumor (e.g. motility). Rather than seeing echoes of selective pressure, our
results suggest that the timing of these genomic alterations dictates their pervasiveness in the final
lesion. This supports the “big bang” model for colorectal tumor growth.
5.3 Mutation burst
One of the questions associated with the “big bang” model is whether there is a “burst” of
mutations that occurs at the initial stage of tumorigenesis. The events prior the first transformed
cell cannot be retrieved, as those genomic aberrations are present in all descendants of that
transformed cell. However, events that occur later are recorded differently, being present in parts,
but not all, of a tumor. Thus, their time of appearance, and subsequent behavior, can be inferred.
In Chapter 4, I establish a mathematical tumor growth model to investigate the existence
of a mutation burst. The essential components of the analysis are the different possible types of
mutations: “fixed” mutations, gland-specific mutations, and shared mutations. The timing of the
mutagenesis event determines the physical distribution and the allele frequency of a mutation,
therefore these types of mutations can serve as messengers, enabling us to estimate the mutation
rate in a time-specific way. The simulation study in Chapter 4 indicates that we do have the
power to estimate the mutation rates in the two stages of tumor development: the initial stage
before the first gland formation and the subsequent stage.
96
5.4 Future directions
5.4.1 Cell motility
One important topic researcher would like to explore is the motility of tumor cells.
Abnormal motility of cells within a tumor is an indicator of invasion and metastasis, and is
present in malignant tumors. My analysis suggests that it is rare for movement of early cells to be
observed in benign tumors (e.g. tumor K), as shown by the point mutation data (see Figure 3.2).
No variegated mutation is found in tumor K, suggesting that the motility of cells in adenomas is
limited. On the other hand, private mutations (variegated mutations) present in both sides of the
tumor reveal cell motility in cancers, for example tumor N in Figure 3.8. It is hypothesized that
this motility for cancer cells is acquired as soon as the cell transforms into a cancer progenitor
cell, rather than being a product of chronic selection sweeps on driver mutations. The reason for
this is that the timing of a mutation needs to be early in order for it to be detectable and universal
in a gland. In order for it to also be visible in the other tumor half, the migration event also needs
to occur early on. Using our model we can formally infer mixing rates within tumors, and relate
those to tumor behavior. Clearly it would be useful to extend this analysis to a greater number of
tumor samples, to see whether the patters we report here are widespread.
5.4.2 Scope of the data
One of the biggest issues that constrains our view of tumors is the size and scope of our
experimental data. Tumor growth models that have been proposed by scientists are typically built
upon the intuition formed from observation of data collected from tumors. However, the data
generated by current technology cannot fully represent the genomic structure of a tumor.
97
Traditionally, studies of tumor genomes are based on genomic data extracted from bulk tissue
samples of a tumor, which essentially provides an averaged view of the genomes across tens of
millions of cells. All the local variations at a small region are possibly lost and cannot be
recovered. Crucially, such samples also contain widespread contamination from normal tissue.
This motivates single gland sampling schemes, such as those we report here, in which relatively
homogenous samples are taken from extremely small cell population. However, even these
sampling schemes are prone to contamination, albeit on a much smaller scale. In addition, and
obviously, not all glands from a given tumor are sampled. A typical single gland of a colorectal
tumor consists of about 10,000 cells. In the future, it will be important to understand how well a
small sample of those cells can represent the population of ~10000 cells within a typical gland,
and how well a small sample of glands can represent the large number of glands within an entire
tumor. Our work only begins to answer such questions.
One way of avoiding contamination is to focus on single cells. A cell is the smallest
macro-unit of multicellular organisms. Sequencing technology now allows us to amplify and
sequence the whole genome of a single cell [96-98]. This technique is called single-cell
sequencing. It provides the finest level of genomic information as compared to other sequencing
methods. If, in the future, we can sequence a large number of cells within a tumor, the genomic
heterogeneity among them would provide an unprecedented tool to study the tumor development
and reveal the mystery of tumorigenesis.
In summary, we are still at the early stages of understanding the process of tumor growth
on a cellular level and how these processes might vary from tumor to tumor. Such growth
processes cannot be observed directly, but they can be inferred from the patterns of genetic
variation observed in a tumor at any given moment in time. Our work begins to address the
question of tumor growth, and clearly demonstrates that there is significant variation in patterns
98
of genetic heterogeneity across tumors. This is consistent with the idea that these tumors behave
differently, and, as is commonly seen, may therefore respond differently to a given treatment. By
beginning to understand tumor behavior on an individual level, as we do here, we might
reasonably hope to eventually allow investigators to propose individualized treatment regimens
for patients.
99
References
1. Siegmund KD, Marjoram P, Woo YJ, Tavare S, Shibata D (2009) Inferring clonal expansion
and cancer stem cell dynamics from DNA methylation patterns in colorectal cancers.
Proc Natl Acad Sci U S A 106: 4828-4833.
2. Hong YJ, Marjoram P, Shibata D, Siegmund KD (2010) Using DNA methylation patterns to
infer tumor ancestry. PloS one 5: e12002.
3. Hong YJ, Marjoram P, Shibata D, Siegmund KD (2010) Using DNA Methylation Patterns to
Infer Tumor Ancestry. Plos One 5.
4. Stryker SJ, Wolff BG, Culp CE, Libbe SD, Ilstrup DM, et al. (1987) Natural history of
untreated colonic polyps. Gastroenterology 93: 1009-1013.
5. Levine JS, Ahnen DJ (2006) Clinical practice. Adenomatous polyps of the colon. N Engl J
Med 355: 2551-2557.
6. Risio M (2010) The natural history of adenomas. Best Pract Res Clin Gastroenterol 24: 271-
280.
7. Garcia SB, Park HS, Novelli M, Wright NA (1999) Field cancerization, clonality, and
epithelial stem cells: the spread of mutated clones in epithelial sheets. J Pathol 187: 61-81.
8. Wright NA, Poulsom R (2002) Top down or bottom up? Competing management structures in
the morphogenesis of colorectal neoplasms. Gut 51: 306-308.
9. Siegmund KD, Marjoram P, Tavare S, Shibata D (2011) High DNA methylation pattern
intratumoral diversity implies weak selection in many human colorectal cancers. PLoS
One 6: e21657.
100
10. Sottoriva A, Spiteri I, Shibata D, Curtis C, Tavare S (2013) Single-molecule genomic data
delineate patient-specific tumor profiles and cancer stem cell organization. Cancer Res 73:
41-49.
11. Zhao J, Siegmund KD, Shibata D, Marjoram P (2014) Ancestral inference in tumors: how
much can we know? J Theor Biol 359: 136-145.
12. Shibata D (2009) Inferring human stem cell behaviour from epigenetic drift. The Journal of
pathology 217: 199-205.
13. Shibata D, Tavaré S (2006) Counting divisions in a human somatic cell tree: how, what and
why. Cell Cycle 5: 610-614.
14. Laird AK (1964) Dynamics of Tumor Growth. Br J Cancer 13: 490-502.
15. d’Onofrio A (2005) A general framework for modeling tumor-immune system competition
and immunotherapy: Mathematical analysis and biomedical inferences. Physica D:
Nonlinear Phenomena 208: 220-235.
16. Beerenwinkel N, Antal T, Dingli D, Traulsen A, Kinzler KW, et al. (2007) Genetic
progression and the waiting time to cancer. PLoS Comput Biol 3: e225.
17. Bozic I, Antal T, Ohtsuki H, Carter H, Kim D, et al. (2010) Accumulation of driver and
passenger mutations during tumor progression. Proc Natl Acad Sci U S A 107: 18545-
18550.
18. Greaves M, Maley CC (2012) Clonal evolution in cancer. Nature 481: 306-313.
19. Jones S, Chen W-d, Parmigiani G, Diehl F, Beerenwinkel N, et al. (2008) Comparative lesion
sequencing provides insights into tumor evolution. Proceedings of the National Academy
of Sciences 105: 4283-4288.
20. Vogelstein B, Papadopoulos N, Velculescu VE, Zhou S, Diaz LA, Jr., et al. (2013) Cancer
genome landscapes. Science 339: 1546-1558.
101
21. Marjoram P, Donnelly P (1997) Human demography and the time since mitochondrial Eve.
Institute for Mathematics and Its Applications 87: 107.
22. Pritchard JK, Seielstad MT, Perez-Lezaun A, Feldman MW (1999) Population growth of
human Y chromosomes: a study of Y chromosome microsatellites. Molecular Biology
and Evolution 16: 1791-1798.
23. Bromham L, Penny D (2003) The modern molecular clock. Nature Reviews Genetics 4: 216-
224.
24. Logothetis CJ (2013) Re: intratumor heterogeneity and branched evolution revealed by
multiregion sequencing. Eur Urol 64: 170.
25. Tannock IF (2014) Words of wisdom. Re: Intratumor heterogeneity and branched evolution
revealed by multiregion sequencing. Eur Urol 65: 846-847.
26. Laird AK (1964) Dynamics of tumour growth. British journal of cancer 18: 490.
27. Ambrosi D, Preziosi L (2002) On the closure of mass balance models for tumor growth.
Mathematical Models and Methods in Applied Sciences 12: 737-754.
28. Byrne H, Preziosi L (2003) Modelling solid tumour growth using the theory of mixtures.
Mathematical Medicine and Biology 20: 341-366.
29. Anderson A, Chaplain M, Rejniak K, Fozard J (2008) Single-cell-based models in biology
and medicine. Mathematical Medicine and Biology.
30. Klein CA, Hö lzel D (2006) Systemic cancer progression and tumor dormancy: mathematical
models meet single cell genomics. Cell Cycle 5: 1788-1798.
31. Navin N, Kendall J, Troge J, Andrews P, Rodgers L, et al. (2011) Tumour evolution inferred
by single-cell sequencing. Nature 472: 90-94.
102
32. Shibata DK, Lieber MR (2010) Is there any genetic instability in human cancer? DNA repair
9: 858.
33. Yatabe Y, Tavaré S, Shibata D (2001) Investigating stem cells in human colon by using
methylation patterns. Proceedings of the National Academy of Sciences 98: 10839-10844.
34. Ramaley J (1969) Buffon's noodle problem. The American Mathematical Monthly 76: 916-
918.
35. Ripley B (1987) Stochastic Simulation. 1987. Wiley, New York.
36. Tavare S, Balding DJ, Griffiths RC, Donnelly P (1997) Inferring coalescence times from
DNA sequence data. Genetics 145: 505-518.
37. Marjoram P, Tavare S (2006) Modern computational approaches for analysing molecular
genetic variation data. Nat Rev Genet 7: 759-770.
38. Beaumont MA, Zhang W, Balding DJ (2002) Approximate Bayesian computation in
population genetics. Genetics 162: 2025-2035.
39. Ripley BD (2009) Stochastic simulation: Wiley. com.
40. Fearnhead P, Prangle D (2012) Constructing summary statistics for approximate Bayesian
computation: semi ‐automatic approximate Bayesian computation. Journal of the Royal
Statistical Society: Series B (Statistical Methodology) 74: 419-474.
41. Woo YJ, Siegmund KD, Tavaré S, Shibata D (2009) Older individuals appear to acquire
mitotically older colorectal cancers. The Journal of pathology 217: 483-488.
42. Ricci-Vitiani L, Lombardi DG, Pilozzi E, Biffoni M, Todaro M, et al. (2006) Identification
and expansion of human colon-cancer-initiating cells. Nature 445: 111-115.
43. Barnes CP, Filippi S, Stumpf MP, Thorne T (2012) Considerate approaches to constructing
summary statistics for ABC model selection. Statistics and Computing 22: 1181-1197.
103
44. Jung H, Marjoram P (2011) Choice of summary statistic weights in approximate Bayesian
computation. Stat Appl Genet Mol Biol 10.
45. Joyce P, Marjoram P (2008) Approximately sufficient statistics and Bayesian computation.
Statistical Applications in Genetics and Molecular Biology 7.
46. Nunes MA, Balding DJ (2010) On optimal selection of summary statistics for approximate
Bayesian computation. Statistical applications in genetics and molecular biology 9.
47. Booth RA (2007) Minimally invasive biomarkers for detection and staging of colorectal
cancer. Cancer Lett 249: 87-96.
48. Fearon ER, Vogelstein B (1990) A genetic model for colorectal tumorigenesis. Cell 61: 759-
767.
49. Wood LD, Parsons DW, Jones S, Lin J, Sjoblom T, et al. (2007) The genomic landscapes of
human breast and colorectal cancers. Science 318: 1108-1113.
50. Lawrence MS, Stojanov P, Polak P, Kryukov GV, Cibulskis K, et al. (2013) Mutational
heterogeneity in cancer and the search for new cancer-associated genes. Nature 499: 214-
218.
51. Gerlinger M, Rowan AJ, Horswell S, Larkin J, Endesfelder D, et al. (2012) Intratumor
heterogeneity and branched evolution revealed by multiregion sequencing. N Engl J Med
366: 883-892.
52. Humphries A, Cereser B, Gay LJ, Miller DS, Das B, et al. (2013) Lineage tracing reveals
multipotent stem cells maintain human adenomas and the pattern of clonal expansion in
tumor evolution. Proc Natl Acad Sci U S A 110: E2490-2499.
53. Siegmund KD, Marjoram P, Tavare S, Shibata D (2009) Many colorectal cancers are "flat"
clonal expansions. Cell Cycle 8: 2187-2193.
104
54. Gerlinger M, Horswell S, Larkin J, Rowan AJ, Salm MP, et al. (2014) Genomic architecture
and evolution of clear cell renal cell carcinomas defined by multiregion sequencing. Nat
Genet 46: 225-233.
55. Cibulskis K, Lawrence MS, Carter SL, Sivachenko A, Jaffe D, et al. (2013) Sensitive
detection of somatic point mutations in impure and heterogeneous cancer samples. Nat
Biotechnol 31: 213-219.
56. Staaf J, Vallon-Christersson J, Lindgren D, Juliusson G, Rosenquist R, et al. (2008)
Normalization of Illumina Infinium whole-genome SNP data improves copy number
estimates and allelic intensity ratios. BMC Bioinformatics 9: 409.
57. Olshen AB, Bengtsson H, Neuvial P, Spellman PT, Olshen RA, et al. (2011) Parent-specific
copy number in paired tumor-normal studies using circular binary segmentation.
Bioinformatics 27: 2038-2046.
58. Kunkel TA, Bebenek K (2000) DNA replication fidelity. Annu Rev Biochem 69: 497-529.
59. Drake JW (1991) A constant rate of spontaneous mutation in DNA-based microbes. Proc Natl
Acad Sci U S A 88: 7160-7164.
60. Bozic I, Gerold JM, Nowak MA (2016) Quantifying Clonal and Subclonal Passenger
Mutations in Cancer Evolution. PLoS Comput Biol 12: e1004731.
61. Sottoriva A, Kang H, Ma Z, Graham TA, Salomon MP, et al. (2015) A Big Bang model of
human colorectal tumor growth. Nat Genet 47: 209-216.
62. Shibata D (2012) Cancer. Heterogeneity and tumor history. Science 336: 304-305.
63. Kang H, Salomon MP, Sottoriva A, Zhao J, Toy M, et al. (2015) Many private mutations
originate from the first few divisions of a human colorectal adenoma. J Pathol 237: 355-
362.
105
64. Kimble JE, White JG (1981) On the control of germ cell development in Caenorhabditis
elegans. Dev Biol 81: 208-219.
65. Zwaka TP, Thomson JA (2005) Differentiation of human embryonic stem cells occurs
through symmetric cell division. Stem Cells 23: 146-149.
66. Betschinger J, Knoblich JA (2004) Dare to be different: asymmetric cell division in
Drosophila, C. elegans and vertebrates. Curr Biol 14: R674-685.
67. Clevers H (2005) Stem cells, asymmetric division and cancer. Nat Genet 37: 1027-1028.
68. Bodine DM, Seidel NE, Orlic D (1996) Bone marrow collected 14 days after in vivo
administration of granulocyte colony-stimulating factor and stem cell factor to mice has
10-fold more repopulating ability than untreated bone marrow. Blood 88: 89-97.
69. Sasaki M, Abe R, Fujita Y, Ando S, Inokuma D, et al. (2008) Mesenchymal stem cells are
recruited into wounded skin and contribute to wound repair by transdifferentiation into
multiple skin cell type. J Immunol 180: 2581-2587.
70. Hearn JP (2001) Embryo implantation and embryonic stem cell development in primates.
Reprod Fertil Dev 13: 517-522.
71. Knoblich JA (2010) Asymmetric cell division: recent developments and their implications for
tumour biology. Nature Reviews Molecular Cell Biology 11: 849-860.
72. Gomez-Lopez S, Lerner RG, Petritsch C (2014) Asymmetric cell division of stem and
progenitor cells during homeostasis and cancer. Cell Mol Life Sci 71: 575-597.
73. Chia W, Somers WG, Wang H (2008) Drosophila neuroblast asymmetric divisions: cell cycle
regulators, asymmetric protein localization, and tumorigenesis. J Cell Biol 180: 267-272.
74. Wodarz A, Nathke I (2007) Cell polarity in development and cancer. Nat Cell Biol 9: 1016-
1024.
106
75. Habib SJ, Chen BC, Tsai FC, Anastassiadis K, Meyer T, et al. (2013) A localized Wnt signal
orients asymmetric stem cell division in vitro. Science 339: 1445-1448.
76. Caussinus E, Hirth F (2007) Asymmetric stem cell division in development and cancer. Prog
Mol Subcell Biol 45: 205-225.
77. Zhang D, Wang Y, Zhang S (2014) Asymmetric cell division in polyploid giant cancer cells
and low eukaryotic cells. Biomed Res Int 2014: 432652.
78. Jung H, Marjoram P (2011) Choice of Summary Statistic Weights in Approximate Bayesian
Computation. Statistical Applications in Genetics and Molecular Biology 10.
79. Joyce P, Marjoram P (2008) Approximately sufficient statistics and bayesian computation.
Stat Appl Genet Mol Biol 7: Article26.
80. Blum MG, Nunes MA, Prangle D, Sisson SA (2013) A comparative review of dimension
reduction methods in approximate Bayesian computation. Statistical Science 28: 189-208.
81. Nunes MA, Balding DJ (2010) On optimal selection of summary statistics for approximate
Bayesian computation. Stat Appl Genet Mol Biol 9: Article34.
82. Wegmann D, Leuenberger C, Excoffier L (2009) Efficient approximate Bayesian computation
coupled with Markov chain Monte Carlo without likelihood. Genetics 182: 1207-1218.
83. Hamilton G, Currat M, Ray N, Heckel G, Beaumont M, et al. (2005) Bayesian estimation of
recent migration rates after a spatial expansion. Genetics 170: 409-417.
84. Hong JY, Liu X, Mao M, Li M, Choi DI, et al. (2013) Genetic aberrations in imatinib-
resistant dermatofibrosarcoma protuberans revealed by whole genome sequencing. PLoS
One 8: e69752.
85. Nielsen R, Paul JS, Albrechtsen A, Song YS (2011) Genotype and SNP calling from next-
generation sequencing data. Nat Rev Genet 12: 443-451.
107
86. Ricci-Vitiani L, Lombardi DG, Pilozzi E, Biffoni M, Todaro M, et al. (2007) Identification
and expansion of human colon-cancer-initiating cells. Nature 445: 111-115.
87. Li T, Su Y, Mei Y, Leng Q, Leng B, et al. (2010) ALDH1A1 is a marker for malignant
prostate stem cells and predictor of prostate cancer patients' outcome. Lab Invest 90: 234-
244.
88. Wodarz A, Gonzalez C (2006) Connecting cancer to the asymmetric division of stem cells.
Cell 124: 1121-1123.
89. Neumuller RA, Knoblich JA (2009) Dividing cellular asymmetry: asymmetric cell division
and its implications for stem cells and cancer. Genes Dev 23: 2675-2699.
90. Chiba T, Iwama A, Yokosuka O (2016) Cancer stem cells in hepatocellular carcinoma:
Therapeutic implications based on stem cell biology. Hepatol Res 46: 50-57.
91. Tanaka S (2015) Cancer stem cells as therapeutic targets of hepato-biliary-pancreatic cancers.
J Hepatobiliary Pancreat Sci 22: 531-537.
92. Sehl ME, Shimada M, Landeros A, Lange K, Wicha MS (2015) Modeling of Cancer Stem
Cell State Transitions Predicts Therapeutic Response. PLoS One 10: e0135797.
93. Heng HH, Stevens JB, Liu G, Bremer SW, Ye KJ, et al. (2006) Stochastic cancer progression
driven by non-clonal chromosome aberrations. J Cell Physiol 208: 461-472.
94. Nik-Zainal S, Alexandrov LB, Wedge DC, Van Loo P, Greenman CD, et al. (2012)
Mutational processes molding the genomes of 21 breast cancers. Cell 149: 979-993.
95. Stephens PJ, Greenman CD, Fu B, Yang F, Bignell GR, et al. (2011) Massive genomic
rearrangement acquired in a single catastrophic event during cancer development. Cell
144: 27-40.
96. Navin N, Kendall J, Troge J, Andrews P, Rodgers L, et al. (2011) Tumour evolution inferred
by single-cell sequencing. Nature 472: 90-94.
108
97. Eberwine J, Sul JY, Bartfai T, Kim J (2014) The promise of single-cell sequencing. Nat
Methods 11: 25-27.
98. Mato Prado M, Frampton AE, Stebbing J, Krell J (2016) Single-cell sequencing in cancer
research. Expert Rev Mol Diagn 16: 1-5.
Abstract (if available)
Abstract
Tumorigenesis is the process by which a tumor is formed. It starts with the transformation of normal cells into cells with uncontrolled growth, coupled with genetic, epigenetic, and cellular changes. One of the challenges that researchers have to face when studying tumor growth is that only the end point of the growth process is observed. Namely, we are not able to examine the clonal expansion of the first transformed cell. To design an effective therapeutic strategy to treat tumors and to prevent tumorigenesis, basic understanding of the entire developmental details of tumors is needed. Intratumor heterogeneity (ITH) of sequencing data, i.e. the fact that not all parts of a given tumor are the same, has been observed in tumors in various genomic data sets. Thus all tumors are not the same, and it is likely that they should not all be treated in the same way. In this thesis, I aim to better understand tumor growth and the variations in this process between individuals by viewing tumor growth as an evolutionary process. I will decipher the past of a given tumor using techniques borrowed from molecular phylogeny using the information about that past that is contained within the ITH. ❧ In summary, here I describe methods and models to study ITH contained in methylation data, exome sequencing data and SNP array data in order to decipher the ancestry of colorectal tumors and estimate several important parameters, such as methylation error rate, demethylation error rate, mutation rate, number of cancer stem cells, and the probability that a stem cell undergoes symmetric division or asymmetric division. ❧ In Chapter 1, I give a general introduction to the field. In Chapter 2, I demonstrate that DNA methylation data that are collected from colorectal tumors can be used to infer the ancestral methylation state, the number of cancer stem cells, and methylation/demethylation error rate. In addition, I demonstrate that methylation data have limitations for parameter inference. In Chapter 3, I present work I performed as part of the development of an alternative tumor growth model, a “big bang” model, for colorectal tumors, which is supported by genomic data from spatial sampling. In Chapter 4, I explore the possibility that the DNA mutations from single gland exome sequencing data can be utilized to infer the division patterns of cancer stem cells and to infer whether or not there is a mutation burst in the early stage of tumorigenesis. Finally, in Chapter 5, I summarize the work presented in my thesis and discuss future questions of interest.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Developing a robust single cell whole genome bisulfite sequencing protocol to analyse circulating tumor cells
PDF
CpG methylation profiling in lung cancer cell lines, tumors and non-tumors
PDF
Evolutionary genomic analysis in heterogeneous populations of non-model and model organisms
PDF
DNA hypermethylation: its role in colorectal tumorigenesis and potential clinical applications
PDF
Modeling mutational signatures in cancer
PDF
Understanding DNA methylation and nucleosome organization in cancer cells using single molecule sequencing
PDF
Molecular role of EZH2 overexpression in Colorectal Cancer progression
PDF
RNA methylation in cancer plasticity and drug resistance
PDF
Functional DNA methylation changes in normal and cancer cells
PDF
Effects of chromatin regulators during carcinogenesis
PDF
Exploration of the roles of cancer stem cells and survivin in the pathogenesis and progression of prostate cancer
PDF
Identification and characterization of cancer-associated enhancers
PDF
Investigating the effects of polycomb repressive complex inhibitors on cancer cell phenotypic plasticity
PDF
Identifying prognostic gene mutations in colorectal cancer with random forest survival analysis
PDF
Harnessing the power of stem cell self-renewal pathways in cancer: dissecting the role of BMI-1 in Ewing’s sarcoma initiation and maintenance
PDF
The cancer stem-like phenotype: therapeutics, phenotypic plasticity and mechanistic studies
PDF
Polycomb repressive complex 2 subunit stabilizes NANOG to maintain self-renewal in hepatocellular carcinoma tumor-initiating stem-like cells
PDF
DNA methylation and gene expression profiles in Vidaza treated cultured cancer cells
PDF
Importance of alpha-catenin and cell-cell interactions in hair follicle stem cells homeostasis
PDF
Deconvolution of circulating tumor cell heterogeneity and implications for aggressive variant prostate cancer
Asset Metadata
Creator
Zhao, Junsong
(author)
Core Title
Ancestral inference and cancer stem cell dynamics in colorectal tumors
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Molecular Biology
Publication Date
07/17/2016
Defense Date
05/10/2016
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
asymmetric division ancestry intratumor heterogeneity tumor growth model,cancer stem cell approximate Bayesian computation methylation DNA mutation,colorectal tumor,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Nuzhdin, Sergey (
committee chair
), Marjoram, Paul (
committee member
), Ralph, Peter (
committee member
), Siegmund, Kimberly (
committee member
)
Creator Email
junsongz@usc.edu,junsongzhao.ah@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-271104
Unique identifier
UC11280414
Identifier
etd-ZhaoJunson-4555.pdf (filename),usctheses-c40-271104 (legacy record id)
Legacy Identifier
etd-ZhaoJunson-4555.pdf
Dmrecord
271104
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Zhao, Junsong
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
asymmetric division ancestry intratumor heterogeneity tumor growth model
cancer stem cell approximate Bayesian computation methylation DNA mutation
colorectal tumor