University of Southern California Dissertations and Theses

Modeling mutational signatures in cancer

(USC Thesis Other)

Modeling mutational signatures in cancer

PDF

Download a page range

Download transcript

Copy asset link

Request this asset

Transcript (if available)

Content Modeling Mutational Signatures in
Cancer
by
Zhi Yang
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(BIOSTATISTICS)
December 2019
Copyright 2019 Zhi Yang
Acknowledgements
I would like to express the deepest appreciation to my committee chair, Dr. Kimberly
Siegmund. From the first day, she believed in me and took me as her research assistant
despite my failure in the screening exam. Later on, every milestone achievement was
accompanied by her insightful suggestions and endless support. At every group meeting,
her thoughtful comments have helped me develop independent thinking skills. And
her encouragement and support have a tremendous impact on building my confidence
and improving public speaking. I will always remember her dedication to teaching and
research and take it along with me.
I alsowould like toexpress my gratitude to my committee member, Dr. Paul Marjoram.
He has manifested his keen intuitive vision for research problems at every single research
meeting that we attended together. I am thankful for his wise academic advice and
encouragement through my Ph.D. work. The work would not have been done without the
support of other committee members including Dr. David Conti for his crucial remarks
on my work, and Dr. Darryl Shibata for providing the data and invaluable biological
insights.
I would like to thank my peers in doctoral programs including Ashley Song, Ugonna
Ihenacho, and Intira Sriprasert. They have been supporting me emotionally and academi-
cally in the past four years. I would like to thank my R friends, Malcolm Barrett, George
Vega Yon, and Emil Hvitfeldt for including me in organizing the Los Angeles East R
Users Group and introducing me so many awesome things about R and the community.
Besides, I would like to express my gratitude to faculty including Dr. Wendy Mack for
helping me innumerouswayswith passingthe screeningexamand warmcare byfrequently
asking me how I feel about my school and research work. Besides my dissertation work,
I am also grateful to Dr. Jessica L. Barrington-Trimis. Dr. Kiros Berhane, and Dr.
Rob Scot McConnell for letting me undertake the electronic cigarette projects and their
mentorship.
A special thanks to my parents and parents-in-law who have supported my decision
about pursuing a doctoral degree far away from home. I would especially like to thank
my husband, Ethan Wang, and my pet, Fluffy, for supporting me along the way and
giving me the complete freedom and continued support to pursue my dream.
iii
Abstract
Human cancer genomes carry somatic mutations that arise from a variety of biological
processes. Much scientific research has been focused on very few " driver" mutations,
which leaves a massive number of remaining "passenger" mutations largely unexplained.
Those widespread somatic mutations in cancer genomes are a lifetime accumulation of
mutations that resulted from a variety of mutational processes that could be possibly
linked to tumorigenesis. Different biological processes produce different patterns of
somatic mutations that leave their fingerprints on human genomes. These fingerprints
are called mutational signatures of the mutational processes. Currently, there are two
frameworks used to characterize and visualize mutational signatures. One framework was
proposed by Alexandrov et al. and the second by Shiraishi et al. We refer to these as the
saturated model and the independent model, respectively.
The study of mutational signatures poses a number of challenges. The number of
mutational signatures needs to be determined before inferring mutational signatures
and their mutation exposures (frequencies) in cancer samples. Therefore, the first
objective of this dissertation is focused on developing an efficient algorithm to detect
the number of underlying mutational signatures. In Chapter 2, we present an automatic
and efficient approach, based on perplexity and cross-validation, to provide a simple and
straightforward way to obtain the number of signatures. The method is evaluated in a
simulation study that considers a variety of different scenarios, including different numbers
of mutational catalogs, different numbers of somatic mutations, and different maximum
cosine similarity among mutational signatures. A comparison of this method to other
approaches, including Akaike information criterion, corrected Akaike information criterion,
and Bayesian information criterion, is conducted to assess their relative performance
in estimating the correct number of signatures. The work then serves as a vital step
that is carried out before the second objective, which is to investigate the association of
mutational exposures across groups.
Mutational signatures provide a practical tool to study heterogeneity in tumor samples.
Our particular interest is to identify signatures that may occur de novo during tumor
growth. Just like phylogenetics and human development, tumor growth requires genome
replication, which generates intratumor heterogeneity as a consequence of replication
errors. Early somatic mutations accumulated in the period between the zygote and the
initiating tumor cell should appear in all descendant cells, while those that appear later
in growth will appear in progressively smaller subsets of cells in the tumor. These are
called trunk and branch mutations, respectively. Using multi-regional tumor sampling,
we can distinguish trunk from branch mutations and ask whether the mutation exposures
(compositions of the associated fractions) of mutational signatures in the first tumor cell
differ from that of signatures of tumor growth. Similarly, this idea can be applied to any
associationsbetweenmutationalexposuresandclinicallyimportantvariables/demographic
features. Researchers usually conduct post hoc analyses on the estimated proportions to
further examine this difference. The second objective is to compare mutation exposures
across different features. To achieve the second objective, we develop a hierarchical latent
Dirichlet allocation model to fit two groups of samples (before and after tumor initiation)
as well as to test whether the contributions of the signatures differ across those different
stages of cancer development. This improves the accuracy and robustness of comparison
v
by including every mutation in the likelihood estimation. This approach was implemented
under various scenarios and can be further used for discovering differences according to
tumor prognosis, ethnic group, tumor subgroup, therapeutic group and so on. In our first
application, we test for differences by time of occurrence, before or after tumor initiation.
In a esophageal cancer dataset, we look at the association between mutational exposures
and four variables including sex, age, smoking status, and tumor site.
Lastly, regardless of which framework is used to extract signatures, researchers
would like to compare and contrast the signature findings to known signature databases,
or even across two methods, to further interpret their findings. This process can be
easily completed through statistical software such as R by importing the signature data,
calculating similarities, and subsequent data visualization. To facilitate this process,
we provide a simple and intuitive computer interface using R shiny that allows such
comparisons to be easily performed. Users can compare signatures from one method to
another. Furthermore, this web application enables users to input a self-defined signature
and examine its relationship to known signatures from both existing frameworks.
vi
Table of contents
List of figures x
List of tables xiii
Nomenclature xiv
Chapter 1 Anoverviewofconcepts,methods,andobjectivesformodeling
somatic mutations 1
1.1 Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Somatic mutations . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Mutational processes . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.3 Mutational signatures and mutation features . . . . . . . . . . . . 5
1.1.4 Mutational catalogs . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.1 Modeling mutational signatures under the saturated model . . . . 9
1.2.2 Modeling mutational signatures under the independent model . . 13
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Chapter 2 Selectingthenumberofmutationalsignaturesusingaperplexity-
based measure and cross-validation 25
Table of contents
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2.1 Method evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.1 Number of catalogs . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3.2 Maximum pairwise cosine similarity . . . . . . . . . . . . . . . . . 36
2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Chapter 3 Inferring the difference in mutational signatures via HiLDA 41
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.1 Hierarchical Bayesian Mixture Model . . . . . . . . . . . . . . . . 43
3.2.2 Testing for Differences in Signature Exposures . . . . . . . . . . . 46
3.2.3 Two-stage Inference Methods using the Point Estimates of muta-
tional exposures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2.4 Choosing the Number of Signatures . . . . . . . . . . . . . . . . . 50
3.2.5 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.2.6 Data preparation in 17 CRC tumors . . . . . . . . . . . . . . . . 52
3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.3.1 Tumor Evolution in USC Colon Cancer Data . . . . . . . . . . . . 56
3.3.2 Esophageal Adenocarcinoma . . . . . . . . . . . . . . . . . . . . . 59
3.3.3 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
viii
Table of contents
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Chapter 4 iMutSig: a web application to identify the most similar
mutational signature using shiny 71
4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2.1 Main features of iMutSig . . . . . . . . . . . . . . . . . . . . . . . 75
4.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Chapter 5 Conclusions and future work 78
5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
References 81
ix
List of figures
1.1 An illustration of non-differentiation of two different mutation substitution 6
1.2 A mutational spectrum (catalog) of the final cancer genome. . . . . . . . 7
1.3 An illustration of a signature under the saturated model . . . . . . . . . 10
1.4 An illustration of some of all 1536 possible mutation patterns for signatures
consisting of one substitution and four flanking bases. . . . . . . . . . . . 14
1.5 An illustration of the independent model of characterizing the mutational
signatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.6 An illustration of the independent model of characterizing the mutational
signatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1 Comparison of percentages of underestimation, success, and overestimation,
whendeterminingthenumberofsignaturespresent, asafunctionofmethod
used and number of mutations per catalog. . . . . . . . . . . . . . . . . . 35
2.2 Comparison of percentages of underestimation, success, and overestimation
when determining the number of signatures present as a function of method
and number of mutational catalogs. . . . . . . . . . . . . . . . . . . . . . 36
x
List of figures
2.3 Comparison of how the percentage of success when estimating the true
number of signatures, obtained by each of the four methods, varies both as
a function of the number of mutations in the data sets and the maximum
pairwise cosine similarity value. . . . . . . . . . . . . . . . . . . . . . . . 38
3.1 A detailed comparison of two-step method and the hierarchical latent
Dirichlet allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2 Plate notation for hierarchical LDA with Dirichlet-distributed signature-
mutation distributions exemplified by two groups, i.e., trunk mutations
and branch mutations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3 Bayesian comparison of signature fractions from the null model and alter-
native model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4 Thelog-likelihood, bootstrap-erros, andmaximumcorrelationvaluesamong
estimated mutational exposure parameters for a series of K number of
signatures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.5 Mutation accumulation under the clonal evolution model . . . . . . . . . 53
3.6 The numbers of somatic mutations in 32 mutational catalogs obtained
from 16 colon cancer patients in the USC data and their mutation spectra. 57
3.7 Mutational exposures and three mutational signatures from the analysis
of 16 trunk mutational catalogs and 16 branch mutational catalogs in the
USC data (16 colon cancer patients) . . . . . . . . . . . . . . . . . . . . 67
3.8 Estimated mutational exposures and posterior distributions of mean dif-
ferences in mutational exposures from the analysis of the EAC data (146
esophageal adenocarcinoma patients) . . . . . . . . . . . . . . . . . . . . 68
xi
List of figures
3.9 Estimated mutational exposures and posterior distributions of mean dif-
ferences in mutational exposures from the analysis of the EAC data (146
esophageal adenocarcinoma patients) by pmsignature and HiLDA. . . . . 69
3.10 The generative process of a mutation in the simulation study is illustrated
here. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.11 A small simulation study demonstrates how the rejection rate of the two-
stage method could be influenced by different variances in two groups with
the same means. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.1 Overview of three workflows in the iMutSig interface. . . . . . . . . . . . 75
xii
List of tables
1.1 List of notation for the saturated model . . . . . . . . . . . . . . . . . . 9
1.2 List of notation under the independent model . . . . . . . . . . . . . . . 15
1.3 An example of signatures under the independent model. . . . . . . . . . . 17
2.1 List of measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2 Observed pair-wise cosine similarity for the four EAC mutational signatures 31
3.1 The scale for interpretation of the Bayes factor . . . . . . . . . . . . . . . 48
3.2 The number of somatic mutations in the 32 mutational catalogs from
16 colon tumors in the USC data, along with their associated trunk and
branch mutations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3 Comparing mutational exposures from two sets of mutational catalogs,
side A and side B, in the USC data. . . . . . . . . . . . . . . . . . . . . . 61
3.4 Comparing mutational exposures in colorectal cancer between trunk and
branch mutational catalogs. . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.5 The false positive rates (FPR), true positive rates (TPR), and updated
true positive rates of both the two-stage method and HiLDA . . . . . . . 63
4.1 An example of PM signatures. . . . . . . . . . . . . . . . . . . . . . . . . 73
xiii
Nomenclature
Roman Symbols
J
i
The total number of mutations within the ith mutational catalog
E The matrix of the intensities in mutational catalogs
G The matrix of the mutational catalogs
I The total number of the mutational catalogs
K The total number of signatures
L The total dimensions of mutation features
M The vector of mutation features
Q The matrix of frequency distributions in multiple mutational signatures
S The matrix of signatures
X The mutation charateristics vector
Superscripts
g Index of being a trunk or branch sample
xiv
Nomenclature
Subscripts
i the ith mutational catalog
j the jth mutation
k the kth signature
l the lth mutation feature
Acronyms / Abbreviations
AIC Akaike Information Criterion
AICc Corrected Akaike Information Criterion
ALL Acute Lymphoblastic Leukaemia
AML Acute Myeloid Leukaemia
BIC Bayesian Information Criterion
CRC Colorectal Cancer
EAC Esophageal Cancer
EM Expectation Maximization
HiLDA Hierarchical Latent Dirichlet Allocation
JAGS Just Another Gibbs Sampler
LDA Latent Dirichlet Allocation
LL Log-Likelihood
xv
Nomenclature
MCMC Markov chain Monte Carlo
MSI Microsatellite Instable
MSS Microsatellite Stable
NMF Nonnegative Matrix Factorization
PY Perplexity
R3PY Repeated 3-fold Cross-validation Perplexity
TCGA The Cancer Genome Altas
USC University of Southern California
WES Whole-exome Sequencing
WGS Whole-genome Sequencing
xvi
Chapter 1
An overview of concepts, methods, and
objectives for modeling somatic
mutations
Somatic mutations found in human cancer genomes are a lifetime’s accumulated result
of a variety of biological processes occurring in human tissue. A biological process,
causing somatic mutations in cells and following specific mechanisms, is defined as a
mutational process. Whole-genome sequencing (WGS) and whole-exome sequencing
(WES) allow researchers to identify these mutations generated by mutational processes
and use them to study the development of cancer. A mere handful of "driver" mutations,
which confer growth advantage and are causally implicated in oncogenesis, have been
extensively studied, whereas thousands of co-occurring "passenger" mutations are less
vigorously studied [1]. Recently, there has been an increasing emphasis on associating
those "passenger" mutations with mutational processes, which might provide insights
into understanding and characterizing the mutation-causing processes [1, 2]. Initially,
1
1.1 Concepts
"passenger" mutations were not vigorously studied using complex statistical modeling.
Instead investigators constructed simple frequency tables of mutational patterns. For
example, the prevalence of mutational patterns was recorded and then associated with
clinically important features or demographics (e.g., there is a higher fraction of AA
transversions in the regions above the gastric-esophageal junction [3]; C > T is the most
common substitution in esophageal squamous cell carcinoma of a Chinese population [4].
In the first chapter of this proposal, we will introduce several important concepts for
modeling somatic mutations such as mutational process, mutational signature, mutational
catalog. We will then specify and distinguish two modeling methods, the saturated model
and the independent model, under each of which these key concepts are emphasized in
mathematical expressions [5, 6]. The main aim of this first chapter to provide a detailed
overview of two current methods, and summarize their strengths and weaknesses in
characterizing mutational processes. We then will introduce intrinsic motivations for
comparing metrics in determining the number of mutational signatures, followed by a
new perspective on inferring the differences in mutation exposures across groups using a
hierarchical model. At last, we also present an easy-to-use web-application to associate
the findings between two current methods. Three objectives will be briefly introduced
and summarized in this chapter to provide a theoretical framework for signature selection,
inferring the change in mutation exposures, and identifying the most similar mutational
signatures across methods.
1.1 Concepts
In section 1.1, we will introduce some new concepts, such as driver mutations, passenger
mutation, mutational process, mutational signature, mutational catalog, that one needs
2
1.1 Concepts
to know to understand how to model and characterize biological processes in somatic
mutations. Specific terms will be explained in depth, and their mathematical specifications
will be provided later in Methods section 1.2.
1.1.1 Somatic mutations
Prevalence across cancer types
In 2013, Alexandrov et al. examined somatic mutations from whole-exome sequencing
(WES) and whole-genome sequencing (WGS) on 7,402 cancers of 30 different classes, and
found a large variation in mutation prevalences across cancer types [7]. Later in 2018,
Alexandrov et al. has expanded the experiment to a much wider range by examining
the somatic mutations of 23,829 samples of most cancer types [8]. They found that the
childhood cancer types have a relatively lower prevalence of somatic mutation, whereas
those cancer types related to chronic mutagenic exposures exhibit much higher prevalence.
This fact is consistent with the well-established findings of mutagenesis in those cancers,
such as tobacco smoking for lung cancer, ultraviolet light exposure for skin cancer, and
defective DNA mismatch repair for some colorectal cancer. In contrast, early-onset
cancer such as Acute myeloid leukemia (AML) and Acute lymphoblastic leukemia (ALL)
tend to have lower mutation prevalence since fewer mutations have been accumulated
(because these patients are typically much younger). The variation in prevalence of
somatic mutations is in some part due to the duration of mutagenic exposure from the
fertilized egg to the time when the tumor cell is sequenced.
3
1.1 Concepts
Driver and passenger mutations
All cancer genomes carry somatic mutations that arise from a variety of biological
processes and can now be identified by WES and WGS. A few "driver" mutations have
been known to confer growth advantage and are causally implicated in oncogenesis.
This is a reflection of the fact that investigating and discovering driver mutations has
always been something of a priority in the study of tumor development [1]. However,
the handful of driver mutations do not completely account for the causes of cancers [9].
Compared to driver mutations, the remaining "passenger" mutations make up a large
proportion of the total mutation burden and are less vigorously studied. This is perhaps
because the "passenger" mutations are mostly unmodified by selection [ 10]. However,
those uncharacterized somatic mutations are the result of a lifetime’s accumulation of
the results of different mutational processes, such as those relating to DNA damage
and failure of DNA repair processes [11]. Those biological processes might be linked to
tumorigenesis and so further used to help uncover the underlying mechanisms for cancer
development. Thus, investigating the observed patterns of somatic mutation can allow us
to better uncover the underlying mechanisms of tumor development.
1.1.2 Mutational processes
Different biological processes produce different mutation patterns and leave those as
fingerprints on human genomes. These somatic mutations identified in cancer genomes
represent the accumulation of mutations over time from all processes. A biological process,
causing somatic mutations and following specific mechanisms, is defined as a mutational
process. There is growing interest in unraveling the mutational processes in cancer to
better predict patient outcomes and help treatment decision making. In Secrier et al’s
4
1.1 Concepts
recent work on investigating the mutational process in esophageal cancer, they speculated
that mutational signatures can aid in contributing to the development of potential tailored
therapies [12].
Characterization of a mutational process: its pattern and intensity
Mutational processes will vary by their impact on the mutation spectrum and duration
of mutation accumulation. In a published work of Helleday et al., they illustrated four
possible hypothetical mutational processes with varying patterns and intensities [13]. Two
mutational processes represent steady but low-intensity mutational processes, with varying
operating periods. They might be associated with different underlying mechanisms, for
instance, the aging process occurs throughout the lifetime whereas an acquired DNA
repair defect occurs along with the onset of a disease. (For example, deficient DNA
mismatch repair in hereditary non-polyposis colon cancer [14]). Alternatively, one type
of mutational processes might exist for a short time of period during which there are
high levels of exposure, such as occupational exposure to environmental hazards. Also, a
mutational process might have intermittent bursts along the timeline due to the recovery
from and relapse into a particular disease that causes accumulation of somatic mutations.
1.1.3 Mutational signatures and mutation features
Closer inspection of four hypothetical mutational processes, proposed by Helleday et al.,
revealed that the consequence of each mutational process might result in differences in
observed patterns of somatic mutations, aside from different levels of intensity found in
mutational processes [13]. For example, one mutational process might tend to result in
different mutated bases compared to another one, while a third one might tend to cause
the same mutated bases as those resulting from the first one, but at much higher frequency.
5
1.1 Concepts
Thus, the different combination of mutation patterns are called mutational signatures
[7], and these can be associated with specific underlying mechanisms. Given a particular
mutational signature, the complexity of mutation patterns are directly determined by the
number of mutation features, which include the substitution itself, the flanking bases on
either side of the substitution, and the transcription strand direction [7].
Mutation substitutions
Since any given base can mutate to any of the other three possible nucleotides, there are
a total of 12 possible base substitutions. However, due to the double-stranded structure
of DNA, we only distinguish six types, characterizing them in terms of how the C or
G base has changed [7]. For example, substitution C > A in the plus strand results in
the same observed change as the substitution G > T occurring in the minus strand as
illustrated in Figure 1.1. Thus each substitution is characterized by one of six possible
types (C > A, C > G, C > T, T > A, T > C and T > G).
Fig. 1.1 An illustration of non-differentiation of two different mutation substitution. The
plus strand is drawn in blue whereas the minus strand is in pink.
6
1.1 Concepts
Flanking bases
Thus far we have classified each substitution into one of six possible substitutions, but
we can extend this to include contextual information from the neighboring bases. If we
wish to include contextual information, we might also include the possible 3’ and 5’ bases
flanking the nucleotide substitution position (A, C, G, T). The most frequent approach,
therefore, is to model mutational signatures using 96 (= 6× 4× 4) possible mutation
patterns, thereby including the single flanking base on either side of the substitution.
Transcription strand
Some mutational signatures might have transcription strand biases, indicating that
substitutions occur differently on the two complementary strands, which can be captured
by the transcription strand direction. This feature has two possible directions: being
either plus or minus.
Fig.1.2Asanillustration, amutationalcatalog(left)ofthefinalcancergenomeisdisplayed
as a frequency distribution over six possible base substitutions. The corresponding
normalized mutational catalog, measured in fractions, is shown to the right.
7
1.2 Methods
1.1.4 Mutational catalogs
A human cancer genome is a record of the accumulated results from a variety of co-
activating mutational processes, which is defined as a mutational catalog [7]. As a further
illustration, Figure 1.2 shows a mutational catalog of some hypothetical mutational
processes that are characterized in terms of the six possible base substitutions. The
corresponding normalized mutational catalog is obtained by dividing the frequency of
each mutation type by the total number of mutations. When more mutation features are
considered, the complexity of mutational catalogs will increase to reflect these additional
features.
1.2 Methods
In the previous section, we introduced several important concepts and summarized three
critical ideas in characterizing mutational processes: 1) a mutational catalog represents
the accumulated results from a variety of mutational processes; 2) the consequence of a
mutational process is characterized by its mutational signature and associated intensity; 3)
a mutational signature consists of various mutation features. Therefore, to mathematically
specify the generative process of mutational catalogs, we need to know how to express
mutational catalogs, mutational signatures, their associated intensities, and mutation
features, respectively.
AfterAlexandrovetal. [5]firstproposedthesignatureframework, manycomputational
toolswerepublishedtobothestimatemutationalsignaturesandmutationalexposures. We
can group them into two categories based on different assumptions posed on constructing
mutational signatures. This section will be focused on introducing two modeling methods,
the saturated model and the independent model. Others have put them into two groups
8
1.2 Methods
based on whether the signatures are "de novo" (signatures extraction) or "refitting"
(known signatures) [15]. Here, we will only consider the former setting. Also, we will
provide a detailed description of the different mathematical specifications and implications
of those key concepts using two modeling methods, which are described in Section 1.2.1
and Section 1.2.2 respectively. A list of mathematical notations of the two models is
given in Table 1.1 and Table 1.2.
1.2.1 Modelingmutationalsignaturesunderthesaturated model
Notation Description
I Total number of mutational catalogs (indexed by i)
J
i
Number of observed mutations in ith mutational catalog (indexed
by j)
L Number of features to include. Here, we use the nucleotide substi-
tution, flanking bases and transcription strand (indexed by l)
M Vector of the maximum numbers of possible values, (M
1
,...,M
L
),
for each mutation feature, (indexed by M
l
), M
1
=6 for nucleotide
substitution, M
2
= 4 for flanking base, (A, C, G, T), M
L
= 2 for
transcription strand, (+,− )
m the total number of feature patterns,
Q
L
l=1
M
l
K Total number of mutational signatures (indexed by k)
Parameters exclusive to the saturated model
S a signature matrix with size m× K, each column is made of a
probability vector S
k
of mutational signature k, (s
k,1
,s
k,2
,...,s
k,96
),
with the sum equal to one.
G a mutational catalog matrix with size m× I, each column is made
of a vector G
i
, (g
i,1
,g
i,2
,...,g
i,m
)
T
, with mutations in catalog i with
P
m
i
g
i,m
=J
i
E a intensity matrix with size K× I, each column is made of a vector
E
i
, (e
i,1
,e
i,2
,...,e
i,K
)
T
Table 1.1 List of notation for the saturated model
9
1.2 Methods
Mutational signatures S in mathematical expression
Fig. 1.3 An illustration of a signature under the saturated model is presented with each
bar represents the probability of observing a particular mutation pattern made of a
substitution with two flanking bases.
Let us now consider the most frequent approach to modeling signatures (96 mutation
patterns) under the saturated model. For example, a signature is represented by plotting
its frequency distributions over 96 possible mutation patterns in Figure 1.3. The signature
spectrum was colored differently based on six possible substitution types. The histogram
bins within each substitution type correspond to the different possible 3’ and 5’ contexts
in which that substitution might occur. Therefore, the frequencies of mutation patterns
in the kth signature can be summarized by a vector, S
k
= (s
k,1
,s
k,2
,...,s
k,96
), of 96
probabilities (summing to one).
The lengths of signature vectors vary depending upon how many mutation features
are included. Thus it is convenient to define a feature vector M = (M
1
,...,M
L
) that
describes the number of possible states for each of the L mutation features in a signature.
Thus, the length of a signature vector S
k
is calculated as m=
Q
L
l=1
M
l
, the product of
all elements in the mutation feature vector M. For instance, M =(6,4,4) for a signature
with 96=6× 4× 4 mutation patterns.
Therefore, a signature S
k
with a feature vector V is denoted by a nonnegative vector,
(s
k,1
,s
k,2
,...,s
k,n
)
T
, with a length of n mutation patterns. If we have K signatures with
10
1.2 Methods
the same mutation features space. We express them as a signature matrix S with size
m× K.
S =

S
1
S
2
... S
K

=











s
1,1
s
2,1
... s
K,1
s
1,2
s
2,2
... s
K,2
... ...
.
.
. ...
s
1,m
s
2,m
... s
K,m











Mutational catalogs G and intensities E in mathematical expression
In a similar manner, a mutational catalog G
i
over m mutation patterns is expressed as a
vector with nonnegative integers, (g
i,1
,g
i,2
,...,g
i,m
)
T
, with a length of n where the sum
is equal to the number of observed mutations in the ith mutational catalog, J
i
. Hence,
sets of I mutational catalogs can be expressed as a catalog matrix G with size m× I.
G=

G
1
G
2
... G
I

=











g
1,1
g
2,1
... g
I,1
g
1,2
g
2,2
... g
I,2
... ...
.
.
. ...
g
1,m
g
2,m
... g
I,m











The cancer catalog of the ith cancer genome is a consequence of multiple muta-
tional signatures of different intensities. The intensities of the various signatures that
make up that catalog can be expressed as a nonnegative vector E
i
over K signatures,
(e
i,1
,e
i,2
,...,e
i,K
)
T
. Therefore, the intensities of the different signatures in sets of I cancer
11
1.2 Methods
genomes is expressed as an intensity matrix E with size K× I.
E =

E
1
E
2
... E
I

=











e
1,1
e
2,1
... e
I,1
e
1,2
e
2,2
... e
I,2
... ...
.
.
. ...
e
1,K
e
2,K
... e
I,K











There has been a study in which mutational intensities were fitted in a lasso logistic
regression model to predict BRCA1/BRCA2 defnicity.
Relationship between G, S and E
There is a simple relationship between the three matrices, G,S,E. The catalog matrix
G can be decomposed into the signature matrix S and the intensity matrixE described
below.
G=S× E











g
1,1
g
2,1
... g
I,1
g
1,2
g
2,2
... g
I,2
... ...
.
.
. ...
g
1,n
g
2,n
... g
I,n











=











s
1,1
s
2,1
... s
K,1
s
1,2
s
2,2
... s
K,2
... ...
.
.
. ...
s
1,n
s
2n
... s
K,n











× 










e
1,1
e
2,1
... e
I,1
e
1,2
e
2,2
... e
I,2
... ...
.
.
. ...
e
1,K
e
2,K
... e
I,K











To estimate S and E, the non-negative Matrix Factorization (NMF) method was
chosen because it ensures the elements of all estimated matrices, S andE, contain only
nonnegative values [16, 17]. Alexandrov et al. provided the source code of the framework
12
1.2 Methods
in Matlab at http://www.mathworks.com/matlabcentral/fileexchange/38724 [ 5], which
is less helpful for R users. Two R packages, SomaticSignatures and MutationalPatterns,
were later implemented using the NMF algorithm to provide an easy-to-use tool for
R users to integrate with the Bioconductor workflows [ 18, 19]. sigfit is an R package
that performs the signature estimation using NMF method and Bayesian inference [20].
Another R package published on Bioconductor, signeR, also enables users to both identify
mutational signatures and the number of signatures using an Bayesian implementation of
the NMF algorithm [21].
Disadvantages of modeling signatures under the saturated model
The saturated model includes a separate frequency for each possible combination of
mutation features. Thus, if we wish to include more flanking information, say including
thet nearest bases on each side, we would need a vector with a length of 6× 4
2t
to account
for the probabilistic distribution of 6× 4
2t
mutation patterns. For example, Figure 1.4
illustrates a common scenario in which four flanking bases are used (two on each side)
which results in the estimation of 1535= 6× 4
4
− 1 parameters. If we also wish to include
information regarding the transcription strand direction, this will double the number of
parameters. Essentially, with the increase in features, it could result in the estimation of
mutational signatures less stable [22].
1.2.2 Modelingmutationalsignaturesundertheindependent model
Under the saturated model, one can see that the number of signature-related parameters
can quickly become overwhelming to model. However, the number of signature-related
parameters can be reduced by assuming independence across mutation features, such as
substitution types, flanking bases, and the strand transcription direction. We refer to
13
1.2 Methods
Fig. 1.4 An illustration of some of all 1536 possible mutation patterns for signatures
consisting of one substitution and four flanking bases.
this model as the independent model. Under this subsection, the independent model will
be introduced and compared to the saturated model [6] in terms of their performance in
modeling mutational signatures in somatic mutations.
Understanding the assumption of independence
We assume each mutation belongs to only one of K distinct signatures, and that the
unique personal history of each individual leads to, for each sample, unique fractions
of mutations from the K signatures. The unique profile, the fractions of signatures for
the ith sample, can be expressed by a probability vector Q
i
= (q
i,1
,...,q
i,k
,...,q
i,K
)
T
where
P
k
q
i,k
= 1, and q
i,k
is the fraction of mutations from the kth signature found
in the ith tumor. In the same manner as the saturated model, a vector, Q
i
, represents
the composition of K mutational signatures in the ith mutational catalog whereas a
matrix,Q, represents the compositions of K mutational signatures in sets of I mutational
14
1.2 Methods
Notation Description
I Total number of mutational catalogs (indexed by i)
J
i
Number of observed mutations in ith mutational catalog (indexed
by j)
L Number of features to include. Here, we use the nucleotide substi-
tution, flanking bases and transcription strand (indexed by l)
M Vector of the maximum numbers of possible values, (M
1
,...,M
L
),
for each mutation feature, (indexed by M
l
), M
1
=6 for nucleotide
substitution, M
2
= 4 for flanking base, (A, C, G, T), M
L
= 2 for
transcription strand, (+,− )
m the total number of feature patterns,
Q
L
l=1
M
l
K Total number of mutational signatures (indexed by k)
Parameters exclusive to the independent model
Q a mutational exposure matrix with size m× I, each column is made
of a vector Q
i
, (q
i,1
,q
i,2
,...,q
i,k
)
T
, with
P
K
i
q
i,k
=1
X
i,j
Observed mutation characteristic vector, (x
i,j,1
,...,x
i,j,L
), for the
jth mutation from the ith mutational catalog (indexed by x
i,j,l
)
z
i,j
Index of the latent assignment for X
i,j
, z
i,j
∈{1,...,K}
q
i,k
Probability vector of signature k exposure in mutational catalog i,
(q
i,1
,...,q
i,K
), with
P
k
q
i,k
=1
f
k,l
Probability vector of observing any of M
l
elements for lth mutation
feature,f
k,l
=(f
k,l,1
,...,f
k,l,M
l
) with
P
m
l
f
k,l,m
l
=1
F
k
A tuple of probability vectors with length L, (f
k,1
,...,f
k,L
)
Table 1.2 List of notation under the independent model
catalogs.
Q=

Q
1
Q
2
... Q
I

=











q
1,1
q
2,1
... q
I,1
q
1,2
q
2,2
... q
I,2
... ...
.
.
. ...
q
1,K
q
2,K
... q
I,K











15
1.2 Methods
Let a vector X
i,j
= (x
i,j,1
,...,x
i,j,l
,...,x
i,j,L
) denote the jth mutation of the ith
mutational catalog, each element of which is the observed mutation type at the lth
mutation feature. Let z
i,j
denote the signature that generated the mutation X
i,j
, where
z
i,j
∈{1,...,K}. We can write,
f
k,l
(x
i,j,l
|z
i,j
)=Multinomial(f
z
i,j
,l
)
f(z
i,j
|k)=Multinomial(q
i,1
,...,q
i,k
,...,q
i,K
),
X
k
q
i,k
=1,
where f
k,l
= (f
k,l,1
,...,f
k,l,M
l
) is a vector representing the probabilities that the lth
feature came from the signature k (k = 1,...,K), and q
i,k
is the fraction of mutations
from the signature k found in the ith tumor. For l = 1, f
k,1
= (f
k,1,1
,...,f
k,1,6
) with
P
6
m=1
f
k,1,m
=1 for the six substitution patterns. When the feature is a flanking base,
f
k,l
= (f
k,l,1
,...,f
k,l,4
),
P
4
m=1
f
k,l,m
= 1 to indicate the chance of observing one of four
bases and when it is transcription strand, f
k,l
= (f
k,l,1
,f
k,l,2
), f
k,l,2
= 1− f
k,l,1
for the
chance of observing the plus or minus strand. Under the assumption of independence,
the probability of observing a mutation pattern X
i,j
of length L in the signature k is
calculated as the product of the mutation feature probabilities.
A common setting, and one we consider later, is to model six mutation features: the
nucleotide substitution, two 3’ bases, two 5’ bases, and the transcription strand. Under
the independent model, the probability of the mutation pattern is the product of the
probability of each of the six mutation features. For example, suppose we observe a C >
A mutation in the GpCpCpTpTp context on the minus strand. We use the distributions
from Figure 1.5 and Table 1.3 to calculate the probability of observing this particular
mutation pattern as 3.31× 10
− 5
=0.033× 0.139× 0.119× 0.246× 0.339× 0.728.
16
1.2 Methods
Table 1.3 An example of signatures under the independent model.
Nucleotide substitution
C >A C>G C>T T>A T>C T>G
0.033 0.000 0.106 0.062 0.799 0.000
Flanking bases
Position A C G T
-2 0.340 0.143 0.139 0.378
-1 0.567 0.119 0.166 0.148
+1 0.451 0.048 0.255 0.246
+2 0.2211 0.198 0.252 0.339
Transcription strand
Plus Minus
0.272 0.728
Fig. 1.5 An illustration of the independent model of characterizing the mutational
signatures, which contains six mutational features.
Comparing the visualization of mutational signatures between two models
We show an example of a signature using the independent model (Figure 1.5), for which
the corresponding multinomial distributions are listed in Table 1.3.
To get a sense of how the number of parameters can be reduced by assuming inde-
pendence, consider the example in Figure 1.5. The signature shown in Figure 1.6, that
includes two 3’ flanking bases, two 5’ flanking bases, and transcription strand, results in
17
1.2 Methods
Fig. 1.6 An illustration of the independent model of characterizing the mutational
signatures, which contains six mutational features.
a total of 3,072 (=6× 4
4
× 2) possible mutation patterns. In this setting, the saturated
model requires 3,071 parameters and is overspecified for any data sets that does not
include all possible mutation patterns. However, under the independent model, the
6-feature signature only requires 18=5+3× 4+1 parameters.
Figure 1.5 depicts a mutational signature under the assumption of independence.
Each mutation feature is shown by a rectangle, in which that area of each color
represents the expected proportion of that category and the heights are scaled by
1+0.5× log
P
m=1,2,3,4
f
2
k,l,m
[23], where f
k,l,m
is the probability of observing the specific
base m at the lth feature in the kth signature (m=1,2,3,4 for A, C, G, T). Therefore,
the height is positively associated with the sum of square probabilities over four bases,
which is minimized when f
k,l,m
=0.25 when m=1,2,3,4. In other words, the rectangle
is the shortest when all four bases are equally likely to appear. The more informative the
probabilistic distribution is, the taller the rectangular box is. For example, the box at
position -1 is slightly higher than that at position -2, the reason for this is the higher
frequency of observing a base of A (0.567 > 0.340) [6].
18
1.2 Methods
Conversion of signatures between two models
It is worth mentioning that an independent signature can be converted into a saturated
signature, but the other way around is only possible when the saturated signature satisfies
the assumptionof independence. For example, the signaturein Figure 1.5can beconverted
with the use of six feature vectors shown in Table 1.3, by multiplying together the relevant
feature probabilities. It is the same for remaining mutation patterns. Following that, a
probability vector P with a length of n, used in the saturated model, can be constructed
from a signature, under the independent model, with a mutation feature M with a length
of L. A recently published R package on Bioconductor, decompTumor2Sig, [24] allows
users to decompose an individual tumor genome into a given set of either saturated
signatures or independent signatures. In addition, it has provided functions to attain the
conversion from the independent signature to the saturated signature.
Intensities in mathematical expression
The intensity matrix E from the saturated model can be converted into the proportion
matrix Q by dividing elements in the ith column by the column-wise sum e
i· =
P
k
e
i,k
.
Q=













e
1,1
/e
1· e
2,1
/e
2· ... e
I,1
/e
I· e
1,2
/e
1· e
2,2
/e
2· ... e
I,2
/e
I· ... ...
.
.
. ...
e
1,K
/e
1· e
2,K
/e
2· ... e
I,K
/e
I· 












19
1.2 Methods
The mean proportions of K mutational signatures among I mutational catalogs can
be expressed by a vector, ¯q, which will be used later in chapter 3.
¯q =














(q
1,1
+q
2,1
+··· +q
I,1
)/I
(q
1,2
+q
2,2
+··· +q
I,2
)/I
...
(q
1,K
+q
2,K
+··· +q
I,K
)/I














=

¯q
·1
¯q
·2
... ¯q
·K

T
The likelihood under the independent model
Now, the likelihood can be written as a latent allocation or mixed membership model,
where the likelihood of the observed mutations X
i,j
is conditioned on the unobserved
signature of origin. As in the statistical approach described in [6], we model the features
under the assumption of independence, such that the complete data likelihood is shown
below and in which the indicator variable denoting a mutation’s signature of origin,
I[z
i,j
=k], is unobserved.
L(x,z)=
I
Y
i=1
J
i
Y
j=1
K
Y
k=1

L
Y
l=1
f
k,l
(x
i,j,l
|z
i,j
)q
i,k
!
I[z
i,j
=k]
We estimate the proportions Q and signature parametersf
k
(k =1,...,K) by using
the expectation-maximization (EM) algorithm to maximize the likelihood [6, 25].
Comparison of modeling signatures between two models
In the previous example in Figure 1.5, we saw the huge reduction in numbers of parameters
obtained by using an independence assumption across mutation features. The independent
20
1.3 Objectives
model overlooks the interactions between mutation features. However, key features of
signatures and their marginal effects, such as having high frequencies for certain mutation
patterns, can still be captured by this new model without losing too much information by
not modeling the interaction across mutation features. The independent model discovers
more signatures (27 v.s. 20) than the saturated model when applied to the same datasets
used in Alexandrov et al.’s analysis, which might suggest that the independent model is
a more sensitive tool in detection of novel mutation signatures [6, 7]
Despite the major difference between the saturated model and the independent model
in terms of the assumption of independence for mutational signatures, there is another
close relationship between these two frameworks. There has been a paper published
about the equivalence between NMF and LDA [26] in which LDA is equivalent to L1-
normalization NMF. In that case, they are essentially different algorithms, leading to
converging to different local minima, to optimize the same objective function.
In summary, a standardized description of mutational processes is rooted in charac-
terizing the observed data as a unique combination of mutation patterns and intensity,
which are measured as mutational signatures and their associated mutation exposures
measured in proportions. Both models provide insights into modeling these parameters,
however, one needs to weigh the advantages and disadvantages of a particular model as
this may influence the interpretation and generalizability of results.
1.3 Objectives
Regardless of which framework is used to analyze the dataset, determining the number
of mutational signatures to use is a challenging problem. Alexandrov et al. utilized two
criteria, both signature stability and reconstruction error, to select the best number of
21
1.3 Objectives
signatures for a given collection of mutational catalogs [5]. Shiraishi et al. proposed
to base the decision on three metrics, including the log-likelihood, bootstrap error, and
correlations between the estimated mutational exposures, but there are no clear guidelines
for exactly how to pick the number of signatures. Therefore, in Chapter 2, we propose
a novel method for determining the number of signatures and comparing it to other
traditional metrics in simulation studies. The selected number of signature can then be
treated as fixed for modeling somatic mutations.
Once mutational signatures and exposures are estimated, previous studies interested
in comparing mutational exposure estimates between different groups of tumor catalogs
conducted a post hoc analysis [27–34, 4]. The analysis proceeded in two stages. First, they
performed one of the several different approaches for mathematically extracting the latent
mutational signatures and their exposures from the mutational catalogs (see [35] for a
reviewofsuchmethods). Later, theyconductedanindependenttestofassociationbetween
the point estimates of the mutational exposures and external covariates. Examples of
covariates included cancer subtype, or patient history of alcohol or tobacco use. However,
variation of the exposure estimates is affected by two factors, the number of mutations
in the tumor and the variation in exposure frequency in the patient population. The
former, the number of mutations in the tumor, affects the accuracy of the exposure
estimates. The application of the Wilcoxon rank-sum test to the exposure estimates does
not take into consideration their uncertainty, which can lead to loss of efficiency and
test power. In Chapter 3, we address this by introducing a unified parametric model for
testing variation of mutational exposures between groups of mutational catalogs, where
the exposure frequencies are modeled using a Dirichlet distribution.
Researchersareawareoftheadvantagesanddisadvantagesofthetwocommonmethods,
so they often applied both methods to explore the mutational signatures embedded in
22
1.4 Summary
their dataset [36]. However, this step usually creates a challenge regarding how to link
and compare signatures estimated from two different methods. The question is often
formulated as identifying signature in one method that is most similar to a signature in
the other. In Sahasrabudhe et al’s work on gastric cancer [36], they used both Frobenius
and cosine similarity measures to compare the similarity. There are several web-interfaces
that enable users to associate signatures using cosine similarity but they always assume
that the comparison is to be conducted within the same framework, and not across
methods, between "de novo" and known signatures, i.e., matlof and MutationalPatterns
[37, 19]. In Chapter 4, we introduce the first web-interface application to enable users to
associate signatures across frameworks.
In summary, our aims are:
1. To select the optimal number of mutational signatures for estimating mutational
signatures and their associated proportions.
2. To develop a hierarchical model to allow for inferring mutation exposures in groups
of samples and for testing for differences between groups.
3. To implement a web application to facilitate the understanding and conversion of
mutational signatures between the saturated model and the independent model.
1.4 Summary
In this chapter, we introduced the background and vital concepts that allow us to
understand the importance of studying mutational processes, and then characterize
them with and without using the assumption of independence. To characterize these
23
1.4 Summary
mutational processes, we provide the technical details of the two most popular methods.
Those methods mathematically express and estimate parameters related to the operating
signatures of mutational processes, and their associated proportions in samples, from
sets of mutational catalogs. The advantages and disadvantages of those two models
were evaluated and summarized to allow us to decide which model to be employed. The
main three objectives of this proposal were briefly summarized to establish a theoretical
framework for conducting statistical inference on mutation exposuresQ, as well as serving
as a basis for assessments of reliable estimation of mutational signatures, S.
24
Chapter 2
Selecting the number of mutational
signatures using a perplexity-based
measure and cross-validation
2.1 Introduction
An important decision for modeling signatures is the choice of the number of mutational
signatures, K, reflected in samples. This needs to be determined before performing any
analysis to estimate parameters. The primary focus of this chapter is to develop an
efficient approach for determining the optimal number of mutational signatures from
cancer genome samples. The approach will be evaluated and compared to other prevailing
methods with simulated data.
25
2.1 Introduction
2.1.1 Background
The somatic mutations found in each human genome result from multiple mutational
processes that occur throughout a lifetime. These processes themselves correspond to
phenomena such as aging and exposures such as ultraviolet radiation and smoking. To
capture the pattern of resulting mutations, and the contexts in which those mutations
occur, a novel way of modeling the mutational process, “mutational signatures", has been
proposed and many research groups now work on this matter [7, 38–40]. Currently, there
are two different frameworks used to characterize and visualize mutational signatures: the
first one [ 41] was proposed by Alexandrov et al., who used a vector of 96 probabilities to
capture the possible composition of the four nucleotide substitutions (C>A, C>T, C>G,
T>A, T>C, T>G) and the neighboring base immediately on each of the 5
′
and 3
′
side of
the mutated base [7]; later, a mixed-membership model was proposed by Shiraishi et al
[6]. By assuming independence across bases, this latter model substantially reduces the
number of parameters needed to characterize a signature [6]. Specifically, the number of
parameters reduces from 6× 4× 4− 1= 95 to (6-1)+(4-1)+(4-1) = 11 [6]. The reduction
in the number of parameters is greater if more flanking bases are included. In this paper,
we will refer to these two models as Alexandrov et al.’s model and Shiraishi et al.’s model.
Regardless of which framework is used to analyze the dataset, determining the number
of mutational signatures to use is a challenging problem. Alexandrov et al. utilized two
criteria, both signature stability and reconstruction error to select the best number of
signatures to use to explain the observed data. In a very similar manner, Shiraishi et al.
proposed to base the decision on three metrics, including the log-likelihood, bootstrap
error, and correlations between the estimated mutational exposures, but there are no
clear guidelines for exactly how to pick the number of signatures.
26
2.2 Methods
It is common for researchers to use the same number of signatures when fitting both
models. For example, Secrier et al. [12] determined six mutational signatures as the
optimal number when using Alexandrov et al’s model. They then proceeded to also
extract six mutational signatures when using Shiraishi et al’s method. However, there is
no one-to-one correspondence between fitted signatures in the two models (because the
model of Alexandrov et al. is a more general model than that of Shiraishi et al.) As a
partial result of this, perhaps, there was only one Polε signature, signature 10, discovered
by Alexandrov et al. whereas Shiraishi et al. found two Polε signatures [6, 5]. Because of
the differences between the two models, it is not obvious that applying the same number
of mutational signatures when using the two methods is the correct decision.
In this paper, we provide a comparison of a single-value-based method and other
traditional metrics in order to provide guidelines for users when determining the number
of mutational signatures. Specifically, the performance of our algorithm, based on
perplexity [42], is compared to the Akaike Information Criterion (AIC), the corrected
Akaike Information Criterion (AICc) and the Bayesian Information Criterion (BIC). In
the following, we start by briefly introducing all four measures followed by a description
of the simulation study we used to evaluating them.
2.2 Methods
In this study, a total of five different simulation scenarios were used to evaluate the
performances of a total of four. Across these scenarios, we vary the number of catalogs,
the number of mutations per catalog, and the maximum pairwise cosine similarity among
signatures. We use the term mutational catalog to denote a collection of mutations at the
sample level. We assessed the performances of different approaches over 500 replicates
27
2.2 Methods
of randomly generated datasets. A better performance is indicated by a higher rate of
successful prediction of the number of signatures used to generate the data.
We compare a newly proposed perplexity-based method, repeated 3-fold cross-
validation perplexity (R3PY), to three traditional measures - AIC, AICc, and BIC
[43, 44]. Details of how the dataset is simulated, and of the methods we use to determine
the number of signatures, are given below. We start by describing the most important
component of all these measures, the log-likelihood.
Method specification - the log-likelihood
Theprobabilisticmodelproposedby[6]isasuccessfulapplicationofthemixedmembership
model in genomic data. This method imposes an assumption of independence across
mutational features (substitution, flanking bases and transcriptional strand). It assumes
that each of j
i
mutations in mutational catalog i is probabilistically generated by only
one of K mutational signatures. The probability of any given mutation coming from any
given signature is given by a probability vector [q
i1
,q
i2
,...,q
iK
]. We refer the fraction q
ik
as the mutational exposure of signature k. We denote this latent variable for signature
assignment by z
i,j
. A signature k is characterized by L features, each modeled by a
multinomial vector, f
k,l
. For a given observed mutational pattern, x
i,j,l
, we can index
the corresponding multinomial vector, f
k,l
, to extract the probability of observing that
pattern, which is denoted as f
k,l,x
i,j,l
. Therefore, we can write the log-likelihood (LL)
of a given collection of mutational catalogs X with K mutational signatures as in the
equation below.
28
2.2 Methods
LL=
I
X
i=1
J
X
j=1
K
X
k=1
I(z
i,j
=k)(
L
X
l=1
log(f
k,l,x
i,j,l
+log(q
i,k
)))
Each of the methods we compared our approach to depend upon the estimated
log-likelihood, LL, in addition to a penalization term that is some function of p, the
number of parameters, and n, the total number of somatic mutations, n =
P
i
J
i
(See
Table 2.1). All three measures share a common p, which is defined as the number of free
parameters in the fitted mixed membership model. p is equal to the sum ofK(5+3× 4+1)
and I(K− 1), which corresponds to the free parameters in mutational signatures and
mutational exposures, respectively. Thus, p increases with the number of mutational
signatures fitted, K, and the number of mutational catalogs, I. Since the optimal value
of K is defined to be the one for which these measures are minimized, clearly the K
value that is selected will alter according to the exact formulation of the penalization
term. Among these three measures, AIC is the only measure that is free of the number
of mutations, n. AICc has been designed to perform well for small-sample problems [45].
As sample size n gets larger, the behavior of AIC and AICc will converge. BIC penalizes
model complexity more heavily compared to both AIC and AICc. Since perplexity is a
less well-known concept compared to the others, we now introduce it in detail.
Table 2.1 List of measures
Measure Abbreviation Formula
Repeated 3-fold Cross-Validation Perplexity R3PY
1
9
P
3
r=1
P
3
m=1
PY
m
Akaike Information Criterion AIC − 2LL+2p
Corrected Akaike Information Criterion AICc − 2LL+2p+
2p(p+1)
n− p− 1
Bayesian Information Criterion BIC − 2LL+plog(n)
29
2.2 Methods
Perplexity measurement
We propose a new method based on perplexity, which has fundamental differences
compared to the former three traditional measures. AIC, AICc, and BIC are all calculated
based on the entire data while our proposed method measures how well the model predicts
the likelihood of unseen data. It does this by computing the perplexity. First, we divide
the data into two pieces: training data and hold-out data. We fit the model using the
training data and then take the estimated parameter values and compute the likelihood of
the hold-out data. This measure has been widely used to assess the predictive performance
of topic modeling on holdout samples, including mixed membership models [42]. The
lower the perplexity, the better the model fits the hold-out data.
In detail, we pool all mutations in the data prior to splitting them into three folds by
random sampling without replacement. We take two of those thirds (chosen at random)
and treat them as the training data; the remaining third is used as the hold-out, or test,
data. We denote the number of mutations in these two components by n
training
and
n
test
respectively. Using the training data, we estimate parameters using the R package
pmsignature [6], and denote these estimated parameters by ˆ q
i,k
and
ˆ
f
k,l
. We then use
these estimates to calculate the likelihood of observing the test sample. The perplexity
(PY) of the hold-out sample is defined as follows:
PY =exp
h
− Q
i,j∈ntest
i,j
Q
L
l
P(x
i,j,l
|ˆ q
i,k
,
ˆ
f
k,l
)
n
test
i
=exp(− LL
test
n
test
)
The procedure described above is how the perplexity is estimated on a single fold.
We reduce noise by repeating this process on each fold and calculate the average 3-fold
cross-validation perplexity. In order to further reduce noise, we repeat this entire process
three times and return the average value of the 3-fold cross-validation perplexity over
30
2.2 Methods
each of the three iterations. The optimal K-value is chosen to be the one that minimizes
this average perplexity value. (We considered K-values in the range K = 2,...,10).
The procedure is implemented in R version 3.6.0 with the source code available at
https://github.com/USCbiostats/selectKSigs. Note that while we chose to use three-fold
cross-validation, because it was observed to work well across all scenarios, users are free
to vary this number if desired when using this software. In fact, perplexity is not limited
to Shiraishi et al’s model but applicable to any likelihood-based model.
2.2.1 Method evaluation
We evaluated and compared the performances of each measure by applying it to 500
replicates of datasets simulated under five different scenarios. The datasets are simulated
using pmsignature with four mutational signatures, which is a number typical of the
number of signatures observed in any given cancer type. More specifically, we use the
four signatures estimated from esophageal cancer (EAC) published by [6]. We chose
to simulate data based on signatures observed in EAC because there has been some
discussion regarding the correct number of signatures to use for that cancer type [12].
Table 2.2 Observed pair-wise cosine similarity for the four EAC mutational signatures
1st 2nd 3rd 4th
1st 1.000 0.134 0.009 0.144
2nd - 1.000 0.098 0.271
3rd - - 1.000 0.027
4th - - - 1.000
The pair-wise cosine similarities among the four signatures estimated by Shiraishi et
al. range from 0.009 to 0.271 (see Table 2.2). We also explored whether using randomly-
generated signatures changes the performances of the methods, whose feature parameters
31
2.2 Methods
are generated by Dirichlet distribution. [6]. We assume that good performance is reflected
by the number of fitted signatures being equal to the number of signatures used to
generate the simulated data. Thus, since we used K = 4, if a measure returns a value
that is equal to four, we define it as a success for that data. Otherwise, we categorize it
as an underestimation ( predicted K is less than 4) or an overestimation (predicted K is
greater than 4). The five simulated scenarios were as follows:
I. A set of 10 mutational catalogs were simulated using the four EAC signatures. We
repeated this for each of the following choices for total number of mutations: 100,
200, 300, 400, 500.
II. Same as scenario I, but replacing the four EAC signatures with four randomly
generated signatures. We refer to these signatures as random. Recall, each signature
is characterized by L features, denoted by the vector f
k,l
. In this scenario, this
vector is sampled from a Dirichlet(1, ..., 1) distribution.
III. A set of mutational catalogs, each with 150 mutations, were simulated using the
four EAC signatures. We varied the total number of catalogs to be: 10, 12, 14, 16,
or 20.
IV. Same as scenario III, but replacing the four EAC signatures with four randomly
generated signatures.
V. We simulated a set of 10 catalogs, using four EAC-like signatures, varying the level
of cosine similarity between the mutational signatures. Specifically, we characterized
each scenario in terms of the maximum cosine similarity observed between any two
signatures. In order to obtain a wide range of maximum cosine similarity, i.e., from
0.30 to 0.83, we replaced the 4th EAC signature by an altered version of the 1st
32
2.2 Methods
signature. Every feature vector in the new signature is re-generated through the
following process: 1) create a vector by randomly drawing eight mutations from
feature vector f
1,l
; 2) normalize this vector such that it sums to 1 and can then be
treated as a probability vector which will be used as a new signature. The degree
of cosine similarity with signature 1 will vary from simulation to simulation, so
when presenting results for this scenario we group them according to the actualized
similarity.
We explored sampling different numbers of somatic mutations in determining
the 4th mutational signature and found that eight mutations covered the largest
range in maximum pairwise cosine similarity among the four mutational signatures
yielding values both above and below 0.5 (range 0.3,0.83). Sampling more than
eight mutations resulted in maximum cosine similarities > 0.5 and sampling fewer
resulted in maximum cosine similarities < 0.5. We created 100 different mutational
signature 4s, which were categorized into four intervals for visualization ((0.3, 0.5]:
20/100; (0.5, 0.6]: 34/100; (0.6, 0.7]: 26/100; (0.7, 1.0]: 20/100). For each of the 100
sets of four mutational signatures (three EAC signatures and the newly synthesized
signature), we simulated a set of 10 mutational catalogs (500 replicates). Each
scenario was evaluated using 150, 200, or 300 mutations per catalog.
For each of the scenarios above, mutational exposures were simulated using a Dirichlet(1,
1, 1, 1) distribution. This results in all mutational signatures having the same expected
mean frequency (mean∼ 0.25). As an indication of the likely variation around that mean,
we note that the expected 95% confidence interval is [0.04, 0.46] if there are 10 catalogs,
getting narrower as the number of catalogs increases.
Note that the chosen minimum values for both number of mutations and number
of catalogs in our study are relatively low. This reflects the lowest values that were
33
2.3 Results
determined to allow high signature reproducibility (cosine similarity for observed and
estimated signatures >95%) based on the simulation results in Shiraishi et al. [6].
2.3 Results
The results from these five scenarios are shown in Figure 2.1, 2.2, and 2.3. In each figure,
the relative proportions of overestimation (red), success (ivory), and underestimation
(blue) are shown as a function of the varying data characteristics. The wider the success
(ivory) bandwidth is, the better the method performs.
Number of mutations, J
i
Figure 2.1 shows that our perplexity-based approach performs well over the entire range
of J
i
values. We note that when the number of mutations is very small (between 100 and
150), AICc performs well also, and that it out-performs AIC across all J
i
values. However,
as J
i
increases, AICc’s performance starts to deteriorate, while the performance of both
R3PY and BIC increase. Interestingly, AIC is insensitive to the change in number of
mutations per catalog, which is likely explained by the lack of terms involving J
i
or n in
the penalization term. When signatures were used the general trend remains the same
except for BIC, which performs better than when applied to the datasets using EAC
signatures. Since the anonymous signatures represent a more general case, in terms of
signature identity, we suggest that R3PY is the more reliable method unless the number
of mutations is very small. We repeated the same simulation study, but this time based
on colorectal cancer signatures (using five signatures), and observed similar patterns
(results not shown).
34
2.3 Results
Fig. 2.1 Comparison of percentages of underestimation, success, and overestimation, when
determining the number of signatures present, as a function of method used and number
of mutations per catalog. We compare four methods using simulated data sets in which
the number of somatic mutations per catalog varies (500 replicates for each catalog size).
The number of mutational catalogs is 10 in all datasets. In the top row we show results
from data simulated using four EAC signatures, whereas in the bottom row we base
our data upon randomly generated signatures (see text). Exposures are distributed as
Dirichlet(1, 1, 1, 1).
2.3.1 Number of catalogs
In Figure 2.2, in the EAC results, we observe consistent improvement in performances
of all measures as the number of catalogs, I, increases. The most dramatic changes are
observed for AIC and BIC, where performance improves dramatically as I increases.
Once again, R3PY demonstrates high performance across the entire range, with AICc
performing well, but marginally less well than R3PY. When anonymous signatures were
35
2.3 Results
Fig. 2.2 Comparison of percentages of underestimation, success, and overestimation when
determining the number of signatures present as a function of method and number of
mutational catalogs. Here we vary the number of catalogs, again with 500 simulated
replicates for each case. The number of mutations per catalog is 150 in all datasets. In the
top row we show results from data simulated using four EAC signatures, whereas in the
bottom row we base our data upon randomly generated signatures (see text). Exposures
are distributed as Dirichlet(1, 1, 1, 1).
used, very similar results are seen for AIC, AICc and R3PY, but the performance of BIC
improves dramatically.
2.3.2 Maximum pairwise cosine similarity
Each boxplot in Figure 2.3 represents the distribution of proportions of success for a
specific combination of the number of mutations and the maximum pairwise cosine
similarity, where we group results according to maximal cosine similarity. As might be
expected as the maximum similarity between two signatures increases the performance
36
2.3 Results
of all methods degrades. This is most apparent for BIC and R3PY where the frequency
of inferring the correct number of signatures decreases rapidly as the maximum cosine
similarity increases. However, the greater the number of mutations per catalog, the better
these methods tolerate this increase in similarity. Once again, AICc perforns relatively
well for small catalog sizes (top row). Interestingly, increase in signature similarity has
little to no effect on the performance of AICc or AIC for higher value of J. For the range
of scenarios we consider. we observe that AICc out-performs AIC. However, as noted
earlier, as sample size increases, AICc will converge to behave similarly to AIC. When
the number of mutations is lower (∼ 150) and the similarity is medium/low (< 0.6),
AICc outperforms R3PY, an advantage that seems to disappear for∼ 200 mutations or
higher. Outside this case, R3PY appears to have the best performance. Of course, when
similarity between two signatures is high, we should expect all methods to struggle to
disentangle them, as is seen here. (Most of approaches tend to identify three rather than
four signatures in this case, by concluding that the two similar signatures are a single
signature.)
37
2.4 Discussion
Fig. 2.3 Comparison of how the percentage of success when estimating the true number
of signatures, obtained by each of the four methods, varies both as a function of the
number of mutations in the data sets (n = 150, 200, 300) and the maximum pairwise
cosine similarity value among three EAC signatures and one synthesized signature in 500
replicates. The data sets were simulated with 10 catalogs. Exposures are distributed as
Dirichlet(1, 1, 1, 1).
2.4 Discussion
Once a mutational signature is discovered, researchers often investigate the underlying
mutationalprocessesandhowtheymightbeassociatedwithexternalorintrinsicexposures.
Signaturevalidationcanbeverydifficultandsubjective, therefore, determiningthenumber
38
2.4 Discussion
of mutational signatures is a vital step before conducting any followup research. Through
comprehensive simulation studies, we have uncovered general trends of performances for
each of the four methods considered here. We hope this provides guidance on choosing
which measure to use in future studies
For most of the scenarios, the perplexity-based measure RP3Y provides the most
reliable approach to determining the number of mutational signatures. If the number of
mutations per catalog is very low (< 200), AICc provides slightly better performance.
Thus, its advantages are more likely to be appreciated for analyses of whole-exome
mutational catalogs that tend to have fewer mutations, particularly when analyzed as
part of a small numbers of mutational catalogs. A recent study of 4,645 whole-genomes
reported more than 95% of mutational catalogs with than 200 mutations each [8]. Once
the number of mutations is greater than 200, the simulation results favored using RP3Y.
Among all the scenarios, AIC has the worst performance since it frequently overesti-
mates K, which might suggest under penalization. AICc, the corrected AIC, performs
quite well for samples with lower number of mutations, presumably due to the additional
correction term in the penalization. In any case, in all cases AICc out-performs standard
AIC. On the other hand, BIC appears to over-penalize in this context, and frequently re-
sults in underestimated K-value. However, this over-penalization weakens as n increased.
Moreover, the performances of all methods improves when analyzing larger numbers of
mutational catalogs, as would be expected.
As a single-value-based approach, R3PY is more practical than the method proposed
by Shiraishi et al., which does not give a selection rule and involves handpicking the
trade-offs among three separate measures [ 6]. At the same time, using cross-validation and
repeated sampling can greatly overcome the weaknesses from three traditional approaches
by substantially reducing either overestimation or underestimation of the number of
39
2.5 Conclusions
mutational signatures [46]. Choosing the correct number of mutational signatures is
important for potential future application of conclusions and development of personalized
cancer interventions and prevention strategies [35]. Our work essentially lays a good
foundation for future research projects that compare mutational exposures [38] since
the number of signatures is often treated as a priori information. With our proposed
perplexity-based method, the selection of the correct number of mutational signatures
becomes more likely than with existing methods.
2.5 Conclusions
In this paper, we proposed a new approach, based on perplexity and cross-validation,
to estimating the number of mutational signatures. We compared its performance to
competing methods using a simulation study including five different scenarios to get a
broad sense of behavior. Our conclusion is that our new approach, R3PY, is the most
reliable method when considered across all scenarios, while, of the traditional methods,
AICc performs best, particularly for very small catalog sizes.
40
Chapter 3
Inferring the difference in mutational
signatures via HiLDA
3.1 Introduction
The latent Dirichlet allocation model, described in the subsection 1.2.2, estimates each
unobserved signature f
k
and the fractions this signature contributes to a mutational
catalog q
i,k
for each sample (k = 1,··· ,K). In the last chapter, we described how to
select the optimal number of signatures, K, which lays the foundation for further inference
on the mean mutation burdens µ g
= [µ g
1
,µ g
2
,...,µ g
K
]
T
, i.e. the average fractions of the
signatures across sets of I mutational catalogs of the gth sample type. For homogeneous
samples, we might not be very interested in testing the equality of ¯µ within one sample
type. However, in many contexts, samples are naturally labeled as two or more different
sample types, and here we may raise the natural question as to whether µ 1
is equal to µ 2
between two groups of samples. However, the current methods (saturated or independent
models) provide no formal tests for µ 1
=µ 2
between two sample types.
41
3.1 Introduction
Fig. 3.1 A instance of of the comparison between the two-step method and hierarchi-
cal latent Dirichlet allocation, in which pmsignature is used to estimate mutational
signatures and their associated fractions.
Given this lack, one may conduct a post hoc analysis on estimated ¯µ ′
s when testing
the difference in the average of signature burdens between groups become necessary.
For example, Tsukamoto et al. tested µ 1
= µ 2
in the estimated compositions of three
signatures between high- and standard-risk groups in follicular lymphoma to distinguish
their clinical features and improve the treatment outcome [47]. However, such a two-step
approach fails to account for the uncertainty in point estimates of signature burdens,
ˆ
¯µ ′
s.
It is evident that
ˆ
¯µ ′
s are derived from samples with a wide range of somatic mutations
per sample. Furthermore, the number of somatic mutations influences the accuracy of
point estimation, meaning that standard errors of estimates will typically vary across
samples.
We propose a unifying model to overcome the failure to allow for the uncertainty in
estimation in the mean signature fractions ¯µ ’s as shown in Figure 3.1. Specifically, we use
a hierarchical latent Dirichlet allocation model (HiLDA) to address this limitation. In
42
3.2 Methods
the simplest case, HiLDA can provide a global test of the difference between ¯µ 1
and ¯µ 2
.
When implemented with more parameters, HiLDA can provide individual testing results
on each signature, ¯µ 1
k
= ¯µ 2
k
for k =2,...,K, rather than a global test of ¯µ 1
= ¯µ 2
.
3.2 Methods
3.2.1 Hierarchical Bayesian Mixture Model
Here, we describe the model formulation for a hierarchical latent Dirichlet allocation
model (HiLDA) using the following notation (Please see Table 1.2 and Section 1.2.2).
Let i index the mutational catalog and j the mutation. The nucleotide substitutions
are reduced to six possible types (C>A, C>T, C>G, T>A, T>C, T>G) to eliminate
redundancy introduced by the complementary strands. Each observed mutation is
characterized by a vector, X
i,j
describing the nucleotide substitution (e.g. C>T) and
a set of genomic features in the neighborhood. Example features include the base(s) 3
′
and 5
′
of the nucleotide substitution (C, G, A, T), and the transcription strand (+, − ).
Each observed feature characteristic, x
i,j,l
for mutation feature l, takes values in the set
{1,2,...,M
l
} (where M
l
=6 for the nucleotide substitution, or 4 for a flanking base, and
2 for the transcription strand).
WeassumeeachmutationbelongstooneofK distinctsignatures. Aspecificmutational
signature k is defined by an l-tuple of probability vectors, F
k
, denoting the relative
frequencies of the M
l
discrete values for the l features, i.e., a vector f
k,l
for the M
l
values corresponding to feature l. We let z
i,j
denote the unique latent assignment of
mutationX
i,j
to a particular signature. Then, given the signature to which a mutation
belongs, the probability of observing a mutational pattern is calculated as the product
of the mutation feature probabilities for that signature. Thus, for signature k we write
43
3.2 Methods
Pr(X
i,j
|z
i,j
)=
Q
l
f
k,l
(x
i,j,l
|z
i,j
). This assumes independent contributions of each feature
to the signature. To model each multinomial distribution off
k,l
, we use a non-informative
Dirichlet prior distribution with all concentration parameters equal to one.
The unique personal exposure history of each individual leads to them having a
particular (latent) vector, q
i
, indicating the resulting contribution of each of the K
signatures to that individual’s mutational catalog. Theseqs are modeled using a Dirichlet
distribution with concentration parametersα , i.e.,q
i
∼ Dir(α ). Extending this model to
the two-group setting, we allow the Dirichlet parameters to depend on group, Dir(α (g
i
)
),
with g
i
indexing the group corresponding to the ith catalog (g
i
= 1 or 2). The mean
mutational exposures, E(q
i
), denoted byµ (g
i
)
, are represented by using the concentration
parameters, i.e.,µ (g
i
)
=α (g
i
)
/
P
α (g
i
)
.
With this extension, we can infer differences in mutational processes between groups
of catalogs by testing whether the mean mutational exposures differ between the two
sets, i.e., at least one µ k
(1)
̸=µ k
(2)
. The likelihood and prior of the multi-level model is
specified as follows,
Fig. 3.2 Plate notation for hierarchical LDA with Dirichlet-distributed signature-mutation
distributions exemplified by two groups, i.e., trunk mutations and branch mutations.
44
3.2 Methods
x
i,j,l
|z
i,j
∼ Multinomial(f
z
i,j
,l
)
z
i,j
∼ Multinomial(q
i
|g)
q
i
|g
i
∼ Dir(α (g
i
)
)
For full details, see Figure 3.2.
The Likelihood function
The likelihood can be constructed with an additional hierarchy based on a latent allocation
or mixed membership model, where the likelihood of the observed mutations is conditioned
on the unobserved signature of origin. As in the statistical approach described in the work
of Shiraishi et al. (2015), we model the features under the assumption of independence,
such that the complete data likelihood is a product of multinomial distributions over each
mutation feature f
k,l
, in which the indicator variable denoting a mutation’s signature
of origin, I[z
i,j
= k], is unobserved. In their previous work, Shiraishi et al. (2015).
assumed that mutational exposures q are drawn from the same Dirichlet distribution
Dir(α )=Dir(α 1
,...,α K
):
Pr(q|α )=
I
Y
i=1

Γ(
P
k
α k
)
Q
k
Γ( α k
)
K
Y
k=1
q
α k
− 1
i,k
!
Pr(X,Z|f,q)=
I
Y
i=1
J
i
Y
j=1
K
Y
k=1

L
Y
l=1
f
k,l
(x
i,j,l
|z
i,j
)q
i,k
!
I[z
i,j
=k]
45
3.2 Methods
where the elements of α k
can be used to calculate the means for q
i,k
, i.e., µ k
=
α k
P
K
1
α k
.
The denominator
P
K
1
α k
is a measure of precision for the means, so that the larger this
measure, the tighter the distribution of q
i,k
around the means.
In our case, there are two groups of mutational catalogs, referred to as group 1 and
group 2, respectively. Therefore, we extend the above model by using two different Dirich-
let distributions, Dir(α (1)
) = Dir(α (1)
1
,...,α (1)
K
) and Dir(α (2)
) = Dir(α (2)
1
,...,α (2)
K
),
from which mutational exposure parameters are drawn, using the same set of k signatures
for each individual (See Figure 3.2). Thus, we write,
Pr(q|α (1)
,α (2)
,g)=Pr(q|α (1)
)
I[g
i
=1]
· Pr(q|α (2)
)
I[g
i
=2]
Pr(X,Z|f,q)=
I
Y
i=1
J
i
Y
j=1
K
Y
k=1

L
Y
l=1
f
k,l
(x
i,j,l
|z
i,j
)q
i,k
!
I[z
i,j
=k]
3.2.2 Testing for Differences in Signature Exposures
To characterize the signature contributions for different sets of tumor catalogs, we wish
to conduct a hypothesis test that there is no difference in mean exposures versus the
alternative that the mean exposure of at least one signature differs between the two
groups, i.e. H
0
:µ (1)
=µ (2)
vs. H
1
: at least one µ k
(1)
̸= µ k
(2)
. We propose both local
and global tests, implemented in a Bayesian framework. The former provides signature-
level evaluations to determine where the differences in mean mutational exposures
occur, while the latter provides an overall conclusion about any difference in mean
mutational exposures. The details of our implementation are given in our Just Another
Gibbs Sampler (JAGS) scripts and Source code is freely available in GitHub at https:
//github.com/USCbiostats/HiLDA and on Bioconductor at https://bioconductor.org/
packages/devel/bioc/html/HiLDA.html [48].
46
3.2 Methods
A local test to identify signatures with different exposures
We propose a signature-level (local) hypothesis test to allow us to infer which signature(s)
contribute a different mean exposure to the mutational catalogs across tumor sets, i.e.,
µ k
(1)
̸= µ k
(2)
. To measure the difference between mean signature exposure vectors, we
implement HiLDA by specifying two Dirichlet distributions, Dir(α (1)
) and Dir(α (2)
),
as priors for the distribution of mutational exposures q
i
of each group [49]. Using this
formulation, the difference between the two groups of the mean exposure of signature k
is calculated as,
∆ k
=µ (2)
k
− µ (1)
k
=
α (2)
k
P
k
α (2)
k
− α (1)
k
P
k
α (1)
k
. (3.1)
For all parameters, α (1)
k
’s and α (2)
k
’s, we use independent, non-informative gamma
distribution priors with a rate of 0.001 and shape of 0.001. Since JAGS suffers from
convergence issue when estimating parameters very close to zeros, we truncate that distri-
bution to be≥ 0.05. This results in an approximate mean of 107.7 and an approximate
variance of 9.62× 10
4
.
We estimate parameters via Markov chain Monte Carlo (MCMC) using two chains [50].
We assess convergence of the two MCMC chains using the potential scale reduction factor
(Rhat) in [51], which is required to be less than or equal to 1.05 for all parameters in order
to conclude that the MCMC run has converged. After obtaining the posterior distribution
of the differences (i.e., of ∆ k
), there are two possible approaches to performing inference.
We can: 1) use the Wald test to compute the P-value using the means and standard
errors of the posterior distribution for ∆ k
; 2) determine whether the 95% credible interval
of the posterior distribution for ∆ k
contains zero.
47
3.2 Methods
Table 3.1 The scale for interpretation of the Bayes factor
Bayes Factor Evidence in favor of the alternative model M
2
1 to 3 not worth more than a bare mention
3 to 20 positive
20 to 150 strong
>150 very strong
A global test using the Bayes factor
We also propose a global test to provide an overall conclusion on whether the mean
exposures differ between groups of catalogs described in Figure 3.3. It uses the Bayes
factor, theratio ofposteriortoprior oddsinfavorof modelH
1
(H
1
: atleastoneµ (1)
k
̸=µ (2)
k
,
k = 1,...,K) compared to modelH
0
(H
0
: µ (1)
=µ (2)
), to indicate the strength of evidence
that they do differ, without explicit details on how they differ. Thus, we can calculate
the Bayes factor as:
BayesFactor =
Pr(H
1
|Data)
Pr(H
0
|Data)

Pr(H
1
)
Pr(H
0
)
. (3.2)
Since the likelihood is analytically intractable, the Bayes factor is calculated via
MCMC [50]. In order to estimate the Bayes factor, during the MCMC analysis, a single
binary hypothesis index variable is used to indicate which hypothesis explains the observed
data [52]. The parameters of two Dirichlet distributions, Dir(α (1)
) and Dir(α (2)
), are
drawn from the same prior if the index takes the value 1, whereas they are drawn from
different priors if it takes the value 2. Initially, the prior hypothesis odds is set to be
0.5/0.5 = 1, which means that both hypotheses are assumed equally likely under the
prior. In order to improve computational efficiency in extreme situations in which one
hypothesis dominates the other, we can use a different prior odds value [ 50]. As shown in
48
3.2 Methods
Fig. 3.3 Bayesian comparison of the null model and alternative model.
Table 3.1, a Bayes Factor (BF) between 3-10 indicates substantial support for the model
with different mean exposures in the two groups ( H
1
) [53]. A BF > 10 indicates strong
support.
3.2.3 Two-stage Inference Methods using the Point Estimates of
mutational exposures
An alternative approach is to perform hypothesis testing using point estimates of the
mutational exposures, ˆ q
i
, in a two-stage analysis, which we refer to as the "two-stage"
method (TS). We used the R package pmsignature to estimate ˆ q [6]. Other methods
are also available, but we selected pmsignature for the purpose of comparisons to the
results from HiLDA since it assumes the same model for estimating signatures under
independence of features. We summarize the steps of the TS method as follows:
49
3.2 Methods
1. Jointly estimate the vectors of mutational signature exposures, q
i
, for each muta-
tional catalog.
2. Test for differential mutational exposures for signature k by performing the Wilcoxon
rank-sum test on the ˆ q
k
.
However, we note that the Wilcoxon rank-sum test in stage 2 is also sensitive to
changes in variance across the two groups, which might lead to significant results even
when there has been no change in mean exposures [54, 55]. We implemented the two-stage
method using R version 3.6.0 [56]. A two-sided P value of less than 0.05 was considered
statistically significant.
3.2.4 Choosing the Number of Signatures
The number of signatures, K, needs to be determined prior to any of the above analyses.
At the time of this work we had not yet developed the perplexity measure in Chapter
2, hence we adopted the method of [6] to determine K. Their method is based on the
following criteria:
1. The optimal value of K is selected over a range of K values such that the likelihood
remains relatively high while simultaneously having relatively low standard errors
for the parameters.
2. Pairwise correlations between any two signatures (the kth signature and the k
′
th
signature, say) are measured by calculating the Pearson correlation between their
estimated mutational exposures across all samples, (i.e., the correlation between
(ˆ q
1,k
,...,ˆ q
I,k
) and (ˆ q
1,k
′,...,ˆ q
I,k
′)). K is chosen such that no strong correlation (i.e.,
>0.6) exists between any pair.
50
3.2 Methods
The result of estimating mutational signatures for the USC data, including 16 trunk
mutational catalogs and 16 branch mutational catalogs, using different values of K, the
number of mutational signatures.
Fig. 3.4 The log-likelihood, bootstrap-erros, and maximum correlation values among
estimated mutational exposure parameters for a series of K number of signatures.
log−likelihood bootstrap error max. corr.
2 3 4 5 6 7 8 9
−54000
−53500
−53000
−52500
−52000
0.0
0.1
0.2
0.3
0.4
−1.0
−0.5
0.0
0.5
1.0
#signature
value
We fit our latent Dirichlet allocation model using the two bases flanking on both sides
of our nucleotide substitution and select K = 3 signatures based on the three criteria
proposed by Shiraishi et al. (2015) (See Figure 3.4). We see that while increasing the
number of signatures always results in an increase in the likelihood, the bootstrap-errors
51
3.2 Methods
started to increase at K =4 while the correlation between estimated signature fractions
reaches high values at K = 4. This suggests that our 32 mutational catalogs can be
explained using three signatures of mutational processes.
3.2.5 Application
USC Colon Cancer Data
Our goal is to identify whether any new mutational signatures occur during colon cancer
growth that distinguish cancer evolution from normal tissue evolution. To achieve this,
we classify somatic mutations into two catalogs according to time of occurrence: those
that accumulated between the time of the zygote and the first tumor cell, which we call
trunk mutations, and those that occur de novo during tumor growth, which we refer to
as branch mutations. We then estimate mutational signatures in the two sets of catalogs
and test whether the mean mutational exposures differ between them.
3.2.6 Data preparation in 17 CRC tumors
In this section, we will describe the technical details of generating somatic mutations
from sets of CRC tumors, referred as the USC CRC dataset, and classifying them into
trunk/branch mutations for the following analysis.
STEP 1: Calling mutations and filtering
Mutational signatures in CRC was previously estimated for the Alexandrov et al. data,
which were generated by whole exome sequencing. In a similar way, in this proposal, our
data consist of 17 new CRC tumors that underwent multi-regional sampling, for each
of which we obtained two bulk tissue samples from opposite tumor halves (orange and
52
3.2 Methods
Fig. 3.5 (A) Mutation accumulation under the clonal evolution model and two samples
(oranges, blue samples) were obtained from the opposite of a tumor. (B) Suppose four
mutation (red lightning mark) occur before and after the tumor initiation. (C) The
classification of four mutations described in (B).
blue circles in Figure 3.5A), with whole exome sequencing performed for both tumor and
tumor-matched non-tumor adjacent tissue [57]. Data were processed using the GATK
pipeline version 3.7 [58] and somatic mutations called with MuTect version 1.1.7 [59] by
MuTect with filters KEEP (default parameters), COVERED (read depth of 14 in tumor
and 10 in matched normal) and allele fraction greater than 0.10. Then, mutations not
also found by Strelka, another mutation caller, were filtered out [ 60]. The remaining
mutations were considered to be high-quality mutations, and it is these we used for our
signature analysis.
STEP 2: Identifying trunk and branch mutations
Every mutation can be reclassified into one of two groups based on when it was acquired
during tumor development [57]. Specifically, trunk mutations are found in all tumor
cells and correspond to mutations that occurred before tumor initiation. In contrast,
those mutations only found in a subset of tumor cells are called branch mutations; they
53
3.2 Methods
correspond to mutations that occurred after tumor initiation. Therefore, for each tumor
we classify somatic mutations by time of occurrence, labeling them as either trunk (being
present in the first tumor cell) or branch (occurring sometime later). This mutation
classification can be performed by comparing tumor halves described above. Mutations
appearing in both tumor halves were identified as trunk mutations, and those found on
only one side as branch mutations.
We analyzed a total of 16 colon tumors. Tumor and adjacent normal tissue were
subject to whole exome sequencing, and somatic mutations called using the GATK
pipeline and MuTect (details below). Somatic mutations in the tumors were defined
as nucleotide variants that were detected in tumor tissue but did not also appear in
the patient-matched normal tissue. We used multi-region tumor sampling to allow us
to distinguish between trunk from branch mutations [57]. Each tumor was sampled
twice, with bulk tissue samples taken from opposite tumor halves. We classified somatic
mutations appearing in both tumor halves as trunk, because only trunk mutations are
likely to appear in both tumor halves, while mutations found on only one side of a tumor
were labeled as branch. This approach has previously been shown to be 99% sensitive for
calling trunk mutations and 85% sensitive for calling branch mutations [57]. Fifteen of
the 16 tumors were previously analyzed in a study of cell motility [61].
The sequence data were processed using the GATK pipeline version 3.7 [58] and
somatic mutations called with MuTect version 1.1.7 [59], applying the quality filters
KEEP (default parameters) and COVERED (read depth of 14 in tumor and 10 in matched
normal - use of a lower coverage threshold in normal tissue is as recommended in [59]).
We excluded any mutations that either had an allele frequency less than 0.10, because
sequencing errors are more common among low-frequency mutations [59], or that were not
also found by Strelka [60], which we used as a confirmatory control. Somatic mutations
54
3.2 Methods
on chromosomes 1 to 22 were used for mutational signature analysis. Our final data set
is available for download from https://osf.io/a8dzx/.
Table 3.2 The number of somatic mutations in the 32 mutational catalogs from 16 colon
tumors in the USC data, along with their associated trunk and branch mutations.
Tumor Side A Side B Trunk Branch
W
∗ 1751 1762 1578 357
N 630 733 430 503
M 567 526 452 189
G 546 407 244 465
S 384 375 235 289
T 295 375 175 320
D 295 299 173 248
J 308 285 227 139
O 268 269 210 117
A 236 217 146 161
U 224 220 171 102
K 200 218 52 314
P 174 215 103 183
X 181 208 126 137
Kc 136 84 49 122
F 82 84 50 66
Total 6277 6277 4421 3712
*
Sample W is microsatellite instable
Esophageal Adenocarcinoma (EAC) Data
Here, we test for possible group differences in esophageal adenocarcinoma mutational
exposures by four clinically important covariates. In papers by [6] and [7], 146 tumor
55
3.3 Results
samples of esophageal cancer patients from [3] were analyzed to extract mutational
signatures. We downloaded the somatic mutations for this analysis from (ftp://ftp.sanger.
ac.uk/pub/cancer/AlexandrovEtAl/somatic_mutation_data/Esophageal/). Information
for the four clinical variables were retrieved from cBioPortal (https://www.cbioportal.org/
study/summary?id=esca_broad) [62, 63, 3]. We extended the analysis of [6], applying
HiLDA to test whether the mutational exposures of the four signatures they found
differ by sex (120 male vs. 25 female), age group (120 ≥ 60 years vs. 25 < 60 years),
smoking status (47 smokers vs. 19 non-smokers), or tumor site (41 esophagus vs. 52
cardia/gastric-esophageal junction(GEJ)).
3.3 Results
3.3.1 Tumor Evolution in USC Colon Cancer Data
A total of 12,554 somatic single-nucleotide substitutions were identified, with a median
of 277 per sample (range: 82 - 1,762) (See Table 3.2). One tumor with microsatelite
instability has more than double the number of somatic mutations (1751 side A, 1762 side
B) than any of the remaining 30 catalogs (all <750 mutations). In our first analysis, we
compared the mutational exposures in side A to those in side B. If the tumors represent
a single clonal expansion, we would expect similar mutational exposure frequencies in the
two catalogs from the same tumor. Indeed, this is what we found (Table 3.2).
Weidentifiedamedianof174trunkand186branchmutationspertumor. Thenumbers
ranged from 49 to 1,578 trunk mutations and from 66 to 503 branch mutations (Figure
3.6A). Interestingly, the microsatellite instable tumor had the most trunk mutations,
but not the most branch mutations, suggesting that during tumor growth the mutation
frequency is similar in microsatellite stable and instable tumors. Figure 3.6B and 3.6C
56
3.3 Results
Fig. 3.6 The numbers of somatic mutations in 32 mutational catalogs obtained from 16
colon cancer patients in the USC data and their mutation spectra.
(A) The number of somatic mutations in 16 tumors, each of which contributes 2 mutational
catalogs denoted as trunk (dark blue) and branch (light blue).
(B) The percentage bar plot of relative frequencies for six substitution types in 16 trunk
mutational catalogs.
(C) The percentage bar plot of relative frequencies for six substitution types in 16 branch
mutational catalogs.
show that the C>T substitution is most common in all trunk catalogs, and most branch
catalogs. The spontaneous deamination of methylated Cs in CpGs is known to contribute
to hotspots of C>T mutation in the genome.
We identified three mutational signatures in our data (see Figure 3.4). Those three
signatures, and their corresponding exposures, are depicted in Figure 3.7A, 3.7B and 3.7C.
For each mutational signature, we compute the probabilities for the 1536 possible five-base
signature patterns by taking the product of the feature component probabilities. We use
these multinomial vectors to calculate the cosine similarity between pairs of signatures
[64]. The signature shown in the yellow box in Figure 2D, involving C>T mutations at
CpG sites, resembles signature 7 in [6] (cosine similarity 0.95), where it was identified
in 25 out of 30 cancer types and likely relates to the deamination of 5-methylcytosine
57
3.3 Results
(‘aging’); the signature in the orange box in Figure 3.7E, involving T>G mutations at
GpGpTpG sites, is novel; the third signature, in the red box in Figure 3.7F, is most
similar to signature 23 in [6] (cosine similarity 0.85), where it was identified in four other
cancer types. The pairwise cosine similarities between pairs of our yellow, orange and
red signatures are 0.12, 0.01, and 0.02 which are rather dissimilar from each other given
the [0, 1] range for cosine similarity. Using HiLDA, we test whether the three signatures
differ in mean exposure between trunk and branch mutations.
Our global test strongly suggests that, in our data, the signature exposures differ
between trunk and branch catalogs (Bayes Factor = 1265.0). A Bayes Factor greater
than 10 is considered strong evidence for model H
1
[53]. Each of the individual signatures
(depicted in Figure 3.7D, 3.7E and 3.7F) is found to differ in exposure between the
two sample groups, a conclusion supported by both HiLDA and the two-stage method
(Table 3.4). From Figure 3.7C, it is evident that the exposures of the first (‘aging’)
signature in trunk mutations is almost always greater than that for the matching catalog
of branch mutations, which is intuitively consistent with the fact that trunk mutations
may well reflect an accumulation of mutations over the life of the subject, whereas
branch mutations are accumulated only after tumor initiation. For the previously unseen
signature, the higher exposures in branch catalogs might suggest that this signature’s
underlying mechanism for generating mutations might be associated with the processes
occurring during tumor evolution as opposed to normal development. From Figure 3.7G,
we observed that the distributional ranges of the two groups of mutational exposures have
some overlaps, but that the centers of each group, i.e., the means of mutational exposures,
are clearly deviated from each other. However, the distributional radii, indicating the
variances of mutational exposures, do not substantially differ between the groups.
58
3.3 Results
We sought to validate the discovery of the previously unseen signature by repurposing
targeted sequencing data from the same tumor set [57] and using publicly available
data from the Cancer Genome Atlas. Four T>G substitutions that we assigned to the
previously unseen signature were part of a much larger independent validation set of
mutations subjected to targeted, high-coverage Ampliseq technology [57]; all four of these
T>G substitutions failed to validate. Further, a systematic analysis of data from the
Cancer Genome Atlas also did not find evidence for this signature [ 65]. Therefore, we
cannot rule out that the signature is the result of sequencing error.
3.3.2 Esophageal Adenocarcinoma
We reanalyzed the 146 EAC previously studied by [6] and recovered the same four
mutational signatures, C>T at CpG (S7), C>T or A at TpC (S14), T>G or C at
Cp(T>G/C)pT (S21), and a signature capturing the remaining mutations, i.e., those
that do not fall into the previous three signatures. We tested for differences in mutational
exposures by sex, age, smoking status, and tumor site. Only tumor site showed some
evidence of differences in mutational exposure by patient subgroup (Figure 3.8). The TS
approach showed the mutational exposure for signature S21 was lower in the cardia/GEJ
compared to the esophagus (p = 0.019) (Figure 3.8A). HiLDA only identified a significant
deficit in the mutational exposure for S21 in the cardia/GEJ location (-7.3 % with
95% credible interval: [-11.8%, -2.7%]; HiLDA-Wald p = 0.002) (Figure 3.8B). However,
the HiLDA global test showed no strong evidence for associations between mutational
exposures and any of the four clinical variables age, sex, smoking status or tumor site
(all Bayes Factors < 1), suggesting the differences with tumor site may not be real. Still,
both HiLDA-CI and HiLDA-Wald tests return significant results even when using the
Bonferroni method to adjust for multiple comparisons (-7.3 % with 98.75% credible
59
3.3 Results
interval: [-12.9%, -1.5%]; adjusted p = 0.019). See Figure 3.9 for more details. We now
go on to assess the reliability of results using a simulation study.
3.3.3 Simulation Study
We conducted a simulation study to assess the performance of both HiLDA and the two-
stage approach in terms of the false-positive rate (FPR) and true-positive rate (TPR), in
local, univariate tests of the difference in mean exposure between two groups of mutational
catalogs. We assess the functionality of the methods in a setting similar to that of the
USC data, simulating somatic mutations directly using the estimated signatures (f
k
)
from Figure 3.7D, 3.7E and 3.7F for the same number of mutational catalogs (two groups
of 16 catalogs each) and somatic mutations per catalog (J
i
in Table 3.2). The mutational
exposures (q
i
) were indirectly used to derive the concentration parameters of the Dirichlet
distributions. The scenarios are as follows:
1. The two groups of mutational catalogs are from separate Dirichlet distributions with
parametersα (1)
=(9.2,0.2,7.5) andα (2)
=(4.2,0.6,7.3). Here, theα s corresponds
to the maximum-likelihood estimated parameters from the three exposure distribu-
tions in the trunk and branch mutational catalogs. This gives mean exposures of
µ (1)
=(0.54,0.01,0.44) andµ (2)
=(0.35,0.05,0.60) in trunk and branch catalogs,
respectively, for the aging signature, new signature, and random signature.
2. The two groups of mutational catalogs are from the same Dirichlet distribution,
Dir(4.2,0.6,7.3), (so here we use the concentration parameters estimated from the
branch mutational catalogs).
For each tumor, mutational exposures q
i
, are drawn from the Dirichlet distribution.
Each set of probabilities parameterize a multinomial distribution later used to proba-
60
3.3 Results
Side A - Side B HiLDA-CI HiLDA-Wald TS-Wilcoxon
Tests
a
Coef.

95% C.I.

b
p value p value
∆ 1
0.002

-0.079, 0.083

0.986 0.780
∆ 2
0.000

-0.029, 0.029

0.988 0.897
∆ 3
-0.002

-0.083, 0.086

0.961 0.985
H
0
:∆ 1
=∆ 2
=∆ 3
=0 Bayes Factor
M
2
/M
1
=0.021
a
∆ k
=
α (2)
k
P
k
α (2)
k
− α (1)
k
P
k
α (1)
k
, the difference in the mean exposure of signature k in group 1 and 2.
b
95% credible interval from the posterior distribution.
Table 3.3 Comparing mutational exposures from two sets of mutational catalogs, Side A
and Side B, in the USC data.
bilistically choose the underlying mutational signature for a mutation (See Figure 3.10).
Then, every mutation feature in the mutational pattern of the mutation is simulated
independently from a corresponding multinomial distribution of the chosen signature. To
estimate the FPRs, 1000 sets of data were simulated for scenario 2, when there is no
difference in the exposure distribution between two groups of mutational catalogs. The
two-stage method is slightly conservative for 1st and 3rd signatures (resulting FPRs of
4.3%, 5.2%, and 4.3%) when testing at the 5% significant level (Table 3.5). In comparison,
HiLDA showed better control of the FPR by using the 95% credible interval of the
posterior distributions (4.8%, 5.0%, and 5.1%). The Wald test also showed control of
the FPR, except in the case of the rare signature when it was noticeably lower (3.7%),
presumably due to the asymmetric posterior distribution.
We then moved to scenario 1, where we simulated 200 data sets with a difference in
mean exposures between the two groups of catalogs. Here, the statistical powers of both
HiLDA and the two-stage method are high when detecting the difference in exposures for
the 1st and 3rd signatures (Table 3.4). In contrast, for the 2nd signature, which has the
lowest mean mutational exposure, the TPRs of all methods are lower (77.5% - 85.5%).
61
3.3 Results
By using the 95% credible interval of posterior distributions, HiLDA is able to distinguish
a difference more often than the two-stage method (99.5% vs. 99.0%, 85.5% vs. 77.5%,
and 91.5% vs. 88.0%). At the same time, using the credible interval resulted in higher
TPRs compared to performing a Wald test (85.5 % vs. 80.5% for the 2nd signature).
In summary, across tests involving these three mutational signatures, HiLDA provides
higher statistical power to the TS method with a tendency of better improvement for
signatures with lower mutational exposures, i.e., the power difference between HiLDA
and the TS method is the highest (8%) for signature 2 with the lowest mean mutational
exposures. The improvements in the power to detect the mean exposure difference is
presumably due to the fact that HiLDA accounts for the uncertainty in the estimated
mutational exposures and provides better model fit of the posterior distributions. All
data were simulated in R 3.6.0 using the hierarchical Bayesian mixture model described
in the methods section. All replicates reached convergence with an Rhat value less than
1.05 for each of the scenarios shown in Tables 3.3, 3.4, and 3.5.
Branch - Trunk HiLDA-CI HiLDA-Wald TS-Wilcoxon
Tests
a
Coef.

95% C.I.

p value p value
∆ 1
-0.210

-0.295, -0.127

<0.0001 0.0002
∆ 2
0.064

0.035, 0.099

0.0001 0.0075
∆ 3
0.146

0.056, 0.231

0.0011 <0.0001
H
0
:∆ 1
=∆ 2
=∆ 3
=0 Bayes Factor
M
2
/M
1
=1265.0
a
∆ k
=
α (2)
k
P
k
α (2)
k
− α (1)
k
P
k
α (1)
k
, the difference in the mean exposure of signature k in group 1 and 2.
b
95% credible interval from the posterior distribution.
Table 3.4 Comparing mutational exposures in colorectal cancer from two sets of mutational
catalogs, trunk and branch, in the USC data.
62
3.4 Discussion
Methods ∆ 1
∆ 2
∆ 3
FPRs HILDA-CI
a
4.8 % 5.0% 5.1%
HILDA-
Wald
b
5.1 % 3.7% 5.4%
TS-Wilcoxon 4.3 % 5.2% 4.3%
TPRs HILDA-CI 99.5% 85.5% 91.5%
HILDA-Wald 99.5% 80.5% 92.5%
TS-Wilcoxon 99.0% 77.5% 88.0%
a
Percentage of 95% credible intervals that exclude zero.
b
Percentage of P-values <0.05 after applying the Wald test to the
posterior distribution.
Table 3.5 The false positive rates (n = 1,000) and true positive rates (n = 200) of both
the two-stage method and HiLDA when applied to the simulated data.
3.4 Discussion
In this paper, we present a new hierarchical method, HiLDA, that allows the user to
simultaneously extract mutational signatures and infer mutational exposures between two
different groups of mutational catalogs, e.g., trunk and branch mutations in our colon
cancer application. Our method is built on the approach of [6], in which mutational
signatures are characterized under the assumption of independence, and it is the first to
provide a unified way of testing whether mutational processes differ between groups (here,
between early and late stages of tumor growth). As a result, our method allows us to
appropriately control the false positive rates while providing higher power by accounting
for the accuracy in the estimated mutational exposures.
In our analysis of the USC data, which consist of 32 mutational catalogs extracted
from tumors from 16 CRC patients, our method detected three signatures and indicated
a statistically significant difference in mean exposures between groups. Two of the three
signatures resemble signatures S7 and S23 found by [6]. But, in addition, we found a
63
3.4 Discussion
novel signature. Signature 7 appears significantly more often in trunk mutations, which is
consistent with the fact that it has previously been related to aging and trunk mutations
have a longer time over which to occur (conceivably over the lifetime of the patient) than
do branch mutations (which occur only during tumor growth). The new signature, which
occurred more often in low frequency branch mutations, is very similar to a sequencing
artifact described by [8] (cosine similarity = 0.93). We note that, for the USC data, the
conclusions obtained from HiLDA were qualitatively the same as those obtained from the
TS method. This is likely due to the relatively large effect size here (i.e., the difference of
mean exposures between the two groups, divided by the standard errors of same, also
known as the signal-to-noise ratio).
In the analysis of the EAC data using HiLDA, we detected a statistically significant
increase in the mutational exposure of S21, which is consistent with the findings of
excessive fraction of A(A>C) mutations in esophagus compared to cardia/GEJ found in
[3]. To explain, since mutational signatures features are defined in terms of substitutions
by the pyrimidine (T and C), an A(A>C) transversion is equivalent to a (T>G)T
transversion associated with S21. Also, we found that S21 greatly resembles Signature 17
published in [5] (cosine similarity = 0.96), the hallmark signature of EAC that has been
proposed to arise from oxidative damage due to gastrointestinal reflux [ 66]. Alexandrov
et al.’s Signature 17 has been shown to have a higher number of mutations in EAC
compared to stomach cancers, which reinforces our results showing higher mutational
exposures for S21 in tumors occurring in the tubular esophagus compared to those in
the GEJ [8]. By comparing different testing results, it seems that both HiLDA-CI and
HiLDA-Wald tests are more sensitive compared to the TS approach in detecting the
difference. However, the global test, based on the Bayes factor, disagrees with the local
64
3.4 Discussion
test in the EAC data which might suggest that more samples are needed for the global
test to sufficiently support model H
1
.
In the simulation study, both HiLDA and the TS approach were applied to datasets
consisting of 16 tumors simulated under two scenarios to test for between group differences
in the mutational exposures of three signature. The results indicated that our unified
approach has higher statistical power for detecting differences in exposures for these
signatures while controlling the 5% false positive rate. We suspect that the improvement
in statistical power is because our unified method explicitly allows for the uncertainty
of inferred mutational exposures, while the two-stage method fails to do so since it
incorporates only the point estimates of those exposures. In addition, HiLDA provides
posterior distributions for each parameter, thereby allowing construction of 95% credible
intervals for parameters, and their differences, for example. As expected, this fully
parametric approach is then more powerful than nonparametric approaches, which we
see particularly when testing for differences in the rarer signatures.
We also note that the two-stage approach can become problematic with regards
to controlling the type I error rate in particular scenarios, e.g., when the variances of
exposures differ widely between the two groups. In our simulation study, we aimed to
emulate the USC data, meaning that the exposure variances were quite similar between
groups. Consequently, the Wilcoxon rank-sum test, the second-stage of the TS approach,
was able to maintain a type I error of 5%. However, we note that the Wilcoxon rank-
sum test is sensitive to differences found in either location or scale parameters of the
two distributions being tested, i.e., it is sensitive to changes in both the mean and
the variance. Therefore, when the variances change between two groups, the Wilcoxon
rank-sum test may indicate statistically significant differences in distributions even when
the means have not changed, (i.e., due to the difference in shape parameters rather than
65
3.5 Conclusion
a difference between location parameters). In contrast, HiLDA explicitly focuses on
detecting differences in means, and is robust to effects such as changes in variance.
Consequently, when applying the TS method, one should be wary of interpreting
significant results as evidence of a "difference in means" when using the TS method (as
seems to be common [33, 31, 27]). We note that scenarios in which the variance of the
estimated exposures differs will be common if the numbers of mutations per tumor varies
between the two groups (e.g. when comparing microsatellite instable vs. microsatellite
stable colon tumors), leading to an inflated false-positive rate if results from the TS
method are interpreted as being evidence of a difference in means. (See Figure 3.11 for a
specific example of this.) We intend to explore this issue further in a future paper. We
also intend to more fully investigate the factors that drive the ability to detect significant
difference between groups across a much wider variety of scenarios.
3.5 Conclusion
In conclusion, we developed a unified method, HiLDA, along with an R package, which
enables researchers to simultaneously estimate mutational signatures and infer the mean
difference in mutational exposures between two groups. The simulation studies demon-
strated that HiLDA has higher statistical power for detecting differences in mutational
signatures, because it accounts for uncertainty in the exposure estimates. Application of
HiLDA to both the USC colon data and the EAC data suggest that future studies may
also benefit from using HiLDA, rather than the existing TS method, to better detect the
difference in mutational signatures.
66
3.5 Conclusion
Fig. 3.7 Mutational exposures and three mutational signatures from the analysis of 16
trunk mutational catalogs and 16 branch mutational catalogs in the USC data (16 colon
cancer patients). (A) Barplot of the somatic mutation counts, by signature type, sorted
in a descending order of the total number of mutations. Each grouped pair contain the
trunk and branch mutations. (B) Barplot of the somatic mutation counts in (A) with
the y-axis being rescaled to show proportions. (C) The same data as in (B), but now
separate into trunk and branch mutations. Within each group the plots are sorted by
the exposure frequency of the first signature (yellow). (D, E, F) The yellow, orange,
red mutational signature with four flanking bases. (G) The distributions of mutational
exposures of three signatures highlighted by group.
67
3.5 Conclusion
Fig. 3.8 Estimated mutational exposures and posterior distributions of mean differences in
mutational exposures from the analysis of the EAC data (146 esophageal adenocarcinoma
patients). (A) Barplot of mean mutational exposures of three signatures by sex, age
groups, smoking status, and tumor sites derived from pmsignature. The significance level
of TS approach is denoted by asterisks. (**: <0.005; *:<0.05). The mutational exposures
do not sum to one since the frequency of remaining mutations (those not assigned to
these three signatures) is not displayed. (B) 95% credible interval of mean differences
in mutational exposures of four signatures derived from HiLDA-CI with the significance
level of HiLDA-Wald test. (**: <0.005; *:<0.05). The difference in mean exposures from
HiLDA can differ from those estimated by pmsignature due to the covariate distribution
in the hierarchical model.
68
3.5 Conclusion
Fig. 3.9 Estimated mutational exposures and posterior distributions of mean differences in
mutational exposures from the analysis of the EAC data (146 esophageal adenocarcinoma
patients) by pmsignature and HiLDA.
69
3.5 Conclusion
Fig. 3.10 The generative process of a mutation in the simulation study is illustrated here.
The black box outline is used to indicate the presumed generative process of a mutation,
i.e., from the left to right, the simulated mutational exposures, the simulated signatures,
and the simulated mutation features.
0.04
0.06
0.08
0.10
50 100 150 200 250
Number of catalogs in total
Rejection Rates
Ratios
30% vs. 70%
50% vs. 50%
70% vs. 30%
Fig. 3.11 A small simulation study demonstrates how the rejection rate of the two-stage
method could be influenced by different variances in two groups with the same means.
The rejection rates are defined as the fraction of significant results with Wilcoxon P value
less than 0.05 given no mean difference in two groups.
70
Chapter 4
iMutSig: a web application to identify
the most similar mutational signature
using shiny
4.1 Background
Eachhumanissubjecttoavarietyofmutationalprocessesthroughouttheirlifetime, which
result in the formation of a catalog of somatic mutations as his/her unique mutational
profile [ 7]. A mutational signature captures the pattern of the mutations and contexts
in which those mutations occur (i.e., the neighboring bases). Examples of important
mutational processes with distinct mutational signatures include aging and ultraviolet
(UV) radiation. Many research groups are performing analysis to discover de novo
mutational signatures in cancer [7, 38–40].
Currently, there are two frameworks used to characterize and visualize mutational
signatures [35, 41]. The first was proposed by Alexandrov et al., who used a vector of
71
4.1 Background
96 probabilities to capture the possible composition of the six nucleotide substitutions
(C>A, C>T, C>G, T>A, T>C, T>G) and the neighboring base immediately on each
of the 5
′
and 3
′
side of the mutated base [7]. A list of published mutational signatures
can be downloaded from the Catalogue Of Somatic Mutations In Cancer (COSMIC)
website at https://cancer.sanger.ac.uk/cosmic/signatures [67]. Later, Shiraishi et al.
proposed a mixed-membership model, pmsignature, which substantially reduces the
number of parameters needed to characterize a signature [6]. They achieved this by
assuming independence across bases, thereby reducing the number of parameters from
6× 4× 4− 1= 95to(6-1)+(4-1)+(4-1)=11[6]. Thereductioninthenumberofparameters
is greater if more flanking bases are included. All Shiraishi’s signatures can be downloaded
from their GitHub repository at https://github.com/friend1ws/pmsignature_paper [6].
In this paper, we will refer to signatures resulting from these two methods as “COSMIC
signatures” (for those resulting from Alexandrov et al.’s method) and “PM signatures”
(for those resulting from Shiraishi et al.’s method).
A large number of researchers have published scientific findings resulting from the
COSMIC signature-based method [68–70], which was defined as the “gold standard" in
the field by Baez-Ortega et al. [ 35]. Meanwhile, an increasing number of researchers are
using the pmsignature-based method due to it requiring fewer parameters (and therefore
being particularly appealing when sample sizes are small, say) [6, 71, 4]. Given that both
methods are widely used, investigators need the ability to compare results from their
analysis with those reported in earlier databases, which may have been produced using
the alternate method. For example, researchers have adopted both tools for gastric cancer
and tried to compare and integrate the information from two data sources in a somewhat
ad hoc manner [36]. No rigorous tool exists for this task. In this paper we present iMutSig,
an easy-to-use tool that allows users to identify the most similar mutational signatures
72
4.2 Implementation
across methods and to compare the information characterizing those signatures using
simple mouse navigation.
4.2 Implementation
Table 4.1 An example of PM signatures.
Nucleotide substitution
C >A C>G C>T T>A T>C T>G
0.012 0.003 0.879 0.003 0.090 0.014
Flanking bases
Position A C G T
-2 0.159 0.042 0.486 0.314
-1 0.044 0.052 0.870 0.034
+1 0.076 0.237 0.571 0.116
+2 0.245 0.247 0.256 0.252
Transcription strand
Plus Minus
0.511 0.489
In order to measure the similarity between mutational signatures across two databases,
we need to represent PM signatures in a way that is comparable with those from COSMIC.
To do this we convert the PM signature into a probabilistic vector with the same length
as the COSMIC signature, i.e., 96. To calculate each of the 96 probabilities, we take
the constituent components that make up the COSMIC signature - which refer to
the nucleotide substitution and two flanking bases at -1 and +1 position - calculate
the probability of each component for given PM signature, and then multiply those
73
4.2 Implementation
probabilities using pmsignature’s assumption of independence. For example, to calculate
the probability of the COSMIC signature C[C >A]T we multiply three PM signature’s
probabilities: P(C), P(C >A), and P(T). This example is shown in Table 4.1 and the
following equation.
P(C[C >A]T)=P(C)P([C >A])P(T)=0.052× 0.012× 0.116=7.24× 10
− 5
Now that we have represented both forms of signature using probabilistic vectors of
length n=96,P andC say, we can directly compare the two signature types. In order
to measure the similarity between them we use cosine similarity, CS, defined as shown in
the following equation:
CS(P,C)=
P · C
||P||·||C||
=
P
n
i=1
P
i
C
i
P
n
i=1
P
i
P
n
i=1
C
i
Intuitively speaking, cosine similarity is the cosine of the angle between the two vectors.
As such, cosine similarity ranges from 0 to 1 (inclusive). If two mutational signatures
have a cosine similarity of 1, they must be identical, i.e., the angle between them is 0
◦ ; in
contrast, if two mutational signatures have a cosine similarity of 0, they are maximally
dissimilar (i.e., orthogonal). Computing the cosine similarity between the input signature
and each of the candidate signatures, and then sorting the similarities from highest to
lowest value, we identify the candidate signature with the highest similarity as the most
similar mutational signature.
74
4.2 Implementation
Fig. 4.1 Overview of three workflows in the iMutSig interface. The first two tabs allow
users to finding the most similar PM signature to an input COSMIC signature (highlighted
in green) and vice versa (highlighted in orange). In addition, users can identify the most
similar signatures from both data sources to an input signature (highlighted in blue).
4.2.1 Main features of iMutSig
iMutSig is built in R with its key features depending on the R package, pmsignature [6].
As shown in Figure 4.1, the Shiny app currently supports three possible workflows for
users to choose from, depending on the type of signatures they have already obtained: 1)
starting with a COSMIC signature; 2) starting with a PM signature; 3) starting with a
self-defined signature that could follow either the COSMIC or PM format.
The first tab in the Shiny app window, “COSMIC to pmsignature", allows users to
select an input COSMIC signature via a drop-down list and returns the best-matched
PM signature. The returned results are divided and organized separately in the top and
the bottom portion of the page. The top half tab summarizes background information
regarding the input signature by presenting: 1) visualized plots of the input signature
75
4.2 Implementation
and its membership among all cancer types, i.e., in which kind of cancers the mutational
signatures has been found; 2) a table showing the cosine similarity between this signature
and all PM signatures, sorted in decreasing order, along with a visualization of a similarity
heatmap with color and intensity proportional to assessed similarity. The bottom half
tab presents plots and descriptions of the input COSMIC signature, the most similar PM
signature, and a second PM signature that the user can select. Thus, users can easily
access all the vital information and results regarding these signatures rather than having
to manually gather and organize information from publications. The top half of the tab
will be automatically updated via a control panel in the middle section of the tab, which
enables users to select a signature to start with and also highlights information about the
currently selected signature, the most-similar signature to that signature, and the cosine
similarity.
The second tab was designed in a similar manner to the first tab but for the case in
which we are starting with a PM signature and looking for the most similar COSMIC
signature.
Unlike the first two tabs, the third tab enables users to enter a user-supplied signature,
which can be in either PM or COSMIC format, and then identify the most similar
signature from each online database. The user will be requested to enter a sub-menu
based on the type of the input signature and to upload a comma-separated values (CSV)
file containing a single signature. A sample CSV file is provided for download to give
the user a better sense of the format of the input file. Then, the tab will be updated to
display two tables, one from each data source, listing the signatures from that data source
and the cosine similarity of each signature to the user-uploaded signature. The tables are
ordered from most to least similar signature. In addition, the user is able to view figures
of the best-matched signatures (i.e., those with highest cosine similarity) from each data
76
4.3 Conclusions
source, allowing users to observe any similarities and dissimilarities. Below that, users
will see a list of cancer types that contain the best-matched signature, highlighting the
names of the cancer types that appear in both databases.
4.3 Conclusions
iMutSig is a user-friendly interactive browser-based application that allows users who
have a signature that they have discovered in an analysis of their own data to identify the
best-matched existing mutational signature from the COSMIC and PM databases. It also
allows users to directly compare signatures between the two databases. It does this in an
interactive way, and also allows straightforward visualization of results. iMutSig enables
researchers to easily identify the most similar mutational signature and to easily access
characteristic information from both data sources without additional software installation
and programming of their own.
77
Chapter 5
Conclusions and future work
In this last chapter, we will review the three main objectives of this thesis and the
implications of the entire thesis on modeling mutational signatures in cancer. In the first
section, we will summarize our main contributions to the signature-modeling community.
In the next section, we propose ideas that arise from these three objectives and their
implications for future work.
5.1 Conclusions
With the continuing advances of sequencing technologies, more data sets of somatic
mutations will be brought into this recent, growing field of signature research. The
validity of mutational signatures estimated from somatic mutations depends on the
quality of data and the number of signatures present in the data sets. Our first objective
in this thesis was to propose a method to determine the number of signatures by using an
efficient clustering algorithm. This was done by calculating the perplexity score on the
test data set using the estimates extracted from the training data set in a cross-validation
78
5.2 Future work
setting. The results of our simulation studies have shown that our new approach, R3PY,
provides a reliable method under the common scenarios considered, with the exception
that in very small catalogs sizes, AICc performs the best.
One of our major objectives in this thesis was to develop methodology for inferring
differences in mutational signature exposures across groups. We achieved this by esti-
mating the posterior distribution of mean exposure differences which can be identified
through the use of a hierarchical latent Dirichlet allocation model. Our method allowed
for explicit incorporation of uncertainty of exposure estimates into the statistical test
procedure. Experiments with our unified method, compared to a post hoc method, showed
higher statistical power for detecting differences. Two application of this method, to both
the USC colon data set and the publicly available EAC data set, showed its efficacy in
detecting significant differences as a function of time of occurrence in CRC cancer and
tumor site in EAC cancer.
Finally, our web application was built to serve as an easy-to-use tool which allows
users to select a signature and identify the most similar signature from another method.
This tool has essentially greatly simplified the task of picking the most similar signature
from one database compared to another.
5.2 Future work
This section briefly describes and summarizes some interesting research ideas which are
either a further extension of HiLDA or an enhancement of it. The first two points are
focused on feature enhancement on the existing method while the remaining three require
the development of new methods.
79
5.2 Future work
1. HiLDA allows the flexibility of estimating mutational signatures from a given data
set for the purpose of discovering new signatures. However, for those who do not
wish to do this, the mutational signatures can be treated as known information
and kept fixed during the analysis process. This approach is called "refitting" or
"signature refitting" [ 41]. So, one desirable enhancement to the current version of
HiLDA would be to include the ability to allow fixed signatures.
2. Currently, HiLDA is implemented under the assumption that samples from one
group are independent from those in the other group, aka, an unpaired test. We
might therefore also add an option to conduct a paired test.
3. HiLDA allows users to avoid any post hoc analyses by providing a unified estimation
and testing framework. However, the process of estimating the number of signatures
still follows an ad hoc method rather than being incorporated within the unified
method [72]. So, instead of treating K as fixed a priori, we might extend our model
to place a prior distribution on K (e.g., a Poisson distribution).
4. HiLDA uses MCMC and, as such, might sometimes be faced with issues of slow
convergence. There have been studies that looked at using variational Bayesian
inference (VB) to solve this issue [72]. VB approximates the posterior distribution
with the product of independent variational distributions. Based on existing work
in which VB is used in the context of LDA [42], we could incorporate an additional
distribution for the concentration variables in the Dirichlet distribution for each
group, which would then be estimated along with the mutational exposures and
mutational signatures.
80
References
[1] Michael R Stratton, Peter J Campbell, and P Andrew Futreal. The cancer genome.
Nature, 458(7239):719–724, 2009.
[2] Michael R Stratton. Exploring the genomes of cancer cells: progress and promise.
science, 331(6024):1553–1558, 2011.
[3] Austin M Dulak, Petar Stojanov, Shouyong Peng, Michael S Lawrence, Cameron Fox,
Chip Stewart, Santhoshi Bandla, Yu Imamura, Steven E Schumacher, Erica Shefler,
Aaron McKenna, Scott L Carter, Kristian Cibulskis, Andrey Sivachenko, Gordon
Saksena, Douglas Voet, Alex H Ramos, Daniel Auclair, Kristin Thompson, Carrie
Sougnez, Robert C Onofrio, Candace Guiducci, Rameen Beroukhim, Zhongren Zhou,
Lin Lin, Jules Lin, Rishindra Reddy, Rishindra Reddy, Rodney Landrenau, Arjun
Pennathur, Shuji Ogino, James D Luketich, Todd R Golub, Stacey B Gabriel, Eric S
Lander, David G Beer, Tony E Godfrey, Gad Getz, and Adam J Bass. Exome and
whole-genome sequencing of esophageal adenocarcinoma identifies recurrent driver
events and mutational complexity. Nature Genetics, 45(5):478, 2013.
[4] Jintao Guo, Jiankun Huang, Ying Zhou, Yulin Zhou, Liying Yu, Huili Li, Lingyun
Hou, Liuwei Zhu, Dandan Ge, Yuanyuan Zeng, et al. Germline and somatic variations
influence the somatic mutational signatures of esophageal squamous cell carcinomas
in a chinese population. BMC Genomics, 19(1):538, 2018.
[5] Ludmil B Alexandrov, Serena Nik-Zainal, David C Wedge, Peter J Campbell, and
Michael R Stratton. Deciphering signatures of mutational processes operative in
human cancer. Cell reports, 3(1):246–259, 2013.
[6] Yuichi Shiraishi, Georg Tremmel, Satoru Miyano, and Matthew Stephens. A simple
model-based approach to inferring and visualizing cancer mutation signatures. PLoS
genetics, 11(12):e1005657, 2015.
[7] Ludmil B Alexandrov, Serena Nik-Zainal, David C Wedge, Samuel AJR Aparicio,
Sam Behjati, Andrew V Biankin, Graham R Bignell, Niccolo Bolli, Ake Borg, Anne-
Lise Børresen-Dale, et al. Signatures of mutational processes in human cancer.
Nature, 500(7463):415, 2013.
81
References
[8] Ludmil Alexandrov, Jaegil Kim, Nicholas J Haradhvala, Mi Ni Huang, Alvin WT
Ng, Arnoud Boot, Kyle R Covington, Dmitry A Gordenin, Erik Bergstrom, Nuria
Lopez-Bigas, et al. The repertoire of mutational signatures in human cancer. BioRxiv,
page 322859, 2018.
[9] Douglas Hanahan and Robert A Weinberg. The hallmarks of cancer. cell, 100(1):
57–70, 2000.
[10] Alan F Rubin and Phil Green. Mutation patterns in cancer genomes. Proceedings of
the National Academy of Sciences, 106(51):21766–21770, 2009.
[11] Aziz Sancar, Laura A Lindsey-Boltz, Keziban Ünsal-Kaçmaz, and Stuart Linn.
Molecular mechanisms of mammalian DNA repair and the DNA damage checkpoints.
Annual review of biochemistry, 73(1):39–85, 2004.
[12] Maria Secrier, Xiaodun Li, Nadeera De Silva, Matthew D Eldridge, Gianmarco
Contino, Jan Bornschein, Shona MacRae, Nicola Grehan, Maria O’Donovan, Ahmad
Miremadi, et al. Mutational signatures in esophageal adenocarcinoma define etiolog-
ically distinct subgroups with therapeutic relevance. Nature genetics, 48(10):1131,
2016.
[13] Serena Nik-Zainal and Sandro Morganella. Mutational signatures in breast cancer:
the problem at the DNA level, 2017.
[14] Päivi Peltomäki. Deficient DNA mismatch repair: a common etiologic factor for
colon cancer. Human Molecular Genetics, 10(7):735–740, 2001.
[15] Hanane Omichessan, Gianluca Severi, and Vittorio Perduca. Computational tools
to detect signatures of mutational processes in DNA from tumours: a review and
empirical comparison of performance. bioRxiv, page 483982, 2019.
[16] Daniel D Lee and H Sebastian Seung. Learning the parts of objects by non-negative
matrix factorization. Nature, 401(6755):788–791, 1999.
[17] Daniel D Lee and H Sebastian Seung. Algorithms for non-negative matrix fac-
torization. In Advances in neural information processing systems, pages 556–562,
2001.
[18] Julian S Gehring, Bernd Fischer, Michael Lawrence, and Wolfgang Huber. So-
maticsignatures: inferring mutational signatures from single-nucleotide variants.
Bioinformatics, 31(22):3673–3675, 2015.
[19] Francis Blokzijl, Roel Janssen, Ruben Van Boxtel, and Edwin Cuppen. Mutation-
alpatterns: comprehensive genome-wide analysis of mutational processes. Genome
medicine, 10(1):33, 2018.
82
References
[20] Kevin Gori and Adrian Baez-Ortega. sigfit: flexible bayesian inference of mutational
signatures. bioRxiv, page 372896, 2018.
[21] Rafael A Rosales, Rodrigo D Drummond, Renan Valieris, Emmanuel Dias-Neto, and
Israel T da Silva. signer: an empirical bayesian approach to mutational signature
discovery. Bioinformatics, 33(1):8–16, 2016.
[22] Samir Lal, Amy E McCart Reed, Xavier M de Luca, and Peter T Simpson. Molecular
signatures in breast cancer. Methods, 131:135–146, 2017.
[23] Alfréd Rényi et al. On measures of entropy and information. In Proceedings of the
Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1:
Contributions to the Theory of Statistics. The Regents of the University of California,
1961.
[24] Sandra Krüger and Rosario M Piro. decomptumor2sig: identification of mutational
signatures active in individual tumors. BMC bioinformatics, 20(4):152, 2019.
[25] Todd K Moon. The expectation-maximization algorithm. IEEE Signal processing
magazine, 13(6):47–60, 1996.
[26] Chris Ding, Tao Li, and Wei Peng. On the equivalence between non-negative matrix
factorization and probabilistic latent semantic indexing. Computational Statistics &
Data Analysis, 52(8):3913–3927, 2008.
[27] Cancer Genome Atlas Research Network. Integrated genomic characterization of
oesophageal carcinoma. Nature, 541(7636):169, 2017.
[28] Jiang Chang, Wenle Tan, Zhiqiang Ling, Ruibin Xi, Mingming Shao, Mengjie Chen,
Yingying Luo, Yanjie Zhao, Yun Liu, Xiancong Huang, et al. Genomic analysis of
oesophageal squamous-cell carcinoma identifies alcohol drinking-related mutation
signature and genomic alterations. Nature Communications, 8:15290, 2017.
[29] R Tyler Hillman, Gary B Chisholm, Karen H Lu, and P Andrew Futreal. Genomic
rearrangement signatures and clinical outcomes in high-grade serous ovarian cancer.
JNCI: Journal of the National Cancer Institute, 110(3):265–272, 2017.
[30] Eric Letouzé, Jayendra Shinde, Victor Renault, Gabrielle Couchy, Jean-Frédéric
Blanc, Emmanuel Tubacher, Quentin Bayard, Delphine Bacq, Vincent Meyer, Jérémy
Semhoun, et al. Mutational signatures reveal the dynamic interplay of risk factors
and cellular processes during liver tumorigenesis. Nature Communications, 8(1):1315,
2017.
83
References
[31] Bettina Meier, Nadezda V Volkova, Ye Hong, Pieta Schofield, Peter J Campbell,
Moritz Gerstung, and Anton Gartner. Mutational signatures of DNA mismatch
repair deficiency in C. elegans and human cancers. Genome Research, 28(5):666–675,
2018.
[32] NJ Haradhvala, J Kim, YE Maruvka, P Polak, D Rosebrock, D Livitz, JM Hess,
I Leshchiner, A Kamburov, K. W. Mouw, M. S. Lawrence, and G. Getz. Distinct
mutational signatures characterize concurrent loss of polymerase proofreading and
mismatch repair. Nature Communications, 9(1):1746, 2018.
[33] Tingting Qin, Yanxiao Zhang, Katie R Zarins, Tamara R Jones, Shama Virani,
Lisa A Peterson, Jonathan B McHugh, Douglas Chepeha, Gregory T Wolf, Laura S
Rozek, and Maureen A. Sartor. Expressed hnscc variants by hpv-status in a well-
characterized michigan cohort. Scientific Reports , 8(1):11458, 2018.
[34] Magali Olivier, Liacine Bouaoun, Stephanie Villar, Alexis Robitaille, Vincent Cahais,
Adriana Heguy, Graham Byrnes, Florence Le Calvez-Kelm, Gabriela Torres-Mejia,
Isabel Alvarado-Cabrero, et al. Molecular features of premenopausal breast cancers
in Latin American women: Pilot results from the precama study. PloS one, 14(1):
e0210372, 2019.
[35] Adrian Baez-Ortega and Kevin Gori. Computational approaches for discovery of
mutational signatures in cancer. Briefings in Bioinformatics , 20(1):77–88, 2017.
[36] Ruta Sahasrabudhe, Paul Lott, Mabel Bohorquez, Ted Toal, Ana P Estrada, John J
Suarez, Alejandro Brea-Fernández, José Cameselle-Teijeiro, Carla Pinto, Irma Ramos,
et al. Germline mutations in palb2, brca1, and rad51c, which regulate DNA recom-
bination repair, in patients with gastric cancer. Gastroenterology, 152(5):983–986,
2017.
[37] Anand Mayakonda, De-Chen Lin, Yassen Assenov, Christoph Plass, and H Phillip
Koeffler. Maftools: efficient and comprehensive analysis of somatic variants in cancer.
Genome research, 28(11):1747–1756, 2018.
[38] Zhi Yang, Priyatama Pandey, Darryl Shibata, David V Conti, Paul Marjoram, and
Kimberly Siegmund. HiLDA: a statistical approach to investigate differences in
mutational signatures. PeerJ, 7:e7557, 2019.
[39] Stefan Gröschel, Daniel Hübschmann, Francesco Raimondi, Peter Horak, Gregor
Warsow, Martina Fröhlich, Barbara Klink, Laura Gieldon, Barbara Hutter, Kortine
Kleinheinz, et al. Defective homologous recombination DNA repair as therapeutic
target in advanced chordoma. Nature Communications, 10(1):1635, 2019.
84
References
[40] Ege Ülgen, Özge Can, Kaya Bilguvar, Yavuz Oktay, Cemaliye B Akyerli, Ayça Erşen
Danyeli, M Cengiz Yakıcıer, O Uğur Sezerman, M Necmettin Pamir, and Koray
Özduman. Whole exome sequencing–based analysis to identify DNA damage repair
deficiency as a major contributor to gliomagenesis in adult diffuse gliomas. Journal
of Neurosurgery, 1(aop):1–12, 2019.
[41] Hanane Omichessan, Gianluca Severi, and Vittorio Perduca. Computational tools
to detect signatures of mutational processes in DNA from tumours: a review and
empirical comparison of performance. bioRxiv, 2018. doi: 10.1101/483982. URL
https://www.biorxiv.org/content/early/2018/12/04/483982.
[42] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent Dirichlet Allocation.
Journal of machine Learning research, 3(Jan):993–1022, 2003.
[43] Hamparsum Bozdogan. Model selection and akaike’s information criterion (aic): The
general theory and its analytical extensions. Psychometrika, 52(3):345–370, 1987.
[44] Gideon Schwarz et al. Estimating the dimension of a model. The annals of statistics,
6(2):461–464, 1978.
[45] Kenneth P Burnham and David R Anderson. Multimodel inference: understanding
aic and bic in model selection. Sociological methods & research, 33(2):261–304, 2004.
[46] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical
learning, volume 1. Springer series in statistics New York, 2001.
[47] Taku Tsukamoto, Masakazu Nakano, Ryuichi Sato, Hiroko Adachi, Miki Kiyota, Eri
Kawata, Nobuhiko Uoshima, Satoru Yasukawa, Yoshiaki Chinen, Shinsuke Mizutani,
et al. High-risk follicular lymphomas harbour more somatic mutations including
those in the aid-motif. Scientific Reports , 7(1):14039, 2017.
[48] Martyn Plummer et al. Jags: A program for analysis of bayesian graphical models
using gibbs sampling. In Proceedings of the 3rd international workshop on distributed
statistical computing, volume 124, page 125. Vienna, Austria, 2003.
[49] David Spiegelhalter, Andrew Thomas, Nicky Best, and Dave Lunn. Winbugs user
manual, 2003.
[50] Bradley P Carlin and Siddhartha Chib. Bayesian model choice via markov chain
monte carlo methods. Journal of the Royal Statistical Society. Series B (Methodolog-
ical), pages 473–484, 1995.
[51] Andrew Gelman, Donald B Rubin, et al. Inference from iterative simulation using
multiple sequences. Statistical Science, 7(4):457–472, 1992.
85
References
[52] Tom Lodewyckx, Woojae Kim, Michael D Lee, Francis Tuerlinckx, Peter Kuppens,
and Eric-Jan Wagenmakers. A tutorial on bayes factor estimation with the product
space method. Journal of Mathematical Psychology, 55(5):331–347, 2011.
[53] Harold Jeffreys. The theory of probability. OUP Oxford, 1998.
[54] Eiiti Kasuya. Mann-whitney u test when variances are unequal. Animal Behaviour,
6(61):1247–1249, 2001.
[55] Graeme D Ruxton. The unequal variance t-test is an underused alternative to
student’s t-test and the mann–whitney u test. Behavioral Ecology, 17(4):688–690,
2006.
[56] R Core Team. R: A Language and Environment for Statistical Computing. R
Foundation for Statistical Computing, Vienna, Austria, 2017. URL https://www.
R-project.org/.
[57] Kimberly Siegmund and Darryl Shibata. At least two well-spaced samples are needed
to genotype a solid tumor. BMC cancer, 16(1):250, 2016.
[58] Mark A DePristo, Eric Banks, Ryan Poplin, Kiran V Garimella, Jared R Maguire,
Christopher Hartl, Anthony A Philippakis, Guillermo Del Angel, Manuel A Rivas,
Matt Hanna, et al. A framework for variation discovery and genotyping using
next-generation DNA sequencing data. Nature genetics, 43(5):491–498, 2011.
[59] Kristian Cibulskis, Michael S Lawrence, Scott L Carter, Andrey Sivachenko, David
Jaffe, Carrie Sougnez, Stacey Gabriel, Matthew Meyerson, Eric S Lander, and Gad
Getz. Sensitive detection of somatic point mutations in impure and heterogeneous
cancer samples. Nature biotechnology, 31(3):213–219, 2013.
[60] Christopher T Saunders, Wendy SW Wong, Sajani Swamy, Jennifer Becq, Lisa J
Murray, and R Keira Cheetham. Strelka: accurate somatic small-variant calling from
sequenced tumor–normal sample pairs. Bioinformatics, 28(14):1811–1817, 2012.
[61] Marc D Ryser, Byung-Hoon Min, Kimberly D Siegmund, and Darryl Shibata. Spatial
mutation patterns as markers of early colorectal tumor cell mobility. Proceedings of
the National Academy of Sciences, 115(22):5774–5779, 2018.
[62] Ethan Cerami, Jianjiong Gao, Ugur Dogrusoz, Benjamin E Gross, Selcuk Onur
Sumer, Bülent Arman Aksoy, Anders Jacobsen, Caitlin J Byrne, Michael L Heuer,
Erik Larsson, Yevgeniy Antipin, Boris Reva, Arthur P. Goldberg, Chris Sander, and
Nikolaus Schultz. The cbio cancer genomics portal: an open platform for exploring
multidimensional cancer genomics data, 2012.
86
References
[63] Jianjiong Gao, Bülent Arman Aksoy, Ugur Dogrusoz, Gideon Dresdner, Benjamin
Gross, S Onur Sumer, Yichao Sun, Anders Jacobsen, Rileen Sinha, Erik Larsson,
Ethan Cerami, Chris Sander, and Nikolaus Schultz. Integrative analysis of complex
cancer genomics and clinical profiles using the cbioportal. Science Signaling, 6(269):
pl1–pl1, 2013.
[64] Zhi Yang, Priyatama Pandey, Paul Marjoram, and Kimberly D Siegmund. iMutSig:
a web application to identify the most similar mutational signature using shiny, 2019.
https://zhiyang.shinyapps.io/imutsig/.
[65] Marc J Williams, Benjamin Werner, Chris P Barnes, Trevor A Graham, and Andrea
Sottoriva. Identification of neutral tumor evolution across cancer types. Nature
Genetics, 48(3):238, 2016.
[66] Katia Nones, Nicola Waddell, Nicci Wayte, Ann-Marie Patch, Peter Bailey, Felicity
Newell, Oliver Holmes, J Lynn Fink, Michael CJ Quinn, Yue Hang Tang, et al.
Genomic catastrophes frequently arise in esophageal adenocarcinoma and drive
tumorigenesis. Nature Communications, 5:5224, 2014.
[67] Simon A Forbes, David Beare, Harry Boutselakis, Sally Bamford, Nidhi Bindal,
John Tate, Charlotte G Cole, Sari Ward, Elisabeth Dawson, Laura Ponting, et al.
Cosmic: somatic cancer genetics at high-resolution. Nucleic Acids Research, 45(D1):
D777–D783, 2016.
[68] Thomas Helleday, Saeed Eshtad, and Serena Nik-Zainal. Mechanisms underlying
mutational signatures in human cancers. Nature Reviews Genetics, 15(9):585, 2014.
[69] Serena Nik-Zainal, Peter Van Loo, David C Wedge, Ludmil B Alexandrov, Christo-
pher D Greenman, King Wai Lau, Keiran Raine, David Jones, John Marshall, Manasa
Ramakrishna, et al. The life history of 21 breast cancers. Cell, 149(5):994–1007,
2012.
[70] Kornelius Schulze, Sandrine Imbeaud, Eric Letouzé, Ludmil B Alexandrov, Julien
Calderaro, Sandra Rebouissou, Gabrielle Couchy, Clément Meiller, Jayendra Shinde,
Frederic Soysouvanh, et al. Exome sequencing of hepatocellular carcinomas identifies
new mutational signatures and potential therapeutic targets. Nature Genetics, 47
(5):505, 2015.
[71] Akira Yokoyama, Nobuyuki Kakiuchi, Tetsuichi Yoshizato, Yasuhito Nannya, Hi-
romichi Suzuki, Yasuhide Takeuchi, Yusuke Shiozawa, Yusuke Sato, Kosuke Aoki,
Soo Ki Kim, et al. Age-related remodelling of oesophageal epithelia by mutated
cancer drivers. Nature, 565(7739):312, 2019.
87
References
[72] Taro Matsutani, Yuki Ueno, Tsukasa Fukunaga, and Michiaki Hamada. Discovering
novel mutation signatures by latent Dirichlet allocation with variational bayes
inference. Bioinformatics, 2019.
88

Abstract (if available)

Linked assets

University of Southern California Dissertations and Theses

Conceptually similar

PDF

An analysis of conservation of methylation

PDF

Latent unknown clustering with integrated data (LUCID)

PDF

Hierarchical regularized regression for incorporation of external data in high-dimensional models

PDF

Ancestral inference and cancer stem cell dynamics in colorectal tumors

PDF

Associations between perfluoroalkyl substances exposure and metabolic pathways in youth

PDF

Evolutionary genomic analysis in heterogeneous populations of non-model and model organisms

PDF

Generalized linear discriminant analysis for high-dimensional genomic data with external information

PDF

Best practice development for RNA-Seq analysis of complex disorders, with applications in schizophrenia

PDF

Two-step study designs in genetic epidemiology

PDF

The accumulation of somatic mutations in humans with age

PDF

Incorporating prior knowledge into regularized regression

PDF

Use of cell-free nucleic acids in associating PD-L1 gene expression with presence of driver mutations in DNA and demographics across different cancers

PDF

DNA methylation and gene expression profiles in Vidaza treated cultured cancer cells

PDF

Computational analysis of genome architecture

PDF

Using artificial neural networks to estimate evolutionary parameters

PDF

Prediction modeling with meta data and comparison with lasso regression

PDF

Bayesian model averaging methods for gene-environment interactions and admixture mapping

PDF

Using multi-level Bayesian hierarchical model to detect related multiple SNPs within multiple genes to disease risk

PDF

Understanding DNA methylation and nucleosome organization in cancer cells using single molecule sequencing

PDF

Gene expression and angiogenesis pathway across DNA methylation subtypes in colon adenocarcinoma

Asset Metadata

Creator Yang, Zhi (author)

Core Title Modeling mutational signatures in cancer

School Keck School of Medicine

Degree Doctor of Philosophy

Degree Program Biostatistics

Publication Date 12/10/2019

Defense Date 10/21/2019

Publisher University of Southern California (original), University of Southern California. Libraries (digital)

Tag cancer,clustering,deconvolution,latent Dirichlet allocation,mutational signatures,OAI-PMH Harvest,somatic mutations

Language English

Contributor Electronically uploaded by the author (provenance)

Advisor Siegmund, Kimberly (committee chair), Marjoram, Paul (committee member), Conti, David (committee member), Shibata, Darryl (committee member)

Creator Email zhiyang@usc.edu,zyang895@gmail.com

Permanent Link (DOI) https://doi.org/10.25549/usctheses-c89-250218

Unique identifier UC11673597

Identifier etd-YangZhi-8034.pdf (filename),usctheses-c89-250218 (legacy record id)

Legacy Identifier etd-YangZhi-8034.pdf

Dmrecord 250218

Document Type Dissertation

Rights Yang, Zhi

Type texts

Source University of Southern California (contributing entity), University of Southern California Dissertations and Theses (collection)

Access Conditions The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...

Repository Name University of Southern California Digital Library

Repository Location USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA