Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
The accumulation of somatic mutations in humans with age
(USC Thesis Other)
The accumulation of somatic mutations in humans with age
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
The Accumulation of Somatic Mutations in Humans with Age
By
Shuangchao Ma
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(Cancer Biology and Genomics)
May 2019
1
Table of Contents
ACKNOWLEDGEMENTS 3
LIST OF FIGURES 6
LIST OF TABLES 9
ABSTRACT 11
Chapter 1: INTRODUCTION 12
1.1 Aging and DNA Mutation 12
1.1.1. Types and Mechanisms of Mutations 13
1.2 The Challenge of Studying Somatic Mutations 15
1.3 Our Approach via Colon Crypts 16
1.4 Previous Studies 18
1.5 Methods Used 20
1.5.1 Next Generation (Illumina) Sequencing 20
1.5.2 Sequencing Data Analysis 23
1.5.3 Novel Software 26
Chapter 2. METHODS 28
2.1 Subject and Sample Information 28
2.2 Crypt Isolation 31
2.3 Library Preparation 32
2.3.1 Bulk Sample Genomic DNA Extraction 32
2.3.2 Library Preparation (Bulk and Crypt) 33
2.4 Next Generation Sequencing 36
2.5 Data Analysis 37
2.5.1 Single Nucleotide Variation Calling 39
2.5.2 Deletion (>50bps) Calling 40
2.5.3 Other Analyses 41
2.6 Software Development 42
2.6.1 Motif and Breakpoint Statistical Analysis Tool Kit 42
2.6.2 Slippage-Associated Repeat Identification
and Analysis 44
Chapter 3. RESULTS 46
3.1 Library and Sequencing Read Quality Overview 46
3.2 Single Nucleotide Variations Accumulate with Age 49
3.2.1 Software Output Overlap 49
3.2.2 Additional Filters 53
3.2.3 SNVs Observed 55
3.2.4 Normalization of Counts 57
3.2.5 Annotation 63
2
3.2.6 Statistical Analysis: SNVs Accumulate with Age 66
3.3 Large Deletions Accumulate with Age 70
3.3.1 Identification of Deletions 70
3.3.2 Large Deletions Increase in Number and
Size with Age. 74
3.3.3 Deletion Frequencies Observed Are Consistent
with Our Previous Study 77
3.4 Testing Novel Software 78
3.4.1 Motif and Breakpoint Statistical Analysis Tool Kit 78
3.4.2 Slippage-Associated Repeat Identification
and Analysis 79
3.5 Alignment Free Analysis 81
Chapter 4. DISCUSSION 83
4.1 Summary 83
4.2 Mutations Observed and Comparison to Existing
Studies 83
4.2.1 Mutation Accumulation 84
4.2.2 Comparison to Other Studies 86
4.3 Mutations as a Mechanism of Aging 87
4.4 Future Directions 88
BIBLIOGRAPHY 90
APPENDIX I: Additional Material for the Thesis Project 98
Appendix I.1 All Library Preparation 98
Appendix I.2 Supplementary Data for SNV Counts 100
Appendix I.3 Supplementary Data for Deletions 124
Appendix I.4 Supplementary Data for Discussion 124
APPENDIX II: MISCELLANEOUS PROJECTS 130
PROJECT I 130
Project I.1 Key Questions Addressed 130
Project I.2 Methods 133
Project I.3 Results 136
Project I.4 Discussion and Conclusion 139
Project I.5 Future Work 140
PROJECT II 141
Project II.1 Introduction 141
Project II.2 Procedure and Results 141
3
ACKNOWLEDGEMENTS
I would like to thank my PhD advisor, Dr. Michael Lieber for
laying out this incredible aging project in front of me when I entered
USC, seeking excitement in science. I want to thank him for his
continuous support throughout the past five and a half years. I have
learned so much from him and his lab. It has been an incredible
journey.
I would like to thank all the professors who contributed to this
project. Without them, we would not be able to come this far and reach
the conclusions we did. Dr. Chih-Lin Hsieh has been the lead designer
for this project, especially for the sequencing library preparation
procedure. She is the one who enabled us to work with such a small
amount of starting material, and from her, I learned how essential
quantification and controls are for a project. Dr. Graham Casey kindly
provided all the crypts for the middle-aged and old subject groups in
this project and has supported it since the start. Dr. Tracy Grikscheit
provided all the colon specimens for the young subject group. Dr.
David Van Den Berg has advised us on sequencing design. Dr. David
W. Craig, Dr. Kai Wang and Dr. Paul Thomas have contributed
tremendously in the pipeline design, software development as well as
mentoring me on programming. Dr. Michael Press, the chair of my
qualifying exam committee and dissertation committee, has always
4
been very close to my lab and this project. Dr. Susan Groshen spent
hours sitting down with me to go over the statistical designs and
calculations.
I want to thank Laura-Marie Nucho, Jessica Brown and
Elizabeth Gilliam from CHLA who helped me collect colon specimens.
Previous Lieber lab members Dr. Albert Tsai and Dr. Zhengfei Lu for
contributing to the previous studies that laid the foundation for this
project. Jason Kim, Stephanie Tring, and Sue Suyeon Ryu from the
Genomics Core at the Keck Campus, who helped us with BioAnalyer,
TapeStation, and OD readings. I want to thank Dr. Baruch Frenkel, Dr.
Ite Offringa, Joyce Perez, Bami Andrada, Marisela Zuniga and Ashley
Flinn from the PIBBS Office and my department for their support
throughout my PhD years and on all my fellowship applications.
I want to thank Dr. Hui Yang, Dr. Yunfei Guo and Chris
Armoskus from Dr. Kai Wang’s lab for teaching me Python and R
basics and for helping me debug my code in the first few years of my
PhD. I want to thank Erin Shaw, Cesar Sul, Maureen Dougherty,
Avalon Johnson, Bill Jendrzejek, Chris Ho, Todd Stein, and many
others from USC HPCC for helping me with Shell, Python and R
questions so that my modules and scripts can be correctly
implemented.
I also want to thank all previous and current Lieber lab members
for aiding this project in routine discussions: Dr. Zheng Zhang, Dr.
Howard Chang, Dr. Noriko Shimazaki, Dr. Nicholas Punnunzio, Dr. Go
Watanabe, Dr. Bailin Zhao, Zitadel Anne Esguerra, Christina
Gerodimos, Di Liu, and Cindy Okitsu-Yen.
I thank my parents and my family for their continuous emotional
(and financial) support of me. I want to especially thank my grandfather
for everything he did for me. I will forever miss him. I want to thank my
mother for always being my friend, always being available to talk to me,
and always believing in me.
5
Finally, I would like to thank my husband, Sandy, for being my
grammar checker, space heater, heavy object lifter, scientific adviser,
good mood provider, and detail implementor. We are so different and
almost never agree with each other, yet we are the best collaborators
and friends. I have been, and always will be grateful for his love and
heart in life.
6
LIST OF FIGURES
Figure 1. Illustration of Cell Divergence with the Accumulation
of Somatic Mutations. 14
Figure 2. Illustration of How Sequencing Multiple Cells Leads
to A Consensus Sequence. 16
Figure 3. EM and Illustrated View of Colon Crypts. 17
Figure 4. Colon Crypt Regeneration Dynamics. 18
Figure 5. Large Deletion in One Crypt Identified through
SNP-chips. 19
Figure 6. Preparation of An Illumina Sequencing Library. 21
Figure 7. Design of An Illumina Flow Cell. 22
Figure 8. Examples of DNA Hairpins. 26
Figure 9. A Fresh Colon Sample Provided by CHLA. 28
Figure 10 A Single Colon Crypt under Light Microscopy
(from the 15-Year-Old Individual). 32
Figure 11. Examples of BioAnalyzer Outputs. 36
Figure 12. Simplified Single Nucleotide Variant Calling
Workflow. 37
Figure 13 Reads from Different Lanes Were Handled
Separately 38
Figure 14 Structural Variant Detection Workflow 40
Figure 15. The Macrogen Data Analysis Workflow. 42
Figure 16. GATK (vcf1) and IVC (vcf2) Overlap. 50
Figure 17. Overlap of Variant Callers for the L047 Crypt
from the 39 Year Old Subject. 51
Figure 18. Manual Check Reveals Reliable Mutations in the
Overlap of Strelka and GATK Calls. 53
7
Figure 19. Portion of Each SNV Type in Each Sample. 56
Figure 20. Normalized SNV Counts for All Samples. 61
Figure 21. Normalized SNV Counts (Sorted by Age) 62
Figure 22. Normalized Exon SNV Counts (Sorted by Age). 65
Figure 23. Output of the SAS “PRC GLM” Function. 68
Figure 24. Output of the SAS “lsmeans” Function. 69
Figure 25. IGV Output Showing a Deletion Called by Manta. 71
Figure 26. Sashimi Plot Shows Apparent Drop in Coverage
at Deletion Site Called by Manta. 72
Figure 27. Deletions Larger than 3000 Base Pairs per
Crypt Vs. Age. 75
Figure 28. Deletions Smaller than 2000 Base Pairs per
Crypt Vs. Age. 75
Figure 29. Results from the MBSAT R-Package Are
Consistent with Those Obtained in (Tsai & Lieber, 2010). 79
Figure 30. A Partial SARIA Output Table for the E2A Region. 80
Figure 31. Graphical Output from SARIA Analysis of the
BCL2 Fragile Zone Region. 81
Figure 32. CAFE Analysis of the Difference Between 4 Crypts. 82
Figure 33. Schematic overview of the Experimental Setup
Used in Another Study. 86
Figure 34. SNV Counts (Sorted by Age) With Only the
15X Coverage Filter Applied. 105
Figure 35. SNV Counts (Sorted by Age) With the 15X
Coverage and Heterogeneity Filters Applied. 106
Figure 36. SNV Counts (Sorted by Age) With Only the
20X Coverage Filter Applied. 109
Figure 37. SNV Counts (Sorted by Age) With the 20X
Coverage and Heterogeneity Filters Applied. 110
Figure 38. Normalized SNV Counts for All Samples Give
a 20X Cutoff. 114
Figure 39. Normalized SNV Counts (Sorted by Age) Given
a 20X Cutoff. 115
Figure 40. Exon SNV Counts (Sorted by Age)
(non-Normalized). 118
Figure 41. Exon SNV Counts (Sorted by Age)
(non-Normalized) Given a 20X Cutoff. 121
8
Figure 42. Normalized Exon SNV Counts (Sorted by
Age) Given a 20X Cutoff. 123
Figure 43. Average Deletion Size Increases with Age. 124
Figure 44. IGV Output Showing a Deletion With Possible
Polyclonality. 128
Figure 45 Workflow of This Study 134
Figure 46 Read depth count distributions (data from
sample: 305_LP6005390-DNA_G08). 137
Figure 47 MR and SRWGS consistency based on
SRWGS position quality (data from sample:
305_LP6005390-DNA_G08). 138
Figure 48 VCF quality score versus SRWGS and
MR consistency (data from sample:
305_LP6005390-DNA_G08) 139
Figure 49 Flowchart for Oligomer Ligation Sequencing
Data Analysis 142
9
LIST OF TABLES
Table 1. Individuals Whose Samples Were Obtained from. 29
Table 2. Sequencing Libraries Generated in This Study. 29
Table 3. Pre-sequencing PCR Program. 35
Table 4. Sample Duplication and Coverage Rates 47
Table 5 Normalization Values (15X) for Each Sample. 57
Table 6. Normalized Counts for All Samples (15X). 59
Table 7. Normalized Exon SNV Counts for All Samples (15X). 63
Table 8. Ti/Tv Ratio in Different Genomic Regions. 66
Table 9. Least Squares Means of the Log of SNV Counts for
Each Age Group. 70
Table 10. Identified Deletions in Each Sample Sorted by Size. 73
Table 11. Number of Deletions Observed in Each Size
Distribution. 74
Table 12. Size of all Identified Deletions. 75
Table 13. Counts of Large Deletions/Crypt From This
Study and (J. Hsieh et al., 2013). 77
Table 14. Fragil Zone Regions Used for Testing MBSTAT. 78
Table 15. Summary of All Prepared Libraries. 98
Table 16. GATK and Strelka SNV Calls Overlap and
Divergence. 100
Table 17. SNV Filtration Based on Position. 101
Table 18. SNV Counts for Each Crypt at Each of the Applied
Filtering Steps (15X cutoff for the Coverage Step). 103
Table 19. SNV Counts for Each Crypt at Each of the Applied
Filtering Steps (20X cutoff for the Coverage Step). 107
10
Table 20. Normalization Values for Each Sample
(20X Cutoff). 111
Table 21. Normalized SNV Counts for All Samples
(20X Cutoff). 112
Table 22. SNV Counts and Ti/Tv Ratios for Exons,
Introns and Non-Transcribed Regions for Each Sample
(15X Cutoff). 116
Table 23. SNV Counts and Ti/Tv Ratios for Exons,
Introns and Non-Transcribed Regions for Each Sample
(20X Cutoff). 119
Table 24. Normalized Counts for All Samples (20X Cutoff). 122
Table 25. Impact of Excluding SNVs Found in dbSNP on
Counts and Ti/Tv Ratios. 124
Table 26. Impact of Excluding Exon SNVs Found in dbSNP
on Counts and Ti/Tv Ratios. 126
Table 27 Partial Illumina MR output 132
Table 28 Heterogeneity Comparison of MR vs SRWGS 136
11
ABSTRACT
The rate and nature of somatic mutations that accumulate in
humans with age have long been subjects of interest for a variety of
reasons. These mutations are known to be the basis of cancer and
may also be a factor in the process of aging. A detailed profile showing
when and where these mutations arise may help explain the
occurrence of cancer, assess their role in aging, and offer insights into
their mechanisms.
Despite this interest, studying somatic mutations has long been
a challenge because it requires single cell sequencing, something that
is not possible without cell expansion in culture or DNA amplification,
both of which can introduce their own mutations.
In this study, we address this challenge by using individual
colon crypts to effectively represent a single stem cell that gave rise to
the crypt shortly beforehand. Using these crypts, we study somatic
mutations in young, middle aged, and old subjects. Our data suggests
that single nucleotide variation (SNV) somatic mutations increase with
age throughout life, but the difference between young and middle aged
subjects seems to be considerably more than the difference between
middle aged and old subjects. Our data also suggests that deletions
larger than 3000 bps also increase with age. The mutational profiles
are consistent with multiple mechanisms.
12
Chapter 1: INTRODUCTION
1.1 Aging and DNA Mutation
Aging is associated with increased risk of cancer,
cardiovascular disease, neurodegenerative disorders and numerous
other health problems (Crowson et al., 2013; Intlekofer & Cotman,
2013; Lieber & Karanjawala, 2004). Despite the universality of aging in
most metazoans, the question of why organisms undergo a
degenerative aging process remains unanswered and subject to
debate. Several hypotheses have been put forward, such as the
molecular damage caused by the continuous generation of harmful
free radicals, malfunctioning cellular structures (such as lysosomes),
the slow misfolding and degradation of essential proteins, and the
accumulation of harmful somatic mutations (da Costa et al., 2016). In
this study, we address the last of these hypotheses to see if mutations
do accumulate over the human lifespan in a manner consistent with
aging. We want to highlight that though other studies intended to
address this, we sought to examine this question with a minimal
amount of DNA amplification and no in vitro cell expansion via cell
culturing, to minimize artefactual changes introduced by these
processes.
13
1.1.1. Types and Mechanisms of Mutations
Mutations can occur through a variety of mechanisms. These
mechanisms can be grouped into two large categories: mutations that
occur due to mistakes during DNA replication, and those that occur
outside of replication.
The first category mostly exists due to failures in the replication
machinery or process and includes what is arguably the simplest of all
mutational mechanisms: mistakes by the DNA polymerases. It also
includes errors due to DNA slippage, which is when the DNA duplex
open and re-anneals in a mis-aligned state at small repeat regions.
Such a misalignment can lead to duplications or deletions as
replication continues, and can leave parts of the DNA single stranded,
making it potentially more vulnerable to mutations (Kimura, Iyehara-
Ogawa, & Kato, 1994).
The second category includes a range of mechanisms, and
typically begins with damage to one DNA strand. This damage may be
from the cell’s own machinery, such as the off target Recombination-
Activating Gene (RAG) proteins activity during VDJ recombination in
precursor B-cells that can lead to lymphoma (Tsai & Lieber, 2010), or
from foreign factors such as the crosslinking of pyrimidine dimers by
UV light (da Costa et al., 2016). Regardless of the origin of the
damage, failure of the cell’s DNA damage repair mechanisms to
correct this, or the usage of error prone or inherently inaccurate
correction methods (such as non-homologous end joining) will lead to
mutations.
If the mutation is not corrected by the cell’s mutation detection
and repair mechanisms, it will be passed down to at least one daughter
cell and then propagated further. Through this mechanism, mutations
can steadily build up in a cell lineage with increased rounds of cell
division. Such mutations can have a negative impact on the individual.
The sum of these negative impacts, from numerous mutations, is
14
hypothesized to lead to aging and other diseases (da Costa et al.,
2016).
Each cell in the human body has a unique mutation profile,
consisting of mutations acquired during that cell’s own lifetime and
mutations accumulated from its progenitors. As humans age, the
mutation profiles between different cells grow further apart, as they
separate in time and generations. Figure 1 illustrates this.
Figure 1. Illustration of Cell Divergence with the Accumulation of Somatic Mutations.
All cells are identical at the time of conception but become more different as each
one progresses down a different mutational lineage. In this example, all cells
originate from a single cell (left). One daughter cell acquires a mutation (blue dot) that
is inherited by both of its daughter cells, which also acquire their own unique
mutations. Through this trend, a unique mutation profile builds up in each cell.
Although there is at least a very small chance for a mutation to be reversed by an
opposite back mutation (orange dot).
This phenomenon has not been well studied. For example, we
still do not know the extent and rate of genetic divergence caused by
15
mutation accumulation. In this project, we aim to provide insight on
these unknowns and provide tools to help solve answer questions.
1.2 The Challenge of Studying Somatic
Mutations
To study the accumulation of somatic mutations in humans with
age, we would like to simply take a cell from an individual person and
determine how much the sequence of this cell differs from the person’s
sequence at the time of conception. All differences observed are
mutations that have occurred as the person aged. By doing this
comparison for multiple individuals from different age groups, we can
determine how mutations accumulate in humans in general with age.
There are two practical issues that prevent us from using the
method just described. First, it is, not possible to directly obtain the
sequence of an individual at the time of conception. Second, it is not
yet possible to sequence single cells with the desired degree of
accuracy. Methods for obtaining the sequences of single cells exist,
but these methods depend on either extensive ex-vivo culturing, which
places cells in a non-realistic environment and can introduce mutations
due to oxidative stress, or depend on extensive DNA amplification,
which has the potential to introduce artefactual mutations on its own
and can lead to over-PCR amplification of one strand.
To address the second issue, we could also consider
sequencing a group of cells rather than just one. However, this would
lead to an averaged genomic profile. In fact, it would give a consensus
sequence of all cells that effectively represents their common ancestor
(Figure 2). If the cells are a random grouping, then this ancestor cell
was likely present during or shortly after development, and thus cannot
be used for studying mutations with age.
16
Figure 2. Illustration of How Sequencing Multiple Cells Leads to A Consensus Sequence.
If numerous cells with different mutational signatures (marks) are sequenced, telling
true mutations from sequencing errors and normal heterozygosity becomes
impossible, and only a consensus sequence where all mutations are averaged out
can be obtained.
1.3 Our Approach via Colon Crypts
In order to work around the previously discussed challenges, we
have developed a model that depends on the consensus sequence
just discussed and the biology of the human colon. The human colon is
covered in crypts made up of 1000-2000 cells (Figure 3), and each
crypt is being continuously regenerated by a small niche of stem cells
located at its base (Figure 4). The effect of this is that each colon crypt
can be completely traced back to a single stem cell that existed only a
few months ago (J. Hsieh, Van Den Berg, Kang, Hsieh, & Lieber, 2013;
Snippert et al., 2010; Yatabe, Tavaré, & Shibata, 2001). Thus, when
17
we sequence all the cells of a single colon crypt, we are effectively
sequencing a single cell that recently existed.
Now that we have a way of sequencing a current single cell, we
need to obtain the individual’s sequence at the time of conception or
shortly after. For this we simply take a large (~2 mm
2
) piece of bulk
colon tissue and sequence it. Since it is extremely rare for crypts to
split or fuse (instead they grow essentially independently of each other)
(J. Hsieh et al., 2013; Yatabe et al., 2001), the consensus sequence
obtained from this bulk colon segment will be from a point very early in
development.
Now that we have our two sequences as previously described,
identifying mutations accumulated over a lifetime is just a matter of
comparing them.
Figure 3. EM and Illustrated View of Colon Crypts.
Taken from Figure 1b (Barker, 2014).The crypt is continuously regenerated from a
small number of stem cells at the base. As these stem cells divide, all cells are
pushed upward, towards the lumen, where they eventually shed. Thus, the entire
colon crypt is being continuously regenerated and all cells in it can be traced back to
a stem cell that existed a short time ago. TA cell = transit amplifying cell, LGR5 stem
cell = Leucine-rich repeat-containing G-protein coupled receptor 5 stem cell.
18
Figure 4. Colon Crypt Regeneration Dynamics.
Adopted from Figure 1 (J. Hsieh et al., 2013). Colon crypts are continuously
regenerated from a small stem cell niche at their base. A mutation that occurs in one
of these stem cells can quickly take over the entire crypt.
1.4 Previous Studies
We previously performed a SNP (single nucleotide
polymorphism)-chip based study using a very similar design to what
was just discussed (J. Hsieh et al., 2013). Only this study culminated in
performing a SNP-chip assay on the crypt and bulk tissue samples
rather than whole genome sequencing. By comparing SNP-chip results
for individual crypts and bulk tissue, large scale deletions and loss of
heterozygosity (LOH) events that occurred over the subject’s life could
be identified (Figure 5).
This study used SNP-chips with ~710,000 markers, giving an
average inter-marker difference of about 4.5 kb (assuming a 3.2 billion
base genome), and ~100 consecutive markers were required to spot a
deletion or LOH event. Thus, this study could not spot events smaller
than ~0.5 Mb.
Even with this limitation on the minimum size of deletions that
could be detected, this study demonstrated an increase in the number
of deletions with age. However, this study could not explore the
accumulation or impact of numerous smaller mutations, such as single
nucleotide variations (SNVs) or indels.
19
Figure 5. Large Deletion in One Crypt Identified through SNP-chips.
Adopted from Figure 3 (J. Hsieh et al., 2013). Allele frequency (A and C) and log R
ratio (B and D) for chromosome 17 from SNP-chips performed on a pair of colon
crypts on the same individual. Allele frequency plot for crypt 1 (A) shows a loss of
heterozygosity in the boxed region, while the log R ratio plot (B) shows a reduction in
signal in the same region. Taken together, these plots show a deletion of the boxed
region on one chromosome in crypt 1. No such deletion appears to be present in
crypt 2 (C and D), indicating that this is a somatic mutation.
Another study from the Sanger group (Blokzijl et al., 2016) using
a similar design has also been performed. However, this study used a
more than six week long ex-vivo culturing process, starting from a
single cell, to obtain organoids for sequencing. This study did
demonstrate a statistically significant increase in colon mutations with
age but potentially presents the same risks previously mentioned due
to ex-vivo culturing of stem cells: samples may have developed
mutations induced during this culturing process, especially due to the
artificial in-vitro non-biological environment.
20
1.5 Methods Used
1.5.1 Next Generation (Illumina) Sequencing
This project depends heavily on Illumina based next generation
sequencing (NGS). Here, we provide a brief overview of the NGS
process.
Library preparation:
NGS begins with the preparation of a sequencing library from
DNA isolated from the sample. DNA is first fragmented into ~400 bp
reads via sonication. Fragments are then ligated with 5’ and 3’
adaptors that facilitate PCR amplification. PCR amplification is
performed using primers matching the previously added adaptors.
These primers will also contain a unique barcode sequence (one
barcode used for each sample) and a terminal sequence that matches
the adaptors found on the Illumina flow chip. The result of this is to
produce a library of single stranded reads that have a ~400 base
center region of genomic DNA flanked by a pair of PCR adaptors,
flanked by a pair of unique (to each sample) barcode sequences,
flanked by a pair of adaptors matching the Illumina flow chip (Figure 6).
21
Figure 6. Preparation of An Illumina Sequencing Library.
Adapted from NEBNext Ultra DNA Library Prep Kit for Illumina (New England
BioLabs Inc., n.d.). Through a multistep process, genomic DNA is appended with
sequencing primers (green and yellow), a barcode sequence, and primers matching
the Illumina flow chip.
Sequencing:
Once a library has been prepared, it can be sequenced using an
Illumina sequencer. Fragments are washed over the flow cell (Figure
7) and allowed to bind, via the ends of their adaptors, to matching
adaptors on the flow cell. The bound fragments now serve as template
for extension of the flow cell adaptor. After the flow cell adaptor has
been extended into a match for the bound fragment, the bound
fragment is washed away, and the flow cell adaptor strand is allowed
22
to bend over and bind another flow cell adaptor. It then serves as a
template for the extension of this adaptor, producing two copies of the
original fragment that are attached to the flow cell. This process is
repeated several times to produce a small cluster of identical (or
complimentary) reads attached to the flow cell in close proximity.
Figure 7. Design of An Illumina Flow Cell.
Taken from (“Patterned Flow Cell Technology,” n.d.) The cell is divided into lanes,
columns and tiles. Multiple samples can be run in each lane (and thus column and
tile) to normalize the biases or effects of any one location. During sequencing, each
tile is filled with clusters, and each cluster gives data for a single read.
Now that a cluster has been generated, all reverse strands are
removed via a cleavage reaction. Leaving a cluster of identical forward
strands. The forward strands now undergo sequencing-by-synthesis,
where they are primed using the original adaptors ligated on during
library preparation. Sequencing goes in steps and the primers are
extended using fluorescently labeled terminating nucleotides. A
complementary nucleotide binds the cluster and is incorporated into
the growing sequencing strand. It is then fluorescently imaged, causing
the entire cluster to give off a color based on what nucleotide was
incorporated, a chemical reaction is then used to cleave the dye and
terminator off the nucleotide, allowing the process to be repeated. The
first 150 forward bases of the fragment are sequenced this way. The
23
cluster is then amplified as previously described to produce new
reverse strands. The forward strands are then cleaved away, and the
sequencing is repeated using the reverse strands. A detailed
illustration of this process can be found at (ECO, 2007).
In the end, this process gives a pair of sequences (paired-end
sequencing), running towards each other and separated by ~100 bp.
Pairing the sequences makes aligning the sequences and identifying
structural variations (discussed later) much easier. Further, a separate
sequencing reaction is carried out on the unique barcode in the
adaptor of the cluster, this identifies what sample the fragment came
from and allow multiple samples to be used on the same flow cell and
lane (multiplexing), thus normalizing lane specific effects.
1.5.2 Sequencing Data Analysis
Quality control:
When we receive sequencing data, we first perform quality
control on it using the FastQC program (Andrews Simon, n.d.). In
quality control, reads are analyzed for numerous factors to determine if
library preparation and sequencing have occurred correctly and
returned a useable result.
In particular, FastQC looks at the quality score returned by the
sequencer for each position of each read and the sequence content of
the reads. This quality score represents the sequencer confidence in
the base call it has made for that position. FastQC considers the
distribution of the average quality scores of entire reads, the average
quality score at each position for all reads, and the average quality
score for each position for all reads in a given tile on the flow cell.
When considering sequence content, FastQC looks at the base
distribution at each read position (which should be nearly even
between all bases) the distribution of reads by GC content (which
should be around 46.1%, and the Kmer distribution. FastQC also looks
for the presence of overrepresented sequences, duplicated sequences
24
and known adaptor sequences. Finally, FastQC considers the uncalled
(N) base content at each location in all reads (Andrews Simon, n.d.).
Following a positive quality control result, we can move on to
practical analysis and use of the data.
Alignment:
In alignment, each read is checked against a reference human
genome to determine what positions on the genome it covers. In the
process of alignment, short “seed” sequences from each read are
scanned against the reference genome looking for an exact match.
Once an exact match is found, the seed sequence is expanded
outward and the alignment is scored based on how well the read
matches. The best alignment match (if any) for each read is returned
and we now have a record of both sequence and where on the
genome it belongs (Heng Li, 2013).
Sort and remove duplicates:
A consequence of the PCR amplification used in library
preparation is that some reads will be totally identical to each other.
Multiple identical reads in our analysis will be source of bias when
determining allele counts and will not give us any additional
information. Thus, in the case of these duplicates, it is important to
remove all but one. In this relatively simple step we look for any reads
that are aligned to the exact same region and remove all but one of
them from further analysis.
Variant calling:
As discussed previously, a key step in our protocol is the
identification of mutations by looking for variants between the bulk
tissue sequence and the sequences of individual crypts. Genomic
variant calling can be done two ways: germline and somatic. In
germline variant calling, sequencing data is compared to a reference
genome and variations between the sequencing data and the
reference genome are identified. In somatic variant calling, sequencing
25
data from two different samples is directly compared to identify
variations between them (McKenna et al., 2010; Saunders et al.,
2012). A germline variant caller can be used as a somatic variant caller
if both samples are separately called against the reference and then
the lists of called variants for both samples are compared to each
other.
In both kinds of variant calling, a position by position
comparison (facilitated by the previous alignment step) is done. When
considering calling a variant, it is important to factor in normal
heterozygosity and sequencing errors. Thus, the count and quality of
each base type are actively considered. Different variant callers usually
use different mathematical models for such purposes (Xu, 2018). In
this study, we take advantage of more than one variant caller and carry
along their overlaps for analysis. The end output of variant calling is a
list of all likely variations between each crypt and the bulk tissue.
Alignment-free Analysis:
As an alternative to conventional variant calling, we also tried
performing alignment-free sequence comparisons between different
samples. The alignment-free sequence comparisons we performed
were based on comparing the counts of various Kmers found in two
samples, and thus, through differences in these counts, quantified
differences between the samples (Lu et al., 2016). Our goal was to see
if there was a reliable increase in Kmer difference in crypt samples
versus bulk from the same subject with the age of the subject.
Annotation:
Following the identification of all likely mutations, we use
annotation to look for patterns in the mutations observed. Annotation
software systemically looks through all identified variants and records
of what sort of genomic features they occur in (e.g. exons and introns),
based on their position and an annotated reference genome (K. Wang,
26
Li, & Hakonarson, 2010). By annotating our variant list, we can search
for patterns in the identified mutations.
1.5.3 Novel Software
To facilitate the analysis of identified variants and help elucidate
the mechanisms of their occurrence, we developed two novel pieces of
software.
SARIA (Slippage-Associated Repeat Identification and Analysis)
We developed this Python module to quickly search for potential
slippage inducing repeats near mutation sites. Slippage occurs during
DNA synthesis or transcription when part of the template or growing
strand forms a hairpin structure rather than binding to the other strand,
this pulls the strands out of alignment, attracts DNA repair proteins that
may delete a looped-out region or arm of a stem-loop structure. This
may lead to insertions of deletions in the strand being synthesized.
Figure 8. Examples of DNA Hairpins.
Formation of these hairpins during the DNA replication process can pull the template
and growing strand out of alignment, creating errors, and attract DNA repair proteins
that may modify the single stranded DNA.
Certain repeat sequences have a well demonstrated instability
that is likely caused by their ability to form hairpins and induce this
phenomenon. The most notable case of this may be the “CAG” repeats
27
in the Huntingtin gene, which are believed to cause their own
expansion through this mechanism and thus eventually expand to the
point that they cause Huntington’s disease (Bates et al., 2015). Our
software is designed to facilitate a quick and easy analysis of the
proximity of identified mutations to these sites.
Motif Detection: MBSAT
A previous paper by the Lieber lab demonstrated the relevance
of CpG motifs as sites of the initiating mutations in several types of
lymphomas. In this case, the close proximity of these motifs to the
established DNA break sites driving the cancer was critical to
elucidating the mechanism by which a mistake in VDJ recombination
could contribute to the risk of turning a normal precursor B-cell into a
lymphoma cell. This proximity was demonstrated to be statistically
significant using a custom made script (Tsai et al., 2008). We have
copied the logic of this script into a generalized and easy to use R
package that can check the proximity of any motif to any given list of
mutations and determine if there is a statistically significant association
between the two. Our hope is that this package, combined with a list of
age-related mutations will facilitate major insights into mutational
mechanisms in the same way that the previous project did.
28
Chapter 2. METHODS
2.1 Subject and Sample Information
Figure 9. A Fresh Colon Sample Provided by CHLA.
Healthy colon crypts and bulk tissue were obtained from tissue
extracted during normal clinical procedures. All tissue would have been
removed from the patient regardless of our study and samples were
de-identified before we received them. Colon samples from older and
middle-aged individuals were provided by Dr. Graham Casey at the
USC Norris Cancer Center. Samples from young individuals were
provided by Dr. Tracy Grikscheit at Children’s Hospital Los Angeles
(CHLA). Figure 9 shows a fresh colon specimen obtained from CHLA.
Subjects and samples are described in the following two tables.
In Table 1 and Table 2, green, yellow and red coloring is used to
identify subjects in young (0-15 years old), middle-aged (39-45 years
old), and old groups (83-93 years old), respectively. The uncolored
rows identify subjects that have crypts stored, but no library prepared
29
yet. In total, this project used 9 Individuals: 3 young, 3 middle-aged, 3
old. Note that many additional libraries were prepared, but not
sequenced, and these libraries are described in Table 15 in the
appendix.
Table 1. Individuals Whose Samples Were Obtained from.
Individual Label Age Race Gender Crypts Prepared By
062117 CHLA 0 NA
a
F Shuangchao Ma
121317 CHLA 11 NA
a
M Shuangchao Ma
090617 CHLA 15 Hispanic F Shuangchao Ma
cgnn03 CGNN 39 Asian M Dr. Graham Casey Lab
12105 CCTU 43 White F Dr. Graham Casey Lab
12011 CCTU 45 White M Dr. Graham Casey Lab
12260 CCTU 83 Asian F Dr. Graham Casey Lab
12216 CCTU 85 Black F Dr. Graham Casey Lab
12430 CCTU 93 White F Dr. Graham Casey Lab
031418
b
CHLA 4 Hispanic M Shuangchao Ma
101718
b
CHLA 2 NA
a
M Shuangchao Ma
a. CHLA did not have consent to collect this data.
b. Crypts were isolated and stored from this individual but have not been used
for library preparation in this study.
Table 2. Sequencing Libraries Generated in This Study.
Sample
Name
Individual Age
(year)
Bulk
Or
Crypt
Barcode
(i5 = 500,
i7 listed)
Prepared By
L090 062117 0.6 Bulk 4 Shuangchao Ma
L129 062117 0.6 Crypt 7 Shuangchao Ma
L133 062117 0.6 Crypt 11 Shuangchao Ma
L125 062117 0.6 Crypt 3 Shuangchao Ma
L131
a
062117 0.6 Crypt 9 Shuangchao Ma
L134 062117 0.6 Crypt 12 Shuangchao Ma
L116 121317 11 Bulk 6 Shuangchao Ma
L115 121317 11 Crypt 5 Shuangchao Ma
L122 121317 11 Crypt 12 Shuangchao Ma
L105 121317 11 Crypt 7 Shuangchao Ma
L113 121317 11 Crypt 3 Shuangchao Ma
L121 121317 11 Crypt 11 Shuangchao Ma
L123 121317 11 Crypt 1 Shuangchao Ma
L111 090617 15 Bulk 1 Shuangchao Ma
L108 090617 15 Crypt 10 Shuangchao Ma
L120 090617 15 Crypt 10 Shuangchao Ma
30
L118 090617 15 Crypt 8 Shuangchao Ma
L126 090617 15 Crypt 4 Shuangchao Ma
L071 cgnn03 39 Bulk 6 Shuangchao Ma
L044 cgnn03 39 Crypt 2 Dr. Chih-Lin Hsieh
L045 cgnn03 39 Crypt 3 Dr. Chih-Lin Hsieh
L046 cgnn03 39 Crypt 4 Dr. Chih-Lin Hsieh
L047 cgnn03 39 Crypt 5 Dr. Chih-Lin Hsieh
L048 cgnn03 39 Crypt 7 Dr. Chih-Lin Hsieh
L068 12105 41 Bulk 3 Shuangchao Ma
L036 12105 41 Crypt 11 Dr. Chih-Lin Hsieh
L037 12105 41 Crypt 12 Dr. Chih-Lin Hsieh
L038 12105 41 Crypt 7 Dr. Chih-Lin Hsieh
L039 12105 41 Crypt 8 Dr. Chih-Lin Hsieh
L040 12105 41 Crypt 9 Dr. Chih-Lin Hsieh
L067 12011 45 Bulk 2 Shuangchao Ma
L034 12011 45 Crypt 9 Dr. Chih-Lin Hsieh
L035 12011 45 Crypt 10 Dr. Chih-Lin Hsieh
L041 12011 45 Crypt 11 Dr. Chih-Lin Hsieh
L042 12011 45 Crypt 12 Dr. Chih-Lin Hsieh
L043 12011 45 Crypt 1 Dr. Chih-Lin Hsieh
L070 12260 83 Bulk 5 Shuangchao Ma
L052 12260 83 Crypt 6 Dr. Chih-Lin Hsieh
L053 12260 83 Crypt 7 Dr. Chih-Lin Hsieh
L058 12260 83 Crypt 8 Shuangchao Ma
L059 12260 83 Crypt 9 Shuangchao Ma
L086 12260 83 Crypt 11 Dr. Chih-Lin Hsieh
lp012 12216 85 Bulk 4 Dr. Chih-Lin Hsieh
lp010 12216 85 Crypt 2 Dr. Chih-Lin Hsieh
lp011 12216 85 Crypt 3 Dr. Chih-Lin Hsieh
m1 12216 85 Crypt 5 Shuangchao Ma
m2 12216 85 Crypt 7 Shuangchao Ma
L055 12216 85 Crypt 9 Shuangchao Ma
L027 12430 93 Bulk 4 Dr. Chih-Lin Hsieh
L030 12430 93 Crypt 5 Dr. Chih-Lin Hsieh
L031 12430 93 Crypt 6 Dr. Chih-Lin Hsieh
L064 12430 93 Crypt 3 Shuangchao Ma
L065 12430 93 Crypt 6 Shuangchao Ma
Merge
L028
b
12430 93 Crypt 7 Dr. Chih-Lin Hsieh
L029 12430 93 Crypt 5 Dr. Chih-Lin Hsieh
a. This sample was not included in the later data analysis, because its Strelka
variant calling result is very aberrant.
31
b. This sample library was prepared normally but sequenced in two separate
portions. The data from these portions was then merged following
sequencing.
2.2 Crypt Isolation
Samples were stored in serum free RPMI media and
transported on ice. The colon specimen was cut along one side of the
cylindrical portion of the colon to produce a “sheet”, and then cut into
1-2 mm
2
squares using a razor and tweezers. Squares were placed in
a new 50 ml falcon tube. 15 ml 1X PBS with 9 mM EDTA was added.
The tube was gently shaken 10 times by hand and let sit for 1 min
before liquid was poured out This procedure was repeated 3 times. In
a final washing step, 10 ml of 9 mM EDTA in 1X PBS (room
temperature) was added and the sample was incubated for an
additional 20 minutes at room temperature, with a gentle inversion
every 2-3 minutes. The tissue was allowed to settle, and liquid was
poured out.
Following washing, crypts were separated by adding 10 ml 1X
PBS (no EDTA), pipetting up and down 8-10 times, and vortexing at
the highest speed for 10 seconds. Tissue was allowed to settle for 30s
before 2 ml of supernatant was collected and transferred into a 60x15
mm culture dish.
Under a microscope, 1X PBS was added to dilute the sample
until all crypts were separated from each other. A 2-20 µl pipet set to
10 ul was then used to isolate individual crypts and place them in an
Eppendorf tube.
Remaining tissue squares were used as bulk tissue samples. All
crypt and bulk samples were stored at -80°C until use. Figure 10 is an
example of what colon crypts tended to look like under a microscope.
32
Figure 10 A Single Colon Crypt under Light Microscopy (from the 15-Year-Old Individual).
Crypts were broken up and then diluted with PBS until they could be separated from
each other.Library Preparation
2.3 Library Preparation
2.3.1 Bulk Sample Genomic DNA Extraction
Sequencing libraries were prepared using the NEBNext Ultra
DNA Library Prep Kit (New England BioLabs Inc., n.d.) for Illumina with
several adjustments to the protocol to compensate for the minimal
(1000-2000) number of starting cells.
For bulk samples, DNA extraction was first performed by cutting
the bulk sample into 2 mm
2
pieces. The sample was then spun down at
13,000 rpm in an Eppendorf microfuge, and any supernatant was
pipetted off. The pellet was washed with 1x PBS using a pipette.
Protein was removed using 717 µl 1X TE buffer with 18 µl 20% SDS
and 15 µl 20 mg/ml Proteinase. The total 750 µl was vortexed at the
highest speed and incubated at 60°C for 4 hours.
Following incubation, a phenol extraction was performed by
adding 750 µl phenol, vortexing for 20 seconds and centrifuging for 1
minute the highest speed 13,000 rpm in an Eppendorf microfuge.
Following centrifugation, the top layer of phenol was extracted. 750 µl
of chloroform was added, followed by vortexing for 20 seconds and
33
centrifuging for 1 minute at 13,000 rpm. The top layer was then
extracted again.
Phenol was removed by adding 75 µl of 3 M NaOAc, and 750 µl
of isopropanol. The tube was gently inverted, and a white DNA
precipitate would form. The liquid was carefully removed using a
pipette. 500 µl of 75% ethanol was added to wash the sample. The
sample was spun down at 13,000 rpm and the supernatant was
removed. The DNA pellet was air dried for at least 3 minutes. 50 µl TE
buffer was added without pipetting, and the tube was left at room
temperature, so the pellet could dissolve completely.
2.3.2 Library Preparation (Bulk and Crypt)
Initial Crypt Handling
When preparing crypt samples, we were extremely careful not
to lose any material, since we initially only had 1000-2000 cells. For
this reason, DNA extraction was not performed, and all reactions were
done using 1/3 of the reaction volume for any step indicated in
NEBNext Ultra DNA Library Prep Kit (New England BioLabs Inc., n.d.)
for Illumina. We also used low binding tubes and tips and avoided
vortex and tip-based mixing steps when possible.
1X TE buffer was added to the crypt sample tubes to make a
final volume of 50 µl. Samples were put through three freeze-thaw
cycles using an ethanol and dry ice bath. One µl 20mg/ml Proteinase
was added to each tube and tubes were incubated at 56°C for 20 min.
Afterwards, another 1 µl Proteinase was added and the incubation was
repeated.
Initial Bulk Handling
For bulk samples, we took out 1 µl DNA (roughly 50ng) from the
genomic DNA stock, and 1X TE buffer was used to make a final
volume of 50 µl.
Sonication
34
From this step on, bulk and crypt samples were handled the
same way. Samples were transferred to a microTUBE (Snap-Cap AFA
Fiber) and fragmented using a Covaris Ultra Sonicator S2 (model
SN000880) set to intensity 5, duty cycle 10%, 200 cycles/burst, and
treatment time 50 s, to achieve a DNA fragment range with a peak of
350 bp.
Following fragmentation, 70 µl of Ampure XP Beads in PEG was
immediately added and mixed in by pipetting, to achieve a
concentration of 1.4X. Beads were washed using 500 µl 100% ethanol
twice on a magnetic bar. Then DNA was eluted using 18.6 µl 1 mM
Tris and liquid was taken out and stored in a new tube.
End Repair and Adaptor Ligation
Following bead removal, the fragment end repair reaction was
induced by adding 2.17 µl NEBNext End Repair Reaction Buffer (10X)
and 1 µl NEBNext End Prep Enzyme Mix. After incubating first at 20°C
for 30 min and then at 65°C for 30 min, we ligated adaptors by adding
5 µl Blunt/TA Ligase Master Mix, 0.33 µl NEBNext Ligation Enhancer,
and NEBNext Adaptors for Illumina (methylated) (0.1X for crypt
samples and 1X for bulk samples). The reaction was then incubated at
20°C for 15 min. Finally, adaptors were cut by adding 1 µl USER
enzyme and incubating for 15 min at 37°C.
After adaptor addition, 34.6 µl of beads were added to the
reaction and the tube was mixed using pipette and left on a magnetic
bar. Beads were washed using 500 ul 100% ethanol twice on a
magnetic bar. We eluted by adding 9 µl 1 mM Tris to crypt samples,
and 17 µl 1 mM Tris to bulk samples, leaving the tube on a magnetic
bar for 1 minute, and then moving the liquid to a new tube. For bulk
samples, 3 µl of elute was used in the PCR reaction below.
Standard enhancement PCR for sequencing library construction
was done by mixing the sample (or 3 µl of it for bulk samples) with 5 µl
NEBNext Q5 Hot Start HiFi PCR Master Mix, 0.4 µl i5 universal primer
35
(25uM), 0.4 µl i7 index primer (25uM) and 1.2 µl dH2O. PCR was then
run with the following program (Table 3).
Table 3. Pre-sequencing PCR Program.
98 °C 30 s 1X
98 °C 10 s
10X (crypt), 8X (bulk)
65 °C 75 s
65 °C 5 min 1X
4 °C Hold
To prevent over amplification of any fragments, PCR was done
in triplicate for crypt samples and the resulting products were
combined.
Following PCR, 8.5 µl of beads were added to each tube and
mixed by pipette. Beads were washed using 500 ul 100% ethanol twice
on a magnetic bar. Sample was eluted with 12 µl 1 mM Tris. Samples
were analyzed by using either 1 µl of sample on a BioAnalyzer (Agilent
2100 Bioanalyzer) or .5 µl of sample on a TapeStation (Agilent
Technologies 4200 Tape Station) (Agilent-Technologies, 2005).
Size Selection (Not Universal)
This was only applied to samples where BioAnalyzer or
TapeStation analysis showed the average fragment size to be too
large (>400 bp, see Figure 11 for examples). If the sample was found
to be too large, Ampure XP Beads in PEG were to achieve a
concentration of 0.55X. The sample was then left at room temperature
for 5 min and liquid was extracted and stored in a different tube. We
then added another 0.3X beads to get a final concentration of 0.85X.
Beads were washed using 500 µl 100% ethanol twice on a magnetic
bar. The sample was then eluted with 14 µl 1 mM Tris.
36
Figure 11. Examples of BioAnalyzer Outputs.
The BioAnalyzer outputs (in two different formats) are shown for samples L124, L125
and L126. The numbers after each sample correspond to the PCR triplicate being
analyzed (e.g. 125-2 is sample L125, PCR triplicate 2). In A, the x-axis is the read
length (in base pairs) the y-axis is the fluorescence intensity (FU), which corresponds
to the number of reads with that length. In B, each sample is shown as a gel run with
darkness corresponding to fluorescence (and therefor read) intensity, A ladder of
standard DNA lengths has also been run (v13 and v18 are unrelated samples that
happened to be in the run). The L125 samples show a good size distribution, while
the L126 samples are too large and require size selection.
After size selection, DNA concentration was measured using a
NanoDrop 8-Sample Spectrophotometer ND-8000.
2.4 Next Generation Sequencing
50 ng of DNA from each sample was sent for sequencing.
Sequencing was done by Macrogen (Korean Branch), using their
Illumina platform HiSeq.
37
2.5 Data Analysis
Figure 12. Simplified Single Nucleotide Variant Calling Workflow.
An illustration showing the entire SNV calling workflow can be
seen in Figure 12. Sequencing data was first analyzed using FastQC
38
(Andrews Simon, n.d.). Once FastQC confirmed the quality of the
sequencing data, reads were aligned to hg19 (“Human genome
overview - Genome Reference Consortium,” n.d.). Because GATK
(next step) takes multiplexing (see introduction) into account when
performing its analysis, reads were split by lane using a custom script,
and alignment, sorting and removal of duplicates was performed
separately for each lane. Aligned data from separate lanes was then
merged, sorted, and duplicates across multiple lanes were removed.
Figure 13 illustrates this procedure.
Figure 13 Reads from Different Lanes Were Handled Separately
Each lane was aligned, sorted and had duplicates removed separately, before
merging lanes back together and sorting and removing duplicates again. This was
necessary to ensure GATK would receive accurate multiplexing information.
During this procedure, alignment was done using bwa-mem
(v0.7.12) with the -t 16 (number of processors) and -R (record read
group information, which includes sample ID, preparation time, and
platform) options enabled (Heng Li & Durbin, 2009) (Consortium, 2001).
The -R option was used because downstream software (such as GATK)
can use this information to control for batch effects and multiplexing
(McKenna et al., 2010).
39
2.5.1 Single Nucleotide Variation Calling
The resulting .bam file was sorted using Picard’s (v3a81eb1)
SortSam tool with the options SORT_ORDER=coordinate and
VALIDATION_STRINGENCY=SILENT. Duplicates reads were
identified and removed using Picard’s MarkDuplicates tool with the
options REMOVE_DUPLICATES=true and
VALIDATION_STRINGENCY=SILENT (Broad Institute, n.d.). Reads
were indexed using samtools (v1.2) (H. Li et al., 2009). To ensure
accuracy, we called SNVs using both Strelka (direct crypt to bulk
comparisons) and GATK 3.6 (crypt to reference comparison minus any
calls identified in a bulk to reference comparison). Strelka (v2.0.17)
was run with default configuration settings (Saunders et al., 2012).
GATK was run following the recommended best practice all the way till
prior to the Variant Quality Score Recalibration (VQSR) step (McKenna
et al., 2010), and key tools used were AnalyzeCovariates, PrintReads,
BaseRecalibrator, HaplotypeCaller. For Base Quality Score
Recalibration (BQSR) in GATK, the dbSNP, GoldIndel and kGIndel
references used were 138,
Mills_and_1000G_gold_standard.indels.hg19, and 1000G_phase1
respectively (McKenna et al., 2010).
Other variant calling softwares such as VarScan2 (Koboldt et al.,
2012), Mutect2 (Cibulskis et al., 2013), and IVC (Raczy et al., 2013)
were tested. We did not pursue VarScan2, because it requires a
samtools pileup output and that takes extremely long time to generate.
We did not use Mutect2 because at the time when we started with data
analysis Mutect2 was still under development. We did compare the
result of GATK and strelk to IVC (from Macrogen) side by side.
GATK was also used for extracting number of sites examined at
a given coverage. We did not use bedtools (Quinlan & Hall, 2010)
mainly because we lacked the storage space to save the large
coverage .bedgraph files it requires.
40
Small deletions and insertions (<50 bp in size) were also called
and outputted by both GATK and Strelka, but in this project we did not
perform any further analysis on these calls.
2.5.2 Deletion (>50bps) Calling
Figure 14 Structural Variant Detection Workflow
41
Figure 14 illustrates the structural variant calling workflow.
Structural variations were identified by inputting aligned reads with
duplications removed into Manta with default configurations (Chen et
al., 2015). Manta identifies deletions, insertions, tandem duplications,
inversions and translations. We used Manta’s default filtering strategy,
and a short script was written to separate the different output
categories. Only deletions were considered for further analysis. Many
variants called by Manta, including all deletions, were manually
examined in Integrative Genomics Viewer (IGV) (Robinson et al.,
2011a) to ensure quality. During this examination, we found that two
clusters of called small deletions, identified in the 11 and 15 year old
patients, showed almost perfect overlap with the exons of a gene.
While the origin of these calls is unclear, we believe that the near
perfect exon overlap means that these were not normal deletions, and
thus they were excluded from the analysis.
2.5.3 Other Analyses
We performed alignment free analysis of the 39 year old and 41
year old subject samples (all possible comparisons) using CAFE with
the default parameters.
We would like to note that we also compared our results to
those provided by Macrogen. Figure 15 shows their workflow, which
follows similar logic, but uses different software.
42
Figure 15. The Macrogen Data Analysis Workflow.
Taken from report files sent from Macrogen. The overall logic used by Macrogen is
similar to our workflow, but different aligners and variant callers are used.
2.6 Software Development
2.6.1 Motif and Breakpoint Statistical Analysis Tool Kit
The package MBSAT (Motif and Breakpoint Statistical Analysis
Tool Kit) was developed to asses if there is a statistically significant
association between a given motif of interest and a set of mutations.
As input, MBSAT requires a motif of interest, a collection of mutations
in a given genomic region and a window size. The collection of
mutations can be inputted directly as a VCF file or as a collection of
reads in FASTA format that are then aligned back to user defined
reference genome by a local BLAST search to check for mutations.
Once MBSAT has all input, it defines a genomic region of
interest in any version of the human genome that stretches from the
43
first mutation to the last, with additional wings equal to the given
window size. MBSAT performs several analyses in this region to
determine if the motif of interest is significantly associated with the
given mutations. There are three statistical tests built into this software,
the Binomial test (Equation 1), Student’s t Test (Equation 2) the Mann-
Whitney-U Test (Equation 3). They are used to assess the probability
that mutations are more likely to occur within the motif of interest
(binomial test) and determine if mutations are significantly close to the
motif of interest (Student’s t test and Mann-Whitney-U Test). All these
tests are based on the standard of comparing the observed mutations
to the pattern of a mutation at all possible sites.
Equation 1. Binomial Test
This test is used to determine if mutations are biased towards occurring in motifs of
interest. N = number of identified mutations, x = number of mutations occurring in the
motif of interest, p = probability of a randomly placed mutation landing inside a motif.
p is determined by taking the number of these sites that are in a motif and dividing by
the number of sites where a mutation can occur (length of region being considered +
1). Note that we count mutation/site as occurring in a motif if it is inside or on either
end of the motif. Thus, a motif of n bases has n + 1 sites in it.
Equation 2. Student’s t Test
This test used is to determine if mutations are biased towards occurring near motifs
of interest when the distance between mutations and motifs of interest follows a
normal distribution. n1 = number of mutations, n2 = number of possible mutation
positions (length of region being considered + 1), x
̅ 1 is the observed average distance
between a mutation and the closest motif of interest (going from the edge of the
mutation to the edge of the motif), x
̅ 2 is the average distance between all possible
mutation positions and the closest motif of interest, s1 and s2 are the observed
standard deviations of the distances, x1 and x2, respectively. The t-value is converted
into a p-value using the t.test function in R.
Equation 3. Mann-Whitney-U Test
This test is used to determine if mutations are biased towards occurring near motifs
of interest when the distance between the mutations and motifs of interest does not
44
follow a normal distribution. n1 = number of mutations and n2 = number of possible
mutation positions. To obtain R1, all mutations and all possible mutations sites were
ranked by their distance (smallest to largest) from the closest motif of interest. If
multiple ranks had the same distance, those ranks were averaged. R1 was set to the
sum of the ranks given to all mutations. The z-value is converted into a p-value using
the wilcox.test function in R.
In addition to statistical analysis, MBSAT offers a graphical
output showing the mutation and motif distribution. MBSAT was tested
using an already published dataset that previously demonstrated a
statistically significant association between lymphoma causing break
points in precursor B cells and CpG motifs (Tsai et al., 2008).
2.6.2 Slippage-Associated Repeat Identification and Analysis
The package SARIA (Slippage-Associated Repeat Identification
and Analysis) was developed to offer easy visualization of potential
slippage inducing repeats in mutation prone regions. SARIA allows the
user to specify the maximum distance between repeats, maximum
repeat length and maximum repeat mismatch tolerance. Using these
parameters, SARIA plots density of both direct and inverted of repeats
in a user defined region and can highlight user defined areas of
interest. Repeat mismatch tolerance is calculated using the
Levenshtein distance. The output of this package can help users
determine if a mutation rich region is also rich in slippage inducing
repeats.
SARIA was tested using data from Di Liu’s (Lieber lab member)
project, which covered breakpoints in the E2A region on chromosome
45
19 (nt position 1615821 to 1619109). Results from SARIA were
manually checked and confirmed by Di Liu.
46
Chapter 3. RESULTS
In this study, we sought to identify mutations that had
accumulated in humans, and to evaluate the association of the
presence of these mutations with age, by taking samples that
represented a person’s sequence at the time of conception, taking
samples that represented an individual’s sequence of single cells in
that individual at the current time, and comparing these two using
whole genome sequencing and variant calling. Prior to variant calling,
sequencing library preparation was optimized and several pre- and
post-sequencing quality control checks were performed to ensure the
results obtained were accurate. Single nucleotide variant calling was
performed using two callers and the results were overlapped and then
run through multiple filters to ensure a high degree of stringency. Large
deletions were called using one caller and the results were examined
using IGV (Robinson et al., 2011b). The types and counts of mutations
identified in people from different age groups were analyzed to
establish the patterns in which mutations accumulate with age.
3.1 Library and Sequencing Read Quality
Overview
Sequencing library preparation achieved good DNA
fragment length and quality. Our library preparation protocol was
designed to give a size distribution centered around 350-400 bps. This
47
distribution was chosen with the consideration that our paired-end
sequencing is 150 bps long from each end. Reads that are less than
300 bps will have their middle region sequenced twice, which provides
duplicated information and would be a waste of resources. Reads that
are too long may reduce the sequencing efficiency. Therefore, we
implemented size selection at the end of library preparation, when
needed, to ensure our read size distribution peaks at 350 to 400 bps.
Sequencing data achieved good quality. We analyzed our
sequencing results using FastQC (Andrews Simon, n.d.) to make sure
that there were no quality issues that would impact the accuracy of our
latter mutation calls. FastQC analysis did not show any warnings of
concern (result files saved in Lieber lab server), and this is consistent
with feedback from Macrogen. Our samples generally have slightly
higher duplication rates (~10-30%), compared with the field standards
(~5-10%) (Table 4). This may be caused by the relatively low amount
of starting material. With fewer DNA fragments present prior to the
PCR reaction, the chance that one fragment is overamplified is higher.
All samples showed good coverage, with an average of depth 34.7X
and a standard deviation of 5.2X (Table 4). These coverage and
duplication rate statistics are from the analysis provided by Macrogen,
and therefore may be slightly different from the statistics obtained
using BWA-mem. We used these statistics because we had to remove
duplications at two different steps when using BWA-mem (see
methods), making it hard to pull the correct statistics from the results
Figure 13.
Table 4. Sample Duplication and Coverage Rates
Sample
Name
Duplication
Rate %
Coverage (X)
after PCR duplication removal
L090 8.5 32.3
L129 35.8 46.1
L133 31.8 46.7
L125 18.8 49.0
48
L131
a
26.7 46.1
L134 35.2 46.8
L116 8.2 37.4
L115 12.1 38.0
L122 12 38.5
L105 9.7 36.2
L113 13.4 34.8
L121 10.7 37.6
L123 9.9 42.5
L111 11 33.2
L108 10.8 34.8
L120 18.6 30.4
L118 18.2 31.6
L126 17.8 32.3
L071 25.3 27.8
L044 21.7 31.3
L045 17.4 36.8
L046 12.8 29.2
L047 12.9 31.1
L048 15.2 34.1
L068 12.3 30.8
L036 23.4 30.5
L037 17.6 30.0
L038 24.9 31.3
L039 15.6 31.7
L040 21.8 31.0
L067 14.6 32.5
L034 29.5 32.0
L035 15.6 30.5
L041 17 31.7
L042 26.7 26.2
L043 14 37.8
L070 22.9 31.4
L052 19 43.1
L053 20.5 39.9
L058 22.4 30.0
L059 30.7 30.0
L086 23.1 36.5
lp012 14.2 38.2
lp010 18.3 35.9
lp011 13.7 36.7
m1 24.5 33.4
49
m2 25.9 31.0
L055 27.3 38.5
L027 29 30.4
L030 17.4 30.3
L031 12.5 34.4
L064 13.5 32.0
L065 11.9 35.0
Merge L028
b
11.1 30.1
L029 34.3 32.3
a. This sample was not included in the later data analysis, because its Strelka
variant calling result was very unusual
b. This sample library was prepared normally but sequenced in two separate
portions. The data from these portions was then merged following
sequencing. The data here is for the main portion (>90% of the reaction
volume).
3.2 Single Nucleotide Variations Accumulate
with Age
After sequencing all samples and ensuring their quality, we
sought to study whether the number of SNVs accumulated with age, as
well as their types and locations. To identify, SNVs we compared the
results of multiple variant callers, to see how we could identify
mutations with the most accuracy and settled on using the overlap of
Strelka and GATK. SNVs identified via this method were further filtered
to ensure the calls were accurate. Finally, we analyzed the regions the
remaining SNVs occurred in, how age related to number of SNVs
found in a crypt, and what types of SNVs were discovered.
3.2.1 Software Output Overlap
Germline callers largely agree with each other when calling
sample-reference variants. When comparing crypt and bulk samples
to identify SNVs, we used GATK, IVC, and Strelka side by side, as
described in methods, to find the optimal SNV calling pipeline. GATK
and IVC are both germline callers, meaning they call variants between
a sample and a reference genome. Calling bulk-crypt SNVs with them
50
requires first calling bulk-reference and crypt-reference SNVs
separately, and then identifying the SNVs that were not found in both
comparisons. GATK and IVC showed good overlap (~80-90%) when
calling bulk or crypt-reference variants (Figure 16).
Figure 16. GATK (vcf1) and IVC (vcf2) Overlap.
When identifying crypt-reference and bulk-reference variants for the L047 (from 39
year old subject), we found that the overlap between GATK and IVC is high.
Numbers indicate the number of called SNVs in each category.
This agreement decreased significantly when crypt and
bulk-reference variants were compared to obtain crypt-bulk
variants. We found that the overlap between the final output of each
caller, especially the overlap after bulk and crypt comparisons between
Strelka and GATK or Strelka and IVC, was poor (for example, see
Figure 17).
51
Figure 17. Overlap of Variant Callers for the L047 Crypt from the 39 Year Old Subject.
Numbers indicate the number of called SNVs in each category.
In order to determine what results to use for SNV identification,
we first did a manual check on the 186338135-186428182 region of
chromosome 3 from bulk L071 and crypt L044 samples to confirm that
our script for comparing crypt-reference and bulk-reference results
from GATK or Strelka was not producing any errors. This concern was
prompted by the significant decrease in overlap between the two
programs following this comparison, as just discussed. However, the
manual check validated the results of our script.
SNVs called by both GATK and Strelka are reliable. We next
sought to assess the accuracy of variant callers. Our first method was
using the transition/transversion (Ti/Tv) ratio. The Ti/Tv ratio is the ratio
of purine to purine or pyrimidine to pyrimidine mutations to purine to
pyrimidine or pyrimidine to purine mutations. ~2.8 is generally
accepted as the most likely value for the Ti/Tv ratio (J. Wang, Raskin,
Samuels, Shyr, & Guo, 2015). Although we have some concerns with
this standard (discussed in Chapter 4. DISCUSSION), we applied it
here to help guide our method optimization.
We found that each individual variant caller showed a Ti/Tv ratio
close to 1, which suggests a high rate of false positives. We
hypothesized the false positive rate might be reduced by only taking
52
mutations identified by both somatic and variant based calling, since
this would ensure the mutations were called twice through each of two
different methods. The use of overlapping variant callers is a good way
to remove false positives (Callari et al., 2017) Working from this
hypothesis, we checked the overlap of Stelka and GATK calls, and
found that this mutation set had a Ti/Tv ratio that was higher (~1.5)
(Table 19 shows Ti/Tv ratios following additional filtering steps). We
also noticed that the Ti/Tv ratio after this procedure was higher for
older individuals, but the reason for this is not clear.
Knowing that taking the overlap of GATK and Strelka identified
variants will provide improved an Ti/Tv ratio, we did a manual check of
a group of randomly selected of variants called by GATK, Strelka or
both on chromosome 18 for the comparison between the bulk L071
and crypt L044 samples. In this manual check, we looked at each
mutation called and then at the sample reads aligning to that position
on IGV to see if we agreed with the call. As can be seen in Figure 18,
taking only the overlap of GATK and Strelka seemed to ensure a
selection of reliable mutations, while each variant caller alone had a
large number of called mutations that were likely false positives. We
also considered only using variants that had passed Strelka’s internal
quality filter, but this did not seem to offer any increase in accuracy
compared to simply taking the overlap of Strelka and GATK. All reads
and calls considered in this step are available in the appendix. Based
on these analyses, we chose to use the overlap of Strelka and GATK
to identify SNVs. The total SNV counts for each caller and each crypt
sample can be found in Table 16 of the Appendix.
53
Figure 18. Manual Check Reveals Reliable Mutations in the Overlap of Strelka and GATK
Calls.
A random selection of SNV calls from GATK, Strelka and both in the chromosome 18
region of sample L044 from the 39 year old subject was manually assessed in IGV.
Calls that we found reliable are marked in green, calls that we found unreliable are
marked in red, calls that we were unsure of are marked in yellow.
3.2.2 Additional Filters
Additional filters for autosomal and well covered regions
were used to ensure the quality of our mutations calls. After taking
the overlap of raw mutations identified by both GATK and Strelka, we
applied several additional filters. As shown in Figure 12, we first
removed all mutations that were in non-alignable regions, sex
chromosomes and mitochondrial DNA. Non-alignable regions were
removed because the difficulty of placing them in the reference
indicates sequencing and analysis of them may be unreliable. Sex
chromosomes were removed because our subjects were a mix of male
and female. Mitochondrial DNA was removed because it replicates
independently of the main genome (and thus would not necessarily
have the same mutation rate). The resulting absolute counts can be
found in Table 17 of the Appendix.
54
Then, we filtered out all mutations at sites that did not have at
least 15X coverage in the corresponding bulk and crypt samples to
ensure that there was sufficient coverage to make mutation calls. We
have also tested filtering based on 20X coverage. Both methods give
similar results. The SNV counts with these filters applies can be found
in Appendix section: Table 18 and Table 19.
Finally, we examined the heterogeneity of each pair of calls.
During our manual check of Strelka outputs, we noticed that Strelka
frequently calls a site as heterozygous when the major allele is found
on the almost all the reads. We addressed this by switching a call to
homozygous (before evaluating the mutation status of the site) if more
than 95% of the reads were the major allele and fewer than 3 reads
were the minor allele. Following this correction, we sought to ensure
mutation calls based on opposite homozygous/heterozygous
assignments for the crypt were accurate. We did this by rejecting the
mutation call if the sample marked as homozygous had the major allele
was on less than 85% of the reads or if the sample marked as
heterozygous had the minor allele was on less than 20% of the reads.
This helps ensure that when a crypt-bulk pair is called as one being
homozygous and the other being heterozygous (and therefore a
mutation), there is a strong degree of evidence for both calls. We were
more stringent (95% cutoff) at this step compared to when we are
simply just trying to validate a heterozygosity call that the software
made, and less stringent (85% cutoff) than when we are trying to
determine if the software made a correct call. See Table 18 and Table
19 for SNV counts with all filters (15 or 20X coverage cutoff,
respectively).
We experimented with applying an additional filter where all
mutations at sites listed in dbSNP were removed, but this filter did not
offer any improvement in Ti/Tv ratio and we were concerned that it
may exclude true mutation calls, so it was not applied in the results.
55
3.2.3 SNVs Observed
SNV types are consistent with expectations. Figure 19
shows the portion of all different SNV types for each sample. Mutations
that begin at a heterozygous site are rare. This is consistent with the
fact that the human genome is estimated to have a ~1:1000 ratio of
heterozygous to homozygous sites based on whole genome
sequencing studies of multiple individuals (Bryc, Patterson, & Reich,
2013). Further, there are even fewer double mutations (where both
alleles have changed) observed. This is expected given that such a
case would require two separate mutations at the same site to occur.
56
Figure 19. Portion of Each SNV Type in Each Sample.
All possible SNVs are listed along the x-axis, the y-axis shows what portion of the
total SNVs that particular SNV was.
57
3.2.4 Normalization of Counts
SNV counts for each crypt-bulk pair were normalized based
on the coverage of that pair. We sought to determine if the number
of identified mutations increased with age. However, we could not do
this using the direct mutation counts for each crypt because some
samples had better sequencing coverage than others. These samples
would have a greater part of their genome with >=15X coverage and
thus more sites where mutations could be discovered, leading to a
larger number of called mutations without any real difference in
mutation occurrence. We controlled for this by normalizing the SNV
count for each crypt based on the coverage that crypt had.
The SNV count for each crypt was normalized by dividing by the
total number of sites (from GATK) in the bulk-crypt pair that could be
examined (located on an autosome and with more than 15X coverage
in both samples) and then multiplying by total number of sites that
could be examined in the sample with the fewest examinable sites.
Table 5 shows the values used for normalization. The last column is
the lower of the bulk and crypt pair, and thus maximum number of sites
that could be examined and the number that was used for
normalization. Table 6 shows the normalized counts. Figure 20 and
Figure 21 are bar graphs and boxplots of the normalized SNV counts.
(Equivalent items for a 20X coverage cutoff are in Figure 37, Table 21,
Figure 38 and, Figure 39 of the Appendix.)
Table 5 Normalization Values (15X) for Each Sample.
Sample Name Number
Sites >=1X
Number
Sites >=1X,
Autosome
Number
Sites >=15X,
Autosome
Used Final
Number
Sites >=15X,
Autosome
yr00_bulk_L090 2804843885 2652341251 2618526601 2618526601
yr00_crypt_L129 2804647447 2652168399 2616793007 2616793007
yr00_crypt_L133 2804677635 2652062846 2560005939 2560005939
yr00_crypt_L125 2804069325 2651693991 2617800279 2617800279
yr00_crypt_L134 2804579054 2652078928 2610202821 2610202821
58
yr11_bulk_L116 2823899715 2654578928 2610644417 2610644417
yr11_crypt_L115 2822402716 2653385280 2606101724 2606101724
yr11_crypt_L122 2821785286 2652968965 2593987896 2593987896
yr11_crypt_L105 2821763199 2652962154 2574357791 2574357791
yr11_crypt_L113 2821921822 2653014508 2603444331 2603444331
yr11_crypt_L121 2823519965 2654246255 2596673086 2596673086
yr11_crypt_L123 2823635092 2654448002 2610203463 2606101724
yr15_bulk_L111 2805510602 2652452463 2568899699 2568899699
yr15_crypt_L108 2806218042 2653208208 2547543945 2547543945
yr15_crypt_L120 2801209469 2648568218 2507618120 2507618120
yr15_crypt_L118 2802311170 2650042478 2559283797 2559283797
yr15_crypt_L126 2803637979 2651115530 2589658825 2568899699
yr39_bulk_L071 2824961137 2655936114 2571326890 2571326890
yr39_crypt_L044 2824378693 2655399058 2571244201 2571244201
yr39_crypt_L045 2823447062 2654697673 2601330829 2571326890
yr39_crypt_L046 2821759758 2653348090 2551555682 2551555682
yr39_crypt_L047 2821757359 2653349116 2570864122 2570864122
yr39_crypt_L048 2823090289 2654379771 2586932772 2571326890
yr41_bulk_L068 2806196336 2653346197 2584957018 2584957018
yr41_crypt_L036 2807872853 2654420845 2565421046 2565421046
yr41_crypt_L037 2808230771 2654886650 2571734665 2571734665
yr41_crypt_L038 2808091207 2654789157 2559263874 2559263874
yr41_crypt_L039 2807654589 2654482494 2581353925 2581353925
yr41_crypt_L040 2807476924 2654171760 2502439814 2502439814
yr45_bulk_L067 2826044910 2656459800 2601288311 2601288311
yr45_crypt_L034 2824256467 2655038353 2534777331 2534777331
yr45_crypt_L035 2824776410 2655439663 2582106345 2582106345
yr45_crypt_L041 2824506106 2656575222 2546639986 2546639986
yr45_crypt_L042 2822397965 2653512655 2413148724 2413148724
yr45_crypt_L043 2824158147 2654880352 2599722309 2599722309
yr83_bulk_L070 2809515488 2655630180 2601703673 2601703673
yr83_crypt_L052 2807642285 2654491886 2614657140 2601703673
yr83_crypt_L053 2807675815 2654498809 2601789017 2601703673
yr83_crypt_L058 2805537767 2654147653 2542142243 2542142243
yr83_crypt_L059 2805320320 2652592403 2565064721 2565064721
yr83_crypt_L086 2811002264 2656865283 2614605038 2601703673
yr85_bulk_lp012 2804189903 2651914148 2560737800 2560737800
yr85_crypt_lp010 2805382187 2652948441 2609075946 2560737800
yr85_crypt_lp011 2805002507 2652639062 2608590223 2560737800
yr85_crypt_m1 2720001052 2572946170 2000370826 2000370826
59
yr85_crypt_m2 2595898757 2458751108 1661919169 1661919169
yr85_crypt_L055 2810601541 2656949024 2612906704 2560737800
yr93_bulk_L027 2808137264 2654731584 2568415915 2568415915
yr93_crypt_L030 2807486287 2654235079 2524574921 2524574921
yr93_crypt_L031 2807525631 2654323867 2588058520 2568415915
yr93_crypt_L064 2807250287 2654078655 2569709539 2568415915
yr93_crypt_L065 2807776865 2654528522 2593459720 2568415915
yr93_crypt_mergeL028 2808461920 2655074543 2615160510 2568415915
yr93_crypt_L029 2808771213 2655097403 2487699043 2487699043
Table 6. Normalized Counts for All Samples (15X).
Sample Name Normalized Number
yr00_crypt_L129 1516.6
yr00_crypt_L133 1514.6
yr00_crypt_L125 1582.7
yr00_crypt_L134 1485.4
yr11_crypt_L115 1204.0
yr11_crypt_L122 979.0
yr11_crypt_L105 1051.0
yr11_crypt_L113 1011.2
yr11_crypt_L121 913.9
yr11_crypt_L123 1201.4
yr15_crypt_L108 1242.7
yr15_crypt_L120 2250.0
yr15_crypt_L118 1874.7
yr15_crypt_L126 1721.5
yr39_crypt_L044 2467.1
yr39_crypt_L045 2717.8
yr39_crypt_L046 2568.2
yr39_crypt_L047 2410.6
yr39_crypt_L048 2784.4
yr41_crypt_L036 2108.0
yr41_crypt_L037 2105.4
yr41_crypt_L038 2180.6
yr41_crypt_L039 2118.8
yr41_crypt_L040 2415.4
yr45_crypt_L034 1629.3
yr45_crypt_L035 2720.0
yr45_crypt_L041 2682.8
yr45_crypt_L042 1303.7
yr45_crypt_L043 1945.3
60
yr83_crypt_L052 2958.8
yr83_crypt_L053 2623.5
yr83_crypt_L058 1886.7
yr83_crypt_L059 2372.0
yr83_crypt_L086 1619.9
yr85_crypt_lp010 1165.6
yr85_crypt_lp011 2435.7
yr85_crypt_m1 2048.8
yr85_crypt_m2 2684.0
yr85_crypt_L055 1509.6
yr93_crypt_L030 3666.1
yr93_crypt_L031 4465.4
yr93_crypt_L064 3936.1
yr93_crypt_L065 2239.5
yr93_crypt_mergeL028 1736.7
yr93_crypt_L029 2377.6
61
Figure 20. Normalized SNV Counts for All Samples.
Samples are plotted on the x-axis in order of age, while the y-axis indicates the
normalized SNV count for that sample.
62
Figure 21. Normalized SNV Counts (Sorted by Age)
The samples have been grouped by age on the x-axis, and the y-axis shows the
number of normalized SNV counts for each sample. The box and lines define the
middle and outer quartiles respectively. The diamond is centered on the mean and
extends the standard deviation. The additional center line defines the median.
63
3.2.5 Annotation
The distribution of SNVs between exonic/intronic and non-
transcribed regions of the genome was consistent with the overall
area of each of these categories with at least 15X coverage. We
annotated our filtered SNVs to see if they could be found in exonic,
intronic, or non-transcribed regions. The ratio of SNVs in these regions
was consistent with the ratio of bases in these regions with at least
15X coverage. Table 7 shows the Normalized count of exon SNVs in
each sample. Figure 22 shows a boxplot for these counts. Table 22
and Figure 40, in the appendix, show the non-normalized SNV counts
for all genomic regions and a non-normalized SNV count boxplot,
respectively.
Table 7. Normalized Exon SNV Counts for All Samples (15X).
Sample Name
Normalized Number for
Exon
yr00_crypt_L129 13.3
yr00_crypt_L133 9.1
yr00_crypt_L125 11.4
yr00_crypt_L134 16.6
yr11_crypt_L115 7.0
yr11_crypt_L122 6.4
yr11_crypt_L105 6.5
yr11_crypt_L113 6.4
yr11_crypt_L121 5.8
yr11_crypt_L123 11.5
yr15_crypt_L108 4.6
yr15_crypt_L120 13.9
yr15_crypt_L118 13.6
yr15_crypt_L126 16.2
yr39_crypt_L044 31.7
yr39_crypt_L045 31.7
yr39_crypt_L046 33.2
yr39_crypt_L047 22.6
yr39_crypt_L048 30.4
yr41_crypt_L036 22.0
yr41_crypt_L037 20.7
yr41_crypt_L038 27.3
yr41_crypt_L039 24.5
64
yr41_crypt_L040 21.9
yr45_crypt_L034 22.3
yr45_crypt_L035 38.0
yr45_crypt_L041 28.1
yr45_crypt_L042 15.8
yr45_crypt_L043 22.4
yr83_crypt_L052 30.7
yr83_crypt_L053 28.7
yr83_crypt_L058 13.1
yr83_crypt_L059 18.1
yr83_crypt_L086 19.8
yr85_crypt_lp010 2.6
yr85_crypt_lp011 13.0
yr85_crypt_m1 17.4
yr85_crypt_m2 17.0
yr85_crypt_L055 9.7
yr93_crypt_L030 35.5
yr93_crypt_L031 45.9
yr93_crypt_L064 41.4
yr93_crypt_L065 30.4
yr93_crypt_mergeL028 22.0
yr93_crypt_L029 14.0
65
Figure 22. Normalized Exon SNV Counts (Sorted by Age).
66
The samples have been grouped by age on the x-axis, and the y-axis shows the
number of normalized exon SNV counts for each sample. The box and lines define
the middle and outer quartiles respectively. The diamond is centered on the mean
and extends the standard deviation. The additional center line defines the median.
Exon regions have the highest mean Ti/Tv ratio and
standard deviation compared with other parts of the genome.
Table 8 shows the mean and standard deviation of the Ti/Tv ratio in
different genome sections for all samples. The exonic regions show the
highest Ti/Tv mean and standard deviation, which may be explained by
the relatively small number of SNVs in the exons. Table 22 in the
appendix shows the Ti/Tv ratio for all regions in each sample.
Table 8. Ti/Tv Ratio in Different Genomic Regions.
All SNVs Exon
Ti/Tv
Ratio
Intron
Ti/Tv
Ratio
Non-
Transcribed
Ti/Tv Ratio
Exon &
Intron Ti/Tv
Ratio
Average 1.68 2.73 1.77 1.64 1.79
SD 0.25 1.55 0.32 0.24 0.33
Other annotation information, such as if a mutation can cause
missense, nonsense, splicing or other errors, and more are available in
the Lieber Lab server.
3.2.6 Statistical Analysis: SNVs Accumulate with Age
The Accumulation of SNVs With Age is Marginally
Significant. We used SAS to analyze the relation between SNVs and
age. During this analysis, we used the natural log transform of the
normalized number of SNVs found in each crypt as our dependent
variable. We used the natural log transform because it yielded data
that were reasonably compatible with the assumptions of the normal
distribution.
We fit our data to a linear regression model using the “proc glm”
function. This is a “parametric model” which assumes that the data are
approximately normally distributed and accounts for the different
number of crypts in the different age groups. It fits the following formula:
Log(count) = µage + εsubject + ecrypt + Erandom
67
Where µage is a fixed effect caused by age, εsubject is the random
variation due to differences between subjects; it is approximately
normally distributed with a true mean of 0 and variance σsubject
2
. ecrypt is
approximately normally distributed with a true mean of 0 and variance
σcrypt
2
, and Erandom is approximately normally distributed with a true
mean of 0 and variance σrandom
2
. Together, they are the random
variation due to errors and the differences between crypts. The last two
sources of variation must be fitted together because we did not
sequence the same crypt multiple times.
This analysis showed that εsubject and ecrypt + Erandom have a
variance (standard deviation squared) of 0.0373793 and 0.058504
respectively (Calculation: σsubject
2
= (0.230682 – 0.058504) / 4.9613 =
0.0347, with data from Figure 23). More importantly, the analysis
shows that there is a marginally statistically significant difference
between the three age groups (p = 0.059) (Figure 23).
The “proc glm” function can also be used to assess the level of
variability between subjects within each group. Doing this, it was
shown that the difference between individuals in the young group was
statistically significant (p = 0.0011), while the difference between
individuals within the middle aged and old groups was not (p = 0.1247,
and p = 0.1033, respectively) (the output for this analysis was too large
to be included in a figure, but is available in the Chapter 6. APPENDIX.
We also used the “lsmeans” function to perform Tukey’s honest
significant difference (honest comparisons) test to see if there were
pairwise significant differences between the individual age groups. This
multiple comparisons test showed that the difference between the
young and middle-aged groups was marginally significant (p = 0.09),
while the difference between the young and old age groups was clearly
significant (p = 0.05) and the difference between the middle-aged and
old groups was not significant (p = 0.50) (Figure 24).
68
Figure 23. Output of the SAS “PRC GLM” Function.
SAS was used to fit the log of normalized mutations for each crypt to the function:
Log(count) = µage + εsubject + ecrypt + Erandom and found a statistically significant
dependence on age.
69
Figure 24. Output of the SAS “lsmeans” Function.
Analysis showed the difference between the young and middle-age and young and
old groups was respectively marginally and strongly significant.
Power Calculation
Early in this study, we had plans to include 20 old and 20 middle
aged subjects. This was before we had access to young samples and
was based on a desire for 90% power, with assumptions of a 2.5-3.0
fold mutation count difference between the two groups (Blokzijl et al.,
2016), a 100% coefficient of variation and a lognormal distribution.
Following this study, we now sought to reassess how many
samples would be needed to have 80%-85% power to detect a
difference in SNV count based on age. Since we now had younger
samples and learned that crypts were not independent of each other,
we switched our model and used the Mixed Linear Model described
above. Assuming similar variation across different age groups, we
70
used the variation between individuals obtained above (0.034704) to
compute a common standard deviation: σ
2
= 0.034704 + (0.058504/3),
which gave us σ = 0.23282. We entered the least squares means
(lsmeans) for each age groups (Table 9) and σ into nQuery (Statsols,
n.d.), with options for a 0.05-level one-way ANOVA (Analysis of
Variance) test. We found that given the assumption that 3 crypts would
be collected for each age group, 6 individuals in each age group would
provide us with a statistical power of 85.4%, nearly the same power we
had aimed for originally, with far fewer samples.
Table 9. Least Squares Means of the Log of SNV Counts for Each Age Group.
Age group Log(counts) LSMEAN
Young 7.24932244
Mid-age 7.71167564
old 7.74209560
3.3 Large Deletions Accumulate with Age
3.3.1 Identification of Deletions
Deletions show a bimodal distribution. Deletions called by
Manta were validated using manual inspection in IGV (Robinson et al.,
2011b) (see Figure 25 and Figure 26 for an examples). The start and
end positions of some reads may be estimates because Manta cannot
precisely determine where the breakpoint occurred. This can happen
when the distance between the paired ends of a read indicates a
deletion (they are much further apart than they should be when
mapping to the genome) but neither end shows an alignment indicating
a breakpoint (they breakpoint occurred in the sequenced region
between them). It can also occur when read aligning shows a
breakpoint, but similar sequences on both sides of the deletion make it
impossible to determine where precisely the break occurred.
71
Figure 25. IGV Output Showing a Deletion Called by Manta.
This shows a deletion in Chromosome 1, at positions 240,448,893 to 240,449,081, in
the L052 crypt sample of the 83 year old subject. Data is also shown for the matching
L070 bulk sample. Grey bars represent reads, with the thick areas showing regions
that match the reference and the thin areas indicating parts of the reference not
found on the read. A clear drop off in coverage, corresponding reads spanning the
same area, is seen in the crypt sample, but not in the bulk sample.
72
Figure 26. Sashimi Plot Shows Apparent Drop in Coverage at Deletion Site Called by Manta.
A deletion in Chromosome 1, at positions 240,448,893 to 240,449,081, in the L052
crypt sample of the 83 year old subject is indicated be a drop in coverage in the crypt
sample (blue) for this region. Coverage is also shown for the matching L070 bulk
sample (red). The clear drop in coverage seen in the crypt sample, but not in the
bulks sample, indicates a deletion mutation.
73
Analysis with Manta identified 88 independent deletions (Table
21). Separating these deletions by size revealed a clear binomial
distribution, with 60 deletions centered around the low hundreds of
base pairs (Manta was set to detect in deletion larger than 50 bp) and
20 deletions over 5000 base pairs (and often much larger). Only 6
deletions were between 1000 and 5000 base pairs (Table 11).
Table 10. Identified Deletions in Each Sample Sorted by Size.
Sample Name
50-
150
150-
250
250-
500
500-
1000
1000
-
2000
2000
-
3000
3000
-
5000
>5000
yr00_crypt_L129 1 0 0 0 0 0 1 1
yr00_crypt_L133 0 0 0 0 0 0 1 0
yr00_crypt_L125 0 0 0 1 0 0 0 0
yr00_crypt_L134 0 0 0 0 0 0 0 0
yr11_crypt_L115 0 1 1 0 0 0 0 0
yr11_crypt_L122 2 0 2 0 0 0 0 0
yr11_crypt_L105 0 0 1 0 0 0 0 0
yr11_crypt_L113 0 1 3 0 0 0 0 0
yr11_crypt_L121 0 0 1 0 0 0 0 0
yr11_crypt_L123 0 0 1 0 0 0 0 0
yr15_crypt_L108 1 0 0 0 0 0 0 1
yr15_crypt_L120 1 2 0 0 1 0 0 0
yr15_crypt_L118 1 0 0 1 1 0 0 0
yr15_crypt_L126 2 1 1 0 0 0 0 0
yr39_crypt_L044 0 0 0 0 0 0 0 0
yr39_crypt_L045 0 0 1 0 0 0 0 0
yr39_crypt_L046 0 0 1 0 0 0 0 0
yr39_crypt_L047 0 0 1 0 0 0 0 1
yr39_crypt_L048 0 0 0 0 0 0 1 1
yr41_crypt_L036 1 0 0 0 0 0 0 0
yr41_crypt_L037 0 0 0 0 0 0 0 0
yr41_crypt_L038 1 0 0 0 0 0 0 1
yr41_crypt_L039 0 0 0 0 0 0 0 1
yr41_crypt_L040 0 0 2 0 0 0 0 0
yr45_crypt_L034 1 0 1 0 0 0 0 0
yr45_crypt_L035 0 0 0 0 0 0 0 0
yr45_crypt_L041 0 0 0 0 0 0 0 1
yr45_crypt_L042 0 0 0 0 0 0 0 0
yr45_crypt_L043 0 0 0 0 0 0 0 2
yr83_crypt_L052 1 1 1 0 0 0 0 2
yr83_crypt_L053 0 0 0 0 0 0 0 1
74
yr83_crypt_L058 1 0 0 0 0 0 0 1
yr83_crypt_L059 0 0 0 1 0 0 0 1
yr83_crypt_L086 1 0 0 1 0 0 0 3
yr85_crypt_lp010 1 1 0 0 0 0 0 0
yr85_crypt_lp011 2 2 0 0 0 0 0 0
yr85_crypt_m1 0 1 1 0 0 0 0 0
yr85_crypt_m2 1 0 0 0 0 0 0 0
yr85_crypt_L055 1 2 0 0 0 0 0 0
yr93_crypt_L030 0 1 0 2 0 0 0 0
yr93_crypt_L031 0 0 0 0 0 0 0 2
yr93_crypt_L064 0 0 0 0 0 0 0 0
yr93_crypt_L065 0 0 0 0 0 0 1 2
yr93_crypt_merg
eL028
1 0 0 0 0 0 0 2
yr93_crypt_L029 0 0 1 0 0 0 0 1
Table 11. Number of Deletions Observed in Each Size Distribution.
Deletion
size (bp)
50-
150
150-
250
250-
500
500-
1000
1000-
2000
2000-
3000
3000-
5000
>5000
Number
of
observed
deletions
20 13 19 6 2 0 4 24
3.3.2 Large Deletions Increase in Number and Size with Age.
Following deletion identification, we saw that the number of
deletions discovered increased with the age of the subject. We first
divided the deletion count for each subject by the number of crypts
taken from that subject to normalize for different levels of sampling. We
found a clear correlation (with noise) between age and the number of
deletions larger than 3000 bp per a crypt (Figure 27). By contrast, no
correlation with age was observed for deletions smaller than 2000 bp
(Figure 28).
75
Figure 27. Deletions Larger than 3000 Base Pairs per Crypt Vs. Age.
The number of large deletions observed in a patient’s crypts, divided by the number
of crypts analyzed for that patient, was plotted against patient age.
Figure 28. Deletions Smaller than 2000 Base Pairs per Crypt Vs. Age.
The number of small deletions observed in a patient’s crypts, divided by the number
of crypts analyzed for that patient, was plotted against patient age.
We also observed an increase in average deletion size with age
(Table 12 and Figure 43 in Appendix).
Table 12. Size of all Identified Deletions.
Sample Name Deletions Found
yr00_crypt_L129 -3200 -6763 -51
yr00_crypt_L133 -3200
76
yr00_crypt_L125 -684
yr00_crypt_L134
yr11_crypt_L115 -483 -153
yr11_crypt_L122 -92 -427 -140
-
440
yr11_crypt_L105 -440
yr11_crypt_L113 -212 -416 -276
-
440
yr11_crypt_L121 -440
yr11_crypt_L123 -440
yr15_crypt_L108 -119 -6210
yr15_crypt_L120 -119 -169 -1434
-
205
yr15_crypt_L118 -119 -640 -1434
yr15_crypt_L126 -119 -440 -233 -64
yr39_crypt_L044
yr39_crypt_L045 -364
yr39_crypt_L046 -301
yr39_crypt_L047 -11537 -364
yr39_crypt_L048 -3698 -23777
yr41_crypt_L036 -52
yr41_crypt_L037
yr41_crypt_L038 -58
-
3149029
yr41_crypt_L039
-
22784785
yr41_crypt_L040 -333 -336
yr45_crypt_L034 -144 -379
yr45_crypt_L035
yr45_crypt_L041 -224563
yr45_crypt_L042
yr45_crypt_L043 -224493 -13666
yr83_crypt_L052 -188 -87
-
1421814
-
332
-
1299792
yr83_crypt_L053 -1299755
yr83_crypt_L058 -120
-
1421813
yr83_crypt_L059 -567 -81071
yr83_crypt_L086 -699 -19498 -23504 -52
-
1421793
yr85_crypt_lp010 -159 -146
yr85_crypt_lp011 -159 -146 -94
-
171
yr85_crypt_m1 -215 -392
yr85_crypt_m2 -66
yr85_crypt_L055 -240 -99 -180
77
yr93_crypt_L030 -978 -212 -543
yr93_crypt_L031 -119965 -839746
yr93_crypt_L064
yr93_crypt_L065 -3606 -272725 -839868
yr93_crypt_mergeL028 -91 -428249 -377687
yr93_crypt_L029 -337
-
5473275
3.3.3 Deletion Frequencies Observed Are Consistent with Our
Previous Study
We also compared the count of deletions larger than 50,000 bp
that we observed to the results of a previous study where we detected
deletions in colon crypts using microarrays. We only compared
deletions larger than 50,000 bases because we calculate this is the
smallest deletion the microarray could reasonably be used to detect.
We found the overall count for these large deletions, as well as their
increase with age, was consistent with the results of our previous study
(Table 13).
Table 13. Counts of Large Deletions/Crypt From This Study and (J. Hsieh et al., 2013).
Present Study 2013 Microarray Study
Age Deletions/crypt Age Deletions/Crypt
0 0 17 0
11 0 27 0
15 0 28 0
39 0 36 0
41 0.4 45 0
45 0.4 57 0
72 0
78 0.2
80 0
83 1.0 85 0.1
85 0 89 1.0
93 1.2 98 0.3
78
3.4 Testing Novel Software
3.4.1 Motif and Breakpoint Statistical Analysis Tool Kit
We developed or adapted software to permit us to examine
individual mutation or deletion locations for preferred sequence
features. A previous version of this program with the same algorithm
was developed in the Lieber lab but was written in JavaScript and
could only take pre-formatted sequence input files. Here, this new
program, now called MBSAT, was written in the more useable R
language, and can take more common input files, such VCF.
MBSAT was tested using the ~2 kb sequence surrounding the
150 bp BCL1-MTC fragile zone region of chromosome 11 (hg19), the
~30 kb sequence surrounding the BCL2 (MBR, ICR, and MCR) fragile
zone region of chromosome 18 (Table 14), and a collection of reported
lymphoma breakpoints, with the CpG motif selected for analysis. This
is the same region as was previously analyzed in (Tsai et al., 2008). As
can be seen Figure 29, and was previously demonstrated, MBSAT
found a statistically significant association between lymphoma
breakpoints and the CpG motifs. We have not applied this software to
aging-associated mutations or deletions yet, but it remains an option.
Table 14. Fragil Zone Regions Used for Testing MBSTAT.
Chr Size Start End
BCL1-MTC chr11 150bps 69346757 69346906
BCL2-MBR chr18 150bps 60793424 60793598
BCL2-ICR chr18 105bps 60774480 60774584
BCL2-MCR chr18 561bps 60763906 60764466
79
Figure 29. Results from the MBSAT R-Package Are Consistent with Those Obtained in (Tsai &
Lieber, 2010).
Output from MBSAT (R-Package) is compared to that obtained in the previous paper
(Original) for BCL1-MTC fragile zone. The results are highly consistent between the
two.
3.4.2 Slippage-Associated Repeat Identification and Analysis
We were interested in asking if some mutation or deletion
locations were had nearby repeats that might account for DNA
replication slippage events that might then induce a subsequent
mutation event. We developed SARIA for this purpose.
SARIA was tested using the ~30 kb BCL2 region from
chromosome 18 (Table 14) as well as the E2A regions from
chromosome 19 (nt position 1615821 to 1619109). These regions were
selected because they contain several fragile zones (marked in pink in
Figure 31), known to experience double stranded breaks during
lymphoma development. We also instructed SARIA to track the
presence of CG motifs in the identified repeats. As can be seen in
Figure 30, SARIA returned a list of all repeats in the region of interest,
including their position, sequence, the window between them (if any), if
they are direct or inverted repeats, and the number of motifs of interest
(CG in this case) found inside the repeat. SARIA also returned a
graphical output showing the density of repeats through the search
80
region (Figure 31). No relation was observed between the repeat
density observed by SARIA and the known fragile zones for lymphoma
translocations. This program has not been applied to the deletions
identified in the aging project yet, but this could be done in the future.
Figure 30. A Partial SARIA Output Table for the E2A Region.
“Start Position” is the position (in the read in sequence) of the first base of the first
repeat. “Repeat1” and “Repeat2” are the two matching repeats. “Window” is the
sequence between the repeats (if any). “Direct/Inverted” specifies if the identified
repeat is a direct repeat or an inverted repeat. The final column shows the count of
the motif of interest (if given, CG in this case) within the identified repeat.
Start Position Repeat1 Window Repeat2 Direct/Inverted # of "CG" Motif
66 GGTGGG GGTCAG GGTGGG direct 0
67 GTGGGGG TCAGGGTGGGAGCCCAGGGCAGGTGTCACC GTGGGGG direct 1
162 CCGGCG GGG CCGGCG direct 4
183 CTGGGC GTGTGAGGTGGACAGCT CTGGGC direct 1
293 TGCCTC TGCCTC direct 0
445 GGAGGAGG CAAGAGTAGGCAAGTACCTGGA GGAGGAGG direct 0
450 AGGCAAG AGT AGGCAAG direct 0
487 CAGGCA GAGGGCGCAGCC CAGGCA direct 1
516 CCTGGG GCTGGACTGCA CCTGGG direct 0
549 AGGAGA CCCGTGGGGTTGGAGCAGAGTG AGGAGA direct 1
578 GGAGAG GGAGAG direct 0
579 GAGAGGGA GAGAGGGA direct 0
605 GCGGGGC AGGGCAGGTCAT GCGGGGC direct 2
623 TGCGGGG CCTTGTGGGC TGCGGGG direct 2
685 GGAGGG CTGA GGAGGG direct 0
832 GGGCCC TGCATCTCCTGAAGGCG GGGCCC direct 1
861 ACAGGA TTTGTGATGG ACAGGA direct 0
887 AGGGTG AGAGAAGGC AGGGTG direct 0
961 AAGGAA GGGGACA AAGGAA direct 0
967 GGGGACAAA GGAAAAGGTT GGGGACAAA direct 0
985 TGGGGA CAAATCTGGACCTGGGCC TGGGGA direct 0
1001 CCTGGG CCTGGG direct 0
1118 CTTTCCA GGCAGA CTTTCCA direct 0
1164 CCAGGG CATCTCACCGCAGCGGCTCTGCCC CCAGGG direct 2
1197 GGGGACA CTGGGTGATGTCT GGGGACA direct 0
81
Figure 31. Graphical Output from SARIA Analysis of the BCL2 Fragile Zone Region.
The input sequence is plotted along the x-axis in base pairs. The y-axis is the number
of repeats (with at least the minimum length and an allowed number of mismatches)
that are within the given window size of each position in the input sequence. Known
fragile zones are highlighted in pink.
3.5 Alignment Free Analysis
Alignment free analysis is not effective for studying
somatic mutations. We looked at our CAFE results to see if
alignment free analysis could be effectively used to measure the level
of difference between bulk and crypt samples, and thus measure the
increase in this difference with age. Unfortunately, in our testing, we
found that it was very difficult for CAFE to distinguish the difference
between crypts from the same person from crypts from different people
(Figure 32). Based on this, we concluded distinguishing the increase in
difference with age would be impossible and did not pursue the
analysis further.
82
Figure 32. CAFE Analysis of the Difference Between 4 Crypts.
Each box indicates the similarity of its two intersecting samples (e.g. the upper right
box indicates the similarity between crypt L040 from the 41 year old subject and crypt
L045 from the 39 year old subject). Boxes are colored based on the CAFE similarity
scale, where darker boxes indicate more similar samples.
83
Chapter 4. DISCUSSION
4.1 Summary
Here, we have studied the accumulation of somatic mutations in
humans with age, without resorting to either extensive PCR or cell
culture. Our results are generally similar to those of groups that did use
these methods, though our filtering criteria are more explicit and
inclusive, and in this sense, more reliable. As expected, we have found
an increase in SNVs with age; interestingly, the increase between the
young (aged 0, 11, 15) and middle aged (aged 39, 41, 45) subjects is
greater than the increase between the middle aged and the oldest
(aged 83, 85, 93) subjects, although the middle aged subjects are
closer in age to the younger subjects than the oldest subjects. We
have also observed an increase in large (>3000 bp) deletions with age,
but not deletions between 50 to 2000 bp in size, though <50 bp
deletions have not been studied yet. The patterns of the mutations we
have found suggest multiple mutational mechanisms.
4.2 Mutations Observed and Comparison to
Existing Studies
The SNVs observed in this study did not show any significant
bias for occurring at particular genomic locations, such as exons,
introns or non-coding regions. This suggests that the mechanisms
84
driving these mutations occur uniformly throughout the genome and
are therefore probably not very dependent on DNA accessibility or
openness. It may be that these mutations mainly occur during
replication (as DNA polymerase errors). It is also worth noting that no
bias against SNVs in exon regions was observed. This suggests that
there is little active selective pressure against these mutations.
As discussed in the Chapter 3. RESULTS section, observed
deletions can be split based on size, into at least 2 groups: a group of
small deletions, of 50-2000 bp, and a group of deletions larger than
3000 bp. This suggests at least two different mutational mechanisms
(such as slippage and non-homologous end joining) may at work.
4.2.1 Mutation Accumulation
In this study we observed that the number of SNVs in young
subjects (0-15 years old) was notably lower than the number of SNVs
found in middle aged (39-45 years old) and older (83-93 years old)
subjects. We found that the mutation rate could be modelled using the
formula: Log(count) = µage + εsubject + ecrypt + Erandom.
The SNV counts observed and the formula we have derived
based on them show a logarithmic increase in SNVs with age. While
there appears to be an SNV increase in the first half of the normal
human lifespan, in this study, the rate at which SNVs accumulate
seems to slow down afterwards. It appears that there is an increase in
SNV count variance between subjects at the older end of the age
spectrum. The reasons for this decreasing mutation rate could be
many-fold, but one possibility is an overall decrease in metabolic
activity as people age. It is also possible that are young subjects (who
are 15 or younger) have a colon that is still growing, and thus has
many more opportunities for mutation, while there should be no overall
colon growth between our middle aged and old subjects.
The logarithmic increase in mutations with age observed in this
study seems to preclude at least one hypothesis that has been
85
discussed since the start of this study: that mutations would hinder the
cell’s ability to accurately transcribe and repair DNA, leading to more
mutations. Such a scenario should have led to an exponential increase
in mutations with age, which is not compatible what was observed.
The SNV counts we have observed in this study are fairly
consistent with the counts observed in another study, which observed
~1000 SNVs in the colon crypts of younger (0-25 years old) subjects
and ~2500 SNVs in the colon crypts of older (50-75 years old) subjects
(Blokzijl et al., 2016). That study was done using organoids grown from
single colon crypt stem cells rather than directly sequencing single
colon crypts. That study reported a linear increase in mutations with
age but lacked an age group to cover the range covered by our middle-
aged group. A logarithmic trend would not have been obvious given
their sample set.
Our SNV counts were also consistent with SNV counts
observed in liver and small intestine stem cells (based on organoid
expansion and sequencing) (Blokzijl et al., 2016) and brain
neurons (based on single cell genome amplification and clonal
expansion) (Lodato et al., 2015). Put together, this shows that different
methods for studying the accumulation of somatic mutations in humans
return relatively consistent results, and that different cells in highly
different places and roles accumulate mutations at relatively similar
rates.
We also found that the number of large (>3000 base pairs)
deletions shows an increase with age. There is no such apparent trend
for smaller deletions in the range of 50 to 2000 bps. The basis for this
disparity is unclear, but since the mutations of different sizes likely
occur through different mechanisms, it might be expected that they
accumulate in a different manner. It should also be noted that this
disparity has not been observed before, and therefor requires further
confirmation. The number of very large (>50,000 base pairs) structural
86
changes (including deletions) we have observed is consistent with our
own and other’s previous studies (Blokzijl et al., 2016; J. Hsieh et al.,
2013). The Manta software is not useful for examining deletions <50
bps, and it will be interesting to study smaller deletions (in the vcf files)
using other methods in the future.
4.2.2 Comparison to Other Studies
Comparing with the paper (Blokzijl et al., 2016), which is one of
the leading studies in the field, our design differs mainly in two areas.
The first is library material collection. Figure 33 shows the culturing
steps used in this study. While we agree that organoid culture grown
from a single cell will likely to provide a uniform genetic profile, the
weeks long culture could potentially induce in vitro mutations. We
believe the design of our study, which avoided any culturing, provides
more accurate insights.
Figure 33. Schematic overview of the Experimental Setup Used in Another Study.
Adopted from (Blokzijl et al., 2016) Extended Data Figure 1a. Single cells were
grown into organoids for sequencing, in a potentially mutation inducing process.
Second, we took a different route when analyzing the variants.
In their Extended Data Figure 1, one of their significant filtering steps is
excluding variant positions that overlapped with the dbSNP databases.
We tested this step but did not keep it in our analysis due to concerns
that real somatic mutations that happen to match dbSNP records might
be excluded. Table 25 and Table 26 show the analysis results using
this filter. For almost all crypts, we noticed a minor increase in Ti/Tv
ratio, after removing mutations matching dbSNP138 records. When
applying this filter in exon regions, with their limited number of
mutations to begin with, it is difficult to observe any trend.
87
In addition, the paper excluded variants that were found to be
variable in at least two of three unrelated individuals. Again, we believe
that naturally occurring somatic mutations could fall into this category
and thus did not implement this step.
Throughout the variant calling procedure, we learned that
different variant caller programs are likely to disagree with each other,
as we showed in Software Output Overlap section. This has already
been observed and reported by another group (Hofmann et al., 2017).
This study was started with the assumption that all cells in a
colon crypt are monoclonal and can be traced back to a single stem
cell in a relatively short (in terms of the human lifespan) timeframe.
This assumption is supported by certain reports (J. Hsieh et al., 2013;
Snippert et al., 2010; Yatabe et al., 2001), but others suggest that
colon crypts may be somewhat polyclonal, with separate “ribbons” in
the crypt each being traceable back to a different stem cell (Bjerknes &
Cheng, 1999; Clevers, 2013). The deletion presented in Figure 25
indicates monoclonality since it shows a near 50% reduction in
coverage from the bulk sample. However, some other deletions we
have observed do not show such an ideal ratio (see Figure 44 for an
example) and therefore may indicate a degree of polyclonality in the
crypts. Still, our SNV calling pipeline does not depend very heavily on
allele frequency (see our workflow in Figure 12, and homozygous and
heterozygous cutoff details at the end of section 2.5.1) and Manta does
not use coverage at all when calling mutations (Chen et al., 2015).
Therefore, our mutation calling should be robust even in the face of
minor polyclonality.
4.3 Mutations as a Mechanism of Aging
We found that colon crypts of young subjects contained ~15
exon SNVs, while the colon crypts of older subjects contained ~30
exon SNVs. There is a clear increase with age and exon SNVs have a
88
good chance of being deleterious if they are missense or nonsense
mutations.
The extent to which these mutations might damage a person’s
health is unclear, large scale genetic studies have estimated that the
average individual is carrying 250-500 inherited missense or loss of
function variants in protein coding genes, with possibly as many as 85
of these mutations being homozygous (Xue et al., 2012) (1000
Genomes Project Consortium et al., 2010). However, all but ~80 of
these changes are inherited from the population, and thus must be
compatible with normal development. The new ~80 mutations have
also been developmentally ‘filtered’ for those mutations that would
have caused a spontaneous abortion. Thus, there is a clear selective
pressure on these mutations that may make them less harmful than
the somatic mutations acquired over a lifetime, which was the focus
here.
4.4 Future Directions
We believe this study can be expanded on in several ways in
future studies. Below is the list of items.
1. Include more subjects. This study only analyzed samples from
9 subjects and we believe the addition of more subjects will
improve statistical power when looking for age related trends
and allow for factors such as ethnicity and environment to be
controlled for. Depending on the future sample sizes (see Power
Calculation in Statistical Analysis: SNVs Accumulate with Age),
we can further control and study variables such as gender, race,
in addition to age. Preliminary power calculations suggest that
6 individuals in each group (young, mid-age, and older) would
be needed for a definitive study.
2. With this data sets there are additional factors that can be
studied.
89
a. Sex chromosomes, mitochondrial data (currently
being processed by Yingfei Wang), and telomeres
can be studied, as these are hypothesized to factor
into aging.
b. Analysis on small (<50 bp) insertions and deletions,
as well as other structural variations such as
duplications and translocations, can be informative.
c. The surrounding sequences of any identified
mutations can may offer insights into their
mechanisms of occurrence.
d. Variability between crypts from the same individual
can be studied to see how cells diverge from each
other with age.
e. It would be interesting to see if pooling all reads from
all crypts in an individual leads to a variant profile
consistent with the bulk sample from that individual. In
theory, this should occur due to the averaging effect
discussed at in the introduction.
f. The impact of modifying our mutation calling pipeline
could be explored in several ways. For example,
when we filtered based on heterogeneity, we only
considered Tier 2 counts from Strelka, but in future
studies, it may worth looking at Tier 1 counts as well.
We have tried several different coverage cutoffs (10X,
15X and 20X) for calling mutations without seeing a
major change in results, but it may be worth looking
into other options. Especially as deeper coverage
might allow us to better identify deletions and loss of
heterozygosity events.
90
BIBLIOGRAPHY
1000 Genomes Project Consortium, T. 1000 G. P., Abecasis, G. R.,
Altshuler, D., Auton, A., Brooks, L. D., Durbin, R. M., … McVean,
G. A. (2010). A map of human genome variation from population-
scale sequencing. Nature, 467(7319), 1061–73.
https://doi.org/10.1038/nature09534
Agilent-Technologies. (2005). Agilent 2100 Bioanalyzer 2100 Expert
User’s Guide. Retrieved from
http://www.genomics.agilent.com/literature.jsp?crumbAction=push
&tabId=AG-PR-1001
Andrews Simon. (n.d.). Babraham Bioinformatics - FastQC A Quality
Control tool for High Throughput Sequence Data.
https://doi.org/v.0.11.5
Barker, N. (2014). Adult intestinal stem cells: critical drivers of epithelial
homeostasis and regeneration. Nature Reviews Molecular Cell
Biology, 15(1), 19–33. https://doi.org/10.1038/nrm3721
Bates, G. P., Dorsey, R., Gusella, J. F., Hayden, M. R., Kay, C.,
Leavitt, B. R., … Tabrizi, S. J. (2015). Huntington disease. Nature
Reviews Disease Primers, 1, 15005.
https://doi.org/10.1038/nrdp.2015.5
Bjerknes, M., & Cheng, H. (1999). Clonal analysis of mouse intestinal
epithelial progenitors. Gastroenterology, 116(1), 7–14.
https://doi.org/10.1016/S0016-5085(99)70222-2
91
Blokzijl, F., de Ligt, J., Jager, M., Sasselli, V., Roerink, S., Sasaki, N.,
… van Boxtel, R. (2016). Tissue-specific mutation accumulation in
human adult stem cells during life. Nature, 538(7624), 260–264.
https://doi.org/10.1038/nature19768
Broad Institute. (n.d.). Picard Tools - By Broad Institute. Retrieved
November 12, 2018, from http://broadinstitute.github.io/picard/
Bryc, K., Patterson, N., & Reich, D. (2013). A novel approach to
estimating heterozygosity from low-coverage genome sequence.
Genetics, 195(2), 553–61.
https://doi.org/10.1534/genetics.113.154500
Callari, M., Sammut, S.-J., De Mattos-Arruda, L., Bruna, A., Rueda, O.
M., Chin, S.-F., & Caldas, C. (2017). Intersect-then-combine
approach: improving the performance of somatic variant calling in
whole exome sequencing data using multiple aligners and callers.
Genome Medicine, 9(1), 35. https://doi.org/10.1186/s13073-017-
0425-1
Chen, X., Schulz-Trieglaff, O., Shaw, R., Barnes, B., Schlesinger, F.,
Cox, A. J., … Saunders, C. T. (2015). Manta: Rapid detection of
structural variants and indels for clinical sequencing applications.
BioRxiv, 024232. https://doi.org/10.1101/024232
Cibulskis, K., Lawrence, M. S., Carter, S. L., Sivachenko, A., Jaffe, D.,
Sougnez, C., … Getz, G. (2013). Sensitive detection of somatic
point mutations in impure and heterogeneous cancer samples.
Nature Biotechnology, 31(3), 213–219.
https://doi.org/10.1038/nbt.2514
Clevers, H. (2013). XThe intestinal crypt, a prototype stem cell
compartment. Cell, 154(2), 274–284.
https://doi.org/10.1016/j.cell.2013.07.004
Consortium, I. H. G. S. (2001). Initial sequencing and analysis of the
human genome. Nature, 409(6822), 860–921.
https://doi.org/10.1038/35057062
92
Crowson, C. S., Therneau, T. M., Davis, J. M., Roger, V. L., Matteson,
E. L., & Gabriel, S. E. (2013). Brief report: accelerated aging
influences cardiovascular disease risk in rheumatoid arthritis.
Arthritis and Rheumatism, 65(10), 2562–6.
https://doi.org/10.1002/art.38071
da Costa, J. P., Vitorino, R., Silva, G. M., Vogel, C., Duarte, A. C., &
Rocha-Santos, T. (2016). A synopsis on aging-Theories,
mechanisms and future prospects. Ageing Research Reviews, 29,
90–112. https://doi.org/10.1016/j.arr.2016.06.005
ECO. (2007). Tech Summary: Illumina’s Solexa Sequencing
Technology. Retrieved from
http://seqanswers.com/forums/showthread.php?t=21
Hofmann, A. L., Behr, J., Singer, J., Kuipers, J., Beisel, C., Schraml,
P., … Beerenwinkel, N. (2017). Detailed simulation of cancer
exome sequencing data reveals differences and common
limitations of variant callers. BMC Bioinformatics, 18(1), 8.
https://doi.org/10.1186/s12859-016-1417-7
Hsieh, C.-L. (2018). Novel Lines of Evidence for the Asymmetric
Strand Displacement Model of Mitochondrial DNA Replication.
Molecular and Cellular Biology, MCB.00406-18.
https://doi.org/10.1128/MCB.00406-18
Hsieh, J., Van Den Berg, D., Kang, H., Hsieh, C. L., & Lieber, M. R.
(2013). Large chromosome deletions, duplications, and gene
conversion events accumulate with age in normal human colon
crypts. Aging Cell, 12(2), 269–279.
https://doi.org/10.1111/acel.12053
Human genome overview - Genome Reference Consortium. (n.d.).
Retrieved October 13, 2014, from
http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human
/
Hurd, P. J., & Nelson, C. J. (2009). Advantages of next-generation
93
sequencing versus the microarray in epigenetic research.
Briefings in Functional Genomics & Proteomics, 8(3), 174–83.
https://doi.org/10.1093/bfgp/elp013
Infinium Omni2.5-8 Kit Support. (n.d.). Retrieved November 18, 2018,
from https://support.illumina.com/array/array_kits/humanomni2_5-
8_beadchip_kit/downloads.html
Intlekofer, K. A., & Cotman, C. W. (2013). Exercise counteracts
declining hippocampal function in aging and Alzheimer’s disease.
Neurobiology of Disease, 57, 47–55.
https://doi.org/10.1016/j.nbd.2012.06.011
Kimura, H., Iyehara-Ogawa, H., & Kato, T. (1994). Slippage--
misalignment: to what extent does it contribute to mammalian cell
mutagenesis? Mutagenesis, 9(5), 395–400. Retrieved from
http://www.ncbi.nlm.nih.gov/pubmed/7837971
Koboldt, D. C., Zhang, Q., Larson, D. E., Shen, D., Mclellan, M. D., Lin,
L., … Wilson, R. K. (2012). VarScan 2 : Somatic mutation and
copy number alteration discovery in cancer by exome sequencing
VarScan 2 : Somatic mutation and copy number alteration
discovery in cancer by exome sequencing. Genome Research,
22(3), 568–576. https://doi.org/10.1101/gr.129684.111
Li, H. (2013). Aligning sequence reads, clone sequences and
assembly contigs with BWA-MEM. ArXiv Preprint ArXiv, 00(00), 3.
https://doi.org/arXiv:1303.3997 [q-bio.GN]
Li, H., & Durbin, R. (2009). Fast and accurate short read alignment
with Burrows-Wheeler transform. Bioinformatics (Oxford,
England), 25(14), 1754–60.
https://doi.org/10.1093/bioinformatics/btp324
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., …
1000 Genome Project Data Processing Subgroup. (2009). The
Sequence Alignment/Map format and SAMtools. Bioinformatics,
25(16), 2078–2079. https://doi.org/10.1093/bioinformatics/btp352
94
Lieber, M. R., & Karanjawala, Z. E. (2004). Ageing, repetitive genomes
and DNA damage. Nature Reviews. Molecular Cell Biology, 5(1),
69–75. https://doi.org/10.1038/nrm1281
Lodato, M. A., Woodworth, M. B., Lee, S., Evrony, G. D., Mehta, B. K.,
Karger, A., … Walsh, C. A. (2015). Somatic mutation in single
human neurons tracks developmental and transcriptional history.
Science (New York, N.Y.), 350(6256), 94–98.
https://doi.org/10.1126/science.aab1785
Lu, Y. Y., Tang, K., Ren, J., Fuhrman, J. A., Michael, S., & Sun, F.
(2016). Genome analysis CAFE : aCcelerated Alignment-FrEe
sequence analysis, 2015–2017.
Marinković, M., de Leeuw, W. C., de Jong, M., Kraak, M. H. S.,
Admiraal, W., Breit, T. M., & Jonker, M. J. (2012). Combining next-
generation sequencing and microarray technology into a
transcriptomics approach for the non-model organism Chironomus
riparius. PloS One, 7(10), e48096.
https://doi.org/10.1371/journal.pone.0048096
McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K.,
Kernytsky, A., … DePristo, M. A. (2010). The Genome Analysis
Toolkit: a MapReduce framework for analyzing next-generation
DNA sequencing data. Genome Research, 20(9), 1297–303.
https://doi.org/10.1101/gr.107524.110
New England BioLabs Inc. (n.d.). NEBNext DNA Library Prep Master
Mix Set for Illumina. Retrieved from
https://www.neb.com/~/media/Catalog/All-
Products/23EF8BC783C749D9B664D90C8C72FA73/Datacards
or Manuals/manualE7370.pdf
Patterned Flow Cell Technology. (n.d.). Retrieved November 12, 2018,
from https://www.illumina.com/science/technology/next-
generation-sequencing/sequencing-technology/patterned-flow-
cells.html
95
Quinlan, A. R., & Hall, I. M. (2010). BEDTools: a flexible suite of
utilities for comparing genomic features. Bioinformatics, 26(6),
841–842. https://doi.org/10.1093/bioinformatics/btq033
Raczy, C., Petrovski, R., Saunders, C. T., Chorny, I., Kruglyak, S.,
Margulies, E. H., … Tanner, S. W. (2013). Isaac: Ultra-fast whole-
genome secondary analysis on Illumina sequencing platforms.
Bioinformatics, 29(16), 2041–2043.
https://doi.org/10.1093/bioinformatics/btt314
Robinson, J. T., Thorvaldsdóttir, H., Winckler, W., Guttman, M.,
Lander, E. S., Getz, G., & Mesirov, J. P. (2011a). Integrative
genomics viewer. Nature Biotechnology, 29(1), 24–6.
https://doi.org/10.1038/nbt.1754
Robinson, J. T., Thorvaldsdóttir, H., Winckler, W., Guttman, M.,
Lander, E. S., Getz, G., & Mesirov, J. P. (2011b). Integrative
genomics viewer. Nature Biotechnology, 29(1), 24–6.
https://doi.org/10.1038/nbt.1754
Saunders, C. T., Wong, W. S. W., Swamy, S., Becq, J., Murray, L. J.,
& Cheetham, R. K. (2012). Strelka: accurate somatic small-variant
calling from sequenced tumor-normal sample pairs.
Bioinformatics, 28(14), 1811–1817.
https://doi.org/10.1093/bioinformatics/bts271
Snippert, H. J., van der Flier, L. G., Sato, T., van Es, J. H., van den
Born, M., Kroon-Veenboer, C., … Gilbertson, R. J. (2010).
Intestinal crypt homeostasis results from neutral competition
between symmetrically dividing Lgr5 stem cells. Cell, 143(1), 134–
44. https://doi.org/10.1016/j.cell.2010.09.016
Statsols. (n.d.). nQuery Advanced - User Manual. Retrieved from
https://www.statsols.com/hubfs/Resources_/nQuery
Manuals/nQuery Advanced User Manual/nQuery Advanced User
Manual (nQuery 8.0).pdf?hsLang=en-us
Tsai, A. G., & Lieber, M. R. (2010). Mechanisms of chromosomal
96
rearrangement in the human genome. BMC Genomics, 11 Suppl
1(Suppl 1), S1. https://doi.org/10.1186/1471-2164-11-S1-S1
Tsai, A. G., Lu, H., Raghavan, S. C., Muschen, M., Hsieh, C.-L., &
Lieber, M. R. (2008). Human chromosomal translocations at CpG
sites and a theoretical basis for their lineage and stage specificity.
Cell, 135(6), 1130–42. https://doi.org/10.1016/j.cell.2008.10.035
VCF (Variant Call Format) version 4.1 | 1000 Genomes. (n.d.).
Retrieved October 13, 2014, from
http://www.1000genomes.org/wiki/Analysis/Variant Call
Format/vcf-variant-call-format-version-41
Wang, J., Raskin, L., Samuels, D. C., Shyr, Y., & Guo, Y. (2015).
Genome measures used for quality control are dependent on gene
function and ancestry. Bioinformatics (Oxford, England), 31(3),
318–23. https://doi.org/10.1093/bioinformatics/btu668
Wang, K., Li, M., & Hakonarson, H. (2010). ANNOVAR: functional
annotation of genetic variants from high-throughput sequencing
data. Nucleic Acids Research, 38(16), e164.
https://doi.org/10.1093/nar/gkq603
Xu, C. (2018). A review of somatic single nucleotide variant calling
algorithms for next-generation sequencing data. Computational
and Structural Biotechnology Journal, 16, 15–24.
https://doi.org/10.1016/j.csbj.2018.01.003
Xue, Y., Chen, Y., Ayub, Q., Huang, N., Ball, E. V, Mort, M., … 1000
Genomes Project Consortium. (2012). Deleterious- and disease-
allele prevalence in healthy individuals: insights from current
predictions, mutation databases, and population-scale
resequencing. American Journal of Human Genetics, 91(6), 1022–
32. https://doi.org/10.1016/j.ajhg.2012.10.015
Yatabe, Y., Tavaré, S., & Shibata, D. (2001). Investigating stem cells in
human colon by using methylation patterns. Proceedings of the
National Academy of Sciences of the United States of America,
97
98(19), 10839–44. https://doi.org/10.1073/pnas.191225998
98
APPENDIX I: Additional Material for
the Thesis Project
Appendix I.1 All Library Preparation
Table 15. Summary of All Prepared Libraries.
Sample Name Individual Age
(year)
Bulk
Or
Crypt
Barcode
(i5 = 500,
i7 listed)
Prepared By
L090 062117 0.6 Bulk 4 Shuangchao Ma
L129 062117 0.6 Crypt 7 Shuangchao Ma
L133 062117 0.6 Crypt 11 Shuangchao Ma
L125 062117 0.6 Crypt 3 Shuangchao Ma
L131 062117 0.6 Crypt 9 Shuangchao Ma
L134 062117 0.6 Crypt 12 Shuangchao Ma
L087 062117 0.6 Crypt 1 Shuangchao Ma
L088 062117 0.6 Crypt 1 Shuangchao Ma
L089 062117 0.6 Crypt 3 Shuangchao Ma
L091 062117 0.6 Crypt 5 Shuangchao Ma
L093 062117 0.6 Crypt 7 Shuangchao Ma
L094 062117 0.6 Crypt 8 Shuangchao Ma
L095 062117 0.6 Crypt 9 Shuangchao Ma
L096 062117 0.6 Crypt 10 Shuangchao Ma
L097 062117 0.6 Crypt 11 Shuangchao Ma
L098 062117 0.6 Crypt 12 Shuangchao Ma
L099 062117 0.6 Crypt 1 Shuangchao Ma
L100 062117 0.6 Crypt 2 Shuangchao Ma
L101 062117 0.6 Crypt 3 Shuangchao Ma
L102 062117 0.6 Crypt 4 Shuangchao Ma
L103 062117 0.6 Crypt 5 Shuangchao Ma
L124 062117 0.6 Crypt 2 Shuangchao Ma
L127 062117 0.6 Crypt 5 Shuangchao Ma
L128 062117 0.6 Crypt 6 Shuangchao Ma
99
L130 062117 0.6 Crypt 8 Shuangchao Ma
L132 062117 0.6 Crypt 10 Shuangchao Ma
L116 121317 11 Bulk 6 Shuangchao Ma
L115 121317 11 Crypt 5 Shuangchao Ma
L122 121317 11 Crypt 12 Shuangchao Ma
L105 121317 11 Crypt 7 Shuangchao Ma
L113 121317 11 Crypt 3 Shuangchao Ma
L121 121317 11 Crypt 11 Shuangchao Ma
L123 121317 11 Crypt 1 Shuangchao Ma
L106 121317 11 Crypt 8 Shuangchao Ma
L107 121317 11 Crypt 9 Shuangchao Ma
L112 121317 11 Crypt 2 Shuangchao Ma
L114 121317 11 Crypt 4 Shuangchao Ma
L111 090617 15 Bulk 1 Shuangchao Ma
L108 090617 15 Crypt 10 Shuangchao Ma
L120 090617 15 Crypt 10 Shuangchao Ma
L118 090617 15 Crypt 8 Shuangchao Ma
L126 090617 15 Crypt 4 Shuangchao Ma
L104 090617 15 Crypt 6 Shuangchao Ma
L109 090617 15 Crypt 11 Shuangchao Ma
L110 090617 15 Crypt 12 Shuangchao Ma
L117 090617 15 Crypt 7 Shuangchao Ma
L119 090617 15 Crypt 9 Shuangchao Ma
L071 cgnn03 39 Bulk 6 Shuangchao Ma
L044 cgnn03 39 Crypt 2 Dr. Chih-Lin Hsieh
L045 cgnn03 39 Crypt 3 Dr. Chih-Lin Hsieh
L046 cgnn03 39 Crypt 4 Dr. Chih-Lin Hsieh
L047 cgnn03 39 Crypt 5 Dr. Chih-Lin Hsieh
L048 cgnn03 39 Crypt 7 Dr. Chih-Lin Hsieh
L068 12105 41 Bulk 3 Shuangchao Ma
L036 12105 41 Crypt 11 Dr. Chih-Lin Hsieh
L037 12105 41 Crypt 12 Dr. Chih-Lin Hsieh
L038 12105 41 Crypt 7 Dr. Chih-Lin Hsieh
L039 12105 41 Crypt 8 Dr. Chih-Lin Hsieh
L040 12105 41 Crypt 9 Dr. Chih-Lin Hsieh
L067 12011 45 Bulk 2 Shuangchao Ma
L034 12011 45 Crypt 9 Dr. Chih-Lin Hsieh
L035 12011 45 Crypt 10 Dr. Chih-Lin Hsieh
L041 12011 45 Crypt 11 Dr. Chih-Lin Hsieh
L042 12011 45 Crypt 12 Dr. Chih-Lin Hsieh
L043 12011 45 Crypt 1 Dr. Chih-Lin Hsieh
L070 12260 83 Bulk 5 Shuangchao Ma
L052 12260 83 Crypt 6 Dr. Chih-Lin Hsieh
L053 12260 83 Crypt 7 Dr. Chih-Lin Hsieh
100
L058 12260 83 Crypt 8 Shuangchao Ma
L059 12260 83 Crypt 9 Shuangchao Ma
L086 12260 83 Crypt 11 Dr. Chih-Lin Hsieh
L060 12260 83 Crypt 10 Shuangchao Ma
lp012 12216 85 Bulk 4 Dr. Chih-Lin Hsieh
lp010 12216 85 Crypt 2 Dr. Chih-Lin Hsieh
lp011 12216 85 Crypt 3 Dr. Chih-Lin Hsieh
m1 12216 85 Crypt 5 Shuangchao Ma
m2 12216 85 Crypt 7 Shuangchao Ma
L055 12216 85 Crypt 9 Shuangchao Ma
L027 12430 93 Bulk 4 Dr. Chih-Lin Hsieh
L030 12430 93 Crypt 5 Dr. Chih-Lin Hsieh
L031 12430 93 Crypt 6 Dr. Chih-Lin Hsieh
L064 12430 93 Crypt 3 Shuangchao Ma
L065 12430 93 Crypt 6 Shuangchao Ma
mergeL028 12430 93 Crypt 7 Dr. Chih-Lin Hsieh
L029 12430 93 Crypt 5 Dr. Chih-Lin Hsieh
L054 12430 93 Crypt 8 Dr. Chih-Lin Hsieh
Appendix I.2 Supplementary Data for SNV
Counts
Table 16. GATK and Strelka SNV Calls Overlap and Divergence.
Sample Name
GATK
Total
SNV
Calls
GATK
Only
SNV
Calls
GATK
Only
SNV
Calls
Ratio
Strelka
Total
SNV
Calls
Strelka
Only
SNV
Calls
Strel
ka
Only
SNV
Calls
Ratio
Strelka
and
GATK
SNV Calls
yr00_crypt_L129 303657 295281 0.97 69981 61605 0.88 8376
yr00_crypt_L133 300931 292896 0.97 66218 58183 0.88 8035
yr00_crypt_L125 302799 294038 0.97 74001 65240 0.88 8761
yr00_crypt_L134 307279 298995 0.97 71545 63261 0.88 8284
yr11_crypt_L115 275244 269599 0.98 67541 61896 0.92 5645
yr11_crypt_L122 278888 273475 0.98 56966 51553 0.90 5413
yr11_crypt_L105 292520 286786 0.98 57807 52073 0.90 5734
yr11_crypt_L113 280219 275150 0.98 62058 56989 0.92 5069
yr11_crypt_L121 286328 280977 0.98 55454 50103 0.90 5351
yr11_crypt_L123 278598 272098 0.98 64506 58006 0.90 6500
yr15_crypt_L108 317640 311014 0.98 44568 37942 0.85 6626
yr15_crypt_L120 340708 331453 0.97 63468 54213 0.85 9255
yr15_crypt_L118 338777 330616 0.98 61919 53758 0.87 8161
yr15_crypt_L126 326380 318459 0.98 64927 57006 0.88 7921
101
yr39_crypt_L044 317307 308640 0.97 49730 41063 0.83 8667
yr39_crypt_L045 315835 305408 0.97 58415 47988 0.82 10427
yr39_crypt_L046 327768 318857 0.97 61200 52289 0.85 8911
yr39_crypt_L047 324213 314174 0.97 67104 57065 0.85 10039
yr39_crypt_L048 316753 306327 0.97 55074 44648 0.81 10426
yr41_crypt_L036 291847 283859 0.97 49115 41127 0.84 7988
yr41_crypt_L037 286200 278322 0.97 46173 38295 0.83 7878
yr41_crypt_L038 291458 283449 0.97 50860 42851 0.84 8009
yr41_crypt_L039 293543 285073 0.97 58275 49805 0.85 8470
yr41_crypt_L040 322320 311404 0.97 88071 77155 0.88 10916
yr45_crypt_L034 302774 295196 0.97 70082 62504 0.89 7578
yr45_crypt_L035 291590 282343 0.97 55017 45770 0.83 9247
yr45_crypt_L041 319544 310385 0.97 60675 51516 0.85 9159
yr45_crypt_L042 332251 325764 0.98 79588 73101 0.92 6487
yr45_crypt_L043 284101 275237 0.97 52921 44057 0.83 8864
yr83_crypt_L052 272443 261865 0.96 58068 47490 0.82 10578
yr83_crypt_L053 274379 263937 0.96 54623 44181 0.81 10442
yr83_crypt_L058 304220 295856 0.97 51013 42649 0.84 8364
yr83_crypt_L059 287347 278304 0.97 56105 47062 0.84 9043
yr83_crypt_L086 257426 250771 0.97 39238 32583 0.83 6655
yr85_crypt_lp010 337709 331276 0.98 39033 32600 0.84 6433
yr85_crypt_lp011 336802 327441 0.97 41900 32539 0.78 9361
yr85_crypt_m1 772524 757994 0.98 69373 54843 0.79 14530
yr85_crypt_m2
123531
6
1222227 0.99 57621 44532 0.77 13089
yr85_crypt_L055 357751 347465 0.97 58990 48704 0.83 10286
yr93_crypt_L030 310175 298824 0.96 66072 54721 0.83 11351
yr93_crypt_L031 298860 284303 0.95 56657 42100 0.74 14557
yr93_crypt_L064 299983 288277 0.96 51075 39369 0.77 11706
yr93_crypt_L065 293515 281860 0.96 53150 41495 0.78 11655
yr93_crypt_mergeL
028
287331 275775 0.96 61291 49735 0.81 11556
yr93_crypt_L029 313115 304392 0.97 60879 52156 0.86 8723
Table 17. SNV Filtration Based on Position.
Sample Name
GATK
&
Strelka
Overla
pped
SNV
Call
Count
SNV
Calls
That
Fall into
Unalign
able
Contigs
Ratio of
SNV
Calls
That
Fall into
Unalign
able
Contigs
SNV Calls
That Fall
into Sex
Chromos
ome and
Mitochon
dria
Sequence
s
Ratio
of SNV
Calls
That
Fall
into
Sex
Chrom
osome
and
Mitoch
ondria
Seque
nces
Coun
t of
Auto
some
Only
SNVs
Autoso
me
Only
Ti/Tv
Ratio
yr00_crypt_L129 8376 404 0.05 542 0.06 7430 1.29
102
yr00_crypt_L133 8035 437 0.05 532 0.07 7066 1.24
yr00_crypt_L125 8761 434 0.05 531 0.06 7796 1.34
yr00_crypt_L134 8284 427 0.05 473 0.06 7384 1.29
yr11_crypt_L115 5645 274 0.05 467 0.08 4904 1.38
yr11_crypt_L122 5413 300 0.06 488 0.09 4625 1.25
yr11_crypt_L105 5734 293 0.05 622 0.11 4819 1.27
yr11_crypt_L113 5069 286 0.06 451 0.09 4332 1.31
yr11_crypt_L121 5351 310 0.06 537 0.10 4504 1.31
yr11_crypt_L123 6500 301 0.05 643 0.10 5556 1.48
yr15_crypt_L108 6626 463 0.07 370 0.06 5793 1.31
yr15_crypt_L120 9255 519 0.06 510 0.06 8226 1.19
yr15_crypt_L118 8161 376 0.05 481 0.06 7304 1.38
yr15_crypt_L126 7921 403 0.05 496 0.06 7022 1.27
yr39_crypt_L044 8667 819 0.09 581 0.07 7267 1.61
yr39_crypt_L045 10427 1004 0.10 773 0.07 8650 1.60
yr39_crypt_L046 8911 816 0.09 567 0.06 7528 1.58
yr39_crypt_L047 10039 949 0.09 651 0.06 8439 1.67
yr39_crypt_L048 10426 865 0.08 645 0.06 8916 1.63
yr41_crypt_L036 7988 766 0.10 478 0.06 6744 1.63
yr41_crypt_L037 7878 725 0.09 411 0.05 6742 1.71
yr41_crypt_L038 8009 729 0.09 434 0.05 6846 1.81
yr41_crypt_L039 8470 733 0.09 422 0.05 7315 1.68
yr41_crypt_L040 10916 884 0.08 580 0.05 9452 1.80
yr45_crypt_L034 7578 712 0.09 628 0.08 6238 1.60
yr45_crypt_L035 9247 734 0.08 679 0.07 7834 1.64
yr45_crypt_L041 9159 713 0.08 631 0.07 7815 1.89
yr45_crypt_L042 6487 883 0.14 549 0.08 5055 1.28
yr45_crypt_L043 8864 774 0.09 730 0.08 7360 1.85
yr83_crypt_L052 10578 886 0.08 639 0.06 9053 1.71
yr83_crypt_L053 10442 894 0.09 644 0.06 8904 1.73
yr83_crypt_L058 8364 676 0.08 455 0.05 7233 1.69
yr83_crypt_L059 9043 688 0.08 584 0.06 7771 1.74
yr83_crypt_L086 6655 317 0.05 441 0.07 5897 1.68
yr85_crypt_lp010 6433 766 0.12 329 0.05 5338 1.30
yr85_crypt_lp011 9361 731 0.08 585 0.06 8045 1.79
yr85_crypt_m1 14530 5235 0.36 548 0.04 8747 1.76
yr85_crypt_m2 13089 4941 0.38 404 0.03 7744 1.66
yr85_crypt_L055 10286 1712 0.17 451 0.04 8123 1.58
yr93_crypt_L030 11351 825 0.07 741 0.07 9785 1.88
yr93_crypt_L031 14557 778 0.05 1014 0.07
1276
5
1.93
yr93_crypt_L064 11706 834 0.07 776 0.07
1009
6
1.91
yr93_crypt_L065 11655 767 0.07 750 0.06 1013 2.01
103
8
yr93_crypt_mergeL
028
11556 938 0.08 700 0.06 9918 1.80
yr93_crypt_L029 8723 391 0.04 616 0.07 7716 1.75
Table 18. SNV Counts for Each Crypt at Each of the Applied Filtering Steps (15X cutoff for the
Coverage Step).
Sample Name
Autosom
e SNV
Count
Autos
ome
Ti/Tv
Ratio
>=15X
Coverage
Autosom
e SNV
Count
>=15X
Coverage
Autosom
e Ti/Tv
Ratio
>=15X
Coverage
Autosome
with
Heterogen
eity Filter
SNV Count
>=15X
Coverage
Autosome
with
Heterogeneit
y Filter Ti/Tv
Ratio
yr00_crypt_L129 7430 1.29 6306 1.27 2388 1.41
yr00_crypt_L133 7066 1.24 6067 1.22 2333 1.33
yr00_crypt_L125 7796 1.34 6843 1.32 2493 1.38
yr00_crypt_L134 7384 1.29 6331 1.28 2333 1.34
yr11_crypt_L115 4904 1.38 4366 1.38 1888 1.55
yr11_crypt_L122 4625 1.25 3977 1.25 1528 1.34
yr11_crypt_L105 4819 1.27 4275 1.27 1628 1.43
yr11_crypt_L113 4332 1.31 3746 1.30 1584 1.30
yr11_crypt_L121 4504 1.31 3851 1.27 1428 1.42
yr11_crypt_L123 5556 1.48 4980 1.46 1884 1.55
yr15_crypt_L108 5793 1.31 4663 1.27 1905 1.35
yr15_crypt_L120 8226 1.19 6394 1.19 3395 1.24
yr15_crypt_L118 7304 1.38 5839 1.41 2887 1.55
yr15_crypt_L126 7022 1.27 5598 1.27 2661 1.35
yr39_crypt_L044 7267 1.61 6523 1.60 3817 1.77
yr39_crypt_L045 8650 1.60 7873 1.58 4205 1.85
yr39_crypt_L046 7528 1.58 6802 1.57 3943 1.73
yr39_crypt_L047 8439 1.67 7709 1.67 3729 1.80
yr39_crypt_L048 8916 1.63 8157 1.63 4308 1.78
yr41_crypt_L036 6744 1.63 5974 1.64 3254 1.83
yr41_crypt_L037 6742 1.71 5984 1.71 3258 1.88
yr41_crypt_L038 6846 1.81 6017 1.83 3358 1.97
yr41_crypt_L039 7315 1.68 6644 1.69 3291 1.76
yr41_crypt_L040 9452 1.80 8363 1.80 3637 1.83
yr45_crypt_L034 6238 1.60 5617 1.60 2485 1.61
yr45_crypt_L035 7834 1.64 7225 1.65 4226 1.83
yr45_crypt_L041 7815 1.89 7105 1.90 4111 2.01
yr45_crypt_L042 5055 1.28 4249 1.26 1893 1.33
yr45_crypt_L043 7360 1.85 6807 1.86 3043 1.99
yr83_crypt_L052 9053 1.71 8496 1.73 4632 1.91
yr83_crypt_L053 8904 1.73 8276 1.72 4107 1.82
yr83_crypt_L058 7233 1.69 6611 1.68 2886 1.71
yr83_crypt_L059 7771 1.74 7201 1.74 3661 1.85
yr83_crypt_L086 5897 1.68 5438 1.68 2536 1.85
104
yr85_crypt_lp010 5338 1.30 4467 1.29 1796 1.25
yr85_crypt_lp011 8045 1.79 7141 1.83 3753 1.87
yr85_crypt_m1 8747 1.76 6309 1.79 2466 1.75
yr85_crypt_m2 7744 1.66 6088 1.73 2684 1.76
yr85_crypt_L055 8123 1.58 6467 1.61 2326 1.57
yr93_crypt_L030 9785 1.88 8751 1.89 5569 2.01
yr93_crypt_L031 12765 1.93 11872 1.94 6901 2.05
yr93_crypt_L064 10096 1.91 9217 1.92 6083 2.12
yr93_crypt_L065 10138 2.01 9338 2.00 3461 2.00
yr93_crypt_merg
eL028
9918 1.80 8976 1.80 2684 1.61
yr93_crypt_L029 7716 1.75 6776 1.74 3559 1.89
105
Figure 34. SNV Counts (Sorted by Age) With Only the 15X Coverage Filter Applied.
The samples have been grouped by age on the x-axis, and the y-axis shows the
number of SNV counts for each sample. The box and lines define the middle and
outer quartiles respectively. The diamond is centered on the mean and extends the
standard deviation. The additional center line defines the median.
106
Figure 35. SNV Counts (Sorted by Age) With the 15X Coverage and Heterogeneity Filters
Applied.
The samples have been grouped by age on the x-axis, and the y-axis shows the
number of SNV counts for each sample. The box and lines define the middle and
outer quartiles respectively. The diamond is centered on the mean and extends the
standard deviation. The additional center line defines the median.
107
Table 19. SNV Counts for Each Crypt at Each of the Applied Filtering Steps (20X cutoff for the
Coverage Step).
Sample Name
Autosom
e SNV
Count
Autos
ome
Ti/Tv
Ratio
>=20X
Coverage
Autosom
e SNV
Count
>=20X
Coverage
Autosom
e Ti/Tv
Ratio
>=20X
Coverage
Autosome
with
Heterogen
eity Filter
SNV Count
>=20X
Coverage
Autosome
with
Heterogeneit
y Filter Ti/Tv
Ratio
yr00_crypt_L129 7430 1.3 5514 1.3 2014 1.4
yr00_crypt_L133 7066 1.2 5256 1.2 1924 1.3
yr00_crypt_L125 7796 1.3 6036 1.3 2100 1.3
yr00_crypt_L134 7384 1.3 5472 1.3 1936 1.3
yr11_crypt_L115 4904 1.4 3831 1.4 1587 1.7
yr11_crypt_L122 4625 1.3 3497 1.2 1280 1.3
yr11_crypt_L105 4819 1.3 3809 1.2 1351 1.4
yr11_crypt_L113 4332 1.3 3232 1.3 1305 1.3
yr11_crypt_L121 4504 1.3 3370 1.3 1176 1.4
yr11_crypt_L123 5556 1.5 4484 1.4 1625 1.5
yr15_crypt_L108 5793 1.3 3876 1.2 1463 1.3
yr15_crypt_L120 8226 1.2 5197 1.2 2594 1.3
yr15_crypt_L118 7304 1.4 4826 1.4 2256 1.5
yr15_crypt_L126 7022 1.3 4661 1.3 2119 1.4
yr39_crypt_L044 7267 1.6 5454 1.5 3129 1.7
yr39_crypt_L045 8650 1.6 6654 1.5 3550 1.8
yr39_crypt_L046 7528 1.6 5693 1.5 3204 1.7
yr39_crypt_L047 8439 1.7 6539 1.6 3080 1.7
yr39_crypt_L048 8916 1.6 6967 1.6 3611 1.7
yr41_crypt_L036 6744 1.6 5062 1.6 2710 1.8
yr41_crypt_L037 6742 1.7 4972 1.7 2670 1.9
yr41_crypt_L038 6846 1.8 5066 1.8 2777 2.0
yr41_crypt_L039 7315 1.7 5613 1.7 2715 1.8
yr41_crypt_L040 9452 1.8 6932 1.8 2812 1.8
yr45_crypt_L034 6238 1.6 4721 1.6 1954 1.6
yr45_crypt_L035 7834 1.6 6352 1.6 3686 1.8
yr45_crypt_L041 7815 1.9 6195 1.8 3539 2.0
yr45_crypt_L042 5055 1.3 3364 1.2 1370 1.3
yr45_crypt_L043 7360 1.9 6107 1.8 2675 2.0
yr83_crypt_L052 9053 1.7 7660 1.7 4219 1.9
yr83_crypt_L053 8904 1.7 7454 1.7 3711 1.9
yr83_crypt_L058 7233 1.7 5772 1.7 2403 1.7
yr83_crypt_L059 7771 1.7 6253 1.7 3070 1.8
yr83_crypt_L086 5897 1.7 4880 1.7 2236 1.8
yr85_crypt_lp010 5338 1.3 3760 1.3 1438 1.3
yr85_crypt_lp011 8045 1.8 6282 1.9 3249 1.9
yr85_crypt_m1 8747 1.8 5277 1.8 1927 1.7
108
yr85_crypt_m2 7744 1.7 5211 1.8 2220 1.8
yr85_crypt_L055 8123 1.6 5206 1.6 1719 1.5
yr93_crypt_L030 9785 1.9 7239 1.9 4534 2.0
yr93_crypt_L031 12765 1.9 10330 1.9 5946 2.1
yr93_crypt_L064 10096 1.9 7848 1.9 5126 2.1
yr93_crypt_L065 10138 2.0 8147 2.0 2923 2.0
yr93_crypt_merg
eL028
9918 1.8 7763 1.8 2220 1.6
yr93_crypt_L029 7716 1.8 5755 1.7 2893 1.8
109
Figure 36. SNV Counts (Sorted by Age) With Only the 20X Coverage Filter Applied.
The samples have been grouped by age on the x-axis, and the y-axis shows the
number of SNV counts for each sample. The box and lines define the middle and
outer quartiles respectively. The diamond is centered on the mean and extends the
standard deviation. The additional center line defines the median.
110
Figure 37. SNV Counts (Sorted by Age) With the 20X Coverage and Heterogeneity Filters
Applied.
The samples have been grouped by age on the x-axis, and the y-axis shows the
number of SNV counts for each sample. The box and lines define the middle and
111
outer quartiles respectively. The diamond is centered on the mean and extends the
standard deviation. The additional center line defines the median.
Table 20. Normalization Values for Each Sample (20X Cutoff).
Sample Name
Number
Sites >=1X
Number
Sites >=1X,
Autosome
Number
Sites >=20X,
Autosome
Used Final
Number
Sites >=20X,
Autosome
yr00_bulk_L090 2804843885 2652341251 2598275085 2598275085
yr00_crypt_L129 2804647447 2652168399 2595243310 2595243310
yr00_crypt_L133 2804677635 2652062846 2375261911 2375261911
yr00_crypt_L125 2804069325 2651693991 2600201769 2598275085
yr00_crypt_L134 2804579054 2652078928 2574757211 2574757211
yr11_bulk_L116 2823899715 2654578928 2565943129 2565943129
yr11_crypt_L115 2822402716 2653385280 2558361580 2558361580
yr11_crypt_L122 2821785286 2652968965 2509676176 2509676176
yr11_crypt_L105 2821763199 2652962154 2442964636 2442964636
yr11_crypt_L113 2821921822 2653014508 2539588791 2539588791
yr11_crypt_L121 2823519965 2654246255 2528970319 2528970319
yr11_crypt_L123 2823635092 2654448002 2562544790 2562544790
yr15_bulk_L111 2805510602 2652452463 2431801371 2431801371
yr15_crypt_L108 2806218042 2653208208 2322232675 2322232675
yr15_crypt_L120 2801209469 2648568218 2296666576 2296666576
yr15_crypt_L118 2802311170 2650042478 2366732886 2366732886
yr15_crypt_L126 2803637979 2651115530 2460794264 2431801371
yr39_bulk_L071 2824961137 2655936114 2329166655 2329166655
yr39_crypt_L044 2824378693 2655399058 2414030545 2329166655
yr39_crypt_L045 2823447062 2654697673 2538443721 2329166655
yr39_crypt_L046 2821759758 2653348090 2338709045 2329166655
yr39_crypt_L047 2821757359 2653349116 2421739550 2329166655
yr39_crypt_L048 2823090289 2654379771 2487168752 2329166655
yr41_bulk_L068 2806196336 2653346197 2435207221 2435207221
yr41_crypt_L036 2807872853 2654420845 2366228996 2366228996
yr41_crypt_L037 2808230771 2654886650 2379189113 2379189113
yr41_crypt_L038 2808091207 2654789157 2369929757 2369929757
yr41_crypt_L039 2807654589 2654482494 2441652033 2435207221
yr41_crypt_L040 2807476924 2654171760 2251404052 2251404052
yr45_bulk_L067 2826044910 2656459800 2506449451 2506449451
yr45_crypt_L034 2824256467 2655038353 2254464537 2254464537
yr45_crypt_L035 2824776410 2655439663 2445286498 2445286498
yr45_crypt_L041 2824506106 2656575222 2364109173 2364109173
yr45_crypt_L042 2822397965 2653512655 1987690364 1987690364
112
yr45_crypt_L043 2824158147 2654880352 2540798034 2506449451
yr83_bulk_L070 2809515488 2655630180 2489038911 2489038911
yr83_crypt_L052 2807642285 2654491886 2592442737 2489038911
yr83_crypt_L053 2807675815 2654498809 2549162844 2489038911
yr83_crypt_L058 2805537767 2654147653 2316709303 2316709303
yr83_crypt_L059 2805320320 2652592403 2362807558 2362807558
yr83_crypt_L086 2811002264 2656865283 2563630344 2489038911
yr85_bulk_lp012 2804189903 2651914148 2415090453 2415090453
yr85_crypt_lp010 2805382187 2652948441 2554861447 2415090453
yr85_crypt_lp011 2805002507 2652639062 2559989543 2415090453
yr85_crypt_m1 2720001052 2572946170 1862880119 1862880119
yr85_crypt_m2 2595898757 2458751108 1533411195 1533411195
yr85_crypt_L055 2810601541 2656949024 2578864865 2415090453
yr93_bulk_L027 2808137264 2654731584 2363809742 2363809742
yr93_crypt_L030 2807486287 2654235079 2270465040 2270465040
yr93_crypt_L031 2807525631 2654323867 2484571723 2363809742
yr93_crypt_L064 2807250287 2654078655 2415426633 2363809742
yr93_crypt_L065 2807776865 2654528522 2515073609 2363809742
yr93_crypt_mergeL028 2808461920 2655074543 2584284548 2363809742
yr93_crypt_L029 2808771213 2655097403 2237055327 2237055327
Table 21. Normalized SNV Counts for All Samples (20X Cutoff).
Sample Name Normalized SNV Count
yr00_crypt_L129 1190.0
yr00_crypt_L133 1242.1
yr00_crypt_L125 1239.3
yr00_crypt_L134 1153.0
yr11_crypt_L115 951.2
yr11_crypt_L122 782.1
yr11_crypt_L105 848.0
yr11_crypt_L113 788.0
yr11_crypt_L121 713.1
yr11_crypt_L123 972.4
yr15_crypt_L108 966.0
yr15_crypt_L120 1731.9
yr15_crypt_L118 1461.7
yr15_crypt_L126 1336.2
yr39_crypt_L044 2060.0
yr39_crypt_L045 2337.1
yr39_crypt_L046 2109.4
yr39_crypt_L047 2027.7
113
yr39_crypt_L048 2377.3
yr41_crypt_L036 1756.2
yr41_crypt_L037 1720.8
yr41_crypt_L038 1796.8
yr41_crypt_L039 1709.6
yr41_crypt_L040 1915.2
yr45_crypt_L034 1329.0
yr45_crypt_L035 2311.4
yr45_crypt_L041 2295.5
yr45_crypt_L042 1056.9
yr45_crypt_L043 1636.5
yr83_crypt_L052 2599.2
yr83_crypt_L053 2286.2
yr83_crypt_L058 1590.5
yr83_crypt_L059 1992.4
yr83_crypt_L086 1377.5
yr85_crypt_lp010 913.0
yr85_crypt_lp011 2062.9
yr85_crypt_m1 1586.2
yr85_crypt_m2 2220.0
yr85_crypt_L055 1091.4
yr93_crypt_L030 3062.1
yr93_crypt_L031 3857.2
yr93_crypt_L064 3325.3
yr93_crypt_L065 1896.2
yr93_crypt_mergeL028 1440.1
yr93_crypt_L029 1983.0
114
Figure 38. Normalized SNV Counts for All Samples Give a 20X Cutoff.
Samples are plotted on the x-axis in order of age, while the y-axis indicates the
normalized SNV count for that sample.
115
Figure 39. Normalized SNV Counts (Sorted by Age) Given a 20X Cutoff.
The samples have been grouped by age on the x-axis, and the y-axis shows the
number of normalized SNV counts for each sample. The box and lines define the
middle and outer quartiles respectively. The diamond is centered on the mean and
extends the standard deviation. The additional center line defines the median.
116
Table 22. SNV Counts and Ti/Tv Ratios for Exons, Introns and Non-Transcribed Regions for
Each Sample (15X Cutoff).
Sample Name
Exon
SNV
Count
Exon
Ti/Tv
Ratio
Intron
SNV
Count
Intron
Ti/Tv
Ratio
Non-
Trans
cribed
SNV
Count
Non-
Transc
ribed
Ti/Tv
Ratio
Exon
&
Intron
SNV
Count
Exon
&
Intron
Ti/Tv
Ratio
yr00_crypt_L129 21 1.71 616 1.48 1751 1.38 637 1.49
yr00_crypt_L133 14 1.20 591 1.33 1728 1.33 605 1.32
yr00_crypt_L125 18 2.40 629 1.59 1846 1.30 647 1.61
yr00_crypt_L134 26 2.29 554 1.43 1753 1.30 580 1.46
yr11_crypt_L115 11 0.67 441 1.47 1436 1.59 452 1.44
yr11_crypt_L122 10 0.40 328 1.02 1190 1.46 338 1.00
yr11_crypt_L105 10 5.00 353 1.40 1265 1.43 363 1.43
yr11_crypt_L113 10 3.00 285 1.26 1289 1.30 295 1.29
yr11_crypt_L121 9 1.33 277 1.70 1142 1.36 286 1.68
yr11_crypt_L123 18 3.33 424 1.83 1442 1.47 442 1.87
yr15_crypt_L108 7 0.75 387 1.53 1511 1.32 394 1.50
yr15_crypt_L120 21 1.63 894 1.34 2480 1.20 915 1.34
yr15_crypt_L118 21 3.25 725 2.06 2141 1.40 746 2.09
yr15_crypt_L126 25 1.27 683 1.48 1953 1.31 708 1.47
yr39_crypt_L044 49 2.38 961 1.77 2807 1.76 1010 1.80
yr39_crypt_L045 49 3.50 1149 2.03 3007 1.76 1198 2.07
yr39_crypt_L046 51 1.88 1006 1.73 2886 1.73 1057 1.74
yr39_crypt_L047 35 4.00 1044 1.90 2650 1.74 1079 1.94
yr39_crypt_L048 47 3.20 1144 1.72 3117 1.79 1191 1.76
yr41_crypt_L036 34 2.44 839 2.17 2381 1.72 873 2.18
yr41_crypt_L037 32 4.33 816 2.11 2410 1.79 848 2.16
yr41_crypt_L038 42 2.50 904 2.11 2412 1.91 946 2.13
yr41_crypt_L039 38 3.00 840 1.68 2413 1.77 878 1.73
yr41_crypt_L040 33 3.43 1018 1.87 2586 1.81 1051 1.90
yr45_crypt_L034 34 1.07 623 1.53 1828 1.65 657 1.50
yr45_crypt_L035 59 2.35 1114 1.91 3053 1.79 1173 1.93
yr45_crypt_L041 43 4.14 1072 2.08 2996 1.97 1115 2.13
yr45_crypt_L042 23 1.13 450 1.35 1420 1.33 473 1.34
yr45_crypt_L043 35 2.56 774 2.08 2234 1.96 809 2.10
yr83_crypt_L052 48 1.37 1239 2.06 3345 1.86 1287 2.02
yr83_crypt_L053 45 2.25 1151 2.05 2911 1.73 1196 2.06
yr83_crypt_L058 20 2.80 722 1.83 2144 1.66 742 1.85
yr83_crypt_L059 28 4.40 973 2.00 2660 1.78 1001 2.04
yr83_crypt_L086 31 1.45 662 1.93 1843 1.84 693 1.90
yr85_crypt_lp010 4 0.00 371 1.20 1421 1.27 375 1.19
yr85_crypt_lp011 20 3.25 1001 1.95 2732 1.83 1021 1.97
yr85_crypt_m1 21 7.50 578 1.91 1867 1.69 599 1.98
117
yr85_crypt_m2 17 2.50 652 1.83 2015 1.73 669 1.84
yr85_crypt_L055 15 0.86 476 1.65 1835 1.55 491 1.61
yr93_crypt_L030 54 2.60 1580 2.34 3935 1.89 1634 2.35
yr93_crypt_L031 71 4.83 1869 2.30 4961 1.94 1940 2.36
yr93_crypt_L064 64 4.27 1712 2.25 4307 2.06 1776 2.29
yr93_crypt_L065 47 5.00 909 2.13 2505 1.92 956 2.21
yr93_crypt_mergeL
028
34 3.83 631 1.53 2019 1.61 665 1.60
yr93_crypt_L029 21 5.67 911 1.82 2627 1.91 932 1.86
118
Figure 40. Exon SNV Counts (Sorted by Age) (non-Normalized).
The samples have been grouped by age on the x-axis, and the y-axis shows the
number of exon SNV counts for each sample. The box and lines define the middle
and outer quartiles respectively. The diamond is centered on the mean and extends
the standard deviation. The additional center line defines the median.
119
Table 23. SNV Counts and Ti/Tv Ratios for Exons, Introns and Non-Transcribed Regions for
Each Sample (20X Cutoff).
Sample Name
Exon
SNV
Count
Exon
Ti/Tv
Ratio
Intron
SNV
Count
Intron
Ti/Tv
Ratio
Non-
Trans
cribed
SNV
Count
Non-
Transc
ribed
Ti/Tv
Ratio
Exon
&
Intron
SNV
Count
Exon
&
Intron
Ti/Tv
Ratio
yr00_crypt_L129 19 1.43 512 1.54 1483 1.41 531 1.53
yr00_crypt_L133 14 1.20 482 1.38 1428 1.28 496 1.38
yr00_crypt_L125 17 2.20 528 1.64 1555 1.24 545 1.65
yr00_crypt_L134 25 2.29 440 1.49 1471 1.29 465 1.52
yr11_crypt_L115 10 0.50 365 1.74 1212 1.65 375 1.68
yr11_crypt_L122 8 0.67 263 1.01 1009 1.38 271 1.00
yr11_crypt_L105 9 Infinite 285 1.34 1057 1.36 294 1.39
yr11_crypt_L113 7 Infinite 230 1.32 1068 1.29 237 1.39
yr11_crypt_L121 8 1.00 235 1.85 933 1.30 243 1.81
yr11_crypt_L123 16 3.00 371 1.77 1238 1.43 387 1.80
yr15_crypt_L108 7 0.75 267 1.26 1189 1.27 274 1.24
yr15_crypt_L120 14 3.67 675 1.34 1905 1.21 689 1.37
yr15_crypt_L118 18 4.00 544 1.94 1694 1.41 562 1.98
yr15_crypt_L126 20 1.22 553 1.41 1546 1.34 573 1.40
yr39_crypt_L044 43 2.25 771 1.75 2315 1.70 814 1.77
yr39_crypt_L045 43 3.00 948 2.00 2559 1.75 991 2.03
yr39_crypt_L046 43 2.17 790 1.71 2371 1.66 833 1.73
yr39_crypt_L047 27 2.83 835 1.80 2218 1.68 862 1.82
yr39_crypt_L048 43 2.90 943 1.68 2625 1.76 986 1.71
yr41_crypt_L036 27 2.25 702 2.26 1981 1.68 729 2.26
yr41_crypt_L037 24 3.80 633 2.11 2013 1.79 657 2.15
yr41_crypt_L038 32 1.80 726 2.18 2019 1.91 758 2.17
yr41_crypt_L039 29 3.50 675 1.64 2011 1.83 704 1.69
yr41_crypt_L040 25 2.29 750 1.94 2037 1.79 775 1.95
yr45_crypt_L034 27 1.08 455 1.60 1472 1.65 482 1.56
yr45_crypt_L035 48 1.94 963 1.87 2675 1.78 1011 1.88
yr45_crypt_L041 37 4.00 892 2.07 2610 1.91 929 2.11
yr45_crypt_L042 14 1.20 294 1.21 1062 1.27 308 1.21
yr45_crypt_L043 31 2.11 667 2.10 1977 1.92 698 2.10
yr83_crypt_L052 44 1.56 1137 2.08 3038 1.87 1181 2.06
yr83_crypt_L053 43 2.17 1041 2.09 2627 1.76 1084 2.09
yr83_crypt_L058 17 3.25 597 1.92 1789 1.60 614 1.95
yr83_crypt_L059 21 3.20 812 2.00 2237 1.79 833 2.03
yr83_crypt_L086 28 1.50 586 1.93 1622 1.82 614 1.91
yr85_crypt_lp010 4 0.00 294 1.19 1140 1.33 298 1.18
yr85_crypt_lp011 19 3.25 871 2.06 2359 1.89 890 2.08
yr85_crypt_m1 20 7.50 464 1.86 1443 1.68 484 1.94
120
yr85_crypt_m2 15 3.00 547 1.95 1658 1.74 562 1.97
yr85_crypt_L055 12 0.67 348 1.54 1359 1.47 360 1.50
yr93_crypt_L030 37 3.63 1266 2.43 3231 1.85 1303 2.46
yr93_crypt_L031 56 3.58 1617 2.36 4273 1.94 1673 2.39
yr93_crypt_L064 45 3.44 1435 2.20 3646 2.06 1480 2.23
yr93_crypt_L065 36 4.67 762 2.06 2125 1.93 798 2.12
yr93_crypt_mergeL
028
30 3.17 515 1.56 1675 1.65 545 1.61
yr93_crypt_L029 15 3.67 718 1.79 2160 1.80 733 1.81
121
Figure 41. Exon SNV Counts (Sorted by Age) (non-Normalized) Given a 20X Cutoff.
The samples have been grouped by age on the x-axis, and the y-axis shows the
number of exon SNV counts for each sample. The box and lines define the middle
and outer quartiles respectively. The diamond is centered on the mean and extends
the standard deviation. The additional center line defines the median.
122
Table 24. Normalized Counts for All Samples (20X Cutoff).
Sample Name
Normalized Number for
Exon
yr00_crypt_L129 11.2
yr00_crypt_L133 9.0
yr00_crypt_L125 10.0
yr00_crypt_L134 14.9
yr11_crypt_L115 6.0
yr11_crypt_L122 4.9
yr11_crypt_L105 5.6
yr11_crypt_L113 4.2
yr11_crypt_L121 4.9
yr11_crypt_L123 9.6
yr15_crypt_L108 4.6
yr15_crypt_L120 9.3
yr15_crypt_L118 11.7
yr15_crypt_L126 12.6
yr39_crypt_L044 28.3
yr39_crypt_L045 28.3
yr39_crypt_L046 28.3
yr39_crypt_L047 17.8
yr39_crypt_L048 28.3
yr41_crypt_L036 17.5
yr41_crypt_L037 15.5
yr41_crypt_L038 20.7
yr41_crypt_L039 18.3
yr41_crypt_L040 17.0
yr45_crypt_L034 18.4
yr45_crypt_L035 30.1
yr45_crypt_L041 24.0
yr45_crypt_L042 10.8
yr45_crypt_L043 19.0
yr83_crypt_L052 27.1
yr83_crypt_L053 26.5
yr83_crypt_L058 11.3
yr83_crypt_L059 13.6
yr83_crypt_L086 17.2
yr85_crypt_lp010 2.5
yr85_crypt_lp011 12.1
yr85_crypt_m1 16.5
yr85_crypt_m2 15.0
yr85_crypt_L055 7.6
yr93_crypt_L030 25.0
123
yr93_crypt_L031 36.3
yr93_crypt_L064 29.2
yr93_crypt_L065 23.4
yr93_crypt_mergeL028 19.5
yr93_crypt_L029 10.3
Figure 42. Normalized Exon SNV Counts (Sorted by Age) Given a 20X Cutoff.
The samples have been grouped by age on the x-axis, and the y-axis shows the
number of normalized exon SNV counts for each sample. The box and lines define
the middle and outer quartiles respectively. The diamond is centered on the mean
and extends the standard deviation. The additional center line defines the median.
124
Appendix I.3 Supplementary Data for Deletions
Figure 43. Average Deletion Size Increases with Age.
The natural logarithm of the average deletion size for each crypt (y axis, in base
pairs) is plotted against the age of each subject (x axis). The samples are grouped by
age. The box and lines define the middle and outer quartiles respectively. The
diamond is centered on the mean and extends the standard deviation. The additional
center line defines the median. All deletions 50 bps or more are included.
Appendix I.4 Supplementary Data for
Discussion
Table 25. Impact of Excluding SNVs Found in dbSNP on Counts and Ti/Tv Ratios.
Sample Name
SNV Count
(With
Coverage
Autosome
with
Heterogene
ity Filters)
Ti/Tv Ratio
SNVs
found in
dbSNP
Ti/Tv
Ratio of
SNVs
Found
in
dbSNP
SNVs
Not
Found
in
dbSNP
Ti/Tb Ratio
of SNVs Not
Found in
dbSNP
yr00_crypt_L
129
2388 1.41 1126 1.24 1262 1.58
yr00_crypt_L
133
2333 1.33 1103 1.29 1230 1.36
yr00_crypt_L 2493 1.38 1170 1.33 1323 1.42
125
125
yr00_crypt_L
134
2333 1.34 1114 1.38 1219 1.30
yr11_crypt_L
115
1888 1.55 838 1.24 1050 1.85
yr11_crypt_L
122
1528 1.34 810 1.28 718 1.42
yr11_crypt_L
105
1628 1.43 817 1.44 811 1.42
yr11_crypt_L
113
1584 1.30 794 1.12 790 1.51
yr11_crypt_L
121
1428 1.42 781 1.44 647 1.40
yr11_crypt_L
123
1884 1.55 869 1.42 1015 1.67
yr15_crypt_L
108
1905 1.35 1144 1.31 761 1.43
yr15_crypt_L
120
3395 1.24 1447 1.35 1948 1.17
yr15_crypt_L
118
2887 1.55 1145 1.45 1742 1.61
yr15_crypt_L
126
2661 1.35 1133 1.28 1528 1.41
yr39_crypt_L
044
3817 1.77 1131 1.63 2686 1.83
yr39_crypt_L
045
4205 1.85 1224 1.37 2981 2.09
yr39_crypt_L
046
3943 1.73 1103 1.48 2840 1.84
yr39_crypt_L
047
3729 1.80 1107 1.52 2622 1.93
yr39_crypt_L
048
4308 1.78 1214 1.64 3094 1.83
yr41_crypt_L
036
3254 1.83 997 1.69 2257 1.89
yr41_crypt_L
037
3258 1.88 1001 1.53 2257 2.05
yr41_crypt_L
038
3358 1.97 1031 1.59 2327 2.15
yr41_crypt_L
039
3291 1.76 1166 1.62 2125 1.83
yr41_crypt_L
040
3637 1.83 1491 1.58 2146 2.03
yr45_crypt_L
034
2485 1.61 848 1.33 1637 1.76
yr45_crypt_L
035
4226 1.83 1139 1.87 3087 1.82
yr45_crypt_L
041
4111 2.01 1127 1.78 2984 2.11
yr45_crypt_L
042
1893 1.33 837 1.31 1056 1.35
yr45_crypt_L
043
3043 1.99 898 1.83 2145 2.07
yr83_crypt_L
052
4632 1.91 1089 1.81 3543 1.93
yr83_crypt_L
053
4107 1.82 1113 1.70 2994 1.86
yr83_crypt_L
058
2886 1.71 867 1.55 2019 1.78
yr83_crypt_L
059
3661 1.85 980 1.58 2681 1.95
yr83_crypt_L
086
2536 1.85 773 1.64 1763 1.94
yr85_crypt_lp
010
1796 1.25 902 1.31 894 1.19
yr85_crypt_lp
011
3753 1.87 1056 1.53 2697 2.01
126
yr85_crypt_m
1
2466 1.75 1189 1.51 1277 2.00
yr85_crypt_m
2
2684 1.76 1160 1.50 1524 1.96
yr85_crypt_L
055
2326 1.57 1464 1.58 862 1.55
yr93_crypt_L
030
5569 2.01 1194 1.76 4375 2.08
yr93_crypt_L
031
6901 2.05 1398 1.96 5503 2.07
yr93_crypt_L
064
6083 2.12 1302 1.99 4781 2.16
yr93_crypt_L
065
3461 2.00 1008 1.70 2453 2.12
yr93_crypt_m
ergeL028
2684 1.61 919 1.49 1765 1.67
yr93_crypt_L
029
3559 1.89 984 1.56 2575 2.03
Table 26. Impact of Excluding Exon SNVs Found in dbSNP on Counts and Ti/Tv Ratios.
Sample Name
Exon SNV
Count
(With
Coverage
Autosome
with
Heterogene
ity Filters)
Exon Ti/Tv
Ratio
Exon
SNVs
found in
dbSNP
Ti/Tv
Ratio of
Exon
SNVs
Found in
dbSNP
Exon
SNVs
Not
Found
in
dbSNP
Ti/Tb Ratio
of Exon
SNVs Not
Found in
dbSNP
yr00_crypt_L
129
21 1.71 17 2.75 4 0.33
yr00_crypt_L
133
14 1.20 11 3.00 3 0.00
yr00_crypt_L
125
18 2.40 12 4.50 6 1.00
yr00_crypt_L
134
26 2.29 21 3.50 5 0.67
yr11_crypt_L
115
11 0.67 7 0.75 4 0.50
yr11_crypt_L
122
10 0.40 9 0.50 1 0.00
yr11_crypt_L
105
10 5.00 8 4.00 2 No Tv Found
yr11_crypt_L
113
10 3.00 8 2.50 2 No Tv Found
yr11_crypt_L
121
9 1.33 8 2.00 1 0.00
yr11_crypt_L
123
18 3.33 14 3.50 4 3.00
yr15_crypt_L
108
7 0.75 5 0.67 2 1.00
yr15_crypt_L
120
21 1.63 6 5.00 15 1.14
yr15_crypt_L
118
21 3.25 10 5.00 11 2.67
yr15_crypt_L
126
25 1.27 10 1.50 15 1.14
yr39_crypt_L
044
49 2.38 14 3.33 35 2.10
yr39_crypt_L
045
49 3.50 15 14.00 34 2.33
yr39_crypt_L
046
51 1.88 13 2.67 38 1.69
yr39_crypt_L
047
35 4.00 16 4.00 19 4.00
yr39_crypt_L
048
47 3.20 15 3.33 32 3.14
127
yr41_crypt_L
036
34 2.44 13 4.50 21 1.86
yr41_crypt_L
037
32 4.33 12 11.00 20 3.00
yr41_crypt_L
038
42 2.50 13
No Tv
Found
29 1.70
yr41_crypt_L
039
38 3.00 20 2.60 18 3.50
yr41_crypt_L
040
33 3.43 13 3.00 20 3.75
yr45_crypt_L
034
34 1.07 21 1.83 13 0.50
yr45_crypt_L
035
59 2.35 24 2.29 35 2.40
yr45_crypt_L
041
43 4.14 20 12.00 23 2.83
yr45_crypt_L
042
23 1.13 18 2.25 5 0.00
yr45_crypt_L
043
35 2.56 16 3.33 19 2.17
yr83_crypt_L
052
48 1.37 14 3.33 34 1.00
yr83_crypt_L
053
45 2.25 20 2.75 25 2.00
yr83_crypt_L
058
20 2.80 13 2.25 7 5.00
yr83_crypt_L
059
28 4.40 7 6.00 21 4.00
yr83_crypt_L
086
31 1.45 9 6.00 22 1.00
yr85_crypt_lp
010
4 0.00 4 0.00 0 No Tv Found
yr85_crypt_lp
011
20 3.25 3
No Tv
Found
17 3.00
yr85_crypt_m
1
21 7.50 10 2.50 11 No Tv Found
yr85_crypt_m
2
17 2.50 6 3.00 11 2.33
yr85_crypt_L
055
15 0.86 12 0.83 3 1.00
yr93_crypt_L
030
54 2.60 23 4.75 31 1.82
yr93_crypt_L
031
71 4.83 21 9.00 50 4.00
yr93_crypt_L
064
64 4.27 27 10.00 37 3.00
yr93_crypt_L
065
47 5.00 17 6.50 30 4.40
yr93_crypt_m
ergeL028
34 3.83 20 7.00 14 2.25
yr93_crypt_L
029
21 5.67 12 11.00 9 3.00
128
Figure 44. IGV Output Showing a Deletion With Possible Polyclonality.
This shows a deletion in Chromosome 7, at positions 23,177,372 to 23,177,516, in
the L034 crypt sample of the 45 year old subject. Data is also shown for the matching
L067 bulk sample. Grey bars represent reads, with the thick areas showing regions
that match the reference and the thin areas indicating parts of the reference not
found on the read. A drop off in coverage, corresponding to reads spanning the same
area, is seen in the crypt sample, but not in the bulk sample, indicating a deletion.
However, the coverage in the deletion area is ~15X for the bulk and ~12X for crypt,
even though the two samples have nearly the same coverage (32.5X and 32X
129
respectively). This higher than 50% coverage in the crypt indicates that it may be
polyclonal, with some cells not containing the deletion.
130
APPENDIX II: MISCELLANEOUS
PROJECTS
PROJECT I
Microarrays Do Not Provide Additional Accuracy
When Combined with Short-Read Whole Genome
Sequencing
Project I.1 Key Questions Addressed
As whole genome sequencing (WGS) techniques become more
available, the microarray (MR) technique has ceased to be the go to
method for human genetic variation studies. A major improvement
offered by WGS is that with enough coverage, WGS can provide single
base pair resolution continuously across the entire genome (~3 billion
base pairs), while MR is limited to millions of discrete base pairs.
However, there are other factors to consider.
First, MR is much less costly. Further, WGS does have its
shortcomings. Long-reads (LR) WGS is relatively new and is still very
expensive. Although it has a tremendous advantage in detecting
structure variations, the accuracy of LRWGS heavily relies on quality
control at the software side, and it is hard to correct errors. Therefore,
we will focus on short-read (SR) WGS techniques for this study.
SRWGS relies largely on sequencing depth, and without a reasonable
coverage (at least 15X in most cases) at a target site, it can be highly
131
inaccurate at detecting variations, including single nucleotide variants
(SNVs), which is our focus for this small study.
Assuming the variants called by MR are usually accurate due to
its deep coverage, how frequently does SRWGS miss variants called
by MR or obtain false positives? Simply put, could SRWGS completely
replace MR, when it has a reasonable coverage (say 30x)? We are
particularly interested in comparing MR to a 30X coverage SRWGS,
which is a minimal “sufficient” average coverage. In many studies,
initial DNA input is very limited (e.g. 5-20ng) and preparing a
sequencing library for 30X coverage is not a simple task, especially if
one wants to avoid PCR duplication.
We will compare the variant calling results from 30X SRWGS to
MR using 17 DNA samples from a family with a history of colitis. We
will determine if MR provides any benefit in our future studies when we
already have done SRWGS. Our finding will also help design genomic
variant detection projects in which scientists are limited by the amount
of starting materials.
Existing Studies
Although studies have combined both MR and next generation
sequencing in studying genotyping, epigenetics, methylation, and de
novo mutations, there are few existing studies looking into the
differences between MR and SRWGS using experimental data
(Marinković et al., 2012)(Hurd & Nelson, 2009). In particular, the field is
lacks knowledge about how 30X SRWGS compares with MR.
Understanding Illumina MR Outputs
This small section is a brief introduction on how to understand
the MR output from Illumina. More details can be found in Illumina
documents with the colitis data on the Lieber lab server. In the Table
27, the allele forward columns are real allele calls from the experiment
site (rs labeled site). “Forward” in the allele column name means it is
using the forward strand nucleotide record for the reference of this
SNP name. The reference here, however, is not consistent. It could be
132
from dbSNP131, 1000 Genomes Project (kgp), or from Illumina’s own
design. This inconsistency has caused a lot of complexity in this
project. For example, dbsnp131 strand orientation does not always
agree with hg19. The more recent dbsnp versions such as 138 do
agree, but not 131. The allele design columns specify the alleles on the
chip. More details about the position of each SNP used on this chip
can be found in (“Infinium Omni2.5-8 Kit Support,” n.d.)
Table 27 Partial Illumina MR output
SNP Name Sample ID
Allele1 -
Forward
Allele2 -
Forward
Allele1
-
Design
Allele2
-
Design
GC
Score
rs2661837
WG0227940-
DNAG08_LP6005390-
DNA_G08
C C G G 0.9062
rs17599091
WG0227940-
DNAG08_LP6005390-
DNA_G08
G C G C 0.9515
rs1492138
WG0227940-
DNAG08_LP6005390-
DNA_G08
A A A A 0.8471
rs7774255
WG0227940-
DNAG08_LP6005390-
DNA_G08
T T A A 0.8084
rs206018
WG0227940-
DNAG08_LP6005390-
DNA_G08
C C G G 0.9126
rs11929668
WG0227940-
DNAG08_LP6005390-
DNA_G08
C C C C 0.9621
rs4252072
WG0227940-
DNAG08_LP6005390-
DNA_G08
A A A A 0.9433
rs6455
WG0227940-
DNAG08_LP6005390-
DNA_G08
G G C C 0.8795
Assumptions in Our Design and Justifications
We will only determine whether MR and SRWG agree on the
heterozygosity of each site. This is not because comparing genotyping
is difficult. Rather, it is because our data were generated prior to when
GRCH37 (or hg19) was widely used as the standard reference in
microarray chip design. In our dataset, the strand orientation of MR
positions were defined based on a mixed standard from the
combination of dbSNP131, 1000 Genomes Project, and even
Illumina’s own design; which is often the opposite of GRCH37’s strand
133
orientation. Not having a uniform standard for the orientation will result
in many more incorrectly called differences when comparing the two
methods. In fact, when I did compare genotyping using one set of data
and ignored the orientation issues; most of the resulting different calls
had complementary genotyping values when comparing MR and
SRWGS. For example, MR would report AG on one site, and SRWGS
would report would TC on the same site. Having MR and SRWGS
disagree on numerous positions in such a manner reflects the
existence of a systematic error. Therefore, due to the limitation of our
data sets, we here only compare heterogeneity. Note that there might
be methods to solve this problem, and I will discuss them in the
discussion session.
Also, the SRWGS variant calling was done by Illumina in 2012
using the ELAND aligner and CASAVA variant caller, which were
Illumina standards at the time. These are not considered to be the best
pipeline software in practice, but we assume their ability to do the
simplest variant calling task: calling SNVs, is reliable.
Project I.2 Methods
First, all MR hybridization sites need to be matched back to the
genome. Thankfully Illumina has a standard genome position table for
comparison, using GRCH37. ANNOVAR was used to obtain the
reference genotypes for sites not shown in the VCF output from
SRWGS (K. Wang et al., 2010). No phasing is included in this study.
dbSNP uses a 0-based position system and this is accounted for in this
analysis.
134
Figure 45 Workflow of This Study
Filtration
Before comparing MR data to SRWGS data, there are certain
sites in MR that were filtered out for the convenience of programming.
The number of sites excluded is very minimal (4793 out of 2379855).
These sites can be added back and processed in different manners if
needed. They include Chr0 and ChrXY labeled sites. The prior label is
used by Illumina when the site is located on an unalignable reference
contig and the actual chromosome is unknown; and the latter is used
for pseudoautosomal cases, and it means that the position is found on
both ChrX and ChrY, just like an autosomal position in other
chromosomes. We then filter out all sites that failed in the MR
experiments. Last, we filter the sites that show up more than once in
the MR. These sites are duplicated because at the time when the MR
chip was designed, it was not known that the two sites were in fact the
same.
Comparison
135
The comparison procedure is shown in the workflow of Figure
45. The main idea is quite simple. The SRWGS SNV report (a VCF
file) is a truncated genome position genotyping summary that only
reports sites that have different genotypes with the human genome
reference (GRCH37) – we need to compare MR to VCF output when
any position in MR is reported by variant calling in SRWGS; and then
also compare MR positions to the human genome reference for the
positions in MR that are not reported in VCF files.
136
Project I.3 Results
Table 28 Heterogeneity Comparison of MR vs SRWGS
SRWGS results largely agree with MR results on heterogeneity.
Out of more than two million MR sites, 16 of 17 samples have more
than 99% sites showing the same heterogeneity (HT) as determined by
SRWGS with 30X coverage (Table 28). Only one sample showed a
2.5% difference rate, and this is still quite minor.
ChrXY/0 Failed
Duplicated
Positions
Position
in WGS
Positions
not in
VCF
Total
Position
in WGS
Positions
not in
VCF
Total
201_LP6005390-
DNA_G07 2,379,855 4,793 5,888 5,626 2,363,548 696,685 1,652,740 2,349,425 1,894 12,229 14,123 99.4025%
301_LP6005390-
DNA_B08 2,379,855 4,793 3,874 5,615 2,365,573 695,290 1,655,450 2,350,740 2,038 12,795 14,833 99.3730%
303_LP6005390-
DNA_C08 2,379,855 4,793 16,412 5,481 2,353,169 690,088 1,648,707 2,338,795 2,013 12,361 14,374 99.3892%
304_LP6005390-
DNA_H08 2,379,855 4,793 16,934 5,564 2,352,564 694,902 1,643,300 2,338,202 1,939 12,423 14,362 99.3895%
305_LP6005390-
DNA_G08 2,379,855 4,793 33,484 5,524 2,336,054 693,025 1,628,080 2,321,105 2,153 12,796 14,949 99.3601%
306_LP6005390-
DNA_D08 2,379,855 4,793 16,934 5,431 2,352,697 689,508 1,648,061 2,337,569 2,301 12,827 15,128 99.3570%
307_LP6005390-
DNA_A09 2,379,855 4,793 9,785 5,613 2,359,664 698,090 1,647,345 2,345,435 1,868 12,361 14,229 99.3970%
308_LP6005390-
DNA_B09 2,379,855 4,793 10,409 5,487 2,359,166 690,514 1,653,913 2,344,427 2,088 12,651 14,739 99.3752%
309_LP6005390-
DNA_C09 2,379,855 4,793 12,719 5,594 2,356,749 691,041 1,651,713 2,342,754 1,981 12,014 13,995 99.4062%
403_LP6005390-
DNA_D09 2,379,855 4,793 13,555 5,446 2,356,061 690,836 1,650,690 2,341,526 2,035 12,500 14,535 99.3831%
404_LP6005390-
DNA_A10 2,379,855 4,793 18,601 5,392 2,351,069 692,068 1,644,265 2,336,333 1,990 12,746 14,736 99.3732%
406_LP6005390-
DNA_E08 2,379,855 4,793 12,119 5,596 2,357,347 698,915 1,644,265 2,343,180 1,817 12,350 14,167 99.3990%
407_LP6005390-
DNA_F08 2,379,855 4,793 126,407 4,558 2,244,097 631,399 1,556,590 2,187,989 22,456 33,652 56,108 97.4998%
409_LP6005390-
DNA_E09 2,379,855 4,793 17,665 5,476 2,351,921 689,665 1,647,521 2,337,186 2,019 12,716 14,735 99.3735%
410_LP6005390-
DNA_F09 2,379,855 4,793 14,500 5,587 2,354,975 694,730 1,646,174 2,340,904 1,827 12,244 14,071 99.4025%
411_LP6005390-
DNA_G09 2,379,855 4,793 11,962 5,603 2,357,497 697,581 1,645,456 2,343,037 1,877 12,583 14,460 99.3866%
MRL_LP600539
0-DNA_H09 2,379,855 4,793 40,561 5,180 2,329,321 686,755 1,627,001 2,313,756 2,392 13,173 15,565 99.3318%
Same Different
MR=WGS Sample
Filter
Total SNP
Total
Comparing
137
Possible Reasons for Identified Differences
Although the differences between MR and SRWGS calls are
few, we also examined a few possible causes of such differences. We
compared a few SRWGS features between sites where MR agrees
with SRWGS for HT, to sites where MR disagree with SRWGS for HT.
The considered sites are only from those shown in MR and shown in
the VCF output of SRWGS. The features compared include: read
depth, whether the position passed the sequencing filtration threshold
and read quality from the SRWGS data. Note that MR details of these
features can also be examined, but it has not been done in this study
yet. Also note that for the features at sites not listed in VCF variant
calling output (meaning genotyping is the same as in GRCH37), these
three features can still be examined in the future by using samtools
pileup (Heng Li & Durbin, 2009).
Read Depth
Figure 46 Read depth count distributions (data from sample: 305_LP6005390-DNA_G08).
The x-axis is the SRWGS read depth and the y-axis is the number of posistions with
mathcing (top) or disperate (bottom) heterozygosity calls between SRWGS and MR.
Figure 46 shows the relation between SRWGS coverage and
matching heterozygosity calls. While the matching calls are nicely
distributed around 30, which is the average coverage, the non-
matching calls seem to skew towards low coverage areas. This
138
suggests that disagreement between SRWGS results and MR results
may be driven by low SRWGS coverage and that improved coverage
can make this disagreement even lower.
Pass Filtration
Figure 47 MR and SRWGS consistency based on SRWGS position quality (data from sample:
305_LP6005390-DNA_G08).
The number of posistions with “PASS” or “q20” SRWGS calls is shown for posistions
where HT did match between MR and SRWGS (top) and those where it did not
(bottom).
Figure 47 shows the relation between the SRWGS quality filter
and matching heterozygosity calls. The SRWGS variant caller used
here assess the average base quality for each position it reports. In our
case, the position is either given a “PASS” if this position has passed
all filters, or a “q20” if not (“VCF (Variant Call Format) version 4.1 |
1000 Genomes,” n.d.). We can see that positions given a “PASS” were
almost perfectly consistent with MR results, while lower quality
positions often did not match MR results. Again, this result suggests
that SRWGS would have even less differences comparing with MR
given better quality sequencing (which is often dependent on higher
coverage).
Quality
139
Figure 48 VCF quality score versus SRWGS and MR consistency (data from sample:
305_LP6005390-DNA_G08)
The x-axis is the SRWGS VCF quality score and the y-axis is the number of
posistions with mathcing (top) or disperate (bottom) heterozygosity calls between
SRWGS and MR
Figure 48 shows the relation SRWGS quality scores and
matching heterozygosity calls. Every SNP reported during SRWGS
variant calling is given a Phred-scaled quality score (“VCF (Variant Call
Format) version 4.1 | 1000 Genomes,” n.d.). We can see a clear trend
where lower quality scores assigned to a SNP in the SRWGS VCF file
associate with an increased likelihood of non-matching SRWGS and
MR results. Again, we take this as an indication that improving
SRWGS coverage and quality would have led to even less differences
when compared with MR.
Project I.4 Discussion and Conclusion
We believe the MR does not provide a significant amount of
new or more accurate information when SRWGS with 30X coverage
has already been applied. However, if knowing precise information for
every single concerned site is essential, one should consider either
increasing the coverage of SRWGS, or applying MR. Our results
suggest that the disagreement between MR and SRWGS is often due
to pore WGS quality and can be reduced by obtaining SRWGS with
140
better coverage, or if ignoring certain sites is affordable, simply
applying a tougher filtering threshold would also do. We do not
currently have a better covered SRWGS dataset to directly test this
hypothesis but some alternative approaches are described below.
Project I.5 Future Work
In future studies, several additional factors can be considered to
further elaborate on the value of MR when doing SRWGS. A study
could be more informative if it covered all MR sites, rather than just
known autosome ones and compared the actual base calls of the two
methods rather than just if they detected heterozygosity of not. Our
study has shown the major factor in different results between SRWGS
and MR is SRWGS coverage and it would be nice to work with a much
deeper SRWGS data set and complementary MR set so that this can
be further explored. Finally, it is worth noting that MR can detect
deletions and loss of heterozygosity events (J. Hsieh et al., 2013), and
it would be interesting to see if there is any difference in the detection
of these events between SRWGS and MR.
141
PROJECT II
Analysis of Mitochondria DNA Using Oligomer
Probes and Next Generation Sequencing (Method
Discussion)
Project II.1 Introduction
This project sought to distinguish between the two mechanisms
of human mitochondrial DNA replication by looking at the profile of
single stranded mitochondrial DNA (mtDNA), something that should
differ between the two, through a variety of methods. Dr. Hsieh
performed an assay where strand selective DNA probes would only
ligate each other and be sequenced if there was complementary single
stranded mtDNA. Twenty four pairs of probes were used for each
strand, and I was asked to process the sequencing data from these
probes.
Project II.2 Procedure and Results
I first performed quality control on the sequencing data using
FastQC (Andrews Simon, n.d.). FastQC returned some warnings
related to overrepresented sequences, but this was understandable
given that we were sequencing a limited set of oligomers. No other
FastQC warnings occurred.
Following quality control, reads were aligned to a mitochondrial
genome reference using BWA-mem (Heng Li, 2013). Reads were then
sorted based on alignment using Picard (Broad Institute, n.d.). Finally,
the number of reads aligning to each probe position that had an
alignment length above 37bps (the length of the ligated oligomers) was
counted (Figure 49). This was judged using the CIGAR string in the
aligned (SAM) files and the logic is shown in Figure 49.
142
Figure 49 Flowchart for Oligomer Ligation Sequencing Data Analysis
This procedure was repeated for 922 separate samples (with
different oligomers probes and ligation assay conditions) and counts of
oligomers at different positions were returned to Dr. Hsieh and used in
her analysis. Data from these assays provided evidence in favor of the
main hypothesis of mtDNA replication previously described. These
results were published in (C.-L. Hsieh, 2018).
Abstract (if available)
Abstract
The rate and nature of somatic mutations that accumulate in humans with age have long been subjects of interest for a variety of reasons. These mutations are known to be the basis of cancer and may also be a factor in the process of aging. A detailed profile showing when and where these mutations arise may help explain the occurrence of cancer, assess their role in aging, and offer insights into their mechanisms. ❧ Despite this interest, studying somatic mutations has long been a challenge because it requires single cell sequencing, something that is not possible without cell expansion in culture or DNA amplification, both of which can introduce their own mutations. ❧ In this study, we address this challenge by using individual colon crypts to effectively represent a single stem cell that gave rise to the crypt shortly beforehand. Using these crypts, we study somatic mutations in young, middle aged, and old subjects. Our data suggests that single nucleotide variation (SNV) somatic mutations increase with age throughout life, but the difference between young and middle aged subjects seems to be considerably more than the difference between middle aged and old subjects. Our data also suggests that deletions larger than 3000 bps also increase with age. The mutational profiles are consistent with multiple mechanisms.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Genomic, transcriptomic and immunologic landscapes of human cancers
PDF
Investigating the complexity of the tumor microenvironment's role in drug response
PDF
Modeling mutational signatures in cancer
PDF
Understanding and controlling mitotic errors leading to aneuploidy in early ovarian cancer development
PDF
Hierarchical regularized regression for incorporation of external data in high-dimensional models
PDF
The role of endoplasmic reticulum chaperone protein GRP78 in breast cancer
PDF
Applying multi-omics in cancer liquid biopsy for improved patient monitoring and biomarker discovery
PDF
The mechanisms of somatic cell reprogramming
PDF
Integrative genomic and epigenomic analysis of human cancer
PDF
Molecular classification of breast cancer specimens from tissue morphology
PDF
Cell non-autonomous control of proliferation in tissues of elevated cancer risk in BRCA1 mutation carriers
PDF
Characterization of the ZFX family of transcription factors that bind downstream of the start site of CpG island promoters
PDF
Examination of the effect of the ecPlexin-B1 on MDA-MB-231 cells
PDF
Developing a robust single cell whole genome bisulfite sequencing protocol to analyse circulating tumor cells
PDF
Exploration of the roles of cancer stem cells and survivin in the pathogenesis and progression of prostate cancer
PDF
The cancer stem-like phenotype: therapeutics, phenotypic plasticity and mechanistic studies
PDF
Cell surface translocation mechanism of stress-inducible GRP78 in human cancer
PDF
Effects of chromatin regulators during carcinogenesis
PDF
RNA methylation in cancer plasticity and drug resistance
PDF
Loss of checkpoint controls in acute lymphoblastic leukemia
Asset Metadata
Creator
Ma, Shuangchao
(author)
Core Title
The accumulation of somatic mutations in humans with age
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Cancer Biology and Genomics
Publication Date
02/12/2019
Defense Date
12/12/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
aging,bioinformatics,genetics,NGS,OAI-PMH Harvest,software,WGS
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Press, Michael (
committee chair
), Casey, Graham (
committee member
), Craig, David (
committee member
), Groshen, Susan (
committee member
), Lieber, Michael (
committee member
)
Creator Email
shuangcm@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-117933
Unique identifier
UC11676879
Identifier
etd-MaShuangch-7049.pdf (filename),usctheses-c89-117933 (legacy record id)
Legacy Identifier
etd-MaShuangch-7049.pdf
Dmrecord
117933
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Ma, Shuangchao
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
bioinformatics
genetics
NGS
WGS