Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Identification and analysis of shared epigenetic changes in extraembryonic development and tumorigenesis
(USC Thesis Other)
Identification and analysis of shared epigenetic changes in extraembryonic development and tumorigenesis
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
IDENTIFICATION AND ANALYSIS OF SHARED EPIGENETIC CHANGES IN
EXTRAEMBRYONIC DEVELOPMENT AND TUMORIGENESIS
by
Benjamin Edgar Decato
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(Computational Biology & Bioinformatics)
December 2018
Copyright 2018 Benjamin Edgar Decato
Acknowledgements
I am deeply thankful for all of the people who have supported me personally and professionally prior to and
during the course of this work. A Ph.D. is impossible to achieve in a vacuum, and this work would not have
been possible without the rich community of exceptional colleagues, collaborators, friends, and family that
surround me.
Foremost in my mind is my advisor, Prof. Andrew D. Smith. Andrew has been a constant source of
patient instruction, unwavering encouragement, and enthusiastic discussion about science. I will forever
appreciate his ability to push me to be the absolute best scientist I can be while maintaining our friendship.
I would like to thank my committee, Prof. Matthew Dean, Prof. Remo Rohs, and Prof. Peter Cal-
abrese for their thoughtful input on my research, approachability, and candor. My cohort provided valuable
motivation, good food, and comraderie. Additionally, I would like to thank current and former MCB staff
members Haley Peltz, Katie Boeck, Miguel Franco, and Doug Burleson for making the administrative side
of completing a Ph.D. a little less daunting and fielding many (sometimes silly) questions. It has been a
pleasure interacting regularly with every member of my department.
Although not all of my current collaborations are explicitly represented in this manuscript, the regular
conversations and personal interactions with my collaborators have greatly enriched my life and understand-
ing of the biological systems discussed here. The placenta work in Chapter 3 would have been dead in the
water if not for the contributions of Dr. Jorge Lopez-Tello and Prof. Amanda Sferruzzi-Perri, as well as
Prof. Matthew Dean’s cheerful guidance through my first experience with the publication process. I am
grateful to my circulating tumor cell project collaborators Prof. Min Yu and especially Veronica Ortiz, a
constant source of learning and optimism. And most recently, I am grateful for regular discussions about
the nature of genome accessibility and DNA methylation with Kelly Barnett and Prof. Emily Hodges.
Current and former Smithlab members made the lab a cheerful and welcoming place to work every day.
I’d like to thank many of the current and former Methpipe crew: Dr. Song Qiang, Dr. Egor Dolzhenko,
Dr. Jianghan (Jenny) Qu, Meng Zhou, and Xiaojing (Liz) Ji. It has been a pleasure discussing science,
ii
academia, and life afterward with Dr. Timothy Daley, Dr. Philip Uren, Dr. Chao Deng, Dr. Hoda Mirsafian,
Dr. Ehsan Behnamgader, Dr. Emad Bahrami-Samani, Saket Choudhary, Rishvanth Prabakar, Amal Thomas,
Guilherme de Sena Brandine, and Wenzheng Li. The passion for science in this lab is infectious, and I
consider myself lucky to have such an eclectic and entertaining group of friends from the Smithlab.
Outside of the Smithlab, many friends of mine have contributed to my success and sanity during the
long hours of my graduate work. I am grateful to Dr. Krist Hausken, Dr. Kim Nguyen and Dr. David Liu,
the N244 slack channel (composed of my former UNH Computer Science classmates), the “happy Monday”
crew, and everyone else who has watched me learn to balance graduate school workloads with a personal
life for all of their support over the years.
Lastly, my family had a front row seat for six of the busiest and most trying years of my life. My
wife, Dr. Lily Zhang, was a daily source of love and inspiration, and I am confident that I would not have
succeeded without her. My parents Tom and Roxi Decato, as well as my sister Sandy Decato, supported my
decision to move 3,000 miles away to pursue my research, regularly accommodating my changing schedule
and painfully short visits. My cats, Noodles and Cheeto, turned a number of Los Angeles apartments into
welcoming homes to return to after long days in the lab. As anyone who knows me understands, I am deeply
perfectionist and it can be a challenge to continually listen to my private struggles with my work. My family
showed me unwavering love and support even at times when I was too overwhelmed or selfish to reciprocate
it, and for that I will forever be grateful.
iii
Table of Contents
Acknowledgements ii
Abstract vi
Chapter 1: Introduction 1
Chapter 2: Background 4
2.1 DNA methylation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Quantifying DNA methylation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Analyzing DNA methylation datasets . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Genome organization and accessibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.2.1 Genome organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.2.2 Methods for assessing chromatin accessibility . . . . . . . . . . . . . . . . . . . . . 34
2.2.3 Biological dynamics of chromatin accessibility . . . . . . . . . . . . . . . . . . . . 35
2.3 Epigenome dynamics by biological context . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.3.1 Mammalian development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.3.2 Tumorigenesis and metastasis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Chapter 3: DNA methylation divergence and tissue specialization in the developing mouse pla-
centa 43
3.1 Experimental design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.1.1 Inter-species whole placenta tissue collection . . . . . . . . . . . . . . . . . . . . . 45
3.1.2 Intra-species junctional and labyrinthine zone tissue collection . . . . . . . . . . . . 45
3.1.3 DNA extraction and methylation assay . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2 Data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 Placenta DNA methylation is globally heterogeneous but highly conserved at regulatory
regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.1 Global heterogeneity and tissue specificity in the mouse placenta . . . . . . . . . . . 47
3.3.2 Placenta-specific DNA methylation dynamics at regulatory regions . . . . . . . . . 49
3.4 Quantifying species- and layer-specific methylation changes in the placenta . . . . . . . . . 51
3.4.1 Promoter methylation differences between species, layers, and timepoints . . . . . . 51
3.4.2 Differential methylation in placental retrotransposons . . . . . . . . . . . . . . . . . 57
3.5 Progressive PMD formation in the mouse placenta . . . . . . . . . . . . . . . . . . . . . . . 59
3.6 Post-study analysis of effect of reference genome on global methylation levels . . . . . . . . 61
iv
Chapter 4: Improved methods for identification and analysis of partially methylated domains 64
4.1 Towards a functional definition of partially methylated domains . . . . . . . . . . . . . . . 65
4.2 Improving PMD identification through dynamic bin size selection . . . . . . . . . . . . . . 66
4.3 An improved method for identifying and scoring optimal PMD boundaries . . . . . . . . . . 69
4.4 PMD identification in methylation array data . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.5 Performance of our PMD identification method on RRBS data . . . . . . . . . . . . . . . . 72
4.6 Consensus segmentation of multiple samples . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.7 Identification of hypomethylated regions in PMD-containing samples . . . . . . . . . . . . 75
Chapter 5: Analysis of partially methylated domains across species and contexts 77
5.1 Data-driven refinement of the PMD decision criteria . . . . . . . . . . . . . . . . . . . . . . 78
5.2 Segmentation suggests gradual PMD expansion and conservation of PMD features across
species . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.3 Some genes escape hypomethylation and downregulation in late-replicating regions . . . . . 82
5.4 Discordant PMD state across species correlates with expression and gene density differences 85
5.5 Satellite-driven pericentromeric hypomethylation shows promise as a mitotic clock in healthy
cell types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.6 Analysis of hypomethylated regions in PMD-containing samples . . . . . . . . . . . . . . . 89
5.7 Study design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.7.1 An expanded methylome segmentation trackhub . . . . . . . . . . . . . . . . . . . 91
5.7.2 Mouse mammary tumour cell lines . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.7.3 Lung cancer cell lines and healthy lung tissue . . . . . . . . . . . . . . . . . . . . . 91
5.7.4 Miscellaneous methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Chapter 6: Conclusions 93
Appendix A
Supplementary Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Appendix B
Supplementary Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
v
Abstract
DNA methylation is a critical modification to DNA at CpG dinucleotides that associates strongly with gene
expression, aging, and changes in striking and characteristic ways during cellular differentiation and tumori-
genesis. The majority of CpGs in mammalian genomes remain highly methylated throughout differentiation.
Kilobase-scale hypomethylated regions (HMRs) frequently co-locate with regulatory regions and genome-
wide HMR profiles correlate with cell-type specific transcriptional profiles. In tumorigenesis and placental
development, additional megabase-scale regions of hypomethylation known as partially methylated domains
(PMDs) arise and appear to have a repressive effect on surrounding transcription. In this dissertation, I de-
scribe an initial project on DNA methylation dynamics in the placenta that spurred my interest in analysis of
PMDs, and then introduce improved methods to identify PMDs and analyze PMD-containing methylomes.
In the last chapter, I use those methods to explore PMD formation and function across a wide array of PMD-
containing methylomes in different species, improving our understanding of their evolution and relationship
with replication timing as well as identifying a set of genes that escape their repressive effects.
vi
Chapter 1
Introduction
DNA methylation occurs primarily on cytosines of CpG dinucleotides in mammals [Bird, 1985]. Across
the genome, most CpGs are methylated. In somatic cells, hypomethylated regions (HMRs) of up to a few
kilobases co-locate with gene promoters and enhancers and methylation through these intervals typically
associates with gene silencing and restriction of regulatory activity [Jones, 2012]. Most retrotransposons
are observed methylated in most cell types, and this phenomenon is one form of genomic defense against
their expression [Walsh et al., 1998]. In some cell types, notably most cancers and the placenta, larger
megabase-scale regions of hypomethylation called partially methylated domains (PMDs) develop. DNA
methylation changes in measurable and consistent ways as tissues differentiate [Seisenberger et al., 2012]
and can be compared across species [Molaro et al., 2011, Pai et al., 2011], enabling a precise characterization
of cellular and species identity.
Since joining the Smithlab in June of 2012, I have focused on using DNA methylation to explore the
epigenetic basis of cellular development, differentiation, and disease. My contributions to this goal are
two-pronged: simultaneously developing methods for processing and analyzing DNA methylation data and
collaborating with leaders in development, evolution, and cancer research to answer questions about asso-
ciated epigenetic phenomena. My first project in the Smithlab was the processing and analysis of DNA
methylation data from developing intestinal stem cells [Kaaij et al., 2013]. In addition, I actively contribute
to the development and maintenance of our flagship software pipeline Methpipe and its associated database,
Methbase [Song et al., 2013].
The focus of my dissertation is on understanding the methylation landscapes of cellular contexts where
PMDs develop. Each of the three results chapters, summarized below, characterize systems in which PMDs
exist, with the first providing a deep analysis of placental methylation, the second describing improved
methods for analysis of PMD-containing methylomes, and the third providing a comprehensive study of
1
PMDs across every context to better understand their formation and function. Before doing any of that, I
introduce in great detail the current state the field of quantifying and analyzing DNA methylation, partially
drawing on and updating text from a book chapter I co-wrote with Dr. Huy Q. Dinh and Dr. Benjamin
Berman in 2012 that was not published.
In the first results chapter, I describe a specific project involving DNA methylation in the developing
mouse placenta, which was performed in collaboration with Dr. Matthew Dean and our collaborators from
Cambridge University, Drs. Amanda Sferruzzi-Perri and Jorge Lopez-Tello. The mouse placenta is a highly
heterogeneous landscape that, until our work, had been explored in limited contexts and sparsely across
distant species. Little was known about how the placental epigenome varied within an individual species,
but a full understanding of this within-species heterogeneity must precede the identification of meaning-
ful between-species differences. In this project, we explored the placental methylome from 6 strains of 3
closely related mouse species and characterized a globally deregulated epigenome relative to other differ-
entiated tissues. Species-specific methylation patterns primarily existed inside retrotransposon subfamilies,
particularly in RLTR10 and RLTR20 subfamilies. Regulatory regions such as CpG islands and promoter
regions displayed conservation across species. By comparing our data to previously published gene expres-
sion studies, we showed that the globally lower methylation state in placenta is sufficient for transcriptional
silencing, which appears to be a placenta-specific phenomenon.
Additionally, we produced the first purified methylomes of the functionally distinct placental junctional
and labyrinthine zones at two developmental timepoints. We identified a subtle but consistent hypomethy-
lation of the junctional zone globally, as well as many concentrated differences at gene promoters. Pro-
moter differences were enriched on the X chromosome, and most differentially methylated promoters were
hypermethylated in the junctional zone relative to the labyrinthine. Differential methylation between de-
velopmental timepoints uncovered evidence for progressive formation of large, lowly methylated regions
termed partially methylated domains in the placenta as well as widespreaddenovo methylation, suggesting
a role for the epigenome in mediating differentiation of the layers towards term. In spite of earlier studies
suggesting the methylation in the placenta is static, our studies demonstrate a dynamic methylation program
that varies by species, genetic strain, layer, and developmental timepoint.
In the second results chapter, I describe a set of improved methods for the identification of partially
methylated domains. Chief among them is a method for stabilizing PMD estimates in samples with differing
sequencing depth, which has hampered previous attempts to analyze PMDs across the wide range of samples
2
available in MethBase. Additionally, I describe an improved boundary refinement method, identification of
PMDs in RRBS and microarray data, and the identification of HMRs in PMD-containing methylomes.
In the last results chapter, I explore the formation and function of PMDs across all of the tissues and
conditions in which they exist. To do this, I utilized the full strength of MethBase, which my labmates and
I have spent over 6 years populating. Using this vast amount of data in collaboration with Dr. Jianghan Qu
and Xiaojing Ji from the Smith lab and Dr. Simon RV Knott from Cedars-Sinai Medical Center, we refined
the criteria for deciding whether a sample has PMDs or not by applying the methods from the previous
chapter to 267 bisulfite sequencing samples from 7 different species from both MethBase and The Cancer
Genome Atlas.
Refinement of PMDs using our updated methods allowed us to take an unprecedentedly clear look at
their formation, boundaries, and epigenomic correlates. We observed expansion of PMDs over mitotic
replication cycles and a tight link between late replicating genomic regions and PMD locations. PMD
boundaries are punctuated by CTCF bound sites, and often occur at the transcriptional start or end sites of
genes. We discovered a set of genes that escape the hypomethylation and downregulation of PMDs in these
late-replicating regions, and a connection between discordant PMD state, expression, and gene density in
homologous sequences across species. Lastly, we observed that even in samples that do not contain PMDs,
large regions of hypomethylation near the centromere lose methylation in a manner consistent with their
perceived replication history. Our results further refine the model of PMD formation as one where sequence
context and regional epigenomic features both play a role in gradual genome-wide hypomethylation.
3
Chapter 2
Background
The Background chapter is organized into roughly three sections. In the first section, I describe the methods
for quantifying and analyzing DNA methylation in great detail. In the next section, I describe the quantifi-
cation and analysis of genome organization and accessibility in considerably less detail. In the last section, I
provide in-depth descriptions of the changes in DNA methylation and chromatin accessibility that occur dur-
ing the processes of embryonic development and tumorigenesis, with special attention paid to subsystems
that I study in later chapters.
2.1 DNA methylation
DNA methylation occurs as an addition of a methyl group (CH
3
) to DNA, typically at cytosine residues
(mC) and particularly at the CpG dinucleotide context in mammalian genomes. This epigenetic mark has
a known mechanism (via Dnmt1, a methyltransferase) for transmission from parent to daughter cells and
the methylation state of nearby genes has long been associated with regulation of gene expression, making
it a focal point of the epigenetics research community. As techniques for profiling DNA methylation state
improve, DNA methylation maps (or “methylomes”) have been constructed at single basepair resolution to
reveal functional roles across a large number of tissue and cell types. This section aims to provide a ground-
up description of DNA methylation quantification and analysis, including experimental techniques, quality
control considerations, and computational and statistical analysis methods.
In order to best explain these techniques and methods, the section is divided into two subsections. The
first gives an overview of methods for quantifying DNA methylation, including the wet-lab protocols for
probing methylation at low and high resolution and the dry-lab technical considerations for processing the
4
resulting data. The second subsection focuses on the identification and analysis of major features and func-
tions of DNA methylation, such as regions of hypomethylation, allele-specific methylation, and differen-
tially methylated regulatory regions. It also discusses a few miscellaneous considerations when analyzing
DNA methylation data, including how to compare methylation across species, visualize DNA methylation
data, and analyze non-CpG methylation in vertebrate genomes.
2.1.1 Quantifying DNA methylation
This subsection reviews whole genome bisulfite sequencing, the current gold standard for quantification of
DNA methylation. It also explains common mapping procedures and quality control considerations, such as
measuring bisulfite conversion rate and consistent calculation of methylation levels.
2.1.1.1 Whole genome bisulfite sequencing
Bisulfite sequencing is the most widely used and method for profiling DNA methylation. Treating DNA
with sodium bisulfite converts unmethylated cytosines to uracils, but leaves methylated cytosines untouched
[Hayatsu et al., 1970]. Following polymerase chain reaction (PCR) and sequencing, the methylated sites
can be identified within the reads as those positions where a cytosine appears.
Whole genome bisulfite sequencing (WGBS) is the combination of bisulfite treatment with whole-
genome sequencing, and was made possible by short-read high-throughput sequencing technologies, for
example Illumina sequencing. WGBS is the gold standard method for profiling DNA methylation [Laird,
2010] and is able to cover more of the genome and with fewer biases than any other method. The advan-
tages of WGBS data in understanding DNA methylation are tremendous and are explained throughout this
section. WGBS using 100 base reads can cover more than 96% of all CpG sites in the human genome,
including the satellite repeat and pericentromeric CpGs that alternative methods including RRBS (see the
next section) [Meissner et al., 2005] and regional methods often exclude. The whole-genome, high cover-
age scope of this technique has facilitated hundreds of studies in all manner of species and cell types, and
is playing a central role in the study of epigenomic impact on development, evolution and disease. There
are a few downsides. Bisulfite sequencing is unable to distinguish methylation from hydroxymethylation
[Huang et al., 2010] (see Section 2.1.1.10), which is becoming increasingly recognized as having biological
significance. As a resequencing method, WGBS relies on having an established reference genome and can
be quite costly, but the amount of information gained is immeasurable.
5
CpG CpG CpG CpG CpG
... ...
0
1
Methylation
level C/(C+T)
Reference Genome
CpG CpG CpG CpG CpG
... ...
Sodium Bisulfite treatment
Sample
Genomic DNA from population
CpG CpG CpG CpG CpG
... ...
CpG CpG CpG CpG CpG
... ...
PCR + Sequencing
Methyl group
CpG UpG UpG CpG
UpG UpG CpG CpG CpG
UpG UpG UpG UpG
CpG TpG TpG CpG
TpG TpG CpG CpG CpG
TpG TpG TpG TpG
Figure 2.1: DNA is isolated from a population of cells and treated with sodium bisulfite, converting all
unmethylated (starred) cytosines to uracil. PCR and sequencing reads this unmethylation out as thymine.
The fraction of cytosines informs on the methylation “level” of the site for that population. For more details
see the section on calculating methylation levels (Section 2.1.1.7).
2.1.1.2 Reduced representation bisulfite sequencing
Although sequencing costs continue to drop, the amount of sequencing involved in producing methylation
maps using WGBS remains prohibitive for many labs [Bock et al., 2010, Harris et al., 2010]. As a result,
several variant methods that still use bisulfite sequencing but selectively sequence regions of interest are
popularly used for studies with a large number of samples.
Meissner et al. [2005] introduced a method called reduced representation bisulfite sequencing (RRBS)
that combines restriction enzymes and bisulfite sequencing to profile DNA methylation at CpG-dense ge-
nomic regions. The RRBS procedure starts with DNA digestion with the restriction endonuclease MspI
6
[Meissner et al., 2005] or BglII [Meissner et al., 2008], followed by bisulfite sequencing with adapter lig-
ation. A major limitation of RRBS is that it only provides information about CpG sites associated with
restriction sites, and thus surveys a small and biased fraction of possible methylation sites. Another method,
DREAM [Jelinek and Madzo, 2016], uses even more specific restriction enzymes Smal and Xmal, which
cut at CCCGGG recognition sites, yielding just over 50,000 out of the total 28 million CpG sites in the
human genome. Akalin et al. [2012] introduced an improved method, called Extended RRBS, that resulted
in better genomic coverage and alignment accuracy and allowed for a limited amount of input materials. Li
et al. [2013] presented an analysis framework to identify differentially methylated regions specialized for
RRBS data. However, all of these methods rely on reads starting at the same restriction site. When using
single-end reads, it becomes very difficult to distinguish duplicate reads arising from PCR overamplification
from distinct molecules. Recently, unique molecular identifiers (UMIs) like DNA barcodes have become
popular and provide an attractive solution to this problem [Hebert et al., 2003].
RRBS data requires some minor adjustments to analyze. Though most bisulfite mapping programs can
deal with RRBS data, RRBSMAP [Xi et al., 2012] proposed a way based on the conventional bisulfite
sequencing mapping (see Section 2.1.1.6). Specialized techniques for mapping RRBS reads primarily differ
in that they can be faster and require less memory by only considering relevant genomic regions based on
the locations of the restriction sites in the genome. When calculating methylation levels in RRBS, coverage
is, by design, directed specifically to CpG-rich regions. This will result in different estimates of global
methylation “levels” and must be considered when comparing RRBS and WGBS methylomes (see Section
2.1.1.7 for more details on calculating methylation levels).
2.1.1.3 Targeted bisulfite sequencing approaches
Bisulfite sequencing allows single-base resolution interrogation of methylation state. This is extremely
powerful, but can be very expensive on a genome wide scale. Here we summarize two methods for targeted
exploration of methylation state that significantly reduce the cost of DNA methylation studies: bisulfite cap-
ture and bisulfite padlock probe sequencing. Although both released at nearly the same time, the approach
for each is very different.
Bisulfite capture uses microarray hybridization following bisulfite treatment to enrich the sample for
fragments originating in a set of regions of interest prior to sequencing [Hodges et al., 2009]. One drawback
of this method is the possibility of biases related to different methylation states having different efficiency
7
of capture. The probes were designed as pairs, with one member of each pair enriching for molecules with
methylation (and hence retained Cs) at all CpG sites through the complementary region; the other enriches
for molecules lacking methylation. In addition, this method requires significant amounts of starting DNA,
which limits its applicability.
Bisulfite padlock probe sequencing provides an unbiased alternative to restriction enzyme based ap-
proaches [Ball et al., 2009]. Padlock probes are 100 nucleotide sequences whose center is a common
backbone sequence and whose ends hybridize to regions adjacent to the CpG-rich area of interest. This
forms circular DNA that includes the region of interest, and can be sequenced using the common backbone
sequence for priming during PCR. By doing so, bisulfite padlock probe sequencing allows for highly spe-
cific, unbiased, and single-base resolution analysis of DNA methylation at the sites contained in the probes.
Deng et al. [2009] introduced an experiment successfully profiling 66,000 CpG sites, mainly in CpG islands
with a set of 30,000 padlock probes. Padlock probe sequencing does require substantial library preparation
to create the necessary probes, but some work has been done towards automating this process [Diep et al.,
2012].
2.1.1.4 Low DNA input methods for bisulfite sequencing
A major technical weakness of WGBS lies in its high input DNA requirement, leading to the necessary
profiling of a large number of cells. This requirement limits WGBS to exploring questions related to a het-
erogeneous population of epigenotypes for cell types where DNA is abundant enough to profile. Obvious
drawbacks to this approach include considerations for the cellular heterogeneity of the sample (particu-
larly in fast-evolving populations like cancer) [Sottoriva et al., 2015], the inability to profile very rare or
difficult to obtain cell types, and the inherent obfuscation of methylation state by cell that could reveal long-
range interactions and co-methylation on a single-cell scale. These and other shortcomings of bulk WGBS
experiments motivate the study of methods for reducing the amount of input DNA required for accurate
quantification of genome-wide methylation profiles.
In some cases, it is not economically feasible, safe, or is prohibitively difficult to obtain a large enough
sample for bulk WGBS. The most obvious example of these samples are the rapidly changing cells of early
embryonic development. Although some studies have produced composite methylomes from zygotes [Peat
et al., 2014], these bulk methylomes suffer from the same heterogeneity as described as above. During early
embryonic development every cell has a unique epigenotype, and by profiling each cell individually and
8
noting its geographic location on the embryo, it may be possible to anticipate cell fate and perform very
early screens for cell-type specific diseases as a form of preimplantation genetic diagnosis (PGD) during
in-vitro fertilization. In addition, the stability of allele-specific methylation (see Section 2.1.2.4) during this
very fluid period in embryonic development makes the PGD of ASM-related disorders such as Prader-Willi
or Angelman syndrome possible.
Outside of early embryonic development, difficult-to-survey cell types provide an alternative motivation
for production of single cell methylomes. Needle biopsies typically provide less DNA than adequate for
composite WGBS, and full biopsies of delicate regions like the brain can be unnecessarily dangerous and
are typically used only as a last resort for diagnosis of neurodegenerative disorders. By producing single cell
methylomes from very low numbers of cells, biopsies can become less invasive and comparisons to healthy
bulk methylomes can be made.
Efforts have been made to reduce the amount of input DNA bisulfite sequencing requires. Post Bisulfite
Adapter Tagging (PBAT) and T-WGBS are two such methods that allow down to 20ng of input DNA.
However, a rough estimate for a diploid human cell is 6 picograms, meaning these methods are not applicable
to the study of methylation in single cells.
T-WGBS is an extension of a transposase-basedinvitro shotgun library construction method first applied
to genomic sequencing [Adey et al., 2010]. A hyperactive transposase, Tn5, fragments the DNA and appends
sequencing adapters in the same step. Because Tn5 targets double-stranded DNA, this step is performed
prior to the sodium bisulfite treatment in T-WGBS, and the cytosines in the adapter sequence are methylated
prior to the treatment to retain their identity during sequencing [Adey and Shendure, 2012]. T-WGBS retains
reasonable library complexity down to 1ng of DNA and displays no regional coverage biases.
PBAT attempts to circumvent the loss of template sequence during the sodium bisulfite treatment by
ligating adapters afterward. This simple change substantially improved the coverage of WGBS samples,
and facilitates library preparation with “even sub-nanogram amounts of DNA [Miura et al., 2012].”
2.1.1.5 Single-cell methylation profiling
The next challenge in DNA methylation assays and analysis is certainly to accurately profile the methylation
state on a single-cell scale. Although single cell methylation is just emerging as technically feasible, a series
of approaches to querying and quantifying DNA methylation in single cells have been produced so far.
9
scRRBS and scWGBS are methods that mirror the similarly named bisulfite sequencing assays designed
for populations of cells: after isolation of a single cell, fragmentation via Msp1 digestion [Guo et al., 2013]
or simultaneous fragmentation and conversion via sodium bisulfite [Farlik et al., 2015] precede PCR and
high throughput sequencing. Due to the damaging effects of the sodium bisulfite treatment and extremely
low amount of starting DNA, both techniques suffer from very low coverage, reporting an average of 1
million CpG sites in single-cell human samples, which corresponds to just under 4% of all CpG sites in the
human genome. scRRBS additionally suffers from the coverage biases of traditional RRBS (see Section
2.1.1.2), but with such low coverage these biases allow us to assay arguably more important regions more
reliably. While the coverage is certainly low, the main redeeming benefit of these methods is their high
throughput, with their studies producing 35 and 81 single cell samples, respectively.
Other single-cell sequencing techniques rely on post-bisulfite adapter tagging (see Section 2.1.1.4) and
pre-amplification cycles prior to PCR [Smallwood et al., 2014] [Gravina et al., 2015]. These approaches
yield higher CpG coverage, and Smallwood et al. [2014] reports that sequencing to saturation with long reads
can yield CpG coverage up to 48.4%. One drawback to these approaches is that sequenced reads with the
same start and end coordinates will be produced and therefore duplicates arising from PCR overamplification
cannot be reliably removed. Without these duplicates, sites with higher coverage than the ploidy of the cell
could be reasoned as either duplicate locations mapping to the same region in the reference or a change in
the copy number of the region.
Recently, methods for single-cell methylation analysis have been refined further and applied to specific
biological contexts on increasingly large scales. Luo et al. [2017] developed single-nucleus methylcytosine
sequencing (snmC-seq), used it to produced over 6,000 single cell methylomes and refined our understand-
ing of human and mouse neuronal subtypes. Additionally, other labs have focused on developed assays that
probe single cell methylation state together with other marks of interest, including transcriptional profiles
[Angermueller et al., 2016] and chromatin accessibility [Clark et al., 2018].
Despite the recent advances and high-throughput studies, the amount of methylation information gleaned
from a single cell is severely limited using bisulfite-sequencing based methods. One possible direction may
be to assay DNA methylation on DNA directly using nanopore sequencing [Simpson et al., 2017]. Because
the length of nanopore reads can reach megabase scale, and a single read is guaranteed to come from a single
cell, accurate identification of methylation on each read could exceed the current amount of information
gained by bisulfite-based methods in a single read.
10
2.1.1.6 Mapping reads from bisulfite sequencing
Mapping reads originating from bisulfite converted DNA was the first major analysis problem associated
with whole-genome bisulfite sequencing to receive significant attention. The comprehensive review by
Bock [2012] explains the surface level issues involved in read mapping, so here I briefly cover these and
refer the reader to that review for additional details and comparisons of various tools.
Aligning reads from bisulfite sequencing must take into account the conversion of unmethylated cy-
tosines, which appear as thymines in the reads. There are generally two approaches to solving this problem.
One approach converts all remaining cytosines in reads into thymines prior to mapping, and also performs a
similar conversion for all cytosines in the genome. The obvious effect is that more reads will map ambigu-
ously: a greater fraction of the genome is composed of sequences that appear identically at more than one
place in the genome. As in the case of genotyping, these must be excluded from analysis, and is of greater
importance when the project seeks to understand methylation at repeat elements like retrotransposons. This
can have a magnified effect when the read length is small, but is mitigated somewhat by using paired-end
reads.
The other category of mappers, called “wild-card” mappers, deals with the T! C mapping directly
rather than modifying the reads or genome. This category achieves greater specificity at a cost of biasing the
mappability of reads such that reads with a greater fraction of cytosines (and therefore higher methylation)
are more likely to map uniquely. This bias is almost completely circumvented in the RMAPBS [Smith
et al., 2009] program by allowing a C in reads to map over a T in the genome when the T is part of a TpG
dinucleotide.
As there are currently many read mapping tools that perform well for bisulfite sequencing data, one of
the primary considerations in selecting include ease of use, and efficiency in terms of computational speed
and memory. A review of bisulfite read mapping methods compared three of the most popular methods
for aligning bisulfite treated reads: Bismark, BSMAP, and RMAPBS [Chatterjee et al., 2012]. The review
cites efficient mapping from all of them, and explains optimal situations for using each of the different
packages. A different set of mappers were recently benchmarked in [Kunde-Ramamoorthy et al., 2014] with
similar results. The comparison cited Bismark as a balanced tradeoff between mapping and performance,
and another method, Pash, as providing high mappability due to its uniquely mapping reads in regions
of structural variation. More recently, our lab produced a new wild-card bisulfite read mapper, Wildcard
11
ALignment Tool (WALT), that employs the periodic spaced seeds described in [Chen et al., 2009]. WALT
was used as the mapper in all of my subsequent analysis sections [Chen et al., 2016].
2.1.1.7 Summarizing DNA methylation levels
The idea that sites could be “called” as methylated or unmethylated appeared in the first whole genome
bisulfite papers [Cokus et al., 2008, Lister et al., 2009]. This notion treats potential methylation sites in a
manner similar to SNPs, for which calling an alternate or heterozygous status makes sense. In the context
of DNA methylation, the state at an individual site in a whole genome bisulfite experiment is only discrete
when the underlying data is from a single cell. Since current WGBS data sets are derived from populations of
cells, the reads mapping over each site are assumed to originate from distinct molecules (see Section 2.1.1.6).
The methylation “level” for a single site is the fraction of molecules in the sequenced sample that have the
methylation mark at that site. Analysis always assumes that the molecules are sampled uniformly at random
from those in the population of cells (a necessary assumption that may not be strictly correct), so we can
model the outcomes with a binomial distribution, and the “level” is the fraction of methylated observations
divided by total observations at the site. This is the unbiased parameter estimate for the underlying binomial
distribution.
For a single site, there is no ambiguity when talking about methylation level. Quantifying the methyla-
tion level in a region, however, opens the door to several possible ways of extending the concept. The issues
are explained in detail by Schultz et al. [2012]. The authors explain three distinct definitions for methylation
levels that include multiple sites. The mean methylation level first obtains a methylation level for each site
(as described above), and then takes their mean. The weighted mean methylation level takes the sum of
methylated observations over all sites considered, and then divides by the total observations. The fractional
methylation level is based on first “calling” sites as methylated, then computing the fraction of sites that are
actually methylated.
For each of the methods described above to compute a methylation level, any bias in coverage associated
with CpG density will also bias the computation of the level. One commonly observed type of bias is an
enrichment of reads from parts of the genome with increased CpG density. This bias is bidirectional: Song
et al. [2013] showed that over a large set of samples from different projects, bias in coverage varied both
higher and lower in CpG dense regions, in seemingly equal proportions. In the case of weighted mean
methylation, greater weight will be given to observations in CpG dense regions, which are also expected a
12
priori to have lower methylation level, thus giving increased weight to regions of the genome with lower
methylation. The other two measures are more resilient. The mean methylation, taken over enough sites, will
provide a more accurate estimation for the true methylation level given enough coverage of the low-density
areas.
2.1.1.8 Quality control in analysis of DNA methylation
Several barriers exist to proper and unbiased analysis of DNA methylation. Inherent biases in the experi-
mental protocol itself as well as pitfalls in analysis can both lead to erroneous findings, but researchers have
done a great amount of work correcting these biases.
One of the most common experimental biases is the incomplete conversion of unmethylated cytosines
by the bisulfite treatment. This incomplete conversion has been shown to be concentrated towards the 5’
end of a read, most likely due to re-annealing of sequences adjacent to fully methylated adapters during
sequencing [Berman et al., 2012]. Liu et al. [2012] solves this problem using a filter that walks from the 5’
to 3’ end of a read, discarding cytosine information until it reaches the first unmethylated cytosine that has
been converted to a T when compared with the reference genome. More generally, a commonly accepted
method for quantifying the bisulfite conversion rate genome wide is to observe the conversion rate in DNA
known to be entirely unmethylated. This DNA can be spiked-in and entirely foreign, or can be sampled
from a region known to be entirely unmethylated, such as the human mitochondria [Hong et al., 2013].
After bisulfite treatment, the percentage of this unmethylated sequence left unconverted is a reasonable
estimation for genome-wide conversion rate. While information can still be gained from samples with low
conversion rate, high confidence in results, especially at single-base resolution, requires conversion rates
above 95%.
PCR is performed following bisulfite treatment, and comes with its own set of considerations. It has
long been known that PCR bias exists and is sequence dependent. The conversion of all unmethylated
cytosines leads to a specific sequence context and introduces bias [Warnecke et al., 1997]. In 2007, Shen
et al. [2007] reported a decrease in PCR bias in bisulfite sequencing experiments by increasing the annealing
temperature for PCR, but the method relies on designing experiments using control mixtures of varying,
known methylation state to calibrate the temperature.
In addition to ensuring that methylated and unmethylated alleles are represented in equal proportions
during PCR, identifying and removing duplicate reads from the same PCR experiment after sequencing is
13
also important. After mapping, reads mapping to exactly the same genomic coordinates from the same
lane are likely duplicate reads, and should not be used to increase the power of methylation analyses. Most
techniques simply throw away one of the reads at random, but theoretically it is possible to consider bisulfite
conversion state in both reads while choosing which to retain [Robertson et al., 2008].
Bisulfite treatment is harsh and degrades DNA. As a consequence, the experiment requires significant
amounts of starting material. This can be a problem when using WGBS to profile DNA methylation in
rare cells (e.g. FACS purified cells) or single cells. In such situations the bisulfite sequencing libraries
tend to be of very low complexity, with most distinct original molecules corresponding to multiple reads as
PCR products. These “duplicate” reads must be discarded to avoid bias in estimating methylation levels.
In many cases significant amounts of data are thrown out. The Smithlab, in an effort to avoid this wasted
sequencing, developed a framework for sequencing a small amount from a prepared WGBS library, and then
predicting the yield of distinct molecules as a function of the number of reads sequenced [Daley and Smith,
2013]. The method, called PRESEQ, is highly accurate, can make predictions about large-scale sequencing
experiments, and ultimately can save significant expense in WGBS projects starting with small amounts of
cells.
2.1.1.9 Other techniques for methylation quantification
Before sequencing-based methylation quantification techniques were well-developed, three other techniques
for probing the methylation levels at CpG sites were commonplace: DNA methylation arrays, immunopre-
cipitation based techniques, and methylation-sensitive restriction enzyme approaches. The use of these
assays endures due to their relative cost effectiveness and the wealth of publicly available data already pro-
duced using them. Below I briefly describe each technique and discuss its benefits and drawbacks relative
to sequencing based methods, as well as considerations for its use alongside sequencing-based approaches.
Methylation arrays Fluorescent probe arrays are by far the most widely used technology for DNA methy-
lation quantification, even as its utility is being surpassed by sequencing based methods. Fluorescence based
methods rely on a set of pre-determined olignoucleotides annealed to a chip. For each CpG site of interest,
two sets of nearly identical oligonucleotides are annealed next to each other, one with a C and one with a
T at the CpG locus corresponding to the methylated and unmethylated fragments respectively. Bisulfite-
treated DNA is added to the chip and binding occurs with the oligonucleotides. Extension occurs with
14
labeled nucleotides, and subsequent fluorescent staining yields a ratio of two colors that is calculated as the
methylation level of the CpG site.
Because oligonucleotides must be designed for each CpG site of interest, the number of CpGs assayed
is the main limitation to methylation arrays. The most widely used array is the Illumina Methylation 450K
beadchip, which assays 450,000 CpG sites in a wide range of genomic contexts. The Cancer Genome Atlas
assayed thousands of tumor samples using this technology, facilitating useful population-level variation
studies at this subset of CpG sites. More recently, this assay has been expanded to include over 850,000
CpG sites [Pidsley et al., 2016]. These chips are amenable to high-throughput studies owing to the fact that
up to 96 samples can be loaded onto a single chip.
Methylation array data suffers from cross-sample comparison issues owing to the relative differences
in total fluorescence. A number of normalization methods have been produced to reliably estimate the
methylation levels across chips [Aryee et al., 2014, Fortin et al., 2016, Maksimovic et al., 2012]. Because
there is no sequencing aspect of array methylation quantification, there is no risk of discretization of values,
and when comparing array data to sequencing data care must be taken to ensure that discretization is not
driving any observed differences.
Immunoprecipitation and methylation-sensitive restriction approaches In addition to microarray- and
sequencing-based techniques making use of the sodium bisulfite reaction, two other general approaches
have been used to profile DNA methylation: immunoprecipitation and methylation-sensitive restriction.
The basic techniques have existed for some time, and just like bisulfite-based methods, it was natural to
use them in the context of high-throughput sequencing. The readout for both types of methods is quite
different from the bisulfite sequencing, and more similar to methods like ChIP-seq and RNA-seq. In both
cases, the sample is enriched for molecules that originate from parts of the genome with greater amounts
of methylation. These methods can be quite easy and effective in testing a variety of hypotheses about
methylation, but they generally lack precision and can be difficult to normalize. These strategies follow
the same general formula, and involve using methylation-specific antibodies or restriction enzyme-based
fragmentation followed by sequencing to identify enrichment [Bock et al., 2010, Harris et al., 2010, Laird,
2010] or depletion [Illingworth et al., 2008] of methylation.
Methylated DNA Immunoprecipitations Sequencing (meDIP-seq) utilizes immunoprecipitation of mon-
oclonal antibodies specific to methylated cytosines followed by sequencing [Down et al., 2008, Weber et al.,
2005]. MeDIP-seq provides the lowest cost per CpG covered genome-wide [Harris et al., 2010] and requires
15
a relatively low starting DNA concentration (on the order of 160-300ng), which makes it an attractive, if
lower resolution alternative to RRBS [Taiwo et al., 2012]. A disadvantage of meDIP-seq is its inability to
selectively enrich for different levels of regional methylation during the immunoprecipitation stage.
Methyl-binding domain sequencing (MBD-seq) answers this shortcoming by applying salt cuts to the
DNA fragments during extraction [Aberg et al., 2012, Lan et al., 2011]. Using this technique, fragments with
low methylation levels are extracted at lower salt concentrations, and it is possible to select for regions known
to have high DNA methylation concentration such as CpG islands by first removing the low concentration
fragments [Hirst and Marra, 2010]. In addition, the starting concentrations of MBD-seq are much lower, on
the order of 10ng [Serre et al., 2010]. This increase in specificity makes MBD-seq an attractive alternative
to meDIP-seq and reduces extraneous sequencing, allowing researchers to save money or increase coverage
of important areas (for more on reducing sequencing cost, see Section 2.1.1.8). However, both of these
methods suffer from biases related to the binding affinity of their respective proteins.
Another major mechanism for DNA methylation quantification is methylation-sensitive restriction en-
zyme sequencing (MRE-seq) [Khulan et al., 2006]. MRE-seq utilizes a size-selective DNA digestion with
a specific enzyme, often HpaII, followed by sequencing to probe methylation state [Brunner et al., 2009,
Oda et al., 2009]. Later iterations of the mechanism cut restriction sites with unmethylated CpGs, followed
by calculation of methylation levels by observing the depletion of mapped reads at cutting sites [Ball et al.,
2009, Maunakea et al., 2010, Nair et al., 2011]. The similarity of MRE-seq and its derivatives to RRBS
(see Section 2.1.1.2) has lead to a high correlation in the called methylation levels, but that correlation drops
significantly as global methylation level in the sample increases [Bock et al., 2010]. In addition, this method
relies on a large (on the order of 10mg) amount of starting DNA [Khulan et al., 2006].
A downside to restriction enzyme and immunoprecipitation based methods is the difficulty of analysis.
Because they are not single-site specific, a large number of statistical frameworks have been developed to
infer methylation levels by region. Down et al. [2008] developed a Bayesian tool for methylation level
estimation in meDIP-seq-type immunoprecipitation experiments (Batman) and compared its estimations to
control WGBS profiles of mammalian genomes, citing high correlation. Later work has integrated meDIP-
seq with restriction enzyme data to exploit their complementary nature and produce higher-quality estimates
[Harris et al., 2010, Maunakea et al., 2010].
This work has led to machine learning methods that leverage both types of experimental data along-
side other genomic features including CpG density, distribution of enzyme cut sites, and annotations of
genes and repeat elements to make precise estimates of methylation level at single-base resolution [Stevens
16
et al., 2013]. In addition to accurately estimating methylation level in single samples, recent methods de-
tect differentially methylated regions across samples which has increased the accessibility of comparative
epigenomics to labs where WGBS is not an option [Zhang et al., 2013].
The class of methods discussed this section are widely used in large-scale epigenetic studies thanks to
their low-cost availability. However, even with robust analysis tools they still lack the sensitivity offered
by bisulfite- conversion based methods. Because bisulfite-based methods are internally normalized for PCR
coverage skewing, estimation is unnecessary using bisulfite-based methods and establishes them as the gold
standard when studying DNA methylation.
2.1.1.10 Hydroxymethylation
Hydroxymethylation (5hmC) refers to the oxidation of DNA methylation (5mC). Recently it has been shown
to play an important role in epigenetic regulation in multiple cell types by promoting demethylation via
iterative TET-mediated 5mC oxidation [Tahiliani et al., 2009]. Accurate observation of hydroxymethylation
is crucial for understanding the temporal changes that occur during differentiation of somatic tissues [Ficz
et al., 2011], but traditional bisulfite sequencing does not distinguish 5mC from 5hmC [Huang et al., 2010].
As a result, several technologies for observing 5hmC have been developed, beginning at lower resolution
and culminating with robust single-base techniques.
Early stage methods for profiling hydroxymethylation parallel the restriction enzyme and immunopre-
cipitation based methods used to quantify 5mC discussed in Section 2.1.1.9. Szwagierczak et al. [2010]
used biotin labeling via glycosylation to determine enrichment, and later used a purified recombinant en-
donuclease, PvuRts1l, to selectively cleave 5hmC-containing sequences and determine the cleavage sites
in a manner similar to MRE-seq [Szwagierczak et al., 2011]. Another technique uses a procedure similar
to meDIP-seq, where immunoprecipitation is performed using 5hmC-specific antibodies [Ficz et al., 2011,
Williams et al., 2011] or 5-cytosine methylenesulfonate, which is the product of sodium bisulfite reacting
with 5hmC [Pastor et al., 2011].
Tet-Assisted Bisulfite Sequencing (TAB-seq) [Yu et al., 2012] allows interrogation of hydroxymethy-
lation at single-base resolution. It works by first tagging 5hmC with glucose via a-glucosyltransferase,
which protects the 5hmC from oxidation byTet1. Introduction of an excess ofTet1 yields iterative oxidation
of 5mC to 5hmC, 5-formylcytosine (5fC), and finally 5-carboxylcytosine (5caC). This conversion followed
17
by bisulfite sequencing converts all unmethylated cytosines and 5caC bases to uracil, leaving the protected
5hmC identifiable using existing bisulfite sequencing analysis tools.
Another method for single-base resolution 5hmC identification is Oxidation Bisulfite Sequencing, or
OxBS-seq [Booth et al., 2012]. OxBS-seq uses potassium perruthenate (KRuO4) to specifically oxidize
5hmC to 5-formylcytosine (5fC). 5fC is converted to uracil after bisulfite treatment, leaving only 5mC sites
as cytosines in the treated reads. From there, performing bisulfite sequencing on the original sample and
subtracting out the 5mC sites leaves the 5hmC sites.
Both OxBS-seq and TAB-seq are powerful methods that produce an accurate 5hmC map on a single-
base scale, but they are not perfect. OxBS-seq provides a simple, low-cost procedure but requires repeating
the harsh bisulfite treatment to fully deaminate 5fC, leading to DNA damage. TAB-seq measures 5hmC
directly, but relies on highly active TET enzymes for high conversion rates from 5mC to 5caC [Song et al.,
2012].
Analysis of 5hmC genome-wide profiling has borrowed heavily from tools generated for 5mC analysis
[Booth et al., 2012, Yu et al., 2012]. This is acceptable for any one dataset, but combining any of OxBS-seq,
TAB-seq, and normal bisulfite sequencing may lead to mis-estimation of 5hmC levels due to an overshooting
problem. The subtraction of OxBS-seq methylation levels from bisulfite datasets, for example, can lead to
perceived negative methylation levels due to fluctuations in read coverage causing calculation of methylation
state to be higher in the OxBS-seq data for a single site. Qu et al. [2013] overcomes this problem by
introducing a maximum-likelihood framework to accurately estimate the levels of 5hmC and 5mC when
datasets from different protocols are being investigated simultaneously.
2.1.2 Analyzing DNA methylation datasets
Analysis of DNA methylation datasets most often begins after bisulfite-treated reads have been mapped,
PCR duplicates have been removed, and methylation levels at each cytosine genome-wide has been called.
There are several methylation landmarks of interest, each existing in different frequencies, scales, and con-
texts across samples and species and each arising through a distinct underlying biological process. In this
section, I describe these landmarks, their major features, existing tools for their identification and analysis,
and where applicable, point to sections where I have developed improved methods.
18
chr15:
10 kb hg19
78,630,000 78,635,000 78,640,000 78,645,000
CRABP1
chr15:
5 Mb hg19
50,000,000 55,000,000
chr20:
20 kb hg19
57,410,000 57,430,000 57,450,000 57,470,000 57,490,000
MIR296
MIR298
GNAS-AS1
GNAS
GNAS
GNAS
GNAS
GNAS
LOC101927932
GNAS
GNAS
GNAS
GNAS
GNAS
GNAS
GNAS
GNAS
PBMCs
PBMCs
PBMCs
Leukemia
HMRs
PMDs
AMRs
Figure 2.2: UCSC Genome Browser plot of HMRs, PMDs, and AMRs in peripheral blood mononuclear
cells (PBMCs) and leukemia.
2.1.2.1 Hypomethylated regions
Methylation levels at single cytosines tend to display a bimodal distribution in mammals, with the site
either displaying near-complete methylation or hypomethylation in a given sample. Generally speaking,
most CpG sites in mammalian methylomes are highly methylated, and hypomethylated CpG sites group
together at regulatory regions to form kilobase-scale hypomethylated regions (HMRs). In contrast, plant
methylomes have low methylation by default, and the regions of interest are hypermethylated. Despite the
inversion between plants and mammals, the methods for detecting these regions are easily generalizable to
either case.
19
HMRs are highly dynamic and informative of cellular identity. They appear, disappear, widen, and
contract in reproducible ways throughout cellular differentiation [Hodges et al., 2011]. Not all HMRs co-
locate with obvious regulatory regions such as gene promoters of CpG islands, and intergenic HMRs have
been shown to have substantial enhancer activity [Schlesinger et al., 2013]. HMRs are widest in sperm
[Molaro et al., 2011], and analysis of sperm HMRs across species recently revealed evolutionary expansion
of HMRs in the mammalian germline [Qu et al., 2018]. Promoter HMRs have been shown to expand as
a result of transcription and then remain widened, representing an epigenetic memory whereby the gene is
poised for efficient transcriptional upregulation in the future [dos Santos et al., 2015].
The most accurate method in place today for detection of HMRs is to use a Hidden Markov Model (or
HMM), where methylation levels at individual sites are modeled using a Beta-Binomial distribution. The
HMM learns to classify regions as inside or outside an HMR by using stochastic segmentation to calculate
average methylation levels inside and outside of putative HMR regions, taking into account variation in read
coverage [Molaro et al., 2011]. The sophisticated model predicts HMRs very precisely, and for that reason
comparison of HMRs in regulatory regions can be more informative than probing the region with other
experiments, such as DNase hypersensitivity or H3K4me marks [Song et al., 2013].
While the model takes variation in read coverage into account, accurate detection of HMRs is directly
impacted by coverage of the samples. Low coverage samples inherently predict noisier and less reliable
HMRs, due to the greater variance in single site methylation level. While there is no well-defined response
to this problem, WGBS with greater than 5x coverage has been shown to identify sufficiently sharp bound-
aries. In situations where no coverage-based issues exist, such as single-cell methylation samples where
the methylation level is necessarily tied to the copy number of the CpG site, methylation levels may be
directly modeled with a multinomial distribution, however the coverage of single cell methylomes has not
yet reached sufficient levels to justify the use of such a method.
2.1.2.2 Partially methylated domains
DNA methylation is associated with a variety of gene regulatory functions in mammals, working in concert
with histone marks to stably repress transcription. Early studies of DNA methylation in cancer discovered a
globally reduced level of methylation, compared to healthy tissue analogues [Feinberg et al., 1983, Gama-
Sosa et al., 1983]. The development of modern whole-genome bisulfite sequencing (WGBS) allowed for a
high-resolution and full-genome view of DNA methylation. One of the most striking features to emerge from
20
the first application of this technique in mammals were partially methylated domains (PMDs), which were
observed in a human lung fibroblast cell line but not in embryonic stem cells [Lister et al., 2009]. Subsequent
studies found these broad domains of reduced methylation to be prevalent in cancer methylomes [Berman
et al., 2012, Hansen et al., 2011], and we can now attribute the global hypomethylation observed in early
cancer studies to this phenomenon.
Further WGBS studies have established PMDs as a universal feature in methylomes of cancers and
cultured cells [Hansen et al., 2014, Hon et al., 2012, Lister et al., 2011]. When they exist, PMDs can cover
as much as half the genome, with many contiguous domains larger than 1 Mb [Hansen et al., 2014]. In
addition to cancers and cultured cells, PMDs have been identified in the placenta at multiple developmental
stages and in several species [Decato et al., 2017, Schroeder et al., 2013, 2015].
An increasing number of studies have uncovered several genomic and epigenomic features associated
with PMDs. PMDs generally reside in gene-sparse genomic locations and coincide with lamina associated
domains and late-replicating regions [Berman et al., 2012, Hon et al., 2012]. Despite this general trend, their
locations show some degree of cell type specificity [Schroeder et al., 2011, 2013]. Boundaries of PMDs are
enriched for genomic regulatory features including promoters and insulators [Berman et al., 2012], as well
as CTCF bound sites [Salhab et al., 2018]. CpG island methylation inside PMDs is pronounced in many
cancers [Brinkman et al., 2018] and subtle but significant in placenta [Decato et al., 2017]. PMDs in a breast
cancer cell line largely overlap genomic regions occupied by histone modifications H3K9me3 or H3K27me3
[Hon et al., 2012], suggesting a link between PMDs and repressive chromatin that has been further validated
by association of PMDs with repressive chromatin states identified by chromHMM [Salhab et al., 2018].
PMDs can be reliably produced by immortalizing human B-lymphocytes with Epstein-Barr virus (EBV;
Hansen et al., 2014), and can be erased by inducing pluripotency [Ball et al., 2009, Lister et al., 2011].
There is mounting evidence that PMD formation is related to imperfect maintenance of methylation during
mitotic replication, and that certain sequence contexts are more susceptible to this imperfect maintenance
than others. Gaidatzis et al. [2014] was the first to suggest that different sequence contexts inside PMDs
could explain some of the variation in methylation levels. Recently, Zhou et al. [2018] showed that that
widespread hypomethylation of CpGs in the WCGW context occurred in all tissues as a function of mitotic
age, and that these CpGs occurred more frequently in regions likely to contain PMDs. This, coupled with
new knowledge that hemimethylated states can persist for quite some time in nascent DNA and failure
to remethylate these nascent strands before the next round of replication leads to permanent loss of the
methylation at that CpG site provides a compelling model for PMD formation [Charlton et al., 2018].
21
While the model of PMD formation is becomingly increasingly convincing, the decision of whether a
methylome contains PMDs or not, and if so, how to identify them, remains a difficult problem. There are
three main methods used to identify PMDs in existing literature. The first is to use sliding windows with
a hard cutoff on methylation level. Sliding windows of varying sizes were used in several of the earliest
PMD-centric methylation papers [Berman et al., 2012, Lister et al., 2009, 2011, Schroeder et al., 2013] and
the decision of whether PMDs existed in the sample was made based on summary statistics of the mean
size and methylation levels in the windows. This decision making criteria was appropriate for these early
studies because PMDs are so large and easily observable in many methylomes, but decision criteria based on
bimodality of methylation levels in windows can be biased by the window size (Figure 2.3A) or the subtlety
of difference in methylation inside and outside PMDs, as in mouse placenta [Decato et al., 2017, Schroeder
et al., 2015].
The other two approaches for identification of PMDs are MethylSeekR [Burger et al., 2013] and Meth-
pipe [Song et al., 2013]. Both of these approaches use Hidden Markov Models to segment the genome into
PMD and non-PMD sections. MethylSeekR segments a set of values that range from 0.1 to 3. These
values correspond to the parameter of the symmetric Beta distribution for which the posterior probability
of observing the methylation level in each 101-CpG bin is maximized (see Figure 2.3B for examples of the
extreme values of these distributions). Decisions are made by looking for bimodality in the resulting value
distribution over all 101-CpG bins, or for a heavy tail in the distribution, but no hard cutoff is suggested.
Because a high value corresponds to the “intermediate” methylation state, the resulting segments are nec-
essarily regions of the genome that display intermediate methylation levels, regardless of whether that is the
true methylation state of the PMDs. This could under-report the variability of methylation inside PMDs,
exclude HMR regions inside PMDs, or fail to detect subtle or very deep PMDs entirely. Methpipe segments
directly on the methylation levels, modeling the in-PMD and out-PMD emission distributions with more
flexible beta-binomial distributions over non-overlapping 1kb bins.
The last complication of working with methylomes that contain PMDs is identification of HMRs. Both
MethylSeekR and MethPipe have separate HMM-based methods for identification of HMRs. MethylSeekR
only applies HMR segmentation to regions outside of PMDs, leading to an incomplete picture of the regula-
tory landscape. Methpipe applies HMR segmentation agnostic of PMD state, frequently leading to incorrect
learning of HMM parameters and exclusion of many true HMRs in favor of larger, quasi-PMD regions.
22
In Chapter 4, I outline a set of improved methods for PMD detection, refinement of decision criteria
necessary to decide whether or not a sample contains PMDs, and a method for identifying HMRs in PMD-
containing samples.
0.0 0.2 0.4 0.6 0.8 1.0
0 1 2 3 4
%mCpG
Density
1mb
1kb
50kb
0.0 0.2 0.4 0.6 0.8 1.0
0 2 4 6 8
%mCpG
Density
α=β=0.1
α=β=1
α=β=3
A B
Figure 2.3: (A) Distribution of global methylation levels in IMR90 when observed using different bin sizes.
(B) Examples of symmetric Beta distribution used by MethylSeekR to segment PMDs.
2.1.2.3 Differential methylation
Most analyses seek to understand context-specific features of DNA methylation, for example alterations in
methylation profiles between developmental stages, disease states [Robertson, 2005] or exposure to some
stimulus [Christensen et al., 2009]. Frequently the changes of interest are highly localized and organized
into “differentially methylated regions” (DMRs). Since hypomethylated regions frequently correspond to
regulatory regions in the genome, DMRs are often presumed to mark regulated changes in either accessibility
or activity of transcription factors and other DNA binding proteins [Doi et al., 2009].
The operational definition for DMRs in a given project must be tailored to the hypothesis being tested,
and could refer to either short or long genomic intervals, and also to dramatic or subtle changes. In the
context of disease, especially cancer, these changes could represent a localized state being disrupted rather
than a regulated change. Regions range from single point differences in critical regulatory sequences to
23
domains megabases large with widespread differences across samples, and this stretch can lead to a lack of
focus that makes it difficult for researchers to accurately identify regions of differential methylation.
The simplest way to view this problem is at single-CpG resolution, where the “regions” in question are
individual nucleotide positions, and one asks whether the methylation level a given CpG site differs between
two methylomes. This is usually done by testing a null hypothesis stating that the methylation level of the site
is the same in both methylomes. If binomial confidence intervals have already been computed for each site
in both methylomes, then checking for overlapping confidence intervals is straightforward [Hodges et al.,
2009]. Another approach directly uses all four counts from reads: methylated and unmethylated counts in
both methylomes. These can be organized as a contingency table, and an exact test for homogeneity can
be conducted [Hodges et al., 2011]. Fisher’s exact test has been used in this context, but a better justified
test is the Bayesian formulation by Altham [1969], which can be viewed as a directional version of Fisher’s
exact test. A more recent method for identification of individual differentially methylated CpGs produced
by the Smith Lab is RADMeth, based on beta-binomial regression for modeling WGBS data [Dolzhenko
and Smith, 2014], and is useful in contexts where multiple replicates with various experimental factors
are used. Individual CpG differences are often only useful when specific CpG sites are already known
to have some significance, for example residing within a transcription factor binding site. Differences at
individual nucleotide sites can also be used in higher level analysis, for example to determine the density of
differentially methylated sites in some genomic interval.
Methods that identify genomic intervals of non-trivial size as having differential methylation take a va-
riety of approaches. One approach first computes probabilities that each individual CpG site differs between
two methylomes, and then attempts to identify intervals having an enrichment for such differences [Hodges
et al., 2011]. The BSmooth method by Hansen et al. [2012] was the first to account for the problem of
biological or technical variation between replicated methylomes. The method first smooths the methyla-
tion levels using the approach of local likelihood. Then a signal-to-noise statistic, similar to at-statistic, is
computed as the sum of the differences between groups divided by the standard error of the differences.
The BiSeq method developed by Hebestreit et al. [2013] also identifies DMRs between groups of methy-
lomes. BiSeq is aimed at targeted bisulfite sequencing, including RRBS, but this is presumably because the
method is not yet efficient enough to be applied over entire mammalian genomes. The method begins by
identifying clusters of CpG sites that could potentially form a DMR – such clusters are naturally defined in
targeted bisulfite sequencing and RRBS. Similar to BSmooth, the methylation levels are smoothed within
each cluster. The smoothed levels are modeled with a beta distribution, a beta regression model is fit to
24
the smoothed methylation levels at each CpG site, and a Wald test is used identify differences between
groups. Although BiSeq begins with defined regions (the clusters), a trimming stage is included to optimize
boundaries of the DMRs.
As with single-methylome analysis, smoothing methylation values is problematic: important methyla-
tion differences are often simply shifting boundaries of hypomethylated regions [dos Santos et al., 2015,
Hodges et al., 2011], and such boundaries are blurred when smoothing is applied. In the case of DMRs,
consider that for a small DMR, the methylome “context” is presumed not to change between two condi-
tions, only the levels within the DMR, and so borrowing information from outside the DMR by smoothing
obscures such differences.
2.1.2.4 Allele-specific methylation
Allele-specific methylation (ASM) refers to the methyl mark appearing on only one allele, typically caused
by genotypic differences between the alleles, such as SNPs or indels altering how proteins interact with
the allele and influence methylation. ASM can also be associated with the allele’s parent-of-origin, and is
the mechanism of genomic imprinting [Brandeis et al., 1993, Shoemaker et al., 2010]. In studies aiming
to associate altered DNA methylation with disease, identification of ASM must be considered as aberrant
patterns of ASM have been linked to disease states [Feinberg, 2007].
The most common method of identifying ASM in high-throughput sequencing experiments involves the
use of single nucleotide polymorphisms (SNPs) [Kerkel et al., 2008]. For each read, SNP information can
identify which allele the read was sequenced from and the methylation status of the read can be compared
against the other allele, with significant differences marked as ASM. This approach allows researchers to
probe the effects conformational changes resulting from SNPs have on regional affinity for DNA methylation
[Shoemaker et al., 2010]. However, this approach limits the number of investigable regions to those with
SNPs, and is further limited by the maximum reliable length of the reads.
Fang et al. [2012] developed a technique that does not rely on SNP information, enabling the identifica-
tion of ASM genome-wide. Rather than using SNPs directly, probing the methylation state in a probabilistic
manner can give a sense of how often clusters of methylation occur together on reads, and identify regions
where two distinct sets (alleles) exist. If reads with more than one CpG frequently occur in an “all methy-
lated” or “all unmethylated” state, sufficient coverage can provide one with reasonable evidence for the reads
25
originating from distinct molecules. By examining reads overlapping two or more CpGs and identifying re-
gions where this pattern exists in roughly equal proportions, similarly methylated reads can be assembled
into longer allelicly-methylated regions (AMRs).
2.1.2.5 Analysis of single cell methylomes
The introduction of high-throughput, low-coverage single cell methylomes has spurred the development of
a number of methods for their validation and exploration of single cell methylation dynamics. In papers
validating the new assays, a lot of emphasis was put on proving that the single cell methylomes were consis-
tent with each other across cell types and could recapitulate the composite methylome when merged [Farlik
et al., 2015, Smallwood et al., 2014]. They did this by visualizing binned methylation level across samples
to account for low coverage and computing Pearson correlations across each single cell to show that the
correlation was high within and low between cell types. Smallwood et al. [2014] also analyzed the variance
in methylation states of CpGs in different genomic contexts, and recapitulated cell types using hierarchical
clustering on the top 300 most variable sites among single cell ESC samples. All statistical analyses focused
on computing methylation levels within bins/cells.
The observed methylation levels in single cells are sparse and discrete. In diploid cells, the only pos-
sible methylation levels for a given single site are 0, 1, or 0.5, corresponding to hypo/hypermethylation of
both alleles or discordant allelic methylation states, respectively. Farlik et al. [2015] handled these sparse
measurements via binning methylation levels and developed a method to measure how much a single cell
differs from a set of control samples in terms of average DNA methylation across all regions of a given type
(promoter, chromatin binding peaks, etc). By analyzing the metaprofile of regional methylation variation,
they gained the statistical power necessary to detect cell-type specific changes in methylation state. This
model was applied to scenarios where directed methylation changes were happening, and they fit a linear
model to correct for differences between high and low CpG-density methylation dynamics.
2.1.2.6 Issues associated with heterogeneous samples
Samples in DNA methylation studies consist of populations of cells, and are not always homogeneous. They
are often made up of distinct cell lineages and subpopulations, which can be a result of contamination dur-
ing isolation, cells arising from different subtypes of the same tissue, or a mixture of normal and cancerous
26
cells. Even in purified samples, differing methylation levels at single sites across a population can be in-
dicative of a highly dynamic environment maintaining a methylation equilibrium. This randomness can be
mistaken for population structure in homogeneous samples or misinform the underlying population structure
in heterogeneous samples [Mikeska et al., 2010].
Early mechanisms for quantifying the level of diversity in methylation levels of a region involved
straightforward calculation of the average number of differences at each CpG site in all pairwise com-
parisons between reads that cover that site [Yatabe et al., 2001]. Later approaches [He et al., 2013, Xie et al.,
2011] introduced the concept of DNA methylation entropy, and are based on three parameters: the number
of CpG sites, the total number of mapped reads, and the occurrence of each methylation pattern. Entropy
represents the level of “uncertainty” or stochasticity in the data, and is 0 when every read overlapping a site
reports the same methylation state, and maximized when reads at a site report different methylation states
with equal frequency. High levels of methylation entropy have been reported in cancer many studies [Kor-
shunova et al., 2008, Shibata, 2012, Taylor et al., 2007] when compared to their normal tissue counterparts.
A high level of methylation entropy may suggest a large amount ofdenovo methylation and demethyla-
tion activity balanced in equilibrium, but does not adequately inform population structure in heterogeneous
samples. Identifying population structure in heterogeneous methylation data is extremely difficult, and
methods have been produced that attack the problem from many angles. It can be split into three main
categories. Supervised methods assume the number of subpopulations are known and that their methylation
states are known, and attempt to learn their proportions in a mixture. Semi-supervised methods assume
the number of subpopulations are known, but know nothing about their methylation states or proportions.
Unsupervised methods make no assumptions about the number of subpopulations and attempt to learn this
as well as their proportions and methylation states.
Supervised methods provide the most tractability for characterizing a sample, but rely on accurate in-
formation about the number and methylation states of the subpopulations in the mixture. Houseman et al.
[2012] devised a method that uses case/control state and replicated, purified references to implement a sur-
rogacy relation and utilized linking regression to learn proportion differences between the case and control
in Illumina 450k methylation array data. While their approach is easily generalizable to WGBS data and
single case/references for mixture deconvolution, it relies on relatively few CpG sites due to the necessity of
their parameter estimation methods (ordinary least squares) to invert matrices, which is extremely costly, as
well as the non-single-site resolution of array-based methylation technology.
27
Semi-supervised methods include previous work on identifying allele-specific methylation, where two
subpopulations (one for each allele) were expected in a sample, and a probabilistic framework was used
to investigate the probability that a particular genomic region’s methylation states could segregate into two
equally sized bins [Fang et al., 2012] (see Section 2.1.2.4). Another semi-supervised method is MethyPurify
[Zheng et al., 2014], which focuses solely on the problem of assessing purity of cancer samples. It assumes
two subpopulations (healthy and cancer) and learns their relative proportions by utilizing proportions of
reads in differentially methylated regions. This can be an issue because while cancer has large regions of
lower than average methylation, their methylation states do not go down to zero in all cancers, which would
lead to an overestimation of healthy contamination in many cancers.
Early unsupervised methods relied on the concept of epipolymorphism, a framework that sought to
identify epialleles, or specific methylation signatures that occur in some proportion of the sample and likely
belong to a distinct subpopulation [Landan et al., 2012]. Epipolymorphism suffers from not being able to
link distant methylation states of the same population together, and instead uses read-length resolution to
identify groups of distinct methylation levels in short intervals. Later work used the concept of an overlap
graph borrowed from research in viral population reconstruction and a network flow algorithm to attempt
to determine number, state, and proportion of subpopulations [Dorri et al., 2016]. This approach focused
on only a few hundred CpG sites, and collapsed methylation states across regions to make the problem
more tractable with short reads. Other recent unsupervised work approached the problem in a linear model
context, by using two-stage regression to simultaneously learn subpopulation methylation states and mixture
proportions [Houseman et al., 2014] or by non-negative matrix factorization of the unknown subpopulation
states and mixture methylation observations [Houseman et al., 2016].
Even with the success of these statistical frameworks, the only sure-fire way to accurately characterize
the heterogeneity in a sample is to interrogate the methylation state of a single cell. Although it is being
actively researched, the ability to probe DNA methylation genome-wide at high coverage in a single cell is
not reliable enough to use in place of current techniques (see Section 2.1.2.5 for details).
2.1.2.7 Analysis of DNA methylation across species
DNA methylation is a well-conserved epigenetic modification, its presence ranging across kingdoms from
animals to plants and fungi. Extremely well-conserved functions such as heavy methylation of transposable
elements to prevent proliferation suggest that the molecular mechanisms behind DNA methylation dynamics
28
are derived from a common ancestor. Zemach et al. [2010] performed WGBS on 17 widely differing species
to try and understand the mechanisms associated with this “ancestral methylome.” They provide evidence
that the last common ancestor of plants, fungi, and animals likely methylated its gene bodies, and that the
methylation was probably correlated with gene expression. They also add to the body of evidence that TE
methylation is an ancient mechanism, despite not being present in some modern day methylomes. Reference
studies like these provide researchers with useful starting points for analyses depending on their kingdom of
interest.
More recently, Qu et al. [2018] focused exploring the changes in germ line methylation through mam-
malian speciation. This focus on germ line methylomes was motivated by the fact that methylation patterns
are highly dynamic during cellular differentiation, and that the only methylation patterns that can be reliably
transmitted to the next generation occur in the germ cells. Therefore, the best place to explore evolution of
the epigenome is the germ cells.
There are several challenges researchers face when trying to perform cross-species comparisons of DNA
methylation, the first being context. Because global methylation patterns vary widely across species, tradi-
tional differentially methylated region detection does not always work. A good example of this context shift
occurs when comparing mammals to plants. In plants, most of the genome is unmethylated, and non-CpG
methylation is far more abundant. This is a marked difference from mammalian genomes, where the ground
state of the genome is methylated, and non-CpG methylation makes up the minority of methylation (see
Section 2.1.2.9). Transcriptional repression via non-CpG sites may not be picked up if only CpG sites are
considered during DMR detection.
Another challenge is mutation. As species get further apart, single site comparison becomes trickier
as they mutate, vanish, or appear. One solution to this is to use LiftOver, which is a useful tool originally
designed to lift data from one reference genome to the next. This tool has been retrofitted for cross-species
batch coordinate conversion through “chain files” that map nucleotides in one species to their orthologous
region in the target genome. This can be viewed as a prerequisite step to identifying differentially methylated
regions between species and can be useful for identifying “soft” mutations between closely related species
[Hinrichs et al., 2006].
In some cases, lifting over is not an option because there are not separate reference genomes, as in the
case of mouse strains all mapped to the mm10 reference. In Chapter 3, I highlight some of this problem and
describe my attempts to improve the methylomes of three mouse strains after my publication.
29
2.1.2.8 Characterization and visualization of DNA methylation
While things like DMRs can act as excellent general quantifiers of biological differences between samples,
proper visualization and functional annotation of those DMRs can provide meaningful intuition that can
go further than observation, with the potential to identify markers for disease states and untangle subtle
phenotypic differences not directly explainable by genotype.
There are several useful tools available for visualization, the largest of which are the UCSC genome
browser [Kent et al., 2002], JBrowse [Skinner et al., 2009], and the Integrative Genomics Viewer (IGV)
[Robinson et al., 2011]. These browsers are interactive webpages that allow researchers to upload their own
next generation sequencing data including methylation levels and coverage information and compare them
side by side, at any resolution, with public and private samples.
As the number of public whole genome bisulfite sequencing experiments grows, however, the difficulty
of analyzing and uploading public tracks to these services alongside private experiments becomes non-
trivial. Researchers have approached this problem in different ways: Hackenberg et al. [2011] developed
NGSmethDB, which stores cytosine methylation information and allows users to search for samples with
specific methylation signatures. In contrast, Song et al. [2013] designed MethBase, a database that includes
methylation state but also high-level, precomputed features such as hypomethylated regions, allelicly methy-
lated regions, and bisulfite conversion rate. These types of database facilitate the use of samples of the same
cell type from different experiments to give a sense of the ground state or “reference” methylome for that
cell type. They also allow researchers to quickly and easily choose public methylomes relevant to their topic
of interest and visualize them side by side with their private work.
Databases like those mentioned above facilitate quick, powerful, and biologically meaningful analyses.
Researchers are able to choose, for example, distinct sets of case and control, compute differentially methy-
lated regions between the two sets, and then compare them in the browser. Filtering the DMRs by size and
absolute methylation difference and then selecting those with proximity to biologically meaningful regions
like transcription start sites or retrotransposons provides a small but reliable set of interesting regions. They
are almost certainly functional in some way, and can drive subsequent experimentation.
If wet lab experimentation is not an option, functional analysis using terms derived from the literature
is another way of identifying interesting characteristics of methylation features. Computing distributions
of methylation level in classically functional regions, such as transposons, gene bodies, and promoters is a
good start to identifying potential methylation -driven function. At the forefront of functional annotation are
30
DA VID and GREAT [Huang et al., 2009, McLean et al., 2010], two programs designed to identify functional
motifs in a set of genomic regions. Given a set of, for example, differentially methylated promoter regions,
these programs aim to discern the primary function of the genes within a certain distance of those regions.
In addition, GREAT compiles some statistics related to how relevant the regions are, and how near to genes
they are.
2.1.2.9 Analysis of non-CpG DNA methylation in vertebrate genomes
While plants have a well-documented functional role for methylation of cytosines outside the context of
CpG dinucleotides, there has been a long-standing presumption that non-CpG methylation in genomes of
vertebrates has no functional importance. Plants and vertebrates, however, share much of their enzymatic
machinery related to DNA methylation, and appreciable amounts of non-CpG methylation in vertebrate
genomes have been reported more than a decade ago [Ramsahoye et al., 2000]. The landmark article by
Lister et al. [2009] sparked much greater interest in non-CpG methylation. This article reported that 0.02%
of methylated cytosines in IMR90 cells were in the CpG context, while almost 25% of methylated cytosines
in H1 ES cells were found at non-CpG cytosines. This enrichment of non-CpG methylation is not exclu-
sively a characteristic feature of embryonic stem cells, however. Subsequently, non-CpG methylation has
been found to be prevalent in neuronal development [Lister et al., 2013] and oocytes Shirane et al. [2013].
Because of these findings, current projects typically verify non-CpG methylation levels when bisulfite-based
experiments are conducted (see Section 2.1.1.1).
There are three technical issues to consider when trying to analyze a vertebrate methylome for non-CpG
methylation, all associated with the prior expectation that even interesting levels of non-CpG methylation
will be very low.
The most obvious considerations are bisulfite conversion rate and sequencing error rates. Any uncon-
verted cytosine will erroneously appear to reflect methylation. When analyzing non-CpG methylation it is
essential that bisulfite conversion rate be known very precisely based on reads from mitochondrial DNA
or (preferably) a spike-in control. Additionally, one should use these controls to estimate combined PCR
and sequencing error rate. Then for a global estimate of non-CpG methylation levels, one could for use the
“weighted mean” methylation level just as described above for CpG methylation, and subtract the expected
fraction due to non-conversion or error.
31
Another essential consideration when examining non-CpG methylation is genetics. Ideally, as in any
form of resequencing experiment, one maps reads to a reference genome with the genotype of the cells used
in the experiment. To date WGBS projects have not taken this route due to current cost and effort required
for resequencing, which may soon become cheaper and easier. And many reference genomes used for such
studies are also not yet fully including all major alleles at SNP sites. The scenario to avoid is having a
CpG site in the DNA used in the experiment correspond to a site in the reference genome that was a CpG
in the ancestral sequence but has recently mutated to, for example, a CpA by deamination of the cytosine
on the opposite strand. This situation cannot be identified by looking only at the states in reads mapping
over cytosines in the reference genome, so one must consider nearest neighbor nucleotides in reads when
counting methylated and unmethylated states mapping over non-CpG cytosines.
Finally, analysis to understanding any functional implication of non-CpG methylation will likely be at
lower-resolution than analysis of CpG methylation. Overall average levels can be corrected for the influence
of genetics, sequencing errors or bisulfite conversion rates, but only in aggregate, and at individual sites or
small collections of sites the levels of non-CpG methylation are easily misestimated by a large margin.
2.2 Genome organization and accessibility
In this section, I briefly outline eukaryotic genome organization from a structural and packaging perspective,
describe methods for the identification and analysis of “accessible” genomic regions, and attempt to frame
the study of DNA methylation in the context of this larger picture.
2.2.1 Genome organization
Mammalian genomes display remarkably consistent structural organization, and differences in that structural
organization often reflect differences in accessibility of specific genes or regulatory regions to proteins that
influence transcriptional activity in a cell-type specific manner. DNA is organized in a helical structure
[Watson et al., 1953], and the majority of this helical DNA is wrapped around protein complexes known
as nucleosomes, the components of which were first identified decades earlier [Kossel et al., 1928]. In
the 1960s, a pair of papers displayed that nucleosome occupancy at gene promoters were associated with
transcriptional silencinginvitro [Lorch et al., 1987] andinvivo in yeast [Kayne et al., 1988]. Nucleosomes
are composed of four histone subunits (H2, H3a, H3b, and H4) and epigenetic modifications to the lysine
tails of these histone subunits are now well studied and associated with a large variety of repressive or
32
activating signals [Lawrence et al., 2016]. Nucleosomes are spaced relatively evenly throughout the majority
of the genome, wrapping approximately 200bp worth of DNA around each nucleosome in about 1.7 turns
and separated by “linker” DNA that varies in size between 20-50bp. The incompleteness of the second wrap
results in a highly compactable zig-zag 3D structure [Bednar et al., 1998].
Besides this “bead-on-a-string” level of DNA compaction into nucleosome-linker-nucleosome patterns
[Olins and Olins, 1974], mammalian genomes display higher order compartments of differential accessibil-
ity. Early research identified two distinct chromatin types via staining: euchromatin, which was less heavily
stained, and heterochromatin, which was heavily stained, indicating denser compaction of nucleosomes
[Passarge, 1979]. Since then, the number, formation, and function of specific types of heterochromatin have
been reclassified a number of times.
The separation of heterochromatin into facultative and constitutive heterochromatin is one such classi-
fication that has led to substantial advances in our understanding of accessibility’s role on transcription. In
this partition, constitutive heterochromatin is highly repetitive and localizes primarily but not exclusively at
pericentromeric regions. Constitutive heterochromatin has been shown to have a downregulation effect on
genes that are inserted nearby, a phenomena termed “position-effect variegation” that was first observed in
Drosophila melanogaster [Muller, 1930]. In contrast, facultative heterochromatin is less repetitive, more
variable over the course of differentiation, and lacking a precise molecular signature, which suggested its
eventual more refined characterization [Trojer and Reinberg, 2007].
Since then, definitions based on combinations of histone modifications and other epigenetic marks
have more satisfyingly probed the distinct molecular signatures of heterochromatin subtypes. Sullivan
and Karpen [2004] observed that the histone modification patterning of centromeric heterochromatin was
distinct from that of non-centromeric heterochromatin, including non-centromeric constitutive heterochro-
matin. Most recently, genome segmentation such as Segway [Hoffman et al., 2012] and ChromHMM [Ernst
and Kellis, 2012] have separated regions of the genome into a number of interpretable states based on their
epigenetic signals, including repressive and activating histone marks, CTCF binding, and DNA methylation.
Tools like these are improving our understanding of the specific combinations of epigenetic modifications
that give rise to distinct biological states, but still suffer from the necessary choice of the number of distinct
states. Future improved model selection techniques may be able to overcome these issues.
It is important to note that even the early separation of euchromatin and heterochromatin and imprecise
definitions of heterochromatin subtypes remain valuable depending on the specific biological questions of
interest, and therefore endure as valuable tools for biological discovery.
33
2.2.2 Methods for assessing chromatin accessibility
The above breakthroughs in our understanding of how DNA is packaged into cells would not have been
possible without parallel advances in laboratory techniques for assaying DNA accessibility. The earliest
widespread methods for accessibility information relied on enzymes with endonuclease activity to cleave
regions that were not bound by nucleosomes. These methods included MAINE-seq, which used a micro-
coccal MNase enzyme [Cusick et al., 1983], and DNase-seq, which uses DNAse 1 [Boyle et al., 2008].
FAIRE-seq [Giresi et al., 2007] took advantage of the fact that formaldehyde cross-linking was more ef-
fective in regions of higher nucleosome density, though the increase in background cross-linking made
identification of accessible regions more difficult.
The most recent method for assaying accessibility is ATAC-seq [Buenrostro et al., 2013]. ATAC-seq
relies on the Tn5 transposase, which can transpose into regions of accessible chromatin in a (mostly) un-
biased way. By using Tn5 together with sequencing adapters, ATAC-seq avoids the destructive methods of
endonucleases and the end of each sequenced read precisely corresponds to the region of DNA that Tn5 was
able to access.
The analysis of accessibility across all of the above methods hinges on the identification of peaks, or
regions where the pileup of read coverage differs substantially from the background. Methods for such peak
calling are numerous, and though many were originally designed for the identification of peaks in ChIP-
seq data, methods like MACS [Zhang et al., 2008] and ZINBA [Rashid et al., 2011] generalize well to the
analysis of accessibility data.
In 2012, Kelly et al. [2012] introduced NOME-seq, which simultaneously assays DNA methylation and
nucleosome depletion by introducing a GpC methyltransferase (M.CviPI) that methylates any unprotected
GpC dinucleotide. After bisulfite conversion and some special considerations for mapping, an indepen-
dent reading of methylation state and nucleosome occupancy can be ascertained from the same read. This
method also has the extreme advantage of avoiding peak-calling methods in favor of an “accessibility level”
similar to the methylation level described in Section 2.1.1.7 at each GpC, however it can suffer from lack
of resolution if GpCs are sparse in regions of interest. As reviewed in Section 2.1.1.5, other methods for
simultaneous acquisition of accessibility and methylation are being developed for single cells.
Although nucleosomes are the primary packaging unit of DNA and are the focus of most accessibility
assays, protein binding can also be very informative and a reason for observed lack of accessibility. Anec-
dotally, we have often observed bimodal distributions in ATAC-seq peaks at regulatory regions, and these
34
dips in accessibility could be caused by protein binding interfering with Tn5 transposition in the center of
the peak (data not shown). More sophisticated models will be required to combine ChIP-seq protein binding
assays and ATAC-seq accessibility peaks to fully understand the effects of protein binding on accessibility
profiles.
DNA methylation can also play a direct role in the binding affinity of transcription factors. Methylation
of individual CpGs has been shown to confer changes to the roll angle and minor-groove width that has a
bidirectional (enhances cleavage rate for DNase I and reduces binding for Pbx-Hox complexes) effect on
binding [Rao et al., 2018].
It is important to note that the precise definition of DNA accessibility is omitted here because it changes
based on the assay used. The operational and desired definition would be something like “accessible to
transcriptional machinery such as transcription factors or polymerases,” but it is not yet known whether
accessibility to a site by Tn5, an endonuclease, or other enzymes precisely translates to accessibility by
transcriptional machinery.
2.2.3 Biological dynamics of chromatin accessibility
As described in previous sections, DNA is highly condensed – coiled around nucleosomes and organized
into tightly packed 3D structures. Nucleosome spacing varies by tissue and nucleosomes are absent at
regions of heightened accessibility, including gene promoters and enhancers. The absence of a nucleosome
could imply that a protein is able to interact with the DNA there. The condensed state of DNA results in
inaccessibility of regulatory regions and gene promoters of packed genes by transcription factors and RNA
polymerase, and the dynamic activation of such genes relies on so-called “chromatin remodeling” factors.
There are two major modes of chromatin remodeling: covalent histone modifications that relax or tighten
the binding affinity of DNA to nucleosomes, and ATP-dependent remodeling proteins that actively add or re-
move nucleosomes. A variety of histone modifications have been implicated as strengthening or weakening
the attachment of DNA to nucleosomes, described as a “histone code” hypothesis whereby combinations
of these modifications can lead to a rich library of fully activated, poised, or silenced regulatory regions
[Allis and Jenuwein, 2016]. This, coupled with the fact that histone modifications are assayable through im-
munoprecipitation techniques, facilitates their in-depth studies of their direct association with transcriptional
activity [Heintzman et al., 2009, Karli´ c et al., 2010].
35
DNA methylation and accessibility are intrinsically linked, with accessibility of regulatory regions fre-
quently correlating with hypomethylation and vice versa for silenced regions. Despite this association, the
precise dynamics and timing of methylation and accessibility changes during cellular differentiation remain
unclear. Analysis of MAINE-seq data revealed that depletion of nucleosomes in intergenic regions cor-
related with intergenic HMRs [Schlesinger et al., 2013]. There is a slight periodicity of methylation that
correlates with linker DNA, and this periodicity is amplified in count-based measures of accessibility such
as ATAC-seq, motivating the future study of both marks together to reduce false-positive peaks.
2.3 Epigenome dynamics by biological context
In this section, I briefly describe the changes in DNA methylation and chromatin accessibility that occur
during embryonic development, tumorigenesis, and cellular differentiation. DNA methylation dynamics of
whole-organism mammalian development and tumorigenesis are complex and not yet perfectly defined. The
methylation state of each CpG site is typically maintained during mitosis byDenovomethyltransferase1, or
Dnmt1, which uses the methylation state of the template strand to copy methylation to the nascent strand. De
novo methylation at CpG sites is achieved throughDnmt3a andDnmt3b, althoughDnmt1 also has somede
novo methyltransferase ability. The TET pathway is the primary demethylation pathway, whereby multiple
rounds of hydroxylation leads to erasure of the methyl group on methylcytosine [Kohli and Zhang, 2013]. In
this section, when I refer to “active demethylation” it is presumably by this mechanism. In contrast, “passive
demethylation” refers to failure of Dnmt1 to faithfully transmit the methylation state of the template strand
through cell division, leading to slow loss of methylation in a population of cells undergoing proliferation.
Combinations of active and passive demethylation anddenovo methylation can lead to faithful transmission
of epigenomic states across generations, changes in response to environmental stimuli, and epigenetically
driven selective sweeps in tumorigenesis.
2.3.1 Mammalian development
2.3.1.1 Embryonic development: from totipotent to somatic cell types
Mammalian development can be roughly split into two main stages: establishment of totipotency in the pre-
implantation embryo and cellular differentiation as the embryo grows. DNA methylation plays a critical role
in both stages through global reprogramming and cell-type specific differential methylation, and as studies
36
move towards single site resolution whole genome bisulfite sequencing these phenomena are becoming
clearer.
Global epigenetic reprogramming is a hallmark regulatory event that occurs twice over the course of
mammalian development, and has been implicated as a potential driver for establishing nuclear totipotency
as well as genomic imprinting [Reik et al., 2001]. The first wave of reprogramming occurs in primordial
germ cells (PGCs), where the erasure and subsequent de novo methylation occurs during their migration to
the gonads during development. Seisenberger et al. [2012] carried out a time-course experiment, perform-
ing WGBS on mouse PGCs across key stages of development ranging from E6.5 to E16.5. They found that
demethylation in PGCs occurs in two stages, and appears to be passive. They also report a set of variably
erased CpG islands, suggesting that these regions may contribute to transgenerational epigenetic inheritance.
The second wave of reprogramming occurs in the pre-implantation embryo following fertilization. The pa-
ternal genome is actively demethylated first, likely due to its location outside the maternal pronucleus, but
interestingly imprinted genes escape demethylation (see Section 2.1.2.4 for more details). Passive methy-
lation of both genomes follows through the first few cell divisions, and is restored to varying degrees as
blastocyst-stage cells begin to differentiate. More recently, Zhang et al. [2018] described a more detailed
timecourse and replicated many of the results described above.
In addition to waves of reprogramming, DNA methylation changes have been observed in cellular differ-
entiation during development. Upon differentiation from ESCs, genes traditionally associated with pluripo-
tency such as OCT4 and NANOG are repressed bydenovo methylation at their promoter regions [Meissner,
2010]. While numerous other methylation changes have been observed, whether they are connected with
subsequent changes in expression is not well explored. For some cell types, evidence suggests that the crit-
ical genes are already primed for expression in ESCs prior to differentiation [Kaaij et al., 2013]. Recently,
VanderKraats et al. [2013] developed a method to connect differential methylation patterns to expression
differences. The method is focused on smoothing methylation profiles around transcription start sites. The
types of changes in profiles associated with promoters are then connected to expression differences, in order
to assess the degree to which DNA methylation impacts the expression profiles of different somatic cell
lines.
Identifying regions of differential methylation in differentiating cells can inform generally on where to
look during subsequent analyses. The most familiar type of differential methylation is total alteration of
promoter methylation level, particularly at CpG-rich promoters, upon lineage commitment in development.
The CD19 gene is an example of a tissue-specific gene, in this case B cell specific, whose promoter loses
37
methylation when cells commit to the lymphoid lineage [Hodges et al., 2011]. Since the promoters of most
annotated protein-coding genes have at least some region of hypomethylation in nearly all somatic cell
types, complete methylation change at promoters represents the minority of methylation dynamics during
development. Much more commonly, existing regions of hypomethylation show significant expansions
upon differentiation. These expansions have been associated with CpG density, and have been called CpG
island “shores” but the association with increased CpG density might be coincidental [Molaro et al., 2011].
The expanded regions of hypomethylation are skewed towards the 3’ direction relative to the transcription
start site of a gene [dos Santos et al., 2015]. In addition to these expanded regions of hypomethylation at
promoters, other hypomethylated regions frequently appear as cells differentiate. These regions are often
cell type specific enhancers, and can occur both near and far relative to their targets.
2.3.1.2 Extraembryonic development
In the previous section, I discussed embryonic development and the changes that occur to the methylome
during this process. Concurrent with embryonic development is the development of extraembryonic tissue,
with the chief extraembryonic component being the placenta. The placenta forms the crucial link between
mother and developing offspring during mammalian pregnancy. It is responsible for anchoring the fetus
to the uterine wall, secretes hormones that adapt maternal physiology, prevents immunological rejection of
the fetus, and exchanges substrates between fetal and maternal blood spaces. These functions are strongly
conserved across mammals, despite extraembryonic tissues displaying remarkable morphological variation
[Furukawa et al., 2014] and placenta-specific genes accumulating a high rate of nonsynonymous mutations
[Chuong et al., 2010, Hughes et al., 2000]. After establishment of totipotency, the trophoblast (the precursor
lineage to the placenta) is the first cellular differentiation event to take place.
To effectively perform its wide array of functions, the placenta is composed of multiple trophoblast cell
types, which in the mouse are organized into specialized zones. The junctional zone lies proximal to the
uterine wall and is composed of invasive endocrine trophoblast cells: the spongiotrophoblast, glycogen and
giant cells. These cell types are important in promoting maternal immune tolerance, decidual vasculariza-
tion and maternal metabolic adjustments that favour fetal nutrient delivery [Hu and Cross, 2010]. The other
layer of the mouse placenta is called the labyrinthine zone, and lies proximal to the developing embryo.
It is composed of a dense network of fetal capillaries and maternal blood spaces that are lined with syn-
cytiotrophoblast cells, which exchange nutrients, gases and waste between mother and fetus [Coan et al.,
38
2005, Sferruzzi-Perri et al., 2009]. Genomic studies have provided insight into placentation and diversifica-
tion of placental morphology [Cross, 2000, Roberts and Cooper, 2001], and the recent maturation of assays
designed to explore epigenetic features allow us to study these phenomena at unprecedented resolution.
In mammals, around 70% of the cytosines in CpG dinucleotides are methylated in somatic cells [Song
et al., 2013], compared with closer to 50% methylated in the placenta [Ehrlich et al., 1982, Razin et al.,
1984]. A large body of evidence suggests that the placental methylome plays a critical functional role,
and targeted assays have implicated aberrant DNA methylation in several placental disease phenotypes,
including pre-eclampsia [Hogg et al., 2013, Kulkarni et al., 2011, Yuen et al., 2010] and growth restriction
[Banister et al., 2011, Lambertini et al., 2011]. The recent advent of whole genome bisulfite sequencing
(WGBS) has provided higher resolution maps of DNA methylation in the placenta that have confirmed this
lower methylation and detected the presence of partially methylated domains (PMDs), long stretches of the
genome where methylation levels drop below the default, primarily hypermethylated state [Schroeder et al.,
2013]. PMD presence is correlated with changes in gene expression, and outside of placenta they have only
been observed in cancer and immortalized cell lines [Hansen et al., 2011, Lister et al., 2009].
A recent study found that despite widespread morphological changes, the placenta’s globally lowered
methylation state is conserved across mammalian species [Schroeder et al., 2015]. Another study identified
species-specific endogenous retroviral activity in the placenta, and suggested that the lowered methylation
levels in the placenta could represent a source of placenta-specific regulatory variation that is typically
silenced in other tissues [Chuong et al., 2013]. Prior work also observed similarity between trophoblast
methylomes and oocytes, suggesting minimal de novo methylation in extraembryonic-derived tissues fol-
lowing fertilization [Schroeder et al., 2015]. Knocking out de novo methyltransferases Dnmt3a/3b in tro-
phoblasts resulted in few defects compared to wild type at embryonic day 9.5, further supporting a reduction
or lack ofdenovo methylation in placenta to that timepoint [Branco et al., 2016].
The processes that lead to the globally lower methylation levels in the placenta are still not fully under-
stood. In Chapter 3, I do a comprehensive examination of the methylation landscape in the placenta across
timepoints, placental layers, and species. In Chapter 5, I examine the similarities and differences of PMDs
in placenta with those formed in cultured cell lines and cancer.
39
2.3.2 Tumorigenesis and metastasis
Tumor DNA methylation profiles show a remarkable departure from their healthy-tissue counterparts. The
two major epigenetic changes that occur during tumorigenesis are the genome-wide hypermethylation of
CpG islands (CGIs) and the formation of partially methylated domains. It is easy to state these as two
discrete and unaltering phenomena, but in reality there is a huge amount of subtlety to the number, degree,
and clonal dynamics of CGI methylation events and PMDs. In this section, I will describe what is known
about tumor origins, primary tumor proliferation, metastasis, and what is known about the methylation
changes at CpG islands that accompany these events. See Section 2.1.2.2 and Chapter 5 for more information
about the formation and function of PMDs in cancer.
2.3.2.1 Tumor formation and metastasis
Cancer is an uncontrolled cellular growth that forms as a result of a combination of inherited genotype and
genetic and epigenetic mutations that accumulate over the liftetime of an organism [Garraway and Lander,
2013]. Cancer generally originates from a single cell type and ultimately spreads to distant regions of
the body, causing death of the organism, and is a multi-faceted disease with several initiating factors and
mechanisms. Gene mutations primarily occur during mitotic replication, leading to hypotheses that cancer
is more likely to originate from rapidly replicating cell types with more opportunity for cancer-causing
mutations than those that rarely divide [Tomasetti et al., 2017]. Through population level screening of
different tumor types, thousands of genes have been described as having a role in tumorigenesis. These genes
can be classed as either tumor suppressors, whereby silencing confers tumorigenic properties, or oncogenes,
where upregulation via knockout of upstream tumor suppressors or change in regulatory sequence can lead
to cancerous phenotypes. In cell types that rarely divide, epigenetic changes that can influence the expression
of a large number of oncogenes or tumor suppressors simultaneously can accelerate the formation of tumors,
as in glioblastoma [Nagarajan and Costello, 2009].
The shedding of tumor cells from their primary site into systemic circulation followed by hematogenous
spread has long been thought to be a major cause of distant metastasis [Ashworth, 1869]. Modern studies
have shown that the majority of circulating tumor cells (CTCs) [Fehm et al., 2002] and metastatic lesions
[Wagenblast et al., 2015] are derived from subclones of the primary tumor. Studying the epigenome of
this link between primary tumors and distant lesions is notoriously difficult. CTCs are extremely rare,
with approximately one cancer cell per 10,000,000 white blood cells, which makes their detection and
40
capture challenging. Nonetheless, if CTCs can be isolated from cancer patients as viable cells that can be
genotyped and functionally characterized over the course of therapy, treatment regimens can be devised that
most effectively target the evolving mutational profile of the cancer. Recently, CTCs have been isolated
from blood and grown in vitro [Yu et al., 2011], promising the future ability to study the changes in the
epigenome that confer ability to intravasate into the blood stream and extravasate to secondary sites.
2.3.2.2 CpG island methylator phenotype
Early studies of DNA methylation in cancer reported global hypomethylation of cancer tissues relative to
their matched healthy analogues. With the advent of array and sequencing based methylation quantification
techniques, researchers recognized that the observed global hypomethylation was organized into distinct
sections of the genome, later called partially methylated domains (PMDs). PMDs associate with lamina-
attachment regions, transcriptional repression of the genes they overlap, and are generally in regions of low
gene density. Concurrent with the discovery of this widespread demethylation was the discovery of focal
hypermethylation of CpG islands [Toyota et al., 1999] in colorectal cancer. Later studies confirmed the
existence of elevated methylation levels in the CGIs of other cancers, including glioma [Noushmehr et al.,
2010], breast cancer [Fang et al., 2011], and leukemia [Kelly et al., 2017].
The focus of many research groups has been on categorizing tumor samples into categories based on the
presence and degree of CGI methylation, and of designing small sets of very informative CGI probes that can
discriminate between those samples having or not having the so-called “CpG Island Methylator Phenotype”
(CIMP). Using these informative CGI probes to assay methylation in large numbers of tumors, many groups
have explored association of CIMP with survival outcomes, leading to mixed results regarding prognosis of
patients with CIMP+ tumors that seem to vary based on cancer type. Recently, the distinction has been made
that all tumor samples have some degree of CGI methylation changes, and that CIMP methylation changes
are a distinct subset [de Souza et al., 2018]. Despite mixed directionality in prognosis predictions for CIMP
in different tumor types, the verifiable and significant differences in survival times between CIMP-positive
and CIMP-negative patients makes CIMP identification panels a clinically relevant pursuit.
Parallel to identification of CGI methylation changes in cancer, research was also focused on the mech-
anisms underlying it. Many causes have been proposed, and most fall into one of three categories. The
first and most comprehensively studied are genetic mutations that lead to change in the relative abundances
41
or effectiveness of methylation or demethylation machinery such as TET proteins or de novo methyltrans-
ferases. One of the earliest detected CGI methylation-associated genetic aberrations was mutation ofIDH1
[Noushmehr et al., 2010] in glioblastoma. Later in the same year, IDH1/IDH2 mutations were shown to
be mutually exclusive to TET2 mutations in leukemia, and it was proposed that IDH1 mutations led to dis-
ruption of TET2 function, implying a causal mechanism whereby IDH1/IDH2 and TET2 mutations lead to
methylation of CGIs [Figueroa et al., 2010].
Another category of proposed causes of CGI methylation changes in cancer are changes to the CGIs
themselves that make them more amenable to methylation. The best studied examples of such mechanisms
are knockouts of proteins that bind to CGIs in healthy cells and protect them from methylation. In one recent
non-cancer study, it was shown that knockout ofFBXL10, which co-occurs at CGIs bound by the polycomb
repressive complex, was sufficient for methylation of those CGIs [Boulard et al., 2015]. Wt1 is another
gene for which knockout confers CGI methylation [Wang et al., 2015]. In addition to the repeated oxidation
activity of TET family proteins, it has previously been shown that TET1 has a second function: namely
to bind to unmethylated DNA with its CXXC domain and prevent de novo methylation [Wu et al., 2011].
Expression of a novel isoform of TET1 lacking this CXXC domain was recently discovered and was found
to be overexpressed in cancer [Good et al., 2017]. It was theorized that the lack of this binding protection
role leads to increased methylation of CGIs. Mutations in chromatin remodeling proteins may also play a
role in increased accessibility of CGIs by Dnmts, leading to their methylation [Tahara et al., 2014]. Double
stranded breaks inside CGIs [Morano et al., 2013, O’Hagan et al., 2008] and more recently insertion of long
CpG-free sequences into CGIs [Takahashi et al., 2017] have also been suggested as instigators of de novo
methylation in CGIs.
The last category of proposed causes of CGI methylation in cancer are related to impact of tumor mi-
croenvironment, such as hypoxia and arsenic exposure. Thienpont et al. [2016] recently claimed that hy-
poxia resulted in lower TET2 enzymatic activity, leading to gradual increase in de novo methylation at
CGIs being actively and constantly targeted for demethylation. Earlier studies showed in-utero exposure of
infants to arsenic and mercury lead to significant differences in CGI methylation in umbilical cord blood
[Cardenas et al., 2015], though the effect sizes of these environmental perturbations do not compare to the
near-complete hypermethylation observed as a result of genetic mutations in important binding proteins and
demethylation enzymes.
42
Chapter 3
DNA methylation divergence and tissue specialization in the developing
mouse placenta
The placental epigenome plays a vital role in regulating mammalian growth and development. Aberrations
in placental DNA methylation are linked to several disease states, including intrauterine growth restriction
and preeclampsia. Studying the evolution and development of the placental epigenome is critical to under-
standing the origin and progression of such diseases. Although high resolution studies have found substantial
variation between placental methylomes of different species, the nature of methylome variation has yet to
be characterized within any individual species. We conducted a study of placental DNA methylation at
high resolution in multiple strains and closely related species of house mice (Mus musculus musculus, Mus
m. domesticus, and M. spretus), across developmental timepoints (embryonic days 15 to 18), and between
two distinct layers (labyrinthine transport and junctional endocrine). We observed substantial genome-
wide methylation heterogeneity in mouse placenta compared to other differentiated tissues. Species-specific
methylation profiles were concentrated in retrotransposon subfamilies, specifically RLTR10 and RLTR20
subfamilies. Regulatory regions such as gene promoters and CpG islands displayed cross-species conser-
vation, but showed strong differences between layers and developmental timepoints. Partially methylated
domains exist in the mouse placenta and widen during development. Taken together, our results characterize
the mouse placental methylome as a highly heterogeneous and deregulated landscape globally, intermixed
with actively regulated promoter and retrotransposon sequences.
3.1 Experimental design
Our study relied upon two datasets, which we will refer to as the “interspecific” and “intraspecific” datasets,
respectively (Figure 3.1). The interspecific dataset included 12 WGBS methylomes from three species,
43
24 Total, 1-2x coverage
12 Labyrinthine (LZ) 12 Junctional (JZ)
6 E15 6 E18 6 E15 6 E18
3 Male 2 Female 3 Male 3 Female 3 Male 3 Female 3 Male 3 Female
12 Total, 6-10x coverage
4 Domesticus 4 Musculus
BIKxDOT1
BIKxDOT2
DOTxBIK1
DOTxBIK2
4 Spretus
Interspecies (Whole) Samples:
Intraspecies (Layer) Samples:
MPBxMBS1
MPBxMBS2
MBSxMPB1
MBSxMPB2
STFxSFM1
STFxSFM2
SFMxSTF1
SFMxSTF2
Male A
B
One sample
too low quality
Figure 3.1: Experimental design. (A) Four whole placental samples each were sequenced from three mouse
species. Two male samples were included. Crosses were named with their maternal strain. (B) Twenty-
four placental samples from two timepoints: embryonic days 15 and 18 were evenly split between the
labyrinthine and junctional zone and had equal number of male and female embryos.
including four samples from each of Mus musculus musculus, M. m. domesticus, and M. spretus. We
reduced the potential confounds of inbreeding and litter effects by inter-crossing two strains per species. We
sequenced to an average depth of 6.2x per covered CpG per sample, and surveyed an average of 77% of
CpGs genome-wide per placenta.
The intraspecific dataset concentrated on a single genetic strain, C57BL/6J, the genome reference. From
that strain, we produced the first purified methylomes of the two main placental layers. We sequenced 24
WGBS methylomes: 12 from each the labyrinthine zone [LZ] and junctional zone [JZ]. These samples
originated from two developmental timepoints (embryonic days 15 [E15] and 18 [E18]) and from male
and female siblings collected from three different litters. In this intraspecies dataset, each sample was
sequenced to an average depth of 1.45x per covered CpG, and surveyed on average 50% of CpGs genome-
wide. The high number of replicates for each factor allowed us to combine these methylomes for high
coverage where necessary and increase statistical power to detect differences across factors. Due to poor
quality one sample was thrown out. Quality control statistics for all methylomes produced in this study can
be found in Supplemental Tables B.1 and B.2.
44
3.1.1 Inter-species whole placenta tissue collection
All animal husbandry, experimental procedures, and personnel were approved by the University of Southern
California’s Institutional Animal Care and Use Committee, protocol #11394. Mice were housed under
a 14:10 hour light cycle with food and water ad libidum. To investigate species differences in placental
methylation, crosses between wild derived inbred strains were established, developed and distributed by
Franc ¸ois Bonhomme and colleagues (U. Montpellier). For M. domesticus, we made reciprocal crosses
between strains BIK (originally isolated from Kefar Galim, Israel) and DOT (Tahiti); forM.musculus, MPB
(Bialowieza, Poland) and MBS (Sokolovo, Bulgaria); and for M. spretus, STF (Fondouk Djedid, Tunisia)
and SFM (Montpellier, France). For all crosses, a single stud and dam were housed together for four and
a half days, and then split. 10 days later, females were euthanized, uteri were collected and the number of
viable conceptuses counted, leading to gestational ages between E11 and E16. The embryos and placenta
were dissected and weighed. Two placentae were sampled from a single litter in each cross direction and
snap frozen in liquid nitrogen. Placenta are named according to their maternal strain. We attempted to
sample only female placenta, through two replicated attempts to PCR-amplify two Y- and one X-linked
region [Kunieda et al., 1992], but for logistical reasons we had to include two male placentae.
3.1.2 Intra-species junctional and labyrinthine zone tissue collection
All experiments were carried out under the UK Home Office Animals (Scientific Procedures) Act 1986.
C57Bl/6 females were housed under dark:light 12:12 conditions with free access to water and the stan-
dard diet used in the University of Cambridge Animal Facility. At 8-10 weeks, females were mated with
C57Bl/6 males and the day a copulatory plug was found was denoted as embryonic day 1 of pregnancy
(term=20.5 days). On embryonic days 15 or 18 of pregnancy (days correspond to the periods of rapid pla-
cental and fetal growth respectively), mouse dams were schedule 1 killed by cervical dislocation. Uteri
were collected and the number of viable conceptuses counted. Embryos and placentas were dissected
and weighed. Each placenta from the litter was rapidly separated into the functionally distinct zones, the
labyrinthine transport and junctional endocrine zones [Sferruzzi-Perri et al., 2009], in ice cold sterile PBS
before rapid snap freezing in liquid nitrogen. Fetal tails were kept for sexing using standard genotyping
methods including using primers to detect the SRY gene (5’-CCCAGCATGCAAAATACAGA -3’ and 5’-
TCAACAGGCTGCCAATAAAA-3’), an internal control gene (5’-AGTGGCTAACGCTGAGTGGT-3’ and
45
5’-GTGCCTGTCGGAGGAGAAC-3’) and with agarose gel electrophoresis. From each litter, the male and
female placenta with its weight closest to the litter mean was used for further analysis.
3.1.3 DNA extraction and methylation assay
Using a Qiagen DNA extraction kit, DNA was extracted and purified for all inter- and intra-species samples.
DNA was fragmented to 100-300bp fragments by sonication and end repaired before ligation of methylated
sequencing adapters. Bisulfite treatment was performed using the Zymo EZ DNA Methylation Gold kit.
Following bisulfite treatment, DNA fragments were de-salted and size selected to produce a 200-300bp
short-insert library, subjected to PCR, and size selected again before 100bp paired-end reads were sequenced
using an Illumina Hiseq4000.
3.2 Data analysis
Reads were mapped to the mm10 reference genome using WALT [Chen et al., 2016]. Calculation of methy-
lation levels, bisulfite conversion rate, and identification of PMDs was performed as described in [Song
et al., 2013], and all quality control metrics and summary statistics are available in Supplemental Tables B.1
and B.2. Weighted methylation levels as defined by Schultz et al. [2012] were used to calculate average
methylation levels in genomic regions. All browser plots were created using the UCSC genome browser
tool [Kent et al., 2002]. Promoters were defined as +/-1kb from the mm10 RefSeq TSS based on the obser-
vation that hypomethylation frequently occurs there on the kilobase scale [Molaro et al., 2011]. CpG islands
were identified as described in Gardiner-Garden and Frommer [1987]. Retrotransposon copies were anno-
tated by RepeatMasker [Smit et al., 1996] and downloaded from the UCSC Table Browser [Karolchik et al.,
2004]. Pearson correlation and Euclidean distance at single CpG resolution in whole placenta samples were
calculated from the 2,801,446 CpG sites with 5x or greater sequencing depth across all samples. Pairwise
comparisons between whole placental samples included only within-species comparisons. Intraspecies sin-
gle CpG site correlations and distances were computed from 8,180 CpG sites with 3x or greater sequencing
depth across all samples. Boxplots for brain, intestine, and blood were produced with a random set of CpG
sites downsampled to the number covered in placental samples. Pearson correlation, Euclidean distance,
and distributions were produced using only promoters, CpG islands, and retrotransposons with at least 5
CpG observations to reduce the discretizing effect of low-coverage observations.
46
ANOV A between global methylation levels with species as the factor was performed in R as a one-
way analysis of variance. Differentially methylated CpGs between species were called using [Dolzhenko
and Smith, 2014] with species-specific methylation signatures identified using the other two species as
background samples. Multiple testing correction of combined DM CpG p-values was done as described in
[Benjamini and Hochberg, 1995] using an alpha level of 0.05. Observed over expected (O/E) ratios used for
enrichment and depletion analysis of subfamilies were calculated as follows:
O
E
=
DM CpGs in subfamily
DM CpGs total
CpGs in subfamily
CpGs genome-wide
(3.1)
Public RNA-seq reads were mapped using STAR [Dobin et al., 2013]. BAM files were converted to
read counts using HT-seq [Anders et al., 2014] and differentially expressed genes were identified using
edgeR [Robinson et al., 2010]. We placed an upper bound on the counts per million (CPM) of differentially
expressed genes analyzed to focus on those genes that were nearly silenced in one cell type relative to the
other.
3.3 Placenta DNA methylation is globally heterogeneous but highly conserved
at regulatory regions
3.3.1 Global heterogeneity and tissue specificity in the mouse placenta
We observed global hypomethylation of the mouse placenta relative to other tissues: genome-wide methy-
lation levels across interspecific samples ranged from 43.3% to 53.8%, and varied by species (ANOV A,
p < 0:015). Comparison of placental methylation levels in whole placental samples at single CpG reso-
lution also revealed significantly higher within-tissue heterogeneity when compared with other fully differ-
entiated tissues. To illustrate this, we computed the Pearson correlation and Euclidean distance between
all pairs of whole placenta samples from the same species. We plotted these as a boxplot, together with
boxplots of pairwise correlation and distance for three other tissues: brain [Lister et al., 2013], instestine
[Hon et al., 2013, Kaaij et al., 2013, Sheaffer et al., 2014] and blood [Kieffer-Kwon et al., 2013] (Figures
3.2A and 3.3A). Despite this within-tissue variability, genome-wide methylation levels clustered reasonably
well by strain and species in the placenta (Figure 3.2B), although they did not precisely capture the true
species-level evolutionary relationship [Sarver et al., 2017, Tucker, 2006]. Intra-species samples clustered
47
well by layer, however both comparisons of single-CpG heterogeneity and pairwise binned correlations suf-
fered from substantially lower sequencing depth (Figure 3.3B). In the intraspecific dataset, the junctional
zone was less methylated than the labyrinthine zone (p< 0:017) (Figure 3.2C), but there was no significant
difference in global levels of CpG methylation by developmental timepoint or sex.
B
1.0 0.9 0.8 0.7 0.6
MPB1
MPB2
MBS1
MBS2
SFM1
SFM2
STF1
STF2
DOT1
DOT2
BIK1
BIK2
LZ JZ
0.42 0.46 0.50
Layer
%mCpG
E15E18
Age
M F
Sex
Correlation
C
0.5
Musc. Spret. Dom.
0.0 0.2 0.4 0.6 0.8 1.0
100
200
300
400
500
600
A
Placenta
Brain Distance
Placenta
Brain
Correlation
Figure 3.2: DNA methylation is variable in the placenta, except at regulatory regions. (A) Pearson correla-
tion and Euclidean distance between whole placenta samples and brain samples. (B) Hierarchical clustering
of pairwise binned correlation for whole placental samples. Three-letter codes indicate genetic strain, num-
ber indicates individual. (C) Global methylation between layers, timepoints, and sex.
0.0 0.2 0.4 0.6
M1049.F5.M18.JZ
M1043.F5.F15.JZ
M2253.F2.M15.JZ
M2255.F6.F15.JZ
M2253.F3.F15.JZ
M1054.F7.M18.JZ
M1043.F3.M15.JZ
M2255.F8.M15.JZ
M1053.F3.M18.JZ
M1053.F7.F18.JZ
M1049.F6.F18.JZ
M1054.F3.F18.JZ
M2253.F3.F15.LZ
M2253.F2.M15.LZ
M1043.F3.M15.LZ
M2255.F8.M15.LZ
M1054.F7.M18.LZ
M1053.F7.F18.LZ
M2255.F6.F15.LZ
M1049.F6.F18.LZ
M1053.F3.M18.LZ
M1049.F5.M18.LZ
M1054.F3.F18.LZ
Correlation
0.2 0.4 0.6 0.8 1.0
C57/Bl6
Brain
Intestine
Blood
C57/Bl6
Brain
Intestine
Blood
Correlation
Distance
0 5 10 15 20 25 30
A B
Figure 3.3: (A) Pearson correlation and Euclidean distance between intra-species placenta samples and other
tissues at single CpG resolution. (B) Hierarchical clustering of intraspecies samples using methylation levels
in 1kb bins. Sample names indicate the individual ID, litter, gender, age (15/18), and layer.
48
The mouse placenta possesses substantially higher within-tissue methylation variance than in other dif-
ferentiated tissues. This noisy and globally hypomethylated state relative to other tissues remains a funda-
mentally distinct and poorly understood feature of placental cells. Recent studies have shown that this low
methylation likely originates very early and persists through the trophoblast lineage [Branco et al., 2016]. In
addition, this state appears to be conserved across distant species of mammal [Branco et al., 2016, Schroeder
et al., 2015] as well as in each mouse species in our study. Additionally, this global variation in placental
methylation is conserved in differentiated placental layers and across developmental timepoints. The source
of this variability could be rooted in the placenta’s transient nature, allowing its epigenome to erode during
development without much harm to the overall success of the embryo. One other possibility is an increased
ability of the placenta to buffer or respond to internal or external stimuli. It is possible that the placenta is
capable of flexibly altering its epigenome and transcriptome in subpopulations of cells based on information
from their microenvironment [Fowden and Moore, 2012, Sferruzzi-Perri and Camm, 2016]. These sub-
populations would be intermixed between layers based on proximity to maternal nutrients, and most easily
identified through application of single cell methods.
3.3.2 Placenta-specific DNA methylation dynamics at regulatory regions
DNA methylation is an important component of transcriptional regulation, with methylation of retrotrans-
posons and gene promoters strongly correlated with their repression [Boyes and Bird, 1991]. Figure 3.4A
presents methylation levels in gene promoters in the placenta of each mouse species and high quality WGBS
embryonic stem cell (ESC) methylomes from four separate projects curated in MethBase [Harten et al.,
2015, Li et al., 2015, Lu et al., 2014, Song et al., 2013, Yearim et al., 2015]. In the placenta, promoter
methylation remained bimodal, but with the high mode typically associated with transcriptional repression
shifted from the near- complete methylation seen in other tissues to intermediate levels.
To explore how this reduced promoter methylation level might relate to transcriptional regulation, we
identified differentially expressed (DE) genes between 5 E14.5 C57Bl6/J x FVB/n mouse placenta RNA-
seq experiments [Mould et al., 2013] and ESC RNA-seq data derived from two of the studies above with
matched methylation and expression data [Lu et al., 2014, Yearim et al., 2015]. We identified differentially
expressed genes (see Methods) and plotted them by their normalized counts and log fold change between
ESC and placenta (Figure 3.5). We filtered for DE genes with a log-fold change of greater than 5 and
counts per million of less than 5, leaving us 797 and 733 DE genes expressed higher in placenta and ESC,
49
respectively, and plotted the promoter methylation distributions for these genes (Figure 3.4B). For genes
that were higher expressed in placenta, we observed nearly complete methylation of the promoter in ESCs
and hypomethylation of the promoter in placenta. In genes higher expressed in ESCs however, the promoter
methylation levels of placenta reached only intermediate levels. This pattern was conserved when consid-
ering the promoter methylation distributions of differentially expressed genes between placenta and other
tissues, including brain [Lister et al., 2013], intestine [Sheaffer et al., 2014], and blood [Kieffer-Kwon et al.,
2013] (Figure 3.6). These observations indicate that epigenomic repression of transcription in the placenta
does not require methylation levels as high as seen in other tissues.
Despite global methylation heterogeneity in the placenta, the function and methylation states of regula-
tory regions are conserved across species and stable within species. However, the placenta displays a shift
in this promoter methylation distribution with the hypermethylated state towards 0.5, suggesting that the
lowered global default state may represent a change in “high” methylation that remains within the range
acceptable as part of epigenomic gene silencing. Above, we showed that this shifted “high” state has the
same repressive effect on gene expression as full methylation does when we observe it in other tissues. This
opens up an interesting question: if the background methylation level in differentiated somatic cells is higher
than needed for its role in gene expression, why does it remain at consistently high levels with such small
variation between somatic cell types?
Another unique feature of the placental methylome is the relaxed methylation state of retrotransposons.
Almost all retrotransposons are methylated in most other tissues, but show a relaxed methylation state in the
placenta (Figure 3.4C). Interspecies comparisons of placenta methylomes according to genomic annotation
revealed uniformly high correlation and low Euclidean distance for promoters and CpG islands, indicating
conservation of epigenomic state at these regulatory regions. In contrast, we observed strong species-specific
patterns in all classes of retrotransposons (Figures 3.4D and E). These patterns are consistent with an arms
race hypothesis [Crespi and Nosil, 2013], where species-specific methylation patterns are associated with
genomic parasites.
50
B
Promoters CpG Islands LINEs SINEs LTRs
Correlation Distance
Within Between
Within Between Within Between Within Between Within Between
C
D
E
A
Gene promoters
%mCpG
Proportion
0.0 0.2 0.4 0.6 0.8 1.0
0.00 0.05 0.10 0.15 0.20
Domesticus
Musculus
Spretus
ESCs
Repeats
%mCpG
Proportion
0.0 0.2 0.4 0.6 0.8 1.0
0.00 0.10 0.20 0.30
0.4 0.6 0.8 1.0 0 5 10 15 20 25 30
Up
Density
0.0
0.2
0.4
0.6
0.8
1.0
0.00 0.08 0.15
Down
0.15 0.08 0.00
Density
%mCpG
ESCs
Placenta
Domesticus
Musculus
Spretus
ESCs
Figure 3.4: (A) Promoter methylation density plot comparing inter-species placenta and ESCs. (B) Di-
rectional promoter methylation distributions for genes differentially expressed between public placenta and
ESC RNA-seq samples. (C) Retrotransposon methylation density plot. (D, E) Within- and between-species
pairwise correlation and distance by genomic feature.
3.4 Quantifying species- and layer-specific methylation changes in the placenta
3.4.1 Promoter methylation differences between species, layers, and timepoints
To identify the biologically meaningful and statistically significant epigenetic differences driven by species,
layer, and developmental timepoints, we used RADMeth to find differentially methylated (DM) CpGs, com-
bine the p-values of neighboring (within 100bp) CpGs, and perform false discovery rate correction according
to Benjamini and Hochberg [1995].
To compare species, we identified DM CpGs for each species relative to the other two species combined.
This allows us to identify features specific to each species. To compare layers and developmental timepoints,
we accounted for the full experimental design to avoid confounding by other factors. Importantly, we
51
0 5 10
−20 −15 −10 −5 0 5 10 15
log2(counts per million)
log2(fold change)
0 5 10
−20 −15 −10 −5 0 5 10 15
log2(counts per million)
log2(fold change)
0 5 10
−20 −15 −10 −5 0 5 10 15
log2(counts per million)
log2(fold change)
Intestine
0 5 10
−20 −15 −10 −5 0 5 10 15
log2(counts per million)
log2(fold change)
Brain
Blood
ESC
A B
C D
Figure 3.5: MA Plots of differentially expressed genes between placenta and other tissue RNA-seq samples.
Red points pass the FDR cutoff as differentially expressed after multiple testing correction. (A) ESC (B)
Blood (C) Intestine and (D) Brain.
removed one female sample from the junctional zone (M1043-F5-F15-JZ) from the same developmental
timepoint as the missing LZ sample to ensure equal numbers of male and female samples while calling
differential methylation. Principle component analysis [Wold et al., 1987] using our identified DM CpGs
as the feature set for each factor showed a clear segregation by factor status (Figure 3.7). Most of the
significantly DM CpGs between layers are hypomethylated in the junctional zone relative to the labyrinthine
zone (Figure 3.8A). We also detected a sizeable number of CpG sites whose methylation level increased
52
Upregulated in Placenta
Placenta promoter %mCpG
Proportion
0.0 0.2 0.4 0.6 0.8 1.0
0.00 0.05 0.10 0.15 0.20
Downregulated in Placenta
Placenta promoter %mCpG
Proportion
0.0 0.2 0.4 0.6 0.8 1.0
0.00 0.05 0.10 0.15
Upregulated in Other
Other promoter %mCpG
Proportion
0.0 0.2 0.4 0.6 0.8 1.0
0.00 0.10 0.20
Downregulated in Other
Other promoter %mCpG
Proportion
0.0 0.2 0.4 0.6 0.8 1.0
0.00 0.10 0.20 0.30
Brain
Blood
Intestine
Brain
Blood
Intestine
Brain
Blood
Intestine
A B
C D
Figure 3.6: Promoter methylation distributions for differentially expressed genes (A) upregulated in placenta
(B) downregulated in placenta (C) upregulated in the other tissue and (D) downregulated in the other tissue.
between E15 and E18 (Figure 3.8B), suggesting that while previous studies on Dnmt3a/b knockout mice
revealed normal trophoblast formation during early development,denovo methylation likely plays a role in
the late stage development and differentiation of the placental layers.
53
10000 15000 20000
−20000 −10000 0 10000
Layer DM CpGs
PC1 (18.4%)
PC2 (7.5%)
JZ
LZ
2000 3000 4000
−4000 −2000 0 2000 4000
Sex DM CpGs
PC1 (20.3%)
PC2 (8.8%)
F
M
2000 2500 3000 3500
−2000 0 1000 2000 3000
Age DM CpGs
PC1 (19.7%)
PC2 (9.0%)
15
18
150000 300000 450000
0e+00 2e+05 4e+05 6e+05
Domesticus DM CpGs
PC1 (44.2%)
PC2 (16.5%)
Domesticus
Musculus
Spretus
400000 550000 700000
−5e+05 0e+00 5e+05
Musculus DM CpGs
PC1 (37.1%)
PC2 (23.3%)
Domesticus
Musculus
Spretus
650000 800000 950000
−1e+06 −5e+05 0e+00 5e+05
Spretus DM CpGs
PC1 (38.7%)
PC2 (22.1%)
Domesticus
Musculus
Spretus
Figure 3.7: Principle Component Analysis on DM CpGs identified by RadMeth for each factor of interest.
In order to identify the sources of species-, layer-, and age-specific variation, we investigated DM CpG
occupancy inside various genomic regions (Figure 3.8C). We observed an order of magnitude more dif-
ferences by species than by layer or age, and DM CpGs seem to be uniformly distributed throughout the
genome. To identify the DM CpGs that are most likely to drive meaningful differences in transcriptional
54
A
5 kb
Prrg3
D
Labyrinthine
Junctional
chrXqA7.3 mm10
B
Layer
Age
Spretus
Musculus
Domesticus
Promoter
Exon
SINE
LTR
LINE
Intergenic
0.0 0.2 0.4 0.6 0.8 1.0
Prop. DM CpGs
C
Layer DM CpGs Age DM CpGs
E
Subfamily log(observed/expected)
Species DM CpG enrichment in repeats
−2.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0
0 0.02 0.06 0.10
Proportion
n = 229578
n = 41450
n = 6274232
n = 4562165
n = 2946231
0
1
0
1
E18−E15 (%mCpG)
Proportion
−0.4 −0.2 0.0 0.2 0.4
0 0.05 0.1
JZ−LZ (%mCpG)
Proportion
−0.4 −0.2 0.0 0.2 0.4
0 0.15 0.075
Figure 3.8: Methylomes differ between layers, developmental timepoints, and species. (A, B) Directional
DM CpG methylation distributions between layer and age. (C) DM CpG quantity and location by genomic
region, with N equal to the total number of DM CpGs for each factor. (D) Example promoter that is
differentially methylated between layers. (E) Distribution of enrichment ofM.musculus-specific DM CpGs
in retrotransposon subfamilies, showing enrichment (log O/E > 0) of differential methylation in almost
every subfamily.
regulation, we focused on those located in gene promoters. For each promoter, we counted the number of
significantly DM CpGs with at least 30% methylation difference between levels for each factor. Between
layers, this yielded 5 genes with at least 10 DM promoter CpGs, with the top two (Srrt andZmym3) having
45 and 25 DM CpGs respectively. Three of these genes were on the X-chromosome, and analysis of male
and female samples separately show conservation of magnitude and directionality of this differential methy-
lation between sexes (Figure 3.9). Subject to the same analysis, differences between ages yielded only 2
genes with at least 10 DM CpGs (Supplemental Table B.3). Of note, the next highest difference between
55
ages (8 DM CpGs) was Tjp1, a human ortholog of which was previously associated with trophoblast cell
differentiation and whose promoter was methylated in E18 samples [Pidoux et al., 2010]. No gene set was
enriched for placenta-related gene ontology terms.
chrX (qA7.3) XqA1.1 qA2 XqA4XqA5 qA6 A7.3qB XqC1 XqC3 XqD XqE1 XqE3 qF1 qF2 XqF3 qF4 XqF5
Prrg3
Stag2
chrX (qD) XqA1.1 qA2 XqA4XqA5 qA6 A7.1 A7.3 qB XqC1 XqC3 XqD XqE1 XqE3 qF1XqF2XqF3XqF4XqF5
Zmym3
chrX (qA4) XqA1.1 qA2 XqA4XqA5 qA6 A7.1 A7.3qB XqC1 XqC3 XqD XqE1 XqE3 qF1XqF2XqF3XqF4XqF5
Junctional Zone (composite)
Junctional Zone (female)
Junctional Zone (male)
Labyrinthine Zone (female)
Labyrinthine Zone (male)
Labyrinthine Zone (composite)
Junctional Zone (composite)
Junctional Zone (female)
Junctional Zone (male)
Labyrinthine Zone (female)
Labyrinthine Zone (male)
Labyrinthine Zone (composite)
Junctional Zone (composite)
Junctional Zone (female)
Junctional Zone (male)
Labyrinthine Zone (female)
Labyrinthine Zone (male)
Labyrinthine Zone (composite)
A
B
C
Identified DMR
Identified DMR
Identified DMR
Figure 3.9: UCSC Genome Browser plots showing location and methylation state in male and female in-
traspecies samples for each promoter differentially methylated between layers, with the region containing
DM CpGs in pink.
Interestingly, differentially methylated gene promoters between layers were enriched on the X-chromosome
and primarily hypermethylated in the junctional zone, despite most DM CpGs showing junctional hy-
pomethylation. An example of junctional zone promoter hypermethylation is shown in Figure 3.8D.
An analysis of the X-chromosome revealed global hypomethylation relative to autosomes (p< 3:56e
05, Figure 3.10A) in females but similar methylation levels in males. We also observed elevated methylation
56
Rank Layer # DM CpGs Age # DM CpGs
1 Srrt 45 Cdc42 16
2 Zmym3* 25 Picalm 12
3 Stag2* 25 Tjp1 8
4 Prrg3* 11
5 1810009A15Rik 10
Table 3.1: Top differentially methylated promoters between placental layers and developmental timepoints.
Between layers, there is an enrichment for X-chromosome genes (asterisk).
levels in CpG islands of both male and female X chromosomes relative to autosomes, but with greater levels
in female placentas (p < 2:21e 10, Figure 3.10B), suggesting CpG islands have elevated methylation
levels on the inactive X. The male CpG island methylation increase is slightly enriched in the junctional
layer (p< 0:03, Figure 3.10C).
3.4.2 Differential methylation in placental retrotransposons
To better understand the strong species-specific retrotransposon signals we observed, we utilized the Repeat-
Masker annotation of retrotransposons in the LINE, SINE, and LTR classes (removing all non-retrotransposons
from the annotation). We computed the enrichment of species-specific DM CpGs in each retrotransposon
subfamily given each subfamily’s total CpG density. The distribution of observed over expected (O/E) ra-
tios of DM CpG occupancy inside retrotransposon subfamilies is notably shifted to the right, indicating that
almost all retrotransposon subfamilies are more differentially methylated between species than expected by
chance (Figure 3.8E). By filtering for subfamilies with at least 50 DM CpGs and O/E ratio of at least 2x, we
saw almost exclusive enrichment in RLTR10 and RLTR20 subfamilies (Supplemental Table B.4), notably in
the same broad group of ERVs (ERV2) as the retrotransposons bearing species-specific enhancers identified
previously [Chuong et al., 2013].
Despite our observation of species-specific methylation profiles in retrotransposons, the impact of the
globally lowered methylation state on retrotransposon activity in the placenta remains poorly understood.
The observed pattern of hypomethylation in specific retrotransposons stands in stark contrast to non-placental
tissues, where retrotransposons are usually methylated [Bestor, 2000]. This elevated tolerance to retro-
element hypomethylation and expression may be required for placenta-specific phenomena, such as previ-
ously identified exaptation events of specific retrotransposons by the placenta to evade the maternal immune
system [Feschotte and Gilbert, 2012, Mi et al., 2000] or co-option of certain retrotransposon subfamilies
as placenta-specific enhancer elements [Chuong et al., 2013]. In contrast to the study by Chuong et al.
57
A B
C
0.30 0.35 0.40 0.45 0.50 0.55
%mCpG
Auto chrX
Male Whole
JZ LZ
0.00 0.05 0.10 0.15 0.20 0.25
Male Islands
JZ LZ
Female Islands
Autosomes
0.00 0.10 0.20
Male X Female X
CpG island methylation
%mCpG
Auto chrX
Female Whole
%mCpG
0.05 0.15 0.25
Figure 3.10: DNA methylation compared between male and female samples. (A) Weighted mean methy-
lation level in autosomes and X-chromosome(s) for male and female samples. (B) Average CpG island
methylation level by layer in male and female samples. (C) Average CpG island methylation in autosomes,
males, and female samples.
[2013], which identified placenta-specific enhancer elements at mouse-specific retrotransposons not present
in the rat, we identified differential methylation in retrotransposons that are present in all species studied.
This could represent more recent adaptations of the placental regulatory program, although further study is
needed.
Though selection to maintain genome integrity in extra-embryonic tissues will certainly be lower than in
the embryo or its germline, too much retro-element expression still represents a potential danger to genome
integrity. Retro-element hypomethylation may be possible due to the redundant nature of mechanisms for
retrotransposon silencing [Aravin et al., 2007, Reichmann et al., 2013], allowing their sequences to act as
58
an enhancers for nearby genes while limiting their transcriptional activity. In turn, differentially methylated
subfamilies may help fuel the rapid diversification of placenta-specific regulatory networks. Although the
mechanisms of retrotransposon-derived enhancers have been studied previously [McDonald et al., 1997,
Ruda et al., 2004], further study is needed to explore the direct impact of differential retrotransposon methy-
lation state between species on the transcription of nearby orthologous genes, and to understand what, if
any, role retrotransposon-mediated transcription may play in the human placenta.
3.5 Progressive PMD formation in the mouse placenta
As described previously in the Background chapter, PMDs are megabase-scale stretches of the genome with
consistently low methylation relative to the background of genome-wide equilibrium methylation level.
They were first observed in human immortalized cell lines [Lister et al., 2009] and later found to be present
in cancer [Berman et al., 2012, Hansen et al., 2011] and then observed in human placenta methylomes
[Schroeder et al., 2013]. In contrast to a prior study that reported the absence of PMDs in mouse placenta
[Schroeder et al., 2015], we identified a highly reproducible segmentation of the methylome in all three
mouse species into background and PMD regions using the HMM approach described in Song et al. [2013]
(Figure 3.12A). This method segments the methylome based on consecutive observations of weighted av-
erage methylation levels inside 1kb bins, reducing the effect of local hypermethylation introduced by regu-
latory regions such as gene promoters or enhancers. We compared the previously reported mouse placenta
methylome to our own data and found PMDs covering 26.5% of the genome and reaching similar in-PMD
methylation levels to our own using the same identification technique, with a bin size of 20kb to compensate
for substantially lower coverage than our own whole placental samples (Figure 3.12A).
Placental PMDs are located in gene poor regions and exist in both layers of the placenta (Figure 3.11A).
Taking the union of PMDs across all inter-specific samples, we observed an overlap with 8,828 gene pro-
moters compared with an expected 12,025, given the size of the genome and assuming a hypergeometric
distribution of the overlaps. While PMD locations stayed generally constant, the overall fraction of the
genome inside PMDs varied substantially, from nearly absent in the MPB strain to extremely prevalent in
the DOT strain (Figure 3.11B). For this interspecific data, the collection method for the whole placenta sam-
ples allowed the age of the embryo at dissection to vary up to 5 days (see Methods), and we observed an
order-of-magnitude decrease in embryonic weight in the MPB strain that coincided with an earlier develop-
mental stage and absent PMDs (Supplemental Table B.5). Therefore we hypothesized that PMDs gradually
59
appear over developmental time in the placenta. In the intraspecific data, analysis of PMD size between
developmental timepoints in late gestation revealed an increase in the size of PMDs between E15 and E18
(p < 0:032) (Figure 3.11C). While PMDs widened over time, the methylation level inside PMDs did not
show significant differences between timepoints (p < 0:47). The methylation levels outside of PMDs
increased slightly but not significantly, consistent with the observed global methylation levels of the two
timepoints. Corroborating the observations reviewed in Novakovic and Saffery [2013], theDnmt1 promoter
is hypomethylated in mouse placenta at all timepoints, in all layers, and in all species, suggesting that PMD
formation in the mouse placenta is likely not driven by differential expression ofDnmt1.
The globally lower methylation level in M. spretus compared to the other two species led to lower av-
erage methylation inside PMDs (p< 0:0006) (Figure 3.11D) but similar PMD depth. Interestingly, despite
the lack of PMDs in MPB leading to large overall methylation differences between those and the other M.
musculus samples (Figure 3.11E), CpG island methylation remained extremely close regardless of PMD
presence. This suggests that CpG islands remain under direct regulation even inside PMDs (Figure 3.11F).
Average methylation levels for CpG islands inside PMDs were slightly elevated in our mouse placenta sam-
ples, corroborating the finding in [Schroeder et al., 2013]. However, few CpG islands inside placental PMDs
displayed methylation above 80% (Figure 3.12B), while methylation of CpG islands inside cancer PMDs
regularly exceeds 80% [Hansen et al., 2011, Toyota et al., 1999].
To function properly, the placenta must invade and integrate with maternal tissues, a process that shares
some similarities to the invasive behaviors of some cancers [Novakovic and Saffery, 2013]. Partially methy-
lated domains exist in the mouse placenta, are absent in our smallest, developmentally young embryos, and
widen between E15 and E18, suggesting that they arise as a function of developmental time. These PMDs
share conserved locations across species, layer, and are found in the same gene-poor regions as in cancers.
PMDs correlate with late replicating domains in human [Berman et al., 2012], and therefore may arise in
both cancer and placental cells as a consequence of rapid cell division outpacing the maintenance of methy-
lation in these regions. CpG island hypermethylation is a hallmark of cancer methylomes and is enriched
within cancer PMDs. Mouse placenta PMDs show no CpG island hypermethylation of the type reported in
the PMDs of cancer methylomes [Berman et al., 2012]. Further studies are required to determine if PMDs
have any significance in placental function. However, this shared feature of placental and cancer methy-
lomes is striking and any model to explain PMDs will be more appealing if its explanatory power extends to
both cancer and placenta epigenomes. See Chapter 5 for an extension of analysis to include cancer samples
as well as the placenta.
60
3.6 Post-study analysis of effect of reference genome on global methylation
levels
While all analysis done in the above study was performed on raw reads mapped to the mm10 reference
genome, recent publications have shown that divergence in sequence between mouse strains can have sub-
stantial effects on the observed methylation level, most often resulting in directional loss of methylation
caused by C-to-T mutations in the strain being interpreted as hypomethylation during mapping to a distant
reference [Wulfridge et al., 2016].
A few months after our publication, UCSC and the Mouse Genomes Project jointly developed a trackhub
including a multiple alignment of several mouse strains, including those used in our study. To improve
future analysis of our data and mitigate any strain bias in methylation, we remapped each sample from our
publication to its closest reference according to this trackhub. For some samples (specifically the intraspecies
dataset, which was C57BL/6J) mm10 was the closest reference. For the interspecific dataset, we mappedM.
m. musculus samples to PWK,M.m. domesticus to WSB, andM.spretus to SPRET.
Mapping samples to their closest references improved the mappability and eliminated the observed
bias in methylation levels (Supplemental Table B.6). Following mapping, we used the bigMAF multiple
alignment file provided by the trackhub to create one-to-one CpG liftover indices from the new reference
genomes to mm10 and lifted over the methylation information using methpipes fast-liftover tool.
61
10 Mb mm10
MPB1
MPB2
SFM1
STF2
STF1
SFM2
MBS1
MBS2
BIK2
DOT1
BIK1
DOT2
Domesticus
Musculus
Spretus
Junctional
Labyrinthine
chr8 (qA1.1-qE1)
RefSeq gene
density
A
Frac. genome in PMDs
B C
Musculus
Spretus
Domesticus
E
F
MPB2
MPB1
STF2
SFM2
SFM1
STF1
MBS2
MBS1
DOT1
DOT2
BIK1
BIK2
MPB2
MPB1
STF2
SFM2
SFM1
STF1
MBS2
MBS1
DOT1
DOT2
BIK1
BIK2
200 250 300
Distance
MBS2
MBS1
MPB2
MPB1
STF1
SFM2
SFM1
STF2
BIK1
BIK2
DOT1
DOT2
MBS2
MBS1
MPB2
MPB1
STF1
SFM2
SFM1
STF2
BIK1
BIK2
DOT1
DOT2
3 4 5 6
Distance
Color Key
CpG Island Distance Genome-wide Distance
Color Key
0.0 0.2 0.4 0.6 0.8
%mCpG
0.0
0.02
0.04
0.06
0.08
0.10
0.10
0.08
0.06
0.04
0.02
D
Non-Spretus Spretus
Inside PMDs Outside PMDs
Inside PMDs Outside PMDs
Frac. genome in PMDs
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.00 0.05 0.10 0.15 0.20 0.25
●
●
E15 E18
Figure 3.11: Partially methylated domains exist in mouse placenta and are spatially conserved across
species. (A) Methylation levels (yellow barplots) and identified PMD locations (grey boxes) in a selected
genomic interval; genes shown in blue. (B) Genomic fraction covered by PMDs in whole placenta sam-
ples. (C) Genomic fraction covered by PMDs at E15 and E18 in intra-specific samples. (D) Distributions of
CpG methylation levels inside and outside PMDs in M. spretus and non-M. spretus samples. (E) Pairwise
distances between genome-wide methylation profiles (average level in 1kb bins). (F) Pairwise distance by
species in CpG islands.
62
chr8: (qA1.1-qE1) 50 Mb mm10
Schroeder
et al. 2015
Domesticus
Spretus
Musculus
MethBase ESCs
A
B
BIK2380P1
BIK2380P2
DOT2381P2
DOT2381P5
MBS1852P5
MBS1852P8
MPB1574P1
MPB1574P2
SFM1923P4
SFM1923P5
STF1955P2
STF1955P3
0.0 0.2 0.4 0.6 0.8 1.0
Figure 3.12: PMDs in the mouse placenta methylome are conserved across projects. (A) UCSC Browser
plot of methylation levels on a large portion of chromosome 8 for ESCs, our species placenta data, and the
mouse placenta data from Schroeder et al. 2015. (B) CpG island methylation distributions for inside and
outside of PMDs.
63
Chapter 4
Improved methods for identification and analysis of partially methylated
domains
Partially methylated domains (PMDs) are a hallmark of epigenomes that have undergone significant mi-
totic replication, including cancer cells, the placenta, and cultured cell lines. Existing methods for deciding
whether PMDs exist in a sample and their identification are few, often tailored for specific biological ques-
tions, and require high coverage samples for accurate identification. Additionally, the presence of PMDs
complicates the identification of shorter, kilobase-scale hypomethylated regions (HMRs) typically asso-
ciated with gene regulation. In this chapter, I describe improved methods for identification of PMDs in
different situations, including across samples with substantially different sequencing depths, methylation
array data, reduced representation bisulfite sequencing data, and identification of consensus PMDs across
several samples. Additionally, I describe a method for identification of HMRs in PMD-containing methy-
lomes, which has historically been a difficult task owing to the obfuscating effect of partial methylation on
the HMM learning process.
In this chapter, I present improved segmentation methods used to identify PMDs and HMRs by making
modifications to the methods from Methpipe. Our PMD identification method now chooses a bin size to
fix the amount of information at each step in the Markov chain across samples, mitigating the discretizing
effect of low coverage and allowing for direct comparison of PMD attributes across samples with varying
sequencing depth. After segmentation, we improved our precise boundary selection by identifying the par-
tition that maximizes the joint likelihood of the in-PMD portion coming from the learned in-PMD emission
distribution and the out-PMD portion coming from the learned out-PMD emission distribution. When iden-
tifying HMRs, we pass the posterior probabilities of each CpG belonging to a PMD as an argument and train
64
two separate models, facilitating comparison of properties like size and depth of HMRs inside and outside
of PMDs within a sample.
Our study introduces improved methods for whole-methylome segmentation at multiple resolutions, and
uses them to improve our understanding of PMD formation, associations, and function. The expansion of
our PMD identification to very low coverage samples increases the possibility of using PMDs in a clinical
setting where high-replicate, low-coverage sample schemes are most cost-effective, and the ability to iden-
tify HMRs in PMD-containing samples allows us to more effectively explore the consequences of PMDs on
the methylation state of regulatory elements.
4.1 Towards a functional definition of partially methylated domains
Previous studies have frequently pointed out how easily one can see PMDs when visualizing DNA methy-
lation along chromosomes, assuming the scale is appropriate (e.g. several megabases). Unfortunately, to
date PMDs have no functional definition. Statistical definitions based on methylation data have, in almost
all cases, been tied to procedures for identifying the PMDs. Here we outline a set of necessary properties
of PMDs that capture the key ideas in previous studies. Importantly, these properties make no reference to
genome annotations or properties of the underlying DNA sequence. This is important in avoiding biases in
subsequent analysis of PMDs, for example in identifying features that correlate with PMDs. We claim that
the following properties are essential to defining PMDs:
they have a lower methylation level than the rest of the genome;
they cover a fraction of the genome that is distinctly larger than the fraction associated with regulatory
features (e.g. CGIs, promoters, enhancers, etc.);
and they are organized as contiguous genomic intervals whose size is distinctly larger than the size of
the aforementioned regulatory features.
The reduced methylation level in PMDs has been assumed by all previous studies. The description of
PMDs as being “partially” methylated can be misleading, as we will show, but is often the case since the
rest of the genome tends to be highly methylated in most mammalian cells. Regarding the total fraction of
the genome covered by PMDs, previous studies have reported a range of 20-65% of the genome covered.
Despite the differences in methods used to obtain these numbers, the scale is consistent. The organization
65
of PMDs as contiguous intervals having a particular size distribution has also been common to all previous
studies. This is apparent in the use of large bins (e.g. 10kb; [Berman et al., 2012] and 20kb; [Schroeder
et al., 2015]) and in the care taken for these intervals not to be fragmented by features like CpG islands
[Schroeder et al., 2011].
Existing reports have found substantial overlap between the portions of the genome covered by PMDs
in different methylomes [Berman et al., 2012, Lister et al., 2011]. Should it result from a shared underlying
cause, it would be desirable for our definition to recapitulate this concordance without forcing it. At the same
time, the definition of PMDs should form the basis for determining whether or not a methylome contains
PMDs. For a variety of reasons, large intervals of reduced DNA methylation exist in methylomes that, on
a global level, do not appear to have PMDs. Examples include the HOX clusters in ESCs, which seem to
be covered by a cluster of small intervals with well defined break-points, and pericentromeric satellites in
human sperm [Molaro et al., 2011].
Previous studies have applied their PMD detection methods to all samples of interest and then used the
features of their segmentation to decide whether each sample contained PMDs or not. For example, Lister
et al. [2011] observed an order of magnitude difference in the total length of segmentation between the cell
lines with and without PMDs, and used that to determine which lines had PMDs. Similarly, Berman et al.
[2012] claimed PMDs are absent from hESCs and primary normal colon based on a low fraction of the
genome found to be covered by PMDs using their sliding window approach. While this method of decid-
ing whether PMDs exist or not in a sample is inherently circular, the clear divide between segmentations
in PMD-containing and non-PMD-containing methylomes has allowed it to endure as a reliable decision
criteria, and is the approach we take in subsequent sections when deciding whether or not a sample contains
PMDs.
4.2 Improving PMD identification through dynamic bin size selection
In this chapter, I present an improved PMD identification method based off of methpipe’s original PMD
detection method [Song et al., 2013] and in the next chapter, apply it to a large number of samples. This
method is a two-state Hidden Markov Model (HMM) that segments non-overlapping bins of the genome
into one of two states: PMD or background. The distributions of methylation levels in these two states
are modeled with beta-binomial distributions and the transition and emission parameters of the model are
learned through the Baum-Welch algorithm. PMDs are segmented via posterior decoding and boundaries
66
are heuristically sharpened to single-basepair resolution. Lastly, false discovery correction filters out small
PMDs by segmenting a shuffled version of the methylome and comparing the size distribution of the result-
ing segments with the unshuffled segments.
A key technical challenge in accurate comparison of PMDs across many samples from different studies
is the bias introduced by variable sequencing depth. As sequencing depth decreases, the fraction of bins in
the genome with methylation observations decreases, and the accuracy of the methylation estimation inside
those bins deteriorates as the observed value discretizes. A natural approach to reducing this bias is to vary
the size of these binned regions on a sample-by-sample basis to equalize the amount of information in each
bin used to do the segmentation. We defined the minimum amount of “sufficient” information in a bin as
40 observations: whether that is 40 observations of a single CpG or single observations of 40 CpGs. 40
observations corresponds to an 80% confidence interval around an observed methylation level of 50% that
ranges from 39% to 61%, which allows us to safely interpret the methylation level in a bin as at least being
low, medium, or high. Before segmentation occurs, we choose the minimum bin size that yields at least 40
observations in 80% of all bins genome wide. We observed that the majority of human samples in MethBase
were sequenced deeply enough to hit the above criteria with the default bin size of 1,000 basepairs. However,
most PMD-containing mouse samples were sequenced shallowly enough that bin size adjustment positively
affected our segmentation. Figure A.1 shows the mean CpG coverage depth by selected bin size for all
PMD-containing samples included in the next Chapter.
Figure 4.1A shows the performance of our dynamic bin size method against the fixed bin approach pub-
lished in Methpipe and MethylSeekR’s PMD function [Burger et al., 2013] (see the Background section on
PMDs for more detail on MethylSeekR’s PMD identification approach). In the first example, all three sam-
ples are highly covered and therefore the fixed and dynamic bin estimates are very similar and perform well,
with the beta-binomial emission distributions performing flexibly enough to model the very low methyla-
tion levels of Calu1 PMDs and subtle methylation differences of the healthy liver sample. MethylSeekR
performs well when methylation levels inside PMDs are near 0.5 and arguably when they are very subtly
lower than the background, as in liver, but struggles when PMDs methylation levels are very low, as in the
Calu1 lung cancer cell line.
Because we don’t know the underlying process involved in generating PMDs, it is impossible to compare
methods against a ground truth, simulated or otherwise. However, we can explore the ability of each method
to segregate PMDs identified in a cell type specific manner. Figure 4.1B and C show the Jaccard indices
between pairs of tumor samples from The Cancer Genome Atlas (TCGA) for each of the three methods.
67
0 10 20 30 40 0 10 20 30 40
0.00
0.25
0.50
0.75
1.00
Mean CpG coverage depth
Jaccard index w/ full sample
Dynamic bin Fixed bin
B A
D
Uterine
Lung
Brain
Digestive
Breast
Bladder
0
0.2
0.4
0.6
0.8
1
Jaccard Index
Uterine
Lung
Brain
Digestive
Breast
Bladder
Uterine
Lung
Brain
Digestive
Breast
Bladder
Uterine
Lung
Brain
Digestive
Breast
Bladder
between
within
0.0 0.2 0.4 0.6 0.8 1.0
between
within
between
within
Jaccard Index
C
E
IMR90
Healthy
liver
Calu1
Lung
RefSeq
Genes
Methpipe (dynamic bin) Methpipe (fixed bin) MethylSeekR
Scale: chr6:
10 Mb hg19
25,000,000 35,000,000 45,000,000 55,000,000 65,000,000
0
1
0
1
0
1
Dynamic MethylSeekR Fixed bin
Dynamic bin MethylSeekR Fixed bin
GM19204
Scale: chr6: 25,000,000 35,000,000 45,000,000 55,000,000 65,000,000
0
1
Downsampling Percentage
0%
50%
90%
0%
50%
90%
0%
50%
90%
Methpipe (dynamic bin) Methpipe (fixed bin) MethylSeekR
10 Mb hg19
p<1.9e-14
p<2.0e-14
p<1.5e-07
Figure 4.1: (A) Comparison of PMD estimates from all three methods on chromosome 6 for three samples
with substantial variation in PMD depth. (B) Pairwise Jaccard index distributions for within-cancer-type
and between-cancer-type for PMD sets identified using each method (C) Pairwise Jaccard index heatmaps
of TCGA cancer sample PMDs identified by each method (D) Effect of downsampling and low coverage
on PMD estimates (E) Jaccard index of downsampled PMD estimates against full-coverage sample for all
human PMD-containing samples.
68
While MethylSeekR has the highest mean pairwise Jaccard index, the dynamic bin method shows the largest
difference in pairwise Jaccard index for pairs of PMD sets coming from the same cancer type vs different
cancer types. This could prove useful in downstream classification of cell-free or circulating tumor cell
DNA into its originating tumor type based on their PMD profiles.
Figure 4.1D shows the performance of each method in extremely low coverage situations by down-
sampling. We took a sample from Methbase with very low sequencing depth (50% CpGs covered to 1x
depth) and randomly downsampled observations to 90, 80, ... 10% of the original levels. From the manual,
MethylSeekR is not recommended for use on samples below 10x sequencing depth and performs poorly as
a result. The fixed bin methpipe method performs well at 1x sequencing depth but suffered from significant
erosion of PMD estimates as we downsampled. We show that our method yields consistent PMDs even
in the face of extremely low coverage, facilitating future study of PMDs in high-throughput, low-coverage
sequencing experiments.
To further show that our variable bin size selection improves the stability of our PMD estimates, we
downsampled methylation observations from several samples, segmented the downsampled methylomes
using both a fixed 1kb bin size and dynamically selected bin size, and computed the Jaccard index with
the full-coverage PMD segmentation (Figure 4.1E). For each sample, we downsampled 9 times: randomly
selecting 10%, 20%, ..., 90% of the methylation observations to retain. The resulting plots show that our dy-
namic bin size selection significantly improves PMD stability in very low coverage samples by maintaining
the segmentation closer to what would be achieved with higher sequencing depth. Interestingly, even with
the updated method, we fail to identify accurate PMDs in two samples with the shallowest PMDs (human
liver and human H1 mesenchymal cells) until mean CpG sequencing depth exceeds 10x, possibly explaining
why PMDs were not observed in these samples until recently.
4.3 An improved method for identifying and scoring optimal PMD boundaries
Identification of precise PMD boundaries and quantification of their strength facilitates downstream analysis
of strong vs weak boundary correlates and conservation of boundary strength across samples, making it a
very desirable metric to have.
A natural candidate for scoring PMD boundaries are posterior transition probabilities, which are directly
calculable for each bin in the PMD detection program and report the probability of transitioning into or out
of a PMD at each bin. However, these transition posteriors are subject to the arbitrary partitioning of the
69
genome into non-overlapping blocks and therefore can suffer from technical biases related to where the true
PMD boundary is in relation to the two bins the posterior is calculated for. A boundary score should be
calculated relative to the true boundary position.
We therefore first improved our precise boundary location method. We know that the true boundaries
for each PMD lies somewhere inside the two bins where a maximally probable transition took place during
posterior decoding. For each such region, there a set of CpG sitesC =fC
1
::C
n
g with coveragesN
1
::N
n
and number of methylated observationsM
1
::M
n
that are candidate boundaries. The goal is to pick a bound-
ary siteC
k
from these sites that splits them into a left bin and right bin. Using the weighted methylation
level for the left respectively, we aim to select the sitek that maximizes a score representing the probability
that the site is the true boundary.
Our method uses the learned PMD (foreground) and non-PMD (background) emission distributions
(
f
;
f
),(
b
;
b
) and calculates the joint likelihood of each side of the boundary coming from its respective
emission distribution. For all values of k, the likelihood was computed using M
L
=
P
k
i=1
M
i
, M
R
=
P
n
i=k+1
M
i
,N
L
=
P
k
i=1
N
i
,N
R
=
P
n
i=k
N
i
,p
L
=
M
L
N
L
andp
R
=
M
R
N
R
:
L(
f
;
f
;
b
;
b
jp
L
;p
R
) =P(p
L
;p
R
j
f
;
f
;
b
;
b
) =P(p
L
j
b
;
b
)P(p
R
j
f
;
f
)
=
p
b
1
L
(1p
L
)
b
1
B(
b
;
b
)
!
p
f
1
R
(1p
R
)
f
1
B(
f
;
f
)
!
where B(;) is the Beta function. We considered using the maximized joint likelihood directly as
a boundary quality metric, but depending on where the partition is the maximum value of the likelihood
may be biased (as in the case of all but one CpGs belonging to the PMD or non-PMD state, respectively).
To overcome this, we optimized boundaries first and then for each boundary, take 1 bin on either side and
calculate the joint likelihood using that equal weighting.
In order to get a joint likelihood for each boundary that is comparable across the genome, it is important
that they are not influenced by coverage. Therefore for each boundary, we calculate the joint likelihoods
assuming a beta distribution rather than beta-binomial, and report the coverage for the bin with less coverage
as well as the likelihood as a “certainty score.” This helped downstream to know whether we could trust a
70
boundary value, and allowed us to compare the quality of boundaries without worrying about differences in
sequencing depths affecting the likelihood.
4.4 PMD identification in methylation array data
Identification of PMDs has, in the past, been limited to sequencing-based methylation assays. However,
knowing whether a sample contains PMDs or not and where they are has clinically relevant uses and arrays
are still frequently used in this setting, owing to their low cost and ability to confidently assay many impor-
tant regulatory regions (see Chapter 2 for more details). In the past, PMD identification or even decision has
been prohibitively difficult in microarray data due to the paucity of probes in intergenic regions and gene
bodies. Recently, Illumina produced an updated methylation assay called the MethylationEPIC array, which
contains over 850,000 probes that are less focused on regulatory regions and more amenable to potential
PMD detection.
There are a few special considerations when identifying PMDs in array data. First, due to the lack of
any concept of coverage, properly normalized methylation levels in array data should be modeled as beta
distributed, rather than beta-binomial. Second, each EPIC dataset has a fixed number of probes in fixed
location, and anywhere with a probe can be considered a high-confidence observation, therefore no binning
is necessary, though some binning may help smooth the methylation levels learned with respect to regulatory
hypomethylation. The key challenge to identification of PMDs in EPIC data is the sparsity of these probes
along intergenic regions of the genome that may or may not contain PMDs.
To combat sparsity in the sequencing case, we use a concept of a “desert size.” If the distance between
two CpGs or probes is too great in absolute basepairs, we consider their methylation levels to be independent,
violating the Markov property and separate these positions into two separate Markov chains. In sequencing
data, we use a fairly conservative desert size (set at 5 times the bin size) but in array data, this desert size can
be chosen more liberally to maximize the similarity between matched WGBS and EPIC segmentations, and
due to the fixed locations of probes can be reused for new EPIC samples. To do this, we utilized a dataset
with matched WGBS and EPIC samples that both do and do not contain PMDs [Pidsley et al., 2016]. For the
Human LNCaP data, we fixed the desert size at 100kb and segmented the EPIC data by itself, then computed
the Jaccard index with the matched WGBS sample. We observed decent performance and a clear divide
between the non-PMD-containing and PMD-containing samples (Figure 4.2A) that recapitulated cellular
identity (Figure 4.2B), but caution that the PMDs resulting are still imperfect and are best used only as a
71
mechanism for determining whether a sample contains PMDs or not (see the next chapter for guidelines on
making this choice).
While the resulting segmentation in array data can certainly be used to decide whether a sample contains
PMDs or not, the breakdown of the Markov property limits their accuracy. It is possible to greatly reduce
this breakdown by matching array data with very low coverage WGBS. We believe that a combination of
high-confidence methylation values at regions of regulatory importance, together with sparse methylation
observations along the genome, can maximize the amount of information in a single sample.
4.5 Performance of our PMD identification method on RRBS data
Reduced representation bisulfite sequencing (RRBS) is a cost-effective method for exploration of methyla-
tion levels at CpG-rich regions such as CpG islands and gene promoters (see the Background chapter for
more details). RRBS works by employing a restriction enzyme that specifically cuts at CCGG sequences,
which occur primarily in CpG dense regions. As a result, these low-coverage, low-cost libraries preferen-
tially observe hypomethylated regions of the genome and the global methylation levels are skewed towards
that of regulatory regions.
This bias is a cost-saving advantage for many applications, but is a hindrance in the identification of
PMDs. Here, we assessed the performance of the fixed bin, dynamic bin, and MethylSeekR on the identifi-
cation of PMDs in RRBS data using HCT116 cell line WGBS [Blattler et al., 2014] and RRBS [Akalin et al.,
2012] libraries from two different studies. Figure 4.3A shows the total basepairs segmented in the RRBS
sample compared to the dynamic bin PMDs from the WGBS library, which were used as a gold-standard.
While MethylSeekR segmented close to the same amount of the genome, the dynamic bin method identified
locations more consistent with the gold-standard PMDs (Figure 4.3B and C).
4.6 Consensus segmentation of multiple samples
Both our HMR and PMD segmentation methods have been updated to take several samples as input. By
including more samples in our PMD and HMR detection, we are incorporating population-level variation
and more effectively segmenting the “core” PMDs and HMRs for the phenotype of interest.
When multiple samples are provided to either segmentation program, the two state HMM models each
sample with its own emission distribution while transitions are shared. This assumes that the methylation
72
Array PMDs
WGBS PMDs
No PMDs/healthy
Guthrie_EPIC_T.82.txt
Guthrie_EPIC_T.73.txt
Guthrie_EPIC_T.75.txt
PrEC_EPIC_R2.txt
PrEC_EPIC_R1.txt
Guthrie_EPIC_T.80.txt
Guthrie_EPIC_T.76.txt
PreC_WGBS_R4_20m
PreC_WGBS_R1_20m
PreC_WGBS_R2_20m
PreC_WGBS_R3_20m
NAF_EPIC_R3.txt
CAF_EPIC_R3.txt
NAF_EPIC_R1.txt
CAF_EPIC_R1.txt
NAF_EPIC_R2.txt
CAF_EPIC_R2.txt
LNCaP_EPIC_R1.txt
LNCaP_EPIC_R2.txt
LNCaP_WGBS_R2_20m
LNCaP_WGBS_R3_20m
LNCaP_WGBS_R4_20m
LNCaP_WGBS_R1_20m
LNCaP_WGBS_R5_20m
Guthrie_EPIC_T−82.txt.pmd
Guthrie_EPIC_T−73.txt.pmd
Guthrie_EPIC_T−75.txt.pmd
PrEC_EPIC_R2.txt.pmd
PrEC_EPIC_R1.txt.pmd
Guthrie_EPIC_T−80.txt.pmd
Guthrie_EPIC_T−76.txt.pmd
PreC_WGBS_R4_20m.pmd
PreC_WGBS_R1_20m.pmd
PreC_WGBS_R2_20m.pmd
PreC_WGBS_R3_20m.pmd
NAF_EPIC_R3.txt.pmd
CAF_EPIC_R3.txt.pmd
NAF_EPIC_R1.txt.pmd
CAF_EPIC_R1.txt.pmd
NAF_EPIC_R2.txt.pmd
CAF_EPIC_R2.txt.pmd
LNCaP_EPIC_R1.txt.pmd
LNCaP_EPIC_R2.txt.pmd
LNCaP_WGBS_R2_20m.pmd
LNCaP_WGBS_R3_20m.pmd
LNCaP_WGBS_R4_20m.pmd
LNCaP_WGBS_R1_20m.pmd
LNCaP_WGBS_R5_20m.pmd
0.2 0.6
Value
0 50 150
Color Key
and Histogram
Count
A
B
LNCaP replicates
PreC replicates
Figure 4.2: (A) UCSC Browser plot showing snapshot of array PMDs compared to WGBS PMDs (B)
Jaccard index heatmap of segment overlaps for all PMD segments
levels in each sample comes from its own emission distribution, and allows for modeling of differences
across samples due to biological variation, in contrast to shared emission distributions plus the introduction
73
A
C
B
0.0e+00
2.5e+08
5.0e+08
7.5e+08
RRBS, MethylSeekR
RRBS, dynamic bin
RRBS, fixed bin
WGBS
Total BP inside PMDs
RRBS, MethylSeekR
RRBS, Fixed bin
RRBS, Dynamic bin
WGBS
RRBS, MethylSeekR
RRBS, Fixed bin
RRBS, Dynamic bin
WGBS
0.2
0.3
0.4
0.5
0.6
0.7
Scale:
chr3:
10 Mb hg19
54,000,000-74,000,000
1 _
0 _
1 _
0 _
Blattler HCT116
(WGBS)
Akalin HCT116
(RRBS)
Methpipe (dynamic bin)
Methpipe (fixed bin) MethylSeekR
Figure 4.3: (A) Basepairs segmented into PMDs for HCT116 RRBS using different methods vs WGBS
(B) Jaccard index heatmap of segment overlaps for PMD segments in RRBS vs WGBS (C) UCSC genome
browser plot showing PMD estimates for RRBS vs WGBS in a matched cell type.
of technical noise parameters. The program dynamically selects the bin size required for the lowest coverage
sample provided and segments at that resolution.
This mode, because it models each replicate with its own beta-binomial emission distribution, represents
a mechanism by which to explore the biological consensus rather than reduce technical variance, which
would more efficiently be taken care of by learning a single beta-binomial emission with normally distributed
noise introduced by the replicates.
74
4.7 Identification of hypomethylated regions in PMD-containing samples
Hypomethylated regions (or HMRs) are kilobase-scale regions of near-complete methylation loss that fre-
quently occur at gene promoters and enhancer regions in non-PMD-containing cell types. The identification
of these regions in PMD-containing methylomes has been challenging in the past, owing to the large vari-
ability of methylation levels inside PMDs leading to detection of spurious segments whose near complete
hypomethylation is more likely due to noise rather than meaningful biological signal.
CRABP1
Sperm
H1ESC
Macrophage
IMR90
Placenta
Colon Cancer
Breast Cancer
Sperm
H1ESC
Macrophage
IMR90
Placenta
ColonCancer
HCC1954
Figure 4.4: HMRs can persist in PMD regions. No method to date has attempted to identify these HMRs,
and those that identify HMRs agnostic of PMDs suffer from the high variability of methylation levels inside
PMDs.
MethylSeekR does not attempt to learn HMRs inside PMDs at all, instead opting to segment HMRs
only on the non-PMD region of the genome. Our previous method assumed a single emission distribution
representing the background, composed of both PMD and non-PMD regions. By not separating these back-
grounds, our method overwhelmingly learned large HMRs with intermediate methylation level, and FDR
removed many of the small but true HMRs at gene promoters outside PMDs.
Simply masking the genome into PMD and non-PMD region and running separate segmentations also
did not work perfectly. Because of the FDR correction method we use during PMD identification, this leads
to some false-negative regions existing in the non-PMD mask and imperfect learning of HMR parameters
on that portion of the genome.
75
Our new method uses the posterior probabilities of each CpG site belonging to a PMD learned from
the PMD detection method directly. For each CpG site, it is assigned to the PMD or non-PMD background
state by applying a cutoff of 0.5 to the posterior probability. Then for each of these two backgrounds, a two-
state HMM is trained to recognize HMRs and that background by weighing the contribution of the other
background as zero. After training, we segment CpG sites from the background of interest into HMRs and
non-HMRs using posterior decoding.
By utilizing the posterior probabilities obtained by segmenting PMDs, we can accurately split the back-
ground of the genome into the PMD and non-PMD emission distributions. This also facilitates explicit
comparison of HMR properties inside and outside of PMDs, including CpG density, size, and methylation
profiles.
76
Chapter 5
Analysis of partially methylated domains across species and contexts
After producing new methods for identifying PMDs that would be more stable in the face of differing se-
quencing depths, we sought to learn about the formation and function of PMDs by applying our methods
to 267 methylomes from 7 species. The summary statistics of the resulting segmentations allowed us to
precisely define acceptable decision criteria for labeling samples as PMD-containing (PC) or non-PMD-
containing (non-PC). Additionally, the improved PMDs allowed us to put previous results regarding ge-
nomic and epigenomic PMD correlates in perspective as well as explore the evolution of PMD state across
species. PMD boundaries co-located with CTCF-bound sites and gene promoters, with genes displaying
a bias towards transcription away from PMDs in every species studied. Repeat elements were enriched or
depleted in PMDs in a lineage specific manner, with SINE elements including equine repeat element, dog
SINEC elements, mouse B2/B4 elements and primate Alu elements all showing depletion inside PMDs and
LTR elements showing slight primate-specific enrichment inside PMDs. Oocytes showed highly divergent
PMD patterns from other PC samples.
While PMD presence and their within-PMD methylation loss correlated well with replication timing
data, we observed a set of highly methylated genes termed “escapees” that evaded their surrounding PMD
state and remained highly expressed despite their late replication timing. We observed that while most
human-mouse syntenic blocks shared similar PMD composition, blocks with discordant PMD state dis-
played differential expression and differences in their local gene density, suggesting that local genomic
context can influence PMD state in otherwise highly homologous regions between species.
Additionally, we observed slight hypomethylation in non-PC samples at pericentromeric regions that
were commonly PMDs, with the degree of hypomethylation correlating with approximate mitotic age. Using
HMRs in a healthy, primary non-PC lung sample as a guide, we characterized the regulatory methylation
77
landscape of healthy-conserved HMRs and concurrent rise in sample-specific HMRs as a result of culturing,
tumorigenesis, and tumor adjacency.
5.1 Data-driven refinement of the PMD decision criteria
To discern those methylomes that contain PMDs from those that do not, we applied our improved method for
PMD detection to a wide range of newly sequenced and public methylomes currently curated in MethBase
and TCGA, regardless of whether or not they have been studied in the context of PMDs, and without making
any inferences on whether they should or should not have PMDs (Supplementary Tables B.7 & B.8). Sam-
ples were manually annotated as either healthy or cancer, primary or cultured, and with their approximate
cell type. To distinguish between PMD-containing (PC) and non-PMD-containing (non-PC) methylomes,
we calculated summary statistics based on the properties described in the previous section for each sample.
Figure 5.1A shows the fraction of the genome segmented and the mean size of the segmented regions for
all samples analyzed. In all species, there is an inflection point at which both the size and amount of seg-
mentation increases. For the fraction of the genome segmented, the fraction jumps abruptly from roughly
5% (which corresponds well to the fraction of base pairs inside regulatory regions such as gene promoters
and CpG islands) to over 10%. At the same time, the mean segment size changes from tens of kilobases to
over a hundred. We used these metrics (fraction segmented> 5%, mean segment size> 50kb) as cutoffs to
distinguish PC methylomes from non-PC methylomes for the remainder of the study.
After establishing cutoffs, it was clear that there was a substantial difference not only in segment size
and frequency but also location in PC samples and non-PC samples. Figure 5.1B shows a full-chromosome
view of the resulting segmentation for all human PC and non-PC methylomes, and reveals almost inverse
locations for segments correlated heavily with gene density. Some of the largest segments in non-PC samples
came from centromeric regions, which were also segmented in PC samples, and HOX clusters, which were
frequently not segmented in PC samples and are likely the result of extended enhancer activity in those
regions (Supplemental Figure A.2). Almost all PC samples fell into the categories of cancer, cultured cell
line, or placenta, with some notable exceptions. We corroborated a recent result that observed PMDs in the
human liver [Salhab et al., 2018] and observed highly divergent but reproducible PMDs in mouse oocytes
(Supplemental Figure A.3).
78
A
B
0
0.2
0.4
0.6
0.8
1
chr1: 100 Mb hg19
RefSeq
Genes
PCs
(n=76)
Non-PCs
(n=81)
0.5 0.4 0.3 0.2 0.1 0.0
Cow
Dog
Horse
Human
Mouse
Rhesus
SquirrelMonkey
0 500 1000 1500 2000 2500
Mean segment size (kb) Fraction of genome covered
Figure 5.1: (A) Fraction of genome covered and mean segment size of segmentation results on all samples
studied, with cutoffs drawn to delineate PMD-containing from non-PMD-containing (B) Genomic locations
of segments in PCs vs non-PCs
While PMD locations were fairly consistent across many of the cell types we studied (i.e. occurring in
approximately the same 5%-30% of the genome) we did some observe cell type specificity across differ-
ent cancer types. The sheer size of PMDs, coupled with our method’s ability to identify them in very low
coverage situations, makes them a large and attractive target for early diagnosis of cancer using cell free
DNA. Additionally, we observed a striking departure of oocyte PMDs from similarity of any other cell type.
Further study will be necessary to determine whether the large and reproducible regions of hypomethylation
in mature oocytes are formed through the same mechanism as other PMDs, and if so, whether differences
79
in their location reflect differences in replication timing during oogenesis compared to diploid samples. It
is possible that incomplete remethylation following epigenetic reprogramming of primordial germ cells,
rather than replication-induced methylation loss, gives rise to oocyte PMDs. We observed significant hy-
pomethylation of satellite-rich pericentromeric regions and HOX gene clusters in non-PC samples. These
pericentromeric regions may prove to be a reliable estimator of replication history, given the large variability
in methylation loss there in non-PC cell types. HOX cluster hypomethylation seems unlikely to arise via
the same gradual loss processes of methylation as PMDs, and PC samples with PMDs covering HOX gene
clusters show markedly different methylation patterns.
5.2 Segmentation suggests gradual PMD expansion and conservation of PMD
features across species
Analysis of PMD location overlap with other genomic features revealed substantial depletion of genes in-
side PMDs, corroborating previous results [Berman et al., 2012]. Additionally, we observed lineage-specific
enrichment of LTR families in primate PMDs, and depletion of nearly all SINE families inside PMDs across
species (Figure 5.2A). PMDs were depleted for nearly all chromHMM annotations, with the exceptions of
repetitive elements and repressive marks (Supplemental Figure A.5). PMDs were observable in non-CpG
methylation as well, suggesting differentialDnmt3a activity in PC samples vs non-PC samples (Supplemen-
tal Figure A.5).
Using all 76 PC human methylomes, we observed that PMD boundaries often co-occur with short seg-
ments of high CpG density (Figure 5.2B), separate domains of high and low CpG density, and that methy-
lation levels inside PMDs increase towards PMD boundaries. This increase could reflect some degree of
variability in the precise boundary location at the single cell level, or the absence of a precise boundary in
favor of increasingly probable hypomethylation of CpG sites as a result of changes in the local environment.
This trend is conserved across species (Supplemental Figure A.6). Elevated CpG density at PMD boundaries
was primarily driven by strong association of transcription start and end sites (TSSs and TESs) with PMD
boundaries. TSSs at PMD boundaries displayed a directional affinity for genes transcribing away from the
PMD (Figure 5.2C). In samples with available CTCF binding data, we observed an enrichment of CTCF
bound sites with PMD boundaries (Supplemental Figure A.7).
We observed only modest enrichment or depletion of repeat families at boundaries relative to the whole
genome, but there was substantial family-specific tendencies towards being included or excluded in the PMD
80
D
E
A B
C
Genes
CR1
L1
L2
RTE−BovB
ERV
ERV1
ERVK
ERVL
ERVL−MaLR
Alu
B2
B4
Core−RTE
ID
MIR
tRNA
−2.0
−1.0
0.0
Rhesus
Sq monkey
Mouse
Dog
Horse
Cow
log(Observed/Expected)
ratio in placental PMDs
SINEs LTRs LINEs
Human
0.00
0.25
0.50
0.75
1.00
%mCpG
39
42
45
48
51
Mean CpG
Density
PMD
50kb
Boundary
100kb
50kb
100kb
Transcription start sites Transcription end sites
200 100 0 100 200 200 100 0 100 200
0
5
10
15
20
Count (thousands)
Transcription
away
towards
0 20 40 60
0.0 0.2 0.4 0.6 0.8
Number of overlapping PMDs
%mCpG in PMDs
0.0
0.2
0.4
0.6
0.8
Bladder
Blood
Brain
Breast
Colon
Liver
Lung
Prostate
Rectum
Stomach
Uterus
Tissue Type
%mCpG
Distance to PMD boundary (kb)
Cultured Primary
Figure 5.2: (A) Observed/Expected ratios for genes and retrotransposon families inside PMDs (B) Meta-
gene plot showing methylation levels and CpG density near boundaries of PMDs (C) Histograms showing
distance from each RefSeq TSS and TES to its nearest PMD boundary (D) PMD depth as a function of
50kb bin PMD conservation across human samples (E) PMD depth boxplots in cultured vs primary cancer
samples
81
if an element occurred at the boundary. We plotted relative enrichment at the boundary using the difference
in O/E ratios for the 5kb inside vs 5kb outside the PMD boundary for each family (Supplemental Figure
A.8). In primate species, Alu elements were preferentially excluded from PMDs, with the magnitude of this
exclusion positively correlating with the age of the Alu subfamily. Interestingly, other lineage-specific SINE
elements such as the SINEC family specific to dog and equine repeat element (ERE) family specific to horse
showed similar exclusion at PMD boundaries.
While a lot of attention has been placed on the locations and frequency of PMDs in different contexts,
relatively little if any attention has been focused on understanding how the methylation level within PMDs
varies within and across samples. For every 50kb bin in the genome that overlapped completely with a PMD
in at least one sample, we plotted the number of samples it had a PMD in against its methylation level. We
observed that the more conserved a PMD is across samples, the lower its methylation level (Figure 5.2D).
This could reflect an ordering whereby some PMDs become detectable earlier in their development and have
lost more methylation than others over time.
We plotted the methylation distribution inside PMDs for cultured vs primary cancer samples, with the
assumption that on average, cultured cancer cell lines will have a longer mitotic history than their matched-
tissue primary tumor counterparts (Figure 5.2E). We observed lower methylation levels in the PMDs of cul-
tured cancer samples for cancer types with both cultured and primary samples. To minimize the possibility
of these differences occurring due to stromal composition of primary tumors, we explored the relationship
between tumor purity and PMD prevalence in TCGA primary tumor samples and observed no correlation
(Supplemental Figure A.9).
5.3 Some genes escape hypomethylation and downregulation in late-replicating
regions
We made use of public Repli-seq data, which measure the relative timing of nascent DNA generation in
different regions of the genome throughout S phase, to explore the relationship between replication timing
and PMD state. We observed a strong link between late replication timing and PMD state, corroborating
previous results [Berman et al., 2012] (Figure 5.3A). PMD boundaries appeared to associate with chromatin
loop boundaries, possibly indicating a significant change in the accessibility of the adjacent regions by
Dnmt machinery. In addition, we observed a negative correlation with replication time and methylation
82
level within PMDs, suggesting that not only are PMDs associated with late-replication, but the deepest
PMD regions are associated with the latest replication times (Figure 5.3B).
Given the increasingly tight link between late replication and PMD state, we asked whether any genes in
predominantly-PMD genomic regions escape their hypomethylating and downregulating effects. To explore
this, we filtered for all genes in mouse and human whose gene bodies were less than 20% covered by a
PMD, and whose 100kb flanking regions were both at least 80% covered by a PMD. We deemed these
genes “escapee” genes and identified 195 in human and 88 in mouse that occurred in at least 2 samples.
Figure 5.3C shows the top 6 highly conserved escapee genes in human from left to right, as well as the
methylation profiles of homologous regions across species. A full list of escapees as well as an analogous
figure for mouse escapees are available in Supplemental Table B.9 and Supplemental Figure A.10. We
observed modest conservation of escapee behavior across species, and variability in escapee status even
within cancer types (Supplemental Figure A.11). The top hit, Fam188a, was an escapee in 37 of 76 human
samples. It is an extremely conserved, ubiquitously expressed protein in mammals [Rehman et al., 2016].
Interestingly, it was only covered by a PMD in one of the 76 samples, a lung adenocarcinoma sample, and
has been previously identified as a tumor suppressor of non-small-cell lung cancer [Shi et al., 2011]. Other
top hits coded for equally important proteins, specifically key regulators of apoptosis (MAP3K7) and mitotic
replication such as a sub-unit of the origin-replication-complex (Orc5) and a protein involved in kinetochore
function (ZWINT) [Obuse et al., 2004].
Analysis of gene locations revealed that escapee genes exhibit similar replication timing profiles to genes
inside PMDs, and substantially later replication timing than other genes outside PMDs (Figure 5.3D). To
explore whether escapees also evaded PMD-associated downregulation, we explored the expression profiles
of four lung cancer cell lines and compared them to healthy lung samples. Three of four cultured tissues
exhibited significantly lower transcript-per-million (TPM) distributions for genes inside PMDs than the same
genes in the healthy counterpart (p < 2:2e 16), while no significant difference in expression of genes
inside PMDs was observed for Calu1 (p < 0:309). In all four cultured tissues, escapee genes exhibited
significantly higher TPM than genes inside PMDs (p < 2:2e 16) and similar TPM distribution to genes
outside of PMDs, implying that they escape PMD-associated downregulation (Figure 5.3E).
Despite an increasingly tight association between replication timing and progressive loss of methylation,
we identified a set of genes residing in late-replicating domains that reliably escape this methylation loss
and associated downregulation. This observation, coupled with the observation that particular histone marks
like H3K36me3 co-locate with CpGs that maintain their high methylation in late replicating regions [Zhou
83
A
C
D E
B
GM12878 HepG2 IMR90 MCF7
High
Low
0
25
50
75
Binned Methylation Level
Repliseq signal
Med
High
Low
Med
High
Low
Med
High
Low
Med
MCF7
IMR90
GM12878
HepG2
Scale chr2:
5 Mb hg19
35,000,000 40,000,000 45,000,000 50,000,000
RefSeq
Genes
PMDs Repli-seq Chromatin loops Methylation
1 Mb hg19
ITGA8
ITGA8
FAM188A
FAM188A
ORC5
ORC5
LHFPL3
ZWINT
ZWINT
ZWINT
ALG10B
ALG10B
CTD-2297D10.2
ADAMTS16
ADAMTS16
ICE1 BACH2
BACH2
MIR4464
MAP3K7
MAP3K7
MAP3K7
MAP3K7
Calu1
Human
Rhesus
Sq. Monkey
Mouse
Dog
Horse
Cow
Placenta
chr10:p13 chr7:q22.1 chr10:q21.1 chr12:q12 chr5:p15.32 chr6:q15
RefSeq Genes
All PMDs
IMR90 MCF7 GM12878 HepG2
Inside PMD
Escapee
Outside PMD
0
25
50
75
Repli-seq signal
Inside PMD
Escapee
Outside PMD
Inside PMD
Escapee
Outside PMD
Inside PMD
Escapee
Outside PMD
−5
0
5
10
log(TPM)
Context: Cancer Healthy
M3 Calu1 441NSCLC 1650Lung
Inside PMDs
Escapees
Outside PMDs
Inside PMDs
Escapees
Outside PMDs
Inside PMDs
Escapees
Outside PMDs
Inside PMDs
Escapees
Outside PMDs
Figure 5.3: (A) Browser plot showing tight association of replication timing and PMD state. (B) Replication
timing for 50kb bins inside PMDs with high (> 66%) medium (33-66%) or low (0-33%) mean methylation
level (C) Stitched browser plot showing 6 most conserved escapee genes in human and their methylation
state in other species (D) Escapee replication timing violin plots (E) TPM distributions for genes inside
PMDs, escapees, and outside of PMDs in PMD containing samples and healthy analogues.
84
et al., 2018] are contributing to an increasingly clear picture of the major determinants of genome-wide
methylation state. It seems likely that starting from some early state, gradual reduction in methylation levels
occurs in late-replicating regions. This gradual reduction could roughly track the history of the cells in
question, and culminates with identifiable partially methylated domains in cells with a long history of mitotic
division, as in liver, placenta, many cancers, and cultured cell lines. Concomitant with this methylation
erosion, the actively maintained methylation of critical genes like escapees are exposed like shells at low
tide. Mechanisms have been proposed for the preservation of expression in genes close to the lamina, though
whether these mechanisms are responsible for escapee methylation is unclear [Bank and Gruenbaum, 2011].
Left unanswered are interesting questions about why critical genes remain in late-replicating regions of the
genome, how they remain active, whether their position affords them cyclical gene expression patterns, and
how their variable methylation state can be used to inform on their misregulation in cancer.
5.4 Discordant PMD state across species correlates with expression and gene
density differences
Several studies have explored the relationship between large-scale hypomethylation and DNA sequence
[Gaidatzis et al., 2014, Zhou et al., 2018]. We sought to explore the extent to which orthologous sequence
across species shared its PMD state, and in what cases the PMD state was discordant. Figure 5.4A shows,
for each one-to-one syntenic block larger than 50kb obtained from the human-mouse pairwise alignment
first reported by Kent et al. [2003], the proportion of that block covered by a PMD in human vs mouse,
with the color of the point reflecting the size of the block. There is a significant and positive correlation
between human and mouse, but many blocks display highly discordant PMD state despite their homologous
sequence.
Looking specifically at the impact of discordant PMD state on genes, we identified orthologous genes
between mouse and human and observed significant directional differences in expression for genes that are
differentially covered by PMDs (p< 2:2e16 for genes in mouse PMDs but not human, andp< 1:6e05
for the genes in human PMDs but not mouse) (Figure 5.4B). For these genes, we also explored their local
gene density and observed directional changes in the number of protein coding genes within 100kb, which
could indicate that genomic rearrangements or changes in genomic context can lead to differences in PMD
state (Figure 5.4C).
85
A B
C
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00
Human block % PMD
Mouse block % PMD
12
14
16
18
log syntenic
block size
R
2
= 0.387, cor = 0.62
0
50
100
150
200
Surrounding +/-100kb
in a gene (kb)
Both Mouse Human Neither PMD in...
Human Mouse
Both Mouse Human Neither
Mouse
Human
Mouse
Human
Mouse
Human
Mouse
−5
0
5
10
log(TPM)
PMD in...
p<2.2e-16
p<1.6e-05
Human
Figure 5.4: (A) Syntenic block PMD conservation between mouse and human breast cancer (B) TPM dis-
tributions for orthologous genes between mouse and human binned by PMD state in each species (C) Com-
parison of local gene density for orthologous genes between mouse and human by PMD state
The discussion of PMD association with late-replicating domains and sequence is well explored, but
disentangling the relative effect of these factors on PMD state has been difficult. While we failed to directly
explore the differences in replication timing across species in this manuscript, we observed that for orthol-
ogous genes in mouse and human, changes in PMD state were linked to changes in gene expression and
local gene density. This result could be related to position-effect variegation [Muller, 1930] on a species
level, whereby a genomic rearrangement leading to the juxtaposition of an otherwise active gene close to
heterochromatin can lead to its subsequent downregulation (and in this case, coverage by a PMD).
One remaining and important unknown related to the model of PMD formation and maintenance is the
existence of an “equilibrium” methylation level inside PMDs. Some cultured cell lines have been passaged
for a very long time and display intermediate methylation levels in PMDs, while others display near 0%
methylation levels in PMDs. The speed at which methylation loss occurs in PMD regions, and where that
loss stops, may reflect intrinsic differences in the ability to maintain methylation in these regions between
86
cell types. Non-CpG methylation varies with transient Dnmt3a expression [Lister et al., 2013]. Given that
we observed PMD-like variation in non-CpG methylation in human placenta, it seems likely that differences
inDnmt activity could play a central role in determination of equilibrium PMD depth.
5.5 Satellite-driven pericentromeric hypomethylation shows promise as a
mitotic clock in healthy cell types
In earlier sections, we observed that there were some regions of lower-than-average methylation in pericen-
tromeric regions that existed even in non-PC samples. These observations, plus the surprising and subtle
PMDs we detected in the human liver, led us to hypothesize that well-conserved PMD locations may begin
to lose methylation far before PMDs are detectable on a global scale, and that hypomethylation may be
observable in other healthy tissues. To explore this possibility, for each non-PC sample, we calculated the
difference in methylation between regions that were always PMD and never PMD in PC samples for mouse
and human, and called this difference “PMD shadow depth.”
Figure 5.5A shows the PMD shadow depth for many human samples stratified by differentiation lineage.
It is clear that PMD shadows begin to form as early as ESCs, and that differences in their depths are ob-
servable upon differentiation into the three germ layers, with ectoderm displaying shallower shadows than
endoderm or mesoderm. Shadow depth varied by cell type, with large depths in heart, liver, and adrenal
glands and a differential depth between neuronal and non-neuronal brain cells that was conserved with
mice (Supplemental Figure A.12). Shadow depth was significantly higher in adult tissues compared to fetal
tissues (Wilcoxon rank-sum test,p < 0:00032) and shadows deepened in differentiated fetal tissues com-
pared to undifferentiated/progenitor tissues, although this difference was not significant (p< 0:054) (Figure
5.5B). There was a noticeable increase in shadow depth with age in whole blood methylomes taken from
newborn, middle-aged, and centenarian patients (Figure 5.5C) and differences in shadow depth that seemed
to correlate with sun exposure and age in epidermal samples (Figure 5.5D). Lastly, there was a significant
increase in shadow depth in tumor-adjacent tissues from TCGA compared to their Roadmap epigenome
project healthy references (p< 0:0303) (Figure 5.5E).
While I believe it is appropriate at this stage to assume something about the mitotic history of the above
comparisons, establishment of PMDs or PMD shadows as a mitotic clock will require substantially more
controlled experiments for proper calibration of the loss rate. In the future, time courses where replication
87
Ectoderm Endoderm
Mesendoderm
Neural progenitors
ESCs
Mesoderm
0
0.05
0.1
0.15
PMD Shadow Depth (%)
Cortex
Epidermis
Neuron Non-neuron Gyrus
Blood Heart Misc. Misc.
0.05
0.10
0.15
0.20
Healthy Tumor Adjacent
PMD Shadow Depth (%)
p<0.0303
Adult Fetal Undifferentiated
0.05
0.10 0.15
PMD Shadow Depth (%)
Tissue Age
p<0.054
p<0.00032
Old/Exposed
Old/Protected
Young/Exposed
Young/Protected
PMD Shadow Depth (%)
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
CD4T: 100 yr
PBMC: 40 yr
0.00 0.05 0.10 0.15
PMD Shadow Depth (%)
CD4T: newborn
A
B C D E
Figure 5.5: (A) PMD Shadow depth in non-PC samples organized by lineage (B,C) Shadow depth correlates
with tissue age (D) Shadow depth in epidermal samples that vary by age and sun exposure (E) Shadow depth
differs between tumor-adjacent tissues and their healthy analogues
88
timing is approximated and hypomethylation is tracked specifically in pericentromeric satellites may prove
useful as a way to track lineage-specific replicative history.
5.6 Analysis of hypomethylated regions in PMD-containing samples
We applied our improved method for HMR detection (see Chapter 4.7) to a set of lung methylomes including
primary healthy, cultured healthy, primary tumor-adjacent, primary cancer, and cultured cancer samples. As
a first step, we identified 40,251 HMRs that existed in all three of the healthy primary samples and deemed
them the “gold standard” (GS) HMRs that are most likely to play a functional role in the lung. We observed
varying degrees of loss of these GS HMRs based on context. Roughly 10,000 GS HMRs are lost in tumor-
adjacent healthy tissue. Cultured samples display the most striking loss of GS HMRs regardless of their
cancer/healthy state, and their loss was skewed towards GS HMRs inside PMDs. Concurrent with GS HMR
loss is overall gain in non-gold-standard (NGS) (Figure 5.6A).
GS and NGS HMRs display strikingly different genomic feature occupancy profiles, with gold-standard
HMRs existing in regulatory regions such as TSS, TES, and distal CpG islands frequently and NGS HMRs
overwhelmingly occurring in gene bodies, intergenic regions and repeat elements (Figure 5.6B). Interest-
ingly, PMD state had little influence on these context profiles, indicating that the increase in HMRs in PMD-
containing samples is occurring in the same kinds of regions that are already hypomethylated in non-PC
samples.
We explored the methylation levels in 50bp bins as a function of distance from the core of each HMR
in different contexts (Figure 5.6C). Gold-standard HMRs outside of PMDs were widest. GS HMRs inside
PMDs were narrower and slightly shallower. NGS HMRs were very narrow and significantly shallower
than gold-standard HMRs. Surprisingly, NGS HMRs displayed hypermethylation just before the HMR in
healthy tissues, something we had not seen before. Analysis of HMR methylation dynamics by genomic
context in other healthy tissues confirms that this is a property of HMRs in coding sequences, intergenic, and
repeat elements across a wide group of healthy tissues (Supplementary Figure A.13). This hypermethylation
flanking healthy NGS HMRs is lost in other contexts.
89
Gold-standard
20
Context
Primary tumor
Cultured healthy
Cultured cancer
Healthy
Tumor adjacent
Num. HMRs (thousands)
0 100 300 500 30 40 10
Non-gold-standard
Inside PMDs Outside PMDs
Gold-standard HMRs Non-gold-standard HMRs
healthy
tumor_adj
cancer
cultured
cultured_cancer
healthy
tumor_adj
cancer
cultured
cultured_cancer
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
Phenotypic context
Proportion
Genomic Context
CDS
distalCGI
Intergenic
Repeat
TES
766ï&*, 766ïQR&*, Inside PMDs Outside PMDs
Gold-Standard HMRs Non-gold-standard HMRs
-4 -3 -2 -1 0 1 2 3
0.0
0.2
0.4
0.6
0.8
0.0
0.2
0.4
0.6
0.8
Distance from HMR core (kb)
%mCpG
Context
Primary tumor
Cultured healthy
Cultured cancer
Healthy
Tumor adjacent
4 -4 -3 -2 -1 0 1 2 3 4
A
B C
Figure 5.6: (A) Number of gold-standard and non-gold standard HMRs in samples ordered by their context
(B) Stacked barplot showing genomic context of GS and NGS HMRs stratified by PMD state (C) Metagene
plot showing 50bp-resolution average methylation level +/-4kb from HMR cores.
90
5.7 Study design
5.7.1 An expanded methylome segmentation trackhub
We produced three trackhubs containing methylation, coverage, PMD, and HMR segmentations for all the
samples in this study. One contains all PMD-containing methylomes organized by original study, another
contains all non-PMD-containing methylomes organized by original study, and the last contains PMD-
containing methylomes lifted to distant reference genomes (human to mouse and all species to human). In
addition, previously MethBase only reported the final PMD segmentation as a set of BED-format genomic
regions. We realize that this may not be informative enough for some users and that providing probabilities
directly may help in downstream screening of PMDs within samples. In our trackhub, we also provide pos-
terior probabilities of PMD state at each bin and the boundary likelihood scores for each PMD. These are
public and can be accessed via the UCSC genome browser.
5.7.2 Mouse mammary tumour cell lines
Mouse mammary tumor cell lines were cultured as described previously [Wagenblast et al., 2015].
5.7.3 Lung cancer cell lines and healthy lung tissue
We generated WGBS and RNA-seq libraries for 4 non-small cell lung cancer cell lines and 2 primary normal
lung epithelial samples. The cancer cell lines consist of 3 adenocarcinoma (H1650, H441 and M3) and a
squamous cell carcinoma (Calu-1). The two primary lung samples are from small airway epithelial (SAE)
and bronchial epithelial (BE), respectively. These primary samples, along with the H1650, H441 and Calu-1
cell lines, were obtained from ATCC. The M3 cell line was isolated from H1650 for its resistance to the
drug erlotinib and displayed features suggestive of epithelial-to-mesenchymal transition [Yao et al., 2010].
Genome wide, the WGBS datasets reach 10-15X coverage and cover 93.8%-95.1% of all CpG sites, with
bisulfite conversion rates uniformly above 98%.
5.7.4 Miscellaneous methods
We downloaded genome annotations for repeats [Smit et al., 1996], CpG islands [Gardiner-Garden and
Frommer, 1987], and human gene bodies [O’Leary et al., 2015] using the UCSC table browser [Karolchik
et al., 2004]. Additionally, we made use of Ensembl biomart to identify orthologous genes across species
91
[Kinsella et al., 2011]. The bedtools software suite was used extensively throughout the manuscript [Quin-
lan, 2014]. Public Repli-seq [Hansen et al., 2010], CTCF binding [Sabo et al., 2004], and ChromHMM
segmentation data [Ernst and Kellis, 2010] were downloaded from the UCSC table browser.
We merged the intra-species placenta methylomes from [Decato et al., 2017] to produce high-coverage
methylomes of the labyrinthine and junctional placental zones. Mouse methylomes from Decato et al. [2017]
originating from strains other than C57BL/6J were remapped to their recently completed native reference
genomes [Lilue et al., 2018] using WALT [Chen et al., 2016] before being lifted to mm10. All lifts between
genome assemblies were performed using the lift-filter program in Methpipe, which acts as a methylation-
aware wrapper for the liftOver tool [Hinrichs et al., 2006]. No chainfile existed for bosTau8 to hg19, so we
lifted it from bosTau8 to hg38 and then from hg38 to hg19.
All RNA-seq libraries were mapped using STAR [Dobin et al., 2013] and processed using HTSeq [An-
ders et al., 2014].
92
Chapter 6
Conclusions
In this dissertation, I described the state of knowledge regarding the major epigenetic changes that occur
in extraembryonic development and tumorigenesis, as well as two projects aimed at better understanding
these changes. In the Background chapter, I gave a complete summary of all methods for quantifying DNA
methylation as well as the major landmarks of a methylome, then described the current state of scientific
knowledge on how those landmarks change during cellular differentiation, extraembryonic development,
and tumorigenesis.
In Chapter 3, I described a specific project that initiated my interest in partially methylated domains,
where we studied the dynamics of DNA methylation changes during speciation, development, and differen-
tiation of the mouse placenta. We corrected the record regarding the conservation of PMDs across species by
observing PMDs in mouse, and identified a placenta-specific promoter methylation distribution that refined
our understanding of the relationship between transcriptional repression and DNA methylation by confirm-
ing that full methylation at gene promoters was not required for transcriptional silencing in the placenta.
In Chapter 4, I described improvements to the methods originally published in Methpipe and used in the
placenta project to gear up for a comprehensive analysis of PMDs across placenta, cultured cell lines, and
tumor samples. These methods included ways to stabilize PMD estimates across samples of different cov-
erage depths, segment “consensus” PMDs across a set of samples, considerations for identifying PMDs in
array data, and methods for identifying HMRs inside and outside of PMDs in PMD-containing methylomes.
In Chapter 5, I used the aforementioned methods to refine the decision criteria for whether a sample
contained PMDs or not using the entire wealth of data in Methbase and TCGA, and then performed a deep
dive into the formation and function of PMDs by integrating other types of epigenomic data including Repli-
seq, gene expression, Hi-C, and ChIP-seq. This synergistic study of PMD formation across samples coming
from all contexts (cancer, the placenta, and cultured cell lines) facilitated the discovery of a set of genes that
93
appear to escape the methylation loss inside PMDs despite their late-replicating state. Future study into the
mechanism of this escapee-ism may yield insights into higher-order chromatin organization near the nuclear
periphery.
One other exciting future avenue for the study of PMDs is as a potential therapeutic biomarker. Cur-
rently, several researchers and companies are exploring the possibility of detecting different types of cancer
early from blood screens, whereby cell-free DNA from the tumor displays tumor-type-specific signatures
such as driver-gene mutations. PMDs are large and not specific to an individual site in the genome, making
their discovery somewhat easier than specific mutations. The fraction of cancer-originating cell-free DNA
in blood is low, so methods for detecting subtle PMDs in samples with low sequencing depth like those
described in previous chapters could be particularly valuable.
Additionally, future study of PMD formation through timecourses and especially hypomethylation around
centromeres may facilitate the creation of an accurate mitotic clock, whereby the full replicative history of
a cellular lineage following epigenetic reprogramming can be deduced.
Lastly, as single-cell methylome technology improves, the ability to discern the cell-to-cell heterogeneity
patterns inside PMDs will yield new insights into their formation, and possibly provide the necessary data,
together with information about relative methylation machinery expression rates, to build a unified statistical
model of hypomethylation through cell division.
94
Bibliography
Karolina A Aberg, Joseph L McClay, Srilaxmi Nerella, Lin Y Xie, Shaunna L Clark, Alexandra D Hudson,
Jozsef Buksz´ ar, Daniel Adkins, Swedish Schizophrenia Consortium, Christina M Hultman, et al. Mbd-seq
as a cost-effective approach for methylome-wide association studies: demonstration in 1500 case-control
samples. Epigenomics, 4(6):605–621, 2012.
Andrew Adey and Jay Shendure. Ultra-low-input, tagmentation-based whole-genome bisulfite sequencing.
Genome research, 22(6):1139–1143, 2012.
Andrew Adey, Hilary G Morrison, Xu Xun, Jacob O Kitzman, Emily H Turner, Bethany Stackhouse,
Alexandra P MacKenzie, Nicholas C Caruccio, Xiuqing Zhang, Jay Shendure, et al. Rapid, low-input,
low-bias construction of shotgun fragment libraries by high-density in vitro transposition. Genome
biology, 11(12):R119, 2010.
Altuna Akalin, Francine E Garrett-Bakelman, Matthias Kormaksson, Jennifer Busuttil, Lu Zhang, Irina
Khrebtukova, Thomas A Milne, Yongsheng Huang, Debabrata Biswas, Jay L Hess, et al. Base-pair reso-
lution DNA methylation sequencing reveals profoundly divergent epigenetic landscapes in acute myeloid
leukemia. PLoS genetics, 8(6):e1002781, 2012.
C David Allis and Thomas Jenuwein. The molecular hallmarks of epigenetic control. Nature Reviews
Genetics, 17(8):487, 2016.
Patricia ME Altham. Exact bayesian analysis of a 2 2 contingency table, and fisher’s” exact” significance
test. journal of the Royal Statistical Society. Series B (Methodological), pages 261–269, 1969.
Simon Anders, Paul Theodor Pyl, and Wolfgang Huber. Htseq: a python framework to work with high-
throughput sequencing data. Bioinformatics, page btu638, 2014.
Christof Angermueller, Stephen J Clark, Heather J Lee, Iain C Macaulay, Mabel J Teng, Tim Xiaoming
Hu, Felix Krueger, S´ ebastien A Smallwood, Chris P Ponting, Thierry V oet, et al. Parallel single-cell
sequencing links transcriptional and epigenetic heterogeneity. Nature methods, 13(3):229, 2016.
Alexei A Aravin, Gregory J Hannon, and Julius Brennecke. The piwi-pirna pathway provides an adaptive
defense in the transposon arms race. science, 318(5851):761–764, 2007.
Martin J Aryee, Andrew E Jaffe, Hector Corrada-Bravo, Christine Ladd-Acosta, Andrew P Feinberg,
Kasper D Hansen, and Rafael A Irizarry. Minfi: a flexible and comprehensive bioconductor package
for the analysis of infinium DNA methylation microarrays. Bioinformatics, 30(10):1363–1369, 2014.
TR Ashworth. A case of cancer in which cells similar to those in the tumours were seen in the blood after
death. Aust Med J, 14(3):146–149, 1869.
95
Madeleine P Ball, Jin Billy Li, Yuan Gao, Je-Hyuk Lee, Emily M LeProust, In-Hyun Park, Bin Xie,
George Q Daley, and George M Church. Targeted and genome-scale strategies reveal gene-body methy-
lation signatures in human cells. Nature biotechnology, 27(4):361–368, 2009.
Carolyn E Banister, Devin C Koestler, Matthew A Maccani, James F Padbury, E Andres Houseman, and
Carmen J Marsit. Infant growth restriction is associated with distinct patterns of DNA methylation in
human placentas. Epigenetics, 6(7):920–927, 2011.
Erin M Bank and Yosef Gruenbaum. The nuclear lamina and heterochromatin: a complex relationship,
2011.
Jan Bednar, Rachel A Horowitz, Sergei A Grigoryev, Lenny M Carruthers, Jeffrey C Hansen, Abraham J
Koster, and Christopher L Woodcock. Nucleosomes, linker DNA, and linker histone form a unique
structural motif that directs the higher-order folding and compaction of chromatin. Proceedings of the
National Academy of Sciences, 95(24):14173–14178, 1998.
Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and powerful approach
to multiple testing. journal of the Royal Statistical Society. Series B (Methodological), pages 289–300,
1995.
Benjamin P Berman, Daniel J Weisenberger, Joseph F Aman, Toshinori Hinoue, Zachary Ramjan, Yaping
Liu, Houtan Noushmehr, Christopher PE Lange, Cornelis M van Dijk, Rob AEM Tollenaar, et al. Re-
gions of focal DNA hypermethylation and long-range hypomethylation in colorectal cancer coincide with
nuclear lamina-associated domains. Nature genetics, 44(1):40–46, 2012.
Timothy H Bestor. The DNA methyltransferases of mammals. Human molecular genetics, 9(16):2395–
2402, 2000.
Adrian P Bird. CpG-rich islands and the function of DNA methylation. Nature, 321(6067):209–213, 1985.
Adam Blattler, Lijing Yao, Heather Witt, Yu Guo, Charles M Nicolet, Benjamin P Berman, and Peggy J
Farnham. Global loss of DNA methylation uncovers intronic enhancers in genes showing expression
changes. Genome biology, 15(9):469, 2014.
Christoph Bock. Analysing and interpreting DNA methylation data. Nature Reviews Genetics, 13(10):
705–719, 2012.
Christoph Bock, Eleni M Tomazou, Arie B Brinkman, Fabian M¨ uller, Femke Simmer, Hongcang Gu, Natalie
J¨ ager, Andreas Gnirke, Hendrik G Stunnenberg, and Alexander Meissner. Quantitative comparison of
genome-wide DNA methylation mapping technologies. Nature biotechnology, 28(10):1106–1114, 2010.
Michael J Booth, Miguel R Branco, Gabriella Ficz, David Oxley, Felix Krueger, Wolf Reik, and Shankar
Balasubramanian. Quantitative sequencing of 5-methylcytosine and 5-hydroxymethylcytosine at single-
base resolution. Science, 336(6083):934–937, 2012.
Mathieu Boulard, John R Edwards, and Timothy H Bestor. Fbxl10 protects polycomb-bound genes from
hypermethylation. Nature genetics, 47(5):479–485, 2015.
Joan Boyes and Adrian Bird. DNA methylation inhibits transcription indirectly via a methyl-CpG binding
protein. Cell, 64(6):1123–1134, 1991.
96
Alan P Boyle, Sean Davis, Hennady P Shulha, Paul Meltzer, Elliott H Margulies, Zhiping Weng, Terrence S
Furey, and Gregory E Crawford. High-resolution mapping and characterization of open chromatin across
the genome. Cell, 132(2):311–322, 2008.
Miguel R Branco, Michelle King, Vicente Perez-Garcia, Aaron B Bogutz, Matthew Caley, Elena Fineberg,
Louis Lefebvre, Simon J Cook, Wendy Dean, Myriam Hemberger, et al. Maternal DNA methylation
regulates early trophoblast development. Developmental cell, 36(2):152–163, 2016.
Michael Brandeis, Tal Kafri, Mira Ariel, J.Richard Chaillet, John McCarrey, Aharon Razin, and Howard
Cedar. The ontogeny of allele-specific methylation associated with imprinted genes in the mouse. The
EMBO journal, 12(9):3669–3677, 1993.
Arie B Brinkman, Serena Nik-Zainal, Femke Simmer, F German Rodriguez-Gonzalez, Marcel Smid, Lud-
mil B Alexandrov, Adam Butler, Sancha Martin, Helen Davies, Glodzik Dominik, et al. Partially methy-
lated domains are hypervariable in breast cancer and fuel widespread CpG island hypermethylation.
bioRxiv, page 305193, 2018.
Alayne L Brunner, David S Johnson, Si Wan Kim, Anton Valouev, Timothy E Reddy, Norma F Neff, Eliza-
beth Anton, Catherine Medina, Loan Nguyen, Eric Chiao, et al. Distinct DNA methylation patterns char-
acterize differentiated human embryonic stem cells and developing human fetal liver. Genome research,
19(6):1044–1056, 2009.
Jason D Buenrostro, Paul G Giresi, Lisa C Zaba, Howard Y Chang, and William J Greenleaf. Transposition
of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins
and nucleosome position. Nature methods, 10(12):1213, 2013.
Lukas Burger, Dimos Gaidatzis, Dirk Sch¨ ubeler, and Michael B Stadler. Identification of active regulatory
regions from DNA methylation data. Nucleic acids research, 41(16):e155–e155, 2013.
Andres Cardenas, Devin C Koestler, E Andres Houseman, Brian P Jackson, Molly L Kile, Margaret R
Karagas, and Carmen J Marsit. Differential DNA methylation in umbilical cord blood of infants exposed
to mercury and arsenic in utero. Epigenetics, 10(6):508–515, 2015.
Jocelyn Charlton, Timothy L Downing, Zachary D Smith, Hongcang Gu, Kendell Clement, Ramona Pop,
Veronika Akopian, Sven Klages, David P Santos, Alexander M Tsankov, et al. Global delay in nascent
strand DNA methylation. Nature structural & molecular biology, page 1, 2018.
Aniruddha Chatterjee, Peter Stockwell, Euan Rodger, and Ian Morison. Comparison of alignment software
for genome-wide bisulphite sequence data. Nucleic Acids Research, 40(10), 2012.
Haifeng Chen, Andrew D Smith, and Ting Chen. WALT: fast and accurate read mapping for bisulfite
sequencing. Bioinformatics, page btw490, 2016.
Yangho Chen, Tade Souaiaia, and Ting Chen. Perm: efficient mapping of short sequencing reads with
periodic full sensitive spaced seeds. Bioinformatics, 25(19):2514–2521, 2009.
BC Christensen, EA Houseman, CJ Marsit, S Zheng, MR Wrensch, et al. Aging and Environmental Ex-
posures Alter Tissue-Specific DNA Methylation Dependent upon CpG Island Context. PLoS Genetics, 5
(8), 2009.
97
Edward B Chuong, Wenfei Tong, and Hopi E Hoekstra. Maternal–fetal conflict: rapidly evolving proteins
in the rodent placenta. Molecular biology and evolution, 27(6):1221–1225, 2010.
Edward B Chuong, MA Karim Rumi, Michael J Soares, and Julie C Baker. Endogenous retroviruses function
as species-specific enhancer elements in the placenta. Nature genetics, 45(3):325–329, 2013.
Stephen J Clark, Ricard Argelaguet, Chantriolnt-Andreas Kapourani, Thomas M Stubbs, Heather J Lee,
Celia Alda-Catalinas, Felix Krueger, Guido Sanguinetti, Gavin Kelsey, John C Marioni, et al. scnmt-
seq enables joint profiling of chromatin accessibility DNA methylation and transcription in single cells.
Nature communications, 9(1):781, 2018.
PM Coan, AC Ferguson-Smith, and GJ Burton. Ultrastructural changes in the interhaemal membrane and
junctional zone of the murine chorioallantoic placenta across gestation. journal of anatomy, 207(6):783–
796, 2005.
Shawn J Cokus, Suhua Feng, Xiaoyu Zhang, Zugen Chen, Barry Merriman, Christian D Haudenschild, Sri-
harsa Pradhan, Stanley F Nelson, Matteo Pellegrini, and Steven E Jacobsen. Shotgun bisulfite sequencing
of the Arabidopsis genome reveals DNA methylation patterning. Nature, 452(7184):215–219, 2008.
Bernard Crespi and Patrik Nosil. Conflictual speciation: species formation via genomic conflict. Trends in
ecology & evolution, 28(1):48–57, 2013.
James C Cross. Genetic insights into trophoblast differentiation and placental morphogenesis. In Seminars
in cell & developmental biology, volume 11, pages 105–113. Elsevier, 2000.
Michael E Cusick, Keun Su Lee, Melvin L DePamphilis, and Paul M Wassarman. Structure of chromatin
at deoxyribonucleic acid replication forks: nuclease hypersensitivity results from both prenucleosomal
deoxyribonucleic acid and an immature chromatin structure. Biochemistry, 22(16):3873–3884, 1983.
Timothy Daley and Andrew D Smith. Predicting the molecular complexity of sequencing libraries. Nature
Methods, 10:325–327, 2013.
Camila Ferreira de Souza, Thais S Sabedot, Tathiane M Malta, Lindsay Stetson, Olena Morozova, Artem
Sokolov, Peter W Laird, Maciej Wiznerowicz, Antonio Iavarone, James Snyder, et al. A distinct DNA
methylation shift in a subset of glioma CpG island methylator phenotypes during tumor recurrence. Cell
reports, 23(2):637–651, 2018.
Benjamin E Decato, Jorge Lopez-Tello, Amanda N Sferruzzi-Perri, Andrew D Smith, and Matthew D Dean.
DNA methylation divergence and tissue specialization in the developing mouse placenta. Molecular
Biology and Evolution, 34(7):1702–1712, 2017.
Jie Deng, Robert Shoemaker, Bin Xie, Athurva Gore, Emily M LeProust, Jessica Antosiewicz-Bourget,
Dieter Egli, Nimet Maherali, In-Hyun Park, Junying Yu, et al. Targeted bisulfite sequencing reveals
changes in DNA methylation associated with nuclear reprogramming. Nature biotechnology, 27(4):353–
360, 2009.
Dinh Diep, Nongluk Plongthongkum, Athurva Gore, Ho-Lim Fung, Robert Shoemaker, and Kun Zhang.
Library-free methylation sequencing with bisulfite padlock probes. Nature methods, 9(3):270–272, 2012.
98
Alexander Dobin, Carrie A Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe
Batut, Mark Chaisson, and Thomas R Gingeras. Star: ultrafast universal rna-seq aligner. Bioinformatics,
29(1):15–21, 2013.
Akiko Doi, In-Hyun Park, Bo Wen, Thorsten Schlaeger, George Daley, Andrew Feinberg, et al. Differential
methylation of tissue- and cancer-specific CpG island shores distinguishes human induced pluripotent
stem cells, embryonic stem cells and fibroblasts. Nature Genetics, 41:1350–1353, 2009.
Egor Dolzhenko and Andrew D Smith. Using beta-binomial regression for high-precision differential methy-
lation analysis in multifactor whole-genome bisulfite sequencing experiments. BMC bioinformatics, 15
(1):215, 2014.
Faezeh Dorri, Lee Mendelowitz, and H´ ector Corrada Bravo. methylflow: cell-specific methylation pattern
reconstruction from high-throughput bisulfite-converted DNA sequencing. Bioinformatics, 32(11):1618–
1624, 2016.
Camila O dos Santos, Egor Dolzhenko, Emily Hodges, Andrew D Smith, and Gregory J Hannon. An
epigenetic memory of pregnancy in the mouse mammary gland. Cell reports, 11(7):1102–1109, 2015.
Thomas A Down, Vardhman K Rakyan, Daniel J Turner, Paul Flicek, Heng Li, Eugene Kulesha, Stefan
Graef, Nathan Johnson, Javier Herrero, Eleni M Tomazou, et al. A bayesian deconvolution strategy for
immunoprecipitation-based DNA methylome analysis. Nature biotechnology, 26(7):779–785, 2008.
Melanie Ehrlich, Miguel A Gama-Sosa, Lan-Hsiang Huang, Rose Marie Midgett, Kenneth C Kuo, Roy A
McCune, and Charles Gehrke. Amount and distribution of 5-methylcytosine in human DNA from differ-
ent types of tissues or cells. Nucleic acids research, 10(8):2709–2721, 1982.
Jason Ernst and Manolis Kellis. Discovery and characterization of chromatin states for systematic annotation
of the human genome. Nature biotechnology, 28(8):817–825, 2010.
Jason Ernst and Manolis Kellis. Chromhmm: automating chromatin-state discovery and characterization.
Nature methods, 9(3):215, 2012.
Fang Fang, Sevin Turcan, Andreas Rimner, Andrew Kaufman, Dilip Giri, Luc GT Morris, Ronglai Shen,
Venkatraman Seshan, Qianxing Mo, Adriana Heguy, et al. Breast cancer methylomes establish an epige-
nomic foundation for metastasis. Science translational medicine, 3(75):75ra25–75ra25, 2011.
Fang Fang, Emily Hodges, Antoine Molaro, Matthew D Dean, Gregory J Hannon, and Andrew D Smith.
The genomic landscape of human allele-specific DNA methylation. Proceedings of the National Academy
of Sciences, 109(19):7332–7337, 2012.
Matthias Farlik, Nathan C Sheffield, Angelo Nuzzo, Paul Datlinger, Andreas Sch¨ onegger, Johanna
Klughammer, and Christoph Bock. Single-cell DNA methylome sequencing and bioinformatic inference
of epigenomic cell-state dynamics. Cell reports, 10(8):1386–1397, 2015.
Tanja Fehm, Arthur Sagalowsky, Edward Clifford, Peter Beitsch, Hossein Saboorian, David Euhus, Song-
dong Meng, Larry Morrison, Thomas Tucker, Nancy Lane, et al. Cytogenetic evidence that circulating
epithelial cells in patients with carcinoma are malignant. Clinical Cancer Research, 8(7):2073–2084,
2002.
Andrew Feinberg. Phenotypic plasticity and the epigenetics of human disease. Nature, 447:433–440, 2007.
99
Andrew P Feinberg, Bert V ogelstein, et al. Hypomethylation distinguishes genes of some human cancers
from their normal counterparts. Nature, 301(5895):89–92, 1983.
C´ edric Feschotte and Cl´ ement Gilbert. Endogenous viruses: insights into viral evolution and impact on host
biology. Nature Reviews Genetics, 13(4):283–296, 2012.
G Ficz, MR Branco, S Seisenberger, F Santos, F Krueger, TA Hore, CJ Marques, S Andrews, and W Reik.
Dynamic regulation of 5-hydroxymethylcytosine in mouse ES cells during differentiation. Nature, 473
(7347):398–402, 2011.
Maria E Figueroa, Omar Abdel-Wahab, Chao Lu, Patrick S Ward, Jay Patel, Alan Shih, Yushan Li, Neha
Bhagwat, Aparna Vasanthakumar, Hugo F Fernandez, et al. Leukemic idh1 and idh2 mutations result in a
hypermethylation phenotype, disrupt tet2 function, and impair hematopoietic differentiation. Cancer cell,
18(6):553–567, 2010.
Jean-Philippe Fortin, Timothy J Triche, and Kasper D Hansen. Preprocessing, normalization and integration
of the illumina humanmethylationepic array with minfi. Bioinformatics, page btw691, 2016.
AL Fowden and T Moore. Maternal-fetal resource allocation: co-operation and conflict. Placenta, 33:
e11–e15, 2012.
Satoshi Furukawa, Yusuke Kuroda, and Akihiko Sugiyama. A comparison of the histological structure of
the placenta in experimental animals. journal of toxicologic pathology, 27(1):11–18, 2014.
Dimos Gaidatzis, Lukas Burger, Rabih Murr, Anita Lerch, Sophie Dessus-Babus, Dirk Sch¨ ubeler, and
Michael B Stadler. DNA sequence explains seemingly disordered methylation levels in partially methy-
lated domains of mammalian genomes. PLoS Genet, 10(2):e1004143, 2014.
Miguel A Gama-Sosa, Valerie A Slagel, Ronald W Trewyn, Ronald Oxenhandler, Kenneth C Kuo,
Charles W Gehrke, and Melanie Ehrlich. The 5-methylcytosine content of DNA from human tumors.
Nucleic acids research, 11(19):6883–6894, 1983.
M Gardiner-Garden and M Frommer. CpG islands in vertebrate genomes. journal of molecular biology, 196
(2):261–282, 1987.
Levi A Garraway and Eric S Lander. Lessons from the cancer genome. Cell, 153(1):17–37, 2013.
Paul G Giresi, Jonghwan Kim, Ryan M McDaniell, Vishwanath R Iyer, and Jason D Lieb. Faire
(formaldehyde-assisted isolation of regulatory elements) isolates active regulatory elements from human
chromatin. Genome research, 17(6):877–885, 2007.
Charly R Good, Jozef Madzo, Bela Patel, Shinji Maegawa, Nora Engel, Jaroslav Jelinek, and Jean-Pierre J
Issa. A novel isoform of tet1 that lacks a cxxc domain is overexpressed in cancer. Nucleic Acids Research,
2017.
Silvia Gravina, Shireen Ganapathi, and Jan Vijg. Single-cell, locus-specific bisulfite sequencing (slbs) for
direct detection of epimutations in DNA methylation patterns. Nucleic acids research, page gkv366, 2015.
Hongshan Guo, Ping Zhu, Xinglong Wu, Xianlong Li, Lu Wen, and Fuchou Tang. Single-cell methylome
landscapes of mouse embryonic stem cells and early embryos analyzed using reduced representation
bisulfite sequencing. Genome research, 23(12):2126–2135, 2013.
100
M Hackenberg, G Barturen, and JL Oliver. Ngsmethdb: a database for next-generation sequencing single-
cytosine-resolution DNA methylation data. Nucleic Acids Research, 39, 2011.
Kasper D Hansen, Benjamin Langmead, Rafael A Irizarry, et al. BSmooth: from whole genome bisulfite
sequencing reads to differentially methylated regions. Genome Biol, 13(10):R83, 2012.
Kasper D Hansen, Sarven Sabunciyan, Ben Langmead, Noemi Nagy, Rebecca Curley, Georg Klein, Eva
Klein, Daniel Salamon, and Andrew P Feinberg. Large-scale hypomethylated blocks associated with
epstein-barr virus–induced b-cell immortalization. Genome research, 24(2):177–184, 2014.
Kasper Daniel Hansen, Winston Timp, H´ ector Corrada Bravo, Sarven Sabunciyan, Benjamin Langmead,
Oliver G McDonald, Bo Wen, Hao Wu, Yun Liu, Dinh Diep, et al. Increased methylation variation in
epigenetic domains across cancer types. Nature genetics, 43(8):768–775, 2011.
R Scott Hansen, Sean Thomas, Richard Sandstrom, Theresa K Canfield, Robert E Thurman, Molly Weaver,
Michael O Dorschner, Stanley M Gartler, and John A Stamatoyannopoulos. Sequencing newly replicated
DNA reveals widespread plasticity in human replication timing. Proceedings of the National Academy of
Sciences, 107(1):139–144, 2010.
R Alan Harris, Ting Wang, Cristian Coarfa, Raman P Nagarajan, Chibo Hong, Sara L Downey, Brett E
Johnson, Shaun D Fouse, Allen Delaney, Yongjun Zhao, et al. Comparison of sequencing-based meth-
ods to profile DNA methylation and identification of monoallelic epigenetic modifications. Nature
biotechnology, 28(10):1097–1105, 2010.
Sarah K Harten, Harald Oey, Lauren M Bourke, Vandhana Bharti, Luke Isbel, Lucia Daxinger, Pierre Faou,
Neil Robertson, Jacqueline M Matthews, and Emma Whitelaw. The recently identified modifier of murine
metastable epialleles, rearranged l-myc fusion, is involved in maintaining epigenetic marks at CpG island
shores and enhancers. BMC biology, 13(1):1, 2015.
Hikoya Hayatsu, Yusuke Wataya, Kazushige Kai, and Shigeru Iida. Reaction of sodium bisulfite with uracil,
cytosine, and their derivatives. Biochemistry, 9(14):2858–2865, 1970.
Jianlin He, Xinxi Sun, Xiaojian Shao, Liji Liang, and Hehuang Xie. DMEAS: DNA methylation entropy
analysis software. Bioinformatics, 2013.
Paul DN Hebert, Alina Cywinska, Shelley L Ball, et al. Biological identifications through DNA barcodes.
Proceedings of the Royal Society of London. Series B: Biological Sciences, 270(1512):313–321, 2003.
Katja Hebestreit, Martin Dugas, and Hans-Ulrich Klein. Detection of significantly differentially methylated
regions in targeted bisulfite sequencing data. Bioinformatics, 2013.
Nathaniel D Heintzman, Gary C Hon, R David Hawkins, Pouya Kheradpour, Alexander Stark, Lindsey F
Harp, Zhen Ye, Leonard K Lee, Rhona K Stuart, Christina W Ching, et al. Histone modifications at
human enhancers reflect global cell-type-specific gene expression. Nature, 459(7243):108, 2009.
Angela S Hinrichs, Donna Karolchik, Robert Baertsch, Galt P Barber, Gill Bejerano, Hiram Clawson, Mark
Diekhans, Terrence S Furey, Rachel A Harte, Fan Hsu, et al. The ucsc genome browser database: update
2006. Nucleic acids research, 34(suppl 1):D590–D598, 2006.
Martin Hirst and Marco A Marra. Next generation sequencing based approaches to epigenomics. Briefings
in functional genomics, 9(5-6):455–465, 2010.
101
Emily Hodges, Andrew D Smith, Jude Kendall, Zhenyu Xuan, Kandasamy Ravi, Michelle Rooks,
Michael Q Zhang, Kenny Ye, Arindam Bhattacharjee, Leonardo Brizuela, et al. High definition profiling
of mammalian DNA methylation by array capture and single molecule bisulfite sequencing. Genome
research, 19(9):1593–1605, 2009.
Emily Hodges, Antoine Molaro, Camila O Dos Santos, Pramod Thekkat, Qiang Song, Philip J Uren, Jin
Park, Jason Butler, Shahin Rafii, W Richard McCombie, et al. Directional DNA methylation changes
and complex intermediate states accompany lineage specificity in the adult hematopoietic compartment.
Molecular Cell, 44(1):17–28, 2011.
Michael M Hoffman, Orion J Buske, Jie Wang, Zhiping Weng, Jeff A Bilmes, and William Stafford Noble.
Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nature
methods, 9(5):473, 2012.
Kirsten Hogg, John D Blair, Deborah E McFadden, Peter von Dadelszen, and Wendy P Robinson. Early
onset pre-eclampsia is associated with altered DNA methylation of cortisol-signalling and steroidogenic
genes in the placenta. PloS one, 8(5):e62969, 2013.
Gary C Hon, R David Hawkins, Otavia L Caballero, Christine Lo, Ryan Lister, Mattia Pelizzola, Armand
Valsesia, Zhen Ye, Samantha Kuan, Lee E Edsall, et al. Global DNA hypomethylation coupled to repres-
sive chromatin domain formation and gene silencing in breast cancer. Genome research, 22(2):246–258,
2012.
Gary C Hon, Nisha Rajagopal, Yin Shen, David F McCleary, Feng Yue, My D Dang, and Bing Ren. Epi-
genetic memory at embryonic enhancers identified in DNA methylation maps from adult mouse tissues.
Nature genetics, 45(10):1198–1206, 2013.
Elizabeth E. Hong, Cindy Y . Okitsu, Andrew D. Smith, and Chih-Lin Hsieh. Regionally specific and
genome-wide analyses conclusively demonstrate the absence of CpG methylation in human mitochon-
drial DNA. Molecular and Cellular Biology, 33(14):2683–2690, 2013. doi: 10.1128/MCB.00220-13.
URLhttp://mcb.asm.org/content/33/14/2683.abstract.
Eugene A Houseman, William P Accomando, Devin C Koestler, Brock C Christensen, Carmen J Marsit,
Heather H Nelson, John K Wiencke, and Karl T Kelsey. DNA methylation arrays as surrogate measures
of cell mixture distribution. BMC bioinformatics, 13(1):86, 2012.
Eugene Andres Houseman, John Molitor, and Carmen J Marsit. Reference-free cell mixture adjustments in
analysis of DNA methylation data. Bioinformatics, 30(10):1431–1439, 2014.
Eugene Andres Houseman, Molly L Kile, David C Christiani, Tan A Ince, Karl T Kelsey, and Carmen J Mar-
sit. Reference-free deconvolution of DNA methylation data and mediation by cell composition effects.
bioRxiv, page 037671, 2016.
Dong Hu and James C Cross. Development and function of trophoblast giant cells in the rodent placenta.
International journal of Developmental Biology, 54(2):341, 2010.
DW Huang, BT Sherman, and RA Lempicki. Systematic and integrative analysis of large gene lists using
DA VID Bioinformatics Resources. Nature Protocols, 4(1):44–57, 2009.
Yun Huang, William A Pastor, Yinghua Shen, Mamta Tahiliani, David R Liu, and Anjana Rao. The be-
haviour of 5-hydroxymethylcytosine in bisulfite sequencing. PLoS One, 5(1):e8888, 2010.
102
Austin L Hughes, Jonathan A Green, Juana M Garbayo, and R Michael Roberts. Adaptive diversification
within a large family of recently duplicated, placentally expressed genes. Proceedings of the National
Academy of Sciences, 97(7):3319–3323, 2000.
Robert Illingworth, Alastair Kerr, Dina DeSousa, Helle Jørgensen, Peter Ellis, Jim Stalker, David Jackson,
Chris Clee, Robert Plumb, Jane Rogers, et al. A novel CpG island set identifies tissue-specific methylation
at developmental gene loci. PLoS biology, 6(1):e22, 2008.
Jaroslav Jelinek and Jozef Madzo. Dream: a simple method for DNA methylation profiling by high-
throughput sequencing. In Chronic Myeloid Leukemia, pages 111–127. Springer, 2016.
Peter A Jones. Functions of DNA methylation: islands, start sites, gene bodies and beyond. Nature Reviews
Genetics, 13(7):484–492, 2012.
LT Kaaij, Marc van de Wetering, Fang Fang, Benjamin Decato, Antoine Molaro, Harmen JG van de Werken,
Johan H van Es, Jurian Schuijers, Elzo de Wit, Wouter de Laat, et al. DNA methylation dynamics during
intestinal stem cell differentiation reveals enhancers driving gene expression in the villus. Genome Biol,
14:R50, 2013.
Rosa Karli´ c, Ho-Ryun Chung, Julia Lasserre, Kristian Vlahoviˇ cek, and Martin Vingron. Histone modifica-
tion levels are predictive for gene expression. Proceedings of the National Academy of Sciences, 107(7):
2926–2931, 2010.
Donna Karolchik, Angela S Hinrichs, Terrence S Furey, Krishna M Roskin, Charles W Sugnet, David
Haussler, and W James Kent. The ucsc table browser data retrieval tool. Nucleic acids research, 32(suppl
1):D493–D496, 2004.
Paul S Kayne, Ung-Jin Kim, Min Han, Janet R Mullen, Fuminori Yoshizaki, and Michael Grunstein. Ex-
tremely conserved histone h4 n terminus is dispensable for growth but essential for repressing the silent
mating loci in yeast. Cell, 55(1):27–39, 1988.
AD Kelly, H Kroeger, J Yamazaki, R Taby, F Neumann, S Yu, JT Lee, B Patel, Y Li, R He, et al. A CpG
island methylator phenotype in acute myeloid leukemia independent of idh mutations and associated with
a favorable outcome. Leukemia, 2017.
Theresa K Kelly, Yaping Liu, Fides D Lay, Gangning Liang, Benjamin P Berman, and Peter A Jones.
Genome-wide mapping of nucleosome positioning and DNA methylation within individual DNA
molecules. Genome research, 22(12):2497–2506, 2012.
W James Kent, Robert Baertsch, Angie Hinrichs, Webb Miller, and David Haussler. Evolution’s cauldron:
duplication, deletion, and rearrangement in the mouse and human genomes. Proceedings of the National
Academy of Sciences, 100(20):11484–11489, 2003.
WJ Kent, CW Sugnet, TS Furey, KM Roskin, TH Pringle, AM Zahler, and D Haussler. The human genome
browser at ucsc. Genome Research, 12(6):996–1006, 2002.
K Kerkel, A Spadola, E Yuan, J Kosek, L Jiang, E Hod, K Li, VV Murty, N Schupf, E Vilain, M Morris,
F Haghighi, and B Tycko. Genomic surveys by methylation-sensitive SNP analysis identify sequence-
dependent allele-specific DNA methylation. Nature Genetics, 40(7):904–908, 2008.
103
Batbayar Khulan, Reid F Thompson, Kenny Ye, Melissa J Fazzari, Masako Suzuki, Edyta Stasiek, Maria E
Figueroa, Jacob L Glass, Quan Chen, Cristina Montagna, et al. Comparative isoschizomer profiling of
cytosine methylation: the help assay. Genome research, 16(8):1046–1055, 2006.
Kyong-Rim Kieffer-Kwon, Zhonghui Tang, Ewy Mathe, Jason Qian, Myong-Hee Sung, Guoliang Li, Wolf-
gang Resch, Songjoon Baek, Nathanael Pruett, Lars Grøntved, et al. Interactome maps of mouse gene
regulatory domains reveal basic principles of transcriptional regulation. Cell, 155(7):1507–1520, 2013.
Rhoda J Kinsella, Andreas K¨ ah¨ ari, Syed Haider, Jorge Zamora, Glenn Proctor, Giulietta Spudich, Jeff
Almeida-King, Daniel Staines, Paul Derwent, Arnaud Kerhornou, et al. Ensembl biomarts: a hub for
data retrieval across taxonomic space. Database, 2011, 2011.
Rahul M Kohli and Yi Zhang. Tet enzymes, tdg and the dynamics of DNA demethylation. Nature, 502
(7472):472, 2013.
Yulia Korshunova, Rebecca K Maloney, Nathan Lakey, Robert W Citek, Blaire Bacher, Arief Budiman,
Jared M Ordway, W Richard McCombie, Jorge Leon, Jeffrey A Jeddeloh, et al. Massively parallel bisul-
phite pyrosequencing reveals the molecular complexity of breast cancer-associated cytosine-methylation
patterns obtained from tissue and serum DNA. Genome research, 18(1):19–29, 2008.
Albrecht Kossel et al. Protamines and histones. 1928.
Asmita Kulkarni, Preeti Chavan-Gautam, Savita Mehendale, Hemlata Yadav, and Sadhana Joshi. Global
DNA methylation patterns in placenta and its association with maternal hypertension in pre-eclampsia.
DNA and cell biology, 30(2):79–84, 2011.
Govindarajan Kunde-Ramamoorthy, Cristian Coarfa, Eleonora Laritsky, Noah J Kessler, R Alan Harris,
Mingchu Xu, Rui Chen, Lanlan Shen, Aleksandar Milosavljevic, and Robert A Waterland. Comparison
and quantitative verification of mapping algorithms for whole-genome bisulfite sequencing. Nucleic acids
research, page gkt1325, 2014.
T Kunieda, M Xian, E Kobayashi, T Imamichi, K Moriwaki, and Y Toyoda. Sexing of mouse preimplanta-
tion embryos by detection of y chromosome-specific sequences using polymerase chain reaction. Biology
of reproduction, 46(4):692–697, 1992.
Peter W Laird. Principles and challenges of genomewide DNA methylation analysis. Nat Rev Genet, 11(3):
191–203, Mar 2010. doi: 10.1038/nrg2732. URLhttp://dx.doi.org/10.1038/nrg2732.
Luca Lambertini, Tin-Lap Lee, Wai-Yee Chan, Men-Jean Lee, Andreas Diplas, James Wetmur, and Jia
Chen. Differential methylation of imprinted genes in growth-restricted placentas. Reproductive sciences,
18(11):1111–1117, 2011.
Xun Lan, Christopher Adams, Mark Landers, Miroslav Dudas, Daniel Krissinger, George Marnellos, Rus-
sell Bonneville, Maoxiong Xu, Junbai Wang, Tim H-M Huang, et al. High resolution detection and
analysis of CpG dinucleotides methylation using mbd-seq technology. PloS one, 6(7):e22226, 2011.
Gilad Landan, Netta Mendelson Cohen, Zohar Mukamel, Amir Bar, Alina Molchadsky, Ran Brosh, Shirley
Horn-Saban, Daniela Amann Zalcenstein, Naomi Goldfinger, Adi Zundelevich, et al. Epigenetic polymor-
phism and the stochastic formation of differentially methylated regions in normal and cancerous tissues.
Nature genetics, 44(11):1207–1214, 2012.
104
Moyra Lawrence, Sylvain Daujat, and Robert Schneider. Lateral thinking: how histone modifications regu-
late gene expression. Trends in Genetics, 32(1):42–56, 2016.
Sheng Li, Francine E Garrett-Bakelman, Altuna Akalin, Paul Zumbo, Ross Levine, Bik L To, Ian D Lewis,
Anna L Brown, Richard J D’Andrea, Ari Melnick, et al. An optimized algorithm for detecting and
annotating regional differential methylation. BMC bioinformatics, 14(Suppl 5):S10, 2013.
Zhiguang Li, Hongzheng Dai, Suzanne N Martos, Beisi Xu, Yang Gao, Teng Li, Guangjing Zhu, Dustin E
Schones, and Zhibin Wang. Distinct roles of dnmt1-dependent and-independent methylation patterns in
the genome of mouse embryonic stem cells. Genome biology, 16(1):115, 2015.
Jingtao Lilue, Anthony G Doran, Ian T Fiddes, Monica Abrudan, Joel Armstrong, Ruth Bennett, William
Chow, Joanna Collins, Anne Czechanski, Petr Danecek, et al. Multiple laboratory mouse reference
genomes define strain specific haplotypes and novel functional loci. bioRxiv, page 235838, 2018.
Ryan Lister, Mattia Pelizzola, Robert H Dowen, R David Hawkins, Gary Hon, Julian Tonti-Filippini,
Joseph R Nery, Leonard Lee, Zhen Ye, Que-Minh Ngo, et al. Human DNA methylomes at base reso-
lution show widespread epigenomic differences. nature, 462(7271):315–322, 2009.
Ryan Lister, Mattia Pelizzola, Yasuyuki S Kida, R David Hawkins, Joseph R Nery, Gary Hon, Jessica
Antosiewicz-Bourget, Ronan O?Malley, Rosa Castanon, Sarit Klugman, et al. Hotspots of aberrant epige-
nomic reprogramming in human induced pluripotent stem cells. Nature, 471(7336):68–73, 2011.
Ryan Lister, Eran A Mukamel, Joseph R Nery, Mark Urich, Clare A Puddifoot, Nicholas D Johnson, Jacinta
Lucero, Yun Huang, Andrew J Dwork, Matthew D Schultz, et al. Global epigenomic reconfiguration
during mammalian brain development. Science, 341(6146), 2013.
Y . Liu, K. D. Siegmund, P. W. Laird, and B. P. Berman. Bis-SNP: Combined DNA methylation and SNP
calling for Bisulfite-seq data. Genome Biol., 13(7):R61, Jul 2012.
Yahli Lorch, Janice W LaPointe, and Roger D Kornberg. Nucleosomes inhibit the initiation of transcription
but allow chain elongation with the displacement of histones. Cell, 49(2):203–210, 1987.
Falong Lu, Yuting Liu, Lan Jiang, Shinpei Yamaguchi, and Yi Zhang. Role of tet proteins in enhancer
activity and telomere elongation. Genes & development, 28(19):2103–2119, 2014.
Chongyuan Luo, Christopher L Keown, Laurie Kurihara, Jingtian Zhou, Yupeng He, Junhao Li, Rosa Cas-
tanon, Jacinta Lucero, Joseph R Nery, Justin P Sandoval, et al. Single-cell methylomes identify neuronal
subtypes and regulatory elements in mammalian cortex. Science, 357(6351):600–604, 2017.
Jovana Maksimovic, Lavinia Gordon, and Alicia Oshlack. Swan: Subset-quantile within array normalization
for illumina infinium humanmethylation450 beadchips. Genome biology, 13(6):R44, 2012.
Alika K Maunakea, Raman P Nagarajan, Mikhail Bilenky, Tracy J Ballinger, Cletus D?Souza, Shaun D
Fouse, Brett E Johnson, Chibo Hong, Cydney Nielsen, Yongjun Zhao, et al. Conserved role of intragenic
DNA methylation in regulating alternative promoters. Nature, 466(7303):253–257, 2010.
John F McDonald, Lilya V Matyunina, Susanne Wilson, I King Jordan, Nathan J Bowen, and Wolfgang J
Miller. Ltr retrotransposons and the evolution of eukaryotic enhancers. In Evolution and Impact of
Transposable Elements, pages 3–13. Springer, 1997.
105
Cory McLean, Dave Bristor, Michael Hiller, Shoa Clarke, Bruce Schaar, Craig Lowe, Aaron Wenger,
and Gill Bejerano. GREAT improves functional interpretation of cis-regulatory regions. Nature
Biotechnology, 28(5):495–501, 2010.
Alexander Meissner. Epigenetic modifications in pluripotent and differentiated cells. Nature Biotechnology,
28:1079–1088, 2010.
Alexander Meissner, Andreas Gnirke, George W Bell, Bernard Ramsahoye, Eric S Lander, and Rudolf
Jaenisch. Reduced representation bisulfite sequencing for comparative high-resolution DNA methylation
analysis. Nucleic acids research, 33(18):5868–5877, 2005.
Alexander Meissner, Tarjei S Mikkelsen, Hongcang Gu, Marius Wernig, Jacob Hanna, Andrey Sivachenko,
Xiaolan Zhang, Bradley E Bernstein, Chad Nusbaum, David B Jaffe, et al. Genome-scale DNA methyla-
tion maps of pluripotent and differentiated cells. Nature, 454(7205):766–770, 2008.
Sha Mi, Xinhua Lee, Xiang-Ping Li, Geertruida M Veldman, Heather Finnerty, Lisa Racie, Edward Lavallie,
Xiang-Yang Tang, Philippe Edouard, Steve Howes, et al. Syncytin is a captive retroviral envelope protein
involved in human placental morphogenesis. Nature, 403(6771):785–789, 2000.
Thomas Mikeska, Ida LM Candiloro, and Alexander Dobrovic. The implications of heterogeneous DNA
methylation for the accurate quantification of methylation. Epigenomics, 2(4):561–573, 2010.
Fumihito Miura, Yusuke Enomoto, Ryo Dairiki, and Takashi Ito. Amplification-free whole-genome bisulfite
sequencing by post-bisulfite adaptor tagging. Nucleic acids research, 40(17):e136–e136, 2012.
Antoine Molaro, Emily Hodges, Fang Fang, Qiang Song, W Richard McCombie, Gregory J Hannon, and
Andrew D Smith. Sperm methylation profiles reveal features of epigenetic inheritance and evolution in
primates. Cell, 146(6):1029–1041, 2011.
Annalisa Morano, Tiziana Angrisano, Giusi Russo, Rosaria Landi, Antonio Pezone, Silvia Bartollino, Can-
dida Zuchegna, Federica Babbio, Ian Marc Bonapace, Brittany Allen, et al. Targeted DNA methylation by
homology-directed repair in mammalian cells. transcription reshapes methylation on the repaired gene.
Nucleic acids research, 42(2):804–821, 2013.
Arne W Mould, Zhenyi Pang, Miha Pakusch, Ian D Tonks, Mitchell Stark, Dianne Carrie, Pamela
Mukhopadhyay, Annica Seidel, Jonathan J Ellis, Janine Deakin, et al. Smchd1 regulates a subset of auto-
somal genes subject to monoallelic expression in addition to being critical for x inactivation. Epigenetics
Chromatin, 6(1):19, 2013.
Hans J Muller. Types of visible variations induced by x-rays indrosophila. journal of Genetics, 22(3):
299–334, 1930.
Raman P Nagarajan and Joseph F Costello. Epigenetic mechanisms in glioblastoma multiforme. In Seminars
in cancer biology, volume 19, pages 188–197. Elsevier, 2009.
Shalima S Nair, Marcel W Coolen, Clare Stirzaker, Jenny Z Song, Aaron L Statham, Dario Strbenac, Mark D
Robinson, and Susan J Clark. Comparison of methyl-DNA immunoprecipitation (medip) and methyl-CpG
binding domain (mbd) protein capture for genome-wide DNA methylation analysis reveal CpG sequence
coverage bias. Epigenetics, 6(1):34–44, 2011.
106
Houtan Noushmehr, Daniel J Weisenberger, Kristin Diefes, Heidi S Phillips, Kanan Pujara, Benjamin P
Berman, Fei Pan, Christopher E Pelloski, Erik P Sulman, Krishna P Bhat, et al. Identification of a CpG
island methylator phenotype that defines a distinct subgroup of glioma. Cancer cell, 17(5):510–522, 2010.
Boris Novakovic and Richard Saffery. Placental pseudo-malignancy from a DNA methylation perspective:
unanswered questions and future directions. Front. Genet, 4(285):10–3389, 2013.
Chikashi Obuse, Osamu Iwasaki, Tomomi Kiyomitsu, Gohta Goshima, Yusuke Toyoda, and Mitsuhiro
Yanagida. A conserved mis12 centromere complex is linked to heterochromatic hp1 and outer kineto-
chore protein zwint-1. Nature cell biology, 6(11):1135, 2004.
Mayumi Oda, Jacob L Glass, Reid F Thompson, Yongkai Mo, Emmanuel N Olivier, Maria E Figueroa,
Rebecca R Selzer, Todd A Richmond, Xinmin Zhang, Luke Dannenberg, et al. High-resolution genome-
wide cytosine methylation profiling with simultaneous copy number analysis and optimization for limited
cell numbers. Nucleic acids research, 37(12):3829–3839, 2009.
Heather M O’Hagan, Helai P Mohammad, and Stephen B Baylin. Double strand breaks can initiate gene
silencing and sirt1-dependent onset of DNA methylation in an exogenous promoter CpG island. PLoS
genetics, 4(8):e1000155, 2008.
Nuala A O’Leary, Mathew W Wright, J Rodney Brister, Stacy Ciufo, Diana Haddad, Rich McVeigh, Bhanu
Rajput, Barbara Robbertse, Brian Smith-White, Danso Ako-Adjei, et al. Reference sequence (refseq)
database at ncbi: current status, taxonomic expansion, and functional annotation. Nucleic acids research,
44(D1):D733–D745, 2015.
Ada L Olins and Donald E Olins. Spheroid chromatin units ( bodies). Science, 183(4122):330–332, 1974.
Athma A Pai, Jordana T Bell, John C Marioni, Jonathan K Pritchard, and Yoav Gilad. A genome-wide study
of DNA methylation patterns and gene expression levels in multiple human and chimpanzee tissues. PLoS
Genet, 7(2):e1001316, 2011.
EBERHARD Passarge. Emil heitz and the concept of heterochromatin: longitudinal chromosome differen-
tiation was recognized fifty years ago. American journal of human genetics, 31(2):106, 1979.
WA Pastor, UJ Pape, Y Huang, HR Henderson, R Lister, M Ko, EM McLoughlin, Y Brudno, S Mahapatra,
P Kapranov, JR Ecker, S Agarwal, A Rao, et al. Genome-wide mapping of 5-hydroxymethylcytosine in
embryonic stem cells. Nature, 473(7347):394–397, 2011.
Julian R Peat, Wendy Dean, Stephen J Clark, Felix Krueger, S´ ebastien A Smallwood, Gabriella Ficz,
Jong Kyoung Kim, John C Marioni, Timothy A Hore, and Wolf Reik. Genome-wide bisulfite sequencing
in zygotes identifies demethylation targets and maps the contribution of tet3 oxidation. Cell reports, 9(6):
1990–2000, 2014.
Guillaume Pidoux, Pascale Gerbaud, S´ edami Gnidehou, Michael Grynberg, Graziello Geneau, Jean Gui-
bourdenche, Diane Carette, Laurent Cronier, Dani` ele Evain-Brion, Andr´ e Malassin´ e, et al. Zo-1 is
involved in trophoblastic cell differentiation in human placenta. American journal of Physiology-Cell
Physiology, 298(6):C1517–C1526, 2010.
Ruth Pidsley, Elena Zotenko, Timothy J Peters, Mitchell G Lawrence, Gail P Risbridger, Peter Molloy,
Susan Van Djik, Beverly Muhlhausler, Clare Stirzaker, and Susan J Clark. Critical evaluation of the
107
illumina methylationepic beadchip microarray for whole-genome DNA methylation profiling. Genome
biology, 17(1):208, 2016.
Jianghan Qu, Meng Zhou, Qiang Song, Elizabeth E Hong, and Andrew D Smith. Mlml: consistent si-
multaneous estimates of DNA methylation and hydroxymethylation. Bioinformatics, 29(20):2645–2646,
2013.
Jianghan Qu, Emily Hodges, Antoine Molaro, Pascal Gagneux, Matthew D Dean, Gregory J Hannon, and
Andrew D Smith. Evolutionary expansion of DNA hypomethylation in the mammalian germline genome.
Genome research, 28(2):145–158, 2018.
Aaron R Quinlan. Bedtools: the swiss-army tool for genome feature analysis. Current protocols in
bioinformatics, pages 11–12, 2014.
Bernard H Ramsahoye, Detlev Biniszkiewicz, Frank Lyko, Victoria Clark, Adrian P Bird, and Rudolf
Jaenisch. Non-CpG methylation is prevalent in embryonic stem cells and may be mediated by DNA
methyltransferase 3a. Proceedings of the National Academy of Sciences, 97(10):5237–5242, 2000.
Satyanarayan Rao, Tsu-Pei Chiu, Judith F Kribelbauer, Richard S Mann, Harmen J Bussemaker, and Remo
Rohs. Systematic prediction of DNA shape changes due to CpG methylation explains epigenetic effects
on protein–DNA binding. Epigenetics & chromatin, 11(1):6, 2018.
Naim U Rashid, Paul G Giresi, Joseph G Ibrahim, Wei Sun, and Jason D Lieb. Zinba integrates local
covariates with DNA-seq data to identify broad and narrow regions of enrichment, even within amplified
genomic regions. Genome biology, 12(7):R67, 2011.
A Razin, C Webb, M Szyf, J Yisraeli, A Rosenthal, T Naveh-Many, N Sciaky-Gallili, and H Cedar. Vari-
ations in DNA methylation during mouse cell differentiation in vivo and in vitro. Proceedings of the
National Academy of Sciences, 81(8):2275–2279, 1984.
Syed Arif Abdul Rehman, Yosua Adi Kristariyanto, Soo-Youn Choi, Pedro Junior Nkosi, Simone Weidlich,
Karim Labib, Kay Hofmann, and Yogesh Kulathu. Mindy-1 is a member of an evolutionarily conserved
and structurally distinct new family of deubiquitinating enzymes. Molecular cell, 63(1):146–155, 2016.
Judith Reichmann, James P Reddington, Diana Best, David Read, Rupert
¨
Ollinger, Richard R Meehan, and
Ian R Adams. The genome-defence gene tex19. 1 suppresses line-1 retrotransposons in the placenta and
prevents intra-uterine growth retardation in mice. Human molecular genetics, 22(9):1791–1806, 2013.
W Reik, W Dean, and J Walter. Epigenetic reprogramming in mammalian development. Science, 293
(5532):1089–1093, 2001.
JM Roberts and DW Cooper. Pathogenesis and genetics of pre-eclampsia. The Lancet, 357(9249):53–56,
2001.
A Gordon Robertson, Mikhail Bilenky, Angela Tam, Yongjun Zhao, Thomas Zeng, Nina Thiessen, Timothee
Cezard, Anthony P Fejes, Elizabeth D Wederell, Rebecca Cullum, et al. Genome-wide relationship
between histone h3 lysine 4 mono-and tri-methylation and transcription factor binding. Genome research,
18(12):1906–1917, 2008.
KD Robertson. DNA Methylation and human disease. Nature Reviews Genetics, 6(8):597–610, 2005.
108
James T Robinson, Helga Thorvaldsd´ ottir, Wendy Winckler, Mitchell Guttman, Eric S Lander, Gad Getz,
and Jill P Mesirov. Integrative genomics viewer. Nature biotechnology, 29(1):24, 2011.
Mark D Robinson, Davis J McCarthy, and Gordon K Smyth. edger: a bioconductor package for differential
expression analysis of digital gene expression data. Bioinformatics, 26(1):139–140, 2010.
VM Ruda, SB Akopov, DO Trubetskoy, NL Manuylov, AS Vetchinova, LL Zavalova, LG Nikolaev, and
ED Sverdlov. Tissue specificity of enhancer and promoter activities of a herv-k (hml-2) ltr. Virus research,
104(1):11–16, 2004.
Peter J Sabo, Michael Hawrylycz, James C Wallace, Richard Humbert, Man Yu, Anthony Shafer, Janelle
Kawamoto, Robert Hall, Joshua Mack, Michael O Dorschner, et al. Discovery of functional noncoding
elements by digital analysis of chromatin structure. Proceedings of the National Academy of Sciences of
the United States of America, 101(48):16837–16842, 2004.
Abdulrahman Salhab, Karl Nordstr¨ om, Kathrin Kattler, Peter Ebert, Fidel Ramirez, Laura Arrigoni, Fabian
M¨ uller, Cristina Cadenas, Jan Hengstler, Thomas Lengauer, et al. Partially methylated domains are hall-
marks of a cell specific epigenome topology. bioRxiv, page 249334, 2018.
B Sarver, S Keeble, T Cosart, P Tucker, and Matthew D Dean. Phylogenomic insights into genome evolution
and speciation in mice. Genome Biology and Evolution, 2017.
Felix Schlesinger, Andrew D Smith, Thomas R Gingeras, Gregory J Hannon, and Emily Hodges. De novo
DNA demethylation and noncoding transcription define active intergenic regulatory elements. Genome
research, 23(10):1601–1614, 2013.
Diane I Schroeder, Paul Lott, Ian Korf, and Janine M LaSalle. Large-scale methylation domains mark a
functional subset of neuronally expressed genes. Genome research, 21(10):1583–1591, 2011.
Diane I Schroeder, John D Blair, Paul Lott, Hung On Ken Yu, Danna Hong, Florence Crary, Paul Ashwood,
Cheryl Walker, Ian Korf, Wendy P Robinson, et al. The human placenta methylome. Proceedings of the
national academy of sciences, 110(15):6037–6042, 2013.
Diane I Schroeder, Kartika Jayashankar, Kory C Douglas, Twanda L Thirkill, Daniel York, Pete J Dickinson,
Lawrence E Williams, Paul B Samollow, Pablo J Ross, Danika L Bannasch, et al. Early developmental
and evolutionary origins of gene body DNA methylation patterns in mammalian placentas. PLoS Genet,
11(8):e1005442, 2015.
Matthew D Schultz, Robert J Schmitz, and Joseph R Ecker. levelingthe playing field for analyses of single-
base resolution DNA methylomes. Trends in genetics: TIG, 28(12):583, 2012.
Stefanie Seisenberger, Simon Andrews, Felix Krueger, Julia Arand, Jorn Walter, Fatima Santos, Christian
Popp, Bernard Thienpont, Wendy Dean, and Wolf Reik. The dynamics of genome-wide DNA methylation
reprogramming in mouse primordial germ cells. Molecular Cell, 48(6):849–862, 2012.
David Serre, Byron H Lee, and Angela H Ting. Mbd-isolated genome sequencing provides a high-
throughput and comprehensive survey of DNA methylation in the human genome. Nucleic acids research,
38(2):391–399, 2010.
Amanda N Sferruzzi-Perri and Emily J Camm. The programming power of the placenta. Frontiers in
physiology, 7, 2016.
109
Amanda N Sferruzzi-Perri, Anne M Macpherson, Claire T Roberts, and Sarah A Robertson. Csf2 null
mutation alters placental gene expression and trophoblast glycogen cell and giant cell abundance in mice.
Biology of reproduction, 81(1):207–221, 2009.
Karyn L Sheaffer, Rinho Kim, Reina Aoki, Ellen N Elliott, Jonathan Schug, Lukas Burger, Dirk Sch¨ ubeler,
and Klaus H Kaestner. DNA methylation is required for the control of stem cell differentiation in the
small intestine. Genes & development, 28(6):652–664, 2014.
Lanlan Shen, Yi Guo, Xinli Chen, Saira Ahmed, and J Issa. Optimizing annealing temperature overcomes
bias in bisulfite pcr methylation analysis. Biotechniques, 42(1):48, 2007.
Y Shi, J Chen, Z Li, Z Zhang, H Yu, K Sun, X Wang, X Song, Y Wang, Y Zhen, et al. C10orf97 is a novel
tumor-suppressor gene of non-small-cell lung cancer and a functional variant of this gene increases the
risk of non-small-cell lung cancer. Oncogene, 30(39):4107, 2011.
D Shibata. Cancer. Heterogeneity and tumor history. Science, 336:304–305, 2012.
Kenjiro Shirane, Hidehiro Toh, Hisato Kobayashi, Fumihito Miura, Hatsune Chiba, Takashi Ito, Tomohiro
Kono, and Hiroyuki Sasaki. Mouse oocyte methylomes at base resolution reveal genome-wide accu-
mulation of non-CpG methylation and role of DNA methyltransferases. PLoS genetics, 9(4):e1003439,
2013.
Robert Shoemaker, Jie Deng, Wei Wang, and Kun Zhang. Allele-specific methylation is prevalent and is
contributed by CpG-SNPs in the human genome. Genome Research, 20:883–889, 2010.
Jared T Simpson, Rachael E Workman, PC Zuzarte, Matei David, LJ Dursi, and Winston Timp. Detecting
DNA cytosine methylation using nanopore sequencing. nature methods, 14(4):407, 2017.
ME Skinner, A V Uzilov, LD Stein, CJ Mungall, and IH Holmes. Jbrowse: a next-generation genome
browser. Genome Research, 19(9):1630–1638, 2009.
S´ ebastien A Smallwood, Heather J Lee, Christof Angermueller, Felix Krueger, Heba Saadeh, Julian Peat,
Simon R Andrews, Oliver Stegle, Wolf Reik, and Gavin Kelsey. Single-cell genome-wide bisulfite se-
quencing for assessing epigenetic heterogeneity. Nature methods, 11(8):817–820, 2014.
Arian FA Smit, Robert Hubley, and P Green. Repeatmasker. Published on the web at http://www.
repeatmasker. org, 1996.
Andrew D Smith, Wen-Yu Chung, Emily Hodges, Jude Kendall, Greg Hannon, James Hicks, Zhenyu Xuan,
and Michael Q Zhang. Updates to the rmap short-read mapping software. Bioinformatics, 25(21):2841–
2842, 2009.
Chun-Xiao Song, Chengqi Yi, and Chuan He. Mapping recently identified nucleotide variants in the genome
and transcriptome. Nature biotechnology, 30(11):1107–1116, 2012.
Qiang Song, Benjamin Decato, Elizabeth E Hong, Meng Zhou, Fang Fang, Jianghan Qu, Tyler Garvin,
Michael Kessler, Jun Zhou, and Andrew D Smith. A reference methylome database and analysis pipeline
to facilitate integrative and comparative epigenomics. PloS one, 8(12):e81148, 2013.
Andrea Sottoriva, Haeyoun Kang, Zhicheng Ma, Trevor A Graham, Matthew P Salomon, Junsong Zhao,
Paul Marjoram, Kimberly Siegmund, Michael F Press, Darryl Shibata, et al. A big bang model of human
colorectal tumor growth. Nature genetics, 47(3):209–216, 2015.
110
Michael Stevens, Jeffrey B Cheng, Daofeng Li, Mingchao Xie, Chibo Hong, C´ ecile L Maire, Keith L
Ligon, Martin Hirst, Marco A Marra, Joseph F Costello, et al. Estimating absolute methylation lev-
els at single-CpG resolution from methylation enrichment and restriction enzyme sequencing methods.
Genome research, 23(9):1541–1553, 2013.
Beth A Sullivan and Gary H Karpen. Centromeric chromatin exhibits a histone modification pattern that is
distinct from both euchromatin and heterochromatin. Nature Structural and Molecular Biology, 11(11):
1076, 2004.
A Szwagierczak, S Bultmann, CS Schmidt, F Spada, and Leonhardt H. Sensitive enzymatic quantification
of 5-hydroxymethylcytosine in genomic DNA. Nucleic Acids Research, 38(19):e181, 2010.
Aleksandra Szwagierczak, Andreas Brachmann, Christine S Schmidt, Sebastian Bultmann, Heinrich Leon-
hardt, and Fabio Spada. Characterization of pvurts1i endonuclease as a tool to investigate genomic 5–
hydroxymethylcytosine. Nucleic acids research, 39(12):5149–5156, 2011.
Tomomitsu Tahara, Eiichiro Yamamoto, Priyanka Madireddi, Hiromu Suzuki, Reo Maruyama, Woonbok
Chung, Judith Garriga, Jaroslav Jelinek, Hiro-o Yamano, Tamotsu Sugai, et al. Colorectal carcino-
mas with CpG island methylator phenotype 1 frequently contain mutations in chromatin regulators.
Gastroenterology, 146(2):530–538, 2014.
Mamta Tahiliani, Kian Peng Koh, Yinghua Shen, William A Pastor, Hozefa Bandukwala, Yevgeny Brudno,
Suneet Agarwal, Lakshminarayan M Iyer, David R Liu, L Aravind, et al. Conversion of 5-methylcytosine
to 5-hydroxymethylcytosine in mammalian DNA by mll partner tet1. Science, 324(5929):930–935, 2009.
Oluwatosin Taiwo, Gareth A Wilson, Tiffany Morris, Stefanie Seisenberger, Wolf Reik, Daniel Pearce,
Stephan Beck, and Lee M Butcher. Methylome analysis using medip-seq with low DNA concentrations.
Nature protocols, 7(4):617–636, 2012.
Yuta Takahashi, Jun Wu, Keiichiro Suzuki, Paloma Martinez-Redondo, Mo Li, Hsin-Kai Liao, Min-Zu Wu,
Reyna Hern´ andez-Ben´ ıtez, Tomoaki Hishida, Maxim Nikolaievich Shokhirev, et al. Integration of CpG-
free DNA induces de novo methylation of CpG islands in pluripotent stem cells. Science, 356(6337):
503–508, 2017.
Kristen H Taylor, Robin S Kramer, J Wade Davis, Juyuan Guo, Deiter J Duff, Dong Xu, Charles W Caldwell,
and Huidong Shi. Ultradeep bisulfite sequencing analysis of DNA methylation patterns in multiple gene
promoters by 454 sequencing. Cancer research, 67(18):8511–8518, 2007.
Bernard Thienpont, Jessica Steinbacher, Hui Zhao, Flora D?Anna, Anna Kuchnio, Athanasios Ploumakis,
Bart Ghesqui` ere, Laurien Van Dyck, Bram Boeckx, Luc Schoonjans, et al. Tumor hypoxia causes DNA
hypermethylation by reducing tet activity. Nature, 537(7618):63, 2016.
Cristian Tomasetti, Lu Li, and Bert V ogelstein. Stem cell divisions, somatic mutations, cancer etiology, and
cancer prevention. Science, 355(6331):1330–1334, 2017.
Minoru Toyota, Nita Ahuja, Mutsumi Ohe-Toyota, James G Herman, Stephen B Baylin, and Jean-Pierre J
Issa. CpG island methylator phenotype in colorectal cancer. Proceedings of the National Academy of
Sciences, 96(15):8681–8686, 1999.
Patrick Trojer and Danny Reinberg. Facultative heterochromatin: is there a distinctive molecular signature?
Molecular cell, 28(1):1–13, 2007.
111
Priscilla K Tucker. Systematics of the genus mus. The Mouse in Biomedical Research: History, Wild mice,
and Genetics, 1:13–21, 2006.
Nathan D VanderKraats, Jeffrey F Hiken, Keith F Decker, and John R Edwards. Discovering high-resolution
patterns of differential DNA methylation that correlate with gene expression changes. Nucleic Acids
Research, 2013.
Elvin Wagenblast, Mar Soto, Sara Guti´ errez-
´
Angel, Christina A Hartl, Annika L Gable, Ashley R Maceli,
Nicolas Erard, Alissa M Williams, Sun Y Kim, Steffen Dickopf, et al. A model of breast cancer hetero-
geneity reveals vascular mimicry as a driver of metastasis. Nature, 520(7547):358, 2015.
Colum P Walsh, J Richard Chaillet, and Timothy H Bestor. Transcription of iap endogenous retroviruses is
constrained by cytosine methylation. Nature genetics, 20(2):116–117, 1998.
Yiping Wang, Mengtao Xiao, Xiufei Chen, Leilei Chen, Yanping Xu, Lei Lv, Pu Wang, Hui Yang,
Shenghong Ma, Huaipeng Lin, et al. Wt1 recruits tet2 to regulate its target gene expression and sup-
press leukemia cell proliferation. Molecular cell, 57(4):662–673, 2015.
Peter M Warnecke, Clare Stirzaker, John R Melki, Douglas S Millar, Cheryl L Paul, and Susan J Clark.
Detection and measurement of pcr bias in quantitative methylation analysis of bisulphite-treated DNA.
Nucleic acids research, 25(21):4422–4426, 1997.
James D Watson, Francis HC Crick, et al. Molecular structure of nucleic acids. Nature, 171(4356):737–738,
1953.
Michael Weber, Jonathan J Davies, David Wittig, Edward J Oakeley, Michael Haase, Wan L Lam, and
Dirk Schuebeler. Chromosome-wide and promoter-specific analyses identify sites of differential DNA
methylation in normal and transformed human cells. Nature genetics, 37(8):853–862, 2005.
K Williams, J Christensen, MT Pedersen, JV Johansen, PA Cloos, J Rappsilber, and K Helin. TET1 and
hydroxymethylcytosine in transcription and DNA methylation fidelity. Nature, 473(7347):343–348, 2011.
Svante Wold, Kim Esbensen, and Paul Geladi. Principal component analysis. Chemometrics and intelligent
laboratory systems, 2(1-3):37–52, 1987.
Hao Wu, Ana C D’Alessio, Shinsuke Ito, Kai Xia, Zhibin Wang, Kairong Cui, Keji Zhao, Yi Eve Sun, and
Yi Zhang. Dual functions of tet1 in transcriptional regulation in mouse embryonic stem cells. Nature,
473(7347):389–393, 2011.
Phillip Wulfridge, Ben Langmead, Andrew P Feinberg, and Kasper Hansen. Choice of reference genome
can introduce massive bias in bisulfite sequencing data. bioRxiv, page 076844, 2016.
Yuanxin Xi, Christoph Bock, Fabian M¨ uller, Deqiang Sun, Alexander Meissner, and Wei Li. Rrb-
smap: a fast, accurate and user-friendly alignment tool for reduced representation bisulfite sequencing.
Bioinformatics, 28(3):430–432, 2012.
Hehuang Xie, Min Wang, Alexandre de Andrade, Maria de F Bonaldo, Vasil Galat, Kelly Arndt, Veena Ra-
jaram, Stewart Goldman, Tadanori Tomita, and Marcelo B Soares. Genome-wide quantitative assessment
of variation in DNA methylation patterns. Nucleic acids research, 39(10):4099–4108, 2011.
112
Zhan Yao, Silvia Fenoglio, Ding Cheng Gao, Matthew Camiolo, Brendon Stiles, Trine Lindsted, Michaela
Schlederer, Chris Johns, Nasser Altorki, Vivek Mittal, et al. Tgf- il-6 axis mediates selective and adap-
tive mechanisms of resistance to molecular targeted therapy in lung cancer. Proceedings of the National
Academy of Sciences, 107(35):15535–15540, 2010.
Yasushi Yatabe, Simon Tavar´ e, and Darryl Shibata. Investigating stem cells in human colon by using methy-
lation patterns. Proceedings of the National Academy of Sciences, 98(19):10839–10844, 2001.
Ahuvi Yearim, Sahar Gelfman, Ronna Shayevitch, Shai Melcer, Ohad Glaich, Jan-Philipp Mallm, Malka
Nissim-Rafinia, Ayelet-Hashahar S Cohen, Karsten Rippe, Eran Meshorer, et al. Hp1 is involved in
regulating the global impact of DNA methylation on alternative splicing. Cell reports, 10(7):1122–1134,
2015.
Miao Yu, Gary C Hon, Keith E Szulwach, Chun-Xiao Song, Liang Zhang, Audrey Kim, Xuekun Li, Qing
Dai, Yin Shen, Beomseok Park, et al. Base-resolution analysis of 5-hydroxymethylcytosine in the mam-
malian genome. Cell, 149(6):1368–1380, 2012.
Min Yu, Shannon Stott, Mehmet Toner, Shyamala Maheswaran, and Daniel A Haber. Circulating tumor
cells: approaches to isolation and characterization. The Journal of cell biology, 192(3):373–382, 2011.
Ryan KC Yuen, Maria S Pe˜ naherrera, Peter von Dadelszen, Deborah E McFadden, and Wendy P Robinson.
DNA methylation profiling of human placentas reveals promoter hypomethylation of multiple genes in
early-onset preeclampsia. European journal of Human Genetics, 18(9):1006–1012, 2010.
Assaf Zemach, Ivy McDaniel, Pedro Silva, and Daniel Zilberman. Genome-Wide Evolutionary Analysis of
Eukaryotic DNA Methylation. Science, 328(5980):916–919, 2010.
Bo Zhang, Yan Zhou, Nan Lin, Rebecca F Lowdon, Chibo Hong, Raman P Nagarajan, Jeffrey B Cheng,
Daofeng Li, Michael Stevens, Hyung Joo Lee, et al. Functional DNA methylation differences between
tissues, cell types, and across individuals discovered using the m&m algorithm. Genome research, 23(9):
1522–1540, 2013.
Yong Zhang, Tao Liu, Clifford A Meyer, J´ erˆ ome Eeckhoute, David S Johnson, Bradley E Bernstein, Chad
Nusbaum, Richard M Myers, Myles Brown, Wei Li, et al. Model-based analysis of chip-seq (macs).
Genome biology, 9(9):R137, 2008.
Yu Zhang, Yunlong Xiang, Qiangzong Yin, Zhenhai Du, Xu Peng, Qiujun Wang, Miguel Fidalgo, Weikun
Xia, Yuanyuan Li, Zhen-ao Zhao, et al. Dynamic epigenomic landscapes during early lineage specifica-
tion in mouse embryos. Nature genetics, 50(1):96, 2018.
Xiaoqi Zheng, Qian Zhao, Hua-Jun Wu, Wei Li, Haiyun Wang, Clifford A Meyer, Qian Alvin Qin, Han Xu,
Chongzhi Zang, Peng Jiang, et al. Methylpurify: tumor purity deconvolution and differential methylation
detection from single tumor DNA methylomes. Genome Biol, 15(8):419, 2014.
Wanding Zhou, Huy Q Dinh, Zachary Ramjan, Daniel J Weisenberger, Charles M Nicolet, Hui Shen, Pe-
ter W Laird, and Benjamin P Berman. DNA methylation loss in late-replicating domains is linked to
mitotic cell division. Nature genetics, page 1, 2018.
113
Appendix A
Supplementary Figures
0 20 40 60 80 100 120
2000 4000 6000 8000 10000 12000
Mean CpG coverage depth
Bin size
Cow
Dog
Horse
Human
Mouse
Rhesus
SquirrelMonkey
Figure A.1: Selected bin sizes for each segmented sample colored by species and plotted against sequencing
depth.
114
Satellite
Repeats
10 Mb hg19
PMDs
Roadmap
Segments
Calu1 Lung
HSC
Centromere
chr3 chr4 chr7
1 Mb
hg19
SKAP2
HOXA1
HOXA1
HOXA1
HOTAIRM1
HOTAIRM1
HOXA2
AK291164
HOXA3
HOXA3
AK311383
BC035889
HOXA4
HOXA-AS3
HOXA5
HOXA6
HOXA-AS3
DQ655986
HOXA7
HOXA9
HOXA10-HOXA9
HOXA-AS4
MIR196B
HOXA10
HOXA10
HOXA11
HOXA11
HOXA11-AS
LOC402470
HOXA13
HOTTIP
EVX1
EVX1
EVX1
SKAP1
SKAP1
SKAP1
SKAP1
HOXB1
HOXB1
HOXB2
HOXB-AS1
HOXB3
HOXB3
HOXB3
HOXB3
HOXB3
HOXB3
HOXB3
HOXB4
MIR10A
HOXB-AS3
HOXB-AS3
HOXB5
HOXB-AS3
HOXB-AS3
HOXB6
HOXB6
HOXB6
HOXB-AS3
HOXB7
HOXB8
HOXB9
MIR196A1
PRAC
HOXB-AS5
MIR3185
HOXB13
TTLL6
TTLL6
TTLL6
TTLL6
TTLL6
CALCOCO2
CALCOCO2
CALCOCO2
CALCOCO2
CALCOCO2
ATP5G1
ATP5G1
UBE2Z
SNF8
SNF8
GIP
MFSD5
MFSD5
MFSD5
ESPL1
ESPL1
ESPL1
PFDN5
PFDN5
PFDN5
PFDN5
C12orf10
C12orf10
C12orf10
C12orf10
AAAS
AAAS
SP7
SP7
SP7
SP1
SP1
SP1
AMHR2
AMHR2
AMHR2
AMHR2
PRR13
PRR13
PCBP2
PRR13
PCBP2
PCBP2
PCBP2
PCBP2
PCBP2
PCBP2
PCBP2
PCBP2
PCBP2
PCBP2
PCBP2
MAP3K12
MAP3K12
TARBP2
TARBP2
TARBP2
TARBP2
TARBP2
NPFF
ATF7
ATF7
ATF7
LOC100652999
ATF7
ATF7
ATP5G2
ATP5G2
ATP5G2
CALCOCO1
CALCOCO1
CALCOCO1
CALCOCO1
CALCOCO1
CALCOCO1
CALCOCO1
CALCOCO1
Mir_544
SNORD81
HOXC-AS5
HOXC13
HOXC12
HOTAIR
HOTAIR
HOTAIR
HOTAIR_4
HOTAIR_5
HOXC11
HOXC10
MIR196A2
HOXC9
HOXC8
HOXC6
HOXC5
HOXC4
HOXC6
HOXC5
MIR615
HOXC4
FLJ12825
LOC100240735
Y_RNA
LOC100240734
LOC400043
AX747003
SMUG1
SMUG1
SMUG1
SMUG1
SMUG1
SMUG1
SMUG1
SMUG1
CBX5
CBX5
CBX5
MIR3198-2
HNRNPA1
HNRNPA1
HNRNPA1
HNRNPA1
HNRNPA1
NFE2
NFE2
NFE2
NFE2
COPZ1
COPZ1
COPZ1
COPZ1
COPZ1
MIR148B
GPR84
GPR84
ZNF385A
ZNF385A
ZNF385A
ZNF385A
ZNF385A
ZNF385A
ITGA5
ITGA5
ITGA5
GTSF1
NCKAP1L
NCKAP1L
NCKAP1L
PDE1B
PDE1B
PDE1B
PDE1B
PDE1B
PDE1B
PPP1R1A
GLYCAM1
GLYCAM1
GLYCAM1
LACRT
DCD
DCD
DCD
KIAA1715
KIAA1715
KIAA1715
KIAA1715
KIAA1715
KIAA1715
EVX2
HOXD13
HOXD12
HOXD12
HOXD11
HOXD11
HOXD10
HOXD9
AX747372
AX747372
HOXD8
HOXD8
HOXD8
HOXD-AS2
BC047605
MIR10B
HOXD4
HOXD3
HOXD-AS1
HOXD-AS1
HOXD1
HOXD1
MTX2
MTX2
RefSeq
Genes
PMDs
Roadmap
Segments
Calu1 Lung
HSC
A
B
HOXA HOXB HOXC HOXD
Figure A.2: (A) HOX gene clusters display abnormally large, near-complete hypomethylated regions in
non-PC samples. (B) Hypomethylation in non-PC samples is prevalent near centromeres and appears linked
to satellite repeat density.
115
Decato.2017.Mouse_E15.JZ
Kobayashi.2012.Mouse_Oocyte
Wang.2014.Mouse_Oocyte
Decato.2018.Mouse_4T1AF5
Decato.2018.Mouse_4T1BB3
Decato.2018.Mouse_4T1BB2
Decato.2018.Mouse_4T1R3
Decato.2018.Mouse_4T1AO6
Decato.2018.Mouse_4T1H3
Decato.2018.Mouse_4T1parent
Decato.2017.Mouse_E18.JZ
Hon.2013.Mouse_Placenta
Decato.2017.Mouse_MBS1852P8_mm10
Decato.2017.Mouse_BIK2380P1_mm10
Decato.2017.Mouse_BIK2380P2_mm10
Decato.2017.Mouse_DOT2381P2_mm10
Decato.2017.Mouse_DOT2381P5_mm10
Decato.2017.Mouse_MBS1852P5_mm10
Decato.2017.Mouse_SFM1923P5_mm10
Decato.2017.Mouse_STF1955P3_mm10
Decato.2017.Mouse_SFM1923P4_mm10
Decato.2017.Mouse_STF1955P2_mm10
Branco.2016.Mouse_E7.5.EPC.Ctrl
Decato.2017.Mouse_E15.LZ
Decato.2017.Mouse_E18.LZ
Schroeder.2015.Mouse_Placenta
Decato−2017−Mouse_E15−JZ
Kobay ashi−2012−Mouse_Oocyte
W ang−2014−Mouse_Oocyte
Decato−2018−Mouse_4T1AF5
Decato−2018−Mouse_4T1BB3
Decato−2018−Mouse_4T1BB2
Decato−2018−Mouse_4T1R3
Decato−2018−Mouse_4T1AO6
Decato−2018−Mouse_4T1H3
Decato−2018−Mouse_4T1parent
Decato−2017−Mouse_E18−JZ
Hon−2013−Mouse_Placenta
Decato−2017−Mouse_MBS1852P8_mm10
Decato−2017−Mouse_BIK2380P1_mm10
Decato−2017−Mouse_BIK2380P2_mm10
Decato−2017−Mouse_DOT2381P2_mm10
Decato−2017−Mouse_DOT2381P5_mm10
Decato−2017−Mouse_MBS1852P5_mm10
Decato−2017−Mouse_SFM1923P5_mm10
Decato−2017−Mouse_STF1955P3_mm10
Decato−2017−Mouse_SFM1923P4_mm10
Decato−2017−Mouse_STF1955P2_mm10
Br anco−2016−Mouse_E7.5−EPC−Ctrl
Decato−2017−Mouse_E15−LZ
Decato−2017−Mouse_E18−LZ
Schroeder−2015−Mouse_Placenta
0.3
0.5
0.7
0.9
Figure A.3: Pairwise Jaccard index of segmented PMDs in mouse PC samples.
116
GM12878 HMEC HepG2
1: Active promoter 2: Weak promoter 3: Poised promoter
4/5: Strong enhancer
6/7: Poised enhancer
8: Insulator
9/10: Transcriptional
transition/elongation
11: Weak transcribed
12: Polycomb-repressed
13: Heterochromatin 14/15: Repetitive/CNV
PMDs
RefSeq
Genes
10_Txn_Elongation
9_Txn_Transition
10_Txn_Elongation
10_Txn_Elongation
9_Txn_Transition
9_Txn_Transition
4_Strong_Enhancer
5_Strong_Enhancer
4_Strong_Enhancer
5_Strong_Enhancer
1_Active_Promoter
4_Strong_Enhancer
11_Weak_Txn
1_Active_Promoter
11_Weak_Txn
1_Active_Promoter
5_Strong_Enhancer
7_Weak_Enhancer
2_Weak_Promoter
7_Weak_Enhancer
2_Weak_Promoter
11_Weak_Txn
2_Weak_Promoter
6_Weak_Enhancer
6_Weak_Enhancer
7_Weak_Enhancer
6_Weak_Enhancer
3_Poised_Promoter
8_Insulator
13_Heterochrom/lo
12_Repressed
3_Poised_Promoter
8_Insulator
13_Heterochrom/lo
8_Insulator
13_Heterochrom/lo
14_Repetitive/CNV
12_Repressed
3_Poised_Promoter
15_Repetitive/CNV
12_Repressed
14_Repetitive/CNV
15_Repetitive/CNV
15_Repetitive/CNV
14_Repetitive/CNV
log(observed/expected)
−5
−4
−3
−2
−1
0
Human_HMEC
Human_HepG2
Human_GM12878
A
B
Figure A.4: (A) UCSC genome browser plot showing chromHMM annotations overlaid with PMDs for cell
types that have been annotated. (B) Observed/expected ratios of different chromHMM annotations inside
PMDs.
117
Placenta
Liver
Lung
RefSeq
Genes
non-CpG (range: 0-0.3) CpG (range: 0-1)
chr1: p36.33-p12
non-CpG (range: 0-0.08)
Figure A.5: UCSC genome browser plot comparing CpG methylation levels (yellow) and PMD estimates
(black) to the fraction of non-CpG cytosines in 50kb bin that display nonzero methylation levels (green and
pink).
118
0.00
0.50
1.00
0.00
0.50
1.00
0.00
0.50
1.00
0.00
0.50
1.00
0.00
0.50
1.00
0.00
0.50
1.00
0.00
0.50
1.00
30
40
50
60
70
Mean CpG
Density
PMD
50kb
Left bound
100kb
50kb
100kb
Human
Rhesus
Squirrel
Monkey
Mouse
Dog
Horse
Cow
%mCpG
Figure A.6: Metagene plot of methylation level and CpG density as a function of distance from PMD
boundary stratified by species.
119
HMEC
HCT116
-200kb -100kb 0 100kb 200kb
0
200
400
Distance (bp) from CTCF-bound site
to nearest PMD boundary
Count
PMD
MCF7
HepG2
0
200
400
0
200
400
0
200
400
Figure A.7: Histograms of CTCF bound site distances from PMDs.
120
0.4
log(external O/E)−log(internal O/E)
Repeat Family
LINE:CR1
LINE:L1
LINE:L2
LINE:RTE−BovB
LTR:ERV1
LTR:ERVK
LTR:ERVL
SINE:Alu
SINE:B2
SINE:B4
SINE:Core−RTE
SINE:ID
SINE:MIR
SINE:tRNA
SINE:tRNA−Core−RTE
0.0
0.1
0.2
0.3
AluJ
AluS
AluY
AluJ
AluS
AluY
AluJ
AluS
AluY
log(External O/E)-log(Internal O/E)
Species
Human
Rhesus
Squirrel Monkey
0.2 0 -0.2 -0.4
Human
Rhesus
Squirrel monkey
Mouse
Dog
Horse
Cow
*
*
*
SINECs
EREs
Figure A.8: Retrotransposons at PMD boundaries are included/excluded in the PMD in a family-specific
manner. Difference in observed/expected ratio for each family by species, with a zoom out showing that the
youngest Alu elements are excluded from PMDs more than the oldest Alu elements.
121
60 70 80 90 100
0.0e+00 4.0e+08 8.0e+08 1.2e+09
% Tumor Nuclei
Total bp in PMDs
Figure A.9: Scatterplot of TCGA-reported primary tumor purity against number of basepairs segmented
into PMDs.
122
2 Mb
mm10
Gm10466
4933427E13Rik
5730522E02Rik
5730522E02Rik
Mir6372
Fancl
Fancl
Fancl
Fancl
Vrk2
Vrk2
Vrk2
Vrk2
Gm12070
Plppr4
Plppr4
Plppr5
Plppr5
Snx7
Snx7
Mir137
AK076759
Pik3c3
Pik3c3
Pdcd10
Serpini1
Serpini1
Platr10
Golim4
Golim4
Golim4
Golim4
Fstl5
Fstl5
Gm11758
Gm11757
Gm13871
2310002L09Rik
2310002L09Rik
Frmd3
Frmd3
Kdm4c
Kdm4c
Kdm4c
Kdm4c
Elac2
Elac2
Arhgap44
Arhgap44
Arhgap44
Arhgap44
B430202K04Rik
Myocd
Myocd
Myocd
Myocd
Gm12295
Mouse 4T1 Parent
Placenta
1650Lung
441NSCLC
Calu1
M3Lung
Human
chr11:qA3.2 chr3:qG1 chr18:qB1 chr3:qE3 chr4:qC3 chr11:qB3
RefSeq Genes
All PMDs
Figure A.10: Top 6 most conserved mouse escapee genes and their methylation state in homologous regions
of human.
Scale
chr12:
BT-A2LA-01A
H4-A2HQ-01A
DK-A1AA-01A
BT-A20V-01A
DK-A1AG-01A
500 kb hg19
39,000,000 39,500,000
ALG10B
ALG10B
CPNE8
KIF21A
KIF21A
KIF21A
KIF21A
ABCD2
Bladder Urotherial
Carcinoma
RefSeq
Genes
Figure A.11: Adjacent escapee genes showing differential escapee state in TCGA bladder cancer samples.
123
Human Mouse
0 20 40 60 0 25 50 75
−0.05
0.00
0.05
0.10
0.15
−0.05
0.00
0.05
0.10
0.15
Age (weeks)
PMD Shadow Depth (%)
Age (years)
Whole frontal cortex
Neurons
Non-neurons
Middle front gyrus
Figure A.12: PMD Shadow depth of brain samples organized by age and colored by cell type.
124
Coding Sequences Intergenic
Distal CpG Islands
LINEs / SINEs / LTRs
Transcription End Sites TSS with CGI TSS without CGI
0.00
0.25
0.50
0.75
Distance from HMR core (kb)
%mCpG
Cell type
/Human_Adipose
/Human_Adrenal−gland
/Human_Esophagus
/Human_F etal−Adrenal−gland
/Human_F etal−Heart
/Human_F etal−Intestine−Large
/Human_F etal−Spinal−Cord
/Human_F etal−Stomach
/Human_fMuscle−leg
/Human_fMuscle−trunk
/Human_F oreskin−fibrob last−iPSC−1911
/Human_H1−BMP4
/Human_H1−Mesenchymal
/Human_H1−Mesendoderm
/Human_H1−NP
/Human_H9
/Human_Hear t−Aorta
/Human_IMR90
/Human_Left−ventrical
/Human_Liver
/Human_Lung
/Human_Ovary
/Human_Placenta
/Human_Psoas−muscle
/Human_Right−atrium
/Human_Right−ventrical
/Human_Sigmoid−Colon
/Human_Small−intestine
/Human_Spleen
/Human_Thymus
0.00
0.25
0.50
0.75
-4 -3 -2 -1 0 1 2 3 4 -4 -3 -2 -1 0 1 2 3 4 -4 -3 -2 -1 0 1 2 3 4
-4 -3 -2 -1 0 1 2 3 4 -4 -3 -2 -1 0 1 2 3 4 -4 -3 -2 -1 0 1 2 3 4 -4 -3 -2 -1 0 1 2 3 4
Figure A.13: Methylation metagene plots for HMRs identified in healthy Roadmap data stratified by ge-
nomic context.
125
Appendix B
Supplementary Tables
Name Species Conversion rate % CpG covered Mean %mCpG Embryonic day Sex Embryonic weight Placental weight Litter size
BIK2380P1 Domesticus 0.995341 0.874622 0.443011 11 to 16 Female 0.0513 0.0399 7
BIK2380P2 Domesticus 0.996033 0.86594 0.44816 11 to 16 Female 0.0528 0.0527 7
DOT2381P2 Domesticus 0.99591 0.857325 0.473066 11 to 16 Female 0.0528 0.0527 6
DOT2381P5 Domesticus 0.995925 0.861084 0.43015 11 to 16 Female 0.0626 0.0348 6
MBS1852P5 Musculus 0.995578 0.776638 0.487414 11 to 16 Female 0.0457 0.0463 8
MBS1852P8 Musculus 0.995626 0.783063 0.466334 11 to 16 Female 0.0444 0.0448 8
MPB1574P1 Musculus 0.995278 0.781556 0.539841 11 to 16 Female 0.0048 0.0187 7
MPB1574P2 Musculus 0.995603 0.783858 0.516511 11 to 16 Female 0.0061 0.0274 7
SFM1923P4 Spretus 0.994381 0.671744 0.438989 11 to 16 Male 0.0555 0.0476 7
SFM1923P5 Spretus 0.994176 0.670684 0.456405 11 to 16 Female 0.0513 0.0399 7
STF1955P2 Spretus 0.994106 0.649878 0.47287 11 to 16 Male 0.0279 0.0281 3
STF1955P3 Spretus 0.994316 0.673734 0.452854 11 to 16 Female 0.029 0.032 3
Table B.1: Quality control information for interspecific dataset.
126
Name Conversion rate % CpG covered Mean %mCpG Embryonic day Sex Layer
M1043-F3-M15-JZ 0.994285 0.567453 0.45596 15 Male Junctional
M1043-F3-M15-LZ 0.993449 0.433127 0.468068 15 Male Labyrinthine
M1043-F5-F15-JZ 0.994059 0.465059 0.433803 15 Female Junctional
M1049-F5-M18-JZ 0.993262 0.4345 0.420003 18 Male Junctional
M1049-F5-M18-LZ 0.992943 0.50895 0.479019 18 Male Labyrinthine
M1049-F6-F18-JZ 0.994233 0.453696 0.466433 18 Female Junctional
M1049-F6-F18-LZ 0.992795 0.563181 0.503374 18 Female Labyrinthine
M1053-F3-M18-JZ 0.993936 0.46238 0.464525 18 Male Junctional
M1053-F3-M18-LZ 0.986937 0.520218 0.489859 18 Male Labyrinthine
M1053-F7-F18-JZ 0.992977 0.52284 0.406603 18 Female Junctional
M1053-F7-F18-LZ 0.992915 0.511345 0.511294 18 Female Labyrinthine
M1054-F3-F18-JZ 0.994117 0.421792 0.472106 18 Female Junctional
M1054-F3-F18-LZ 0.993372 0.495431 0.478918 18 Female Labyrinthine
M1054-F7-M18-JZ 0.993634 0.49015 0.428622 18 Male Junctional
M1054-F7-M18-LZ 0.989262 0.529187 0.480076 18 Male Labyrinthine
M2253-F2-M15-JZ 0.994751 0.465269 0.490862 15 Male Junctional
M2253-F2-M15-LZ 0.994452 0.509366 0.43848 15 Male Labyrinthine
M2253-F3-F15-JZ 0.99443 0.522923 0.495519 15 Female Junctional
M2253-F3-F15-LZ 0.994362 0.423028 0.474785 15 Female Labyrinthine
M2255-F6-F15-JZ 0.994533 0.533655 0.441479 15 Female Junctional
M2255-F6-F15-LZ 0.993784 0.55917 0.462061 15 Female Labyrinthine
M2255-F8-M15-JZ 0.993764 0.556873 0.43803 15 Male Junctional
M2255-F8-M15-LZ 0.994123 0.515026 0.435446 15 Male Labyrinthine
Table B.2: Quality control information for intraspecific dataset.
Sample name Species Cancer History Cell type PMDs Source
Banovich-
2014 Human GM18505
Human Healthy Cultured Lymphoblastoid yes MethBase
Banovich-
2014 Human GM18507
Human Healthy Cultured Lymphoblastoid yes MethBase
Banovich-
2014 Human GM18508
Human Healthy Cultured Lymphoblastoid yes MethBase
Banovich-
2014 Human GM18516
Human Healthy Cultured Lymphoblastoid yes MethBase
Banovich-
2014 Human GM18522
Human Healthy Cultured Lymphoblastoid yes MethBase
Banovich-
2014 Human GM19141
Human Healthy Cultured Lymphoblastoid yes MethBase
Banovich-
2014 Human GM19193
Human Healthy Cultured Lymphoblastoid yes MethBase
Banovich-
2014 Human GM19204
Human Healthy Cultured Lymphoblastoid yes MethBase
Banovich-
2014 Human GM19238
Human Healthy Cultured Lymphoblastoid yes MethBase
Banovich-
2014 Human GM19239
Human Healthy Cultured Lymphoblastoid yes MethBase
Berman-
2012 Human ColonCancer
Human Cancer Primary Colon yes MethBase
127
Table B.7 continued from previous page
Sample name Species Cancer History Cell type PMDs Source
Blattler-
2014 Human HCT116
Human Cancer Cultured Bladder yes MethBase
Decato-2018-
Human 1650Lung
Human Cancer Cultured Lung yes New
Decato-2018-
Human 441NSCLC
Human Cancer Cultured Lung yes New
Decato-2018-
Human Calu1Lung
Human Cancer Cultured Lung yes New
Decato-2018-
Human M3Lung
Human Cancer Cultured Lung yes New
Grimmer-
2014 Human MCF7
Human Cancer Cultured Breast yes MethBase
Hammoud-2014-
Human Sperm
Human Healthy Primary Sperm no MethBase
Hansen-
2011 Human ColonCancer
Human Cancer Primary Colon yes MethBase
Heyn-2012a-Human-
CD4T-100yr
Human Healthy Primary Blood no MethBase
Heyn-2012a-Human-
CD4T-Newborn
Human Healthy Primary Blood no MethBase
Heyn-2012a-Human-
PBMC
Human Healthy Primary Blood no MethBase
Heyn-2012b-Human-
BCell-Healthy
Human Healthy Primary Blood no MethBase
Hodges-2011-BCell Human Healthy Primary Blood no MethBase
Hodges-2011-HSPC Human Healthy Primary Blood no MethBase
Hodges-2011-
Human CD133HSC
Human Healthy Primary Blood no MethBase
Hodges-2011-
Human Neut
Human Healthy Primary Blood no MethBase
Hon-
2012 Human HCC1954
Human Cancer Cultured Breast yes MethBase
Hon-
2012 Human HMEC
Human Healthy Cultured Breast yes MethBase
Lin-2015-
Human BreastTumor089
Human Cancer Primary Breast no MethBase
Lin-2015-
Human BreastTumor126
Human Cancer Primary Breast yes MethBase
Lin-2015-
Human BreastTumor198
Human Cancer Primary Breast yes MethBase
Lin-2015-Human MCF7 Human Cancer Cultured Breast yes MethBase
Lin-2015-
Human HealthyBreast
Human Healthy Primary Breast no MethBase
Lister-2011 Human ADS Human Healthy Cultured Adipose yes MethBase
128
Table B.7 continued from previous page
Sample name Species Cancer History Cell type PMDs Source
Lister-
2011 Human ADSAdipose
Human Healthy Cultured Adipose yes MethBase
Lister-2011 Human FF Human Healthy Cultured Fibroblast yes MethBase
Lister-2013-
Human DorsPrefrontMale55YrTissue
Human Healthy Primary Brain no MethBase
Lister-2013-
Human DorsPrefrontNeuron53Yr
Human Healthy Primary Brain no MethBase
Lister-2013-
Human DorsPrefrontNeuronMale55Yr
Human Healthy Primary Brain no MethBase
Lister-2013-
Human DorsPrefrontNonNeuron53Yr
Human Healthy Primary Brain no MethBase
Lister-2013-
Human DorsPrefrontNonNeuronMale55Yr
Human Healthy Primary Brain no MethBase
Lister-2013-
Human FetalCerebCortex
Human Healthy Primary Brain no MethBase
Lister-2013-
Human FrontCortexFemale64Yr
Human Healthy Primary Brain no MethBase
Lister-2013-
Human HUES6
Human Healthy Cultured Nonsomatic no MethBase
Lister-2013-
Human MidFrontGyr12Yr
Human Healthy Primary Brain no MethBase
Lister-2013-
Human MidFrontGyr16Yr
Human Healthy Primary Brain no MethBase
Lister-2013-
Human MidFrontGyr25Yr
Human Healthy Primary Brain no MethBase
Lister-2013-
Human MidFrontGyr2Yr
Human Healthy Primary Brain no MethBase
Lister-2013-
Human MidFrontGyr35Day
Human Healthy Primary Brain no MethBase
Lister-2013-
Human MidFrontGyr5Yr
Human Healthy Primary Brain no MethBase
Lister-ESC-
2009 Human IMR90
Human Healthy Cultured Lung yes MethBase
Lowe-2013-
Human Buccals
Human Healthy Primary Buccals no MethBase
LuWen-2014-
Human Ad-front
Human Healthy Primary Brain no MethBase
Lund-
2014 Human AML3-
Control
Human Cancer Primary Blood yes MethBase
Ma-2014 Human HDF Human Healthy Cultured Fibroblast yes MethBase
Menafra-
2014 Human MCF7
Human Cancer Cultured Breast yes MethBase
Molaro-2011-Sperm Human Healthy Primary Sperm no MethBase
129
Table B.7 continued from previous page
Sample name Species Cancer History Cell type PMDs Source
Pacis-2015-
Human DendriticCell
Human Healthy Primary Skin no MethBase
Pidsley-2016-
Human LNCaP
Human Cancer Cultured Prostate yes MethBase
Pidsley-2016-
Human PreC
Human Healthy Cultured Prostate no MethBase
Roadmap-2015-
Human Adipose
Human Healthy Primary Adipose no MethBase
Roadmap-2015-
Human Adrenal-gland
Human Healthy Primary Adrenal no MethBase
Roadmap-2015-
Human BipolarGastric
Human Healthy Primary Gastric no MethBase
Roadmap-2015-
Human Bladder
Human Healthy Primary Bladder no MethBase
Roadmap-2015-
Human Ectoderm
Human Healthy Cultured Nonsomatic no MethBase
Roadmap-2015-
Human Endoderm
Human Healthy Cultured Nonsomatic no MethBase
Roadmap-2015-
Human Esophagus
Human Healthy Primary Esophagus no MethBase
Roadmap-2015-
Human Fetal-Adrenal-
gland
Human Healthy Primary Adrenal no MethBase
Roadmap-2015-
Human Fetal-Heart
Human Healthy Primary Heart no MethBase
Roadmap-2015-
Human Fetal-Intestine-
Large
Human Healthy Primary Intestine no MethBase
Roadmap-2015-
Human Fetal-Spinal-
Cord
Human Healthy Primary SpinalCord no MethBase
Roadmap-2015-
Human Fetal-Stomach
Human Healthy Primary Stomach no MethBase
Roadmap-2015-
Human H1-BMP4
Human Healthy Cultured Nonsomatic no MethBase
Roadmap-2015-
Human H1-
Mesenchymal
Human Healthy Cultured Nonsomatic yes MethBase
Roadmap-2015-
Human H1-
Mesendoderm
Human Healthy Cultured Nonsomatic no MethBase
Roadmap-2015-
Human H1-NP
Human Healthy Cultured Nonsomatic no MethBase
130
Table B.7 continued from previous page
Sample name Species Cancer History Cell type PMDs Source
Roadmap-2015-
Human HSC
Human Healthy Cultured Nonsomatic no MethBase
Roadmap-2015-
Human HUES64
Human Healthy Cultured Nonsomatic no MethBase
Roadmap-2015-
Human HealthyGastric
Human Healthy Primary Gastric no MethBase
Roadmap-2015-
Human Heart-Aorta
Human Healthy Primary Heart no MethBase
Roadmap-2015-
Human Left-ventrical
Human Healthy Primary Heart no MethBase
Roadmap-2015-
Human Liver
Human Healthy Primary Liver no MethBase
Roadmap-2015-
Human Lung
Human Healthy Primary Lung no MethBase
Roadmap-2015-
Human Macrophage
Human Healthy Primary Blood no MethBase
Roadmap-2015-
Human Mesoderm
Human Healthy Primary Nonsomatic no MethBase
Roadmap-2015-
Human NK
Human Healthy Primary Blood no MethBase
Roadmap-2015-
Human Ovary
Human Healthy Primary Ovary no MethBase
Roadmap-2015-
Human PSAGastric
Human Healthy Primary Gastric no MethBase
Roadmap-2015-
Human Psoas-muscle
Human Healthy Primary Muscle no MethBase
Roadmap-
2015 Human Placenta
Human Healthy Primary Placenta yes MethBase
Roadmap-2015-
Human Right-atrium
Human Healthy Primary Heart no MethBase
Roadmap-2015-
Human Right-ventrical
Human Healthy Primary Heart no MethBase
Roadmap-2015-
Human Sigmoid-Colon
Human Healthy Primary Colon no MethBase
Roadmap-2015-
Human Small-intestine
Human Healthy Primary Intestine no MethBase
Roadmap-2015-
Human Spleen
Human Healthy Primary Spleen no MethBase
Roadmap-2015-
Human Tcell
Human Healthy Primary Blood no MethBase
Roadmap-2015-
Human Thymus
Human Healthy Primary Thymus no MethBase
Roadmap-2015-
Human fMuscle-leg
Human Healthy Primary Muscle no MethBase
131
Table B.7 continued from previous page
Sample name Species Cancer History Cell type PMDs Source
Roadmap-2015-
Human fMuscle-trunk
Human Healthy Primary Muscle no MethBase
Schlesinger-
2013 Human GM12878
Human Healthy Cultured Lymphoblastoid yes MethBase
Schroeder-
2010 Human SHSY5Y
Human Cancer Cultured Brain yes MethBase
Schroeder-
2013 Human Placenta
Human Healthy Primary Placenta yes MethBase
Vandiver-2015-
Human Epidermis-
old-sun-exposed
Human Healthy Primary Skin no MethBase
Vandiver-2015-
Human Epidermis-
old-sun-protected
Human Healthy Primary Skin no MethBase
Vandiver-2015-
Human Epidermis-
young-sun-exposed
Human Healthy Primary Skin no MethBase
Vandiver-2015-
Human Epidermis-
young-sun-protected
Human Healthy Primary Skin no MethBase
Xie-2013 Human IMR90 Human Healthy Cultured Lung yes MethBase
Zeng-2012-Human-
PreFrontCortex
Human Healthy Primary Brain no MethBase
Ziller-2013-
Human HepG2
Human Cancer Cultured Liver yes MethBase
AbuRemaileh-2015-
Mouse Colon-epithelial
Mouse Healthy Primary Colon no MethBase
Branco-2016-
Mouse E7.5-EPC-Ctrl
Mouse Healthy Primary Placenta yes MethBase
Decato-2017-
Mouse BIK2380P1 mm10
Mouse Healthy Primary Placenta yes MethBase
Decato-2017-
Mouse BIK2380P2 mm10
Mouse Healthy Primary Placenta yes MethBase
Decato-2017-
Mouse DOT2381P2 mm10
Mouse Healthy Primary Placenta yes MethBase
Decato-2017-
Mouse DOT2381P5 mm10
Mouse Healthy Primary Placenta yes MethBase
Decato-2017-
Mouse E15-JZ
Mouse Healthy Primary Placenta yes MethBase
Decato-2017-
Mouse E15-LZ
Mouse Healthy Primary Placenta yes MethBase
Decato-2017-
Mouse E18-JZ
Mouse Healthy Primary Placenta yes MethBase
132
Table B.7 continued from previous page
Sample name Species Cancer History Cell type PMDs Source
Decato-2017-
Mouse E18-LZ
Mouse Healthy Primary Placenta yes MethBase
Decato-2017-
Mouse MBS1852P5 mm10
Mouse Healthy Primary Placenta yes MethBase
Decato-2017-
Mouse MBS1852P8 mm10
Mouse Healthy Primary Placenta yes MethBase
Decato-2017-
Mouse MPB1574P1 mm10
Mouse Healthy Primary Placenta yes MethBase
Decato-2017-
Mouse MPB1574P2 mm10
Mouse Healthy Primary Placenta yes MethBase
Decato-2017-
Mouse SFM1923P4 mm10
Mouse Healthy Primary Placenta yes MethBase
Decato-2017-
Mouse SFM1923P5 mm10
Mouse Healthy Primary Placenta yes MethBase
Decato-2017-
Mouse STF1955P2 mm10
Mouse Healthy Primary Placenta yes MethBase
Decato-2017-
Mouse STF1955P3 mm10
Mouse Healthy Primary Placenta yes MethBase
Decato-2018-
Mouse 4T1AF5
Mouse Cancer Cultured Breast yes New
Decato-2018-
Mouse 4T1AO6
Mouse Cancer Cultured Breast yes New
Decato-2018-
Mouse 4T1BB2
Mouse Cancer Cultured Breast yes New
Decato-2018-
Mouse 4T1BB3
Mouse Cancer Cultured Breast yes New
Decato-2018-
Mouse 4T1R3
Mouse Cancer Cultured Breast yes New
Decato-2018-
Mouse 4T1parent
Mouse Cancer Cultured Breast yes New
Decato-2018-
Mouse 4T1H3
Mouse Cancer Cultured Breast yes New
Gabel-2015-
Mouse Cerebellum
Mouse Healthy Primary Brain no MethBase
Gabel-2015-
Mouse Cortex
Mouse Healthy Primary Brain no MethBase
Hahn-Mouse-2017-
Mouse Liver-AL-Old
Mouse Healthy Primary Liver no MethBase
Hahn-Mouse-2017-
Mouse Liver-AL-Young
Mouse Healthy Primary Liver no MethBase
Hahn-Mouse-2017-
Mouse Liver-DR-Old
Mouse Healthy Primary Liver no MethBase
Hahn-Mouse-2017-
Mouse Liver-DR-Young
Mouse Healthy Primary Liver no MethBase
133
Table B.7 continued from previous page
Sample name Species Cancer History Cell type PMDs Source
Hammoud-Mouse-2014-
Mouse Sperm
Mouse Healthy Primary Sperm no MethBase
Hammoud-Mouse-2014-
Mouse Spermatid
Mouse Healthy Primary Sperm no MethBase
Hammoud-Mouse-2014-
Mouse Spermatocyte
Mouse Healthy Primary Sperm no MethBase
Hammoud-Mouse-2014-
Mouse Spermatogonia
Mouse Healthy Primary Sperm no MethBase
Harten-2015-
Mouse E14.5-Liver-WT
Mouse Healthy Primary Liver no MethBase
Harten-2015-
Mouse E18.5-Liver-WT
Mouse Healthy Primary Liver no MethBase
He-Mouse-2014-
Mouse Keratinocytes
Mouse Healthy Primary Skin no MethBase
Hon-2013-
Mouse BoneMarrow
Mouse Healthy Primary BoneMarrow no MethBase
Hon-2013-
Mouse Cerebellum
Mouse Healthy Primary Brain no MethBase
Hon-2013-Mouse Colon Mouse Healthy Primary Colon no MethBase
Hon-2013-Mouse Cortex Mouse Healthy Primary Brain no MethBase
Hon-2013-Mouse Heart Mouse Healthy Primary Heart no MethBase
Hon-2013-
Mouse Intestine
Mouse Healthy Primary Intestine no MethBase
Hon-2013-
Mouse Kidney
Mouse Healthy Primary Kidney no MethBase
Hon-2013-Mouse Liver Mouse Healthy Primary Liver no MethBase
Hon-2013-Mouse Lung Mouse Healthy Primary Lung no MethBase
Hon-2013-
Mouse OlfactoryBulb
Mouse Healthy Primary OlfactoryBulb no MethBase
Hon-2013-
Mouse Pancreas
Mouse Healthy Primary Pancreas no MethBase
Hon-2013-
Mouse Placenta
Mouse Healthy Primary Placenta yes MethBase
Hon-2013-Mouse Skin Mouse Healthy Primary Skin no MethBase
Hon-2013-Mouse Spleen Mouse Healthy Primary Spleen no MethBase
Hon-2013-
Mouse Stomach
Mouse Healthy Primary Stomach no MethBase
Hon-2013-
Mouse Thymus
Mouse Healthy Primary Thymus no MethBase
Hon-2013-Mouse Uterus Mouse Healthy Primary Uterus no MethBase
Inoue-Mouse-2017-
Mouse-Spermatogonia-
WT
Mouse Healthy Primary Sperm no MethBase
134
Table B.7 continued from previous page
Sample name Species Cancer History Cell type PMDs Source
Kaaij-2013-
Mouse IntestinalGen13 mm10
Mouse Healthy Primary Intestine no MethBase
Kaaij-2013-
Mouse IntestinalGen15 mm10
Mouse Healthy Primary Intestine no MethBase
Kaaij-2013-
Mouse IntestinalGen16 mm10
Mouse Healthy Primary Intestine no MethBase
Kobayashi-2012-
Mouse Oocyte
Mouse Healthy Primary Oocyte yes MethBase
Lister-2013-
Mouse FetalFrontCortex
Mouse Healthy Primary Brain no MethBase
Lister-2013-
Mouse FrontCortexGliaS00bPos
Mouse Healthy Primary Brain no MethBase
Lister-2013-
Mouse FrontCortexMale10Wk
Mouse Healthy Primary Brain no MethBase
Lister-2013-
Mouse FrontCortexMale1Wk
Mouse Healthy Primary Brain no MethBase
Lister-2013-
Mouse FrontCortexMale22Mo
Mouse Healthy Primary Brain no MethBase
Lister-2013-
Mouse FrontCortexMale2Wk
Mouse Healthy Primary Brain no MethBase
Lister-2013-
Mouse FrontCortexMale4Wk
Mouse Healthy Primary Brain no MethBase
Lister-2013-
Mouse FrontCortexMale6Wk
Mouse Healthy Primary Brain no MethBase
Lister-2013-
Mouse FrontCortexNeuronFemale12Mo
Mouse Healthy Primary Brain no MethBase
Lister-2013-
Mouse FrontCortexNeuronFemale6Wk
Mouse Healthy Primary Brain no MethBase
Lister-2013-
Mouse FrontCortexNeuronMale7Wk
Mouse Healthy Primary Brain no MethBase
Lister-2013-
Mouse FrontCortexNonNeuronFemale12Mo
Mouse Healthy Primary Brain no MethBase
Lister-2013-
Mouse FrontCortexNonNeuronFemale6Wk
Mouse Healthy Primary Brain no MethBase
Lister-2013-
Mouse FrontCortexNonNeuronMale7Wk
Mouse Healthy Primary Brain no MethBase
Mann-2014-Mouse Fib Mouse Healthy Primary Fibroblast no MethBase
Mellen-2017-
Mouse CerebralGranule
Mouse Healthy Primary Brain no MethBase
Mo-2015-
Mouse EXneurons
Mouse Healthy Primary Brain no MethBase
Mo-2015-
Mouse PVneurons
Mouse Healthy Primary Brain no MethBase
135
Table B.7 continued from previous page
Sample name Species Cancer History Cell type PMDs Source
Mo-2015-
Mouse VIPneurons
Mouse Healthy Primary Brain no MethBase
Schroeder-2015-
Mouse Placenta
Mouse Healthy Primary Placenta yes MethBase
Sheaffer-2014-
Mouse Intestine
Mouse Healthy Primary Intestine no MethBase
Sheaffer-2014-
Mouse IntestineSC
Mouse Healthy Primary Intestine no MethBase
Wang-2014-
Mouse Oocyte
Mouse Healthy Primary Oocyte yes MethBase
WeiXie-Mouse-2012-
Mouse FrontalCortex-
129
Mouse Healthy Primary Brain no MethBase
WeiXie-Mouse-2012-
Mouse FrontalCortex-
Cast
Mouse Healthy Primary Brain no MethBase
WeiXie-Mouse-2012-
Mouse FrontalCortex-F1i
Mouse Healthy Primary Brain no MethBase
WeiXie-Mouse-2012-
Mouse FrontalCortex-
F1r
Mouse Healthy Primary Brain no MethBase
dosSantos-Mouse-2015-
Mouse ParBasDiff
Mouse Healthy Primary Breast no MethBase
dosSantos-Mouse-2015-
Mouse ParBasMaSC
Mouse Healthy Primary Breast no MethBase
dosSantos-Mouse-2015-
Mouse ParBasProg
Mouse Healthy Primary Breast no MethBase
dosSantos-Mouse-2015-
Mouse ParLumAlve
Mouse Healthy Primary Breast no MethBase
dosSantos-Mouse-2015-
Mouse ParLumDuct
Mouse Healthy Primary Breast no MethBase
dosSantos-Mouse-2015-
Mouse ParLumProg
Mouse Healthy Primary Breast no MethBase
dosSantos-Mouse-2015-
Mouse VirBasDiff
Mouse Healthy Primary Breast no MethBase
dosSantos-Mouse-2015-
Mouse VirBasMaSC
Mouse Healthy Primary Breast no MethBase
dosSantos-Mouse-2015-
Mouse VirBasProg
Mouse Healthy Primary Breast no MethBase
dosSantos-Mouse-2015-
Mouse VirLumAlve
Mouse Healthy Primary Breast no MethBase
dosSantos-Mouse-2015-
Mouse VirLumDuct
Mouse Healthy Primary Breast no MethBase
136
Table B.7 continued from previous page
Sample name Species Cancer History Cell type PMDs Source
dosSantos-Mouse-2015-
Mouse VirLumProg
Mouse Healthy Primary Breast no MethBase
Mendizabal-2016-
Rhesus Brain
Rhesus Healthy Primary Brain no MethBase
Qu-2017-Rhesus Sperm Rhesus Healthy Primary Sperm no MethBase
Schroeder-2015-
Rhesus Placenta
Rhesus Healthy Primary X yes MethBase
Tung-2011-
Rhesus PBMC High
Rhesus Healthy Primary Blood no MethBase
Tung-2011-
Rhesus PBMC Low
Rhesus Healthy Primary Blood no MethBase
Carmona-2014-
Dog MDCK
Dog Healthy Cultured Kidney yes MethBase
Qu-2017-Dog Sperm Dog Healthy Primary Sperm no MethBase
Schroeder-2015-
Dog Placenta
Dog Healthy Primary Placenta yes MethBase
Schroeder-2015-
SquirrelMonkey Placenta
Squirrel
Mon-
key
Healthy Primary Placenta yes MethBase
Schroeder-2015-
Horse Placenta
Horse Healthy Primary Placenta yes MethBase
Schroeder-2015-
Cow Placenta
Cow Healthy Primary Placenta yes MethBase
TCGA-2018-Human A5-
A0G2-01A-Uterine-
Corpus-Endometrial-
Carcinoma
Human Cancer Primary Uterus yes TCGA
TCGA-2018-Human A7-
A0CE-01A-Breast-
Invasive-Carcinoma
Human Cancer Primary Breast yes TCGA
TCGA-2018-
Human A7-A0CE-11A-
MatchedHealthy-Breast-
Invasive-Carcinoma
Human Healthy Primary Breast no TCGA
TCGA-2018-Human A8-
A07I-01A-Breast-
Invasive-Carcinoma
Human Cancer Primary Breast yes TCGA
TCGA-2018-
Human AA-3518-01A-
Colon-Adenocarcinoma
Human Cancer Primary Colon yes TCGA
TCGA-2018-
Human AA-3518-11A-
MatchedHealthy-Colon-
Adenocarcinoma
Human Healthy Primary Colon no TCGA
137
Table B.7 continued from previous page
Sample name Species Cancer History Cell type PMDs Source
TCGA-2018-
Human AA-A00R-01A-
Colon-Adenocarcinoma
Human Cancer Primary Colon yes TCGA
TCGA-2018-Human AF-
2689-01A-Rectum-
Adenocarcinoma
Human Cancer Primary Rectum yes TCGA
TCGA-2018-
Human AF-2689-
11A-MatchedHealthy-
Rectum-Adenocarcinoma
Human Healthy Primary Rectum no TCGA
TCGA-2018-
Human AG-3593-01A-
Rectum-Adenocarcinoma
Human Cancer Primary Rectum yes TCGA
TCGA-2018-Human AP-
A05J-01A-Uterine-
Corpus-Endometrial-
Carcinoma
Human Cancer Primary Uterus yes TCGA
TCGA-2018-
Human AX-A1CI-
01A-Uterine-Corpus-
Endometrial-Carcinoma
Human Cancer Primary Uterus yes TCGA
TCGA-2018-
Human AX-A1CI-11A-
MatchedHealthy-Uterine-
Corpus-Endometrial-
Carcinoma
Human Healthy Primary Uterus no TCGA
TCGA-2018-
Human AX-A1CK-
01A-Uterine-Corpus-
Endometrial-Carcinoma
Human Cancer Primary Uterus yes TCGA
TCGA-2018-Human B5-
A0K6-01A-Uterine-
Corpus-Endometrial-
Carcinoma
Human Cancer Primary Uterus yes TCGA
TCGA-2018-Human BL-
A13J-01A-Bladder-
Urothelial-Carcinoma
Human Cancer Primary Bladder yes TCGA
TCGA-2018-Human BR-
6452-01A-Stomach-
Adenocarcinoma
Human Cancer Primary Stomach yes TCGA
138
Table B.7 continued from previous page
Sample name Species Cancer History Cell type PMDs Source
TCGA-2018-
Human BR-6452-
11A-MatchedHealthy-
Stomach-
Adenocarcinoma
Human Healthy Primary Stomach no TCGA
TCGA-2018-Human BT-
A20V-01A-Bladder-
Urothelial-Carcinoma
Human Cancer Primary Bladder yes TCGA
TCGA-2018-
Human BT-A20V-
11A-MatchedHealthy-
Bladder-Urothelial-
Carcinoma
Human Healthy Primary Bladder no TCGA
TCGA-2018-Human BT-
A2LA-01A-Bladder-
Urothelial-Carcinoma
Human Cancer Primary Bladder yes TCGA
TCGA-2018-
Human CG-5730-
01A-Stomach-
Adenocarcinoma
Human Cancer Primary Stomach yes TCGA
TCGA-2018-Human D7-
6519-01A-Stomach-
Adenocarcinoma
Human Cancer Primary Stomach yes TCGA
TCGA-2018-
Human DK-A1AA-
01A-Bladder-Urothelial-
Carcinoma
Human Cancer Primary Bladder yes TCGA
TCGA-2018-
Human DK-A1AG-
01A-Bladder-Urothelial-
Carcinoma
Human Cancer Primary Bladder yes TCGA
TCGA-2018-Human E2-
A15H-01A-Breast-
Invasive-Carcinoma
Human Cancer Primary Breast yes TCGA
TCGA-2018-Human F1-
6177-01A-Stomach-
Adenocarcinoma
Human Cancer Primary Stomach yes TCGA
TCGA-2018-Human H4-
A2HQ-01A-Bladder-
Urothelial-Carcinoma
Human Cancer Primary Bladder yes TCGA
TCGA-2018-Human 06-
0128-01A-Glioblastoma-
Multiforme
Human Cancer Primary Brain yes TCGA
139
Table B.7 continued from previous page
Sample name Species Cancer History Cell type PMDs Source
TCGA-2018-Human 14-
1401-01A-Glioblastoma-
Multiforme
Human Cancer Primary Brain yes TCGA
TCGA-2018-Human 14-
1454-01A-Glioblastoma-
Multiforme
Human Cancer Primary Brain yes TCGA
TCGA-2018-Human 14-
3477-01A-Glioblastoma-
Multiforme
Human Cancer Primary Brain yes TCGA
TCGA-2018-Human 16-
1460-01A-Glioblastoma-
Multiforme
Human Cancer Primary Brain yes TCGA
TCGA-2018-Human 19-
1788-01A-Glioblastoma-
Multiforme
Human Cancer Primary Brain yes TCGA
TCGA-2018-Human 21-
1078-01A-Lung-
Squamous-Cell-
Carcinoma
Human Cancer Primary Lung yes TCGA
TCGA-2018-Human 34-
2600-01A-Lung-
Squamous-Cell-
Carcinoma
Human Cancer Primary Lung yes TCGA
TCGA-2018-Human 38-
4630-01A-Lung-
Adenocarcinoma
Human Cancer Primary Lung yes TCGA
TCGA-2018-Human 44-
6148-01A-Lung-
Adenocarcinoma
Human Cancer Primary Lung yes TCGA
TCGA-2018-
Human 44-6148-11A-
MatchedHealthy-Lung-
Adenocarcinoma
Human Healthy Primary Lung no TCGA
TCGA-2018-Human 60-
2695-01A-Lung-
Squamous-Cell-
Carcinoma
Human Cancer Primary Lung yes TCGA
TCGA-2018-Human 60-
2722-01A-Lung-
Squamous-Cell-
Carcinoma
Human Cancer Primary Lung yes TCGA
140
Table B.7 continued from previous page
Sample name Species Cancer History Cell type PMDs Source
TCGA-2018-
Human 60-2722-
11A-MatchedHealthy-
Lung-Squamous-Cell-
Carcinoma
Human Healthy Primary Lung no TCGA
TCGA-2018-Human 67-
6215-01A-Lung-
Adenocarcinoma
Human Cancer Primary Lung yes TCGA
TCGA-2018-Human 78-
7156-01A-Lung-
Adenocarcinoma
Human Cancer Primary Lung yes TCGA
TCGA-2018-Human 91-
6840-01A-Lung-
Adenocarcinoma
Human Cancer Primary Lung yes TCGA
TCGA-2018-Human A2-
A04X-01A-Breast-
Invasive-Carcinoma
Human Cancer Primary Breast yes TCGA
TCGA-2018-Human A2-
A0YG-01A-Breast-
Invasive-Carcinoma
Human Cancer Primary Breast yes TCGA
Table B.7: Summary of samples used in Chapter 4 with manual
annotation of culturing and cancer status.
141
Rank LAYER # DM CpGs AGE # DM CpGs Sex # DM CpGs
1 Zmym3 45 Cdc42 24 Rdx 24
2 Srrt 45 Picalm 21 Klhl15 21
3 Stag2 35 Tjp1 18 March10 20
4 1810009A15Rik 22 March10 18 Phka2 18
5 Sox3 21 Nnat 17 Rhox4e 16
6 Cdhr5 18 Sirpa 12 Nup62cl 16
7 Rcan2 15 Mettl9 12 Tab3 15
8 Prrg3 15 Fxr2 12 Fgd1 15
9 Phf6 14 Brdt 12 Chst7 14
10 Eif5 13 Rhox4a 10 E230019M04Rik 13
11 Pglyrp2 12 Mtm1 12
12 Fam110a 12 Med12 12
13 Cmtm2b 12 Gm6981 12
14 Armcx1 12 Gabrq 12
15 Mir5136 11 Slc16a2 11
16 A530016L24Rik 11 Ptms 11
17 5730408K05Rik 11 Tpbg 10
18 Tspan11 10 Rhox4d 10
19 Ppp2r5e 10 Rai2 10
20 Pcf11 10 Rab9 10
21 Nrk 10 Nrk 10
22 Insl5 10 H2-T10 10
23 Dcaf12l1 10 Fhl1 10
24 Cstf2 10
25 Bcat1 10
Table B.3: Full DM Promoter Table by placental layer, developmental timepoint, and sex.
# DM CpGs # CpGs in subfamily O/E value
Musculus-Specific DM CpG Subfamily Enrichment:
RLTR10F 1864 4343 2.05727
RLTR10D2 892 2043 2.09282
RLTR44D 375 848 2.11968
RLTR10D 2788 6094 2.19293
RLTR10E 988 2109 2.24551
MLT1H1-int 51 104 2.35056
LTRIS5 221 447 2.36984
Domesticus-Specific DM CpG Subfamily Enrichment:
RLTR20A1 60 208 2.14105
RLTR46A2 157 513 2.27154
Spretus-Specific DM CpG Subfamily Enrichment:
RLTR20A1 122 208 2.04429
MLT1I-int 76 118 2.2448
Table B.4: Species-specific retrotransposon subfamily enrichment.
142
Sample Name Sex Embryonic weight Placental weight Litter size
BIK2380P1 Female 0.0513 0.0399 7
BIK2380P2 Female 0.0528 0.0527 7
DOT2381P2 Female 0.0528 0.0527 6
DOT2381P5 Female 0.0626 0.0348 6
MBS1852P5 Female 0.0457 0.0463 8
MBS1852P8 Female 0.0444 0.0448 8
MPB1574P1 Female 0.0048 0.0187 7
MPB1574P2 Female 0.0061 0.0274 7
SFM1923P4 Male 0.0555 0.0476 7
SFM1923P5 Female 0.0513 0.0399 7
STF1955P2 Male 0.0279 0.0281 3
STF1955P3 Female 0.029 0.032 3
Table B.5: Embryonic and placental weight of interspecific dataset.
Name Assembly Unique reads mm10 Unique reads new %mC mm10 %mC new
BIK2380P1 WSB 152103438 150041511 44.3% 44.1%
BIK2380P2 WSB 122112034 120725434 44.8% 44.6%
DOT2381P2 WSB 110041413 108912234 47.3% 47.1%
DOT2381P5 WSB 114551129 113625731 43.0% 42.9%
MBS1852P5 PWK 111800220 116035183 48.7% 49.1%
MBS1852P8 PWK 124878772 129260716 46.6% 47.1%
MPB1574P1 PWK 114685265 118774790 53.9% 54.7%
MPB1574P2 PWK 125901225 130944978 51.6% 52.3%
SFM1923P4 SPRET 122383008 115601824 43.8% 48.9%
SFM1923P5 SPRET 129909505 132094682 45.6% 47.4%
STF1955P2 SPRET 107024568 132094682 47.2% 46.3%
STF1955P3 SPRET 125066779 140772321 45.2% 47.6%
Table B.6: Mapping statistics and global methylation levels before and after mapping whole-placenta sam-
ples to their more accurate reference genomes compared to mm10.
143
Sample name Species Fraction of
ref. genome
called PMDs
Mean seg-
ment size
Number of
segments
Fraction
of CpGs
covered in
sample
Mean depth
of CpG cov-
erage
Bin size
used for seg-
mentation
AbuRemaileh-2015-
Mouse Colon-epithelial
Mouse 0.00811836 8543.43 2595 0.958209 16.2401 1000
Banovich-
2014 Human GM18505
Human 0.468217 1.24E+06 1169 0.874074 6.42202 2000
Banovich-
2014 Human GM18507
Human 0.456804 1.50E+06 944 0.873671 6.27756 2000
Banovich-
2014 Human GM18508
Human 0.478912 1.28E+06 1161 0.856085 4.66965 3000
Banovich-
2014 Human GM18516
Human 0.466751 1.32E+06 1095 0.809213 2.86282 3500
Banovich-
2014 Human GM18522
Human 0.44135 1.63E+06 840 0.543079 0.910438 9500
Banovich-
2014 Human GM19141
Human 0.46496 1.28E+06 1126 0.858013 4.59573 2500
Banovich-
2014 Human GM19193
Human 0.498018 1.12E+06 1371 0.801636 2.61927 4000
Banovich-
2014 Human GM19204
Human 0.456215 1.40E+06 1008 0.660505 1.35609 7000
Banovich-
2014 Human GM19238
Human 0.507588 967571 1624 0.855734 4.46278 2500
Banovich-
2014 Human GM19239
Human 0.485926 1.19E+06 1262 0.808246 2.6965 4000
Berman-
2012 Human ColonCancer
Human 0.332847 523307 1969 0.961465 25.7471 1000
Blattler-2014 Human HCT116 Human 0.269742 681665 1225 0.925586 6.09965 2000
Branco-2016-Mouse E7.5-
EPC-Ctrl
Mouse 0.312069 2.61E+06 326 0.706863 2.30804 9000
Carmona-2014-Dog MDCK Dog 0.45706 625403 1762 0.978372 41.0268 1000
Decato-2017-
Mouse BIK2380P1 mm10
Mouse 0.302409 1.28E+06 647 0.982157 10.9966 3000
144
Table B.8 continued from previous page
Sample name Species Fraction of
ref. genome
called PMDs
Mean seg-
ment size
Number of
segments
Fraction
of CpGs
covered in
sample
Mean depth
of CpG cov-
erage
Bin size
used for seg-
mentation
Decato-2017-
Mouse BIK2380P2 mm10
Mouse 0.299521 1.33E+06 614 0.976802 8.20141 3000
Decato-2017-
Mouse DOT2381P2 mm10
Mouse 0.308808 1.63E+06 516 0.970864 6.38417 3000
Decato-2017-
Mouse DOT2381P5 mm10
Mouse 0.314643 1.09E+06 785 0.97539 6.73145 2500
Decato-2017-Mouse E15-JZ Mouse 0.24609 522988 1285 0.912139 12.9603 2000
Decato-2017-Mouse E15-LZ Mouse 0.239746 538417 1216 0.888781 9.27535 2500
Decato-2017-Mouse E18-JZ Mouse 0.292255 900804 886 0.899763 9.77492 2500
Decato-2017-Mouse E18-LZ Mouse 0.253335 567071 1220 0.912172 12.5167 2000
Decato-2017-
Mouse MBS1852P5 mm10
Mouse 0.254326 1.15E+06 606 0.977375 6.75598 2500
Decato-2017-
Mouse MBS1852P8 mm10
Mouse 0.289639 1.27E+06 624 0.983344 7.74858 2500
Decato-2017-
Mouse MPB1574P1 mm10
Mouse 0.00207842 28098.5 202 0.978006 8.88303 3000
Decato-2017-
Mouse MPB1574P2 mm10
Mouse 0.00264111 25130.8 287 0.979505 9.66731 3000
Decato-2017-
Mouse SFM1923P4 mm10
Mouse 0.26706 1.54E+06 475 0.969764 6.69176 3000
Decato-2017-
Mouse SFM1923P5 mm10
Mouse 0.26724 1.27E+06 574 0.979872 9.93913 3000
Decato-2017-
Mouse STF1955P2 mm10
Mouse 0.265731 1.21E+06 598 0.9788 10.6785 3000
Decato-2017-
Mouse STF1955P3 mm10
Mouse 0.269355 1.06E+06 694 0.981519 9.25844 2500
Decato-2018-Human 1650Lung Human 0.225587 364673 1915 0.941915 11.2996 2000
Decato-2018-
Human 441NSCLC
Human 0.263647 549610 1485 0.940849 9.39297 2000
145
Table B.8 continued from previous page
Sample name Species Fraction of
ref. genome
called PMDs
Mean seg-
ment size
Number of
segments
Fraction
of CpGs
covered in
sample
Mean depth
of CpG cov-
erage
Bin size
used for seg-
mentation
Decato-2018-
Human Calu1Lung
Human 0.384588 386673 3079 0.942163 9.73472 1500
Decato-2018-Human M3Lung Human 0.336177 577846 1801 0.936502 10.5856 2000
Decato-2018-Mouse 4T1AF5 Mouse 0.317976 593136 1464 0.909296 12.6877 1500
Decato-2018-Mouse 4T1AO6 Mouse 0.338007 671801 1374 0.901006 10.1139 1500
Decato-2018-Mouse 4T1BB2 Mouse 0.36041 902138 1091 0.899267 9.37776 2000
Decato-2018-Mouse 4T1BB3 Mouse 0.385663 773271 1362 0.909208 12.5914 1500
Decato-2018-Mouse 4T1H3 Mouse 0.348705 667790 1426 0.904154 10.7605 1500
Decato-2018-Mouse 4T1parent Mouse 0.363094 815431 1216 0.904056 10.1364 2000
Decato-2018-Mouse 4T1R3 Mouse 0.357339 706624 1381 0.910051 12.9942 1500
dosSantos-Mouse-2015-
Mouse ParBasDiff
Mouse 0.00974545 20939.1 1271 0.909642 8.88064 1500
dosSantos-Mouse-2015-
Mouse ParBasMaSC
Mouse 0.00637647 22468.8 775 0.885232 4.67428 2000
dosSantos-Mouse-2015-
Mouse ParBasProg
Mouse 0.00709734 15580.3 1244 0.930264 17.7868 1500
dosSantos-Mouse-2015-
Mouse ParLumAlve
Mouse 0.0118271 20416.2 1582 0.927214 10.9552 1500
dosSantos-Mouse-2015-
Mouse ParLumDuct
Mouse 0.013123 27652.1 1296 0.924967 10.4977 2000
dosSantos-Mouse-2015-
Mouse ParLumProg
Mouse 0.0120628 23665.3 1392 0.925916 12.2673 1500
dosSantos-Mouse-2015-
Mouse VirBasDiff
Mouse 0.0083391 16094 1415 0.930212 14.1543 1500
dosSantos-Mouse-2015-
Mouse VirBasMaSC
Mouse 0.00600834 15976.6 1027 0.916694 9.73737 1500
dosSantos-Mouse-2015-
Mouse VirBasProg
Mouse 0.00731833 15899.3 1257 0.928392 15.6729 1500
146
Table B.8 continued from previous page
Sample name Species Fraction of
ref. genome
called PMDs
Mean seg-
ment size
Number of
segments
Fraction
of CpGs
covered in
sample
Mean depth
of CpG cov-
erage
Bin size
used for seg-
mentation
dosSantos-Mouse-2015-
Mouse VirLumAlve
Mouse 0.0121589 21203.4 1566 0.922414 11.2254 1500
dosSantos-Mouse-2015-
Mouse VirLumDuct
Mouse 0.0117724 13181.2 2439 0.925918 14.1709 1000
dosSantos-Mouse-2015-
Mouse VirLumProg
Mouse 0.0116914 11882.3 2687 0.921235 18.6369 1000
Gabel-2015-Mouse Cerebellum Mouse 0.0292011 143943 554 0.628005 1.12829 8000
Gabel-2015-Mouse Cortex Mouse 0.0125051 129847 263 0.555096 0.891294 10000
Grimmer-2014 Human MCF7 Human 0.204951 481386 1318 0.91057 4.45512 3000
Hahn-Mouse-2017-
Mouse Liver-AL-Old
Mouse 0.0100304 10608.8 2582 0.942295 28.7687 1000
Hahn-Mouse-2017-
Mouse Liver-AL-Young
Mouse 0.00625229 13154.2 1298 0.929441 7.57105 1500
Hahn-Mouse-2017-
Mouse Liver-DR-Old
Mouse 0.00956187 10640.7 2454 0.940161 26.4248 1000
Hahn-Mouse-2017-
Mouse Liver-DR-Young
Mouse 0.00526827 13754.3 1046 0.925878 7.21105 1500
Hammoud-2014-Human Sperm Human 0.0353034 14614.7 7478 0.958828 51.8183 1000
Hammoud-Mouse-2014-
Mouse Sperm
Mouse 0.0360353 23752.8 4143 0.921733 29.5062 1000
Hammoud-Mouse-2014-
Mouse Spermatid
Mouse 0.0315108 20464.2 4205 0.923542 29.537 1000
Hammoud-Mouse-2014-
Mouse Spermatocyte
Mouse 0.0323675 41971.3 2106 0.911778 26.3135 1500
Hammoud-Mouse-2014-
Mouse Spermatogonia
Mouse 0.0306163 18263.3 4578 0.931759 41.6678 1000
Hansen-
2011 Human ColonCancer
Human 0.435548 1.08E+06 1245 0.840229 6.1631 2500
147
Table B.8 continued from previous page
Sample name Species Fraction of
ref. genome
called PMDs
Mean seg-
ment size
Number of
segments
Fraction
of CpGs
covered in
sample
Mean depth
of CpG cov-
erage
Bin size
used for seg-
mentation
Harten-2015-Mouse E14.5-
Liver-WT
Mouse 0.00735869 11023.4 1823 0.938206 68.1083 1000
Harten-2015-Mouse E18.5-
Liver-WT
Mouse 0.00695759 9039.14 2102 0.918868 33.3353 1000
He-Mouse-2014-
Mouse Keratinocytes
Mouse 0.00762373 20371.3 1022 0.920834 12.7787 2000
Heyn-2012a-Human-CD4T-
100yr
Human 0.00407705 11708.1 1078 0.952845 14.0417 1500
Heyn-2012a-Human-CD4T-
Newborn
Human 0.0122679 16598.6 2288 0.953797 14.4172 1500
Heyn-2012a-Human-PBMC Human 0.00610024 12803 1475 0.949875 14.0199 1500
Heyn-2012b-Human-BCell-
Healthy
Human 0.469625 1.26E+06 1157 0.953899 18.213 1500
Hodges-2011-BCell Human 0.00700046 15070.4 1438 0.957382 11.8547 1500
Hodges-2011-HSPC Human 0.0112802 18584.4 1879 0.957503 11.329 1500
Hodges-2011-
Human CD133HSC
Human 0.0122123 31270.1 1209 0.958016 9.25521 2000
Hodges-2011-Human Neut Human 0.0114241 22228.4 1591 0.958288 11.602 2000
Hon-2012 Human HCC1954 Human 0.268447 271756 3058 0.960519 32.6548 1000
Hon-2012 Human HMEC Human 0.282187 400534 2181 0.959401 22.939 1000
Hon-2013-Mouse BoneMarrow Mouse 0.00290257 16479.3 481 0.927682 7.59714 2000
Hon-2013-Mouse Cerebellum Mouse 0.0119859 25552 1281 0.927432 8.59741 2000
Hon-2013-Mouse Colon Mouse 0.00656212 15623.6 1147 0.9239 8.5729 2000
Hon-2013-Mouse Cortex Mouse 0.0087221 13487.5 1766 0.933408 9.85818 1500
Hon-2013-Mouse Heart Mouse 0.00477551 17575.9 742 0.930586 7.05864 2000
Hon-2013-Mouse Intestine Mouse 0.0055599 17492.4 868 0.929826 8.85271 2000
Hon-2013-Mouse Kidney Mouse 0.00472748 17375.7 743 0.927155 6.96727 2000
Hon-2013-Mouse Liver Mouse 0.00412991 20175.7 559 0.914774 4.99081 2500
Hon-2013-Mouse Lung Mouse 0.00515766 17010.8 828 0.93139 8.42149 2000
148
Table B.8 continued from previous page
Sample name Species Fraction of
ref. genome
called PMDs
Mean seg-
ment size
Number of
segments
Fraction
of CpGs
covered in
sample
Mean depth
of CpG cov-
erage
Bin size
used for seg-
mentation
Hon-2013-
Mouse OlfactoryBulb
Mouse 0.010122 14088.7 1962 0.933558 10.35 1500
Hon-2013-Mouse Pancreas Mouse 0.00507502 15930.1 870 0.930517 8.63513 2000
Hon-2013-Mouse Placenta Mouse 0.321378 927741 946 0.931035 8.42777 2000
Hon-2013-Mouse Skin Mouse 0.00623639 18633.2 914 0.927253 7.8287 2000
Hon-2013-Mouse Spleen Mouse 0.00580889 15721.8 1009 0.930611 6.77724 2000
Hon-2013-Mouse Stomach Mouse 0.00384354 16049.2 654 0.923821 7.72492 2000
Hon-2013-Mouse Thymus Mouse 0.00672981 17947.5 1024 0.930169 8.94456 2000
Hon-2013-Mouse Uterus Mouse 0.00420676 15256.5 753 0.928243 7.63936 2000
Inoue-Mouse-2017-Mouse-
Spermatogonia-WT
Mouse 0.0287933 244956 321 0.775759 2.19282 6500
Kaaij-2013-
Mouse IntestinalGen13 mm10
Mouse 0.00911846 9573.76 2601 0.935309 15.9969 1000
Kaaij-2013-
Mouse IntestinalGen15 mm10
Mouse 0.0079268 11829 1830 0.933981 11.446 1500
Kaaij-2013-
Mouse IntestinalGen16 mm10
Mouse 0.00710395 12856.2 1509 0.924306 16.3126 1500
Kobayashi-2012-Mouse Oocyte Mouse 0.412013 355274 3167 0.432781 1.68381 6000
Lin-2015-
Human BreastTumor089
Human 0.00577594 28336.8 631 0.866766 37.7327 2500
Lin-2015-
Human BreastTumor126
Human 0.0603901 96167.3 1944 0.956102 27.1584 1500
Lin-2015-
Human BreastTumor198
Human 0.112549 193137 1804 0.965329 14.7521 1500
Lin-2015-Human MCF7 Human 0.191271 536823 1103 0.795231 5.6552 4500
Lister-2011 Human ADS Human 0.271435 355148 2366 0.954075 37.6784 1000
Lister-
2011 Human ADSAdipose
Human 0.250506 332543 2332 0.954391 41.4368 1000
Lister-2011 Human FF Human 0.346546 493015 2176 0.95584 18.317 1000
149
Table B.8 continued from previous page
Sample name Species Fraction of
ref. genome
called PMDs
Mean seg-
ment size
Number of
segments
Fraction
of CpGs
covered in
sample
Mean depth
of CpG cov-
erage
Bin size
used for seg-
mentation
Lister-2013-
Human DorsPrefrontMale55YrTissue
Human 0.019726 69630 877 0.885903 3.00561 3500
Lister-2013-
Human DorsPrefrontNeuron53Yr
Human 0.0411321 21425.6 5943 0.959114 24.4158 1000
Lister-2013-
Human DorsPrefrontNeuronMale55Yr
Human 0.038454 24034.3 4953 0.958272 13.6816 1000
Lister-2013-
Human DorsPrefrontNonNeuron53Yr
Human 0.0231018 20609.8 3470 0.952477 11.8716 1000
Lister-2013-
Human DorsPrefrontNonNeuronMale55Yr
Human 0.0255131 21608.9 3655 0.956727 12.8111 1000
Lister-2013-
Human FetalCerebCortex
Human 0.0208915 20248.5 3194 0.955768 25.0421 1000
Lister-2013-
Human FrontCortexFemale64Yr
Human 0.0173158 27349.2 1960 0.957201 14.7629 1500
Lister-2013-Human HUES6 Human 0.0126527 16293.3 2404 0.938982 8.85026 1500
Lister-2013-
Human MidFrontGyr12Yr
Human 0.0197063 17237.8 3539 0.960959 19.4822 1000
Lister-2013-
Human MidFrontGyr16Yr
Human 0.0194229 15927.8 3775 0.96131 19.5498 1000
Lister-2013-
Human MidFrontGyr25Yr
Human 0.0174462 29480.3 1832 0.939499 9.9812 1500
Lister-2013-
Human MidFrontGyr2Yr
Human 0.0187089 17903.3 3235 0.961213 19.6029 1000
Lister-2013-
Human MidFrontGyr35Day
Human 0.0205438 17544.1 3625 0.960846 19.1665 1000
Lister-2013-
Human MidFrontGyr5Yr
Human 0.0185233 18100.5 3168 0.961073 19.4953 1000
Lister-2013-
Mouse FetalFrontCortex
Mouse 0.0108999 8410.88 3539 0.928911 13.0411 1000
150
Table B.8 continued from previous page
Sample name Species Fraction of
ref. genome
called PMDs
Mean seg-
ment size
Number of
segments
Fraction
of CpGs
covered in
sample
Mean depth
of CpG cov-
erage
Bin size
used for seg-
mentation
Lister-2013-
Mouse FrontCortexGliaS00bPos
Mouse 0.00878284 9409.5 2549 0.935886 31.4486 1000
Lister-2013-
Mouse FrontCortexMale10Wk
Mouse 0.0115599 9710.47 3251 0.935689 16.3553 1000
Lister-2013-
Mouse FrontCortexMale1Wk
Mouse 0.00914498 17649.3 1415 0.933898 11.9123 1500
Lister-2013-
Mouse FrontCortexMale22Mo
Mouse 0.0116626 10204.8 3121 0.936191 19.101 1000
Lister-2013-
Mouse FrontCortexMale2Wk
Mouse 0.00966302 8987.9 2936 0.936913 22.5663 1000
Lister-2013-
Mouse FrontCortexMale4Wk
Mouse 0.0121722 9893.11 3360 0.936906 23.3838 1000
Lister-2013-
Mouse FrontCortexMale6Wk
Mouse 0.0126473 9699 3561 0.936909 21.4634 1000
Lister-2013-
Mouse FrontCortexNeuronFemale12Mo
Mouse 0.0318426 20547.7 4232 0.934599 14.3156 1000
Lister-2013-
Mouse FrontCortexNeuronFemale6Wk
Mouse 0.0331892 32083.3 2825 0.932701 10.3069 1500
Lister-2013-
Mouse FrontCortexNeuronMale7Wk
Mouse 0.034476 20431.7 4608 0.932911 15.4892 1000
Lister-2013-
Mouse FrontCortexNonNeuronFemale12Mo
Mouse 0.00860716 16414.1 1432 0.931971 9.83999 1500
Lister-2013-
Mouse FrontCortexNonNeuronFemale6Wk
Mouse 0.00448265 18547.8 660 0.916558 5.58925 2000
Lister-2013-
Mouse FrontCortexNonNeuronMale7Wk
Mouse 0.00804612 9458.86 2323 0.936245 18.6647 1000
Lister-ESC-
2009 Human IMR90
Human 0.359304 423732 2625 0.951786 34.605 1000
Lowe-2013-Human Buccals Human 0.0135645 12896.7 3256 0.975626 122.704 1000
151
Table B.8 continued from previous page
Sample name Species Fraction of
ref. genome
called PMDs
Mean seg-
ment size
Number of
segments
Fraction
of CpGs
covered in
sample
Mean depth
of CpG cov-
erage
Bin size
used for seg-
mentation
Lund-2014 Human AML3-
Control
Human 0.287053 354600 2506 0.955973 30.6672 1000
LuWen-2014-Human Ad-front Human 0.0195539 26713.5 2266 0.933881 7.98179 1500
Ma-2014 Human HDF Human 0.300546 472763 1968 0.959558 16.9877 1000
Mann-2014-Mouse Fib Mouse 0.0152006 16806 2470 0.941395 70.1659 1000
Mellen-2017-
Mouse CerebralGranule
Mouse 0.0103802 32396.4 875 0.904704 4.85252 3000
Menafra-2014 Human MCF7 Human 0.227422 277504 2537 0.964264 37.8841 1500
Mendizabal-2016-Rhesus Brain Rhesus 0.0153995 13170.3 3316 0.938687 13.902 1000
Mo-2015-Mouse EXneurons Mouse 0.0389048 22706.5 4679 0.921133 25.0246 1000
Mo-2015-Mouse PVneurons Mouse 0.0507892 26509.7 5232 0.924662 32.8353 1000
Mo-2015-Mouse VIPneurons Mouse 0.026414 15870.9 4545 0.925207 29.241 1000
Molaro-2011-Sperm Human 0.025841 31042.3 2577 0.960371 17.0328 1500
Pacis-2015-
Human DendriticCell
Human 0.0177456 12996.2 4227 0.932468 44.8506 1000
Pidsley-2016-Human LNCaP Human 0.235832 98113.7 7441 0.974365 20.7252 1000
Pidsley-2016-Human PreC Human 0.0118625 18188.5 2019 0.974807 20.0961 1000
Qu-2017-Dog Sperm Dog 0.0200912 9455.27 5123 0.98036 88.4085 1000
Qu-2017-Rhesus Sperm Rhesus 0.0121999 7097.14 4875 0.973594 21.317 1000
Roadmap-2015-
Human Adipose
Human 0.0136271 13718.8 3075 0.97242 129.039 1000
Roadmap-2015-
Human Adrenal-gland
Human 0.0140375 14948.7 2907 0.967853 72.6502 1000
Roadmap-2015-
Human BipolarGastric
Human 0.0125644 13590.3 2862 0.943506 24.5137 1000
Roadmap-2015-Human Bladder Human 0.013804 14544.9 2938 0.957157 55.0663 1000
Roadmap-2015-
Human Ectoderm
Human 0.0273071 16559.1 5105 0.972281 86.7556 1000
152
Table B.8 continued from previous page
Sample name Species Fraction of
ref. genome
called PMDs
Mean seg-
ment size
Number of
segments
Fraction
of CpGs
covered in
sample
Mean depth
of CpG cov-
erage
Bin size
used for seg-
mentation
Roadmap-2015-
Human Endoderm
Human 0.0255072 17804.4 4435 0.964282 37.7685 1000
Roadmap-2015-
Human Esophagus
Human 0.0110996 12014.3 2860 0.969035 69.9464 1000
Roadmap-2015-Human Fetal-
Adrenal-gland
Human 0.0121191 26198.9 1432 0.951926 21.9393 2000
Roadmap-2015-Human Fetal-
Heart
Human 0.0165312 14243.1 3593 0.973469 37.4233 1000
Roadmap-2015-Human Fetal-
Intestine-Large
Human 0.0138625 12599.5 3406 0.970095 15.1093 1000
Roadmap-2015-Human Fetal-
Spinal-Cord
Human 0.0156523 22920.9 2114 0.962126 33.9026 1500
Roadmap-2015-Human Fetal-
Stomach
Human 0.00896211 16743.5 1657 0.970968 36.2201 1500
Roadmap-2015-
Human fMuscle-leg
Human 0.00621188 19058.6 1009 0.951464 26.9837 2000
Roadmap-2015-
Human fMuscle-trunk
Human 0.00853238 16622.8 1589 0.967284 36.4691 1500
Roadmap-2015-Human H1-
BMP4
Human 0.0120971 8142.81 4599 0.954039 38.3839 1000
Roadmap-2015-Human H1-
Mesenchymal
Human 0.167982 239091 2175 0.968395 37.041 1500
Roadmap-2015-Human H1-
Mesendoderm
Human 0.0136119 9093.26 4634 0.952882 25.4802 1000
Roadmap-2015-Human H1-NP Human 0.0207851 14754.5 4361 0.949749 27.1624 1000
Roadmap-2015-
Human HealthyBreast
Human 0.0082285 19253.9 1323 0.955665 33.8186 1500
Roadmap-2015-
Human HealthyGastric
Human 0.0129634 12687.5 3163 0.952097 23.4497 1000
153
Table B.8 continued from previous page
Sample name Species Fraction of
ref. genome
called PMDs
Mean seg-
ment size
Number of
segments
Fraction
of CpGs
covered in
sample
Mean depth
of CpG cov-
erage
Bin size
used for seg-
mentation
Roadmap-2015-Human Heart-
Aorta
Human 0.0161865 16917.1 2962 0.939613 27.6348 1000
Roadmap-2015-Human HSC Human 0.0168811 12296.2 4250 0.972183 37.9275 1000
Roadmap-2015-
Human HUES64
Human 0.0253439 13891.1 5648 0.966811 77.4734 1000
Roadmap-2015-Human Left-
ventrical
Human 0.0176663 17966.4 3044 0.969177 109.566 1000
Roadmap-2015-Human Liver Human 0.225681 290012 2409 0.969852 50.0962 1000
Roadmap-2015-Human Lung Human 0.0144978 12882 3484 0.972831 79.2619 1000
Roadmap-2015-
Human Macrophage
Human 0.00973419 15835 1903 0.965026 36.4291 1500
Roadmap-2015-
Human Mesoderm
Human 0.0266715 20977.4 3936 0.97036 64.8203 1000
Roadmap-2015-Human NK Human 0.00632847 13604.9 1440 0.963788 26.9727 1500
Roadmap-2015-Human Ovary Human 0.013548 17438.8 2405 0.965434 51.4479 1000
Roadmap-
2015 Human Placenta
Human 0.356069 702984 1568 0.967937 27.4747 1500
Roadmap-2015-
Human PSAGastric
Human 0.0139393 13786.5 3130 0.959341 49.0613 1000
Roadmap-2015-Human Psoas-
muscle
Human 0.0145291 17180.1 2618 0.958434 42.2433 1000
Roadmap-2015-Human Right-
atrium
Human 0.0135536 14189.3 2957 0.956922 48.4592 1000
Roadmap-2015-Human Right-
ventrical
Human 0.0150934 16996.9 2749 0.963242 60.5618 1000
Roadmap-2015-
Human Sigmoid-Colon
Human 0.0143048 13051.4 3393 0.962467 118.201 1000
Roadmap-2015-Human Small-
intestine
Human 0.0126222 11368.8 3437 0.975088 109.396 1000
154
Table B.8 continued from previous page
Sample name Species Fraction of
ref. genome
called PMDs
Mean seg-
ment size
Number of
segments
Fraction
of CpGs
covered in
sample
Mean depth
of CpG cov-
erage
Bin size
used for seg-
mentation
Roadmap-2015-Human Spleen Human 0.0128813 10330.7 3860 0.966908 99.4409 1000
Roadmap-2015-Human Tcell Human 0.00504868 10808.6 1446 0.965837 34.3945 1500
Roadmap-2015-
Human Thymus
Human 0.0190555 12650.7 4663 0.974728 72.0496 1000
Schlesinger-
2013 Human GM12878
Human 0.486969 1.07E+06 1406 0.94791 9.99356 1500
Schroeder-
2010 Human SHSY5Y
Human 0.184521 745719 766 0.535562 0.898255 11500
Schroeder-
2013 Human Placenta
Human 0.319553 489238 2022 0.947005 7.81454 2500
Schroeder-2015-Cow Placenta Cow 0.347765 1.52E+06 610 0.946001 7.24804 2500
Schroeder-2015-Dog Placenta Dog 0.376643 922843 984 0.946953 6.91865 2500
Schroeder-2015-Horse Placenta Horse 0.283547 862279 817 0.892904 4.09659 2500
Schroeder-2015-
Mouse Placenta
Mouse 0.282126 1.27E+06 607 0.871964 5.24208 4000
Schroeder-2015-
Rhesus Placenta
Rhesus 0.344907 1.15E+06 850 0.776269 1.9885 4500
Schroeder-2015-
SquirrelMonkey Placenta
SquirrelMonkey 0.358025 1.13E+06 823 0.867797 2.59832 12500
Sheaffer-2014-Mouse Intestine Mouse 0.00654173 12670 1410 0.901523 21.5699 1500
Sheaffer-2014-
Mouse IntestineSC
Mouse 0.00969644 8406.27 3150 0.916415 23.8125 1000
TCGA-2018-Human 06-0128-
01A-Glioblastoma-Multiforme
Human 0.205014 278360 2280 0.998375 10.4016 1500
TCGA-2018-Human 14-1401-
01A-Glioblastoma-Multiforme
Human 0.295503 563987 1622 0.998525 11.341 1500
TCGA-2018-Human 14-1454-
01A-Glioblastoma-Multiforme
Human 0.181726 201637 2790 0.997475 9.87494 1000
155
Table B.8 continued from previous page
Sample name Species Fraction of
ref. genome
called PMDs
Mean seg-
ment size
Number of
segments
Fraction
of CpGs
covered in
sample
Mean depth
of CpG cov-
erage
Bin size
used for seg-
mentation
TCGA-2018-Human 14-3477-
01A-Glioblastoma-Multiforme
Human 0.102113 68944.8 4585 0.996781 10.0542 1500
TCGA-2018-Human 16-1460-
01A-Glioblastoma-Multiforme
Human 0.0185373 34097.3 1683 0.998726 11.2325 1500
TCGA-2018-Human 19-1788-
01A-Glioblastoma-Multiforme
Human 0.0190018 18155.5 3240 0.999022 12.555 1000
TCGA-2018-Human 21-1078-
01A-Lung-Squamous-Cell-
Carcinoma
Human 0.155381 211527 2274 0.998418 10.7609 1500
TCGA-2018-Human 34-2600-
01A-Lung-Squamous-Cell-
Carcinoma
Human 0.190027 193191 3045 0.997486 15.0699 1000
TCGA-2018-Human 38-4630-
01A-Lung-Adenocarcinoma
Human 0.159204 256157 1924 0.997759 10.913 1500
TCGA-2018-Human 44-6148-
01A-Lung-Adenocarcinoma
Human 0.00933386 19750.3 1463 0.998892 11.0023 1500
TCGA-2018-Human 44-6148-
11A-MatchedHealthy-Lung-
Adenocarcinoma
Human 0.00985904 16815.7 1815 0.998846 11.0146 1500
TCGA-2018-Human 60-2695-
01A-Lung-Squamous-Cell-
Carcinoma
Human 0.0657992 52269.5 3897 0.997885 15.6954 1000
TCGA-2018-Human 60-2722-
01A-Lung-Squamous-Cell-
Carcinoma
Human 0.155421 134208 3585 1 15.1834 1000
TCGA-2018-Human 60-2722-
11A-MatchedHealthy-Lung-
Squamous-Cell-Carcinoma
Human 0.0111004 10926.4 3145 0.999225 20.0658 1000
156
Table B.8 continued from previous page
Sample name Species Fraction of
ref. genome
called PMDs
Mean seg-
ment size
Number of
segments
Fraction
of CpGs
covered in
sample
Mean depth
of CpG cov-
erage
Bin size
used for seg-
mentation
TCGA-2018-Human 67-6215-
01A-Lung-Adenocarcinoma
Human 0.230229 428317 1664 0.998355 10.7555 1500
TCGA-2018-Human 78-7156-
01A-Lung-Adenocarcinoma
Human 0.142153 159385 2761 0.99758 10.4383 1500
TCGA-2018-Human 91-6840-
01A-Lung-Adenocarcinoma
Human 0.320476 301366 3292 0.99845 12.0619 1500
TCGA-2018-Human A2-
A04X-01A-Breast-Invasive-
Carcinoma
Human 0.239502 288942 2566 0.9997 29.917 1000
TCGA-2018-Human A2-
A0YG-01A-Breast-Invasive-
Carcinoma
Human 0.0907478 82214.6 3417 0.999555 28.4931 1000
TCGA-2018-Human A5-
A0G2-01A-Uterine-Corpus-
Endometrial-Carcinoma
Human 0.182604 229885 2459 0.998738 11.3695 1500
TCGA-2018-Human A7-
A0CE-01A-Breast-Invasive-
Carcinoma
Human 0.158709 152772 3216 1 18.1708 1000
TCGA-2018-Human A7-
A0CE-11A-MatchedHealthy-
Breast-Invasive-Carcinoma
Human 0.0104902 20978.4 1548 1 15.2756 1500
TCGA-2018-Human A8-A07I-
01A-Breast-Invasive-Carcinoma
Human 0.0723469 80187.6 2793 0.999635 27.4549 1000
TCGA-2018-Human AA-3518-
01A-Colon-Adenocarcinoma
Human 0.292346 689271 1313 1 20.1467 1500
TCGA-2018-Human AA-3518-
11A-MatchedHealthy-Colon-
Adenocarcinoma
Human 0.0103355 11105.8 2881 1 18.5438 1000
157
Table B.8 continued from previous page
Sample name Species Fraction of
ref. genome
called PMDs
Mean seg-
ment size
Number of
segments
Fraction
of CpGs
covered in
sample
Mean depth
of CpG cov-
erage
Bin size
used for seg-
mentation
TCGA-2018-Human AA-
A00R-01A-Colon-
Adenocarcinoma
Human 0.451599 624113 2240 0.999463 18.624 1000
TCGA-2018-Human AF-2689-
01A-Rectum-Adenocarcinoma
Human 0.357862 381616 2903 0.998825 13.8711 1000
TCGA-2018-Human AF-2689-
11A-MatchedHealthy-Rectum-
Adenocarcinoma
Human 0.00957256 8569.61 3458 0.999153 12.7134 1000
TCGA-2018-Human AG-3593-
01A-Rectum-Adenocarcinoma
Human 0.336108 831061 1252 0.999361 17.8 1000
TCGA-2018-Human AP-
A05J-01A-Uterine-Corpus-
Endometrial-Carcinoma
Human 0.168228 282574 1843 0.997209 11.1481 1500
TCGA-2018-Human AX-
A1CI-01A-Uterine-Corpus-
Endometrial-Carcinoma
Human 0.135517 138684 3025 1 14.4983 1000
TCGA-2018-Human AX-
A1CI-11A-MatchedHealthy-
Uterine-Corpus-Endometrial-
Carcinoma
Human 0.0142057 14678.4 2996 1 14.8402 1000
TCGA-2018-Human AX-
A1CK-01A-Uterine-Corpus-
Endometrial-Carcinoma
Human 0.196215 193446 3140 0.998475 11.7104 1000
TCGA-2018-Human B5-
A0K6-01A-Uterine-Corpus-
Endometrial-Carcinoma
Human 0.220697 250996 2722 0.998778 11.8715 1000
TCGA-2018-Human BL-
A13J-01A-Bladder-Urothelial-
Carcinoma
Human 0.267046 540675 1529 0.998988 12.8328 1500
158
Table B.8 continued from previous page
Sample name Species Fraction of
ref. genome
called PMDs
Mean seg-
ment size
Number of
segments
Fraction
of CpGs
covered in
sample
Mean depth
of CpG cov-
erage
Bin size
used for seg-
mentation
TCGA-2018-Human BR-6452-
01A-Stomach-Adenocarcinoma
Human 0.326219 1.03E+06 982 0.998752 11.9417 1500
TCGA-2018-Human BR-6452-
11A-MatchedHealthy-Stomach-
Adenocarcinoma
Human 0.00305082 15282.2 618 0.998857 11.8944 1500
TCGA-2018-Human BT-
A20V-01A-Bladder-Urothelial-
Carcinoma
Human 0.37983 362577 3243 0.998705 13.2644 1500
TCGA-2018-Human BT-A20V-
11A-MatchedHealthy-Bladder-
Urothelial-Carcinoma
Human 0.00727951 15456.2 1458 0.99932 12.9582 1500
TCGA-2018-Human BT-
A2LA-01A-Bladder-Urothelial-
Carcinoma
Human 0.100086 134595 2302 0.997683 10.4935 1500
TCGA-2018-Human CG-5730-
01A-Stomach-Adenocarcinoma
Human 0.180548 274250 2038 0.998659 12.2486 1500
TCGA-2018-Human D7-6519-
01A-Stomach-Adenocarcinoma
Human 0.136266 236588 1783 0.998696 12.7212 1500
TCGA-2018-Human DK-
A1AA-01A-Bladder-Urothelial-
Carcinoma
Human 0.36945 464543 2462 0.999093 13.6432 1000
TCGA-2018-Human DK-
A1AG-01A-Bladder-Urothelial-
Carcinoma
Human 0.213492 328483 2012 0.999003 12.7415 1500
TCGA-2018-Human E2-A15H-
01A-Breast-Invasive-Carcinoma
Human 0.120156 116312 3198 0.999693 29.6908 1000
TCGA-2018-Human F1-6177-
01A-Stomach-Adenocarcinoma
Human 0.293383 840170 1081 0.998808 11.469 1500
159
Table B.8 continued from previous page
Sample name Species Fraction of
ref. genome
called PMDs
Mean seg-
ment size
Number of
segments
Fraction
of CpGs
covered in
sample
Mean depth
of CpG cov-
erage
Bin size
used for seg-
mentation
TCGA-2018-Human H4-
A2HQ-01A-Bladder-Urothelial-
Carcinoma
Human 0.287211 271404 3276 0.998468 12.4326 1500
Tung-2011-
Rhesus PBMC High
Rhesus 0.00624636 7024.01 2522 0.98016 31.165 1000
Tung-2011-Rhesus PBMC Low Rhesus 0.00681779 7201.16 2685 0.981169 34.6181 1000
Vandiver-2015-
Human Epidermis-old-sun-
exposed
Human 0.0123751 18303.7 2093 0.978648 19.0104 1000
Vandiver-2015-
Human Epidermis-old-sun-
protected
Human 0.0125106 12685.5 3053 0.978996 19.319 1000
Vandiver-2015-
Human Epidermis-young-
sun-exposed
Human 0.0138249 12415.9 3447 0.979187 20.8366 1000
Vandiver-2015-
Human Epidermis-young-
sun-protected
Human 0.015178 12075.7 3891 0.979459 21.7911 1000
Wang-2014-Mouse Oocyte Mouse 0.396314 227084 4766 0.923001 13.8841 1500
WeiXie-Mouse-2012-
Mouse FrontalCortex-129
Mouse 0.0119839 20201.6 1620 0.916568 19.6035 1500
WeiXie-Mouse-2012-
Mouse FrontalCortex-Cast
Mouse 0.00320083 22298.6 392 0.853837 11.908 3000
WeiXie-Mouse-2012-
Mouse FrontalCortex-F1i
Mouse 0.0102173 10946.3 2549 0.938507 50.1897 1000
WeiXie-Mouse-2012-
Mouse FrontalCortex-F1r
Mouse 0.00874258 9232.36 2586 0.934856 42.057 1000
Xie-2013 Human IMR90 Human 0.356954 425335 2598 0.952602 28.5332 1000
160
Table B.8 continued from previous page
Sample name Species Fraction of
ref. genome
called PMDs
Mean seg-
ment size
Number of
segments
Fraction
of CpGs
covered in
sample
Mean depth
of CpG cov-
erage
Bin size
used for seg-
mentation
Zeng-2012-Human-
PreFrontCortex
Human 0.0165979 27744 1852 0.960971 11.1963 1500
Ziller-2013-Human HepG2 Human 0.444385 1.06E+06 1301 0.883163 7.9983 4000
Table B.8: Summary statistics of segmentation for each sample
studied in Chapter 4.
161
Species # Samples
escapee in
Overlapping RefSeq Annotations
Human 37 NM 001318330 NM 024948
Human 36 NM 002553 NM 181747
Human 31 NM 001005413 NM 007057 NM 032997
Human 22 NM 001013620 NM 001308340
Human 20 NM 003188 NM 145331 NM 145332 NM 145333
Human 20 NM 015325
Human 19 NM 001018041 NM 005670 NR 038246 NM 001348092 NM 032145
NR 038244 NR 038245 NM 001042683 NM 173082
Human 19 NM 017813
Human 17 NM 001318058 NM 001318059 NM 033428
Human 17 NM 013272 NM 001145044 NR 135775
Human 16 NM 001282748 NM 001282749 NM 001282753 NM 001282760
NM 001351541 NM 001351542 NM 001351550 NM 001351552
NM 001351558 NM 001351562 NM 001351564 NM 007005 NR 104239
NM 001351543 NM 001351546 NM 001351547 NM 001351556
NM 001351560 NM 001351563
Human 15 NM 002358
Human 15 NM 004737 NM 133642 NR 039921 NR 145833 NR 038949 NR 038950
Human 15 NM 015283
Human 15 NR 132999 NR 133000 NR 003660 NR 132997 NR 132998 NR 125727
NR 125728 NR 125729 NR 132994 NR 132995 NR 132996
Human 14 NM 001253881 NM 004269 NM 001253882
Human 14 NM 001345975 NM 001345976 NM 022828
Human 12 NM 001113434
Human 12 NM 001134232 NM 018374
Human 12 NM 001282986 NM 080651
Human 12 NM 153634
Human 11 NM 001144891 NM 001308110 NM 016586
Human 11 NM 001286984 NM 001286985 NM 001114395 NM 017738
Human 11 NM 001290223 NM 001380 NM 001039762
Human 11 NM 001308020 NM 002647
Human 11 NM 001568
Human 11 NM 014167 NR 033192 NM 001319675 NM 001347934 NM 032230
NR 144940 NR 144941 NR 144942 NR 144943
Human 11 NR 026668 NM 001018038 NM 015186 NM 001018037 NM 033305
Human 10 NM 001146071 NM 030794 NM 001146070
Human 10 NM 001270622 NM 001270623 NM 004731 NR 073055 NR 073056
Human 10 NM 001352681 NM 001352680 NM 152834
Human 10 NM 032505
Human 10 NR 137166 NM 001326542 NM 001326543 NM 003112 NR 031594
Human 9 NM 001256584 NM 002101 NM 016815
Human 9 NM 001286703 NM 001286704 NM 001286705 NM 001286706
NM 016617 NR 104584 NR 104585
Human 9 NM 003033 NM 173344
162
Table B.9 continued from previous page
Species # Samples
escapee in
Overlapping RefSeq Annotations
Human 9 NM 005482
Human 9 NM 018040 NM 001297754
Human 9 NM 018368
Human 9 NM 019002
Human 9 NM 020905 NM 001199103 NM 001199104 NM 001002006
NM 001199086 NM 001199087 NM 001199088 NM 033253
Human 8 NM 001024844 NM 002231
Human 8 NM 001104647 NM 018155
Human 8 NM 001244949 NM 020918
Human 8 NM 007249
Human 8 NM 015976 NM 152238 NR 033716
Human 8 NM 020751 NR 026745 NM 001145079 NR 036190
Human 8 NM 022166 NR 135179
Human 8 NM 032228
Human 8 NM 032834
Human 8 NM 178496
Human 8 NR 027027 NR 027028 NR 027026
Human 7 NM 000254 NM 001291939 NM 001291940
Human 7 NM 001018039 NM 001029880 NR 132972
Human 7 NM 001135703 NM 013437
Human 7 NM 001136262
Human 7 NM 001167614 NM 014319
Human 7 NM 001199865 NM 024704 NM 001199866
Human 7 NM 003747 NR 030327
Human 7 NM 006558
Human 7 NM 006828 NM 001284271 NM 022091
Human 7 NM 018686 NR 135117
Human 7 NM 020179
Human 7 NM 173471 NM 001164796 NM 001350993 NR 028475 NM 001350991
NM 001350992 NM 015541
Human 6 NM 001039580 NR 125937
Human 6 NM 001135610 NM 015866 NM 012231 NM 001007257
Human 6 NM 001146312 NM 153604 NR 104605 NR 104607 NM 001321164
NM 001321166 NM 001321167 NM 001321168 NM 014859 NR 135569
NR 039747 NM 001165962 NM 018127 NM 173717
Human 6 NM 001286139 NM 152313
Human 6 NM 001320302 NR 135199 NM 001017961 NR 030626
Human 6 NM 020943
Human 6 NM 178312 NM 178311
Human 6 NM 198449
Human 6 NM 198507
163
Table B.9 continued from previous page
Species # Samples
escapee in
Overlapping RefSeq Annotations
Human 6 NR 136252 NR 136253 NM 001142326 NM 001142327 NR 024549
NR 024550 NM 021145 NM 024315 NM 001329475 NM 001329472
NM 001329473 NM 001329474 NR 138030
Human 5 NM 000585 NM 172175 NR 037840
Human 5 NM 001163315
Human 5 NM 001308155 NM 014498
Human 5 NM 033026 NM 014510
Human 5 NR 033985
Human 5 NR 110759 NR 135636
Human 5 NR 133571 NM 152432 NR 145747
Human 4 NM 000158
Human 4 NM 001172509 NM 001172517 NM 015265 NR 134967 NR 026830
Human 4 NM 001258296 NM 001258297 NM 012340 NM 173091 NM 001136021
NM 001258292 NM 001258294 NM 001258295 NR 036162
Human 4 NM 001278555 NM 182678 NM 006357 NM 001278554
Human 4 NM 001286521 NM 001286522 NM 020367 NR 104461
Human 4 NM 001294161 NM 018431 NM 177959
Human 4 NM 001300985 NM 001300986 NM 001300987 NM 001300988
NM 001300989 NM 001300990 NM 021190 NR 125357 NR 125356
Human 4 NM 001321173 NM 001321174 NM 001321175 NM 001321176
NM 005486 NR 027941 NR 027942 NR 135677 NM 001162861
NM 001162862 NM 004375 NM 001321518 NM 178509
Human 4 NM 001330561 NM 004133
Human 4 NM 001350618 NM 001350617 NM 144974 NM 001128303
NM 001350638 NM 001350640 NM 001350641 NM 001350644
NM 001350645 NM 001350639 NM 001350642 NM 001350643
NM 001350646 NM 001350647 NM 001350648 NM 153218
Human 4 NM 001867 NR 145802
Human 4 NM 002412
Human 4 NM 002515 NM 006489 NM 006491
Human 4 NM 014243
Human 4 NM 015253
Human 4 NM 133494
Human 4 NR 121628 NR 121627
Human 4 NR 134669
Human 4 NR 149032
Human 3 NM 000127 NR 145799
Human 3 NM 001014972 NM 001252612 NM 001252613 NM 014497
Human 3 NM 001025616 NR 039656 NM 001042669 NM 001346093
NM 001287805 NM 031305
Human 3 NM 001130110 NM 015559 NR 036203
Human 3 NM 001160210 NM 014251 NR 027662 NR 030322
Human 3 NM 001173463 NM 001173464 NM 001173465 NM 017641
164
Table B.9 continued from previous page
Species # Samples
escapee in
Overlapping RefSeq Annotations
Human 3 NM 001178055 NM 001178056 NM 024615 NM 001331028
Human 3 NM 001243236 NM 001348216 NM 001348215 NM 001243234
NM 001243235 NM 001348214 NM 001330605 NM 001243232
NM 001306208 NM 001243233 NM 001348212 NM 001348213
NM 001243231 NM 001348211 NM 001306207 NM 001348217
NM 001348218 NM 001348219 NM 001348220 NM 001083962
NM 001243230 NM 001330604 NM 003199 NM 001243227
NM 001243228 NM 001243226 NR 132985 NR 039754
Human 3 NM 001320322 NM 001320321 NM 152588
Human 3 NM 001323291 NM 178532 NM 001323292 NM 001113561
Human 3 NM 001323312 NM 001015508 NM 013357 NM 001323311 NM 000553
Human 3 NM 001330061 NM 001330066 NM 001099687 NM 016212 NR 110886
NR 110897 NR 110910 NR 110911 NM 001205259 NR 110914
Human 3 NM 001330156 NM 001100818 NM 001330157 NM 001330158
NM 017933
Human 3 NM 001330242 NM 001330243 NM 024560 NR 039848
Human 3 NM 001356429 NM 006765 NM 178234
Human 3 NM 006080
Human 3 NM 021643 NR 027303 NR 036072
Human 3 NM 053002 NM 013308 NM 001081455 NM 014879 NM 023915
NM 176894 NM 176876 NM 022788 NM 001178145 NM 001178146
NM 178822
Human 3 NM 194250
Human 3 NM 201545 NR 034040 NM 006499 NM 201543 NM 201544 NM 018072
Human 3 NM 203394
Human 3 NR 003083
Human 3 NR 003663
Human 3 NR 033990
Human 3 NR 034024 NR 034025
Human 3 NR 077061
Human 3 NR 120583
Human 3 NR 120647
Human 3 NR 130740 NR 130741 NR 130742
Human 3 NR 146890
Human 2 NM 000700
Human 2 NM 000746 NM 001190455 NR 046324
Human 2 NM 001004722 NM 003581 NM 001004720
Human 2 NM 001007102 NM 032438 NM 001346551 NM 001346550
Human 2 NM 001007559 NM 005637 NM 001308201
Human 2 NM 001009909 NM 001252008 NM 001252010
Human 2 NM 001010848 NM 001165972 NM 001165973 NR 120666
Human 2 NM 001013735
Human 2 NM 001042601 NM 133462 NM 001288582 NM 181426
165
Table B.9 continued from previous page
Species # Samples
escapee in
Overlapping RefSeq Annotations
Human 2 NM 001080396 NR 145739 NR 031671 NR 046848
Human 2 NM 001098626
Human 2 NM 001101669 NM 003866 NM 001331040
Human 2 NM 001128174 NM 001322112 NM 001322113 NM 001322114
NM 003360 NR 030303
Human 2 NM 001136016 NM 000484 NM 001136129 NM 001136130
NM 001204301 NM 001204302 NM 001204303 NM 201413 NM 201414
NM 001136131
Human 2 NM 001136051 NM 153377
Human 2 NM 001145065 NM 207491
Human 2 NM 001177599 NM 003848
Human 2 NM 001195215 NR 125340 NM 001300858 NM 144977 NM 001195216
Human 2 NM 001198683 NM 017526 NM 001003680 NM 001003679 NM 002303
NM 001198681 NM 001198688 NM 001198687 NM 001198689
Human 2 NM 001199328 NM 001199329 NM 003838 NM 001199327
NM 001112808 NM 015978
Human 2 NM 001244813 NM 001349344 NM 001244815 NM 001349337
NM 001349342 NM 001349343 NM 001244814 NM 001244812
NM 001244816 NM 001244808 NM 001349338 NM 001349339
NM 001349340 NM 001349341 NR 146142 NR 146143 NM 001244810
NM 032682 NM 001012505 NR 126463 NR 031697
Human 2 NM 001244972 NM 001244973 NM 001330068 NM 003873 NR 045259
NM 001024628 NM 001024629
Human 2 NM 001258366 NM 001258367 NM 001258368 NM 001042517
NM 030932 NM 001258369 NM 001258370 NR 046539 NR 109838
NR 109839 NR 051993 NR 051994 NR 046540
Human 2 NM 001282736 NM 032812
Human 2 NM 001286449 NM 198524 NM 018365
Human 2 NM 001303100 NM 207332
Human 2 NM 001305203 NM 001305206 NM 001305207 NM 001305208
NM 133458 NR 130976 NR 130977 NR 130978 NM 001305204
Human 2 NM 001319146 NM 199427 NM 018197 NM 022088 NM 199426
Human 2 NM 001321825 NM 018086
Human 2 NM 001330557 NM 001281533 NM 001281534 NM 012307
NM 001281535
Human 2 NM 001349568 NM 001074 NM 001330719
Human 2 NM 001350960 NM 032549 NM 001244606 NM 001350959
NM 001350961 NM 001350963 NM 001350962 NM 001099658
NM 001099660 NM 018334 NM 001350964
166
Table B.9 continued from previous page
Species # Samples
escapee in
Overlapping RefSeq Annotations
Human 2 NM 001351420 NM 001351419 NM 001351415 NM 001351416
NM 001351417 NM 001351418 NM 001401 NM 057159 NM 001351406
NM 001351407 NM 001351408 NM 001351409 NM 001351410
NM 001351411 NM 001351412 NM 001351413 NM 001351414
NM 001351400 NM 001351401 NM 001351402 NM 001351403
NM 001351404 NM 001351405 NM 001351397 NM 001351398
NM 001351399
Human 2 NM 001352596 NM 001352594 NM 001352593 NM 001352591
NM 001352592 NM 001352595 NM 004540 NM 001352597
Human 2 NM 001355197
Human 2 NM 001796
Human 2 NM 003728
Human 2 NM 005098 NR 033652 NR 033651 NM 007332
Human 2 NM 006918 NM 001024956
Human 2 NM 016205 NR 036641
Human 2 NM 019087 NR 030307 NR 039664
Human 2 NM 024335
Human 2 NM 030797
Human 2 NM 031461 NM 001286777 NM 001286778
Human 2 NM 144633 NR 039954
Human 2 NM 152375
Human 2 NM 152573
Human 2 NM 152744 NM 001079653
Human 2 NM 173552 NM 001134470
Human 2 NM 173812
Human 2 NM 178123
Human 2 NM 194449
Human 2 NM 201264 NM 018534 NM 201267 NM 003872 NM 201266 NM 201279
Human 2 NR 003595 NM 152627
Human 2 NR 024249
Human 2 NR 027503
Human 2 NR 030754
Human 2 NR 031628
Human 2 NR 031737
Human 2 NR 033961
Human 2 NR 034115 NR 132442 NM 001127715 NM 139244
Human 2 NR 110917
Human 2 NR 146478
Human 2 NR 146712
Human 2 NR 147083
Mouse 20 NM 001277273 NM 025923 NR 102382 NM 001252447 NM 027260
Mouse 18 NM 001190156 NM 029655
Mouse 9 NM 001172095 NM 144787
167
Table B.9 continued from previous page
Species # Samples
escapee in
Overlapping RefSeq Annotations
Mouse 9 NM 175193
Mouse 9 NM 181414
Mouse 8 NM 023479 NM 001099288 NM 175003
Mouse 7 NM 009982
Mouse 7 NM 011931 NR 028518 NR 039543
Mouse 7 NM 172865
Mouse 7 NM 178407
Mouse 7 NM 198007
Mouse 6 NM 001162933
Mouse 6 NM 010305
Mouse 6 NM 030560 NM 172667
Mouse 6 NM 175088
Mouse 6 NR 046233
Mouse 5 NM 001045518
Mouse 5 NM 009234
Mouse 5 NM 010158
Mouse 5 NM 025675
Mouse 5 NM 172120
Mouse 5 NM 198865
Mouse 4 NM 001081230
Mouse 4 NM 001122733 NM 021099
Mouse 4 NM 001163301 NM 001163302 NM 001163305 NM 001163306
NM 001267714 NM 001267715 NM 008994
Mouse 4 NM 008858
Mouse 4 NM 009555
Mouse 4 NM 010167
Mouse 4 NM 011507
Mouse 4 NM 026321
Mouse 4 NM 027992
Mouse 4 NM 172920
Mouse 4 NR 045907
Mouse 3 NM 001177832
Mouse 3 NM 016853
Mouse 3 NM 024270
Mouse 3 NM 144552
Mouse 3 NM 172049
Mouse 3 NM 175187 NR 029418 NR 029419
Mouse 3 NR 033146
Mouse 2 NM 001013608 NM 023507
Mouse 2 NM 001025600 NM 018770 NM 207675 NM 207676 NR 046048
Mouse 2 NM 001033385 NM 001163833
Mouse 2 NM 001033500
Mouse 2 NM 001038643 NM 023908
168
Table B.9 continued from previous page
Species # Samples
escapee in
Overlapping RefSeq Annotations
Mouse 2 NM 001081345 NR 037569
Mouse 2 NM 001099299
Mouse 2 NM 001113198 NM 001178049 NM 008601
Mouse 2 NM 001127261 NM 001127259 NM 001127260 NM 001127263
NM 001127265 NM 001127262 NM 001127264 NM 011641
Mouse 2 NM 001145830 NM 019677 NR 040517 NR 039540
Mouse 2 NM 001146022 NM 001146021 NM 026253
Mouse 2 NM 001164593 NM 001164594
Mouse 2 NM 001166414 NM 011265
Mouse 2 NM 001167983 NM 172579
Mouse 2 NM 001199177 NM 133752
Mouse 2 NM 001267846 NM 001267847 NM 021716
Mouse 2 NM 001285831 NM 027379
Mouse 2 NM 007525
Mouse 2 NM 007960 NM 001163154
Mouse 2 NM 008598
Mouse 2 NM 008682
Mouse 2 NM 008913
Mouse 2 NM 010162
Mouse 2 NM 010210 NR 030680
Mouse 2 NM 011943
Mouse 2 NM 013464
Mouse 2 NM 018773
Mouse 2 NM 019550
Mouse 2 NM 019739
Mouse 2 NM 021424
Mouse 2 NM 021496 NM 021497 NM 021495
Mouse 2 NM 023284
Mouse 2 NM 027225 NM 177025
Mouse 2 NM 027444
Mouse 2 NM 028547
Mouse 2 NM 080285
Mouse 2 NM 130880
Mouse 2 NM 145990
Mouse 2 NM 172484 NM 001081756 NM 176957
Mouse 2 NM 172870
Mouse 2 NM 175448
Mouse 2 NM 177718
Mouse 2 NM 181401
Mouse 2 NR 033552 NM 011546
Mouse 2 NR 040756
Mouse 2 NR 045837
Mouse 2 NR 045905 NM 030708
169
Table B.9 continued from previous page
Species # Samples
escapee in
Overlapping RefSeq Annotations
Mouse 2 NR 106083
Table B.9: Mouse and human escapee genes. Overlapping RefSeq
annotations include alternative transcripts and non-coding or other
nearby small RNAs in addition to the primary full length transcript.
170
Abstract (if available)
Abstract
DNA methylation is a critical modification to DNA at CpG dinucleotides that associates strongly with gene expression, aging, and changes in striking and characteristic ways during cellular differentiation and tumorigenesis. The majority of CpGs in mammalian genomes remain highly methylated throughout differentiation. Kilobase-scale hypomethylated regions (HMRs) frequently co-locate with regulatory regions and genome-wide HMR profiles correlate with cell-type specific transcriptional profiles. In tumorigenesis and placental development, additional megabase-scale regions of hypomethylation known as partially methylated domains (PMDs) arise and appear to have a repressive effect on surrounding transcription. In this dissertation, I describe an initial project on DNA methylation dynamics in the placenta that spurred my interest in analysis of PMDs, and then introduce improved methods to identify PMDs and analyze PMD-containing methylomes. In the last chapter, I use those methods to explore PMD formation and function across a wide array of PMD-containing methylomes in different species, improving our understanding of their evolution and relationship with replication timing as well as identifying a set of genes that escape their repressive effects.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Comparative analysis of DNA methylation in mammals
PDF
Understanding protein–DNA recognition in the context of DNA methylation
PDF
Integrated genomic & epigenomic analyses of glioblastoma multiforme: Methods development and application
PDF
Genome-wide studies reveal the function and evolution of DNA shape
PDF
Functional DNA methylation changes in normal and cancer cells
PDF
Differential methylation analysis of colon tissues
PDF
Identification of DNA methylation markers in diffuse large B-cell lymphoma
PDF
Quantitative modeling of in vivo transcription factor–DNA binding and beyond
PDF
An analysis of conservation of methylation
PDF
Whole genome bisulfite sequencing: analytical methods and biological insights
PDF
Mapping 3D genome structures: a data driven modeling method for integrated structural analysis
PDF
Genome-wide studies of protein–DNA binding: beyond sequence towards biophysical and physicochemical models
PDF
Comparative genomics of translational regulation
PDF
Natural variation of Arabidopsis thaliana methylome and its impact on genome evolution
PDF
DNA hypermethylation: its role in colorectal tumorigenesis and potential clinical applications
PDF
DNA methylation and gene expression profiles in Vidaza treated cultured cancer cells
PDF
DNA shape at transcription factor binding sites: from purifying selection to a new alphabet
PDF
Epigenetic regulation of non CPG island gene promoters
PDF
DNA methylation inhibitors and epigenetic regulation of microRNA expression
PDF
Computational methods for translation regulation analysis from Ribo-seq data
Asset Metadata
Creator
Decato, Benjamin Edgar
(author)
Core Title
Identification and analysis of shared epigenetic changes in extraembryonic development and tumorigenesis
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Publication Date
10/11/2019
Defense Date
08/10/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
bioinformatics,cancer,Computational Biology,DNA methylation,epigenome,OAI-PMH Harvest,partially methylated domains,Placenta
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Smith, Andrew D. (
committee chair
), Calabrese, Peter (
committee member
), Dean, Matthew D. (
committee member
), Rohs, Remo (
committee member
)
Creator Email
bdecato@gmail.com,decato@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-88394
Unique identifier
UC11675444
Identifier
etd-DecatoBenj-6816.pdf (filename),usctheses-c89-88394 (legacy record id)
Legacy Identifier
etd-DecatoBenj-6816.pdf
Dmrecord
88394
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Decato, Benjamin Edgar
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
bioinformatics
DNA methylation
epigenome
partially methylated domains