Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Understanding DNA methylation and nucleosome organization in cancer cells using single molecule sequencing
(USC Thesis Other)
Understanding DNA methylation and nucleosome organization in cancer cells using single molecule sequencing
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
UNDERSTANDING DNA METHYLATION AND NUCLEOSOME ORGANIZATION IN
CANCER CELLS
USING SINGLE MOLECULE SEQUENCING
by
Y aping Liu
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(GENETIC, MOLECULAR, AND CELLULAR BIOLOGY)
Dec 2014
Copyright 2014 Y aping Liu
Epigraph
“Nothing in biology makes sense except in the light of evolution.”
by Theodosius Dobzhansky
1
1
Dobzhansky T. American Biology Teacher, March 1973 (35:125-129)
ii
Dedication
To my parents, my wife and our ”little bean”. To me, you are the most beautiful and
important things in this world.
iii
Acknowledgments
Foremost, I would like to express my deepest gratitude to my advisor, Dr.Benjamin P .
Berman, who unbelievably influenced my whole scientific career. I was greatly impressed
by his passion, curiosity, criticism and wisdom on science and decided to follow him
when he was a postdoc. I learned how to become a computational biologist all from
him: programming, criticize my own work, be independent, and always always always
love not only the scientific discovery but also the path lead to the exciting results. I can
never forget all his patient, guidance and encouragement to my research and life.
I would also like to thank my dissertation committee members Drs. Peter Laird,
Kimberly Siegmund and Andrew Smith. Dr. Peter Laird, as my mentor in my first year
in USC Epigenome Center, showed me how knowledgeable a scientist could be and
always brought insightful comments and questions. Dr. Kimberly Siegmund provided a
lot of great helps and suggestions when I was still a newbee in statistic field. Dr. Andrew
Smith taught me not only the skill in computational biology but also how to think as a
computational biologist.
I am also incredibly lucky and honored to have the collaboration with Dr. Peter Jones
lab. Dr. Peter Jone always provided grateful support on all the idea we pursue. I really
enjoyed and learned a lot on the experimental side during the weekly discussion with
Dr. Terry Kelly and Fides Lay. The beautiful data made by Dr. Fides Lay, who is also my
classmates, made my whole life easier and relaxed. I would feel very lucky if we could
keep this long term collaboration in the future.
iv
I would like to thank Zack Ramjan for all his tremendous and generous help on IT
support. I am also feel lucky to have Dr. Huy Dinh and Lijing Y ao for the weekly discus-
sion and formed a productive group in the lab.I would like to express thankful mind to Dr.
Charlie Nicolet, Selene Tyndale and Helen Truong for the next generation sequencing
data production. I am also thankful for Drs. Hui Shen and Timothy Trich’s help on my
qualify exam and insightful discussion in ECBM meeting. I am also obliged to many
of my colleagues and friends in USC Epigenome Center who have offered tremendous
help during my PhD study: Dr. Toshi Hinoue, Dr.Suhn Rhie, Dr. Houtan Noushmehr, Dr.
Vasu Punj, Dr. Daniel Weisenberger, Dr. David Van Den Berg, Dr. Jonathan Buckley,
Natalya Shatokhina, Moiz Bootwalla, Dennis Maglinte, Christine Lavoie, Jenifer Walker,
Janice Galler and Kendra Bergen.
I would like to thank my qualify exam committee member, Dr. Peggy Farnham, who
provided many helpful suggestions and ideas in the discussion. I am thankful to the
help from Dr. Adam Blatter and Heather Witt for the technical side discussion.
I would also like to thank Dr. Amir Eden and his lab member for the help on fibrob-
last cell culture and OIS project. I would like to thank Dr. Xiaojing Y ang and Gangning
Liang for the comparison discussion in AcceSssIble assay and NOMe-seq technology.
Thanks for the great help on HMM model from Dr. Qiang Song and Jianghan Qu from
Dr. Andrew Smith lab. Thanks for the discussion about genotyping in Bisulfite-seq and
NOMeClonePlot from Ying Wu. I would like to thank Dr. Clayton Collings for NOMe-
Toolkit manuscript preparation.
I would like to thank Drs. Ite A. Laird-Offringa and Sergey Nuzhdin for the support
during my rotation in the first year. I would also like to thank HPCC staff for the high
performance computer IT support.
I would like to thank Charles Heidelberger Memorial Predoctoral Scholarship, TCGA
and NCI grant for financial support in writing this dissertation.
v
Last but not the least, I would like to thank my wife, Li Wang, my mother, Xiaoling
Wu, my father, Dongbao Liu and my family. Without their understanding, support, and
love, I would not be able to finish this journey.
vi
Table of Contents
Epigraph ii
Dedication iii
Acknowledgments iv
List of Tables xi
List of Figures xii
Abstract xv
Chapter 1: Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Epigenetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 DNA methylation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Histone variants and histone modifications . . . . . . . . . . . . . . . . 7
1.5 Nucleosome organizations and chromatin accessibility . . . . . . . . . 9
1.6 Epigenetic aberration in cancer . . . . . . . . . . . . . . . . . . . . . . 11
1.7 Epigenome consortium projects as the reference maps . . . . . . . . 12
Chapter 2: Bis-SNP: Combined DNA methylation and SNP calling for Bisulfite-
seq data 15
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 Local realignment, base quality recalibration and other BAM
file preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.2 BisSNP probabilistic model . . . . . . . . . . . . . . . . . . . . 21
2.2.3 Five-prime bisulfite non-conversion filter . . . . . . . . . . . . . 24
2.2.4 Pre-SNP calling quality filters . . . . . . . . . . . . . . . . . . . 24
2.2.5 Downsampling coverage . . . . . . . . . . . . . . . . . . . . . . 25
2.2.6 External tools used for comparison . . . . . . . . . . . . . . . . 25
2.2.7 Datasets used for whole-genome comparisons . . . . . . . . . 27
2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.1 Bis-SNP workflow . . . . . . . . . . . . . . . . . . . . . . . . . 29
vii
2.3.2 Description of SNP calling algorithm . . . . . . . . . . . . . . . 30
2.3.3 Evaluation of SNP calls at known SNPs . . . . . . . . . . . . . 32
2.3.4 Accuracy of genome-wide methylation calling . . . . . . . . . . 36
2.3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Chapter 3: NOMeToolkit: Integrative computational pipeline for NOMe-seq
and whole genome bisulfite sequencing 42
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.1 NOMeToolkit workflow . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.2 Quality Control (QC) before reads mapping . . . . . . . . . . . 44
3.2.2.1 Adapter contamination estimation and adapter trim-
ming . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2.2.2 Inverted duplication reads . . . . . . . . . . . . . . . . 45
3.2.3 QC after reads mapping . . . . . . . . . . . . . . . . . . . . . . 46
3.2.4 Bis-SNP pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.5 Beta-binomial HMM based segmentation . . . . . . . . . . . . 48
3.2.6 Annotation and visualization of NOMe-seq data and integra-
tion of external datasets . . . . . . . . . . . . . . . . . . . . . . 49
Chapter 4: NOMe-seq: Genome-wide mapping of nucleosome positioning
and DNA methylation with single molecule sequencing 54
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2 Materials and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2.1 Cell culture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2.2 Nucleosome footprinting . . . . . . . . . . . . . . . . . . . . . . 56
4.2.3 Nucleosome footprinting in fresh-frozen colon tumor . . . . . . 57
4.2.4 Library construction and sequencing . . . . . . . . . . . . . . . 58
4.2.5 Low-input generation of whole-genome bisulfite sequencing
library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.6 Sequence alignment and extraction of CG and GC methylation
levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2.7 Genomic element average profile plots . . . . . . . . . . . . . . 60
4.2.8 Promoter nucleosome-depleted region detection . . . . . . . . 61
4.2.9 NDR and linker identification . . . . . . . . . . . . . . . . . . . 62
4.2.10 IMR90 MNase-seq . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2.11 Nucleosome occupancy score . . . . . . . . . . . . . . . . . . 63
4.2.12 MSRE datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2.13 Combinatorial epigenomic signatures . . . . . . . . . . . . . . 64
4.2.14 Other data access . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3.1 Identifying optimal treatment conditions for accurate footprint-
ing of a variety of genomic loci . . . . . . . . . . . . . . . . . . 65
4.3.2 NOMe-seq reveals expected nucleosome occupancy patterns
at CTCF and transcription start sites . . . . . . . . . . . . . . . 66
viii
4.3.3 DNA methylation marks inter-nucleosome linker regions through-
out the human genome . . . . . . . . . . . . . . . . . . . . . . 69
4.3.4 NOMe-seq reveals the genome-wide transcription factors bind-
ing strength in vivo . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3.5 NOMe-seq reveals distinct chromatin configurations at specific
promoter types in both of cell line and primary fresh frozen
tumor sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3.6 Combinatorial epigenomic signatures reveal functional chro-
matin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3.7 NOMe-seq reveals the genome-wide epigenetic switching from
neural stem cell to glioblastoma . . . . . . . . . . . . . . . . . . 77
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Chapter 5: The context-dependent roles of DNA methylation in directing the
functional organization of the epigenome 90
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.2 Materials and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.2.1 Cell Culture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.2.2 Genome-wide nucleosome footprinting assay . . . . . . . . . . 92
5.2.3 Hidden-Markov model-based approach of NDR detection . . . 93
5.2.4 Definining classes of promoters . . . . . . . . . . . . . . . . . . 94
5.2.5 ChIP-seq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.2.6 ChromHMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.2.7 RNA-seq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.3.1 NOMe-seq detects NDR changes upon the global loss of DNA
methylation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.3.2 Reorganization of nucleosomes occurs in the absence of DNA
methylation at CGI promoters . . . . . . . . . . . . . . . . . . . 99
5.3.3 The loss of DNA methylation in CGI promoters results in the
acquisition of active and poised histone marks . . . . . . . . . 100
5.3.4 MU promoters fall into two distinct chromatin states marked by
modification at H3K27 . . . . . . . . . . . . . . . . . . . . . . . 101
5.3.5 The loss of DNA methylation does not alter the chromatin struc-
ture of non-CGI promoters . . . . . . . . . . . . . . . . . . . . . 104
5.3.6 Long range accessibility changes in partially methylated domains
reveal association with heterochromatic H3K9me3 domain . . 106
5.4 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Chapter 6: The analysis of epigenomic map during oncogene induced senes-
cence in fibroblast cells 113
6.1 Introductions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.2.1 Cell culture and retroviral infection . . . . . . . . . . . . . . . . 117
6.2.2 Immunofluorescence, antibodies, SAHF , and SA-gal staining. 118
ix
6.2.3 Fluorescence in situ hybridization . . . . . . . . . . . . . . . . 118
6.2.4 DNA methylation assay and computational analysis . . . . . . 118
6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.3.1 Small but consistent changes of DNA methylation distinguish
OIS and normal fibroblast cells . . . . . . . . . . . . . . . . . . 119
6.3.2 The number of significantly changed DNA methylation probes
are small and enriched in non-coding regions . . . . . . . . . . 119
6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
Chapter 7: Discussion and future directions 123
7.1 Summary and discussion . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.2 Perspective and future directions . . . . . . . . . . . . . . . . . . . . . 125
Bibliography 127
Appendix: Detailed description of TCGA Whole Genome Bisulfite sequencing
Bis-SNP pipeline 152
x
List of Tables
2.1 Chromosome 1 Bis-SNP detection . . . . . . . . . . . . . . . . . . . . 41
2.2 Genome-wide cytosine counts and methylation . . . . . . . . . . . . . 41
3.1 Reads mapping efficiency before and after InvertDupsHunter trimming 47
xi
List of Figures
1.1 DNA demethylation pathway. . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 Detecting single nucleotide polymorphisms from Bisulfite-seq data. . 19
2.2 Bis-SNP workflow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3 Bis-SNP error frequencies in detecting SNPs on the Illumina 1M SNP
array. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4 Bis-SNP error frequencies at C:T heterozygous SNPs. . . . . . . . . 35
2.5 Sensitivity as a function of sequence coverage. . . . . . . . . . . . . 37
2.6 Accurate methylation calling at SNPs. . . . . . . . . . . . . . . . . . . 38
3.1 NOMeToolkit workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2 Inverted duplication reads . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3 Two states beta-binomial HMM segments NOMe-seq accessibility sig-
nal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.1 NOMe-seq can footprint a variety of chromatin structures. . . . . . . . 67
4.2 GCG excluded in the downstream computational analysis does not
affect accessibility measurement greatly. . . . . . . . . . . . . . . . . 68
xii
4.3 NOMe-seq displays nucleosome occupancy profiles at specific loci
and globally. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.4 Increased methylation in linker regions within different genomic con-
texts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.5 DNA methylation occurs primarily at linker regions in nucleosomal
arrays flanking CTCF binding sites. . . . . . . . . . . . . . . . . . . . 83
4.6 NOMe-seq detects chromatin configuration around transcription factor
binding sites. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.7 NOMe-seq reveals distinct chromatin configurations at CTCF sites
and associated with specific histone modifications and promoter types.
85
4.8 Low-input NOMe-seq detects distinct chromatin configurations of dif-
ferent promoter types. . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.9 Combinatorial epigenomic signatures reveal functional chromatin. . . 87
4.10 Validation of DCA promoters. . . . . . . . . . . . . . . . . . . . . . . . 88
4.11 NOMe-seq reveals the genome-wide epigenetic switching from neural
stem cell to glioblastoma. . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.1 Significant loss of DNA methylation in DKO1 cells does not dramati-
cally increases global accessibility . . . . . . . . . . . . . . . . . . . . 97
5.2 NOMe-seq detects NDRs and changes in NDRs following global hypomethy-
lation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.3 The loss of DNA methylation triggers nucleosome reorganization in
CGI promoters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
xiii
5.4 CGI promoters that lose DNA methylation acquire active and poised
histone marks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.5 Hypomethylated CGI promoters gain active and/or repressive histone
marks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.6 DNA methylation loss at CGI promoters re-establish poise histone
modification status and gene expression in normal cells . . . . . . . . 105
5.7 The global loss of DNA methylation does not result in dramatic nucle-
some organization and chromatin remodeling in non-CGI promoters . 111
5.8 Long range accessibility changes reveal association with PMDs and
heterochromatic H3K9me3 domains . . . . . . . . . . . . . . . . . . . 112
6.1 Oncogene induced senescence as a tumor barrier . . . . . . . . . . . 117
6.2 DNA methylation variation distinguishes OIS and normal fibroblast cells 121
6.3 Significantly changed probes are few and enrich in non-coding regions 122
xiv
Abstract
One of the hallmarks of cancer is aberrant epigenetic changes which include alter-
ations in DNA methylation, nucleosome positioning and histone modifications. Recent
advances in whole-exome sequencing have revealed common defects in many of the
chromatin modifier genes responsible for establishing these marks. Changes in tumor
DNA methylation patterns and, to a lesser extent, histone modifications have been
extensively characterized, but changes in nucleosome positioning and how they relate
to other epigenetic marks remain poorly understood.
Nucleosome Occupancy and Methylation whole-genome sequencing (NOMe-seq)
assay was developed to understand the relationship between DNA methylation and
nucleosome positioning changes at the same DNA strand in a single experiment. NOMe-
seq was primarily built in the fibroblast cell line (IMR90) system. It was then adapted for
the fresh frozen tissue system by combining low DNA input bisulfite treatment kit. The
anti-correlation between nucleosome occupancy and DNA methylation was observed
genome-wide at distal regions, especially near CTCF motifs. NOMe-seq could also
reveal genome-wide in vivo transcription factor binding affinities in human cells by using
different salt-wash concentrations. At promoter regions, NOMe-seq could reveal three
distinct chromatin configurations by the combination of DNA methylation and accessi-
bility status. Combinatorial epigenomic signatures in the same DNA strand revealed
xv
Divergent Chromatin Alleles (DCAs) to exploit the discrete epigenetic patterns in differ-
ent alleles. Furthermore, the application of NOMe-seq on the primary cultured glioblas-
toma cell line (GBMs) and Neuron Stem Cell (NSCs) revealed that the epigenetic switch-
ing process from the stem cells to glioblastoma cells are mainly enriched in pre-marked
poised enhancer regions. This demonstrated the power of NOMe-seq to detect dynamic
changes of DNA methylation and accessibility among different cells.
A complete computational pipeline, NOMeToolkit, was developed to process NOMe-
seq data. NOMe-seq is a GpC methyltranferase footprinting technology followed by
Whole Genome Bisulfite Sequencing (WGBS). Most parts of NOMeToolkit, therefore,
could also be adapted for WGBS analysis. NOMeQC in NOMeToolkit was used to
assess the quality of NOMe-seq library and preprocess the bad quality reads before
and after reads mapping. InvertDupsHunter, one of the major components in NOMeQC,
was developed to identify inverted duplicated reads and trim reads in order to increase
reads mapping efficiency and accuracy. Single-nucleotide polymorphisms (SNPs) could
result in inaccurate or missing methylation calls in Bisulfite-seq/NOMe-seq. Bis-SNP ,
the core part of NOMeToolkit, was developed based on the Genome Analysis Toolkit
(GATK) framework which used bayesian inference to determine genotypes and methy-
lation/accessibility levels simultaneously. After accurate DNA methylation/accessibility
call at HCG/GCH sites, ”peak calling” method is required for the further downstream
analysis. The segmentation method based on two-state beta-binomial Hidden Markov
Model (HMM) was developed to identify individual nucleosomes, linkers and Nucleo-
some Depleted Regions (NDRs). NOMeClonePlot and AlignWig2Loc were developed
for the locus-specific and genome-wide visualization,respectively, of NOMe-seq.
Using the experimental technology and computational tools developed above, the
functional interactions between DNA methylation and other epigenetic modifiers were
evaluated in human colon cancer cell HCT116 with both DNMT3B and DNMT1 knocked
out. DNA methylation was globally depleted in this double knock out cell line (DKO1).
Upon global loss of DNA methylation, however, only a few chromatin accessibility changes
xvi
were observed. Strikingly, DNA methylation loss caused well phased nucleosomes at
polycomb target CGI promoters and formation of NDR at non-polycomb target CGI pro-
moters in DKO1 cells when compared to its parent cells. Interestingly, changes in DKO1
cells occurred in a direction such that the cells gained an epigenetic landscape and gene
expression similar to normal colonic mucosa. In distal regulatory elements, similar chro-
matin remodeling and structure changes were observed at several different transcription
factor binding sites, such as AP1 and ELF1. Decrease of accessibility in the large scale
(> 20kb) comparing to the adjacent regions largely overlapped with H3K9me3 marked
heterochromatin regions and Partially Methylated Domains (PMDs). Interestingly, large
H3K4me3 blocks were observed in these regions upon global loss of DNA methylation
but not other cell types.
We also tried to connect Oncogene Induced Senescence (OIS) with molecular sub-
types in cancers by using NOMe-seq and other epigenomic profiling technologies. Pre-
liminary studies on fibroblast cell lines by using Infinium 450K array technology, how-
ever, revealed that only few but consistent DNA methylation changes happened and
enriched outside gene coding regions during OIS.
Taken together, these studies in different normal and cancer cell lines and samples
may provide novel insights into the crosstalk between genetic variations, DNA methyla-
tion, nucleosome positioning and histone modification during cancer progression.
xvii
Chapter 1
Introduction
1.1 Introduction
What is cancer? Cancer is a large family of diseases that abnormal cells have under-
gone unregulated growth and are able to invade other tissues (Bertram, 2000).
The progresses of cancer research are tightly correlated with technology advances
and the way people interpret results. The earliest explanation about cancer was based
on Hippocrates humoral theory (460-370 BCE), which propose the disease was due to
an excess of black bile (Weinstein and Case, 2008). The theory was not challenged
until autopsy technology was widely used in human medical research in the eighteenth
century: the tumor grew from lymph thrown out by blood. The modern cancer research
really began after microscope was used to study disease tissue. Several investigators,
notably Rudolf Virchow who is the founder of cellular pathology, revealed that the can-
cer is a disease of cells (Weinstein and Case, 2008). Cell culture, mouse genetics
and drosophila genetics technology in the early 20th century moved the cancer field
into cytology and genetics, which showed that cancer is the defect of chromosomes.
DNA structure and central dogma was revealed by the dedication of X-ray diffraction
and biochemistry in the middle of last century. The rapid progress in biochemistry and
molecular biology technology enabled the exploration of disorder during the tumor pro-
gression on molecular level. Cancer was identified as a genetic disorder disease. The
remarkable advances in sequencing technology in the past decade, especially after the
human genome was fully sequenced and the next generation sequencing technology
was widely used, validated the concept that cancer is a disease of both epigenetics and
genetics abnormalities (Baylin and Jones, 2011).
1
Moreover, very recently, whole-exome sequencing in tumor samples had identified
many recurrent mutations in genes that control the epigenome, which therefore caused
large epigenetic aberrations. Epigenetic alteration, such as DNA methylation, can also
cause high frequency point mutations and disrupt DNA repair function in many different
cancer types(Y ou and Jones, 2012). The crosstalk between genetics, epigenetics and
different aspects of epigenetics has been known for a while, but the cause-and-effect
relationship among different components is still in mystery. It is so called causal infer-
ence in artificial intelligence field when people are trying to draw a conclusion about a
causal connection based on the observed conditions. The major difference between
causal inference and association inference is that the former analyzes the response of
the effect variable when the cause is changed. In order to describe the association and
causal relationship among genetics, epigenetics and different components of epigenet-
ics, the thesis are mainly organized in three different aspects:
1. Development of computational tools for NOMe-seq and Whole Genome Bisulfite
Sequencing (WGBS) to explore the genotype information, DNA methylation and nucle-
osome positioning signal simultaneously in a single experiment.
2. Development of a new sequencing technology, named NOMe-seq to simultane-
ously measure DNA methylation and nucleosome positioning at single molecule level.
3. Application of experimental and computational tools developed at a methylation
deficient colon cancer cell line model, which provide the hint of DNA methylation as a
major causal effect of the cancer progression.
4. Tried to apply NOMe-seq at a fibroblast cell line model to connect Oncogene
Induced Senescence (OIS) with CpG Methylation Phenotype (CIMP) and therefore
identify out new molecular subtype in cancer. The preliminary study done by Illu-
mina Infinium 450K array, however, indicated only few but consistent DNA methylation
changes during OIS.
2
This chapter will introduce the recent advance in human epigenome and the
crosstalk between different aspects of epigenome. In addition, this chapter will sum-
marize the aberrant epigenetic changes associated with human cancer progression. It
provides the general background and ideas for the materials presented in Chapter 2-6.
1.2 Epigenetics
Epi- (Greek: ePi- over, outside of, around) -genetics was originally used by C. H.
Waddington in 1942 to describe how interactions between genetics and environment
can affect phenotypes during development (Bell and Spector, 2011). The most recent
definition of this term was described in NIH Roadmap Epigenomics Project 2013(Bern-
stein et al., 2010): ”Epigenetics is an emerging frontier of science that involves the study
of changes in the regulation of gene activity and expression that are not dependent on
gene sequence. For purposes of this program, epigenetics refers to both heritable
changes in gene activity and expression (in the progeny of cells or of individuals) and
also stable, long-term alterations in the transcriptional potential of a cell that are not
necessarily heritable.” Current understandings of epigenetic modification system mainly
focused on the following categories: DNA methylation, histone modifications, histone
variants, nucleosome organizations and noncoding RNAs (Shen and Laird, 2013). This
chapter will mainly focus on the background knowledge of DNA methylation, histone
modification and nucleosome organizations.
1.3 DNA methylation
DNA methylation on cytosine is one of the DNA modifications that a methyl group (-
CH3) is added to the position 5 of cytosine DNA nucleotides. Early in 1975, methylation
of cytosine residues in CpG dinucleotide context was found to serve as an epigenetic
mark in vertebrates (Holliday and Pugh, 1975; Riggs, 1975). Extensive studies later
3
showed that DNA methylation on cytosine mainly happens at CpG dinucleotide context
in mammals. Non-CpG methylation, however, was found to be prevalent in embryonic
stem cell and neural cells recently (Lister et al., 2009, 2011, 2013). CpGs are not evenly
distributed in the genome due to the spontaneously deamination of methylated cytosine
over time (Lander et al., 2001). About 60% to 90% of CpGs are highly methylated in
normal cells, while unmethylated CpGs are grouped together into CpG island (CGI),
which co-localize with 60% of all promoters in human and mice (Antequera, 2003).
The first definition of CGI was proposed in 1987 by Gardiner-Garden and Frommer
(Gardiner-Garden and Frommer, 1987) as being a 200-bp stretch of DNA with a C+G
content of 50% and an observed/expected CpG ratio in excess of 0.6, which has been
used in UCSC genome browser. Later in 2002, GC-rich genomic sequence such as
Alu repeats was excluded, CGI definition was revised based on completed sequences
of human chromosomes 21 and 22 (Takai and Jones, 2002): DNA sequences greater
than 500bp, GC content greater than 55%, and an observed-to-expected CpG ratio of
65%. Very recently, CGI model was redefined based on Hidden Markov Model (HMM)
and a first CGI list for invertebrates was created (Wu et al., 2010a). This list contained
much more CGI regions with short elements unobserved in the former two definitions.
CpG density at promoters in mammalian genome has a bimodal distribution (Illing-
worth and Bird, 2009). Previously, most of the work on DNA methylation focused on CGI
at promoter regions. In most of the cases, CpG sites at CGI promoters are not methy-
lated if the genes are expressed. In normal somatic cell, methylation of CGI promoters
can happen in some genes at which there is a long-term stabilization of repressed
states, such as imprinted genes and genes that only express in germ cells. In terms of
non-CGI promoters, there are some correlations between tissue specific gene expres-
sion and tissue specific DNA methylation. But the cause-and-effect relationship is still
not clear yet (Jones, 2012). The early research indicated that DNA methylation came
after the gene got silenced, which indicated a lock role served by DNA methylation (Lock
et al., 1987). More and more recent studies, however, indicated that methylation plays
4
an essential initiation role for the cell fate determination (Challen et al., 2012). Very
recent genome-wide study on embryonic stem cell differentiation found that active pro-
moters in the early stage are CG-rich and some of them get repressed by H3K27me3 in
the late lineage. While promoters tend to express in the later stage are often CG poor
and get repressed by DNA methylation in the developed lineage (Xie et al., 2013).
Recent advances in Whole Genome Bisulfite Sequencing (WGBS) technology
greatly expanded the knowledge of DNA methylation in distal regulatory elements (Lis-
ter et al., 2009; Stadler et al., 2011). WGBS study across 30 different human cells
and tissue types showed that in the context of normal development, 21.8% of autoso-
mal CpGs have variable methylation status, most of which are enriched in enhancer
category (Ziller et al., 2013). Enhancers were found not to be 100% methylated or
unmethylated but accompany with Low-Methylated Regions (LMRs) (Stadler et al.,
2011). WGBS can not differentiate 5-mC from 5-hydroxymethylcytosine (5hmC), 5-
formylcytosine (5fC) and 5-carboxylcytosine (5caC), all of which are intermediate sub-
stances in the ten-eleven translocation (Tet) family proteins mediated cytosine demethy-
lation pathway as shown in Figure 1.1 (Shen and Zhang, 2013). Further studies at
base-pair resolution showed that 5hmC, 5fC and 5caC are highly enriched in enhancer
regions in human and mouse (Yu et al., 2012; Shen et al., 2013), especially in poised
enhancers (Wen et al., 2014), indicating the presence of LMRs may be caused by the
enrichment of 5-hmC et. al. levels detected by WGBS. Stadler paper proved that the
binding of DNA-binding factors are necessary and sufficient to create LMRs in CpG
poor regions. In terms of insulators, a well-studied previous example on CTCF binding
to the imprinted IGF2-H19 locus showed that DNA methylation blocked the binding of
CTCFs (Bell and Felsenfeld, 2000), while recent Stadler paper suggested that the global
CTCF binding status does not change in mouse ES cell in the absence of methylation
(ESC with DNMT triple knock out), but increases by about two fold at IGF-H19 imprinted
CTCF locus (Stadler et al., 2011). Thus, the causal effects of DNA methylation on the
distal regulatory elements are still uncertain.
5
Figure 1.1: DNA demethylation pathway.
DNA demethylation pathways that involve 5mC, 5hmC 5fC and 5caC. Alternatively, 5hmC may be deam-
inated by AID/APOBECs to become 5hmU. 5fC, 5caC, and 5hmU can be excised from DNA by glycosy-
lases. Also, DNMT3A and DNMT3B may directly dehydroxymethylate 5hmC to generate unmethylated C.
Solid lines represent processes with strong evidence, dashed lines indicate processes which need to be
further confirmed controversial process. Figure are adapted from Shen and Zhang (2013).
In gene body, DNA methylation is not associated with repression but positively corre-
lated with active transcription (Lister et al., 2009). Recent advances linked DNA methy-
lation in gene body with alternative splicing where they found CTCF can bind to weak
upstream exons by promoting PolII pausing and lead to exon inclusion both at specific
locus (CD45) and genome-wide, whereas the binding is inhibited by DNA methylation
(Shukla et al., 2011). WGBS studies on different tissues also observed DNA methy-
lation changes in a large scale (>20kb). Two groups very recently discovered DNA
Methylation Valleys (DMVs) and DNA methylation canyons in most of cell lineages (Xie
et al., 2013; Jeong et al., 2014), which may associate with H3K27me3 mediated gene
repression as previously known (Gal-Y am et al., 2008).
It has been known for decades that de novo methylation of DNA is catalyzed by
DNMT3a and DNMT3b in early development stages, while the maintenance of DNA
methylation is mediated by DNMT1. Recent study showed that DNMT3a and DNMT3b
6
are also required for DNA methylation maintenance (Jones and Liang, 2009). Complete
lack of the three DNMTs will lead to cell death for somatic and cancer cells, but not for
embryonic stem cells (Jackson-Grusby et al., 2001; Chen et al., 2007; Tsumura et al.,
2006). Stable double/single knock out DNMTs cancer cell lines have been established
(Rhee et al., 2002) and thus are extremely useful in understanding the causal effects of
DNA methylation with other cellular processes.
1.4 Histone variants and histone modifications
Histones are a family of basic proteins that package DNA into nucleosomes. There are
five major histone families: H1/H5, H2A, H2B, H3 and H4. Two copies of each H2A,
H2B, H3 and H4 are wrapped around by 145bp of DNA to form the octameric nucle-
osome core, and are locked into position by the linker histone H1/H5 (Fisher, 2001).
Histones, especially the N terminal tails can be post-translationally modified, including
methylation, acetylation, phosphorylation and ubiquitination (Tan et al., 2011). Different
types of histone modifications can potentially affect histone properties, their interac-
tion with DNA and other proteins, are highly associated with different levels of gene
expression or different chromatin states, which have been extensively studied in sev-
eral epigenomic projects (ENCODE Project Consortium, 2004; Bernstein et al., 2010).
The fact that histone modifications can cross talk with each other in a great number of
combinations is defined as the histone code (Jenuwein and Allis, 2001). One of the
famous examples is the bivalent chromatin structure, where the promoter regions of
developmentally regulated genes are marked by both active histone H3K4me3 and the
repressive H3K27me3 (Bernstein et al., 2006). This poised chromatin status was also
found to be present near enhancers (Rada-Iglesias et al., 2011).
In addition to histone modifications, the structure and functions of chromatin can also
be affected by the incorporation of histone variants, which differ from their canonical
histones in both their primary sequences and expression, deposition and regulatory
7
mechanisms (Kamakaka and Biggins, 2005). Large number of variants are existed in
H2A family, including H2A.Z, MacroH2A, H2A-Bbd, H2AvD and H2A.X (Kamakaka and
Biggins, 2005). H1 variants include H10, H5, and the sperm and testis-specific variants.
In H3 famaily, there are two major variants named: H3.3 and centromeric H3 (CenH3)
(Malik and Henikoff, 2003). H2B and H4, however, are lack of variants in the current
studies.
Recently, more and more studies showed that histone variants and modifications, as
another layer of regulation on gene expression, are tightly correlated with DNA methyla-
tion. The first study on Arabidopsis thaliana demonstrated an anti-correlation between
H2A.Z and DNA methylation (Zilberman et al., 2008). Later on, the same phenomenon
was observed in other organisms, including mammal cell lines (Zemach et al., 2010;
Conerly et al., 2010). DNA methylation, rather than H2A.Z, was proved to play a driv-
ing role in this anti-correlation by studying H2A.Z loss-of-function mutant in Arabidopsis
(Coleman-Derr and Zilberman, 2012).
Histone modifications, such as H3K4me3, were also found to be anti-correlated with
DNA methylation by very recent studies (Balasubramanian et al., 2012). The relation-
ship between H3K27me3 and DNA methylation remains controversial. The early study
pointed out that polycomb group protein EZH2, which catalyzes H3K27me3, also posi-
tively regulated DNA methylation (Vir´ e et al., 2006). More and more recent progresses,
however, indicated that H3K27me3 and DNA methylation are anti-correlated (Lindroth
et al., 2008; Bartke et al., 2010; Wu et al., 2010b). DNA methyltransferase deficient
mouse somatic cells showed that DNA hypomethylation causes genome-wide redis-
tribution of H3K27me3(Reddington et al., 2013; Hagarman et al., 2013), while loss of
H3K27me3 may only have a modest effect on DNA methylation (Hagarman et al., 2013).
8
1.5 Nucleosome organizations and chromatin accessibility
Nucleosome is the basic unit of eukaryotic chromatin. Approximately 147bp of DNA is
wrapped 1.65 times around histone octamer in the nucleosome (Jiang and Pugh, 2009).
Nucleosome packages DNA and stabilizes the negative supercoiling of genomic DNA
in vivo (Luijsterburg et al., 2008). Nucleosome can also directly regulate the access
of trans-acting factors to regulatory DNA elements by its relative position (Iyer, 2012).
Nucleosome positioning indicates where the nucleosomes are located with respect to
genomic DNA sequence, which can vary from perfect positioning that a nucleosome
locates at a given position in all of cell populations, to no positioning that nucleosome is
located at all possible genomic positions with equal frequency in a cell population (Struhl
and Segal, 2013). Nucleosome positioning is determined by different factors includ-
ing DNA sequences, DNA-binding proteins, nucleosome remodelers, DNA methylation,
histone modifications and histone variants (Struhl and Segal, 2013; Andreu-Vieyra and
Liang, 2013).
Nucleosome positioning is affected by DNA sequence. There is a debate about the
role of DNA sequence in nucleosome positioning in vivo and in vitro (Struhl and Segal,
2013). Some evidence showed that intrinsic DNA sequence played the central role
in determining the organization of nucleosomes in different eukaryotic genome (Segal
et al., 2006; Kaplan et al., 2009; Tillo et al., 2010). Recent progress of whole genome
MNase-seq study in different cell types, however, revealed that majority of the genome
showed substantial flexibility of nucleosome positioning among different cell types in
vivo (Valouev et al., 2011). Due to the limitations of MNase-seq that requires large num-
ber of short reads being sequenced, further high depth MNase-seq study indicated that
about half of the genome contains regularly spaced arrays of nucleosomes enriched in
active chromatin domains (Gaffney et al., 2012). Nucleosome positioning can be cre-
ated alone by DNA sequence in some regions. Nucleosomes in other regions, however,
are more forced by chromatin remodeling or DNA binding proteins. Two major sequence
9
determinants that affected nucleosome bindings are: 1) bendable dinucleotides (AT, TA)
occur on the face of helical repeat in every 10 bp. 2) homopolymeric sequences poly
(dA:dT) and poly (dG:dC) strongly inhibit nucleosome array formation (Struhl and Segal,
2013).
ATPase-dependent chromatin remodelers enable chromatin with structural flexibility
by promoting assembly or disruption of nucleosomes and the exchange of histone vari-
ants (Mueller-Planitz et al., 2013). It can be grouped into families based on subunits’
composition and activity: SWI/SNF family (SWI/SNF , INO80 and SWR1 complexes),
ISWI family (RSF , ACF/CHRAC, WICH and NURF complexes) and CHD family (NURD
complexes) (Andreu-Vieyra and Liang, 2013).
There are some nucleosome depleted regions (NDRs) in the chromosome (Mueller-
Planitz et al., 2013). These regions, which increase chromatin accessibility, are often
located adjacent to nucleosomes containing the histone variant H2A.Z and are marked
with H3K4me3 in promoter or H3K27ac/H3K4me1 in enhancer regions (Jones, 2012).
The nucleosome arrays are often positioned around NDRs (Mueller-Planitz et al., 2013).
More and more studies in the last decade began to investigate the relationship
between DNA methylation and nucleosomes. Nucleosome may have some effects to
shape DNA methylation. DNMTs preferentially target to nucleosome bound DNA, even
for methylated non-CpG island promoters (Han et al., 2011). Several studies showed
that DNMT activity results in preferential DNA methylation outside of the nucleosome
cores (Felle et al., 2011; Gowher et al., 2005; Jiang et al., 2011; Kelly et al., 2012;
Huff and Zilberman, 2014), while another study showed genomic DNA methylation was
modestly enriched inside nucleosome core DNA (Chodavarapu et al., 2010). Recent
WGBS studies identified a 10bp periodicity in DNA methylation that tightly correlates
with nucleosome substructures (Lister et al., 2013). On the other hand, DNA methy-
lation also plays an important role in nucleosome positioning (Collings et al., 2013;
Jimenez-Useche et al., 2013; P´ erez et al., 2012). As debate goes on, whether DNA
methylation is correlated or anti-correlated with nucleosome occupancy needs to be
10
further studies on both of the computational and experimental sides (Portella et al.,
2013; Collings et al., 2013; Chodavarapu et al., 2010; Kelly et al., 2012; Portela et al.,
2013; Huff and Zilberman, 2014).
1.6 Epigenetic aberration in cancer
Both of genetic and epigenetic changes play important roles in cancer initiation and
progression (Andreu-Vieyra and Liang, 2013). Most of the previous studies on cancer
epigenetic aberration are mainly focused on DNA methylation. Other epigenetic marks,
including long noncoding RNAs, miRNA and histones, were also found to have under-
gone broad changes during cancer progression recently (Shen and Laird, 2013). Many
known tumor suppressor genes have been found to be silenced by promoter CpG island
hypermethylation (Jones and Baylin, 2002). High frequency of CpG island hypermethy-
lation was first observed in mismatch repair-deficient colon tumor, referred to as a CpG
island methylator phenotype (CIMP) (Toyota et al., 1999). Recent genome-wide scale
studies showed distinct epigenetic subtypes in colorectal cancer (Hinoue et al., 2012)
and glioblastoma (Noushmehr et al., 2010). Interestingly, colorectal CIMP showed a
very strong association with BRAF
V600E
mutation (Weisenberger et al., 2006). Some
other studies observed a tight correlation between BRAF
V600E
mutation and oncogene
induced senescence (OIS) as a tumor barrier(Michaloglou et al., 2005). It may indicate
a relationship between OIS and CIMP that has not been resolved yet.
The global loss of DNA methylation in some cancer types was reported a few
decades ago (Diala and Hoffman, 1982). The discrepancy between hypermethylation
and hypomethylation observed in the same cancer type was solved very recently after
the application of WGBS on tumor samples (Berman et al., 2012; Hansen et al., 2011;
Hon et al., 2012). Hypermethylation mainly occurs at focal regions, while hypomethy-
lation often associates with increased repressive chromatin with large organized chro-
matin lysine modification regions (LOCKs) (Hansen et al., 2011; Wen et al., 2009).
11
These long range hypomethylation changes may due to the late replicating lamin asso-
ciated domain (LADs) undergo progressive loss of DNA methylation, which was also
observed in some normal cell lines as Partially Methylated Domains (PMD) (Lister et al.,
2009) but got a more striking demethylation effect in cancer cells (Berman et al., 2012;
Hon et al., 2012).
Recently, a lot of mutations were discovered in chromatin remodeler subunits. Since
genomic instability is mostly absent in tumor with defective SWI/SWF complexes, it
is possible that the aberration of nucleosome positioning may contribute to the devel-
opment of theses aggressive cancers (Wilson and Roberts, 2011).Histone modifica-
tions are also involved in epigenetic switching during tumor progression (Andreu-Vieyra
and Liang, 2013). For instance, H3K27me3 is replaced by de novo DNA methylation
likely through the recruitment of DNMTs (Schlesinger et al., 2007; Widschwendter et al.,
2007; Ohm et al., 2007)
1.7 Epigenome consortium projects as the reference maps
As the rapid progress on microarray and next generation sequencing, epigenetics,
which refers to the study of single or set of genes, has quickly been extended to
epigenome which refers to the global study of epigenetic changes at genome wide
level. Numerous epigenomics projects were funded in the last decade. In 1999, even
before the completion of human genome project, human epigenome project had already
been proposed (Beck et al., 1999; Rakyan et al., 2004). The Encyclopedia of DNA Ele-
ments (ENCODE) project launched in 2003 aims to identify all functional elements in
the human genome (ENCODE Project Consortium, 2004; ENCODE Project Consor-
tium et al., 2007; ENCODE Project Consortium, 2011), while modENCODE project has
the similar goal with respect to model organism (modENCODE Consortium et al., 2010;
Gerstein et al., 2010).To capture the breadth of epigenomic differences among differ-
ent tissue and cell types, the NIHs Roadmap Epigenomics Program was initiated in
12
2008. Unlike phase I and II in ENCODE project, this project mainly focuses on the
primary human tissue (Bernstein et al., 2010). In terms of genomics and epigenomics
profiling in diseases, The Cancer Genome Atlas (TCGA) project was initiated in 2005
by USA National Cancer Institute and the National Human Genome Research Insti-
tute. Genomics and epigenomics profiling in at least 200 forms of cancer and many
more subtypes are under investigation (Cancer Genome Atlas Research Network et al.,
2013). It is also integrated as part of International Cancer Genome Consortium (ICGC)
(International Cancer Genome Consortium et al., 2010). Around 2011, the AACR Can-
cer Epigenome Task Force (CETF) investigated the feasibility of an integrated Inter-
national Cancer Epigenome Project after the successful cancer genome project (Beck
et al., 2012).
In Europe, similar epigenomic projects are going on in recent years. The
BLUEPRINT project started in 2011 focuses on distinct types of haematopoietic cells
from healthy individuals and on their malignant leukaemic counterparts. It aims to gen-
erate at least 100 reference epigenomes and study them to advance and exploit knowl-
edge of the underlying biological processes and mechanisms in health and disease.
The German epigenome programme DEEP started in 2012 will produce 70 reference
epigenomes of selected human cells and tissues in normal and diseased states. It
mainly focuses on complex diseases such as inflammatory diseases of joint and bowel
and metabolic diseases such as hepatosteatosis and adipositas. BLUEPRINT, DEEP ,
NIH Epigenome Roadmap projects are all part of International Human Epigenome Con-
sortium (IHEC).
EpiTwin, a project collaborated between Europe and Beijing Genome Institute (BGI)
in China, tries to capture the subtle epigenetic profiles that mark the differences between
5,000 twins. It will be looking for differences that explain why identical twins do not
develop the same diseases (Bell and Spector, 2011, 2012). China recently initiated the
Cancer Genome/Epigenome Project focusing on certain types of cancers that have high
incidence and mortality rate in Chinese population (He et al., 2008).
13
Large-scale epigenetic mapping projects mentioned above provide us the poten-
tial global integrative reference maps of different cellular states, which will serve as
a launch point for the downstream investigations on different experimental designs.
Future epigenome wide association studies on individual epigenetic variations and their
interactions with human disease can be built on top of these epigenome maps. These
large-scale project also provide us huge computational challenges to store, process and
mining the useful information.
14
Chapter 2
Bis-SNP: Combined DNA
methylation and SNP calling for
Bisulfite-seq data
This chapter is modified from a manuscript previously published in the peer-reviewed
journal, Genome Biology in which I am the first author
2.1 Introduction
Cytosine methylation of DNA plays an important role in mammalian gene regulation,
chromatin structure and imprinting during normal development and the development of
pathological conditions such as cancer. With the dramatic increase in throughput made
possible by next-generation DNA sequencing technologies, sodium bisulfite conversion
followed by massively parallel sequencing (Bisulfite-seq) has become an increasingly
popular method for investigating epigenetic profiles in the human genome (reviewed
in (Laird, 2010)). Several sequencing strategies have been applied that vary in terms
of cost and the regions of the genome covered. Reduced Representation Bisulfite-
Seq (RRBS(Meissner et al., 2008)) uses restriction fragment size selection to select a
portion of the genome enriched for CpG Islands and gene regulatory sequences. Bisul-
fite Padlock Probes (BSPP(Diep et al., 2012)) or solution-based hybridization capture
(Agilent, Inc.) can be designed for customizable selection of hundreds of thousands
of regions throughout the genome. Whole-Genome Bisulfite-Seq (WGBS(Lister et al.,
15
2009)) is the most comprehensive, covering more than 90% of cytosines in the human
genome. Bisulfite-seq is well-suited to the investigation of epigenetic changes from clin-
ical tissue samples (Hansen et al., 2011; Berman et al., 2012), and can be applied to
very small quantities of DNA(Adey and Shendure, 2012) including formalin-fixed sam-
ples(Gu et al., 2010). WGBS and RRBS data have been used to profile a number of cell
lines and human tissues by large sequencing consortia including the ENCODE project
(ENCODE Project Consortium, 2004), the NIH Epigenomics Roadmap, and The Cancer
Genome Atlas (TCGA), and these datasets are publicly available for download.
Bisulfite treatment of DNA converts unmethylated cytosines to uracils, which are
replaced by thymines during amplification. This dramatic change to sequence composi-
tion necessitates specialized software for almost all sequence analysis tasks. Typically,
the first step in processing high-throughput sequencing data is to map and align each
read to the correct location in the reference genome (genome mapping), and a number
of powerful tools have been developed to map bisulfite-converted reads (reviewed in
(Krueger et al., 2012)). The next step is to identify differences between the reference
genome and the sample genome, including single-nucleotide polymorphisms (SNPs)
and insertion/deletion events (indels). The identification of SNPs has been an active
area of research and a number of powerful statistical tools have been developed for
SNP calling of non-bisulfite sequencing data(Li et al., 2009a; DePristo et al., 2011; Li
et al., 2009b). SNP calling of bisulfite sequencing data has significant complications.
First, reads from the two genomic strands are not complementary, and this assumption
of complementarity is made by all SNP calling algorithms. Second, true (evolution-
ary) C>T SNPs in the sample cannot be distinguished from C>T substitutions that are
caused by bisulfite conversion, and can thus be misidentified as unmethylated Cs. Con-
sequently, identification and proper handling of such SNPs is important for accurate
quantification of methylation levels, especially so given the fact that C>T is the most
common substitution in the human population (65% of all SNPs in dbSNP) and these
usually occur in the CpG context (Zhao and Boerwinkle, 2002).
16
Accurate SNP calling at the positions immediately surrounding a cytosine is equally
important. Those nucleotides lying one or two positions 3’ of the cytosine are
particularly critical, as they determine the specificity of methyltransferases. These
methyltransferase-specific context positions can be organism or cell type specific. In
mammals, CpG dinucleotides are often highly methylated in most cell types, while CpA
dinucleotides have much lower methylation levels and are cell type restricted(Lister
et al., 2009; Ramsahoye et al., 2000). In plants, by contrast, CHG trinucleotides are
often methylated (Lister et al., 2008; Cokus et al., 2008). Other sequences within a
slightly wider genomic neighborhood can also have strong cis effects on methylation,
perhaps due to the presence of key regulatory motifs(Lienert et al., 2011). Heterozy-
gous SNPs in proximity to cytosines can reveal widespread allele-specific methyla-
tion patterns(Tycko, 2010) and important regulatory changes such as loss of imprint-
ing(Shoemaker et al., 2010; Gertz et al., 2011; Xie et al., 2012).
Despite the great interest in Bisulfite-seq and the availability of a number of tools for
genomic mapping, no adequate software exists for SNP calling(Krueger et al., 2012). In
order to overcome the difficulty in identifying SNPs in bisulfite-treated sequences, some
groups have relied on matched non-bisulfite sequencing data in the same sample(Li
et al., 2010; Stadler et al., 2011; Hon et al., 2012). Others have used non-bisulfite SNP
microarrays(Harris et al., 2010; Schalkwyk et al., 2010), or used study designs relying
on isogenic mouse strains with known parental genotypes(Stadler et al., 2011; Xie et al.,
2012).
A key property of some bisulfite-related protocols is that G nucleotides on the strand
opposing a C are not affected by conversion. This strand-specificity principle has been
exploited in order to distinguish bisulfite conversion from C>T SNPs(Weisenberger
et al., 2005). The Illumina-based protocol currently being used in most Bisulfite-seq
studies has this important property, and thus it has been classified as a directional
bisulfite-seq protocol(Krueger et al., 2012). Non-directional protocols (those that also
result in G>A substitutions) have been used(Cokus et al., 2008), but have not been
17
widely adopted. Figure 2.1 illustrates the directional protocol, where approximately half
the reads at a given cytosine position (those mapping to the ”C-strand”) can be used
for methylation quantification but cannot distinguish C/T SNPs. The other half (those
mapping to the ”G-strand”, boxed in Figure 2.1a) yield no methylation information but
can be used to identify C/T SNPs. Importantly, heterozygous C/T SNPs can be used in
the analysis of allele specific methylation (Figure 2.1b).
The inherent directionality of Illumina Bisulfite-seq has thus far been used in a lim-
ited and ad hoc way. The Salk Institute group filtered out cytosines which did not have
one or more unconverted Cs on the C-strand, but this approach results in lost informa-
tion about completely unmethylated cytosines (which play a crucial role in gene regu-
lation)(Lister et al., 2009, 2011). Our own group filtered out reference Cs if opposing
reads contains As, but the number of such A reads was somewhat arbitrary(Berman
et al., 2012). A third group removed all C/T reads on the C-strand, and called SNPs by
requiring a minimum number of reads containing two different alleles(Chen et al., 2011).
Importantly, none of these so-called ”k-allele” approaches took advantage of base call-
ing quality scores, which have been shown to be extremely important for distinguishing
true SNPs from sequencing errors(Li et al., 2008). Others used various methods that
did not attempt to identify C/T or other SNPs occurring at cytosines(Gertz et al., 2011;
Shoemaker et al., 2010; Diep et al., 2012). Such methods may be useful for analyz-
ing allele-specific patterns in a limited way, but do not address the need to improve
methylation quantification at SNPs.
Here, we describe a probabilistic SNP caller, Bis-SNP, that is based on methods
that have proven successful in non-bisulfite SNP calling (DePristo et al., 2011; Li et al.,
2009b). Bis-SNP uses Bayesian inference to evaluate a model of strand-specific base
calls and base call quality scores, along with prior information on population SNP
frequencies, experiment-specific bisulfite conversion efficiency, and site-specific DNA
methylation estimates. It also takes advantage of base call quality score recalibration,
an addition that has greatly improved SNP calling in the non-bisulfite context (DePristo
18
et al., 2011). Bis-SNP is open-source and based on the GATK framework (McKenna
et al., 2010), in order to take advantage of the parallel Map-Reduce computation strat-
egy and provide practical execution times. Bis-SNP accepts either single-end or paired-
end mapped Bisulfite-seq data in the form of BAM files, and outputs SNP and methy-
lation information using standard file formats. We show that Bis-SNP is a practical
tool that can both (1) improve DNA methylation calling accuracy by detecting SNPs
at cytosines and adjacent positions, and (2) identify heterozygous SNPs that can be
used to investigate mono-allelic DNA methylation and polymorphisms in cis-regulatory
sequences.
Figure 2.1: Detecting single nucleotide polymorphisms from Bisulfite-seq data.
Hypothetical bisulfite-sequencing data is shown, with reference genome at top, genome of the individual
sequenced (unobserved) middle, and bisulfite sequencing reads bottom. (a) shows three reference cyto-
sine positions, with the first being a match to the reference genome and the second two being homozygous
single nucleotide polymorphisms. The first case shows a true C:G genotype, and all reads on the same
strand as the C (the ”C-strand) are read as T, indicating an unmethylated state (shown as blue). Because
the Illumina Bisulfite-seq protocol is ”directional”, reads on the opposite strand (the ”G-strand”) are read
as the true genotype, G (reads on the G-strand are boxed in this figure). The second case illustrates a
true T:A genotype, which can be distinguished from the first by the ”A” reads on the G-strand. In this case,
the reads on the C-strand are inferred to be from a true ”T” and should not be used for methylation calling
(crossed out here). The third case shows a T>C SNP , which again can be identified based on G-strand
reads. (b) A cytosine position with 50% unmethylated (T) and 50% methylated (C) reads can be associated
with a heterozygous SNP on the same sequencing reads. In this case, the unmethylated reads are those
on the ”A” allele chromosome (here shown as maternal) and the methylated reads are on the ”T” allele
chromosome.
19
2.2 Materials and Methods
2.2.1 Local realignment, base quality recalibration and other BAM file
preprocessing
Reads with mapping quality scores less than 30 and those mapped to multiple genomic
regions are removed, as are PCR duplicates (optional). For paired-end reads, we
remove read pairs that do not have the ProperlyPaired field set.
We use GATK to perform local multiple sequence realignment and sequence recali-
bration mostly as described (DePristo et al., 2011). Since most of bisulfite sequencing
mapping tools (e.g. Bismark, BSMAP , MAQ etc) do not provide correct CIGAR string
in the BAM file for GATK’s indel realignment, the CIGAR string is recalculated when
necessary. We extend GATK’s RealignerTargetCreator to count mismatch number
but not count thymine as a mismatch when the reference genome position is cytosine.
After we create a potential indel interval, we realign using a modified version of GATK’s
IndelRealigner. PCR duplicate reads are marked after indel realignment.
For base quality recalibration, we modify the GATK algorithm to account
for bisulfite conversion by extending the GATK CountVariantWalker and
TableRecalibrationWalker classes. The algorithm first tabulates empirical mis-
matches to the reference at all loci not known to vary in the population (i.e., not in
dbSNP build 135). These counts are categorized by their reported instrument-reported
quality score (R) and position (cycle) within the read (C). In tabulating mismatches, we
do not count thymine as a mismatch when the reference genome position is cytosine
(on the second end of a paired-end read, we instead don’t count adenine as a mismatch
when the reference is guanine).
By default, only positions with a recalibrated Base Calling Quality Score of greater
than 20 are used for SNP calling. This quality cutoff can be set using a command line
parameter (see Appendix).
20
2.2.2 BisSNP probabilistic model
We begin with the bayesian likelihood model of GATK (DePristo et al., 2011), and make
a number of bisulfite-specific adaptations. Assuming the underlying genome is diploid,
we letD = (D
1
;D
2
;:::;D
r
) represent the base calls at a particular genomic position i
that is covered by r sequencing reads. We then calculate the posterior probability by
(2.1) as in GATK:
Pr(GjD)=
(G)Pr(DjG)
Pr(D)
(2.1)
Here, G is the underlying diploid genotype,AB, with A and B being the two parental
alleles. (G) is a genotype prior probability for observing the given genotype based
on the genotype of the reference genome and population frequencies, the same as
discussed in Table 1 of SOAPsnp paper(Li et al., 2009b) . Pr(D) is defined as the sum
over all possible genotypes
P
AB
(AB)Pr(DjAB), but is the same in each case and
can generally be ignored since we are generally concerned with likelihood ratios. We
assume that each of the two alleles is equally likely to be sequenced, and calculate the
overall likelihood of D as the product of all individual reads (2.2),(2.3):
Pr(DjG)=
r
Y
j=1
Pr(D
j
jG) (2.2)
Pr(D
j
jG=AB)=
1
2
Pr(D
j
jA)+
1
2
Pr(D
j
jB) (2.3)
The following steps are shown for single-end sequences. For paired end sequences,
the first end is treated as described, but the second end is reverse complemented before
performing these calculations (because the Illumina second end is the complementary
strand of the same template as the first end). This changes G>A bisulfite substitutions,
which occur on the second end, to the actual C>T substitutions present on the bisulfite-
converted template. The recalibrated base quality scores are on a phred scale which
21
represents the probability " that the position is an error, which is used in the following
calculation.
When the underlying allele is adenine (a), thymine (t), bisulfite conversion does not
apply and the probability estimation is straightforward as shown for t :
Pr(D
j
jB = t)=
8
>
>
<
>
>
:
"
j
3
ifD
j
6= t
1"
j
ifD
j
= t
(2.4)
Here, "
j
is the probability of a sequencing or base calling error at position j, i.e.
probability that the true alleleB is a t, but base callD
j
is observed as an a, c, or g. The
likelihood function for a is equivalent to that of Equation (2.4). When the underlying allele
is a c or a g, however, the probabilities are strand-specific since bisulfite conversion only
affects one strand in the directional Bisulfite-seq protocol (Figure 2.1). The probability
of seeing a t in the read depends on the probability that the read is methylated (),
as well as the bisulfite conversion efficiency ( and
). Bisulfite treatment converts all
unmethylated cytosines to thymine, but in practice it is not 100% efficient(Lister et al.,
2009). The parameter is the estimated frequency of unmethylated cytosines which are
not converted (typically taken from unmethylated spiked in DNA(Lister et al., 2009) or
the mammalian mitochondrial sequences, which we have found to be almost completely
unmethylated(Berman et al., 2012). In this case,=
chrM
). By default, is set to 0.01
but can be specified by the user. We also include a
parameter for over-conversion,
i.e. the rate at which methylated cytosines are converted. Although this is not routinely
measured in practice, it could be estimated by including an enzymatically methylated
control DNA(Renbaum et al., 1990), or a sequencing library without bisulfite conversion.
By default,
is set to 0 but can be specified by the user. The full likelihood calculation
for cytosines is as follows:
22
Pr(D
j
jB = c)=
8
>
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
>
:
(1"
j
)[
j
(1
)+(1
j
)] ifD
j
= c
+
"
j
3
+(1"
j
)[
j
+(1
j
)(1)] ifD
j
= t
+
1"
j
ifD
j
= c
"
j
3
otherwise
j
(1
)= methylated and (properly) not converted
j
= methylated and (improperly) converted
(1
j
)= unmethylated and (improperly) not converted
(1
j
)(1)= unmethylated and (properly) converted
(2.5)
The key to these calculations is that reads on the same strand as the inferred cyto-
sine allele (denoted with +) are treated differently than reads from the opposite strand
(denoted with). As expected based on the example in Figure 2.1, a true allele of
B = c results in a very high probability of seeing a t
+
(a ”t” read on the C-strand), but
a very low probability of seeing a t
(an ”a” read on the G-strand). The genotypeG
best
with the highest posterior probability Pr(GjD) is chosen, and the final output score is
the odds ratio between the best (G
best
) and the second best (G
nextbest
), as in Equation
(2.6). In practice, we optimize execution by evaluating only a subset of the 10 possible
diploid genotypes.
score=log(
Pr(G
best
jD)
Pr(G
nextbest
jD)
(2.6)
Bisulfite efficiency, i.e. and
typically vary by less than 1%, so the critical parame-
ter included in Equation 2.5 is the methylation rate. Since this rate varies by genomic
context, organism, and even cell type, we allow the user to specify the possible con-
texts as a set ofn nucleotides sequences specified by their IUPAC degeneracy codes
(for instance,CH representsCC,CT , orCA). In mammalian genomes where typically
23
only the single base 3’ of the cytosine is considered relevant, the user would specify CG
and CH (the Bis-SNP default). For Arabidopsis, one might specify CG, CHH, and CHG.
Any arbitrary number of 5’ and 3’ bases may be specified in order to accommodate the
full range of Bisulfite-seq assays. For instance a CCGG pattern could be specified for
MspI restriction sites inherent to the RRBS protocol ((Smith et al., 2009)).
One methylation output file (BED6+2 format) is created for each cytosine context
specified by the user. For each cytosine determined to have the particular sequence
context, the percent methylated (the number of C reads on the C-strand divided by the
number of C or T reads on the C-strand) is output as the score field. To aid in statistical
analysis, a second field contains the total number of C/T reads.
2.2.3 Five-prime bisulfite non-conversion filter
Non-conversion of unmethylated Cs is known to preferentially affect the 5’ end of
Illumina-generated reads, most likely driven by the re-annealing of sequences adja-
cent to the fully methylated sequence adapters during bisulfite conversion. We control
for this using a 5’ non-conversion filter as implemented in our earlier work(Berman et al.,
2012). For each read, we walk along the read from 5’ to 3’, and we remove any Cs on
the C-strand until we reach the first reference C which is converted to a T. By apply-
ing this filter, early bisulfite conversion in early cycles is brought to levels very similar
to those of late cycles, thus removing a potential source of methylation bias (data not
shown). Notice that this filter should be turned off for RRBS data, which gleans most of
its methylation data from the first cycle (see Appendix).
2.2.4 Pre-SNP calling quality filters
Using the approach of GATK, we apply additional quality filters before SNP calling to
avoid known sources of false positives. SNPs found in clusters (two or more within a
ten-base-pair window) were filtered out. SNPs with coverage depth above 120, Strand
24
Bias(SB) score more than -0.02, or Quality by Depth(QD) less than 1.0 are filtered out.
All of these parameters are configurable (see Appendix). If BAM contains Mapping
Quality scores, suspicious regions are filtered out when greater than 10% of aligned
reads (minimum of 40 reads) have mapping quality of 0.
Bisulfite sequencing can have higher strand biases since high bisulfite concentration
can lead to DNA degradation when the depurination step causes random strand breaks
(Raizis et al., 1995; Ehrich et al., 2007). We calculated strand bias score as in GATK,
but bisulfite converted reads have an apparent strand bias which is higher than the
actual strand bias, since the G-strand contributes more than the C-strand at cytosines.
For this reason, we used a substantially less stringent strand bias cutoff (-0.02) than the
GATK default.
2.2.5 Downsampling coverage
We downsampled the human colon mucosa Bisulfite-seq dataset into different mean
coverages using GATK, which randomly picks z reads at each individual nucleotide
locus. The following formula is used, where N is the mean coverage of total dataset
before downsampling (32x in this case),n is the desired downsampling coverage, and
m is the actual coverage at the particular locus.
z =
mn
N
(2.7)
2.2.6 External tools used for comparison
K-allele method. The K-allele method was used to identify heterozygous SNPs as a
generalization of described methods(Chen et al., 2011; Gertz et al., 2011), both of which
count the number of alternate alleles present and exclude C/T SNPs. For reference
cytosine positions, we only use counts from the G-strand, while at other positions we
combine the two strands to get read counts. After these filters, we use aK cutoff which
can vary from 0-10 and apply the K-allele threshold as follows. For positions with n
25
passing reads wheren is less than 10, we require that each of the two alleles have at
leastK reads. For positions wheren is greater than 10, we require at leastn
k
10
reads.
Fore reference, the Hudson Alpha group(Gertz et al., 2011) used a set definitionK of
7 reads and at least 10%, and excluded all C/T SNPs. The UCLA group(Chen et al.,
2011) specified that the allele with the lower read count had to contain at least 40% of
reads, and excluded C/T reads.
bisReadMapper. We downloaded bisReadMapper version 1(Diep et al., 2012). We
first usegenomePrep.pl to preprocess the reference genome and extract cytosine posi-
tion in each chromosome. The built in read mapper could not handle our large BAM file,
so we circumvented the mapping step and used the BAM files directly as input. This is
not a standard part of the bisReadMapper package, and required us to divide our BAM
alignment files to separate reads aligning to the forward strand of the reference genome
from those aligning to the reverse strand. We used the following bisReadMapper
parameters: allC=1; length=75; snp=dbsnp135.rod;alignMode=S;qualBase=33;
trim3=0; trim5=0;refDir=/path/to/GenomePreparationProcessedDir/
Shoemaker. The Shoemaker(Shoemaker et al., 2010) method was implemented as
described in their supplemental materials with clarifications from the author. The reads
were sorted into two categories, one with the higher C:T nucleotide ratio and one with
the lower. The reads in the lower C:T ratio category were converted to their reverse
complementary sequences. All reads are then demethylated in silico (Cs converted to
Ts). Input reads are filtered by their criteria: (1) Base calls at the examined SNP site
and three flanking positions on either side needed to have a minimum Base Quality
score of 15. (2) If a certain base was present in more than 20% of reads on one
strand, its reverse complement needed to be present on at least 20% of the reads on
the opposing strand. Only positions passing these two criteria were analyzed. Base
Quality scores were used to weight the nucleotide count contributions to the nucleotide
frequency matrix. This matrix was normalized, multiplied by the read count to get final
nucleotide number matrix in each location (normalized and weighted A,C,G,T number
26
in each loci). The Fisher exact test was applied to each nucleotide in each of the allele
(e.g. nucleotide number of G vs. nucleotide number of not G, expected nucleotide
number of G vs. expected nucleotide number of not G). Two p-values of each allele
were multiplied together for each of ten possible genotypes and then normalized. The
SNPs were selected out when (1) The best genotype was 10 times more than the next
most likely genotype, (2) the SNP was in reported in dbSNP , and (3) had at least 10X
read depth.
Bismark. We downloaded Bismark-0.50(Krueger and Andrews,
2011). We converted our input BAM file to SAM format and ran
genome methylation bismark2bedGraph.pl to extract cytosines. Default settings
were used.
Berman2012. We implemented a generalized version of the method described in
our earlier work(Berman et al., 2012). We only included reference cytosine positions
that had at least 3 overlapping C or T reads. We required at leastk% of reads on the
C-strand to be C or T, andk% of the reads on the G-strand to be G. The default setting
(used in (Berman et al., 2012) and shown as an orange rectangle in Figure 2.3) was
k =10%.
2.2.7 Datasets used for whole-genome comparisons
OTB-colon. 75bp Single End Whole-Genome Bisulfite-Seq data from (Berman et al.,
2012) was generated using Illumina GAIIx sequencing (available at dbGap:phs000385).
Sample was normal adjacent colon mucosa from a male colon cancer patient.
TCGA-lung and TCGA-breast. 100bp Paired End Whole Genome Bisulfite-Seq
(WGBS) data generated at USC by the TCGA (The Cancer Genome Atlas) USC-JHU
Epigenome Characterization Center. Data is unpublished, but available for download
via the UCSC Cancer Genomics Hub (CG-Hub https://cghub.ucsc.edu/). The lung
normal sample is adjacent tissue from case TCGA-60-2722 (data available in CG-
Hub analysis ID 964a8130-d061-472f-9839-9c1f07b24205), and the breast normal is
27
adjacent tissue from case TCGA-A7-A0CE (CG-Hub analysis ID279507dd-4c62-4975-
877d-5cfebd2e7c6f.
Mouse-F1i and Mouse-F1r. 100bp Paired-end sequence from two independent
mice(Xie et al., 2012). We downloaded alignments from the original publication (GEO
accessions GSM753569 and GSM753570), which were performed using Novoalign.
High-confidence genotypes were available for both parental strains via the Mouse
Genome Database. We inferred high-confidence genotypes for the progeny only when
each parent was homozygous at the particular position.
Figure 2.2: Bis-SNP workflow.
(a) Bis-SNP accepts .bam files, produced by most genome mapping tools (BSMAP , RMAP , Bismark, BS-
SEEKER, RRBSMAP , etc). The base quality recalibration step results in a new BAM with the recalibrated
base quality scores. Finally, Bis-SNP performs SNP calling and outputs both methylation levels and SNP
calls. (b) The SNP calling step is performed on each genomic position independently. Differences between
the reference genome and the sample genome can produce one of 10 possible allele pairs or genotype
(G, only 4 shown here). Frequencies of all possible substitutions in the population are taken from the
dbSNP database, represented as (G). A probabilistic model that incorporates prior probabilities for
methylation level and bisulfite conversion efficiency is used to calculate the probability of observing the
actual bisulfite read data (D) assuming each of the 10 genotypes (Pr(GjD)) Finally, bayesian inference
uses the population frequencies of each SNP to calculate the posterior likelihoodPr(DjG).
28
2.3 Results
2.3.1 Bis-SNP workflow
The two primary steps in the Bis-SNP workflow are outlined in Figure 2.2a and include
base quality re-calibration and local realignment followed by SNP calling. Bis-SNP
accepts standard alignment files (.bam format), which can be generated by popu-
lar Bisulfite-seq mapping programs such as MAQ, Bismark, BSMAP , RMAPBS, BS-
SEEKER, PASH, or Novoalign. This allows the user to decide which mapping criteria
are most important for their specific application. This also makes Bis-SNP compatible
with specialized mappers such as RRBSMAP (Xi et al., 2012) and any other program
that can output (.bam) files.
The Bis-SNP model relies on the accuracy of base quality scores, which are initially
estimated by the instrument-specific base caller. However, these initial base scores
do not accurately represent true error probabilities, which are highly dependent on
local sequence context (DePristo et al., 2011). In the GATK workflow, empirical mis-
match rates for each nucleotide at each sequencing cycle are calculated by comparing
base calls to the reference genome, and these mismatch rates are used to recalibrate
instrument-generated values (DePristo et al., 2011). We cannot use this default imple-
mentation with bisulfite-seq data, because true C>T sequencing errors can not be
quantified correctly when the underlying methylation state of each bisulfite-converted
DNA fragment is unknown. Therefore, instead of treating Ts at reference cytosines as
errors, we treat them as a 5th baseX, and estimate these as a group separately from
T>T, A>T, or G>T. The effect is that we can effectively recalibrate base call quality
scores for all except theX nucleotide, improving our ability to accurately identify SNPs.
Importantly, we are able to improve SNP calling at cytosines by recalibrating ”G-strand”
reads complementary to the cytosine.
The user can choose among several output files. For methylation levels, Bis-SNP
can return a standard UCSC .bed or .wig file, and a separate output file is generated
29
for each cytosine context specified by the user on the command line. Example cytosine
contexts are CG, CH, or CHH (H is the IUPAC symbol for A,C, or T). The .wig output
contains the methylation percentage for each methylated cytosine, while the.bed format
also contains the number of C/T reads the percentage is based on, plus the strand of
each cytosine relative to the reference genome. For SNPs, Bis-SNP returns a Variant
Calling Format (.vcf) file, which contains all SNP calls and likelihood scores in addition
to methylation percentages.
2.3.2 Description of SNP calling algorithm
The core of the SNP calling algorithm is based on the Bayesian inference model of
GATK(DePristo et al., 2011), and implemented using GATK’s LocusWalker class. For
each locus, Bis-SNP evaluates one of ten possible diploid genotypes (G), as shown in
Figure 2.2b (a diploid genotype is made up of two parental alleles, referred to asA and
B). The prior probability of each genotype,(G), is determined using population data
from dbSNP (including 1000 genomes data) similar to SOAPsnp (Li et al., 2009b) (See
Materials and Methods). In this model, the likelihood of observing all base calls at a
particular locus, assuming a particular diploid genotypeAB, is expressed asPr(DjG=
AB) and is the product of observing the base call at each individual readj (Equation 2 of
Materials and Methods). As described below,Pr(D
j
jG =AB) is calculated according
to the strand of read j and several bisulfite-specific parameters, ,, and
(Figure
2.2b).
In the GATK non-bisulfite SNP calling model, the probability of observing a base call
different from the presumed genotypeG is simply the base call quality score. In the
case of Bisulfite-seq, this is true for A:T genotypes but not C:G. For C:G genotypes, the
probability of observing a T depends on the strand of the read, the methylation state,
and the efficiency of bisulfite conversion. Reads on the G-strand opposite the cytosine
are treated with the normal GATK model. Reads on the C-strand use an alternate
model that considers C>T substitutions as either potential errors or bisulfite conversions
30
(see Materials and Methods). The probability of observing a bisulfite conversion event
depends on both the underlying methylation state and bisulfite conversion errors. While
none of these are observed directly, they are included in the model as variables ,,
and
as described in Equation 2.5 in the Methods section.
After bisulfite treatment, an unmethylated C that fails to get converted to a T is
referred to as underconversion, while a methylated C that is converted to T is referred
to as overconversion. The underconversion rate, , is often estimated using either a
spike in control(Lister et al., 2009) or the unmethylated mitochondrial genome(Berman
et al., 2012). This rate can be set manually by the user and has a value of 99.75%
by default. While bisulfite overconversion can not be reliably measured using current
Bisulfite-seq data, we include an additional parameter,
, which is set to 0% by default.
In the future, this could be estimated by including DNA not treated with bisulfite in the
same sequencing run.
The percentage of methylated reads at a given cytosine position can vary widely.
Since C reads and T reads yield asymmetric information about the presence of a C>T
SNP , the locus-specific methylation rate can strongly influence SNP calling. In mam-
malian genomes, CpG methylation levels are multimodal, with various classes of func-
tional elements having distinct methylation patterns. At least four different classes exist
with mean methylation rates ranging from around 0% to over 80%(Lister et al., 2009;
Stadler et al., 2011). Furthermore, methylation at particular di- or tri-nucleotide contexts
is organism and even cell type specific. To better understand how methylation estimates
could affect SNP calling performance, we implemented several different methods for
estimating the methylation frequency parameter, and we describe these next.
First, we used a naive estimate for where the probability of a read being methylated
or unmethylated at any particular cytosine position was 0.5. Second, we used context-
specific estimates which were determined in a two-round procedure as follows. In the
first round, naive estimates were used as described above, and the resulting SNP calls
were used along with dbSNP to select a set of high-confidence non-SNP homozygous
31
cytosines (probability>99.99%). These homozygous cytosines were used to estimate
average methylation levels for a set of cytosine sequence context that could be specified
on the Bis-SNP command line (by default,
CG
and
CH
). In the third and final estima-
tion method, was estimated for each cytosine locus individually using the number of
C and T reads (
c
c+t
, see Methods). The rationale for this locus-specific method was our
concern that genome-wide estimates might be inappropriate CpGs, given the strongly
bimodal nature of CpG methylation levels. Each of these three estimation methods
was run individually as described below. The default method for the public version of
Bis-SNP is locus-specific estimation.
2.3.3 Evaluation of SNP calls at known SNPs
We evaluated Bis-SNP calling accuracy for each of the three different methylation esti-
mation methods (naive, context-specific, and locus-specific). The latter two methods
performed substantially better than naive estimation, so those are the only two dis-
cussed below. We evaluated accuracy using an actual whole-genome Bisulfite-seq
dataset from a normal (male) human colon mucosa sample published previously by
our lab(Berman et al., 2012) (sequence available via accession dbGap:phs000385). All
reads were 75bp long single-end, and generated using the Illumina Genome Analyzer
IIx platform. The complete dataset had an average read depth of 32X. The Bisulfite-seq
data were compared to Illumina Human1M-Duo BeadChip SNP array data from same
sample.
The primary goal of bisulfite sequencing is the accurate determination of cytosine
methylation levels, so we first investigated the ability of Bis-SNP to correctly identify
homozygous cytosines. As the ”ground truth”, we used 435,120 positions identified as
homozygous cytosines on the 1M SNP array, and examined false negative and false
positive calls made by Bis-SNP (Figure 2.3a-c). Calls at varying stringencies were gen-
erated by adjusting the Bis-SNP score cutoff, which is defined as the odds ratio between
32
the first and second most likely genotype (see Methods). Evaluating the different Bis-
SNP methylation estimates with and without base quality recalibration showed that the
locus-specific estimation plus recalibration produced the most accurate results. Using
the complete sequence dataset and the default score cutoff (Figure 2.3c, red circle),
Bis-SNP was able to detect 95.22% of the true cytosines (414,327 features) with a false
positive rate of 0.37% (2,461 features). We simulated lighter sequencing coverage by
randomly picking reads from the full dataset to estimate accuracy at 8x (Figure 2.3a)
and 16x (Figure 2.3b) genomic coverage. The reader should note that these false pos-
itive rates are not indicative of the genome-wide false positive rates, since most false
positives come from heterozygous SNPs which are frequent on the SNP array but very
infrequent in the genome.
For comparison, we determined the accuracy of homozygous cytosine calling using
several published methods (Figure 2.3a-c). Bismark(Krueger and Andrews, 2011)
returns methylation estimates for all cytosines in the reference genome. It is thus not
surprising that Bismark performs poorly for features on the 1M SNP array, which were
selected for their polymorphism and differences with the reference genome. Several
other published studies use the same strategy and estimate methylation at all reference
cytosines(Laurent et al., 2010; Hodges et al., 2011). In our own earlier work(Berman
et al., 2012), we restricted methylation calling to reference cytosines. Thus it is not sur-
prising that when we applied this method (”Berman2012”) to the 1M SNP array dataset,
it achieved almost the same false negative rate as Bismark. However, Berman2012
filtered out positions where less than 90% of reads were C or T on the C-strand and G
on the G-strand, resulting in a substantially lower false positive rates than Bismark, but
not as low as Bis-SNP .
We next focused on the ability of Bis-SNP to determine heterozygous SNPs, which
can be used both for improving methylation calling accuracy as well as allele-specific
methylation analysis (see Figure 2.1b). Heterozygous SNPs are more difficult to iden-
tify than homozygous SNPs, due to the approximately 1/2 the read coverage for each
33
Figure 2.3: Bis-SNP error frequencies in detecting SNPs on the Illumina 1M SNP array.
Receiver Operating Characteristics (ROC curves) are shown for Bis-SNPs accuracy at detecting SNPs in
Bisulfite-seq data generated from human colonic mucosa tissue. The ”true” genotypes were determined
using an Illumina Duo 1M Human SNP array, and Bis-SNP results were only evaluated at these million
genomic positions. All datasets were from (Berman et al., 2012). The three ROC curves at the top (a-
c) show accuracy at positions corresponding to 435,120 homozygous cytosines on the 1M SNP array.
By randomly downsampling from the average 32x read depth of the Bisulfite-seq data, we are able to
show results corresponding to 8x coverage (a), 16x coverage (b), and 32x coverage (c). Bis-SNP in
three different conditions are compared to Bismark and the method used in ”Berman2012”(Berman et al.,
2012), both of which restrict their results to reference cytosines. For ”Berman2012”, we varied the number
of reverse strand G reads required to plot a range of stringencies. The three plots at the bottom (d-
f) show accuracy at the 303,656 positions that are heterozygous according to the 1M SNP array. For
comparison, we show results from the k-allele method (similar to the approach of (Chen et al., 2011)) ,
Shoemaker:2010fu(Shoemaker et al., 2010) and bisReadMapper(Diep et al., 2012).
allele. We excluded the haploid X chromosome, leaving 303,656 autosomal loci called
as heterozygous by the 1M SNP array. As before, the locus-specific methylation esti-
mation plus recalibration performed the best of all methods. Using the full dataset with
the default Bis-SNP cutoff (Figure 2.3c, red circle), Bis-SNP was able to identify 93.18%
of heterozygous SNPs (282,944 loci) with a false positive rate of 0.094% (755 loci). Of
the 303,656 heterozygous loci examined, 242,347 (79.81%) were C/T heterozygotes.
C/T is the most common SNP in mammals, arising from evolutionary deamination of
34
methylated cytosines. It is also the most difficult SNP to detect in bisulfite-treated DNA,
because the C-strand reads are often uninformative (see Figure 2.1). As expected,
Bis-SNP (and other methods) performed more poorly on C/T heterozygous SNPs than
others, due to C>T conversion ambiguity (Figure 2.4).
0.000 0.001 0.002 0.003 0.004
0.0 0.2 0.4 0.6 0.8 1.0
C/T SNP & Non C/T heterozygous SNPs
32x sequencing coverage
False Positive Rate
1-False Negative Rate
C/T_SNP_BisSNP
C/T_SNP_K-allele
nonC/T_SNP_BisSNP
nonC/T_SNP_K-allele
C/T SNP bisReadMapper
nonC/T SNP bisReadMapper
C/T SNP Shoemaker
nonC/T SNP Shoemaker
Figure 2.4: Bis-SNP error frequencies at C:T heterozygous SNPs.
The data for heterozygous SNP calling in Figure 2.3c is broken up into C:T SNPs vs. other heterozygous
SNPs
We compared Bis-SNP results to heterozygous SNPs called using two alternate
”k-allele” techniques that used read count cutoffs without incorporating base quality
scores. We implemented a generalized form of the method used by (Chen et al., 2011;
Gertz et al., 2011) to use a variable read count cutoff. This cutoff, k, was defined as
the minimum percentage of reads with a secondary allele necessary to call a heterozy-
gous SNP . As in (Chen et al., 2011), we counted C and T as a single allele at reference
cytosines (on the C-strand only). In addition to k-allele, we also tried the Shoemaker
method(Shoemaker et al., 2010), which does not evaluate C/T SNPs at all and requires
observations of the less frequent allele on at least 20% of reads on each strand. Finally,
35
we tried the bisReadMapper algorithm(Diep et al., 2012), which calls SNPs indepen-
dently on each strand using a non-bisulfite SNP caller, SAMTOOLS(Li et al., 2009a),
and reports only those SNPs that agree between strands. Figures 2.3d-f show that
each variation of Bis-SNP performs better than other methods.
An important practical question is the minimum read depth required for accurate
SNP identification. We addressed this problem by downsampling our 32x Bisulfite-seq
genome to various coverage levels from 2x to 30x (Figure 2.5). For each coverage
level, we determined the number of false positives and false negatives across a range
of Bis-SNP stringency cutoffs using the 1M SNP array data, as in Figure 2.3. At each
coverage level, we then selected the least stringent cutoff that produced a False Discov-
ery Rate (FDR) of less than 5%, and plotted the number of true positives (sensitivity).
For both homozygous cytosines (Figure 2.5a) and heterozygous SNPs (Figure 2.5b),
sensitivity increased dramatically up to about 10x coverage and then began to level off.
Homozygous SNPs were almost fully detected (98% sensitivity) by 10x coverage, while
heterozygous SNPs had a more gradual increase from 80% detected at 10x to 95%
detected at 30x.
2.3.4 Accuracy of genome-wide methylation calling
To verify the ability of Bis-SNP to correctly identify cytosines and improve methylation
quantification genome-wide, we ran Bis-SNP across an entire chromosome for the OTB
colon mucosa sample and four additional whole-genome bisulfite-seq samples (Table
2.1). TCGA normal lung and normal breast were generated by the USC Epigenome
Center and aligned using BSMAP , while the two mouse methylomes were generated by
UCSD and aligned using Novoalign(Xie et al., 2012). Runtimes for chromosome 1 were
about 3 hours using a standard 12-core Intel server with 10GB RAM. The entire human
genome takes about 30-40 hours on a single server (data not shown).
We used Bis-SNP to identify four classes of cytosines in the sample genome (Fig-
ure 2.6 and Table 2.2, ”Sample Genotypes”), and divided these by their corresponding
36
Figure 2.5: Sensitivity as a function of sequence coverage.
Comparisons between Bis-SNP SNP calls and 1M SNP array from Figure 3 ROC curves were extended to
a range of coverage levels from 2x-30x. At each coverage level, we selected the least stringent threshold
that yielded a False Discovery Rate (FDR) less than 0.05, and plotted the Sensitivity (1 - False Negative
rate). As in Figure 2.3, separate plots show sensitivity at detecting homozygous cytosines (a) and het-
erozygous SNPs (b). For heterozygous SNPs, we include the overall detection rate (red line), as well as
separate lines for C/T heterozygous SNPs (blue line) and non-C/T heterozygous SNPS (green line).
sequences in the reference genome (Figure 2.6 and Table 2.2, ”Reference Genotypes”).
As shown in Table 2.2, about 0.5-0.6% of reference CpGs were lost in the sample
genome, and 0.5-0.6% of CpGs in the sample genome had a different sequence in
the reference. The two mouse samples had significantly higher SNP rates, presum-
ably due to true strain differences between the crossed strains and the C57BL/6J strain
sequenced for the mouse reference genome. In both F1 mice, about 2.5% of reference
CpGs were lost in the sample genome, and about 1.1% of CpGs in the sample genome
had a different sequence in the reference.
We next compared average methylation levels across each sample genotype (Fig-
ure 2.6). As expected, homozygous CpHs were consistently low, while homozygous
CpGs were consistently high, regardless of the corresponding reference sequence.
Both mouse frontal cortex brain samples showed elevated levels of CpH methylation as
37
described in their original publication(Xie et al., 2012). Interestingly, homozygous CpGs
that represented SNPs (differing from the reference genome) had consistently higher
methylation. This fits with what is known about mammalian genome evolution - evo-
lutionary C>T changes occur much more frequently at methylated than unmethylated
CpGs because the C>T deamination and deamination repair process is methylation-
specific. We next looked at heterozygous CpGs (Figure 2.6, right). CpG/CpH posi-
tions had methylation about halfway between CpG homozygous and CpH homozygous
positions. At CpG/ApG or CpG/GpG heterozygous positions, methylation can only be
measured for the C allele, and the methylation state is about the same as homozygous
CpGs. CpG/TpG heterozygous positions are not shown, because we can not accu-
rately measure methylation at these positions. Together, these data show that Bis-SNP
genotype calling produces accurate methylation quantification even when the sample
genome differs from the reference genome.
Figure 2.6: Accurate methylation calling at SNPs.
Bis-SNP was run on five different datasets, single-end sequencing from Colon Mucosa Tissue(Berman
et al., 2012) (a), two TCGA normal samples using paired-end sequencing from breast and lung, two mouse
samples using paired-end sequencing from Xie et al. (2012) (see Table 2.1). In each case, Bis-SNP was
used to identify cytosines in one of four sequence context in the sample genome. For each sample,
cytosines were further divided by their sequence context in the reference genome (refCpG, refCpH, or
refNotC). All cytosines within a particular category in a particular sample were averaged to yield a mean
methylation level. The number of cytosines in each category can be found in Table 2.2.
38
2.3.5 Discussion
We have described a publicly-available software tool, Bis-SNP, which extracts methy-
lation information and SNP information simultaneously from data generated using
the Illumina Bisulfite-seq protocol. Command-line executables and open-source code
are both freely available for download http://epigenome.usc.edu/publicationdata/
bissnp2011. The directional nature of the Illumina protocol allows for analysis of DNA
methylation and the identification of a SNP at the same position, by combining infor-
mation from each strand separately. This is the dominant bisulfite-sequencing protocol
in use today by individual labs and genomics consortia such as ENCODE, the NIH
Epigenomics Roadmap, and The Cancer Genome Atlas. By correctly identifying and
filtering SNPs correctly, we can obtain more accurate methylation levels. Heterozygous
SNPs, including C/T SNPs, can be used to identify allele-specific methylation patterns.
Bis-SNP is implemented using the efficient GATK framework, which allows for runtimes
that are reasonable for modern whole-genome analysis. An entire 32x whole-genome
dataset took about 30 hours to run on a typical 12-processor compute node with 10GB
of memory, or 3 hours when each chromosome was run in parallel on a separate com-
pute node. This performance profile makes Bis-SNP accessible to most users.
We included the capability to perform base quality re-calibration on bisulfite-seq
data, which improves the overall SNP calling accuracy of Bis-SNP . Not only do more
accurate base quality scores allow us better identification of SNPs as shown here, but
could be used in the future to calculate more precise DNA methylation estimates. Bio-
logical DNA samples do not typically have a large number of cytosines that are always
100% methylated, so there is not a reliable way to identify true C>T mismatches and
recalibrate quality scores at these positions. Recalibration could be improved in the
future by spiking a library of DNA that has not been treated with bisulfite into the same
sequencing lane.
39
The potential applications of Bisulfite-seq in basic biology and medicine are broad,
and Bis-SNP can be used for the majority of Bisulfite-seq experimental designs
including Whole-Genome Bisulfite-Seq (WGBS), Reduced Representation Bisulfite-
Seq (RRBS), and customizable genome selection methods. While we have focused
on human studies, Bis-SNP can output methylation levels split up according to user-
defined cytosine contexts, which makes it applicable to analysis of Arabidopsis or
any other organism. It also allows Bis-SNP to accommodate novel study designs,
such as in vitro methylation by methyltransferases with arbitrary sequence specificities,
or even the study 5-hydromethyl-cytosine (5-hmC) using a novel bisulfite-sequencing
approach(Booth et al., 2012).
An intriguing potential use of Bisulfite-seq and Bis-SNP is the study of genome-wide
associations between SNPs and DNA methylation patterns (i.e. methQTLs, reviewed
in (Rakyan et al., 2011)). While the experimental designs thus far envisioned for such
studies include a paired SNP and methylation assays, our encouraging results with Bis-
SNP suggest that both could be captured in a single bisulfite-sequencing experiment.
Sequencing depths of 50x or greater for Whole-genome Bisulfite-seq are not unattain-
able from a cost perspective, and would likely provide sufficient SNP and methylation
coverage for methQTL studies. Another potential application could be to a Genome-
Wide Association Study (GWAS) that used Bisulfite-seq rather than traditional sequenc-
ing, to identify disease associations at the genetic and epigenetic levels simultaneously.
This could be especially useful given the large number of GWAS hits that appear to
affect regulatory regions rather than gene coding regions. Bis-SNP and other Bisulfite-
seq analysis tools will be important in the development of these exciting new technolo-
gies.
40
Sample Aligner reference cvg Het SNPs Hom SNPs Callable bases runtime
OTB MAQ hg18 32x 119,103 67,725 211,042,010 2.8h
TCGA-lung-normal BSMAP hg19 19x 118,412 58,309 222,763,786 3.1h
TCGA-breast-normal BSMAP hg19 19x 113,009 57,281 221,014,965 2.7h
Mouse-F1i Novoalign mm9 50x 663,528 65,364 178,718,615 3.1h
Mouse-F1r Novoalign mm9 41x 682,979 67,068 178,847,508 3.1h
Table 2.1: Chromosome 1 Bis-SNP detection
All benchmarking performed using a single Intel(R) Xeon (X5650,2.67GHz) server with 12 CPU cores and
10GB memory. SE refers to single-end sequencing and PE to paired-end
Sample Sample genotype Reference Genotypes % methylation
Reference CpG Reference CpH Reference DpN (D=A,T,G) Ref CpG Ref CpH Ref DpN
OTB CpG 3,758,803 99.39% 12,540 0.02% 11,838 0.01% 73% 80% 82%
normal CpH 7,773 0.21% 78,427,918 99.95% 18,804 0.01% 1% 1% 1%
colon DpN 5,658 0.15% 14,166 0.02% 128,570,817 99.97% NA NA NA
CpG/CpH het 7,218 0.19% 8,998 0.01% NA NA 39% 39% NA
CpG/RpG het 2,512 0.07% NA NA 1,826 0.00% 74% NA 77%
TCGA CpG 4,153,196 99.52% 10,995 0.01% 10,511 0.01% 76% 84% 85%
normal CpH 5,460 0.13% 85,031,960 99.96% 16,420 0.01% 1% 1% 1%
lung DpN 5,310 0.13% 13,725 0.02% 133,490,905 99.98% NA NA NA
CpG/CpH het 6,682 0.16% 8,529 0.01% NA NA 37% 39% NA
CpG/RpG het 2,476 0.06% NA NA 1,993 0.00% 80% NA 78%
TCGA CpG 4,100,643 99.54% 10,893 0.01% 10,657 0.01% 75% 85% 86%
normal CpH 5,286 0.13% 80,654,084 99.96% 13,390 0.01% 1% 1% 1%
breast DpN 4,954 0.12% 13,310 0.02% 136,180,779 99.98% NA NA NA
CpG/CpH het 6,289 0.15% 8,120 0.01% NA NA 39% 40% NA
CpG/RpG het 2,413 0.06% NA NA 1,854 0.00% 78% NA 79%
Xie 2012 CpG 2,125,320 97.51% 10,990 0.02% 11,757 0.01% 76% 83% 84%
Mouse CpH 4,314 0.20% 57,706,841 99.87% 20,312 0.02% 3% 3% 3%
F1i DpN 5,300 0.24% 20,905 0.04% 118,570,097 99.96% NA NA NA
(chr1) CpG/CpH het 28,896 1.33% 36,735 0.06% NA NA 43% 42% NA
CpG/RpG het 15,754 0.72% NA NA 12,917 0.01% 78% NA 82%
Xie 2012 CpG 2,199,907 97.52% 11,268 0.02% 11,974 0.01% 75% 83% 84%
Mouse CpH 4,476 0.20% 58,685,115 99.87% 20,933 0.02% 3% 3% 4%
F1r DpN 5,171 0.23% 20,765 0.04% 117,647,445 99.96% NA NA NA
(chr1) CpG/CpH het 29,983 1.33% 38,159 0.06% NA NA 43% 42% NA
CpG/RpG het 16,371 0.73% NA NA 13,147 0.01% 78% NA 82%
Table 2.2: Genome-wide cytosine counts and methylation
”het” signifies heterozygous. Two non-reference bases in a row automatically filtered out. CpH=C(A/C/T).
DpN= (A/T/G)(A/C/T/G). RpG=(A/G)G. CpG/TpG heterozygous genotypes are filtered out because they
can not be used for methylation calling
41
Chapter 3
NOMeToolkit: Integrative
computational pipeline for
NOMe-seq and whole genome
bisulfite sequencing
Most part of this chapter is modified from a manuscript prepared for publication in which
I am the first author and share the co-first authorship with Dr. Clayton Collings. The
Inverted duplication reads part is modified from the results of a collaborated project with
German Cancer Research Center (DKFZ)
3.1 Introduction
Transcriptional regulation in eukaryotes is mediated by chromatin dynamics including
DNA methylation and the positioning of individual nucleosomes. Genome-wide epige-
nomic profiling has identified important attributes of each of these two marks, but very
little is known about their interaction in mammalian cells. NOMe-seq, profiling GpC
nucleosomal footprints and endogenous CpG methylation simultaneously within individ-
ual DNA molecules, allows us to investigate the interactions between these two impor-
tant marks on a single chromosome of a single cell(Kelly et al., 2012). Currently, there
is no public available computational tool for the analysis of NOMe-seq data.
42
Here, we present NOMeToolkit as a comprehensive toolkit collection for the quality
control, methylation/accessibility calling, segmentation and visualization in NOMe-seq.
As the intrinsic similarity between NOMe-seq and whole genome bisulfite sequencing
(WGBS), most of the quality control and methylation calling parts (Bis-SNP pipeline)
can also be directly used in WGBS.
3.2 Results and discussion
3.2.1 NOMeToolkit workflow
NOMeToolKit includes several components that are outlined in Figure 3.1. NOMe-
Toolkit was implemented in Java, perl and R scripts that can be run across different
platforms. NOMeToolkit accepted standard fastq files (.fastq format). NOMeQC pro-
cessed quality control before and after reads mapping in different aspects. Bad quality
reads would be removed or trimmed to increase the mapping efficiency and accuracy.
Bis-SNP described in Chapter 2 and published previously (Liu et al., 2012) has been
wrapped into a user friendly all-in-one perl script. It output accurately identified CpG
methylation for WGBS, GCH/HCG methylation for NOMe-seq. The methylation infor-
mation was represented as .6plus2.bed and .bw format for downstream analysis and
visualization. It also output SNP information and GCH/HCG information within each
read. The accessibility information (GCH methylation level) was then segmented into
Nucleosome Depleted Regions (NDRs), linkers and nucleosome occupied regions by
NdrHmmHunter. AlignWig2Loc was used to visualize signal intensity from NOMe-seq or
external ChIP-seq/sequence motif in the interested genomic regions. NOMeClonePlot
is a R script used to visualize locus-specific NOMe data.
43
Figure 3.1: NOMeToolkit workflow
NOMeToolkit includes four major components: NOMeQC, Bis-SNP pipeline, segmentation and visualiza-
tion. Each circle represents a module in the component. NOMeQC makes bisulfite-specific and NOMe-
seq specific quality control before and after reads mapping. Bis-SNP pipeline outputs accurate genotype
and methylation information for any arbitrary cytosine context including HCG/GCH in NOMe-seq. Seg-
mentation component segments accessibility information into NDRs, linkers and nucleosome occupancy
information in single sample or dynamic NDRs in multiple samples. The annotation module will annotate
these segmentation results by external datasets and calculate the odds ratio comparing with random seg-
ments. DNA methylation and accessibility information from Bis-SNP pipeline, ChIP-seq normalized signals
from bam2normalizedwig and sequence motif signals are integrated together to visualize at any interested
genomic region (e.g. Transcription Start Sites (TSS) or CTCF motifs) NOMeClonePlot is used to align and
visualize locus-specific NOMe dataset.
3.2.2 Quality Control (QC) before reads mapping
3.2.2.1 Adapter contamination estimation and adapter trimming
During library preparations in the next generation sequencing, DNA fragments consist-
ing adapter sequences may be sequenced, which leads the adapter contamination that
44
will greatly affect reads mapping efficiency and accuracy. Since cytosines in Illumina
TruSeq adapters are all methylated, adapter contamination will also introduce bias to
the estimation of DNA methylation level. Several adapter contamination detection and
trimming tools are currently available in bisulfite-seq space. Fastq-mcf (Aronesty, 2011)
was used in NOMeToolkit to report and trim adapter sequences for both of single-end
and paired-end reads.
3.2.2.2 Inverted duplication reads
The Inverted duplication reads part is modified from the results of a collaborated project
with German Cancer Research Center(DKFZ)
Large fraction of inverted duplication reads were discovered in some bad paired-
end NOMe-seq libraries as shown in Figure 3.2a. The similar phenomenon was also
discovered in WGBS library almost the same time at German Cancer Center (DKFZ).
Some reads filter codes inside early GATK UnifiedGenotyper also indicated the similar
finding existed in non-bisulfite treated paired-end genomic sequencing data. The cause
of inverted duplication reads seems to be correlated with small insertion size during
the library preparation step, but the true mechanism behind is still not clear yet. These
inverted duplication reads will greatly decrease reads mapping efficiency and accuracy.
Inverted duplication reads filter inside GATK depends on the reads mapping information,
which may already be biased by the incorrect mapping of the reads. Therefore, we
developed InvertDupsHunter to detect inverted duplication reads by the paired reads
mismatches information even before the reads mapping. The new tool can also increase
the reads mapping efficiency by trimming those inverted duplication reads.
Four 100bp paired-end libraries generated for Illumina HiSeq2000 platform were
selected as test examples: Two libraries with poor reads mapping efficiency as the bad
libraries (B1, B2 library in Figure 3.2b), the other two libraries with good reads mapping
efficiency as the control libraries(C1, C2 library in Figure 3.2b. Each of the library group
contains one library from NOMe-seq and the other from WGBS. 1 million reads were
45
randomly sampled from each of the sequencing library bam file. Reads with 0 mismatch
between the two ends in the beginning of 20bp were defined as inverted duplicated
reads by InvertDupsHunter. Mismatches between base T in the 1st end and base C in
the 2nd end, or base G in the 1st end and base A in the 2nd end, were considered as
matched bases in the bisulfite-treated libraries (NOMe-seq/WGBS).
Figure 3.2b showed that almost all of the reads in B1,B2 library belong to inverted
duplication reads, while reads in C1,C2 library did not show the similar characters.
InvertDupsHunter identified 71% of the total reads in B1 library and 61% of the total
reads in B2 library as inverted duplication reads, while in control library C1 and C2, only
9% and 2% of the total reads were identified, respectively.
Inverted duplication reads would cause incorrect reads mapping which was indi-
cated by the high mismatch rate compared to reference genome (Figure 3.2c). By
default, InvertDupsHunter scanned the matches rate between two ends until up to 3
mismatches. The rest of the reads would be trimmed off. 2nd end reads would be
thrown away because of the duplicate information between the two ends after the trim-
ming. Figure 3.2c showed that the trimming would decrease the mismatches rate which
indicated more accurate reads mapping. Also, the reads mapping efficiency was greatly
improved, while the efficiency in the control library was not affected (Table 3.1).
Some subsets of inverted duplication reads showed another character that 2nd end
reads were reverse complement with 1st end reads. It might indicate that some of the
inverted duplication reads may form because of hairpin structure formation during the
library construction (Figure 3.2d). Further investigation on the mechanism of inverted
duplication reads formation is still needed.
3.2.3 QC after reads mapping
The estimation of coverage distribution inside/outside CGI was done by Zack Ramjan’s
script
46
Bad libraries
Library name Before trimming Trimmed adapter Trimmed adapter + Inverted duplication reads
B1(NOMe-seq) 10% 27% 82%
B2(WGBS) 22% 23% 88%
Control libraries
Library name Before trimming Trimmed adapter Trimmed adapter + Inverted duplication reads
C1(NOMe-seq) 78% 78% 78%
C2(WGBS) 80% 80% 81%
Table 3.1: Reads mapping efficiency before and after InvertDupsHunter trimming
Reads are mapped by BSMAP v2.6 with parameters (-s 16 -v 10 -q 2 -z 64). The percentage number
represents percentage of reads got correct mapped after the preprocessing. Reads correct mapped:
Single-end(after InvertDupsHunter trimming) and paired-end reads are uniquely mapped without PCR
duplication. Additionally, paired-end reads are required to be properly paired
QC after reads mapping includes several different components: bisulfite conversion
rate, NOMe-seq enzyme accessible/leak efficiency, coverage distribution inside/outside
CGI and library complexity.
Bisulfite conversion rate was estimated at two different levels: library level and
read level. The library bisulfite conversion rate was estimated based on the CpH/CpG
(HCH/HCG for NOMe-seq) methylation level in mitochondria and spiked-in lambda
genome. CpH (HCH for NOMe-seq) methylation level within each read was summa-
rized into histogram. CpH methylation above some threshold was considered to be
incompletely converted reads. We also observed higher incomplete conversion rate in
the 5’ end of reads as described in Chapter 2. CpG methylation status in a read was
only counted after the first successfully converted cytosine.
In theory, M.CviP1 enzyme should only methylate accessible GpC sites. In the
previous study (Kelly et al., 2012) (also described in Chapter 4), we observed some
minor enzyme off-target at CpC sites. Different combinations of tri-nucleotides’ (NCN)
DNA methylation level in autosome and mitochondria were summarized to estimate
the enzyme off-target efficiency. The enzyme accessible efficiency was also estimated
by the average accessibility plot around CTCF conserved motif and CGI promoters.
Previous study showed next generation sequencing has lower coverage in CpG island
area (Wang et al., 2011), which limited the estimation of methylation level at CpG island.
47
We estimated the coverage drop-out ratio within CGI and a random region outside CGI.
CGI drop out ratio above 0.6-0.7 is usually considered as a good coverage quality in
CGI. We used PRESEQ to estimate library complexity (Daley and Smith, 2013).
3.2.4 Bis-SNP pipeline
Bis-SNP software (described in Chapter 2 and (Liu et al., 2012) ) accepted either single-
end or paired-end mapped BS-seq/NOMe-seq/RRBS data in the form of BAM files and
outputs SNP and methylation information using standard VCF formats and bed/bed-
Detail/bedGraph/wig formats, respectively. NOMeToolKit would provide an updated,
easy-usage version of Bis-SNP that is intended to prevent user error and to streamline
data processing. In addition to obtaining genotype statistics, the methylation informa-
tion from NOMe-seq data in bed and big wig formats could be further analyzed with the
use of the NdrHmmHunter and the AlignWig2Loc programs.
3.2.5 Beta-binomial HMM based segmentation
NdrHmmHunter is a java-based software that utilizes a two states beta-binomial Hidden
Markov Model (HMM) algorithm (Molaro et al., 2011) to infer different segments in DNA
accessibility as shown in Figure 3.3a-b. For a sequence of n GCHs in a contiguous
chromosomal region, let p
i
denote the true probability that GCH
i
is methylated. We
assumed p
i
follow a beta distribution as p
i
Beta(;). We estimated ^ p
i
= k
i
=n
i
,
where k
i
represent number of methylated reads, n
i
represent total number of methy-
lated and unmethylated reads. In the forward-backward training step as described at
upper panel in Figure 3.3a, we used EM algorithm to fit the beta distribution and esti-
mate the maximum likelihood parameters (
j
;
j
, where j=0 or 1) for each of the two
states (
0
;
1
) as described at the supplementary methods part in (Molaro et al., 2011).
The initial probability was initiated randomly before the iteration, then summarized by
the fraction of GCH belong to each state after each iteration. In the decoding step,
48
we incorporated GCH coverage information and used a Beta-Binomial distribution to
calculate the likelihood of observations from a particular emission state
j
as Equation
3.1:
Pr(k
i
jn
i
;
j
)=Pr(k
i
jn
i
;
j
;
j
)
=
Z
1
0
L(k
i
jp
i
)(p
i
j
j
;
j
)dp
=
n
i
k
i
1
B(
j
;
j
)
Z
1
0
p
k
i
+
j
1
i
(1p
i
)
n
i
k
i
+
j
1
)dp
=
n
i
k
i
B(k
i
+
j
;n
i
k
i
+
j
)
B(
j
;
j
)
(3.1)
The posterior probability of each GCH that belongs to Methyltransferase Accessi-
ble Regions (MARs) state or Methyltransferase Protective Regions (MPRs) state was
calculated out. Then Viterbi algorithm was used to determine the state. Every two adja-
cent GCHs that are in the same state would be connected as a segment. The boundary
coordinate of each segment was determined by the posterior probability that enters/ex-
its this segment. The significance of each MAR was calculated by comparing with all
of MPRs around +/-100kb region using one-way binomial test or beta-binomial test.
The same procedure for the MPRs significance was also applied. Significant MARs
with length longer than 100bp were identified as NDRs, while shorter significant MARs
(<100bp) were identified as linker. Significant MPRs with length between 100bp and
200bp were thought as mono-nucleosomes. The lengths of these segment subtypes
could be specified by the user. Furthermore, users could also alter other parameters
during the training and decoding steps which include but are not limited to coverage and
GCH content thresholds, P value cutoffs, and window sizes for observation.
3.2.6 Annotation and visualization of NOMe-seq data and integration of
external datasets
The NOMeClonePlot was done by Dr. Clayton Collings and Ying Wu’s R script
49
AlignWig2Loc is a Perl script that can align NOMe-seq data in wig format as well as
other types of NGS data (BS-seq, ChIP-seq, MNase-seq, DNase-seq, RNA-seq, etc.) to
a given set of coordinates and is integrated with several R scripts to produce publication-
quality figures that can contain multiple panels for multiple experiments. AlignWig2Loc
was equipped with several options that allow the user to generate average plots, density
bar plots, or heatmaps (as shown in Figure 3.3c) with the ability to adjust color, labels,
and axes. Notably, the AlignWig2Loc program has been previously implemented to align
NOMe-seq and ChIP-seq data to transcription start sites, CTCF sites, and nucleosome
depleted regions (identified by NdrHmmHunter) in IMR90, HCT116, and DKO1 cells
(details described in Chapter 4, 5).
In addition to investigating DNA methylation and nucleosome occupancy, ChIP-seq
could be used to supplement NOMe-seq in order to gain further insights into epigenomic
mechanisms. For these integrated types of analyses, the Bam2NormalizedWig software
could process ChIP-seq data in a variety of ways and ultimately outputs the data in
wig format, which in turn, could be used as input into the AlignWig2Loc program. For
example, Bam2NormalizedWig could give the user the option to normalize ChIP-seq
data by the well-established Wiggler method (ENCODE Project Consortium, 2011). To
normalize variations between biological replicates, Bam2NormalizedWig performed Z-
score transformation by subtracting the mean wiggler value across the genome and
being divided by standard deviation of genome-wide wiggler subtraction value (Xie et al.,
2013).
NOMeToolKit also provided a more auxiliary R software package called NOMe-
ClonePlot that manages locus-specific, low-throughput NOMe data. With a reference
sequence and .seq files from one or more clones as input, NOMeClonePlot evaluated
bisulfite conversion, performed alignments using the Smith-Waterman method, and out-
put a bubble plot that conveys endogenous methylation and DNA accessibility for indi-
vidual cytosines within HCG and GCH contexts, respectively. NOMeClonePlot is unique
50
in that it analyzed endogenous methylation and accessibility simultaneously and gener-
ated figures in the traditional format published in studies by Dr. Peter Jones laboratory.
51
Figure 3.2: Inverted duplication reads
(a) An example of paired end read is shown. orange circle represents the matches between two ends,
while the blue line and shaded red rectangle indicate the mismatches between two ends. (b). The global
match and mismatch rate in each of the library is shown as a heatmap. Y ellow represents match between
two ends, while blue represents mismatches. Each row represents one pair of reads, while each column
represents the relative position within each read. Hierarchical clustering is performed to group read pairs
with similar pattern. (c) Mismatches rate of the reads comparing with reference genome indicates the
incorrect mapping of inverted duplication reads. After trimming by InvertDupsHunter , the mismatches rate
returns to the similar level as control library. (d). 2nd end read in B1,B2 library is transformed to be a
reverse complement read. Hierarchical clustering is performed to group read pairs with similar pattern.
52
Figure 3.3: Two states beta-binomial HMM segments NOMe-seq accessibility signal
(a) The details of training and decoding steps in two states beta-binomial HMM. (b) NdrHmmHunter work-
flow on how to segments NOMe-seq accessibility signals. GCH methylation value and coverage are input
into the HMM model. MARs and MPRs are identified by the model. MARs and MPRs with FDR corrected
p value less than 0.01 are further processed by length criteria to identify NDRs (>100bp), linker (<100bp)
and Mono-nucleosomes (100-200bp). (c) A heatmap visualization example generated by AlignWig2Loc.
NOMe-seq DNA methylation level(blue represent unmethylated, yellow represent methylated) and acces-
sibility level(white represent no accessibility, green represent high accessibility) are aligned to the center
of the NDRs and plotted +/-1-kb. NDRs are hierarchically clustered based on the accessibility level and
the clusters are separated by dashed horizontal red lines. Each NDR in both cell lines is annotated based
on its genomic locations as defined by chromHMM model at Human Mammary Epithelial Cell (HMEC) cell
line in ENCODE. The total number of NDRs contained in each cluster is indicated
53
Chapter 4
NOMe-seq: Genome-wide mapping
of nucleosome positioning and
DNA methylation with single
molecule sequencing
Most parts of this chapter are modified from a manuscript published in Genome
Research in which I share the co-first authorship with Dr. Terry Kelly. The part related
with DNA methylation and linker relationship is modified from a manuscript made by Dr.
Ben Berman at PeerJ preprint file server in which I am the second author. The parts
related with NOMe-seq in fresh frozen tumor tissue and NOMe-seq for transcription
factor binding affinity are the pilot projects collaborated with Dr. Fides Lay. The part
related with NOMe-seq in NSC and GBM cells is modified from the results in a project
conducted by Drs. Terry Kelly and Ben Berman in which i am the main person for all of
the data analysis.
4.1 Introduction
Epigenetic mechanisms including DNA methylation and nucleosome positioning work
together to generate specific chromatin states which facilitate, inhibit, or allow for the
potential of gene expression. Active promoters have unmethylated DNA and lack nucle-
osomes just prior to the genes transcriptional start site (TSS), while inactive promoters
54
have densely packed nucleosomes and can be unmethylated (poised or repressed) or
methylated (silent). Due to this variety of chromatin structures, gene activation potential
cannot be predicted by looking at nucleosome occupancy or DNA methylation alone.
Pioneering work by Michael Kladde and colleagues has demonstrated the ability
of methyltransferase-based footprinting to determine nucleosome positioning in yeast
and mammalian cells (Xu et al., 1998; Jessen et al., 2004; Kilgore et al., 2007; Pardo
et al., 2011). Using next generation sequencing, we describe a genome-wide nucleo-
some footprinting method termed NOMe-seq (nucleosome occupancy and methylome
sequencing), which uses a GpC methyltransferase (M.CviPI) (Xu et al., 1998) to obtain
nucleosome positioning information based on enzyme accessibility to GpC sites, while
obtaining endogenous DNA methylation information at the same time from CpG sites.
Importantly, both pieces of epigenetic information are obtained from the same individ-
ual DNA molecule, revealing the relationship between these two chromatin features on a
single chromosome. Thus, using a single methodology, one can generate genome-wide
maps of multiple epigenetic modifications at the single molecule level.
Using NOMe-seq with whole-genome bisulfite sequencing, we generated an inte-
grated map and showed distinct nucleosome/ methylation configurations associated
with specific genomic features and that the strength of the NDR upstream of the TSS
is indicative of expression level and can accommodate several nucleosomes. By exam-
ining promoters with reads from two distinct chromatin states, as defined by nucleo-
some occupancy and methylation, we identified genes likely to be in two divergent allelic
states, which are strongly enriched on the X chromosome and at known imprinted loci.
Simultaneously measuring nucleosome occupancy and DNA methylation within individ-
ual DNA strands is an important tool for examining how chromatin structure across the
genome is altered in disease states.
55
4.2 Materials and methods
4.2.1 Cell culture
Cell culture was done by Drs. Terry Kelly and Fides Lay
IMR90 cells were cultured according to ATCC recommendations. Primary GBM cells
were cultured as previously described (Laks et al., 2009). Neural stem cells (NSCs)
were from Dr. Ruchi Bajpai cultured as (Rada-Iglesias et al., 2011). Briefly, neurosphere
media contained DMEM/F12 supplemented with B27 (GIBCO), bFGF (20 ng/mL, R&D
Systems Inc.), epidermal growth factor (EGF; 50 ng/mL, Peprotech), penicillin/ strep-
tomycin (1%, Invitrogen), and heparin (5 mg/mL, Sigma- Aldrich). Heparin, bFGF , and
EGF were added to the media every 3 or 4 d. Spheres were passaged every 7 to 14 d
following dissociation with TrypLE Express (Invitrogen).
4.2.2 Nucleosome footprinting
Nucleosome footprinting was done by Drs. Terry Kelly and Fides Lay
NOMe-seq is a modified version of our methylation-dependent single promoter
assay (Miranda et al., 2010). Nuclei from IMR90 cells (ATCC) were isolated as pre-
viously described (Miranda et al., 2010). Previous publications using locus-specific
NOMe-seq have used the minimal amount of M.CviPI that resulted in optimal footprint-
ing of the specific region of interest: 100 units (Wolff et al., 2010), 200 units (Taber-
lay et al., 2011; Y ou et al., 2011), or 200 + 100 units (Andreu-Vieyra et al., 2011).
Since whole-genome NOMe-seq required accurate footprinting of a variety of genomic
regions, we performed a dose response curve (Fig. 1); nuclei were incubated with 100
or 200 units of GpC methyl- transferase (M.CviPI) and S-Adenosyl methionine (SAM)
for 15 min at 37C or 200 units of GpC methyltransferase (M.CviPI) and SAM for 7.5
min at 37C followed by a boost with an additional 100 units M.CviPI and SAM for 7.5
min. For whole-genome NOMe-seq, libraries were generated from nuclei that were
incubated with 200 units of GpC methyltransferase (M.CviPI) and SAM for 7.5 min at
56
37C followed by a boost with an additional 100 units M.CviPI and SAM for 7.5 min. The
reaction was stopped, DNA extracted and bisulfite-converted to distinguish methylated
from unmethylated Cs. For individual regions of interest, PCR was performed, using
PCR primers that do not contain any CpG or GpC dinucleotides, followed by TA cloning
and sequencing. Sequences of PCR primers are available upon request.
4.2.3 Nucleosome footprinting in fresh-frozen colon tumor
Nucleosome footprinting was done by Dr. Fides Lay
Human colon tumors were collected in accordance to institutional guidelines and
processed per TCGA standard requirements (TCGA, 2012). The tissue was cut and
placed in a 2ml cryocentrifuge tube. The whole tube was submerged in isopentane
at -80C for 60-90 seconds and stored in -80C long-term. Briefly, colon tissue was cut
into 1-3 mm pieces while still frozen and resuspended in DMEM media, using 10 mL of
media per gram of tissue in a conical tube. To cross-link, 37% formaldehyde was added
to a final concentration of 1% and the tube was gently rotated at room temperature
for no more than 15 minutes. A final concentration of cold 0.125M glycine was added
to the tube to stop the reaction and the tube was rotated for additional 5 minutes at
room temperature. After cross-linking, the tissue was washed twice with cold PBS. To
prepare the nuclei, the tissue was resupended in 1mL cold PBS per 100mg of tissue and
dounced for 20 strokes using a chilled dounce-homogenizer. After centrifugation and
removal of supernatant, the nuclei pellet was washed with ice-cold NOMe-seq wash
buffer. Nuclei were resuspended in 1X GpC buffer and sonicated for 3 cycles of 30s
on and 30s off using the Bioruptor system (Diagenode). NOMe-seq was performed as
previously described with the following modifications: the nuclei were incubated with
100U M.CviPI and 1.5ul of SAM for 60 minutes followed by a boost of an additional
100U M.CviPI and 0.75ul of SAM for 60 minutes twice before stopping the reaction.
The chromatin was reverse cross-linked overnight at 65C and DNA was extracted using
standard phenol-chloroform protocol and ethanol precipitation.
57
4.2.4 Library construction and sequencing
Library construction was done by Drs. Terry Kelly and Fides Lay
For NOMe-seq, libraries were prepared from 5 ug of DNA as previously described
(Lister et al., 2009; Kelly et al., 2010; Berman et al., 2012). Briefly, M.CviPI-treated
DNA was fragmented into 200-bp pieces, END-repaired (Epicenter), methylated adap-
tors ligated (Illumina), bisulfite-converted (Zymo EZ DNA methylation), and subjected to
PCR. Clusters were generated following Illumina protocols, and the resulting library
was sequenced on Illumina Hi-seq 2000 using the 76-bp single-end configuration.
Neural stem cell and glioblastoma sample was sequenced using the same approach,
except that they were sequenced using the Illumina Hi-seq 2000 Paired-End protocol.
Base calling was performed by Illumina Real Time Analysis (RTA) software, yielding a
total of 1.180 million reads that passed the Illumina quality filter (IMR90). NSC was
sequenced with one lane of 100bp paired-end (375 million reads). GBM culture #157
was sequenced with one lane of 50-bp paired-end (310 million reads) and one lane of
100-bp paired-end (291 million reads), while culture #248 was sequenced with one lane
of 50-bp paired-end (313 million reads) and one lane of 100-bp paired-end (301 million
reads).
4.2.5 Low-input generation of whole-genome bisulfite sequencing library
This part was done by Dr. Fides Lay
WGBS libraries were generated using the EpiGnome Methyl-Seq Kit (Epicentre).
Briefly, 50 ng of bisulfite converted DNA were denatured and DNA synthesis primer was
annealed at 95 C for 5 minutes. DNA copy was subsequently synthesized and excess
random primer was removed by exonuclease digestion. In the second round of DNA
synthesis, terminal-tagging oligo was annealed to generate di-tagged cDNA which was
subsequently purified using AMPure magnetic beads (BD) and whole-genome ampli-
fied using high-fidelity Taq Polymerase, yielding adaptor-tagged whole genome bisulfite
58
sequencing libraries. Libraries were sequenced on HiSeq-2000 using the 75PE method
and raw data was processed based on previously described methods ((Berman et al.,
2012; Kelly et al., 2012; Liu et al., 2012), also described in Chapter 2,3).
4.2.6 Sequence alignment and extraction of CG and GC methylation lev-
els
Genomic alignment and bisulfite sequence analysis was performed largely as previously
described (Berman et al., 2012) and Chapter 3, with some adjustments for paired-end
sequencing. For single-end IMR90 libraries, MAQ (Li et al., 2008) was used with the
-c bisulfite mode (as in (Berman et al., 2012)), and for paired-end NSC, GBM libraries,
BSMAP (Xi and Li, 2009) was used. IMR90 reads were aligned to NCBI reference
genome hg18 and NSC, GBM sequences to hg19. Genomic alignments with a mapping
quality of less than 30 were filtered out, resulting in 678 million reads (IMR90), 375
million (NSC), 587 million (GBM #157), and 691 million (GBM #248). For IMR90, NSC
and GBM cells, we removed reads starting at exactly the same genomic position as
another read (PCR duplicate reads), yielding a total of 156 million analyzable reads for
IMR90 (11.8 gigabases). For NSC, GBM paired-end, we additionally removed reads
not properly paired (mapping to opposing strands within 500 bp of each other), yielding
a total of 462 million analyzable reads (34.0 gigabases) for GBM #157 and 492 million
analyzable reads (36.4 gigabases) for GBM #248. SNPs were removed for GC and CG
methylation level.
We only included cytosines present in the reference genome if at least 90% of reads
mapping to the BSC strand were C or T, and this included at least three reads. Addition-
ally, we only included cytosines where 90% of the reads mapped to the GGS were G
(any other base indicates a genetic variant; importantly, only the GGS strand can reveal
the C>T transitions that can lead to false methylation calling). A cytosine was deter-
mined to be in a particular XCX trinucleotide context using the same criteria, e.g., GCH
positions were only included if 90% of reads were G for the preceding base and 90%
59
of the reads were A, C, or T (IUPAC H symbol includes A,C,T) for the following base.
Reads on the BCS strand were treated as described above, i.e., either a C or T could
match a C in either of the X context positions. This approach was used to determine
the following trinucleotides discussed in this study: HCG (H includes A, C, or T), GCG,
WCG (W includes A or T), and GCH.
As in Berman et al. (2012) and Chapter 3, we filter out the 5’ ends of reads that
have apparent bisulfite nonconversion, which is common in the Illumina protocol pre-
sumably due to reannealing of base pairs adjacent to the adapter sequences which are
methylated and thus have 100% base complementarity (Hansen et al., 2011; Berman
et al., 2012). We accomplish this by walking inward from the 59 of the sequencing read
and disregarding any unconverted cytosine (in any sequence context) until the first con-
verted cytosine is encountered. From that point and all 39 positions within the read, we
include all converted and unconverted cytosines in methylation counts.
For all downstream analyses, we included CCG trinucleotides, despite the slight off-
target M.CviPI activity described that only affects CG methylation information. Thus,
methylation averages include all HCG trinucleotides The single exception was the
within-read combination plots (Figure 4.9), where the very large number of data points
being averaged allowed us to exclude CCGs and use only WCG trinucleotides (W: A,T).
4.2.7 Genomic element average profile plots
Methylation values were extracted from regions surrounding genomic landmarks of
interest (promoters, CTCF sites, etc.), and all methylation values were averaged within
moving windows of 20 bp for all plots (genomic positions without cytosines of the
correct type were not included in averages). Twenty bp was chosen because it is
smaller than the average distance between adjacent GCs in the genome and clearly
able to resolve nucleosome phasing/positioning (as evidenced in CTCF alignments).
Promoter positions, chromatin marks, and expression values were taken from Supple-
mental Table 7 ( in (Kelly et al., 2012)) of a previous reference (Hawkins et al., 2010)
60
(GEO ID GSE16256). Enhancers with a H3K4me1+/H3K4me3- profile were taken
from Supplemental Table 12 (in (Kelly et al., 2012)) of the same reference (Hawkins
et al., 2010) (GEO ID GSE16256). IMR90 DNase hypersensitivity data is from GEO ID
GSM468792. Histone and EP300 (also known as p300) locations from Neural Progeni-
tor Cells were taken from a second reference (Rada-Iglesias et al., 2011). H3K27me3-
enriched regions are those elements beginning with R (for region) in GEO record
GSM602301, while EP300 calls are from GEO record GSM602299. H3K4me3 marks
were not included in GEO and were provided by Alvaro Rada-Iglesias (available upon
request). For CpG island and non-CpG island promoters, we used the Takai-Jones
definition (Takai and Jones, 2002). For CTCF annotations, we used evolutionarily con-
served CTCF binding motifs (Xie et al., 2007) that were bound in vivo in either HeLa
cells (Kim et al., 2007) or CD4+ T -cells (Figs. 4.3B, 4.7A, GBM) (Cuddapah et al., 2009)
and those obtained using ChIP-seq in IMR90 cells from GEO record GSM935404 (Fig.
4.7A, IMR90). We removed 10% of these sites that fell within 2 kb of a known TSS.
Our final set contained 8722 nonpromoter CTCF sites (Supplemental Table S2 in (Kelly
et al., 2012)). For ”CTCF regions with 0 CpGs”, we used only those genomic positions
that contained no CpGs in the reference human genome within a span of two nucleo-
somes on either side (+/- 370bp). This comprised about 1% of the full CTCF set.
4.2.8 Promoter nucleosome-depleted region detection
Identification of promoter nucleosome-depleted regions (Supplemental Fig. S7 in (Kelly
et al., 2012)) was performed as follows: each unique TSS from the UCSC KnownGenes
track was considered independently. All sequencing reads overlapping the candidate
NDR region (-100 to +50 bp) were collected, and GCHs were analyzed on each read.
Every GCH on each of the overlapping reads was counted as an independent nucleo-
some protection measurement, and only those with base quality phred scores of greater
than 10 were included. Those TSSs with 10 or less such data points were removed from
the analysis as regions of inadequate sequence coverage. This coverage filter removed
61
27,312 of 41,054 (66%) hg18 TSSs for IMR90, and 6225 (15%) and 4009 (10%) of
41,017 hg19 TSSs for GBM cultures #157 and #248, respectively. For each sample,
the frequency of methylation among these independent GCH measurements within the
candidate NDR region was compared to the frequency within the surrounding 8 kbthe
4 kb directly up- stream of the candidate -100- to +50-bp region and the 4 kb directly
downstream. We used a one-tailed binomial test to test whether the frequency of GCH
methylation within the candidate NDR region was higher (i.e., less nucleosome protec-
tion) than the surrounding 8 kb. The binomial test resulted in raw P-values, which were
corrected for multiple hypotheses (Benjamini-Hochberg) in each sample independently,
using the number of TSSs passing the initial coverage filter in that particular sample as
the number of hypotheses. Lists of all TSSs, methylation frequencies in candidate NDR
and surrounding regions, and raw and corrected P-values for each sample are available
as Supplemental Tables S5S7 (in (Kelly et al., 2012)).
Intersections between NDR calls from the three samples and histone marks (Venn
diagrams in Supplemental Fig. S7 in (Kelly et al., 2012)) were generated as follows: The
universe of TSSs considered for a given Venn diagram included only those that passed
the coverage filter for all the samples included in the intersection, i.e., for Supplemental
Figure S7C (in (Kelly et al., 2012)), only the 12,424 TSSs covered by all three cell
types were included, while in Supplemental Figure S7D (in (Kelly et al., 2012)), only the
33,425 TSSs covered by both GBM samples were included; all histone-marked TSSs
within this given subset were considered.
4.2.9 NDR and linker identification
A beta-binomial Hidden Markov Model (HMM) (Molaro et al., 2011) was used to identify
NDRs and linker regions (details were also described in Chapter 3)
62
4.2.10 IMR90 MNase-seq
MNase-seq library was made by Dr. Terry Kelly published in PeerJ preprint server
Mononucleosomes were generated by digesting 1x106 cells with 0.5, 1 and 5 Units
of micrococcal nuclease (MNase; Worthington Biochemicals) for 15 minutes at 37 C.
The three MNase preparations were combined, and mononucleosome fragments of
150 bp were gel extracted and libraries were preparared from 30ng DNA using Illumina
single-end sequencing adapters as described in (Bernstein et al., 2006). Sequencing
was performed on an Illumina Genome Analyzer IIx using standard Illumina reagents,
producing 153,469,077 high quality 36bp sequence reads. Reads were aligned using
MAQ with a minimum mapping quality of 30, resulting in 14 111,705,730 uniquely
alignable reads. All sequences and alignments are available at GEO GSE21823.
4.2.11 Nucleosome occupancy score
This part was done by Dr. Benjamin Berman published in PeerJ preprint server
For genomic coordinate c and an estimated mononucleosome size s, the nucleo-
some occupancy score for a particular position was determined by summing the num-
ber of MNase tags on the forward genomic strand in the range c-(s/2) and the number of
tags on the reverse strand in the range c+(s/2). We estimated s to be 165 after examin-
ing a range of values (50bp-250bp) within 1kb of all CTCF binding sites. After alignment
to the genomic element of interest, the raw nucleosome occupancy score was normal-
ized for local tag density by dividing by the total number of reads within 200bp. Plots
were smoothed by taking a moving average of normalized occupancy scores within a
20bp window.
4.2.12 MSRE datasets
This part was done by Dr. Benjamin Berman published in PeerJ preprint server
63
For B-lymphocyte MSRE dataset, supplemental data files from Ball et al. (Ball et al.,
2009) contained the number of tag counts for each possible HspII site. Using the pro-
cedure described in the Ball et al. methods section, we transformed these counts to
percent methylation using the following equation: m = 1 (0.1124 * c), where m is the
estimated percent methylation, and c is the raw tag counts.
4.2.13 Combinatorial epigenomic signatures
For each such genomic position, each read mapping to the bisulfite-C strand was ana-
lyzed for within-read associations with nearby CGs. Each WCG within 20 bp upstream
or downstream was considered nearby (chosen as a distance that could resolve nucle-
osome positioning). If the nearby WCG was methylated, the GCH methylation value
for the read was stratified into the methylated bin (red lines in Fig. 4.9); likewise, those
reads where the nearby CG was unmethylated went into the unmethylated bin (Fig.
4.9, blue lines). If a single GCH was within 20 bp of multiple CGs, the methylation
value of each of the multiple CGs in each read went into the appropriate (methylated or
unmethylated) bin as an independent observation.
To generate the plots in the right-hand plots of Figure 4.9A (labeled on same read),
these methylated and unmethylated GCH bins were averaged across all genomic ele-
ments to yield two average GC profiles for the methylated (red) and unmethylated (blue)
bins. For the left-hand plots (labeled across all reads), the entire analysis was per-
formed identically, except that nearby CG methylation values were taken from a ran-
domly selected read mapping to the same location, rather than the same read as the
GC. Generally, multiple reads overlapped the same position, but we only selected one
read at random to keep the number of observations identical to the on same read con-
dition, eliminating any possible effects from differences in variance between the two
conditions. Divergent chromatin allele promoter detection.
Identification of promoters with divergent chromatin alleles (Fig. 4.9) was performed
as follows: we only counted reads that had two or more GCHs and two or more HCGs,
64
with 90% of cytosines in each category being in agreement. For each TSS from the
UCSC KnownGenes track, we selected those reads where more than half of the read
fell within (-150 to +100 bp). Any gene with at least one read in the active chromatin
combination state (CG unmethylated and GC nucleosome-accessible) and another read
in the ”silenced” state (CG methylated and GC nucleosome-protected) was counted
as a DCA gene. The fraction of these falling onto chromosome X or associated with
imprinted genes was compared to size-matched sets picked randomly from the genome,
as described in the Figure 4.9 legend.
4.2.14 Other data access
NOMe-seq tracks for genomic viewers (Fig. 4.3) are available as a supplemental docu-
ment and at http://epigenome. usc.edu and the NCBI Gene Expression Omnibus (GEO)
(http:// www.ncbi.nlm.nih.gov/geo/) under accession number GSE21823. All source
code tools are available at http://sourceforge.net/projects/ uecgatk/. See Supplemen-
tal Material in (Kelly et al., 2012) for instructions on using these tools. The tool and
source code for the new module of the IGV viewer to display NOMe-seq data from raw
BAM alignment files are publicly available for download at the IGV project website, http://
www.broadinstitute.org/igv/.
4.3 Results
4.3.1 Identifying optimal treatment conditions for accurate footprinting
of a variety of genomic loci
This part was done by Drs. Terry Kelly and Fides Lay published in Genome Research
To generate integrated DNA methylation and nucleosome occupancy information,
nuclei are treated with M.CviPI, which methylates GpC dinucleotides not protected by
nucleosomes or tight binding proteins. Following bisulfite conversion to differentiate
65
between methylated and unmethylated cytosine residues, cytosines contained within
a CpG dinucleotide context provide endogenous methylation information, while nucle-
osome positioning is derived from cytosines within GpC dinucleotides. Nucleosome
occupancy and endogenous DNA methylation information is obtained as the methyla-
tion of each individual cytosine is calculated as the fraction of methylated reads divided
by all reads covering that position. Combining CpG and GpC methylation profiles, four
distinct chromatin structures can be visualized (Fig. 4.1A). We first identified a set of
reaction conditions, which allowed for accurate footprinting (i.e., accessibility of nucleo-
some depleted regions, while not aberrantly accessing nucleosome-occupied regions,
defined as 146 bp or larger that are inaccessible to M.CviPI) of a variety of chromatin
configurations (Fig. 4.1B).
4.3.2 NOMe-seq reveals expected nucleosome occupancy patterns at
CTCF and transcription start sites
We generated whole-genome NOMe-seq libraries and use our NOMeToolkit pipeline
(Liu et al., 2012) (Chapter 3) to segregate cytosines based on the trinucleotide con-
taining the cytosine in the central position. GCH cytosines were generally used to plot
enzyme accessibility (nucleosome protection or occupancy), while HCGs (where H = C,
T, or A) were used for endogenous methylation. GCGs were excluded due to ambiguity
between endogenous and enzymatic methylation. Exclusion of GCGs is not likely to
dramatically hurt the ability of M.CviPI to footprint nucleosomes since GCGs represent
less than 0.24% of the genome and make up only 5.6% of all GC dinucleotides (Fig 4.2).
Furthermore, 93.4% of GCG trinucleotides have a GCH within 20 bp (half of which are
within 5 bp) from which nucleosome occupancy information can be derived (Fig 4.2).
Due to the availability of genome-wide data (Lister et al., 2009; Bernstein et al.,
2010), we performed whole-genome NOMe-seq in IMR90 cells and obtained 156 million
uniquely alignable reads which can be displayed from raw BAM alignment files using a
newly developed module of the IGV viewer (Fig. 4.3A) (Thorvaldsd´ ottir et al., 2013).
66
Figure 4.1: NOMe-seq can footprint a variety of chromatin structures.
(A) After IMR90 cell nuclei are treated with M.CviPI, DNA is extracted, bisulfite-converted, and sequencing
is performed. DNA methylation status is obtained from CpG dinucleotides, and nucleosome occupancy
information is gained from the inaccessibility of the M.CviPI methyltransferase to GpC dinucleotides. The
combination of DNA methylation and nucleosome occupancy data can reveal four distinct chromatin signa-
tures: unmethylated and nucleosome-depleted, unmethylated and nucleosome-occupied, methylated and
nucleosome-occupied, and methylated and nucleosome-depleted. (Black circles) Methylated CpG sites;
(teal circles) accessible (methylated) GpC sites. (B) We found that 200 units of M.CviPI for 7.5 min fol-
lowed by a boost of 100 units accurately revealed an NDR upstream of the TSS of HSPA5 (also known as
GRP78), an active CGI promoter, while also showing that the polycomb repressed MYOD1 CGI promoter
and methylation-silenced CpG-poor LAMB3 promoter were occupied by nucleosomes and inaccessible to
M.CviPI, as expected. M.CviPI-inaccessible regions greater than 146 bp are covered by a pink rectangle
indicating nucleosome occupancy. PCR amplicon sizes: HSPA5447 bp, MYOD474 bp, and LAMB3426 bp.
We examined the relationship between nucleosome depletion and expression by
dividing promoters into quartiles based on expression level (Hawkins et al. 2010) and
found that promoters in the lowest bin (0%25%) were nucleosome-occupied regardless
67
Figure 4.2: GCG excluded in the downstream computational analysis does not affect
accessibility measurement greatly.
(A). Frequency of GCH, GCG and HCG in the reference genome. Only 5.6% of all GpC dinucleotides are
excluded in the downstream analysis. (B).Frequency of GCH location within specific distances to GCG
trinucleotides that are excluded in analyses. 47% of GCG trinucleotides have a GCH within 5 base pairs
or less mitigating the impact of excluding GCGs from analyses. M.CviPI inaccessible regions greater than
146 base pairs are covered by a pink rectangle indicating nucleosome occupancy
of whether they were CGI or non-CGI promoters (Fig. 4.3D,E). With increasing expres-
sion quartiles, the NDR upstream of the TSSs and the positioning of the nucleosomes
after the TSS became more apparent for both CGI and non-CGI promoters. These
results suggest that an NDR upstream of the TSS and positioned nucleosomes down-
stream from the TSS are strongly predictive of expression level and indicate similar
epigenetic regulation of CGI and non-CGI promoters.
We next compared the ability of NOMe-seq to accurately map the well-positioned
nucleosomes flanking CTCF binding sites. We aligned reads to conserved CTCF bind-
ing motifs (Xie et al., 2007) located more than 2 kb away from TSSs that have been
experimentally validated as bound in vivo by CTCF (Supplemental Table S2 in (Kelly
et al., 2012)) (Kim et al., 2007; Cuddapah et al., 2009) and found that NOMe-seq
mapped nucleosomes similar to MNase-seq data (Fig. 4.3B) (Schones et al., 2008).
68
Nucleosome occupancy is plotted as inaccessibility to M.CviPI (1-GpC methylation)
(Fig. 4.3B, teal line); thus, regions of protection appear as peaks in the graph. The
first and second nucleosomes to the right of the CTCF binding site appear to be slightly
out of phase using NOMe-seq compared to MNase; however, here we are comparing
MNase-seq data from the benchmark CD4+ T -cell data set (Schones et al., 2008) with
NOMe-seq data collected in IMR90 cells. This phase shifting is not apparent when we
compare NOMe-seq and MNase-seq data both generated from IMR90 cells (Supple-
mental Fig. S3 in (Kelly et al., 2012)). The high resolution generated by NOMe-seq also
reveals a region of protection coinciding with the CTCF binding site, likely reflective of
CTCF binding. To investigate this further, we examined a CTCF binding region at high
resolution (Supplemental Fig. S4 in (Kelly et al., 2012)) and show a discreet protection
pattern overlapping the CTCF motif with a size (<40 bp) consistent with a nonnucle-
osomal protein. This protected region is surrounded by clear nucleosome depletion
(accessibility), which is, in turn, surrounded by larger protected regions whose size is
consistent with nucleosome occupancy. We next aligned NOMe-seq reads to all TSSs
and again found that NOMe-seq was comparable to MNase-seq and able to identify
a nucleosome-depleted region (NDR) upstream of TSSs and well-positioned nucleo-
somes downstream from TSSs (Fig. 4.3C).
4.3.3 DNA methylation marks inter-nucleosome linker regions through-
out the human genome
This is part of the paper at PeerJ file server. MNase-seq experiment was done by Dr.
Terry Kelly. MNase-seq analysis was done by Dr. Benjamin Berman
We identified consistent linker regions from IMR90 NOMe-seq nucleosome occu-
pancy data (Kelly et al., 2012) (Figure 4.4). DNA within the linkers was consistently
more methylated than the flanking nucleosomes, most prominently in CTCF regions
and PMDs. Interestingly, the inter-nucleosome spacing was shorter in CTCF regions
(185bp) than PMDs or the rest of the genome (200bp).
69
To demonstrate that increased methylation in linker DNA was not cell type specific,
we examined methylation around CTCF sites in several additional WGBS datasets as
well as the non-bisulfite MSRE dataset described above. Indeed, all cell types showed
linker-specific methylation (Figure 4.5a), and almost identical global patterns have been
observed for dozens of other human tissues sequenced by WGBS in our lab (unpub-
lished and data not shown). Interestingly, whereas CpGs within +/-200bp of the CTCF
binding site were completely unmethylated in most tissues, H1 and HSF1 embryonic
stem cells (hESCs) showed increased methylation, possibly attributable to ESC-specific
5-hydroxymethylation at CTCF sites (Yu et al., 2012). MSRE could not accurately rep-
resent the methylation levels within this +/- 200bp region due to known limitations of the
method to measure very low methylation (Ball et al., 2009).
The large number of CTCF binding sites in the genome provided an opportunity to
investigate the interplay between methylation and nucleosome positioning. There is evi-
dence suggesting that methylation can influence nucleosome formation (Collings et al.,
2013) and vice-versa (Ooi et al., 2007). It is impossible to determine with certainty
without additional experiments, but we reasoned that if DNA methylation were required
for nucleosome positioning, CpGs dinucleotides would be required around functional
CTCF sites. To investigate this bioinformatically, we extracted CTCF-adjacent positions
that contained zero CpGs in the reference human genome within a region of two full
nucleosomes (+/-370bp). According to MNase occupancy and NOMe-seq chromatin
accessibility levels, the nucleosomes at these zero CpG regions were positioned just as
well as other CTCF-adjacent nucleosomes, strongly suggesting that linker DNA methy-
lation is not necessary for nucleosome positioning (Figure 4.5b-c). Nevertheless, the
zero CpG regions comprise only about 1-3% of CTCF-adjacent nucleosomes, so we
can not completely rule out some role for DNA methylation in establishing or reinforcing
nucleosome positioning.
70
4.3.4 NOMe-seq reveals the genome-wide transcription factors binding
strength in vivo
The experimental part was done by Dr. Fides Lay for a collaborated pilot project
Using NOMe-seq, we were also able to detect a protection pattern occurring due
to the binding of transcription factors or other DNA binding proteins whose recognition
sites contain at least one GpC site, such as CTCF studied above. We next adapted
our standard NOMe-seq protocol to include a high-salt wash of freshly isolated nuclei
prior to treatment with the M.CviPI enzyme in order to measure the strength of differ-
ent transcription factor binding. We detected an increase of accessibility in the CTCF
binding sites following a 200mM-salt wash prior to M.CviPI treatment (Figure 4.6a, mid-
dle panel). This increase is even more dramatic at 400 mM, suggesting that the effect
of high-salt wash on CTCF binding is not restricted to individual regions (Figure 4.6a,
right panel). We furthermore show that we can detect similar change in accessibility
pattern following high-salt wash in NSRF1 binding motifs (Figure 4.6b), thus highlight-
ing the strength of our genome-wide footprinting method. The centers of CTCF and
NSRF1 binding sites are also markedly hypomethylated compared to the surrounding
regions. Unlike CTCF , however, DNA methylation pattern is not phased surrounding
NSRF1 binding regions which also generally lack well-spaced nucleosome arrays. This
data illustrates that different transcription factors may have differing roles in regulating
nucleosome organization. Although this regulatory aspect is beyond the scope of our
current study, we demonstrate that NOMe-seq assay is uniquely suited for future investi-
gations for the role of transcription factors in the establishment of epigenetic landscape.
71
4.3.5 NOMe-seq reveals distinct chromatin configurations at specific pro-
moter types in both of cell line and primary fresh frozen tumor sam-
ple
The experimental part on cell line was done by Dr. Terry Kelly. The experimental part
on fresh frozen tissue was done by Dr. Fides Lay for a collaborated pilot project
We examined the combined nucleosome occupancy and methylation patterns at
CTCF sites, specific promoter classifications (Fig. 4.7) (Hawkins et al., 2010), and other
genomic regions including enhancers and intron/exon boundaries (Supplemental Figs.
S6, S7 in (Kelly et al., 2012)). Interestingly, DNA methylation and nucleosome occu-
pancy were strongly anti-correlated surrounding CTCF sites such that DNA methylation
peaked in the linker regions between nucleosomes (Fig. 4.7A). To examine whether this
correlation was cell type-specific, we performed NOMe-seq in two primary cultures from
glioblastoma tumors (157 and 248) and found that DNA methylation and nucleosome
positioning were also anti-correlated at CTCF sites in these cells (Fig. 4.7A). At promot-
ers, nucleosome occupancy and DNA methylation were consistent with transcription
potential (Fig. 4.7B): H3K4me3-marked (active) promoters were unmethylated with a
distinct NDR upstream of TSSs and at least four nucleosomes downstream from TSSs,
while H3K27me3-marked (repressed) promoters were unmethylated but nucleosome
occupied as indicated by inaccessibility to M.CviPI. DNA methylated (silent) promoters
were completely nucleosome occupied. NOMe-seq is able to distinguish these three
important and distinct promoter architectures in a single experiment. Surprisingly, there
was a bump in DNA methylation just upstream of the TSS of promoters marked by
H3K4me3, which we found to be due to off-target activity of M.CviPI and only affected
the endogenous methylation information obtained from cytosines that were preceded
by another cytosine at regions of peak M.CviPI accessibility. This artifact could be
72
removed completely by eliminating CCGs (Supplemental Fig. S8; Supplemental Mate-
rial in (Kelly et al., 2012)), and future analysis methods can better adjust for this known
off-target activity rate.
We next investigated chromatin configurations of CGI and non-CGI promoters (Fig.
4.7C,D). In general, CGI promoters had low levels of cytosine methylation near the
TSS (relative to 1 kb away from the TSS), a distinct NDR upstream of the TSS, and
well-positioned nucleosomes downstream from the TSS. Separating CGI promoters into
those that are methylated and unmethylated reveals that the CGI promoter pattern is
largely driven by unmethylated CGI promoters and the few CGI promoters that were
methylated were nucleosome- occupied. Separating non-CGI promoters into those that
were methylated and unmethylated revealed that the relatively few non-CGI promoters
that were unmethylated also had an NDR upstream of the TSS and a nucleosome
immediately downstream from the TSS, while the more commonly methylated non-CGI
promoters were nucleosome-occupied.
To demonstrate NOMe-seqs reproducibility, we sequenced two glioblastoma (GBM)
primary cell cultures and found similar nucleosome positioning patterns at promoters
and enhancers in the GBM cells as we did in IMR90 cells (Supplemental Fig. S7 in
(Kelly et al., 2012)). Using a statistical test to identify NDRs near TSSs (see Materials
and methods), we found high concordance among all samples at CGI promoters; the
two GBMs had NDRs that were 90% overlapping with each other and 88% and 91%
overlapping with IMR90, respectively (Supplemental Fig. S7C in (Kelly et al., 2012)).
Many genes which are es- sential for cellular function (i.e., housekeeping genes) have
CGI promoters; thus, it was not surprising to have such significant overlap between the
GBM and IMR90 cells. Nevertheless, the probability of getting such a 90% overlap of
NDRs in the two GBMs by chance is 10
-518
using a hypergeometric test. We found
significantly less overlap at non-CGI promoters between cell types, consistent with the
greater cell-type specificity of non-CGI genes. In these gene promoters, the two GBM
73
samples overlapped by 58%, while they overlapped IMR90 by 43% and 47%, respec-
tively.
Next, we extended our experiment to apply the assay to fresh-frozen primary
human colon tumor tissue. we selected two adenocarcinoma specimens, E485101 and
E237101, which were collected from the right proximal colon of two female patients for
genome-wide NOMe-seq analysis. These tumors were previously genotyped and found
to be wild-type for KRAS, BRAF and TP53 (Hinoue et al., 2012). One of the limiting
factors in our effort to globally map the nucleosome occupancy and DNA methylation
pattern of uncultured human tumors is the low starting amount of tissue and subse-
quently, the yield of DNA following enzyme treatment, in particular when dealing with
adjacent normal tissues. To circumvent this issue, we tested the EpiGnome Methyl-Seq
Kit (Epicentre) which required<100ng of input DNA for the generation of genome-wide
NOMe-seq libraries and performed low coverage sequencing to generate 62-63 million
uniquely mappable reads for each tumor sample (>90% mapping quality). In order to
characterize the promoter types within the tumor colon, we took advantage of publicly
available ChIP-seq data of normal colonic mucosa in order to annotate the genome
based on the functional chromatin states (Ernst et al., 2011). We then examined the
changes in DNA methylation and nucleosome occupancy pattern in the functional pro-
moter states in the both colon tumors. We show that generally active promoters remain
unmethylated and accessible (Figure 4.8, left panel), while poised (Figure 4.8, middle
panel) and inactive promoters (Figure 4.8, right panel) are unmethylated and inaccessi-
ble and methylated and inaccessible respectively in the tumors.
Strikingly, we observed a distinct accessibility pattern within the active promoters of
the two colon tumors. E237101 showed peak accessibility immediately upstream of the
TSS (Figure 4.8a, left panel) whereas the accessibility of E485101 peaked downstream
of the TSS (Figure 4.8b, left panel). We also observed a minimal increased of accessi-
bility in the TSS of poised promoter state of E237101 (Figure 4.8a, middle panel) com-
pared to baseline accessibility which was not observed in E485101 (Figure 4.8b, middle
74
panel). Higher depth sequencing will be needed to comprehensively examine the differ-
ences between these samples and to determine whether different tumor subtypes may
have different patterns of nucleosome positioning and chromatin accessibility.
4.3.6 Combinatorial epigenomic signatures reveal functional chromatin
Unlike any other method used to assess nucleosome occupancy or DNA methylation,
NOMe-seq includes both nucleosome positioning and DNA methylation data for individ-
ual DNA strands, enabling a correlation between the two features at the single molecule
level. Because different chromatin states can exist on the two alleles in a single cell
or in different subpopulations of cells within a sample, we expected the combination
of two marks on a single molecule to yield more information than average levels taken
across a population of cells. To investigate this, we calculated nucleosome protection
patterns around genomic elements as a function of DNA methylation state, first using
methylation in- formation from population averages from any read covering the same
position in the genome (Fig. 4.9A-C, left panels) and then using only the methylation
state from the same read (Fig. 4.9A-C, right panels). Some regulatory elements we
investigated, such as sequences with predicted AP-1 binding motifs, had a visible NDR
but showed almost no difference between population averages (Fig. 4.9A, left) and
within-read averages (Fig. 4.9A, right). These elements suggest uniformity of a specific
chromatin state across the entire population of cells. Other elements we investigated,
such as those annotated as DNase hypersensitive enhancers in IMR90 cells (Hawkins
et al., 2010), had a much stronger correlation between DNA methylation and acces-
sibility within individual reads than across the population of reads, suggesting that a
combinatorial chromatin signature exists within a subset of cells or alleles within the
sample (Fig. 4.9B).
To investigate whether we could detect combinatorial chromatin signatures within
regions likely to be monoallelic, we applied this same approach to gene promoters
identified as having both DNA methylation and H3K4me3 marks in IMR90 cells (Fig.
75
4.9C) (Hawkins et al., 2010). These two states are generally antagonistic at promot-
ers, suggesting that they might exist on two different alleles in the same cell, especially
in a genetically female cell line like IMR90. The across-read vs. within-read compari-
son shows that any correlation between methylation and nucleosome occupancy is lost
when averaging across all reads (Fig. 4.9C, left) but clear when looking at within-read
correlations (Fig. 4.9C, right). To test whether we could exploit this within-read correla-
tion to identify allele-specific regions, we searched all promoters (-150 to +100 bp from
TSS) for a combination of the two opposing chromatin conformations (Fig. 4.9D), one
containing unmethylated CpGs and no nucleosome protection (Allele 1) and the other,
methylated CpGs and nucleosome protection (Allele 2). We found that 742 promoter
regions met this divergent chromatin alleles (DCA) criteria, of which 201 mapped to the
X chromosome (27% of DCA promoters compared to 2.7% of promoters genome-wide)
(Fig. 4.9E). Eighteen DCA promoters were associated with one of 58 known imprinted
genes (http://www.geneimprint.org/), compared to an average of four in matched
sets of randomly selected promoters (Fig. 4.9F). To validate our genome-wide findings,
we performed locus-specific NOMe-seq analysis on one X-linked (DLG3), one imprinted
gene (SRNPN), and one novelly identified DCA promoter (ZNF597), which was recently
suggested to be imprinted (Fig. 4.10) (Choufani et al., 2011; Nakabayashi et al., 2011).
Our results clearly showed the presence of two distinct chromatin structures. We further
showed more overlap in DCA alleles between the two GBM samples compared to the
number of DCA alleles shared amongst the GBM and IMR90 samples (Supplemental
Fig. S9 in (Kelly et al., 2012)). The incorporation of both DNA methylation and nucleo-
some positioning information from individual DNA strands enabled the identification of
several monoallelic genes that have not been previously described, and we expect that
increased sequencing depth will greatly increase our sensitivity for these regions.
76
4.3.7 NOMe-seq reveals the genome-wide epigenetic switching from neu-
ral stem cell to glioblastoma
The experimental part on GBMs was done by Dr. Terry Kelly. The experimental part on
NSCs was done by Dr. Terry Kelly.
We extended our study from investigation of the epigenomic status within a single
sample to the epigenome switching between two difference cell types. Comparisons of
NOMe-seq results in NSC and two GBMs was performed. The union of 17,000 NDRs
in either cell types were detected by two state beta-binomial HMM (detail described in
Chapter 3). Hierarchical clustering of accessibility level in both two cell types revealed
the NDR dynamics between cell types. Conserved NDRs were highly enriched in pro-
moter regions, while de novo NDRs in GBM were mainly enriched in enhancer regions
(Figure 4.11a). Gain of accessibility was accompanied with loss of DNA methylation,
while loss of accessibility in GBM was correlated with mild increase of DNA methyla-
tion. Cluster3 showed a significant H2A.Z increase at GBMs (Figure 4.11b,d). Weak
H3K4me1 signal at NSC may indicated that the potential enhancers were already pre-
marked by H3K4me1 (poise enhancers). H2A.Z seemed to play a key role in activating
poise enhancer during stem cell differentiation. Motif analysis at Cluster3 NDR regions
showed a strong enrichment at AP1, HIFIb, NF1/CTF and Sox3 et al. Lack of sequence
specificity and histone modifications in Cluster4 might indicate a new class of NDRs
that have not been discussed before or might be false positively called NDR in NSCs.
This study showed the power of NOMe-seq on the detection of accessibility and DNA
methylation dynamics among multiple samples.
4.4 Discussion
Using a novel approach to examine nucleosome occupancy and endogenous DNA
methylation genome-wide, we footprinted chromatin architecture at a variety of pro-
moter and nonpromoter regions. We showed that the NDR upstream of the TSS can
77
accommodate multiple nucleosomes and is indicative of expression levels for both CGI
and non-CGI promoters. We further show that the relationship between nucleosome
occupancy and DNA methylation is context-specific-depending on genomic location-
but widely exists across the genome and that incorporation of DNA methylation and
nucleosome information within the same DNA strand can facilitate the identification of
monoallellic chromatin patterns. We also explore the possibility of applying NOMe-seq
to investigate the binding affinity of distal elements in vivo. Importantly, we show that
NOMe-seq alone can distinguish among the three major chromatin states known to be
found at promoters-active (H3K4me3, meC-, NDR+), repressed/poised (H3K27me3,
meC-, NDR-), and silenced (mec+, NDR-). The ability to distinguish these three
promoter architectures within a single experiment, let alone a single molecule, holds
great promise for epigenomic mapping. Later, we observed the similar phenomenon
in fresh frozen tissue. Our pilot experiments overall provided a proof-of-principle that
epigenome-wide mapping of uncultured normal and tumor tissue can be done. Finally,
we showed that NOMe-seq could successfully identify accessibility and DNA methyla-
tion dynamics among different cell types as well as in a single sample.
Traditionally, genome-wide mapping of nucleosome positioning has been done using
MNase-seq or H3 ChIP-seq, which rely on DNA breakage. FAIRE-seq relies on
enhanced sensitivity to DNA breakage of nucleosome-depleted regions (Giresi et al.,
2007; Nagy and Price, 2009). Instead of using a nuclease, methyltransferase-based
footprinting, such as the CpG methyltransferase M.SssI (Gal-Y am et al., 2006; Lin et al.,
2007; Bouazoune et al., 2009; Kelly et al., 2010) or GpC methyltransferase M.CviPI
(Kelly et al., 2010; Wolff et al., 2010; Andreu-Vieyra et al., 2011; Taberlay et al., 2011;
Y ou et al., 2011), uses the placement of a biochemical mark (i.e., methylation) on DNA to
assess nucleosome occupancy. Since GpC dinucleotides are not endogenously methy-
lated, NOMe-seq provides both nucleosome positioning and DNA methylation within
the same individual DNA strand. In addition, since NOMe-seq signal is interpreted as
a percentage of sequencing reads at a given position, it provides a normalized and
78
unskewed measurement that does not rely upon the number of reads that map to a par-
ticular genomic locus and, thus, is an independent method of assessing nucleosome
occupancy that can complement and validate results from the established enrichment
methods.
Using MNase-seq combined with DNA methylation information from bisulfite
sequencing libraries, previous work has found that nucleosomal DNA is preferentially
methylated (Chodavarapu et al., 2010). While this is true for the majority of the
genome, specific types of elements did not adhere to this genome-wide trend. For
example, nucleosomes surrounding CTCF sites were unmethylated while the linker
regions between nucleosomes were methylated, demonstrating a novel relationship at
a functionally important class of chromatin.We provided strong evidence for a pervasive
methylation pattern occurring at linker regions between arrays of positioned nucleo-
somes in the human genome. This observation has implications for methylome analy-
sis, suggesting that methylation levels may be used to deduce nucleosome positioning
in some cases. Nucleosomes adjacent to CTCF binding sites may account for a sig-
nificant fraction of these nucleosomal arrays, since it is estimated that approximately
one million nucleosomes may be positioned adjacent to CTCF sites (around 55,000
CTCF sites in any given cell type (Wang et al., 2012), with about 20 nucleosomes
positioned per site (Fu et al., 2008)). We additionally showed that methylation levels
within linker regions are unlikely to play a causal role in positioning of CTCF-adjacent
nucleosomes. This is parsimonious with the observation that strong positioned nucleo-
somes are stacked against a barrier that possibly introduced by ATP-dependent chro-
matin remodeling complexes (Zhang et al., 2011).
One hallmark of an active gene is the presence of a NDR immediately upstream of
the TSS. Previous work found the levels of active histone marks were correlated with
gene expression level (Barski et al., 2007); however, nucleosome occupancy itself was
not measured, and the regions upstream of the TSS appeared to be equally depleted of
79
nucleosomes regardless of transcript level. Here, we show that NDRs are more promi-
nent at more highly expressed genes. Importantly, this correlation between expression
and nucleosome depletion was similar for both CGI and non-CGI promoters, suggesting
that at least some key aspects of epigenetic gene regulation are shared between CGI
and non-CGI promoters. In addition, our results show that NDRs are large enough to
accommodate multiple nucleosomes. The inability to detect subtle nucleosome deple-
tion differences based on expression and the underestimation of NDR size by previous
studies is potentially reflective of the variability in fragment sizes generated by sonica-
tion, and highlights the subtleties of chromatin organization that can be identified using
NOMe-seq but have been overlooked in previous studies.
Whole genome NOMe-seq is a novel approach that footprints nucleosome occu-
pancy while retaining DNA methylation information to identify chromatin structures of a
variety of genomic regions including promoters, enhancers, and insulators. The combi-
nation of these two epigenetic marks on the same molecule can identify combinatorial
profiles within a mixed population of cells or alleles with greater sensitivity than the two
marks in isolation. The epigenetic landscape generated by these combinatorial epige-
nomic profiles has several important implications for biology, especially in the context
of profiling complex tissues containing multiple cell types. Furthermore, as mutations
in chromatin remodeling complexes are becoming increasingly associated with cancer
(Wilson and Roberts, 2011), whole-genome NOMe-seq is an ideal approach to address
the effects that these mutations have, both on nucleosome positions and DNA methyla-
tion, and can further investigate whether chromatin remodeling defects are dependent
on DNA methylation state.
80
Figure 4.3: NOMe-seq displays nucleosome occupancy profiles at specific loci and
globally.
(A) Broad view of the ATM promoter using a newly developed module of the IGV viewer (Thorvaldsd´ ottir
et al., 2013) to visualize NOMe-seq BAM alignment files. The top two tracks indicate endogenous DNA
methylation (at HCG sites) in each of two GBM samples, while tracks 5 and 6 indicate GCH accessibility
of the same GBM samples. (Red) Methylated sites (for both HCG and GCH); (blue) unmethylated sites
(for both HCG and GCH). The promoters of ATM and NFAT are unmethylated (blue in top two tracks) and
nucleosome-depleted (i.e., accessible and therefore methylated, and thus red in tracks 5 and 6). The same
methylation and nucleosome occupancy pattern is seen for both GBM samples. Tracks 3 and 4 show aver-
age methylation levels derived from these tracksat each individual HCG, the number of reads methylated
at that HCG is divided by the total number of reads methylated and unmethylated. Average GCH methy-
lation in tracks 7 and 8 is calculated as before but inverted (1-GCH) to indicate nucleosome protection
as used throughout the main figures. The tool and source code are publicly available for download at
the IGV project website: http://www.broadinstitute.org/igv/. (B,C) NOMe-seq reads were aligned to
CTCF (B) and TSSs (C). Nucleosome positioning in IMR90 cells is indicated on the y-axis by inaccessibil-
ity to M.CviPI (1-GpC methylation; teal line) and the number of MNase sequencing reads (blue line). For
MNase-seq, reads were aligned to 8709 CTCF sites, while 8687 CTCF sites had at least one GpC site
that was covered by a minimum of three reads (B). For TSS, 42,103 promoters were used for MNase-seq,
and 41,292 promoters had at least one GpC site that was covered by a minimum of three reads. (D,E)
Gene promoters were divided into quartiles based on transcription level (Hawkins et al., 2010), and the
corresponding M.CviPI inaccessibility (1-GCH, teal line) is plotted on the y-axis. (D) CpG island promoters.
(E) Non-CpG island promoters. The NDR is stronger in more highly expressed genes and, in some cases,
can be several hundred bp long to accommodate multiple nucleosomes.
81
Figure 4.4: Increased methylation in linker regions within different genomic contexts.
Linkers identified from IMR90 NOMe-seq are shown aligned to IMR90 chromatin accessibility (GCH, green
line) and methylation (HCG, black line). H can include any A, C, or T nucleotide.
82
Figure 4.5: DNA methylation occurs primarily at linker regions in nucleosomal arrays
flanking CTCF binding sites.
(A) Methylation levels around motifs bound by CTCF in HeLa cells (see methods). Association between
methylation and nucleosome positioning is verified in several WGBS datasets and one non-bisulfite
(MSRE) dataset. (B) Nucleosome occupancy is shown around CTCF sites for IMR90 cells. The black
line includes all CTCF-adjacent regions from Figure 4.5A. The red line includes only positions that have
zero CpGs within +/-370 base pairs (a region the size of four full nucleosomes). (C) Same analysis, but
using NOMe-seq chromatin accessibility from IMR90 cells.
83
Figure 4.6: NOMe-seq detects chromatin configuration around transcription factor
binding sites.
(A) NOMe-seq demonstrates unmethylated NDRs at CTCF sites in IMR90 cells, which are marked by a
peak in inaccessibility at the CTCF site itself. Well positioned nucleosomes flank CTCF sites, with DNA
methylation peaking in between nucleosomes. 0 indicates the middle of the CTCF binding motif. Increased
accessibility at the CTCF site is seen when nuclei are washed with high-salt buffer to disrupt CTCF binding
to DNA. (B) NOMe-seq shows unmethylated NDR regions associated with NSRF1sites in IMR90 cells
which are marked by a peak of inaccessibility corresponding to NSRF1 binding site itself that gradually
disappear with increasing concentration of salt wash. Unlike CTCF , NSRF1 does not show well-phased
nucleosomes.
84
Figure 4.7: NOMe-seq reveals distinct chromatin configurations at CTCF sites and
associated with specific histone modifications and promoter types.
(A) NOMe-seq demonstrates unmethylated NDRs at CTCF sites in IMR90 and GBM cells, which are
marked by a peak in inaccessibility at the CTCF site itself. Well-positioned nucleosomes flank CTCF
sites, with DNA methylation peaking in between nucleosomes. 0 indicates the middle of the CTCF bind-
ing motif. CTCF binding sites were obtained from GSM935404. (B) NOMe-seq distinguishes the three
major promoter states at promoters in IMR90 cells-active H3K4me3-marked promoters are unmethylated
and contain a NDR upstream and well-positioned nucleosomes after the TSS. TSSs are indicated on
the x-axes as 0. Repressed/poised H3K27me3-marked promoters are unmethylated and nucleosome-
occupied. Methylated promoters are nucleosome-occupied. The y-axis indicates M.CviPI inaccessibility
(1-CpG; teal) and CpG methylation level. (C) In IMR90 cells, CpG island promoters are characterized by
a lack of CpG methylation, an upstream NDR, and well-positioned nucleosomes after the TSS. The major-
ity of CpG island promoters are unmethylated (11,165) and display the same pattern, while methylated
CpG island promoters (781) are nucleosome-occupied and inaccessible to M.CviPI. (D) Non-CpG island
promoters are generally characterized by CpG methylation and inaccessibility to M.CviPI, indicating nucle-
osome occupancy. The few unmethylated non-CpG island promoters (1397) are depleted of nucleosomes
upstream of the TSS, while the majority of non-CpG island promoters (4668) are nucleosome-occupied
and inaccessible to M.CviPI. M.CviPI inaccessibility is plotted (1-GCH) in teal and CpG methylation (CGH)
in black.
85
Figure 4.8: Low-input NOMe-seq detects distinct chromatin configurations of different
promoter types.
NOMe-seq reads from two biological samples (A)237101 and (B)485101 are aligned to TSS of active (left
panel), poised (middle panel) and inactive promoters (right panel). The chromatin state was determined
by training ChIP-seq data of normal colonic mucosa into chromHMM model as described in Chapter 5.
86
Figure 4.9: Combinatorial epigenomic signatures reveal functional chromatin.
(AC) Nucleosome occupancy levels (percent of GpCs protected) are shown stratified by the methylation
status of nearby CpGs (within 20 bp). For each element type, this analysis was performed twice-once
sampling randomly across all reads covering the same genomic position as the GpC (left plots, labeled
across all reads) and a second time using only the methylation status from the same read (right plots,
labeled on same read) (see Methods). All three examples show nucleosome depletion associated primarily
with the unmethylated state, but while predicted AP-1 binding motifs (A) display this in both population and
within-read profiles, enhancers and promoters marked by the opposing K4me3 and meC (B,C) show this
association only in the within-read analysis. 0 refers to the center of the AP-1 binding motif (A), the peak
of DNase HS within K4me1-marked regions (B) and TSSs (C). (D) Search strategy for finding divergent
chromatin alleles (DCA) by searching TSS regions for at least two reads with opposing chromatin profiles
in IMR90 cells. (E) Promoters that exist in both nucleosome-depleted and unmethylated and nucleosome-
occupied and methylated are enriched on the X chromosome. Seven hundred and forty-two DCA genes
were compared to randomized sets of 742 genes-1000 trials were performed and the standard deviation is
shown for the number on each chromosome. A P-value was determined from the X chromosome using a
binomial test with the probability determined by the random trials. (F) DCA genes were compared to 1000
randomized gene sets for the number within 50 kb of known imprinted genes.
87
Figure 4.10: Validation of DCA promoters.
PCR amplicons were cloned, and several colonies were sequenced to visualize two distinct chromatin
configurations of an imprinted gene, SNRPN (A), and an X-linked locus, DLG3 (B), and a novelly identified
DCA gene, ZNF597 (C). (Black) DNA methylation of CpG sites; (teal) GpC accessibility. Pink bars indicate
nucleosome positioning.
88
Figure 4.11: NOMe-seq reveals the genome-wide epigenetic switching from neural
stem cell to glioblastoma.
(A).The union of 17,000 NDRs detected in either of cell types are merged into a single dataset. Accessibility
and DNA methylation level around -1kb to +1kb of NDR center is displayed as a heatmap. Hierarchical clus-
tering is used to order NDRs and define 4 clusters based on accessibility patterns. Below each heatmap,
average levels are shown for each cluster, using the same color coding at the left of the heatmaps. Addi-
tional left-hand columns indicate overlap with chromatin state defined in normal epithelial cells (HMEC)
as defined by ENCODE chromHMM. (B). The histone modification and histone variants ChIP-seq Z-score
signals (minus Input) are averaged and shown as heat density plot from -3kb to +3kb of NDRs. (C). Motif
enrichment p value in each cluster is calculated by HOMer v4.3. Only top ranked motifs are shown here.
(D). An example of de novo NDR formation in GBMs. Methylation and accessibility signals below 10 are
under the middle line to discriminate signal 0 case from the 0 reads covered case.
89
Chapter 5
The context-dependent roles of
DNA methylation in directing the
functional organization of the
epigenome
This chapter is modified from a manuscript prepared in which I share the co-first author-
ship with Dr. Fides Lay. Dr. Fides Lay did all of the experimental work, while I did all of
the computational analysis.
5.1 Introduction
Eukaryotic genomes are controlled by the inter-related and heritable sets of epigenetic
mechanisms, consisting of DNA methylation, nucleosome positioning and histone mod-
ifications, which can determine gene activation potential. DNA methylation, one of the
most-studied epigenetic mechanisms, is a covalent addition of a methyl group on the
cytosine of CpG dinucleotides. It is known for many years that DNA methylation is
critical for suppressing transcriptional activity in normal cells particularly during imprint-
ing, X-inactivation and silencing of retrotransposons though recent studies suggest that
DNA methylation may play a more subtle role in fine-tuning the silencing of gene expres-
sion rather than as a direct silencing mechanism (Jones, 2012; Rivera and Ren, 2013).
As many as 25 million CpG sites have been shown to be methylated in the human
90
genome, with the exception being those located in CpG-rich regions, or CpG islands
(CGI) (Lister et al., 2009; Ziller et al., 2013). This methylation pattern is faithfully copied
in a cell-cycle dependent process mediated by DNA methyltransferases DNMT1 and
DNMT3A/B (Jones and Liang, 2009; Sharma et al., 2011).
Another major component of transcriptional regulation which recently undergoes a
revival due to the advancement of genome-wide mapping technology is nucleosome
positioning (Bell et al., 2011; Lorch et al., 1987; Schones et al., 2008). Nucleosome is
the primary unit of chromatin structure which consists of 147bp of DNA wrapped around
a histone octamer of H2A/B, H3 and H4. The organization of nucleosomes, along with
modifications on the histone tails, is important for maintaining a balance between com-
paction and accessibility of the genome by transcription factors and other DNA binding
proteins during cellular processes such as transcription, replication and repair (Cairns,
2009; Li et al., 2007). Various factors such as underlying DNA sequences, chromatin
modifications and DNA binding factors have been described to play a role in the posi-
tioning of nucleosomes (Bell et al., 2011; Tillo and Hughes, 2009; Valouev et al., 2011).
Often overlooked and sometimes controversial, however, is the role of DNA methyla-
tion in directing nucleosome positioning in mammalian genome (Bell et al., 2011; Cho-
davarapu et al., 2010; Portela et al., 2013; Valouev et al., 2011).
Epigenetic changes, in particular aberrant DNA methylation and silencing of CGI
promoters are a common signature of cancer (Baylin and Jones, 2011). This observa-
tion combined with the fact that more than 60% of promoters are located in the CGIs
has driven the focus in CGIs as a model of study for epigenetic regulation (Deaton and
Bird, 2011; Irizarry et al., 2009; Portela et al., 2013; Tazi and Bird, 1990). The rise
of the epigenomic era has revealed that CGI promoters may not be a homogenous
class and regulatory regions outside of the CGI promoters such as the CpG shores
and non-CpG islands may also play a role in tumorigenesis (Doi et al., 2009; Irizarry
et al., 2009; Rach et al., 2011). It becomes increasingly clear, however, that the study of
gene regulation requires a holistic understanding of how each component of epigenetic
91
machineries influences each other (Rivera and Ren, 2013). Despite the extensive study
on DNA methylation changes in cancer, we still lack the understanding as to how DNA
methylation contributes to the establishment of epigenetic landscape (Jones, 2012).
Here, we compared a colon cancer cell line HCT116 with its hypomethylated deriva-
tives, DKO1 cells, to evaluate the interaction between DNA methylation, nucleosome
positioning and histone modifications. Our study coupled NOMe-seq, ChIP-seq and
RNA-seq to generate an integrated map of chromatin architecture and gene expression
for HCT116 and DKO1 cells which have been genetically engineered to have a com-
plete depletion of DNMT3B and are hypomorphic for DNMT1 (Egger et al., 2006; Rhee
et al., 2002; Sharma et al., 2011). Using this model, we specifically profiled the distinct
chromatin structures of CGIs and non-CGI promoters and elucidated how perturbations
in global DNA methylation pattern may directly alter the functional organization of the
cancer epigenome and thus, gene transcription.
5.2 Materials and methods
5.2.1 Cell Culture
This part was done by Dr. Fides Lay
HCT116, obtained from ATCC, and DKO1 cells were cultured under recommended
conditions at 37C and 5% CO2 in McCoys 5A media supplemented with 10% FBS and
penicillin/streptomycin.
5.2.2 Genome-wide nucleosome footprinting assay
Library construction was done by Dr. Fides Lay
NOMe-seq was performed as previously described (Kelly et al., 2012). Briefly, expo-
nentially growing cells were washed with PBS, trypsinized and incubated with ice-cold
92
lysis buffer (10 mM Tris, pH7.4, 10 mM NaCl, 3 mM MgCl2, 0.1 mM EDTA and 0.5%NP-
40) for 5 minutes on ice to isolate intact nuclei. Nuclei were washed with ice-cold wash
buffer (10 mM Tris, pH7.4, 10 mM NaCl, 3 mM MgCl2, 0.1 mM EDTA), resuspended
in ice-cold 1x GpC buffer (New England Biolabs) and treated with 200U of M.CviPI
enzyme supplemented with 1.5 l S-adenosylmethionine (SAM) for 7.5 minutes with
a boost of 100U enzyme and 0.75 l SAM for additional 7.5 minutes. Genomic DNA
was isolated by standard phenol-chloroform extraction and ethanol precipitation. WGBS
libraries were generated using 2-5g of DNA as previously described and sequenced
on Hiseq2000 (Berman et al., 2012; Lister et al., 2009). Sequencing reads were
mapped to the hg19 genome and methylation levels of CpG and GpC dinucleotides
were determined using previously described pipeline (Kelly et al., 2012; Liu et al., 2012).
5.2.3 Hidden-Markov model-based approach of NDR detection
Two-states beta-binomial HMM was adapted from a previously described method
(Chapter 3) to segment regions into Methyltransferase Accessible Regions (MARs) and
Methyltransferase Protected Regions (MPRs) based on the methylation value of GCH in
HCT116 and DKO1 cells where each separate training of the model was performed for
each biological replicate (Molaro et al., 2011). GCH coverage information and Viterbi
algorithm was used to decode the states of each GCH and segments containing at
least 3 GCH present in the same state are required to categorize the MARs and MPRs.
One-way binomial test was used to calculate the significance level of each MAR in com-
parison to all MPRs present in the adjacent +/-100kb region with only MARs having
FDR-corrected p-value<0.01 considered significant. MARs having the length<100 bp
were considered as NDRs. For downstream analysis such as in Figure 5.2a, only NDRs
that perfectly overlapped in the two biological replicates of each cell line were used.
93
5.2.4 Definining classes of promoters
We merged 2 NOMe-seq replicates of HCT116 and DKO1 and included all known TSS
as annotated by UCSC and filtered for regions having at least 10 reads and 3 WCG
sites -300 to +500bp around the TSS. For both HCT116 and DKO1 cells, we considered
regions having methylation level less than 5% as unmethylated. We used 60% and 35%
as the cutoff for methylated regions in HCT116 and DKO1, respectively. The different
cutoffs for methylation level were determined based on the distribution of methylation
value in the promoter regions of each cell line. Based on this criteria, we included
15,000 CGI promoters and 8,000 non-CGI promoters.
5.2.5 ChIP-seq
ChIP-seq library construction was done by Dr. Fides Lay and Heather Witt in Dr. Peggy
Farnham lab
ChIP assay was performed using 50 g of chromatin as previously described and
according to ENCODEs guideline (Kelly et al., 2010; Landt et al., 2012). The follow-
ing antibodies were used: H2A.Z (Abcam, ab4174), H3K4me3 (Active Motif, 39160),
H3K4me1 (Active Motif, 39298), H3K27Ac (Active Motif, 39297), H3K27me3 (Active
Motif, 39155). Genome-wide libraries were generated from 20ng of purified ChIP and
input DNA, barcoded and sequenced for 50 single-end reads on Hiseq 2000 using previ-
ously described protocol (Barski et al., 2007; Kelly et al., 2012). Sequencing reads were
mapped to hg19 using bwa, removing non-unique reads and PCR duplicates. All ChIP-
seq reads were extended by each librarys own mean fragment size which was esti-
mated using the default setting of HOMER v.4.3s makeTaqDirectory command. Each
data was normalized into a single value for each genomic position using the default set-
ting of Wiggler and globalmap k20tok54 as mappability parameter. Mean wiggler value
was calculated in 10bp bin (Gerstein et al., 2012). To normalize variations between
biological replicates, we modified previously described method and performed Z-score
94
transformation by subtracting the mean wiggler value across the genome and divided
by standard deviation of genome-wide wiggler substraction value (Xie et al., 2013).
5.2.6 ChromHMM
Segmentation and determination of chromatin states were calculated as previously
described (Ernst et al., 2011). Data was trained using histone modification ChIP-seq
data shared among Colonic Mucosa, HCT116 and DKO1 cells to generate 11 chromatin
states.
5.2.7 RNA-seq
One replicate of RNA-seq library was done by Dr. Fides Lay and the other replicate of
RNA-seq library was done by Dr. Adam Blatter in Dr. Peggy Farnham lab
Cells were washed with PBS and subsequently lysed in Trizol. Total RNA from
two independent cultures was purified using Direct-zol RNA MiniPrep (Zymo Research)
and libraries were constructed using the poly-A selected method of the TruSeq RNA
Sample Prep Kit (Illumina) according to manufacturers instructions. Sequencing reads
were mapped to the reference genome using TopHat v.1.2, filtering out non-unique
reads and PCR duplicates. FPKM value was calculated using Cufflinks v.2.1.1 with
the parameter: -F 0.3 -u -b hg19.fa. Gene annotation was obtained as a GTF file from
the UCSC Genome Browser. The read count for each gene was extracted using htseq-
count v0.5.4p3. EdgeR v.3.6.0 was used to determine the differentially expressed genes
between the two cell lines using two biological replicates for each cell line. Only genes
with FDR-corrected p-value<0.05 were considered differentially expressed.
95
5.3 Results
5.3.1 NOMe-seq detects NDR changes upon the global loss of DNA
methylation
Using NOMe-seq, we characterized the relationship between DNA methylation and
chromatin accessibility by analyzing two biological replicates of HCT116 and the
severely hypomethylated DKO1 cells (Figure 5.1). To examine the effects of DNA
methylation loss on discrete genomic regions, we identified four clusters of nucleosome
depleted regions (NDRs) that were present in either cell types based on a beta-binomial
Hidden-Markov Model (HMM). We found that only a small fraction of NDRs was unique
to DKO1 cells (C3) with the majority conserved (C1,C2,C4) between HCT116 and DKO1
cells (Figure 5.2a, Materials and Methods). Most of the conserved NDRs are flanked by
well-phased nucleosomes and are associated with unmethylated genomic regions that,
in the case of C1 and C2, are also highly enriched for the active H3K27ac histone mark
as well as the permissive H2A.Z, H3K4me3, and H3K4me1 marks and depleted for the
polycomb repressive complex (PRC) mark, H3K27me3 (Figure 5.2a-b). C3 NDRs, on
the other hand, exhibit weakly-phased nucleosomes and are specific to regions that
lose DNA methylation in DKO1 cells. These regions also show a moderate increase in
active and permissive histone marks which are mostly absent in the parent cells.
Examining the distribution of functional chromatin states in each NDR cluster, we
found that C1 and C2 consisted largely of active and weak CGI promoters in both
HCT116 and DKO1 cells, whereas C3 and C4 contained higher fractions of distal reg-
ulatory regions (Figure 5.2c). Notably, we observed an increased enrichment of strong
enhancer state in C3 in DKO1 cells, suggesting that the loss of DNA methylation may
initiate the activation of enhancers that have previously been primed in HCT116 cells.
These distal NDRs are enriched for transcription factor motifs such as HIF1b and AP-1,
and may be responsible for the increase of expression in nearby genes. C4, on the
other hand, contains CTCF-associated weak enhancer state in HCT116 cells which is
96
Figure 5.1: Significant loss of DNA methylation in DKO1 cells does not dramatically
increases global accessibility
Genome-wide NOMe-seq was performed for HCT116 (x-axes) and DKO1 (y-axes) cells and (a) endoge-
nous DNA methylation and (b) accessibility levels of the two cell lines were compared. Methylation levels
of CpG and GpC sites were measured in genomic windows of 200bp. (c) Genomic windows (200bp)
exhibiting change in DNA methylation (x- axes, HCT116-DKO1) were compared to changes in accessibility
(y-axes, HCT116- DKO1). Dashed lines represent 20% change in methylation (vertical) and accessibility
(horizontal).
dramatically reduced in DKO1 cells. Strikingly, we also observed a trend of increased
poised CGI promoter state in all four NDR clusters which is consistent with the low level
97
increase of H3K27me3 seen in DKO1 cells (Figure 5.2b-c), suggesting that rearrange-
ment of the chromatin landscape may occur in the absence of DNA methylation inde-
pendent of accessibility changes in order to maintain the balance between chromatin
accessibility and compaction.
Figure 5.2: NOMe-seq detects NDRs and changes in NDRs following global
hypomethylation
(a) Hidden-Markov Model (HMM) is applied to identify Methyltransferase Accessible Regions (MARs) and
MARs with length more than 100-bp are considered as nucleosome-depleted regions (NDRs). Only NDRs
that were overlapping between the two biological replicates of HCT116 and DKO1 cells are included in this
heatmap (n=22,460). NOMe-seq reads are aligned to the center of the NDRs and plotted +/-1-kb. NDRs
are hierarchically clustered based on the accessibility of DKO1 cells and the clusters are separated by
dashed horizontal lines. (b) Z-score is calculated for each histone mark and the value is plotted +/-5kb
from the center of the NDRs, following the order seen in (a). (c) Each NDR in both cell lines is annotated
based on its chromatin state as defined by chromHMM model and the distribution of chromatin states within
each cluster is shown as a bar graph. The total number of NDRs contained in each cluster is indicated.
98
5.3.2 Reorganization of nucleosomes occurs in the absence of DNA
methylation at CGI promoters
Due to the established association between DNA methylation and gene silencing in
promoters, we further examined the contribution of DNA methylation in establishing dif-
ferent promoter architectures. We first analyzed the broad class of CGI promoters and
stratified them based on the methylation status of HCT116 and DKO1 (Materials and
methods). We found that CGI promoters are largely unmethylated in both cell types
(Figure 5.3, left panel, UU), which is consistent with reports that CGIs are generally
protected from DNA methylation (Deaton and Bird, 2011; Weber et al., 2007). We also
noted that while the wild-type cells exhibit bimodal DNA methylation pattern, hyper-
methylated CGI promoters are scarce in DKO1 cells where global DNA methylation is
significantly reduced. Subsequently, we identified a small subclass of promoters that
retained at least a residual methylation of 30% (MM) and promoters that dramatically
lost DNA methylation (MU) in DKO1 cells.
To analyze the effects of DNA methylation on nucleosome organization, we hierar-
chically clustered each subgroup of promoters based on the accessibility level of DKO1
cells. The majority of UU promoters have an open architecture strongly associated with
active promoters, namely they have highly accessible regions around the transcription
start sites (TSS) which are flanked by at least three well-phased nucleosomes at both
5 and 3 directions (Figure 5.3, left panel) (Schones et al., 2008). Not all unmethylated
CGI promoters have open and well-organized nucleosomes, but methylated CGI pro-
moters, even at residual level such as seen in DKO1 cells, are consistently inaccessi-
ble, lack nucleosome phasing and invariably closed (Figure 5.3, right panel). Complete
depletion of DNA methylation in MU promoters, however, results in the reorganization
of surrounding nucleosomes whereby promoters gain well-phased nucleosomes and/or
99
accessibility in DKO1 cells that are absent in the methylated parent HCT116 cells (Fig-
ure 5.3, left panel), thereby suggesting that DNA methylation may directly influence the
organization of nucleosomes in CGI promoters.
Figure 5.3: The loss of DNA methylation triggers nucleosome reorganization in CGI
promoters
NOMe-seq reads are aligned to 15,692 annotated CGI TSS and promoters are categorized based on
the methylation levels in both cell types as Unmethylated in HCT116 and Unmethylated in DKO1 (UU),
Methylated in HCT116 and Methylated in DKO1 (MM) and Methylated in HCT116 and Unmethylated in
DKO1 (MU). The number of promoters that fall in each class is shown on the left. Heatmaps are generated
for DNA methylation (left panel) and accessibility (right panel) +/-1kb from the TSS and clustering of each
row was done based on the accessibility pattern of DKO1 cells. DNA methylation and accessibility level in
MM group are zoomed in at the right most panel.
5.3.3 The loss of DNA methylation in CGI promoters results in the acqui-
sition of active and poised histone marks
To better understand the functional relevance of changes in accessibility and nucleo-
some position in the absence of DNA methylation, we inspected the histone modifi-
cation and gene expression patterns of CGI promoters at various level of methylation.
On average, UU promoters have a high enrichment of H2A.Z variant in the histone
core which is also strongly decorated by active and permissive histone marks (Figure
5.4a, left panel), consistent with our observation that the majority of UU promoters are
active and highly expressed in both cell types (Figure 5.4b-c, left panel). In contrast,
methylated CGI promoters as seen in HCT116 cells are largely devoid of permissive
100
histone modifications as well as polycomb repressive H3K27me3 and heterochromatic
H3K9me3 marks (Figure 5.4a, middle and right panel).
The reduction of DNA methylation subsequently initiates a dramatic remodeling of
CGI promoters and illustrates the interplay between various epigenetic components.
We observed that despite retaining residual methylation, MM promoters surprisingly
exhibit a minimal increase in H3K4me3 and H3K27me3 (Figure 5.4a, middle panel).
Invariably, we found that about 50% of the promoters had switched from an inactive to a
poised state (Figure 5.4b, middle panel) and gained a low level increased of expression
in DKO1 cells (Figure 5.4c, middle panel). Changes in histone modifications are even
more pronounced in MU promoters where the complete ablation of DNA methylation
results in the establishment of a chromatin structure that overall is more permissive for
transcription (Figure 5.4, right panel). While on average the MU promoters do not gain
an enrichment of active H3K27ac mark, we observed a significant gain of permissive
H3K4me3 mark and H2A.Z variant along with a low level increase of H3K4me1 (Figure
5.4a, right panel). Aside from the active and permissive marks, we also found a striking
increase of the polycomb repressive complex mark H3K27me3 in the absence of DNA
methylation (Figure 5.4a), consistent with previous reports (Jin et al., 2009; Komashko
and Farnham, 2010). Functionally, the rearrangement of histone modification pattern in
the MU promoters results in the establishment of both active and poised promoter state
in DKO1 cells (Figure 5.4b, right panel), and a more dramatic change in expression
compared to what we observed in MM promoters (Figure 5.4c, right panel).
5.3.4 MU promoters fall into two distinct chromatin states marked by
modification at H3K27
More detailed analysis on the architectures of CGI promoters that lose DNA methyla-
tion in DKO1 cells reveals functionally distinct regulatory classes within CGI promoters.
Earlier, we found that approximately three quarter of MU promoters gain well-phased
101
Figure 5.4: CGI promoters that lose DNA methylation acquire active and poised his-
tone marks
(a) Enrichment level of histone marks is shown +/-3kb around the TSS as the average of all promot-
ers in each class. Enrichment level, expressed in terms of z-score was calculated based on normalized
experimental wiggler value compared to the input. (b) Distribution of chromatin states for each promoter
class in both cell types is shown as a bar chart. Chromatin states of promoters are defined based on
the chromHMM model (Ernst et al., 2011). Others include chromatin states covering various enhancers,
transcribed and heterochromatic regions. (c) Transcript level (based on FPKM) for each promoter class is
shown for two biological duplicates for both cell types.
nucleosomes without dramatic accessibility changes following the loss of DNA methyla-
tion while the remaining quarter of promoters exhibit a dramatic increase in accessibil-
ity at the TSS which are flanked by well-positioned nucleosomes, thereby adopting an
open chromatin configuration reminiscent of the UU promoters (Figure 5.3, Figure 5.5a).
Separating the MU promoters into two clusters based on their nucleosome pattern, we
found that the average enrichment signal for H3K27me3 was driven by the subset of
102
MU promoters that gain well-phased nucleosomes but not accessibility in DKO1 cells
(Figure 5.5b). Approximately 30% of the MU promoters became active and gain expres-
sion whereas as many as 40% including the CYP4X1 locus became poised, suggest-
ing an alternative repressive mechanism mediated by polycomb is critical in regulating
gene expression of CGI promoters in the absence of DNA methylation (Figure 5.5c-d,
left panel, Figure 5.5e) (Gal-Y am et al., 2008; Jin et al., 2009). Remarkably, we also
observed that the configuration of the poised state in DKO1 cells occurs such that CGI
promoters adopt a normal-like chromatin structure and gene expression pattern (Figure
5.6).
Furthermore, we found that MU promoters that exhibited an increase of accessi-
bility (Figure 5.5b, right panel) also gain a low level of active H3K27ac mark, but lack
H3K27me3 in DKO1 cells compared to the parent cells and the cluster of MU promoters
that did not gain accessibility (Figure 5.5b, left panel). Of these promoters, more than
half, including ZNF214, became active and gain expression (Figure 5.5c-d, right panel,
Figure 5.5f). This result is consistent with previous observation that gene reactivation
following DNA demethylation requires the formation of accessible or NDR region at the
TSS (Lin et al., 2007; Y ang et al., 2012). These new NDRs, moreover, are enriched for
binding sites of Sp1 transcription factor, which have been shown to protect CGI from
methylation (Gebhard et al., 2010), and may contribute to the maintenance of demethy-
lation of CGI promoters in DKO1 cells in regions devoid of polycomb. Importantly, our
observation highlights the role of modification on lysine-27 of histone H3 in the absence
of DNA methylation in determining whether a CGI promoter is accessible and active or
nucleosome-occupied and poised, thus illustrating the interaction between components
of epigenetic machineries.
103
Figure 5.5: Hypomethylated CGI promoters gain active and/or repressive histone
marks
(a)DNA methylation and accessibility level in MU group as shown in Figure 5.4 are zoomed in. Two clusters
(NP: Nucleosome positioning cluster; NDR: NDR cluster) are separated by a red dash line. Examples in
each cluster are shown in (e) and (f), which are marked by red ”e” and ”f”. (b) Enrichment level of histone
marks is shown +/-3kb around the TSS as the average of all promoters in NP cluster and NDR cluster
identified in (a). Enrichment level, expressed in terms of z-score was calculated based on normalized
experimental wiggler value compared to the input. (c) Distribution of chromatin states for each promoter
class in both cell types is shown as a bar chart. Chromatin states of promoters are defined based on
the chromHMM model (Ernst et al., 2011). Others include chromatin states covering various enhancers,
transcribed and heterochromatic regions. (d) Transcript level (based on FPKM) for each promoter class is
shown for two biological duplicates for both cell types. (e) An IGV example for NP cluster marked as red
”e” in (a) with NOMe-seq, RNA-seq and ChIP-seq results. (f) An IGV example for NDR cluster marked as
red ”e” in (a) with NOMe-seq, RNA-seq and ChIP-seq results.
5.3.5 The loss of DNA methylation does not alter the chromatin structure
of non-CGI promoters
The significant effects of the loss of DNA methylation on the chromatin landscape, how-
ever, do not extend to non-CGI promoters, thus illustrating an important difference on
the regulation of CGI and non-CGI promoters. The majority of non-CGI promoters were
104
Figure 5.6: DNA methylation loss at CGI promoters re-establish poise histone modifi-
cation status and gene expression in normal cells
(a) DNA methylation, accessibility and enrichment level of histone marks is shown +/-3kb around the
TSS as the average of all promoters in each class. Enrichment level, expressed in terms of z-score was
calculated based on normalized experimental wiggler value compared to the input. (b) Distribution of
chromatin states for each promoter class in all three cell types is shown as a bar chart. Chromatin states
of promoters are defined based on the chromHMM model (Ernst et al., 2011). Others include chromatin
states covering various enhancers, transcribed and heterochromatic regions. (c) Transcript level (based
on FPKM) for each promoter class is shown for two biological duplicates for all of the cell types.
found in the MU category, with UU and MM promoters being less prevalent (Figure
5.7a, left panel), and in further contrast to the wide-spread accessibility seen in CGI
promoters, only a very small subset of the promoters were accessible (Figure 5.7a, right
105
panel). However, we surprisingly observed that the majority of MU promoters in non-
CGI exhibit a weaker, but organized array of nucleosomes in HCT116 despite their high
DNA methylation levels (Figure 5.7a). This phased nucleosome pattern was maintained
and became more defined in DKO1 cells with the apparent increase of accessibility in
the linker regions. Well-positioned nucleosomes have often been described as a feature
of unmethylated and permissive promoters, but the co-occurrence of DNA methylation
and weak nucleosome phasing suggests that the organization of nucleosomes in non-
CGI promoters occurs independent of DNA methylation status.
DNA methylation also does not appear to directly influence the histone modification
patterns of non-CGI promoters. Generally, UU promoters, specifically those exhibit-
ing accessibility, are marked by permissive histone modifications in both cell types.
Compared to CGI promoters, however, only a small fraction of those are active and
expressed (Figure 5.7c-d, left panel). Meanwhile, MM promoters are devoid of per-
missive histone marks and are also largely inactive (Figure 5.7b-d, middle panel). The
loss of DNA methylation, however, does not result in dramatic rearrangement of histone
modification pattern such as seen in CGIs as the majority of MU promoters in non-CGIs
do not gain active or poised histone marks and thus remain inactive and not expressed
in DKO1 cells (Figure 5.7b-d, right panel). This observation suggests that the loss of
DNA methylation is insufficient to remodel and reactivate non-CGI promoters and that
DNA methylation itself may not directly modulate the chromatin landscape of non-CGI
promoters.
5.3.6 Long range accessibility changes in partially methylated domains
reveal association with heterochromatic H3K9me3 domain
Having established the varying roles of DNA methylation in regulating the focal chro-
matin structure of CGI and non-CGI promoters, we next inspected whether the genome-
wide loss of DNA methylation also alters the long-range chromatin landscape. We cal-
culated methylation and accessibility levels of non-overlapping 1 Mb windows across the
106
genome, excluding CGIs, and ranked the windows in both cell types based on the aver-
age accessibility signals of DKO1 cells. We observed that while not dramatic, the higher
level of long range accessibility in DKO1 cells compared to HCT116 cells is directly cor-
related with the lower methylation level in the derivative cells (Figure 5.8a). Within both
cell types, however, we also noted windows of significantly reduced accessibility that
are associated with hypomethylated regions. These regions of lower accessibility coin-
cided with partially methylated domains (PMDs), which have previously been linked to
heterochromatic regions (Berman et al., 2012; Hon et al., 2012; Lister et al., 2011). We
examined the large-scale enrichment of repressive H3K9me3 and H3K27me3 marks as
well permissive H3K4me1 mark and indeed observed that windows of low accessibility
are correlated with the presence of H3K9me3 in both cell types (Figure 5.8a). Con-
sistent with previous reports, these H3K9me3-associated PMDs are mutually exclusive
with H3K27me3 repressive domains as well as permissive H3K4me1 domains (Hon
et al., 2012), but interestingly not H3K4me3 blocks. Furthermore, it is apparent From
within a 48 Mb region of chromosome 2 that residual DNA methylation is preferen-
tially retained outside of PMDs and that the most dramatic changes in accessibility
occur within the boundaries of PMDs which are also enriched for H3K9me3 blocks,
thus demonstrating the compact and inaccessible structure of heterochromatic regions
(Figure 5.8b).
5.4 Discussions
Overwhelming evidence shows that investigating the role of epigenetic in regulating
transcriptional program is key in understanding the mechanism governing normal mam-
malian phenotypes as well as diseases (Baylin and Jones, 2011). Much of what we
know about DNA methylation has been described through the lens of CpG islands
(CGIs) promoters whose patterns of aberrant methylation are linked to cancer (Baylin
and Jones, 2011; Deaton and Bird, 2011; Tazi and Bird, 1990). However, the remarkable
107
growth in the field of epigenomics, propelled by advances in high-throughput genomic
sequencing, reveals that the functions of DNA methylation may be more nuanced than
previously understood and only recently have we begun to appreciate the subtle differ-
ences in how DNA methylation influences the chromatin landscape of CGIs and non-
CGIs (Jones, 2012; Rivera and Ren, 2013). Understanding the specific mode of regula-
tion by DNA methylation thus requires a holistic examination of its interactions with other
epigenetic mechanisms, including histone modifications and nucleosome positioning.
Well-spaced array of nucleosomes in promoters have been shown to facilitate tran-
scription initiation, but our knowledge of how DNA methylation contributes to nucle-
osome organization in different promoter classes remains inconclusive (Chodavarapu
et al., 2010; Kelly et al., 2012; Valouev et al., 2011). Our results indicate that while the
global loss of DNA methylation does not result in dramatic changes in chromatin acces-
sibility, significant reorganization of nucleosomes and chromatin modifications occurs in
context-specific manner. We show that while DNA methylation may mask phased nucle-
osomes in CGIs, the same is not true for non-CGIs where methylated and unmethylated
promoters are almost equally-phased. Furthermore, dramatic rearrangement of the
overall chromatin landscape occurs strictly in hypomethylated CGI and not in non-CGI
promoters, suggesting that DNA methylation controls CGI promoters at the chromatin
level, but may have little contribution in regulating the chromatin structure of non-CGI
promoters. Importantly, our findings also demonstrate that the transcriptional regulation
of the overall permissive CGI promoters in the absence of DNA methylation is medi-
ated by polycomb whose preferential targeting to CGI has been previously described
(Mikkelsen et al., 2007) while the repression of non-CGI promoters in DKO1 cells is
mediated by the presence of well-positioned nucleosomes around the TSSs (Han et al.,
2011). DNA methylation thus serves to fine-tune the repressed promoter state by mod-
ulating the inherently poised chromatin state in CGI and by directly targeting inherently
nucleosome occupied regions of non-CGI promoters, revealing a fundamentally differ-
ent logic in the regulation of CGI and non-CGI promoters (Cairns, 2009; Tazi and Bird,
108
1990). Indeed, the selective reestablishment of poised and active chromatin in CGI
also suggests that activation potentials of CGI promoters are facilitated by the overall
permissive chromatin structure whereas activities of non-CGI promoters may depend
more on other factors such as the availability of tissue-specific transcription factors and
ATP-dependent remodeling complexes (Ramirez-Carrozzi et al., 2009).
The establishment of both active and poised chromatin state in DKO1 cells is
strongly correlated with the chromatin states of normal colonic mucosa, suggesting
that the loss of DNA methylation in CGI promoters may restore normal-like chromatin
landscape while demonstrating the distinct mechanisms by which DNA methylation
reprograms the epigenetic landscape in cancer cells. The acquisition of DNA methy-
lation may directly silence genes during tumorigenesis and as such, the removal of
DNA methylation in DKO1 cells reestablishes the active promoter landscape (Baylin
and Jones, 2011; Gal-Y am et al., 2008). The presence or absence of polycomb, how-
ever demarcates the non-homogenous classes of CGI promoters (Gal-Y am et al., 2008;
Rach et al., 2011). DNA hypermethylation has been shown to preferentially occur
on polycomb-repressed promoters whereby repressed genes become permanently
silenced in a process known as epigenetic switching (Ohm et al., 2007; Schlesinger
et al., 2007; Widschwendter et al., 2007). In the absence of DNA methylation, the
reestablishment of poised CGI promoter state suggests a reversal of this epigenetic
switch. Further studies will be needed, however, to determine whether the removal of
DNA methylation truly restores normal chromatin architecture.
Focal CGI hypermethylation and global hypomethylation at PMDs have also been
described to be a signature of cancer cells and are associated with widespread gene
silencing (Berman et al., 2012; Hon et al., 2012; Hovestadt et al., 2014). We con-
firmed the correlation between PMDs and H3K9me3 domains in our model and more
importantly, demonstrated that heterochromatic regions are less accessible compared
to the rest of the genome, including regions that are more highly methylated, and thus
providing a mechanistic link for gene repression. Taken together, our study reveals
109
the different contribution of DNA methylation in regulating focal and long-range chro-
matin landscape. Understanding the mechanisms of these differences is critical and
will contribute to better understanding as to how DNA methylation changes influence
the chromatin structure of normal and cancer cells and ultimately, gene expression.
110
Figure 5.7: The global loss of DNA methylation does not result in dramatic nuclesome
organization and chromatin remodeling in non-CGI promoters
(a)NOMe-seq reads were aligned to n annotated non-CGI TSS and promoters were categorized based
on the methylation levels in both cell types as Unmethylated in HCT116 and Unmethylated in DKO1 (UU),
Methylated in HCT116 and Methylated in DKO1 (MM) and Methylated in HCT116 and Unmethylated in
DKO1 (MU). The number of promoters that fall in each class is shown on the left. Heatmaps were gener-
ated for DNA methylation (left panel) and accessibility (right panel) +/-1kb from the TSS and clustering of
each row was done based on the accessibility pattern of DKO1 cells (right most panel). DNA methylation
and accessibility in UU and MM group are zoomed in. (b) Enrichment level of histone marks is shown
+/-3kb around the TSS as the average of all promoters in each class. Enrichment level, expressed in
terms of z-score was calculated based on normalized experimental wiggler value compared to the input.
(c) Distribution of chromatin states for each promoter class in both cell types is shown as a bar chart. Chro-
matin states of promoters are defined based on the chromHMM model (Ernst et al., 2011). Others include
chromatin states covering various enhancers, transcribed and heterochromatic regions. (d) Transcript level
(based on FPKM) for each promoter class is shown for two biological duplicates for both cell types.
111
Figure 5.8: Long range accessibility changes reveal association with PMDs and hete-
rochromatic H3K9me3 domains
(a). Genome are divided into non-overlapping 1 Mb windows. Mean DNA methylation, accessibility level
and permissive histone modification (H3K9me3, H3K27me3, H3K4me1) and Input ChIP-seq signal inten-
sity are calculated out after excluding signals inside CGI. Windows are sorted by accessibility level in DKO1
cell (marked as red star). (b) A IGV example of a 48Mb region in chr2 about the relationship between
accessibility, DNA methylation and H3K9me3 are shown.
112
Chapter 6
The analysis of epigenomic map
during oncogene induced
senescence in fibroblast cells
This chapter is adapted from a preliminary study collaborated with Dr. Amir Eden lab.
6.1 Introductions
Decades ago, Hayflick and Moorhead first demonstrated that after a certain finite num-
ber of divisions, the cell will enter into a replicative senescence phase so that the cell
cycle is irreversibly arrested(HayFlick and Moorhead, 1961). Many years later, Ser-
rano and his colleges found that in vitro expression of highly activated oncogene Ras
induces a premature senescence only a few days after the transfection(Serrano et al.,
1997). There are two main pathways involved in this oncogene induced senescence
(OIS), which are shown in Figure 6.1 and is slightly different among different cell con-
text. There are more and more evidence that some players, like p16INK4a, may rather
initiate the senescence response and take response to form senescence associated
heterochromatin foci (SAHF), while others, such as p53, may then take over to ensure
maintenance of the permanent arrest(Beaus´ ejour et al., 2003), implying that some cru-
cial mediators of cellular senescence may be essential only temporarily and can be
absent at other times(Braig and Schmitt, 2006).
113
The most dramatic change during OIS, besides cell cycle arrest, is the formation of
SAHF . SAHF was first characterized in Ras induced senescence in IMR90 lung fibrob-
last cell(Narita et al., 2003) and then confirmed in many other cell types(Di Micco et al.,
2006; Michaloglou et al., 2005). In immunofluroscence studies, SAHF was shown to
largely overlap with H3K9me3 repressive marks and its binding protein heterochro-
matin protein 1 (HP1) (Narita et al., 2003). SUV39H1, a histone methyltransferase for
H3K9me3, has already been proved to be tightly associated with Ras induced senes-
cence(Braig et al., 2005). Recent data indicated that HDAC1 is required for OIS in
melanocytes in several studies(Bandyopadhyay et al., 2007). Up-regulation of HDAC1
induces a number of collaborated chromatin remodelers, including transient recruitment
of the ATP-dependent SWI/SNF chromatin remodeling protein Brahma (Brm1) to high
molecular weight complexes containing RB, HDAC1, HP1b and others. HDAC1 induced
redistribution of nucleoplasmic HP1b into foci that co-localized with SUV39H1, which
rapidly increases H3K9me2, and progressively increases H3K9me3(Wallace and Orr-
Weaver, 2005). The rapid but transient formation of HDAC1-induced HP1b/Brm1/RB
complexes points to Brm1 as the likely provider of the energy required for sliding and
dislodging specific subsets of nucleosomes in preparation for SAHF formation.
Somatic mutations, especially in the subset of genes termed driver mutations, play
a crucial role in converting cells from normal to malignant proliferations(Stratton, 2011).
Carcinogenesis is a multistep process caused by a combination of mutations in onco-
genes or tumor suppressor genes or from epigenetic changes in DNA such as DNA
methylation. Activation of oncogene K-Ras and inactivation of tumor suppressor genes,
such as p53, are thought to be critical determinants in tumor initiation and progres-
sion(Smith et al., 2002). However, research studies around 2005 demonstrated that
activated mutation of oncogenes alone, such as Ras or BRAF , would form as a barrier
for tumorgenesis(Braig et al., 2005; Chen et al., 2005; Collado et al., 2005; Michaloglou
et al., 2005). Suppression of some additional tumor suppressor genes or activation of
114
oncogenes are needed to escape or bypass OIS and enter into malignant tumorgen-
esis. Michaloglou et al. showed that nevi, frequently with BRAF mutation, may lack
any apparent proliferative activity often for decades before lesions in a small subset of
cases eventually progress into a malignant melanoma. Also sustained expression of
oncogenic BRAF
V600E
in primary melanocytes provokes senescence process. More
importantly, they showed strong SA--Gal positivity and negativity for the proliferation
marker Ki67 in biopsy specimens of melanocytic nevi in situ, which confirmed that the
nature of premalignant BRAF
V600E
-induced senescence is in vivo(Michaloglou et al.,
2005). Moreover, cellular senescence may also interrupt or delay Ras driven lung car-
cinogenesis and work as a barrier for lymphobia (Braig et al., 2005) and some other
kinds of tumors. Interestingly, the oncogene induced process seems to be a dosage
dependent process. Low level of chronic Ras activation in mouse tissue stimulates cel-
lular proliferation and mammary epithelial hyperplasias. While high level of acute Ras
activation will induce senescence that is Ink4a-Arf- dependent and irreversible follow-
ing Ras down regulation(Sarkisian et al., 2007). This indicates that completely different
downstream response mechanisms may exist even within the same oncogene activated
mutation. The understanding of OIS mechanism may suggest a novel strategy to treat
human cancers with OIS relevant and OIS irrelevant features separately.
The Laird lab and others have shown that a common colorectal cancer subtype (CpG
Island Methylator Phenotype; CIMP) is strongly associated with a particular Raf muta-
tion, and exhibits widespread gain of heterochromatin marks such as promoter DNA
methylation along with widespread loss of euchromatin marks such as H3K4 methy-
lation (Hinoue et al., 2009; Suzuki et al., 2010). Furthermore, CIMP cells appear to
escape OIS by transcriptionally inactivating the p53 downstream gene IGFBP7(Suzuki
et al., 2010), which has been shown, although it is still under debate(Schrama et al.,
2010; Scurr et al., 2010), the genetic or epigenetic silence of IGFBP7 expression will
lead bypass from BRAF
V600E
induced senescence(Wajapeyee et al., 2008). For Ras
induced OIS, silencing of p53 and p16 will lead a bypass from ras induced OIS in human
115
fibroblast(Serrano et al., 1997), while p53 often gets mutated and p16 usually gets
hypermethylated silencing in different kinds of tumor. Based on these observations and
the propensity for various cancer subtypes to develop aberrant promoter methylation,
we believe that OIS may be common to many cancers. The epigenomic analysis of OIS
may be useful to distinguish different molecular subtypes in cancer.
DNA methylation and nucleosome positioning during OIS remain poorly under-
stood. Recently, more and more evidence indicated potential genome-wide epigenetic
changes may happen during OIS. Ras and BRAF were found to be tightly correlated
with DNA methylation changes in different cancers(Hinoue et al., 2009; Patra, 2008).
Activated Ras was reported to induce hypermethylation of several genes, e.g. RECK
metastasis suppressor gene, and increase DNMT3b expression level in NIH/3T3 mouse
fibroblast cell line(Chang et al., 2006) or pro-apoptotic Fas ligand encoding gene in
epithelial cell(Peli et al., 1999). A genome-wide RNAi screen in murine cell revealed
a pathway of 28 genes for K-Ras mediated epigenetic silencing of Fas gene. DNMT1,
HDAC and HMT, which are all included in this 28 genes list, are tightly associated with
genome-wide DNA methylation and histone modification changes (Gazin et al., 2007).
SAHF may induce aberrant DNA hypermethylation through HP1 dependent recruitment
of DNMTs(Smallwood et al., 2007). The study on the loss of linker histone H1 and
Brm1s function for nucleosome sliding in cellular senescence(Funayama et al., 2006)
indicated that some nucleosome occupancy changes may also happen. Very recently,
a genome-wide DNA methylation study on Mouse Embryonic Stem cells (MEFs) by
MeDIP technology, however, did not observe any significant DNA methylation changes
during OIS(Kaneda et al., 2011), which seems to be against our hypothesis. We noticed
that SAHF is not formed during OIS in MEFs, which indicates that a large number
of DNA methylation changes may only accompany with SAHF formation during OIS.
Therefore, we perform a preliminary genome-wide study on DNA methylation changes
during SAHF formation in OIS by using fibroblast cell line as a model system.
116
Figure 6.1: Oncogene induced senescence as a tumor barrier
Oncogene activation alone, such as Ras/Raf, can induce OIS and form as a barrier for tumorgenesis.
Additional mutations of some tumor suppressor genes will lead the escape from OIS to transformed
status.There are two main pathways found in OIS, which are slightly different in different cell context.
p16INK4a may rather initiate the senescence response and take response to form SAHF , while others,
such as p53, may then take over to ensure maintenance of the permanent arrest. There are two main mark-
ers to identify OIS: cell cycle arrest and SAHF(main components: H3K9me2/3, HP1, HMGA2, macroH2A).
SAHF will repress E2F target genes. It only exists in OIS but not other senescence or aging process
(controversial yet).
6.2 Materials and Methods
6.2.1 Cell culture and retroviral infection
This part was done by Dr. Amir Eden lab
WI38 cells were cultured according to ATCC. Retrovirus infection of WI38 cells was
performed as described in (Zhang et al., 2007). The plasmid pBabe with no oncogene
was also tansfected into the control group.
117
6.2.2 Immunofluorescence, antibodies, SAHF, and SA -gal staining.
This part was done by Dr. Amir Eden lab
Antibody of H3K9me2, MacroH2A and DAPI staining for SAHF and SA-gal staining
in senescent cells were performed essentially as described previously (Zhang et al.,
2007).
6.2.3 Fluorescence in situ hybridization
This part was done by Dr. Amir Eden lab
Fluorescence in situ hybridization of H3K9me2 and MacroH2A was performed as
described in (Zhang et al., 2007)
6.2.4 DNA methylation assay and computational analysis
We used Illumina Infinium DNA methylation platforms HumanMethylation450 (HM450)
BeadChip (Illumina, San Diego, CA) to obtain DNA methylation profiles of three biolog-
ical replicates in OIS and normal WI38 cells. Library preparation was conducted the
same as described in (Cancer Genome Atlas Research Network, 2013).
The level of DNA methylation was summarized as a beta () value calculated as
(M/(M+U)), ranging from 0 to 1. P-value was also calculated with the methylumi pack-
age. Data points was masked as ”NA” in the following condition: 1) detection p-value
>0.05; 2) probes overlapped with repeats; 3) probes overlapped with SNPs. Only those
probes without any ”NA” in all of 6 samples were kept for the downstream analysis.
R v3.0.0 was used for the statistical analysis. Pearson correlation score was calcu-
lated for the comparison between each of the two samples. Distance between samples
was calculated by hclust with ”average” method in R. Two samples t-test was used
to calculate the significance of differences between OIS and control group. p value
was adjusted by qvalue package. Probes genomic annotation was based on Illumina
HM450K manifest. The same number of probes were randomly rearranged by their
118
coordinates. The enrichment ratio was calculated based on the ratio between the num-
ber of probes and the number of random probes that fall into each genomic annotation
category.
6.3 Results
6.3.1 Small but consistent changes of DNA methylation distinguish OIS
and normal fibroblast cells
OIS was successfully induced in WI38 cells and SAHF was formed during OIS as
shown in Figure 6.2a. Top 3,000 most variable probes were selected based on the
stand deviation of DNA methylation level among the samples. Hierarchical clustering
on the distance between different samples showed the global relationship among sam-
ples. Replicates within each group tend to be clustered together. It indicated that DNA
methylation variation can distinguish OIS from normal fibroblast cells. Furthermore, two
by two scatterplot on DNA methylation level in all informative probes was performed.
Pearson correlation score and the scatter shape indicated that the differences of DNA
methylation between OIS and normal fibroblast cell are very small. However, the sub-
tle changes between OIS and normal fibroblast cell are very consistent through three
different replicates, which indicates that subtle DNA methylation changes may indeed
happen during OIS process.
6.3.2 The number of significantly changed DNA methylation probes are
small and enriched in non-coding regions
Only 581 probes was identified with mean beta value difference more than 0.2 between
two groups. Further, we used two samples t-test to identify out significantly changed
probes and presented the methylation changes as a heatmap in Figure 6.3a(q value
<0.05). 73 probes were hypermethylated, while 46 probes were hypomethylated. The
119
genomic enrichment level for hyper- and hypo- methylation probes were shown in Figure
6.3b. Significantly changed probes were mostly enriched in 3’UTR, enhancers and CpG
oceans.
6.4 Discussion
Originally, we planned to perform NOMe-seq, ChIP-seq and RNA-seq to comprehen-
sively profile the epigenomic and gene expression changes during OIS. This prelimi-
nary study by Infinium 450K array showed the evidence that DNA methylation changes
exist in SAHF formation during OIS. The changes, however, are very subtle which is
unexpected given that the global heterochromatin structure is largely increased dur-
ing OIS. There are a few possible explanations to the phenomenon we observed in
this study. Heterochromatin, like H3K9me2/3, may not really increased during OIS.
Recent work done by the same group that discovered SAHF showed the evidence of
independence between repressive histone mark and SAHF formation (Chandra et al.,
2012). ChIP-seq result of H3K27me3 and H3K9me3 indicated almost no change on
heterochromatin level after OIS. Furthermore, H3K9me3 and H3K27me3 depleted cells
still exhibited DAPI-dense SAHF formation upon Ras induction. The immunoflurocense
result on condense H3K9me2/3 observed in SAHF may just due to the side effect of
3-D condensation of chromosome during OIS. Thus, it is possible that DNA methylation
may not be greatly affected either. Also, The significantly changed probes are highly
enriched outside the coding region, which may indicate that DNA methylation changes
happen outside gene coding region, such as Partially Methylation Domans (PMDs) or
enhancers that are poorly covered by Infinium 450K array probes. Finally, fibroblast
cell, as a model system, may not truly reflect the epigenetic dynamics during the pos-
sible OIS process in tumorgenesis. More investigations at base-pair resolution DNA
methylation changes during OIS are still needed.
120
Figure 6.2: DNA methylation variation distinguishes OIS and normal fibroblast cells
(a). SA--gal staining is a marker for cellular senescence, which shows blue in OIS cells but not control
group. DNA is stained using DAPI. Immunofluroscence results of H3K9me2 and macroH2A show SAHF
formed in OIS cell but not in control cells. (b). Hierarchical clustering is used to measure the distance
between three biological replicates in each of group by the 3000 most variable probes’ DNA methylation
level. pBabe1, pBabe2, pBabe3 represent 3 biological replicates in control group, while Ras1, Ras2 and
Ras3 represent 2 biological replicates in OIS group. The replicates in each group are tend to be clustered
together. pBabe1, however, shows some larger distance to the other two replicates in control group. (c) Two
by two scatter plots for all probes’ DNA methylation level show the global trend of DNA methylation changes
within and between groups. Pearson correlation score is calculated for each comparison. Replicates within
each group are tightly correlated with each other. The comparison between OIS and control group also
shows a high correlation score. The shape, however, indicates that a weak hyper- and hypo- methylation
consistently exists in OIS comparing to control group.
121
Figure 6.3: Significantly changed probes are few and enrich in non-coding regions
(a).Two samples t-test is used to select significant changed probes. 119 probes are found to be sig-
nificant changed with q value less than 0.05 (73 probes are hypermethylated OIS, while 46 probes are
hypomethylated in OIS group). Heatmap is generated to represent the DNA methylation level in these
probes. Hierarchical clustering is applied to reorder the row and columns. (b). Genomic enrichment of sig-
nificant changed probes is calculated as described in materials and method section. Blue bar represents
the enrichment level of hypermethylated probes, while red represents the hypomethylated probes’s enrich-
ment level. The significant changed probes are highly enriched in CpG ocean, 3’ UTR and enhancers.
CpG ocean category represents the probes outside any illumina annotation category.
122
Chapter 7
Discussion and future directions
7.1 Summary and discussion
Overall, we developed a new sequencing technology named NOMe-seq and a complete
computational pipeline NOMeToolkit (including Bis-SNP) to process the data. Then, we
applied this experimental and computational tools on a DNA methylation deficiency cell
line system to explore the causal effects of DNA methylation on cancer epigenome orga-
nization. We also tried to connect OIS with cancer molecular subtypes in the aspects of
DNA methylation. .
We first developed NOMe-seq technology in a fibroblast cell line system. It is the first
time that people can explore genome-wide DNA methylation and nucleosome position-
ing information at the same time at single molecule level. We validated DNA methyla-
tion and nucleosome occupancy pattern at promoters and distal regulatory elements of
which has been observed by other technology already, such as MNase-seq/ChIP-seq.
Then, we adapted the technology into primary fresh frozen human tissue to explore
aberrant genome-wide nucleosome patterns in primary human tumor, which have been
indicated by mutations of a large number of nucleosome remodeler genes via tumor
whole-exome sequencing. Combined with old fashion salt wash biochemistry method,
our pilot studies showed the possibility to explore genome-wide transcription factor bind-
ing affinity in vivo.
Bis-SNP was developed primarily for WGBS project in The Cancer Genome Atlas
project (TCGA). Identifying SNPs is important for accurate quantification of methylation
levels and for identification of allele-specific methylation events such as imprinting. It
provided an opportunity to investigate genetic and epigenetic information in a single
123
experiment. Therefore, methyQTL and epigenome-wide association study is possible
in a single experiment at a cheaper price for large population screening. Also, it is
extremely useful for the investigation of epigenetics at single cell level, since DNA mate-
rials in a single cell can not be sequenced twice. However, bisulfite treatment tends to
cause strand break during depurination, resulting in higher error rates and more false
negatives in Bisulfite-seq compared with genomic DNA sequencing. Bisulfite conver-
sion will also reduce genome complexity, which makes reads mapping and genotyping
more difficult. Further improvement on Bisulfite-seq technology and more advanced
computational methods are still in urgent need for accurate genotyping in bisulfite-seq.
Bis-SNP can be readily adapted for NOMe-seq due to the flexibility of methylation
calling on any arbitrary cytosine context by Bis-SNP . We developed a complete compu-
tational pipeline, named NOMeToolkit, for NOMe-seq. HMM segmentation was adapted
from the segmentation method previously applied in WGBS. Further improvements are
required in order to involve distance information between GCH and the accessibility
changes at different scales.
Finally, we applied NOMe-seq to investigate the cancer cell line HCT116 and its
derived DNA methylation deficient cell line. We found that loss of DNA methylation
affected cancer epigenome in a context dependent manner, and restored chromatin
structure and gene expression levels into normal cell like states. It remains mysterious
why cancer epigenetic drugs, such as 5-Aza-CdR, could affect DNA methylation level
globally without activating oncogenes. Our study, however, provides a hint that the
global loss of DNA methylation may only affect specific genomic context due to the local
sequence context or epigenomic effect. Further studies on patient samples before and
after 5-Aza-CdR treatment are critical to answer this question thoroughly.
124
7.2 Perspective and future directions
As stated in the introduction chapter, the breakthrough in cancer research is often lead
by the development and application of new technology, followed by another wave of
revolution in data interpretation. Since the last decade, next generation sequencing
technology has been generating a tremendous amount of data and therefore continues
to recruit more and more computational biologists into this field. Y et compared to the
huge amount of data in IT world, public data in biology and human health is still lim-
ited though keeps growing drastically. Big data processing and future deep learning in
the personalize diagnosis require the migration and recreation of current biological soft-
wares/methods in cloud based environment. Bis-SNP , although based on GATK map-
reduce framework(McKenna et al., 2010), is still not the industrial-level high efficient
map-reduce framework. High efficiency computer framework is needed to increase the
speed for future application of Bis-SNP on large population scale EWAS studies. Also,
advanced statistical methods are needed to solve population substructure problems,
which is well known in GWAS study but may show more significant effects in EWAS, for
the downstream interpretations on genetic and epigenetic interactions.
On the micro-scale end, epigenetic substructure in cell population, rather than
human population, is also interesting and feasible to study in the near future. Many grad-
ual models have been proposed for tumor progression, including clonal evolution, the
mutator phenotype and stochastic progression (Navin et al., 2011). Recent advances
in next generation sequencing at single cell level enabled the studies to infer tumor
evolution by investigating genetic alteration and gene expression changes inside tumor
samples (Hou et al., 2012; Zong et al., 2012; Shalek et al., 2013), while the study on
epigenetics alteration in cancer progression at single cell level is still limited. Single cell
reduced representation bisulfite-seq (scRRBS) technology (Guo et al., 2013) came out
very recently brought up the possibility to study epigenetic changes in tumor at single
cell level and single base-pair resolution. Single cell WGBS/NOMe-seq combined with
125
Bis-SNP software in the near future will enable us to explore genetic and epigenetic
interactions at the same time in a single bisulfite-seq experiment. Further analysis on
genetic and epigenetic evolutionary trees within tumor sample could reveal the driver
events during cancer progression. After the identification of driver events in different
cancer types, single cell epigenomic study combined with microfluid technology will ben-
efit the early detection and personalized diagnosis in cancer. The rapid development
in personalized medicine requires the application of more advanced machine learn-
ing methods to build diagnostic models based on large public datasets and to identify
genetic/epigenetic networks in different scale in order to distinguish disease cells from
healthy ones. My dissertation presented here is a pioneer study but definitely nowhere
near achieving this goal.
Towards heaven and ocean, our journey leads.
126
Bibliography
Adey, A. and Shendure, J. (2012). Ultra-low-input, tagmentation-based whole genome
bisulfite sequencing. Genome Res.
Andreu-Vieyra, C., Lai, J., Berman, B. P ., Frenkel, B., Jia, L., Jones, P . A., and Coetzee,
G. A. (2011). Dynamic nucleosome-depleted regions at androgen receptor enhancers
in the absence of ligand in prostate cancer cells. Mol Cell Biol, 31(23):4648–62.
Andreu-Vieyra, C. V. and Liang, G. (2013). Nucleosome occupancy and gene regulation
during tumorigenesis. Adv Exp Med Biol, 754:109–34.
Antequera, F . (2003). Structure, function and evolution of cpg island promoters. Cell
Mol Life Sci, 60(8):1647–58.
Aronesty, E. (2011). ea-utils: Command-line tools for processing biological sequencing
data.
Balasubramanian, D., Akhtar-Zaidi, B., Song, L., Bartels, C. F ., Veigl, M., Beard, L.,
Myeroff, L., Guda, K., Lutterbaugh, J., Willis, J., Crawford, G. E., Markowitz, S. D.,
and Scacheri, P . C. (2012). H3k4me3 inversely correlates with dna methylation at a
large class of non-cpg-island-containing start sites. Genome Med, 4(5):47.
Ball, M. P ., Li, J. B., Gao, Y ., Lee, J.-H., LeProust, E. M., Park, I.-H., Xie, B., Daley, G. Q.,
and Church, G. M. (2009). Targeted and genome-scale strategies reveal gene-body
methylation signatures in human cells. Nat Biotechnol, 27(4):361–8.
Bandyopadhyay, D., Curry, J. L., Lin, Q., Richards, H. W., Chen, D., Hornsby, P . J., Tim-
chenko, N. A., and Medrano, E. E. (2007). Dynamic assembly of chromatin complexes
during cellular senescence: implications for the growth arrest of human melanocytic
nevi. Aging Cell, 6(4):577–91.
Barski, A., Cuddapah, S., Cui, K., Roh, T.-Y ., Schones, D. E., Wang, Z., Wei, G., Che-
pelev, I., and Zhao, K. (2007). High-resolution profiling of histone methylations in the
human genome. Cell, 129(4):823–37.
Bartke, T., Vermeulen, M., Xhemalce, B., Robson, S. C., Mann, M., and Kouzarides, T.
(2010). Nucleosome-interacting proteins regulated by dna and histone methylation.
Cell, 143(3):470–84.
127
Baylin, S. B. and Jones, P . A. (2011). A decade of exploring the cancer epigenome -
biological and translational implications. Nat Rev Cancer, 11(10):726–34.
Beaus´ ejour, C. M., Krtolica, A., Galimi, F ., Narita, M., Lowe, S. W., Y aswen, P ., and
Campisi, J. (2003). Reversal of human cellular senescence: roles of the p53 and p16
pathways. EMBO J, 22(16):4212–22.
Beck, S., Bernstein, B. E., Campbell, R. M., Costello, J. F ., Dhanak, D., Ecker, J. R.,
Greally, J. M., Issa, J.-P ., Laird, P . W., Polyak, K., Tycko, B., Jones, P . A., and
AACR Cancer Epigenome Task Force (2012). A blueprint for an international cancer
epigenome consortium. a report from the aacr cancer epigenome task force. Cancer
Res, 72(24):6319–24.
Beck, S., Olek, A., and Walter, J. (1999). From genomics to epigenomics: a loftier view
of life. Nat Biotechnol, 17(12):1144.
Bell, A. C. and Felsenfeld, G. (2000). Methylation of a ctcf-dependent boundary controls
imprinted expression of the igf2 gene. Nature, 405(6785):482–5.
Bell, J. T. and Spector, T. D. (2011). A twin approach to unraveling epigenetics. Trends
Genet, 27(3):116–25.
Bell, J. T. and Spector, T. D. (2012). Dna methylation studies using twins: what are they
telling us? Genome Biol, 13(10):172.
Bell, O., Tiwari, V. K., Thom¨ a, N. H., and Sch¨ ubeler, D. (2011). Determinants and
dynamics of genome accessibility. Nat Rev Genet, 12(8):554–64.
Berman, B. P ., Weisenberger, D. J., Aman, J. F ., Hinoue, T., Ramjan, Z., Liu, Y ., Noush-
mehr, H., Lange, C. P . E., van Dijk, C. M., Tollenaar, R. A. E. M., Van Den Berg,
D., and Laird, P . W. (2012). Regions of focal dna hypermethylation and long-
range hypomethylation in colorectal cancer coincide with nuclear lamina-associated
domains. Nat Genet, 44(1):40–6.
Bernstein, B. E., Mikkelsen, T. S., Xie, X., Kamal, M., Huebert, D. J., Cuff, J., Fry, B.,
Meissner, A., Wernig, M., Plath, K., Jaenisch, R., Wagschal, A., Feil, R., Schreiber,
S. L., and Lander, E. S. (2006). A bivalent chromatin structure marks key develop-
mental genes in embryonic stem cells. Cell, 125(2):315–26.
Bernstein, B. E., Stamatoyannopoulos, J. A., Costello, J. F ., Ren, B., Milosavljevic, A.,
Meissner, A., Kellis, M., Marra, M. A., Beaudet, A. L., Ecker, J. R., Farnham, P . J.,
Hirst, M., Lander, E. S., Mikkelsen, T. S., and Thomson, J. A. (2010). The nih roadmap
epigenomics mapping consortium. Nat Biotechnol, 28(10):1045–8.
Bertram, J. S. (2000). The molecular biology of cancer. Molecular Aspects of Medicine,
21(6):167 – 223.
Booth, M. J., Branco, M. R., Ficz, G., Oxley, D., Krueger, F ., Reik, W., and Bal-
asubramanian, S. (2012). Quantitative sequencing of 5-methylcytosine and 5-
hydroxymethylcytosine at single-base resolution. Science, 336(6083):934–7.
128
Bouazoune, K., Miranda, T. B., Jones, P . A., and Kingston, R. E. (2009). Analysis of
individual remodeled nucleosomes reveals decreased histone-dna contacts created
by hswi/snf. Nucleic Acids Res, 37(16):5279–94.
Braig, M., Lee, S., Loddenkemper, C., Rudolph, C., Peters, A. H. F . M., Schlegelberger,
B., Stein, H., D¨ orken, B., Jenuwein, T., and Schmitt, C. A. (2005). Oncogene-induced
senescence as an initial barrier in lymphoma development. Nature, 436(7051):660–5.
Braig, M. and Schmitt, C. A. (2006). Oncogene-induced senescence: putting the brakes
on tumor development. Cancer Res, 66(6):2881–4.
Cairns, B. R. (2009). The logic of chromatin architecture and remodelling at promoters.
Nature, 461(7261):193–8.
Cancer Genome Atlas Research Network (2013). Comprehensive molecular character-
ization of clear cell renal cell carcinoma. Nature, 499(7456):43–9.
Cancer Genome Atlas Research Network, Weinstein, J. N., Collisson, E. A., Mills, G. B.,
Shaw, K. R. M., Ozenberger, B. A., Ellrott, K., Shmulevich, I., Sander, C., and Stuart,
J. M. (2013). The cancer genome atlas pan-cancer analysis project. Nat Genet,
45(10):1113–20.
Challen, G. A., Sun, D., Jeong, M., Luo, M., Jelinek, J., Berg, J. S., Bock, C., Vasan-
thakumar, A., Gu, H., Xi, Y ., Liang, S., Lu, Y ., Darlington, G. J., Meissner, A., Issa,
J.-P . J., Godley, L. A., Li, W., and Goodell, M. A. (2012). Dnmt3a is essential for
hematopoietic stem cell differentiation. Nat Genet, 44(1):23–31.
Chandra, T., Kirschner, K., Thuret, J.-Y ., Pope, B. D., Ryba, T., Newman, S., Ahmed, K.,
Samarajiwa, S. A., Salama, R., Carroll, T., Stark, R., Janky, R., Narita, M., Xue, L.,
Chicas, A., N˜ unez, S., Janknecht, R., Hayashi-Takanaka, Y ., Wilson, M. D., Marshall,
A., Odom, D. T., Babu, M. M., Bazett-Jones, D. P ., Tavar´ e, S., Edwards, P . A. W., Lowe,
S. W., Kimura, H., Gilbert, D. M., and Narita, M. (2012). Independence of repressive
histone marks and chromatin compaction during senescent heterochromatic layer
formation. Mol Cell, 47(2):203–14.
Chang, H.-C., Cho, C.-Y ., and Hung, W.-C. (2006). Silencing of the metastasis suppres-
sor reck by ras oncogene is mediated by dna methyltransferase 3b-induced promoter
methylation. Cancer Res, 66(17):8413–20.
Chen, P .-Y ., Feng, S., Joo, J. W. J., Jacobsen, S. E., and Pellegrini, M. (2011). A
comparative analysis of dna methylation across human embryonic stem cell lines.
Genome Biol, 12(7):R62.
Chen, T., Hevi, S., Gay, F ., Tsujimoto, N., He, T., Zhang, B., Ueda, Y ., and Li, E. (2007).
Complete inactivation of dnmt1 leads to mitotic catastrophe in human cancer cells.
Nat Genet, 39(3):391–6.
129
Chen, Z., Trotman, L. C., Shaffer, D., Lin, H.-K., Dotan, Z. A., Niki, M., Koutcher, J. A.,
Scher, H. I., Ludwig, T., Gerald, W., Cordon-Cardo, C., and Pandolfi, P . P . (2005).
Crucial role of p53-dependent cellular senescence in suppression of pten-deficient
tumorigenesis. Nature, 436(7051):725–30.
Chodavarapu, R. K., Feng, S., Bernatavichute, Y . V., Chen, P .-Y ., Stroud, H., Yu, Y .,
Hetzel, J. A., Kuo, F ., Kim, J., Cokus, S. J., Casero, D., Bernal, M., Huijser, P ., Clark,
A. T., Kr¨ amer, U., Merchant, S. S., Zhang, X., Jacobsen, S. E., and Pellegrini, M.
(2010). Relationship between nucleosome positioning and dna methylation. Nature,
466(7304):388–92.
Choufani, S., Shapiro, J. S., Susiarjo, M., Butcher, D. T., Grafodatskaya, D., Lou, Y ., Fer-
reira, J. C., Pinto, D., Scherer, S. W., Shaffer, L. G., Coullin, P ., Caniggia, I., Beyene,
J., Slim, R., Bartolomei, M. S., and Weksberg, R. (2011). A novel approach iden-
tifies new differentially methylated regions (dmrs) associated with imprinted genes.
Genome Res, 21(3):465–76.
Cokus, S. J., Feng, S., Zhang, X., Chen, Z., Merriman, B., Haudenschild, C. D., Prad-
han, S., Nelson, S. F ., Pellegrini, M., and Jacobsen, S. E. (2008). Shotgun bisulphite
sequencing of the arabidopsis genome reveals dna methylation patterning. Nature,
452:215–9.
Coleman-Derr, D. and Zilberman, D. (2012). Deposition of histone variant h2a.z within
gene bodies regulates responsive genes. PLoS Genet, 8(10):e1002988.
Collado, M., Gil, J., Efeyan, A., Guerra, C., Schuhmacher, A. J., Barradas, M., Bengur´ ıa,
A., Zaballos, A., Flores, J. M., Barbacid, M., Beach, D., and Serrano, M. (2005).
Tumour biology: senescence in premalignant tumours. Nature, 436(7051):642.
Collings, C. K., Waddell, P . J., and Anderson, J. N. (2013). Effects of dna methylation
on nucleosome stability. Nucleic Acids Res, 41(5):2918–31.
Conerly, M. L., Teves, S. S., Diolaiti, D., Ulrich, M., Eisenman, R. N., and Henikoff, S.
(2010). Changes in h2a.z occupancy and dna methylation during b-cell lymphoma-
genesis. Genome Res, 20(10):1383–90.
Cuddapah, S., Jothi, R., Schones, D. E., Roh, T.-Y ., Cui, K., and Zhao, K. (2009). Global
analysis of the insulator binding protein ctcf in chromatin barrier regions reveals
demarcation of active and repressive domains. Genome Res, 19(1):24–32.
Daley, T. and Smith, A. D. (2013). Predicting the molecular complexity of sequencing
libraries. Nat Methods, 10(4):325–7.
Deaton, A. M. and Bird, A. (2011). Cpg islands and the regulation of transcription.
Genes Dev, 25(10):1010–22.
DePristo, M. A., Banks, E., Poplin, R., Garimella, K. V., Maguire, J. R., Hartl, C., Philip-
pakis, A. A., del Angel, G., Rivas, M. A., Hanna, M., McKenna, A., Fennell, T. J.,
130
Kernytsky, A. M., Sivachenko, A. Y ., Cibulskis, K., Gabriel, S. B., Altshuler, D., and
Daly, M. J. (2011). A framework for variation discovery and genotyping using next-
generation dna sequencing data. Nat Genet, 43(5):491–8.
Di Micco, R., Fumagalli, M., Cicalese, A., Piccinin, S., Gasparini, P ., Luise, C., Schurra,
C., Garre’, M., Nuciforo, P . G., Bensimon, A., Maestro, R., Pelicci, P . G., and d’Adda di
Fagagna, F . (2006). Oncogene-induced senescence is a dna damage response trig-
gered by dna hyper-replication. Nature, 444(7119):638–42.
Diala, E. S. and Hoffman, R. M. (1982). Hypomethylation of hela cell dna and the
absence of 5-methylcytosine in sv40 and adenovirus (type 2) dna: analysis by hplc.
Biochem Biophys Res Commun, 107(1):19–26.
Diep, D., Plongthongkum, N., Gore, A., Fung, H.-L., Shoemaker, R., and Zhang, K.
(2012). Library-free methylation sequencing with bisulfite padlock probes. Nat Meth-
ods, 9(3):270–2.
Doi, A., Park, I.-H., Wen, B., Murakami, P ., Aryee, M. J., Irizarry, R., Herb, B., Ladd-
Acosta, C., Rho, J., Loewer, S., Miller, J., Schlaeger, T., Daley, G. Q., and Feinberg,
A. P . (2009). Differential methylation of tissue- and cancer-specific cpg island shores
distinguishes human induced pluripotent stem cells, embryonic stem cells and fibrob-
lasts. Nat Genet, 41(12):1350–3.
Egger, G., Jeong, S., Escobar, S. G., Cortez, C. C., Li, T. W. H., Saito, Y ., Y oo, C. B.,
Jones, P . A., and Liang, G. (2006). Identification of dnmt1 (dna methyltransferase
1) hypomorphs in somatic knockouts suggests an essential role for dnmt1 in cell
survival. Proc Natl Acad Sci U S A, 103(38):14080–5.
Ehrich, M., Zoll, S., Sur, S., and van den Boom, D. (2007). A new method for accurate
assessment of dna quality after bisulfite treatment. Nucleic Acids Res, 35(5):e29.
ENCODE Project Consortium (2004). The encode (encyclopedia of dna elements)
project. Science, 306(5696):636–40.
ENCODE Project Consortium (2011). A user’s guide to the encyclopedia of dna ele-
ments (encode). PLoS Biol, 9(4):e1001046.
ENCODE Project Consortium, Birney, E., Stamatoyannopoulos, J. A., Dutta, A., Guig´ o,
R., Gingeras, T. R., Margulies, E. H., Weng, Z., Snyder, M., Dermitzakis, E. T., Thur-
man, R. E., Kuehn, M. S., Taylor, C. M., Neph, S., Koch, C. M., Asthana, S., Malhotra,
A., Adzhubei, I., Greenbaum, J. A., Andrews, R. M., Flicek, P ., Boyle, P . J., Cao, H.,
Carter, N. P ., Clelland, G. K., Davis, S., Day, N., Dhami, P ., Dillon, S. C., Dorschner,
M. O., Fiegler, H., Giresi, P . G., Goldy, J., Hawrylycz, M., Haydock, A., Humbert, R.,
James, K. D., Johnson, B. E., Johnson, E. M., Frum, T. T., Rosenzweig, E. R., Kar-
nani, N., Lee, K., Lefebvre, G. C., Navas, P . A., Neri, F ., Parker, S. C. J., Sabo, P . J.,
Sandstrom, R., Shafer, A., Vetrie, D., Weaver, M., Wilcox, S., Yu, M., Collins, F . S.,
Dekker, J., Lieb, J. D., Tullius, T. D., Crawford, G. E., Sunyaev, S., Noble, W. S., Dun-
ham, I., Denoeud, F ., Reymond, A., Kapranov, P ., Rozowsky, J., Zheng, D., Castelo,
131
R., Frankish, A., Harrow, J., Ghosh, S., Sandelin, A., Hofacker, I. L., Baertsch, R.,
Keefe, D., Dike, S., Cheng, J., Hirsch, H. A., Sekinger, E. A., Lagarde, J., Abril, J. F .,
Shahab, A., Flamm, C., Fried, C., Hackerm¨ uller, J., Hertel, J., Lindemeyer, M., Missal,
K., Tanzer, A., Washietl, S., Korbel, J., Emanuelsson, O., Pedersen, J. S., Holroyd,
N., Taylor, R., Swarbreck, D., Matthews, N., Dickson, M. C., Thomas, D. J., Weirauch,
M. T., Gilbert, J., Drenkow, J., Bell, I., Zhao, X., Srinivasan, K. G., Sung, W.-K., Ooi,
H. S., Chiu, K. P ., Foissac, S., Alioto, T., Brent, M., Pachter, L., Tress, M. L., Valencia,
A., Choo, S. W., Choo, C. Y ., Ucla, C., Manzano, C., Wyss, C., Cheung, E., Clark,
T. G., Brown, J. B., Ganesh, M., Patel, S., Tammana, H., Chrast, J., Henrichsen,
C. N., Kai, C., Kawai, J., Nagalakshmi, U., Wu, J., Lian, Z., Lian, J., Newburger, P .,
Zhang, X., Bickel, P ., Mattick, J. S., Carninci, P ., Hayashizaki, Y ., Weissman, S., Hub-
bard, T., Myers, R. M., Rogers, J., Stadler, P . F ., Lowe, T. M., Wei, C.-L., Ruan, Y .,
Struhl, K., Gerstein, M., Antonarakis, S. E., Fu, Y ., Green, E. D., Kara¨ oz, U., Siepel,
A., Taylor, J., Liefer, L. A., Wetterstrand, K. A., Good, P . J., Feingold, E. A., Guyer,
M. S., Cooper, G. M., Asimenos, G., Dewey, C. N., Hou, M., Nikolaev, S., Montoya-
Burgos, J. I., L¨ oytynoja, A., Whelan, S., Pardi, F ., Massingham, T., Huang, H., Zhang,
N. R., Holmes, I., Mullikin, J. C., Ureta-Vidal, A., Paten, B., Seringhaus, M., Church,
D., Rosenbloom, K., Kent, W. J., Stone, E. A., NISC Comparative Sequencing Pro-
gram, Baylor College of Medicine Human Genome Sequencing Center, Washington
University Genome Sequencing Center, Broad Institute, Children’s Hospital Oakland
Research Institute, Batzoglou, S., Goldman, N., Hardison, R. C., Haussler, D., Miller,
W., Sidow, A., Trinklein, N. D., Zhang, Z. D., Barrera, L., Stuart, R., King, D. C., Ameur,
A., Enroth, S., Bieda, M. C., Kim, J., Bhinge, A. A., Jiang, N., Liu, J., Y ao, F ., Vega,
V. B., Lee, C. W. H., Ng, P ., Shahab, A., Y ang, A., Moqtaderi, Z., Zhu, Z., Xu, X.,
Squazzo, S., Oberley, M. J., Inman, D., Singer, M. A., Richmond, T. A., Munn, K. J.,
Rada-Iglesias, A., Wallerman, O., Komorowski, J., Fowler, J. C., Couttet, P ., Bruce,
A. W., Dovey, O. M., Ellis, P . D., Langford, C. F ., Nix, D. A., Euskirchen, G., Hart-
man, S., Urban, A. E., Kraus, P ., Van Calcar, S., Heintzman, N., Kim, T. H., Wang,
K., Qu, C., Hon, G., Luna, R., Glass, C. K., Rosenfeld, M. G., Aldred, S. F ., Cooper,
S. J., Halees, A., Lin, J. M., Shulha, H. P ., Zhang, X., Xu, M., Haidar, J. N. S., Yu,
Y ., Ruan, Y ., Iyer, V. R., Green, R. D., Wadelius, C., Farnham, P . J., Ren, B., Harte,
R. A., Hinrichs, A. S., Trumbower, H., Clawson, H., Hillman-Jackson, J., Zweig, A. S.,
Smith, K., Thakkapallayil, A., Barber, G., Kuhn, R. M., Karolchik, D., Armengol, L.,
Bird, C. P ., de Bakker, P . I. W., Kern, A. D., Lopez-Bigas, N., Martin, J. D., Stranger,
B. E., Woodroffe, A., Davydov, E., Dimas, A., Eyras, E., Hallgr´ ımsd´ ottir, I. B., Huppert,
J., Zody, M. C., Abecasis, G. R., Estivill, X., Bouffard, G. G., Guan, X., Hansen, N. F .,
Idol, J. R., Maduro, V. V. B., Maskeri, B., McDowell, J. C., Park, M., Thomas, P . J.,
Y oung, A. C., Blakesley, R. W., Muzny, D. M., Sodergren, E., Wheeler, D. A., Worley,
K. C., Jiang, H., Weinstock, G. M., Gibbs, R. A., Graves, T., Fulton, R., Mardis, E. R.,
Wilson, R. K., Clamp, M., Cuff, J., Gnerre, S., Jaffe, D. B., Chang, J. L., Lindblad-Toh,
K., Lander, E. S., Koriabine, M., Nefedov, M., Osoegawa, K., Y oshinaga, Y ., Zhu, B.,
and de Jong, P . J. (2007). Identification and analysis of functional elements in 1% of
the human genome by the encode pilot project. Nature, 447(7146):799–816.
132
Ernst, J., Kheradpour, P ., Mikkelsen, T. S., Shoresh, N., Ward, L. D., Epstein, C. B.,
Zhang, X., Wang, L., Issner, R., Coyne, M., Ku, M., Durham, T., Kellis, M., and Bern-
stein, B. E. (2011). Mapping and analysis of chromatin state dynamics in nine human
cell types. Nature, 473(7345):43–9.
Felle, M., Hoffmeister, H., Rothammer, J., Fuchs, A., Exler, J. H., and L¨ angst, G. (2011).
Nucleosomes protect dna from dna methylation in vivo and in vitro. Nucleic Acids Res,
39(16):6956–69.
Fisher, M. (2001). Lehninger Principles of Biochemistry, 3rd edition; By David L. Nelson
and Michael M. Cox, volume 6. Springer-Verlag.
Fu, Y ., Sinha, M., Peterson, C. L., and Weng, Z. (2008). The insulator binding protein
ctcf positions 20 nucleosomes around its binding sites across the human genome.
PLoS Genet, 4(7):e1000138.
Funayama, R., Saito, M., Tanobe, H., and Ishikawa, F . (2006). Loss of linker histone h1
in cellular senescence. J Cell Biol, 175(6):869–80.
Gaffney, D. J., McVicker, G., Pai, A. A., Fondufe-Mittendorf, Y . N., Lewellen, N., Miche-
lini, K., Widom, J., Gilad, Y ., and Pritchard, J. K. (2012). Controls of nucleosome
positioning in the human genome. PLoS Genet, 8(11):e1003036.
Gal-Y am, E. N., Egger, G., Iniguez, L., Holster, H., Einarsson, S., Zhang, X., Lin, J. C.,
Liang, G., Jones, P . A., and Tanay, A. (2008). Frequent switching of polycomb repres-
sive marks and dna hypermethylation in the pc3 prostate cancer cell line. Proc Natl
Acad Sci U S A, 105(35):12979–84.
Gal-Y am, E. N., Jeong, S., Tanay, A., Egger, G., Lee, A. S., and Jones, P . A. (2006).
Constitutive nucleosome depletion and ordered factor assembly at the grp78 pro-
moter revealed by single molecule footprinting. PLoS Genet, 2(9):e160.
Gardiner-Garden, M. and Frommer, M. (1987). Cpg islands in vertebrate genomes. J
Mol Biol, 196(2):261–82.
Gazin, C., Wajapeyee, N., Gobeil, S., Virbasius, C.-M., and Green, M. R. (2007).
An elaborate pathway required for ras-mediated epigenetic silencing. Nature,
449(7165):1073–7.
Gebhard, C., Benner, C., Ehrich, M., Schwarzfischer, L., Schilling, E., Klug, M., Diet-
maier, W., Thiede, C., Holler, E., Andreesen, R., and Rehli, M. (2010). General
transcription factor binding at cpg islands in normal cells correlates with resistance to
de novo dna methylation in cancer cells. Cancer Res, 70(4):1398–407.
Gerstein, M. B., Kundaje, A., Hariharan, M., Landt, S. G., Y an, K.-K., Cheng, C., Mu,
X. J., Khurana, E., Rozowsky, J., Alexander, R., Min, R., Alves, P ., Abyzov, A., Addle-
man, N., Bhardwaj, N., Boyle, A. P ., Cayting, P ., Charos, A., Chen, D. Z., Cheng,
Y ., Clarke, D., Eastman, C., Euskirchen, G., Frietze, S., Fu, Y ., Gertz, J., Grubert,
133
F ., Harmanci, A., Jain, P ., Kasowski, M., Lacroute, P ., Leng, J., Lian, J., Mona-
han, H., O’Geen, H., Ouyang, Z., Partridge, E. C., Patacsil, D., Pauli, F ., Raha, D.,
Ramirez, L., Reddy, T. E., Reed, B., Shi, M., Slifer, T., Wang, J., Wu, L., Y ang, X.,
Yip, K. Y ., Zilberman-Schapira, G., Batzoglou, S., Sidow, A., Farnham, P . J., Myers,
R. M., Weissman, S. M., and Snyder, M. (2012). Architecture of the human regulatory
network derived from encode data. Nature, 489(7414):91–100.
Gerstein, M. B., Lu, Z. J., Van Nostrand, E. L., Cheng, C., Arshinoff, B. I., Liu, T., Yip,
K. Y ., Robilotto, R., Rechtsteiner, A., Ikegami, K., Alves, P ., Chateigner, A., Perry, M.,
Morris, M., Auerbach, R. K., Feng, X., Leng, J., Vielle, A., Niu, W., Rhrissorrakrai, K.,
Agarwal, A., Alexander, R. P ., Barber, G., Brdlik, C. M., Brennan, J., Brouillet, J. J.,
Carr, A., Cheung, M.-S., Clawson, H., Contrino, S., Dannenberg, L. O., Dernburg,
A. F ., Desai, A., Dick, L., Dos´ e, A. C., Du, J., Egelhofer, T., Ercan, S., Euskirchen, G.,
Ewing, B., Feingold, E. A., Gassmann, R., Good, P . J., Green, P ., Gullier, F ., Gutwein,
M., Guyer, M. S., Habegger, L., Han, T., Henikoff, J. G., Henz, S. R., Hinrichs, A.,
Holster, H., Hyman, T., Iniguez, A. L., Janette, J., Jensen, M., Kato, M., Kent, W. J.,
Kephart, E., Khivansara, V., Khurana, E., Kim, J. K., Kolasinska-Zwierz, P ., Lai, E. C.,
Latorre, I., Leahey, A., Lewis, S., Lloyd, P ., Lochovsky, L., Lowdon, R. F ., Lubling, Y .,
Lyne, R., MacCoss, M., Mackowiak, S. D., Mangone, M., McKay, S., Mecenas, D.,
Merrihew, G., Miller, 3rd, D. M., Muroyama, A., Murray, J. I., Ooi, S.-L., Pham, H.,
Phippen, T., Preston, E. A., Rajewsky, N., R¨ atsch, G., Rosenbaum, H., Rozowsky, J.,
Rutherford, K., Ruzanov, P ., Sarov, M., Sasidharan, R., Sboner, A., Scheid, P ., Segal,
E., Shin, H., Shou, C., Slack, F . J., Slightam, C., Smith, R., Spencer, W. C., Stinson,
E. O., Taing, S., Takasaki, T., Vafeados, D., Voronina, K., Wang, G., Washington,
N. L., Whittle, C. M., Wu, B., Y an, K.-K., Zeller, G., Zha, Z., Zhong, M., Zhou, X., mod-
ENCODE Consortium, Ahringer, J., Strome, S., Gunsalus, K. C., Micklem, G., Liu,
X. S., Reinke, V., Kim, S. K., Hillier, L. W., Henikoff, S., Piano, F ., Snyder, M., Stein,
L., Lieb, J. D., and Waterston, R. H. (2010). Integrative analysis of the caenorhabditis
elegans genome by the modencode project. Science, 330(6012):1775–87.
Gertz, J., Varley, K. E., Reddy, T. E., Bowling, K. M., Pauli, F ., Parker, S. L., Kucera,
K. S., Willard, H. F ., and Myers, R. M. (2011). Analysis of dna methylation in a
three-generation family reveals widespread genetic influence on epigenetic regula-
tion. PLoS Genet, 7:e1002228.
Giresi, P . G., Kim, J., McDaniell, R. M., Iyer, V. R., and Lieb, J. D. (2007). Faire
(formaldehyde-assisted isolation of regulatory elements) isolates active regulatory
elements from human chromatin. Genome Res, 17(6):877–85.
Gowher, H., Stockdale, C. J., Goyal, R., Ferreira, H., Owen-Hughes, T., and Jeltsch,
A. (2005). De novo methylation of nucleosomal dna by the mammalian dnmt1 and
dnmt3a dna methyltransferases. Biochemistry, 44(29):9899–904.
Gu, H., Bock, C., Mikkelsen, T. S., J¨ ager, N., Smith, Z. D., Tomazou, E., Gnirke, A.,
Lander, E. S., and Meissner, A. (2010). Genome-scale dna methylation mapping of
clinical samples at single-nucleotide resolution. Nat Methods, 7:133–6.
134
Guo, H., Zhu, P ., Wu, X., Li, X., Wen, L., and Tang, F . (2013). Single-cell methy-
lome landscapes of mouse embryonic stem cells and early embryos analyzed using
reduced representation bisulfite sequencing. Genome Res, 23(12):2126–35.
Hagarman, J. A., Motley, M. P ., Kristjansdottir, K., and Soloway, P . D. (2013). Coordinate
regulation of dna methylation and h3k27me3 in mouse embryonic stem cells. PLoS
One, 8(1):e53880.
Han, H., Cortez, C. C., Y ang, X., Nichols, P . W., Jones, P . A., and Liang, G. (2011). Dna
methylation directly silences genes with non-cpg island promoters and establishes a
nucleosome occupied promoter. Hum Mol Genet, 20(22):4299–310.
Hansen, K. D., Timp, W., Bravo, H. C., Sabunciyan, S., Langmead, B., McDonald, O. G.,
Wen, B., Wu, H., Liu, Y ., Diep, D., Briem, E., Zhang, K., Irizarry, R. A., and Feinberg,
A. P . (2011). Increased methylation variation in epigenetic domains across cancer
types. Nat Genet, 43(8):768–75.
Harris, R. A., Wang, T., Coarfa, C., Nagarajan, R. P ., Hong, C., Downey, S. L., Johnson,
B. E., Fouse, S. D., Delaney, A., Zhao, Y ., Olshen, A., Ballinger, T., Zhou, X., Forsberg,
K. J., Gu, J., Echipare, L., O’Geen, H., Lister, R., Pelizzola, M., Xi, Y ., Epstein, C. B.,
Bernstein, B. E., Hawkins, R. D., Ren, B., Chung, W.-Y ., Gu, H., Bock, C., Gnirke,
A., Zhang, M. Q., Haussler, D., Ecker, J. R., Li, W., Farnham, P . J., Waterland, R. A.,
Meissner, A., Marra, M. A., Hirst, M., Milosavljevic, A., and Costello, J. F . (2010).
Comparison of sequencing-based methods to profile dna methylation and identifica-
tion of monoallelic epigenetic modifications. Nat Biotechnol, 28(10):1097–105.
Hawkins, R. D., Hon, G. C., Lee, L. K., Ngo, Q., Lister, R., Pelizzola, M., Edsall, L. E.,
Kuan, S., Luu, Y ., Klugman, S., Antosiewicz-Bourget, J., Y e, Z., Espinoza, C., Agar-
wahl, S., Shen, L., Ruotti, V., Wang, W., Stewart, R., Thomson, J. A., Ecker, J. R., and
Ren, B. (2010). Distinct epigenomic landscapes of pluripotent and lineage-committed
human cells. Cell Stem Cell, 6(5):479–91.
HayFlick, L. and Moorhead, P . S. (1961). The serial cultivation of human diploid cell
strains. Exp Cell Res, 25:585–621.
He, X., Chang, S., Zhang, J., Zhao, Q., Xiang, H., Kusonmano, K., Y ang, L., Sun, Z. S.,
Y ang, H., and Wang, J. (2008). Methycancer: the database of human dna methylation
and cancer. Nucleic Acids Res, 36(Database issue):D836–41.
Hinoue, T., Weisenberger, D. J., Lange, C. P . E., Shen, H., Byun, H.-M., Van Den Berg,
D., Malik, S., Pan, F ., Noushmehr, H., van Dijk, C. M., Tollenaar, R. A. E. M., and
Laird, P . W. (2012). Genome-scale analysis of aberrant dna methylation in colorectal
cancer. Genome Res, 22(2):271–82.
Hinoue, T., Weisenberger, D. J., Pan, F ., Campan, M., Kim, M., Y oung, J., Whitehall,
V. L., Leggett, B. A., and Laird, P . W. (2009). Analysis of the association between cimp
and braf in colorectal cancer by dna methylation profiling. PLoS One, 4(12):e8357.
135
Hodges, E., Molaro, A., Dos Santos, C. O., Thekkat, P ., Song, Q., Uren, P . J., Park,
J., Butler, J., Rafii, S., McCombie, W. R., Smith, A. D., and Hannon, G. J. (2011).
Directional dna methylation changes and complex intermediate states accompany
lineage specificity in the adult hematopoietic compartment. Mol Cell, 44(1):17–28.
Holliday, R. and Pugh, J. E. (1975). Dna modification mechanisms and gene activity
during development. Science, 187(4173):226–32.
Hon, G. C., Hawkins, R. D., Caballero, O. L., Lo, C., Lister, R., Pelizzola, M., Valsesia, A.,
Y e, Z., Kuan, S., Edsall, L. E., Camargo, A. A., Stevenson, B. J., Ecker, J. R., Bafna,
V., Strausberg, R. L., Simpson, A. J., and Ren, B. (2012). Global dna hypomethyla-
tion coupled to repressive chromatin domain formation and gene silencing in breast
cancer. Genome Res, 22(2):246–58.
Hou, Y ., Song, L., Zhu, P ., Zhang, B., Tao, Y ., Xu, X., Li, F ., Wu, K., Liang, J., Shao, D.,
Wu, H., Y e, X., Y e, C., Wu, R., Jian, M., Chen, Y ., Xie, W., Zhang, R., Chen, L., Liu,
X., Y ao, X., Zheng, H., Yu, C., Li, Q., Gong, Z., Mao, M., Y ang, X., Y ang, L., Li, J.,
Wang, W., Lu, Z., Gu, N., Laurie, G., Bolund, L., Kristiansen, K., Wang, J., Y ang, H.,
Li, Y ., Zhang, X., and Wang, J. (2012). Single-cell exome sequencing and monoclonal
evolution of a jak2-negative myeloproliferative neoplasm. Cell, 148(5):873–85.
Hovestadt, V., Jones, D. T. W., Picelli, S., Wang, W., Kool, M., Northcott, P . A., Sul-
tan, M., Stachurski, K., Ryzhova, M., Warnatz, H.-J., Ralser, M., Brun, S., Bunt, J.,
J¨ ager, N., Kleinheinz, K., Erkek, S., Weber, U. D., Bartholomae, C. C., von Kalle, C.,
Lawerenz, C., Eils, J., Koster, J., Versteeg, R., Milde, T., Witt, O., Schmidt, S., Wolf,
S., Pietsch, T., Rutkowski, S., Scheurlen, W., Taylor, M. D., Brors, B., Felsberg, J.,
Reifenberger, G., Borkhardt, A., Lehrach, H., Wechsler-Reya, R. J., Eils, R., Y aspo,
M.-L., Landgraf, P ., Korshunov, A., Zapatka, M., Radlwimmer, B., Pfister, S. M., and
Lichter, P . (2014). Decoding the regulatory landscape of medulloblastoma using dna
methylation sequencing. Nature.
Huff, J. T. and Zilberman, D. (2014). Dnmt1-independent cg methylation contributes to
nucleosome positioning in diverse eukaryotes. Cell, 156(6):1286–97.
Illingworth, R. S. and Bird, A. P . (2009). Cpg islands–’a rough guide’. FEBS Lett,
583(11):1713–20.
International Cancer Genome Consortium, Hudson, T. J., Anderson, W., Artez, A.,
Barker, A. D., Bell, C., Bernab´ e, R. R., Bhan, M. K., Calvo, F ., Eerola, I., Gerhard,
D. S., Guttmacher, A., Guyer, M., Hemsley, F . M., Jennings, J. L., Kerr, D., Klatt, P .,
Kolar, P ., Kusada, J., Lane, D. P ., Laplace, F ., Y ouyong, L., Nettekoven, G., Ozen-
berger, B., Peterson, J., Rao, T. S., Remacle, J., Schafer, A. J., Shibata, T., Stratton,
M. R., Vockley, J. G., Watanabe, K., Y ang, H., Yuen, M. M. F ., Knoppers, B. M.,
Bobrow, M., Cambon-Thomsen, A., Dressler, L. G., Dyke, S. O. M., Joly, Y ., Kato, K.,
Kennedy, K. L., Nicol´ as, P ., Parker, M. J., Rial-Sebbag, E., Romeo-Casabona, C. M.,
Shaw, K. M., Wallace, S., Wiesner, G. L., Zeps, N., Lichter, P ., Biankin, A. V., Cha-
bannon, C., Chin, L., Cl´ ement, B., de Alava, E., Degos, F ., Ferguson, M. L., Geary,
136
P ., Hayes, D. N., Hudson, T. J., Johns, A. L., Kasprzyk, A., Nakagawa, H., Penny,
R., Piris, M. A., Sarin, R., Scarpa, A., Shibata, T., van de Vijver, M., Futreal, P . A.,
Aburatani, H., Bay´ es, M., Botwell, D. D. L., Campbell, P . J., Estivill, X., Gerhard, D. S.,
Grimmond, S. M., Gut, I., Hirst, M., L´ opez-Ot´ ın, C., Majumder, P ., Marra, M., McPher-
son, J. D., Nakagawa, H., Ning, Z., Puente, X. S., Ruan, Y ., Shibata, T., Stratton,
M. R., Stunnenberg, H. G., Swerdlow, H., Velculescu, V. E., Wilson, R. K., Xue, H. H.,
Y ang, L., Spellman, P . T., Bader, G. D., Boutros, P . C., Campbell, P . J., Flicek, P ., Getz,
G., Guig´ o, R., Guo, G., Haussler, D., Heath, S., Hubbard, T. J., Jiang, T., Jones, S. M.,
Li, Q., L´ opez-Bigas, N., Luo, R., Muthuswamy, L., Ouellette, B. F . F ., Pearson, J. V.,
Puente, X. S., Quesada, V., Raphael, B. J., Sander, C., Shibata, T., Speed, T. P ., Stein,
L. D., Stuart, J. M., Teague, J. W., Totoki, Y ., Tsunoda, T., Valencia, A., Wheeler, D. A.,
Wu, H., Zhao, S., Zhou, G., Stein, L. D., Guig´ o, R., Hubbard, T. J., Joly, Y ., Jones,
S. M., Kasprzyk, A., Lathrop, M., L´ opez-Bigas, N., Ouellette, B. F . F ., Spellman, P . T.,
Teague, J. W., Thomas, G., Valencia, A., Y oshida, T., Kennedy, K. L., Axton, M.,
Dyke, S. O. M., Futreal, P . A., Gerhard, D. S., Gunter, C., Guyer, M., Hudson, T. J.,
McPherson, J. D., Miller, L. J., Ozenberger, B., Shaw, K. M., Kasprzyk, A., Stein,
L. D., Zhang, J., Haider, S. A., Wang, J., Yung, C. K., Cros, A., Cross, A., Liang, Y .,
Gnaneshan, S., Guberman, J., Hsu, J., Bobrow, M., Chalmers, D. R. C., Hasel, K. W.,
Joly, Y ., Kaan, T. S. H., Kennedy, K. L., Knoppers, B. M., Lowrance, W. W., Masui,
T., Nicol´ as, P ., Rial-Sebbag, E., Rodriguez, L. L., Vergely, C., Y oshida, T., Grimmond,
S. M., Biankin, A. V., Bowtell, D. D. L., Cloonan, N., deFazio, A., Eshleman, J. R.,
Etemadmoghadam, D., Gardiner, B. B., Gardiner, B. A., Kench, J. G., Scarpa, A.,
Sutherland, R. L., Tempero, M. A., Waddell, N. J., Wilson, P . J., McPherson, J. D.,
Gallinger, S., Tsao, M.-S., Shaw, P . A., Petersen, G. M., Mukhopadhyay, D., Chin, L.,
DePinho, R. A., Thayer, S., Muthuswamy, L., Shazand, K., Beck, T., Sam, M., Timms,
L., Ballin, V., Lu, Y ., Ji, J., Zhang, X., Chen, F ., Hu, X., Zhou, G., Y ang, Q., Tian, G.,
Zhang, L., Xing, X., Li, X., Zhu, Z., Yu, Y ., Yu, J., Y ang, H., Lathrop, M., Tost, J., Bren-
nan, P ., Holcatova, I., Zaridze, D., Brazma, A., Egevard, L., Prokhortchouk, E., Banks,
R. E., Uhl´ en, M., Cambon-Thomsen, A., Viksna, J., Ponten, F ., Skryabin, K., Stratton,
M. R., Futreal, P . A., Birney, E., Borg, A., Børresen-Dale, A.-L., Caldas, C., Foekens,
J. A., Martin, S., Reis-Filho, J. S., Richardson, A. L., Sotiriou, C., Stunnenberg, H. G.,
Thoms, G., van de Vijver, M., van’t Veer, L., Calvo, F ., Birnbaum, D., Blanche, H.,
Boucher, P ., Boyault, S., Chabannon, C., Gut, I., Masson-Jacquemier, J. D., Lathrop,
M., Pauport´ e, I., Pivot, X., Vincent-Salomon, A., Tabone, E., Theillet, C., Thomas, G.,
Tost, J., Treilleux, I., Calvo, F ., Bioulac-Sage, P ., Cl´ ement, B., Decaens, T., Degos, F .,
Franco, D., Gut, I., Gut, M., Heath, S., Lathrop, M., Samuel, D., Thomas, G., Zucman-
Rossi, J., Lichter, P ., Eils, R., Brors, B., Korbel, J. O., Korshunov, A., Landgraf, P .,
Lehrach, H., Pfister, S., Radlwimmer, B., Reifenberger, G., Taylor, M. D., von Kalle,
C., Majumder, P . P ., Sarin, R., Rao, T. S., Bhan, M. K., Scarpa, A., Pederzoli, P .,
Lawlor, R. A., Delledonne, M., Bardelli, A., Biankin, A. V., Grimmond, S. M., Gress,
T., Klimstra, D., Zamboni, G., Shibata, T., Nakamura, Y ., Nakagawa, H., Kusada, J.,
Tsunoda, T., Miyano, S., Aburatani, H., Kato, K., Fujimoto, A., Y oshida, T., Campo,
E., L´ opez-Ot´ ın, C., Estivill, X., Guig´ o, R., de Sanjos´ e, S., Piris, M. A., Montserrat,
137
E., Gonz´ alez-D´ ıaz, M., Puente, X. S., Jares, P ., Valencia, A., Himmelbauer, H., Him-
melbaue, H., Quesada, V., Bea, S., Stratton, M. R., Futreal, P . A., Campbell, P . J.,
Vincent-Salomon, A., Richardson, A. L., Reis-Filho, J. S., van de Vijver, M., Thomas,
G., Masson-Jacquemier, J. D., Aparicio, S., Borg, A., Børresen-Dale, A.-L., Caldas,
C., Foekens, J. A., Stunnenberg, H. G., van’t Veer, L., Easton, D. F ., Spellman, P . T.,
Martin, S., Barker, A. D., Chin, L., Collins, F . S., Compton, C. C., Ferguson, M. L.,
Gerhard, D. S., Getz, G., Gunter, C., Guttmacher, A., Guyer, M., Hayes, D. N., Lan-
der, E. S., Ozenberger, B., Penny, R., Peterson, J., Sander, C., Shaw, K. M., Speed,
T. P ., Spellman, P . T., Vockley, J. G., Wheeler, D. A., Wilson, R. K., Hudson, T. J., Chin,
L., Knoppers, B. M., Lander, E. S., Lichter, P ., Stein, L. D., Stratton, M. R., Anderson,
W., Barker, A. D., Bell, C., Bobrow, M., Burke, W., Collins, F . S., Compton, C. C.,
DePinho, R. A., Easton, D. F ., Futreal, P . A., Gerhard, D. S., Green, A. R., Guyer,
M., Hamilton, S. R., Hubbard, T. J., Kallioniemi, O. P ., Kennedy, K. L., Ley, T. J., Liu,
E. T., Lu, Y ., Majumder, P ., Marra, M., Ozenberger, B., Peterson, J., Schafer, A. J.,
Spellman, P . T., Stunnenberg, H. G., Wainwright, B. J., Wilson, R. K., and Y ang, H.
(2010). International network of cancer genome projects. Nature, 464(7291):993–8.
Irizarry, R. A., Ladd-Acosta, C., Wen, B., Wu, Z., Montano, C., Onyango, P ., Cui, H.,
Gabo, K., Rongione, M., Webster, M., Ji, H., Potash, J. B., Sabunciyan, S., and
Feinberg, A. P . (2009). The human colon cancer methylome shows similar hypo-
and hypermethylation at conserved tissue-specific cpg island shores. Nat Genet,
41(2):178–86.
Iyer, V. R. (2012). Nucleosome positioning: bringing order to the eukaryotic genome.
Trends Cell Biol, 22(5):250–6.
Jackson-Grusby, L., Beard, C., Possemato, R., Tudor, M., Fambrough, D., Csankovszki,
G., Dausman, J., Lee, P ., Wilson, C., Lander, E., and Jaenisch, R. (2001). Loss of
genomic methylation causes p53-dependent apoptosis and epigenetic deregulation.
Nat Genet, 27(1):31–9.
Jenuwein, T. and Allis, C. D. (2001). Translating the histone code. Science,
293(5532):1074–80.
Jeong, M., Sun, D., Luo, M., Huang, Y ., Challen, G. A., Rodriguez, B., Zhang, X.,
Chavez, L., Wang, H., Hannah, R., Kim, S.-B., Y ang, L., Ko, M., Chen, R., G¨ ottgens,
B., Lee, J.-S., Gunaratne, P ., Godley, L. A., Darlington, G. J., Rao, A., Li, W., and
Goodell, M. A. (2014). Large conserved domains of low dna methylation maintained
by dnmt3a. Nat Genet, 46(1):17–23.
Jessen, W. J., Dhasarathy, A., Hoose, S. A., Carvin, C. D., Risinger, A. L., and Kladde,
M. P . (2004). Mapping chromatin structure in vivo using dna methyltransferases.
Methods, 33(1):68–80.
Jiang, C. and Pugh, B. F . (2009). Nucleosome positioning and gene regulation:
advances through genomics. Nat Rev Genet, 10(3):161–72.
138
Jiang, Y ., Schneck, J. L., Grimes, M., Taylor, A. N., Hou, W., Thrall, S. H., and Sweitzer,
S. M. (2011). Methyltransferases prefer monomer over core-trimmed nucleosomes
as in vitro substrates. Anal Biochem, 415(1):84–6.
Jimenez-Useche, I., Ke, J., Tian, Y ., Shim, D., Howell, S. C., Qiu, X., and Yuan, C.
(2013). Dna methylation regulated nucleosome dynamics. Sci Rep, 3:2121.
Jin, B., Y ao, B., Li, J.-L., Fields, C. R., Delmas, A. L., Liu, C., and Robertson, K. D.
(2009). Dnmt1 and dnmt3b modulate distinct polycomb-mediated histone modifica-
tions in colon cancer. Cancer Res, 69(18):7412–21.
Jones, P . A. (2012). Functions of dna methylation: islands, start sites, gene bodies and
beyond. Nat Rev Genet, 13(7):484–92.
Jones, P . A. and Baylin, S. B. (2002). The fundamental role of epigenetic events in
cancer. Nat Rev Genet, 3(6):415–28.
Jones, P . A. and Liang, G. (2009). Rethinking how dna methylation patterns are main-
tained. Nat Rev Genet, 10(11):805–11.
Kamakaka, R. T. and Biggins, S. (2005). Histone variants: deviants? Genes Dev,
19(3):295–310.
Kaneda, A., Fujita, T., Anai, M., Y amamoto, S., Nagae, G., Morikawa, M., Tsuji, S.,
Oshima, M., Miyazono, K., and Aburatani, H. (2011). Activation of bmp2-smad1 signal
and its regulation by coordinated alteration of h3k27 trimethylation in ras-induced
senescence. PLoS Genet, 7(11):e1002359.
Kaplan, N., Moore, I. K., Fondufe-Mittendorf, Y ., Gossett, A. J., Tillo, D., Field, Y ., LeP-
roust, E. M., Hughes, T. R., Lieb, J. D., Widom, J., and Segal, E. (2009). The dna-
encoded nucleosome organization of a eukaryotic genome. Nature, 458(7236):362–
6.
Kelly, T. K., Liu, Y ., Lay, F . D., Liang, G., Berman, B. P ., and Jones, P . A. (2012). Genome-
wide mapping of nucleosome positioning and dna methylation within individual dna
molecules. Genome Res, 22(12):2497–506.
Kelly, T. K., Miranda, T. B., Liang, G., Berman, B. P ., Lin, J. C., Tanay, A., and Jones, P . A.
(2010). H2a.z maintenance during mitosis reveals nucleosome shifting on mitotically
silenced genes. Mol Cell, 39(6):901–11.
Kilgore, J. A., Hoose, S. A., Gustafson, T. L., Porter, W., and Kladde, M. P . (2007).
Single-molecule and population probing of chromatin structure using dna methyl-
transferases. Methods, 41(3):320–32.
Kim, T. H., Abdullaev, Z. K., Smith, A. D., Ching, K. A., Loukinov, D. I., Green, R. D.,
Zhang, M. Q., Lobanenkov, V. V., and Ren, B. (2007). Analysis of the vertebrate
insulator protein ctcf-binding sites in the human genome. Cell, 128(6):1231–45.
139
Komashko, V. M. and Farnham, P . J. (2010). 5-azacytidine treatment reorganizes
genomic histone modification patterns. Epigenetics, 5(3).
Krueger, F . and Andrews, S. R. (2011). Bismark: a flexible aligner and methylation caller
for bisulfite-seq applications. Bioinformatics, 27(11):1571–2.
Krueger, F ., Kreck, B., Franke, A., and Andrews, S. R. (2012). Dna methylome analysis
using short bisulfite sequencing data. Nat Methods, 9(2):145–51.
Laird, P . W. (2010). Principles and challenges of genomewide dna methylation analysis.
Nat Rev Genet, 11(3):191–203.
Laks, D. R., Masterman-Smith, M., Visnyei, K., Angenieux, B., Orozco, N. M., Foran, I.,
Y ong, W. H., Vinters, H. V., Liau, L. M., Lazareff, J. A., Mischel, P . S., Cloughesy, T. F .,
Horvath, S., and Kornblum, H. I. (2009). Neurosphere formation is an independent
predictor of clinical outcome in malignant glioma. Stem Cells, 27(4):980–7.
Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C., Zody, M. C., Baldwin, J., Devon, K.,
Dewar, K., Doyle, M., FitzHugh, W., Funke, R., Gage, D., Harris, K., Heaford, A., How-
land, J., Kann, L., Lehoczky, J., LeVine, R., McEwan, P ., McKernan, K., Meldrim, J.,
Mesirov, J. P ., Miranda, C., Morris, W., Naylor, J., Raymond, C., Rosetti, M., Santos,
R., Sheridan, A., Sougnez, C., Stange-Thomann, N., Stojanovic, N., Subramanian,
A., Wyman, D., Rogers, J., Sulston, J., Ainscough, R., Beck, S., Bentley, D., Burton,
J., Clee, C., Carter, N., Coulson, A., Deadman, R., Deloukas, P ., Dunham, A., Dun-
ham, I., Durbin, R., French, L., Grafham, D., Gregory, S., Hubbard, T., Humphray, S.,
Hunt, A., Jones, M., Lloyd, C., McMurray, A., Matthews, L., Mercer, S., Milne, S., Mul-
likin, J. C., Mungall, A., Plumb, R., Ross, M., Shownkeen, R., Sims, S., Waterston,
R. H., Wilson, R. K., Hillier, L. W., McPherson, J. D., Marra, M. A., Mardis, E. R.,
Fulton, L. A., Chinwalla, A. T., Pepin, K. H., Gish, W. R., Chissoe, S. L., Wendl, M. C.,
Delehaunty, K. D., Miner, T. L., Delehaunty, A., Kramer, J. B., Cook, L. L., Fulton,
R. S., Johnson, D. L., Minx, P . J., Clifton, S. W., Hawkins, T., Branscomb, E., Predki,
P ., Richardson, P ., Wenning, S., Slezak, T., Doggett, N., Cheng, J. F ., Olsen, A.,
Lucas, S., Elkin, C., Uberbacher, E., Frazier, M., Gibbs, R. A., Muzny, D. M., Scherer,
S. E., Bouck, J. B., Sodergren, E. J., Worley, K. C., Rives, C. M., Gorrell, J. H., Met-
zker, M. L., Naylor, S. L., Kucherlapati, R. S., Nelson, D. L., Weinstock, G. M., Sakaki,
Y ., Fujiyama, A., Hattori, M., Y ada, T., Toyoda, A., Itoh, T., Kawagoe, C., Watanabe,
H., Totoki, Y ., Taylor, T., Weissenbach, J., Heilig, R., Saurin, W., Artiguenave, F ., Brot-
tier, P ., Bruls, T., Pelletier, E., Robert, C., Wincker, P ., Smith, D. R., Doucette-Stamm,
L., Rubenfield, M., Weinstock, K., Lee, H. M., Dubois, J., Rosenthal, A., Platzer, M.,
Nyakatura, G., Taudien, S., Rump, A., Y ang, H., Yu, J., Wang, J., Huang, G., Gu, J.,
Hood, L., Rowen, L., Madan, A., Qin, S., Davis, R. W., Federspiel, N. A., Abola, A. P .,
Proctor, M. J., Myers, R. M., Schmutz, J., Dickson, M., Grimwood, J., Cox, D. R.,
Olson, M. V., Kaul, R., Raymond, C., Shimizu, N., Kawasaki, K., Minoshima, S.,
Evans, G. A., Athanasiou, M., Schultz, R., Roe, B. A., Chen, F ., Pan, H., Ramser, J.,
Lehrach, H., Reinhardt, R., McCombie, W. R., de la Bastide, M., Dedhia, N., Bl¨ ocker,
H., Hornischer, K., Nordsiek, G., Agarwala, R., Aravind, L., Bailey, J. A., Bateman, A.,
140
Batzoglou, S., Birney, E., Bork, P ., Brown, D. G., Burge, C. B., Cerutti, L., Chen, H. C.,
Church, D., Clamp, M., Copley, R. R., Doerks, T., Eddy, S. R., Eichler, E. E., Furey,
T. S., Galagan, J., Gilbert, J. G., Harmon, C., Hayashizaki, Y ., Haussler, D., Herm-
jakob, H., Hokamp, K., Jang, W., Johnson, L. S., Jones, T. A., Kasif, S., Kaspryzk, A.,
Kennedy, S., Kent, W. J., Kitts, P ., Koonin, E. V., Korf, I., Kulp, D., Lancet, D., Lowe,
T. M., McLysaght, A., Mikkelsen, T., Moran, J. V., Mulder, N., Pollara, V. J., Ponting,
C. P ., Schuler, G., Schultz, J., Slater, G., Smit, A. F ., Stupka, E., Szustakowski, J.,
Thierry-Mieg, D., Thierry-Mieg, J., Wagner, L., Wallis, J., Wheeler, R., Williams, A.,
Wolf, Y . I., Wolfe, K. H., Y ang, S. P ., Y eh, R. F ., Collins, F ., Guyer, M. S., Peter-
son, J., Felsenfeld, A., Wetterstrand, K. A., Patrinos, A., Morgan, M. J., de Jong, P .,
Catanese, J. J., Osoegawa, K., Shizuya, H., Choi, S., Chen, Y . J., Szustakowki, J.,
and International Human Genome Sequencing Consortium (2001). Initial sequencing
and analysis of the human genome. Nature, 409(6822):860–921.
Landt, S. G., Marinov, G. K., Kundaje, A., Kheradpour, P ., Pauli, F ., Batzoglou, S., Bern-
stein, B. E., Bickel, P ., Brown, J. B., Cayting, P ., Chen, Y ., DeSalvo, G., Epstein, C.,
Fisher-Aylor, K. I., Euskirchen, G., Gerstein, M., Gertz, J., Hartemink, A. J., Hoffman,
M. M., Iyer, V. R., Jung, Y . L., Karmakar, S., Kellis, M., Kharchenko, P . V., Li, Q., Liu,
T., Liu, X. S., Ma, L., Milosavljevic, A., Myers, R. M., Park, P . J., Pazin, M. J., Perry,
M. D., Raha, D., Reddy, T. E., Rozowsky, J., Shoresh, N., Sidow, A., Slattery, M.,
Stamatoyannopoulos, J. A., Tolstorukov, M. Y ., White, K. P ., Xi, S., Farnham, P . J.,
Lieb, J. D., Wold, B. J., and Snyder, M. (2012). Chip-seq guidelines and practices of
the encode and modencode consortia. Genome Res, 22(9):1813–31.
Laurent, L., Wong, E., Li, G., Huynh, T., Tsirigos, A., Ong, C. T., Low, H. M., Kin Sung,
K. W., Rigoutsos, I., Loring, J., and Wei, C.-L. (2010). Dynamic changes in the human
methylome during differentiation. Genome Res, 20:320–31.
Li, B., Carey, M., and Workman, J. L. (2007). The role of chromatin during transcription.
Cell, 128(4):707–19.
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abeca-
sis, G., Durbin, R., and 1000 Genome Project Data Processing Subgroup (2009a).
The sequence alignment/map format and samtools. Bioinformatics, 25(16):2078–9.
Li, H., Ruan, J., and Durbin, R. (2008). Mapping short dna sequencing reads and calling
variants using mapping quality scores. Genome Res, 18(11):1851–8.
Li, R., Li, Y ., Fang, X., Y ang, H., Wang, J., Kristiansen, K., and Wang, J. (2009b).
Snp detection for massively parallel whole-genome resequencing. Genome Res,
19:1124–32.
Li, Y ., Zhu, J., Tian, G., Li, N., Li, Q., Y e, M., Zheng, H., Yu, J., Wu, H., Sun, J., Zhang,
H., Chen, Q., Luo, R., Chen, M., He, Y ., Jin, X., Zhang, Q., Yu, C., Zhou, G., Sun,
J., Huang, Y ., Zheng, H., Cao, H., Zhou, X., Guo, S., Hu, X., Li, X., Kristiansen, K.,
Bolund, L., Xu, J., Wang, W., Y ang, H., Wang, J., Li, R., Beck, S., Wang, J., and
Zhang, X. (2010). The dna methylome of human peripheral blood mononuclear cells.
PLoS Biol, 8:e1000533.
141
Lienert, F ., Wirbelauer, C., Som, I., Dean, A., Mohn, F ., and Sch¨ ubeler, D. (2011). Iden-
tification of genetic elements that autonomously determine dna methylation states.
Nat Genet, 43:1091–7.
Lin, J. C., Jeong, S., Liang, G., Takai, D., Fatemi, M., Tsai, Y . C., Egger, G., Gal-Y am,
E. N., and Jones, P . A. (2007). Role of nucleosomal occupancy in the epigenetic
silencing of the mlh1 cpg island. Cancer Cell, 12(5):432–44.
Lindroth, A. M., Park, Y . J., McLean, C. M., Dokshin, G. A., Persson, J. M., Herman, H.,
Pasini, D., Mir´ o, X., Donohoe, M. E., Lee, J. T., Helin, K., and Soloway, P . D. (2008).
Antagonism between dna and h3k27 methylation at the imprinted rasgrf1 locus. PLoS
Genet, 4(8):e1000145.
Lister, R., Mukamel, E. A., Nery, J. R., Urich, M., Puddifoot, C. A., Johnson, N. D.,
Lucero, J., Huang, Y ., Dwork, A. J., Schultz, M. D., Yu, M., Tonti-Filippini, J., Heyn,
H., Hu, S., Wu, J. C., Rao, A., Esteller, M., He, C., Haghighi, F . G., Sejnowski, T. J.,
Behrens, M. M., and Ecker, J. R. (2013). Global epigenomic reconfiguration during
mammalian brain development. Science, 341(6146):1237905.
Lister, R., O’Malley, R. C., Tonti-Filippini, J., Gregory, B. D., Berry, C. C., Millar, A. H., and
Ecker, J. R. (2008). Highly integrated single-base resolution maps of the epigenome
in arabidopsis. Cell, 133:523–36.
Lister, R., Pelizzola, M., Dowen, R. H., Hawkins, R. D., Hon, G., Tonti-Filippini, J., Nery,
J. R., Lee, L., Y e, Z., Ngo, Q.-M., Edsall, L., Antosiewicz-Bourget, J., Stewart, R.,
Ruotti, V., Millar, A. H., Thomson, J. A., Ren, B., and Ecker, J. R. (2009). Human dna
methylomes at base resolution show widespread epigenomic differences. Nature,
462(7271):315–22.
Lister, R., Pelizzola, M., Kida, Y . S., Hawkins, R. D., Nery, J. R., Hon, G., Antosiewicz-
Bourget, J., O’Malley, R., Castanon, R., Klugman, S., Downes, M., Yu, R., Stewart, R.,
Ren, B., Thomson, J. A., Evans, R. M., and Ecker, J. R. (2011). Hotspots of aberrant
epigenomic reprogramming in human induced pluripotent stem cells. Nature, 471:68–
73.
Liu, Y ., Siegmund, K. D., Laird, P . W., and Berman, B. P . (2012). Bis-snp: Combined
dna methylation and snp calling for bisulfite-seq data. Genome Biol, 13(7):R61.
Lock, L. F ., Takagi, N., and Martin, G. R. (1987). Methylation of the hprt gene on the
inactive x occurs after chromosome inactivation. Cell, 48(1):39–46.
Lorch, Y ., LaPointe, J. W., and Kornberg, R. D. (1987). Nucleosomes inhibit the initiation
of transcription but allow chain elongation with the displacement of histones. Cell,
49(2):203–10.
Luijsterburg, M. S., White, M. F ., van Driel, R., and Dame, R. T. (2008). The major
architects of chromatin: architectural proteins in bacteria, archaea and eukaryotes.
Crit Rev Biochem Mol Biol, 43(6):393–418.
142
Malik, H. S. and Henikoff, S. (2003). Phylogenomics of the nucleosome. Nat Struct Biol,
10(11):882–91.
McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A.,
Garimella, K., Altshuler, D., Gabriel, S., Daly, M., and DePristo, M. A. (2010). The
genome analysis toolkit: a mapreduce framework for analyzing next-generation dna
sequencing data. Genome Res, 20:1297–303.
Meissner, A., Mikkelsen, T. S., Gu, H., Wernig, M., Hanna, J., Sivachenko, A., Zhang,
X., Bernstein, B. E., Nusbaum, C., Jaffe, D. B., Gnirke, A., Jaenisch, R., and Lander,
E. S. (2008). Genome-scale dna methylation maps of pluripotent and differentiated
cells. Nature, 454:766–70.
Michaloglou, C., Vredeveld, L. C. W., Soengas, M. S., Denoyelle, C., Kuilman, T.,
van der Horst, C. M. A. M., Majoor, D. M., Shay, J. W., Mooi, W. J., and Peeper,
D. S. (2005). Brafe600-associated senescence-like cell cycle arrest of human naevi.
Nature, 436(7051):720–4.
Mikkelsen, T. S., Ku, M., Jaffe, D. B., Issac, B., Lieberman, E., Giannoukos, G., Alvarez,
P ., Brockman, W., Kim, T.-K., Koche, R. P ., Lee, W., Mendenhall, E., O’Donovan, A.,
Presser, A., Russ, C., Xie, X., Meissner, A., Wernig, M., Jaenisch, R., Nusbaum, C.,
Lander, E. S., and Bernstein, B. E. (2007). Genome-wide maps of chromatin state in
pluripotent and lineage-committed cells. Nature, 448(7153):553–60.
Miranda, T. B., Kelly, T. K., Bouazoune, K., and Jones, P . A. (2010). Methylation-
sensitive single-molecule analysis of chromatin structure. Curr Protoc Mol Biol, Chap-
ter 21:Unit 21.17.1–16.
modENCODE Consortium, Roy, S., Ernst, J., Kharchenko, P . V., Kheradpour, P ., Negre,
N., Eaton, M. L., Landolin, J. M., Bristow, C. A., Ma, L., Lin, M. F ., Washietl, S.,
Arshinoff, B. I., Ay, F ., Meyer, P . E., Robine, N., Washington, N. L., Di Stefano, L.,
Berezikov, E., Brown, C. D., Candeias, R., Carlson, J. W., Carr, A., Jungreis, I., Mar-
bach, D., Sealfon, R., Tolstorukov, M. Y ., Will, S., Alekseyenko, A. A., Artieri, C.,
Booth, B. W., Brooks, A. N., Dai, Q., Davis, C. A., Duff, M. O., Feng, X., Gorchakov,
A. A., Gu, T., Henikoff, J. G., Kapranov, P ., Li, R., MacAlpine, H. K., Malone, J., Min-
oda, A., Nordman, J., Okamura, K., Perry, M., Powell, S. K., Riddle, N. C., Sakai, A.,
Samsonova, A., Sandler, J. E., Schwartz, Y . B., Sher, N., Spokony, R., Sturgill, D., van
Baren, M., Wan, K. H., Y ang, L., Yu, C., Feingold, E., Good, P ., Guyer, M., Lowdon,
R., Ahmad, K., Andrews, J., Berger, B., Brenner, S. E., Brent, M. R., Cherbas, L.,
Elgin, S. C. R., Gingeras, T. R., Grossman, R., Hoskins, R. A., Kaufman, T. C., Kent,
W., Kuroda, M. I., Orr-Weaver, T., Perrimon, N., Pirrotta, V., Posakony, J. W., Ren, B.,
Russell, S., Cherbas, P ., Graveley, B. R., Lewis, S., Micklem, G., Oliver, B., Park, P . J.,
Celniker, S. E., Henikoff, S., Karpen, G. H., Lai, E. C., MacAlpine, D. M., Stein, L. D.,
White, K. P ., and Kellis, M. (2010). Identification of functional elements and regulatory
circuits by drosophila modencode. Science, 330(6012):1787–97.
143
Molaro, A., Hodges, E., Fang, F ., Song, Q., McCombie, W. R., Hannon, G. J., and Smith,
A. D. (2011). Sperm methylation profiles reveal features of epigenetic inheritance and
evolution in primates. Cell, 146(6):1029–41.
Mueller-Planitz, F ., Klinker, H., and Becker, P . B. (2013). Nucleosome sliding mecha-
nisms: new twists in a looped history. Nat Struct Mol Biol, 20(9):1026–32.
Nagy, P . L. and Price, D. H. (2009). Formaldehyde-assisted isolation of regulatory ele-
ments. Wiley Interdiscip Rev Syst Biol Med, 1(3):400–6.
Nakabayashi, K., Trujillo, A. M., Tayama, C., Camprubi, C., Y oshida, W., Lapunzina, P .,
Sanchez, A., Soejima, H., Aburatani, H., Nagae, G., Ogata, T., Hata, K., and Monk,
D. (2011). Methylation screening of reciprocal genome-wide upds identifies novel
human-specific imprinted genes. Hum Mol Genet, 20(16):3188–97.
Narita, M., N˜ unez, S., Heard, E., Narita, M., Lin, A. W., Hearn, S. A., Spector, D. L.,
Hannon, G. J., and Lowe, S. W. (2003). Rb-mediated heterochromatin formation and
silencing of e2f target genes during cellular senescence. Cell, 113(6):703–16.
Navin, N., Kendall, J., Troge, J., Andrews, P ., Rodgers, L., McIndoo, J., Cook, K.,
Stepansky, A., Levy, D., Esposito, D., Muthuswamy, L., Krasnitz, A., McCombie, W. R.,
Hicks, J., and Wigler, M. (2011). Tumour evolution inferred by single-cell sequencing.
Nature, 472(7341):90–4.
Noushmehr, H., Weisenberger, D. J., Diefes, K., Phillips, H. S., Pujara, K., Berman, B. P .,
Pan, F ., Pelloski, C. E., Sulman, E. P ., Bhat, K. P ., Verhaak, R. G. W., Hoadley, K. A.,
Hayes, D. N., Perou, C. M., Schmidt, H. K., Ding, L., Wilson, R. K., Van Den Berg, D.,
Shen, H., Bengtsson, H., Neuvial, P ., Cope, L. M., Buckley, J., Herman, J. G., Baylin,
S. B., Laird, P . W., Aldape, K., and Cancer Genome Atlas Research Network (2010).
Identification of a cpg island methylator phenotype that defines a distinct subgroup of
glioma. Cancer Cell, 17(5):510–22.
Ohm, J. E., McGarvey, K. M., Yu, X., Cheng, L., Schuebel, K. E., Cope, L., Mohammad,
H. P ., Chen, W., Daniel, V. C., Yu, W., Berman, D. M., Jenuwein, T., Pruitt, K., Sharkis,
S. J., Watkins, D. N., Herman, J. G., and Baylin, S. B. (2007). A stem cell-like chro-
matin pattern may predispose tumor suppressor genes to dna hypermethylation and
heritable silencing. Nat Genet, 39(2):237–42.
Ooi, S. K. T., Qiu, C., Bernstein, E., Li, K., Jia, D., Y ang, Z., Erdjument-Bromage, H.,
Tempst, P ., Lin, S.-P ., Allis, C. D., Cheng, X., and Bestor, T. H. (2007). Dnmt3l con-
nects unmethylated lysine 4 of histone h3 to de novo methylation of dna. Nature,
448(7154):714–7.
Pardo, C. E., Carr, I. M., Hoffman, C. J., Darst, R. P ., Markham, A. F ., Bonthron, D. T.,
and Kladde, M. P . (2011). Methylviewer: computational analysis and editing for bisul-
fite sequencing and methyltransferase accessibility protocol for individual templates
(mapit) projects. Nucleic Acids Res, 39(1):e5.
144
Patra, S. K. (2008). Ras regulation of dna-methylation and cancer. Exp Cell Res,
314(6):1193–201.
Peli, J., Schr¨ oter, M., Rudaz, C., Hahne, M., Meyer, C., Reichmann, E., and Tschopp, J.
(1999). Oncogenic ras inhibits fas ligand-mediated apoptosis by downregulating the
expression of fas. EMBO J, 18(7):1824–31.
P´ erez, A., Castellazzi, C. L., Battistini, F ., Collinet, K., Flores, O., Deniz, O., Ruiz, M. L.,
Torrents, D., Eritja, R., Soler-L´ opez, M., and Orozco, M. (2012). Impact of methylation
on the physical properties of dna. Biophys J, 102(9):2140–8.
Portela, A., Liz, J., Nogales, V., Seti´ en, F ., Villanueva, A., and Esteller, M. (2013). Dna
methylation determines nucleosome occupancy in the 5’-cpg islands of tumor sup-
pressor genes. Oncogene, 32(47):5421–8.
Portella, G., Battistini, F ., and Orozco, M. (2013). Understanding the connection
between epigenetic dna methylation and nucleosome positioning from computer sim-
ulations. PLoS Comput Biol, 9(11):e1003354.
Rach, E. A., Winter, D. R., Benjamin, A. M., Corcoran, D. L., Ni, T., Zhu, J., and Ohler,
U. (2011). Transcription initiation patterns indicate divergent strategies for gene reg-
ulation at the chromatin level. PLoS Genet, 7(1):e1001274.
Rada-Iglesias, A., Bajpai, R., Swigut, T., Brugmann, S. A., Flynn, R. A., and Wysocka,
J. (2011). A unique chromatin signature uncovers early developmental enhancers in
humans. Nature, 470(7333):279–83.
Raizis, A. M., Schmitt, F ., and Jost, J. P . (1995). A bisulfite method of 5-methylcytosine
mapping that minimizes template degradation. Anal Biochem, 226:161–6.
Rakyan, V. K., Down, T. A., Balding, D. J., and Beck, S. (2011). Epigenome-wide asso-
ciation studies for common human diseases. Nat Rev Genet, 12:529–41.
Rakyan, V. K., Hildmann, T., Novik, K. L., Lewin, J., Tost, J., Cox, A. V., Andrews, T. D.,
Howe, K. L., Otto, T., Olek, A., Fischer, J., Gut, I. G., Berlin, K., and Beck, S. (2004).
Dna methylation profiling of the human major histocompatibility complex: a pilot study
for the human epigenome project. PLoS Biol, 2(12):e405.
Ramirez-Carrozzi, V. R., Braas, D., Bhatt, D. M., Cheng, C. S., Hong, C., Doty, K. R.,
Black, J. C., Hoffmann, A., Carey, M., and Smale, S. T. (2009). A unifying model
for the selective regulation of inducible transcription by cpg islands and nucleosome
remodeling. Cell, 138(1):114–28.
Ramsahoye, B. H., Biniszkiewicz, D., Lyko, F ., Clark, V., Bird, A. P ., and Jaenisch, R.
(2000). Non-cpg methylation is prevalent in embryonic stem cells and may be medi-
ated by dna methyltransferase 3a. Proc Natl Acad Sci U S A, 97:5237–42.
145
Reddington, J. P ., Perricone, S. M., Nestor, C. E., Reichmann, J., Y oungson, N. A.,
Suzuki, M., Reinhardt, D., Dunican, D. S., Prendergast, J. G., Mjoseng, H., Ramsa-
hoye, B. H., Whitelaw, E., Greally, J. M., Adams, I. R., Bickmore, W. A., and Meehan,
R. R. (2013). Redistribution of h3k27me3 upon dna hypomethylation results in de-
repression of polycomb target genes. Genome Biol, 14(3):R25.
Renbaum, P ., Abrahamove, D., Fainsod, A., Wilson, G. G., Rottem, S., and Razin, A.
(1990). Cloning, characterization, and expression in escherichia coli of the gene
coding for the cpg dna methylase from spiroplasma sp. strain mq1(m.sssi). Nucleic
Acids Res, 18:1145–52.
Rhee, I., Bachman, K. E., Park, B. H., Jair, K.-W., Y en, R.-W. C., Schuebel, K. E.,
Cui, H., Feinberg, A. P ., Lengauer, C., Kinzler, K. W., Baylin, S. B., and Vogelstein,
B. (2002). Dnmt1 and dnmt3b cooperate to silence genes in human cancer cells.
Nature, 416(6880):552–6.
Riggs, A. D. (1975). X inactivation, differentiation, and dna methylation. Cytogenet Cell
Genet, 14(1):9–25.
Rivera, C. M. and Ren, B. (2013). Mapping human epigenomes. Cell, 155(1):39–55.
Sarkisian, C. J., Keister, B. A., Stairs, D. B., Boxer, R. B., Moody, S. E., and Chodosh,
L. A. (2007). Dose-dependent oncogene-induced senescence in vivo and its evasion
during mammary tumorigenesis. Nat Cell Biol, 9(5):493–505.
Schalkwyk, L. C., Meaburn, E. L., Smith, R., Dempster, E. L., Jeffries, A. R., Davies,
M. N., Plomin, R., and Mill, J. (2010). Allelic skewing of dna methylation is widespread
across the genome. Am J Hum Genet, 86:196–212.
Schlesinger, Y ., Straussman, R., Keshet, I., Farkash, S., Hecht, M., Zimmerman, J.,
Eden, E., Y akhini, Z., Ben-Shushan, E., Reubinoff, B. E., Bergman, Y ., Simon, I., and
Cedar, H. (2007). Polycomb-mediated methylation on lys27 of histone h3 pre-marks
genes for de novo methylation in cancer. Nat Genet, 39(2):232–6.
Schones, D. E., Cui, K., Cuddapah, S., Roh, T.-Y ., Barski, A., Wang, Z., Wei, G.,
and Zhao, K. (2008). Dynamic regulation of nucleosome positioning in the human
genome. Cell, 132(5):887–98.
Schrama, D., Kneitz, H., Willmes, C., Adam, C., Houben, R., and Becker, J. C.
(2010). Lack of correlation between igfbp7 expression and braf mutational status
in melanoma. J Invest Dermatol, 130(3):897–8.
Scurr, L. L., Pupo, G. M., Becker, T. M., Lai, K., Schrama, D., Haferkamp, S., Irvine, M.,
Scolyer, R. A., Mann, G. J., Becker, J. C., Kefford, R. F ., and Rizos, H. (2010). Igfbp7
is not required for b-raf-induced melanocyte senescence. Cell, 141(4):717 – 727.
Segal, E., Fondufe-Mittendorf, Y ., Chen, L., Th˚ astr¨ om, A., Field, Y ., Moore, I. K., Wang,
J.-P . Z., and Widom, J. (2006). A genomic code for nucleosome positioning. Nature,
442(7104):772–8.
146
Serrano, M., Lin, A. W., McCurrach, M. E., Beach, D., and Lowe, S. W. (1997). Onco-
genic ras provokes premature cell senescence associated with accumulation of p53
and p16ink4a. Cell, 88(5):593–602.
Shalek, A. K., Satija, R., Adiconis, X., Gertner, R. S., Gaublomme, J. T., Raychowdhury,
R., Schwartz, S., Y osef, N., Malboeuf, C., Lu, D., Trombetta, J. J., Gennert, D., Gnirke,
A., Goren, A., Hacohen, N., Levin, J. Z., Park, H., and Regev, A. (2013). Single-cell
transcriptomics reveals bimodality in expression and splicing in immune cells. Nature,
498(7453):236–40.
Sharma, S., De Carvalho, D. D., Jeong, S., Jones, P . A., and Liang, G. (2011). Nucleo-
somes containing methylated dna stabilize dna methyltransferases 3a/3b and ensure
faithful epigenetic inheritance. PLoS Genet, 7(2):e1001286.
Shen, H. and Laird, P . W. (2013). Interplay between the cancer genome and epigenome.
Cell, 153(1):38–55.
Shen, L., Wu, H., Diep, D., Y amaguchi, S., D’Alessio, A. C., Fung, H.-L., Zhang, K.,
and Zhang, Y . (2013). Genome-wide analysis reveals tet- and tdg-dependent 5-
methylcytosine oxidation dynamics. Cell, 153(3):692–706.
Shen, L. and Zhang, Y . (2013). 5-hydroxymethylcytosine: generation, fate, and genomic
distribution. Curr Opin Cell Biol, 25(3):289–96.
Shoemaker, R., Deng, J., Wang, W., and Zhang, K. (2010). Allele-specific methylation
is prevalent and is contributed by cpg-snps in the human genome. Genome Res,
20(7):883–9.
Shukla, S., Kavak, E., Gregory, M., Imashimizu, M., Shutinoski, B., Kashlev, M., Ober-
doerffer, P ., Sandberg, R., and Oberdoerffer, S. (2011). Ctcf-promoted rna poly-
merase ii pausing links dna methylation to splicing. Nature, 479(7371):74–9.
Smallwood, A., Est` eve, P .-O., Pradhan, S., and Carey, M. (2007). Functional coopera-
tion between hp1 and dnmt1 mediates gene silencing. Genes Dev, 21(10):1169–78.
Smith, G., Carey, F . A., Beattie, J., Wilkie, M. J. V., Lightfoot, T. J., Coxhead, J., Garner,
R. C., Steele, R. J. C., and Wolf, C. R. (2002). Mutations in apc, kirsten-ras, and
p53–alternative genetic pathways to colorectal cancer. Proc Natl Acad Sci U S A,
99(14):9433–8.
Smith, Z. D., Gu, H., Bock, C., Gnirke, A., and Meissner, A. (2009). High-throughput
bisulfite sequencing in mammalian genomes. Methods, 48:226–32.
Stadler, M. B., Murr, R., Burger, L., Ivanek, R., Lienert, F ., Sch¨ oler, A., van Nimwegen,
E., Wirbelauer, C., Oakeley, E. J., Gaidatzis, D., Tiwari, V. K., and Sch¨ ubeler, D.
(2011). Dna-binding factors shape the mouse methylome at distal regulatory regions.
Nature, 480(7378):490–5.
147
Stratton, M. R. (2011). Exploring the genomes of cancer cells: progress and promise.
Science, 331(6024):1553–8.
Struhl, K. and Segal, E. (2013). Determinants of nucleosome positioning. Nat Struct
Mol Biol, 20(3):267–73.
Suzuki, H., Igarashi, S., Nojima, M., Maruyama, R., Y amamoto, E., Kai, M., Akashi, H.,
Watanabe, Y ., Y amamoto, H., Sasaki, Y ., Itoh, F ., Imai, K., Sugai, T., Shen, L., Issa,
J.-P . J., Shinomura, Y ., Tokino, T., and Toyota, M. (2010). Igfbp7 is a p53-responsive
gene specifically silenced in colorectal cancer with cpg island methylator phenotype.
Carcinogenesis, 31(3):342–9.
Taberlay, P . C., Kelly, T. K., Liu, C.-C., You, J. S., De Carvalho, D. D., Miranda, T. B.,
Zhou, X. J., Liang, G., and Jones, P . A. (2011). Polycomb-repressed genes have
permissive enhancers that initiate reprogramming. Cell, 147(6):1283–94.
Takai, D. and Jones, P . A. (2002). Comprehensive analysis of cpg islands in human
chromosomes 21 and 22. Proc Natl Acad Sci U S A, 99(6):3740–5.
Tan, M., Luo, H., Lee, S., Jin, F ., Y ang, J. S., Montellier, E., Buchou, T., Cheng, Z.,
Rousseaux, S., Rajagopal, N., Lu, Z., Y e, Z., Zhu, Q., Wysocka, J., Y e, Y ., Khochbin,
S., Ren, B., and Zhao, Y . (2011). Identification of 67 histone marks and histone lysine
crotonylation as a new type of histone modification. Cell, 146(6):1016–28.
Tazi, J. and Bird, A. (1990). Alternative chromatin structure at cpg islands. Cell,
60(6):909–20.
Thorvaldsd´ ottir, H., Robinson, J. T., and Mesirov, J. P . (2013). Integrative genomics
viewer (igv): high-performance genomics data visualization and exploration. Brief
Bioinform, 14(2):178–92.
Tillo, D. and Hughes, T. R. (2009). G+c content dominates intrinsic nucleosome occu-
pancy. BMC Bioinformatics, 10:442.
Tillo, D., Kaplan, N., Moore, I. K., Fondufe-Mittendorf, Y ., Gossett, A. J., Field, Y ., Lieb,
J. D., Widom, J., Segal, E., and Hughes, T. R. (2010). High nucleosome occupancy
is encoded at human regulatory sequences. PLoS One, 5(2):e9129.
Toyota, M., Ahuja, N., Ohe-Toyota, M., Herman, J. G., Baylin, S. B., and Issa, J. P .
(1999). Cpg island methylator phenotype in colorectal cancer. Proc Natl Acad Sci U
S A, 96(15):8681–6.
Tsumura, A., Hayakawa, T., Kumaki, Y ., Takebayashi, S.-i., Sakaue, M., Matsuoka,
C., Shimotohno, K., Ishikawa, F ., Li, E., Ueda, H. R., Nakayama, J.-i., and Okano,
M. (2006). Maintenance of self-renewal ability of mouse embryonic stem cells in
the absence of dna methyltransferases dnmt1, dnmt3a and dnmt3b. Genes Cells,
11(7):805–14.
148
Tycko, B. (2010). Allele-specific dna methylation: beyond imprinting. Hum Mol Genet,
19:R210–20.
Valouev, A., Johnson, S. M., Boyd, S. D., Smith, C. L., Fire, A. Z., and Sidow, A.
(2011). Determinants of nucleosome organization in primary human cells. Nature,
474(7352):516–20.
Vir´ e, E., Brenner, C., Deplus, R., Blanchon, L., Fraga, M., Didelot, C., Morey, L.,
Van Eynde, A., Bernard, D., Vanderwinden, J.-M., Bollen, M., Esteller, M., Di Croce,
L., de Launoit, Y ., and Fuks, F . (2006). The polycomb group protein ezh2 directly
controls dna methylation. Nature, 439(7078):871–4.
Wajapeyee, N., Serra, R. W., Zhu, X., Mahalingam, M., and Green, M. R. (2008). Onco-
genic braf induces senescence and apoptosis through pathways mediated by the
secreted protein igfbp7. Cell, 132(3):363–74.
Wallace, J. A. and Orr-Weaver, T. L. (2005). Replication of heterochromatin: insights
into mechanisms of epigenetic inheritance. Chromosoma, 114(6):389–402.
Wang, H., Maurano, M. T., Qu, H., Varley, K. E., Gertz, J., Pauli, F ., Lee, K., Canfield,
T., Weaver, M., Sandstrom, R., Thurman, R. E., Kaul, R., Myers, R. M., and Stam-
atoyannopoulos, J. A. (2012). Widespread plasticity in ctcf occupancy linked to dna
methylation. Genome Res, 22(9):1680–8.
Wang, W., Wei, Z., Lam, T.-W., and Wang, J. (2011). Next generation sequencing
has lower sequence coverage and poorer snp-detection capability in the regulatory
regions. Sci Rep, 1:55.
Weber, M., Hellmann, I., Stadler, M. B., Ramos, L., P¨ a¨ abo, S., Rebhan, M., and
Sch¨ ubeler, D. (2007). Distribution, silencing potential and evolutionary impact of pro-
moter dna methylation in the human genome. Nat Genet, 39(4):457–66.
Weinstein, I. B. and Case, K. (2008). The history of cancer research: introducing an
aacr centennial series. Cancer research, 68(17):6861–6862.
Weisenberger, D. J., Campan, M., Long, T. I., Kim, M., Woods, C., Fiala, E., Ehrlich, M.,
and Laird, P . W. (2005). Analysis of repetitive element dna methylation by methylight.
Nucleic Acids Res, 33:6823–36.
Weisenberger, D. J., Siegmund, K. D., Campan, M., Y oung, J., Long, T. I., Faasse,
M. A., Kang, G. H., Widschwendter, M., Weener, D., Buchanan, D., Koh, H., Simms,
L., Barker, M., Leggett, B., Levine, J., Kim, M., French, A. J., Thibodeau, S. N., Jass,
J., Haile, R., and Laird, P . W. (2006). Cpg island methylator phenotype underlies spo-
radic microsatellite instability and is tightly associated with braf mutation in colorectal
cancer. Nat Genet, 38(7):787–93.
Wen, B., Wu, H., Shinkai, Y ., Irizarry, R. A., and Feinberg, A. P . (2009). Large histone
h3 lysine 9 dimethylated chromatin blocks distinguish differentiated from embryonic
stem cells. Nat Genet, 41(2):246–50.
149
Wen, L., Li, X., Y an, L., Tan, Y ., Li, R., Zhao, Y ., Wang, Y ., Xie, J., Zhang, Y ., Song, C., Yu,
M., Liu, X., Zhu, P ., Li, X., Hou, Y ., Guo, H., Wu, X., He, C., Li, R., Tang, F ., and Qiao,
J. (2014). Whole-genome analysis of 5-hydroxymethylcytosine and 5-methylcytosine
at base resolution in the human brain. Genome Biol, 15(3):R49.
Widschwendter, M., Fiegl, H., Egle, D., Mueller-Holzner, E., Spizzo, G., Marth, C.,
Weisenberger, D. J., Campan, M., Y oung, J., Jacobs, I., and Laird, P . W. (2007).
Epigenetic stem cell signature in cancer. Nat Genet, 39(2):157–8.
Wilson, B. G. and Roberts, C. W. M. (2011). Swi/snf nucleosome remodellers and
cancer. Nat Rev Cancer, 11(7):481–92.
Wolff, E. M., Byun, H.-M., Han, H. F ., Sharma, S., Nichols, P . W., Siegmund, K. D.,
Y ang, A. S., Jones, P . A., and Liang, G. (2010). Hypomethylation of a line-1 promoter
activates an alternate transcript of the met oncogene in bladders with cancer. PLoS
Genet, 6(4):e1000917.
Wu, H., Caffo, B., Jaffee, H. A., Irizarry, R. A., and Feinberg, A. P . (2010a). Redefining
cpg islands using hidden markov models. Biostatistics, 11(3):499–514.
Wu, H., Coskun, V., Tao, J., Xie, W., Ge, W., Y oshikawa, K., Li, E., Zhang, Y ., and Sun,
Y . E. (2010b). Dnmt3a-dependent nonpromoter dna methylation facilitates transcrip-
tion of neurogenic genes. Science, 329(5990):444–8.
Xi, Y ., Bock, C., M¨ uller, F ., Sun, D., Meissner, A., and Li, W. (2012). Rrbsmap: a
fast, accurate and user-friendly alignment tool for reduced representation bisulfite
sequencing. Bioinformatics, 28:430–2.
Xi, Y . and Li, W. (2009). Bsmap: whole genome bisulfite sequence mapping program.
BMC Bioinformatics, 10:232.
Xie, W., Barr, C. L., Kim, A., Yue, F ., Lee, A. Y ., Eubanks, J., Dempster, E. L., and Ren,
B. (2012). Base-resolution analyses of sequence and parent-of-origin dependent dna
methylation in the mouse genome. Cell, 148:816–31.
Xie, W., Schultz, M. D., Lister, R., Hou, Z., Rajagopal, N., Ray, P ., Whitaker, J. W., Tian,
S., Hawkins, R. D., Leung, D., Y ang, H., Wang, T., Lee, A. Y ., Swanson, S. A., Zhang,
J., Zhu, Y ., Kim, A., Nery, J. R., Urich, M. A., Kuan, S., Y en, C.-a., Klugman, S., Yu,
P ., Suknuntha, K., Propson, N. E., Chen, H., Edsall, L. E., Wagner, U., Li, Y ., Y e, Z.,
Kulkarni, A., Xuan, Z., Chung, W.-Y ., Chi, N. C., Antosiewicz-Bourget, J. E., Slukvin,
I., Stewart, R., Zhang, M. Q., Wang, W., Thomson, J. A., Ecker, J. R., and Ren, B.
(2013). Epigenomic analysis of multilineage differentiation of human embryonic stem
cells. Cell, 153(5):1134–48.
Xie, X., Mikkelsen, T. S., Gnirke, A., Lindblad-Toh, K., Kellis, M., and Lander, E. S.
(2007). Systematic discovery of regulatory motifs in conserved regions of the human
genome, including thousands of ctcf insulator sites. Proc Natl Acad Sci U S A,
104(17):7145–50.
150
Xu, M., Kladde, M. P ., Van Etten, J. L., and Simpson, R. T. (1998). Cloning, charac-
terization and expression of the gene coding for a cytosine-5-dna methyltransferase
recognizing gpc. Nucleic Acids Res, 26(17):3961–6.
Y ang, X., Noushmehr, H., Han, H., Andreu-Vieyra, C., Liang, G., and Jones, P . A.
(2012). Gene reactivation by 5-aza-2’-deoxycytidine-induced demethylation requires
srcap-mediated h2a.z insertion to establish nucleosome depleted regions. PLoS
Genet, 8(3):e1002604.
Y ou, J. S. and Jones, P . A. (2012). Cancer genetics and epigenetics: two sides of the
same coin? Cancer Cell, 22(1):9–20.
Y ou, J. S., Kelly, T. K., De Carvalho, D. D., Taberlay, P . C., Liang, G., and Jones, P . A.
(2011). Oct4 establishes and maintains nucleosome-depleted regions that provide
additional layers of epigenetic regulation of its target genes. Proc Natl Acad Sci U S
A, 108(35):14497–502.
Yu, M., Hon, G. C., Szulwach, K. E., Song, C.-X., Zhang, L., Kim, A., Li, X., Dai, Q.,
Shen, Y ., Park, B., Min, J.-H., Jin, P ., Ren, B., and He, C. (2012). Base-resolution
analysis of 5-hydroxymethylcytosine in the mammalian genome. Cell, 149(6):1368–
80.
Zemach, A., McDaniel, I. E., Silva, P ., and Zilberman, D. (2010). Genome-wide evolu-
tionary analysis of eukaryotic dna methylation. Science, 328(5980):916–9.
Zhang, R., Chen, W., and Adams, P . D. (2007). Molecular dissection of formation of
senescence-associated heterochromatin foci. Mol Cell Biol, 27(6):2343–58.
Zhang, Z., Wippo, C. J., Wal, M., Ward, E., Korber, P ., and Pugh, B. F . (2011). A packing
mechanism for nucleosome organization reconstituted across a eukaryotic genome.
Science, 332(6032):977–80.
Zhao, Z. and Boerwinkle, E. (2002). Neighboring-nucleotide effects on single nucleotide
polymorphisms: a study of 2.6 million polymorphisms across the human genome.
Genome Res, 12:1679–86.
Zilberman, D., Coleman-Derr, D., Ballinger, T., and Henikoff, S. (2008). Histone
h2a.z and dna methylation are mutually antagonistic chromatin marks. Nature,
456(7218):125–9.
Ziller, M. J., Gu, H., M¨ uller, F ., Donaghey, J., Tsai, L. T.-Y ., Kohlbacher, O., De Jager,
P . L., Rosen, E. D., Bennett, D. A., Bernstein, B. E., Gnirke, A., and Meissner,
A. (2013). Charting a dynamic dna methylation landscape of the human genome.
Nature, 500(7463):477–81.
Zong, C., Lu, S., Chapman, A. R., and Xie, X. S. (2012). Genome-wide detection
of single-nucleotide and copy-number variations of a single human cell. Science,
338(6114):1622–6.
151
Appendix: Detailed description of
TCGA Whole Genome Bisulfite
sequencing Bis-SNP pipeline
152
0. Prerequisites
● A genome mapper that can generate BAM output (BSMAP preferred)
● SAMTOOLS v. 0.1.18
● Picard v. 1.46
● Java: Java
TM
SE Runtime Environment 1.6. Bis-SNP is implemented using the GATK 1.5
toolkit, which does not support OpenJDK or Java 1.7. Java SE Runtime 1.6 has been
successfully tested on Mac and Linux.
● Perl 5.8.8 or later
1. Genome mapping and BAM creation
Gerald (Illumina, Inc.) is used to create an initial FASTQ file for each sample in each
sequencing lane, and this is used as the primary input to the pipeline.
1.1 FASTQ preprocessing:
Trim Inverted duplicates: In paired end sequencing, a presumed PCR (or clustering) defect can
lead to the 5’ side of each end read being exact reverse complements of each other. The 3’
ends of the reads diverge after some point and become garbage sequence (often poly-A). We
report the frequency of such inverted duplicates in our QC database and use it as a measure of
library quality (we typically only see levels above 1% in libraries with very little DNA input,
manuscript in preparation).
The sequences of inverted duplicates appear to be valid up until the point of divergence
between the two ends, because these 5’ ends map to the genome with similar rates as non
inverted dups if the 3’ garbage sequences are removed. We use the perl script
invert_dups_check.pl to trim the 3’ garbage sequences off of both reads so that they can
be properly aligned. In alignment post-processing, we mark the second end as a PCR Duplicate
so they will be filtered out from use in subsequent processing steps.
Clip adapters (skipped): It is possible to use a tool like fastq-mcf could be used to trim adapter
sequences from 3’ ends. This step ideally should be done by the genome mapper which may
be able to use low quality information to improve mappings. The USC pipeline skips this step,
which is performed by the mapper BSMAP.
1.2 Genome mapping:
Mapping is performed with BSMAP v.1.15 using the command line parameters shown below.
Bis-SNP is compatible with other bisulfite aligners that use BAM as an output format, such as
Bismark/Bowtie and BS-SEEKER*. Unfortunately, different aligners use different conventions
for BAM fields, so there are a few important caveats. One important difference is how different
mappers mark reads as mapping to a single unique locus or multiple loci in the genome (which
should be removed for most standard analyses). BSMAP uses the 0x0100 bit in the flags field to
demarcate non-unique alignments. Bowtie2 also uses the 0x0100 bit, but only if the “-k”
command line setting is used. See the Bis-SNP read filtering section below for additional details.
bsmap -a sampx_read1.fastq sampx.read2.fastq -d hg19_rCRSchrm.fa -o sampx.sam
-p 12 -s 16 -v 10 -q 2";
For mappers that output SAM (such as BSMAP), these need to be sorted and converted to
BAM. This is done using Picard:
java -Xmx14g -jar Picard/SortSam.jar VALIDATION_STRINGENCY=SILENT
INPUT=sampx.sam OUTPUT=sampx.bam SORT_ORDER=coordinate"
* BS-SEEKER does not output base quality scores in their output files, so some custom merging
of FASTQ and BS-SEEKER output files would be required. See
http://pellegrini.mcdb.ucla.edu/BS_Seeker/USAGE.html
1.3 BAM merging and indexing
BAM files need to be indexed for almost any downstream processing, including Bis-SNP.
Indexing creates a small file (with the extension .bai) containing BAM file indices required for
instant access to an arbitrary genomic coordinate. This file is created with SAMTOOLS:
samtools merge sampx.bam sampx.bai
If multiple sequencing lanes need to be merged, this is done using Picard, which can create
index files at the same time. Picard also maintains our metadata by propogating a separate
read group for each sequencing lane or flow cell. Each read is annotated with its correct read
group, which is important for recalibrating base quality scores within a sequencing run (see
below). Read groups can be added using the AddOrReplaceReadGroups function of Picard.
java -Xmx12g -jar Picard/MergeSamFiles.jar VALIDATION_STRINGENCY=SILENT
ASSUME_SORTED=true MERGE_SEQUENCE_DICTIONARIES=true CREATE_INDEX=true
USE_THREADING=true MAX_RECORDS_IN_RAM=2000000 OUTPUT='sampx.bam'
INPUT='sampx-lane1.bam' INPUT='sampx-lane2.bam'
2. Alignment post-processing
A number of post processing steps are performed to improve the quality of the sequence and
alignment data. These are largely based on the pipeline developed for the 1,000 Genomes
project at the Broad Institute (DePristo et al., Nature Genetics 2011). Their alignment post-
processing pipeline is shown as Phase 1 in the figure below (linked from the GATK website):
2.1 Local realignment:
BisulfiteRealignerTargetCreator and BisulfiteIndelRealigner are extensions of
GATK’s standard pipeline module RealignerTargetCreator and IndelRealigner. These
tools take a BAM and perform local realignment to identify indels that were mistaken by the
primary mapper as single nucleotide variants. This is a common occurrence, since mappers
either ignore indels completely, or operate only on individual reads and therefore have problems
distinguishing a short indel from a series of single nucleotide variants.
BisulfiteRealignerTargetCreator uses all reads covering the mismatch region to
identify indels, rather than operating on individual reads in isolation.
BisulfiteRealignerTargetCreator uses a measurement of entropy to identify regions
with likely indels, either because the primary mapper called them indels, or because they have
an unusually high number of apparent SNPs using a very simple SNP calculation
1
. The only
difference from the normal GATK RealignerTargetCreator is that bisulfite-related
mismatches (C->T in end 1, G->A in end 2) are ignored. This function outputs a file with a set of
intervals to realign. The function is hard-coded to filter out the following types of reads: (1) reads
with mapping quality 0 or bad cigar string, (2) for paired end, only read pairs that are properly
paired.
BisulfiteIndelRealigner performs local realignment at the regions specified in this
interval file. The output is a new BAM containing any improved alignments. The new alignments
will be put into the main POS and CIGAR fields, and the values from the original mapper will be
retained in the ORIGINAL POSITION (OP) and ORIGINAL CIGAR (OC) fields. The only
1
A SNP is called if (sum_of_mismatch_bases_qualities / totalQualities) >= mismatchThreshold
difference from the normal GATK IndelRealigner is that bisulfite-related mismatches (C->T
in end 1, G->A in end 2) are ignored. The function is hard-coded to filter out unmapped reads.
Note for future versions: If the library contains a lot of PCR duplicated reads, it is better to mark
duplicate reads first, and then do the indel realignment.
Runtime requirements: Requires 12GB RAM, takes ~13 hours to run using 12 CPUs
Command line example:
java -Xmx12G -jar bissnp-default.jar -R hg19_rCRSchrm.fa -I sampx.bam -T
BisulfiteRealignerTargetCreator -known 1000G_phase1.indels.hg19.sort.vcf -
known Mills_and_1000G_gold_standard.indels.hg19.sites.sort.vcf -o
sampx.indels.intervals -nt 12
java -Xmx12G -jar bissnp-default.jar -R hg19_rCRSchrm.fa -I sampx.bam -T
BisulfiteIndelRealigner -targetIntervals sampx.indels.intervals -known
1000G_phase1.indels.hg19.sort.vcf -known
Mills_and_1000G_gold_standard.indels.hg19.sites.sort.vcf -compress 5 -cigar -
o sampx.realign.bam
Command line options for BisulfiteRealignerTargetCreator:
Important options are noted below. For a complete list of options, see
RealignerTargetCreator:
● -known: A VCF file specifying known indel regions in the reference genome.
Chromosomes must be sorted in the same order as your -I input BAM and -R reference
genome. You can supply multiple -known files, and we supply the two listed above,
which come from analysis of 1000Genomes indels.
● -mismatch: This feature is really only necessary when using an ungapped aligner (e.g.
MAQ in the case of single-end read data) and should be used in conjunction with '--
model USE_SW' in the IndelRealigner. We are using BSMAP, which is ungapped
aligner, and we allow 10 mismatches during the alignment, most of our TCGA reads are
100PE, so I just use 10% mismatch threshold here as a threshold. e.g. -mismatch 0.1
(Default: 0.1. previous version is 0.0, which means just use all of known indels position
provided..).
● -minReads: We do not check for high entropy SNPs at loci with less than this many
reads covering them.
● -nt: Number of CPU threads to use in parallel. USC uses 12 for HPCC cluster nodes
● -L: interval file (useful if you’re performing targeted sequencing)
Command line options for BisulfiteIndelRealigner:
Important options are noted below. For a complete list of options, see IndelRealigner
● -targetIntervals: the output from the previous step
● -known: Use the same VCF files as for BisulfiteIndelRealigner
● -cigar: Some ungapped aligners (e.g. BSMAP, PASH, Bismark with bowtie1) do not
output correct CIGAR strings. In this case, -cigar should be specified to ignore the
original CIGAR. This is used in the USC BSMAP pipeline
● -model: GATK recommends that users run with USE_READS when trying to realign high
quality longer read data mapped with a gapped aligner; Smith-Waterman is really only
necessary when using an ungapped aligner. Although BSMAP is an ungapped aligner,
we currently use USE_READS as a conservative option, but USE_SW will be preferable
in future versions (it will need to be tailored for bisulfite reads.
● -LOD: LOD threshold above which the realigner will begin to work. This term is
equivalent to "significance" - i.e. is the improvement significant enough to merit
realignment? The default is good for most datasets, but for low coverage and/or when
looking for indels with low allele frequency, this value should be set to be smaller.
(Default: 5.0)
● -entropy: Percentage of mismatches at a locus to be considered having high entropy.
Realigner will only proceed with the realignment (even above the given threshold) if it
minimizes entropy among the reads (and doesn't simply push the mismatch column to
another position). We determined that the default setting of 0.15 may be too high
because C->T substitutions (or G->A on end 2 reads) are not considered mismatches,
and we will test lower values in future versions of the pipeline.
● -maxReads: Intervals with more than this many reads are not attempted. This parameter
is for experts only, and should only be considered for very deep datasets (>1000x).
● -L: interval file in GATK format. Not required if you already specified the interval file in
BisulfiteRealignerTargerCreator (since all your targets will already be in the interval)
● SW_GAP, SW_GAP_EXTEND and other SW variables control Smith Waterman cost
function. We will explore these when SW is implemented in future versions of the
pipeline.
2.2 Mark inverted duplicate reads
Inverted duplicates, as discussed in section 1.1, should be filtered so only the 5’ end of the first
end read is used in subsequent analyses. A program is currently under development for this
step.
2.3 Mark duplicate reads
The MarkDuplicates function in Picard marks PCR duplicate reads in the appropriate BAM
field. This is important for downstream processing steps which ignore duplicate reads. In high-
depth sequencing, paired-end sequencing can limit the number of false duplicate reads marked,
since both ends must be mapped to the same position to be considered duplicates.
Note for future versions: This step should be moved before indel realignment so that duplicates
can be filtered out in that step. It should also be multiplexed with multiple mark duplicate jobs
running on a single node, since it is not multi-threaded.
Runtime requirements: Requires 12GB RAM, takes up to 7 hours to run using 1 CPU
Command line example:
java -Xmx12G -jar MarkDuplicates.jar I=sampx.realign.bam
O=sampx.realign.mdups.bam CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT
MAX_RECORDS_IN_RAM=3000000 METRICS_FILE=sampx.mdups.dupsMetrics.txt
TMP_DIR=./
Command line options for MarkDuplicates:
For a complete list of options, see MarkDuplicates.
2.4. Base quality recalibration
Base quality score recalibration is described in detail in the GATK documentation here. The
function BisulfiteCountCovariates calculates mismatch counts by covariate (read group
& cycle). BisulfiteTableRecalibration uses these to recalibrate instrument quality
scores to empirical mismatch frequencies. Original base quality scores are retained in the
custom OQ field in the new BAM.
Here is an example plot of HiSeq2000 whole-genome bisulfite-seq data before and after base
quality recalibration:
GATK filters used to exclude problematic reads: For base quality recalibration, we want to only
use the reads where we have high confidence that they are mapped to the correct and unique
location in the genome. We also want to include only a single read for sets of PCR “duplicate”
reads, to avoid overweighting of any properties of those reads. Specific GATK filters used:
reads with poor mapping confidence (MappingQualityZeroFilter), reads without mappings
(UnmappedReadFilter), paired reads where ends do not map near each other in opposing
orientations (BadMateFilter), reads mapping to multiple locations in the genome
(NotPrimaryAlignmentFilter), Reads that appear to be PCR duplicates of another read
(PCRDuplicateReadFilter).
Note: BSMAP uses the 0x0100 bit in the flags field (i.e., “Not Primary”) to demarcate non-unique
alignments. Bowtie2 uses this same field, but only if the “-k” command line setting is used. BWA
uses a custom field “XT” which specified the mapping as either “unique” (U) or “repeat” (R), and
therefore these reads should either be filtered out or manually marked with the 0x0100 bit
before using with Bis-SNP.
Runtime requirements: Requires 12 GB RAM, BisulfiteCountCovariates takes up to 5
hours, BisulfiteTableRecalibration up to 10 hours using 12 CPUs.
java -Xmx12G -jar bissnp-default.jar -R hg19_rCRSchrm.fa -T
BisulfiteCountCovariates -nt 12 -knownSites dbsnp_135.hg19.sort.vcf -cov
ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -recalFile
sampx.realign.mdups.recal.beforeRecal.txt -I sampx.realign.mdups.bam
java -Xmx12G -jarbissnp-default.jar -R hg19_rCRSchrm.fa -T
BisulfiteTableRecalibration -recalFile
sampx.realign.mdups.recal.beforeRecal.txt -maxQ 40 -I sampx.realign.mdups.bam
-o sampx.realign.mdups.recal.bam
Command line options for BisulfiteCountCovariates:
Important options are noted below. For a complete list of options, see GATK documentation
(CountCovariates has been named to BaseRecalibrator).
● -knownSites: A VCF file specifying known indel regions in the reference genome.
Chromosomes must be sorted in the same order as your -I input BAM and -R reference
genome. We supply a sorted VCF from dbSNP v.135. This field is required.
● -cov: Mismatches can be calibrated to one or more covariates.
● -recalFile: The output of mismatch frequencies (used in the next step)
● -nt: specify how many CPU to use for this step. e.g. -nt 12
● -L: Specifying an interval file speeds it up if you are doing targeted sequencing
Command line options for BisulfiteTableRecalibration:
Important options are noted below. For a complete list of options, see GATK documentation
(TableRecalibration has been deprecated. In newer versions of GATK, the functionality
has been moved to PrintReads using the “-BQSR” command line parameter, see here).
● -recalFile: Output file from the previous step
● -o: output recalibrated bam file. e.g. -o input.recal.bam
● -maxQ: An integer value at which to cap the recalibrated quality scores. Covariate
conditions with 0 observed mismatches are set to this limit in the version of GATK that
Bis-SNP is based on (Default: 40).
3. SNPs and methylation calling
Details of genotype and methylation calling are described in detail in the Bis-SNP Genome
Biology article. The method relies on prior probability estimates for particular genotypes (usually
inferred from dbSNP information) and bisulfite conversion efficiency:
The function BisulfiteGenotyper performs most of the work, outputting VCF files which
contain detailed information about SNPs and methylation calls, which can be combined into a
single VCF file or split out into two.
Some post-processing of initial VCF files is required. This includes sorting (4.2), filtering out
potential false SNP calls (4.3), and outputting BED methylation files (4.4).
3.1 BisulfiteGenotyper
3.1.1 Description of function
BisulfiteGenotyper is the core of the Bis-SNP pipeline. It is a locus walker which takes a
BAM file as input and outputs VCF files containing base pair by base pair information about
SNPs, methylation counts, or both. BisulfiteGenotyper supports a wide range of experimental
designs and therefore has very flexible output options (described in the command line
parameters section below). The default settings will produce one VCF file that contains all loci
determined to be potential SNPs, and a second VCF file containing all loci determined to be
CpGs in the sample genome (regardless of genotype in the reference genome).
BisulfiteGenotyper filters out potentially problematic reads of various types: The following
reads are not used to call genotypes or methylation:
● Reads that have VendorQualityCheck value FAILED
● Reads with mapping quality less than 30 (can be changed using the -mmq command line
option)
● Reads that align to multiple locations in the genome*
● PCR Duplicate reads (from BAM duplicate field)
● Unmapped reads
● For paired-end sequences, reads not marked properly paired in BAM file. This can be
overridden with the -badMates command line option
● Second end for paired-end reads that are inverted duplicates of each other
● All reads in regions with >250 fold coverage. This is usually indicative of improper
mapping in highly similar regions of the genome (can be changed using the -
toCoverage command line option)
● Reads with more than 3 mismatches to reference (bisulfite space) within a 20bp window
Note: BSMAP uses the 0x0100 bit in the flags field (i.e., “Not Primary”) to demarcate non-unique
alignments. Bowtie2 uses this same field, but only if the “-k” command line setting is used. BWA
uses a custom field “XT” which specified the mapping as either “unique” (U) or “repeat” (R), and
therefore these reads should either be filtered out or manually marked with the 0x0100 bit
before using with Bis-SNP.
Additionally, BisulfiteGenotyper filters out particular bases within reads for a number of
reasons. These include trimming the 5’ end of the read for non-conversion, trimming of
positions with base quality below 5 (this can be changed with the -mbq command line
parameter), and arbitrary trimming of a specified number of base pairs at the 5’ and/or 3’ ends
(see full specification of command line parameters). In paired-end sequencing, the two ends can
overlap when the insert length is shorter than the end sequence length. In this case, we only
want to count the overlap region once (since the two ends exact copies of the same molecule).
For each position, Bis-SNP retains the copy of the base with the higher base quality score (in
rare cases when they disagree, both bases are thrown out).
3.1.2 Description of VCF output files
In order to provide low-level methylation information in these files, we have added custom
methylation fields to the VCF format (formal syntax of these fields is always contained in the
VCF header section). Here is an example of a typical VCF line:
#CHR POS ID REF ALT QUAL FILTER INFO FORMAT sampx
chr9 99600737 . C . 70.78 PASS
CS=+;Context=CG,C;DP=13;MQ0=0;NS=1;REF=0 GT:BQ:BRC6:CM:CP:CU:DP:DP4:GP:GQ:SS
0/0:25.2,NaN:7,0,0,6,0,0:7:CG:0:13:7,6,0,0:0,71,423:70.78:5
This case shows a cytosine in the sample which matches the reference (ALT is “.”). The QUAL
score is the lower of the scores for the C and the adjacent G. Each of these two scores is a
likelihood ratio in phred scale, so that a score of 10 means that the called genotype is 10x more
likely than the second best genotype, and a score of 20 means that it is 100x more likely. Only
those loci with scores above a value set by the -stand_emit_conf command line parameter
will be output into the VCF (by default, this is set to 0 to emit all loci where CpG is the most
likely genotype). If the QUAL score is lower than a value set by the -stand_call_conf command
line parameter, the FILTER field is set to LowQual rather than PASS (by default, this is set to 20
which this locus passes by a large margin).
The INFO column contains several custom fields:
CS: This field specifies whether the C strand is forward (+) relative to the reference genome or
reverse (-). Thus, the VCF outputs two lines for each CpG dinucleotide, one with methylation
information with the C (“+”, relative to the reference genome), and another for the G (“-”, relative
to the reference genome). This can be important for discovering strand specificity to methylation
patterns. However, the typical user will want to combine information from the two strands, and
we provide a perl script to do this.
Context: This contains the genotype of the cytosine, which will typically be CG,C in the default
configuration. This field uses IUPAC codes, so if the position is determined to be heterozygous
CG/CA, it will be noted as CR.
The FORMAT column contains several custom fields:
CP: The VCF format supports multiple FORMAT columns, one for each sample (the example
above only has one sample). This field is simply the sample-specific version of the Context
field.
CM: The count of methylated (i.e. cytosine) reads at this position
CU: The count of unmethylated (i.e. thymine) reads at this position
BRC6: More detailed read counts that went into determination of genotype. The field contains
the following six subfields, separated by commas:
1. C (methylated) read count on the cytosine strand
2. T (unmethylated) read count on the cytosine strand
3. A/G/N read count on the cytosine strand
4. G read count on the opposite strand
5. A read count on the opposite strand (evidence of C->T evolutionary deamination)
6. C/T/N read count on the opposite strand
3.1.3 Example usage, including non-CpG methylation examples
Examples for CpH
-output_mode EMIT_ALL_CYTOSINES -vfn1 sampx.cpg.raw.vcf -vfn2
sampx.snp.raw.vcf
Examples for arabidopsis
-C CG,1 -C CHH,1 -C CHG,1 -output_mode EMIT_ALL_CYTOSINES -vfn1
sampx.cpg.raw.vcf -vfn2 sampx.snp.raw.vcf
Examples for NOMe-seq
-sm GM -output_mode EMIT_ALL_CYTOSINES -vfn1 sampx.cpg.raw.vcf -vfn2
sampx.snp.raw.vcf
.
3.1.4 Notes on model parameters
dbSNP has different evidence levels for different entries, and these are used to determine prior
probabilities of SNPs occurring in the sequenced sample. Users sequencing human (or other
mammalian) samples should not need to modify the estimates used, but command line
parameters are provided for some of the important assumptions. Assuming two alleles A and G
are observed at a particular locus, and that A is most common allele based on the reference
genome and dbSNP, here is how we calculate priors (all use a variable transition_rate
that has a default value of of ⅔ but can be changed on the command line):
● If dbSNP contains a minor allele frequency of maf, the prior for A/A will be 1-
(3*maf/2)=0.55, for A/G will be 0.3*transition_rate=maf*⅔=0.2, and
transition_rate*validated_het/2=⅔*0.1/2=0.033 for G/G (validated_het has
a default value of 0.1 but can be changed with the -vdh command line option).
● if dbSNP does not provide minor allele frequency, but the SNP has been validated by
experiment, the priors will be 1-(3*validated_het/2)=0.85 for ref A/A,
validated_het*transition_rate=0.1*⅔=0.066 for A/G and
transition_rate*validated_het/2=⅔*0.1/2=0.033 for G/G
● If the dbSNP entry is not validated by experiment, the priors will be 1-
(3*unvalidated_het/2)=0.97 for ref A/A,
unvalidated_het*transition_rate=0.02*2/3=0.013 for A/G and transition_rate
*unvalidated_het/2=(2/3)*0.02/2=0.0066 for G/G (unvalidated_het has a default
value of 0.02 but can be changed with the -ndh command line option).
● if the position has no entry in dbSNP, the priors will be 1-(3*nodbsnp_het/2)=1-
(3*0.001/2)=0.9985 for ref A/A, nodbsnp_het*transition_rate=0.001*2/3=0.00066
for A/G and transition_rate*nodbsnp_het/2=(2/3)*0.001/2=0.00033 for G/G
(nodbsnp_het has a default value of 0.001 but can be changed with the -hets
command line option).
3.1.5 Non-conversion filter
total read: Show bad library and how it’s bimodal
● Reads with more than 40% methylation of cytosines in the [A/T]C[A/T] context (we use
this specific context because it is compatible with NOMe-seq)
5’ non-conversion (show bad library).
● Bases in the region that before -minConv specified (filtered incompleted bisulfite
conversion in the beginning of reads)
Note for RRBS or ERRBS users: 5’ non-conversion is a problem.
3.1.6 BisulfiteGenotyper Usage details
Runtime requirements: Requires 12 GB RAM, BisulfiteGenotyper takes up to 48 hours,
using 12 CPUs.
java -Xmx12G -jar bissnp-default.jar -T BisulfiteGenotyper -R
hg19_rCRSchrm.fa -I sampx.realign.mdups.recal.bam -D dbsnp_135.hg19.sort.vcf
-vfn1 sampx.realign.mdups.recal.cpg.raw.vcf -vfn2
sampx.realign.mdups.recal.snp.raw.vcf -stand_call_conf 20 -stand_emit_conf 0
-dt NONE -bsRate 0.9975 -loc -1 -nt 12 -minConv 1 -vcfCache 1000000 -mmq 30 -
mbq 5
Basic command line options for BisulfiteTableRecalibration:
● -D: dbSNP VCF file. Known SNPs and population frequencies are import for estimating
SNP priors, which are very important for low coverage loci. Chromosomes must be
sorted in the same order as your -I input BAM and -R reference genome. The perl script
sortByRefAndCor.pl (included with Bis-SNP) can be used to sort any VCF file.
● -stand_call_conf: Minimum phred-scale score which is marked PASS in the output VCF.
Default value is 20 (i.e. the called genotype is 100x more likely than second best
genotype). This works well for high-coverage (>15x) sequencing, but we typically use a
value of 10 for lower coverage sequencing.
● -stand_emit_conf: Minimum phred-scale score which is output. The default is 0 (i.e. emit
all loci where the SNP genotype is more likely than the second best genotype)
● -minConv: If value is >0, we do not output methylation information for cytosines that are
5’ of the n
th
converted cytosine (where n is the value of -minConv). See discussion of 5’
non-conversion, above. RRBS and ERRBS derive a significant portion of the methylation
information from the first 5’ position in the read, and thus minConv should be set to a
value of 0 to disable 5’ non-conversion filtering for RRBS and ERRBS samples. Default
value is 1.
● -mmq: Disregards reads with mapping qualities below this phred-scale value. Note
mapping quality interpretations can be mapper-specific. (Default: 30)
● -mbq: Read bases with base quality scores below this phred-scale value are not used for
genotyping or methylation calling. (Default: 5)
● -L: interval file in GATK format. Bis-SNP will only call genotypes within these intervals
(can be useful for targeted sequencing).
● -nt: specify how many CPUs to use for this step. e.g. -nt 12
Command line options to specify output information:
● -sm: Run Bis-SNP for a particular assay. Normal Bisulte-seq mode: BM; NOMe-seq
mode: GM (Default: BM)
● -vfn1: output VCF containing CpG methylation information (see -out_modes)
● -vfn2: output VCF containing SNP information (see -out_modes)
● -out_modes: Specify one (default is DEFAULT_FOR_TCGA)
○ DEFAULT_FOR_TCGA: emit all homozygous, non-SNP CpG sites above
stand_emit_conf threshold into VCF1 file, and all SNP sites above
stand_emit_conf threshold into VCF2 file.
○ EMIT_ALL_CPG: emit only het/homozygous CpG sites above
stand_emit_conf threshold into VCF1 file.
○ EMIT_ALL_CYTOSINES: emit all of cytosines sites above stand_emit_conf
threshold into VCF1 file.
○ EMIT_VARIANTS_ONLY: emit only SNP sites above stand_emit_conf
threshold into VCF1 file.
○ EMIT_HET_SNPS_ONLY: emit only heterozygous SNP sites above
stand_emit_conf threshold into VCF1 file.
○ EMIT_VARIANT_AND_CYTOSINES: emit all of cytosines sites above
stand_emit_conf threshold into VCF1 file, and all SNP sites above
stand_emit_conf threshold into VCF2 file
○ EMIT_ALL_CONFIDENT_SITES: emit all sites (both het/homozygous CpG and
SNPs) above stand_emit_conf threshold into a single VCF1 file.
○ NOMESEQ_MODE: Special case for NOMe-seq assay. Emits GCH, HCG and
GCG sites above stand_emit_conf threshold into VCF1 file, and all of SNP
sites above emit threshold into VCF2 file.
● -C: Specify the cytosine contexts to check. e.g. -C CG,1 -C CH,1... 1 in it means the
cytosine position in the pattern. it is 1-based coordinate. You could specify '-C' multiple
times for different cytosine pattern. (Default: -C CG,1 -C CH,1)
● -nonDirectional: Specify if sequencing used a non-directional method, i.e. where PCR
strands create both C->T and G->A substitutions corresponding to bisulfite conversion.
Note: this mode is experimental.
● -cpgreads: If a filename is specified, a methylation-by-read file is created. This file is a
format similar to the Bismark bismark_methylation_extractor output files, in that it
contains a methylation value for every CpG on every read. These files can be rather
large, up to 20GB or more for a deep-coverage whole-genome dataset.
● -includeSnp: also output heterozygous SNP in cpg reads file. (Default: -includeSnp, not
enabled)
● -onlyDbsnp: only output heterozygous SNP at known dbSNP position in cpg reads file.
Need to enable -includeSnp at the sam time. (Default: -onlyDbsnp, not enabled)
Bis-SNP model parameters (not frequently changed if using for human sequencing):
● -hets: Prior for heterozygosity of any locus not in dbSNP (Default: 0.001)
● -vdh: Prior for heterozygous SNPs present in dbSNP (validated), but without a Minor
Allele Frequency specified (Default: 0.1)
● -ndh: Prior for heterozygous SNPs present in dbSNP (not validated), and no Minor Allele
Frequency specified (Default: 0.02)
● -tvt: Prior for Transition rate vs. Transversion rate (Default: 2.0)
● -rge: Prior for sequencing errors in the reference genome. (Default: 1e-6)
● -bsRate: bisulfite conversion rate of the library. 0-1 is the range, 1 means complete
bisulfite conversion (Default: 0.9975)
● -overRate: bisulfite over-conversion rate of the library. 0-1 is the range, 0 means no
bisulfite over-conversion (Default: 0)
Other Infrequently modified command line options:
● -toCoverage: maximum read coverage allowed for genotyping and methylation. (Default:
-toCoverage 250)
● -badMate: use bad mated reads for genotyping and methylation calling. Bad mates
means mates are not mapped, mapped into a different chromosome or not properly
paired reads. (Default: false)
● -mm40: If more than this many mismatches occur on a read within a 40 bp window
(20bp on either side) of a target position, that read is not used for calling the genotype of
the position (Default: 3)
● -trim5: Number of bases at 5' end of the reads to discard. (Default: 0)
● -trim3: Number of bases at 3' end of the reads to discard. (Default: 0)
● -trim2nd: Trimming will only be applied to 2nd end of paired-end reads. This should be
enabled for tagmentation-based bisulfite sequencing. (Default: false)
● -invDups: How to handle paired-end sequences with inverted duplicates (see description
above): USE_ONLY_1ST_END (default), USE_BOTH_END, NOT_TO_USE
3.2 Sort VCF file
Script sortByRefAndCor.pl are used to sort VCF file (any tab seperated format file) by the
reference contig order and coordinate
Command-line Options:
● --k: contig name is in the field POS (1-based) of input lines.
● --c: contig cordinate is in the field COR (1-based) of input lines.
● --tmp: temp directory (Default: --tmp /tmp)
● input.vcf: input file to sort. If '-' is specified, then reads from STDIN.
● reference_genome_index_file: .fai file, or ANY file that has contigs, in the desired soting
order, as its first column.
e.g.
perl sortByRefAndCor.pl --k 1 --c 2 input.raw.vcf hg19.fa.fai > input.raw.sort.vcf
3.3 Filter fake SNP, and bad CpG calls
Algorithm description:
Important Command-line Options:
● -R: reference genome file. e.g. -R hg19.fa
● -T VCFpostprocess: specify BisSNP to run VCFpostprocessWalker. Filter out those bad
quality position.
● -oldVcf: input VCF file for the filtering. e.g. -oldVcf snp.raw.vcf
● -snpVcf: input VCF file which provide the SNP information for the filtering. e.g. -snpVcf
snp.raw.vcf
● -newVcf: output VCF file that has already been filtered. e.g. -newVcf cpg.filter.vcf
● -o: Output summary statistics file for the cpg methylation w/wo SNPs e.g. -o
vcf.cpgSummary.txt (still in experiment..)
● -qual: position with genotype quality score less than this criteria would be filtered out.
When there are multiple samples in the same VCF file, if one of the samples satisfy the
criteria, the position would be output into the filtered VCF file. (Default: -qual 20)
Other Command-line Options:
● -C: Specify the cytosine contexts to check. e.g. -C CG,1 -C CH,1... 1 in it means the
cytosine position in the pattern. it is 1-based coordinate. You could specify '-C' multiple
times for different cytosine pattern. (Default: -C CG,1 -C CH,1)
● -minCT: minimum number of CT reads for count methylation level. (Default: -minCT 0)
● -maxCov: position exceed maximum coverage criteria would be filtered out. (Default: -
maxCov 250)
● -sb: heterozygous SNP with strand bias score higher than this criteria would be filtered
out. (Default: -sb -0.02)
● -mq0: fraction of reads with mapping_quality_zero. If it is higher than the criteria, it
indicates a possible bad aligned region. (Default: -mq0 0.1)
● -minSNPinWind: minimum number of SNPs in the window. (Default: -minSNPinWind 2)
● -windSizeForSNPfilter: window size for detect SNP cluster, 10 means +/- 10bp distance,
no second SNP there. (Default: -windSizeForSNPfilter 10)
e.g.
For SNP filter:
-R hg19.fa -T VCFpostprocess -qual 20 -oldVcf cpg.raw.vcf -snpVcf snp.raw.vcf -newVcf
cpg.filter.vcf -o vcf.cpgSummary.txt
For CpG filter:
-R hg19.fa -T VCFpostprocess -qual 20 -oldVcf snp.raw.vcf -snpVcf snp.raw.vcf -newVcf
snp.filter.vcf -o vcf.cpgSummary.txt
4. Convert VCF to 6plus2.bed file, methylation wiggle
file, coverage wiggle file
4.1. convert VCF to 6plus2.bed format
Script vcf2bed6plus2.pl is used to convert VCF file to 6plus2.bed format file.
6plus2.bed format description:
It contains the following fields:
● chromosome name
● start position
● end position
● status: if there are SNPs, it will be marked as “snp” here, if it is not a reference CG here,
it will be marked as “no_ref”. Otherwise, it will be “.” by default.
● score: methylation level*10 to satisfy bed format requirement for 0-1000 score
● strand: when combined both strand, it will be “.”, otherwise, it will show “+” for forward
strand and “-” for reverse strand
● methylation value: methylation level (0-100)
● CT_reads: number of CT reads covered in this CG. when combined CG in both strand, it
means sum of CT reads in both of CGs.
e.g.
chr11 7000492 7000493 snp 0 . . 24
chr11 7000730 7000731 . 769 . 76.92 13
chr11 7000998 7000999 . 818 . 81.82 11
chr11 7001534 7001535 . 846 . 84.62 13
chr11 7001637 7001638 . 750 . 75.00 8
chr11 7002111 7002112 . 700 . 70.00 10
chr11 7003483 7003484 . 523 . 52.38 21
chr11 7003705 7003706 no_ref 1000 . 100.00 10
Command-line Options:
● --only_good_call: only output confident call of cytosine pattern. (Default: not enabled)
● --seperate_strand: not combine two adjacent CGs in two strand into one CG. It shoule
be enabled for no-symmetric cytosines, like GCH or CHH. (Default: not enabled)
● --qual: quality score for confidently called cytosine pattern (Default: --qual 20)
● --minCT: minimum number of CT reads, otherwise, methy column will be '.' (Default: --
minCT 1)
● --maxCov: maximum coverage allowed(Default: --maxCov 250)
● --sample_order: sample chosen to output as bed file when there are multiple samples in
the same vcf file. 1 means the first sample (Default: --sample_order 1)
● input_vcf: input CG VCF files.
● Cytosine_type: cytosine type’s methylation value to output. without this input, it will
output all of cytosines in the VCF file. Support IUPAC code, like CG, CH, CHH, CHG..
e.g.
perl vcf2bed6plus2.pl cpg.raw.vcf CG
4.2. convert VCF to methylation level wiggle file
Command-line Options:
perl vcf2wig.pl cpg.filter.vcf CG
4.3. convert to CT_coverage wiggle file
Command-line Options:
perl vcf2wig_ct_coverage.pl cpg.filter.vcf CG
4.4. convert to raw_coverage wiggle file
Output the total reads coverage in each position in VCF file.
Command-line Options:
perl vcf2wig_raw_coverage.pl cpg.raw.vcf
4.5. convert to .tdf file for IGV visualization (methylation value and CT_coverage)
pre-install of igvtools are required
Command-line Options:
igvtools toTDF cpg.filter.wig hg19
igvtools toTDF cpg.filter.ct_coverage.wig hg19
igvtools toTDF cpg.raw.raw_coverage.wig hg19
BisSNP performance in low-coverage:
test lib:
/export/uec-
gs1/laird/shared/production/ga/flowcells/D1W2BACXX/results/D1W2BACXX/D1W2BACXX_6_N
IC1254A94
NOTES:
Bis-SNP version used: Bis-SNP 0.81.2 (/home/uec-
00/shared/production/software/bissnp/bissnp-default.jar)
perl script used:
VCF sort:
/home/uec-00/shared/production/software/bissnp/sortByRefAndCor.pl
VCF to bed (combine CG in +/- strand together):
/home/uec-00/shared/production/software/bissnp/vcf2bed6plus2.pl
VCF to methylation level wiggle file:
/home/uec-00/shared/production/software/bissnp/vcf2wig.pl
VCF to CT_coverage:
/home/uec-00/shared/production/software/bissnp/vcf2wig_ct_coverage.pl
VCF to raw_coverage
/home/uec-00/shared/production/software/bissnp/vcf2wig_raw_coverage.pl
Reference genome used:
/home/uec-00/shared/production/genomes/hg19_rCRSchrm/hg19_rCRSchrm.fa
dbSNP file used:
/home/uec-00/shared/production/software/bissnp/genomic_data/dbsnp_135.hg19.sort.vcf
Known Indel files used:
indel1:/home/uec-
00/shared/production/software/bissnp/genomic_data/1000G_phase1.indels.hg19.sort.vcf,
indel2:/home/uec-
00/shared/production/software/bissnp/genomic_data/Mills_and_1000G_gold_standard.indels.hg
19.sites.sort.vcf)
PROCESS with -nt in the input paramters indicate that the parallel computation is enabled in the
program.
Abstract (if available)
Abstract
One of the hallmarks of cancer is aberrant epigenetic changes which include alterations in DNA methylation, nucleosome positioning and histone modifications. Recent advances in whole-exome sequencing have revealed common defects in many of the chromatin modifier genes responsible for establishing these marks. Changes in tumor DNA methylation patterns and, to a lesser extent, histone modifications have been extensively characterized, but changes in nucleosome positioning and how they relate to other epigenetic marks remain poorly understood. ❧ Nucleosome Occupancy and Methylation whole-genome sequencing (NOMe-seq) assay was developed to understand the relationship between DNA methylation and nucleosome positioning changes at the same DNA strand in a single experiment. NOMe-seq was primarily built in the fibroblast cell line (IMR90) system. It was then adapted for the fresh frozen tissue system by combining low DNA input bisulfite treatment kit. The anti-correlation between nucleosome occupancy and DNA methylation was observed genome-wide at distal regions, especially near CTCF motifs. NOMe-seq could also reveal genome-wide in vivo transcription factor binding affinities in human cells by using different salt-wash concentrations. At promoter regions, NOMe-seq could reveal three distinct chromatin configurations by the combination of DNA methylation and accessibility status. Combinatorial epigenomic signatures in the same DNA strand revealed Divergent Chromatin Alleles (DCAs) to exploit the discrete epigenetic patterns in different alleles. Furthermore, the application of NOMe-seq on the primary cultured glioblastoma cell line (GBMs) and Neuron Stem Cell (NSCs) revealed that the epigenetic switching process from the stem cells to glioblastoma cells are mainly enriched in pre-marked poised enhancer regions. This demonstrated the power of NOMe-seq to detect dynamic changes of DNA methylation and accessibility among different cells. ❧ A complete computational pipeline, NOMeToolkit, was developed to process NOMe-seq data. NOMe-seq is a GpC methyltranferase footprinting technology followed by Whole Genome Bisulfite Sequencing (WGBS). Most parts of NOMeToolkit, therefore, could also be adapted for WGBS analysis. NOMeQC in NOMeToolkit was used to assess the quality of NOMe-seq library and preprocess the bad quality reads before and after reads mapping. InvertDupsHunter, one of the major components in NOMeQC, was developed to identify inverted duplicated reads and trim reads in order to increase reads mapping efficiency and accuracy. Single-nucleotide polymorphisms (SNPs) could result in inaccurate or missing methylation calls in Bisulfite-seq/NOMe-seq. Bis-SNP, the core part of NOMeToolkit, was developed based on the Genome Analysis Toolkit (GATK) framework which used bayesian inference to determine genotypes and methylation/accessibility levels simultaneously. After accurate DNA methylation/accessibility call at HCG/GCH sites, "peak calling" method is required for the further downstream analysis. The segmentation method based on two-state beta-binomial Hidden Markov Model (HMM) was developed to identify individual nucleosomes, linkers and Nucleosome Depleted Regions (NDRs). NOMeClonePlot and AlignWig2Loc were developed for the locus-specific and genome-wide visualization,respectively, of NOMe-seq. ❧ Using the experimental technology and computational tools developed above, the functional interactions between DNA methylation and other epigenetic modifiers were evaluated in human colon cancer cell HCT116 with both DNMT3B and DNMT1 knocked out. DNA methylation was globally depleted in this double knock out cell line (DKO1). Upon global loss of DNA methylation, however, only a few chromatin accessibility changes were observed. Strikingly, DNA methylation loss caused well phased nucleosomes at polycomb target CGI promoters and formation of NDR at non-polycomb target CGI promoters in DKO1 cells when compared to its parent cells. Interestingly, changes in DKO1 cells occurred in a direction such that the cells gained an epigenetic landscape and gene expression similar to normal colonic mucosa. In distal regulatory elements, similar chromatin remodeling and structure changes were observed at several different transcription factor binding sites, such as AP1 and ELF1. Decrease of accessibility in the large scale (>20kb) comparing to the adjacent regions largely overlapped with H3K9me3 marked heterochromatin regions and Partially Methylated Domains (PMDs). Interestingly, large H3K4me3 blocks were observed in these regions upon global loss of DNA methylation but not other cell types. ❧ We also tried to connect Oncogene Induced Senescence (OIS) with molecular subtypes in cancers by using NOMe-seq and other epigenomic profiling technologies. Preliminary studies on fibroblast cell lines by using Infinium 450K array technology, however, revealed that only few but consistent DNA methylation changes happened and enriched outside gene coding regions during OIS. ❧ Taken together, these studies in different normal and cancer cell lines and samples may provide novel insights into the crosstalk between genetic variations, DNA methylation, nucleosome positioning and histone modification during cancer progression.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Functional DNA methylation changes in normal and cancer cells
PDF
DNA methylation changes in the development of lung adenocarcinoma
PDF
Developing a robust single cell whole genome bisulfite sequencing protocol to analyse circulating tumor cells
PDF
The relationship between DNA methylation and transcription factor binding in colon cancer cells
PDF
DNA methylation and gene expression profiles in Vidaza treated cultured cancer cells
PDF
Preprocessing and analysis of DNA methylation microarrays
PDF
Identification and characterization of cancer-associated enhancers
PDF
DNA methylation markers for blood-based detection of small cell lung cancer in mouse models
PDF
Identification of novel epigenetic biomarkers and microRNAs for cancer therapeutics
PDF
Integrative genomic and epigenomic analysis of human cancer
PDF
Identification of DNA methylation markers in diffuse large B-cell lymphoma
PDF
Effects of chromatin regulators during carcinogenesis
PDF
CpG poor promoter SULT1C2 regulated by DNA methylation and is induced by cigarette smoke condensate in lung cell lines
PDF
Limit of detection analysis for cell-free DNA methylation using targeted bisulfite sequencing
PDF
DNA methylation inhibitors and epigenetic regulation of microRNA expression
PDF
Role of DNA methyltransferases 3A and 3B in inheritance of DNA methylation patterns
PDF
Efficient algorithms to map whole genome bisulfite sequencing reads
PDF
DNA methylation as a biomarker in human reproductive health and disease
PDF
Breast epithelial cell type specific enhancers and functional annotation of breast cancer risk loci
PDF
CpG methylation profiling in lung cancer cell lines, tumors and non-tumors
Asset Metadata
Creator
Liu, Yaping
(author)
Core Title
Understanding DNA methylation and nucleosome organization in cancer cells using single molecule sequencing
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Genetic, Molecular and Cellular Biology
Publication Date
09/04/2015
Defense Date
07/07/2014
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
cancer,DNA methylation,nucleosome,OAI-PMH Harvest,SNP,whole genome bisulfite sequencing
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Berman, Benjamin P. (
committee chair
), Laird, Peter W. (
committee member
), Siegmund, Kimberly D. (
committee member
), Smith, Andrew D. (
committee member
)
Creator Email
lyping1986@gmail.com,yapingli@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-468903
Unique identifier
UC11286920
Identifier
etd-LiuYaping-2880.pdf (filename),usctheses-c3-468903 (legacy record id)
Legacy Identifier
etd-LiuYaping-2880.pdf
Dmrecord
468903
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Liu, Yaping
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
DNA methylation
nucleosome
SNP
whole genome bisulfite sequencing