Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Identification and characterization of cancer-associated enhancers
(USC Thesis Other)
Identification and characterization of cancer-associated enhancers
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
1
IDENTIFICATION AND CHARACTERIZATION OF
CANCER-ASSOCIATED ENHANCERS
By
Lijing Yao
_________________________________________________________________
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirement for the Degree
DOCTOR OF PHILOSOPHY
(GENETIC, MOLECULAR AND CELLULAR BIOLOGY)
December 2015
Copyright 2015 Lijing Yao
2
Table of Contents
Table of Contents ......................................................................................................................................... 2!
Abstract ........................................................................................................................................................ 6!
Acknowledgments ........................................................................................................................................ 7!
List of Figures .............................................................................................................................................. 8!
List of Supplementary Figures ................................................................................................................... 9!
List of Tables .............................................................................................................................................. 10!
Chapter 1. Introduction and background: demystifying the secret mission of enhancers ................. 11!
1.1 Introduction .................................................................................................................................. 11!
1.2 Identifying target genes of enhancers using methods based on physical interactions between
separated regions of chromatin ........................................................................................................ 19!
1.2.1 Methods used to identify enhancer-promoter linkages ............................................... 19!
1.2.2 3 dimension enhancer-promoter interaction patterns ................................................ 24!
1.3 Identifying target genes of enhancers using computational analyses to link altered enhancer
structure or activity to specific gene expression patterns .............................................................. 29!
1.3.1 Predicting target genes based on changes in enhancer sequence ............................... 29!
1.3.2 Predicting target genes based on changes in enhancer activity ................................. 33!
1.4 Identifying target genes of enhancers using experimental validation ..................................... 40!
1.4.1 Using reporter assays to monitor enhancer activity .................................................... 40!
1.4.2 Analyzing enhancers in their natural chromosomal context ...................................... 44!
1.5 Conclusions and future perspectives .......................................................................................... 47!
1.6 Thesis overview ............................................................................................................................ 51!
3
Chapter 2. Functional annotation of colon cancer risk SNPs ................................................................ 53!
2.1 Abstract ......................................................................................................................................... 53!
2.2 Introduction .................................................................................................................................. 53!
2.3 Results ........................................................................................................................................... 56!
2.3.1 CRC risk-associated SNPs linked to a specific gene ................................................... 56!
2.3.2 CRC risk-associated SNPs in distal regulatory regions .............................................. 61!
2.3.3 Effects of SNPs on binding motifs in the distal elements ............................................ 65!
2.3.4 Expression analysis of candidate risk-associated genes .............................................. 66!
2.3.5 The effect of enhancer deletion on the transcriptome ................................................. 70!
2.4 Discussion ..................................................................................................................................... 71!
2.5 Methods ......................................................................................................................................... 74!
2.5.1 RNA-seq .......................................................................................................................... 74!
2.5.2 ChIP-seq analysis ........................................................................................................... 75!
2.5.3 Enhancer deletion ........................................................................................................... 76!
2.5.4 Analysis of FunciSNP and correlated SNPs effects ..................................................... 76!
2.5.5 Batch effects analysis ...................................................................................................... 77!
2.5.6 eQTL analyses ................................................................................................................. 78!
2.5.7 General Data Handling and Visualization ................................................................... 79!
2.6 Supplementary figures for chapter 2 ......................................................................................... 80!
Chapter 3. Inferring regulatory element landscapes and transcription factor networks from cancer
methylomes ................................................................................................................................................. 90!
3.1 Abstract ......................................................................................................................................... 90!
3.2 Background .................................................................................................................................. 91!
3.3 Results ........................................................................................................................................... 93!
4
3.3.1 Identifying cancer-specific DNA methylation changes in distal enhancer regions for
10 cancer types ......................................................................................................................... 93!
3.3.2 Linking methylation-affected enhancers to gene expression ...................................... 96!
3.3.3 Identification of regulatory TFs in each cancer type ................................................ 103!
3.4 Discussion ................................................................................................................................... 108!
3.5 Conclusions ................................................................................................................................. 115!
3.6 Methods ....................................................................................................................................... 116!
3.6.1 Availability of source code and R package ................................................................ 116!
3.6.2 DNA Methylation and RNA-seq Datasets. ................................................................. 116!
3.6.3 Selecting enhancer probes ........................................................................................... 118!
3.6.4 Identifying enhancer probes with cancer-specific DNA methylation changes ....... 118!
3.6.5 Linking enhancer probes with methylation changes to target genes with expression
changes ................................................................................................................................... 119!
3.6.6 ChIA-PET analysis ....................................................................................................... 120!
3.6.7 Gene Ontology (GO) enrichment analysis ................................................................. 120!
3.6.8 Motif analyses ............................................................................................................... 121!
3.6.9 Associating TF expression with TF binding motif methylation ............................... 121!
3.6.10 Survival analyses ........................................................................................................ 122!
3.7 Supplementary figures for chapter 3 ....................................................................................... 123!
Chapter 4. Conclusions and future studies ............................................................................................ 129!
4.1 Summary of main findings ........................................................................................................ 129!
4.2 Functional annotation of colorectal cancer GWAS SNPs ...................................................... 129!
4.3 Inferring enhancer regulatory networks using DNA methylation ........................................ 130!
4.4 Future directions ........................................................................................................................ 131!
Appendix A: Publications as a contributing author ............................................................................. 138!
5
Cell type-specific binding patterns reveal that TCF7L2 can be tethered to the genome by
association with GATA3 .................................................................................................................. 139!
ZBTB33 binds unmethylated regions of the genome associated with actively expressed genes
........................................................................................................................................................... 157!
Global loss of DNA methylation uncovers intronic enhancers in genes showing expression
changes .............................................................................................................................................. 175!
Appendix B: Supplementary Data Files in DVD .................................................................................. 191!
References ................................................................................................................................................. 194!
6
Abstract
Enhancers are short regulatory sequences located distal to transcription start sites that are bound
by sequence-specific transcription factors and play a major role in the spatiotemporal specificity of gene
expression patterns in development and disease. While it is now possible to identify enhancer regions
genome-wide in both cultured cells and primary tissues using epigenomic approaches, it has been more
challenging to develop methods to understand the function of individual enhancers because enhancers are
located far from the gene(s) they regulate. However, it is essential to identify target genes of enhancers
not only so that we can understand the role of enhancers in disease but also because this information will
assist in the development of future therapeutic options. In this dissertation, I employ two strategies to
characterize cancer-associated enhancers. The first strategy involves a functional characterization of
enhancers linked to genetic predisposition for colorectal cancer (CRC). Genome-wide association studies
(GWAS) have identified single nucleotide polymorphisms (SNPs) associated with increased risk for
CRC. By integrating genetic and epigenetic information, I identified 23 promoters and 28 enhancers
harboring CRC GWAS SNPs or correlated SNPs. Using gene expression data from normal and tumor
cells, I identified 66 putative target genes of these risk-associated enhancers. In the second strategy, I
developed ELMER (Enhancer Linking by Methylation/Expression Relationships), an R-based tool
integrating DNA methylation and gene expression profiles from primary tissues, to systematically infer
multi-level transcription factor regulatory networks. I then applied ELMER to 10 cancer types, including
more than 2,000 tumor samples from The Cancer Genome Atlas. I identified cancer-specific enhancers
and linked the enhancers to putative target genes and upstream regulatory TFs, developing cancer-
associated regulatory networks. My studies identified several networks regulated by well-known cancer
drivers, such as GATA3 and FOXA1 in breast cancer, SOX17 and FOXA2 (endometrial cancer), and
NFE2L2, SOX2 and TP63 (squamous cell lung cancer), and also identified novel networks with
prognostic associations, including RUNX1 in kidney cancer.
7
Acknowledgments
Foremost, I would like to express my sincere gratitude to my advisors Professor Benjamin
Berman and Professor Peggy Farnham. They not only provided me continuous guidance for my study and
research but also shared their tremendous wisdom and immense knowledge on both science and life with
me in the past four years. I learned to set high standards on whatever I do, such as making the paper
figures as beautiful as possible. I learned to set weekly bullet points so that I can efficiently finish my
work. I could not have today’s success without their help and guidance. Also they made my days full of
surprises and happiness (e.g. spicy dinner, Christmas gifts, lab parties and so on). If I had another chance
to choose mentors for my Ph.D, I would not hesitate to join in their labs again.
In addition to my advisors, I would like to thank the rest of my thesis committee, Professor
Kimberly Siegmund and Professor James Knowles, for their encouragement, insightful comments, and
hard questions. I also thank them for taking time off from their busy schedules to attend my dissertation
committee meetings.
I thank my colleagues Seth Frietze, Adam Blattler, Matt Grimmer, Esther Tak, Malaina Gaddis,
Heather Witt, Shannon Wood, Yaping Liu, Huy Dinh, Hui Shen, Toshinori Hinoue, and Zack Ramjan for
stimulating discussions throughout the years. I enjoyed working with Adam Blattler, discussing science
with Matt Grimmer, hanging out with Esther Tak, Malaina Gaddis and Shannon Wood, and hearing about
adventures from Heather Witt. I spent amazing time in the bioinformatics suite with Yaping Liu, Huy
Dinh, Hui Shen, Toshinori Hinoue and Zack Ramjan; they never tired of teaching me new computing tips.
Last but not the least, I would like to thank my family: my parents Baohua Yao and Lan Li, for
giving birth to me in the first place and supporting me spiritually throughout my life. I love you my
mother and father. And I would like to thanks my boyfriend Hilario Medina who has kept me company
for the last four years, making me happy when I am frustrated, sending me food when I work late in the
lab, and driving me anywhere I want to go. He has been an important part of my Ph.D journey.
8
List of Figures
Figure'1.1'Enhancer/mediated'gene'regulation.'.......................................................................................................'14!
Figure'1.2'3C/based'technologies'used'to'identify'enhancer/promoter'loops.'..............................................'21!
Figure'1.3'Chromatin'3D'structures.'.............................................................................................................................'26!
Figure'1.4'Computational'methods'to'link'enhancers'to'putative'target'genes.'...........................................'30!
Figure'1.5'Experimental'strategies'to'study'enhancer'activity.'..........................................................................'43!
Figure'1.6'Experimental'strategies'to'identify'target'genes.'................................................................................'47!
Figure'2.1'Identification'of'potential'functional'SNPs'for'CRC.'............................................................................'58!
Figure'2.2'Expression'of'risk/associated'genes'in'colon'cells.'.............................................................................'59!
Figure'2.3'Linking'a'transcript'to'an'enhancers'using'TCGA'data.'.....................................................................'68!
Figure'2.4'Identification'of'genes'affected'by'deletion'of'enhancer'7.'..............................................................'70!
Figure'2.5'Summary'of'identified'candidate'genes'correlated'with'increased'risk'for'CRC.'....................'71!
Figure'3.1'Identifying'cancer/specific'DNA'methylation'changes'in'distal'enhancer'regions.'.................'95!
Figure'3.2'Linking'differentially'methylated'probes'to'expression'of'nearby'genes.'.................................'98!
Figure'3.3'Comparison'of'probe/gene'pairs'between'the'different'cancer'types.'........................................'99!
Figure'3.4'Physical'characteristics'of'the'probe/gene'pairs.'.............................................................................'101!
Figure'3.5'Gene'Ontology'(GO)'enrichment'analysis'for'genes'identified'in'more'than'one'cancer'type.
'.......................................................................................................................................................................................'102!
Figure'3.6'Identification'of'enhancer'sets'predicted'to'be'co/regulated'by'the'same'transcription'
factor.'..........................................................................................................................................................................'104!
Figure'3.7'High'RUNX1'expression'is'associated'with'poor'survival'in'clear'cell'renal'carcinoma.'...'107!
Figure'3.8'Identification'of'in#vivo#TF'networks.'...................................................................................................'109!
Figure'4.1'Four'groups'of'putative'enhancer/gene'pairs'in'luminal'A'breast'cancer.'..............................'133!
9
List of Supplementary Figures
Supplementary'figure'2.1'Correlated'exon'SNPs.'.....................................................................................................'80!
Supplementary'figure'2.2'Analysis'of'RNA/seq'data.'..............................................................................................'81!
Supplementary'figure'2.3'Correlated'TSS'SNPs.'.......................................................................................................'82!
Supplementary'figure'2.4'ChIP/seq'peak'analysis.'..................................................................................................'83!
Supplementary'figure'2.5'Correlated'enhancer'SNPs.'............................................................................................'84!
Supplementary'figure'2.6'TCGA'batch'effects'analysis.''RNA/seq'batch'effects.'............................................'85!
Supplementary'figure'2.7'TCGA'batch'effects'analysis.'CNV'(GW'SNP6'array)'batch'effects.'...................'86!
Supplementary'figure'2.8'TCGA'batch'effects'analysis.'DNA'methylation'(Infinium'HM450K'
microarray)'batch'effects.'.......................................................................................................................................'87!
Supplementary'figure'2.9'Expression'analysis'of'genes'identified'by'promoter'and'exon'SNPs'and'
potential'enhancer'target'genes'in'TCGA'samples.'........................................................................................'88!
Supplementary'figure'2.10'eQTL'analysis'summary.'..............................................................................................'89!
Supplementary'figure'3.1'Quantitative'summary'of'links,'probes,'and'genes'for'each'cancer'type.'..'123!
Supplementary'figure'3.2'Rank'of'putative'target'gene'according'to'distance'in'the'enhancer/gene'
pairs'for'each'cancer'type.'...................................................................................................................................'124!
Supplementary'figure'3.3'Motif'enrichment'heatmap.'.......................................................................................'125!
Supplementary'figure'3.4'Proportion'of'intragenic'vs.'intergenic'enhancers'that'regulate'the'nearest'
gene.'.............................................................................................................................................................................'126!
Supplementary'figure'3.5'MYC'3’'end'enhancer'regulates'MYC'expression'in'colorectal'cancer'tissue.
'.......................................................................................................................................................................................'127!
Supplementary'figure'3.6'Survival'analysis'of'commonly'identified'ZNFs.'.................................................'128!
10
List of Tables
'TOC'\c'"Table"'Table'1.1'Summary'of'available'software'for'chromatin'interaction'data.'......................'22!
Table'1.2'Available'databases'for'eQTL'.......................................................................................................................'31!
Table'1.3'Summary'of'computational'methods'........................................................................................................'35!
Table'2.1'Summary'of'regions'linked'to'CRC'tag'SNPs.'...........................................................................................'60!
Table'2.2'Expressed'transcripts'directly'linked'to'CRC'tag'SNPs'........................................................................'61!
Table'2.3'Distal'regulatory'regions'correlated'with'CRC'tag'SNPs'.....................................................................'64!
Table'2.4'Effects'of'SNPs'on'motifs'in'the'distal'regulatory'regions.'.................................................................'65!
Table'2.5'Linking'transcripts'to'enhancers'using'TCGA'data.'..............................................................................'69!
11
Chapter 1. Introduction and background: demystifying the secret mission of
enhancers
Parts of the following introduction are adapted from a review article published in Critical Reviews in
Biochemistry and Molecular Biology; Lijing Yao, Benjamin P. Berman and Peggy J. Farnham.
Demystifying the secret mission of enhancers: linking distal regulatory elements to target genes. 2015.
Lijing Yao wrote the draft and complied all suggested edits from co-authors; Benjamin P. Berman and
Peggy J. Farnham provide constructive suggestions and edits.
1.1 Introduction
There are two main types of regulatory elements involved in transcriptional activation, promoters
and enhancers. Whereas promoters are easy to identify, usually defined as a distance that spans a few Kb
on either side of a transcription start site (TSS) of a coding or non-coding gene, enhancers are more
elusive. Enhancers, first identified in viral genomes more than 30 years ago (Banerji et al., 1981), were
initially defined simply as DNA fragments that are located outside of core promoter regions and that can
increase transcription from a particular gene. Early studies in which enhancers were removed from their
normal genomic location and analyzed in reporter assays indicated that their enhancing activities can be
independent from their exact location or orientation relative to the activated promoter [reviewed in (Plank
and Dean, 2014, Bulger and Groudine, 2011)], suggesting that enhancers can be located at long distances
upstream or downstream of target genes. Although the early reporter assays did not identify the natural
target(s) of the tested enhancers, the hypothesis that most enhancers work at a distance has been adopted
as a general consensus in the field. Multiple models have been proposed to explain how enhancers
regulate transcription of a target gene from a distance (Blackwood and Kadonaga, 1998, Bulger and
Groudine, 2011). The two most common models are “scanning or tracking”, in which TF-containing
protein complexes bind at an enhancer and diffuse (perhaps via rapid on/off events) along the genome to
search for a target promoter (Blackwood and Kadonaga, 1998) and “looping”, in which an enhancer
12
directly interacts with a target promoter by forming a physical interaction mediated by protein-protein
contact (Bulger and Groudine, 1999, Blackwood and Kadonaga, 1998); see Figure 1.1a. The “scanning”
model is consistent with a proposed mechanism by which insulator proteins such as CTCF function (i.e.
by creating distinct chromatin domains on either side of the bound protein). One study in support of this
model detected short RNAs transcribed from the DNA between an enhancer and the nearest promoter
(Zhu et al., 2007) and the second study observed that TF-containing protein including RNA PolII bind at
DNA between an enhancer and nearby gene promoter (Hatzis and Talianidis, 2002). The “scanning”
model implies that an enhancer should regulate the nearest active promoter and thus would not be
consistent with long-range interactions in which an enhancer bypasses multiple promoters to regulate a
more distally located gene (Wang et al., 2005). Although it is possible that some enhancers function via
the “scanning” model and some function via the “looping” model, recent evidence provided by multiple
nuclear architecture studies (Krivega et al., 2014, Tolhuis et al., 2002) has tipped the balance in support of
the “looping” model.
In the looping model, two genomic regions separated by a distance are brought together via
protein-protein interactions mediated by transcription factors bound at a distal element and at a promoter.
One ramification of this model is that it should be possible to identify distal enhancers by the location of
transcription factor (TF) binding motifs. Unfortunately, TF binding motifs are usually less than 10 nts in
length (Stewart et al., 2012) and thus are found throughout the genome, making it difficult to identify an
enhancer using only bioinformatics approaches. However, recent improvements in genome-wide
technologies such as ChIP-seq and DNase-seq (ENCODE_Project_Consortium, 2012, Thurman et al.,
2012) now allow what is thought to be a comprehensive identification of distal regulatory regions within
a given cell type. The most common distal regulatory elements are DNase I hypersensitive sites (DHSs),
which are regions of nucleosome-free chromatin that harbor clusters of transcription factor binding
sites(ENCODE_Project_Consortium, 2012, RoadmapEpigenomicsConsortium, 2015). There are
estimated to be ~3 million DHSs in the human genome, although not all DHSs are present in a given cell
type (Thurman et al., 2012) and different subsets of DHSs have flanking regions marked by specifically
13
modified histones that are thought to distinguish enhancers from promoters. For example, potentially
active enhancers have flanking regions with well-positioned nucleosomes in which histone H3 is marked
by monomethylation (H3K4me1) (Heintzman et al., 2007) and fully active enhancers have flanking
regions with well-positioned nucleosomes in which histone H3 is marked not only by monomethylation
of lysine 4 but also by acetylation of lysine 27 (H3K27Ac) (Heintzman et al., 2009, Rada-Iglesias et al.,
2011); see Figure 1.1b,c. Enhancers also have low levels of histone H3 trimethylated on lysine 4
(H3K4me3). In contrast, promoters are marked by high levels of H3K4me3, low levels of H3K4me1, and
variable levels of H3K27Ac. In addition to modified histones and site-specific DNA-binding TFs such as
TCF7L2, transcriptional coactivators such as EP300 and CBP also localize at enhancer regions (Blow et
al., 2010, Visel et al., 2009a). The combination of DHSs, modified histones, and TFs has allowed the
identification of putative enhancer elements throughout the genome in more than 100 cell types (Whitaker
et al., 2015, RoadmapEpigenomicsConsortium, 2015). Recent studies have also shown that enhancers can
be identified as distal regions that have low levels of DNA methylation (Stadler et al., 2011, Thurman et
al., 2012, RoadmapEpigenomicsConsortium, 2015, ENCODE_Project_Consortium, 2012).
14
Figure 1.1 Enhancer-mediated gene regulation.
A) Shown are two models for gene regulation by enhancers. The left panel illustrates the “scanning or
tracking” model in which a TF-containing protein complex binds at an enhancer and moves along the
genome, searching for a target promoter (the nearest promoters are labeled in brown and distal promoters
are labeled in red). The right panel illustrates the “looping” model in which an enhancer directly interacts
with a target promoter by forming a DNA loop mediated by protein-protein contacts. B) Shown is an
illustration of the distinctive chromatin signatures at active vs. inactive enhancers and promoters. Active
enhancers provide nucleosome-free regions for binding of clusters of transcription factors (TF) and are
flanked by nucleosomes marked by H3K4me1 (cyan dots) and H3K27ac (green dots); active promoters
have flanking nucleosomes marked by H3K4me3 (blue dots). CpG sites throughout the human genome
have high levels of DNA methylation (red dots) except at active enhancers and promoters. C) DNA
methylation (WGBS), ENCODE ChIP-seq data (labeled according to the antibody used in each
experiment), and the location of DHSs and TF binding data for HCT116 cells from the University of
California, Santa Cruz genome browser are shown for an enhancer and a promoter region.
15
Reporter assays using model organisms have revealed a high degree of spatiotemporal cell-type
specificity of enhancers. An early genome-wide ChIP-seq study in human cells compared undifferentiated
embryonic stem cells (hESCs) to induced early mesendoderm or neuroepithelium cells, finding that the
enhancer marks showed cell-type-specific patterns but that the promoter mark H3K4me3 was largely
invariant across cell types (Hawkins et al., 2011). A large ChIP-seq study across seven developmental
time points and three developmental lineages showed a very high degree of lineage and temporal
specificity at enhancer regions, but very few differences in promoter regions (Nord et al., 2013). In
addition, a recent study comparing enhancers in many different types of human cells has shown that
different cell lineages can be revealed using enhancer marks (RoadmapEpigenomicsConsortium, 2015).
It is thought that cell-type-specific enhancers bound by critical lineage-specifying transcription factors
help to orchestrate the precise order of expression of both protein-coding genes (e.g. SHH (Petit et al.,
2015)) and non-coding RNAs (e.g. let-7 family microRNAs (Cohen et al., 2015)), to ensure proper
development and differentiation (Boland et al., 2014, Buecker and Wysocka, 2012, Rada-Iglesias et al.,
2012).
Considering the important role of enhancers in orchestrating development and differentiation, it is
not surprising that many diseases are associated with changes in enhancer activity. Mutations in
sequence-specific enhancer-binding TFs (e.g. GATA factors (Zheng and Blobel, 2010) and Hox factors
(Quinonez and Innis, 2014)) and transcriptional coregulators (e.g. CBP and RB (Iyer et al., 2004, Vile and
Winterbourn, 1989, Janknecht, 2002)) have long been associated with disease. However, such mutations
are likely to affect all genes regulated by these enhancer-binding factors. In contrast, abnormal sequence
variants that alter the activity of individual enhancers can lead to disease by altering expression of specific
genes. Early examples showed that inherited deletions of enhancers in the β-globin locus led to the
decreased β-globin expression underlying β-Thalassemia (Kulozik et al., 1988, Kioussis et al., 1983).
Genome-wide association studies (GWAS) have identified thousands of single nucleotide polymorphisms
(SNPs), defined as germline nucleotide variations that occur within a population at a frequency above 1%,
16
that are associated with particular diseases [reviewed in (Freedman et al., 2011, Hindorff et al., 2009)].
Interestingly, most SNPs identified by GWAS are located in non-coding regions (Yao et al., 2014,
Freedman et al., 2011) and recent studies from the ENCODE Project, the Roadmap Epigenome Mapping
Consortium, and other groups, have found that many disease-associated SNPs fall within enhancer
regions, suggesting that these SNPs cause changes in gene expression that lead to an increased risk for
that disease (RoadmapEpigenomicsConsortium, 2015, Yao et al., 2014, Gjoneska et al., 2015, Farh et al.,
2015, Akhtar-Zaidi et al., 2012). In support of this hypothesis, multiple studies have shown that SNPs at
disease-relevant enhancers are likely to impact the binding of transcription factors (Hazelett et al., 2014,
Herz et al., 2014).
Certain rare somatic mutations (nucleotide changes that occur after birth in the genome of
specific tissues and thus are not inherited) at enhancers have been associated with various diseases. For
example, a mutation that disrupts an enhancer for the RET gene (located 15 Kb away from the TSS)
results in a 20-fold greater contribution to the risk of Hirschsprung’s disease than other known RET
mutations (Emison et al., 2005). Large-scale chromosomal changes that affect enhancers can also lead to
disease, such as the translocation of the active IgH enhancer to the MYC locus in Burkitt’s lymphoma
(Siebenlist et al., 1984). Also, chromosomal rearrangements can relocate an enhancer that regulates
GATA2 expression, leading to aberrant expression of the proto-oncogene EVL1 and causing acute
myeloid leukemia (Groschel et al., 2014). This type of oncogene “enhancer hijacking” appears to be
common in non-hematopoietic cancers as well. For example, large deletions between the TMPRSS2
enhancer and various ETS-family oncogenes (such as ERG) are common in prostate cancer (Tomlins et
al., 2005) and translocations between various active enhancers and the GLI oncogenes appear to underlie
~10% of pediatric medulloblastoma cases (Northcott et al., 2014). More complex structural changes
involving enhancers can also underlie disease. For example, a large genomic deletion (660 Kb) results in
the loss of a topological domain boundary that normally prevents interaction between forebrain-specific
enhancers and the LMNB1 promoter. This deletion is responsible for acquisition of autosomal dominant
17
adult-onset demyelinating leukodystrophy in some patients because of the overexpression of LMNB1
(Giorgio et al., 2015).
Most sequencing of disease tissues has thus far been limited to exonic sequences. With the
addition of whole genome sequencing of patients to the toolkit of personalized medicine, it is certain that
many more enhancers harboring germ line variants, somatic nucleotide mutations, or located within or
nearby chromosomal alterations will be identified. In fact, a recent analysis of 436 complete cancer
genomes (Melton et al., 2015) generated by the TCGA Consortium identified recurrent mutations in distal
regulatory elements. Mutated or variant enhancers, when associated with a particular disease, may
become potential therapeutic targets. Importantly, the cell type-specific activity of enhancers may enable
more precise therapy, as compared to agents that inhibit entire signaling networks. A good example of an
enhancer that may be appropriate for such targeted therapy is one that regulates BCL11A, a transcription
factor that represses expression of fetal hemoglobin (Sebastiani et al., 2015). There are different types of
hemoglobin that show specificity of expression in adult vs. fetal cells. Patients with sickle cell disease
(SCD) have mutations in the HbA hemoglobin proteins that are normally expressed in adult cells. Clinical
observations have shown that SCD patients with higher levels of the fetal-specific type of hemoglobin
(HbF) have a better prognosis (Peterson et al., 2014, Dong et al., 2013). Therefore, it has been suggested
that reducing levels of BCL11A (which would allow re-activation of fetal hemoglobin) might be an
appropriate therapy for SCD patients. However, although BCL11A would seem to be an attractive
therapeutic target for SCD, there are concerns about directly inactivating this transcriptional repressor
because it plays important roles in other cell types (Yu et al., 2012, Wakabayashi et al., 2003). The
discovery of a red blood cell-specific enhancer for BCL11A located ~65 Kb away from TSS (Sebastiani
et al., 2015) suggests an alternative approach. That is, to specifically downregulate BCL11A expression
only in blood cells by disabling the blood cell-specific enhancer using chromatin editing or genomic
nucleases, alleviating off-target effects in other cell types.
Due to their cell type-specific roles in specifying gene expression patterns that regulate both
normal development and human diseases, it is clearly important to fully understand the function of
18
enhancers. However, very few enhancers have been studied in the same detail as those described above.
In addition, there is still an unsolved fundamental question: what are the target genes of the hundreds of
thousands of enhancers that have been identified in the human genome? The most recent Gencode release
has identified ~60 000 coding and non-coding genes (http://www.gencodegenes.org/stats/current.html)
that are expressed from ~200 000 promoters
(http://fantom.gsc.riken.jp/5/datafiles/latest/extra/CAGE_peaks/). Current estimates are that 10 000-15
000 genes are expressed in a given cell type (Bengtsson et al., 2005) and that each cell type has 44 000-
294 000 active enhancers (resulting in a total of 389 967 non-overlapping enhancer regions across 98
tissue and cell lines) (RoadmapEpigenomicsConsortium, 2015, Yao et al., 2015). Thus, at both the global
level and within a given cell type, there are many more enhancers than expressed genes. Also, as
discussed above, enhancers tend to be cell type-specific, suggesting that a gene can be regulated by
different enhancers in different cell types. In addition, an enhancer can regulate different promoters in
different cells, as observed at the β globin locus (Holwerda and de Laat, 2013). Thus, enhancer targets
must be identified in a cell type-specific manner. Although these numbers suggest that in a particular cell
type, an enhancer may, on average, regulate only one or a small number of genes, the flexibility of the
distances between enhancers and target genes makes it very complicated to predict which gene is
regulated by a specific enhancer.
To date, very few experiments have been performed in mammalian cells with the goal of linking
an enhancer to a specific target gene. In model organisms such as fruitfly and mouse, linkages have been
established using in vivo reporter constructs followed by in situ imaging of developmental expression
patterns. When an enhancer reporter construct has a highly specific spatiotemporal pattern that matches
that of a neighboring gene, it is taken as strong functional evidence that an enhancer-gene target pair has
been identified. An analysis of thousands of candidate enhancers in Drosophila suggests that even in a
relatively compact genome enhancers operate at large distances from the gene they regulate and that
significant numbers do not regulate the nearest annotated gene (Kvon et al., 2014). Such developmental
approaches to study enhancers are impractical in mammals. However, recent advances in experimental
19
techniques that allow the investigation of the three-dimensional architecture of chromatin, as well as
analytical approaches that take advantage of large multi-dimensional epigenomic datasets, provide
methods by which investigators can predict which genes are regulated by specific enhancers. This review
covers 1) methods based on physical interactions, including Chromosome Conformation Capture (3C)
(Dekker et al., 2002, Hagege et al., 2007), Circular Chromosome Conformation Capture (4C) (Zhao et al.,
2006, Simonis et al., 2006), Chromosome Conformation Capture Carbon Copy (5C) (Dostie and Dekker,
2007), Hi-C (Lieberman-Aiden et al., 2009, van Berkum et al., 2010), tethered conformation capture
(TCC) (Kalhor et al., 2011), Capture Hi-C (Chi-C) (Jager et al., 2015), Capture-C (Hughes et al., 2014),
DNase I Hi-C (Ma et al., 2015), and Chromatin Interaction Analysis by Paired-End Tag Sequencing
(ChIA-PET) (Fullwood and Ruan, 2009) and 2) methods based on gene expression associations using
SNPs (Westra and Franke, 2014), DHSs (Sheffield et al., 2013, Thurman et al., 2012), histone
modifications (Ernst et al., 2011, Shen et al., 2012), and DNA methylation (Aran et al., 2013, Aran and
Hellman, 2013, Yao et al., 2015) at enhancer regions. Of course, each of these methods produces
predictions of cell-type-specific enhancer-gene linkages that should be verified by follow-up experiments.
Therefore, I also describe methods, such as genomic deletion or epigenetic inactivation of an enhancer,
which can be employed in such experiments. By combining results from these computational and
experimental methods, an encyclopedia of enhancer-gene linkages can be developed to help guide future
biological studies or clinical therapeutic treatments.
1.2 Identifying target genes of enhancers using methods based on physical
interactions between separated regions of chromatin
1.2.1 Methods used to identify enhancer-promoter linkages
Many methods that study chromatin interactions are based on a technique termed 3C. The
principle of 3C technology relies on formaldehyde crosslinking of interacting chromatin fragments,
restriction enzyme digestion, ligation of the interacting fragments, and finally PCR analysis using primers
20
specific for the fragments of interest (Tolhuis et al., 2002, Hagege et al., 2007, Dekker et al., 2002).
Multiple variations of 3C (4C, 5C, Hi-C, and TCC) have been developed, the most recent ones being
adapted for genome-wide analyses; see previous reviews for methodological details (Dekker et al., 2013,
de Wit and de Laat, 2012, Lan et al., 2012). In brief, 3C and 4C-seq methods can produce interaction
profiles for individual genomic loci of interest, such as promoters and enhancers; see Figure 1.2 plus
Table 1.1 for a list of software associated with 3C-based methods. 3C investigates possible long-distance
interactions between two loci based on prior knowledge of a potential interaction in a “one-to-one”
manner (Dekker et al., 2002). In contrast, 4C-seq identifies all chromosomal regions interacting with a
single specific genomic locus (described as a “viewpoint” or “bait”), in a “one-to-all” manner (Zhao et al.,
2006, Simonis et al., 2006). The 5C, Hi-C, and TCC methods use distinct approaches to overcome the
limitation of choosing an individual locus for study, potentially allowing the investigation of all
chromatin interactions throughout the genome. For example, 5C detects ligation products in a 3C library
using ligation-mediated amplification (LMA) (Dostie and Dekker, 2007, Dostie et al., 2006). Hi-C labels
ligation products in a 3C library using biotin so that all ligated fragments can be enriched for sequencing
(Lieberman-Aiden et al., 2009). Finally, TCC uses a similar biotin-based enrichment strategy as in Hi-C
except that the ligation is performed on a solid substrate rather than in solution to improve the signal-to-
noise ratio (Kalhor et al., 2011). As expected, 5C provides less comprehensive interaction profiles than
the other two methods because genomic coverage of 5C is limited by the requirement for large numbers
of primers. For example, a 5C study using primers designed to identify interactions between 628 TSS-
containing restriction fragments and 4535 distal restriction fragments identified less than 2000 promoter-
distal interactions (Sanyal et al., 2012), whereas ~60 000 promoter-distal interactions were identified
using Hi-C (Jin et al., 2013). The resolution for detecting interactions by the three genome-wide
technologies is limited by sequencing depth and the frequency of restriction enzyme (RE) cutting sites. In
a typical Hi-C experiment, 3.4-8.4 billion reads can produce interaction profiles at a 5 Kb~1 Mb
resolution, depending on bin size (Jin et al., 2013, Lieberman-Aiden et al., 2009). A recent study of 3D
nuclear architecture using a method called in situ Hi-C, which uses the original Hi-C protocol but
21
performs the digestion and ligation steps in intact nuclei, produced over 25 billion reads for nine human
cell types, and improved the resolution of chromatin interactions to 1 Kb to detect over 15 billion distinct
interactions across the nine cell types (Rao et al., 2014).
Figure 1.2 3C-based technologies used to identify enhancer-promoter loops.
All 3C-based technologies
begin with formaldehyde
treatment, leading to
crosslinking of DNA
fragments in close
proximity. The 3C, 4C, and
5C methods begin with
restriction enzyme (RE)
digestion of the chromatin
into small pieces (digestion
sites represented by black
bars). Crosslinked
fragments are ligated to
form unique hybrid DNA
molecules and then the
DNA is purified. In 3C, a
predicted ligation product
can be analyzed by PCR
using a pair of primers; this
is termed a one-to-one
approach. In 4C, the 3C
ligation library is digested
with a second RE to digest
the DNA to smaller sizes
(2
nd
digestion sites are labeled as green ovals) and then the fragments are ligated to form a circle. Inverse
PCR is utilized to generate a genome-wide interaction profile for a single locus (analyzed by high
throughput sequencing); this is termed a one-to-all approach. 5C detects ligation products from a 3C
library using ligation-mediated amplification (LMA) followed by high throughput sequencing; this is
termed a many-to-many approach. Starting from 3C fragmentation products, Hi-C includes a unique
step in which sticky ends resulting from the RE digestion are filled in with biotinylated nucleotides
(shown as red dots). This facilitates a streptavidin-based enrichment of the ligation products for
sequencing. The difference between TCC and Hi-C is that TCC adds an initial protein biotinylation and
tethering step, such that the fragmentation and ligation are performed on a solid substrate; TCC and Hi-C
are termed all-to-all approaches. Specific subsets of TCC and HCC products can be selected prior to
sequencing using oligonucleotides or arrays in CHI-C and Capture-C, allowing an all-to-all analysis of
selected genomic regions. DNase Hi-C uses the conventional Hi-C protocol but replaces the RE
fragmentation step with DNase I digestion and thus is an all-to-all approach. ChIA-PET, which is quite
different from the other 3C-based methods, begins with sonication of the chromatin, which is followed by
a conventional chromatin immunoprecipitation step. Then, A (purple) and B (orange) linkers are added to
two groups of material which are mixed together for the ligation step, the ligation products are digested
with MmeI, and the DNA is sequenced. The frequency of random ligations between the two different
linkers (AB) is used to estimate the frequency of nonspecific ligation. ChIA-PET is termed an all-to-all
approach for interactions involving a specific protein.
22
Table 1.1 Summary of available software for chromatin interaction data.
Software Data Program Input Analysis Software website PMID
Basic4Cseq 4C
R/Bioconducto
r package
BAM
1. Creat RE fragment library. 2. Filter 4C-seq data and map read to
fragment. 3. Visualization. 4. Quality control
http://www.bioconductor.org/pack
ages/release/bioc/html/Basic4Cse
q.html
25078398
fourSig 4C perl and R SAM
1. Filter 4C-seq data and map read to fragment. 2. Determination of
significant enrichment by perumations. 3. Visualization
http://sourceforge.net/projects/fou
rsig/
24561615
r3Cseq 4C R package BAM
1. Filter 4C-seq data and map read to fragment. 2. Data
normalization. 3. Identify significant interactions. 4. Visualization
http://r3cseq.genereg.net/Site/ind
ex.html
23671339
3DG 5C web-based
Primer sets and
contact matrix
1. Design 5C primer sets. 2. Visualization http://3DG.umassmed.edu 19789528
HiTC 5C, Hi-C
R/Bioconducto
r package
Interaction count
matrix in csv
1. Quality control. 2. Visualization. 3. Interaction map transformation
and normalization
http://www.bioconductor.org/pack
ages//2.10/bioc/html/HiTC.html
22923296
HiFive 5C, Hi-C Python BAM
1. Filter 4C-seq data and map read to fragment. 2. Data
normalization. 3. Identify interaction enrichement and boundary
index. 4. 3D modeling
http://bxlab-
hifive.readthedocs.org/en/1.0.3/in
dex.html
http://biorxiv.org/c
ontent/early/2014
/10/03/009951
ChIA-PET
Tool
ChIA-PET java software Fastq
1. Linker filtering. 2. Reads mapping. 3. Identify interation and ChIP-
seq peak calling.
http://chiapet.gis.a-star.edu.sg/ 20181287
Chiasig ChIA-PET Perl and R
Contacts in
BEDPE-format
1. Identify significant interactions using model based on non-central
hypergenometric distribution.
http://folk.uio.no/jonaspau/chiasig/ 25114054
Mango ChIA-PET R package Fastq
1. Linker filtering. 2. PET mapping to reference genome. 3. Identify
interation and ChIP-seq peak calling.
https://github.com/dphansti/mang
o
NA
HiC-
inspector
Hi-C, TCC perl and R fastq
1. Alignment. 2. Filter reads that within the DNA fragment size
window around restriction enzyme sites. 3. Count interactions. 4.
Visualziation heatmap
https://github.com/HiC-
inspector/HiC-inspector
NA
HiC-Pro Hi-C, TCC
C++, python,R,
bash
fastq 1. Alignment. 2. Count interactions. 3. Visualziation
https://github.com/nservant/HiC-
Pro
NA
ICE Hi-C, TCC Python fastq
1. Mapping against reference genome. 2. Use iterative correlation to
generate corrected Hi-C interaction map
http://mirnylab.bitbucket.org/hiclib/
index.html
22941365
HIPPIE Hi-C, TCC perl and bash fastq
1. Mapping against reference genome. 2. Quality control. 3. Indentify
Hi-C peaks. 4. Integrate epigenomic data to predict enhancer-gene
linkages.
http://wanglab.pcbi.upenn.edu/hip
pie/
25480377
HiCUP Hi-C, TCC Perl and R fastq
1. Mapping against reference genome. 2. Filter experimental
artifacts. 3. Quality control. 4. Generate BAM/SAM file for post
analysis by other software.
http://www.bioinformatics.babraha
m.ac.uk/projects/hicup/
NA
Homer Hi-C, TCC perl
SAM, BED and
so on
1. Normalization of interaction matrices. 2. identify significant
interactions. 3. Subnulcear compartment snalysis. 4. Structure
interaction matrix analysis. 5. Visualization
http://homer.salk.edu/homer/index
.html
20513432
HiCat Hi-C, TCC C and R BAM
1. Interaction analysis. 2. Integrate multiple epigenetic information for
interaction annotation. 3. Comparison analysis for Hi-C data
https://github.com/MWSchmid/Hi
Cdat
25132176
HiCNorm Hi-C, TCC R package
Raw Hi-C cis
contact map and
the local genomic
Normalization of contacts
http://www.people.fas.harvard.edu
/~junliu/HiCNorm/
23023982
HiCorrector Hi-C, TCC C
Raw contact
matrix
Normalization of contacts
http://zhoulab.usc.edu/Hi-
Corrector/
25391400
hicpipe Hi-C, TCC perl and R
Raw Hi-C
contacts file
Estimate Hi-C biases and normalize interactions.
http://compgenomics.weizmann.a
c.il/tanay/?page_id=283
22001755
MDM ChIA-PET R package Contact file Identify the true interaction.
http://www.stat.osu.edu/~statgen/
SOFTWARE/MDM/
24835279
HiCseq Hi-C, TCC R package Contact matrix Detect cis-interactions
http://cran.r-
project.org/web/packages/HiCseg
/index.html
25161224
AutoChrom3
D
Hi-C web-based Contact file
1. Analyze chromatin interactions using sequencing-bias-releaxed
structure parameter to normalize chromatin interactions. 2. 3D
modeling
http://ibi.hzau.edu.cn/3dmodel/ind
ex.php
23965308
InfMod3DGe
n
Hi-C, TCC MATLAB Contact file 3D modeling
https://github.com/wangsy11/InfM
od3DGen
25690896
ChromSDE Hi-C, TCC Matlab
Output files from
Hi-C pipeline of
Tanay's Group
3D modeling
http://biogpu.ddns.comp.nus.edu.
sg/~chipseq/ChromSDE/
24195706
PASTIS Hi-C, TCC Python Contact file 3D modeling
http://cbio.ensmp.fr/~nvaroquaux/
pastis/
24931992
LACHESIS Hi-C, TCC C, perl and R fastq De novo genome assembly by Hi-C
http://shendurelab.github.io/LACH
ESIS/
24185095
FisHiCal Hi-C, TCC R package
Normalized Hi-C
interactions and
FISH data
Integrate Hi-C data interaction and FISH to reconstruct nuclear 3D
structure.
http://cran.r-
project.org/web/packages/FisHiC
al/index.html
25061071
NuChart Hi-C, TCC R package Contact file
1. Integrate Hi-C and other genomic feature to annotate and analyze
a list of input genes. 2. Visualization
ftp://fileserver.itb.cnr.it/nuchart 24069388
CytoHiC Hi-C, TCC
Cytoscape
plugin
Normalized Hi-C
interactions
Interaction network analysis
http://apps.cytoscape.org/apps/cyt
ohicplugin
23508968
Juicebox
5C, Hi-C,
ChIA-PET
java software Visualization of published Hi-C data http://www.aidenlab.org/juicebox/ 25497547
23
Although the above-mentioned studies suggest that deep sequencing of Hi-C data can improve
resolution of a chromosome interaction map, it is monetarily and computationally expensive to obtain and
analyze the number of reads necessary to reach 1 Kb resolution. Also, among the more than 1 million
interactions detected in IMR90 fetal lung cells using Hi-C, only 57 585 were linkages between a known
promoter and a distal region (Jin et al., 2013). Thus, the interactions of major interest in the investigation
of enhancer-gene regulatory networks constitute a minority of all chromosomal interactions identified
using Hi-C-based methods. One way to increase resolution is to focus only on a subset of interactions of
interest, using methods such as ChIA-PET and Capture Hi-C (CHi-C) methods. ChIA-PET is a ChIP-
based method that employs 3C principles but uses antibodies to a specific protein to collect the interacting
fragments. This method can be used to map all interactions at a subset of enhancers bound by a specific
TF (e.g. using an antibody to CTCF (Handoko et al., 2011) or estrogen receptor α (Fullwood et al., 2009))
or at all active promoters (using an antibody to RNA polymerase II (Li et al., 2012a)). CHi-C is a new
approach that uses sequence capture technology to enrich a Hi-C library for annotated promoters. This
technique resulted in a 10-fold enrichment of reads involving promoters, allowing more promoter
coverage at a fraction of the cost; sequencing of 754 million uniquely mapped paired-end reads identified
approximately 1 million promoter-based interactions. In addition to increasing sequencing depth at
regulatory regions, another way to fundamentally improve resolution in identifying interaction loci is to
replace the typically used REs having a six nt recognition sequence (e.g. HindIII) with REs having more
frequent recognition sites in the genome or to change to DNase I-based digestion or chromatin
fragmentation by sonication. The Capture-C method, which combines a 3C method using a RE that has a
four nucleotide recognition sequence (DpnII) with a hybridization-based capture of targeted regions and
high-throughput sequencing, can provide an unbiased, high-resolution profile of cis interactions for
hundreds of genes in a single experiment (Hughes et al., 2014). DNase I Hi-C replaces the RE digestion
step in the conventional Hi-C protocol with digestion by DNase I and performs slightly better than RE-
based Hi-C in terms of biases in G+C content and mappability (Ma et al., 2015).
24
1.2.2 3 dimension enhancer-promoter interaction patterns
A decade of 3D chromatin conformation studies has resulted in highly ordered and complicated
long-rang chromosomal interaction maps for human and mouse cells; these studies have challenged a
common assumption that enhancers regulate the nearest genes. Instead, the complicated long-range
enhancer-promoter interactions identified from chromosome interaction studies strongly suggest that
enhancers frequently skip the nearest promoter to regulate a more distal gene. For example, a 5C study of
GM12878, K562 and HeLa-S3 cells found that only 27% of the distal elements interact with the nearest
TSS, with the number increasing to 47% when only expressed genes were used in the analysis (Sanyal et
al., 2012). Another study using ChIA-PET and an antibody to RNA polymerase II found that ~40% of the
enhancer elements did not interact with the nearest promoter (Li et al., 2012a). Studies have also shown
that the distances between interacting enhancers and promoters can be quite large. For example, only 25%
of the enhancer-promoter interacting fragments were within 50 Kb and about 57% of the contacts spanned
more than 100 Kb, as reported from Hi-C experiments in IMR90 cells (Jin et al., 2013). In addition,
enhancer-promoter interactions are not limited to one-to-one relationships. Rather, an enhancer can
contact multiple promoters and a promoter can contact multiple enhancers (Sanyal et al., 2012, Li et al.,
2012a, Schoenfelder et al., 2015). The different types of chromatin interactions have been classified into
two categories of higher-order interaction clusters: “single-gene” complexes, which consist of interactions
between a single gene and one or more enhancers and “multi-gene” complexes, in which multiple genes
are involved in interactions with one or more enhancers. Interestingly, the genes in the “single-gene”
interaction complexes tend to be cell-type-specific (Li et al., 2012a). Surprisingly, some promoter
sequences in the multi-gene interaction complexes show enhancer capacity affecting the expression of
other linked genes (Li et al., 2012a).
As described above, the high-resolution analyses of chromatin interactions by 5C, Hi-C and
ChIA-PET have identified thousands of enhancer-promoter interactions. Although the rules governing
25
enhancer-promoter specificity are still not clear, some experiments have suggested that enhancers are
restricted to regulating promoters within specified chromatin boundaries. For example, low-resolution
analyses of 3D chromatin data have introduced two new concepts: “genome spatial
compartmentalization” and “topologically associated domains” (TADs); see Figure 1.3. The first concept,
based on chromatin interactions at low resolution (1 Mb) and histone modifications, divides the genome
into two compartments. Compartment A is characterized by gene dense regions and has active chromatin
marks and compartment B is associated with repressive chromatin (van Berkum et al., 2010, Lieberman-
Aiden et al., 2009, Dekker et al., 2013); analyses suggest that most interactions occur within the same
compartments. A recent deeply sequenced in situ Hi-C experiment with 25 Kb resolution suggested that
there are also sub-compartments (~300 Kb in size) with distinct patterns of histone modifications, with
compartment A consisting of two sub-compartments and compartment B consisting of four sub-
compartments (Rao et al., 2014). The second concept, based on TADs (~1 MB in size), comes from the
observation that regions are bounded by segments where the chromatin interactions end abruptly.
Although TADs are defined independently from compartments (Pope et al., 2014, Jin et al., 2013, Dekker
et al., 2013), several adjacent TADs can organize to create a compartment (Rao et al., 2014). Studies
suggest that most chromatin interactions occur between elements within a TAD, with many fewer
interactions occurring between elements from different TADs. TADs are highly conserved among species
(Dixon et al., 2012) and most TADs are invariable across cell types and developmental stages.
Interestingly, although the boundaries of the TADs are relatively constant, a TAD can switch between the
active compartment A and the repressive compartment B in different cell types (Phillips-Cremins et al.,
2013, Dekker et al., 2013, Pope et al., 2014, Vietri Rudan et al., 2015). If, as proposed, enhancer-
promoter interactions are constrained by the boundaries of the TADs, then altering these boundaries
should have critical impacts on gene expression. Early work showed that a shift in topological domain
boundaries accompanied expression changes of homeotic genes during mouse development (Noordermeer
et al., 2011) and deletion of a TAD boundary on the X-chromosome has been shown to result in novel
chromatin interactions and alteration of gene expression (Nora et al., 2012). It is also important to note
26
that the invariability of TADs across cell types does not conflict with the concept of cell-type-specific
enhancer-promoter interactions. This is because the specific enhancer-promoter interactions within a
TAD can vary from cell type to cell type (Li et al., 2012a, Dixon et al., 2012).
Figure 1.3 Chromatin 3D structures.
Shown is a two-dimensional heatmap of Hi-C interaction frequencies in IMR90 cells from a 5 Mb region
of Chr2 generated using the website: http://www.3dgenome.org and color key represents the interaction
counts between two loci. Highlighted in gray is a repressed compartment and highlighted in orange is an
active compartment. Also shown is ChIP-seq data for CTCF and histone modifications, as well as a
Wavelet-smoothed Repli-seq track representing DNA replication timing; all datasets were taken from the
University of California, Santa Cruz genome browser. For each compartment, shown is a model of
chromatin interactions (which are more frequent within a TAD than between TADs) facilitated by CTCF,
Cohesin and Mediator. Long distance constitutive interactions require a pair of CTCF sites with
convergently orientated motifs as anchors; any combination of CTCF, Cohesin and Mediator can facilitate
median distance interactions. Many other CTCF binding sites (green bars) are not involved in chromatin
interactions and occur within loops.
27
Three major components have been shown to contribute to the formation of the 3D chromatin
architecture: CCCTC-binding factor (CTCF), Cohesin, and Mediator (Zuin et al., 2014, Wendt et al.,
2008, Sanyal et al., 2012, Rao et al., 2014, Phillips-Cremins et al., 2013, Parelho et al., 2008, Handoko et
al., 2011); see Figure 1.3. CTCF is a site-specific DNA binding protein that has insulator capacity that
can interfere with enhancer-promoter communications and block heterochromatin spreading [reviewed in
(Ong and Corces, 2014)]; Cohesin is a protein complex important for the separation of sister chromatids
during mitosis and meiosis (Brooker and Berkowitz, 2014) and is also involved in gene regulation
(Losada, 2014); and Mediator is a large, multi-protein complex that functions as a transcriptional
coactivator. CTCF, Cohesin, and Mediator were shown to bind to >80% of the interaction anchors
identified in ES cells (Phillips-Cremins et al., 2013) and another study showed that these three
components are located at 86% of ~10 000 constitutive chromatin interactions in 9 cell lines (Rao et al.,
2014). These complexes appear to have overlapping, but not equivalent, roles in defining chromatin
looping. For example, CTCF alone or CTCF plus Cohesin is highly associated with constitutive long-
range interactions, whereas Mediator plus Cohesin complexes are more associated with proximal
enhancer-promoter interactions (Phillips-Cremins et al., 2013, Dowen et al., 2014, Ing-Simmons et al.,
2015, Vietri Rudan et al., 2015). Also, CTCF and components of the Cohesin complex are present at
most of the TAD boundaries (Nora et al., 2012, Dixon et al., 2012) and the CTCF motifs at these sites are
conserved across species (which may explain the invariance of TADs (Vietri Rudan et al., 2015). Pairs of
CTCF motifs, which have an orientation because of the non-palindromic motif sequence
(5’CCACNAGGTGGCAG-3’), involved in constitutive long-range interactions on the same chromosome
are positioned in a convergent manner on opposite strands of the DNA (Rao et al., 2014, Vietri Rudan et
al., 2015). Other evidence that suggests non-equivalent functions of CTCF vs. Cohesin comes from
depletion studies. For example, in HEK293 cells the depletion of CTCF results in a higher frequency of
inter-TAD interactions and fewer intra-TAD interactions, whereas reduction of Cohesin has no impact on
TAD structure but leads to a global loss of intra-TAD interactions (Zuin et al., 2014). Also, a conditional
28
deletion of a component of Cohesin in thymocytes weakens enhancer-promoter interactions, without
affecting the location or strength of the histone marks H3K27ac and H3K4me1 (Ing-Simmons et al.,
2015). Moreover CTCF tends to bind to the boundaries of large enhancer regions, restraining the
Cohesin-anchored interactions within the domains (Ing-Simmons et al., 2015, Dowen et al., 2014).
Deletion of a CTCF site at one side of a large enhancer region caused alteration of expression of genes
within and nearby the enhancer regions (Dowen et al., 2014). These results indicate that CTCF plays an
important role in maintaining chromosomal structure. However, it is important to note that only a small
portion of CTCF binding sites reside at the boundaries of TADs or super-enhancer domains; rather, most
CTCF sites are located within TADs (Handoko et al., 2011, Cuddapah et al., 2009). Although there is
evidence that CTCF sites located at enhancer or promoter regions can facilitate enhancer-promoter
interactions (Handoko et al., 2011, Pena-Hernandez et al., 2015, Jager et al., 2015), 79% of long-distance
interactions between distal elements and promoters actually bypass one or more CTCF sites (Sanyal et al.,
2012), suggesting a situation more complex than a simple “insulator” or “bridge” model. Perhaps the
bypassed CTCF sites are involved in enhancer-promoter interactions that occur in other cell types or poise
cells for changes in physical interactions in response to developmental progression or external cues.
Taken together, these recent studies support a critical role of CTCF, Cohesin, and Mediator in organizing
chromatin interactions and gene regulation, but the details of the mechanisms by which they govern the
3D architecture of the genome are still unsolved.
The chromatin interactions identified by the 3C-based technologies described above provide
evidence that distal elements can physically interact with specific promoter regions, allowing the
prediction of potential target genes. However, an interaction between a distal element and a promoter
does not guarantee that the distal element is actually involved in regulating expression of the linked gene.
For example, chromatin interactions in IMR90 cells show very few changes upon treatment with TCF-α,
even though large changes in gene expression occur (Jin et al., 2013). This suggests either that enhancers
are looped to the target promoters even before the genes are activated and/or that many chromatin
interactions are not related to gene expression. Other potential non-functional interactions identified by
29
3C-based methods, such as random collision in the nucleus or within a defined topologically associated
domain, have been recently discussed (Dekker et al., 2013). It is likely that the chromatin interaction
profile in a given cell type is composed of several different types of interactions, some involved in
maintaining overall nuclear structure, some involved in gene regulation, and some representing stable, but
non-functional, loops. Importantly, it is also possible that some enhancers regulate their target gene via a
mechanism distinct from looping. Thus, 3D studies may not provide a definitive identification of the set
of target genes of an enhancer. However, computation tools have been developed to predict enhancer-
gene linkages based on the association of target gene expression with dynamic enhancer activities (as
defined by sequence changes in the population or differences in epigenetic marks); some of these tools
integrate multiple layers of genetic and epigenetic information to improve the accuracy of prediction.
These computational tools, which provide alternative methods for understanding enhancer function, are
described in the next section.
1.3 Identifying target genes of enhancers using computational analyses to link
altered enhancer structure or activity to specific gene expression patterns
1.3.1 Predicting target genes based on changes in enhancer sequence
Expression quantitative trait loci (eQTL) refers to the method of using the correlation between
genetic polymorphism and variation of gene expression across many different individuals to identify
genomic loci that influence gene expression [reviewed in (Westra and Franke, 2014)]. This method
requires matched SNP information and gene expression patterns from multiple individuals (Figure 1.4a).
To date, eQTL studies have been performed for multiple cell types and tissues such as fibroblasts, liver,
lung, brain, muscle, adipose tissue, skin, whole blood, specific blood cell types (B-cells, monocytes and
T-cells) and lymphoblastoid cell-lines (Castaldi et al., 2015, Ramasamy et al., 2014, Nica et al., 2011,
Yang et al., 2010, GTExConsortium, 2015); see Table 1.2 for a list of eQTL databases. A comparison of
eQTL results using different cell or tissue types suggests that SNPs can influence the expression of
30
different genes in different cell types and can even have opposite effects on a given gene in different cell
types (Fu et al., 2012, Fairfax et al., 2012, Francesconi and Lehner, 2014). Approximately 14 million
validated SNPs have been identified in human populations. Although modern genotyping arrays only
contain features for ~1 million or so SNPs, they capture most of the common SNPs through linkage
disequilibrium-based SNP imputation (Howie et al., 2012, Porcu et al., 2013).
Figure 1.4 Computational methods to link enhancers to putative target genes.
A) The eQTL method uses the association between genotypes of a SNP within an enhancer and gene
expression levels across multiple individuals to predict target genes. B) A correlation between dynamic
enhancer activity and gene expression across multiple cell lines or tissues can be used to predict enhancer-
gene linkages. Levels of H3K4me1, H3K27ac and DHS show positive correlations with enhancer
activities; each color represents data from an individual cell line or tissue. C) Similar to panel B, DNA
methylation data can also be used to predict target genes; in this case, one expects a negative correlation
between DNA methylation at enhancers and gene expression. D) Integrating multiple layers of
information can help to predict a target gene; e.g. the gene with a higher score for phylogenetic proximity
and similar GO terms with the TF (indicating that the TF and target gene are in a same pathway) is
predicted to be the putative target gene.
31
Table 1.2 Available databases for eQTL
While some eQTL studies have analyzed SNP-expression associations genome-wide, this
approach requires expression profiles from a large number of individuals in order to attain statistical
significance after adjusting for the large number of hypotheses tested. A popular approach has been to
constrain eQTL analyses to genetic loci that have already been implicated in human diseases or traits in
Genome-Wide Association Studies (GWAS). Within the last decade, it has become clear that the majority
of disease-linked SNPs are outside of gene coding regions and likely represent variation within regulatory
elements (Freedman et al., 2011), suggesting that the SNP allele should co-vary with expression of a
nearby target gene. In support of this theory, enhancers and other regulatory elements mapped
experimentally by the ENCODE and Roadmap Consortia have consistently been found to be enriched for
disease-associated SNPs identified in GWAS studies (ENCODE_Project_Consortium, 2012,
RoadmapEpigenomicsConsortium, 2015). Therefore, after identifying an “index” SNP associated with a
particular disease, investigators have analyzed the expression of all genes within a certain region to try to
identify a gene whose expression co-varies with the SNP allele across multiple individuals. To gain
insights into risk for breast cancer, Li et al. conducted an eQTL-based analysis of 15 breast cancer risk
SNPs, integrating multi-level information such as copy number variation, promoter methylation, and
enhancer annotations from TCGA and ENCODE (Li et al., 2013a). Using a similar method, expression
of the TMED6 gene was linked to an enhancer, located 600 Kb away, which harbors three colon cancer-
Data$source Cell$type website PMID
Genevar'(GENe'Expression'
VARiation)
adipose,'LCL,'skin,'fibroblast'and'T>cell
https://www.sanger.ac.uk/resources/so
ftware/genevar/
20702402
GTExPortal
Blood,'esophagus'mucosa,'esophagus'muscularis,'heart,'
lung,'muscle'skeletal,'nerve'tibial,'skin,'stomach,'
thyroid,'adipose'and'artery
http://www.gtexportal.org/home/ 25954001
MuTHER adipose,'LCL,'skin http://www.muther.ac.uk/Data.html 21304890
UKBEC Brain http://www.braineac.org/ 25174004
32
associated SNPs (Yao et al., 2014). Similar studies have combined epigenomic enhancer data with eQTL
mapping to link particular enhancers to other diseases, such as chronic obstructive disease (Castaldi et al.,
2015), asthma (Sharma et al., 2014), prostate cancer (Hazelett et al., 2014), and schizophrenia (Roussos et
al., 2014).
A working model in the field is that the disease-associated SNPs located within enhancers affect
enhancer function by disrupting or improving TF binding motifs, thus causing changes in expression of
target genes. This model has been investigated for certain important risk alleles, such as the risk variant
for both colon and prostate cancer rs6983267 that affects expression of the MYC oncogene via altered
transcription factor binding (Pomerantz et al., 2009). Other studies (Hazelett et al., 2014, Li et al., 2013a,
Gjoneska et al., 2015, Yao et al., 2014) have used TF binding motif prediction within eQTL-linked
enhancers to generate hypotheses that can be tested experimentally. Li et al. combined eQTL mapping,
epigenomic enhancer maps, and TF motif prediction in an innovative way to understand how risk variants
might affect entire transcriptional networks. In addition to predicting direct cis-interactions, eQTL can
predict putative indirect trans-interactions between a risk locus and distant loci in the genome. For
example, Li et al. identified a breast cancer risk SNP within 1 Mb of the gene for the ESR1 transcription
factor that affected expression of 476 different genes having nearby putative enhancers containing an
ESR1 binding motif. This approach represents an integrated method for correlating genome-wide
enhancer analyses with modifications in expression of upstream transcription factors to understand the
role of transcriptional networks in disease.
Although eQTL mapping has produced lists of putative target genes for specific enhancers, the
results should be interpreted with caution. First, although a large sample size is required to generate
robust linkage predictions, many studies have been performed using small sample sizes and thus may
include false positives. Second, associations between genomic loci and gene expression predicted by
eQTL represent a mixture of direct (i.e. cis) and indirect (i.e. trans) regulation, and these are often not
easy to distinguish due to the large distances that can separate enhancers and their gene targets. However
allele-specific expression linked to a heterozygous SNP can help to identify direct targets (Dixon et al.,
33
2015, Crowley et al., 2015). Third, because most enhancers are cell type-specific, eQTL mapping should
be performed using the cell type or tissue most relevant for a particular disease. Finally, it is important to
note that the SNPs identified by array-based GWAS studies may not be the causal SNP; all SNPs in LD
with the GWAS-linked SNP should be considered, along with fine mapping or a whole genome
sequencing-based approach.
1.3.2 Predicting target genes based on changes in enhancer activity
Epigenomic mapping techniques make it possible to correlate enhancer activity, via changes in
DNA hypersensitivity, histone modifications, or DNA methylation levels, with target gene expression
across different tissues or cellular conditions; see Table 1.3 for a summary of the computational tools
used for these correlative analyses. A change in the nuclear level of a TF is perhaps the most
straightforward mechanism by which an enhancer activity can change. For example, upon hormone
stimulation, estrogen receptors (ERs) can translocate to the nucleus, bind to their motifs in enhancer
regions, cause changes in histone modifications, and stimulate expression of target genes (Levin, 2005,
Vrtacnik et al., 2014). Several methods have been developed to link target genes to regulatory TFs using
dynamic binding patterns of TFs; in general, TF binding is measured by ChIP-chip or ChIP-seq and gene
expression profiles are measured using microarrays or RNA-seq under conditions of differential
expression of the relevant TF. Although yeast are not considered to have distal enhancers per se,
regulation of RNA Polymerase II initiation by sequence-specific transcription factors is highly conserved,
and thus S. cerevisiae has been an important model system for developing these approaches. For example,
Gao et al. used a multivariate regression model to identify correlations between TF activity and gene
expression across various conditions in S. cerevisiae to produce lists of putative target genes of TFs (Gao
et al., 2004). Over the last decade, a number of different approaches have been developed to correlate
model organism or human TF ChIP-chip and ChIP-seq data with gene expression; such methods include
partial least squares (PLS) regression (Boulesteix and Strimmer, 2005), a genetic regulatory modules
(GRAM) algorithm (Bar-Joseph et al., 2003), a probabilistic model termed target identification from
34
profiles (TIP) (Cheng et al., 2011), a support vector machine (SVMs) (Qian et al., 2003), a linear
activation model-based (Honkela et al., 2010), and an unsupervised machine learning with an expectation
maximization algorithm (EMBER) (Maienschein-Cline et al., 2012). All of these methods are based on
positive correlations of TF activity and target gene expression, ignoring the important fact that TFs can
also repress target gene expression. However, one method, called binding and expression target analysis
(BETA), does include a step to evaluate if the effects of a given TF are activating or repressing (Wang et
al., 2013f). Also, BETA considers the distance between TF binding sites and the putative target gene, as
well as the location of conserved CTCF binding sites (which may delineate the boundaries of TADs, as
described above), when predicting the targets of TFs. Overall, these methods are useful to predict target
genes of TFs if one has available ChIP-seq and expression profiles performed in different conditions
(such as knockdown or overexpression of a TF or activation of a pathway). However, these methods,
which cannot distinguish direct from indirect effects, only produce a list of putative target genes for a
particular TF without pinpointing individual enhancer-gene linkages. Comparing the results of these
algorithms to direct interaction mapping approaches, such as the 3C methods described above, will likely
allow for improvements in predicting direct vs. indirect targets; however, this is an emerging area that has
not been well-explored.
35
Table 1.3 Summary of computational methods
Recent genome-wide studies from the ENCODE Project and the Roadmap Epigenome Mapping
Consortium have confirmed that DHSs and certain histone modifications are correlated with the binding
of transcription factors and the activity state of enhancers (Figure 1.4b). Ernst et al. developed a method
that combines multiple histone marks (including H3K27ac and H3K4me1) into chromatin state signals for
9 ENCODE human cell types. Correlation of enhancer activity states from this method and gene
expression in the same cell types identified putative target genes within a 5 Kb-125 Kb range (Ernst et al.,
2011). Based on a similar principle, Shen et al. used the signal intensity of H3K4me1 and RNA
Polymerase II ChIP-seq data, representing enhancer activity and gene expression, respectively, across 19
mouse tissues and cell lines to calculate a Spearman correlation coefficiency (SSC) between nearby
elements. Enhancers and gene elements were clustered into enhancer-promoter units (EPUs) based on the
Input&data Model Distance&range&to&enhancer Statistics Tissue&type List&of&enhancer9gene&linkages PMID
H3K27ac,))H3K4me1,)
RNA1seq
Gene)expression)vs)
H3K27ac)and)
H3K4me1)
5)Kb1125)Kb)around)TSS
Machine)learning)using)
logistic)regression)
classifer
9)human)cell)line NA 21441907
H3K4me1,))RNAPII) RNAPII)vs)H3K4me1 NA Spearman)correlation 19)mouse)tissue supplementary)table)7)in)PMID:22763441 22763441
H3K4me1)and)RNA1
seq
Gene)expression)vs)
H3K4me1)(PreSTIGE)
within)100)Kb)of)TSS)and)
CTCF)boundary
Shannon)entropy 12)human)cell)lines http://genetics.case.edu/prestige/ 24196873
H3K4me1)and)RNA1
seq
Gene)expression)vs)
H3K4me1)(PreSTIGE)
within)100)Kb)of)TSS)and)
CTCF)boundary
Shannon)entropy 13)mouse)tissues http://genetics.case.edu/prestige/ 24905156
Dnase)I)
DHS)at)promoters)vs)
enhancer
within)500)Kb)around)TSS Spearman)correlation 79)human)cell)lines
ftp://ftp.ebi.ac.uk/pub/databases/ensembl/en
code/integration_data_jan2011/byDataType/o
penchrom/jan2011/dhs_gene_connectivity/
22955617
Dnase)I))and)RNA1seq
Gene)expression)vs)
DHS)
within)100)Kb)of)TSS Pearson)correlation 72)human)cell)lines http://dnase.genome.duke.edu 23482648
DNA)methylation)and)
RNA1seq
Gene)expression)vs)
DNA)methylation)
within)1)Mb)around)TSS
Machine)learning)using)
SVM1MAP
58)human)cell)lines NA 23497655
DNA)methylation)and)
RNA1seq
Gene)expression)vs)
DNA)methylation)
(ELMER)
upstream)10)genes)and)
downstream)10)genes
Mann1Whitney)U)test
~2000)human)primary)
tumor)sample)from)
TCGA
supplementary)table)4)in)PMID:25994056 25994056
CAGE
Gene)expression)vs)
eRNA)level
within)500)Kb)around)TSS Pearson)correlation ~400)human)cell)lines http://enhancer.binf.ku.dk/presets/ 24670763
H3K4me1,)H3K27ac,)
H3Kme3,)RNA1seq
Multi1layer)genetic)
and)epigenetic)
information
within)2)Mb)of)TSS
Machine)learning)using)
random)forest)classifer
12)human)cell)line
www.healthcare.)
uiowa.edu/labs/tan/EP_predictions.xlsx.
24821768
36
SSC along each chromosome. On average, 5.6 enhancers were linked to each promoter using this method;
multiple Hi-C and 3C experiments have verified the enhancer-gene association linkages identified by the
EPU method (Shen et al., 2012). To meet the need for publicly available tools to perform similar
analyses, a software package called predicting specific tissue interaction of genes and enhancers
(PresSTIGE) has been developed which predicts enhancer-gene linkages using H3K4me1 or H3K27ac
and RNA-seq data (Corradin et al., 2014, Van Bortle and Corces, 2014). Several groups have begun to
use DNase I signal intensity to represent enhancer activity. For example, Thurman et al. calculated the
SSC between the DHS state at each TSS and all distal DHSs located within 500 Kb of that TSS and
separated from the TSS by at least one other DHS. This analysis, performed with 79 diverse cell types,
identified 578 905 DHSs that have intensities that are highly correlated with at least one promoter DHS
signal intensity; importantly, these DHS-promoter pairs are significantly overrepresented in interactions
identified by 5C and ChIA-PET (Thurman et al., 2012). Instead of correlating the DHS signals of TSSs
and enhancers, another study used gene expression levels to calculate Pearson correlations with DHSs
located within 100 Kb of each gene across 72 cell types, identifying 530 000 DHSs that have activities
significantly correlated with at least one gene (Sheffield et al., 2013). These correlation-based analyses
provide approaches to predict individual putative enhancer-gene linkages on a genome-wide scale. As
with other correlation-based methods, distinguishing direct vs. indirect linkages remains a challenge, and
the distance-based rules used to date have been relatively ad hoc.
In addition to specific histone modifications and the presence of DHSs, levels of 5-
methylcytosine at CpG dinucleotide sites is another epigenetic mark that can be used to identify
enhancers. In the majority of human cell types, 70%-80% of all CpG sites are methylated. However, short
CpG-rich regions called CpG islands (CGIs), which occur primarily at promoters, remain unmethylated in
somatic cells [reviewed in (Jones, 2012)]. Early studies of DNA methylation mainly focused on the CGIs
located in promoter regions, at which DNA hypermethylation was shown to correlate with transcriptional
repression. Studies of individual enhancers reported that active enhancers have low levels of DNA
demethylation (Thomassin et al., 2001). However, an understanding of the relationship between DNA
37
methylation and enhancer activity was limited until unbiased genome-wide DNA methylation analyses
using whole-genome bisulfite sequencing (WGBS) technology were performed. The genome-wide studies
revealed that low levels of DNA methylation in distal regions could be used to identify enhancers. For
example, a genomic DNA methylation pattern analysis of mouse ES cell and neuronal progenitors (NP)
identified low-methylated regions (LMRs), which are non-promoter CpG-poor regions that have an
average of less than 30% methylation. Integrative analyses strongly suggest that these LMRs are
enhancers because they are DNase I hypersensitive, have chromatin marks associated with active
enhancers, are occupied by TFs, and are associated with expression of nearby genes (Stadler et al., 2011).
Another WGBS study showed that 90% of the regions at which the methylation state changed from
methylated in normal colon to unmethylated in colon cancer overlapped with known enhancers (Berman
et al., 2012). Additionally, a comparison of methylomes for 30 distinct human tissues or cell lines showed
that over 26% of the regions displaying dynamic changes in DNA methylation are enhancers occupied by
cell type-specific TFs; in contrast, only 3% of the regions correspond to promoters (Ziller et al., 2013).
This cell type-specific demethylation at enhancers was confirmed by studies from the ENCODE and
REMC projects (RoadmapEpigenomicsConsortium, 2015, Thurman et al., 2012). Taken together, these
studies demonstrate that the level of DNA methylation at enhancers negatively correlates with enhancer
activity; thus, DNA methylation can be used to predict enhancer-gene linkages (Figure 1.4c).
Aran et al. used machine learning to study correlations between DNA methylation at enhancers
and gene expression across 58 different cell lines. Strikingly, their results showed that the level of DNA
methylation at an enhancer closely anti-correlates with putative target gene expression. Importantly, 53%
of the enhancer-gene linkage predictions having a high score (> 0.85) were validated by 5C experiments
in three cell lines (Aran et al., 2013). Starting from interactions identified by ChIA-PET in MCF7 breast
cancer cells, Aran et al. also employed anti-correlations (evaluated by Pearson correlation coefficients)
between the DNA methylation at enhancers and the expression level of genes physically in contact with
the enhancers to identify functional enhancer-gene interactions in breast cancer cases from The Cancer
Genome Atlas (TCGA). Their results suggested that enhancer regions associated with gene expression are
38
enriched with cancer-associated risk loci from GWAS and that DNA methylation at enhancers can predict
gene expression better than can promoter methylation (Aran and Hellman, 2013). Recently, another
computational tool called ELMER was developed to integrate DNA methylation and gene expression
profiles from primary tissues, to systematically infer multi-level cis-regulatory networks (Yao et al.,
2015). Using ELMER, investigators can identify disease-specific enhancers, which are then linked to
putative target genes using a non-parametric statistical model to evaluate the significance of anti-
correlation between the DNA methylation at the enhancer and the expression of putative target genes.
ELMER also identifies upstream regulatory TFs that drive the changes in enhancer activity using motif
analysis and TF expression profiles. Applying ELMER to TCGA data for 10 distinct cancer types, Yao et
al. derived a list of 4280 putative enhancer-TF-gene linkages. A comparison of ChIA-PET enhancer-
promoter interactions identified using MCF7 cells (Li et al., 2012a) with the ELMER predictions in breast
cancer confirmed that 166 of the 2038 enhancer-target gene pairs had a physical interaction between the
enhancer and the predicted target gene promoter. It should be noted that most studies of expression-
associated DNA methylation at enhancers have been limited to the small portion of enhancers that are
represented on DNA methylation arrays (typically the Illumina Infinium HM450 array) or that can be
analyzed using reduced representation bisulfite sequencing (RRBS); both types of assays cover the
majority of promoter regions but are limited in the number of enhancer loci that can be analyzed. Early
studies using WGBS to study primary disease tissues have been able to predict enhancers corresponding
to disease-specific transcription factor motifs (Berman et al., 2012, Hovestadt et al., 2014), suggesting
that the approaches described above will be applicable to future normal vs. disease tissue studies and will
enable the discovery of additional in vivo putative enhancer-gene linkages.
Instead of simply using the correlation between enhancer activity and gene expression, other
studies have integrated multiple layers of genetic and epigenetic data to predict regulatory networks
(Figure 1.4d). For example, studies have included information such as P300 ChIP-seq data, Gene
Ontology (GO) similarities between the TFs and putative target genes, phylogenetic similarity, and
genomic proximity to predict target genes within 2 Mb intervals centered at enhancers (Rodelsperger et
39
al., 2011, He et al., 2014). Another study started with eQTL mapping data to predict target genes of
enhancers and then integrated the location of insulators as enhancer-blocking elements, TF co-occurrence,
DHSs, and GO similarity terms between the TFs binding to enhancers and nearby genes to validate the
eQTL results (Wang et al., 2013a). Lu et al. combined chromatin interaction data from Hi-C experiments
and phylogenetic correlations across 45 vertebrate species to predict enhancer-gene linkages (Lu et al.,
2013).
The above computational methods all have advantages and disadvantages in terms of
understanding enhancer-gene regulatory networks. One obvious advantage is that most of these
approaches are relatively inexpensive. Methods based on dynamic TF binding can provide a list of
putative target genes for a particular TF using only a few ChIP-seq and RNA-seq experiments. The efforts
of big consortia such as ENCODE, REMC and TCGA have generated genetic and epigenetic profiles for
various cell lines and normal or diseased tissues (ENCODE_Project_Consortium, 2012,
RoadmapEpigenomicsConsortium, 2015, Weinstein et al., 2013) and investigators have made public
multiple methods using eQTL, DHS, histone modification, eRNA and DNA methylation to predict
individual enhancer-gene linkages (Li et al., 2013a, Sheffield et al., 2013, Thurman et al., 2012, Shen et
al., 2012, Ernst et al., 2011, Andersson et al., 2014, Aran et al., 2013, Corradin et al., 2014, Yao et al.,
2015). Because Hi-C experiments are fairly expensive and computationally time consuming, there is
currently a limited number of cell lines for which comprehensive chromatin interaction data is available;
however, it is anticipated that these new technologies will be integrated into the collection of datasets by
existing consortia and other groups, such as those funded by NIH’s new 4D nucleome project
(https://commonfund.nih.gov/4Dnucleome/index) and the proposed International Nucleome Project
(Tashiro and Lanctot, 2015). One common disadvantage inherent to all of these association methods is
that they provide only predictions of putative target genes. More importantly, the predictions from these
methods are limited to cis regulation and the distances allowed between enhancers and putative target
genes are limited to 100 Kb to 2 Mb, depending on the method. As these methods are based on statistical
associations, the limitations are largely due to the limited number of samples available in current datasets
40
and can in principle be overcome by profiling large numbers of samples from diverse tissues and
individuals. Nevertheless, experimental validation of enhancer-gene pairs is essential to evaluate and
improve the accuracy of the various prediction methods. To date, relatively few enhancer-gene pairs have
been experimentally validated, but this is changing as new technologies transform our ability to test
enhancer activity and predictions of enhancer-gene pairs using high throughput and efficient techniques.
1.4 Identifying target genes of enhancers using experimental validation
1.4.1 Using reporter assays to monitor enhancer activity
A common tool to study gene regulation is the reporter assay, which is based on expression of
certain genes whose activity (monitored as either RNA or protein) is easily identified and measured.
Examples of such reporters include luciferase, which is an enzyme catalyzing a reaction with luciferin to
produce light; green fluorescent protein (GFP); LacZ which is an enzyme turning X-gal to blue, and
antibiotic-resistant genes such as neomycin and chloramphenicol acetyltransferase (CAT). In a reporter
assay, a regulatory sequence (such as an enhancer of interest) is cloned adjacent to the reporter gene in a
plasmid that will be transfected either transiently or stably into cell lines, animals, bacteria or plants with
the function of the regulatory element being monitored as changes in levels or activity of the reporter
RNA or protein; transient transfection is often used for human cell lines whereas stable integration is used
to create transgenic model organisms such as fruitflies or mice (Figure 1.5a,b).
Reporter constructs have classically been used to validate enhancers on a one-by-one basis, but
the simplicity of transient transfection, combined with the massive throughput of current sequencing
techniques, has recently allowed adaption of the method for high-throughput multiplexed reporter readout
(Figure 1.5c). These methods begin with a plasmid reporter library containing thousands of different
putative regulatory elements, with different methods employing different designs for the reporter
construct. In the self-transcribing active regulatory regions sequencing (STARR-seq) method, genomic
fragments are cloned downstream of a TSS driving an open reading frame such that activity of the
41
enhancer results in increased levels of RNAs encoding the enhancer, as detected by RNA-seq (Arnold et
al., 2013, Arnold et al., 2014, Shlyueva et al., 2014). CapStarr-seq couples the standard STARR-seq
protocol with a method by which enhancers of interest are captured by array (Vanhille et al., 2015). In the
Massively Parallel Reporter Assay (MPRA), each reporter construct has a putative enhancer followed by
a reporter gene such as GFP with an identifying sequence tag added downstream of the open reading
frame of the reporter gene. Deep sequencing of the mRNA produced from each reporter plasmid in cells
transfected with the construct library is then performed to infer corresponding enhancer activity
(Patwardhan et al., 2012, Kheradpour et al., 2013, Smith et al., 2013, Melnikov et al., 2014). While these
high-throughput methods make valuable contributions to validating putative enhancers, they still suffer
from the fact that (a) the reporter plasmids do not represent in vivo chromatin conditions, b) the reporter
genes have characteristics that may influence the results (Arnone et al., 2004), c) they do not take into
consideration the possibility that enhancers may only regulate promoters with specific characteristics not
represented in the minimal promoter used in the reporter plasmid, and d) the cell line used for the study
may lack certain characteristics (e.g. specific TFs) required for activity of a particular enhancer.
As noted above, studying enhancers identified by chromatin structure using a transient reporter
assay may not reproduce the normal in vivo activity of the enhancer because the plasmid lacks chromatin
structure and modifications. Thus, stable transfection assays or in vivo transgenic enhancer assays are
better models because the enhancer is in a chromatinized state. In addition, transgenic assays allow one
to study cell type-specific and development stage-specific enhancer activity (Visel et al., 2009a, Visel et
al., 2009c, Attanasio et al., 2013). Importantly, the spatiotemporal embryonic patterns of enhancer
reporter constructs allow high-confidence pairing of enhancers with target genes (Kvon et al., 2014,
Pfeiffer et al., 2008) The VISTA Enhancer Browser is a central repository of experimental validations
analyzing human and mouse putative enhancers in transgenic mice; to date, approximately half of the
tested elements have shown enhancer activity. Unfortunately, transgenic mice assays are not high
throughput. However, highly multiplexed libraries of enhancer reporter constructs have been combined
with FACS flow-sorting and next-generation sequencing in a fly transgenic reporter assay called
42
enhancer-FACS-seq (eFS). For eFS, the reporter construct contains a putative enhancer followed by a
reporter gene such as GFP (Figure 1.5c). Cell sorting is used to select cells expressing the reporter gene
(GFP) in a specific tissue (identified using a separate tissue-specific reporter gene). Then, the enhancer
elements that are active in the selected cells are identified by sequencing (Gisselbrecht et al., 2013). In
vivo reporter constructs do have fewer caveats than transient assays and have been instrumental in
understanding the function of enhancers. However, they still have several drawbacks. As noted above, the
endogenous function of enhancers often involves looping at long distances between chromosomal
domains in the 3D space of the nucleus and this situation is not well reproduced by stably integrated
reporter constructs in which the enhancer and promoter are co-located within a very short genomic
distance. Reporter constructs can also not reproduce other interactions present in the native chromosomal
context, such as the location of the enhancer within a particular TAD.
43
Figure 1.5 Experimental strategies to study enhancer activity.
A) In transient transfection assays, the enhancer (orange) is placed upstream of reporter gene (green)
driven by a heterologous promoter (brown) in a plasmid backbone and then the plasmid is transiently
transfected into cells. The activity of the enhancer is monitored by the level of reporter RNA or protein.
B) In a transgenic assay, a plasmid containing the enhancer and reporter gene is microinjected into a
mouse egg and then integrated into the mouse genome. Enhancer activity is monitored in the embryo
using LacZ staining. C) High-throughput enhancer assays can be used to test enhancer activity. In
STARR-seq, potential regulatory elements are inserted between an ORF and a polyA tail and plasmids are
transfected into cells; elements that can be detected in the RNA-seq data are functional enhancers. In the
massively parallel reporter assay (MPRA), sequence synthesis technology is used to link each potential
regulatory element to a unique tag sequence. Then, an open reading frame (ORF) is inserted between the
element and tag sequence to form plasmids that are transfected into cells. After performing RNA-seq, the
enrichment ratio between tag counts in the starting library and in the RNA-seq data is used to identify
functional enhancers. In the enhancer-FACS-seq (eFS) method, a pool of potential regulatory elements is
cloned upstream of the GFP reporter gene. The plasmids are injected into fly embryos and GFP fly lines
are created which are crossed with a fly line that expresses CD2 under control of the tissue-specific
enhancer Twi. Embryos from the cross are dissociated and fluorescent-activated cell sorting (FACS) is
used to select two group of cells: CD2+GFP+ and CD2+GFP-/+(input). Through sequence enrichment
analysis between the two groups, the elements that are functional enhancers can be identified.
44
1.4.2 Analyzing enhancers in their natural chromosomal context
Attempts to alleviate the problems associated with reporter assays have lead to the development
of new methods by which enhancer function is disrupted within a normal chromosomal context. The
underlying rationale for these approaches is that loss or reduction of the activity of a specific enhancer
can reveal its natural target genes through consequent changes in gene expression. To delete or disrupt an
enhancer, a genomic nuclease must be brought to the enhancer using a sequence-specific DNA targeting
method. The first DNA targeting method used in genomic technologies consisted of tandem zinc finger
DNA binding domains (based on natural mammalian zinc finger TFs), each of which recognizes three
nucleotides. For example, an array of four to six zinc fingers is able to recognize a specific 12-18
nucleotides sequence [reviewed in (Urnov et al., 2010)], which should theoretically provide genomic
specificity (Figure 1.6a). However, years of efforts in engineering artificial zinc finger proteins suggest
that the recognition of DNA by the zinc finger domains is more complex than originally thought. For
example, the order of the zinc finger domains within an array may impact on the specificity. Although
new strategies for screening and assembling an array of zinc finger modules have been developed
(Maeder et al., 2008, Sander et al., 2011), these strategies are still labor intensive and not user friendly.
Additionally, a recent study of the genome-wide binding pattern of an artificial zinc finger protein
suggests that zinc finger proteins have thousands of off-target binding sites (Grimmer et al., 2014).
Fortunately, zinc finger proteins are not the only platform by which sequence-specific DNA binding
domains can be used for genomic targeting. Transcription activator-like effectors (TALEs) are derived
from the bacterial plant pathogen Xanthomona and contain DNA-binding tandem repeats, each of which
consists of 33-35 amino acids and can specifically bind to a single nucleotide in a modular fashion
(Figure 1.6b). TALEs have several advantages over zinc finger DNA binding domains: they are easier to
design because each module only recognizes a single nucleotide, easier to construct, and have higher
DNA binding specificity than do zinc fingers (Cermak et al., 2011, Ochiai et al., 2014, Gaj et al., 2013).
For both artificial zinc fingers and TALEs, a non-specific nuclease such as Fok I can be fused to the DNA
binding array, creating sequence-specific genomic scissors termed zinc finger nucleases (ZFNs) or
45
TALEs nucleases (TALENs) that can introduce a double strand break (DSB) at a specific genomic locus.
A complication is that the TALE and zinc finger platforms are most commonly used as heterodimers,
which means that 2 DNA targeting constructs must be created for each targeted site and four constructs
must be created to delete an enhancer (see below for more details), making these techniques laborious and
time consuming. The most recent DNA targeting platform is termed clustered regularly interspaced short
palindromic repeats (CRISPR). This is an efficient and versatile genomic targeting tool that utilizes guide
RNAs (gRNA) to bring a Cas9 bacterial nuclease to a complementary DNA target (Figure 1.6c). The
CRISPR/Cas9 method, which does not involve a complex assembly process, does not require
heterodimerization, and has high targeting specificity, is rapidly becoming the preferred genomic
targeting platform (Cho et al., 2014, Sander and Joung, 2014).
Using genomic editing tools, multiple studies have successfully inactivated enhancers by
introducing a DSB (with consequent alteration of the nucleotides proximal to the cut site) precisely at a
critical TF binding site (Figure 1.6d). One such study showed that loss of CTCF sites at the boundary of
super-enhancer domains (SDs) caused expression changes of genes within the SD (Dowen et al., 2014).
Another study demonstrated that an enhancer located 30 Kb upstream of the Mmp13 promoter regulates
Mmp13 expression through RUNX2 binding and an enhancer located 10 Kb upstream of the promoter
represses Mmp13 expression through binding of 1α,25-dihydroxyxitmin D3 (Meyer et al., 2015).
However introducing a DSB at a single motif may not totally inactivate an enhancer because enhancers
usually consist of a cluster of TF motifs; removing one motif may not substantially affect overall activity
of the enhancer. To achieve a total loss of enhancer activity, one can use genomic editing tools to create
two DSBs flanking the target enhancer region; the enhancer will be deleted and the gap will be
automatically repaired by non-homologous end joining (NHEJ) (Figure 1.6e). Alternatively, one can
replace the enhancer by coupling a single DSB with homologous recombination, using a plasmid
containing sequences having homology to the regions flanking the enhancer. One study deleted an allele-
specific sequence of a super-enhancer located 100 Kb downstream of Sox2 in a mouse ESC line and,
using allele information, showed that the super-enhancer is responsible for 90% of Sox2 expression (Li et
46
al., 2014); these results were supported by an independent CRISPR/Cas9-mediated enhancer deletion
study (Zhou et al., 2014). Also, deletion of enhancers that harbor colorectal cancer-associated SNPs
showed dramatic impacts on expression of MYC, even thought the enhancer is ~ 350 Kb away [(Yao et
al., 2014) and unpublished data]. Another study successfully validated enhancer-gene linkages identified
by ChIA-PET or 5C using a TALEN-mediated homologous recombination knockout system (Kieffer-
Kwon et al., 2013).
Deletion and disruption are not the only in vivo approaches for targeting an enhancer. As
described above, enhancers are characterized by distinct histone modification and DNA methylation
patterns and thus artificial modification of these epigenetic should be able to influence enhancer activity.
Fusing chromatin-modifying domains such as histone acetylases, histone methylases, DNA methylases,
or DNA demethylases to artificial zinc finger proteins and TALEs can create tools that can enhance or
repress enhancer activity (Grimmer et al., 2014). Also, fusion of a chromatin-modifying domain to a
catalytically deactivated Cas9 (dCas9) can be used to alter the activity of the enhancer at a target
sequence (Hilton et al., 2015, Kearns et al., 2015). A chromatin-modifying tool created by fusion of
TALE domains and the lysine-specific demethylase 1 (LSD1) that can demethylate H3K4 showed that
four of the nine tested TALE-LSD1 fusions can reduce the level of H3K4me1 or H3K27ac at enhancers
and cause downregulation of nearby genes by at least 1.5-fold (Mendenhall et al., 2013). Finally, both
dCas9-LSD and dCas9-KRAB fusion proteins have been shown to decrease Oct4 and Tbx3 expression
upon targeting their distal enhancers in mouse embryonic stem cells. However, the LSD and KRAB
effectors appear to use different mechanisms to repress gene expression when tethered to an enhancer
(Kearns et al., 2015); see Figure 1.6f. Instead of repressing enhancer activity, another study used a fusion
of dCas9 and the p300 histone acetylase transferase domain to increase enhancer activity, upregulating
expression of 4 nearby genes, the farthest of which was 54 Kb away (Hilton et al., 2015). Although
genome editing and chromatin modification technologies have great potential for studies of gene
regulation networks, these technologies are still in early developmental and low throughput stages.
47
Figure 1.6 Experimental strategies to identify target genes.
A) DNA targeting tools can consist of tandem zinc finger DNA binding domains, each of which binds to
three nucleotides of DNA. Top: fusion of the non sequence-specific nuclease FokI to zinc finger arrays
creates genomic scissors called zinc finger nucleases (ZNFs); dimerization of two ZFNs targeting a
specific sequence from opposite sides is required for DNA cleavage. Bottom: effector domains can also
be fused to zinc finger arrays; the ZNF-effector proteins do not require heterodimerization to function. B)
DNA targeting tools can consist of tandem TALE DNA binding domains, each of which binds to one
nucleotide of DNA. Top: fusion of the non sequence-specific nuclease FokI to the DNA binding array
creates TALENs. Bottom: effector domains can be fused to TALE domains. Similar to ZNFs, two
TALENs are necessary to perform a site-specific DNA cleavage, but only one TALE-effector is needed
for modify the genome. C) The CRISPR/Cas9 system utilizes guide RNAs (gRNAs) to bring a Cas9
nuclease to a complementary DNA target to perform site-specific genomic editing. Effector domains can
also be fused to a nuclease-deficient Cas9 (dCas9). D) Genomic editing tools can be used to create a
single DNA cleavage event that disrupts a TF motif. E) Two sets of heterodimeric ZFNs or TALENs or
one pair of guide RNAs can be used to create two DSBs flanking the target enhancer region. The
enhancer will be deleted and the gap will be repaired by non-homologous end joining (NHEJ). F)
Enhancer activity can be repressed using chromatin-editing tools if an effector domain, such as a DNA
methyltransferase (DNMT) which can methylate an enhancer or a histone demethylase (LSD1) which can
remove methylation from H3K4me1, is fused to the zinc finger or TALE arrays or to a nuclease-deficient
Cas9 (dCas9).
1.5 Conclusions and future perspectives
Many studies suggest that enhancers are critical regulators of cell-specific phenotypes and that
they contribute to the altered transcriptomes of diseased states(Giorgio et al., 2015, Groschel et al., 2014,
Sur et al., 2012). However, investigators face many challenges in trying to understand the function of
48
enhancers, including cell-type specificity, flexibility of distances between enhancers and target genes, and
multi-way interactions between enhancers and target genes. Fortunately, the advent of a plethora of
genome-wide assays based on next-generation sequencing has revolutionized our ability to interrogate
enhancer-gene regulatory networks, enabling a deeper understanding of the roles of enhancers in
development and disease. Above, I described three general approaches to study enhancer-gene regulatory
networks: chromosome interaction maps, computational predictions, and experimental validations.
However, there are still unanswered questions to consider and additional datasets that must be collected to
further our understanding of the mechanisms by which enhancers work.
A. Are all H3K27Ac-marked enhancers functional? It is a common assumption in the field
that all DHSs that are flanked by nucleosomes having H3K27Ac are active in that cell type. However,
recent experiments suggest that this is not necessarily true. For example, reporter studies have shown that
only a subset of enhancers predicted by DHS and histone modification are functionally validated using
transgenic mouse assays (Nord et al., 2013). This may result from limitations of the transgenic mouse
model; e.g. only enhancers active in a specific embryonic stage may show functionality or the reporter
constructs may lack important higher order chromosomal context as described above. Alternatively, the
relatively low rate of enhancer validation may suggest that DHS and histone modification are not the best
way to predict the subset of functional enhancers. For example, in a recent study, enhancers that had
lower levels of H3K27Ac were reported to have a higher validation rate in reporter assays than enhancers
that had higher levels of that mark (Kwasnieski et al., 2014). One can imagine that the level of H3K27Ac
at an enhancer is a direct consequence of the efficiency of binding of a TF that can recruit a HAT to that
site. However, the most efficient HAT-containing complexes might not include the factors most
important for interacting efficiently with promoters via looping. It is possible that all functional enhancers
will have some level of H3K27Ac but that not all enhancers with H3K27Ac are functional; clearly,
further studies are needed.
B. Are all enhancer-promoter loops functional? 3D nuclear architecture studies clearly show
that enhancers physically contact promoters in a cell-type-specific manner and that these interactions are
49
constrained by TADs that are associated with boundary proteins such as CTCF and Cohesin. However,
the 3D datasets have been collected in relatively few cell types and under a small number of conditions.
Fortunately, future improvements in sequencing technologies, further reduction of sequencing costs, and
improvement of computational analysis methods will result in high-resolution chromosome interaction
maps becoming available for more and more cell types. At this point, it is not clear if all enhancer-
promoter loops can be identified by methods such as Hi-C or if this method is more amenable to
identifying certain classes of structural chromatin interaction loops. More importantly, it is not known if
most enhancer-promoter loops are functional. There have been few focused analyses to understand the
epigenetic marks that are directly involved in these loops. It is possible that loops mediated by TF-TF
interactions between enhancers and promoters could occur prior to the recruitment of critical co-activators
needed to stimulate transcription; this could serve as a mechanism for poising genes for proper expression
in response to a later developmental or environmental cue. In this sense, the enhancers would be available
but not active; perhaps active enhancers are the subset of those distal regulatory elements that loop to
promoters and that also recruit a co-activator such as CBP (i.e. the enhancer must be involved in a loop
and have H3K27Ac).
C. How do enhancers choose target genes? A priori, it would seem reasonable that enhancers
would regulate the nearest promoter, and early studies may have propagated this view due to the fact
nearby sequences were the easiest to test. However, there is a paucity of data documenting the actual
percentage of enhancers that regulate the nearest gene. Most studies that address this question are based
on looping assays. For example, a 5C study in K562, GM12878, and Hela cells showed that 73% of the
tested distal elements do not link to the nearest gene (Sanyal et al., 2012), an RNA Polymerase II ChIA-
PET study in K562 and MCF7 cells found that ~40% of the enhancers involved in loops do not interact
with the TSS of the nearest gene (Li et al., 2012a), a CHi-C study in GM12878 cells found that one third
of the distal interactions were not directed to the promoter of the nearest gene (Mifsud et al., 2015), and a
study using the ELMER computational method found that 85% of tumor-specific enhancers that could be
linked to expression of a nearby gene skipped the nearest gene (Yao et al., 2015). The reasons behind this
50
high level of nearest-promoter skipping are not clear. It is possible that the chromatin conformation
studies do not have the resolution or read depth to detect looping between closely spaced genomic
elements. For example, the CHi-C data, which as noted above is enriched for reads involving promoters,
showed a higher percentage of nearest-gene enhancer loops than did other assays (Mifsud et al., 2015). It
is also possible that how enhancers choose target genes is affected by genomic or epigenomic context. In
support of this hypothesis, multiple studies have shown that many times a skipped genes is not expressed
in that cell type. For instance, the percentage of enhancers interacting with nearest genes in the 5C study
in K562, GM12878 and Hela cells increased from 27% to 47% when only expressed genes were used in
the analysis (Sanyal et al., 2012). Additionally, a study in Drosophila showed that 79% of a set of
intragenic enhancers regulates their host gene (Kvon et al., 2014) and an ELMER analyses of primary
tumors found that 66% of a set of intragenic enhancers were linked to their host gene (Yao et al., 2015).
This higher percentage of nearest-gene regulation in the set of intragenic enhancers, as compared to the
set of all enhancers, may be due to the fact that intragenic enhancers tend to fall within genes that are
actively expressed within that cell type; therefore the nearest promoter to an intragenic enhancer is
usually an active promoter. Although these limited studies provide some clues as to how enhancers
choose target genes, this important topic still needs further investigation to define the relevant factors
ruling target gene selection; e.g. does gene density and/or the presence of a boundary element such as a
certain class of CTCF binding sites influence target gene choice? Clearly, it will be important to
determine the percentage of “nearest promoter” regulation observed upon enhancer deletion to see if the
same low percentage is observed as in the looping studies. However, this will require large numbers of
enhancers to be deleted and, to date, only a handful have been studied in this way (Dowen et al., 2014,
Meyer et al., 2015, Li et al., 2014, Zhou et al., 2014, Yao et al., 2014, Kieffer-Kwon et al., 2013).
D. How do we distinguish direct from indirect actions of enhancers? In general, predictions
for connectivity between genes and enhancers (generated either from chromosome interaction maps or
computational methods) have considered only interactions on the same chromosome and within a defined
genomic distance. However, it is theoretically possible that interactions between enhancers and
51
promoters on different chromosomal arms or even on different chromosomes could occur. Genome-
editing tools provide the most unbiased method by which to obtain a list of putative target genes.
However, even in these cases, assumptions are generally made that the nearest gene that shows a decrease
in expression is likely the direct target and that genes located farther away or on other chromosomes are
indirectly affected due to changes in cell phenotype caused by a decreased expression of the direct
target(s); this is particularly plausible if the nearest gene that shows a change in expression is a TF or a
regulator of a signaling pathway. Perhaps one could begin to separate the direct target genes from the
indirect target genes in a follow-up experiment by simply providing another copy of the “nearest
regulated” target gene at a separate chromosomal location. The consequences of lowered expression of
that target gene would then be eliminated and then perhaps only direct targets of the deleted enhancer
would show changes in expression.
1.6 Thesis overview
In my dissertation work presented within, I attempted to functionally characterize cancer-
associated enhancers. My strategy was to identify cancer-associated enhancers and then understand their
regulatory functions through identifying their putative target genes. In Chapter 2, I used colorectal cancer
GWAS index SNPs and correlated SNPs that were selected using the 1000 Genomes dataset, identifying
larege genomic regions that are associated with increased risk of colorectal cancer. Using epigenetic
information of colorectal tumor and normal cells from the ENCODE Consortium and the Roadmap
Epigenome Mapping Consortium, I identified cancer risk-associated SNPs that occur within coding exons
or regulatory elements (promoters and enhancers) that may affect the function of a protein or the
expression of a gene. To understand the functional consequences of SNPs within the risk-associated
enhancers, I identified putative target genes for these cancer-associated enhancers through differential
gene expression and eQTL analysis using gene expression profiles from colorectal tumor and normal
cells.
52
In Chapter 3, I developed a novel method to systematically study enhancer regulatory networks in
primary tissue using DNA methylation data. I gathered DNA methylation and gene expression profiles for
10 cancer types, including more than 2000 normal and tumor samples from The Cancer Genome Atlas
(TCGA) and compiled a comprehensive list of putative genomic enhancers using ChromHMM data from
the Roadmap Epigenome Mapping Consortium and eRNA data from FANTOM5. I identified a) cancer-
specific enhancers by comparing the DNA methylation level within enhancers between normal and tumor
samples for each cancer type; b) putative target genes by using correlations between gene expression and
DNA methylation at enhancers; and c) upstream transcription factors that may drive hypomethylation at
cancer-specific enhancers by using positive correlations of expression of transcription factors and the
average DNA methylation at cancer-specific enhancers. Finally, I developed an R-based package,
ELMER (Enhancer Linking Methylation\Expression Relationships), to perform these analyses and have
made this package publicly available via Bioconductor.
53
Chapter 2. Functional annotation of colon cancer risk SNPs
The work described in this chapter has been published in Nature Communication, 5: 5114. doi:
10.1038/ncomms6114. Lijing Yao, Yu Gyoung Tak, Benjamin P. Berman and Peggy J. Farnham.
“Functional annotation of colon cancer risk SNPs”. Lijing Yao is responsible for all bioinformatic
analyses and assisted with manuscript preparation; Yu Gyoung Tak performed the enhancer
characterizations (Figure 2.4) and helped to edit the manuscript; Benjamin P. Berman advised L.Y. in
bioinformatic analyses and Peggy J. Farnham conceived the project and wrote the manuscript.
2.1 Abstract
Colorectal cancer (CRC) is a leading cause of cancer-related deaths in the United States.
Genome-wide association studies (GWAS) have identified single nucleotide polymorphisms (SNPs)
associated with increased risk for CRC. A molecular understanding of the functional consequences of this
genetic variation has been complicated because each GWAS SNP is a surrogate for hundreds of other
SNPs, most of which are located in non-coding regions. Here I use genomic and epigenomic
informationto test the hypothesis that the GWAS SNPs and/or correlated SNPs are in elements that
regulate gene expression, and identify 23 promoters and 28 enhancers. Using gene expression data from
normal and tumor cells, I identify 66 putative target genes of the risk-associated enhancers (10 of which
were also identified by promoter SNPs). Employing CRISPR nucleases, one risk-associated enhancer is
deleted to identify genes showing altered expression. I suggest that similar studies be performed to
characterize all CRC risk-associated enhancers.
2.2 Introduction
Colorectal cancer (CRC) ranks among the leading causes of cancer-related deaths in the United
States. The incidence of and death from CRC is in the top 3 of all cancers in the United States for both
54
men and women (http://apps.nccd.cdc.gov/uscs/toptencancers.aspx). It is estimated that 142,820 men and
women will be diagnosed with, and 50,830 men and women will die of, cancer of the colon and
rectum in 2013 (http://seer.cancer.gov/-statfacts/html/colorect.html). A better understanding of the
regulatory factors and signaling pathways that are deregulated in CRC could provide new insight into
appropriate chemotherapeutic targets. Decades of studies have revealed that certain genes and pathways,
such as WNT, RAS, PI3K, TGF-B, p53, and mismatch repair proteins, are important in the initiation and
progression of CRC (Fearon, 2011). In an attempt to obtain a more comprehensive view of CRC, two new
approaches have been used; exome sequencing of tumors and genome-wide population analyses of human
variation. The Cancer Genome Atlas (TCGA) has taken the first of these new approaches in the hopes of
moving closer to a full molecular characterization of the genetic contributions to CRC, analyzing somatic
alterations in 224 tumors (Fearon, 2011). These studies again implicated the WNT, RAS, and PI3K
signaling pathways. The second new approach identifies single nucleotide polymorphisms associated with
specific diseases using genome wide association studies (GWAS). GWAS has led to the identification of
thousands of single nucleotide polymorphisms (SNPs) associated with a large number of phenotypes
(Hindorff et al., 2009, Manolio, 2010). Such studies identify what are known as tag SNPs that are
associated with a particular disease. Specifically for CRC, 25-30 tag SNPs have been identified (Zanke et
al., 2007, Tomlinson et al., 2008, Tenesa et al., 2008, Peters et al., 2013, Jia et al., 2013, Houlston et al.,
2008, Houlston et al., 2010, Dunlop et al., 2012).
Although identification of tag SNPs is an important first step in understanding the
relationship between human variation and risk for CRC, a major challenge in the post-GWAS era is to
understand the functional significance of the identified SNPs (Schaub et al., 2012). It is critical to
advance the field by progressing from a statistical association between genetic variation and disease to a
molecular understanding of the functional consequences of the genetic variation. Progress towards this
goal has been mostly successful when the genetic variation falls within a coding region. Unfortunately,
most SNPs identified as associated with human disease in large GWAS studies are located within large
introns or distal to coding regions, in what in the past has been considered to be the unexplored territory
55
of the genome. However, recent studies from the ENCODE Consortium have shown that introns and
regions distal to genes contain regulatory elements. In particular, the ENCODE Consortium has made
major progress in defining hundreds of thousands of cell-type specific distal enhancer regions (Zentner
and Scacheri, 2012, ENCODE_Project_Consortium, 2012, Frietze et al., 2012). Comparison of GWAS
SNPs to these enhancer regions has revealed several important findings. For example, work from
ENCODE and others have shown that many GWAS SNPs fall within enhancers, DNase hypersensitive
sites, and transcription factor binding sites (ENCODE_Project_Consortium, 2012, Maurano et al., 2012,
Schaub et al., 2012, Akhtar-Zaidi et al., 2012). It is also clear that the SNP whose functional role is most
strongly supported by ENCODE data is often a SNP in linkage disequilibrium (LD) with the GWAS tag
SNP, not the actual SNP reported in the association study (Hardison, 2012).
These recent reports clearly show that regulatory elements can help to identify important
SNPs (Hardison, 2012, Farnham, 2012, Schaub et al., 2012). However, the studies were performed using
all available ENCODE data and did not focus the functional analysis of cancer-associated SNPs on the
regulatory information obtained using the relevant cell types.
Using epigenetic marks obtained from normal colon and colon cancer cells, I identify SNPs in
high LD with GWAS SNPs that are located in regulatory elements specifically active in normal and/or
tumor colon cells. Characterization of transcripts nearby CRC risk-associated promoters and enhancers
using RNA expression data allows the prediction of putative genes and non-coding RNAs associated with
an increased risk of colon cancer. Using genomic nucleases, one risk-associated enhancer was deleted to
identify the deregulated genes for comparison to those predicted to be targets of that enhancer. My studies
suggest that transcriptome characterization after precise deletion of a risk-associated enhancer could be a
useful approach for post-GWAS analyses.
56
2.3 Results
2.3.1 CRC risk-associated SNPs linked to a specific gene
For my studies, I chose 25 tag SNPs, 4 of which have been associated with an increased risk for
CRC in Asia-derived case-control cohorts and the rest in Europe-derived case-control cohorts; the
genomic coordinates of each SNP can be found in Table 2.1 and Supplementary Data File 2.1. Of these
25 tag SNPs, only one is found within an exon, occurring in the third exon of the MYNN gene and
resulting in a synonymous change that does not lead to a coding difference. However, there are hundreds
of SNPs in high LD with each tag SNP and it is possible that some of the high LD SNPs may reside in
coding exons. To address this possibility I used a bioinformatics program called FunciSNP to identify
SNPs correlated with CRC tag SNPs that also intersect the set of coding exons in the human genome
(Coetzee et al., 2012). FunciSNP is an R/Bioconductor package that allows a comparison of population-
based correlated SNPs from the 1000 Genomes Project (http://www.1000genomes.org/) with any set of
chromatin biofeatures. In this initial analysis, I chose coding exons from the Gencode 15 dataset
(http://www.gencodegenes.org/releases/) as the biofeature. Because LD varies with the population, to
identify population-based correlated SNPs I specified the Asian population for analysis of the 4 tag SNPs
identified using Asian-derived case-control cohorts and I specified the European population for analysis
of the rest of the tag SNPs. Using FunciSNP, I identified 240 unique SNPs that are correlated with the 25
tag SNPs at an r
2
>0.1 and are within a coding exon (Supplementary Figure 2.1). I then used snpeff
(http://snpeff.sourceforge.net/ (Cingolani et al., 2012)) to determine that 40 of these correlated SNPs
create non-synonymous changes; however, limiting the SNPs to those with an LD of r
2
>0.5 with the tag
SNP reduced the number to only 13. Using polyphen-2 (http://genetics.bwh.harvard.edu/pph2/ (Adzhubei
et al., 2010)) and provean (http://provean.jcvi.org/index.php (Choi et al., 2012)), only 2 potentially
damaging SNPs at r
2
>0.5 were found, both in POU5F1B (Figure 2.1). At the less restrictive r
2
>0.1, 4
other genes were also found to harbor a damaging SNP (RHPN2, UTP23, LAMA5, and FAM186A). To
determine if these genes are expressed in colon cells, I analyzed two replicates of RNA-seq for HCT116
57
cells and also used RNA-seq data from the Roadmap Epigenome Mapping Consortium for normal
sigmoid colon to examine expression. After analysis of both sets of RNA-seq data, I categorized
transcripts that are not expressed as having less than 0.5 FPKM (Supplementary Figure 2.2). Analysis of
the RNA-seq data revealed that POU5F1B and FAM186A are not expressed in either the normal sigmoid
colon or HCT116 cells (however, see these genes are expressed in a cohort of TCGA colon tumors; see
Table 2.2).
Another way to link a SNP to particular gene is if the SNP falls within a promoter region. I again
used FunciSNP, but this time the biofeature analyzed corresponded to the region from -2000 to +2000 nt
of the transcription start site (TSS) of each transcribed gene (we analyzed coding and non-coding
transcripts from GENCODE V15). I chose to include 2 kb upstream and downstream of the start site as
the promoter proximal regions because several studies (Koudritsky and Domany, 2008, Stergachis et al.,
2013), as well as visual inspection of the ENCODE TF ChIP-seq tracks, have shown that transcription
factors can bind on either side of a transcription start site. Using an r
2
>0.1, I found 684 correlated
promoter SNPs which were reduced to 233 SNPs at r
2
>0.5 (Figure 2.1 and Supplementary Figure 2.3).
Many of these SNPs fall within the same promoter regions. When collapsed into distinct promoters, I
identified the TSS regions of 17 protein coding genes and 2 noncoding RNAs which are expressed in
HCT116 or sigmoid colon cells; promoter SNPs identified 4 additional expressed genes when a larger
number of TCGA colon tumor samples were analyzed (Table 2.2).
58
Figure 2.1 Identification of potential functional SNPs for CRC.
A) Shown is the number of SNPs identified by FunciSNP in each of 3 categories for 25 colon cancer risk
loci (see Table 2.1 for information on each CRC risk SNP). For exons, only non-synonymous SNPs are
reported; parentheses indicated the number of SNPs that are predicted to be damaging; see Table 2.2 for a
list of the expressed genes associated with the correlated SNPs. For TSS regions, the region from -2kb to
+2kb relative to the start site of all transcripts annotated in GENCODE V15, including coding genes and
non-coding RNAs was used; see Table 2.2 for a list of expressed transcripts associated with the
correlated SNPs. B) For H3K27Ac analyses, ChIP-seq data from normal sigmoid colon and HCT116
tumor cells were used; see Table 2.3 for further analysis of distal regions harboring SNPs in normal and
tumor colon cells. The SNPs having an r
2
>0.1 that overlapped with H3K27Ac sites were identified
separately for HCT116 and sigmoid colon datasets. Because more than one SNPs could identify the same
H3K27Ac-marked region, the SNPs were then collapsed into distinct H3K27Ac peaks. The sites that
were within +/- 2 kb of a promoter region were removed to limit the analysis to distal elements. To obtain
a more stringent set of enhancers, those regions having only SNPs with r
2
<0.5 were removed. This
remaining set of 68 distal H3K27Ac sites were contained within 19 of the 25 risk loci. Visual inspection
to identify only the robust enhancers having linked SNPs not at the margins reduced the set to 27
enhancers located in 9 of the 25 risk loci; an additional enhancer was identified in SW480 cells (see
Table 2.3 for the genomic locations of all 28 enhancers). Color key: green=SNPs or H3K27Ac sites
unique to normal colon, red=unique to colon tumor cells, blue=present in both normal and tumor colon.
A B
CRC tag SNPs (25)
Genomic window
(+/-200kb) around tag SNP
Extract all known SNPs
(1000 genome database)
Select correlated SNPs
LD r
2
> 0.1
Select correlated SNPs
LD r
2
> 0.1
r
2
> 0.1: 40 (7)
r
2
> 0.5: 13 (2)
r
2
> 0.1: 684
r
2
> 0.5: 233
r
2
> 0.1: 746
r
2
> 0.5: 270
Exon TSS region H3K27Ac
370
236
140
111
47
41
96
27
32
41
18
9
13 13
1
SNPs with r
2
> 0.1 and
overlapping H3K27Ac
different
H3K27Ac sites
distal
H3K27Ac sites
distal H3K27Ac sites
r
2
> 0.5
visual inspection
Yao_Figure 1
59
Figure 2.2 Expression of risk-associated genes in colon cells.
The left panel indicates if a transcript was identified by a SNP located in an exon or a TSS or is nearby a
risk-associated enhancer; the middle panel shows the expression values of each of the 41 transcripts in
sigmoid colon or HCT116 tumor cells; the right panel shows the fold change of each transcript in the
tumor cells (positive indicates higher expression in the tumor).
60
Table 2.1 Summary of regions linked to CRC tag SNPs.
Tag$SNP Position Ref/Alt $Exons
Protein$
Coding$TSS
Non8Coding$
TSS Enhancers$ PMID
rs6691170 chr1:222045446 G/T 0 0 2 0 20972440
rs6687758 chr1:222164948 A/G 0 0 3 0 20972440
rs10936599 chr3:169492101 C/T 0 4 1 0 20972440
rs647161 chr5:134499092 C/A 0 0 1 6 23263487
rs1321311 chr6:36622900 C/A 0 1 1 0 22634755
rs16892766 chr8:117630683 A/C 1 1 0 0 18372905
rs10505477 chr8:128407443 A/G *1 1 0 2 17618283
rs6983267 chr8:128413305 G/T *1 1 0 2 23266556
rs7014346 chr8:128424792 A/G *1 1 1 2 18372901
rs10795668 chr10:8701219 G/A 0 0 1 0 18372905
rs1665650 chr10:118487100 T/C 0 0 0 0 23263487
rs3824999 chr11:74345550 T/G 0 1 0 1 22634755
rs3802842 chr11:111171709 C/A 0 3 1 0 18372901
rs10774214 chr12:4368352 T/C 0 0 0 1 23263487
rs7136702 chr12:50880216 T/C 1 3 1 4 20972440
rs11169552 chr12:51155663 C/T 0 2 2 3 20972440
rs4444235 chr14:54410919 T/C 0 1 1 0 19011631
rs4779584 chr15:32994756 T/C 0 1 1 0 18372905
rs9929218 chr16:68820946 G/A 0 2 1 4 19011631
rs4939827 chr18:46453463 T/C 0 0 0 2 18372905
rs10411210 chr19:33532300 C/T 1 2 0 2 19011631
rs961253 chr20:6404281 C/A 0 0 0 0 19011631
rs2423279 chr20:7812350 T/C 0 0 0 0 23263487
rs4925386 chr20:60921044 T/C 1 1 3 4 20972440
rs5934683 chrX:9751474 T/C 0 0 0 0 22634755
The positions and classification of the CRC tag SNPs are based on the hg19 UCSC genome browser
reference genome; the hg19 reference alleles (Ref) and the alternative alleles (Alt) are indicated; the
risk alleles are in red. The number of exons having a non-synonymous, damaging correlated SNP with
an LD of r
2
>0.1 are reported; the 3 regions marked with an asterisk are the only ones for which the
damaging SNP has an LD of r
2
>0.5 with the tag SNP. For TSS and enhancers, the number of different
promoters or enhancers having at least one SNP with an LD of r
2
>0.5 with the tag SNP are reported
(note that a given TSS or enhancer can be identified by more than one tag SNP; see Table 2.2 and
Table 2.3 for more details). PMID indicates the PubMED ID for a publication describing the
identification of the tag SNP. A list of all correlated SNPs with r
2
>0.1 in exons, TSS, or enhancers can
be found in Supplementary Data File 2.1.
61
Table 2.2 Expressed transcripts directly linked to CRC tag SNPs
Tag$SNP $Exons RNAs$of$TSS$SNPs
rs10936599 ACTRT3,MYNN,.(TERC)
rs1321311 CDKN1A
rs16892766 UTP23 EIF3H
rs7014346 (RP11>382A18.1)
rs3824999 POLD3
rs3802842 C11orf92,C11orf93,C11orf53
rs7136702 DIP2B
rs11169552 ATF1,DIP2B
rs4444235 BMP4
rs4779584 GREM1
rs9929218 CDH3,CDH1
rs10411210 RHPN2 GPATCH1,RHPN2
rs4925386 LAMA5 LAMA5
Only 3 damaging SNPs having an r
2
>0.1 were identified in the exons of genes expressed in either
HCT116 or normal sigmoid colon cells; of these, only UTP23 and RHPN2 were identified as damaging
by two different programs. RNAs expressed in HCT116 or sigmoid colon cells and having a correlated
SNP with r
2
>0.5 within +/- 2kb of the TSS of protein coding transcripts or non-coding RNAs are
shown. The cases in which the tag SNP is located in the TSS region are in bold and non-coding RNAs
are in parentheses. I note that exon SNPs identified two additional expressed genes (POU5F1B and
FAM186A) and promoter SNPs identified 3 additional expressed genes (FAM186A, LRRC34, and
LRRIQ4) when a larger number of TCGA colon tumor samples were analyzed.
2.3.2 CRC risk-associated SNPs in distal regulatory regions
Most of the SNPs in LD with the CRC GWAS tag SNPs cannot be easily linked to a specific gene
because they do not fall within a coding region or a promoter-proximal region. However, it is possible
that a relevant SNP associated with increased risk lies within a distal regulatory element of a gene whose
function is important in cell growth or tumorigenicity. To address this possibility, I used the histone
62
modification H3K27Ac to identify active regulatory regions throughout the genome of colon cancer cells
or normal sigmoid colon cells. I used HCT116 H3K27Ac ChIP-seq data (Frietze et al., 2012) produced in
our lab for the tumor cells and obtained H3K27Ac ChIP-seq data for normal colon cells from the NIH
Roadmap Epigenome Mapping Consortium. The ChIP-seq data for both the normal and tumor cells
included two replicates. To demonstrate the high quality of the datasets, I called peaks on each replicate
of H3K27Ac from HCT116 and each replicate of H3K27Ac from sigmoid colon using Sole-search
(Blahnik et al., 2010, Blahnik et al., 2011) and compared the peak sets from the two replicates using the
ENCODE 40% overlap rule (after truncating both lists to the same number, 80% of the top 40% of one
replicate must be found in the other replicate and vice versa). After determining that the HCT116 and
sigmoid colon datasets were of high quality (Supplementary Figure 2.4), I merged the two replicates
from HCT116 and separately merged the two replicates from sigmoid colon and called peaks on the two
merged datasets; see Supplementary Data File 2.2 for a list of all ChIP-seq peaks. Using the merged
peak lists from each of the samples as biofeatures in FunciSNP, I determined that 746 of the 4894 SNPs
that were in LD with a tag SNP at r
2
>0.1 were located in H3K27Ac regions identified in either the
HCT116 or sigmoid colon peak sets; of these 270 SNPs had an r
2
>0.5 with a tag SNP (Figure 2.1 and
Supplementary Figure 2.5).
A comparison of the H3K27Ac peaks from normal and tumor cells indicated that the patterns are
very similar; in fact, ~24,000 H3K27Ac peaks are in common in the normal and tumor cells. However,
there are clearly some peaks unique to normal and some peaks unique to the tumor cells. Therefore, I
separately analyzed the normal and tumor H3K27Ac ChIP-seq peaks as different sets of biofeatures using
FunciSNP (Figure 2.1B). Of the 746 SNPs, 236 were located in a H3K27Ac site common to both normal
and tumor cells, whereas 140 were unique to tumor and 370 were unique to normal cells. Visual
inspection of the SNPs and peaks using the UCSC genome browser showed that many of the identified
enhancers harbored multiple correlated SNPs. Reduction of the number of SNPs to the number of
different H3K27Ac sites resulted in 47 common, 41 tumor-specific, and 111 normal-specific regions.
Visual inspection also showed that some of the H3K27 genomic regions corresponded to promoter
63
regions (Supplementary Figure 2.4). Because promoter regions having correlated SNPs were already
identified using TSS regions (see above), I eliminated the promoter-proximal H3K27Ac sites, resulting in
27 common, 32 tumor-specific, and 96 normal-specific distal H3K27Ac regions. As the next winnowing
step, I selected only those enhancers having at least one SNP with an r
2
> 0.5, leaving 18 common, 9
tumor-specific, and 41 normal-specific distal H3K27Ac regions. I noted that some of the identified
regions corresponded to low ranked H3K27Ac peaks. For our subsequent analyses, I wanted to limit our
studies to robust enhancers that harbor correlated SNPs. Therefore, I visually inspected each of the
genomic regions identified as having distal H3K27Ac peaks harboring a correlated SNP. To prioritize the
distal regions for further analysis, I eliminated those for which the correlated SNPs was on the edge of the
region covered by the H3K27Ac signal or corresponded to a very low-ranked peak. After inspection, I
were left with a set of 27 distal H3K2Ac regions in which a correlated SNP (r
2
>0.5) was well within the
boundaries of a robust peak (Figure 2.1B). To confirm these results, I repeated the analysis using
H3K27Ac data from a different colon cancer cell line, SW480, identifying only one additional enhancer
harboring risk SNPs for CRC. The genomic coordinates of each of these 28 enhancers, which are
clustered in 9 genomic regions, are listed in Table 2.3 (see also Supplementary Data File 2.3).
Combining all data, enhancers in 5 of the 9 regions were identified in all 3 cell types and 8 of the 9
regions were identified in at least two of the cell types.
64
Table 2.3 Distal regulatory regions correlated with CRC tag SNPs
Enhancer Tag*SNP
No.*of*
Correlated*
SNPs
Chromosome Start End Location
Nearby*Expressed*Coding*and*Non=
Coding*RNAs
1 4 chr5 134468409 134473214 CTC0203F4.13intron PITX1,CATSPER3,H2AFY
2 6 chr5 134474759 134478528 CTC0203F4.13intron PITX1,CATSPER3,H2AFY
3* 4 chr5 134520309 134523373 CTC0203F4.13intron PITX1,CATSPER3,H2AFY
4 7 chr5 134525698 134531612 CTC0203F4.13intron PITX1,CATSPER3,H2AFY
5 7 chr5 134543144 134548023 CTC0203F4.13intron H2AFY,PITX1
6 7 chr5 134511610 134516426 CTC0203F4.13intron PITX1,CATSPER3,H2AFY
7 3 chr8 128412778 128414859
RP110382A18.13
intron
MYC,RP110382A18.1,RP110
382A18.2,3RP110255B23.3
8* 5 chr8 128420412 128422114
RP110382A18.13
intron
MYC,RP110382A18.1,RP110
382A18.2,3RP110255B23.3
9* rs3824999 4 chr11 74288844 74294943 Intergenic POLD3,LIPT2,KCNE3,AP001372.2
10
rs10774214.A
SN
1 chr12
4378128 4379840
Intergenic CCND2,C12orf5
11* 1 chr12 50908239 50913757 DIP2B3intron DIP2B,LARP4
12 2 chr12 50938468 50940796 DIP2B3intron DIP2B,LARP4
13* 2 chr12 51018019 51020503 DIP2B3intron DIP2B,ATF1,LARP4
14* 1 chr12 50973150 50974328 DIP2B3intron DIP2B,LARP4
15 1 chr12 51012054 51014942 DIP2B3intron DIP2B,ATF1,LARP4
16
rs11169552,333333
rs7136702
3 chr12 51040371 51042207 DIP2B3intron DIP2B,ATF1
17 1 chr16 68740822 68742561 Intergenic CDH3,CDH1,TMCO7
18* 4 chr16 68754658 68757192 Intergenic CDH1,CDH3,TMCO7
19 5 chr16 68774214 68780161 CDH13intron CDH1,CDH3,TMCO7
20 11 chr16 68784044 68791839 CDH13intron CDH1,CDH3,TMCO7
21 4 chr18 46448530 46450772 SMAD73intron SMAD7,CTIF,DYM,RP11015F12.1
22 6 chr18 46450800 46454601 SMAD73intron SMAD7,CTIF,DYM,3RP11015F12.1
23 6 chr19 33537339 33541195 RHPN23intron RHPN2,GPATCH1,C19orf40
24 1 Chr19 33530860 33533823 RHPN23intron RHPN2,GPATCH1,C19orf40
25 3 chr20
60929861 60935447
LAMA53intron
LAMA5,RPS21,CABLES2,3RP110
157P1.4
26 3 chr20
60938278 60941762
LAMA53intron
LAMA5,RPS21,CABLES2,3RP110
157P1.4
27 6 chr20 60948726 60951918 Intergenic LAMA5,RPS21,CABLES2
28 6 chr20 60955085 60958391 Intergenic RPS21,LAMA5,CABLES2
rs647161.ASN
rs10411210
rs4939827
rs4925386
rs10505477,333
rs6983267,3333333
rs7014346
rs7136702
rs11169552
rs9929218
The tag SNP and the correlated SNPs for 28 distal, robust H3K27Ac regions are indicated; the enhancers
that are found only in normal sigmoid colon are indicated with an asterisk. The 3 nearest protein-coding
RNAs and 3 nearest non-coding RNAs were identified using the GENCODE V15 gene annotation; only
those RNAs that are expressed in HCT116 or sigmoid colon cells are shown (see also Supplementary
Data File 2.3).
65
2.3.3 Effects of SNPs on binding motifs in the distal elements
To determine possible effects of the correlated SNPs on transcription factor binding, I first
analyzed all SNPs having an r
2
>0.1 with the 25 CRC tag SNPs. Using position weight matrices from
Factorbook (Wang et al., 2013b), all correlated SNPs that fell within a critical position in a transcription
factor binding motif were identified (Supplementary Data File 2.4). I identified ~800 SNPs that were
predicted to impact binding of transcription factor to a known motif. However, most of these SNPs are
not in regulatory regions important for CRC. Therefore, I next limited our analysis to the set of correlated
SNPs that fall within the 28 robust enhancers (Supplementary Data File 2.5). I found 80 SNPs that
cause motif changes in a total of 124 motifs, representing binding sites for 40 different transcription
factors. Using RNA-seq data, I found that 36 of these factors are expressed in HCT116 and/or sigmoid
colon cells (Table 2.4), suggesting that perhaps the binding of these factors at the risk-associated
enhancers is influenced by the correlated SNPs. Of the 36 factors, most were expressed at either
approximately the same levels in normal and tumor colon or at higher levels in HCT116 cells than in
normal colon. However, several factors showed large decreases in gene expression in HCT116 as
compared to sigmoid colon cells, including FOS and JUN which were ~10 fold higher in normal colon
and HNF4A and ETS1 which were 30-40 fold higher in normal colon; Supplementary Data File 2.6.
Table 2.4 Effects of SNPs on motifs in the distal regulatory regions.
AP1 EGR1 MYC/MAX TCF12
AP2 ELF1 NR2C2 TCF7L2
BHLHE40 ELK4 PBX31 TEAD1
CEBPB ESRRA PRDM1 THAP1
2
CREB1 ETS1 RUNX1 USF1
CTCF GABP RXRA YY1
E2F1 GATA SP1 ZBTB7A
3
E2F4 GFI1 SREBF1 ZEB1
EBF1 HNF4A STAT1 ZNF281
Details concerning the impacted motifs (SNP position, sequence of reference and alternative alleles, and
the direction of the effect on the motif) can be found in Supplementary Data File 2.4.
1
identified by motif
UA9, which was the top motif in NANOG hES ChIP-seq data;
2
identified by motif UA2, which was the
top motif in PBX3 GM12878 ChIP-seq data;
3
identified by motif UA5;
4
identified by motif UA3, which
was the top motif in ZBTB7A K562 ChIP-seq data.
66
2.3.4 Expression analysis of candidate risk-associated genes
Although the genes identified by the exon or TSS SNPs are clearly good candidate genes for
analysis of their possible role in the development of colon cancer, it is difficult to definitively link a target
gene with a distal enhancer region because enhancers can function in either direction and do not
necessarily regulate the nearest gene. In fact, the ENCODE Consortium recently reported that, on
average, a distal element can physically associate with ~3 different promoter regions (Sanyal et al., 2012).
Also, only 27% of the distal elements showed an interaction with the nearest TSS, although this increased
to 47% when only expressed genes were used in the analysis (Sanyal et al., 2012). Taken together, these
analyses suggest that examining the 3 nearest genes may produce a reasonable list of genes potentially
regulated by the CRC risk-associated enhancers. Therefore, I used the GENCODE V15 dataset and
identified the 3 nearest promoters of coding genes and 3 nearest promoters of non-coding transcripts
around each of the 28 enhancers (Supplementary Data File 2.3). I next limited the nearby coding and
non-coding transcripts to those expressed in either sigmoid colon RNA or HCT116 cells (Table 2.3); we
note that taking into account expression did not greatly change the list of coding transcripts but eliminated
most of the non-coding transcripts which tend to be expressed in a very cell-type specific manner.
Interestingly, several of the genes nearby the risk-associated enhancers were also identified in the TSS
analyses, suggesting that a putative causal gene associated with CRC might be differentially regulated by
risk-associated SNPs found in the promoter and in a nearby enhancer (Figure 2.2). I note that in these
cases, the promoters and enhancers were identified by different risk-associated SNPs in high LD with a
tag SNP, with the promoters being identified by SNPs within +/- 2kb of the TSS and enhancers being
identified by distal SNPs. I further analyzed the expression levels of all genes directly linked to the risk
SNPs (by exons or TSS) and the expressed genes nearby the risk-associated enhancers in normal colon
and HCT116 tumor cells. Shown in Figure 2.2 are the expression levels of each of the 41 transcripts and
the fold change in expression in HCT116 vs. normal cells; several of these genes display robust changes
in expression in the tumor cells.
67
As a second approach to identify transcripts potentially regulated by the identified enhancers, I
developed a new statistical approach that employs RNA-seq data from TCGA. I selected the 10 nearest
genes 5’ of and the 10 nearest genes 3’ of each of the 28 enhancers. Because of the difference in gene
density in different regions of the genome, the 20-gene span ranged from 786 kb to 7.5 MB, depending on
the specific enhancer. Because several of the 28 enhancers are clustered near each other, this resulted in a
total of 182 unique genes. I downloaded the RNA-seq data for 233 colorectal tumor samples and 21
colorectal normal samples from the TCGA data download website (https://tcga-
data.nci.nih.gov/tcga/dataAccessMatrix.htm) and determined if any of the 182 genes show a
significant increase or decrease (>2 fold change and P value < 0.01) in colon tumors vs. normal colon (see
Methods and Supplementary Figure 2.6 for an analysis of potential TCGA batch effects). I then
eliminated those genes whose expression change did not correspond to the nature of the enhancer (e.g. a
tumor-specific enhancer should not regulate a gene that is higher in normal cells), leaving a total of 39
possible genes whose expression might be differentially regulated in colon cancer by the risk enhancers
(Table 2.5). I note that 5 of the genes shown to be differentially expressed in the TCGA data (MYC,
PITX1, POU5F1B, C5orf20, and CDH3) are also in the set of nearest 3 genes to an enhancer having CRC
risk-associated SNPs. I found that 0-6 differentially expressed genes were linked to an enhancer using the
TCGA data, with an average of 4 transcripts per enhancer that showed correct differential expression in
colon tumors. Heatmaps of the expression of the 39 putative enhancer-regulated genes, as well as the
expression of the genes identified by exon and promoter SNPs, in the TCGA samples are shown in
Supplementary Figure 2.9. To determine if I could validate any of the putative enhancer targets, I used
eQTL analyses based on data from TCGA. I began by identifying the SNPs within each of the 28
enhancers that are on the Illumina WG SNP6 array used by TCGA. Unfortunately, these arrays include
only 8% of the SNPs of interest (i.e. the exon, promoter, and enhancer SNPs that are correlated with the
CRC tag SNPs), greatly limiting our ability to effectively utilize the eQTL methodology. However, I did
identify two examples of allelic expression differences in the set of putative enhancer targets that
correlated with SNPs in an enhancer region. Both of these SNPs fell within enhancer 19 and showed
68
correlation with allelic expression differences of the TMED6 gene (the two SNPs significantly associated
with TMED6 expression had an adjusted P value FDR < 0.1 rs7203339 and rs1078621); enhancer 19 falls
within the intron of the CDH1 gene, which is 600 kb from the transcription start site of the TMED6 gene
(Figure 2.3). A summary of the eQTL analysis of enhancer and promoter risk-associated SNPs can be
found in Supplementary Data File 2.7 and Supplementary Figure 2.10.
Figure 2.3 Linking a transcript to an enhancers using TCGA data.
A) Shown is the location of
enhancer 19 and the position of
the three SNPs (in red)
identified in the eQTL studies
and two other SNPs (in blue)
identified by the FunciSNP
analysis but not present on the
SNParray, in relation to the
H3K27Ac, RNA-seq, and
TCF7L2 ChIP-seq data for that
region. Also shown are the
ENCODE ChIP-seq
transcription factor tracks from
the UCSC genome browser. B)
The expression of the Tmed6
RNA is shown for samples
having homozygous or
heterozygous alleles for 3 SNPs
in enhancer 19. The upper and
lower quartiles of the box plots
are the 75
th
and 25
th
percentiles,
respectively. The whisker top
and bottom are 90
th
and 10
th
percentiles, respectively. The
horizontal line through the box
is median value. The pvalue
corresponds to the regression
coefficient based on the residue
expression level and the germ
line genotype. Sample size is
listed under each genotype. C)
A schematic of the gene
structure in the genomic region
around enhancer 19 (yellow
box) is shown; the arrows indicate the direction of transcription of each gene. The 3 genes in the
enhancer 19 region that showed differential expression in normal vs tumor colon samples (Table 2.5) are
indicated; of these, only TMED6 was identified in the eQTL analysis.
69
Table 2.5 Linking transcripts to enhancers using TCGA data.
Region Enhancer( Correlated(transcripts
1
Enhancer1
PITX1(1.37_L1)4C5orf20(;1.96_R2)4TIFAB(;1.64_R3)4CXCL14(;1.39_R5)4SLC25A48(;
1.34_R7)
1
Enhancer2
PITX1(1.37_L1)4C5orf20(;1.96_R24)TIFAB(;1.64_R3)4CXCL14(;1.39_R5)4SLC25A48(;
1.34_R7)
1 Enhancer3 C5orf20(;1.96_R2)4TIFAB(;1.64_R3)4CXCL14(;1.39_R5)4SLC25A48(;1.34_R7)
1
Enhancer4
PITX1(1.37_L2)4C5orf20(;1.96_R1)4TIFAB(;1.64_R2)4CXCL14(;1.39_R4)4SLC25A48(;
1.34_R6)4TGFBI(2.74_R10)
1
Enhancer5
PITX1(1.37_L2)4C5orf20(;1.96_R1)4TIFAB(;1.64_R2)4CXCL14(;1.39_R4)4SLC25A48(;
1.34_R6)4TGFBI(2.74_R10)
1 Enhancer6 PITX1(1.37_L1)
2
Enhancer7
SQLE(1.8_L6)4FAM84B(1.01_L2)4POU5F1B(3.02_L1)4MYC(1.58_R2)4PVT1(2.44_R3)4
GSDMC(1.75_R4)
2 Enhancer8 none
3 Enhancer9 ARRB1(;1.15_R8)
4
Enhancer10
NRIP2(;1.07_L10)4FOXM1(1.47_L9)4TEAD4(2.15_L6)4RAD51AP1(1.36_R5)4GALNT8(;
1.58_R9)4KCNA6(;2.16_R10)
5 Enhancer11 LIMA1(;1.16_L4)4METTL7A(;2.65_R3)4POU6F1(;1.14_R9)
5
Enhancer12
RACGAP1(1.06_L10)4ASIC1(1.49_L9)4LIMA1(;1.16_L4)4METTL7A(;2.65_R3)4POU6F1(;
1.14_R9)
5 Enhancer13 LIMA1(;1.16_L4)4METTL7A(;2.65_R3)4POU6F1(;1.14_R9)
5 Enhancer14 LIMA1(;1.16_L4)4METTL7A(;2.65_R3)4POU6F1(;1.14_R9)
5
Enhancer15
RACGAP14(1.06_L10)4ASIC1(1.49_L9)4LIMA1(;1.16_L4)4METTL7A(;2.65_R3)4POU6F1(;
1.14_R9)
5 Enhancer16 RACGAP1(1.06_L10)4ASIC1(1.49_L9)
6 Enhancer17 SMPD3(;1.3_L4)4CDH3(6.24_L2)4TMED6(;1.36_R10)
6 Enhancer18 SMPD3(;1.3_L4)4TMED6(;1.36_R10)
6 Enhancer19 SMPD3(;1.3_L4)4CDH3(6.24_L2)4TMED6(;1.36_R10)
6 Enhancer20 SMPD3(;1.3_L4)4CDH3(6.24_L2)4TMED6(;1.36_R10)
7 Enhancer21 KATNAL2(;1.06_L9)4ZBTB7C(;2.81_L3)4LIPG(41.3_R6)4ACAA2(;1.43_R7)
7 Enhancer22 KATNAL2(;1.06_L9)4ZBTB7C(;2.81_L3)4LIPG(41.3_R6)4ACAA2(;1.43_R7)
8 Enhancer23 CHST8(44;2_R9)4KCTD15(;1.02_R10)
8 Enhancer24 CHST8(44;2_R9)4KCTD15(;1.02_R10)
9
Enhancer25
RBBP8NL(1.33_R3)4C20orf166;AS1(;3.81_R6)4SLCO4A1(3.19_R7)4
LOC100127888(2.45_R8)4NTSR1(;2.34_R9)4MRGBP(1.44_R10)
9
Enhancer26
RBBP8NL(1.33_R2)4C20orf166;AS1(;3.81_R5)4SLCO4A1(3.19_R6)4
LOC100127888(2.45_R7)4NTSR1(;2.34_R8)4MRGBP(1.44_R9)
9
Enhancer27
RBBP8NL(1.33_R3)4C20orf166;AS1(;3.81_R6)4SLCO4A1(3.19_R7)4
LOC100127888(2.45_R8)4NTSR1(;2.34_R9)4MRGBP(1.44_R10)
9
Enhancer28
RBBP8NL(1.33_R2)4C20orf166;AS1(;3.81_R5)4SLCO4A1(3.19_R6)4
LOC100127888(2.45_R7)4NTSR1(;2.34_R8)4MRGBP(1.44_R9)
Shown are the subset of the 10 nearest 5’ and 10 nearest 3’ transcripts for each enhancer that show
significant gene expression differences in normal vs. tumor samples, as determined using RNA-seq data
from TCGA. The numbers in parentheses indicate the fold change, with positive indicating a higher
expression in tumors. The 7 normal-specific enhancers are shown in bold and all genes correlated with
these enhancers should be expressed higher in normal cells and thus have a negative value. The R vs L
designation indicates the direction and relative location of the transcript with respect to each enhancer
(e.g. R7 indicates that it is the 7
th
closest transcript to the enhancer on the “right” side).
70
2.3.5 The effect of enhancer deletion on the transcriptome
The expression analyses described above provide a list of genes that potentially are regulated by
the CRC risk-associated enhancers. However, it is possible that the enhancers regulate only a subset of
those genes and/or the target genes are at a greater distance than was analyzed. One approach to identify
targets of the CRC risk-associated enhancers would be to delete an enhancer from the genome and
determine changes in gene expression. As an initial test of this method, enhancer 7 (located at 8q24) was
selected. The region encompassing this enhancer has previously been implicated in regulating expression
of MYC (Sur et al., 2012), which is located 335 kb from enhancer 7. Guide RNAs that flanked enhancer 7,
along with Cas9, were introduced into HCT116 cells, and cells were identified that showed deletion of the
enhancer. Expression analysis using gene expression arrays was performed, identifying 105 genes whose
expression was down-regulated in the cells having a deleted enhancer (Supplementary Data File 2.8 );
the closest one was MYC, which was expressed 1.5 times higher in control vs. deleted cells (Figure 2.4).
Figure 2.4 Identification of genes affected by deletion of enhancer 7.
(A) Shown are the
expression differences
(x axis) and the
significance of the
change (y axis) of the
genes in the control
HCT116 cells vs.
HCT116 cells having
complete deletion of
enhancer 7. The
Illumina Custom
Differential
Expression Algorithm
was used to determine
Pvalues to identify the
significantly altered
genes; 3 replicates
each for the control and deleted cells were used. Genes on chromosome 8 (the location of enhancer 7) are
shown in blue. The spot representing the MYC gene is indicated by the arrow. (B) Shown are all genes on
chromosome 8 that change in expression and the 10 genes showing the largest changes in expression upon
deletion of enhancer 7. The location of the enhancer is indicated and the chromosome number is shown
on the outside of the circle. (C) The genes identified as potential targets using TCGA expression data are
indicated; of these, MYC is the only showing a change in gene expression upon deletion of the enhancer.
71
2.4 Discussion
I used the program FunciSNP (Coetzee et al., 2012), in combination with genomic, epigenomic,
and transcriptomic data, to analyze 25 tag SNPs (and all SNPs in high LD with those tag SNPs) that have
been associated with an increased risk for CRC (Zanke et al., 2007, Tomlinson et al., 2008, Tenesa et al.,
2008, Peters et al., 2013, Jia et al., 2013, Houlston et al., 2008, Houlston et al., 2010, Dunlop et al., 2012).
Taken together, I identified a total of 80 genes that may be regulated by risk-associated SNPs. Of these,
24 are directly linked to a gene via a SNP within an exon or proximal promoter region and 56 additional
genes are putative target genes of risk-associated enhancers; see Figure 2.5 for a schematic summary of
the location of the tag and LD SNPs and associated genes and Supplementary Data File 2.10 for a
complete list of genes and how they were identified.
Figure 2.5 Summary of identified candidate genes correlated with increased risk for CRC.
Shown are the 80
candidate genes
identified in this study.
For the gene names,
green means that it was
only identified as a
potential enhancer
target, the other genes
were identified as direct
targets either by an
exon SNP or a TSS
SNP; the putative
enhancer target genes
were selected as
described in the text.
For each tag SNP, the
relative number of
SNPs that identified an
exon (red portion), a
TSS (blue portion), or
an enhancer (green
portion) is shown by the
bar graph. The 9
genomic regions that
harbor CRC risk
enhancers are shown by the green rectangles outside the circle.
1
3
5
6
8
10
11
12
14
15
16
18 19 20
x
rs10795668
rs10774214
rs11169552
rs10411210
rs10936599
rs10505477
rs16892766
rs6687758
rs6691170
rs1665650
rs3802842
rs3824999
rs7136702
rs4444235
rs4779584
rs9929218
rs4939827
rs2423279
rs4925386
rs1321311
rs6983267
rs7014346
rs5934683
rs961253
rs647161
RP11-255B23.3
RP11-382A18.1
RP11-382A18.2
RP11-157P1.4
LIPG
AP001372.2
CATSPER3
GPATCH1
CABLES2
CDKN1A
C19orf40
ACTRT3
C11orf53
C11orf92
C11orf93
TMCO7
SMAD7
C12orf5
GREM1
CCND2
LAMA5
RHPN2
POLD3
H2AFY
UTP23
MYNN
LARP4
KCNE3
DIP2B
RPS21
PITX1
BMP4
EIF3H
CDH1
TERC
ATF1
DYM
MYC
CTIF
TGFBI
C5orf20
TIFAB
CXCL14
SLC25A48
SQLE
FAM84B
GSDMC
PVT1
ARRB1
LIPT2
NRIP2
FOXM1
TEAD4
RAD51AP1
GALNT8
KCNA6
RACGAP1
ASIC1
LIMA1
METTL7A
POU6F1
CDH3
SMPD3
TMED6
KATNAL2
ZBTB7C
ACAA2
CHST8
KCTD15
RBBP8NL
C20orf166-AS1
SLCO4A1
NTSR1
LOC100127888
MRGBP
POU5F1B
LRRC34
LRRIQ4
FAM186A
RP11-15F12.1
Yao Figure 5
72
Of the 25 tag SNPs, only one is found within a coding exon, occurring in the third exon of the
MYNN gene and resulting in a synonymous change that does not lead to a coding difference. However, by
analysis of SNPs in high LD with the 25 tag SNPs, I identified 5 genes that harbor damaging SNPs and
which are expressed in colon cells (HCT116, normal sigmoid colon, or TCGA tumors); these are
POU5F1B, RHNP2, UTP23, LAMA5, and FAM186A). Interestingly, the retrogene POU5F1B which
encodes a homolog of the stem cell regulator OCT4 has recently been associated with prostate cancer
susceptibility (Breyer et al., 2014). I also identified 23 genes (21 coding and 2 non-coding) that harbor
highly correlated SNPs in their promoter regions and are expressed in colon cells. Several of the genes
that were linked to increased risk for CRC by virtue of promoter SNPs show large changes in gene
expression in tumor vs. normal colon tissue. For example, TERC, the non-coding RNA that is a
component of the telomerase complex, was identified by a promoter SNP and has higher expression in a
subset of colon tumors (Supplementary Figure 2.9A). Similarly, CDH3 (P-cadherin) was identified by a
promoter SNP and shows increased expression in many of the colon tumors. Both TERC and CDH3 have
previously been linked to cancer (Cao et al., 2008, Paredes et al., 2012). Promoter SNPs also identified
three uncharacterized protein coding genes (c11orf93, c11orf92 and c11orf53) clustered together on
chromosome 11. Inspection of H3K4me3 and H3K27Ac ChIP-seq signals suggested that these genes are
in open chromatin in normal sigmoid colon, but not in HCT116. Accordingly, the TCGA gene expression
data showed that all 3 genes are down-regulated in a subset of human CRC tumors (Supplementary
Figure 2.9A). Additional genes identified by promoter SNPs that have been linked to cancer include
ATF1, BMP4, CDH1, CDKN1A, EIF3H, GREM1, LAMA5, and RHPN2 (Zhang et al., 2008, Paredes et
al., 2012, Li et al., 2012n, Li et al., 2013e, Karagiannis et al., 2013, Huang et al., 2012, Hu et al., 2013,
Deng et al., 2007, Danussi et al., 2013, Carneiro et al., 2013, Garte, 1993, Cheung et al., 2013). For
example, BMP4 is up-regulated in the HCT116 cells and has been suggested to confer an invasive
phenotype during progression of colon cancer (Deng et al., 2007). Interestingly, we also identified
GREM1, an antagonist of BMP proteins, and showed that expression of GRIM1 is decreased in HCT116.
The down-regulation of the antagonist GRIM1 and the up-regulation of the cancer-promoting BMP4 may
73
cooperate to drive colon cancer progression. LAMA5 is a subunit of laminin-10, laminin-11 and laminin-
15. Laminins, a family of extracellular matrix glycoproteins, are the major non-collagenous constituent of
basement membranes and have been implicated in a wide variety of biological processes including cell
adhesion, migration, signaling, and metastasis (Aumailley, 2013).
I identified 28 enhancers, clustered in 9 genomic regions, that harbor correlated SNPs. It is
important to note that these studies have used the appropriate cell types and the appropriate epigenetic
mark to identify CRC-associated enhancers. Previous analyses have attempted to link SNPs to enhancers
by using transcript abundance, epigenetic marks, or transcription factor binding from non-colon cell types
(Carvajal-Carmona et al., 2011). In contrast, I have used normal and tumor cells from the colon. Of
equal importance is the actual epigenetic mark that is used to identify enhancers. A previous study used
H3K4me1 to identify genomic regions that were differently marked between normal and tumor colon
cells (Akhtar-Zaidi et al., 2012). However, although H3K4me1 is associated with enhancer regions, this
mark does not specifically identify active enhancers. Some regions marked by H3K4me1 are classified as
“weak” or “poised” enhancers and it is thought that these regions may become active in different cells or
developmental states (Ernst and Kellis, 2010). In contrast, H3K27Ac is strongly associated with active
enhancers (Calo and Wysocka, 2013, Bonn et al., 2012) and I feel that this mark is the most appropriate
one for identification of CRC-associated risk enhancers.
Although it is not possible to conclusively know a priori what gene is regulated by each of the
identified enhancers, I have derived a list of putative CRC risk-associated enhancer target genes by
examining gene expression data from HCT116 cells and from a large number of colon tumors. Several of
the genes that are possible enhancer targets are transcription factors that have previously been linked to
cancer, including H2AFY, MYC, SMAD7, PITX1, TEAD4, and ZBTB7C. MYC, of course, has been
linked to colon cancer by many studies due to the fact that it is a downstream mediator of WNT signaling,
which is a strongly correlated with colon cancer (Fearon, 2011). In addition, PITX1, TEAD4, and
ZBTB7C are all transcription factors that have been previously linked to the control of cell proliferation,
specification of cell fate, or regulation of telomerase activity (Home et al., 2012, Jeon et al., 2012, Knosel
74
et al., 2012, Qi et al., 2011). Also, PVT1 is a Myc-regulated non-coding RNA that may play a role in
neoplasia (Guan et al., 2007, Huppi et al., 2012).
In conclusion, I have used epigenomic and transcriptome information from normal and tumor
colon cells to identify a set of genes that may be involved in an increased risk for the development of
colon cancer. I realize that I cast a rather large net by analyzing 10 genes 5’ and 10 genes 3’ of each
enhancer. I note that 5 of the genes shown to be differentially expressed in the TCGA data (MYC, PITX1,
POU5F1B, C5orf20, and CDH3) are also in the set of nearest 3 genes to an enhancer having CRC risk-
associated SNPs. However, enhancers can also work at large distances. In fact, the eQTL analysis
identified TMED6 as a potential target of enhancer 19 (over 600 kb away) and deletion of enhancer 7
identified MYC as a potential target (335 kb away). Future analyses of the entire set of CRC risk-
associated enhancers are required to confirm the additional putative long range regulatory loops suggested
by our studies. Such studies will provide a high confidence list of genes which, when combined with the
genes identified by the TSS risk-associated SNPs, should be prioritized for analysis in tumorigenicity
assays.
2.5 Methods
2.5.1 RNA-seq
RNA-seq data was downloaded from the Reference Epigenome Mapping Center for analysis of
gene expression in sigmoid colon cells (GSM1010974 and GSM1010942). For HCT116 colon cancer
cells, RNA was prepared using Trizol (Life Technologies, Carlsbad, CA), paired-end libraries were
prepared using the Illumina TruSeqV2 Sample Prep Kit (Catalog #15596-026), starting with 1 ug total
RNA. Libraries were barcoded, pooled, and sequenced using an Ilumina Hiseq. For analysis of RNA-seq
data, we used Cufflinks (Trapnell et al., 2010), a program of “alignment to annotation” having
discontinuous mapping to the reference genome. mRNA abundance was measured by calculating FPKM
(expected fragments per kilobase of transcripts per million fragments sequenced), to allow inter-sample
comparisons. I specified the –G option with the GENCODE V15 comprehensive annotation so that the
75
program will only do alignments that are structurally compatible with the reference transcript provided.
Two biological replicates were performed and the mean FPKM of two biological replicates represents the
expression of each gene (GSM1266733 and GSM1266734). I categorized genes into non-expressed, low
expressed, and expressed based on the distribution of the Gene FPKM (Supplementary Figure 2.2)
generated by the R package “ggplot2”.
RNA-seq data for 233 colorectal tumor samples and 21 colorectal normal samples was
downloaded from the TCGA data download website (https://tcga-
data.nci.nih.gov/tcga/dataAccessMatrix.htm); Supplementary Data File 2.9. The data were all generated
on the Illumina HiSeq platform, and mapped with the RSEM algorithm and normalized so that the third
quartile for each sample equals 1000. Entrez gene IDs were used for mapping to genomic locations using
GenomicRanges (http://www.bioconductor.org/packages//2.12/bioc/html/GenomicRanges.html). To
identify transcripts differentially expressed in the tumor samples, I selected the 10 nearest genes 5’ of and
the 10 nearest genes 3’ of each of the 28 enhancers. After removing the non-expressed genes, I then log2-
transformed the expression data [log2(RSEM+1)], and performed a t test on gene expression between the
normal group and the tumor group for each gene using 254 TCGA colorectal RNAseq datasets. I selected
statistically genes that showed a statisically significant 2 fold change in expression (P <0.01, after
adjustment by Benjamini and Hochberg’s False Discovery Rate Methods).
To genererate the heatmap showing expression of genes in the TCGA samples, I log2-
transformed the expression data of the 254 TCGA colorectal samples RNAseq [log2(RSEM+1)]. Then I
computed the mean and standard deviation of the expression of the each gene (!
!
and !
!
). I normalized
gene expression by [Z=
!!!
!
!
!
]. Hierarchical clustering with Ward’s method was used to normalized
TSS/exon gene expression.
2.5.2 ChIP-seq analysis
Two replicate H3K27Ac ChIP-seq datasets from HCT116 cells (ENCODE accession number
wgEncodeEH002873) and two replicate H3K27Ac ChIP-seq datasets from normal sigmoid colon
76
(www.genboree.org/EdaccData/Current-Release/sampleexperiment/-
Sigmoid_Colon/Histone_H3K27ac/) were analyzed using the Sole-search ChIP-seq peak calling
program (Blahnik et al., 2010, Blahnik et al., 2011) using the following parameters (Permutation:5;
Fragment:250; AlphaValue: 0.00010 = 1.0E-4; FDR: 0.00010 = 1.0E-4; PeakMergeDistance:0;
HistoneBlurLength:1200). Each dataset was analyzed separately and also analyzed as a merged dataset
for HCT116 or sigmoid colon. The merged H3K27Ac peaks from HCT116 or Sigmoid colon were
analyzed using the GenomicRanges package of bioconductor to identify promoter vs distal peaks.
2.5.3 Enhancer deletion
Guide RNAs designed to recognize chr8: 128412821-128412843 and chr8: 128414816-
128414838 (hg19) were cloned into a gRNA cloning vector (Addgene plasmid 41824) and introduced
into HCT116 cells by transfection, along with a plasmid encoding Cas9 and GFP. Cells were sorted using
a flow cytometer to capture the cells having high GFP signals and then colonies were grown from single
cells. Complete deletion of all alleles for enhancer 7 was confirmed by PCR using primers flanking the
enhancer. RNA analysis was performed in triplicate using HumanHT-12 v4 Expression BeadChip arrays
(Illumina), comparing the deleted cells to parental HCT116 cells.
2.5.4 Analysis of FunciSNP and correlated SNPs effects
To identify SNPs correlated with the 25 CRC tag SNPs and that overlap with chromatin
biofeatures, I used the R package for FunciSNP (Coetzee et al., 2012), which is available in
Bioconductor. I used H3K27ac ChIP-seq data from HCT116 cells and sigmoid colon tissue and as
biofeatures we used exon, intron, UTR, TSS annotations generated from GENCODE V15. I ran
FunciSNP with the following parameters: +/- 200 kb around each of the 25 tag SNPs and r
2
>0.1. To
analysis the potential effects of correlated SNPs on protein coding, we employed SnpEff and Provean
using suggested default parameters. For analysis of SNPs on transcription factor motifs, I employed a
method developed by Dennis Hazelett (personal communiciation).
77
2.5.5 Batch effects analysis
I note that TCGA has strict sample criteria. Each frozen primary tumor specimen has a
companion normal tissue specimen which could be blood/blood components (including DNA extracted at
the tissue source site), or adjacent normal tissue taken from greater than 2 cm from the tumor. Each tumor
and adjacent normal tissue specimen (if available) were embedded in optimal cutting temperature (OCT)
medium and a histologic section was obtained for review. Each H&E stained case was reviewed by a
board-certified pathologist to confirm that the tumor specimen was histologically consistent with colon
adenocarcinoma and the adjacent normal specimen contained no tumor cells. The tumor sections were
required to contain an average of 60% tumor cell nuclei (TCGA has found that this provides a sufficient
proportion so that the tumor signal can be distinguished from other cells), with less than 20% necrosis for
inclusion in the study per TCGA protocol requirements. To address potential batch effects, I applied
MBatch software, which was developed by the MD Anderson Cancer Center and has been widely used to
address batch effects in TCGA Consortium (Fearon, 2011, TheCancerGenomeAtlas, 2012f), to perform
hierarchical clustering and Principal Component Analysis (PCA) to address any potential batch effects in
the colorectal TCGA data sets: level 3 mRNA expression (RNA-seq Illumina Hiseq), level 3 DNA
methylation (Infinium HM450K microarray), level 4 SNPs CNV by gene (GW SNP 6). I assessed batch
effects for two variables: batch ID and tissue source site. For hierarchical clustering, MBatch uses the
average linkage algorithm with 1 minus the Pearson correlation coefficient as the dissimilarity measure.
The samples were clustered after labeling with different colors, each of which corresponds to a batch ID
or a tissue source site. (Supplementary Figure 2.6.1, 2.7.1, and 2.8.1). For PCA, MBatch plotted four
principal components (Supplementary Figure 2.6.2,3, 2.7.2, 3, and 2.8.2, 3). Samples with the same
batch ID (or tissue source site) were labeled as same color and shape and were connected to the batch
centroids. The centroids were computed by taking the mean across all samples in the same batch. To
assess batch effects on mRNA expression (Supplementary Figure 2.6), genes with zero values were
removed and normalized gene expression values were log2-transformed before analyzing batch effects.
Batch 132 and 154 stood out in one comparison (Comp1 vs Comp2) but not in the other comparisons
78
(Supplementary Figure 2.62). The remaining batches or tissue source sites did not stand out in
clustering or in any of the PCA plots; thus the data is not supportive of a strong batch effect and all data
was used for analysis. When batch effect on CNV (Supplementary Figure 2.7) was analyzed, the
centroid for the NH tissue source site stood out among other batches. The remaining batches or tissue
source sites did not stand out in clustering or in any of the PCA plots. I did not apply correction on the
data because (i) there were only two samples and a centroid calculated by only two samples is likely not
accurate, (ii) the two samples within the NH batch were not far from other individual samples, and (iii)
two samples would not dramatically affect our analysis of 233 samples. When assessing batch affects on
DNA methylation analysis, no batches or tissue source sites stood out in clustering or in any of the PCA
plots. (Supplementary Figure 2.8). In summary, none of the samples consistently show batch effects in
both clustering and PCA algorithms. Based on the above analysis, I believe that batch effects among the
data sets are not dramatically influencing our analysis.
2.5.6 eQTL analyses
I employed a two step linear regression model which considers somatic germline genotype, copy
number variation and DNA methylation at gene promoters to perform eQTL analysis (Li et al., 2013a). I
selected 228 patients with both tumor samples and matched normal blood or normal tissue samples from
the TCGA colorectal cancer data set. For each of these patients, I obtained the germline genotypes from
normal blood or normal tissue samples using data from the GW SNP6 array platform. I directly
downloaded gene-level somatic copy number, gene isoform expression (from the RNAseqHiseq Illumina
platform) and DNA methylation data (from the HM450K platform) for each tumor sample from the
TCGA data download website
(http://gdac.broadinstitute.org/runs/analyses__2014_01_15/data/COAD/20140115/). To determine DNA
methylation of a promoter, I calculated the average DNA methylation at 100bp upstream of and 700bp
downstream of the TSS for a transcript. I fit the germline genotype of patients, the continuous DNA
methylation level of promoters, and the CNV of matched tumor samples into the two steps multivariate
79
linear regression model. 60 SNPs, including 6 tag SNPs, 18 SNPs within risk enhancers, and 45 SNPs
within TSS regions, were present on the GW SNP6 array. eQTL analyses were performed using these 60
SNPs and the genes identified by exon or TSS SNPs or by differential expression analysis (see Table 2.2
and Table 2.5). To reduce false positives, I excluded genes showing log2 expression less than 2 in over
90% of the samples. The Benjamini-Hochberg method was used to correct the original P value and FDR
of 0.1 was used as the threshold of significant association.
2.5.7 General Data Handling and Visualization
Throughout the analyses I used GenomicRanges to import, export and/or intersect genomic data
for plotting and annotation purposes; the R version 3.0.0 (2013-04-03) was used for all statistical
analyses, the R function ‘image’ was used for heatmap generation, and package ‘ggplot2’ was used to
generate scatterplots. To generate the circle plot, Circos software was used (Krzywinski et al., 2009). All
genomic location information is based on hg19.
80
2.6 Supplementary figures for chapter 2
Supplementary figure 2.1 Correlated exon SNPs.
Shown are correlated SNPs that fall within exons located within +/- 200 kb from each tag SNP. The r
2
value in relation to the tag SNP is shown on the x axis and the distance of the correlated SNP from the tag
SNP is shown on the y axis. The color indicates whether the exon is contained within a protein coding
(green), non-coding (red), or pseudogene (blue) transcript.
81
Supplementary figure 2.2 Analysis of RNA-seq data.
The top panels represent expression levels of coding (left panel) and non-coding (right panel) RNAs in
HCT116 and sigmoid colon samples. RNAs were divided into highly expressed (FPKM higher than 2),
modestly expressed (FPKM between 0.5 and 2) and not expressed FPKM less than 0.5). The number of
genes in each category are shown in the table.
82
Supplementary figure 2.3 Correlated TSS SNPs.
Shown are correlated SNPs that fall within TSS regions located within +/- 200 kb from each tag SNP. The
r
2
value in relation to the tag SNP is shown on the x axis and the distance of the correlated SNP from the
tag SNP is shown on the y axis. The color indicates whether the TSS regulates a protein coding (green),
non-coding (red), or pseudogene (blue) transcript.
83
Supplementary figure 2.4 ChIP-seq peak analysis.
The top panels show the H3K27Ac peak height vs peak rank in the HCT116 (left) and sigmoid colon
(right) ChIP-seq datasets; blue indicates peaks present in both replicates, green indicates peaks present
in only one replicate, and red indicates peaks present only in the merged dataset. The bottom panels
show the H3K27Ac peak height vs peak rank in the HCT116 (left) and sigmoid colon (right) ChIP-seq
datasets; blue indicates peaks located in promoter proximal regions and red indicates peaks located in
distal regions.
84
Supplementary figure 2.5 Correlated enhancer SNPs.
Shown are correlated SNPs that fall within distal H3K27Ac regions located within +/- 200 kb from
each tag SNP. The r
2
value in relation to the tag SNP is shown on the x axis and the distance of the
correlated SNP from the tag SNP is shown on the y axis. The color indicates whether the enhancer is
found only in normal sigmoid colon (blue), only in HCT116 (green), or in both normal and HCT116
cells (red).
85
Supplementary figure 2.6 TCGA batch effects analysis. RNA-seq batch effects.
1) Hierarchical clustering plot for mRNA expression data. 2) PCA for mRNA expression, showing first
and second components comparison, second and third components comparison, and third and forth
components comparison plots with samples connected by centroids according to batch ID. 3) PCA for
mRNA expression, showing first and second components comparison, second and third components
comparison and third and forth components comparison plots with samples connected by centroids
according to tissue source site (TSS).
86
Supplementary figure 2.7 TCGA batch effects analysis. CNV (GW SNP6 array) batch effects.
1) Hierarchical clustering plot for copy number variation (CNV) data. b) PCA for CNV, showing first and
second components comparison, second and third components comparison and third and forth
components comparison plots with samples connected by centroids according to batch ID. 3) PCA for
CNV, showing first and second components comparison, second and third components comparison and
third and forth components comparison plots with samples connected by centroids according to TSS.
87
Supplementary figure 2.8 TCGA batch effects analysis. DNA methylation (Infinium HM450K
microarray) batch effects.
1) Hierarchical clustering plot for DNA methylation data. 2) PCA for DNA methylation, showing first
and second components comparison, second and third components comparison and third and forth
components comparison plots with samples connected by centroids according to batch ID. 3) PCA for
DNA methylation, showing first and second components comparison, second and third components
comparison and third and forth components comparison plots with samples connected by centroids
according to TSS.
88
Supplementary figure 2.9 Expression analysis of genes identified by promoter and exon SNPs and
potential enhancer target genes in TCGA samples.
(A) Shown are the expression levels for the genes identified by exon or TSS SNPs in normal and
colorectal TCGA tumor samples. The heatmap was created by unsupervised clustering of normalized
gene expression, using 21 normal and 233 tumor colon tissues from TCGA. Each row represents genes
and each column is one of the normal (left panel) or tumor (right panel) samples. (B) Shown are the
expression differences (x axis) in the normal vs tumor samples and the significance of the change (y axis)
of the genes identified by promoter or exon SNPs. Genes identified as differently expressed are those with
an adjusted P value <0.01 and log2 expression difference >1); red indicates genes with a higher
expression in tumors and blue indicates genes with a lower expression in tumors. (C) Shown are the
expression levels of the differentially expressed genes in the set of the 10 nearest 5’ and 10 nearest 3’
genes for each of the 28 enhancers in tumor and normal samples in normal and tumor TCGA samples;
heatmap was created as described in panel A. (D) Shown are the expression differences (x axis) in the
normal vs tumor samples and the significance of the change (y axis) of the 10 nearest 5’ and 10 nearest 3’
genes for each of the 28 enhancers. Genes identified as differently expressed are those with an adjusted P
value <0.01 and log2 expression difference >1); the colors refer to the 9 different regions within which
the 28 enhancers are located (see Table 2.3).
89
Supplementary figure 2.10 eQTL analysis summary.
(A) Distribution of promoter methylation of all genes in normal and tumor samples. (B) Distribution of
gene level copy number variation for all genes. (C) Shown are the subset of all related SNP-gene pairs
identified by eQTL analysis with a non-adjusted P-value <0.01. (D) Shown is the expression level for the
three genotypes for each SNP-gene pair, colored by the copy number variable of the gene. (E) Shown is
the expression level for the three genotypes for each SNP-gene pair, colored by the DNA methylation
level at the promoter.
90
Chapter 3. Inferring regulatory element landscapes and transcription factor
networks from cancer methylomes
The work described in this chapter has been published in an article in Genome Biology, 16(1):105 doi:
10.1186/s13059-015-0668-3. Lijing Yao, Hui Shen, Peter W. Laird, Peggy J. Farnham and Benjamin P.
Berman. “Inferring regulatory element landscapes and transcription factor networks from cancer
methylomes”. Lijing Yao is responsible for all bioinformatic analyses and assisted with manuscript
preparation; Benjamin P. Berman conceived of the project and participated in analysis and manuscript
preparation; Peggy J. Farnham participated in the enhancer and transcription factor analysis and
drafted the manuscript; Hui Shen and Peter W. Laird contributed key concepts to the analysis strategy
used and Hui Shen assisted with analysis.
3.1 Abstract
Recent studies indicate that DNA methylation can be used to identify transcriptional enhancers,
but no systematic approach has been developed for genome-wide identification and analysis of enhancers
based on DNA methylation. I describe ELMER (Enhancer Linking by Methylation/Expression
Relationships), an R-based tool that uses DNA methylation to identify enhancers and correlates enhancer
state with expression of nearby genes to identify transcriptional targets. Transcription factor motif
analysis of enhancers is coupled with expression analysis of transcription factors to infer upstream
regulators. Using ELMER, I investigated more than 2,000 tumor samples from The Cancer Genome
Atlas. I identified networks regulated by known cancer drivers such as GATA3 and FOXA1 (breast
cancer), SOX17 and FOXA2 (endometrial cancer), and NFE2L2, SOX2 and TP63 (squamous cell lung
cancer). I also identified novel networks with prognostic associations, including RUNX1 in kidney
91
cancer. I propose ELMER as a powerful new paradigm for understanding the cis-regulatory interface
between cancer-associated transcription factors and their functional target genes.
3.2 Background
ENCODE and other large-scale efforts have mapped transcription factor binding sites, histone
modifications, and chromatin accessibility in a common set of cell lines
(RoadmapEpigenomicsConsortium, 2015, ENCODE_Project_Consortium, 2012). Integration of these
genome-wide maps has led to the view that distinct epigenetic marks are not independent but rather that
chromatin is organized into discrete functional states marked by particular combinations of individual
features (Filion et al., 2010, Henikoff, 2007). Computational methods such as chromHMM (Ernst et al.,
2011) and Segway (Hoffman et al., 2012) have been developed to identify these states from individual
histone and accessibility features, and the state most consistently linked to cellular identity is the “active
enhancer” state defined by the presence of histone H3 lysine 27 acetylation and low levels of the
canonical promoter mark, H3 lysine 4 tri-methylation (Rada-Iglesias et al., 2011, Creyghton et al., 2010,
Ernst et al., 2011). Active enhancers are enriched for sequences bound by cell-type specific transcription
factors, reinforcing their preeminent role in encoding the cis-regulatory logic of the genome. Projects such
as the NIH Roadmap (Bernstein et al., 2010, RoadmapEpigenomicsConsortium, 2015) and Blueprint
(Beck et al., 2012) have also mapped histone modifications and chromatin accessibility in primary human
tissues, identifying a large set of enhancers from many different cell types. Others have employed these
datasets to identify large numbers of enhancer-promoter pairs in 12 human cell types (He et al., 2014,
Sheffield et al., 2013). However, approaches such as ChIP-seq or DNAse hypersensitivity assays require
careful tissue handling (to avoid protein degradation) and relatively large numbers of cells (10
6
to 10
7
)
and thus have not been applied to the identification of enhancers in primary tumor tissues.
Fortunately, enhancers can also be identified using patterns of 5-methylcytosine, an epigenetic
mark that is maintained more stably than protein marks, and can be detected genome-wide in as few as
1,000 cells (Pastor et al., 2014) . Historically, DNA methylation research has focused on gene promoter
92
regions (reviewed in (Bergman and Cedar, 2013)). While early work suggested that DNA methylation
could mark enhancer regions of interest (Thomassin et al., 2001), this was not widely appreciated until the
first complete and unbiased study of DNA methylation in human cells revealed enhancer regions as being
unmethylated in a cell-type specific manner (Lister et al., 2009). A later study used the same whole-
genome bisulfite sequencing (WGBS) approach to identify all genomic regions containing little or no
methylation; these regions overwhelmingly corresponded to enhancers and other distal regulatory
elements (Stadler et al., 2011). Cell-type specific demethylation of enhancers was confirmed by targeted
bisulfite sequencing in the ENCODE project (ENCODE_Project_Consortium, 2012). More recently,
WGBS data from 30 diverse human cell types showed that enhancers had highly dynamic methylation
patterns - roughly 30% of the most cell type-specific regions in the genome overlapped known enhancers
(compared to 5% that overlapped gene promoters). The mechanism underlying these correlations is not
well understood, but could involve de-methylation initiated by transcription factor binding ((Stadler et al.,
2011); reviewed in (Blattler and Farnham, 2013)) and subsequent inhibition by methylation of Histone H3
lysine 4 (Ooi and Wood, 2007).
In cancer tissues, recent studies have shown that cancer-specific enhancers and transcription
factor binding sites can be identified from methylation profiles. The first genome-scale analysis of
transcription factor binding sites in cancer found that binding by transcription factors such as Sp1, NRF1,
and YY1 could protect CpG island gene promoters from cancer-specific hypermethylation (Gebhard et
al., 2010). Our WGBS study of a human colon cancer identified all genomic regions that changed from a
methylated state in the normal colon to an unmethylated state in the tumor; 90% of these regions
overlapped known enhancers, and a highly disproportionate number contained binding sites for the AP-1
transcription factor (Berman et al., 2012). A more recent study showed that DNA methylation changes at
enhancer elements were significantly better than those at promoters for predicting gene expression
changes of target genes in cancer (Aran et al., 2013). WGBS was recently used to show that unmethylated
regions were enriched for binding sites for subtype-specific transcription factors in pediatric
medulloblastoma (LEF1 for the WNT subtype and GLI2 for the SHH subtype (Hovestadt et al., 2014)).
93
Once an enhancer has been identified by methylation, identification of the specific target gene or
genes whose expression is modulated by that enhancer can be challenging because the target genes can be
thousands to millions of base pairs away from the enhancer. A study using chromatin conformation
sequencing (ChIA-PET) to study enhancer/promoter interactions found that the median distance between
an enhancer and a promoter was ~50kb, and that at least 40% of enhancers skip one or more annotated
genes to find their target promoter (Li et al., 2012a). The ChIA-PET dataset was used in conjunction with
DNA methylation and RNA-seq data from breast cancer cases in The Cancer Genome Atlas (TCGA) to
identify enhancer/promoter pairs in vivo (Aran and Hellman, 2013). Other reports have also shown that
methylation of distal regulatory sites is closely related to gene expression levels across the genome
(Wiench et al., 2011). Here, I present a statistical framework for identification of cancer-specific
enhancers and paired gene promoters, and use it to investigate ~3,000 cases from 11 tumors types in the
TCGA “Pan Cancer” analysis set (Weinstein et al., 2013). My R software package, ELMER, uses only
methylation and expression data, and does not require any chromatin conformation or ChIP-seq data.
Furthermore, by identifying transcription factor binding motifs present within enhancers, and
incorporating expression patterns of upstream transcription factors, ELMER is able to infer transcription
factor networks activated in specific cancer subtypes. This work suggests a general approach for
identifying in vivo transcription factor networks and the associated regulatory control sequences altered in
cancer.
3.3 Results
3.3.1 Identifying cancer-specific DNA methylation changes in distal enhancer regions for 10 cancer
types
To identify cancer-specific changes in DNA methylation, I obtained 3,381 DNA methylation
datasets for 11 types of primary tumors from the TCGA Pan Cancer analysis set (Weinstein et al., 2013).
The cancer types I included in my analyses were leukemia (LAML), lung adenocarcinoma (LUAD), lung
94
squamous cell carcinoma (LUSC), kidney renal clear cell carcinoma (KIRC), bladder urothelial
carcinoma (BLCA), uterine corpus endometrioid carcinoma (UCEC), glioblastoma (GBM), head and
neck squamous cell carcinoma (HNSC), breast cancer (BRCA), colon adenocarcinoma (COAD) and
rectal adenocarcinoma (READ). Based on previous TCGA studies (TheCancerGenomeAtlas, 2012c),
COAD and READ are very similar and are often combined for analyses. Therefore I combined these two
cancer types (indicated herein as CRC), resulting in 10 different primary tumor types. The TCGA ID
numbers for all samples can be found in Supplementary Data File 3.1.
The DNA methylation data sets were produced using the Illumina Infinium
HumanMethylation450 (HM450) BeadChip platform. The HM450 array allows the integration of more
than 485,000 methylation sites at single-nucleotide resolution, covering 96% of CpG islands and 99% of
RefSeq genes in the human genome. I used TCGA Level 3 data, which is normalized using platform-
specific internal controls, and masks out probes for failure/SNP/repeats on the HumanMethylation450
array. Then, because I focused on distal enhancers, I selected only those probes that are greater than +/−
2kb from a known TSS (defined using GENCODE v15 (http://www.gencodegenes.org/releases/15.html),
resulting in a set of 145,265 distal probes. I next wanted to limit the number of candidate probes tested, so
I filtered based on two large enhancer databases. While these databases do not include a large number of
primary tumors, they do include cancer cell lines and a large number of cell types. The largest enhancer
set came from a combination of enhancers from the Roadmap Epigenomics Mapping Consortium
(REMC) and the Encyclopedia of DNA Elements (ENCODE) Project, in which enhancers were identified
using ChromHMM (Ernst and Kellis, 2012) for 98 tissues or cell lines
(RoadmapEpigenomicsConsortium, 2015, Bernstein et al., 2010, ENCODE_Project_Consortium, 2012). I
used the union of genomic elements labeled as EnhG1, EnhG2, EnhA1 or EnhA2 (representing intergenic
and intragenic active enhancers) in any of the 98 cell types, resulting in a total of 389,967 non-
overlapping enhancer regions. A total of 101,918 distal probes from the HM450 array overlapped with
these enhancer regions. I also downloaded from FANTOM5 enhancers having associated eRNAs for 400
distinct cell types (Andersson et al., 2014). The set of FANTOM5 enhancers (43,011) was much smaller
95
than the set of REMC/ENCODE enhancers and only added an additional 600 probes, resulting in a total
of 102,518 distal probe regions that overlapped with a previously identified enhancer region (Figure
3.1A). This set of 102,518 distal enhancer probes (Supplementary Data File 3.2) included at least one
CpG for 15% of all enhancers in our annotation set, suggesting that the HM450k array can be used to
sample a meaningful subset of enhancers genome-wide. It also included the majority (70%) of all 145,265
distal probes on the array, so I believe that the analysis described below covers the vast majority of
identifiable enhancers based on the HM450k array design. The ELMER R package also allows a complete
search of all distal probes on the array, without filtering out the 30% not associated with any known
enhancer.
Figure 3.1 Identifying cancer-specific DNA methylation changes in distal enhancer regions.
(a) Out of 145,265 distal probes on the HM450k platform, 102,518 were contained within our annotated
enhancer regions (with ~1/8 of all distal enhancers being covered by at least one probe). (b) The statistical
method used to identify probes hypomethylated (or hypermethylated) in cancer (see Methods for
additional details). The heatmap in the top panel shows the DNA methylation level at each probe p
i
for
each sample from a particular cancer type (either an adjacent normal, or a tumor). Each cell is a
methylation β value, reflecting the fraction of methylated DNA molecules at each CpG probe. The
remainder of the panel illustrates our statistical test, which compares only the most extreme 20% of
normal samples to the most extreme 20% of tumor samples, in order to identify probes hypomethylated in
only a subset of tumors. (c) Shown is a histogram representing the number of cancer-specific
hypomethylated (top graph) or hypermethylated (bottom graph) distal enhancer probes identified for each
cancer type. The fraction of these probes shared by one or more other tumor types is indicated by the
color bars (1 indicates that the probe is hypomethylated in only that tumor type, 2 indicates that it is
hypomethylated in one other tumor type, etc.)
96
To identify enhancers that displayed cancer-specific changes in DNA methylation, I applied a t-
test to identify enhancer probes that were significantly hypermethylated or hypomethylated within tumor
samples of each cancer type, relative to TCGA adjacent normal samples from the same tissue (Figure
3.1B; see Methods for details); a list of the identified hypermethylated or hypomethylated enhancer
probes for each tumor type can be found in Supplementary Data File 3.3. I identified many more
hypomethylated enhancer probes than hypermethylated probes for each of the 10 cancer types (Figure
3.1C). Interestingly, most of the probes showing DNA methylation changes were found to have similar
changes in DNA methylation in more than one cancer type. However, some probes were uniquely
hypermethylated or hypomethylated in only one of the 10 tumor types. I note that it is not possible for us
to be absolutely certain that the adjacent tissues collected by TCGA correspond to the same cell type from
which the cancer arose, and therefore some of these methylation changes may correspond to tissue-
specific differences rather than changes arising in the cancer. However, these differentially methylated
probes are only candidates, as the next steps of ELMER (described below) use differences across all
normal and tumor tissues (of the same cancer type) to determine true regulatory interactions.
3.3.2 Linking methylation-affected enhancers to gene expression
Although I identified ~100,000 enhancer probes that showed DNA methylation changes, it was
not clear if all of these enhancers were actually involved in regulating gene expression. Previous studies
have shown that only a portion of genomic regions classified as enhancers by chromatin marks or
recruitment of histone acetyltransferases show activity in various assays (Kwasnieski et al., 2014, Blow et
al., 2010). In addition, it is difficult to know which gene is regulated by each enhancer since enhancers
can work from a distance, in either orientation, and do not necessarily regulate the closest gene. For
example, in a ChIA-PET study using an antibody for RNA polymerase II, Li et al. (Li et al., 2012a)
identified ~20,000–30,000 enhancer-promoter loops in MCF7 or K562 cells. Of these, more than 40% of
the enhancers skipped over the nearest gene to loop to a farther one. In order to identify target genes
regulated by the distal regulatory elements, I analyzed expression data (RNA-seq) for 10 genes upstream
97
and 10 genes downstream from each distal regulatory element; these 20 nearby genes constituted
candidate gene targets. I preferred this method rather than those that evaluate all genes within a fixed-
length genomic window, because the statistical power is controlled for the large degree in variation in
gene density across the genome. Because not all TCGA samples had matched gene expression datasets, I
selected the 2,841 TCGA samples that had matched gene expression (RNA-seq) and HM450k DNA
methylation data (in Supplementary Data File 3.1). Although I realize that this method cannot identify
target genes that are farther than ten genes away or on different chromosomes, I anticipated that many of
the enhancers would regulate a gene within this distance (Ernst et al., 2011). Genes that are positively
regulated by the enhancers should show a negative correlation between the DNA methylation level of the
probe and expression of a putative target gene. I identified statistically significant CpG probe-gene pairs
by comparing expression of the candidate gene in the upper vs. the lower quintile of samples, as measured
by enhancer probe methylation. For this and all other downstream analyses, I included both normal and
tumor samples, and only included samples within an individual cancer type (e.g. UCEC), to avoid effects
of tissue-specific differences and potential batch effects. I did not explicitly require expression changes
between normal and tumor samples, because the number of normal samples with expression data were
often quite limited. However, most genes identified did in fact show expression changes in the expected
direction (downregulated for hypermethylated enhancers, and upregulated for hypomethylated enhancers;
see the “tumor vs. normal expression” worksheet in Supplementary Data File 3.4). To compare
methylation quintiles vs. expression, I used a non-parametric U test, calculating an empirical p-value
using randomly assigned permutations of the methylation probe tested, and kept all pairs with an
empirical p-value < 0.001 (Figure 3.2A; see Methods for details). An example of one probe and its
relationship to the expression of the 20 nearby genes in UCEC is shown in Figure 3.2B. In this case, the
probe showed an inverse correlation of methylation with expression of TFAP2A, which was the nearest
gene upstream of the probe (~7 kb away). A list of all putative enhancer-gene interactions can be found in
Supplementary Data File 3.4.
98
Figure 3.2 Linking differentially methylated probes to expression of nearby genes.
(a) Shown is an illustration of the method used to associate each differentially methylated enhancer probe
with one or more genes based on gene expression (see Methods section for additional details). For each of
n probes identified as hypomethylated in a given cancer type (shown as blue circles), 10 genes upstream
and 10 genes downstream were considered, yielding 20n statistical tests, one for each probe-gene pair.
Each statistical test is performed across the complete set of normal and tumor samples within a particular
cancer type. For instance, I show a scatterplot to illustrate such a test across the 258 endometrial (UCEC)
tumor samples and 10 UCEC adjacent normals, showing the desired inverse correlation between
methylation (x axis) and expression of the nearby gene (y axis). A Mann–Whitney U test was then
performed, with the null hypothesis that the gene expression of group M samples is less or equal to that of
group U samples. The U group consists of the 20% least methylated samples for probe P
i
, and the M
group consists of the top 20% most methylated. The raw p-value (p
r
) was compared to a permutation-
based distribution of null p values, generated by performing 10,000 U tests between the actual gene G
j
and DNA methylation a randomly selected distal non-enhancer probe. The empirical p
e
value was
calculated by the rank of p
r
within the 10,000 trials. (b) Each scatter plot shows the methylation level of
an example probe cg09606832 in all UCEC samples plotted against the expression of one of 20 adjacent
genes. Only one gene, TFAP2A, shows a significant p
e
indicating negative correlation, and is considered
the linked gene
Using this method, I identified a total of 11,972 hypomethylated probe-gene pairs and 2,308
hypermethylated probe-gene pairs in the set of 10 tumor types (Figure 3.3A), with the number of
hypomethylated probe-gene pairs ranging from 499 to 3,847 in different tumor types, and the number of
hypermethylated probe-gene pairs ranging from 119 to 464 (see Supplementary Figure 3.1 for a
99
breakdown by type). Analysis of the probe-gene pairs revealed that most of the identified pairs were only
found in one cancer type, suggesting that each enhancer regulates a specific gene in a tumor type-specific
manner (Figure 3.3A). Because some enhancers contained two or more probe features, I clustered probes
that were within 500 bp of each other into 6,068 hypomethylated and 1,288 hypermethylated enhancer
regions. Each enhancer was associated with an average of 1.0 to 1.7 genes, depending on tumor type, and
each gene was associated with an average of 1.2 to 2.1 enhancers (Figure 3.3B). My work is consistent
with previous studies indicating that distal elements commonly loop to or are associated with expression
from 1 to 3 promoters (Sanyal et al., 2012). Although the enhancer-gene pairs that I identified were highly
specific for a certain tumor type, we found that ~34% of the genes identified as regulated by a
hypomethylated probe and ~17% of the genes identified as regulated by a hypermethylated probe were
targets in more than one tumor type (Figure 3.3A), suggesting that a gene could utilize different
enhancers in different tumor types for cancer-specific regulation.
Figure 3.3 Comparison of probe-gene pairs between the different cancer types.
(a) For the hypomethylated (top) and hypermethylated (bottom) probe-gene pairs, shown are pie charts
that indicate the percentage of probe-gene pairs, probes, and genes that are present in one (purple) or
shared by more than one of the 10 cancer types. (b) Using all probe-gene pairs, the distribution of the
number of genes per enhancer (top) and the number of enhancers per gene (bottom) is shown for each
individual cancer type. The mean of each is shown as a number within the bar plot
100
To further investigate the relationships between putative enhancers and linked target genes, I
determined the frequency with which the probe-gene pairs I identified were separated by specific
distances using window sizes of 50 or 200 kb (Figure 3.4A). I found that both hypomethylated and
hypermethylated probe-gene pairs were more frequent than random in the first 50 kb window, with
hypermethylated pairs more dramatically so. A previous study using HiC to identify promoter-enhancer
loops found that ~25% of enhancer-promoter pairs were within a 50 kb range and approximately 75%
spanned 100 kb or larger genomic distance, with a median distance of 124 kb (Jin et al., 2013), whereas a
recent study using in situ HiC identified contact domains ranging in size from 40 kb to 3 Mb, with a
median size of 185 kb (Rao et al., 2014). I then selected the set of probe-gene pairs where a single
enhancer was only linked to a single gene (the great majority), and determined how often the linked gene
corresponded to the nearest TSS. In previous studies, enhancers have been shown to loop to the nearest
promoter only 27–40% of the time, skipping over the nearest TSS to loop to promoters farther away (Li et
al., 2012a, Sanyal et al., 2012). I found that only ~15–30% of the time did the correlated gene correspond
to the nearest TSS, with the percentage being higher for hypermethylated probe-gene pairs than for
hypomethylated probe-gene pairs (Figure 3.4B). This was significantly higher than the frequency of an
enhancer being linked any other farther away gene (4–8%); because there was no selection for my
statistical test to link to the nearest gene, the disproportional number of first-gene linkages gave me
confidence that many or most of our linkages were true cis-regulatory links, including those that linked to
more distant genes. If the linked gene did not correspond to the nearest TSS, there was very little
preference to link to a nearby gene; the one exception was that hypermethylated enhancers were more
likely to link to either the closest or second closest gene. This analysis is shown individually for each of
the 10 tumor types in Supplementary Figure 3.2.
101
Figure 3.4 Physical characteristics of the probe-gene pairs.
(a) A histogram of probe-gene distances for all pairs with a hypomethylated (green) or hypermethylated
(yellow) probe. Shown is the distribution of the distance between linked distal enhancer probes and genes.
The X-axis shows distances in bins of 50kb or 200kb. The Y-axis shows the proportion of all probe-gene
pairs in the category (hyper- or hypomethylated) that fall into each range. These were compared to
randomized datasets (gray bars), which were generated by randomly selecting 1,000 probes from the full
set of 145265 distal probes, and randomly pairing each with one of its 20 adjacent genes. I generated
1,000 such datasets to generate 95% Confidence Intervals for each bin (+/−1.96* SD). (b) For each probe
in a probe-gene pair, the 20 adjacent genes were ranked by distance, and shown is the proportion of all
probes linked to genes of a given rank. For this analysis, probes linked to more than one gene and
multiple probes linked to the same gene, were omitted
As indicated above, many of the genes that I identified as linked to enhancers with cancer-
associated DNA methylation differences were actually identified in more than one cancer type, suggesting
that they may have some common function in tumor initiation or progression. I selected all genes linked
to an enhancer probe in more than one cancer type and performed a Gene Ontology enrichment analysis
(Figure 3.5). The 1,959 genes linked to hypomethylated (activated) enhancer probes correspond to genes
upregulated in cancer, and the 284 genes linked to hypermethylated (inactivated) probes correspond to
genes downregulated in cancer. Interestingly, I found that genes linked to hypermethylated (inactivated)
enhancers were genes involved in development and differentiation. In contrast, genes linked to
hypomethylated (activated) enhancers were classified as involved in the cell cycle and other cellular
processes. Accordingly, I have identified known tumor suppressors (e.g. TSG1, RBM6, SPRY2, CDKN1A,
102
and UBE4B) in the set of genes potentially regulated by the hypermethylated enhancers and known
oncogenes and cancer-associated genes (e.g. MYC, TERT, ERBB3, ERBB4, FGFR3, VEGFA, CDK7, and
CCND1) in the set of genes potentially regulated by the hypomethylated enhancers.
Figure 3.5 Gene Ontology (GO) enrichment analysis for genes identified in more than one cancer
type.
All genes identified in more
than one cancer type by
probe-gene pairs were
analyzed for enrichment in
particular GO categories,
using the TopGO program.
Activated genes (associated
with hypomethylated
enhancer probes) are shown in
(a) and inactivated genes
(associated with
hypermethylated enhancer
probes) are shown in (b). All
GO categories with an
adjusted enrichment p value
of less than 0.01 (indicated
next to the category name)
and fold change more than 1.5
are included in the figure, and
categories within the same
biological process (color) are
ordered by enrichment fold
change (shown on the x axis).
The adjusted enrichment p
values are labeled in white in
the graph
103
3.3.3 Identification of regulatory TFs in each cancer type
Changes in methylation status of an enhancer region can be due to gain (for hypomethyated
enhancers) or loss (for hypermethylated enhancers) of site-specific transcription factors. To obtain insight
into which site-specific TFs may be involved in setting the tumor-specific DNA methylation patterns, I
examined the correspondence between cancer-specific hypermethylated or hypomethylated probes and
known regulatory factor recognition sequence motifs. I used a combined set of motifs present in the
JASPAR-Core (Mathelier et al., 2014) and FactorBook (Wang et al., 2013b) datasets. I selected the
enhancer probes that were identified in probe-gene pairs (using a cutoff of 0.001), then used the +/−100bp
sequence around each probe to search for instances of the 145 transcription factor motifs. I calculated the
frequency of each motif within the hypomethylated (or hypermethylated) probe set for a given cancer vs.
the frequency of the motif within the entire enhancer probe set. An Odds Ratio (OR) was calculated from
these two frequencies, and only those motifs with an OR greater than 1.1 (at a Confidence Interval of
95%) were selected as enriched within the given cancer type (motifs with less than 10 instances within the
given probe set were excluded). All enriched motifs are listed in Supplementary Data File 3.5. For
hypermethylated loci, I found that many of the identified motifs (such as E2F, EGR1, NRF1, Sp1) were
associated with promoter regions (Supplementary Figure 3.3), suggesting that many of the
hypermethylated loci may actually correspond to previously uncharacterized promoter regions. This likely
accounts for the relatively high percentage of hypermethylated probe-pairs that showed linkage to the
nearest annotated gene (Figure 3.4B), which could reflect RNA-seq tags from the unannotated transcript
isoform. Because many of the hypermethylated cases might not represent true distal enhancers, and
because some may in fact be the result of cancer-related CpG Island promoter hypermethylation
(Bergman and Cedar, 2013), I focused the remaining analyses on the 38 motifs found to be enriched
within hypomethylated loci (Figure 3.6A). Some of these motifs were common to various different
cancers, such as AP1, which was enriched within 9 of the 10 cancer types. Many motifs were more
enriched in two or more specific tumor types, while others were limited to a single type, such as of GATA
in BRCA, TP53/TP63 in LUSC, and HNF1A/B in UCEC.
104
Figure 3.6 Identification of enhancer sets predicted to be co-regulated by the same transcription
factor.
(a) For 38 motifs enriched within hypomethylated probe-gene pairs in one or more cancer types, we
calculated the 95
th
percent Confidence Interval (CI) for the motif enrichment odds ratio; the lower bound
of the 95% CI is shown for each cancer type in the heatmap. (b) An illustration of the method for linking
sets of enhancers with the same motif to an upstream TF regulator (see Methods for additional details).
For each of the 38 (m) enriched motifs identified in panel (a), the average DNA methylation at all distal
enhancer probes having that motif (in a specific tumor type), was compared to the expression levels of
each of 1,777 (k) human TFs (Supplementary Data File 3.11). One such pair is shown as a scatter plot
of all Breast Cancer (BRCA) tumor and adjacent normal samples, for the GATA motif and the GATA3
TF. BRCA samples (660) are color coded by integrated molecular subtypes defined by the TCGA Pan
Cancer project, and extremes are selected as the 20% of samples with the lowest methylation (U) and the
20% with the highest methylation (M). A Mann–Whitney U test was performed to obtain the raw p value
(p
r
). All 1,777 TFs were then ranked by p
r
(plot at upper right), and the top 5% of the ranked TFs (dashed
blue line) were considered to be significantly associated. The top 3 ranked TFs, along with each member
of the specific DNA-binding family (in this case, GATAs) are labeled. Supplementary Data File 3.7
contains ranked TF plots for all motifs and all cancer types. (c) One of the 230 hypomethylated probe-
gene pairs in BRCA containing a GATA motif corresponds to a downstream enhancer of the CCND1
gene. ENCODE ChIP-seq data in the Luminal-subtype MCF7 cell line verifies that this enhancer is bound
by the ELMER-predicted GATA3 TF
105
Different members of a TF family have very similar DNA binding domains that can bind very
similar or identical motifs. For example, it has been previously shown that GATA1 and GATA2 bind to
the same regulatory regions (Fujiwara et al., 2009) and that members of the E2F family can bind to the
same promoters (Xu et al., 2007). Thus, identification of a motif does not uniquely identify the TF that
binds in vivo to a region containing that motif. However, there is evidence to support the hypothesis that
expression levels of a particular TF can correlate with levels of demethylation and subsequent gene
expression (Blattler and Farnham, 2013, Shakya et al., 2015, Lee et al., 2014). To discover which
members of a TF family are likely to be responsible for binding in vivo to the hypomethylated enhancer
probes identified above and regulating expression of their putative target genes, I analyzed the correlation
between the probes containing a particular motif and expression of all known TFs (Figure 3.6B, left). I
ranked all the TFs by the degree to which their expression inversely correlated with the methylation status
of the enhancers containing the motif (Figure 3.6B, right), which allowed us to determine the family
member most likely to be involved in regulation of the putative target genes in that particular cancer. For
example, the GATA motif was enriched in (expression-linked) enhancer probes in BRCA samples
(Figure 3.6A). There are six members of the GATA family, with different members being linked to
different differentiation phenotypes. For example, GATA1–3 have been linked to the specification of
different hematopoietic cell fates and GATA4–6 are involved in differentiation of cardiac and lung tissues
(Bresnick et al., 2010, Chou et al., 2010, Patient and McGhee, 2002, Kouros-Mehr et al., 2008). GATA3
is one of the most highly enriched transcription factors in the mammary epithelium, has been shown to be
necessary for mammary cell differentiation, and is specifically required to maintain the luminal cell fate
(Patient and McGhee, 2002, Kouros-Mehr et al., 2008). Studies of human breast cancers have shown that
GATA3 is expressed in early stage, well-differentiated tumors but not in advanced invasive cancers. In
addition, GATA3 expression is correlated with longer disease-free survival and evidence suggests that it
can prevent or reverse the epithelial to mesenchymal transition that is characteristic of cancer metastasis
(Yan et al., 2010). Not surprisingly, my analysis of the correlation of the methylation of the GATA motif-
106
containing hypomethylated probes identified GATA3 as the most likely member of the GATA family to
be responsible for the observed hypomethylation of GATA-containing enhancers in the BRCA samples
(Figure 3.6B). Not only was GATA3 the 2
nd
most correlated transcription factor overall, but the extent of
correlation made it easily distinguishable from other members (GATA3 had a U test p-value less than
10
−40
, vs. p-values greater than 10
−5
for all other GATA family members). Furthermore, expression of
GATA3 and methylation of GATA-containing enhancer probes were co-linked to breast cancer subtypes.
As shown using color-coding in the Figure 3.6B scatterplot, Luminal tumors had high expression of
GATA3 and low methylation of GATA-containing enhancer probes, while Basal-like subtype tumors
showed the converse. Figure 3.6C shows an example of one of these GATA-containing enhancer probes
(cg1396202), along with the target gene (CCND1) predicted by expression to be regulated by this putative
enhancer. ENCODE ChIP-seq data in the Luminal-subtype MCF7 cell line confirms that this putative
enhancer region is indeed bound by GATA3, confirming the relationship between transcription factor
binding and demethylation shown in (Aran and Hellman, 2013). This case was among the easier to detect,
since breast cancer has two large subtypes (Luminal and Basal-like), which are molecularly quite distinct
and are increasingly seen as two different diseases. As with all cancer genomic approaches, rarer subtypes
will require larger number of samples to be identified by ELMER. Nevertheless, my results on other more
challenging cancer types were also promising, as described below.
The same correlation analysis was performed for all motifs enriched in hypomethylated enhancer
probes, and the most highly correlated member of the TF family expected to bind to each motif was
identified (Supplementary Data File 3.6 and 3.10). In all, I identified 38 enhancer-TF pairs in the 10
tumor types. Although some of these TFs have previously been implicated in tumor development in the
cancer type in which they were identified (e.g. GATA3 in BRCA), many other associations were novel
and provide new hypotheses regarding basic cancer biology and new potential targets for cancer
prevention and treatment. In order to investigate potential clinical relevance of the new TF networks
identified, I searched for cases where the TF found to be overexpressed in a subset of cases was also
linked to patient survival. Our TF family member analysis showed that RUNX1, RUNX2, and RUNX3
107
were all within the top 5% of TFs correlated with hypomethylation of RUNX-containing enhancer probes
in clear cell renal carcinoma (KIRC) (Figure 3.7A-B). Of these, RUNX1 and RUNX2 were very highly
correlated, with RUNX3 being only moderately so (Figure 3.7A-B). When I investigated patient survival
in KIRC, RUNX1 and RUNX2 had highly significant associations with poor survival outcome after
controlling for other co-variates, while RUNX3 was more marginal (Figure 3.7C and Supplementary
Data File 3.8A). These results suggest that the identification of specific TFs based on enhancer
methylation analysis may lead to new insights into tumor classification and clinical outcomes (other
identified TFs with association to survival are listed in Supplementary Data File 3.8B).
Figure 3.7 High RUNX1 expression is associated with poor survival in clear cell renal carcinoma.
(a) Shown are scatter plots for the average DNA methylation at hypomethylated-paired probes containing
a RUNX motif, plotted against expression for RUNX family members RUNX1, RUNX2, RUNX3. The
number (and percentage) of hypomethylated-paired probes having a RUNX motif in each cancer type is
indicated underneath the name of each cancer type. (b) The ranked TF plot, as described in Figure 3.6, is
plotted for the RUNX motif in clear cell renal carcinoma (KIRC); RUNX1, RUNX2, and RUNX3 are all
within the top 5% (dotted line) of all TFs. (c) Kaplan-Meier survival curves for TCGA KIRC samples,
stratified by expression of RUNX1 (left), RUNX2 (middle), or RUNX3 (right). In each plot, the survival
data for patients having tumors with the highest (top 30%) vs. lowest (bottom 30%) expression for the
given RUNX family member is shown; the Log-Rank test p value between the high and low groups is
indicated
108
3.4 Discussion
In my studies, I have used tumor-specific changes of the DNA methylation status within distal
enhancer regions to provide insight into the mechanisms of gene expression, transcription factor
networks, and tumor classification. I have shown that this can be a powerful approach for generating
hypotheses about master regulators in cancer, and I propose that ELMER analysis be applied along with
other hypothesis-generating approaches in high throughput cancer genomics. For the TCGA Pan-Cancer
dataset, I provide to the community prioritized lists of putative enhancer-target gene pairs for future
validation, and lists of site-specific transcription factors that should be further investigated for their role in
the development and progression of specific tumor types.
Starting with a set of ~100,000 distal enhancer probes, I identified tens of thousands of enhancer
regions that showed changes in methylation status in primary human tumors (Figure 3.8). I identified
many more hypomethylated (ostensibly activated) enhancers than hypermethylated (ostensibly
deactivated) enhancers and have focused mainly on the hypomethylated enhancers in this study. I
identified from 5,147 to 26,787 hypomethylated probes in different tumor types, corresponding to
between 4,841 and 21,374 distinct enhancer regions. However, only a smaller subset of these
hypomethylated enhancer probes (a total of 6,559 for all tumor types combined) could be linked to a
putative target gene (based on expression levels of the 10 nearest genes upstream and 10 nearest genes
downstream of the enhancer), ranging from a low of ~200 enhancer-putative target gene pairs in acute
myelogenous leukemia to ~4,000 enhancer-putative gene pairs in lung cell squamous carcinomas. I feel
that the expression filtering step is important for identifying those regions truly associated with enhancer-
specific methylation, as other long-range methylation changes (such as global hypomethylation (Bergman
and Cedar, 2013)) may also affect enhancer probes.
109
Figure 3.8 Identification of in vivo TF networks.
Network includs upstream TFs and downstream enhancers and gene targets. The innermost black circle
represents the 102,518 distal enhancer probes from the HM450 platform. The next level (labeled Hypo)
shows the number of hypomethylated distal enhancer probes identified in each cancer type. The third
level (labeled Paired hypo) shows the number of hypomethylated probes that were significantly linked to
a putative target gene in each cancer type. The number in the outermost level corresponds to the number
of putative target genes (each linked by expression level to a specific hypomethylated enhancer) predicted
to be regulated by the indicated TF (fourth level); where multiple TF family members were identified,
only the most strongly associated family member is listed
I found that most of the putative linkages between enhancer probes and local gene expression
were cancer type-specific and that within each cancer type, most enhancers correlated with the expression
110
of only one gene. In keeping with previous looping studies, I found that the putative target gene was
typically not the nearest gene. In fact, the gene identified was the nearest gene in only ~15% of the
hypomethylated enhancer-gene pairs. As in other studies (Heintzman et al., 2007, Blattler et al., 2014), I
found that the set of all hypomethylated enhancers was composed of similar proportions of intragenic and
intergenic enhancers. I found that as compared to the intergenic enhancers, intragenic enhancers were
75% more likely to be linked to expression of the nearest TSS (which in 88% of the cases was the gene in
which it resided); see Supplementary Figure 3.4. An intragenic enhancer can loop to regulate the
“upstream” promoter of the gene in which it resides but could also act as alternative promoter. Although I
have eliminated all known promoters from our set of distal probes, I cannot eliminate the possibility that
some of the intragenic enhancers represent as-of-yet unannotated, tumor-specific alternative promoters for
the gene in which they reside (Maunakea et al., 2010, Kowalczyk et al., 2012).
My linking method is based strictly on correlation and therefore cannot absolutely rule out
indirect (trans) interactions. For instance, if the same transcription factor or set of factors regulate both
enhancer X and enhancer Y, the methylation patterns of X and Y across samples may be so similar that I
link enhancer X to a gene that is in fact the direct target of enhancer Y. I have used high-confidence
statistical thresholds in order to rule out as many of these indirect interactions as possible. My search
within the nearest 20 genes is unbiased, so the fact that I disproportionally find linkages to the gene
nearest the enhancer probe provides strong evidence that we are identifying true direct (cis) interactions. I
have provided a robust set of predicted linkages that can serve as a starting point for future experimental
validations. Of course, I realize that we are working under a largely untested assumption that anti-
correlation between an enhancer and expression level of a nearby gene indicate functional regulation.
While this and prior correlative studies (Aran and Hellman, 2013, Aran et al., 2013, Hovestadt et al.,
2014) provide strong supporting evidence for this, further experimental studies (e.g. using CRISPR/Cas9
to delete the enhancers in appropriate tumor cell lines, followed by RNA-seq) will be needed to determine
with certainty that the enhancers regulate their putative target genes, and what degree of correlation is
required to infer functionality. Similarly, a comparison between our predicted enhancer-target pairs and
111
global analysis of long-range chromatin looping would be of interest. Unfortunately, chromatin
conformation assay data is not available for any of the tumor tissue samples and, in fact, very few studies
of global chromatin looping have been completed for cancer cell lines. However, I have identified a set of
chromatin loops derived from deep-sequenced ChIA-PET data from MCF7 cells (Li et al., 2012a).
Although MCF7 cells are not representative of all breast cancers (and are cultured cells, not tumor
tissues), I did find that 166 of the 2,038 enhancer probes pairs I identified in breast cancer tumors (~8%)
were also identified as loops in the MCF7 ChIA-PET data. This was an almost 4-fold enrichment over
randomized enhancer probe-gene pairs (see Supplementary Data File 3.9 for an enrichment analysis,
along with a complete list of BRCA enhancer-gene pairs falling within loops in MCF7 cells). I note that
the various assays used to study looping are not yet optimized and do not always identify the same sets of
loops (Raviram et al., 2014); in addition, some loops may not be related to transcriptional regulation.
Thus, enhancer-gene pairs identified by expression assays are not necessarily concordant with the sets of
promoter-enhancer loops identified by chromatin confirmation assays. Future comparisons between
indirect (i.e. correlative) mapping of enhancer-gene interactions of the type we described here, with direct
physical mapping of enhancer-gene interactions, will be important to help to resolve the different
mechanisms involved. However, in addition to the genome-wide confirmation by ChIA-PET, I note that
at least two of the putative enhancer-gene pairs from our analysis have been studied in functional models
confirming our results. The putative CCND1 enhancer we identified in breast tumors (Figure 3.6C) was
shown to directly regulate the CCND1 gene in response to estradiol in breast cancer cells (Eeckhoute et
al., 2015) and a putative MYC enhancer we identified in colon tumors (Supplementary Figure 3.5) was
shown to be directly responsible for MYC expression in colon cancer cells (Yochum et al., 2008), and in
vivo in a mouse model of colorectal cancer(Konsavage and Yochum, 2014).
I realize that the relationship between TF binding and DNA methylation can be complex (Blattler
and Farnham, 2013). For example, reduced DNA methylation in an enhancer region in a tumor cell
relative to a normal cell could allow a TF to bind and regulate a target gene in a tumor-specific manner
without changes in the expression level of that TF in the tumor. However, it is likely that increased levels
112
of a TF in a tumor can result in higher binding at a partially methylated enhancer, directly leading to loss
of DNA methylation (Stadler et al., 2011). Based on this second mechanism, I have attempted to identify
TFs that regulate the target genes of enhancers that are hypomethylated in tumors. First, I identified a list
of site-specific TF binding motifs that are enriched within the enhancers linked to putative target genes.
Then, by examining the expression patterns of each of the TF family members expected to bind to these
motifs, I have predicted the TF that regulates specific sets of genes in the different cancer types (Figure
3.8). For example, in bladder cancer (BLCA) we have provided a list of 65, 208, and 65 genes that may
be regulated by POU3F1, FOXA1, or CEBPA, respectively, by binding to a specific hypomethylated
enhancer. In all, utilizing enhancer methylation patterns, expression of putative target genes, motif
enrichment, and expression of TF family members that bind to the motif, we have derived a list of 4,280
enhancer-TF-putative target gene linkages.
Some of the cancer type-specific TF networks I show in Figure 3.8 are already known to have a
functional role in the same tumor type, such as PU.1 in AML (Rosenbauer et al., 2004) and TCF7L2 in
colorectal cancer (TheCancerGenomeAtlas, 2012c, Bass et al., 2011, Frietze et al., 2012, Sur et al., 2012,
Pomerantz et al., 2009). Two of the four TFs we identified in squamous cell lung cancer (LUSC), TP63
and SOX2, are oncogenes that are overexpressed in LUSC through genomic amplification
(TheCancerGenomeAtlas, 2012a, Massion et al., 2003). Recently, SOX2 and TP63 were shown to
interact functionally and co-localize to a large number of genomic binding sites in squamous cell lung
cancer (Watanabe et al., 2014). In a number of cases, incorporating TF expression data allowed us to
resolve between different members of the same family that would be indistinguishable by binding motif
alone. For instance, FOXA1 clearly appears to be responsible for hypomethylation of FOX-containing
enhancers in breast (BRCA) and bladder (BLCA) cancers, while FOXA2 appears to be responsible in
endometrial (UCEC). Other TF networks I identified, such as RUNX1/2 and its association with poor
outcome in kidney cancer, have never been reported and will form the basis for future studies.
The method I describe is based on detecting methylation and expression differences between
samples of the same tumor type, and is therefore aimed at identifying changes that co-occur within
113
particular subsets of cases. For instance, I found that GATA-containing enhancer hypomethylation
occurred primarily in the subset of breast cancer cases belonging to the Luminal subtype, which also had
high expression of the GATA3 gene (Figure 3.6B-C). While GATA3 is a well-studied case, my method
can be applied to identify, understand, and find biomarkers for novel molecular subtypes. Understanding
the genome-wide transcriptional consequences of molecular subtypes will be particularly relevant for
those that are defined by genetic mutation of transcriptional regulators; indeed, transcription factors make
up the largest functional class within the list of 127 cancer genes with so-called “driver” mutations
identified by TCGA (Kandoth et al., 2013). A number of the altered transcription factor networks I
identified using ELMER (Figure 3.8) were also present within the 30 or so transcription factors included
in this TCGA driver gene list. These TFs included FOXA1, FOXA2, GATA3, NFE2L2, and SOX17.
Intriguingly, ELMER often identified a particular TF in the same cancer type or types where it is most
frequently mutated. For instance, FOXA1 is most frequently mutated in Breast and Bladder cancer, and
ELMER identified it in these specific cancers. Likewise, FOXA2 and SOX17 are primarily mutated in
endometrial cancers, and ELMER identified network alterations specifically in this cancer type (UCEC).
NFE2L2 is most frequently mutated in lung squamous cell carcinoma (LUSC), the same cancer type
where ELMER detected NFE2L2 alterations. It will take additional work to understand the relationship
between genetic mutations of TFs and epigenetic/transcriptomic changes in each of these different
examples, but the identification of important cancer driver genes underscores the power of studying
enhancers, which sit at the cis-regulatory interface between transcription factors, epigenetic modifiers,
and downstream effector genes.
I also note that in some cases, transcription factors that are not expected to bind to the specific
motif being analyzed were identified as being highly correlated with the degree of enhancer
hypomethylation. In all, I identified 186 TFs frequently correlated with multiple motifs that do not
correspond to the known motif for that TF family (Supplementary Data File 3.10). These correlations
could be due to indirect effects caused by TF networks. For example, transcription factors regulated by
GATA3 may show a similar correlation of expression with the hypomethylated probes in BRCA as does
114
GATA3 itself. Another possible cause is suggested by the case of AP-1. Our results indicate that
hypomethylation of AP-1-containing enhancers is a common feature of many or most cancer types
(including 9 of our 10 cancer types, see Figure 3.6A); this confirms earlier whole-genome observations in
colorectal cancer (Berman et al., 2012). While the AP1 motif is classically described as a binding
sequence for FOS/JUN dimers, it is found to be enriched in many ChIP-seq datasets, including those
using antibodies that recognize factors other than FOS or JUN family members (Worsley Hunt and
Wasserman, 2014). Phosphorylation of JUN can lead to histone acetylation at AP-1 motif-containing
enhancers by inhibiting their association with the Mbd3 component of the NuRD complex (Aguilera et
al., 2011). This could in turn allow binding of other positive transcriptional regulators, activation of
downstream genes and a proliferative expression program. Because JUN activity is regulated post-
transcriptionally, it is logical that my method (which is based on expression) would miss JUN itself, and
instead identify the positive regulators binding these regions (which are often cell-type specific). For
instance, the most strongly associated TF with the AP-1 motif in kidney cancer is RUNX1, while in breast
cancer it is FOXA1, suggesting that many of the AP-1 motif-containing sites may require AP-1 dependent
de-repression along with positive RUNX1/FOXA activation.
Also included in the list of 186 “commonly correlated” TFs are around 50 zinc finger domain-
containing TFs (known as ZNFs). Although ZNFs are the most abundant class of human site-specific TFs,
comprising around half of all site-specific TFs (Tupler et al., 2001, Razin et al., 2012, Vaquerizas et al.,
2009), few of them have been well studied. One of the commonly correlated factors was ZNF703, which
correlated with 16 different motifs in the BRCA samples. Interestingly, high expression of ZNF703 has
been shown to correlate with poor prognosis in patients with luminal B breast cancer (Reynisdottir et al.,
2013). I suggest that our analyses can point to a role for other ZNFs in tumorigenesis. In fact, 11 of the
identified ZNFs showed associations with survival of the cancer in which they were identified
(Supplementary Data File 3.5). For example, ZNF273 was correlated with 4 motifs in CRC and
ZNF683 was correlated with 9 motifs in KIRC; neither of these TFs has ever been associated with cancer.
However, there is a strong correlation between high expression of ZNF273 and ZNF683 with poor
115
survival rates in colorectal and kidney cancers, respectively. Most of the time, the 186 “commonly
correlated” TFs showed cancer type-specific correlations. However, one factor (GRHL2) was identified in
the top 1% of all correlations for 31 different motifs spread amongst 5 of the 10 different cancer types
studied. GRHL2 has been shown to directly bind and activate the hTERT promoter and has been
suggested to be involved in telomerase activation during cellular immortalization (Kang et al., 2009).
Perhaps GRHL2 plays an important role in tumor development in many cancer types.
The results I describe use motif analysis primarily to help identify the transcription factors
responsible for enhancer hypomethylation. However, the most important output of this work may actually
be the identification of enhancers in which mutations in individual transcription factor binding sites can
be responsible for cancer risk or cancer progression. A number of studies have shown that population risk
alleles for cancer reside preferentially in enhancer regions (ENCODE_Project_Consortium, 2012, Schaub
et al., 2012) (Maurano et al., 2012, Akhtar-Zaidi et al., 2012, Hardison, 2012, Shannon, 2014, Yao et al.,
2014) and a recent paper demonstrated that these could be identified in breast cancer by combining DNA
methylation and chromatin conformation capture data to identify putative enhancers (Aran and Hellman,
2013). Somatic enhancer mutations are predicted to affect cancer progression, although these have not yet
been identified due to the overwhelming use of exon sequencing as a means to identify new cancer
mutations. The recent availability of whole-genome sequencing of tumors has started to allow the
identification of non-coding mutations, which have been shown to affect transcription factor binding sites
(Fredriksson et al., 2014, Huang et al., 2013, Weinhold et al., 2014). Methods like ELMER, which can
identify in vivo enhancer regions in tumors, will be essential for analyzing non-coding cancer mutations
arising from WGS studies.
3.5 Conclusions
Although my study is not comprehensive due to the nature of the DNA methylation platform used
by TCGA (which only contains coverage of 15% of known enhancers) and because enhancers have not
yet been mapped in all normal and tumor cell types, my analyses have allowed me to identify a number of
116
cancer type-specific transcriptional regulators, along with the cis-regulatory sequences mediating effects
on target genes. Large-scale identification of such cis-regulatory regions will be critical for understanding
the effects of non-coding genetic polymorphisms on cancer risk and non-coding somatic mutations on
cancer progression (TheCancerGenomeAtlas, 2012c, Rosenbauer et al., 2004, Bass et al., 2011).
Complete tumor methylation profiles using whole-genome bisulfite sequencing (Berman et al., 2012,
Hovestadt et al., 2014, Hansen et al., 2011) are rapidly becoming available, and these will dramatically
increase the power of the ELMER approach to reconstruct complete transcription factor network and
identify important cis-regulatory regions.
3.6 Methods
3.6.1 Availability of source code and R package
All source code is available as an R package, ELMER, downloadable from the main
Bioconductor repository (http://www.bioconductor.org/) or from our GitHub repository
(https://github.com/lijingya/ELMER.git). Vignettes illustrating the use of the functions are available as
part of the BioConductor package, along with an example replicating the results described in this paper
using the ELMER function TCGA.pipe. A user manual and tutorial can be downloaded from the GitHub
repository here: https://github.com/lijingya/ELMER/blob/master/vignettes/vignettes.pdf, and a full
manual can be downloaded here:
https://github.com/lijingya/ELMER/blob/master/inst/doc/ELMER_manual.pdf.
3.6.2 DNA Methylation and RNA-seq Datasets.
TCGA level 3 DNA methylation data based on the Illumina Infinium HumanMethylation450
BeadArray platform was downloaded from https://tcga-
data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/. Only the samples whitelisted by
TCGA for Pan-Cancer Analysis Working Group were used in the study. The whitelist can be downloaded
117
from Sage Bionetworks Synapse (https://www.synapse.org/#) with identifier syn1571603. The version
numbers and final sample IDs for each cancer type are listed in Supplementary Data File 3.1. The DNA
methylation level at each CpG is referred to as a beta (β) value, calculated as (M/(M+U)), where M
represents the methylated allele intensity and U the unmethylated allele intensity, which are normalized
using the TCGA standard pipeline. Beta values range from 0 to 1, reflecting the fraction of methylated
alleles at each CpG in the each tumor; beta values close to 0 indicating low levels of DNA methylation
and beta values close to 1 indicating high levels of DNA methylation. Since there are no available normal
tissues for Acute Myeloid Leukemia (LAML) and Glioblastoma multiforme (GBM) in TCGA, we also
downloaded Infinium HM450K DNA methylation data from publicly available sources as normal tissue
controls for these two cancer types. A set of 58 sorted glial cell samples from GEO (accession number
GSE41826) was used as normal reference samples for glioblastoma. A set of 11 sorted blood samples
from GEO (accession number GSE49618) was used for normal reference samples for leukemia. These
data were generated at the USC Epigenome Center and were processed through the same data analysis
pipeline that was used to create the TCGA Level 3 data files (all TCGA data was also generated by the
USC Epigenome Center). The sample IDs are also listed in Supplementary Data File 3.1.
TCGA Level 3 RNA-seq data was downloaded from https://tcga-
data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/. The version number of each
package downloaded is listed in Supplementary Data File 3.1. TCGA uses gene-level expression
values, meaning any alternative isoforms are included in a single normalized RSEM expression value.
TCGA data production and analysis pipelines are described elsewhere, but a brief description follows: all
data were generated on the Illumina HiSeq platform, with the exception of UCEC, which was generated
on the Illumina GAII platform. Within each cancer type, data were mapped with MapSplice and
quantitated with RSEM (RNA-seq by Expectation Maximization). RSEM outputs expression values that
are normalized across samples, so that the third quartile for each sample equals 1000. Entrez gene IDs
were used for mapping to genomic locations using GenomicRanges (Lawrence et al., 2013). The final
RNA-seq sample IDs used in our analyses are listed in Supplementary Data File 3.1.
118
3.6.3 Selecting enhancer probes
Probes overlapping SNPs are removed as part of the standard TCGA Level 3 pipeline. Probes
located less than 2kb from an annotated transcription start site in GENCODE v.15 were filtered out to
remove promoter regions from our analysis. ENCODE/REMC chromHMM data were downloaded from
https://sites.google.com/site/epigenomeroadmapawg/project-updates/finalsignaltracksandalignmentfiles
and any HM450 probes falling within the genomic regions annotated as EnhG1, EnhG2, EnhA1 or
EnhA2 were selected. FANTOM5 data was downloaded from http://enhancer.binf.ku.dk/presets/ and any
HM450 probes falling within regions annotated as eRNA were selected. This resulted in 102,518
enhancer probes, which are listed in Supplementary Data File 3.2. This functionality is implemented in
the get.feature.probe function of the ELMER BioConductor package.
3.6.4 Identifying enhancer probes with cancer-specific DNA methylation changes
Each of the 10 cancer types was processed independently to identify cancer-specific DNA
methylation changes. For each enhancer probe, I first ranked tumor samples and normal samples (within
the cancer type) by their DNA methylation beta values. To identify hypomethylated probes, I compared
the lower normal quintile (20% of normal samples with the lowest methylation) to the lower tumor
quintile (20% of tumor samples with the lowest methylation), using an unpaired one-tailed t-test. Only the
lower quintiles were used because I did not expect all cases to be from a single molecular subtype, and we
sought to identify methylation changes within cases from the same molecular subtype. 20% (i.e. a
quintile) was picked as a cutoff to include high enough sample numbers to yield t-test p-values that could
overcome multiple hypothesis correction, yet low enough to be able to capture changes in individual
molecular subtypes occurring in 20% or more of the cases. This number can be set arbitrarily as an input
to the get.diff.meth function in the ELMER package, and should be tuned based on sample sizes in
individual studies. The one tailed t-test was used to rule out the null hypothesis: µ
tumor
≥ µ
normal
, where
µ
tumor
is the mean methylation within the lowest tumor quintile and µ
normal
is the mean within the lowest
normal quintile. Raw p-values were adjusted for multiple hypothesis testing using the Benjamini-
119
Hochberg method, and probes were selected when they had adjusted p-value less than 0.01. For additional
stringency, probes were only selected if the methylation difference Δ= µ
normal
- µ
tumor
was greater than 0.3.
This technique is illustrated in Figure 3.1B, and carried out in the get.diff.meth function of the ELMER
package. The same method was used to identify hypermethylated probes, except we used upper tumor
quintile and upper normal quintile, and chose the opposite tail in the t-test. The full set of
hypermethylated and hypomethylated probes we identified are provided in Supplementary Data File
3.3, and can be replicated using the TCGA.pipe vignette in the ELMER package.
3.6.5 Linking enhancer probes with methylation changes to target genes with expression changes
For additional stringency and to avoid correlations due to non-cancer contamination, I selected
only those enhancer probes that had differential methylation as defined above, and where at least 5% of
all samples (combining tumor and normal) had beta values > 0.3. Then, for each of these differentially
methylated enhancer probes, the closest 10 upstream genes and the closest 10 downstream genes were
tested for correlation between methylation of the probe and expression of the gene. To select these genes,
the probe-gene distance was defined as the distance from the probe to a transcription start site specified
by the TCGA RNA-seq Level 3 data files. I used the Level 3 TCGA RNA-seq data files; these represent
expression at the gene level, and merge any alternate transcript isoforms into a single expression value for
each gene. Thus, exactly 20 statistical tests were performed for each probe, as follows. For each probe-
gene pair, the samples (all tumors and normals within a particular cancer type) were divided into two
groups: the M group, which consisted of the upper methylation quintile (the 20% of samples with the
highest methylation at the enhancer probe), and the U group, which consisted of the lowest methylation
quintile (the 20% of samples with the lowest methylation.) The 20%ile cutoff is a configurable parameter
in the get.pair function of ELMER. We used 20% as a balance, which would allow me to identify changes
in a molecular subtype making up a minority (i.e. 20%) of cases, while also yielding enough statistical
power to make strong predictions. For each candidate probe-gene pair, the Mann-Whitney U test was
used to test the null hypothesis that overall gene expression in group M was greater or equal than that in
120
group U. This non-parametric test was used in order to minimize the effects of expression outliers, which
can occur across a very wide dynamic range. For each probe-gene pair tested, the raw p-value P
r
was
corrected for multiple hypothesis using a permutation approach as follows (implemented in the get.permu
function of the ELMER package). The gene in the pair was held constant, and 10,000 random methylation
probes were used to perform the same one-tailed U test, generating a set of 10,000 permutation p-values
(P
p
). I chose the 10,000 random probes only from among those that were “distal” (greater than 2kb from
an annotated transcription start site), in order to make these null-model probes qualitatively similar to the
probe being tested. I only used non-enhancer probes, as using enhancer probes would introduce large
numbers of co-regulated enhancers. An empirical p-value P
e
value was calculated using the following
formula (which introduces a pseudo-count of 1):
!"=
!"# !"≤ !" +1
10001
3.6.6 ChIA-PET analysis
MCF7 ChIA-PET linkage pairs were taken from a previous publication(Li et al., 2012a). The
random pairs were generated by randomly selecting the same number of probes from the set of distal
enhancer probes, and pairing each with one or more of the 20 adjacent genes; the number of links made
for each random probe was identical to the corresponding “true” probe. Thus, the random linkage set has
both the same number of probes and the same number of linked genes as the true set. 100 such random
datasets were generated to arrive at a 95% Confidence Interval (+/-1.96* SD).
3.6.7 Gene Ontology (GO) enrichment analysis
Genes associated with hypo- or hypermethylated enhancer probes in more than one cancer type
were selected for GO analysis. GO analyses were performed using the R package “topGO” (Alexa and
Rahnenfuhrer, 2010). The classic Fisher test was used to generate enrichment p-values. To select the GO
terms that pass a significance cut-off, p-values were adjusted using the Benjamini-Hochberg method; only
those GO terms with a p-value < 0.01 and a fold change > 1.5 are shown in Figure 3.5.
121
3.6.8 Motif analyses
I used FIMO (Grant et al., 2011) with a p-value <1e–4 to scan a +/- 100bp region around each
probe using Factorbook motif position weight matrices (PWMs) (Wang et al., 2012, Wang et al., 2013b)
and Jasper core human motif PWMs generated from the R package MotifDb (Shannon, 2014). For each
probe set tested (i.e. the list of gene-linked hypomethylated probes in a given cancer type), a motif
enrichment Odds Ratio and a 95% confidence interval were calculated using following formulas:
!=
!
!+!
!=
!
(!+!)
!""#!!"#$%= !
! (1−!)
! (1−!)
!"=
1
!
+
1
!
+
1
!
+
1
!
!"#$%!!"#$%&'(!!"!95%!!"#$%&'#!'!!"#$%&'(= exp!(ln !" −!")
where a is the number of probes within the selected probe set that contain one or more motif
occurrences; b is the number of probes within the selected probe set that do not contain a motif
occurrence; c and d are the same counts within the entire enhancer probe set. A probe set was considered
significantly enriched for a particular motif if the 95% confidence interval of the Odds Ratio was greater
than 1.1, and the motif occurred at least 10 times in the probe set. As described in the text, Odds Ratios
were also used for ranking candidate motifs. This analysis is implemented in the get.enrichmed.motifs
function of the ELMER package.
3.6.9 Associating TF expression with TF binding motif methylation
For each motif considered to be enriched within a particular probe set, I compared the average
DNA methylation at all distal enhancer probes within +/- 100bp of a motif occurrence, to the expression
122
of 1,777 human TFs ((Ravasi et al., 2010) and with further refinements, see Supplementary Data File
3.11). A statistical test was performed for each motif-TF pair, as follows. The samples (all tumors and
normal within a particular cancer type) were divided into two groups: the M group, which consisted of the
20% of samples with the highest average methylation at all motif-adjacent probes, and the U group, which
consisted of the 20% of samples with the lowest methylation. The 20
th
percentile cutoff is a parameter to
the get.TFs function of the ELMER package, and was set to allow for identification of molecular subtypes
present in 20% of cases. For each candidate motif-TF pair, the Mann-Whitney U test was used to test the
null hypothesis that overall gene expression in group M was greater or equal than that in group U. This
non-parametric test was used in order to minimize the effects of expression outliers, which can occur
across a very wide dynamic range. For each motif tested, this resulted in a raw p-value (P
r
) for each of the
1,777 TFs. All TFs were ranked by the -log10(P
r
), and those falling within the top 5% of this ranking
were considered candidate upstream regulators. The best upstream TFs for each of these cases was
automatically extracted as high-value candidates, and presented in Figure 3.8. These high-value
candidates are also shown in detail in Supplementary Data File 3.6 and 3.10.
3.6.10 Survival analyses
A Kaplan–Meier survival analysis was used to estimate the association of the TF expression with
the survival of patients. For each selected TF and cancer type combination, tumor samples with the
highest (top 30%) and lowest (bottom 30%) transcription factor expression were analyzed using a Log
Rank test. Overall survival was calculated from the date of initial diagnosis of cancer to disease-specific
death (patients whose vital status is termed dead) and months to last follow up (for patients who are
alive).
123
3.7 Supplementary figures for chapter 3
Supplementary figure 3.1 Quantitative summary of links, probes, and genes for each cancer type.
(A) Shown are histograms representing the number of putative probes-gene pairs, the number of total
probes in the set of paired-probes, and the number of total genes in the set of paired probes for the set of
hypomethylated (top) and hypermethylated (bottom) probe-gene pairs in each cancer type. For each plot,
the number of probes identified in one or more tumor types is indicated by the colored bars. (B) Shown is
a heatmap illustrating the similarity of probe-gene pairs, probes in the pairs, and genes in the pairs
amongst the different cancer types. The color bar indicates the Odds Ratio for the similarity (overlap)
between the indicated cancer types (a higher OR indicates a more significant similarity).
124
Supplementary figure 3.2 Rank of putative target gene according to distance in the enhancer-gene
pairs for each cancer type.
(A) Shown is the distribution for the ranking (by distance) of each putative target gene linked to an
enhancer for enhancers that are significantly associated with more than one gene. (B) Shown is the
distribution for the ranking (by distance) of each putative target gene linked to an enhancer for each
cancer type. The left panel shows the pairs for which the enhancer is significantly associated with more
than one gene and the right panel shows the pairs for which the enhancer is significantly associated with
only one gene.
125
Supplementary figure 3.3 Motif enrichment heatmap.
(A) Shown are the heatmaps for motifs that are enriched in the sets of all hypomethylated probes (top
panel) and all hypermethylated probes (bottom panel). (B) Shown are the heatmaps for motifs that are
enriched in the sets of only those hypomethylated (top panel) or hypermethylated (bottom panel) probes
that are linked to putative target genes (B bottom panel).
126
Supplementary figure 3.4 Proportion of intragenic vs. intergenic enhancers that regulate the
nearest gene.
Shown are bar graphs indicating the number of intergenic vs. intragenic enhancers, the number of each
category that is associated with expression of the nearest gene, and the number of intragenic enhancers
associated with expression of the nearest gene with that gene being the one in which the enhancer resides.
127
Supplementary figure 3.5 MYC 3’ end enhancer regulates MYC expression in colorectal cancer
tissue.
(A) Shown is a scatter plot showing DNA methylation at probes located at the 3’ end of the MYC gene vs.
the expression of MYC RNA. Each dot represents a different patient sample; red and green indicate the
tumor and normal samples, respectively. (B) Shown is the location of the MYC 3’ enhancer and the
ENCODE ChIP-seq histone and transcription factor tracks from the University of California, Santa Cruz
genome browser. The green bar indicates the location of enhancer that has been previously identified to
regulate MYC expression in the HCT116 colon cancer cell line (PMID: 24567369; PMID: 18852287).
128
Supplementary figure 3.6 Survival analysis of commonly identified ZNFs.
(A) Shown is a table listing the subset of ZNFs (the entire list can be found in Supplementary Data File
3.11) which were identified in the top 1% of ranked TFs, which were significantly associated with
multiple different motifs in a specific cancer type (the number of motifs with which the TF was associated
is listed in parentheses), and whose expression level significantly correlates with patient survival. The
direction of correlation is labeled in column labeled “Survival” (red and green color represents high
expression correlated with worse survival or better bad survival, respectively) and log Rank test P value
between the high and low expression groups is provided in the column labeled “logRankP”. (B) Shown
are example Kaplan-Meier survival curves for two ZNFs. The survival data for patients having tumors
with the highest (top 30%) and lowest (bottom 30%) transcription factor expression is shown; the Log
Rank test P value between the high and low groups is indicated.
129
Chapter 4. Conclusions and future studies
4.1 Summary of main findings
Three decades of studies suggest that enhancers are critical regulators of cell-type-specific gene
expression patterns and that they contribute to the deregulation of gene expression in tumors. However,
investigators still face the arduous tasks of understanding a) how to accurately identify enhancers; b) how
enhancers regulate gene expression; c) what determines enhancer cell-type specificity; d) what determines
the flexibility of distances between enhancers and target genes; and e) why complex functional and non-
functional interactions between enhancers and target genes occur. Fortunately, the advent of a plethora of
genome-wide assays based on next-generation sequencing has transformed our capacity to identify
putative enhancers and interrogate enhancer-gene regulatory networks, enabling a deeper understanding
of the roles of enhancers in cancer. Also, the efforts of big consortia such as ENCODE, REMC, GWAS,
FANTOM, and TCGA have generated repositories for genetic and epigenetic profiles for various cell
lines and normal or diseased tissues, enabling thousands of investigators to study enhancer regulatory
networks in normal and diseased tissues. For my studies, I have used such datasets from these repositories
to identify critical cancer-associated enhancers for various cancer types and to develop methods to
identify putative target genes, detecting novel and known enhancer regulatory networks in cancers.
4.2 Functional annotation of colorectal cancer GWAS SNPs
GWAS has identified thousands of loci associated with an increased risk of various diseases, and
each locus is associated with a single representative SNP known as the index SNP. Specifically for CRC,
25-31 index SNPs have been identified and I chose 25 index SNPs for my study. Using the
R/Bioconductor package FunciSNP, I collected all SNPs correlated with CRC index SNPs at various LD
R
2
values and identified biofeatures harboring these index or correlated SNPs. I identified only 13 SNPs
within exons (at an LD R
2
> 0.5), and these are predicted to cause damage to only 2 proteins; my studies
confirm the finding of ENCODE and others that many GWAS SNPs fall within non-coding regions. To
130
test the hypothesis that cancer-associated SNPs are in regulatory elements, I used epigenomic datasets of
both normal and tumor cells from ENCODE and REMC and an LD R
2
> 0.5 to identify 233 SNPs within
promoter proximal regions of 17 protein-coding gene and 2 non-coding RNAs (using TSS annotation
from GENCODE) and 270 SNPs within 27 distal enhancers. Together, these functional annotations of
CRC GWAS SNPs demonstrate that regulatory elements closely associate with CRC risk. To obtain a
deeper understanding of the functions of these risk-associated distal enhancers, I applied two methods:
differential gene expression and eQTL. The eQTL approach took copy number variation and promoter
DNA methylation into consideration in order to identify putative target genes for each enhancer using
gene expression profiles of 233 colorectal tumor samples and 21 colorectal normal samples from TCGA.
In total, I identified 66 putative target genes of the risk-associated enhancers. For example, TMED6 was
identified as putative target gene of a risk-associated enhancer which is located 600kb away from
TMED6. Interestingly, my target gene predictions reinforced the finding that enhancers can regulate
genes which are far way.
4.3 Inferring enhancer regulatory networks using DNA methylation
Using ChIP-seq, DNase Hypersensitivity, or Cap Analysis of Gene Expression assays, active
enhancers can be identified by diverse epigenetic markers such as histone modifications, DNA
hypersensitity sites, TF binding and eRNA. However, these assays require careful tissue handling (to
avoid protein degradation) and relatively large numbers of cells (10
6
to 10
7
) and thus have not been
applied to the identification of enhancers in primary tumor tissues. Fortunately, enhancers can also be
identified using patterns of 5-methylcytosine which is more stable than protein and can be measured
genome wide using as few as 1,000 cells. Therefore, I developed a R-based package called ELMER
(Enhancer Linking by Methylation/Expression Relationship) to investigate enhancer regulatory networks
in tissues using DNA methylation and gene expression profiles from primary tissues. I applied ELMER to
10 cancer types including more than 2,000 samples from TCGA to identify cancer-specific enhancers
through detection of DNA methylation changes between tumor and normal samples. These cancer-
131
specific enhancers were then linked to putative target genes using a non-parametric and permutation-
based statistical model, in order to evaluate the significance of anti-correlation between DNA methylation
at the enhancer and the expression of putative target genes. Interestingly, only 15-30% of enhancers were
linked to nearest genes. Genes linked to hypermethylated (inactive) enhancers in tumors were genes
involved in development, including tumor suppressors such as TSG1 and CDKN1A. In contrast, genes
linked to hypomethylated (active) enhancers in tumors were genes that are important to cell proliferation
including oncogenes such as MYC and CCND1. To assess the accuracy of predictions, I compared my
enhancer-gene linkage predictions identified in breast cancer with ChIA-PET enhancer-promoter
interactions identified using MCF7 cells, showing that 166 of the 2038 enhancer-gene linkages had a
physical interaction between the enhancer and the predicted target gene promoter. I also identified
networks regulated by known cancer drivers, such as GATA and FOXA1 in breast cancer, and revealed
novel networks with prognostic associations, such as RUNX1 in kidney cancer. In total, I not only
derived a list of 4280 putative enhancer-TF-gene linkages but also showed that ELMER is a new
powerful tool to systematically understand the cis-regulatory network in primary tissue.
4.4 Future directions
A. Testing enhancer-promoter linkages. Although computational methods to link enhancers
with target genes are inexpensive and quick when using publicly available datasets, one common
disadvantage inherent to association methods is that they only provide predictions of target genes.
Experimental validation of enhancer-gene linkages is essential to evaluate and improve the accuracy of
the prediction methods. Although relatively few enhancer-gene linkages have been experimentally
validated to date in mammalian genomes, newly developed technologies now enable the field to more
efficiently test enhancer activity and enhancer-gene linkage predictions. Experiments that can be used to
obtain a deeper understanding the function of enhancers in disease and development include: 1) validation
of putative enhancers using luciferase assays or other high-throughput assays (e.g. STAR-seq); 2)
verification of enhancer-gene predictions using chromatin conformation capture technologies (e.g. Hi-C,
132
TCC, 4C, 5C) or enhancer deletion/repression using genomic targeting technologies (e.g. CRISPR); and
3) determining cell phenotype changes upon altering, deleting, or epigenetically modifying the sequence
of a particular enhancer. In ELMER analyses, I discovered that enhancers are linked to nearest gene only
15% of times when analyzing expression of the nearest 20 genes (10 on either side of the enhancer).
However, 64% of the putative linkages, which I identified in breast tumors and that also had physical
interactions identified in MCF7 cells using ChIA-PET, were linked to nearest genes. It is known that
chromatin conformation capture technologies tend to identify loops with shorter distances because the
high noise level from proximity ligations make it difficult to identify long distance interactions. Therefore
additional experiments to determine or validate the true targets are required.
Because of a large amount of breast cancer samples available in TCGA, ELMER was able to
identify a large set of useful and believable enhancer-gene linkage predictions for breast cancer. I propose
that the MCF7 cell line would be a good model to test enhancer-promoter linkages predicted by ELMER
in breast cancer because it is a widely used cell line for breast cancer studies and a large amount of
genetic and epigenetic information (such as chromatin interaction and histone modification profiles) for
MCF7 is publicly available . Since MCF7 is a luminal A type breast cancer cell line, I selected 1908 of
2038 enhancer-gene linkages which are luminal A breast-cancer-specific, which includes 210 enhancers
paired with the nearest genes (Figure 4.1). I classified these pairs into four groups using the MCF7 ChIA-
PET data described in chapter 3, which included: 1) 94 pairs that have loops identified in MCF7 ChIA-
PET and link to nearest genes; 2) 53 pairs that have loops but do not link to nearest genes; 3) 116 pairs
that do not have loops but link to nearest genes; 4) 1645 pairs which do not have loops and do not link to
nearest genes. I suggest selecting 5-10 pairs from each group and performing 4C assays and CRISPR-
mediated deletions to test the predicted linkages. These experiments would help to understand 1) how
accurate or meaningful the ChIA-PET results are; 2) how often the nearest genes are the target genes; 3)
how often the farther genes are linked due to indirect associations; 4) the accuracy of the ELMER
predictions. Moreover, ELMER identified multiple important enhancers linked to the expression of
133
oncogenes, such as MYC, CCND1, and FGFR3. It would be worthwhile to perform 4C assays and
CRISPR-mediated deletions to verify linkages between these enhancers and oncogenes.
Figure 4.1 Four groups of putative enhancer-gene pairs in luminal A breast cancer.
Shown are numbers of pairs that belongs to different groups. Four groups of pairs are classified based on
the proximity and ChIA-PET data.
B. Characterizing cancer-associated transcription factors. Transcription factors are critical
components of regulatory pathways that bind to target genomic regions to modulate the activity of
enhancers and promoters. In tumors, certain TFs frequently have mutations or gene expression
dysregulations leading to tumor progression. My ELMER analyses generated a list of TFs that may drive
cancer-specific regulatory networks, some of which are associated with poor prognosis of cancer patients
(e.g. RUNX1 in kidney cancer). Additional studies are needed to reveal the mechanisms underlying the
ability of such TFs to promote tumor progression. For instance, knockdown experiments can be used to
identify genes whose expression is linked to that of the cancer-associated TFs; ChIP-seq experiments can
be performed to reveal the genome-wide binding pattern of the TFs; and ChIA-PET assays can be
performed to identify enhancers and promoters linked by the TFs.
134
Specifically, I suggest that future work should focus on the RUNX family in kidney cancer
because 1) expression of RUNX family members shows clear negative correlations with average DNA
methylation at RUNX motif sites in kidney cancer, indicating RUNX family members are active
transcription factors in kidney; 2) expression levels of all three RUNX family members strongly associate
with kidney cancer prognosis; 3) there is no report studying the role of RUNX1 family members in
kidney cancer (there is one paper showing that RUNX1 is upregulated in kidney cancer tissues). Because
of the limited information on RUNX family members in kidney cancer cell lines, I suggest that the first
effort should focus on the identification of appropriate normal and cancer cell line models to study RUNX
family members in kidney clear cell carcinoma. Using RT-PCR and Western Blot assays, one can
examine the expression levels of RUNX family members in various kidney cell lines (e.g. normal kidney
cell lines including HK-2, HEK-293 and kidney cancer cell lines including 786-O, A498 and Caki-2),
with the goal of identifying normal cell lines with low expression of RUNX family members, one cancer
cell line with high expression of RUNX family members, and one cancer cell line with low expression of
RUNX family members. Using these cell lines, one could study binding patterns of RUNX family
members in normal and tumor cells using ChIP-seq; the binding patterns of RUNX family members could
help to predict the functions of these TFs in kidney cancer through identifying nearby possible target
genes. And last but not least, proliferation and cell migration assays could be performed after knockdown
or knockout of multiple RUNX TFs in cell lines that have high expression of RUNX family members and
after overexpression of RUNX family members in cell lines with low expression to detect phenotype
changes.
C. Characterizing mutated cancer-associated TFs. Interestingly, using ELMER I identified
several TFs as critical regulators of regulatory networks that were also identified to be frequently mutated
by TCGA. For examples, FOXA1 is the most frequently mutated TF in breast and bladder cancer and I
identified FOXA1 as a TF that drives the hypomethylation at breast/bladder enhancers. Likewise,
NFE2L2 is frequently mutated in lung squamous cell carcinoma and I identified NFE2L2 as a TF leading
to demethylation at enhancers in tumors. Although past studies have revealed that some active TFs are
135
frequently mutated in tumors, the functions of these mutations remain controversial or unknown. For
example, GATA3 is an important TF in luminal breast cancer but it is also frequently mutated in luminal
breast cancer; this apparent contradiction of a high mutation rate of a cancer-promoting TF has not yet
been reconciled. Difficulties in studying the functional consequences of TF mutations are that only a
limited number of mutated cases have been identified and target genes of the TFs are not known. Because
ELMER provides a list of putative target genes for a particular TF, my program may open the door to
study these TF mutations using comparison of target gene expression differences between samples with a
normal TF and samples with a mutated TF.
Specifically, I propose further characterization of NFE2L2 , which is a transcription factor that
regulates the expression of antioxidant proteins that protect against oxidative damage caused by injury
and inflammation. Under oxidative and electrophilic stress conditions, NFE2L2 is released from KEAP1-
mediated degradation (Bauer et al., 2013). Comprehensive genomic characterization of squamous cell
lung cancers from TCGA suggests that both KEAP1 and NFE2L2 are frequently mutated: mutations in
NFE2L2 are mostly in one of two KEAP1 interaction motifs and mutations in KEAP1 are loss-of-
function. Also, the NFE2L2 locus is amplified in lung cancer tissue (TheCancerGenomeAtlas, 2012a).
These data suggest that NFE2L2 is an indispensible active transcription factor in lung squamous cancer
and that the level of functional NFE2L2 in lung tumors is increased by gain of copy number of NFE2L2,
loss of interaction of NFE2L2 with the negative regulatory protein KEAP1, and loss of KEAP1 activity.
Moreover, NFE2L2 pathway mutations are not unique to lung squamous cancer but they are common for
other squamous cancer types such as head and neck squamous cancer and the squamous subtype of
bladder cancer (Cancer Genome Atlas, 2015, Cancer Genome Atlas Research, 2014). I used mutation and
expression information for lung squamous cancer samples from TCGA to test the hypothesis that
mutations of the NFE2L2 pathway, which cause increased activity of NFE2L2, lead to the upregulation of
NFE2L2 target gene expression. I found that the expression levels of NFE2L2 putative target genes
identified by ELMER, including SOX2 and TP63 (which are frequently amplified in squamous lung
tumors), are higher in samples with mutations in the NFE2L2 pathway than in the samples without
136
mutations (data not shown). This result not only confirms my hypothesis but also generates an interesting
question as to whether the NFE2L2 pathway and the SOX2/TP63 pathway are interconnected. To
examine a possible correlation of these two pathways, I propose that one should perform mutation
analyses to determine if NFE2L2 pathway mutation and SOX2/TP63 amplification are mutually exclusive
and conduct experiments to verify linkages between NFE2L2 motif sites and the predicted target genes
SOX2 and TP63. These experiments include 1) ChIP-seq in lung squamous cell lines (e.g. A549) to verify
if NFE2L2 binds to the motif sites; 2) 4C to detect chromatin interactions between NFE2L2 motif sites
and promoter regions of SOX2 and TP63; and 3) enhancer deletion to determine the impact of the
enhancers on regulation of SOX2 and TP63.
D. Expansion of ELMER using whole genome DNA methylation data.
To date, I have only been able to employ ELMER to study the small percentage of enhancers that
are represented on DNA methylation arrays (these arrays only cover 15% of known enhancers). However,
with the rapidly increasing number of complete tumor methylation profiles obtained using whole-genome
bisulfite sequencing for multiple cancer types and other diseases, ELMER will be able to reconstruct
complete transcription factor networks in primary tissues, including identifying active enhancer regions,
discovering genome-wide in vivo putative enhancer-gene linkages, detection of critical transcriptional
regulators, and exploring partnerships between transcriptional regulators. Such a large-scale identification
of regulatory networks will be critical for understanding the functional consequences of non-coding SNPs
or somatic mutations on disease.
Specifically, I suggest the following improvements of ELMER using WGBS data. First, others
have used regions with bisulfite reads that have discordant methylation levels to infer tumor purity
(Zheng et al., 2014). Adapting this method and taking purity into consideration, ELMER could more
accurately identify differentially methylated regions (DMRs). Second, partially methylated domains
(PMDs), which cover nearly half of the human genome, are known to associate with late replication and
lamina-associated domains (Berman et al., 2012). One of the reasons why the current ELMER analyses
identified many more hypomethylated distal enhancer probes is due to the global hypomethylation of
137
tumors relative to normal tissues at PMDs. Using WGBS data, ELMER may be able to remove the DMRs
driven by PMDs, specifically identifying enhancers that have focal hypomethylation. Third, the HM450K
data cannot discover partnerships between transcription factors because of the sparse locations of probes.
However, the base pair resolution of WGBS would enable ELMER to identify partnerships of TFs
through identification of motifs that significantly cluster together in cancer-specific enhancer regions and
correlations between transcription factor expression levels and average DNA methylation at enhancers
with a particular cluster of motifs. Finally, WGBS data would also provide information about sequence
variants, allowing ELMER to analyze the influence of regulatory variants on DNA methylation regions or
motif sites and regulation of target genes.
Many cancer genomic studies, including the TCGA, are performing RNA-seq and HM450k for
most samples, and WGBS on a subset, usually 10% or less. While the number of cancer samples currently
being run using WGBS is an order of magnitude less, a hybrid solution is possible that takes advantage of
the genome-wide coverage of WGBS and the large case numbers of HM450k. I propose to run ELMER
on all samples with HM450k data and RNA-seq data, as I have published in my ELMER paper. For those
samples with WGBS, I can take the set of differentially regulated genes, and search within the 20-gene
genomic neighborhood for hypomethylated regions containing the motif or motifs predicted by the
HM450k ELMER analysis for that sample. While this approach may produce a significant number of
false positives, I believe we will be able to achieve a relatively useful false discovery rate, and thereby be
useful in determining direct targets in the gene regulatory network, as well as annotation for regulatory
genetic variants within the putative enhancers.
138
Appendix A: Publications as a contributing author
1) Seth F, Wang R, Yao L, Tak YG, Ye Z, Gaddis M, Witt H, Farnham PJ, Jin VX. Cell type-specific
binding patterns reveal that TCF7L2 can be tethered to the genome by association with GATA3.
Genome Biol. 2012;13(9):R52. PMCID: PMC3491396
I contributed to this publication by performing ChIP-seq RNA-seq and RT-PCR experiments revealing an
association between TCF7l2 and GATA3 as part of a project focusing on cell type-specific binding of
TCF7L2.
2) Blattler A, Yao L, Wang Y, Ye Z, Jin VX, Farnham PJ. ZBTB33 binds unmethylated regions of the
genome associated with actively expressed genes. Epigenetics & Chromatin. BioMed Central Ltd;
2013;6(1):13. PMCID: PMC3663758
I contributed to this publication by assisting analysis of DNA methylation levels at ZBTB33 binding sites
demonstrating that ZBTB33 mainly binds to unmethylated genomic regions.
3) Blattler A, Yao L, Witt H, Guo Y, Nicolet CM, Berman BP, Farnham PJ. Global loss of DNA
methylation uncovers intronic enhancers in genes showing expression changes. Genome Biol.
BioMed Central Ltd; 2014 Sep20;15(9):469. PMID:25239471
I contributed to this publication by participating in analyses of the impacts of global DNA methylation
loss on enhancers and genes expression in HCT116 colorectal cancer cells with and without mutation of
DNA methyltransferases. These studies showed that global loss of DNA methylation is not sufficient to
reactivate most silenced genes and identified new intronic enhancers associated with up-regulation of
gene expression.
138
RESEARCH Open Access
Cell type-specific binding patterns reveal that
TCF7L2 can be tethered to the genome by
association with GATA3
Seth Frietze
1†
, Rui Wang
2,3†
, Lijing Yao
1
, Yu Gyoung Tak
1
, Zhenqing Ye
3
, Malaina Gaddis
1
, Heather Witt
1
,
Peggy J Farnham
1*
and Victor X Jin
3*
Abstract
Background: The TCF7L2 transcription factor is linked to a variety of human diseases, including type 2 diabetes
and cancer. One mechanism by which TCF7L2 could influence expression of genes involved in diverse diseases is
by binding to distinct regulatory regions in different tissues. To test this hypothesis, we performed ChIP-seq for
TCF7L2 in six human cell lines.
Results: We identified 116,000 non-redundant TCF7L2 binding sites, with only 1,864 sites common to the six cell
lines. Using ChIP-seq, we showed that many genomic regions that are marked by both H3K4me1 and H3K27Ac are
also bound by TCF7L2, suggesting that TCF7L2 plays a critical role in enhancer activity. Bioinformatic analysis of the
cell type-specific TCF7L2 binding sites revealed enrichment for multiple transcription factors, including HNF4alpha
and FOXA2 motifs in HepG2 cells and the GATA3 motif in MCF7 cells. ChIP-seq analysis revealed that TCF7L2 co-
localizes with HNF4alpha and FOXA2 in HepG2 cells and with GATA3 in MCF7 cells. Interestingly, in MCF7 cells the
TCF7L2 motif is enriched in most TCF7L2 sites but is not enriched in the sites bound by both GATA3 and TCF7L2.
This analysis suggested that GATA3 might tether TCF7L2 to the genome at these sites. To test this hypothesis, we
depleted GATA3 in MCF7 cells and showed that TCF7L2 binding was lost at a subset of sites. RNA-seq analysis
suggested that TCF7L2 represses transcription when tethered to the genome via GATA3.
Conclusions: Our studies demonstrate a novel relationship between GATA3 and TCF7L2, and reveal important
insights into TCF7L2-mediated gene regulation.
Background
The TCF7L2 (transcription factor 7-like 2) gene encodes a
high mobility group box-containing transcription factor
that is highly up-regulated in several types of human can-
cer, such as colon, liver, breast, and pancreatic cancer
[1-4]. Although TCF7L2 is sometimes called TCF4, there
is a helix-loop-helix transcription factor that has been
given the official gene name of TCF4 and it is important,
therefore, to be aware of possible confusion in the litera-
ture. Numerous studies have shown that TCF7L2 is an
important component of the WNT pathway [3,5,6].
TCF7L2 mediates the downstream effects of WNT signal-
ing via its interaction with CTNNB1 (beta-catenin) and it
can function as an activator or a repressor, depending on
the availability of CTNNB1 in the nucleus. For example,
TCF7L2 can associate with the members of the Groucho
repressor family in the absence of CTNNB1. The WNT
pathway is often constitutively activated in cancers, leading
to increased levels of nuclear CTNNB1 and up-regulation
of TCF7L2 target genes [3]. In addition to being linked to
neoplastic transformation, variants in TCF7L2 are thought
to be the most critical risk factors for type 2 diabetes
[7-10]. However, the functional role of TCF7L2 in these
diseases remains unclear. One hypothesis is that TCF7L2
regulates its downstream target genes in a tissue-specific
manner, with a different cohort of target genes being
turned on or off by TCF7L2 in each cell type. One way to
* Correspondence: pfarnham@usc.edu; Victor.Jin@osumc.edu
† Contributed equally
1
Department of Biochemistry and Molecular Biology, Norris Comprehensive
Cancer Center, University of Southern California, Los Angeles, CA 90089, USA
3
Department of Biomedical Informatics, The Ohio State University, Columbus,
OH 43210, USA
Full list of author information is available at the end of the article
Frietze et al. Genome Biology 2012, 13:R52
http://genomebiology.com/content/13/9/R52
©2012Frietzeetal.;licenseeBioMedCentralLtd.ThisisanopenaccessarticledistributedunderthetermsoftheCreativeCommons
Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.
139
test this hypothesis is to identify TCF7L2 target genes in a
diverse set of cell types.
Previous studies have used genome-wide approaches to
identify TCF7L2 target genes in human colon cancer cells
[11,12] and, more recently, chromatin immunoprecipita-
tion sequencing (ChIP-seq) analysis of TCF7L2 was
reported in hematopoietic cells [13]. In addition, TCF7L2
binding has been studied in rat islets and rat hepatocytes
[14,15]. However, to date no one study has performed
comparative analyses of genome-wide binding patterns of
TCF7L2 in diverse human cell types. We have now con-
ducted ChIP-seq experiments and comprehensively
mapped TCF7L2 binding loci in six human cell lines. We
identified datasets of common and cell-specific TCF7L2
binding loci and a set of predicted TCF7L2-regulated
enhancers (by comparing the TCF7L2 peak locations with
ChIP-seq data for the active enhancer marks H3K4me1
(histone H3 monomethylated on lysine 4) and H3K27Ac
(histone H3 acetylated on lysine 27)). We also predicted
bioinformatically and confirmed experimentally that
TCF7L2 co-localizes with cell type-specific factors. Finally,
we showed that GATA3 (GATA binding protein 3), which
co-localizes with TCF7L2 in MCF7 breast cancer cells, is
required for recruitment of TCF7L2 to a subset of binding
sites. Our studies reveal new insights into TCF7L2-
mediated gene regulation and suggest that cooperation
with other factors dictates different roles for TCF7L2 in
different tissues.
Results
Defining TCF7L2 genomic binding patterns
To identify TCF7L2 binding loci in a comprehensive man-
ner, we performed ChIP-seq using an antibody to TCF7L2
and profiled six human cell types, including colorectal car-
cinoma cells (HCT116), hepatocellular carcinoma cells
(HepG2), embryonic kidney cells (HEK293), mammary
gland adenocarcinoma cells (MCF7), cervical carcinoma
cells (HeLa), and pancreatic carcinoma cells (PANC1). We
chose these particular cell lines because TCF7L2 has been
associated with these types of cancers and because all of
these cells have various data sets associated with them as
part of the ENCODE project. The TCF7L2 gene has 17
exons, including 5 exons that are alternatively spliced in
different tissues [2,16-20]. Alternative splicing produces
two major isoforms of TCF7L2 in most cells, a cluster of
isoforms of approximately 79kDaandaclusterofiso-
forms of approximately 58 kDa. All of these isoforms con-
tain the DNA binding domain, the CTNNB1 binding
domain, the Groucho binding domain, and the nuclear
localization signal. However, the CtBP (C-terminal binding
protein) binding domain is encoded at the carboxyl termi-
nus and is missing in the 58 kDa isoform [21,22]. The two
major isoforms are found at similar ratios in the six cell
lines that we analyzed (Additional file 1). For all cell types,
we performed duplicate ChIP-seq assays using chromatin
from two different cell culture dates (see Additional file 2
for details concerning all ChIP-seq experiments and infor-
mation on how to access the data). To ensure that our
data were of high quality and reproducible, we called
peaks [11,23,24] and then compared the peak sets using
the ENCODE overlap rules (Additional file 3); all datasets
had a high degree of reproducibility (Additional file 4). We
next combined the reads for each replicate experiment
and called TCF7L2 peaks for each cell type, identifying
tens of thousands of peaks in each cell type (Table 1; see
Additional file 5 for lists of all TCF7L2 binding sites in
each cell type and Additional file 6 for a summary of the
peak characteristics for each cell type). We used a satura-
tion analysis strategy (Additional file 3) to demonstrate
that the depth of sequencing of the ChIP samples was suf-
ficient to identify the majority of TC7L2 binding sites in
each cell type (Additional file 7).
We next determined if the TCF7L2 binding sites identified
in each cell type are unique to that cell type or if TCF7L2
binds to the same locations in different cells. We first per-
formed two-way comparisons of the peaks from all six cell
types and found that the overlaps ranged from a low of
18% of HepG2 sites being present in the HEK293 peak set
to a high of 46% of the HCT116 sites being present in the
PANC1 peak set. These low overlaps suggested that each
cell type contributes a unique set of peaks. To demon-
strate the cell type specificity of binding of TCF7L2, the
top 500 binding sites were selected from the ChIP-seq
datasets from each of the 6 cell types (a total of 3,000
peaks). Then, the sequenced tags in all 6 datasets corre-
sponding to the genomic regions spanning ±3 kb from the
center of each of the combined 3,000 peaks were clustered
with respect to these genomic regions (Figure 1). This
analysis demonstrates a clear cell type-specificity in the
Table 1 TCF7L2 binding sites and target genes
Total peaks Nearest genes Peaks per gene
HCT116 30,259 10,910 2.8
HEK293 24,457 6,254 3.9
HepG2 27,912 7,899 3.5
MCF7 27,721 6,934 4.0
HeLa 52,810 11,334 4.7
PANC1 31,744 9,438 3.4
Common 1,864 1,287 1.4
Total unique 116,270 14,193 8.2
Peaks were called using the BELT program on the merged duplicate datasets
for each cell type. Reported as the ‘Total peaks’ is the total number of TCF7L2
binding sites in each cell line, as well as the total number of unique peaks
when all six cell lines are considered together and the peaks identified as
being common to all six cell lines. To determine the number of ‘Nearest
genes’, the RefSeq gene nearest each TCF7L2 binding site was identified and
duplicates were removed (since TCF7L2 has multiple binding sites near many
genes), and the resultant number is indicated as the total number of target
genes for that peak set.
Frietze et al. Genome Biology 2012, 13:R52
http://genomebiology.com/content/13/9/R52
Page 2 of 18
140
top-ranked TCF7L2 binding sites. We note that one char-
acteristic of cancer cell lines is that they often have exten-
sive genomic amplifications. Peak calling programs (such
as the ones used in our analyses) that use input DNA
from the specific cancer cell can help to prevent many
false positive peaks that would otherwise rise to the top of
the peak list due to the fact that the amplified regions are
‘over-sequenced’ in comparison to the rest of the genome.
However, it is difficult to completely account for amplifi-
cations. Therefore, to ensure that the cell type specificity
that we observed was not due to TCF7L2 peaks in ampli-
fied regions, we used our peak-calling program Sole-search
to identify all genomic amplifications in the six cancer cell
lines (Additional file 8). Then, we identified the TCF7L2
peaks that are in the amplified regions in each cell line
(Additional file 9); all peaks from amplified regions were
removed from the peak lists prior to the analysis shown in
Figure 1. In total, we found that each cell type had more
than 10,000 TCF7L2 binding sites that were not found in
any of the sets of called peaks for the other 5 cell types
(see Additional file 10 for lists of cell type-specific TCF7L2
binding sites). Of course, it is possible that some sites that
appear to be cell type-specific are actually very small peaks
in another cell type and fall below the cutoff used in our
Figure 1 ChIP-seq analysis of TCF7L2 in six different human cell lines. Shown is the distribution of TCF7L2 binding within ±3 kb windows
around distinct genomic regions (n = 3,000) bound by TCF7L2 in a given cell type. ChIP-seq tags for each cell line were each aligned with
respect to the center of the combined top 500 peaks from each dataset and clustered by genomic position.
Frietze et al. Genome Biology 2012, 13:R52
http://genomebiology.com/content/13/9/R52
Page 3 of 18
141
analyses. Overall, we identified 116,270 non-redundant
TCF7L2 binding sites when the datasets from all 6 cell
lines are combined. Only 1,864 TCF7L2 binding loci were
common to all 6 cell lines, suggesting that TCF7L2 may
play an important, yet distinct, role in different cells.
To confirm the cell-specificTCF7L2bindinglocithat
we observed in the ChIP-seq data, we chose a set of three
targets identified to be cell type-specific for each of the
six cell lines, three common targets, and three negative
regions not bound by TCF7L2 in any cell line and per-
formed quantitative ChIP-PCR (ChIP-qPCR) using DNA
isolated from ChIP samples that were distinct from the
samples used for ChIP-seq (Additional file 11). The com-
mon targets were bound by TCF7L2 in all samples,
whereas the negative controlsshowedverylowenrich-
ment in all samples. In general, regions identified as cell
type-specific showed the greatest enrichment for TCF7L2
binding in that corresponding cell line (for example, the
PANC1-specific sites showedveryhighenrichmentin
ChIP samples from PANC1 cells, low enrichment in sam-
ples from HepG2, HeLa, and HCT116 cells, and no
enrichment in samples from HEK293 or MCF7 cells).
Thus, ChIP-qPCR confirms the specificity of targets iden-
tified in the ChIP-seq data in each cell line. Examples of
common and cell type-specific TCF7L2 binding sites are
shown in Figure 2.
To determine the potential set of genes regulated by
TCF7L2 in each cell type, we identified the closest
annotated gene to each TCF7L2 binding site in the six
different cell types and the closest annotated gene to the
set of 1,864 common TCF7L2 binding sites. The num-
ber of target genes (as defined by the nearest gene to a
TCF7L2 binding site) ranged from approximately 6,000
to 11,000 in the different cell lines (Table 1). In addi-
tion, we also observed that the number of target genes
in each cell line was considerably less than the number
of TCF7L2 binding sites, demonstrating that TC7L2
binds to multiple locations near each target gene (Table
1). Although less than 2% (1,864 of 116,270 peaks) of
the total number of peaks were commonly bound by
TCF7L2 in all 6 cell lines, 9% of target genes were com-
mon to all 6 cell lines (1,287 of 14,193 genes). This indi-
cates that TCF7L2 regulates certain genes in different
cell types using different binding sites. For example,
there are 12 TCF7L2 binding sites near the SH3BP4
gene, but these sites are different in MCF7, HCT116,
and PANC1 cells (Figure 2c).
The binding patterns shown in Figure 2c indicate that
TCF7L2 does not necessarily bind to promoter regions,
but rather binds to a variety of genomic locations near or
within the SH3BP4 locus. To evaluate the global distribu-
tion of TCF7L2 binding loci in each cell line, we plotted
the percentage of TCF7L2 sites versus their distance to
the nearest transcription start site. Even though TCF7L2
binds to different sites in the different cell lines, the trend
for the distribution of TCF7L2 target loci is the same for
each cell line (Figure 3a). Although some of the TCF7L2
binding sites are within 1 kb ofatranscriptionstartsite,
most of the sites are located at distances greater than 10
kb from a start site. However, we did find that the com-
mon sites to which TCF7L2 is bound in all six cell lines
are more enriched near the 5’ of a gene than are the
other sites (Figure 3a). A detailed analysis of the TCF7L2
binding sites, including the location of each site relative
to the transcription start site of the nearest gene for all
peaks in each of the six cell lines can be found in Addi-
tional file 5.
TCF7L2 binds to enhancer regions
The fact that TCF7L2 can bind to regions far from core
promoters suggested that TCF7L2 might bind to enhan-
cers. Recent studies have shown that enhancers can be
identified by enrichment for both the H3K4me1 and
H3K27Ac marks [25-27]. To determine if the regions
bound by TCF7L2 are also bound by these modified his-
tones, we performed ChIP-seq experiments in PANC1,
HEK293, HCT116 and MCF7 cells using antibodies that
specifically recognize histone H3 only when it is mono-
methylated on lysine 4 or when it is acetylated on lysine
27; we also used H3K4me1 and H3K27Ac data for HeLa
and HepG2 cells from the ENCODE project. Duplicate
ChIP-seq experiments were performed using two different
culturesofcells foreach celltype,peaks were called indivi-
dually to demonstrate reproducibility (Additional file 4),
the reads were merged and a final peak set for both
H3K4me1 and H3K27Ac was obtained. We then identified
predicted active enhancers as regions having both
H3K4me1 and H3K27Ac and determined the percentage
of the TCF7L2 sites that have either or both of the modi-
fied histones (Table 2). We found that, for most cells, the
majority of TCF7L2 sites co-localized with H3K4me1 and
H3K27Ac. However, a smaller percentage of the TCF7L2
sites in MCF7 cells co-localized with active enhancers.
Heatmaps of the tag density of the histone ChIP-seq
experiments for each cell line relative to the center of the
TCF7L2 peak locations are shown in Figure 3c. Although
most TCF7L2 binding sites show robust levels of both
marks, the TCF7L2 sites inMCF7 cells againshow a smal-
ler percentage of sites having high levels of the modified
histones. To determine if the TCF7L2 binding sites in
MCF7 cells correspond to sites bound by histone modifi-
cations associated with transcriptional repression, we per-
formed duplicate ChIP-seq analysis using antibodies to
H3K9me3 (histone H3 trimethylated on lysine 9) and
H3K27me3 (histone H3 trimethylated on lysine 27); we
also used H3K4me3 (histone H3 trimethylated on lysine 4)
and RNA polymerase II ChIP-seq data from the ENCODE
project. As shown in Figure 3d, neither the proximal nor
Frietze et al. Genome Biology 2012, 13:R52
http://genomebiology.com/content/13/9/R52
Page 4 of 18
142
distal TCF7L2 binding sites show high levels of H3K9me3
or H3K27me3.
To further investigate the role of TCF7L2 in cell type-
specific enhancers, we determined the percentage of
active enhancers in each of the six cell types (that is,
genomic regions bound by both H3K4me1 and
H3K27Ac) that are also bound by TCF7L2. We found
that more than 40% of all enhancers in the different cell
lines are occupied by TCF7L2 (Figure 3b). These results
indicate that TCF7L2 ChIP-seq data identify many of
the active enhancers in a given cell type and suggest
that TCF7L2 may play a critical role in specifying the
transcriptome in a variety of cancer cells. An example of
TCF7L2 binding to sites marked by H3K4me1 and
H3K27Ac in HepG2 cells is shown in Additional file 12;
TCF7L2 does not bind to this same site in HeLa cells
and these sites are also not marked by the modified his-
tones in HeLa cells.
Motif analysis of genomic regions bound by TCF7L2
To investigate the predominant motifs enriched in
TCF7L2 binding sites, we applied a de novo motif discov-
ery program, ChIPMotifs [28,29], to the sets of TCF7L2
peaks in each cell type. We retrieved 300 bp for each loci
from the top 1,000 binding sites in each set of TCF7L2
peaks and identified the top represented 6-mer and 8-
mer (Additional file 13). For all cell lines, the same 6-mer
(CTTTGA) and 8-mer (CTTTGATC) motif was identi-
fied (except for HCT116 cells, for which the 8-mer was
CCTTTGAT). These sites are almost identical to the
Transfac binding motifs for TCF7L2 (TCF4-Q5:
SCTTTGAW) and for the highly related family member
LEF1 (LEF1-Q2:CTTTGA) and to experimentally discov-
ered motifs in previous TCF7L2 ChIP-chip and ChIP-seq
data [11,30]. These motifs are present in a large percen-
tage of the TCF7L2 binding sites. For example, more
than 80% of the top 1,000 peaks in each dataset from
Figure 2 Cell type-specific binding of TCF7L2. (a,b) The ChIP-seq binding patterns of TCF7L2 are compared in six cell lines, demonstrating
both common peaks (a) and cell type-specific binding (b). (c) The ChIP-seq binding patterns of TCF7L2 near and within the SH3BP4 locus is
shown for three cell lines. The number of tags reflecting the ChIP enrichments are plotted on the y-axis; the chromosomal coordinates (hg19)
shown are: (a) chr19:7,701,591-7,718,750; (b) chr1:112,997,195-113,019,766; and (c) chr2:235,767,270-235,974,731.
Frietze et al. Genome Biology 2012, 13:R52
http://genomebiology.com/content/13/9/R52
Page 5 of 18
143
each cell type contain the core TCF7L2 6-mer W1 motif,
with the percentage gradually dropping to approximately
50% of all peaks (Additional file 14).
Because the TCF7L2 motif is present in all the cell lines
at the same genomic locations, but TCF7L2 binds to
different subsets of the TCF7L2 motifs in the different
cell lines, this suggests that a cell type-specific factor may
help to recruit and/or stabilize TCF7L2 binding to speci-
fic sites in different cells. Also, as shown above, TCF7L2
binds to enhancer regions, which are typified by having
Figure 3 TCF7L2 binding sites are distal and enriched for active enhancer histone marks. (a) Shown for the TCF7L2 binding sites in the six
cell types and for the 1,864 peaks commonly bound in all six cells is the percentage of TCF7L2 binding sites in different genomic regions (hg19)
relative to the nearest transcription start site (TSS). (b) The percentage of active enhancer regions containing a TCF7L2 binding site; active
enhancers were defined by taking the regions that have an overlap of H3K4me1 and H3K27ac ChIP-seq peaks for the given cell line. (c) Heatmaps
of the ChIP-Seq tags for H3K27ac and H3K4me1 at TCF7L2-bound regions (±3 kb windows around the center of all TCF7L2 peaks) for each cell line
were generated by k-means cluster analysis. (d) The average RNA polymerase II and histone modification profiles of MCF7 cells are shown for the
±3 kb windows around the center of TCF7L2 peaks identified as proximal to RefSeq genes (upper graph) or distal to RefSeq genes (lower graph).
Table 2 TCF7L2 binds to enhancer regions
Cell
line
Percentage of TCF7L2 peaks at
H3K4me1 sites
Percentage of TCF7L2 peaks at
H3K27Ac sites
Percentage of TCF7L2 peaks at active enhancers
(H3K4me1 and H3K27Ac)
HCT116 68 72 57
HepG2 84 59 56
PANC1 63 74 55
HeLa 72 58 53
MCF7 34 46 25
TCF7L2 peaks were overlapped with called peaks for H3K27Ac or H3K4me1 or both H3K27Ac and H3K4me1 in the same cell line. The percentage of TCF7L2
peaks in each category is indicated.
Frietze et al. Genome Biology 2012, 13:R52
http://genomebiology.com/content/13/9/R52
Page 6 of 18
144
binding sites for multiple factors. To test the hypothesis
that TCF7L2 associates with different transcription factor
partners in different cell types, we identified motifs for
other known transcriptionfactorsusingtheprogram
HOMER [31]. For these analyses, we used the subset of
TCF7L2 binding sites that were specific to each of the six
different cell types. The top four significantly enriched
non-TCF7L2 motifs for each dataset are shown in Table
3; many of these motifs correspond to binding sites for
factors that are expressed in a cell type-enriched pattern.
To assess the specificity of the identified motifs with
respect to TCF7L2 binding, we chose one motif specific
to HepG2 TCF7L2 binding sites (hepatocyte nuclear fac-
tor (HNF)4a)andonemotifspecifictoMCF7TCF7L2
binding sites (GATA3) and plotted motif densities in the
HepG2 cell type-specific TCF7L2 peaks (Figure 4a) and
the MCF7 cell type-specific TCF7L2 peaks (Figure 4b). In
HepG2 cells, the HNF4a motif, but not the GATA3
motif, is highly enriched at the center of TCF7L2 binding
regions. In contrast, in MCF7 cells the GATA3 motif, but
not the HNF4a motif, is highly enriched at the center of
TCF7L2 binding regions.
TCF7L2 co-localizes with HNF4a and FOXA2 in HepG2
cells
To validate the co-localization of TCF7L2 with factors
binding to the identified motifs in HepG2 cells, we
obtained ChIP-seq data for HNF4a and FOXA2 (forkhead
box a2) from the ENCODE Consortium and overlapped
the peak sets with the set of TCF7L2 peaks specific for
HepG2 cells (Figure 5a). We found that approximately 50%
of all HepG2-unique TCF7L2 sites are shared by HNF4a
and FOXA2. The sites bound only by HNF4a,onlyby
TCF7L2, or by both factors were analyzed for enrichment
of the HNF4a and TCF7L2 motifs (Figure 5b). We found
that the motifs were only enriched in the set of peaks spe-
cifically bound by each factor. For example, the sites bound
only by TCF7L2 but not by HNF4a have TCF7L2 motifs
but do not have HNF4a motifs (and vice versa). However,
sites bound by both TCF7L2 and HNF4a have motifs for
both factors. These results indicate that the HNF4a motif
was not identified simply due to its sequence being similar
to the TCF7L2 motif and suggest that both factors bind
directly to the DNA at the co-localizing sites. We next
plotted the location of the experimentally determined
HNF4a and FOXA2 sequence tags relative to the center of
the TCF7L2 binding site in the set of 7,576 peaks bound
by all three factors. As shown in Figure 5c, both HNF4a
and FOXA2 localize near the center of the TCF7L2 bind-
ing sites. An example of the binding patterns of all three
factors at the GREB1 locus is shown in Figure 5d. These
results support the hypothesis that HNF4a and FOXA2
may be involved in specifying a portion of the binding of
TCF7L2 in liver cells.
GATA3 is required for TCF7L2 recruitment to a subset of
sites in MCF7 cells
We next examined the relationship between GATA3 and
TCF7L2 binding in MCF7 cells. We performed duplicate
ChIP-seq experiments for GATA3 in MCF7 cells, called
peaks, and then determined the overlap of GATA3 peaks
with the TCF7L2 peaks in MCF7 cells (Figure 6a). We
found that nearly half of all MCF7-unique TCF7L2 sites
are bound by GATA3 (49%); an example of the binding
patterns of both factors at the CDT1 locus is shown in
Figure 6a. The observation that two factors bind to the
same location in the genome could be a result of both fac-
tors binding to the same (or nearby) site at the same time
or could be due to one factor binding to the genomic
region in one cell with the other factor binding to that
same region in a different cell in the population. To
address these possibilities, weperformedmotifanalyses,
co-immunoprecipitations, and knockdown experiments.
The sites bound only by GATA3, only by TCF7L2, or by
both factors were analyzed for the enrichment of the
GATA3 and TCF7L2 motifs (Figure 6b). We found that
the sites bound only by TCF7L2 contain the TCF7L2
motif but not the GATA3 motif and the sites bound only
by GATA3 contain the GATA3 motif but not the TCF7L2
motif. Interestingly, we found that sites bound by both
GATA3 and TCF7L2 are enriched for the GATA3 motif
but are not enriched for the TCF7L2 motif. These results
suggest that GATA3 may bind to the DNA and recruit
TCF7L2tothese sites. Todetermine ifGATA3can recruit
TCF7L2 to a GATA motif in the genome, we introduced
small interfering RNAs (siRNAs) specific for GATA3 into
MCF7 cells and then tested binding of TCF7L2 to sites
bound by both TCF7L2 and GATA3 and to sites bound
only by TCF7L2. We found that depletion of GATA3
resulted in the reduction of binding of TCF7L2 at the sites
normally bound by both factors but not at the TCF7L2
sites that are not bound by GATA3 (Figure 6c, left panel).
In contrast, knockdown of TCF7L2 reduced binding of
TCF7L2 but did not reduce binding of GATA3 (Figure 6c,
right panel). Thus, GATA3 is necessary for recruiting
Table 3 TCF7L2 cell type-specific modules
Cell line Top four co-localized motifs
HCT116 AP1, CTCF, NF-E2, SP1
HEK293 HOXC9, CDX2, PDX1, FOXA1
HepG2 HNF4a, FOXA2, ERRa, PPARg
HeLa-S3 ERG, MAZ, CEBP, NF-E2
MCF7 GATA3, AP2, TEAD, AP1
PANC1 CDX2, FOXA1, TEAD, RUNX2
The sets of TCF7L2 binding sites specific for each cell line (not found in any of
the other five cell lines) were analyzed using HOMER. The top four most
highly enriched motifs (as determined by P-value), not including the TCF7L2
motif, are shown for each cell line.
Frietze et al. Genome Biology 2012, 13:R52
http://genomebiology.com/content/13/9/R52
Page 7 of 18
145
TCF7L2 to a subset of its genomic binding sites in MCF7
cells but TCF7L2 is not necessary for GATA3 binding to
those same sites. We also performed sequential ChIP
assays (a TCF7L2 ChIP followed by a GATA3 ChIP and a
GATA3 ChIP followed by a TCF7L2 ChIP) to address
whether both TCF7L2 and GATA3 are on the same DNA
fragments (Additional file 15). In both cases, the sites
bound by both TCF7L2 and GATA3 could be enriched by
Figure 4 Association of other motifs with TCF7L2 binding sites. (a,b) TCF7L2 binding sites unique to HepG2 cells (a) or MCF7 cells (b) were
analyzed for the indicated motifs; the position of each motif is plotted relative to the center of the TCF7L2 binding site.
Frietze et al. Genome Biology 2012, 13:R52
http://genomebiology.com/content/13/9/R52
Page 8 of 18
146
Figure 5 Association of TCF7L2 and HNF4a in HepG2 cells. (a) HNF4a and FOXA2 ChIP-seq data were downloaded from the UCSC genome
browser, and peaks were called and overlapped with the HepG2 cell type-specific TCF7L2 peaks. (b) Peaks bound only by HNF4a, only by
TCF7L2, or by both factors were analyzed for the presence of HNF4a and TCF7L2 motifs. (c) For the set of 7,576 peaks bound by all three
factors, the location of the HNF4a and FOXA2 peaks were plotted relative to the center of the TCF7L2 peak. (d) A comparison of TCF7L2,
HNF4a, and FOXA2 binding patterns near the GREB1 locus is shown. The hg19 genomic coordinates are chr2:11,636,208-11,708,654. The number
of tags reflecting the ChIP enrichments is plotted on the y-axis.
Frietze et al. Genome Biology 2012, 13:R52
http://genomebiology.com/content/13/9/R52
Page 9 of 18
147
Figure 6 Association of TCF7L2 and GATA3 in MCF7 cells. (a) GATA3 ChIP-seq in MCF7 cells was performed, and peaks were called and
then overlapped with the MCF7 cell type-specific TCF7L2 peaks; a comparison of TCF7L2 and GATA3 binding patterns near the CDT1 locus is
shown. The hg19 genomic coordinates are chr16:88,861,964-88,880,233. (b) Peaks bound only by GATA3, only by TCF7L2, or by both factors
were analyzed for the presence of GATA3 and TCF7L2 motifs. The GATA3 motif is found in sites bound by GATA3 only and in sites bound by
both factors, whereas the TCF7L2 motif is found only in the sites bound only by TCF7L2 and not in the sites bound by both factors. (c)
Depletion of GATA3 results in loss of TCF7L2 occupancy at sites bound by TCF7L2 and GATA3 sites but not at sites only bound by TCF7L2.
MCF7 cells were transfected with siRNAs specific for TCF7L2 or GATA3 or control siRNAs. ChIP-qPCR assays were performed using antibodies
specific for TCF7L2 (left panel) or GATA3 (right panel) using primers specific for peaks bound only by GATA3, only by TCF7L2, or by both factors.
Shown are ChIP-qPCR results performed in triplicate and plotted with the standard error of two independent experiments. (d) Co-
immunoprecipitation of endogenous GATA3 and FLAG-tagged TCF7L2 constructs from MCF7 cells. The left panel analyzes whole-cell extracts
(WCE) and FLAG immunoprecipitation (FLAG IP) eluates from MCF7 cells transfected with the indicated FLAG-tagged plasmids; the membrane
was incubated with both anti-FLAG and anti-GATA3 antibodies. Note that the GATA3 signal in input WCE extracts is quite weak and can
generally only be visualized after concentration by immunoprecipitation. The right panel is a separate blot prepared in the same way (using the
GATA antibody for immunoprecipitation), but does not include the WCE extracts. V, vector control; E, full length TCF7L2; E∆ , TCF7L2 lacking the
amino terminus; B, TCF7L2 isoform lacking the carboxyl terminus; B∆ , TCF7L2 isoform lacking the amino and carboxyl termini.
Frietze et al. Genome Biology 2012, 13:R52
http://genomebiology.com/content/13/9/R52
Page 10 of 18
148
the second antibody, supporting the hypothesis that the
two factors are binding at the same time to the same
region. To further investigate the hypothesis that GATA3
tethers TCF7L2 to the genome, we sought to determine
whether GATA3 interacts with TCF7L2 in MCF7 cell
extracts using co-immunoprecipitation. Accordingly, we
expressed in MCF7 cells several different FLAG-tagged
TCF7L2 constructs that lack either or both of the amino-
or carboxy-terminal regions. The amino-terminal region
of TCF7L2 mediates the interaction with b-catenin and
the carboxy-terminal portion contains the so-called ‘E-tail’
important for the association with various co-regulators,
including CREBBP/EP300 (CREB binding protein/E1A
binding protein p300) [32-34]. A predominant isoform
lacking the E-tail has been referred to as the B-isoform
[17]. Immunoprecipitation of full-length TCF7L2 (the E-
isoform) and the B-isoform (which lacks the E-tail) as well
as B and E isoforms lacking the amino-terminal b-catenin
binding domain (termed E∆ and B∆ ) revealed that all iso-
forms are capable of co-precipitating GATA3 with equal
efficiency (Figure 6d). Conversely, immunoprecipitation of
GATA3 co-precipitated each of the tested TCF7L2 con-
structs, albeit with different degrees of efficiency (the full-
length E-tail TCF7L2 construct showed the greatest effi-
ciency of co-precipitation with GATA3). Importantly,
FLAG immunoprecipitation ofextractspreparedfrom
MCF7 cells transfected with an empty vector failed to pre-
cipitate GATA3 and control IgG immunoprecipitation
reactions failed to precipitate GATA3 and the full-length
E construct. Therefore, endogenous GATA3 and exogen-
ously expressed TCF7L2 can interact in MCF7 cells.
Taken together, these data show that TCF7L2 and
GATA3 interact and co-localize to specific genomic loci
in MCF7 cells.
TCF7L2 functions as a repressor when tethered to the
genome by GATA3
To establish whether TCF7L2 and GATA3 have a co-
regulatory role in the expression of specific target genes,
we performed RNA-seq analysis of MCF7 cells before
and after knockdown of TCF7L2 or GATA3. We found
that the expression of 914 and 469 genes was significantly
changed compared to cells treated with control siRNA
for GATA3 or TCF7L2, respectively. Many of the genes
showing expression changes can be classified as having
functions involved in breast cancer, cell differentiation,
and response to hormone stimulus (Figure 7c); a list of
all genes whose expression was significantly altered by
each knockdown can be found in Additional file 16. To
identify genes that might be directly co-regulated by
GATA3 and TCF7L2, we first identified a set of 3,614
genes that are directly bound by both GATA3 and
TCF7L2 (Figure 7a). Then, we analyzed the expression of
these 3,614 GATA3+TCF7L2 target genes and found
that 268 and 163 genes have significantly altered expres-
sion levels in siGATA3- or siTCF7L2-treated cells,
respectively (Figure 7b). Approximately half of the set of
genes deregulated upon reduction of GATA3 show
increased expression and half show decreased expression,
suggesting that GATA3 can act as both an activator and
a repressor at the GATA3+TCF7L2 target genes. In con-
trast, most of the genes deregulated by reduction of
TCF7L2 show increased expression, suggesting that
TCF7L2 functions mainly as a repressor of the set of
genes co-bound by TCF7L2 and GATA3. As a final ana-
lysis, we identified genes that are co-bound by TCF7L2
and GATA3 and that show expression changes in both
the knockdown TCF7L2 cells and in the knockdown
GATA3 cells. Although this is a small set of genes, they
mainly clustered into two categories. For example, 16 co-
bound genes showed an increase in expression in both
the TCF7L2 and GATA3 knockdown cells, indicating
that both factors were functioning as a repressor of these
genes. In addition, we identified ten genes that decreased
with GATA3 knockdown but increased upon TCF7L2
knockdown, suggesting that TCF7L2 functioned to nega-
tively modulate GATA3-mediated activation at these
genes. A list of the genes that are cooperatively repressed
by direct binding of TCF7L2 and GATA3 and a list of
genes for which recruitment of TCF7L2 antagonizes
GATA3-mediated activation are shown in Table 4.
Discussion
The TCF7L2 transcription factor has been linked to a vari-
ety of human diseases such as type 2 diabetes and cancer
[3,7-9,35]. To investigate the mechanisms by which this
site-specific DNA binding transcriptional regulator can
impact on such diverse diseases, we performed ChIP-seq
analysis for TCF7L2 in 6 different human cell lines, identi-
fying more than 116,000 non-redundant binding sites,
with only 1,864 sites being common to all 6 cell types.
Several striking discoveries that came from our ChIP-seq
analysis of the 6 different cell lines are: i) TCF7L2 has
multiple binding sites near each target gene; ii) TCF7L2
has developed cell type-specific mechanisms for regulating
a set of approximately 14,000 genes; iii) TCF7L2 binds to
more than 40% of the active enhancers in each of the 6
cancer cell lines; and iv) TCF7L2 functions as repressor
when recruited to the genome via tethering by the master
regulator GATA3.
By analysis of the TCF7L2 ChIP-seq datasets from 6
different human cancer cell lines, we identified 116,270
TCF7L2 binding sites, with each cell type having approxi-
mately 25,000 to 50,000 TCF7L2 peaks. We note that
another group has examined TCF7L2 binding in human
HCT116 cells [12], identifying only 1,095 binding sites. It
is not clear why Zhao and colleagues [12] identified such
smaller numbers of TCF7L2 binding sites in HCT116
Frietze et al. Genome Biology 2012, 13:R52
http://genomebiology.com/content/13/9/R52
Page 11 of 18
149
cells, but it is not likely due to the antibody specificity
(the antibodies used in both studies give similar patterns
on western blots). It is more likely that the 30-fold differ-
ence in peak number is due to the ChIP protocol. Zhao
et al. [12] used protein A agarose beads, whereas we used
magnetic protein A/G beads; we have found that protein
A agarose beads produce low signals in many ChIP assays
(unpublished data). Interestingly, the 116,270 TCF7L2
Figure 7 Transcriptional regulation of TCF7L2 and GATA3 target genes. (a) The nearest gene to each TCF7L2 binding site and the nearest
gene to each GATA3 binding site was identified and the two lists were compared to identify 3,614 genes that are potentially regulated by both
GATA3 and TCF7L2. (b) The expression of the 3,614 GATA3+TCF7L2 bound genes was analyzed in control cells, cells treated with siRNAs to
TCF7L2, and cells treated with siRNAs to GATA3; the number of genes whose expression increases or decreases is shown. (c) A scatterplot of
expression data from RNA-seq experiments. Each point corresponds to one NCBI Reference Sequence (RefSeq) transcript with fragments per
kilobase of gene per million reads (FPKM) values for control and siGATA3 or control and siTCF7L2 knockdown samples shown on a log10 scale.
The dashed line represents no change in gene expression between the two samples. Differentially expressed genes whose function corresponds
to Gene Ontology categories of breast cancer, cell differentiation, and response to hormone stimulus are highlighted.
Frietze et al. Genome Biology 2012, 13:R52
http://genomebiology.com/content/13/9/R52
Page 12 of 18
150
binding sites that we identified correspond to only 14,193
genes, with each target gene having an average of 8.2
TCF7L2 binding sites. Many of these binding sites are
cell type-specific, as exemplified by the fact that there are
only three to four TCF7L2 binding sites per target gene
in any one cell type (Figure 2c).
Cell type-specific binding patterns suggest that TCF7L2
binds cooperatively to the genome along with cell type-
specific factors. For example, the AP1 (activator protein 1)
motif is enriched in the sets of HCT116-specific and
MCF7-specific TCF7L2 binding sites. Interestingly,
TCF7L2 has previously been shown to physically interact
with JUN (which is one of the heterodimeric components
of AP1) and it has been suggested that the JUN and
TCF7L2 interaction is a molecular mechanism that inte-
grates the activation of the TCF and CTNNB1 pathway by
the JNK (Jun N-terminal kinase) pathway [36]. Although
ChIP-seq data for AP1 components is not available for
HCT116 or MCF7 cells, there are 7,400 genomic locations
that are bound by TCF7L2 in HCT116 cells that are also
bound by JUN in HeLa cells [11]; it is likely that a much
larger number of co-localizing regions would be identified
if the datasets were from the same cell type. Our detailed
bioinformatic analysis of the HepG2-specific TCF7L2
peaks suggested that HNF4a and FOXA2 might be bind-
ing partners of TCF7L2 in this cell type. A previous study
had shown that FOXA2 and HNF4a colocalize at a subset
of sites in mouse liver [37], but that study did not examine
the relationship of these sites with TCF7L2 binding.
Therefore, we experimentally validated our bioinformatic
prediction by comparing ChIP-seq data for all three fac-
tors. We found that greater than 50% of the TCF7L2
HepG2-specific binding sites are also bound by the liver
transcription factors HNF4a and FOXA2, suggesting that
this trio of factors cooperate in gene regulation. Based on
the identification of motifs for all three factors in the
TCF7L2 peaks, we suggest that TCF7L2, HNF4a,and
FOXA2 all bind directly to the DNA, perhaps with the
liver-specific factors helping to stabilize TCF7L2 genomic
binding to particular enhancer regions in HepG2 cells.
HNF4a and FOXA2 have been shown to be critical deter-
minants of hepatocyte identity; Hnf4a plus Foxa1, Foxa2,
or Foxa3 can convert mouse embryonic and adult fibro-
blasts into cells that closely resemble hepatocytes in vitro
[38]. The induced hepatocyte-like cells had multiple hepa-
tocyte-specific features and reconstituted damaged hepatic
tissues after transplantation. Future studies should address
a potential roleof TCF7L2 in hepatocyte identity.
Bioinformatic analysis of the MCF7-specific TCF7L2
sites revealed that the GATA3 motif was highly enriched
and experimental analysis of MCF7 GATA3 ChIP-seq
data showed that nearly one-half of the MCF7-specific
TCF7L2 binding sites co-localize with GATA3. Interest-
ingly, we found that the TCF7L2 motif was not enriched
in the regions bound by both TCF7L2 and GATA3. These
results suggested that perhaps GATA3 binds directly to
the DNA at these sites and tethers TCF7L2 to the genome
at the MCF7-specific TCF7L2 binding sites Accordingly,
we showed that depletion of GATA3 reduced recruitment
of TCF7L2 to a subset of genomic sites. We also demon-
strated that TCF7L2 functions mainly as a repressor when
tethered to the genome via GATA3. At some genes,
TCF7L2 cooperatively represses genes with GATA3 but at
other genesTCF7L2antagonizes GATA3-mediated activa-
tion (Figure 8).
Specification of cell phenotypes is achieved by sets of
master transcriptional regulators that activate the genes
specific for one cell fate while repressing genes that specify
other cell fates. The GATA factors, which include six site-
specific DNA binding proteins that bind to the sequence
(A/T)GATA(A/G), are master regulators that govern cell
differentiation [39-44]. For example, GATA1-3 have been
linked to the specification of different hematopoietic cell
fates and GATA4-6 are involved in differentiation of car-
diac and lung tissues. Also, GATA3 is the most highly
enriched transcription factor in the mammary epithelium,
has been shown to be necessary for mammary cell differ-
entiation, and is specifically required to maintain the lumi-
nal cell fate [43,44]. Studies of human breast cancers have
shown that GATA3 is expressed in early stage, well-differ-
entiated tumors but not in advanced invasive cancers. In
addition, GATA3 expression is correlated with longer dis-
ease-free survival and evidence suggests that it can prevent
or reverse the epithelial to mesenchymal transition that is
characteristic of cancer metastasis [45]. Our studies show
that TCF7L2 cooperates with the master regulator
GATA3 to repress transcription in the well-differentiated
Table 4 Genes repressed by TCF7L2 via a GATA motif
Cooperative repression by
TCF7L2 and GATA3
TCF7L2 antagonizes GATA3-
mediated activation
ABCA12 CCND1
APP COL5A1
CAPN2 CYP24A1
CDH11 GPRC5A
EPHA4 KRT80
GALC LYPD1
LTBP1 MAP2
LYPD3 RASGRP1
MARCKS RHOBTB3
NCOA3 TGFB2
NFAT5 TGFBR2
PAM
RNF145
S100A10
SECISBP2L
SORL1
Frietze et al. Genome Biology 2012, 13:R52
http://genomebiology.com/content/13/9/R52
Page 13 of 18
151
MCF7 breast cancer cell line and suggest that a TCF7L2-
GATA3 complex may be a critical regulator of breast cell
differentiation.
Our finding that TCF7L2 co-localizes and cooperates in
gene regulation with a GATA factor in MCF7 breast
cancer cells is similar to a recent study of TCF7L2 in
hematopoietic cells. Trompouki et al. [13] showed that in
hematopoietic cells, TCF7L2 co-occupies sites with
GATA1 and GATA2, which are master regulators of
blood cell differentiation. Both the TCF7L2 motif and the
GATA motif were found at the co-bound sites (suggesting
adjacent binding of the two factors, not tethering) and
TCF7L2 functioned as a transcriptional activator at those
sites. In contrast, our studies indicate that co-localization
of TCF7L2 with GATA3 in MCF7 cells is not due to
adjacent binding but rather TCF7L2istethered tothe gen-
ome by interaction with GATA3 binding to a GATA motif
and that this tethering results in transcriptional repression.
Astudyof Drosophila TCF binding to the Ugt36Bc
upstream region indicated that TCF represses transcrip-
tion of the Ugt36Bc gene by binding to non-traditional
TCF motifs [46]. Interestingly, the three Ugt36Bc TCF
sites (AGAAAT, AGATAA, AGATAA) are almost identi-
cal to the GATA3 motif. Blauwkamp et al.[46]suggest
that the sequence to which TCF binds has an important
function in determining whether a gene will be activated
or repressed. Their studies did not address whether TCF
bound directly to the GATA-like motifs. However, based
on our studies, it would be worthwhile to investigate a
possible genomic tethering mechanism of TCF by GATA
factors in Drosophila.
Conclusions
Our studies reveal numerous new insights into TCF7L2-
mediated gene regulation and suggest that TCF7L2 coop-
erates with other site-specific DNA binding factors to
regulate transcription in a cell type-specific manner. Speci-
fically, we show that TCF7L2 has highly cell type-specific
binding patterns, co-localizes with different factors in
different cell types, and can be tethered to the DNA by
GATA3 in breast cancer cells. Our work, in combination
with other studies [13,47], suggests that TCF7L2 may play
acriticalroleincreatingand maintaining differentiated
phenotypes by cooperating with cell type-specific master
regulators such as HNF4a and FOXA2 in liver cells and
GATA3 in breast cells. Both FOXA and GATA family
members have been classified as pioneer factors, that is,
transcription factors that can access their binding sites
when other factors cannot, helping to create open chro-
matin to enable subsequent binding of other factors [48].
It is possible that FOXA2 and GATA3 serve as pioneer
factors that enhance the ability of TCF7L2 to access its
sites in liver and breast cells. In addition to having cell
type-specific partners, there are many different isoforms of
TCF7L2. Although the major isoforms of TCF7L2 are
similar inmost cell types, itis possible that minor isoforms
contribute to the cell type-specificity of TCF7L2 binding
via interaction of co-localizing proteins with alternatively
Figure 8 Two modes of TCF7L2-mediated transcriptional repression of GATA3 target genes. (a) GATA3 tethers TCF7L2 to the genome
and both factors cooperate to repress target genes. (b) GATA3 tethers TCF7L2 to the genome with TCF7L2 antagonizing GATA3-mediated
transcriptional activation.
Frietze et al. Genome Biology 2012, 13:R52
http://genomebiology.com/content/13/9/R52
Page 14 of 18
152
encoded exons of TCF7L2. We anticipate that future stu-
dies employing isoform-specific antibodies to identify
TCF7L2 binding sites in normal and diseased tissues will
provide additional insight into the transcriptional net-
works that are altered in diseases such as type 2 diabetes,
pancreatic cancer, and coronary artery disease.
Materials and methods
Cell culture
The human cell lines HCT116 (ATCC #CCL-247), HepG2
(ATCC # HB-8065), HEK293 (ATCC #CRL-1573), MCF7
(ATCC #HTB-22), HeLa (ATCC #CCL-2.) and PANC1
(ATCC #CRL-1469) were obtained from the American
Type Culture Collection. HCT116 cells were grown in
McCoy’s 5A Medium supplemented with 10% fetal bovine
serum and 1% penicillin/streptomycin until 80% confluent,
whereas HepG2, HEK293, MCF7, HeLa and PANC1 cells
were grown in Dulbecco’s modified Eagle’s medium sup-
plemented with 10% fetal bovine serum, 2 mM L-gluta-
mine and 1% penicillin/streptomycin) until 75 to 90%
confluent.
siRNA-mediated knockdown
All siRNAs were purchased from Dharmacon (Thermo
Fisher Scientific-Dharmacon Products, Lafayette, CO,
USA; ON-TARGET plus SMART pool - Human GATA3,
TCF7L2 and Non-Targeting siRNA) and transfected using
Lipofectamine™ 2000 Transfection Reagent according to
the manufacturer’s instructions (Life Technologies, Grand
Island, NY, USA). Then, 48 to 56 h following transfection,
cells were either crosslinked for ChIP assays or collected
for RNA and protein extraction.
ChIP-seq assays
The antibodies used for ChIP-seq were: TCF7L2 (Cell Sig-
naling Technology, Danvers, MA, USA; C48H11 #2569),
GATA3 (Santa Cruz Biotechnology, Santa Cruz, CA, USA;
#sc-268), H3K4me1 (Cell Signaling Technology, Danvers,
MA, USA; 9723S lot1), and H3K27Ac (Abcam, Cam-
bridge, MA, USA; Ab4729 lot #GR16377-1). The TCF7L2
antibody will detect both major isoforms of TCF7L2. See
Additional file 2 for details of all ChIP-seq experiments.
For all factor or histone modification and cell type combi-
nation, we performed duplicate ChIP-seq experiments
using chromatin from two different cell culture dates. For
the TCF7L2 ChIP-seq assays, 500 μg chromatin was incu-
bated with 25μg of antibody; for the GATA3 experiments,
600 μg chromatin was incubated with 50 μgofantibody;
and for the histone ChIP-seq experiments, 10 to 12 μg
chromatin and 8 to 10 μg of antibody were used. TCF7L2
and histone ChIP assays were performed as described pre-
viously [49] using protein A/G magnetic beads to collect
the immunoprecipitates. GATA3 ChIP-seq experiments
were performed using StaphA (Sigma-Aldrich, St. Louis,
MO, USA) to collect the immunoprecipitates [50]. After
qPCR confirmed enrichment of target sequences in ChIP
versus input samples, libraries were created as previously
described with minor modifications [49]. Gel size selection
of the 200 to 500 bp fraction (TCF7L2 and histones) or
the 300 to 600 bp fraction (GATA3) was conducted after
the adapter ligation step, followed by 15 amplification
cycles. qPCR (see Additional file 17 for a list of primers
used in this study) was performed to confirm enrichment
of targets in the libraries and then the libraries were ana-
lyzed using an Illumina GAIIx. Sequence reads were
aligned to the UCSC human genome assembly HG19
using the Eland pipeline (Illumina).
ChIP-seq data processing
The BELT program [24] and Sole-search [11,51] were
used to identify peaks for TCF7L2 and for modified his-
tones. We used the ENCODE overlap rules to evaluate the
reproducibilityof the two biological replicates foreach fac-
tor or histone modification and cell-type combination. For
this, we first truncated the peak lists of the two replicates
for a given factor/cell-type combination so that both the A
and B replicate peak list were the same length. Then, we
overlapped the top 40% of the replicate A peak list with
the entire replicate B peak list (and vice versa). ENCODE
standards state that approximately 80% of the top 40% set
should be contained in the larger set. After determining
that replicate datasets met this standard (Additional file 4),
we merged the two replicates and called peaks on the
merged dataset. To determine if we had identified the
majority of the TCF7L2 peaks in each cell type, we per-
formed a saturation analysis. We randomly selected differ-
ent percentages of the reads (10%, 20%, 30%,...,100%) from
the merged datasets from the TCF7L2 ChIP-seq experi-
ments for each cell line and called peaks using the BELT
program; each merged dataset was analyzed three times.
The number of peaks identified in each subset of the total
reads was plotted to demonstrate that we had enough
reads for each dataset to identify the majority of peaks
(Additional file 7).
RNA-seq
RNA was extracted using Trizol Reagent (Life Technolo-
gies) following the suggested protocol; 2 μgofeachRNA
sample was used with the Illumina TruSeq RNA Sample
Prep Kit (catalogue number RS-122-2001) to make RNA
libraries following the Illumina TruSeq RNA Sample pre-
paration Low-Throughput protocol. Briefly, RNA was
fragmented, then first-strand cDNA was prepared using
the kit-supplied 1st Strand Master Mix and user-supplied
Superscript III (Life Technologies, catalogue number
18080-051) followed by second strand cDNA synthesis.
The Illumina protocol and reagents were used to com-
plete the library preparation, with 12 cycles of PCR
Frietze et al. Genome Biology 2012, 13:R52
http://genomebiology.com/content/13/9/R52
Page 15 of 18
153
amplification. Libraries were sequenced using an Illumina
GAIIx and analyzed as described in Additional file 3.
ChIP-qPCR assays
ChIP assays were performed as described in the ChIP-seq
section, except that 30 μgequivalentsofDNAwasused
for each ChIP reaction. The ChIP eluates were analyzed
by qPCR using the Bio-Rad SsoFast™ EvaGreen
®
Super-
mix (catalogue number 172-5202) according to the manu-
facturer’s instructions (Bio-Rad, Hercules, CA, USA).
Generation of TCF7L2 expression constructs and co-
immunoprecipitation assays
TCF7L2 expression constructs were generated by PCR
amplification of cDNA prepared from RNA isolated
from MCF7 cell cultures and used for GATEWAY clon-
ing into the pTRED-N-FLAG expression vector, which
contains an amino-terminal FLAG tag. Control empty
vector or an expression construct was transfected into
MCF7 cells using Lipofectamine™ 2000 according to
the manufacturer’s instructions (Life Technologies); 36 h
following transfection, cells were harvested and lysed in
ice-cold NP-40 lysis buffer (phosphate-buffered saline,
0.25% NP-40, 0.1% sodium-deoxycholate, 2 mM phenyl-
methylsulfonyl fluoride (PMSF) and 10 μg/ml leupeptin
and aprotinin) for co-immunoprecipitation assays. Fol-
lowing extraction on ice for 30 minutes and clarification
by centrifugation, soluble protein extracts were diluted
1:10 with lysis buffer and incubated with either an anti-
FLAG M2 agarose conjugated antibody (Sigma catalogue
number A2220), an anti-GATA3 conjugated antibody
(Santa Cruz HG3-31-AC), or a control rabbit IgG agar-
ose conjugated antibody (Sigma catalogue number
A2909) for 4 hours at 4°C. The beads were then washed
four times and eluted with SDS-PAGE sample buffer
prior to SDS-PAGE and western blot analysis using anti-
bodies specific for GATA3 (Santa Cruz HG3-31) or
FLAG (Sigma catalogue number A8592).
Data access
All data are publicly available via the UCSC Genome
Preview Browser and/or has been submitted to the
Gene Expression Omnibus (information concerning how
to access the data is provided in Additional file 2).
Additional material
Additional file 1: Figure S1 - antibody validation.
Additional file 2: Table S1 - summary of ChIP-seq and RNA-seq
experiments. All ChIP-seq experiments performed by the Farnham
laboratory are listed; GEO numbers are provided and the data are
availableat the UCSC genome browser
Additional file 3: Supplementary Methods.
Additional file 4: Table S2 - ChIP-seq reproducibility. To determine
the reproducibility of the ChIP-seq data, we used the method of
evaluating replicates as described in the ENCODE Standards document
[53]. Briefly, the ENCODE consortium rules are as follows: ‘80% of the top
40% of the targets identified in one replicate should be contained within
the list of targets from the other replicate’. This metric was chosen based
on experiences of all the ENCODE production groups to allow an
achievable threshold of reproducibility while producing high quality
target lists. All ChIP-seq data for site-specific factors submitted to the
UCSC browser as part of ENCODE have to pass this quality metric and, as
can be seen in Table S2, all of the TCF7L2 data in our manuscript have
passed. We note that the metric for reproducibility of ‘broad peak’
histone marks (such as H3K4me1) has not yet been established by
ENCODE. Due to difficulties in calling peaks for such histone marks, the
overlap is sometimes lower than 80%.
Additional file 5: Table S3 - all TCF7L2 peaks in six cell types.
TCF7L2 peaks were called using BELT [54] and the merged datasets for
each cell type (see Additional file 4 for the peak calling parameters for
each dataset).
Additional file 6: Table S7 - summary of TCF7L2 peak
characteristics. TCF7L2 peaks were called using BELT [54] and the
merged datasets for each cell type (see Additional file 4 for the peak
calling parameters for each dataset).
Additional file 7: Figure S2 - saturation analysis.
Additional file 8: Table S9 - amplified regions in the six cancer cell
lines. Sole-search was used to call peaks for TCF7L2 in each of the six
cancer cell lines, using input from each line as the specific control. One
novel feature of Sole-search is that it provides a list of all amplified
regions found in the input control. For each cancer cell line, a worksheet
of all amplified regions is provided; column F indicates the fold
amplification and column I indicates the chromosomal coordinates of
the amplified region
Additional file 9: Table S10 - TCF7L2 peaks in amplified regions. The
TCF7L2 peaks from each cell line were overlapped with the amplified
regions from that same cell line. Presented in this table is a summary of
all the overlaps and a worksheet for each cell line that lists each peak
that is found in the amplified regions (column A indicates the
chromosome and columns D and E indicate the chromosomal
coordinates of the amplified region that contains the peak; column F
indicates the fold amplification of the amplified region; and column I
indicates the chromosomal coordinate of the TCF7L2 peak and the
height of the peak).
Additional file 10: Table S4 - TCF7L2 peaks unique to each of the
six cell types. To identify cell type-specific TCF7L2 peaks for a particular
cell, we first combined the five sets of peaks from the other cell types,
and then identified the unique set of peaks for the given cell type by
removing sites in common with the combined set. For these analyses,
the merged replicate TCF7L2 datasets were used.
Additional file 11: Figure S3 - ChIP-qPCR validation of TCF7L2 sites.
Additional file 12: Figure S4 - TCF7L2 binds to cell type-specific
enhancer regions.
Additional file 13: Table S6 - TCF7L2 binding motifs in six cell
types. We used our ChIPMotifs program to identify two canonical
TCF7L2 motifs, W1 of 6 bp and W2 of 8 bp, for each cell type. We then
used each of two motifs’ position weight matrices to scan the sequences
of the peaks to determine how many peaks contained the motifs; we
examined the set of all peaks and the set of cell type-specific peaks for
all six cell types.
Additional file 14: Figure S5 - motif recovery percentage plots.
Additional file 15: Figure S6 - Re-ChIP analysis of GATA3 and
TCF7L2 sites.
Additional file 16: Table S8 - RNA-seq analysis of TCF7L2 and
GATA3 knockdowns. The RNAseq data were processed by TopHat and
Cufflinks programs essentially as described [55].
Frietze et al. Genome Biology 2012, 13:R52
http://genomebiology.com/content/13/9/R52
Page 16 of 18
154
Additional file 17: Table S5 - primers used in ChIP-qPCR analyses.
The sequences of positive and negative control primers used for ChIP-
qPCR.
Abbreviations
AP1: activator protein 1; bp: base pair; ChIP: chromatin immunoprecipitation;
CTNNB1: catenin beta 1; FOX: forkhead box; GATA: GATA binding protein;
H3K27Ac histone H3 acetylated on lysine 27; H3K4me1: histone H3
monomethylated on lysine 4; HNF: hepatocyte nuclear factor; PCR:
polymerase chain reaction; qPCR: quantitative PCR; siRNA: small interfering
RNA; TCF7L2: transcription factor 7-like 2.
Acknowledgements
The work was supported in part by 1U54HG004558 as a component of the
ENCODE Project and by P30CA014089 from the National Cancer Institute.
The HepG2 and HeLa H3K4me1 and H3K27Ac ChIP-seq data were generated
at the Broad Institute and in the Brad Bernstein lab at the Massachusetts
General Hospital/Harvard Medical School; the HNF4a and FOXA2 ChIP-seq
data were produced in the lab of Rick Myers at the HudsonAlpha Institute
for Biotechnology; the MCF7 H3K4me3 and RNA polymerase II ChIP-seq data
were generated by the Iyer lab at UT-Austin. For these ChIP-seq datasets,
data generation and analysis was supported, in part, by funds from the
NHGRI as part of the ENCODE project. We thank Drs Bernstein, Myers, and
Iyer and the Data Coordination Center at UCSC for providing access to these
data. All other ChIP-seq data and the RNA-seq data from control and siRNA-
treated MCF7 cells were generated by the Farnham lab at the University of
Southern California; libraries were sequenced at the USC Epigenome Data
Production Facility and at Stanford University.
Author details
1
Department of Biochemistry and Molecular Biology, Norris Comprehensive
Cancer Center, University of Southern California, Los Angeles, CA 90089, USA.
2
Department of Chemistry, Lanzhou University, Lanzhou 730000, China.
3
Department of Biomedical Informatics, The Ohio State University, Columbus,
OH 43210, USA.
Authors’ contributions
All authors have read and approved the manuscript for publication. SF
performed ChIP-seq assays and bioinformatics analyses, participated in the
study design and coordination, and helped to draft the manuscript. RW
performed bioinformatic analyses. YGT, LY, MG, and HW performed ChIP-seq
assays and participated in analysis of the data. VXJ participated in the design
of the study, performed data analyses, and helped to draft the manuscript.
PJF conceived of the study, participated in its design and coordination,
performed data analyses, and drafted the manuscript.
Competing interests
The authors declare that they have no competing interests.
Received: 9 March 2012 Revised: 9 March 2012 Accepted: 25 May 2012
Published: 5 September 2012
References
1. Poy F, Lepourcelet M, Shivdasani RA, Eck MJ: Structure of a human Tcf4-
beta-catenin complex. Nat Struct Biol 2001, 8:1053-1057.
2. Prokunina-Olsson L, Welch C, Hansson O, Adhikari N, Scott LJ, Usher N,
Tong M, Sprau A, Swift A, Bonnycastle LL, Erdos MR, He Z, Saxena R,
Harmon B, Kotova O, Hoffman EP, Altshuler D, Groop L, Boehnke M,
Collins FS, Hall JL: Tissue-specific alternative splicing of TCF7L2. Hum Mol
Genet 2009, 18:3795-3804.
3. Ravindranath A, O’Connell A, Johnston PG, El-Tanani MK: The role of LEF/
TCF factors in neoplastic transformation. Curr Mol Med 2008, 8:38-50.
4. Roose J, Clevers H: TCF transcription factors: molecular switches in
carcinogenesis. Biochem Biophys Acta 1999, 87456:M23-M37.
5. Shitashige M, Hirohashi S, Yamada T: Wnt signaling inside the nucleus.
Cancer Sci 2008, 99:631-637.
6. Grove EA: Wnt signaling meets internal dissent. Genes Dev 2011,
25:1759-1762.
7. Gambino R, Bo S, Gentile L, Musso G, Pagano G, Cavallo-Perin P,
Cassader M: Transcription factor 7-like 2 (TCF7L2) polymorphism and
hyperglycemia in an adult Italian population-based cohort. Diabetes Care
2010, 33:1233-1235.
8. Grant SF, Thorleifsson G, Reynisdottir I, Benediktsson R, Manolescu A,
Sainz J, Helgason A, Stefansson H, Emilsson V, Helgadottir A,
Styrkarsdottir U, Magnusson KP, Walters GB, Palsdottir E, Jonsdottir T,
Gudmundsdottir T, Gylfason A, Saemundsdottir J, Wilensky RL, Reilly MP,
Rader DJ, Bagger Y, Christiansen C, Gudnason V, Sigurdsson G,
Thorsteinsdottir U, Gulcher JR, Kong A, Stefansson K: Variant of
transcription factor 7-like 2 (TCF7L2) gene confers risk of type 2
diabetes. Nat Genet 2006, 38:320-323.
9. Voight BF, Scott LJ, Steinthorsdottir V, Morris AP, Dina C, Welch RP,
Zeggini E, Huth C, Aulchenko YS, Thorleifsson G, McCulloch LJ, Ferreira T,
Grallert H, Amin N, Wu G, Willer CJ, Raychaudhuri S, McCarroll SA,
Langenberg C, Hofmann OM, Dupuis J, Qi L, Segre AV, van Hoek M,
Navarro P, Ardlie K, Balkau B, Benediktsson R, Bennett AJ, Blagieva R, et al:
Twelve type 2 diabetes susceptibility loci identified through large-scale
association analysis. Nat Genet 2010, 42:579-589.
10. Weedon MN: The importance of TCF7L2. Diabet Med 2007, 24:1062-1066.
11. Blahnik KR, Dou L, O’Geen H, McPhillips T, Xu X, Cao AR, Iyengar S,
Nicolet CM, Ludaescher B, Korf I, Farnham PJ: Sole-search: an integrated
analysis program for peak detection and functional annotation using
ChIP-seq data. Nucleic Acids Res 2010, 38:e13.
12. Zhao J, Schug J, Li M, Kaestner KH, Grant SF: Disease-associated loci are
significantly over-represented among genes bound by transcription
factor 7-like 2 (TCF7L2) in vivo. Diabetologia 2010, 53:2340-2346.
13. Trompouki E, Bowman TV, Lawton LN, Fan ZP, Wu DC, DiBiase A, Martin CS,
Cech JN, Sessa AK, Leblanc JL, Li P, Durand EM, Mosimann C, Heffner GC,
Daley GQ, Paulson RF, Young RA, Zon LI: Lineage regulators direct BMP
and Wnt pathways to cell-specific programs during differentiation and
regeneration. Cell 2011, 147:577-589.
14. Zhou Y, Zhang E, Berggreen C, Jing X, Osmark P, Lang S, Cilio CM,
Goransson O, Groop L, Renstrom E, Hansson O: Survival of pancreatic beta
cells is partly controlled by a TCF7L2-p53-p53INP1-dependent pathway.
Hum Mol Genet 2012, 21:196-207.
15. Norton L, Fourcaudot M, Abdul-Ghani MA, Winnier D, Mehta FF,
Jenkinson CP, Defronzo RA: Chromatin occupancy of transcription factor
7-like 2 (TCF7L2) and its role in hepatic glucose metabolism. Diabetologia
2011, 54:3132-3142.
16. Osmark P, Hansson O, Jonsson A, Ronn T, Groop L, Renstrom E: Unique
splicing pattern of the TCF7L2 gene in human pancreatic islets.
Diabetologia 2009, 52:850-854.
17. Weise A, Bruser K, Elfert S, Wallmen B, Wittel Y, Wohrle S, Hecht A:
Alternative splicing of Tcf7l2 transcripts generates protein variants with
differential promoter-binding and transcriptional activation properties at
Wnt/beta-catenin targets. Nucleic Acids Res 2010, 38:1964-1981.
18. Hansson O, Zhou Y, Renstrom E, Osmark P: Molecular function of TCF7L2:
Consequences of TCF7L2 splicing for molecular function and risk for
type 2 diabetes. Curr Diab Rep 2010, 10:444-451.
19. Le Bacquer O, Shu L, Marchand M, Neve B, Paroni F, Kerr Conte J, Pattou F,
Froguel P, Maedler K: TCF7L2 splice variants have distinct effects on
beta-cell turnover and function. Hum Mol Genet 2011, 20:1906-1915.
20. Locke JM, Da Silva Xavier G, Rutter GA, Harries LW: An alternative
polyadenylation signal in TCF7L2 generates isoforms that inhibit T cell
factor/lymphoid-enhancer factor (TCF/LEF)-dependent target genes.
Diabetologia 2011, 54:3078-3082.
21. Cuilliere-Dartigues P, El-Bchiri J, Krimi A, Buhard O, Fontanges P, Flejou JF,
Hamelin R, Duval A: TCF-4 isoforms absent in TCF-4 mutated MSI-H
colorectal cancer cells colocalize with nuclear CtBP and repress TCF-4-
mediated transcription. Oncogene 2006, 25:4441-4448.
22. Duval A, Rolland S, Tubacher E, Bui H, Thomas G, Hamelin R: The human T-
cell transcription factor-4 gene: structure, extensive characterization of
alternative splicings, and mutational analysis in colorectal cancer cell
lines. Cancer Res 2000, 60:3872-3879.
23. Frietze S, Lan X, Jin VX, Farnham PJ: Genomic targets of the KRAB and
SCAN domain-containing zinc finger protein 263. J Biol Chem 2010,
285:1393-1403.
24. Lan X, Bonneville R, Apostolos J, Wu W, Jin VX: W-ChIPeaks: a
comprehensive web application tool for processing ChIP-chip and ChIP-
seq data. Bioinformatics 2011, 27:428-430.
Frietze et al. Genome Biology 2012, 13:R52
http://genomebiology.com/content/13/9/R52
Page 17 of 18
155
25. Barski A, Cuddapah S, Cui K, Roh TY, Schones DE, Wang Z, Wei G,
Chepelev I, Zhao K: High-resolution profiling of histone methylations in
the human genome. Cell 2007, 129:823-837.
26. Mikkelsen TS, Ku M, Jaffe DB, Issac B, Lieberman E, Giannoukos G, Alverez P,
Brockman W, Kim T-K, Koche RP, Lee W, Mendenhall E, O’Donovan A,
Presser A, Russ C, Xie X, Meissner A, Wernig M, Jaenisch R, Nusbaum C,
Lander ES, Bernstein BE: Genome-wide maps of chromatin state in
pluripotent and lineage-committed cells. Nature 2007, 448:553-560.
27. Wang Z, Zang C, Rosenfeld JA, Schones DE, Barski A, Cuddapah S, Cui K,
Roh T-Y, Peng W, Zhang MQ, Zhao K: Combinatorial patterns of histone
acetylations and methylations in the human genome. Nat Genet 2008,
40:897-903.
28. Jin V, Apostolos J, Nagisetty NS, Farnham PJ: W-ChIPMotifs: a web
application tool for de novo motif discovery from ChIP-based high
throughput data. Bioinformatics 2009, 25:3191-3193.
29. Jin VX, O’Geen H, Iyengar S, Green R, Farnham PJ: Identification of an
OCT4 and SRY regulatory module using integrated computational and
experimental genomics approaches. Genome Res 2007, 17:807-817.
30. Hatzis P, van der Flier LG, van Driel MA, Guryev V, Nielsen F, Denissov S,
Nijman IJ, Koster J, Santo EE, Welboren W, Versteeg R, Cuppen E, van de
Wetering M, Clevers H, Stunnenberg HG: Genome-wide pattern of
TCF7L2/TCF4 chromatin occupancy in colorectal cancer cells. Mol Cell
Biol 2008, 28:2732-2744.
31. Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, Cheng JX, Murre C,
Singh H, Glass CK: Simple combinations of lineage-determining
transcription factors prime cis-regulatory elements required for
macrophage and B cell identities. Mol Cell 2010, 38:576-589.
32. Behrens J, von Kries JP, Kuhl M, Bruhn L, Wedlich D, Grosschedl R,
Birchmeier W: Functional interaction of beta-catenin with the
transcription factor LEF-1. Nature 1996, 382:638-642.
33. Fasolini M, Wu X, Flocco M, Trosset JY, Oppermann U, Knapp S: Hot spots
in Tcf4 for the interaction with beta-catenin. J Biol Chem 2003,
278:21092-21098.
34. Hecht A, Stemmler MP: Identification of a promoter-specific
transcriptional activation domain at the C terminus of the Wnt effector
protein T-cell factor 4. J Biol Chem 2003, 278:3776-3785.
35. Hansen T, Ingason A, Djurovic S, Melle I, Fenger M, Gustafsson O,
Jakobsen KD, Rasmussen HB, Tosato S, Rietschel M, Frank J, Owen M,
Bonetto C, Suvisaari J, Thygesen JH, Petursson H, Lonnqvist J, Sigurdsson E,
Giegling I, Craddock N, O’Donovan MC, Ruggeri M, Cichon S, Ophoff RA,
Pietilainen O, Peltonen L, Nothen MM, Rujescu D, St Clair D, Collier DA,
et al: At-risk variant in TCF7L2 for type II diabetes increases risk of
schizophrenia. Biol Psychiatry 2011, 70:59-63.
36. Nateri A, Spencer-Dene B, Behrens A: Interaction of phosphorylated c-Jun
with TCF4 regulates intestinal cancer development. Nature 2005,
437:281-285.
37. Hoffman BG, Robertson G, Zavaglia B, Beach M, Cullum R, Lee S,
Soukhatcheva G, Li L, Wederell ED, Thiessen N, Bilenky M, Cezard T, Tam A,
Kamoh B, Birol I, Dai D, Zhao Y, Hirst M, Verchere CB, Helgason CD,
Marra MA, Jones SJ, Hoodless PA: Locus co-occupancy, nucleosome
positioning, and H3K4me1 regulate the functionality of FOXA2-, HNF4A-,
and PDX1-bound loci in islets and liver. Genome Res 2010, 20:1037-1051.
38. Sekiya S, Suzuki A: Direct conversion of mouse fibroblasts to hepatocyte-
like cells by defined factors. Nature 2011, 475:390-393.
39. Bresnick EH, Lee HY, Fujiwara T, Johnson KD, Keles S: GATA switches as
developmental drivers. J Biol Chem 2010, 285:31087-31093.
40. Brewer A, Pizzey J: GATA factors in vertebrate heart development and
disease. Expert Rev Mol Med 2006, 8:1-20.
41. Chou J, Provot S, Werb Z: GATA3 in development and cancer
differentiation: cells GATA have it! J Cell Physiol 2010, 222:42-49.
42. Patient RK, McGhee JD: The GATA family (vertebrates and invertebrates).
Curr Opin Genet Dev 2002, 12:416-422.
43. Kouros-Mehr H, Bechis SK, Slorach EM, Littlepage LE, Egeblad M, Ewald AJ,
Pai SY, Ho IC, Werb Z: GATA-3 links tumor differentiation and
dissemination in a luminal breast cancer model. Cancer Cell 2008,
13:141-152.
44. Kouros-Mehr H, Kim JW, Bechis SK, Werb Z: GATA-3 and the regulation of
the mammary luminal cell fate. Curr Opin Cell Biol 2008, 20:164-170.
45. Yan W, Cao QJ, Arenas RB, Bentley B, Shao R: GATA3 inhibits breast cancer
metastasis through the reversal of epithelial-mesenchymal transition. J
Biol Chem 2010, 285:14042-14051.
46. Blauwkamp TA, Chang MV, Cadigan KM: Novel TCF-binding sites specify
transcriptional repression by Wnt signalling. EMBO J 2008, 27:1436-1446.
47. Verzi MP, Hatzis P, Sulahian R, Philips J, Schuijers J, Shin H, Freed E,
Lynch JP, Dang DT, Brown M, Clevers H, Liu XS, Shivdasani RA: TCF4 and
CDX2, major transcription factors for intestinal function, converge on
the same cis-regulatory regions. Proc Natl Acad Sci USA 2010,
107:15157-15162.
48. Zaret KS, Carroll JS: Pioneer transcription factors: establishing
competence for gene expression. Genes Dev 2011, 25:2227-2241.
49. O’Geen H, Echipare L, Farnham PJ: Using ChIP-Seq technology to
generate high-resolution profiles of histone modifications. Methods Mol
Biol 2011, 791:265-286.
50. O’Geen H, Frietze S, Farnham PJ: Using ChIP-seq technology to identify
targets of zinc finger transcription factors. Methods Mol Biol 2010,
649:437-455.
51. Blahnik KR, Dou L, Echipare L, Iyengar S, O’Geen H, Sanchez E, Zhao Y,
Marra MA, Hirst M, Costello JF, Korf I, Farnham PJ: Characterization of the
contradictory chromatin signatures at the 3’ exons of zinc finger genes.
PLOS One 2011, 6:e17121.
doi:10.1186/gb-2012-13-9-r52
Cite this article as: Frietze et al.: Cell type-specific binding patterns
reveal that TCF7L2 can be tethered to the genome by association with
GATA3. Genome Biology 2012 13:R52.
Submit your next manuscript to BioMed Central
and take full advantage of:
• Convenient online submission
• Thorough peer review
• No space constraints or color figure charges
• Immediate publication on acceptance
• Inclusion in PubMed, CAS, Scopus and Google Scholar
• Research which is freely available for redistribution
Submit your manuscript at
www.biomedcentral.com/submit
Frietze et al. Genome Biology 2012, 13:R52
http://genomebiology.com/content/13/9/R52
Page 18 of 18
156
RESEARCH Open Access
ZBTB33 binds unmethylated regions of the
genome associated with actively expressed genes
Adam Blattler
1,2
, Lijing Yao
1
, Yao Wang
3
, Zhenqing Ye
3
, Victor X Jin
3
and Peggy J Farnham
1*
Abstract
Background: DNA methylation and repressive histone modifications cooperate to silence promoters. One
mechanism by which regions of methylated DNA could acquire repressive histone modifications is via methyl DNA-
binding transcription factors. The zinc finger protein ZBTB33 (also known as Kaiso) has been shown in vitro to bind
preferentially to methylated DNA and to interact with the SMRT/NCoR histone deacetylase complexes. We have
performed bioinformatic analyses of Kaiso ChIP-seq and DNA methylation datasets to test a model whereby
binding of Kaiso to methylated CpGs leads to loss of acetylated histones at target promoters.
Results: Our results suggest that, contrary to expectations, Kaiso does not bind to methylated DNA in vivo but
instead binds to highly active promoters that are marked with high levels of acetylated histones. In addition, our
studies suggest that DNA methylation and nucleosome occupancy patterns restrict access of Kaiso to potential
binding sites and influence cell type-specific binding.
Conclusions: We propose a new model for the genome-wide binding and function of Kaiso whereby Kaiso binds
to unmethylated regulatory regions and contributes to the active state of target promoters.
Keywords: DNA methylation, Zinc finger proteins, Histone modifications, Transcription factor binding, Epigenetics,
Transcriptional regulation
Background
Genes are epigenetically regulated by a combination of
histone modifications and methylation of CpG dinucleo-
tides near their promoters [1,2]. Promoters that have high
levels of DNA methylation always show low activity
whereas the relationship of histone methylation and
promoter activity differs depending on exactly which
residue is methylated. For example, trimethylation of lysine
4 of histone H3 (H3K4me3) correlates with high promoter
activity whereas trimethylation of lysine 9 or lysine 27 of
histone H3 (H3K9me3 or H3K27me3) correlates with low
promoter activity. Recent studies have demonstrated
distinct spatial relationships of these two repressive histone
modifications withregions of methylated DNA [3,4]. These
analyses entailed chromatin immunoprecipitation using
antibodies that recognize specific histone modifications
followed by bisulfite sequencing of the ChIP DNA
(BisChIP-seq). One study showed that H3K27me3 can
sometimes be associated with methylated promoters but
that the H3K27me3-marked regions are repressed inde-
pendently of the level of DNA methylation [3]. A similar
study by Brinkman et al. [4] found that H3K27me3 and
DNA methylation are compatible except at CpG-rich
promoters where the two marks are mutually exclusive,
supporting a previous study by Komashko et al. [5]. The
studies of Komashko et al. also showed a very low overlap
of the sets of promoters having high levels of H3K9me3
versus high levels of DNA methylation. From their
BisChIP-seq results, Brinkman et al. suggest that H3K9me3
has a high degree of overlap with regions of methylated
DNA. However, the majority of H3K9me3 sites in the
genome are not at promoter regions and, unfortunately,
their analyses did not distinguish the DNA methylation
status of promoter versus non-promoter H3K9me3
sites. In summary, although the spatial relationship of
these repressive histone methylations has been described,
it is not yet clear if DNA methylation and H3K27me3 or
H3K9me3 cooperate to repress promoter regions.
In addition to repressive histone methylation modi-
fications, active versus inactive promoters can also be
* Correspondence: pfarnham@usc.edu
1
Department of Biochemistry & Molecular Biology, Norris Comprehensive
Cancer Center, University of Southern California, Los Angeles, CA 90089, USA
Full list of author information is available at the end of the article
© 2013 Blattler et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly cited.
Blattler et al. Epigenetics & Chromatin 2013, 6:13
http://www.epigeneticsandchromatin.com/content/6/1/13
157
distinguished by the acetylation status of histone H3.
Active promoters are marked by both acetylated
lysine 9 and acetylated lysine 27 on histone H3 (H3K9Ac
and H3K27Ac). In fact, H3K9Ac is the modified histone
most highly associated with transcriptional activity, as
shown in the integrative analysis of a large number of
datasets by the ENCODE consortium [6]. Although DNA
methylation may not cooperate with repressive histone
methylation modifications to silence genes, there may bea
functional relationship between DNA methylation and
histone deacetylases. For example, it is possible that
DNA methylation could be a signal for the binding of
a site-specific DNA-binding factor which could in
turn lead to the recruitment of a histone deacetylase. To
investigate this possible mechanism of transcriptional
repression, we have focused on ZBTB33 (also known as
Kaiso), a member of the zinc finger and BTB (ZBTB)
family of site-specific transcription factors, which has been
shown in vitro to preferentially bind a DNA motif
containing methylated CpG dinucleotides (Figure 1).
Among the ZBTB family are three proteins, ZBTB33,
ZBTB4, and ZBTB38; all of which have been shown
in vitro to use three tandem C2H2 zinc finger domains to
preferentially bind methylated DNA [7]. Specifically, Kaiso
has been shown in vitro to bind a motif containing CGCG
(Figure 1A), but only when both cytosine residues are
methylated [8]. Studies show that all three of Kaiso’sC2H2
zinc finger domains are required for its binding to this
methylated motif [8]. Other in vitro results also suggest
Kaiso is capable of binding to an unrelated unmethylated
motif,TCCTGCNA [9]. However, the relationship between
Kaiso binding motifs and DNA methylation has not yet
been investigatedon a genome-wide scale in human cells.
A model has been proposed in which Kaiso binds to
methylated DNA in a promoter region, recruits compo-
nents of the NCoR histone deacetylase complex, and
represses transcription (Figure 1B). This model is based in
part on analysis of the MTA2 promoter. Yoon et al. [10]
showed that Kaiso can bind in vitro to a methylated version
of the sequence GGCGCGCGAGTCTTTGGGGCGCG,
which is found within the CpG island promoter of the
MTA2 gene. Using primers specific for methylated DNA,
they also showed that this sequence is methylated in HeLa
cells and they performed ChIP experiments to show that
Kaiso binds this site before, but not after, cells are treated
with 5aza-dC, a drug which reduces DNA methylation
levels. They show that Kaiso interacts with the NCoR
complex and demonstrate that Kaiso fused to the
GAL4 DNA binding domain can repress transcription
of a luciferase reporter containing a GAL4 binding
site. Finally, they show that treating cells with siRNAs
to Kaiso, 5aza-dC or the HDAC inhibitor TSA can
increase MTA2 RNA. Taken together, they proposed a
model in which Kaiso binds to the methylated MTA2
promoter and represses transcription by reducing
levels of acetylated histones near the start site. We
wished to test this model on a genome-wide scale and
therefore have analyzed Kaiso binding in normal and
cancer human cell lines, performed motif analyses of the
in vivo binding sites, and investigated the relationship
Figure 1 Structure and function of Kaiso. (A) Kaiso is a 672 amino acid protein, which includes an N-terminal POZ/BTB domain required for
protein-protein interactions, a nuclear localization signal, and three tandem C2H2 zinc finger domains responsible for DNA binding; numbers
represent the amino acid borders of each domain. Shown below the zinc finger domains are the two motifs to which Kaiso has been shown to
bind in vitro. (B) The current model for Kaiso’s activity at promoters: in the left figure, Kaiso’s 3 C2H2 zinc finger domains recognize and bind
methylated DNA, recruiting the NCoR histone deacetylation complex to the region and causing the loss of active chromatin marks and the
repression of the adjacent promoter. The figure on the right shows an unmethylated promoter, which is not recognized by Kaiso. As a result, the
NCoR histone deacetylation complex is not recruited to the region and surrounding histones remain acetylated allowing for expression of
the promoter.
Blattler et al. Epigenetics & Chromatin 2013, 6:13 Page 2 of 18
http://www.epigeneticsandchromatin.com/content/6/1/13
158
between Kaiso binding, DNA methylation, modified
histones, and gene expression levels. Our results suggest
that, in contrast to in vitro studies, in cells Kaiso binds to
highlyactive, unmethylatedpromoters.
Results
Identification and characterization of Kaiso-binding sites
We began by downloading the sequenced tags for all
Kaiso ChIP-seq datasets from the UCSC browser
(from the ENCODE consortium); two replicate Bowtie-
mappedChIP-seqdatasetsweredownloadedfor GM12878,
K562, A549, HepG2, HCT116, and SK-N-SH cells. Peaks
were called for each replicate in each cell type using
Sole-Search [11,12]; seeTable 1. To determine the reprodu-
cibility of the data, an overlap analysis was performed by
truncating both replicates to the same peak number and
comparing the top 40% of replicate 1 to the entire set of
peaks of replicate 2 (and vice versa, which is indicated as
the reciprocal overlap in Table 1). Of the six cell lines for
which Kaiso ChIP-seq data is available, we found that the
data from GM12878, K562, and A549 cells passed
ENCODE standards for biological replicates (number
of peaks in each set must be within a factor of two
and after truncation of peak sets to the same number,
80% of the top 40% of the peaks from one replicate must
overlap with the entire set of peaks of the other replicate);
see Landt et al. [13]. Although, the HepG2 datasets are
close to ENCODE standards (numbers of peaks are within
a factor of two and there is a 71% overlap using the
40% rule), the Kaiso ChIP-seq datasets from HCT116
and SK-N-SH cells clearly do not pass the quality
standards [13], due in large part to the great difference in
the number of peaks identified in the two replicates of
each dataset. In our studies we have focused on the Kaiso
datasets from GM12878 and K562cells.
After determining the quality of the Kaiso ChIP-seq
datasets, the next step was to select a defined peak set for
further analysis. Because of the nature of the ChIP-seq
assay, there is not a discrete number of peaks for any
transcription factor. Rather, as more reads are acquired,
more and more peaks are identified. The peaks that are
only identified when large numbers of reads are used tend
to be much smaller than the median height of the peaks
identified when less than 10 million reads are analyzed.
The issue of small peaks is especially problematic when
two ChIP-seq replicate datasets (each of which having
more than 10 million reads) are merged. As shown in
Figure 2 for Kaiso ChIP-seq data from GM12878 cells,
there is a clear inflection point (separating a minority of
large peakshavinga peakheight about40 from 90% ofthe
peaks having a tag height below 40) when peak height is
plotted versus ranked peak number. Although someof the
small peaks may be of biological importance, we have
found that in many ChIP-seq analyses most of the small
peaks are not reproducible. Inclusion of false-positive
(non-reproducible) small peaks can greatly skew motif
analyses and muddy insights into the relationship of a par-
ticular DNA-binding protein and transcriptional regulation.
Table 1 Sequencing and peak metrics for ENCODE Kaiso ChIP-seq datasets
GM12878 K562
1
A549 HepG2 HCT116 SK-N-SH
Unique reads, Rep1 16,619,899 26,444,144 34,298,781 12,992,054 20,455,526 20,984,176
Sole-Search peaks, Rep1 2,396 11,257 14,414 1,560 8,813 12,290
Median peak tag height Rep1 24 25 25 26 21 23
Unique reads, Rep2 14,307,805 19,111,076 34,105,543 18,274,016 4,305,814 10,093,008
Sole-Search peaks, Rep2 2,784 15,395 11,559 2,342 936 2,789
Median peak tag height, Rep2 21 21 23 25 19 21
40% overlap (reciprocal overlap) 96% (97%) 83% (92%) 85% (81%) 66% (71%) 98% (94%) 65% (81%)
Unique reads, merged Reps 30,621,961 44,860,842 67,432,203 31,180,904 24,667,177 30,833,060
Sole-Search peaks, merged Reps 12,543 18,651 42,862 2,529 12,325 22,172
Median peak tag height, merged 20 34 27 30 23 25
High-confidence peaks 1,648 3,082 7,658 757 902 2,675
Median peak tag height 34 28 30 44 78 37
IDR peaks 2,144 3,285 7,152 2,879 4,325 N/A
2
For each dataset, two biological replicates of Bowtie-aligned .bam files were downloaded from the UCSC genome browser, and peaks were called using Sole-
Search. The replicate with the most Sole-Search-called peaks was truncated so that both replicates contained the same number of peaks. Replicates were then
compared by ENCODE standards using the 40% overlap rule (overlap percentage shown). Also shown are peak metrics for high-confidence Kaiso peaks in each
cell line.
1
In the K562 dataset, amplified genomic regions were discovered to contain numerous false peaks that were not all removed by the peak-calling program; these
peaks were removed from the list of high-confidence K562 peaks (see Additional file 2: Figure S2). The removed peaks had very high peak tag numbers because
the amplified regions were over-sequenced. As a result of removing these false positive peaks, the median peak tag height of the high-confidence peaks is lower
than the median peak tag height of the merged peak set.
2
At the time of paper submission, IDR peaks had not been called for Kaiso in SK-N-SH.
Blattler et al. Epigenetics & Chromatin 2013, 6:13 Page 3 of 18
http://www.epigeneticsandchromatin.com/content/6/1/13
159
Therefore, we have chosen to use only high-confidence
Kaiso peaks in our analyses. We define high-confidence
peaksas peakspresent in both replicates, regardless of peak
height. As shown in Figure 2 (inset), although 12,543 peaks
were identified in GM12878 cells using the merged
replicates, only1,648 peaks wereidentified as common to
both replicates (each of which had approximately 2,400 to
2,800 peaks). The peak height versus peak rank of the
1,648 high-confidence Kaiso peaks is plotted in Figure 2A
(dashed line), for comparison with all peaks identified
using the merged datasets (solid line). Although some
small peaks with a tag height below 40 were retained
becausetheywerereproduciblyidentifiedinbothreplicates,
most of the small peaks were removed using this analysis.
We next identified a set of ‘high-confidence’ Kaiso-binding
sites for each cell type using this method (Table 1); these
peakssetswere used for further analyses in our studies. We
also note that these Kaiso datasets have also been analyzed
bytheENCODEconsortiumusingtheIDRprogram,which
provides an estimate of the number of reproducible peaks
between two replicates [14]. Although for the three
high quality datasets (GM12878, K562, and A549),
the IDR-called peaks are similar in number to the
high-confidence peaks we have identified, we note
that there will be some difference in the exact peaks
which are retained using the different methods. Because
IDR uses the merged dataset to call a final peak set, large
peaks that are present in only one dataset may be included
whereas certain small, but reproducible, peaks may be
excluded. For the three datasets that do not pass quality
standards (HepG2, HCT116, and SK-N-SH), the IDR-called
peaksmay contain manyfalsepositives; if thesedatasetsare
used by the community, we recommend that only the
smaller set of high- confidence Sole-Search-called peaks be
used. Genomic coordinates of all high-confidence peak sets
forallKaiso datasetscanbefoundinAdditionalfile1.
Testing a model of Kaiso-mediated transcriptional
repression
As described above, Kaiso has been shown to bind to
methylated DNA and has been suggested to function
as a transcriptional repressor via recruitment of
HDACs via interaction with the NCoR and SMRT
complexes [8,10,15-18]. If true, this could provide a
mechanistic explanation for a link between promoter
methylation and transcriptional repression. However,
these studies were performed in vitro and/or focused on
small sets of genes. To test this model on a genome-wide
scale, we have examined the location of Kaiso binding
sites asking the following questions: 1) Are Kaiso
binding sites located in promoter regions? 2) What is
the epigenetic status of the regulatory regions bound
by Kaiso? and 3) what is the DNA methylation status
of Kaiso binding sites?
To investigate the mechanism(s) by which Kaiso
functions on a genome-wide scale, we first determined the
location of the high-confidence Kaiso ChIP-seq peaks in
GM12878 cells relative to transcription start sites
using the Sole-Search location analysis tool. We found
that approximately 70% of high-confidence Kaiso
peaks bind within 1 kb of a transcription start site of
a Refseq gene (Figure 3A). A similar percentage of Kaiso
peaks was also found to bind within CpG islands
(Figure 3B), indicating that Kaiso prefers to bind at
GC-rich promoters. To determine if Kaiso bound to
promoters that were also bound by RNA polymerase II
(Pol2), we first identified a set of top-ranked Pol2 peaks
(Additional file 2A). To do so, we downloaded the Pol2
ChIP-seq tags from GM12878 from the ENCODE data on
the UCSC browser and called peaks using Sole-Search,
identifying approximately 50,000 peaks. Pol2 is known to
be involved in looping [19,20] and it is likely that
many of the very small peaks are due to protein-protein
Figure 2 Identification of high-confidence Kaiso peaks in GM12878 cells. The set of 12,543 Kaiso peaks identified in GM12878 cells using the
merged replicate datasets (solid line) and the set of 1,648 high-confidence Kaiso peaks (dashed line) is plotted as tag height versus ranked peak
number. The inset shows a Venn diagram of the overlap between the two Kaiso ChIP-seq replicates in GM12878 cells. The total number of peaks
in each replicate is represented in parentheses; 1,648 high-confidence peaks were found to be present in both replicates.
Blattler et al. Epigenetics & Chromatin 2013, 6:13 Page 4 of 18
http://www.epigeneticsandchromatin.com/content/6/1/13
160
interactions. To enrich for those peaks that represent direct
binding of Pol2 to transcription start sites, we chose a set
of top-ranked peaks for downstream analysis. The peaks
chosen represent the top 20% of the Pol2 peaks and the
number of tags of the lowest ranked peak chosen is 2.6-fold
over the median value of the peaks not chosen. Pol2 peaks
chosen for analysis were compared to those discarded for
their proximity to known transcription start sites. The top
Pol2 peaks are highly enriched at transcription start sites
whereas the lower background peaks are primarily
promoter distal (Additional file 2B). We then overlapped
Kaiso peaks with the selected top-ranked Pol2 peaks; the
majority of Kaiso binding sites overlap the top 20% of Pol2
peaks (Figure 3C), indicating that the promoters bound by
Kaiso are either active or poised for transcription because
they have robust Pol2 signals.
To characterize the epigenetic profiles of the Kaiso
binding sites, tag density plots were created to identify
enrichments of specific histone modifications proximal
to Kaiso binding sites. Tag density plots provide a more
accurate view of the relationship of different ChIP-seq
datasets than do simple peak overlap analyses. This is
because all tags are used in these analyses, eliminating
any user-selected cut-offs. ChIP-seq datasets for the
various histone modifications, Pol2, CTCF, and Sin3A in
GM12878 cells were downloaded from the UCSC browser
and compared to the Kaiso peak genomic coordinates.
Consistent with the overlap analysis of Kaiso peaks and
Pol2 peaks shown above, Pol2 ChIP-seq tags were highly
enriched at Kaiso peaks (Figure 4A). Analysis of the
epigenetic modifications of the nucleosomes surrounding
the Kaiso peaks revealed an enrichment of active histone
modifications at Kaiso binding sites. Specifically, there is a
large enrichment of acetylated histone H3, lysine 9: the
mark most predictive of actively-transcribed genes [6].
Another histone mark associated with open chromatin,
acetylated histone H3 lysine 27, was also highly enriched
at Kaiso peaks. These results suggesting that most Kaiso
binding sites are at active promoters were performed
using the entire set of Kaiso binding sites. However, as
shown above, 20% of the Kaiso binding sites did not
overlap with robust Pol2 peaks. It is possible that the
Kaiso bound to this subset of sites may be involved in a
different mode of transcriptional regulation. Therefore, we
separated the Kaiso peaks into those that overlapped with
Pol2 versus those that did not overlap Pol2. The Kaiso
peaks that overlapped with Pol2 had essentially the same
profile as the entire set of Kaiso peaks (Figure 4B).
However, the 378 Kaiso peaks that did not overlap with
the top-ranked Pol2 peaks had a very different epigenetic
profile; these peaks had very low enrichments for the
active histone modifications (Figure 4C). It should be
noted that the non-Pol2-bound Kaiso sites did not have a
robust enrichment for Sin3A, a component of the NCoR
HDAC complex. The highest enriched modification for this
setofpeaksisH3K9me3,perhapssuggestingalinkbetween
Kaiso binding to methylated DNA and transcriptional
repression via H3K9me3 at these sites; this subset of
Kaiso-binding is further analyzed as a separate subclass
below in the DNA methylation studies.
As a third method of characterization of the Kaiso
binding sites, we determined if Kaiso binds to promoters
having high levels of DNA methylation. Using whole
genome bisulfite sequencing (WGBS) and reduced
Figure 3 Kaiso binds to GC-rich promoters that are also bound by Pol2. (A) Shown is the percentage of Kaiso peaks in GM12878 cells
located within 1 kilobase of a Refseq transcription start site (proximal, blue), and those located further than 1 kilobase away from a transcription
start site (distal, red). (B) Shown is the percentage of Kaiso peaks in GM12878 cells that are found within CpG islands. (C) Shown is the
percentage of Kaiso peaks in GM12878 cells that overlap the top-ranked set of Pol2 peaks (see Additional file 2).
Blattler et al. Epigenetics & Chromatin 2013, 6:13 Page 5 of 18
http://www.epigeneticsandchromatin.com/content/6/1/13
161
representation bisulfite sequencing (RRBS) data from
GM12878 cells, we determined the DNA methylation
profile of the sequences surrounding the Kaiso binding
sites. As shown in Figure 5A, the set of all Kaiso peaks
and the set of Kaiso peaks that overlap Pol2 peaks both
have very low DNA methylation profiles. Thus, the
majority of the genomic regions bound by Kaiso in
GM12878 cells are GC-rich promoters that have low
levels of DNA methylation and high levels of acetylated
histones. Although Kaiso peaks not overlapping Pol2 were
found to have slightly higher methylation, this level of
methylation (approximately 30%) is similar to that found
at the center of CTCF peaks, a factor shown to bind to
unmethylated DNA [21].
If Kaiso binds to a short region of highly methylated
DNA that is not exactly in the middle of a peak having
an overall low level of DNA methylation, then it is
possible that a relationship between Kaiso binding and
DNA methylation is obscured when the analyses are
centered on the middle of the genomic regions identified
by the peak-calling program. To alleviate this concern,
the DNA methylation analyses can be centered on the
Kaiso binding motif. Kaiso has been shown to bind to a
methylated motif TCTCGCGAGA in vitro and in vivo
[18], and an unmethylated motif TCCTGCNA in vitro.
To identify an enriched motif within Kaiso binding sites,
a motif analysis was performed using Hypergeometric
Optimization of Motif EnRichment (HOMER) software
[22] to identify strings of nucleotide sequences that are
enriched within Kaiso peaks. HOMER produces separate
output files for de novo identified motifs and known
motifs identified in the peak set. HOMER’s de novo
motif algorithm identified the previously published
‘methylated’ Kaiso motif TCTCGCGAGA as the most
highly represented motif in the 1,648 high-confidence
Kaiso peaks, in the 1,270 Pol2-overlapping peaks, and in
the 378 non-Pol2-overlapping peaks. 36 to 43% of the
Kaiso binding sites contained a match to this motif.
However, the previously identified ‘unmethylated’
in vitro motif TCCTGCNA was not enriched in the
Kaiso peak sets (Table 2). Absence of Kaiso binding to
the TCCTGCNA motif has also been reported by Ruzov
et al. [23], who performed ChIP-seq of a tagged Kaiso
protein; however, the significance of these results was not
clear because mouse and Xenopus Kaiso was studied in
human HEK293 cells. If Kaiso is in factbinding methylated
DNA, it is likely doing so via the Kaiso methylated motif
(because the unmethylated motif does not contain a CpG).
Therefore, we selected the subset of Kaiso peaks
containing the TCTCGCGAGA motif for further analysis
(Additional file 3). Because there is a wide distribution of
the location of this motif relative to the center of Kaiso
peaks, this could perhaps mute the observation of a
peak of DNA methylation within the actual motif. To
correct for this possibility, we centered Kaiso peaks
on the TCTCGCGAGA motif and repeated the
methylation analysis (Figure 5B). This analysis again
did not show a high enrichment of DNA methylation
Figure 4 Epigenomic profiles of Kaiso binding sites in GM12878 cells. The density of ChIP-seq tags for several histone modifications and
transcription factors is plotted relative to subsets of Kaiso binding sites in GM12878; (A) all high-confidence Kaiso peaks, (B) high-confidence Kaiso
peaks that overlap with top-ranked Pol2 peaks, (C) high-confidence Kaiso peaks that do not overlap with top-ranked Pol2 peaks.
Blattler et al. Epigenetics & Chromatin 2013, 6:13 Page 6 of 18
http://www.epigeneticsandchromatin.com/content/6/1/13
162
even when focusing only on the peaks that contain
the TCTCGCGAGA motif. However, these analyses use a
10 bp sliding window and therefore it is possible that the
majority of the CpG dinucleotides in the Kaiso peaks
are unmethylated (lowering the overall average DNA
methylation value of any 10 bp window) but that the
central CGCG of the recognition motif is methylated.
To determine if this is true, we identified all Kaiso
peaks having a central CGCG (Additional file 3); see
Table 2 for the percentage of peaks containing a core
CGCG. We then determined the methylation status of
each occurrence of a CpG dinucleotide in all Kaiso
peaks containing a central CGCG (Additional file 4).
Figure 5C shows the fraction of CGCG motifs in the Kaiso
peak sets that contain high (> 60%), medium (20% to 60%),
or low (< 20%) levels of methylation. Of the 4,781 CGCG
motifs containing greater than 3× sequencing coverage
identified within high-confidence Kaiso peaks, 26 motifs
contained medium levels of DNA methylation with the
highest level being 43%. Most of the motifs containing
medium levels of methylation are found in the subset of
KaisopeaksnotoverlappingPol2.
Characterization of Kaiso binding sites in cancer cells
The analyses of Kaiso binding sites described above were
performed using GM12878 cells, which is an EBV
immortalized normal lymphoblastoid cell line. Because
Kaiso binds to promoter regions, it is possible that the
lack of correlation between Kaiso binding and DNA
methylation is simply due to the fact that most CpG
island promoters are hypomethylated in normal cells. In
contrast, in cancer cells although the majority of the
genome is hypomethylated, many promoters become
hypermethylated. Thus, it is possible that there are more
Table 2 Motif analysis of Kaiso binding sites
KaisoHC GM12878
(1,648 sites)
KaisoHC+Pol2HC GM12878
(1,270 sites)
KaisoHC-No Pol2HC GM12878
(378 sites)
KaisoHC K562
(3,082 sites)
TCTCGCGAGA 43% 42% 36% 29%
CGCG 76% 85% 35% 47%
TCCTGCNA 5% 6% 4% 5%
The percentage of peaks containing each motif displayed on the left side of the chart was calculated for several subsets of Kaiso peaks in GM12878 and K562
cells. The total number of peaks in each subset is shown in parentheses.
Figure 5 DNA methylation analysis of Kaiso peaks. (A) The average percent DNA methylation in the sequences surrounding the Kaiso
binding sites (centered on the middle of the peak) was determined and plotted across all high-confidence Kaiso peaks (solid line), high-
confidence Kaiso peaks overlapping Pol2 (short dashed line), and high-confidence Kaiso peaks not overlapping Pol2 (long dashed line). (B)A
similar analysis was performed as in panel (A) except that only the subsets of high-confidence Kaiso peaks containing the 10 bp Kaiso motif were used
and the regions were centered on the Kaiso motif. (C) Pie charts depicting the methylation percentage of all CGCG motifs in various sets of Kaiso peaks
Blattler et al. Epigenetics & Chromatin 2013, 6:13 Page 7 of 18
http://www.epigeneticsandchromatin.com/content/6/1/13
163
methylated TCTCGCGAGA motifs in cancer cells,
providing a platform for Kaiso to bind to methylated
DNA. To determine if Kaiso binds to methylated DNA in
a cancer cell, we analyzed Kaiso binding sites in the
myeloid leukemia cell line, K562. Kaiso was determined to
be similarly expressed in GM12878 and K562 by Western
blot and by RNA-seq data (see Methods). We first
identified a set of Kaiso binding sites that were
present in both replicates of the K562 Kaiso ChIP-seq
datasets. The number of these sites was much higher
than the number of GM12878 Kaiso binding sites and
therefore we performed further inspection of the
K562 peaks. We noted that a much lower percentage of
the K562 Kaiso peaks contained the Kaiso motif than did
the GM12878 peaks (12% versus 43%), suggesting that the
K562 peak set may contain a large number of false positive
peaks. K562 cells are cancer cells and have a number
of highly-amplified genomic regions. Although our
peak-calling program has features that remove many
of the false positive peaks due to genomic amplifica-
tions [11,12], we observed that a large number of
peaks from K562 cells were in the amplified regions.
We removed these amplified regions from our analyses and
the peak set was reduced to approximately 3,082 peaks,
approximately 30% of which contained a TCTCGCGAGA
motif (Additional file 5). This adjusted peak set was used
for all downstream analyses as the high-confidence Kaiso
peaks in K562. Additional file 6 contains the genomic
coordinatesoftheamplifiedregionsinK562andtheKaiso
peaks before and after removalof these amplifiedregions.
To determine if Kaiso binds to methylated promoters
in K562 cells, we selected the set of Kaiso peaks that
contained a CGCG motif and determined the methylation
percentage of each of the CpG dinucleotides. In this case,
we used RRBS data for DNA methylation analysis because
WGBS data is not available for K562 cells. The genomic
coverage is lower in the K562 data than in the GM12878
data; the number of CpGs with 3× coverage or greater
in the K562 RRBS dataset is 1 million, compared to
45.6 million in the GM12878 methylation data for which
we could use a combination of WGBS and RRBS.
Therefore, with a 3× coverage cut-off, we could only
analyze 1,069 CGCG motifs in the K562 Kaiso peaks.
Of these, 28 peaks contained medium levels of DNA
methylation and only two peaks contained high levels of
methylation (Figure 5C); a snapshot of the chromosomal
region containing the highest methylated CGCG motif
(63%methylated) is shown in Additionalfile 7.
Cell type-specific binding of Kaiso
During our analysis of the methylation status of Kaiso
binding sites, we noted that there are many more Kaiso
binding sites in K562 cells than in GM12878 cells. This
suggested that there might be cell type-specific binding
of Kaiso. To test this hypothesis, we performed an overlap
analysis of the K562 and GM12878 Kaiso high-confidence
peak sets and found that 894 peaks were shared in the
two cell types, but that 760 and 2,198 sites were uniquely
bound by Kaiso in GM12878 and K562 cells, respectively
(Figure 6A). A location analysis revealed that the common
sites werehighly specific for promoter regions (Figure6B).
In contrast, only half of the sites specific to GM12878
were at promoter regions whereas the majority of the sites
specific to K562 were located outside of promoter regions.
It is possible that Kaiso may have a different role in
transcriptional regulation when bound near versus far
from transcription start sites. Therefore, we determined
the epigenetic profiles of the promoters bound by Kaiso in
both GM12878 and K562 (mostly promoter proximal)
and of the sets of Kaiso binding sites unique to K562 cells
that were promoter proximal versus promoter distal. The
promoter proximal peaks common between K562 and
GM12878 are enriched for Pol2, the active marks H3K9Ac
and H3K27ac, and the transcription elongation mark
H3K79me2 (Figure 7A); this evidence suggests that these
regions are actively transcribed promoters. Very similar
epigenetic profiles are seen for the K562-specific Kaiso-
bound promoters (Figure 7B), suggesting that all promoters
bound by Kaiso in K562 cells (common or cell-type
specific) are actively transcribing. Interestingly, the 1,890
promoter distal peaks unique to K562 are also enriched for
Pol2 and acetylated histones (Figure 7C). The overall level
of enrichment for these active marks is lower at the distal
sites than at known promoters. This could be due either to
the presence of a small number of bona fide promoters of
genes or non-coding RNAs that are not in the Refseq
dataset or perhaps an indication of chromosomal looping
between active promoters and a distal site bound by
Kaiso. To address this issue, we overlapped the set of
K562-unique promoter distal Kaiso peaks with the top
20% of the Pol2 peaks from K562 cells. We found that 24%
of these Kaiso peaks overlap with the top-ranked Pol2
peaks, suggesting that perhaps a quarter of the 1,890 Kaiso
K562 binding sites classified as distal are at promoters that
are not annotated in Refseq dataset. Most importantly, no
subset of Kaiso peaks in K562 was determined to have an
enrichment for componentsof the NCoR or SMRThistone
deacetylation complexes, contrary to what would be
expectedbasedonpreviouslypublishedfindings[10,18].
As shown above, 46% of the Kaiso peaks in GM12878
cells and 71% of the Kaiso peaks in K562 cells are cell
type-specific. Although these sites are reproducible
(having been identified in both replicates for each cell
type) the median tag height for the cell type-specific
binding sites is lower than for the sites occupied by
Kaiso in both GM12878 and K562 cells (tag height is
shown in parentheses in Figure 8). This suggests that
Kaiso binding at the cell type-specific sites may occur
Blattler et al. Epigenetics & Chromatin 2013, 6:13 Page 8 of 18
http://www.epigeneticsandchromatin.com/content/6/1/13
164
byadifferentmechanismthanat sitescommontobothcell
types. There are several mechanisms that can mediate cell
type-specific binding. One mechanism is that Kaiso could
be recruited to the genome by an interaction with cell
type-specific DNA-binding factors at sites that contain no
or weak matches to the Kaiso motif. To examine the
possibility that Kaiso is being recruited to sites in
K562 or GM12878 cells via a cell type-specific binding
partner, we first determined if the cell type-specific peaks
contained the Kaiso motif. For both GM12878 and K562
cells, we found that approximately 80 to 85% of the cell
type-specific Kaiso peaks lacked aTCTCGCGAGA motif.
To test the possibility that another site-specific factor
aided in the recruitment of Kaiso to the non-motif
Figure 7 Epigenomic profiles of Kaiso binding sites in K562 cells. The density of ChIP-seq tags for several histone modifications and
transcription factors is plotted relative to subsets of Kaiso binding sites in K562 cells; (A) peaks common between K562 and GM12878 cells,
(B) promoter proximal peaks unique to K562, (C) promoter distal peaks unique to K562.
Figure 6 Characterization of cell type-specific Kaiso binding sites. (A) High-confidence Kaiso peaks in GM12878 and K562 were overlapped
to identify common peaks and cell type-specific peaks between the two datasets. (B) The common and cell type-specific Kaiso peaks were
analyzed for their position relative to the start site of transcription of the set of Refseq genes; sites within +/− 1 kb of a start site were classified as
promoter proximal whereas all other sites were classified as promoter distal.
Blattler et al. Epigenetics & Chromatin 2013, 6:13 Page 9 of 18
http://www.epigeneticsandchromatin.com/content/6/1/13
165
containing cell type-specific sites, we performed a motif
analysis using HOMER. Strikingly, we found that a GATA
motif corresponded to five of the top eight most enriched
motifs in the set of K562-unique peaks that did not have a
Kaiso motif; the other 3 motifs corresponded to reported
motifs for CTCF or CTCFL (Figure 8; Additional file 8A).
We performed a similar analysis for the GM12878-specific
Kaiso peaks except that we separately analyzed promoter
proximal and distal sites not containing a Kaiso motif (this
separation wasnotnecessary forK562 because the majority
of the sites were distal). The proximal regions did not
return significant results, likely due tothe large number of
conserved sequences found at all promoter regions.
Separate analysis of the distal subset, which is more
similar to the K562 unique peaks, eliminated problems that
occur with motif analyses of promoter regions. We found
that motifs for ETS and Runt family members are enriched
in the GM12878-specific, distal, non-Kaiso motif set
(Figure 8; Additional file 8B). To determine if Kaiso
co-localizes with GATA or CTCF family members in
K562 cells or with ETS or Runt family members in
GM12878 cells, we downloaded the GATA1, GATA2,
CTCF, and CTCFL ChIP-seq data from K562 cells
and the ETS1, PU.1, and RUNX3 ChIP-seq data from
GM12878 cells from the UCSC Genome Browser and
created tag density plots relative to Kaiso peaks (Figure 9).
In K562, we found that GATA1, GATA2, CTCF, and
CTCFL were all enriched at the center of Kaiso peaks with
CTCF and GATA2 being the most enriched (Figure 9A).
Interestingly, a previous study has shown that Kaiso can
interact with CTCF [24]. When comparing binding of ETS
and Runt family members to Kaiso sites in GM12878, we
again found that members of both families co-localized
with Kaiso, with RUNX3 and PU.1 having the highest
enrichment (Figure 9B).
A second mechanism that can contribute to cell type-
specific binding of a ubiquitously expressed transcription
factor is site availability. For example, high levels of DNA
methylation at a promoter results in a nucleosome-dense,
silenced region that is not accessible to transcription
factors. Thus, a promoter that has the potential to be a
Kaiso target (for example, because it contains a Kaiso
motif and is bound by Kaiso in GM12878 cells) may not
be bound by Kaiso in K562 cells if the promoter is highly
methylatedanddenselypackedbynucleosomes.Toaddress
whether differences in DNA methylation are responsible
for cell type-specific binding to promoters in GM12878
versus K562 cells, we determined the methylation status in
GM12878 cells of the K562-specific Kaiso peaks and the
methylation status in K562 cells of the GM12878-specific
peaks (Figure 10). We found that GM12878-unique sites
had higher overall levels of methylation in K562 cells and
K562-unique sites had higher overall levels of methylation
in GM12878, suggesting that at least a portion of the cell
type-specificity of Kaiso binding is due to exclusion from
promoters due to DNA methylation. Next, we looked at
the density of nucleosomes relative to GM12878- and
K562-unique and common binding sites. MNase-seq data
produced by ENCODE was downloaded from the
UCSC genome browser, and was used to determine the
nucleosome density of Kaiso binding sites. Tag density
plots show that sites common between GM12878 and
Figure 8 Motif analysis of cell type-specific Kaiso binding sites. Kaiso peaks from GM12878 and K562 were compared to identify common
and cell type-specific binding sites (the median tag height is shown in parentheses). GM12878- and K562-unique peaks were divided into two
sets of peaks, those that contain and those that lack a Kaiso motif; GM12878-unique peaks were further separated into promoter proximal and
promoter distal sets. A motif analysis was then performed for each set of peaks, identifying motifs for the GATA and CTCF families of transcription
factors in the K562-unique peaks and motifs for the ETS and Runt families of transcription factors in distal GM12878 peaks (see also Additional file 8).
Blattler et al. Epigenetics & Chromatin 2013, 6:13 Page 10 of 18
http://www.epigeneticsandchromatin.com/content/6/1/13
166
K562 contain low levels of nucleosome density, particularly
at the center of the peak (Figure 11B). Conversely,
K562-unique peaks are nucleosome-free in K562 cells,
but are occupied in GM12878 (Figure 11A). This suggests
that the occupancy of a nucleosome at a potential binding
site in GM12878 restricts the binding of Kaiso to that
region, contributing to the cell-type-specificity of Kaiso
binding. GM12878-unique sites show a slight nucleosome
occupancy increase in K562 cells, but overall, very similar
nucleosome density is observed in both cell lines at these
sites. This suggests that nucleosome occupancy plays a
lesser role in the cell type-specific Kaiso binding in
GM12878 cells (Figure 11C), perhaps due to less
overall DNA methylation at promoters in GM12878
cells. Together, the DNA methylation and nucleosome
occupancy results suggest that chromatin structure
plays a partial role in the cell type-specificity binding
of Kaiso between GM12878 and K562.
Figure 9 Testing bioinformatic predictions of transcription factor co-localization. The density of ChIP-seq tags for the factors identified by
motif analysis in Figure 8 were plotted relative to (A) non-motif containing K562 Kaiso peaks, and (B) promoter distal, non-motif containing
GM12878 Kaiso peaks.
Figure 10 Methylation of cell type-specific CGCG motifs in different cell lines. The percent methylation of individual CGCG motifs within
GM12878-unique and K562-unique peaks was calculated using GM12878 WGBS and RRBS data and K562 RRBS data, and graphed as box plots
showing differences in methylation between cell types.
Blattler et al. Epigenetics & Chromatin 2013, 6:13 Page 11 of 18
http://www.epigeneticsandchromatin.com/content/6/1/13
167
Lastly, we have used RNA-seq data from GM12878
and K562 to examine the relationship between Kaiso
binding and gene expression. We identified the genes
nearest to the Kaiso binding sites that were in common
between the GM12878 and K562 Kaiso ChIP-seq
datasets and the genes that had Kaiso bound to their
promoter regions only in GM12878 or only in K562
cells. We then determined the expression levels of these
three sets of genes in both GM12878 and K562 cells
(Figure 12). We found that of these three gene sets, the
most highly expressed was the set of genes bound by
Kaiso in both GM12878 and K562 cells. Interestingly,
the GM12878-specific Kaiso targets are more highly
expressed in GM12878 cells than in K562 cells and the
K562-specific Kaiso targets are more highly expressed in
K562 cells than in GM12878 cells. Thus, we see a positive
correlation between gene expression and Kaisobinding.
Discussion
The purpose of this study was to test the previously pro-
posed model that Kaiso functions as a DNA methylation-
dependent transcriptional repressor that causes local
depletion of acetylated histones near its binding sites.
In our studies, we bioinformatically analyzed genome-wide
binding of Kaiso in normal and cancer cells, examined
the epigenetic profiles of Kaiso-bound promoters, and
determined the percentage of DNA methylation of CpGs
bound by Kaiso. Although the analyses presented in this
study focused ononly two cell lines, we havealso examined
the relationship between Kaiso binding sites, Pol2, and
histone modifications in HCT116, A549, and SK-N-SH
cells (seeTable 1 for the number of high- confidence Kaiso
peaks analyzed in each dataset). In each cell type, Kaiso is
associated with active promoters (unpublished data). In
addition, we performed whole genome bisulfite sequencing
of HCT116 and found that Kaiso binding sites are not
methylated; specifically, out of 3,051 CGCG motifs within
the high-confidence HCT116 Kaiso-binding sites, 3,036
(99.5%) had methylation levels less than 20% (unpublished
data). Thus, results from all tested cell lines are similar but
do not support the previously proposed model. Specifically,
we find that Kaiso does not bind to methylated DNA and
that it is associated with highly acetylated, actively tran-
scribed promoters. Our results are consistent with those
presented in Factorbook [25], an online repository that
hosts analyses of ENCODE transcription factor ChIP-seq
data. However, due to the large scale of that project,
individualized analyses were not performed for each factor.
For example, the Factorbook analysis of motifs found in
Kaiso binding sites included only the top 500 peaksand did
not specifically search for the non-methylated motif to
which Kaiso has been shown to bind in vitro. Another
difference is that Factorbook uses the ENCODE IDR peak
setswhich,aswedescribeabove,do notrepresent thesame
peaks as our high confidence peaks that are found in both
ChIP-seq replicates for a given cell type. In addition, our
studies included a thorough analysis of cell type-specific
binding patterns, motifs, and nucleosome occupancy
of Kaiso sites. Finally, Factorbook did not include an
analysis of DNA methylation, which is the central
theme of our study.
Several different groups have used in vitro gel shift
assays to demonstrate that Kaiso binds to the motif
TCTCGCGAGA when both CpGs in the motif are
Figure 11 Nucleosome density of cell type-specific Kaiso peaks in GM12878 and K562. Nucleosome positioning data from MNase-seq was
used and plotted as the average density of nucleosomes relative to Kaiso peaks in A) GM12878, B) K562, and C) common sites.
Blattler et al. Epigenetics & Chromatin 2013, 6:13 Page 12 of 18
http://www.epigeneticsandchromatin.com/content/6/1/13
168
methylated [7-10,26]. In addition, Bartels et al. performed
an unbiased in vitro screen for methyl DNA-binding
proteins and identified Kaiso using mass spectrometry
analysis of the proteins captured in the screen [27]. Also, a
recent study has solved the structure of Kaiso bound
to methylated DNA [28]. Clearly, Kaiso does bind to
methylated DNA in vitro. However, the vast majority
of in vivo Kaiso binding sites have very low levels of
DNA methylation. Visual inspection of the rare sites
that have high DNA methylation revealed that Kaiso
has the ability to bind methylated DNA in vivo, but
that the Kaiso peaks at those sites are among the
lowest enriched binding sites. In fact, a plot of peak
score versus percent DNA methylation clearly shows
that the strongest Kaiso sites have very low levels of
DNA methylation (Additional file 9). These data do not
support the in vitro studies that indicate that Kaiso prefers
to bind to the TCTCGCGAGA motif when both CpG
dinucleotides are fully methylated. However, we note that
we used Bowtie to align the sequenced tags to the human
genome and used Sole-Search [11,12] to call peaks. It was
possible that this combination of alignment and peak-
calling resulted in a loss of the set of highly methylated
Kaiso binding sites. To ensure that our method of
alignment and peak identification had not biased our
results, we reanalyzed the Kaiso ChIP-seq data using
LONUT, a new alignment tool which takes into account
both unique and non-unique tags (manuscript submitted)
and used BELT [29] to call peaks. Also, to eliminate the
possibility that we are missing a set of Kaiso peaks that are
highly methylated, but are not as robust or not as reprodu-
cible as the peaks called using Sole-Search, we relaxed our
stringency and called peaks on the merged Kaiso ChIP-seq
datasets for each cell type (thus not requiring that each
peak be identified in both replicates). We identified many
thousands of Kaiso binding sites in GM12878 and K562
cells, most of which were very small and found in only one
of the two replicates. We found that most of the Kaiso
binding sites identified using this method again had low
levels of DNA methylation (Additional file 10). In fact, only
12 of the 8,069 Kaiso peaks identified by LONUT in
GM12878 had greater than 50% DNA methylation. Visual
inspection of these 12 sites revealed two peaks on the X
chromosome (red arrows, Additional file 10A) within the
CpG island promoters of the PHF8 and PRDX4 genes
(the other 10 sites had low Kaiso peaks and/or were
not reproducible between the two biological replicates). A
central CGCG motif was found to be 73% methylated in
Figure 12 Expression of cell type-specific Kaiso-bound promoters in GM12878 and K562. RPKM values from RNA-seq experiments in
GM12878 and K562 cells were used to create box plots showing expression of cell type-specific and common Kaiso-bound genes.
Blattler et al. Epigenetics & Chromatin 2013, 6:13 Page 13 of 18
http://www.epigeneticsandchromatin.com/content/6/1/13
169
PHF8 and 55% methylated in PRDX4. However, upon
further characterization of these promoters, they were
discovered to be expressed, and have high levels of
active histone modifications surrounding the promoter
(Additional file 10B, PHF8 shown).Forbothpromoters,the
methylated CGCG motif identified was discovered to be
just outside theborders ofthe Sole-Search-called peaksand
therefore was not analyzed in Figure 5. In general,
the combination of LONUT and BELT results in the
assignment of larger genomic regions to peaks than
does the combination of Bowtie and Sole-Search; as a
result the new analysis included some new CGCG
motifs in the final dataset. However, these CGCG motifs
were not highly methylated and the ones that had modest
methylation were on the edges of the peak and unlikely to
contribute toKaisobinding.
Binding of transcription factors in the context of
chromatin is influenced by many factors. For example,
recent ChIP-seq analyses have revealed that not all
binding sites contain the ‘expected’ motif that was derived
from in vitro binding studies [30,31]. In addition, the epi-
genetic profile of a promoter region can influence binding,
with repressive histone marks and nucleosome occupancy
limiting access of transcription factors to their binding sites
[32-34]. Taken in this context, we believe that although
Kaiso may ‘prefer’ to bind to methylated DNA, this is not
an option in the context of chromatin. Under conditions in
which the CpGs in a promoter are highly methylated, the
entire promoter region is often densely wrapped around
nucleosomes andin a repressive heterochromatic state. The
absence of a nucleosome-free region prevents access of
transcription factors to the promoter region and hence
prevents access of Kaiso to itsmethylated motif.Being
denied access to the preferred methylated motif, Kaiso
instead binds to the same motif in its unmethylated state. It
is interesting to note that this disconnect between in vitro
and in vivo binding to a methylated motif has been seen by
others. Bartels et al. identified RBP-J as a strong binder of
methylated DNA in vitro. However, they cannot find
evidence that this factor binds to methylated DNA in vivo
[27]. Other zinc finger proteins have been implicated in
binding methylated DNA. Quenneville et al. [35] showed
that a portion of ZFP57 consisting of two zinc fingers can
bind to a methylated TGCCGC motif in vitro and they
used bisulfite analysis of ChIP DNA to show that ZFP57
can bind to the methylated allele of three imprinted mouse
genes. However, the DNA methylation status of the other
11,000 ZFP57 ChIP-seq peaks was not analyzed and
therefore the significance of these three sites in the
context of the biological function of ZFP57 is not
clear.Spruijtetal.[36]usedquantitativemassspectrometry
to identify proteins that bind to methylated DNA in vitro,
identifying the zinc finger factors KLF2, KLF4, and KLF5
as methyl DNA binding proteins. They report that
approximately 18% of the KLF4- binding sites in mouse
ES cells show high levels of DNA methylation when using
a 100 bp window centered on the peak. However, when
the analysis is restricted to the binding motif, the DNA
methylation levels dropped significantly. Therefore, it is
not clear if KLF4 actually binds to methylated DNA in the
context ofchromatin.
While we were able to identify a small number of
methylated promoters to which Kaiso binds, these genes
are enriched for active histone modifications and are
expressed. The CpG methylated sites were found to be
located away from the center of the Kaiso peak and they
do not contain the full TCTCGCGAGA motif. Therefore,
it is unclear as to whether these sites are relevant to Kaiso
binding or to promoter activity. At this point, we cannot
rule out that there are methylated TCTCGCGAGA motifs
bound by Kaiso in certain cells under certain growth
conditions. As noted above, a previous study has reported
that Kaiso binds to a methylated MTA2 promoter region
in HeLa cells [10]. Unfortunately, we do not have Kaiso
ChIP data for HeLa cells and there is no available DNA
methylation data from HeLa cells for the precise region of
the MTA2 promoter that corresponds to the reported
Kaiso binding site. However, in HeLa cells the MTA2
promoter has marks of open chromatin and is bound by
Pol2; furthermore, the MTA2 transcript is expressed in
HeLa cells (Additional file 11). Therefore, if Kaiso does
bind at the MTA2 promoter in HeLa cells, it does not
silence transcription. We have examined the relationship
of Kaiso binding and MTA2 expression in six other cell
types. We find that Kaiso binds to the MTA2 promoter in
K562, A459, and SK-N-SH cells, and in all of these cell
lines the MTA2 promoter has acetylated histones.
Another study [15] suggested that Kaiso binds to and
represses the CDKN2A, HIC1, and MGMT promoters
in a methylation-dependent manner in HCT116 cells.
However, ChIP-seq data indicates that in HCT116
cells Kaiso does not bind within the genomic regions
of these promoters. A recent study [37] claims that
Kaiso binds the cyclin D1 CpG island promoter via a ‘KBS
motif’, TCCTGCNA, 69 bp upstream of the transcription
start site and functions to repress expression of the cyclin
D1 gene in HCT116 and MCF7 cells. They show that the
methylation of three adjacent CpG dinucleotides helps to
stabilize Kaiso’s interaction with sequences containing the
cyclin D1 promoter using in vitro assays, and that Kaiso
binds the region by ChIP. However, ENCODE ChIP-seq
data from HCT116 shows that Kaiso is not enriched at
the cyclin D1 promoter and RRBS data shows that the
region encompassing the KBS motif 69 bp upstream of the
transcription start site is unmethylated in both HCT116
and MCF7. The gene is also bound by Pol2 and acetylated
histones, and is expressed in both cell lines. Thus, it is hard
to find a clear relationship between Kaiso binding and
Blattler et al. Epigenetics & Chromatin 2013, 6:13 Page 14 of 18
http://www.epigeneticsandchromatin.com/content/6/1/13
170
repression of any gene in any cell line. In fact, by compari-
son of the expression levels of the genes bound by Kaiso in
both GM12878 and K562 versus the expression of the
genes whose promoters are bound by Kaiso in a cell
type-specific manner, we find a positive correlation
betweenKaiso-bindingandgene expression (Figure 12).
Although many promoters are bound by Kaiso in
multiple cell types, we did observe cell type-specific
binding. While we do not yet fully understand what
specifies cell type-specific binding of Kaiso, it appears
that in some cases increased DNA methylation and
compact chromatin structure may prevent Kaiso from
binding in a particular cell type. Our studies also suggest
that interaction with other transcription factors can
influence cell type-specific binding of Kaiso. Motif
analysis indicated that most cell type-specific Kaiso
binding sites did not contain the TCTCGCGAGA
motif, suggesting that Kaiso may be recruited to these
sites in a manner distinct from direct binding to its
motif. To test this hypothesis we performed a motif
analysis and found motifs for GATA and CTCF family
members in the K562-specific Kaiso peaksand for ETS and
Runt family members in the GM12878-specific Kaiso
peaks. These results suggestedthatKaisomightco-localize
with these specific factors in GM12878 or K562 cells.
ChIP-seq data from the ENCODE project confirmed the
validity of these bioinformatic predictions; we found CTCF,
GATA1, and GATA2 to be highly-enriched at K562-unique
Kaiso binding sites, and RUNX3 and PU.1 to be highly
enriched at GM12878-unique Kaiso binding sites. We note
that a previous study has shown that Kaiso can interact
with CTCF [24]. Taken together with our genome-wide
analyses, we suggest that CTCF may help to recruit
Kaiso to certain locations in the genome. Physical
interaction studies of Kaiso and GATA factors have
not been performed; therefore, further work is necessary to
determine the significance of these co-localizations. We
note that a previous study suggested that Kaiso can interact
with ZNF131 [16]. However, the ZNF131 motif was not
identified in any of our unbiased motif analyses. We
performed a direct search fortheZNF131motifallowing
up to two mismatches, and found that only 24 of 603
GM-specific peaks that lack a Kaiso motif and only
38 of 1855 K56-specific peaks that lack a Kaiso motif
contained a match to the ZNF131 motif. There is no
ChIP-seq data for ZNF131, but based on motif analyses, it
may not play an important role in recruiting Kaiso to the
genome. Our results also indicate that nucleosome
occupancy plays a role in dictating Kaiso’s cell type-
specific binding between GM12878 and K562 (Figure 11).
Specifically, binding sites unique to K562 cells show
reduced nucleosome occupancy relative to the same
genomic loci in GM12878 cells. However, in GM12878
cells, it appears that the positioning of nucleosomes plays
a lesser role in differential Kaiso binding between the two
cell types.
Conclusions
In summary, and in contrast to prediction, all of our
analyses suggest that there is a strong positive relationship
between the binding of Kaiso, the absence of DNA
methylation, the presence of active marks on a promoter,
and RNA expression levels. Thus, Kaiso does not appear to
play a major role in creating a repressive promoter
structure. Rather, our studies suggest that instead of
recruiting repressive proteins such as HDACs, Kaiso
may recruit positively-acting transcription factors. Future
studies are needed to more precisely define the mechanism
by which Kaiso contributes to the overall expression level
of its target genes. Importantly, our findings establish a
clear disconnect between in vitro studies and in vivo
studies. Rather than DNA methylation of a small motif, the
chromatin landscape, along with the cooperation of Kaiso
with other factors, dictates Kaiso binding patterns. We
suggest that the fact that Kaiso may prefer a methylated
motif is not relevant to in vivo function because the
methylated Kaiso motifs are not accessible for binding. A
recent study has highlighted the fact that two-thirds of the
members of the large family of KRAB domain-containing
zinc finger proteins contain a specific argenine-histidine
linker sequence between the C2H2 zinc finger domains
that may allow recognition of methylated cytosines [38].
Future studies may successfully identify a transcription
factor that can bind to a methylated motif in the context
ofchromatin and contributeto genesilencing.
Methods
ChIP-seq peak calling and analysis
Two replicate Bowtie-mapped ChIP-seq datasets were
downloaded for GM12878, K562, A549, HepG2, HCT116,
and SK-N-SH from the UCSC browser. All Kaiso ChIP-seq
experiments were carried out in the Myers Lab using Santa
Cruz Biotechnology antibody sc-23871. The antibody
has been validated to be specific to Kaiso as required
by ENCODE standards [13], using Western blot analysis
and mass spectrometry. We note that the antibody
validation document shows that the levels of Kaiso in
GM12878 and K562 are similar. The Kaiso (ZBTB33)
antibody validation document from the ENCODE
consortium can be found at http://genome-preview.ucsc.
edu/cgi-bin/hgEncodeVocab?term=%22ZBTB33%22. Peaks
were called for each replicate in each cell type using
Sole-Search [11,12]. The parameters used in the Sole-
Search program for all Kaiso datasets were as follows:
a permutation of 5, fragment length of 150, alpha
value of 0.0010, an FDR of 0.0010, and a peak merge dis-
tance of 0. To determine the reproducibility between the
two replicates for the Kaiso datasets (and to compare
Blattler et al. Epigenetics & Chromatin 2013, 6:13 Page 15 of 18
http://www.epigeneticsandchromatin.com/content/6/1/13
171
Kaiso peaks with other ChIP-seq datasets), the called peak
files for each replicate were compared using the overlap
analysis tool in Sole-Search [11,12]. A file containing the
genomic locations of all CpG islands were downloaded
from the UCSC genome browser. To determine the
location of Kaiso peaks (and the location of the Pol2
top-ranked peaks versus the Pol2 peaks excluded
from analysis) relative to transcription start sites, the
location analysis tool of Sole-Search was used. The
epigenetic profiles of the Kaiso binding sites were
analyzed using tag density plots that were created
using Hypergeometric Optimization of Motif EnRichment
software (HOMER, http://biowhat.ucsd.edu/homer/index.
html) [22]. All histone modification and transcription
factor ChIP-seq datasets used in comparison to Kaiso
peaks weredownloadedfromthe UCSC genome browser.
DNA methylation analysis
To determine the DNA methylation status of the genomic
regions corresponding to Kaiso peaks, a BigBed file for
GM12878 WGBS data was downloaded from the UCSC
genome browser and was converted to BED format using
the BigBedtoBed script available at http://hgdownload.soe.
ucsc.edu/admin/exe/macOSX.i386/. HOMER [22] was then
used to plot the percent methylation across Kaiso peak
regions using the annotatePeaks.pl script and the ‘-ratio’
option, comparing peak files to tag directories created using
the ‘-mCpGbed’ option. The resulting output files were
used to create histograms plotting the average percent
methylation in 500bp regions surrounding the center of
Kaiso peaks using a bin size of 10bp. To identify an
enriched motif within Kaiso-binding sites, a motif analysis
was performed using HOMER to identify strings of nucleo-
tide sequences that are enriched within Kaiso peaks. The
findMotifsGenome.pl command was run in HOMER using
hg19 as the reference genome. To limit the motif analysis
to regions pulled down by ChIP, the analysis was limited to
sequences contained within each peak using the ‘-size given’
option. For some analyses, we centered Kaiso peaks on the
methylated motif using the annotatePeaks.pl script and the
-center option in HOMER. To determine the methylation
status of individual CGCG motifs contained with peaks, the
genomiccoordinatesofeachmotifwascomparedtoWGBS
and/or RRBS data, and a beta value was assigned to each
CpG within the motif. For GM12878 methylation analyses,
WGBS and RRBS data was combined to achieve higher
genomic coverage; for K562 analyses, only RRBS data was
available. Only bases with a minimum 3× sequencing
coverage were used in methylation analyses.
RNA-seq analysis
The RefSeq gene annotation files for RNA-seq data in
GM12878 and K562 were downloaded from the
UCSC genome browser (http://hgdownload.cse.ucsc.edu/
goldenPath/hg19/encodeDCC/wgEncodeCaltechRnaSeq/).
A union file was created by combining all transcripts for
individual genes with multiple transcripts. For both cell
types, all replicates were combined for analysis. All reads
were aligned to the hg19 reference genome using the
bowtie aligner, and those uniquely mapped were used to
calculate the expression level as reads per kilobase per
million reads (RPKM), usingthe following formula:
RPKM¼n=NL"1:0"10
8
where n is the number of mapped reads localized within
exons, N is the total number of uniquely mapped
reads in the experiment, and L is the length of gene
body summing from all union exons in base pairs.
We note that Kaiso had an RPKM value of
0.004081966 in GM12878 and 0.004755255 in K562,
and ranked within the top 40% of expressed genes in
both cell types. To verify the difference among groups
of peaks (K562-unique, common, GM12878-unique),
the Mann-Whitney rank test was applied.
Additional files
Additional file 1: Chromosomal locations and peak scores for Kaiso
peaks sets in all cell lines.
Additional file 2: Peak selection for Pol2 in GM12878 cells. Shown
are the peaks called using Sole-Search for the merged replicate
datasets of Pol2 in GM12878 (panel A). The parameters used in the
Sole-Search peak calling were as follows: permutation of 5, fragment
length of 150, alpha value of 0.0010, an FDR of 0.0010, and a peak merge
distance of 0. The top 20% of the called peaks were used (tag height >
50). The arrow indicates the position at which the peak list was
truncated. Panel B shows the location analysis for Pol2 peaks that were
used in our analysis (red) and those that were discarded (blue).
Additional file 3: HOMER Annotation of Kaiso peak files in
GM12878 and K562, including locations of Kaiso motifs.
Additional file 4: Chromosomal location and percent methylation
of CGCG motifs within GM12878 and K562 Kaiso peaks.
Additional file 5: Motif analysis of subsets of K562 Kaiso peaks
identified in highly amplified genomic regions. As shown, the Kaiso
motif is not highly represented in the removed peaks. The adjusted high-
confidence peak set is used for analyses in this paper. However, the
genomic coordinates of the retained and removed peaks are provided in
Additional file 6.
Additional file 6: Peak files used in the removal of amplified
regions in K562 cells.
Additional file 7: Methylation of Kaiso peaks in K562. Snapshot of
the region on chromosome 6 containing the highest methylated
CGCG motif identified within Kaiso K562 high-confidence peaks. The
ChIP-seq track for Kaiso is shown in black, with called peaks represented
by black bars below the track. ChIP-seq tracks for Pol2 and histone
modifications are shown in blue. The inset shows a zoom in of the
region bound by Kaiso containing a methylated CGCG motif. Red bars in
the RRBS track represent methylated cytosines.
Additional file 8: Motif analysis of cell type-specific peaks lacking a
Kaiso motif. (A) K562 Kaiso peaks that lack a Kaiso motif were searched
for known motifs for other TFs using Homer. (B) GM12878-unique distal
Kaiso peaks that lack a Kaiso motif were searched for known motifs for
other TFs using Homer.
Blattler et al. Epigenetics & Chromatin 2013, 6:13 Page 16 of 18
http://www.epigeneticsandchromatin.com/content/6/1/13
172
Additional file 9: Plot of DNA methylation versus peak rank. The
methylation status of each CGCG motif within Kaiso peaks was calculated
using WGBS and RRBS data (for GM12878, left panel) and RRBS (for K562,
right panel), and plotted relative to the score of the peak containing the
motif. The left panel shows high-confidence Kaiso peaks in GM12878 and
the right panel shows high- confidence Kaiso peaks in K562.
Additional file 10: Methylation of Kaiso peaks identified by LONUT.
(A) Comparison of methylation levels within peaks called by different
programs. Peaks were called by BELT using non-uniquely mapped tags
(left) and uniquely-mapped tags (middle) that were mapped using
LONUT. These peaks were compared to Sole-Search-called peaks (right)
for methylation of CGCG motifs. Red arrows indicate CGCG motifs
identified in the promoters of the PHF8 and PRDX4 genes. (B) Genome
browser snapshot of the region surrounding the PHF8 gene, where a
CGCG motif is 73% methylated in GM12878 cells. ChIP-seq tracks for Pol2
and histone modifications are shown in red and the Kaiso ChIP-seq track
is shown in black.
Additional file 11: MTA2 promoter showing no DNA methylation
and active promoter marks in HeLa cells. Genome browser snapshot
showing the MTA2 gene. The region analyzed for Kaiso-binding and DNA
methylation in Yoon et al. is represented by the black box under the red
arrow. RRBS tracks for several cell types are shown in green, red, and
yellow, but is absent in all cell lines for the region in question. ChIP-seq
density tracks for Pol2 and histone modifications in HeLa cells are shown
in grey-scale.
Abbreviations
Bp: Base pairs; ChIP: Chromatin immunoprecipitation; EBV: Epstein-Barr virus;
HDAC: Histone deacetylase; HOMER: Hypergeometric Optimization of Motif
EnRichment; MNase: Micrococcal nuclease; RPKM: Reads per kilobase per
million reads; RRBS: Reduced representation bisulfite sequencing;
WGBS: Whole genome bisulfite sequencing.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
AB carried out the analyses and interpretation of ChIP-seq and MNase-seq
data and drafted the manuscript. LY performed analyses of the DNA
methylation, YW performed LONUT alignments and BELT peak calling, ZY
performed the analysis of RNA-seq data, VXJ coordinated LONUT and BELT
ChIP-seq data analysis and RNA-seq data analysis, and PJF conceived of the
study, participated in its design and coordination, and helped to draft the
manuscript. All authors have read and approved the final manuscript.
Acknowledgments
All ChIP-seq and RNA-seq data was produced as part of the ENCODE
consortium, and is available at (http://genome.ucsc.edu); all data used in this
study is past the nine month moratorium. This work was supported in part
by 1U54HG004558 as a component of the ENCODE project and by
P30CA014089 from the National Cancer Institute; AB was supported in part
by a pre-doctoral training fellowship from the National Human Genome
Research Institute of the National Institutes of Health under grant number
F3100HG6114.
Author details
1
Department of Biochemistry & Molecular Biology, Norris Comprehensive
Cancer Center, University of Southern California, Los Angeles, CA 90089, USA.
2
Genetics Graduate Group, University of California-Davis, Davis, CA 95616,
USA.
3
Department of Biomedical Informatics, The Ohio State University,
Columbus, OH 43210, USA.
Received: 18 March 2013 Accepted: 16 April 2013
Published: 21 May 2013
References
1. Jones PA: Functions of DNA methylation: islands, start sites, gene bodies
and beyond. Nat Rev Genet 2012, 13:484–492.
2. Li G, Reinberg D: Chromatin higher-order structures and gene regulation.
Curr Opin Genet Dev 2011, 21:175–186.
3. Statham AL, Robinson MD, Song JZ, Coolen MW, Stirzaker C, Clark SJ: Bisulfite
sequencing of chromatin immunoprecipitated DNA (BisChIP-seq) directly
informs methylation status of histone-modified DNA. Genome Res 2012,
22:1120–1127.
4. Brinkman AB, Gu H, Bartels SJJ, Zhang Y, Matarese F, Simmer F, Marks H,
Bock C, Gnirke A, Meissner A, Stunnenberg HG: Sequential ChIP-bisulfite
sequencing enables direct genome-scale investigation of chromatin and
DNA methylation cross-talk. Genome Res 2012, 22:1128–1138.
5. Komashko VM, Farnham PJ: 5-azacytidine treatment reorganizes genomic
histone modification patterns. Epigenetics 2010, 5:229–240.
6. Dunham I, Kundaje A, Aldred SF, Collins PJ, Davis CA, Doyle F, Epstein CB,
Frietze S, Harrow J, Kaul R, Khatun J, Lajoie BR, Landt SG, Lee B-K, Pauli F,
Rosenbloom KR, Sabo P, Safi A, Sanyal A, Shoresh N, Simon JM, Song L,
Trinklein ND, Altshuler RC, Birney E, Brown JB, Cheng C, Djebali S, Dong X,
Dunham I, et al: An integrated encyclopedia of DNA elements in the
human genome. Nature 2012, 489:57–74.
7. Filion GJP, Zhenilo S, Salozhin S, Yamada D, Prokhortchouk E, Defossez P-A:
A family of human zinc finger proteins that bind methylated DNA and
repress transcription. Mol Cell Biol 2006, 26:169–181.
8. Prokhortchouk A, Hendrich B, Jørgensen H, Ruzov A, Wilm M, Georgiev G, Bird
A, Prokhortchouk E: The p120 catenin partner Kaiso is a DNA methylation-
dependent transcriptional repressor. Genes Dev 2001, 15:1613–1618.
9. Daniel JM, Spring CM, Crawford HC, Reynolds AB, Baig A: The p120(ctn)-binding
partner Kaiso is a bi-modal DNA-binding protein that recognizes both a
sequence-specific consensus and methylated CpG dinucleotides. Nucleic Acids
Res 2002, 30:2911–2919.
10. Yoon H-G, Chan DW, Reynolds AB, Qin J, Wong J: N-CoR mediates DNA
methylation-dependent repression through a methyl CpG-binding
protein Kaiso. Mol Cell 2003, 12:723–734.
11. Blahnik KR, Dou L, O’geen H, McPhillips T, Xu X, Cao AR, Iyengar S, Nicolet
CM, Ludäscher B, Korf I, Farnham PJ: Sole-Search: an integrated analysis
program for peak detection and functional annotation using ChIP-seq
data. Nucleic Acids Res 2010, 38:e13.
12. Blahnik KR, Dou L, Echipare L, Iyengar S, O’geen H, Sanchez E, Zhao Y, Marra
MA, Hirst M, Costello JF, Korf I, Farnham PJ: Characterization of the
contradictory chromatin signatures at the 3′ exons of zinc finger genes.
PLoS One 2011, 6:e17121.
13. Landt SG, Marinov GK, Kundaje A, Kheradpour P, Pauli F, Batzoglou S,
Bernstein BE, Bickel P, Brown JB, Cayting P, Chen Y, DeSalvo G, Epstein C,
Fisher-Aylor KI, Euskirchen G, Gerstein M, Gertz J, Hartemink AJ, Hoffman
MM, Iyer VR, Jung YL, Karmakar S, Kellis M, Kharchenko PV, Li Q, Liu T, Liu
XS, Ma L, Milosavljevic A, Myers RM, et al: ChIP-seq guidelines and
practices of the ENCODE and modENCODE consortia. Genome Res 2012,
22:1813–1831.
14. Li Q, Brown JB, Huang H, Bickel PJ: Measuring reproducibility of high-
throughput experiments. The Annals of Applied Statistics 2011, 5:1752–1779.
15. Lopes EC, Valls E, Figueroa ME, Mazur A, Meng F-G, Chiosis G, Laird PW,
Schreiber-Agus N, Greally JM, Prokhortchouk E, Melnick A: Kaiso contributes
to DNA methylation-dependent silencing of tumor suppressor genes in
colon cancer cell lines. Cancer Res 2008, 68:7258–7263.
16. Donaldson NS, Nordgaard CL, Pierre CC, Kelly KF, Robinson SC, Swystun L,
Henriquez R, Graham M, Daniel JM: Kaiso regulates Znf131-mediated
transcriptional activation. Exp Cell Res 2010, 316:1692–1705.
17. Kelly KF, Otchere AA, Graham M, Daniel JM: Nuclear import of the BTB/POZ
transcriptional regulator Kaiso. J Cell Sci 2004, 117:6143–6152.
18. Raghav SK, Waszak SM, Krier I, Gubelmann C, Isakova A, Mikkelsen TS,
Deplancke B: Integrative genomics identifies the corepressor SMRT as a
gatekeeper of adipogenesis through the transcription factors C/EBPβ
and KAISO. Mol Cell 2012, 46:335–350.
19. Li G, Ruan X, Auerbach RK, Sandhu KS, Zheng M, Wang P, Poh HM, Goh Y,
Lim J, Zhang J, Sim HS, Peh SQ, Mulawadi FH, Ong CT, Orlov YL, Hong S,
Zhang Z, Landt S, Raha D, Euskirchen G, Wei C-L, Ge W, Wang H, Davis C,
Fisher-Aylor KI, Mortazavi A, Gerstein M, Gingeras T, Wold B, Sun Y, et al:
Extensive promoter-centered chromatin interactions provide a
topological basis for transcription regulation. Cell 2012, 148:84–98.
20. Lee B-K, Iyer VR: Genome-wide studies of CCCTC-binding factor (CTCF)
and cohesin provide insight into chromatin structure and regulation.
J Biol Chem 2012, 287:30906–30913.
21. Kelly TK, Liu Y, Lay FD, Liang G, Berman BP, Jones PA: Genome-wide
mapping of nucleosome positioning and DNA methylation within
individual DNA molecules. Genome Res 2012, 22:2497–2506.
Blattler et al. Epigenetics & Chromatin 2013, 6:13 Page 17 of 18
http://www.epigeneticsandchromatin.com/content/6/1/13
173
22. Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, Cheng JX, Murre C,
Singh H, Glass CK: Simple combinations of lineage-determining
transcription factors prime cis-regulatory elements required for
macrophage and B cell identities. Mol Cell 2010, 38:576–589.
23. Ruzov A, Savitskaya E, Hackett JA, Reddington JP, Prokhortchouk A, Madej
MJ, Chekanov N, Li M, Dunican DS, Prokhortchouk E, Pennings S, Meehan
RR: The non-methylated DNA-binding function of Kaiso is not required
in early Xenopus laevis development. Development 2009, 136:729–738.
24. Defossez PA: The human enhancer blocker CTC-binding factor interacts
with the transcription factor Kaiso. J Biol Chem 2005, 280:43017–43023.
25. Wang J, Zhuang J, Iyer S, Lin X, Whitfield TW, Greven MC, Pierce BG, Dong
X, Kundaje A, Cheng Y, Rando OJ, Birney E, Myers RM, Noble WS, Snyder M,
Weng Z: Sequence features and chromatin structure around the
genomic regions bound by 119 human transcription factors. Genome Res
2012, 22:1798–1812.
26. Buck-Koehntop BA, Martinez-Yamout MA, Dyson HJ, Wright PE: Kaiso uses
all three zinc fingers and adjacent sequence motifs for high affinity
binding to sequence-specific and methyl-CpG DNA targets. FEBS Lett
2012, 586:734–739.
27. Bartels SJJ, Spruijt CG, Brinkman AB, Jansen PWTC, Vermeulen M,
Stunnenberg HG: A SILAC-based screen for Methyl-CpG binding proteins
identifies RBP-J as a DNA methylation and sequence-specific binding
protein. PLoS One 2011, 6:e25884.
28. Buck-Koehntop BA, Stanfield RL, Ekiert DC, Martinez-Yamout MA, Dyson HJ,
Wilson IA, Wright PE: Molecular basis for recognition of methylated and
specific DNA sequences by the zinc finger protein Kaiso. Proc Natl Acad
Sci USA 2012, 109:15229–15234.
29. Lan X, Bonneville R, Apostolos J, Wu W, Jin VX: W-ChIPeaks: a
comprehensive web application tool for processing ChIP-chip and
ChIP-seq data. Bioinformatics 2011, 27:428–430.
30. Rabinovich A, Jin VX, Rabinovich R, Xu X, Farnham PJ: E2F in vivo binding
specificity: comparison of consensus versus nonconsensus binding sites.
Genome Res 2008, 18:1763–1777.
31. Farnham PJ: Insights from genomic profiling of transcription factors.
Nat Rev Genet 2009, 10:605–616.
32. Thurman RE, Rynes E, Humbert R, Vierstra J, Maurano MT, Haugen E,
Sheffield NC, Stergachis AB, Wang H, Vernot B, Garg K, John S, Sandstrom R,
Bates D, Boatman L, Canfield TK, Diegel M, Dunn D, Ebersol AK, Frum T,
Giste E, Johnson AK, Johnson EM, Kutyavin T, Lajoie B, Lee B-K, Lee K,
London D, Lotakis D, Neph S, et al: The accessible chromatin landscape of
the human genome. Nature 2012, 489:75–82.
33. Gaffney DJ, McVicker G, Pai AA, Fondufe-Mittendorf YN, Lewellen N,
Michelini K, Widom J, Gilad Y, Pritchard JK: Controls of nucleosome
positioning in the human genome. PLoS Genet 2012, 8:e1003036.
34. Li B, Carey M, Workman JL: The role of chromatin during transcription.
Cell 2007, 128:707–719.
35. Quenneville S, Verde G, Corsinotti A, Kapopoulou A, Jakobsson J, Offner S,
Baglivo I, Pedone PV, Grimaldi G, Riccio A, Trono D: In embryonic stem
cells, ZFP57/KAP1 recognize a methylated hexanucleotide to affect
chromatin and DNA methylation of imprinting control regions. Mol Cell
2011, 44:361–372.
36. Spruijt CG, Gnerlich F, Smits AH, Pfaffeneder T, Jansen PWTC, Bauer C,
Münzel M, Wagner M, Müller M, Khan F, Eberl HC, Mensinga A, Brinkman
AB, Lephikov K, Müller U, Walter J, Boelens R, van Ingen H, Leonhardt H,
Carell T, Vermeulen M: Dynamic readers for 5-(hydroxy)methylcytosine
and its oxidized derivatives. Cell 2013, 152:1146–1159.
37. Donaldson NS, Pierre CC, Anstey MI, Robinson SC, Weerawardane SM, Daniel
JM: Kaiso represses the cell cycle gene cyclin D1 via sequence-specific and
methyl-CpG-dependent mechanisms. PLoS One 2012, 7:e50398.
38. Liu Y, Zhang X, Blumenthal RM, Cheng X: A common mode of recognition
for methylated CpG. Trends Biochem Sci 2013, 38:177–183.
doi:10.1186/1756-8935-6-13
Cite this article as: Blattler et al.: ZBTB33 binds unmethylated regions of
the genome associated with actively expressed genes. Epigenetics &
Chromatin 2013 6:13.
Submit your next manuscript to BioMed Central
and take full advantage of:
• Convenient online submission
• Thorough peer review
• No space constraints or color figure charges
• Immediate publication on acceptance
• Inclusion in PubMed, CAS, Scopus and Google Scholar
• Research which is freely available for redistribution
Submit your manuscript at
www.biomedcentral.com/submit
Blattler et al. Epigenetics & Chromatin 2013, 6:13 Page 18 of 18
http://www.epigeneticsandchromatin.com/content/6/1/13
174
RESEARCH Open Access
Global loss of DNA methylation uncovers intronic
enhancers in genes showing expression changes
Adam Blattler
1,2
, Lijing Yao
1
, Heather Witt
1
, Yu Guo
1
, Charles M Nicolet
1
, Benjamin P Berman
1
and Peggy J Farnham
1,3*
Abstract
Background: Gene expression is epigenetically regulated by a combination of histone modifications and
methylation of CpG dinucleotides in promoters. In normal cells, CpG-rich promoters are typically unmethylated,
marked with histone modifications such as H3K4me3, and are highly active. During neoplastic transformation, CpG
dinucleotides of CG-rich promoters become aberrantly methylated, corresponding with the removal of active
histone modifications and transcriptional silencing. Outside of promoter regions, distal enhancers play a major role
in the cell type-specific regulation of gene expression. Enhancers, which function by bringing activating complexes
to promoters through chromosomal looping, are also modulated by a combination of DNA methylation and
histone modifications.
Results: Here we use HCT116 colorectal cancer cells with and without mutations in DNA methyltransferases,
the latter of which results in a 95% reduction in global DNA methylation levels. These cells are used to study
the relationship between DNA methylation, histone modifications, and gene expression. We find that the loss
of DNA methylation is not sufficient to reactivate most of the silenced promoters. In contrast, the removal of
DNA methylation results in the activation of a large number of enhancer regions as determined by the
acquisition of active histone marks.
Conclusions: Although the transcriptome is largely unaffected by the loss of DNA methylation, we identify two
distinct mechanisms resulting in the upregulation of distinct sets of genes. One is a direct result of DNA methylation
loss at a set of promoter regions and the other is due to the presence of new intragenic enhancers.
Background
Genes are regulated by epigenetic modifications and
transcription factor binding at their promoters and at
distally located regulatory regions. Studies over the past
two decades have shown that promoters having high
levels of DNA methylation are not transcriptionally active
[1-3]. Recent genome-wide epigenetic profiling efforts
demonstrate that promoter regions with high levels of
DNA methylation have low levels of active marks such as
H3K4me3 and that methylated distal regulatory regions
lack the active mark H3K27ac [4-8]. During neoplastic
transformation, DNA methylation is reduced genome-
wide, but accumulates at certain promoters. Because some
of the promoters that become highly methylated are
tumor suppressor genes [9-11], DNA de-methylating
agents are being used in the clinic to reactivate silenced
promoters. However, it has yet to be determined whether
the global eradication of DNA methylation is advanta-
geous for the cell or the patient. One could imagine that
global loss of DNA methylation would have major effects
on the transcriptome and epigenome of the cell. The
DNA de-methylating drug 5-azacytidine (5-Aza-CR) has
been approved for use as an epigenetic chemotherapeutic
agent [12,13]. 5-Aza-CR functions by incorporating into
DNA in place of cytosine and trapping DNA methyltrans-
ferases (DNMTs), which leads to their degradation and a
subsequent passive loss of DNA methylation via replica-
tion. Previously, we treated HEK293 cells with 5-Aza-CR
and analyzed the effects on histone modifications and
RNA expression [12]. We found that 5-Aza-CR treatment
* Correspondence: peggy.farnham@med.usc.edu
1
Norris Comprehensive Cancer Center, University of Southern California, Los
Angeles, CA 90089, USA
3
Department of Biochemistry & Molecular Biology, Norris Comprehensive
Cancer Center, University of Southern California, Los Angeles, CA 90089-9601,
USA
Full list of author information is available at the end of the article
© 2014 Blattler et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain
Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article,
unless otherwise stated.
Blattler et al. Genome Biology 2014, 15:469
http://genomebiology.com/2014/15/10/469
175
caused changes in gene expression in approximately 1,500
genes (out of the 24,000 genes analyzed) but less than 800
of the genes were up-regulated as a result, and most genes
that showed increased expression were not regulated by
promoters that displayed DNA methylation prior to treat-
ment. In addition to affecting DNA methylation, 5-Aza-
CR can also incorporate into RNA and interrupt normal
cellular processes such as ribosomal assembly and transla-
tion [14,15]. Therefore, it was not clear if the observed
changes in transcript levels were due to changes in tran-
scription rate from de-methylated promotersor to changes
in RNA stability caused by intercalation of the 5-Aza-CR
into the transcripts, affecting cellular signaling pathways
due to translational defects. In addition, treatment with 5-
Aza-CR does not completely abolish DNA methylation.
Even with high doses, the overall levels of DNA methyla-
tion arereduced only 50to 60% [12]. Therefore,it was also
possible that de-repression of genes was incomplete after
treatment with the drug (due to the remaining DNA
methylation) and that many more transcripts whose pro-
moters are normally silenced by DNA methylation would
be identified if a more dramatic reduction in DNA methy-
lation could be achieved. Here we explore the relationship
between DNA methylation and the epigenome using
both HCT116 colorectal cancer cells and DKO1 cells,
a derivative of HCT116 cells that have a bi-allelic knock-
out of DNMT1 and bi-allelic deletion of exons 2 to 21 of
DNMT3b [16]. Surprisingly, we found only a modest effect
on the transcriptome and very limited increases in active
marks on promoter regions. Inordertofullyunderstand
theeffectsofglobalDNAmethylationlossonthetranscrip-
tome and the epigenome at promoters and distal regulatory
regions, weemployedgenome-wide methods forexamining
DNA methylation, RNA expression changes, histone modi-
fication patterns, and RNA polymerase II (RNAPII) occu-
pancy. We found that the most robust epigenomic changes
occurring after loss of DNA methylation were due to the
acquisition of thousands of new enhancers. Interestingly,
many of the genes that were up-regulated in DKO1 cells
via mechanisms distinct from de-methylation of promoter
regions had multiple newly acquired intragenicenhancers.
Results
Loss of DNA methylation does not result in an increase in
active histone marks at promoters
To determine the relationship between a reduction of
DNA methylation and global epigenetic marks, we per-
formed functional genomic analyses using DNMT-deficient
HCT116 DKO1 cells. The DKO1 cell line has a bi-allelic
knockout of DNMT1 and bi-allelic deletion of exons 2 to
21 of DNMT3b and is reported to have 5% of the overall
DNA methylation levels relative to the parental HCT116
cell line [16]. However, these results were obtained using a
liquid chromatography approach which monitored overall
5-methylcytosine content genome-wide and thus did not
examine DNA methylation reduction in specific genomic
compartments such as promoters or gene bodies. We
therefore performed whole genome bisulfite sequencing
(WGBS)onHCT116parentalandDKO1cells(Figure1A);
to achieve adequate coverage of GC-rich promoters,
WGBS library preparation was performed as discussed in
the Materials and methods section. We found that pro-
moters, gene bodies, and randomly selected regions of the
genome showed extensive losses of DNA methylation,
with the median level of DNA methylation in DKO1 cells
being <1%, 13%, and 9% of the parental cell line, respect-
ively (Figure 1B). We note that randomly selected regions
ofthe genome showedanoverallreductionof89% of their
original methylation levels, which is slightly different than
the value determined previously (95% reduction in methy-
lation). However, Rhee et al. [16] used a method that
monitors percentage methylation of all cytosines in the
genome whereas we measured methylated cytosines in the
context of CpG dinucleotides that are located in the
uniquely mappable, non-repetitive regions of the human
genome.AsshowninFigure1B,the promotersinparental
HCT116 cells display a wide range of methylation levels.
These promoters generally fall into two major groups,
those having very high methylation or very low methyla-
tion levels (Figure 1C); essentially all promoters have
greatly reduced DNA methylationin DKO1 cells.
Because approximately 30,000 promoters can be classi-
fied as highly methylated in HCT116 cells, having an
average DNA methylation level greater than 50% at the
CpGswithin -100 to+700bpof the transcription start site
(TSS), we anticipated that a loss in DNA methylation in
DKO1 cells at these promoters may reveal previously in-
accessible transcription factor binding sites, resulting in a
new set of active promoters marked by increased levels of
active histones. To determine the effect of losses in DNA
methylation on active histone marks, we compared ChIP-
seq datasets for H3K4me3 and H3K27ac in HCT116 and
DKO1 cells. The H3K4me3 data for HCT116 cells was
available as part of the ENCODE project [17]. To obtain
the other datasets, two biological replicates for each of the
H3K4me3 (DKO1), H3K27ac (DKO1), and H3K27ac
(HCT116) ChIP-seq samples were produced. The sequen-
cing and peak metrics for all ChIP-seq datasets used in
this study are provided in Additional file 1; see Additional
file 2 for ChIP-seq peaks in HCT116 cells and Additional
file 3 for ChIP-seq peaks in DKO1 cells. Surprisingly, we
found that the levels of H3K4me3 and H3K27ac signals
were greatly reduced in both replicates of the DKO1
ChIP-seq datasets compared with the HCT116 ChIP-seq
replicates. However, a small number of genomic locations
dohave increases in the levels of H3K4me3 or H3K27ac in
DKO1 cells (Figure 2). To quantify these differences, we
first identified approximately 12,000 promoter-proximal
Blattler et al. Genome Biology 2014, 15:469 Page 2 of 16
http://genomebiology.com/2014/15/10/469
176
H3K4me3 peaks in both HCT116 and in DKO1 cells. A
comparison of the two datasets revealed that very few
new promoter-proximal H3K4me3 peaks were identified
(Additional file 4). In fact, the overall level of H3K4me3
was greatly reduced at promoters originally unmethylated
in HCT116, and promoters losing methylation in DKO1
did not gain H3K4me3 (Figure 3). To determine if other ac-
tive epigenetic marks were also reduced at these promoter
regions, weexaminedH3K27aclevels.Again,wefoundthat
relatively few promoters gained H3K27ac (Additional file 5)
and that the active H3K27ac mark was reduced in DKO1
cells at promoters having H3K27ac in HCT116 (Figure 3).
It was possible that the loss of DNA methylation did result
in more active promoter regions, but somehow also inter-
fered with recruitment of histone modifying enzymes, thus
altering the histone ChIP-seq profiles. Therefore, we
next monitored the binding of RNAPII and found that
essentially no promoters gained RNAPII in DKO1 cells
(Additional file 6) and that, similar to the active histones,
the levels of RNAPII also decreased at the vast majority of
promoters in the DKO1 cells (Figure 3).
Identificationofgenes affectedbylossofDNA methylation
We had expected that the large increase in the number of
accessible promoters in DKO1 cells due to the loss of DNA
methylation (approximately 30,000 additional unmethylated
promoters) would result in the transcriptional activation of
asimilarnumberofgenes.However,weidentifiedveryfew
promoters that gained RNAPII in DKO1, suggesting that
we would not find increased levels of transcripts for many
0
25
50
75
100
Promoter Gene body Random
HCT116
DKO1
DNA meth l n o i t a lyev ) % ( l e
Coverage HCT116
Coverage DKO1
%mCpG HCT116
%mCpG DKO1
CpG Islands
RefSeq Genes
A
B C
0
20000
40000
60000
Number of promoters
Percent DNA
methylation
Figure 1 Whole genome bisulfite sequencing comparative analysis of HCT116 and DKO1 cells. (A) Light blue and pink tracks represent
the sequencing coverage along a segment of human chromosome 19 in HCT116 and DKO1 cells, respectively. Dark blue and red tracks illustrate
the percentage of methyl-C/C in HCT116 and DKO1, respectively. Light-colored lines within the percentage methylation (%mCpG) tracks
represent the average percentage methylation in the immediate region. CpG islands are shown above the RefSeq genes as green bars.
(B) Box plot illustrating the percentage methylation in promoters, gene bodies, and random regions of the genome; the horizontal line
in each bar indicates the median value. For HCT116 cells the median values are 30% (promoters), 84% (gene bodies), and 84% (random
regions) and for DKO1 cells the median values are <1% (promoters), 13% (gene bodies), and 9% (random regions). (C) The number of
promoters containing varying levels of methylation in HCT116 and DKO1 are shown; the minimum and maximum DNA methylation
values for the region between -100 and +700 relative to the start site at the promoters in each group is indicated by the color key.
Blattler et al. Genome Biology 2014, 15:469 Page 3 of 16
http://genomebiology.com/2014/15/10/469
177
RNAPII
HCT116
H3K4me3
DKO1
H3K4me3
HCT116
RNAPII
DKO1
H3K36me3
DKO1
H3K36me3
HCT116
H3K27ac
DKO1
H3K27ac
HCT116
Region on chromosome 6.
Figure 2 Overview of ChIP-seq data in HCT116 and DKO1 cells. Shown is a region of chromosome 6 illustrating the reduction in levels of
active marks in DKO1 cells compared with HCT116 cells, for RNAP II (blues), H3K4me3 (reds), H3K27ac (greens), and H3K36me3 (oranges). Dark
and light colors represent HCT116 and DKO1 cells, respectively. Although peaks are reduced in height at most sites, some genomic locations do
have increases in H3K4me3 (blue boxes) or H3K27ac (red boxes) in DKO1 cells.
HCT116
%meC
DKO1
%meC
0 100 50
%Methylation
Promoters unmethylated in HCT116 Promoters methylated in HCT116
Tags per bp per peak
Tags per bp per peak
Distance from transcription start site Distance from transcription start site
Figure 3 Active histone modifications and RNA polymerase II are drastically reduced at promoters in DKO1 cells. Shown is the
percentage DNA methylation (heat bars) and the density of ChIP-seq tags for H3K4me3 (reds), H3K27ac (greens), and RNAPII (blues) surrounding
the transcription start sites of promoters that had less than 50% methylation (left) and promoters that had more than 50% methylation in HCT116
cells (right). Light-colored dashed lines represent the marks in DKO1 cells.
Blattler et al. Genome Biology 2014, 15:469 Page 4 of 16
http://genomebiology.com/2014/15/10/469
178
genes. Rather, because of the reduction in overall levels of
active marks at promoters, we thought it was possible that
lack of DNA methylation was causing a global decrease in
transcription. We also note that DNA methylation is
found in the body of genes and that studies have shown a
correlation between gene body methylation and transcript
levels [18-20]. Therefore, it was possible that the loss of
methylation was causing decreased transcription through-
out the genome. To examine the consequences of the loss
of DNA methylation on the transcriptome, we performed
RNA-seq in HCT116 andDKO1 cells; see Additionalfile 7
for replicate comparisons of the RNA-seq samples and
Additional file 8 for all gene expression values. Even
though the relative levels of DNA methylation across gene
0.0
2.5
5.0
7.5
10.0
12.5
0.0 2.5 5.0 7.5 10.0 12.5
log2(DKO1 Expression)
log2(HCT116 Expression)
B
A
Figure 4 A small set of promoters are de-repressed in DKO1 cells. (A) Plot showing relative DNA methylation across the gene bodies of all
RefSeq genes in HCT116 and DKO1 cells. (B) Gene expression differences in HCT116 and DKO1; log2 expression values are plotted for every gene
expressed in HCT116 and/or DKO1. The orange line represents a slope of x=y. The green dots represent a group of de-repressed genes with
log2(HCT116 normalized expression) <1.5 and log2(DKO1 normalized expression) >2.5.
Blattler et al. Genome Biology 2014, 15:469 Page 5 of 16
http://genomebiology.com/2014/15/10/469
179
bodiesis muchlower in DKO1 cells (Figure4A),wefound
relatively similar transcriptome profiles in the two cell
lines (Figure 4B), suggesting that gene body methylation
does not play a significant role in regulating gene expres-
sion in HCT116 and DKO1 cells.
We did, however, identify three sets of transcripts that
were affected by loss of DNA methylation (Additional
file 8). One set of transcripts affected by a loss of methy-
lation at promoters, shown in green in Figure 4B, corre-
sponds to 1,089 transcripts that are lowly expressed
(log2 normalized expression <1.5) in HCT116 and highly
expressed (log2 normalized expression >2.5) in DKO1
cells. The promoters of these genes, which we have termed
de-repressed, were highly methylated in HCT116 and
unmethylated in DKO1 (Figure 5A). Because there are
similar numbers of de-repressed genes as DKO1-specific
H3K4me3 peaks (Additional file 4), we thought that per-
haps the promoters of the de-repressed genes would show
an increase in levels of H3K4me3. Indeed, tag density
plots across the promoters of de-repressed genes show
that, on average, H3K4me3 is increased at these pro-
moters. However, an overlap analysis showed thatonly ap-
proximately 40% of the de-repressed genes gained a new
H3K4me3 peak (Figure 5B). It was possible that the pro-
moters of the 599 de-repressed genes that did not have a
new H3K4me3 site were bound by H3K4me3 in HCT116,
even though they were silenced. However, we found that
only 78of these 599 promoterswereinthe setof common
H3K4me3 peaks. To investigate the possibility that the
remaining 521 de-repressed genes might be utilizing a
new (alternative) promoter region, we determined the dis-
tance from the TSS of each gene to the nearest DKO1-
specific H3K4me3site.Only 48ofthese genes hadanewly
acquired H3K4me3sitewithin20kb, withthe mediandis-
tance from the known TSS to the nearest new H3K4me3
site in DKO1 cells being over 250 kb, suggesting that if
B A
DKO1-unique
H3K4me3
(1,678)
De-repressed
Genes
(1,089)
599 1,199 462
Krueppel-associated box
GAGE
Zinc finger, C2H2-type/integrase, DNA-binding
Homeobox, eukaryotic
0 5 10 15 20 25 30 35
35.84
18.21
11.77
4.30
-log10(Binomial p value)
Krueppel-associated box
Zinc finger, C2H2-type/integrase, DNA-binding
Zinc finger, C2H2
Metallothionein, vertebrate, metal binding site
Zinc finger, C2H2-like
Metallothionein, vertebrate
Metallothionein domain, vertebrate
Homeobox, eukaryotic
0 2 4 6 8 10 12 14 16 18 20 22
23.81
10.53
8.68
7.80
7.72
7.36
7.36
6.50
C
Promoters of de-repressed genes
All de-repressed genes
De-repressed genes overlapping new H3K4me3 peaks
0 100 50
%Methylation
Figure 5 Identification and characterization of de-repressed genes. (A) Average DNA methylation percentage (heat bar) and densities of
ChIP-seq tags for H3K4me3 (reds), H3K27ac (greens), and RNAPII (blues) are plotted relative to the transcription start sites of de-repressed genes
in HCT116 and DKO1. Light-colored dashed lines represent the marks in DKO1 cells. (B) Venn diagram displaying the overlap of the promoters of
de-repressed genes and H3K4me3 peaks unique to DKO1 cells. (C) Gene Ontology results for the 1,086 de-repressed genes in DKO1 (top) and the
subset of those genes whose promoters have DKO1-unique H3K4me3 peaks (bottom).
Blattler et al. Genome Biology 2014, 15:469 Page 6 of 16
http://genomebiology.com/2014/15/10/469
180
this mechanism is used, the alternative promoters are
quitefar from the restof the gene. One hallmark of cancer
cell gene expression is the repression of tumor suppressor
genes via promoter methylation. Therefore, one would
expect that some of the de-repressed genes would be
tumor suppressor genes. Of the 1,089 de-repressed genes
in DKO1, 38 were identified as tumor suppressor genes
present in the current TSGene database [21] (Additional
file 9). A Gene Ontology analysis of the entire site of de-
repressed genes revealed enrichment for zinc finger and
Krueppel-associated genes (Figure 5C), many of which are
found in large clusters on chromosome 19. Zinc finger
genes were particularly enriched in the set of de-repressed
genes overlapping new H3K4me3 peaks in DKO1.
The other sets of transcripts that were altered in
DKO1 cells are those that were expressed in HCT116
(and thus did not have high levels of DNA methylation
at the promoter region in HCT116 cells) but showed
expression changes (either up or down) in DKO1 cells
(Figure 6A). A Gene Ontology analysis of the 274 genes
up-regulated in DKO1 did not reveal any significant gene
categories, but 22 genes were identified as tumor sup-
pressor genes present in the TSGene database. Thus,
loss of DNA methylation resulted in the up-regulation
of tumor suppressor genes by both promoter methylation-
dependent (38 genes) and promoter methylation-independent
(22 genes) mechanisms. Analysis of the 1,366 down-regulated
genes showed enrichment for chaperonins (Figure 6B).
Chaperonins, which are involved in protein folding, are
overexpressed in cancers [22]; our studies suggest redu-
cing global DNA methylation levels may be a suitable op-
tion for reducing the levels of these proteins in cancer
cells. However, we note that the expression differences of
these genes in response to loss of DNA methylation are
modest. Unlike the promoters of the de-represssed genes
(Figure 5A), analysis of the promoter regions of the up-
regulated genes (Figure 6C, right panel) did not show an
increase in levels of active marks. Strikingly, active histone
marks and RNAPII have similar profiles for the promoters
of genes up- or down-regulated in DKO1, indicating that
perhaps some other mechanism is responsible for the dif-
ferential regulation of these genes. However, we have also
considered that there may be a modest global change in
gene expression due to loss of DNA methylation that is
difficult to observe using RNA-seq as a read-out. For ex-
ample, if expression of all genes is modestly increased in
DKO1 cells, then the genes identified as 'down-regulated'
may simply be those that did not increase as much as the
rest of the transcriptome. Because modest global changes
are difficult to quantify using current methods, it remains
possible that the 1,336 down-regulated genes do not show
a loss of active histone marks because they in fact repre-
sent a set of genes that simply do not respond to a loss of
DNA methylation (with all other genes showing increased
expression). Finally, to further classify the promoters of
the altered genes, we determined if they were located
within a CpG island or if they had aTATA box. We found
that for all three categories of genes that responded to loss
of DNA methylation, the majority of promoters were cate-
gorized as CpG island promoters. However, a smaller per-
centage of the de-repressed genes were CpG islands (58%)
compared with the up-regulated (75%) or down-regulated
(83%) genes. For all cases, the CpG islands averaged ap-
proximately 1 kb in length. Allowing one mismatch to the
TATAWAW motif, we found that few promoters in any
class contained a TATA box within -20 to -40 of the TSS
(de-repressed: 10%; up-regulated: 10%; down-regulated:
7%).
Loss of DNA methylation has major effects on distal
regulatory regions
We were particularly interested in the set of genes whose
expression was increased in DKO1 cells but whose pro-
moters were not highly methylated in HCT116 cells. Be-
cause the promoters of these genes were not highly
methylated in HCT116 cells and did not show large in-
creases in active histone marks in DKO1 cells, we hypoth-
esized that a loss of DNA methylation at enhancer regions
may be responsible for the increased transcript levels. To
test this hypothesis, we identified active enhancer regions,
as defined by H3K27ac regions more than 2 kb from a
TSS. Interestingly, although relatively few new active pro-
moter regions were identified in DKO1 cells (Additional
files 4, 5 and 6), the enhancer landscape was greatly al-
tered between the two cell types, with many new en-
hancers in DKO1 cells (Figure 7A; Additional file 10),
most of which were highly methylated in HCT116 cells
(Figure 7B). Similar to previous studies of DNase I hyper-
sensitive sites, the enhancers in HCT116 and DKO1 cells
are evenly divided between intergenic or intragenic loca-
tions [23]. Although the overall level of H3K27ac on com-
mon enhancers is lower in DKO1 cells, the unique
enhancers have higher levels of H3K27ac. While it is diffi-
cult to link an enhancer to a specific gene, we reasoned
that, in general, enhancers regulate genes in the 'nearby'
vicinity (studies from ENCODE have shown that an en-
hancer loops to the nearest active promoter approximately
50% of the time [17]). Therefore, for the sets of up-
regulated and down-regulated genes, we calculated the
distance to the nearest enhancer that was unique to
HCT116, unique to DKO1, or common to both cell lines
(Figure 7C). We found that genes that are up-regulated in
DKO1 cells have more DKO1-unique enhancers than
HCT116-unique enhancers located within 20 kb of their
TSS (right panel). In fact, 35% of the up-regulated genes
in DKO1 have a new enhancer within 20 kb of the pro-
moter(showninred). Bycontrast,only 12% ofthese genes
are within 20 kb of a lost enhancer region (shown in blue).
Blattler et al. Genome Biology 2014, 15:469 Page 7 of 16
http://genomebiology.com/2014/15/10/469
181
Figure 6 (See legend on next page.)
Blattler et al. Genome Biology 2014, 15:469 Page 8 of 16
http://genomebiology.com/2014/15/10/469
182
The median distance of a DKO1-unique enhancer from a
gene up-regulated in DKO1 cells is approximately 42 kb,
whereas the median distance to an HCT116-unique en-
hancer is over 200 kb (Figure 7C, right panel). For com-
parison, we show that the DKO1-unique enhancers have a
median distance of approximately 132 kb to the down-
regulated genes and are farther from the down-regulated
genes than are the HCT116-unique enhancers (Figure 7C,
leftpanel).
As noted above, the median distance between a DKO1-
unique enhancer and the start site of a gene up-regulated
in DKO1 cells was approximately 42 kb, which is similar
to the median size of the up-regulated genes. Because en-
hancers can function in either direction, this suggested
that perhaps genes up-regulated in DKO1 cells contained
DKO1-specific intragenic enhancers. We determined the
number of enhancers within the genes up-regulated
(Figure 8A) and down-regulated (Figure 8B) in DKO1
cells. We found that in the genes up-regulated in DKO1
cells, the number of DKO1-specific intragenic enhancers
was much higher than the number of HCT-unique
enhancers (40% of the genes had a DKO1-specific intra-
genic enhancer but only 12% of the genes had a HCT-
specific intragenic enhancer). In contrast, the number of
HCT116-specific versus DKO1-specific enhancers was
more similar for the genes down-regulated in DKO1 cells.
The relative levels of H3K27ac at the promoter regions did
not change for the 111 up-regulated genes that had new
intragenic enhancers. However, the H3K27ac marks at
the intragenic enhancers was higher in DKO1 than in
HCT116 (Figure 9A). A motif analysis of these new intra-
genic enhancers did not reveal an enrichment for any spe-
cific transcription factor binding sites. Of the 310
intergenic enhancers identified within the gene bodies of
111 genes, 285wereidentifiedasintronicand25wereexonic
(Figure9A).We observed that65ofthe 111 genes that had
a DKO1-unique enhancer had more than one new intra-
genic enhancer. This suggested that perhaps the en-
hancers were spreading throughout the gene. This is in
fact what we observed, and an example is shown in Fig-
ure 9B. The SASH1 gene now has marks of active en-
hancers throughout the transcribed region and is
expressed 3.7-fold higherin DKO1 cells.
Discussion
Using DKO1 cells that have a severe reduction in global
DNA methylation levels due to genetic deletion of the
DNMT1 and DNMT3b genes, we have investigated the
global relationship of DNA methylation with histone
modifications, RNAPII binding, and gene expression. Al-
though other groups have previously analyzed DNA
methylation and gene expression in DKO1 cells [24-26],
this had not been done on a genome-wide scale. Because
methylated promoters are in condensed chromatin
which cannot be accessed by transcription factors and
because DNA methylation of recognition motifs has
been shown to inhibit transcription factor binding
[27,28], we had anticipated that the de-methylation of
promoters in DKO1 cells would expose thousands of
previously inaccessible binding motifs, many of which
would be recognized by transcription factors that are
ubiquitously expressed. Therefore, we hypothesized that
a reduction in DNA methylation would create thousands
of new binding sites for transcription complexes, which
would recruit RNAPII, leading to reactivation of these
promoters. In contrast, active enhancers are highly cell
type-specific and thus it seemed likely that even upon
removal of methylation at the distal regulatory regions,
few new enhancers would be created because the cell
type-specific factors would not be present in DKO1 cells
to bind to the motifs and recruit histone acetylation
complexes. In contrast, we found that relatively few pro-
moters were activated in DKO1 cells but thousands of
newly active enhancers (which were highly methylated in
parental HCT116 cells) were created. Interestingly, 3,008
(47%) of the 6,376 new enhancers that appeared in
DKO1 cells had the H3K4me1 mark of a poised enhan-
cer in the parental HCT116 cell line. Our studies indi-
cate that 1) in general, loss of DNA methylation does
not lead to the acquisition of newly active promoters,
suggesting that DNA methylation is not the primary
driver of promoter repression, and 2) loss of DNA
methylation has a major effect on promoter-distal regu-
latory regions, uncovering intragenic enhancers within
genes whose expression increases upon loss of DNA
methylation.
We investigated the effect of loss of DNA methylation
on histone modifications at promoters, enhancers, and
gene bodies. We found that overall levels of all active
histone marks were reduced at most promoters in
DKO1 cells, regardless of their methylation status in
HCT116 cells. The reason for this overall decrease in ac-
tive promoter marks in DKO1 cells is not clear. How-
ever, DKO1 cells do grow slightly more slowly than
HCT116 cells [16] and it is possible that small differ-
ences in the percentage of cells in S phase may influence
(See figure on previous page.)
Figure 6 Identifcation and characterization of de-regulated genes. (A) The 1,640 genes significantly differentially expressed between HCT116
and DKO1 cells are shown in red (P-value <0.05, fold-change >1.2). (B) Gene Ontology results for down-regulated genes; up-regulated genes were
not enriched for any Gene Ontology terms. (C) The epigenetic profiles at the promoters of genes up-regulated (right panel) and down-regulated
in DKO1 (left panel).
Blattler et al. Genome Biology 2014, 15:469 Page 9 of 16
http://genomebiology.com/2014/15/10/469
183
Figure 7 (See legend on next page.)
Blattler et al. Genome Biology 2014, 15:469 Page 10 of 16
http://genomebiology.com/2014/15/10/469
184
ChIP-seq analysis of promoters. Conversely, we identi-
fied thousands of new enhancers that have increased
levels of H3K27Ac in DKO1 cells, indicating that the
ChIP-seq assay is able to detect high levels of modified
histones in DKO1 cells. These results suggest that DNA
methylation is in fact a primary regulator of the activity
of enhancers. In addition, we found only modest changes
in H3K36me3 or H3K9me3 levels in HCT116 versus
DKO1 cells (data not shown). Our previous studies
using ChIP-chip had suggested that DNA methylation
102 of 1,366 genes have DKO1-unique enhancers (7%)
196 of 1,366 genes have HCT116-unique enhancers (14%)
A
B
111 of 274 genes have DKO1-unique enhancers (40%)
33 of 274 genes have HCT116-unique enhancers (12%)
Figure 8 Genes up-regulated in DKO1 cells have new intragenic enhancers. (A) Graph plotting the number of intragenic enhancers for the
274 genes up-regulated in DKO1 cells. (B) Graph plotting the number of intragenic enhancers for the 1,366 genes down-regulated in DKO1 cells.
(See figure on previous page.)
Figure 7 Loss of DNA methylation uncovers thousands of distal H3K27ac sites. (A) Venn diagram comparing distal H3K27ac peaks in
HCT116 and DKO1 cells. (B) DNA methylation status of common and unique enhancers. Pie charts show the breakdown of genomic locations for
peaks residing within intergenic, intronic, and exonic regions of the genome. Tag density plots show the density of H3K27ac tags relative to the
centers of these peaks. (C) Distances from the TSS of genes up-regulated (right panel) or down-regulated (left panel) in DKO1 to the nearest categorized
enhancer. Also indicated is the median distance from each category of enhancer to the nearest up-regulated or down-regulated gene.
Blattler et al. Genome Biology 2014, 15:469 Page 11 of 16
http://genomebiology.com/2014/15/10/469
185
may be required for H3K9me3 deposition [29]. In fact,
follow-up studies using ChIP-seq showed a major reduc-
tion in H3K9me3 across the entire chromosome 19 (PJ
Farnham and S Iyengar, unpublished data). However, in
those studies DNA methylation levels were reduced by
treatment of cells with the DNMT inhibitor 5-Aza-CR. It
is likely that the effect on H3K9me3 may have been due
to redistribution of the KAP1/SETDB1 histone methyl-
transferase complex due to the activation of the DNA
damage response and not directly due to loss of DNA
methylation [30]. We now show that the H3K9me3 pat-
terns are essentially the same in HCT116 and DKO1 cells.
An analysis of chromosome 19 in DKO1 cells using a
ChIP-chip assay [26] is in agreement with our ChIP-seq
data showing that H3K9me3 and H3K36me3 are not dra-
matically affected in DKO1 cells. Thus, interpretation of
the mechanisms leading to changes in histone marks
caused by 5-Aza-CR mustbe made withcaution.
Unexpectedly, we found that very few genes increased
in expression in DKO1 cells. However, we did identify
two sets of genes whose expression increased by differ-
ent mechanisms in these cells (Figure 10). One set of
genes was up-regulated due to the removal of DNA
methylation from their promoters (de-repressed
genes). The other set of up-regulated genes had
unmethylated, active promoters in HCT116 cells, but
showed increases in overall transcript levels in the
DKO1 cells. Interestingly, the later set of genes con-
tained multiple active enhancers within their gene
bodies in DKO1 cells, and these same enhancers
were methylated in HCT116 cells. Intronic enhancers
have been previously described to regulate the genes
they reside within [23,31-33]. Here, we have identi-
fied several hundred genes that may be controlled by
increases in active histones at intragenic enhancers.
However, further functional studies in which these
Figure 9 New H3K27ac peaks are enriched within the gene bodies of up-regulated genes. (A) DNA methylation profiles and tag density
plots for H3K27ac at the promoters of genes containing intronic enhancers (left) and at the locations of intronic enhancers (right). Pie chart
shows the percentage of the 310 intergenic enhancers identified within the gene bodies of the up-regulated genes that are intronic or exonic.
(B) Genome browser snapshot of the genomic region surrounding the SASH1 gene. The H3K27ac track in green shows an increase in signal
within the gene’s intronic regions in DKO1 (light green) relative to HCT116 (dark green).
Blattler et al. Genome Biology 2014, 15:469 Page 12 of 16
http://genomebiology.com/2014/15/10/469
186
enhancer regions are knocked out or repressed will
be required to truly determine whether these intronic
enhancers are responsible for the activation of these
genes in DKO1 cells. Both the de-repressed and the
up-regulated genes included tumor suppressor genes,
and the resulting mutant cells display characteristics
of normal cells (slightly slower doubling times rela-
tive to HCT116) as was previously shown by Rhee et
al.[16].Thus, although reducing DNA methylation
has modest effects on gene expression, perhaps it will
prove suitable as a therapeutic target.
Conclusions
While DNA methylation can play a role in the repres-
sion of gene expression, our studies indicate that it is
likely not the key determinant in the regulation of most
promoters in HCT116 cells. Our finding that loss of
DNA methylation does not result in the acquisition of
active histone marks or an increase in gene expression
from most de-methylated promoters suggests that these
promoters may remain in a closed or condensed con-
firmation. This interpretation is supported by other
studies showing that promoters that lose DNA methyla-
tion in DKO1 cells do not gain accessibility to a specific
restriction enzymes [34] and do not gain nucleosome-
depleted regions (FD Lay, Y Liu, and BP Berman, per-
sonal communication). Thus, DNA methylation may be
a consequence of, not causation for, promoter silencing.
However, DNA methylation plays a much greater role in
the silencing of distal regulatory elements, with many of
the enhancers activated by loss of DNA methylation fall-
ing within the bodies of their probable target genes.
Materials and methods
Cell growth conditions
The human cell lines HCT116 (ATCC #CCL-247) and
DKO1 [16] were grown in McCoy’s 5A Medium supple-
mented with 10% fetal bovine serum and 1% penicillin/
streptomycin and were harvested for downstream exper-
iments at 80% confluence.
Whole genome bisulfite sequencing
Genomic DNA was collected from HCT116 and DKO1
cells using a Qiagen (Valencia, CA, USA) QIAeasy DNA
mini kit. Genomic DNA (2 μg) was sonicated using a Cov-
aristoanaveragemolecularweightof150bp.Achievement
of the desired size range was verified by Bioanalyzer
(Agilent Technologies, Santa Clara, CA, USA) analysis.
Fragmented DNA was repaired to generate blunt ends
using the END-It kit (EpicentreBiotechnologies, Madison,
WI, USA) according to the manufacturer’s instructions.
Following incubation, the treated DNA was purified using
AmpureX beads (Beckman Coulter, Brea, CA, USA). In
general, magnetic beads were employed for all nucleic
acid purifications in the following protocol. Following
end repair, A-tailing was performed using the NEB
dA-tailing module according to the manufacturer’s in-
structions (New England Biolabs, Ipswich, MA, USA).
Adapters with a 3′ ‘T’ overhang were then ligated to
the end-modified DNA. For whole genome bisulfite
29,622
No significant
change in
expression
32,351
Genes expressed
in HCT116
or DKO1
De-repressed
genes
1,089
-Promoters lose DNA methylation
-40% gain H3K4me3 at their promoters
-Enriched for tumor suppressor genes
-Enriched for zinc finger genes
Down-regulated
1,366
Up-regulated
274
No significant
differences in promoter
histone modifications
-ENRICHED FOR INTRAGENIC
ENHANCERS
Figure 10 Schematic illustrating the characterization of gene expression categories. Shown is the breakdown of all expressed genes in
HCT116 and DKO1 cells into groups based on their expression in the two cell types. Three categories of genes were identified as de-repressed,
up-regulated or down-regulated.
Blattler et al. Genome Biology 2014, 15:469 Page 13 of 16
http://genomebiology.com/2014/15/10/469
187
sequencing, modified Illumina paired-end adapters were
used in which cytosine bases in the adapter are replaced
with 5-methylcytosine bases. Depending on the specific
application, we utilized either Early Access Methylation
Adapter Oligos that do not contain barcodes, or the
adapters present in later versions of the Illumina DNA
Sample Preparation kits, which contain both indices and
methylated cytosines. Ligation was carried out using ultra-
pure, rapid T4 ligase (Enzymatics, Beverly, MA, USA) ac-
cording to the manufacturer’s instructions. The final
product was then purified with magnetic beads to yield an
adapter-ligation mix. Prior to bisulfite conversion, bac-
teriophage lambda DNA that had been through the same
library preparation protocol described above to generate
adapter-ligation mixes was combined with the genomic
sample adapter ligation mix at 0.5% w/w. Adapter-ligation
mixes were then bisulfite converted using the Zymo DNA
Methylation Gold kit (Zymo Research, Orange, CA, USA)
according to the manufacturer’s recommendations. Final
modified product was purified by magnetic beads and
eluted in a final volume of 20μl. Amplification of one-half
the adapter-ligated library was performed using HiFi-U
Ready Mix (Kapa Biosystems, Wilmington, MA, USA) for
the following protocol: 98° for 2 minutes; then six cycles
of 98° for 30 s, 65° for 15 s, 72° for 60 s; with a final 72°
10-minute extension, in a 50 μl total volume reaction.
The final library product was examined on the Agilent
Bioanalyzer, then quantified using the Kapa Biosystems
Library Quantification kit according the to manufacturer’s
instructions. Optimal concentrations to get the right clus-
ter density were determined empirically but tended to
be higher than for non-bisulfite libraries. Libraries were
plated using the Illumina cBot and run on the Hi-Seq
2000 according to the manufacturer’s instructions using
HSCS v.1.5.15.1. Image analysis and base calling were car-
ried out using RTA 1.13.48.0, and deconvolution and fastq
file generation were carried out using CASAVA_v1.7.1a5.
Raw reads were mapped using Bis-SNP [35], and percent-
age methyl-C/CwascalculatedforeveryCpGdinucleotide
in the human genome. All CpG dinucleotides with a mini-
mum sequencing coverageof 3× were used for downstream
analyses. To determine the average DNA methylation sur-
rounding a set of promoters or enhancers, HOMER was
used to calculate the percentage methylation across pro-
moter or enhancer regions using the annotatePeaks.pl
script and the ‘-ratio’ option for 2,500 bp surrounding the
regions of interest using a bin size of 100 bp. The resulting
binswereplottedasaheatmap using heatmap.2 in R.
ChIP-seq
ChIP assays werewas performed in replicatefor H3K4me3
(Cell Signaling Technology, Danvers, MA, USA; 9751S,
lot# 4), H3K27ac (Active Motif, Carlsbad, CA, USA;
#39133, lot# 21311004), and RNAPII (Covance, Princeton,
NJ, USA; MMS-126R (8WG16), Lot# D12LF0314) in
DKO1 cells and one replicate of RNAPII ChIP-seq was
performed in HCT116 cells, as previously described [36].
The ChIP-seq libraries were sequenced on a HiSeq2000;
the H3K4me3 datasets from HCT116 were available via
ENCODE and were downloaded from the UCSC browser,
accession number[UCSC: wgEncodeEH000949]. All ChIP-
seq datawere mapped to hg19 usingBWA(default param-
eters) and peaks were called using Sole-Search [37,38] with
the following parameters: Permutation:5; Fragment:250;
AlphaValue:0.00010=1.0E-4;FDR:0.00010=1.0E-4;Peak-
MergeDistance:0; HistoneBlurLength:1200 for H3K4me3
and H3K27ac. Each replicate ChIP-seq dataset was ana-
lyzed separately and only peaks present in both replicates
wereused for thesubsequentanalyses; see Additional file 1
for ChIP-seq reproducibility measures. Peaks were sepa-
rated by their proximity to promoters, with promoter-
proximal peaks defined as those found within ±2 kb of a
TSS and promoter-distal as everything else; enhancers
were defined as promoter-distal H3K27ac sites. To create
TSS-centered tag density plots, mapped reads were used
with the HOMER annotatePeaks script [39] and the -hist
option to average ChIP-seq tags relative to all (or in some
cases to a subset of) RefSeq TSSs. To create enhancer-
centered tag density plots, the center of the H3K27ac
peaks were used. The mergePeaks option in HOMER was
used to determine overlapping H3K27ac regions between
HCT116 and DKO1 at proximal and distal sites. H3K27ac
peak proximity to TSSs was determined using annotate-
Peaks. Gene Ontology analysis was performed using Stan-
ford’s Genomic Regions Enrichment of Annotations Tool
(GREAT)[40].
RNA-seq
RNA-seq was performed for HCT116 and DKO1 in rep-
licate. RNA was collected from cells using Trizol (Life
Technologies/ThermoFisher, Waltham, MA, USA; cata-
log #15596018) and paired-end libraries were prepared
using the Illumina TruSeqV2 Sample Prep Kit (catalog
#15596-026), starting with 1 μg total RNA. Libraries
were sequenced on an Illumina Hi-Seq 2000. RNA-seq
data were analyzed withPartek Flow version 3 (Partek Inc.,
St Louis, MO, USA). Raw reads were trimmed using the
Quality Score method and mapped to hg19 (Ensembl 72)
using Tophat2 [41]. Gencode V17 annotation was used to
quantify the aligned reads using the Partek E/M method.
Quantified reads were normalized using TMM in EdgeR
[42] and then analyzed for differential expression using
Partek’s Gene Specific Analysis. Differentially expressed
genes weredetermined as those with aP-value <0.05 and a
fold change >1.2. Gene expression scatter and volcano
plots were created using ggplot2 in R. To address concerns
about RNA and/or libraries, we prepared two replicates of
RNA from each cell type; a comparison of the replicates
Blattler et al. Genome Biology 2014, 15:469 Page 14 of 16
http://genomebiology.com/2014/15/10/469
188
for HCT116 and for DKO1 shows that the RNA-seq
data sets from the two replicates are highly reproducible
(Additional file 7). In addition,visualinspectionoftheRNA-
seq reads on a genome browser (with no normalization)
shows very similar RNA profiles in HCT116 and DKO1
cells, except for the set of genes that were identified to
have altered expression.
Data access
WGBS and RNA-seq datasets for HCT116 and DKO1
cells were produced for this manuscript and are deposited
in Gene Expression Omnibus (GEO) [GEO: GSE60106].
H3K27ac and H3K4me3 ChIP-seq datasets in DKO1 cells
and RNAPII ChIP-seq data for both HCT116 and DKO1
cells were produced for this manuscript and are deposited
in GEO [GEO: GSE60106]. H3K4me3 [UCSC: wgEnco-
deEH000949], H3K27ac [UCSC: wgEncodeEH002873],
H3K4me1 [UCSC: wgEncodeEH002874] and part of the
RNAPII [UCSC: wgEncodeEH000651] HCT116 ChIP-seq
data were produced as part of the ENCODE Consortium
[14] and can be downloaded from the UCSC genome
browser.
Additional files
Additional file 1: Sequencing and peak metrics for all ChIP-seq,
RNA-seq, and WGBS datasets. ChIP-seq reads were mapped using BWA
and peaks were called using Sole-Search [37,38]. All data was collected as
part of this study except for H3K4me3 from HCT116, which was from
ENCODE. However, for consistency, the reads were remapped and peaks
were called for this dataset using Sole-Search. High-confidence (HC)
peaks were determined as those present in two independent biological
replicates. Promoter-proximal or promoter-distal peaks were selected by
their proximity to a TSS, with proximal being those peaks within 2000 bp
of a TSS and distal being everything else.
Additional file 2: ChIP-seq peaks in HCT116 cells. Worksheet
containing all peaks called using Sole-Search for H3K27ac (Sheet 1),
H3K4me3 (Sheet2) and RNAPII (Sheet 3) in HCT116 cells.
Additional file 3: ChIP-seq peaks in DKO1 cells. Worksheet containing
all peaks called using Sole-Search for H3K27ac (Sheet 1), H3K4me3
(Sheet2) and RNAPII (Sheet 3) in DKO1 cells.
Additional file 4: Characterization of H3K4me3 promoter-proximal
peaks in HCT116 and DKO1 cells. Venn diagram showing differences
in binding sites for promoter-proximal H3K4me3 peaks (top), and the
density of H3K4me3 ChIP-seq tags in HCT116 and DKO1 for all three peak
categories (bottom).
Additional file 5: Characterization of H3K27ac promoter-proximal
peaks in HCT116 and DKO1 cells. Venn diagram showing differences in
binding sites for promoter-proximal H3K27ac peaks (top), and the density
of H3K27ac ChIP-seq tags in HCT116 and DKO1 for all three peak categories
(bottom).
Additional file 6: Characterization of RNAP2 promoter-proximal
peaks in HCT116 and DKO1 cells. Venn diagram showing differences in
binding sites for promoter-proximal RNA polymerase 2 peaks (top), and
the density of RNA polymerase 2 ChIP-seq tags in HCT116 and DKO1 for
all three peak categories (bottom).
Additional file 7: Replicate RNA-seq analysis. log2 expression values
are plotted comparing replicates for of the same cell type, and across cell
types for HCT116 and DKO1 cells.
Additional file 8: RNA-seq analysis of HCT116 and DKO1 cells.
Worksheet containing the output from RNA-seq analysis using Partek
Flow version 3.0.14.0321. Sheet 1 contains Genes, statistics, and expression
values for all transcripts from the gene-specific analysis. Sheet 2 lists all
de-repressed genes, and their expression values in HCT116 and DKO1. Sheet
3 lists all up-regulated genes and their expression values in HCT116 and
DKO1. Sheet 4 lists all down-regulated genes and their expression values in
HCT116 and DKO1.
Additional file 9: Differentially expressed tumor suppressor and
zinc finger genes. Sheet 1 lists all de-repressed tumor suppressor
genes and their expression values in HCT116 and DKO1. Sheet 2 lists all
up-regulated tumor suppressor genes and their expression values in
HCT116 and DKO1. Sheet 3 lists all de-repressed zinc finger genes and
their expression values in HCT116 and DKO1.
Additional file 10: ChIP-seq peaks of enhancer regions. Sheet 1 lists
all H3K27ac peaks that are unique to HCT116 cells. Sheet 2 lists all
H3K27ac peaks that are unique to DKO1. Sheet 3 lists all H3K27ac peaks
that are common to HCT116 and DKO1. Sheet 3 lists all H3K27ac peaks
present within intragenic regions of genes up-regulated in DKO1.
Abbreviations
5-Aza-CR: 5-azacytidine; bp: base pair; ChIP: chromatin immunoprecipitation;
DNMT: DNA methyltransferase; GEO: Gene Expression Omnibus; RNAPII: RNA
polymerase II; TSS: transcription start site; WGBS: whole genome bisulfite
sequencing.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
AB prepared samples, performed analyses for figures, and assisted in drafting
the manuscript, LY performed analyses for figures, HW and YG performed
ChIP-seq experiments, CMN and BPB carried out the DNA methylation
library methods development, PJF conceived of the project, advised
on all experiments and drafted the manuscript. All authors read and
approved the final manuscript.
Acknowledgments
The H3K4me3 HCT116 ChIP-seq was produced by the laboratory of John
Stamatoyannopoulos at the University of Washington and the H3K27ac and
H3K4me1 HCT116 ChIP-seq data was produced by the Farnham laboratory,
each as part of the ENCODE Consortium [17] and is available at (http://
genome.ucsc.edu); all data used in this study is past the 9 month moratorium.
One replicate of H3K4me3 and one replicate of H3K27ac ChIP-seq in DKO, in
addition to one replicate of RNA-seq in each cell type was produced by Fides
Lay in the laboratory of Peter Jones at the University of Southern California. We
thank them for this data and for their commentary on the project. This work
was supported in part by 1U01ES017154, U54HG006996, and P30CA014089
from the National Cancer Institute; AB was supported in part by a pre-doctoral
training fellowship from the National Human Genome Research Institute of the
National Institutes of Health under grant number F3100HG6114. BB, YL, and LY
were supported in part by National Institutes of Health under grant number
U01ES017054. We thank Selene Tyndale and Helen Truong for services rendered
at the USC Molecular Genomics Sequencing core. We thank Meng Li and Yibu
Chen at the USC Norris Medical Library Bioinformatics Services Program for their
assistance in running Partek Flow. We thank Matt Grimmer and the members of
Farnham laboratory for helpful discussions.
Author details
1
Norris Comprehensive Cancer Center, University of Southern California, Los
Angeles, CA 90089, USA.
2
Integrated Genetics and Genomics, University of
California-Davis, Davis, CA 95616, USA.
3
Department of Biochemistry &
Molecular Biology, Norris Comprehensive Cancer Center, University of
Southern California, Los Angeles, CA 90089-9601, USA.
Received: 31 May 2014 Accepted: 11 September 2014
Blattler et al. Genome Biology 2014, 15:469 Page 15 of 16
http://genomebiology.com/2014/15/10/469
189
References
1. Boyes J, Bird A: Repression of genes by DNA methylation depends on
CpG density and promoter strength: evidence for involvement of a
methyl-CpG binding protein. EMBO J 1992, 11:327.
2. Cedar H, Siegfried Z, Eden S, Mendelsohn M, Feng X, Tsuberi B-Z: DNA
methylation represses transcription in vivo - Nature Genetics. Nat Genet
1999, 22:203–206.
3. Jones PA: The Role of DNA Methylation in Mammalian Epigenetics.
Science 2001, 293:1068–1070.
4. Laurent L, Wong E, Li G, Huynh T, Tsirigos A, Ong CT, Low HM, Kin Sung
KW, Rigoutsos I, Loring J, Wei C-L: Dynamic changes in the human
methylome during differentiation. Genome Res 2010, 20:320–331.
5. Hinoue T, Weisenberger DJ, Lange CPE, Shen H, Byun H-M, Van Den Berg D,
Malik S, Pan F, Noushmehr H, van Dijk CM, Tollenaar RAEM, Laird PW:
Genome-scale analysis of aberrant DNA methylation in colorectal cancer.
Genome Res 2012, 22:271–282.
6. Kelly TK, Liu Y, Lay FD, Liang G, Berman BP, Jones PA: Genome-wide
mapping of nucleosome positioning and DNA methylation within
individual DNA molecules. Genome Res 2012, 22:2497–2506.
7. Xie W, Schultz MD, Lister R, Hou Z, Rajagopal N, Ray P, Whitaker JW, Tian S,
Hawkins RD, Leung D, Yang H, Wang T, Lee AY, Swanson SA, Zhang J, Zhu
Y, Kim A, Nery JR, Urich MA, Kuan S, Yen C-A, Klugman S, Yu P, Suknuntha K,
Propson NE, Chen H, Edsall LE, Wagner U, Li Y, Ye Z, et al: Epigenomic
analysis of multilineage differentiation of human embryonic stem cells.
Cell 2013, 153:1134–1148.
8. Zhu Y, Sun L, Chen Z, Whitaker JW, Wang T, Wang W: Predicting enhancer
transcription and activity from chromatin modifications. Nucleic Acids Res
2013, 41:10032–10043.
9. Sakai T, Toguchida J, Ohtani N, Yandell DW, Rapaport JM, Dryja TP: Allele-specific
hypermethylation of the retinoblastoma tumor-suppressor gene. Am J Hum
Genet 1991, 48:880–888.
10. Merlo A, Herman JG, Mao L, Lee DJ, Gabrielson E, Burger PC, Baylin SB,
Sidransky D: 5′ CpG island methylation is associated with transcriptional
silencing of the tumour suppressor p16/CDKN2/MTS1 in human cancers.
Nat Med 1995, 1:686–692.
11. Esteller M, Sanchez-Cespedes M, Rosell R, Sidransky D, Baylin SB, Herman JG:
Detection of aberrant promoter hypermethylation of tumor suppressor
genes in serum DNA from non-small cell lung cancer patients. Cancer Res
1999, 59:67–70.
12. Komashko VM, Farnham PJ: 5-azacytidine treatment reorganizes genomic
histone modification patterns. Epigenetics 2010, 5:229–240.
13. Kaminskas E, Farrell AT, Wang Y-C, Sridhara R, Pazdur R: FDA drug approval
summary: azacitidine (5-azacytidine, Vidaza) for injectable suspension.
Oncologist 2005, 10:176–182.
14. Li LH, Olin EJ, Buskirk HH, Reineke LM: Cytotoxicity and mode of
action of 5-azacytidine on L1210 leukemia. Cancer Res 1970,
30:2760–2769.
15. Stresemann C, Lyko F: Modes of action of the DNA methyltransferase
inhibitors azacytidine and decitabine. Int J Cancer 2008, 123:8–13.
16. Rhee I, Bachman KE, Park BH, Jair K-W, Yen R-WC, Schuebel KE, Cui H,
Feinberg AP, Lengauer C, Kinzler KW, Baylin SB, Vogelstein B: DNMT1 and
DNMT3b cooperate to silence genes in human cancer cells. Nature 2002,
416:552–556.
17. ENCODE Project Consortium: An integrated encyclopedia of DNA
elements in the human genome. Nature 2012, 489:57–74.
18. Hellman A, Chess A: Gene body-specific methylation on the active X
chromosome. Science 2007, 315:1141–1143.
19. Feng S, Cokus SJ, Zhang X, Chen P-Y, Bostick M, Goll MG, Hetzel J, Jain J,
Strauss SH, Halpern ME, Ukomadu C, Sadler KC, Pradhan S, Pellegrini M,
Jacobsen SE: Conservation and divergence of methylation patterning in
plants and animals. Proc Natl Acad Sci 2010, 107:8689–8694.
20. Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G, Tonti-Filippini J, Nery
JR, Lee L, Ye Z, Ngo Q-M, Edsall L, Antosiewicz-Bourget J, Stewart R, Ruotti
V, Millar AH, Thomson JA, Ren B, Ecker JR: Human DNA methylomes at
base resolution show widespread epigenomic differences. Nature 2009,
462:315–322.
21. Zhao M, Sun J, Zhao Z: TSGene: a web resource for tumor suppressor
genes. Nucleic Acids Res 2012, 41:D970–D976.
22. Coghlin C, Carpenter B, Dundas SR, Lawrie LC, Telfer C, Murray GI:
Characterization and over-expression of chaperonin t-complex proteins
in colorectal cancer. J Pathol 2006, 210:351–357.
23. Thurman RE, Rynes E, Humbert R, Vierstra J, Maurano MT, Haugen E,
Sheffield NC, Stergachis AB, Wang H, Vernot B, Garg K, John S, Sandstrom R,
Bates D, Boatman L, Canfield TK, Diegel M, Dunn D, Ebersol AK, Frum T,
Giste E, Johnson AK, Johnson EM, Kutyavin T, Lajoie B, Lee B-K, Lee K,
London D, Lotakis D, Neph S, et al: The accessible chromatin landscape of
the human genome. Nature 2012, 489:75–82.
24. Paz MF, Wei S, Cigudosa JC, Rodriguez-Perales S, Peinado MA, Huang TH-M,
Esteller M: Genetic unmasking of epigenetically silenced tumor suppressor
genes in colon cancer cells deficient inDNA methyltransferases. Hum Mol
Genet 2003, 12:2209–2219.
25. Jacinto FV, Ballestar E, Ropero S, Esteller M: Discovery of epigenetically
silenced genes by methylated DNA immunoprecipitation in colon cancer
cells. Cancer Res 2007, 67:11481–11486.
26. Hahn MA, Wu X, Li AX, Hahn T, Pfeifer GP: Relationship between gene
body DNA methylation and intragenic H3K9me3 and H3K36me3
chromatin marks. PLoS One 2011, 6:e18844.
27. Blattler A, Yao L, Wang Y, YeZ, Jin VX, Farnham PJ: ZBTB33 binds unmethylated
regions of the genome associated with actively expressed genes.
Epigenetics Chromatin 2013, 6:13.
28. Blattler A, Farnham PJ: Cross-talk between site-specific transcription
factors and DNA methylation states. J Biol Chem 2013, 288:34287–34294.
29. Komashko VM, Acevedo LG, Squazzo SL, Iyengar SS, Rabinovich A, O’geen
H, Green R, Farnham PJ: Using ChIP-chip technology to reveal common
principles of transcriptional repression in normal and cancer cells.
Genome Res 2008, 18:521–532.
30. Iyengar S, Farnham PJ: KAP1 Protein: An Enigmatic Master Regulator of
the Genome. J Biol Chem 2011, 286:26267–26276.
31. Zhang X, Zhou L, Fu G, Sun F, Shi J, Wei J, Lu C, Zhou C, Yuan Q, Yang M:
The identification of an ESCC susceptibility SNP rs920778 that regulates
the expression of lncRNA HOTAIR via a novel intronic enhancer.
Carcinogenesis 2014, 31:2062–2067.
32. Mercer TR, Edwards SL, Clark MB, Neph SJ, Wang H, Stergachis AB, John S,
Sandstrom R, Li G, Sandhu KS, Ruan Y, Nielsen LK, Mattick JS,
Stamatoyannopoulos JA: DNase I-hypersensitive exons colocalize with
promoters and distal regulatory elements. Nat Genet 2013, 45:852–859.
33. Yoshida T, Landhuis E, Dose M, Hazan I, Zhang J, Naito T, Jackson AF, Wu J,
Perotti EA, Kaufmann C, Gounari F, Morgan BA, Georgopoulos K:
Transcriptional regulation of the Ikzf1 locus. Blood 2013, 122:3149–3159.
34. Pandiyan K, You JS, Yang X, Dai C, Zhou XJ, Baylin SB, Jones PA, Liang G:
Functional DNA demethylation is accompanied by chromatin
accessibility. Nucleic Acids Res 2013, 41:3973–3985.
35. Liu Y, Siegmund KD, Laird PW, Berman BP: Bis-SNP: Combined DNA
methylation and SNP calling for Bisulfite-seq data. Genome Biol 2012, 13:R61.
36. O’geen H, Nicolet CM, Blahnik K, Green R, Farnham PJ: Comparison
of sample preparation methods for ChIP-chip assays. Biotech 2006,
41:577–580.
37. Blahnik KR, Dou L, O’geen H, McPhillips T, Xu X, Cao AR, Iyengar S, Nicolet
CM, Ludäscher B, Korf I, Farnham PJ: Sole-Search: an integrated analysis
program for peak detection and functional annotation using ChIP-seq
data. Nucleic Acids Res 2010, 38:e13.
38. Blahnik KR, Dou L, Echipare L, Iyengar S, O’geen H, Sanchez E, Zhao Y,
Marra MA, Hirst M, Costello JF, Korf I, Farnham PJ: Characterization of the
contradictory chromatin signatures at the 3′ exons of zinc finger genes.
PLoS One 2011, 6:e17121.
39. Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, Cheng JX, Murre C,
Singh H, Glass CK: Simplecombinations oflineage-determining transcription
factors prime cis-regulatory elements required for macrophage and B cell
identities. Mol Cell 2010, 38:576–589.
40. McLean CY, Bristor D, Hiller M, Clarke SL, Schaar BT, Lowe CB, Wenger AM,
Bejerano G: GREAT improves functional interpretation of cis-regulatory
regions. Nat Biotechnol 2010, 28:1630–1639.
41. Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL: TopHat2:
accurate alignment of transcriptomes in the presence of insertions,
deletions and gene fusions. Genome Biol 2013, 14:R36.
42. Robinson MD, Oshlack A: A scaling normalization method for differential
expression analysis of RNA-seq data. Genome Biol 2010, 11:R25.
doi:10.1186/s13059-014-0469-0
Cite this article as: Blattler et al.: Global loss of DNA methylation
uncovers intronic enhancers in genes showing expression changes.
Genome Biology 2014 15:469.
Blattler et al. Genome Biology 2014, 15:469 Page 16 of 16
http://genomebiology.com/2014/15/10/469
190
191
Appendix B: Supplementary Data Files in DVD
Supplementary Data File 2.1 Exon, TSS, and enhancer correlated SNPs. (xlsx file) Individual
worksheets are provided for all correlated SNPs identified in TSS regions, all correlated SNPs
found in exons, the subset of nonsynonymous correlated SNPs found in exons, and all correlated
SNPs found in the set of 28 enhancers. In all cases, the tag SNP, the LD of the correlated, and
genomic information for each SNP is provided.
Supplementary Data File 2.2. H3K27Ac ChIP-seq peaks. (xlsx file) The sets of H3K27Ac peaks called
on merged replicates for HCT116 and sigmoid colon datasets is provided in individual
worksheets.
Supplementary Data File 2.3. Distal regulatory regions correlated with CRC tag SNPs. (xlsx file)
The tag SNP and the correlated SNPs for 28 distal, robust H3K27Ac regions are indicated; each
region is classified according to its presence or absence in HCT116 (H), sigmoid colon (S), or
SW480 (SW) cells. The 3 nearest protein-coding genes and 3 nearest non-coding RNAs were
identified using the GENCODE V15 gene annotation. The red text indicates that the same gene
was identified by SNPs within the TSS (r2 >0.5) and the blue color means that the same gene was
identified by a SNP within a coding exon (r2 >0.1); see Table 2.2. The strike-out indicates that
the RNA is present at less than 0.5 FPKM in HCT116 and sigmoid colon cells; bold indicates that
the transcript is higher than 2 FPKM.
Supplementary Data File 2.4. Motif analysis of all correlated SNPs in enhancers. (xlsx file) All SNPs
having an r2> 0.1 with the 25 CRC tag SNPs were analyzed using motifs from Factorbook
(http://www.factorbook.org). For those SNPs that impacted a critical position of a motif, it was
determined if the change was predicted to be an improvement or a disruption. A more restrictive
list including only the subset of SNP-affected motifs within the robust enhancer regions (using an
r2> 0.5 cut-off) are shown in Supplementary Data File 2.5.
Supplementary Data File 2.5. Motif analysis of enhancer SNPs having r
2
>0.5 with a tag SNP. (xlsx
file) Shown are the motifs in the risk-associated enhancers that are affected by the correlated
SNPs; “-“ means that the alternative allele of the correlated SNP should disrupt TF binding and
“+” means that that the alternative allele of the correlated SNP should improve TF binding.
Supplementary Data File 2.6. Expression of transcription factors with affected motifs. (xlsx file)
Shown is the name of the transcription factors (TF), the expression levels in HCT116 and sigmoid
colon cells, the differential expression ratio (calculated both ways), and the gene id and genomic
location of each TF.
Supplementary Data File 2.7. eQTL analysis results. (xlsx file) Shown are the results for the SNP-
gene pairs using 60 SNPs (6 tag SNPs, 18 SNPs within risk enhancers, and 45 SNPs within TSS
regions) present on the GW SNP6 array and the expression of genes identified by exon or TSS
SNPs or by differential expression analysis (see Table 2.2 and Table 2.5).
192
Supplementary Data File 2.8. RNA analysis of cells having a deletion of enhancer 7. (xlsx file)
Shown are the genes whose expression decreased in the cells in which enhancer 7 was deleted, as
determined using Illumina HumanHT-12 v4 Expression BeadChip arrays.
Supplementary Data File 2.9. TCGA sample IDs. (xlsx file) The IDs for HM450K DNA methylation,
RNA-seq, SNP arrays, and copy number variation analyses for 228 TCGA samples are provided,
as well as the IDs for 254 samples used in normal-tumor gene expression analyses.
Supplementary Data File 2.10. Summary of putative CRC risk-associated genes. (xlsx file) The
names of the 80 genes identified as potentially involved in an increased risk for colon cancer are
indicated, along with the location of the SNP(s) that identified the gene. In some cases, the gene
was identified by more than one type of SNP. In the Enhancer column, “nearest 3” and “nearest
20” refer to the position of the gene relative to the location of the enhancer. In the Expressed
column, Yes means that the gene is expressed in colon cells and TCGA means that the gene
showed differential expression in a cohort of the TCGA normal vs. colon tumor samples.
Supplementary Data File 3.1. TCGA DNA methylation and RNA-seq sample ID numbers. (xlsx file)
The data platform and archive version number are listed in the sheet named “Version number”.
The “DNA methylation sample ID” sheet provides information concerning the TCGA sample ID,
the tissue type (normal or tumor), and the cancer type for the DNA methylation data sets. The
“RNA-seq sample ID” sheet provides information concerning the TCGA sample ID, the tissue
type (normal or tumor), and the cancer type for the RNA-seq data sets.
Supplementary Data File 3.2. Distal enhancer probes on the HM450 array. (xlsx file) The
chromosomal location and the name of each of the 102,518 distal enhancer probes used in this
study are indicated.
Supplementary Data File 3.3. Hypo- and hypermethylated enhancer probes identified for each
tumor type. (xlsx file) Individual worksheets are provided that list the hypermethylated probes
and hypomethylated probes identified for each specific cancer type.
Supplementary Data File 3.4. Probe-gene pairs showing inverse correlations between methylation
and expression. (xlsx file) Individual worksheets are provided that list all of the significant
probe-gene pairs for hypomethylated probes and hypermethylated probes in each cancer type. Pr
represents real P value from the Mann–Whitney U test for each pair; Pe represents the empirical P
value for each pair; the distance between the probes and the putative target genes are shown in the
Distance column; the ranking based on the relative distance of the putative target gene amongst
the 20 adjacent genes (10 on either side of the enhancer) is shown; and the cancer type (CT) is
indicated. The “P-value for promoter methylation” column specifies the anti correlation between
methylation at the promoter itself and the expression level of the gene. This is calculated using
the same Mann–Whitney statistic we use to evaluate enhancer-expression correlation, but we
average beta values within the standard promoter methylation region, from −300 to +500bp
relative to the transcription start site (TSS). This region is consistently methylated in all active
promoters based on whole-genome bisulfite sequencing (in cell lines (Wang et al., 2013b) and
primary TCGA tumors, manuscript in preparation). As with our enhancer method, we select for
193
strong changes in methylation filtering out any case where 95% of samples have methylation less
than 0.3; in the spreadsheet, these have a p-value of 1.0. The worksheet “tumor vs. normal
expression” contains a table showing that the majority of target genes linked to hypermethylated
enhancers have lower expression in tumors than normal tissues, while the majority linked to
hypomethylated enhancers have higher expression in tumors. The worksheet “promoter
methylation” shows the fraction of enhancer-linked genes in each cancer type that also have
significant correlation with methylation of the promoter. It is under 10% of genes for all cancer
types.
Supplementary Data File 3.5. Summary of enriched motifs for enhancer-gene pairs with
hypomethylated distal enhancers. (xlsx file) On the worksheet entitled “Summary”, the fold
enrichment of each indicated motif in a specific cancer type (CT) is shown. Shown in the
Enhancer column is the number of the paired enhancers (after clustering distal enhancer probes
within 500bp) containing the enriched motif (the percentage of total paired hypomethylated
enhancers in that cancer type containing each motif is shown in parentheses). For each motif, also
shown is the number of genes linked to the probes containing the motif and the number of probe-
gene pairs. The worksheet entitled “Detail” contains information for each individual probe linked
to a putative target gene via a distal region containing an enriched motif. Pe represents the
empirical P value for each pair; the distance between the probes and the putative target genes are
shown in the Distance column; the ranking based on the relative distance of the putative target
gene amongst the 20 adjacent genes (10 on either side of the enhancer) is shown; and the cancer
type (CT) is indicated.
Supplementary Data File 3.6. Plots of association between all human TFs and DNA methylation at
enriched motif sites. (PDF file) Shown are TF ranking plots based on the score (−log10(Pr)) of
association between TF expression and DNA methylation of the motif in the cancer type in which
the motifs are enriched. The dashed blue line indicates the boundary of the top 5% association
score. The top 3 associated TFs and the TF family members (dots in red) that are associated with
that specific motif are labeled in the plot.
Supplementary Data File 3.7. Scatter plots for TF family members significantly associated with
DNA methylation at distal enhancer regions having enriched motifs. (PDF file) Shown are
scatter plots for average DNA methylation at probes having the indicated enriched motif (x axis,
shown on the top of each set of panels) vs. the expression of the significantly correlated motif-
relevant TF family members (y axis, shown on the right side of each panel). Each dot represents a
different patient sample; red and green indicate the tumor and normal samples, respectively. Pairs
that are within the top 5% of TFs linked to a given motif are indicated with a number inside the
cell. The number corresponds to the rank of the given TF relative to all 1,777 TFs (with “1” being
the most strongly correlated).
Supplementary Data File 3.8. Survival plots for TF family members significantly associated with
DNA methylation at distal enhancer regions having enriched motifs. (PDF file) (A) The
output of a Cox model regression analysis for the effects of expression of RUNX1 on survival
within KIRC samples. Leukocyte methylation signature was calculated as in (PMID 22120008),
and staging information was taken from TCGA clinical data. Leukocyte methylation signature
was included to rule out RUNX1 expression from contaminating leukocytes, which are the main
source of non-cancer cells in KIRC samples. (B) Kaplan-Meier survival curves for TF family
194
members significantly associated with DNA methylation at the distal enhancer regions with
enriched motifs in the indicated cancer type. The survival data for patients having tumors with the
highest (top 30%) and lowest (bottom 30%) transcription factor expression is shown; the Log
Rank test P value between the high and low groups is indicated.
Supplementary Data File 3.9. Overlapping analysis between putative probe-gene pairs in BRCA
and interactions from ChIA-PET data from the MCF7 breast cancer cell line. (xlsx file) A
list of putative probe-gene pairs in BRCA that overlap with interactions from ChIA-PET data
from the MCF7 cell line is provided. The bar graph shows the comparison of the number of
probe-gene pairs identified within MCF7 ChIA-PET data using the putative pairs from BRCA vs.
random pairs. The random pairs were generated by randomly selecting the same number of
probes from the set of distal enhancer probes, and pairing each with one or more of the 20
adjacent genes; the number of links made for each random probe was identical to the
corresponding “true” probe. Thus, the random linkage set has both the same number of probes
and the same number of linked genes as the true set. 100 such random datasets were generated to
arrive at a 95% Confidence Interval (+/−1.96* SD).
Supplementary Data File 3.10. Transcription factors significantly associated with multiple different
motifs. (xlsx file) Each row represents individual transcription factors and each column represent
different cancer types. The numbers in the table show the number of enriched motifs that the
transcription factors associate with in each cancer type; the transcription factor must be in the top
1% of all ranked TFs for that specific motif in that specific cancer type to be listed on the table.
Supplementary Data File 3.11. List of the TFs used in this study. (xlsx file) Shown is the list of TFs
used to compare expression analysis of all TFs to motif methylation in the different cancer types,
taken from.
References
Adzhubei, I. A., Schmidt, S., Peshkin, L., Ramensky, V. E., Gerasimova, A., Bork, P., Kondrashov, A. S.
& Sunyaev, S. R. (2010). A method and server for predicting damaging missense mutations. Nat
Methods, 7, 248-9.
Aguilera, C., Nakagawa, K., Sancho, R., Chakraborty, A., Hendrich, B. & Behrens, A. (2011). c-Jun N-
terminal phosphorylation antagonises recruitment of the Mbd3/NuRD repressor complex. Nature,
469, 231-5.
Akhtar-Zaidi, B., Cowper-Sal-lari, R., Corradin, O., Saiakhova, A., Bartels, C. F., Balasubramanian, D.,
Myeroff, L., Lutterbaugh, J., Jarrar, A., Kalady, M. F., Willis, J., Moore, J. H., Tesar, P. J.,
Laframboise, T., Markowitz, S., Lupien, M. & Scacheri, P. C. (2012). Epigenomic enhancer
profiling defines a signature of colon cancer. Science, 336, 736-9.
Alexa, A. & Rahnenfuhrer, J. (2010). topGO: Enrichment analysis for Gene Ontology. R package version
2.18.0.
Andersson, R., Gebhard, C., Miguel-Escalada, I., Hoof, I., Bornholdt, J., Boyd, M., Chen, Y., Zhao, X.,
Schmidl, C., Suzuki, T., Ntini, E., Arner, E., Valen, E., Li, K., Schwarzfischer, L., Glatz, D.,
Raithel, J., Lilje, B., Rapin, N., Bagger, F. O., Jorgensen, M., Andersen, P. R., Bertin, N.,
Rackham, O., Burroughs, A. M., Baillie, J. K., Ishizu, Y., Shimizu, Y., Furuhata, E., Maeda, S.,
195
Negishi, Y., Mungall, C. J., Meehan, T. F., Lassmann, T., Itoh, M., Kawaji, H., Kondo, N.,
Kawai, J., Lennartsson, A., Daub, C. O., Heutink, P., Hume, D. A., Jensen, T. H., Suzuki, H.,
Hayashizaki, Y., Muller, F., Forrest, A. R., Carninci, P., Rehli, M. & Sandelin, A. (2014). An
atlas of active enhancers across human cell types and tissues. Nature, 507, 455-61.
Aran, D. & Hellman, A. (2013). DNA methylation of transcriptional enhancers and cancer predisposition.
Cell, 154, 11-3.
Aran, D., Sabato, S. & Hellman, A. (2013). DNA methylation of distal regulatory sites characterizes
dysregulation of cancer genes. Genome Biol, 14, R21.
Arnold, C. D., Gerlach, D., Spies, D., Matts, J. A., Sytnikova, Y. A., Pagani, M., Lau, N. C. & Stark, A.
(2014). Quantitative genome-wide enhancer activity maps for five Drosophila species show
functional enhancer conservation and turnover during cis-regulatory evolution. Nat Genet, 46,
685-92.
Arnold, C. D., Gerlach, D., Stelzer, C., Boryn, L. M., Rath, M. & Stark, A. (2013). Genome-wide
quantitative enhancer activity maps identified by STARR-seq. Science, 339, 1074-7.
Arnone, M. I., Dmochowski, I. J. & Gache, C. (2004). Using reporter genes to study cis-regulatory
elements. Methods Cell Biol, 74, 621-52.
Attanasio, C., Nord, A. S., Zhu, Y., Blow, M. J., Li, Z., Liberton, D. K., Morrison, H., Plajzer-Frick, I.,
Holt, A., Hosseini, R., Phouanenavong, S., Akiyama, J. A., Shoukry, M., Afzal, V., Rubin, E. M.,
FitzPatrick, D. R., Ren, B., Hallgrimsson, B., Pennacchio, L. A. & Visel, A. (2013). Fine tuning
of craniofacial morphology by distant-acting enhancers. Science, 342, 1241006.
Aumailley, M. (2013). The laminin family. Cell Adh Migr, 7, 48-55.
Banerji, J., Rusconi, S. & Schaffner, W. (1981). Expression of a beta-globin gene is enhanced by remote
SV40 DNA sequences. Cell, 27, 299-10.
Bar-Joseph, Z., Gerber, G. K., Lee, T. I., Rinaldi, N. J., Yoo, J. Y., Robert, F., Gordon, D. B., Fraenkel,
E., Jaakkola, T. S., Young, R. A. & Gifford, D. K. (2003). Computational discovery of gene
modules and regulatory networks. Nat Biotechnol, 21, 1337-42.
Bass, A. J., Lawrence, M. S., Brace, L. E., Ramos, A. H., Drier, Y., Cibulskis, K., Sougnez, C., Voet, D.,
Saksena, G., Sivachenko, A., Jing, R., Parkin, M., Pugh, T., Verhaak, R. G., Stransky, N., Boutin,
A. T., Barretina, J., Solit, D. B., Vakiani, E., Shao, W., Mishina, Y., Warmuth, M., Jimenez, J.,
Chiang, D. Y., Signoretti, S., Kaelin, W. G., Spardy, N., Hahn, W. C., Hoshida, Y., Ogino, S.,
Depinho, R. A., Chin, L., Garraway, L. A., Fuchs, C. S., Baselga, J., Tabernero, J., Gabriel, S.,
Lander, E. S., Getz, G. & Meyerson, M. (2011). Genomic sequencing of colorectal
adenocarcinomas identifies a recurrent VTI1A-TCF7L2 fusion. Nat Genet, 43, 964-8.
Bauer, A. K., Hill, T., 3rd & Alexander, C. M. (2013). The involvement of NRF2 in lung cancer. Oxid
Med Cell Longev, 2013, 746432.
Beck, S., Bernstein, B. E., Campbell, R. M., Costello, J. F., Dhanak, D., Ecker, J. R., Greally, J. M., Issa,
J. P., Laird, P. W., Polyak, K., Tycko, B. & Jones, P. A. (2012). A blueprint for an international
cancer epigenome consortium. A report from the AACR Cancer Epigenome Task Force. Cancer
Res, 72, 6319-24.
Bengtsson, M., Stahlberg, A., Rorsman, P. & Kubista, M. (2005). Gene expression profiling in single
cells from the pancreatic islets of Langerhans reveals lognormal distribution of mRNA levels.
Genome Res, 15, 1388-92.
Bergman, Y. & Cedar, H. (2013). DNA methylation dynamics in health and disease. Nature structural &
molecular biology, 20, 274-81.
Berman, B. P., Weisenberger, D. J., Aman, J. F., Hinoue, T., Ramjan, Z., Liu, Y., Noushmehr, H., Lange,
C. P., van Dijk, C. M., Tollenaar, R. A., Van Den Berg, D. & Laird, P. W. (2012). Regions of
focal DNA hypermethylation and long-range hypomethylation in colorectal cancer coincide with
nuclear lamina-associated domains. Nat Genet, 44, 40-6.
Bernstein, B. E., Stamatoyannopoulos, J. A., Costello, J. F., Ren, B., Milosavljevic, A., Meissner, A.,
Kellis, M., Marra, M. A., Beaudet, A. L., Ecker, J. R., Farnham, P. J., Hirst, M., Lander, E. S.,
196
Mikkelsen, T. S. & Thomson, J. A. (2010). The NIH Roadmap Epigenomics Mapping
Consortium. Nat Biotechnol, 28, 1045-8.
Blackwood, E. M. & Kadonaga, J. T. (1998). Going the distance: a current view of enhancer action.
Science, 281, 60-3.
Blahnik, K. R., Dou, L., Echipare, L., Iyengar, S., O'Geen, H., Sanchez, E., Zhao, Y., Marra, M. A., Hirst,
M., Costello, J. F., Korf, I. & Farnham, P. J. (2011). Characterization of the contradictory
chromatin signatures at the 3' exons of zinc finger genes. PLOS One, 6, e17121.
Blahnik, K. R., Dou, L., O'Geen, H., McPhillips, T., Xu, X., Cao, A. R., Iyengar, S., Nicolet, C. M.,
Ludaescher, B., Korf, I. & Farnham, P. J. (2010). Sole-search: An integrated analysis program for
peak detection and functional annotation using ChIP-seq data. Nucleic Acids Res, 38, e13.
Blattler, A. & Farnham, P. J. (2013). Cross-talk between site-specific transcription factors and DNA
methylation states. J Biol Chem, 288, 34287-94.
Blattler, A., Yao, L., Witt, H., Guo, Y., Nicolet, C. M., Berman, B. P. & Farnham, P. J. (2014). Global
loss of DNA methylation uncovers intronic enhancers in genes showing expression changes.
Genome Biol, 15, 469.
Blow, M. J., McCulley, D. J., Li, Z., Zhang, T., Akiyama, J. A., Holt, A., Plajzer-Frick, I., Shoukry, M.,
Wright, C., Chen, F., Afzal, V., Bristow, J., Ren, B., Black, B. L., Rubin, E. M., Visel, A. &
Pennacchio, L. A. (2010). ChIP-Seq identification of weakly conserved heart enhancers. Nat
Genet, 42, 806-10.
Boland, M. J., Nazor, K. L. & Loring, J. F. (2014). Epigenetic regulation of pluripotency and
differentiation. Circ Res, 115, 311-24.
Bonn, S., Zinzen, R. P., Girardot, C., Gustafson, E. H., Perez-Gonzalez, A., Delhomme, N., Ghavi-Helm,
Y., Wilczynski, B., Riddell, A. & Furlong, E. E. (2012). Tissue-specific analysis of chromatin
state identifies temporal signatures of enhancer activity during embryonic development. Nat
Genet, 44, 148-56.
Boulesteix, A. L. & Strimmer, K. (2005). Predicting transcription factor activities from combined analysis
of microarray and ChIP data: a partial least squares approach. Theor Biol Med Model, 2, 23.
Bresnick, E. H., Lee, H. Y., Fujiwara, T., Johnson, K. D. & Keles, S. (2010). GATA switches as
developmental drivers. J Biol Chem, 285, 31087-93.
Breyer, J. P., Dorset, D. C., Clark, T. A., Bradley, K. M., Wahlfors, T. A., McReynolds, K. M., Maynard,
W. H., Chang, S. S., Cookson, M. S., Smith, J. A., Schleutker, J., Dupont, W. D. & Smith, J. R.
(2014). An Expressed Retrogene of the Master Embryonic Stem Cell Gene POU5F1 Is
Associated with Prostate Cancer Susceptibility. Am J Hum Genet, 94, 395-404.
Brooker, A. S. & Berkowitz, K. M. (2014). The roles of cohesins in mitosis, meiosis, and human health
and disease. Methods Mol Biol, 1170, 229-66.
Buecker, C. & Wysocka, J. (2012). Enhancers as information integration hubs in development: lessons
from genomics. Trends Genet, 28, 276-84.
Bulger, M. & Groudine, M. (1999). Looping versus linking: toward a model for long-distance gene
activation. Genes Dev, 13, 2465-77.
Bulger, M. & Groudine, M. (2011). Functional and mechanistic diversity of distal transcription enhancers.
Cell, 144, 327-39.
Calo, E. & Wysocka, J. (2013). Modification of enhancer chromatin: what, how, and why? Mol Cell, 49,
825-37.
Cancer Genome Atlas, N. (2015). Comprehensive genomic characterization of head and neck squamous
cell carcinomas. Nature, 517, 576-82.
Cancer Genome Atlas Research, N. (2014). Comprehensive molecular characterization of urothelial
bladder carcinoma. Nature, 507, 315-22.
Cao, Y., Bryan, T. M. & Reddel, R. R. (2008). Increased copy number of the TERT and TERC
telomerase subunit genes in cancer cells. Cancer Sci, 99, 1092-9.
197
Carneiro, P., Figueiredo, J., Bordeira-Carrico, R., Fernandes, M. S., Carvalho, J., Oliveira, C. & Seruca,
R. (2013). Therapeutic targets associated to E-cadherin dysfunction in gastric cancer. Expert Opin
Ther Targets, 17, 1187-201.
Carvajal-Carmona, L. G., Cazier, J. B., Jones, A. M., Howarth, K., Broderick, P., Pittman, A., Dobbins,
S., Tenesa, A., Farrington, S., Prendergast, J., Theodoratou, E., Barnetson, R., Conti, D.,
Newcomb, P., Hopper, J. L., Jenkins, M. A., Gallinger, S., Duggan, D. J., Campbell, H., Kerr, D.,
Casey, G., Houlston, R., Dunlop, M. & Tomlinson, I. (2011). Fine-mapping of colorectal cancer
susceptibility loci at 8q23.3, 16q22.1 and 19q13.11: refinement of association signals and use of
in silico analysis to suggest functional variation and unexpected candidate target genes. Hum Mol
Genet, 20, 2879-88.
Castaldi, P. J., Cho, M. H., Zhou, X., Qiu, W., McGeachie, M., Celli, B., Bakke, P., Gulsvik, A., Lomas,
D. A., Crapo, J. D., Beaty, T. H., Rennard, S., Harshfield, B., Lange, C., Singh, D., Tal-Singer,
R., Riley, J. H., Quackenbush, J., Raby, B. A., Carey, V. J., Silverman, E. K. & Hersh, C. P.
(2015). Genetic control of gene expression at novel and established chronic obstructive
pulmonary disease loci. Hum Mol Genet, 24, 1200-10.
Cermak, T., Doyle, E. L., Christian, M., Wang, L., Zhang, Y., Schmidt, C., Baller, J. A., Somia, N. V.,
Bogdanove, A. J. & Voytas, D. F. (2011). Efficient design and assembly of custom TALEN and
other TAL effector-based constructs for DNA targeting. Nucleic Acids Res, 39, e82.
Cheng, C., Min, R. & Gerstein, M. (2011). TIP: a probabilistic method for identifying transcription factor
target genes from ChIP-seq binding profiles. Bioinformatics, 27, 3221-7.
Cheung, E. C., Athineos, D., Lee, P., Ridgway, R. A., Lambie, W., Nixon, C., Strathdee, D., Blyth, K.,
Sansom, O. J. & Vousden, K. H. (2013). TIGAR is required for efficient intestinal regeneration
and tumorigenesis. Dev Cell, 25, 463-77.
Cho, S. W., Kim, S., Kim, Y., Kweon, J., Kim, H. S., Bae, S. & Kim, J. S. (2014). Analysis of off-target
effects of CRISPR/Cas-derived RNA-guided endonucleases and nickases. Genome Res, 24, 132-
41.
Choi, Y., Sims, G. E., Murphy, S., Miller, J. R. & Chan, A. P. (2012). Predicting the functional effect of
amino acid substitutions and indels. PLoS One, 7, e46688.
Chou, J., Provot, S. & Werb, Z. (2010). GATA3 in development and cancer differentiation: cells GATA
have it! J Cell Physiol, 222, 42-9.
Cingolani, P., Platts, A., Wang le, L., Coon, M., Nguyen, T., Wang, L., Land, S. J., Lu, X. & Ruden, D.
M. (2012). A program for annotating and predicting the effects of single nucleotide
polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2;
iso-3. Fly (Austin), 6, 80-92.
Coetzee, S. G., Rhie, S. K., Berman, B. P., Coetzee, G. A. & Noushmehr, H. (2012). FunciSNP: an
R/bioconductor tool integrating functional non-coding data sets with genetic association studies
to identify candidate regulatory SNPs. Nucleic Acids Res, 40, e139.
Cohen, M. L., Kim, S., Morita, K., Kim, S. H. & Han, M. (2015). The GATA factor elt-1 regulates C.
elegans developmental timing by promoting expression of the let-7 family microRNAs. PLoS
Genet, 11, e1005099.
Corradin, O., Saiakhova, A., Akhtar-Zaidi, B., Myeroff, L., Willis, J., Cowper-Sal lari, R., Lupien, M.,
Markowitz, S. & Scacheri, P. C. (2014). Combinatorial effects of multiple enhancer variants in
linkage disequilibrium dictate levels of gene expression to confer susceptibility to common traits.
Genome Res, 24, 1-13.
Creyghton, M. P., Cheng, A. W., Welstead, G. G., Kooistra, T., Carey, B. W., Steine, E. J., Hanna, J.,
Lodato, M. a., Frampton, G. M., Sharp, P. a., Boyer, L. a., Young, R. a. & Jaenisch, R. (2010).
Histone H3K27ac separates active from poised enhancers and predicts developmental state.
Proceedings of the National Academy of Sciences of the United States of America, 107, 21931-6.
Crowley, J. J., Zhabotynsky, V., Sun, W., Huang, S., Pakatci, I. K., Kim, Y., Wang, J. R., Morgan, A. P.,
Calaway, J. D., Aylor, D. L., Yun, Z., Bell, T. A., Buus, R. J., Calaway, M. E., Didion, J. P.,
Gooch, T. J., Hansen, S. D., Robinson, N. N., Shaw, G. D., Spence, J. S., Quackenbush, C. R.,
198
Barrick, C. J., Nonneman, R. J., Kim, K., Xenakis, J., Xie, Y., Valdar, W., Lenarcic, A. B., Wang,
W., Welsh, C. E., Fu, C. P., Zhang, Z., Holt, J., Guo, Z., Threadgill, D. W., Tarantino, L. M.,
Miller, D. R., Zou, F., McMillan, L., Sullivan, P. F. & Pardo-Manuel de Villena, F. (2015).
Analyses of allele-specific gene expression in highly divergent mouse crosses identifies pervasive
allelic imbalance. Nat Genet, 47, 353-60.
Cuddapah, S., Jothi, R., Schones, D. E., Roh, T. Y., Cui, K. & Zhao, K. (2009). Global analysis of the
insulator binding protein CTCF in chromatin barrier regions reveals demarcation of active and
repressive domains. Genome Res, 19, 24-32.
Danussi, C., Akavia, U. D., Niola, F., Jovic, A., Lasorella, A., Pe'er, D. & Iavarone, A. (2013). RHPN2
Drives Mesenchymal Transformation in Malignant Glioma by Triggering RhoA Activation.
Cancer Res, 73, 5140-50.
de Wit, E. & de Laat, W. (2012). A decade of 3C technologies: insights into nuclear organization. Genes
Dev, 26, 11-24.
Dekker, J., Marti-Renom, M. A. & Mirny, L. A. (2013). Exploring the three-dimensional organization of
genomes: interpreting chromatin interaction data. Nat Rev Genet, 14, 390-403.
Dekker, J., Rippe, K., Dekker, M. & Kleckner, N. (2002). Capturing chromosome conformation. Science,
295, 1306-11.
Deng, H., Makizumi, R., Ravikumar, T. S., Dong, H., Yang, W. & Yang, W. L. (2007). Bone
morphogenetic protein-4 is overexpressed in colonic adenocarcinomas and promotes migration
and invasion of HCT116 cells. Exp Cell Res, 313, 1033-44.
Dixon, J. R., Jung, I., Selvaraj, S., Shen, Y., Antosiewicz-Bourget, J. E., Lee, A. Y., Ye, Z., Kim, A.,
Rajagopal, N., Xie, W., Diao, Y., Liang, J., Zhao, H., Lobanenkov, V. V., Ecker, J. R., Thomson,
J. A. & Ren, B. (2015). Chromatin architecture reorganization during stem cell differentiation.
Nature, 518, 331-6.
Dixon, J. R., Selvaraj, S., Yue, F., Kim, A., Li, Y., Shen, Y., Hu, M., Liu, J. S. & Ren, B. (2012).
Topological domains in mammalian genomes identified by analysis of chromatin interactions.
Nature, 485, 376-80.
Dong, A., Rivella, S. & Breda, L. (2013). Gene therapy for hemoglobinopathies: progress and challenges.
Transl Res, 161, 293-306.
Dostie, J. & Dekker, J. (2007). Mapping networks of physical interactions between genomic elements
using 5C technology. Nat Protoc, 2, 988-1002.
Dostie, J., Richmond, T. A., Arnaout, R. A., Selzer, R. R., Lee, W. L., Honan, T. A., Rubio, E. D.,
Krumm, A., Lamb, J., Nusbaum, C., Green, R. D. & Dekker, J. (2006). Chromosome
Conformation Capture Carbon Copy (5C): a massively parallel solution for mapping interactions
between genomic elements. Genome Res, 16, 1299-309.
Dowen, J. M., Fan, Z. P., Hnisz, D., Ren, G., Abraham, B. J., Zhang, L. N., Weintraub, A. S., Schuijers,
J., Lee, T. I., Zhao, K. & Young, R. A. (2014). Control of cell identity genes occurs in insulated
neighborhoods in mammalian chromosomes. Cell, 159, 374-87.
Dunlop, M. G., Dobbins, S. E., Farrington, S. M., Jones, A. M., Palles, C., Whiffin, N., Tenesa, A., Spain,
S., Broderick, P., Ooi, L. Y., Domingo, E., Smillie, C., Henrion, M., Frampton, M., Martin, L.,
Grimes, G., Gorman, M., Semple, C., Ma, Y. P., Barclay, E., Prendergast, J., Cazier, J. B., Olver,
B., Penegar, S., Lubbe, S., Chander, I., Carvajal-Carmona, L. G., Ballereau, S., Lloyd, A.,
Vijayakrishnan, J., Zgaga, L., Rudan, I., Theodoratou, E., Starr, J. M., Deary, I., Kirac, I.,
Kovacevic, D., Aaltonen, L. A., Renkonen-Sinisalo, L., Mecklin, J. P., Matsuda, K., Nakamura,
Y., Okada, Y., Gallinger, S., Duggan, D. J., Conti, D., Newcomb, P., Hopper, J., Jenkins, M. A.,
Schumacher, F., Casey, G., Easton, D., Shah, M., Pharoah, P., Lindblom, A., Liu, T., Smith, C.
G., West, H., Cheadle, J. P., Midgley, R., Kerr, D. J., Campbell, H., Tomlinson, I. P. & Houlston,
R. S. (2012). Common variation near CDKN1A, POLD3 and SHROOM2 influences colorectal
cancer risk. Nat Genet, 44, 770-6.
199
Eeckhoute, J., Carroll, J. S., Geistlinger, T. R., Torres-Arzayus, M. I. & Brown, M. (2015). A cell-type-
specific transcriptional network required for estrogen regulation of cyclin D1 and cell cycle
progression in breast cancer. Genes & Dev., 20, 2513-2526.
Emison, E. S., McCallion, A. S., Kashuk, C. S., Bush, R. T., Grice, E., Lin, S., Portnoy, M. E., Cutler, D.
J., Green, E. D. & Chakravarti, A. (2005). A common sex-dependent mutation in a RET enhancer
underlies Hirschsprung disease risk. Nature, 434, 857-63.
ENCODE_Project_Consortium (2012). An integrated encyclopedia of DNA elements in the human
genome. Nature, 489, 57-74.
Ernst, J. & Kellis, M. (2010). Discovery and characterization of chromatin states for systematic
annotation of the human genome. Nat Biotechnol, 28, 817-25.
Ernst, J. & Kellis, M. (2012). ChromHMM: automating chromatin-state discovery and characterization.
Nat Methods, 9, 215-6.
Ernst, J., Kheradpour, P., Mikkelsen, T. S., Shoresh, N., Ward, L. D., Epstein, C. B., Zhang, X., Wang,
L., Issner, R., Coyne, M., Ku, M., Durham, T., Kellis, M. & Bernstein, B. E. (2011). Mapping
and analysis of chromatin state dynamics in nine human cell types. Nature, 473, 43-9.
Fairfax, B. P., Makino, S., Radhakrishnan, J., Plant, K., Leslie, S., Dilthey, A., Ellis, P., Langford, C.,
Vannberg, F. O. & Knight, J. C. (2012). Genetics of gene expression in primary immune cells
identifies cell type-specific master regulators and roles of HLA alleles. Nat Genet, 44, 502-10.
Farh, K. K., Marson, A., Zhu, J., Kleinewietfeld, M., Housley, W. J., Beik, S., Shoresh, N., Whitton, H.,
Ryan, R. J., Shishkin, A. A., Hatan, M., Carrasco-Alfonso, M. J., Mayer, D., Luckey, C. J.,
Patsopoulos, N. A., De Jager, P. L., Kuchroo, V. K., Epstein, C. B., Daly, M. J., Hafler, D. A. &
Bernstein, B. E. (2015). Genetic and epigenetic fine mapping of causal autoimmune disease
variants. Nature, 518, 337-43.
Farnham, P. J. (2012). Thematic Minireview Series on Results from the ENCODE Project: Integrative
Global Analyses of Regulatory Regions in the Human Genome. J Biol Chem, 287, 30885-7.
Fearon, E. R. (2011). Molecular genetics of colorectal cancer. Annu Rev Pathol, 6, 479-507.
Filion, G. J., Van Bemmel, J. G., Braunschweig, U., Talhout, W., Kind, J., Ward, L. D., Brugman, W., De
Castro, I. J., Kerkhoven, R. M., Bussemaker, H. J. & Van Steensel, B. (2010). Systematic protein
location mapping reveals five principal chromatin types in Drosophila cells. Cell, 143, 212-224.
Francesconi, M. & Lehner, B. (2014). The effects of genetic variation on gene expression dynamics
during development. Nature, 505, 208-11.
Fredriksson, N. J., Ny, L., Nilsson, J. A. & Larsson, E. (2014). Systematic analysis of noncoding somatic
mutations and gene expression alterations across 14 tumor types. Nat Genet, 46, 1258-63.
Freedman, M. L., Monteiro, A. N., Gayther, S. A., Coetzee, G. A., Risch, A., Plass, C., Casey, G., De
Biasi, M., Carlson, C., Duggan, D., James, M., Liu, P., Tichelaar, J. W., Vikis, H. G., You, M. &
Mills, I. G. (2011). Principles for the post-GWAS functional characterization of cancer risk loci.
Nat Genet, 43, 513-8.
Frietze, S., Wang, R., Yao, L., Tak, Y. G., Ye, Z., Gaddis, M., Witt, H., Farnham, P. J. & Jin, V. X.
(2012). Cell type-specific binding patterns reveal that TCF7L2 can be tethered to the genome by
association with GATA3. Genome Biol, 13, R52.
Fu, J., Wolfs, M. G., Deelen, P., Westra, H. J., Fehrmann, R. S., Te Meerman, G. J., Buurman, W. A.,
Rensen, S. S., Groen, H. J., Weersma, R. K., van den Berg, L. H., Veldink, J., Ophoff, R. A.,
Snieder, H., van Heel, D., Jansen, R. C., Hofker, M. H., Wijmenga, C. & Franke, L. (2012).
Unraveling the regulatory mechanisms underlying tissue-dependent genetic variation of gene
expression. PLoS Genet, 8, e1002431.
Fujiwara, T., O'Geen, H., Keles, S., Blahnik, K., Linnemann, A. K., Kang, Y. A., Choi, K., Farnham, P. J.
& Bresnick, E. H. (2009). Discovering hematopoietic mechanisms through genome-wide analysis
of GATA factor chromatin occupancy. Mol Cell, 36, 667-81.
Fullwood, M. J., Liu, M. H., Pan, Y. F., Liu, J., Xu, H., Mohamed, Y. B., Orlov, Y. L., Velkov, S., Ho,
A., Mei, P. H., Chew, E. G., Huang, P. Y., Welboren, W. J., Han, Y., Ooi, H. S., Ariyaratne, P.
N., Vega, V. B., Luo, Y., Tan, P. Y., Choy, P. Y., Wansa, K. D., Zhao, B., Lim, K. S., Leow, S.
200
C., Yow, J. S., Joseph, R., Li, H., Desai, K. V., Thomsen, J. S., Lee, Y. K., Karuturi, R. K.,
Herve, T., Bourque, G., Stunnenberg, H. G., Ruan, X., Cacheux-Rataboul, V., Sung, W. K., Liu,
E. T., Wei, C. L., Cheung, E. & Ruan, Y. (2009). An oestrogen-receptor-alpha-bound human
chromatin interactome. Nature, 462, 58-64.
Fullwood, M. J. & Ruan, Y. (2009). ChIP-based methods for the identification of long-range chromatin
interactions. J Cell Biochem, 107, 30-9.
Gaj, T., Gersbach, C. A. & Barbas, I., Carlos F (2013). ZFN, TALEN, and CRISPR/Cas-based methods
for genome engineering. Trends Biotechnol, 31, 397-405.
Gao, F., Foat, B. C. & Bussemaker, H. J. (2004). Defining transcriptional networks through integrative
modeling of mRNA expression and transcription factor binding data. BMC Bioinformatics, 5, 31.
Garte, S. J. (1993). The c-myc oncogene in tumor progression. Critical Reviews in Oncog., 4, 435-449.
Gebhard, C., Benner, C., Ehrich, M., Schwarzfischer, L., Schilling, E., Klug, M., Dietmaier, W., Thiede,
C., Holler, E., Andreesen, R. & Rehli, M. (2010). General transcription factor binding at CpG
islands in normal cells correlates with resistance to de novo DNA methylation in cancer cells.
Cancer research, 70, 1398-407.
Giorgio, E., Robyr, D., Spielmann, M., Ferrero, E., Di Gregorio, E., Imperiale, D., Vaula, G., Stamoulis,
G., Santoni, F., Atzori, C., Gasparini, L., Ferrera, D., Canale, C., Guipponi, M., Pennacchio, L.
A., Antonarakis, S. E., Brussino, A. & Brusco, A. (2015). A large genomic deletion leads to
enhancer adoption by the lamin B1 gene: a second path to autosomal dominant adult-onset
demyelinating leukodystrophy (ADLD). Hum Mol Genet, 24, 3143-54.
Gisselbrecht, S. S., Barrera, L. A., Porsch, M., Aboukhalil, A., Estep, P. W., 3rd, Vedenko, A., Palagi, A.,
Kim, Y., Zhu, X., Busser, B. W., Gamble, C. E., Iagovitina, A., Singhania, A., Michelson, A. M.
& Bulyk, M. L. (2013). Highly parallel assays of tissue-specific enhancers in whole Drosophila
embryos. Nat Methods, 10, 774-80.
Gjoneska, E., Pfenning, A. R., Mathys, H., Quon, G., Kundaje, A., Tsai, L. H. & Kellis, M. (2015).
Conserved epigenomic signals in mice and humans reveal immune basis of Alzheimer's disease.
Nature, 518, 365-9.
Grant, C. E., Bailey, T. L. & Noble, W. S. (2011). FIMO: scanning for occurrences of a given motif.
Bioinformatics, 27, 1017-8.
Grimmer, M. R., Stolzenburg, S., Ford, E., Lister, R., Blancafort, P. & Farnham, P. J. (2014). Analysis of
an artificial zinc finger epigenetic modulator: widespread binding but limited regulation. Nucleic
Acids Res, 42, 10856-68.
Groschel, S., Sanders, M. A., Hoogenboezem, R., de Wit, E., Bouwman, B. A., Erpelinck, C., van der
Velden, V. H., Havermans, M., Avellino, R., van Lom, K., Rombouts, E. J., van Duin, M.,
Dohner, K., Beverloo, H. B., Bradner, J. E., Dohner, H., Lowenberg, B., Valk, P. J., Bindels, E.
M., de Laat, W. & Delwel, R. (2014). A single oncogenic enhancer rearrangement causes
concomitant EVI1 and GATA2 deregulation in leukemia. Cell, 157, 369-81.
GTExConsortium (2015). Human genomics. The Genotype-Tissue Expression (GTEx) pilot analysis:
multitissue gene regulation in humans. Science, 348, 648-60.
Guan, Y., Kuo, W. L., Stilwell, J. L., Takano, H., Lapuk, A. V., Fridlyand, J., Mao, J. H., Yu, M., Miller,
M. A., Santos, J. L., Kalloger, S. E., Carlson, J. W., Ginzinger, D. G., Celniker, S. E., Mills, G.
B., Huntsman, D. G. & Gray, J. W. (2007). Amplification of PVT1 contributes to the
pathophysiology of ovarian and breast cancer. Clin Cancer Res, 13, 5745-55.
Hagege, H., Klous, P., Braem, C., Splinter, E., Dekker, J., Cathala, G., de Laat, W. & Forne, T. (2007).
Quantitative analysis of chromosome conformation capture assays (3C-qPCR). Nat Protoc, 2,
1722-33.
Handoko, L., Xu, H., Li, G., Ngan, C. Y., Chew, E., Schnapp, M., Lee, C. W., Ye, C., Ping, J. L.,
Mulawadi, F., Wong, E., Sheng, J., Zhang, Y., Poh, T., Chan, C. S., Kunarso, G., Shahab, A.,
Bourque, G., Cacheux-Rataboul, V., Sung, W. K., Ruan, Y. & Wei, C. L. (2011). CTCF-mediated
functional chromatin interactome in pluripotent cells. Nat Genet, 43, 630-8.
201
Hansen, K. D., Timp, W., Bravo, H. C., Sabunciyan, S., Langmead, B., McDonald, O. G., Wen, B., Wu,
H., Liu, Y., Diep, D., Briem, E., Zhang, K., Irizarry, R. A. & Feinberg, A. P. (2011). Increased
methylation variation in epigenetic domains across cancer types. Nat Genet, 43, 768-75.
Hardison, R. C. (2012). Genome-wide Epigenetic Data Facilitate Understanding of Disease Susceptibility
Association Studies. J Biol Chem, 287, 30932-40.
Hatzis, P. & Talianidis, I. (2002). Dynamics of enhancer-promoter communication during differentiation-
induced gene activation. Mol Cell, 10, 1467-77.
Hawkins, R. D., Hon, G. C., Yang, C., Antosiewicz-Bourget, J. E., Lee, L. K., Ngo, Q. M., Klugman, S.,
Ching, K. A., Edsall, L. E., Ye, Z., Kuan, S., Yu, P., Liu, H., Zhang, X., Green, R. D.,
Lobanenkov, V. V., Stewart, R., Thomson, J. A. & Ren, B. (2011). Dynamic chromatin states in
human ES cells reveal potential regulatory sequences and genes involved in pluripotency. Cell
Res, 21, 1393-409.
Hazelett, D. J., Rhie, S. K., Gaddis, M., Yan, C., Lakeland, D. L., Coetzee, S. G., Henderson, B. E.,
Noushmehr, H., Cozen, W., Kote-Jarai, Z., Eeles, R. A., Easton, D. F., Haiman, C. A., Lu, W.,
Farnham, P. J. & Coetzee, G. A. (2014). Comprehensive functional annotation of 77 prostate
cancer risk loci. PLoS Genet, 10, e1004102.
He, B., Chen, C., Teng, L. & Tan, K. (2014). Global view of enhancer-promoter interactome in human
cells. Proc Natl Acad Sci U S A, 111, E2191-9.
Heintzman, N. D., Hon, G. C., Hawkins, R. D., Kheradpour, P., Stark, A., Harp, L. F., Ye, Z., Lee, L. K.,
Stuart, R. K., Ching, C. W., Ching, K. A., Antosiewicz-Bourget, J. E., Liu, H., Zhang, X., Green,
R. D., Lobanenkov, V. V., Stewart, R., Thomson, J. A., Crawford, G. E., Kellis, M. & Ren, B.
(2009). Histone modifications at human enhancers reflect global cell-type-specific gene
expression. Nature, 459, 108-112.
Heintzman, N. D., Stuart, R. K., Hon, G., Fu, Y., Ching, C. W., Hawkins, R. D., Barrera, L. O., Van
Calcar, S., Qu, C., Ching, K. A., Wang, W., Weng, Z., Green, R. D., Crawford, G. E. & Ren, B.
(2007). Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in
the human genome. Nat Genet, 39, 311-8.
Henikoff, S. (2007). ENCODE and our very busy genome. Nature genetics, 39, 817-8.
Herz, H. M., Hu, D. & Shilatifard, A. (2014). Enhancer malfunction in cancer. Mol Cell, 53, 859-66.
Hilton, I. B., D'Ippolito, A. M., Vockley, C. M., Thakore, P. I., Crawford, G. E., Reddy, T. E. &
Gersbach, C. A. (2015). Epigenome editing by a CRISPR-Cas9-based acetyltransferase activates
genes from promoters and enhancers. Nat Biotechnol, 33, 510-7.
Hindorff, L. A., Sethupathy, P., Junkins, H. A., Ramos, E. M., Mehta, J. P., Collins, F. S. & Manolio, T.
A. (2009). Potential etiologic and functional implications of genome-wide association loci for
human diseases and traits. Proc Natl Acad Sci U S A, 106, 9362-7.
Hoffman, M. M., Buske, O. J., Wang, J., Weng, Z., Bilmes, J. a. & Noble, W. S. (2012). Unsupervised
pattern discovery in human chromatin structure through genomic segmentation. Nature methods,
9, 473-6.
Holwerda, S. J. & de Laat, W. (2013). CTCF: the protein, the binding partners, the binding sites and their
chromatin loops. Philos Trans R Soc Lond B Biol Sci, 368, 20120369.
Home, P., Saha, B., Ray, S., Dutta, D., Gunewardena, S., Yoo, B., Pal, A., Vivian, J. L., Larson, M.,
Petroff, M., Gallagher, P. G., Schulz, V. P., White, K. L., Golos, T. G., Behr, B. & Paul, S.
(2012). Altered subcellular localization of transcription factor TEAD4 regulates first mammalian
cell lineage commitment. Proc Natl Acad Sci U S A, 109, 7362-7.
Honkela, A., Girardot, C., Gustafson, E. H., Liu, Y. H., Furlong, E. E., Lawrence, N. D. & Rattray, M.
(2010). Model-based method for transcription factor target identification with limited data. Proc
Natl Acad Sci U S A, 107, 7793-8.
Houlston, R. S., Cheadle, J., Dobbins, S. E., Tenesa, A., Jones, A. M., Howarth, K., Spain, S. L.,
Broderick, P., Domingo, E., Farrington, S., Prendergast, J. G., Pittman, A. M., Theodoratou, E.,
Smith, C. G., Olver, B., Walther, A., Barnetson, R. A., Churchman, M., Jaeger, E. E., Penegar, S.,
Barclay, E., Martin, L., Gorman, M., Mager, R., Johnstone, E., Midgley, R., Niittymaki, I.,
202
Tuupanen, S., Colley, J., Idziaszczyk, S., Thomas, H. J., Lucassen, A. M., Evans, D. G., Maher,
E. R., Maughan, T., Dimas, A., Dermitzakis, E., Cazier, J. B., Aaltonen, L. A., Pharoah, P., Kerr,
D. J., Carvajal-Carmona, L. G., Campbell, H., Dunlop, M. G. & Tomlinson, I. P. (2010). Meta-
analysis of three genome-wide association studies identifies susceptibility loci for colorectal
cancer at 1q41, 3q26.2, 12q13.13 and 20q13.33. Nat Genet, 42, 973-7.
Houlston, R. S., Webb, E., Broderick, P., Pittman, A. M., Di Bernardo, M. C., Lubbe, S., Chandler, I.,
Vijayakrishnan, J., Sullivan, K., Penegar, S., Carvajal-Carmona, L., Howarth, K., Jaeger, E.,
Spain, S. L., Walther, A., Barclay, E., Martin, L., Gorman, M., Domingo, E., Teixeira, A. S.,
Kerr, D., Cazier, J. B., Niittymaki, I., Tuupanen, S., Karhu, A., Aaltonen, L. A., Tomlinson, I. P.,
Farrington, S. M., Tenesa, A., Prendergast, J. G., Barnetson, R. A., Cetnarskyj, R., Porteous, M.
E., Pharoah, P. D., Koessler, T., Hampe, J., Buch, S., Schafmayer, C., Tepel, J., Schreiber, S.,
Volzke, H., Chang-Claude, J., Hoffmeister, M., Brenner, H., Zanke, B. W., Montpetit, A.,
Hudson, T. J., Gallinger, S., Campbell, H. & Dunlop, M. G. (2008). Meta-analysis of genome-
wide association data identifies four new susceptibility loci for colorectal cancer. Nat Genet, 40,
1426-35.
Hovestadt, V., Jones, D. T., Picelli, S., Wang, W., Kool, M., Northcott, P. A., Sultan, M., Stachurski, K.,
Ryzhova, M., Warnatz, H. J., Ralser, M., Brun, S., Bunt, J., Jager, N., Kleinheinz, K., Erkek, S.,
Weber, U. D., Bartholomae, C. C., von Kalle, C., Lawerenz, C., Eils, J., Koster, J., Versteeg, R.,
Milde, T., Witt, O., Schmidt, S., Wolf, S., Pietsch, T., Rutkowski, S., Scheurlen, W., Taylor, M.
D., Brors, B., Felsberg, J., Reifenberger, G., Borkhardt, A., Lehrach, H., Wechsler-Reya, R. J.,
Eils, R., Yaspo, M. L., Landgraf, P., Korshunov, A., Zapatka, M., Radlwimmer, B., Pfister, S. M.
& Lichter, P. (2014). Decoding the regulatory landscape of medulloblastoma using DNA
methylation sequencing. Nature, 510, 537-41.
Howie, B., Fuchsberger, C., Stephens, M., Marchini, J. & Abecasis, G. R. (2012). Fast and accurate
genotype imputation in genome-wide association studies through pre-phasing. Nat Genet, 44,
955-9.
Hu, Y., Sun, Z., Zhang, A. & Zhang, J. (2013). SMAD7 rs12953717 polymorphism contributes to
increased risk of colorectal cancer. Tumour Biol.
Huang, F. W., Hodis, E., Xu, M. J., Kryukov, G. V., Chin, L. & Garraway, L. A. (2013). Highly recurrent
TERT promoter mutations in human melanoma. Science, 339, 957-9.
Huang, G. L., Guo, H. Q., Yang, F., Liu, O. F., Li, B. B., Liu, X. Y., Lu, Y. & He, Z. W. (2012).
Activating transcription factor 1 is a prognostic marker of colorectal cancer. Asian Pac J Cancer
Prev, 13, 1053-7.
Hughes, J. R., Roberts, N., McGowan, S., Hay, D., Giannoulatou, E., Lynch, M., De Gobbi, M., Taylor,
S., Gibbons, R. & Higgs, D. R. (2014). Analysis of hundreds of cis-regulatory landscapes at high
resolution in a single, high-throughput experiment. Nat Genet, 46, 205-12.
Huppi, K., Pitt, J. J., Wahlberg, B. M. & Caplen, N. J. (2012). The 8q24 gene desert: an oasis of non-
coding transcriptional activity. Front Genet, 3, 69.
Ing-Simmons, E., Seitan, V. C., Faure, A. J., Flicek, P., Carroll, T., Dekker, J., Fisher, A. G., Lenhard, B.
& Merkenschlager, M. (2015). Spatial enhancer clustering and regulation of enhancer-proximal
genes by cohesin. Genome Res, 25, 504-13.
Iyer, N. G., Ozdag, H. & Caldas, C. (2004). p300/CBP and cancer. Oncogene, 23, 4225-31.
Jager, R., Migliorini, G., Henrion, M., Kandaswamy, R., Speedy, H. E., Heindl, A., Whiffin, N., Carnicer,
M. J., Broome, L., Dryden, N., Nagano, T., Schoenfelder, S., Enge, M., Yuan, Y., Taipale, J.,
Fraser, P., Fletcher, O. & Houlston, R. S. (2015). Capture Hi-C identifies the chromatin
interactome of colorectal cancer risk loci. Nat Commun, 6, 6178.
Janknecht, R. (2002). The versatile functions of the transcriptional coactivators p300 and CBP and their
roles in disease. Histol Histopathol, 17, 657-68.
Jeon, B. N., Kim, M. K., Choi, W. I., Koh, D. I., Hong, S. Y., Kim, K. S., Kim, M., Yun, C. O., Yoon, J.,
Choi, K. Y., Lee, K. R., Nephew, K. P. & Hur, M. W. (2012). KR-POK interacts with p53 and
represses its ability to activate transcription of p21WAF1/CDKN1A. Cancer Res, 72, 1137-48.
203
Jia, W. H., Zhang, B., Matsuo, K., Shin, A., Xiang, Y. B., Jee, S. H., Kim, D. H., Ren, Z., Cai, Q., Long,
J., Shi, J., Wen, W., Yang, G., Delahanty, R. J., Ji, B. T., Pan, Z. Z., Matsuda, F., Gao, Y. T., Oh,
J. H., Ahn, Y. O., Park, E. J., Li, H. L., Park, J. W., Jo, J., Jeong, J. Y., Hosono, S., Casey, G.,
Peters, U., Shu, X. O., Zeng, Y. X. & Zheng, W. (2013). Genome-wide association analyses in
East Asians identify new susceptibility loci for colorectal cancer. Nat Genet, 45, 191-6.
Jin, F., Li, Y., Dixon, J. R., Selvaraj, S., Ye, Z., Lee, A. Y., Yen, C. A., Schmitt, A. D., Espinoza, C. A. &
Ren, B. (2013). A high-resolution map of the three-dimensional chromatin interactome in human
cells. Nature, 503, 290-4.
Jones, P. A. (2012). Functions of DNA methylation: islands, start sites, gene bodies and beyond. Nat Rev
Genet, 13, 484-92.
Kalhor, R., Tjong, H., Jayathilaka, N., Alber, F. & Chen, L. (2011). Genome architectures revealed by
tethered chromosome conformation capture and population-based modeling. Nat Biotechnol, 30,
90-8.
Kandoth, C., McLellan, M. D., Vandin, F., Ye, K., Niu, B., Lu, C., Xie, M., Zhang, Q., McMichael, J. F.,
Wyczalkowski, M. A., Leiserson, M. D., Miller, C. A., Welch, J. S., Walter, M. J., Wendl, M. C.,
Ley, T. J., Wilson, R. K., Raphael, B. J. & Ding, L. (2013). Mutational landscape and
significance across 12 major cancer types. Nature, 502, 333-9.
Kang, X., Chen, W., Kim, R. H., Kang, M. K. & Park, N. H. (2009). Regulation of the hTERT promoter
activity by MSH2, the hnRNPs K and D, and GRHL2 in human oral squamous cell carcinoma
cells. Oncogene, 28, 565-74.
Karagiannis, G. S., Berk, A., Dimitromanolakis, A. & Diamandis, E. P. (2013). Enrichment map profiling
of the cancer invasion front suggests regulation of colorectal cancer progression by the bone
morphogenetic protein antagonist, gremlin-1. Mol Oncol, 7, 826-39.
Kearns, N. A., Pham, H., Tabak, B., Genga, R. M., Silverstein, N. J., Garber, M. & Maehr, R. (2015).
Functional annotation of native enhancers with a Cas9-histone demethylase fusion. Nat Methods,
12, 401-3.
Kheradpour, P., Ernst, J., Melnikov, A., Rogov, P., Wang, L., Zhang, X., Alston, J., Mikkelsen, T. S. &
Kellis, M. (2013). Systematic dissection of regulatory motifs in 2000 predicted human enhancers
using a massively parallel reporter assay. Genome Res, 23, 800-11.
Kieffer-Kwon, K. R., Tang, Z., Mathe, E., Qian, J., Sung, M. H., Li, G., Resch, W., Baek, S., Pruett, N.,
Grontved, L., Vian, L., Nelson, S., Zare, H., Hakim, O., Reyon, D., Yamane, A., Nakahashi, H.,
Kovalchuk, A. L., Zou, J., Joung, J. K., Sartorelli, V., Wei, C. L., Ruan, X., Hager, G. L., Ruan,
Y. & Casellas, R. (2013). Interactome maps of mouse gene regulatory domains reveal basic
principles of transcriptional regulation. Cell, 155, 1507-20.
Kioussis, D., Vanin, E., deLange, T., Flavell, R. A. & Grosveld, F. G. (1983). Beta-globin gene
inactivation by DNA translocation in gamma beta-thalassaemia. Nature, 306, 662-6.
Knosel, T., Chen, Y., Hotovy, S., Settmacher, U., Altendorf-Hofmann, A. & Petersen, I. (2012). Loss of
desmocollin 1-3 and homeobox genes PITX1 and CDX2 are associated with tumor progression
and survival in colorectal carcinoma. Int J Colorectal Dis, 27, 1391-9.
Konsavage, W. M., Jr. & Yochum, G. S. (2014). The myc 3' wnt-responsive element suppresses colonic
tumorigenesis. Mol Cell Biol, 34, 1659-69.
Koudritsky, M. & Domany, E. (2008). Positional distribution of human transcription factor binding sites.
Nucleic Acids Res, 36, 6795-805.
Kouros-Mehr, H., Bechis, S. K., Slorach, E. M., Littlepage, L. E., Egeblad, M., Ewald, A. J., Pai, S. Y.,
Ho, I. C. & Werb, Z. (2008). GATA-3 links tumor differentiation and dissemination in a luminal
breast cancer model. Cancer Cell, 13, 141-52.
Kowalczyk, M. S., Hughes, J. R., Garrick, D., Lynch, M. D., Sharpe, J. A., Sloane-Stanley, J. A.,
McGowan, S. J., De Gobbi, M., Hosseini, M., Vernimmen, D., Brown, J. M., Gray, N. E.,
Collavin, L., Gibbons, R. J., Flint, J., Taylor, S., Buckle, V. J., Milne, T. A., Wood, W. G. &
Higgs, D. R. (2012). Intragenic enhancers act as alternative promoters. Mol Cell, 45, 447-58.
204
Krivega, I., Dale, R. K. & Dean, A. (2014). Role of LDB1 in the transition from chromatin looping to
transcription activation. Genes Dev, 28, 1278-90.
Krzywinski, M., Schein, J., Birol, I., Connors, J., Gascoyne, R., Horsman, D., Jones, S. J. & Marra, M. A.
(2009). Circos: an information aesthetic for comparative genomics. Genome Res, 19, 1639-45.
Kulozik, A. E., Kar, B. C., Serjeant, G. R., Serjeant, B. E. & Weatherall, D. J. (1988). The molecular
basis of alpha thalassemia in India. Its interaction with the sickle cell gene. Blood, 71, 467-72.
Kvon, E. Z., Kazmar, T., Stampfel, G., Yanez-Cuna, J. O., Pagani, M., Schernhuber, K., Dickson, B. J. &
Stark, A. (2014). Genome-scale functional characterization of Drosophila developmental
enhancers in vivo. Nature, 512, 91-5.
Kwasnieski, J. C., Fiore, C., Chaudhari, H. G. & Cohen, B. A. (2014). High-throughput functional testing
of ENCODE segmentation predictions. Genome Res, 24, 1595-602.
Lan, X., Farnham, P. J. & Jin, V. X. (2012). Uncovering transcription factor modules using one- and
three-dimensional analyses. J Biol Chem, 287, 30914-21.
Lawrence, M., Huber, W., Pages, H., Aboyoun, P., Carlson, M., Gentleman, R., Morgan, M. T. & Carey,
V. J. (2013). Software for computing and annotating genomic ranges. PLoS Comput Biol, 9,
e1003118.
Lee, D. S., Shin, J. Y., Tonge, P. D., Puri, M. C., Lee, S., Park, H., Lee, W. C., Hussein, S. M., Bleazard,
T., Yun, J. Y., Kim, J., Li, M., Cloonan, N., Wood, D., Clancy, J. L., Mosbergen, R., Yi, J. H.,
Yang, K. S., Kim, H., Rhee, H., Wells, C. A., Preiss, T., Grimmond, S. M., Rogers, I. M., Nagy,
A. & Seo, J. S. (2014). An epigenomic roadmap to induced pluripotency reveals DNA
methylation as a reprogramming modulator. Nat Commun, 5, 5619.
Levin, E. R. (2005). Integration of the extranuclear and nuclear actions of estrogen. Mol Endocrinol, 19,
1951-9.
Li, G., Ruan, X., Auerbach, R. K., Sandhu, K. S., Zheng, M., Wang, P., Poh, H. M., Goh, Y., Lim, J.,
Zhang, J., Sim, H. S., Peh, S. Q., Mulawadi, F. H., Ong, C. T., Orlov, Y. L., Hong, S., Zhang, Z.,
Landt, S., Raha, D., Euskirchen, G., Wei, C. L., Ge, W., Wang, H., Davis, C., Fisher-Aylor, K. I.,
Mortazavi, A., Gerstein, M., Gingeras, T., Wold, B., Sun, Y., Fullwood, M. J., Cheung, E., Liu,
E., Sung, W. K., Snyder, M. & Ruan, Y. (2012a). Extensive promoter-centered chromatin
interactions provide a topological basis for transcription regulation. Cell, 148, 84-98.
Li, Q., Seo, J. H., Stranger, B., McKenna, A., Pe'er, I., Laframboise, T., Brown, M., Tyekucheva, S. &
Freedman, M. L. (2013a). Integrative eQTL-based analyses reveal the biology of breast cancer
risk loci. Cell, 152, 633-41.
Li, Q., Zou, C., Zou, C., Han, Z., Xiao, H., Wei, H., Wang, W., Zhang, L., Zhang, X., Tang, Q., Zhang,
C., Tao, J., Wang, X. & Gao, X. (2013e). MicroRNA-25 functions as a potential tumor
suppressor in colon cancer by targeting Smad7. Cancer Lett, 335, 168-74.
Li, X., Kuang, J., Shen, Y., Majer, M. M., Nelson, C. C., Parsawar, K., Heichman, K. A. & Kuwada, S.
K. (2012n). The atypical histone macroH2A1.2 interacts with HER-2 protein in cancer cells. J
Biol Chem, 287, 23171-83.
Li, Y., Rivera, C. M., Ishii, H., Jin, F., Selvaraj, S., Lee, A. Y., Dixon, J. R. & Ren, B. (2014). CRISPR
reveals a distal super-enhancer required for Sox2 expression in mouse embryonic stem cells.
PLoS One, 9, e114485.
Lieberman-Aiden, E., van Berkum, N. L., Williams, L., Imakaev, M., Ragoczy, T., Telling, A., Amit, I.,
Lajoie, B. R., Sabo, P. J., Dorschner, M. O., Sandstrom, R., Bernstein, B., Bender, M. A.,
Groudine, M., Gnirke, A., Stamatoyannopoulos, J., Mirny, L. A., Lander, E. S. & Dekker, J.
(2009). Comprehensive mapping of long-range interactions reveals folding principles of the
human genome. Science, 326, 289-93.
Lister, R., Pelizzola, M., Dowen, R. H., Hawkins, R. D., Hon, G., Tonti-Filippini, J., Nery, J. R., Lee, L.,
Ye, Z., Ngo, Q.-m., Edsall, L., Antosiewicz-Bourget, J., Stewart, R., Ruotti, V., Millar, A. H.,
Thomson, J. a., Ren, B. & Ecker, J. R. (2009). Human DNA methylomes at base resolution show
widespread epigenomic differences. Nature, 462, 315-22.
Losada, A. (2014). Cohesin in cancer: chromosome segregation and beyond. Nat Rev Cancer, 14, 389-93.
205
Lu, Y., Zhou, Y. & Tian, W. (2013). Combining Hi-C data with phylogenetic correlation to predict the
target genes of distal regulatory elements in human genome. Nucleic Acids Res, 41, 10391-402.
Ma, W., Ay, F., Lee, C., Gulsoy, G., Deng, X., Cook, S., Hesson, J., Cavanaugh, C., Ware, C. B.,
Krumm, A., Shendure, J., Blau, C. A., Disteche, C. M., Noble, W. S. & Duan, Z. (2015). Fine-
scale chromatin interaction maps reveal the cis-regulatory landscape of human lincRNA genes.
Nat Methods, 12, 71-8.
Maeder, M. L., Thibodeau-Beganny, S., Osiak, A., Wright, D. A., Anthony, R. M., Eichtinger, M., Jiang,
T., Foley, J. E., Winfrey, R. J., Townsend, J. A., Unger-Wallace, E., Sander, J. D., Muller-Lerch,
F., Fu, F., Pearlberg, J., Gobel, C., Dassie, J. P., Pruett-Miller, S. M., Porteus, M. H., Sgroi, D. C.,
Iafrate, A. J., Dobbs, D., McCray, P. B., Jr., Cathomen, T., Voytas, D. F. & Joung, J. K. (2008).
Rapid "open-source" engineering of customized zinc-finger nucleases for highly efficient gene
modification. Mol Cell, 31, 294-301.
Maienschein-Cline, M., Zhou, J., White, K. P., Sciammas, R. & Dinner, A. R. (2012). Discovering
transcription factor regulatory targets using gene expression and binding data. Bioinformatics, 28,
206-13.
Manolio, T. A. (2010). Genomewide association studies and assessment of the risk of disease. N Engl J
Med, 363, 166-76.
Massion, P. P., Taflan, P. M., Jamshedur Rahman, S. M., Yildiz, P., Shyr, Y., Edgerton, M. E., Westfall,
M. D., Roberts, J. R., Pietenpol, J. A., Carbone, D. P. & Gonzalez, A. L. (2003). Significance of
p63 amplification and overexpression in lung cancer development and prognosis. Cancer Res, 63,
7113-21.
Mathelier, A., Zhao, X., Zhang, A. W., Parcy, F., Worsley-Hunt, R., Arenillas, D. J., Buchman, S., Chen,
C. Y., Chou, A., Ienasescu, H., Lim, J., Shyr, C., Tan, G., Zhou, M., Lenhard, B., Sandelin, A. &
Wasserman, W. W. (2014). JASPAR 2014: an extensively expanded and updated open-access
database of transcription factor binding profiles. Nucleic Acids Res, 42, D142-7.
Maunakea, A. K., Nagarajan, R. P., Bilenky, M., Ballinger, T. J., D'Souza, C., Fouse, S. D., Johnson, B.
E., Hong, C., Nielsen, C., Zhao, Y., Turecki, G., Delaney, A., Varhol, R., Thiessen, N., Shchors,
K., Heine, V. M., Rowitch, D. H., Xing, X., Fiore, C., Schillebeeckx, M., Jones, S. J., Haussler,
D., Marra, M. A., Hirst, M., Wang, T. & Costello, J. F. (2010). Conserved role of intragenic DNA
methylation in regulating alternative promoters. Nature, 466, 253-7.
Maurano, M. T., Humbert, R., Rynes, E., Thurman, R. E., Haugen, E., Wang, H., Reynolds, A. P.,
Sandstrom, R., Qu, H., Brody, J., Shafer, A., Neri, F., Lee, K., Kutyavin, T., Stehling-Sun, S.,
Johnson, A. K., Canfield, T. K., Giste, E., Diegel, M., Bates, D., Hansen, R. S., Neph, S., Sabo, P.
J., Heimfeld, S., Raubitschek, A., Ziegler, S., Cotsapas, C., Sotoodehnia, N., Glass, I., Sunyaev,
S. R., Kaul, R. & Stamatoyannopoulos, J. A. (2012). Systematic localization of common disease-
associated variation in regulatory DNA. Science, 337, 1190-5.
Melnikov, A., Zhang, X., Rogov, P., Wang, L. & Mikkelsen, T. S. (2014). Massively parallel reporter
assays in cultured mammalian cells. J Vis Exp, (90), e51719.
Melton, C., Reuter, J. A., Spacek, D. V. & Snyder, M. (2015). Recurrent somatic mutations in regulatory
regions of human cancer genomes. Nat Genet, 47, 710-6.
Mendenhall, E. M., Williamson, K. E., Reyon, D., Zou, J. Y., Ram, O., Joung, J. K. & Bernstein, B. E.
(2013). Locus-specific editing of histone modifications at endogenous enhancers. Nat Biotechnol,
31, 1133-6.
Meyer, M. B., Benkusky, N. A. & Pike, J. W. (2015). Selective Distal Enhancer Control of the Mmp13
Gene Identified through Clustered Regularly Interspaced Short Palindromic Repeat (CRISPR)
Genomic Deletions. J Biol Chem, 290, 11093-107.
Mifsud, B., Tavares-Cadete, F., Young, A. N., Sugar, R., Schoenfelder, S., Ferreira, L., Wingett, S. W.,
Andrews, S., Grey, W., Ewels, P. A., Herman, B., Happe, S., Higgs, A., LeProust, E., Follows, G.
A., Fraser, P., Luscombe, N. M. & Osborne, C. S. (2015). Mapping long-range promoter contacts
in human cells with high-resolution capture Hi-C. Nat Genet, 47, 598-606.
206
Nica, A. C., Parts, L., Glass, D., Nisbet, J., Barrett, A., Sekowska, M., Travers, M., Potter, S., Grundberg,
E., Small, K., Hedman, A. K., Bataille, V., Tzenova Bell, J., Surdulescu, G., Dimas, A. S., Ingle,
C., Nestle, F. O., di Meglio, P., Min, J. L., Wilk, A., Hammond, C. J., Hassanali, N., Yang, T. P.,
Montgomery, S. B., O'Rahilly, S., Lindgren, C. M., Zondervan, K. T., Soranzo, N., Barroso, I.,
Durbin, R., Ahmadi, K., Deloukas, P., McCarthy, M. I., Dermitzakis, E. T., Spector, T. D. & Mu,
T. C. (2011). The architecture of gene regulatory variation across multiple human tissues: the
MuTHER study. PLoS Genet, 7, e1002003.
Noordermeer, D., Leleu, M., Splinter, E., Rougemont, J., De Laat, W. & Duboule, D. (2011). The
dynamic architecture of Hox gene clusters. Science, 334, 222-5.
Nora, E. P., Lajoie, B. R., Schulz, E. G., Giorgetti, L., Okamoto, I., Servant, N., Piolot, T., van Berkum,
N. L., Meisig, J., Sedat, J., Gribnau, J., Barillot, E., Bluthgen, N., Dekker, J. & Heard, E. (2012).
Spatial partitioning of the regulatory landscape of the X-inactivation centre. Nature, 485, 381-5.
Nord, A. S., Blow, M. J., Attanasio, C., Akiyama, J. A., Holt, A., Hosseini, R., Phouanenavong, S.,
Plajzer-Frick, I., Shoukry, M., Afzal, V., Rubenstein, J. L., Rubin, E. M., Pennacchio, L. A. &
Visel, A. (2013). Rapid and pervasive changes in genome-wide enhancer usage during
mammalian development. Cell, 155, 1521-31.
Northcott, P. A., Lee, C., Zichner, T., Stutz, A. M., Erkek, S., Kawauchi, D., Shih, D. J., Hovestadt, V.,
Zapatka, M., Sturm, D., Jones, D. T., Kool, M., Remke, M., Cavalli, F. M., Zuyderduyn, S.,
Bader, G. D., VandenBerg, S., Esparza, L. A., Ryzhova, M., Wang, W., Wittmann, A., Stark, S.,
Sieber, L., Seker-Cin, H., Linke, L., Kratochwil, F., Jager, N., Buchhalter, I., Imbusch, C. D.,
Zipprich, G., Raeder, B., Schmidt, S., Diessl, N., Wolf, S., Wiemann, S., Brors, B., Lawerenz, C.,
Eils, J., Warnatz, H. J., Risch, T., Yaspo, M. L., Weber, U. D., Bartholomae, C. C., von Kalle, C.,
Turanyi, E., Hauser, P., Sanden, E., Darabi, A., Siesjo, P., Sterba, J., Zitterbart, K., Sumerauer,
D., van Sluis, P., Versteeg, R., Volckmann, R., Koster, J., Schuhmann, M. U., Ebinger, M.,
Grimes, H. L., Robinson, G. W., Gajjar, A., Mynarek, M., von Hoff, K., Rutkowski, S., Pietsch,
T., Scheurlen, W., Felsberg, J., Reifenberger, G., Kulozik, A. E., von Deimling, A., Witt, O., Eils,
R., Gilbertson, R. J., Korshunov, A., Taylor, M. D., Lichter, P., Korbel, J. O., Wechsler-Reya, R.
J. & Pfister, S. M. (2014). Enhancer hijacking activates GFI1 family oncogenes in
medulloblastoma. Nature, 511, 428-34.
Ochiai, H., Miyamoto, T., Kanai, A., Hosoba, K., Sakuma, T., Kudo, Y., Asami, K., Ogawa, A.,
Watanabe, A., Kajii, T., Yamamoto, T. & Matsuura, S. (2014). TALEN-mediated single-base-
pair editing identification of an intergenic mutation upstream of BUB1B as causative of PCS
(MVA) syndrome. Proc Natl Acad Sci U S A, 111, 1461-6.
Ong, C. T. & Corces, V. G. (2014). CTCF: an architectural protein bridging genome topology and
function. Nat Rev Genet, 15, 234-46.
Ooi, L. & Wood, I. C. (2007). Chromatin crosstalk in development and disease: lessons from REST. Nat
Rev Genet, 8, 544-54.
Paredes, J., Figueiredo, J., Albergaria, A., Oliveira, P., Carvalho, J., Ribeiro, A. S., Caldeira, J., Costa, A.
M., Simoes-Correia, J., Oliveira, M. J., Pinheiro, H., Pinho, S. S., Mateus, R., Reis, C. A., Leite,
M., Fernandes, M. S., Schmitt, F., Carneiro, F., Figueiredo, C., Oliveira, C. & Seruca, R. (2012).
Epithelial E- and P-cadherins: role and clinical significance in cancer. Biochim Biophys Acta,
1826, 297-311.
Parelho, V., Hadjur, S., Spivakov, M., Leleu, M., Sauer, S., Gregson, H. C., Jarmuz, A., Canzonetta, C.,
Webster, Z., Nesterova, T., Cobb, B. S., Yokomori, K., Dillon, N., Aragon, L., Fisher, A. G. &
Merkenschlager, M. (2008). Cohesins functionally associate with CTCF on mammalian
chromosome arms. Cell, 132, 422-33.
Pastor, W. A., Stroud, H., Nee, K., Liu, W., Pezic, D., Manakov, S., Lee, S. A., Moissiard, G., Zamudio,
N., Bourc'his, D., Aravin, A. A., Clark, A. T. & Jacobsen, S. E. (2014). MORC1 represses
transposable elements in the mouse male germline. Nature communications, 5, 5795.
Patient, R. K. & McGhee, J. D. (2002). The GATA family (vertebrates and invertebrates). Curr Opin
Genet Dev, 12, 416-22.
207
Patwardhan, R. P., Hiatt, J. B., Witten, D. M., Kim, M. J., Smith, R. P., May, D., Lee, C., Andrie, J. M.,
Lee, S. I., Cooper, G. M., Ahituv, N., Pennacchio, L. A. & Shendure, J. (2012). Massively
parallel functional dissection of mammalian enhancers in vivo. Nat Biotechnol, 30, 265-70.
Pena-Hernandez, R., Marques, M., Hilmi, K., Zhao, T., Saad, A., Alaoui-Jamali, M. A., del Rincon, S. V.,
Ashworth, T., Roy, A. L., Emerson, B. M. & Witcher, M. (2015). Genome-wide targeting of the
epigenetic regulatory protein CTCF to gene promoters by the transcription factor TFII-I. Proc
Natl Acad Sci U S A, 112, E677-86.
Peters, U., Jiao, S., Schumacher, F. R., Hutter, C. M., Aragaki, A. K., Baron, J. A., Berndt, S. I., Bezieau,
S., Brenner, H., Butterbach, K., Caan, B. J., Campbell, P. T., Carlson, C. S., Casey, G., Chan, A.
T., Chang-Claude, J., Chanock, S. J., Chen, L. S., Coetzee, G. A., Coetzee, S. G., Conti, D. V.,
Curtis, K. R., Duggan, D., Edwards, T., Fuchs, C. S., Gallinger, S., Giovannucci, E. L., Gogarten,
S. M., Gruber, S. B., Haile, R. W., Harrison, T. A., Hayes, R. B., Henderson, B. E., Hoffmeister,
M., Hopper, J. L., Hudson, T. J., Hunter, D. J., Jackson, R. D., Jee, S. H., Jenkins, M. A., Jia, W.
H., Kolonel, L. N., Kooperberg, C., Kury, S., Lacroix, A. Z., Laurie, C. C., Laurie, C. A., Le
Marchand, L., Lemire, M., Levine, D., Lindor, N. M., Liu, Y., Ma, J., Makar, K. W., Matsuo, K.,
Newcomb, P. A., Potter, J. D., Prentice, R. L., Qu, C., Rohan, T., Rosse, S. A., Schoen, R. E.,
Seminara, D., Shrubsole, M., Shu, X. O., Slattery, M. L., Taverna, D., Thibodeau, S. N., Ulrich,
C. M., White, E., Xiang, Y., Zanke, B. W., Zeng, Y. X., Zhang, B., Zheng, W. & Hsu, L. (2013).
Identification of Genetic Susceptibility Loci for Colorectal Tumors in a Genome-Wide Meta-
analysis. Gastroenterology, 144, 799-807 e24.
Peterson, K. R., Costa, F. C., Fedosyuk, H., Neades, R. Y., Chazelle, A. M., Zelenchuk, L., Fonteles, A.
H., Dalal, P., Roy, A., Chaguturu, R., Li, B. & Pace, B. S. (2014). A cell-based high-throughput
screen for novel chemical inducers of fetal hemoglobin for treatment of hemoglobinopathies.
PLoS One, 9, e107006.
Petit, F., Jourdain, A. S., Holder-Espinasse, M., Keren, B., Andrieux, J., Duterque-Coquillaud, M.,
Porchet, N., Manouvrier-Hanu, S. & Escande, F. (2015). The disruption of a novel limb cis-
regulatory element of SHH is associated with autosomal dominant preaxial polydactyly-
hypertrichosis. Eur J Hum Genet, March 18, Epub ahead of print.
Pfeiffer, B. D., Jenett, A., Hammonds, A. S., Ngo, T. T., Misra, S., Murphy, C., Scully, A., Carlson, J.
W., Wan, K. H., Laverty, T. R., Mungall, C., Svirskas, R., Kadonaga, J. T., Doe, C. Q., Eisen, M.
B., Celniker, S. E. & Rubin, G. M. (2008). Tools for neuroanatomy and neurogenetics in
Drosophila. Proc Natl Acad Sci U S A, 105, 9715-20.
Phillips-Cremins, J. E., Sauria, M. E., Sanyal, A., Gerasimova, T. I., Lajoie, B. R., Bell, J. S., Ong, C. T.,
Hookway, T. A., Guo, C., Sun, Y., Bland, M. J., Wagstaff, W., Dalton, S., McDevitt, T. C., Sen,
R., Dekker, J., Taylor, J. & Corces, V. G. (2013). Architectural protein subclasses shape 3D
organization of genomes during lineage commitment. Cell, 153, 1281-95.
Plank, J. L. & Dean, A. (2014). Enhancer function: mechanistic and genome-wide insights come together.
Mol Cell, 55, 5-14.
Pomerantz, M. M., Ahmadiyeh, N., Jia, L., Herman, P., Verzi, M. P., Doddapaneni, H., Beckwith, C. A.,
Chan, J. A., Hills, A., Davis, M., Yao, K., Kehoe, S. M., Lenz, H. J., Haiman, C. A., Yan, C.,
Henderson, B. E., Frenkel, B., Barretina, J., Bass, A., Tabernero, J., Baselga, J., Regan, M. M.,
Manak, J. R., Shivdasani, R., Coetzee, G. A. & Freedman, M. L. (2009). The 8q24 cancer risk
variant rs6983267 shows long-range interaction with MYC in colorectal cancer. Nat Genet, 41,
882-4.
Pope, B. D., Ryba, T., Dileep, V., Yue, F., Wu, W., Denas, O., Vera, D. L., Wang, Y., Hansen, R. S.,
Canfield, T. K., Thurman, R. E., Cheng, Y., Gulsoy, G., Dennis, J. H., Snyder, M. P.,
Stamatoyannopoulos, J. A., Taylor, J., Hardison, R. C., Kahveci, T., Ren, B. & Gilbert, D. M.
(2014). Topologically associating domains are stable units of replication-timing regulation.
Nature, 515, 402-5.
Porcu, E., Sanna, S., Fuchsberger, C. & Fritsche, L. G. (2013). Genotype imputation in genome-wide
association studies. Curr Protoc Hum Genet, Chapter 1, Unit 1 25.
208
Qi, D. L., Ohhira, T., Fujisaki, C., Inoue, T., Ohta, T., Osaki, M., Ohshiro, E., Seko, T., Aoki, S.,
Oshimura, M. & Kugoh, H. (2011). Identification of PITX1 as a TERT suppressor gene located
on human chromosome 5. Mol Cell Biol, 31, 1624-36.
Qian, J., Lin, J., Luscombe, N. M., Yu, H. & Gerstein, M. (2003). Prediction of regulatory networks:
genome-wide identification of transcription factor targets from gene expression data.
Bioinformatics, 19, 1917-1926.
Quinonez, S. C. & Innis, J. W. (2014). Human HOX gene disorders. Mol Genet Metab, 111, 4-15.
Rada-Iglesias, A., Bajpai, R., Prescott, S., Brugmann, S. A., Swigut, T. & Wysocka, J. (2012).
Epigenomic annotation of enhancers predicts transcriptional regulators of human neural crest.
Cell Stem Cell, 11, 633-48.
Rada-Iglesias, A., Bajpai, R., Swigut, T., Brugmann, S. A., Flynn, R. A. & Wysocka, J. (2011). A unique
chromatin signature uncovers early developmental enhancers in humans. Nature, 470, 279-83.
Ramasamy, A., Trabzuni, D., Guelfi, S., Varghese, V., Smith, C., Walker, R., De, T., Consortium, U. K.
B. E., North American Brain Expression, C., Coin, L., de Silva, R., Cookson, M. R., Singleton,
A. B., Hardy, J., Ryten, M. & Weale, M. E. (2014). Genetic variability in the regulation of gene
expression in ten regions of the human brain. Nat Neurosci, 17, 1418-28.
Rao, S. S., Huntley, M. H., Durand, N. C., Stamenova, E. K., Bochkov, I. D., Robinson, J. T., Sanborn, A.
L., Machol, I., Omer, A. D., Lander, E. S. & Aiden, E. L. (2014). A 3D map of the human
genome at kilobase resolution reveals principles of chromatin looping. Cell, 159, 1665-80.
Ravasi, T., Suzuki, H., Cannistraci, C. V., Katayama, S., Bajic, V. B., Tan, K., Akalin, A., Schmeier, S.,
Kanamori-Katayama, M., Bertin, N., Carninci, P., Daub, C. O., Forrest, A. R., Gough, J.,
Grimmond, S., Han, J. H., Hashimoto, T., Hide, W., Hofmann, O., Kamburov, A., Kaur, M.,
Kawaji, H., Kubosaki, A., Lassmann, T., van Nimwegen, E., MacPherson, C. R., Ogawa, C.,
Radovanovic, A., Schwartz, A., Teasdale, R. D., Tegner, J., Lenhard, B., Teichmann, S. A.,
Arakawa, T., Ninomiya, N., Murakami, K., Tagami, M., Fukuda, S., Imamura, K., Kai, C.,
Ishihara, R., Kitazume, Y., Kawai, J., Hume, D. A., Ideker, T. & Hayashizaki, Y. (2010). An atlas
of combinatorial transcriptional regulation in mouse and man. Cell, 140, 744-52.
Raviram, R., Rocha, P. P., Bonneau, R. & Skok, J. A. (2014). Interpreting 4C-Seq data: how far can we
go? Epigenomics, 6, 455-7.
Razin, S. V., Borunova, V. V., Maksimenko, O. G. & Kantidze, O. L. (2012). Cys2His2 zinc finger
protein family: classification, functions, and major members. Biochemistry (Mosc), 77, 217-26.
Reynisdottir, I., Arason, A., Einarsdottir, B. O., Gunnarsson, H., Staaf, J., Vallon-Christersson, J.,
Jonsson, G., Ringner, M., Agnarsson, B. A., Olafsdottir, K., Fagerholm, R., Einarsdottir, T.,
Johannesdottir, G., Johannsson, O. T., Nevanlinna, H., Borg, A. & Barkardottir, R. B. (2013).
High expression of ZNF703 independent of amplification indicates worse prognosis in patients
with luminal B breast cancer. Cancer Med, 2, 437-46.
RoadmapEpigenomicsConsortium (2015). Integrative analysis of 111 reference human epigenomes.
Nature, 19, 317-330.
Rodelsperger, C., Guo, G., Kolanczyk, M., Pletschacher, A., Kohler, S., Bauer, S., Schulz, M. H. &
Robinson, P. N. (2011). Integrative analysis of genomic, functional and protein interaction data
predicts long-range enhancer-target gene interactions. Nucleic Acids Res, 39, 2492-502.
Rosenbauer, F., Wagner, K., Kutok, J. L., Iwasaki, H., Le Beau, M. M., Okuno, Y., Akashi, K., Fiering,
S. & Tenen, D. G. (2004). Acute myeloid leukemia induced by graded reduction of a lineage-
specific transcription factor, PU.1. Nat Genet, 36, 624-30.
Roussos, P., Mitchell, A. C., Voloudakis, G., Fullard, J. F., Pothula, V. M., Tsang, J., Stahl, E. A.,
Georgakopoulos, A., Ruderfer, D. M., Charney, A., Okada, Y., Siminovitch, K. A., Worthington,
J., Padyukov, L., Klareskog, L., Gregersen, P. K., Plenge, R. M., Raychaudhuri, S., Fromer, M.,
Purcell, S. M., Brennand, K. J., Robakis, N. K., Schadt, E. E., Akbarian, S. & Sklar, P. (2014). A
role for noncoding variation in schizophrenia. Cell Rep, 9, 1417-29.
Sander, J. D., Dahlborg, E. J., Goodwin, M. J., Cade, L., Zhang, F., Cifuentes, D., Curtin, S. J.,
Blackburn, J. S., Thibodeau-Beganny, S., Qi, Y., Pierick, C. J., Hoffman, E., Maeder, M. L.,
209
Khayter, C., Reyon, D., Dobbs, D., Langenau, D. M., Stupar, R. M., Giraldez, A. J., Voytas, D.
F., Peterson, R. T., Yeh, J. R. & Joung, J. K. (2011). Selection-free zinc-finger-nuclease
engineering by context-dependent assembly (CoDA). Nat Methods, 8, 67-9.
Sander, J. D. & Joung, J. K. (2014). CRISPR-Cas systems for editing, regulating and targeting genomes.
Nat Biotechnol, 32, 347-55.
Sanyal, A., Lajoie, B. R., Jain, G. & Dekker, J. (2012). The long-range interaction landscape of gene
promoters. Nature, 489, 109-13.
Schaub, M. A., Boyle, A. P., Kundaje, A., Batzoglou, S. & Snyder, M. (2012). Linking disease
associations with regulatory information in the human genome. Genome Res, 22, 1748-59.
Schoenfelder, S., Furlan-Magaril, M., Mifsud, B., Tavares-Cadete, F., Sugar, R., Javierre, B. M., Nagano,
T., Katsman, Y., Sakthidevi, M., Wingett, S. W., Dimitrova, E., Dimond, A., Edelman, L. B.,
Elderkin, S., Tabbada, K., Darbo, E., Andrews, S., Herman, B., Higgs, A., LeProust, E., Osborne,
C. S., Mitchell, J. A., Luscombe, N. M. & Fraser, P. (2015). The pluripotent regulatory circuitry
connecting promoters to their long-range interacting elements. Genome Res, 25, 582-97.
Sebastiani, P., Farrell, J. J., Alsultan, A., Wang, S., Edward, H. L., Shappell, H., Bae, H., Milton, J. N.,
Baldwin, C. T., Al-Rubaish, A. M., Naserullah, Z., Al-Muhanna, F., Alsuliman, A., Patra, P. K.,
Farrer, L. A., Ngo, D., Vathipadiekal, V., Chui, D. H., Al-Ali, A. K. & Steinberg, M. H. (2015).
BCL11A enhancer haplotypes and fetal hemoglobin in sickle cell anemia. Blood Cells Mol Dis,
54, 224-30.
Shakya, A., Callister, C., Goren, A., Yosef, N., Garg, N., Khoddami, V., Nix, D., Regev, A. & Tantin, D.
(2015). Pluripotency transcription factor oct4 mediates stepwise nucleosome demethylation and
depletion. Mol Cell Biol, 35, 1014-25.
Shannon, P. (2014). MotifDb: An annotated collection of Protein-DNA binding sequence motifs.
Bioconductor, R package version 1.8.0.
Sharma, S., Zhou, X., Thibault, D. M., Himes, B. E., Liu, A., Szefler, S. J., Strunk, R., Castro, M.,
Hansel, N. N., Diette, G. B., Vonakis, B. M., Adkinson, N. F., Jr., Avila, L., Soto-Quiros, M.,
Barraza-Villareal, A., Lemanske, R. F., Jr., Solway, J., Krishnan, J., White, S. R., Cheadle, C.,
Berger, A. E., Fan, J., Boorgula, M. P., Nicolae, D., Gilliland, F., Barnes, K., London, S. J.,
Martinez, F., Ober, C., Celedon, J. C., Carey, V. J., Weiss, S. T. & Raby, B. A. (2014). A
genome-wide survey of CD4(+) lymphocyte regulatory genetic variants identifies novel asthma
genes. J Allergy Clin Immunol, 134, 1153-62.
Sheffield, N. C., Thurman, R. E., Song, L., Safi, A., Stamatoyannopoulos, J. A., Lenhard, B., Crawford,
G. E. & Furey, T. S. (2013). Patterns of regulatory activity across diverse human cell types
predict tissue identity, transcription factor binding, and long-range interactions. Genome Res, 23,
777-88.
Shen, Y., Yue, F., McCleary, D. F., Ye, Z., Edsall, L., Kuan, S., Wagner, U., Dixon, J., Lee, L.,
Lobanenkov, V. V. & Ren, B. (2012). A map of the cis-regulatory sequences in the mouse
genome. Nature, 488, 116-20.
Shlyueva, D., Stelzer, C., Gerlach, D., Yanez-Cuna, J. O., Rath, M., Boryn, L. M., Arnold, C. D. & Stark,
A. (2014). Hormone-responsive enhancer-activity maps reveal predictive motifs, indirect
repression, and targeting of closed chromatin. Mol Cell, 54, 180-92.
Siebenlist, U., Hennighausen, L., Battey, J. & Leder, P. (1984). Chromatin structure and protein binding
in the putative regulatory region of the c-myc gene in Burkitt lymphoma. Cell, 37, 381-91.
Simonis, M., Klous, P., Splinter, E., Moshkin, Y., Willemsen, R., de Wit, E., van Steensel, B. & de Laat,
W. (2006). Nuclear organization of active and inactive chromatin domains uncovered by
chromosome conformation capture-on-chip (4C). Nat Genet, 38, 1348-54.
Smith, R. P., Taher, L., Patwardhan, R. P., Kim, M. J., Inoue, F., Shendure, J., Ovcharenko, I. & Ahituv,
N. (2013). Massively parallel decoding of mammalian regulatory sequences supports a flexible
organizational model. Nat Genet, 45, 1021-8.
210
Stadler, M. B., Murr, R., Burger, L., Ivanek, R., Lienert, F., Scholer, A., van Nimwegen, E., Wirbelauer,
C., Oakeley, E. J., Gaidatzis, D., Tiwari, V. K. & Schubeler, D. (2011). DNA-binding factors
shape the mouse methylome at distal regulatory regions. Nature, 480, 490-5.
Stergachis, A. B., Haugen, E., Shafer, A., Fu, W., Vernot, B., Reynolds, A., Raubitschek, A.,
Ziegler, S., LeProust, E. M., Akey, J. M. & Stamatoyannopoulos, J. A. (2013). Exonic
transcription factor binding directs codon choice and affects protein evolution. Science,
342, 1367-1372.
Stewart, A. J., Hannenhalli, S. & Plotkin, J. B. (2012). Why transcription factor binding sites are ten
nucleotides long. Genetics, 192, 973-85.
Sur, I. K., Hallikas, O., Vaharautio, A., Yan, J., Turunen, M., Enge, M., Taipale, M., Karhu, A., Aaltonen,
L. A. & Taipale, J. (2012). Mice lacking a Myc enhancer that includes human SNP rs6983267 are
resistant to intestinal tumors. Science, 338, 1360-3.
Tashiro, S. & Lanctot, C. (2015). The International Nucleome Consortium. Nucleus, 6, 89-92.
Tenesa, A., Farrington, S. M., Prendergast, J. G., Porteous, M. E., Walker, M., Haq, N., Barnetson, R. A.,
Theodoratou, E., Cetnarskyj, R., Cartwright, N., Semple, C., Clark, A. J., Reid, F. J., Smith, L.
A., Kavoussanakis, K., Koessler, T., Pharoah, P. D., Buch, S., Schafmayer, C., Tepel, J.,
Schreiber, S., Volzke, H., Schmidt, C. O., Hampe, J., Chang-Claude, J., Hoffmeister, M.,
Brenner, H., Wilkening, S., Canzian, F., Capella, G., Moreno, V., Deary, I. J., Starr, J. M.,
Tomlinson, I. P., Kemp, Z., Howarth, K., Carvajal-Carmona, L., Webb, E., Broderick, P.,
Vijayakrishnan, J., Houlston, R. S., Rennert, G., Ballinger, D., Rozek, L., Gruber, S. B., Matsuda,
K., Kidokoro, T., Nakamura, Y., Zanke, B. W., Greenwood, C. M., Rangrej, J., Kustra, R.,
Montpetit, A., Hudson, T. J., Gallinger, S., Campbell, H. & Dunlop, M. G. (2008). Genome-wide
association scan identifies a colorectal cancer susceptibility locus on 11q23 and replicates risk
loci at 8q24 and 18q21. Nat Genet, 40, 631-7.
TheCancerGenomeAtlas (2012a). Comprehensive genomic characterization of squamous cell lung
cancers. Nature, 489, 519-25.
TheCancerGenomeAtlas (2012c). Comprehensive molecular characterization of human colon and rectal
cancer. Nature, 487, 330-7.
TheCancerGenomeAtlas (2012f). Comprehensive molecular portraits of human breast tumours. Nature,
490, 61-70.
Thomassin, H., Flavin, M., Espinás, M. L. & Grange, T. (2001). Glucocorticoid-induced DNA
demethylation and gene memory during development. The EMBO journal, 20, 1974-83.
Thurman, R. E., Rynes, E., Humbert, R., Vierstra, J., Maurano, M. T., Haugen, E., Sheffield, N. C.,
Stergachis, A. B., Wang, H., Vernot, B., Garg, K., John, S., Sandstrom, R., Bates, D., Boatman,
L., Canfield, T. K., Diegel, M., Dunn, D., Ebersol, A. K., Frum, T., Giste, E., Johnson, A. K.,
Johnson, E. M., Kutyavin, T., Lajoie, B., Lee, B. K., Lee, K., London, D., Lotakis, D., Neph, S.,
Neri, F., Nguyen, E. D., Qu, H., Reynolds, A. P., Roach, V., Safi, A., Sanchez, M. E., Sanyal, A.,
Shafer, A., Simon, J. M., Song, L., Vong, S., Weaver, M., Yan, Y., Zhang, Z., Zhang, Z.,
Lenhard, B., Tewari, M., Dorschner, M. O., Hansen, R. S., Navas, P. A., Stamatoyannopoulos,
G., Iyer, V. R., Lieb, J. D., Sunyaev, S. R., Akey, J. M., Sabo, P. J., Kaul, R., Furey, T. S.,
Dekker, J., Crawford, G. E. & Stamatoyannopoulos, J. A. (2012). The accessible chromatin
landscape of the human genome. Nature, 489, 75-82.
Tolhuis, B., Palstra, R. J., Splinter, E., Grosveld, F. & de Laat, W. (2002). Looping and interaction
between hypersensitive sites in the active beta-globin locus. Mol Cell, 10, 1453-65.
Tomlins, S. A., Rhodes, D. R., Perner, S., Dhanasekaran, S. M., Mehra, R., Sun, X. W., Varambally, S.,
Cao, X., Tchinda, J., Kuefer, R., Lee, C., Montie, J. E., Shah, R. B., Pienta, K. J., Rubin, M. A. &
Chinnaiyan, A. M. (2005). Recurrent fusion of TMPRSS2 and ETS transcription factor genes in
prostate cancer. Science, 310, 644-8.
Tomlinson, I. P., Webb, E., Carvajal-Carmona, L., Broderick, P., Howarth, K., Pittman, A. M., Spain, S.,
Lubbe, S., Walther, A., Sullivan, K., Jaeger, E., Fielding, S., Rowan, A., Vijayakrishnan, J.,
211
Domingo, E., Chandler, I., Kemp, Z., Qureshi, M., Farrington, S. M., Tenesa, A., Prendergast, J.
G., Barnetson, R. A., Penegar, S., Barclay, E., Wood, W., Martin, L., Gorman, M., Thomas, H.,
Peto, J., Bishop, D. T., Gray, R., Maher, E. R., Lucassen, A., Kerr, D., Evans, D. G., Schafmayer,
C., Buch, S., Volzke, H., Hampe, J., Schreiber, S., John, U., Koessler, T., Pharoah, P., van Wezel,
T., Morreau, H., Wijnen, J. T., Hopper, J. L., Southey, M. C., Giles, G. G., Severi, G., Castellvi-
Bel, S., Ruiz-Ponte, C., Carracedo, A., Castells, A., Forsti, A., Hemminki, K., Vodicka, P.,
Naccarati, A., Lipton, L., Ho, J. W., Cheng, K. K., Sham, P. C., Luk, J., Agundez, J. A., Ladero,
J. M., de la Hoya, M., Caldes, T., Niittymaki, I., Tuupanen, S., Karhu, A., Aaltonen, L., Cazier, J.
B., Campbell, H., Dunlop, M. G. & Houlston, R. S. (2008). A genome-wide association study
identifies colorectal cancer susceptibility loci on chromosomes 10p14 and 8q23.3. Nat Genet, 40,
623-30.
Trapnell, C., Williams, B. A., Pertea, G., Mortazavi, A., Kwan, G., van Baren, M. J., Salzberg, S. L.,
Wold, B. J. & Pachter, L. (2010). Transcript assembly and quantification by RNA-Seq reveals
unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol, 28,
511-5.
Tupler, R., Perini, G. & Green, M. R. (2001). Expressing the human genome. Nature, 409, 832-833.
Urnov, F. D., Rebar, E. J., Holmes, M. C., Zhang, H. S. & Gregory, P. D. (2010). Genome editing with
engineered zinc finger nucleases. Nat Rev Genet, 11, 636-46.
van Berkum, N. L., Lieberman-Aiden, E., Williams, L., Imakaev, M., Gnirke, A., Mirny, L. A., Dekker, J.
& Lander, E. S. (2010). Hi-C: a method to study the three-dimensional architecture of genomes. J
Vis Exp, (39), e1869.
Van Bortle, K. & Corces, V. G. (2014). Lost in transition: dynamic enhancer organization across naive
and primed stem cell states. Cell Stem Cell, 14, 693-4.
Vanhille, L., Griffon, A., Maqbool, M. A., Zacarias-Cabeza, J., Dao, L. T., Fernandez, N., Ballester, B.,
Andrau, J. C. & Spicuglia, S. (2015). High-throughput and quantitative assessment of enhancer
activity in mammals by CapStarr-seq. Nat Commun, 6, 6905.
Vaquerizas, J. M., Kummerfeld, S. K., Teichmann, S. A. & Luscombe, N. M. (2009). A census of human
transcription factors: function, expression and evolution. Nat Reviews Genetics, 10, 252-263.
Vietri Rudan, M., Barrington, C., Henderson, S., Ernst, C., Odom, D. T., Tanay, A. & Hadjur, S. (2015).
Comparative Hi-C Reveals that CTCF Underlies Evolution of Chromosomal Domain
Architecture. Cell Rep, 10, 1297-309.
Vile, G. F. & Winterbourn, C. C. (1989). Microsomal lipid peroxidation induced by adriamycin,
epirubicin, daunorubicin and mitoxantrone: a comparative study. Cancer Chemother Pharmacol,
24, 105-8.
Visel, A., Blow, M. J., Li, Z., Zhang, T., Akiyama, J. A., Holt, A., Plajzer-Frick, I., Shoukry, M., Wright,
C., Chen, F., Afzal, V., Ren, B., Rubin, E. M. & Pennacchio, L. A. (2009a). ChIP-seq accurately
predicts tissue-specific activity of enhancers. Nature, 457, 854-858.
Visel, A., Rubin, E. M. & Pennacchio, L. A. (2009c). Genomic views of distant-acting enhancers. Nature,
461, 199-5.
Vrtacnik, P., Ostanek, B., Mencej-Bedrac, S. & Marc, J. (2014). The many faces of estrogen signaling.
Biochem Med (Zagreb), 24, 329-42.
Wakabayashi, Y., Watanabe, H., Inoue, J., Takeda, N., Sakata, J., Mishima, Y., Hitomi, J., Yamamoto, T.,
Utsuyama, M., Niwa, O., Aizawa, S. & Kominami, R. (2003). Bcl11b is required for
differentiation and survival of alphabeta T lymphocytes. Nat Immunol, 4, 533-9.
Wang, D., Rendon, A. & Wernisch, L. (2013a). Transcription factor and chromatin features predict genes
associated with eQTLs. Nucleic Acids Res, 41, 1450-63.
Wang, J., Zhuang, J., Iyer, S., Lin, X., Whitfield, T. W., Greven, M. C., Pierce, B. G., Dong, X., Kundaje,
A., Cheng, Y., Rando, O. J., Birney, E., Myers, R. M., Noble, W. S., Snyder, M. & Weng, Z.
(2012). Sequence features and chromatin structure around the genomic regions bound by 119
human transcription factors. Genome Res, 22, 1798-812.
212
Wang, J., Zhuang, J., Iyer, S., Lin, X. Y., Greven, M. C., Kim, B. H., Moore, J., Pierce, B. G., Dong, X.,
Virgil, D., Birney, E., Hung, J. H. & Weng, Z. (2013b). Factorbook.org: a Wiki-based database
for transcription factor-binding data generated by the ENCODE consortium. Nucleic Acids Res,
41, D171-6.
Wang, Q., Carroll, J. S. & Brown, M. L. (2005). Spatial and temporal recruitment of androgen receptor
and its coactivators involves chromosomal looping and polymerase tracking. Mol. Cell., 19, 631-
642.
Wang, S., Sun, H., Ma, J., Zang, C., Wang, C., Wang, J., Tang, Q., Meyer, C. A., Zhang, Y. & Liu, X. S.
(2013f). Target analysis by integration of transcriptome and ChIP-seq data with BETA. Nat
Protoc, 8, 2502-15.
Watanabe, H., Ma, Q., Peng, S., Adelmant, G., Swain, D., Song, W., Fox, C., Francis, J. M., Pedamallu,
C. S., DeLuca, D. S., Brooks, A. N., Wang, S., Que, J., Rustgi, A. K., Wong, K. K., Ligon, K. L.,
Liu, X. S., Marto, J. A., Meyerson, M. & Bass, A. J. (2014). SOX2 and p63 colocalize at genetic
loci in squamous cell carcinomas. J Clin Invest, 124, 1636-45.
Weinhold, N., Jacobsen, A., Schultz, N., Sander, C. & Lee, W. (2014). Genome-wide analysis of
noncoding regulatory mutations in cancer. Nat Genet, 46, 1160-5.
Weinstein, J. N., Collisson, E. A., Mills, G. B., Shaw, K. R., Ozenberger, B. A., Ellrott, K., Shmulevich,
I., Sander, C. & Stuart, J. M. (2013). The Cancer Genome Atlas Pan-Cancer analysis project. Nat
Genet, 45, 1113-20.
Wendt, K. S., Yoshida, K., Itoh, T., Bando, M., Koch, B., Schirghuber, E., Tsutsumi, S., Nagae, G.,
Ishihara, K., Mishiro, T., Yahata, K., Imamoto, F., Aburatani, H., Nakao, M., Imamoto, N.,
Maeshima, K., Shirahige, K. & Peters, J. M. (2008). Cohesin mediates transcriptional insulation
by CCCTC-binding factor. Nature, 451, 796-801.
Westra, H. J. & Franke, L. (2014). From genome to function by studying eQTLs. Biochim Biophys Acta,
1842, 1896-1902.
Whitaker, J. W., Nguyen, T. T., Zhu, Y., Wildberg, A. & Wang, W. (2015). Computational schemes for
the prediction and annotation of enhancers from epigenomic assays. Methods, 72, 86-94.
Wiench, M., John, S., Baek, S., Johnson, T. A., Sung, M. H., Escobar, T., Simmons, C. A., Pearce, K. H.,
Biddie, S. C., Sabo, P. J., Thurman, R. E., Stamatoyannopoulos, J. A. & Hager, G. L. (2011).
DNA methylation status predicts cell type-specific enhancer activity. EMBO J, 30, 3028-39.
Worsley Hunt, R. & Wasserman, W. W. (2014). Non-targeted transcription factors motifs are a systemic
component of ChIP-seq datasets. Genome Biol, 15, 412.
Xu, X., Bieda, M., Jin, V. X., Rabinovich, A., Oberley, M. J., Green, R. & Farnham, P. J. (2007). A
comprehensive ChIP-chip analysis of E2F1, E2F4, and E2F6 in normal and tumor cells reveals
iterchangeable roles of E2F family members. Genome Res., 17, 1550-1561.
Yan, W., Cao, Q. J., Arenas, R. B., Bentley, B. & Shao, R. (2010). GATA3 inhibits breast cancer
metastasis through the reversal of epithelial-mesenchymal transition. J Biol Chem, 285, 14042-
51.
Yang, T. P., Beazley, C., Montgomery, S. B., Dimas, A. S., Gutierrez-Arcelus, M., Stranger, B. E.,
Deloukas, P. & Dermitzakis, E. T. (2010). Genevar: a database and Java application for the
analysis and visualization of SNP-gene associations in eQTL studies. Bioinformatics, 26, 2474-6.
Yao, L., Shen, H., Laird, P. W., Farnham, P. J. & Berman, B. P. (2015). Inferring regulatory element
landscapes and transcription factor networks from cancer methylomes. Genome Biol, 16, 105.
Yao, L., Tak, Y. G., Berman, B. P. & Farnham, P. J. (2014). Functional annotation of colon cancer risk
SNPs. Nat Commun, 5, 5114.
Yochum, G. S., Cleland, R. & Goodman, R. H. (2008). A genome-wide screen for beta-catenin binding
sites identifies a downstream enhancer element that controls c-Myc gene expression. Mol Cell
Biol, 28, 7368-79.
Yu, Y., Wang, J., Khaled, W., Burke, S., Li, P., Chen, X., Yang, W., Jenkins, N. A., Copeland, N. G.,
Zhang, S. & Liu, P. (2012). Bcl11a is essential for lymphoid development and negatively
regulates p53. J Exp Med, 209, 2467-83.
213
Zanke, B. W., Greenwood, C. M., Rangrej, J., Kustra, R., Tenesa, A., Farrington, S. M., Prendergast, J.,
Olschwang, S., Chiang, T., Crowdy, E., Ferretti, V., Laflamme, P., Sundararajan, S., Roumy, S.,
Olivier, J. F., Robidoux, F., Sladek, R., Montpetit, A., Campbell, P., Bezieau, S., O'Shea, A. M.,
Zogopoulos, G., Cotterchio, M., Newcomb, P., McLaughlin, J., Younghusband, B., Green, R.,
Green, J., Porteous, M. E., Campbell, H., Blanche, H., Sahbatou, M., Tubacher, E., Bonaiti-Pellie,
C., Buecher, B., Riboli, E., Kury, S., Chanock, S. J., Potter, J., Thomas, G., Gallinger, S.,
Hudson, T. J. & Dunlop, M. G. (2007). Genome-wide association scan identifies a colorectal
cancer susceptibility locus on chromosome 8q24. Nat Genet, 39, 989-94.
Zentner, G. E. & Scacheri, P. C. (2012). The chromatin fingerprint of gene enhancer elements. J Biol
Chem, 287, 30888-96.
Zhang, L., Smit-McBride, Z., Pan, X., Rheinhardt, J. & Hershey, J. W. (2008). An oncogenic role for the
phosphorylated h-subunit of human translation initiation factor eIF3. J Biol Chem, 283, 24047-60.
Zhao, Z., Tavoosidana, G., Sjolinder, M., Gondor, A., Mariano, P., Wang, S., Kanduri, C., Lezcano, M.,
Sandhu, K. S., Singh, U., Pant, V., Tiwari, V., Kurukuti, S. & Ohlsson, R. (2006). Circular
chromosome conformation capture (4C) uncovers extensive networks of epigenetically regulated
intra- and interchromosomal interactions. Nat Genet, 38, 1341-7.
Zheng, R. & Blobel, G. A. (2010). GATA Transcription Factors and Cancer. Genes Cancer, 1, 1178-88.
Zheng, X., Zhao, Q., Wu, H. J., Li, W., Wang, H., Meyer, C. A., Qin, Q. A., Xu, H., Zang, C., Jiang, P.,
Li, F., Hou, Y., He, J., Wang, J., Wang, J., Zhang, P., Zhang, Y. & Liu, X. S. (2014).
MethylPurify: tumor purity deconvolution and differential methylation detection from single
tumor DNA methylomes. Genome Biol, 15, 419.
Zhou, H. Y., Katsman, Y., Dhaliwal, N. K., Davidson, S., Macpherson, N. N., Sakthidevi, M., Collura, F.
& Mitchell, J. A. (2014). A Sox2 distal enhancer cluster regulates embryonic stem cell
differentiation potential. Genes Dev, 28, 2699-711.
Zhu, X., Ling, J., Zhang, L., Pi, W., Wu, M. & Tuan, D. (2007). A facilitated tracking and transcription
mechanism of long-range enhancer function. Nucleic Acids Res, 35, 5532-44.
Ziller, M. J., Gu, H., Muller, F., Donaghey, J., Tsai, L. T., Kohlbacher, O., De Jager, P. L., Rosen, E. D.,
Bennett, D. A., Bernstein, B. E., Gnirke, A. & Meissner, A. (2013). Charting a dynamic DNA
methylation landscape of the human genome. Nature, 500, 477-81.
Zuin, J., Dixon, J. R., van der Reijden, M. I., Ye, Z., Kolovos, P., Brouwer, R. W., van de Corput, M. P.,
van de Werken, H. J., Knoch, T. A., van, I. W. F., Grosveld, F. G., Ren, B. & Wendt, K. S.
(2014). Cohesin and CTCF differentially affect chromatin architecture and gene expression in
human cells. Proc Natl Acad Sci U S A, 111, 996-1001.
Abstract (if available)
Abstract
Enhancers are short regulatory sequences located distal to transcription start sites that are bound by sequence-specific transcription factors and play a major role in the spatiotemporal specificity of gene expression patterns in development and disease. While it is now possible to identify enhancer regions genome-wide in both cultured cells and primary tissues using epigenomic approaches, it has been more challenging to develop methods to understand the function of individual enhancers because enhancers are located far from the gene(s) they regulate. However, it is essential to identify target genes of enhancers not only so that we can understand the role of enhancers in disease but also because this information will assist in the development of future therapeutic options. In this dissertation, I employ two strategies to characterize cancer-associated enhancers. The first strategy involves a functional characterization of enhancers linked to genetic predisposition for colorectal cancer (CRC). Genome-wide association studies (GWAS) have identified single nucleotide polymorphisms (SNPs) associated with increased risk for CRC. By integrating genetic and epigenetic information, I identified 23 promoters and 28 enhancers harboring CRC GWAS SNPs or correlated SNPs. Using gene expression data from normal and tumor cells, I identified 66 putative target genes of these risk-associated enhancers. In the second strategy, I developed ELMER (Enhancer Linking by Methylation/Expression Relationships), an R-based tool integrating DNA methylation and gene expression profiles from primary tissues, to systematically infer multi-level transcription factor regulatory networks. I then applied ELMER to 10 cancer types, including more than 2,000 tumor samples from The Cancer Genome Atlas. I identified cancer-specific enhancers and linked the enhancers to putative target genes and upstream regulatory TFs, developing cancer-associated regulatory networks. My studies identified several networks regulated by well-known cancer drivers, such as GATA3 and FOXA1 in breast cancer, SOX17 and FOXA2 (endometrial cancer), and NFE2L2, SOX2 and TP63 (squamous cell lung cancer), and also identified novel networks with prognostic associations, including RUNX1 in kidney cancer.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Functional characterization of colon cancer risk-associated enhancers: connecting risk loci to risk genes
PDF
Breast epithelial cell type specific enhancers and functional annotation of breast cancer risk loci
PDF
Application of tracing enhancer networks using epigenetic traits (TENET) to identify epigenetic deregulation in cancer
PDF
Functional DNA methylation changes in normal and cancer cells
PDF
Functional characterization of a prostate cancer risk region
PDF
Understanding prostate cancer genetic susceptibility and chromatin regulation
PDF
DNA methylation changes in the development of lung adenocarcinoma
PDF
Understanding DNA methylation and nucleosome organization in cancer cells using single molecule sequencing
PDF
Functional characterization of colorectal cancer GWAS loci
PDF
Genome-wide characterization of the regulatory relationships of cell type-specific enhancer-gene links
PDF
Identification of novel epigenetic biomarkers and microRNAs for cancer therapeutics
PDF
Functional role of chromatin remodeler proteins in cancer biology
PDF
Characterizing ZFX-mediated gene regulation to reveal possible candidates for clinical intervention
PDF
Integrative genomic and epigenomic analysis of human cancer
PDF
DNA methylation and gene expression profiles in Vidaza treated cultured cancer cells
PDF
Functional characterization of colon cancer risk enhancers
PDF
Exploring stem cell pluripotency through long range chromosome interactions
PDF
Systematic analysis of single nucleotide polymorphisms in the human steroid 5-alpha reductase type I gene
PDF
Genomic risk factors associated with Ewing Sarcoma susceptibility
PDF
Using genomics to understand the gene selectivity of steroid hormone receptors
Asset Metadata
Creator
Yao, Lijing
(author)
Core Title
Identification and characterization of cancer-associated enhancers
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Genetic, Molecular and Cellular Biology
Publication Date
11/11/2015
Defense Date
09/11/2015
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
chromatin interaction,computational tools,DNA methylation,enhancer,gene editing,gene expression,GWAS,OAI-PMH Harvest,single nucleotide polymorphism
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Berman, Benjamin (
committee chair
), Farnham, Peggy (
committee chair
), Knowles, James (
committee member
), Siegmund, Kimberly (
committee member
)
Creator Email
lijingya@usc.edu,yaolij@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-199860
Unique identifier
UC11276836
Identifier
etd-YaoLijing-4032.pdf (filename),usctheses-c40-199860 (legacy record id)
Legacy Identifier
etd-YaoLijing-4032.pdf
Dmrecord
199860
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Yao, Lijing
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
chromatin interaction
computational tools
DNA methylation
enhancer
gene editing
gene expression
GWAS
single nucleotide polymorphism