Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Using genomics to understand the gene selectivity of steroid hormone receptors
(USC Thesis Other)
Using genomics to understand the gene selectivity of steroid hormone receptors
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Using genomics to understand the gene selectivity of
steroid hormone receptors
by
Dai-Ying Wu
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(GENETICS MOLECULAR AND CELLULAR BIOLOGY)
December 2014
2
3
Table of Contents
List of Figures ................................................................................................................................................ 4
List of Tables ................................................................................................................................................. 5
Chapter 1 Background .................................................................................................................................. 6
Microarray................................................................................................................................................. 7
Sequencing ................................................................................................................................................ 8
Chapter 2 Differential coregulator requirement at hormone regulated genes ......................................... 11
Introduction ............................................................................................................................................ 11
Methods .................................................................................................................................................. 13
Results ..................................................................................................................................................... 14
Discussion................................................................................................................................................ 45
Chapter 3 Investigation of differential transcription factor binding for ChIP-seq ...................................... 49
Introduction ............................................................................................................................................ 49
Methods .................................................................................................................................................. 53
Results ..................................................................................................................................................... 62
Discussion................................................................................................................................................ 84
Conclusion ............................................................................................................................................... 86
Chapter 4 Conclusion .................................................................................................................................. 88
References .................................................................................................................................................. 92
4
List of Figures
Figure 2-1 MDS plot showing failed array in G9a/GLP data ....................................................................... 14
Figure 2-2 MDS plot before and after batch effect correction ................................................................... 15
Figure 2-3 Effect of hormone treatment and coregulator depletion ......................................................... 16
Figure 2-4 Venn diagrams showing overlap between comparisons with no fold change cutoff ............... 18
Figure 2-5 Venn diagrams showing overlap between comparisons using 2-fold change cutoff ................ 19
Figure 2-6 Overlap within modulated and blocked gene sets .................................................................... 20
Figure 2-7 Difference in histone at TSS of blocked genes ........................................................................... 22
Figure 2-8 Histone marks around TSS of all genes ...................................................................................... 23
Figure 2-9 Histone marks around TSS of blocked genes ............................................................................. 23
Figure 2-10 Histone marks around TSS of hormone regulated genes ........................................................ 24
Figure 2-11 Histone marks around TSS difference between hormone regulated versus all genes ............ 24
Figure 2-12 Direction of regulation between blocked genes and hormone ............................................... 25
Figure 2-13 Dex-regulated fold change of blocked genes compared to hormone regulated genes .......... 26
Figure 2-14 Venn diagrams showing overlap between comparisons in other steriod hormone systems . 27
Figure 2-15 Direction of regulation between modulated genes and hormone .......................................... 28
Figure 2-16 Effect of coregulator depletion on TNFR2 signaling pathway ................................................. 29
Figure 2-17 Hormone response in TNFR2 signaling .................................................................................... 30
Figure 2-18 Acute Phase Response ............................................................................................................. 36
Figure 2-19 Interferon Signaling ................................................................................................................. 39
Figure 2-20 Top canonical pathways .......................................................................................................... 41
Figure 2-21 Upstream regulator prediction ................................................................................................ 42
Figure 2-22 WGCNA results on module-trait relationships ........................................................................ 44
Figure 2-23 Model for coregulator control of gene specificity ................................................................... 46
Figure 3-1 Typical workflow for identifying differential transcription factor binding ................................ 50
Figure 3-2 Overview of peak types and defining reference binding regions .............................................. 53
Figure 3-3 Distance between summits of peaks ......................................................................................... 55
Figure 3-4 Results from condition-level comparison methods ................................................................... 63
Figure 3-5 Overlapping conditions in NRF1 and ERα .................................................................................. 64
Figure 3-6 MA plots for sample-level comparison methods ...................................................................... 67
Figure 3-7 Differences in results of sample-level comparison methods .................................................... 69
Figure 3-8 Voom MAplots ........................................................................................................................... 72
Figure 3-9 Clustering of fold changes from top differentially bound regions from each method ............. 74
Figure 3-10 Clustring between all pairwise comparisons ........................................................................... 79
Figure 3-11 Smoothed scatter plot of log fold change versus G+C content ............................................... 80
Figure 3-12 Boxplot of differences in fold change between ChIP-qPCR validation and differential ChIP-seq
methods ...................................................................................................................................................... 81
Figure 3-13 Overrepresentation of artifact and chrM reads ...................................................................... 83
Figure 4-1 Model of cellular processes involved in gene regulation .......................................................... 89
5
List of Tables
Table 2-1 Summary of expression microarray experiments with coregulator depletion ........................... 13
Table 3-1 Differential binding methods ...................................................................................................... 51
Table 3-2 Summary of comparisons ........................................................................................................... 56
Table 3-3 Number of significant differential binding regions ..................................................................... 65
6
Chapter 1 Background
The human body is made up of a myriad of specialized cell types. These specialized cell types, such as
skin cells, liver cells and heart cells, provide the basis through which different organs form while using
the same genetic template. When the body encounters stimuli, these specialized cell types are capable
of responding differently. For example, estrogen and androgen hormones are important for female and
male sexual development; however, most cells are not important for sexual development and must
either ignore these hormones or respond to these hormones differently than how reproductive organs
respond. This raises the question of how different cells can respond to the same stimuli in different ways.
This differential response can be measured through the activation of a different profile of genes upon
hormone treatment in different types of cells.
Cells begin in a pluripotent state and become more differentiated and specialized as development
progresses. Of the over 20 thousand genes in human genome, only a subset these genes are expressed.
These expressed genes are found in uncondensed regions of the genome, open chromatin, where
transcription factor proteins can easily bind to DNA. Different cell types will have different regions of the
genome accessible and a different gene expression profile. These gene expression profiles are not
permanent and cellular response to stimuli can trigger the expression a different set of genes. By fully
understanding how gene expression is regulated, we would have the potential to control genes
expression in the future.
To understand underlying mechanisms of gene regulation, the process through which the transcription
or expression of genes is controlled, we focus on how cells respond to stimuli. We use steroid hormones,
which can readily pass through the lipid bilayer of the cell, as stimulus to study gene regulation. Steroid
hormones are important for many physiological functions including stress response, immune response,
and development (Biddie et al. 2012; Clegg 2012). Many types of steroid hormones exist and they are
categorized based on the receptor that they bind to. All steroid hormones share a similar structure
related to cholesterol and include estrogen, androgen, progesterone, mineralocorticoid, and
glucocorticoid. Upon treatment of cells with hormone, receptors will bind hormone, dimerize, and
translocate into the nucleus (if the receptor is in the cytoplasm) where they bind DNA. Upon DNA
binding, other proteins are also recruited to form a complex that will activate or repress downstream
target genes (Tsai and O’Malley 1994). Changes in target gene expression could further modulate gene
expression leading to secondary responses. This cascade of gene expression changes from stimulus can
lead to the characterized physiological responses.
Each steroid hormone causes a variety of physiological effects in different cells. While useful for
controlling multiple physiological pathways using the same stimulus, treatment with hormones in a
clinical setting poses many challenges due to their pleiotropic effects (Rosen and Miner 2005). Although
hormones are used clinically for a subset of their physiological effects, they also have multiple undesired
and severe side-effects. For example, glucocorticoids are a class of hormones that are one of the most
widely prescribed and potent anti-inflammation treatments; however, side effects from treatment
include muscle wasting, hyperglycemia, insulin resistance, weight gain, and osteoporosis (Biddie et al.
2012; Gross and Cidlowski 2008; Strehl and Buttgereit 2013; Schäcke et al. 2002). Thus, when
7
prescribing hormones for therapy, side effects must also be monitored and managed. By understanding
how gene regulation occurs upon hormone stimulus, drugs could be designed with higher specificity and
fewer side effects. Combination therapy with hormones and inhibitors could inhibit pathways of side
effects and increase the specificity of hormone treatments in the clinic (Turgeon et al. 2004).
One approach to modulate hormone specificity involves changing the chemical structure of hormones.
Using this approach, selective receptor modulators could be created that change the receptor-hormone
binding conformation in such a way that receptors would bind and selectively regulate a subset of
physiological pathways of interest. Unfortunately, after years of research, not much progress has been
made through this approach for glucocorticoid receptor (De Bosscher 2010) though this approach has
had some mixed success with estrogen receptor (Komm and Mirkin 2014). The next step in hormone
regulation involves the hormone-receptor complex binding to DNA and recruiting proteins to aid
transcriptional activation or repression. These proteins recruited to the receptor are called coregulators
and are a promising avenue for controlling hormone specificity (Lonard and O’Malley 2012; Dasgupta et
al. 2014). Our lab specializes in studying coregulators and understanding the mechanism through which
they function.
Coregulators have been shown to modulate gene expression at many well studied hormone receptor
target genes (Lonard and O’Malley 2012). While studying the protein complex associated with the
steroid receptor, hundreds of coregulators have been identified. Some of these coregulators have
known histone reading and modification function while others are important for recruitment of
transcriptional machinery (Chen et al. 1999; Lee et al. 2006; Kim et al. 2003). The requirement of
coregulator proteins for transcription differs between genes with coregulator depletion resulting in both
increased and decreased induction of a subset of hormone target genes. This suggests that coregulator
proteins could be an avenue to achieve hormone specificity (Heinlein and Chang 2002; Lonard and
O’Malley 2012). We hypothesize that hormone specificity is achieved through coregulators by having
shared combinations of coregulators at target genes involved in similar physiological pathways.
While there are many thousands of genes with expression changed by hormone, only a handful have
been characterized in detail. To understand how these coregulators affect gene regulation by steroid
hormones outside of a small selection of well characterized target genes, we employ genome scale
assays to interrogate changes in gene expression as well as receptor occupancy changes upon
coregulator depletion. Gene expression changes due to hormone regulation caused by specific
coregulators can be compared to annotated physiological pathways to evaluate these differences in the
context of known hormone regulated pathways.
Microarray
Gene expression microarrays are a method to interrogate genome-wide RNA expression profiles. These
DNA microarrays involve a DNA probe attached to a slide that will fluoresce (typically green or red)
when a fluorescently labeled complementary DNA hybridizes to this probe. The level of fluoresce for
each probe is quantified to provide a signal intensity value for each probe corresponding to the
abundance of bound complementary DNA. For gene expression microarrays, these probes are designed
to bind DNA that is complementary to sections of gene exons (Schulze and Downward 2001). Other
8
types of microarrays such as DNA methylation arrays and arrays for ChIP-chip will target methylated
regions of the genome and known protein binding regions respectively (Hoheisel 2006).
How these probes are distributed on each array will differ based on the microarray manufacturer. For
humans, the two primary expression microarray manufacturers are Affymetrix and Illumina. Affymetrix
introduced their arrays earlier and had multiple probes targeting each gene. The signal value from a set
of probes (probset) associated with each gene is summarized, using linear models (Irizarry et al. 2003),
to obtain a signal intensity value corresponding to mRNA abundance for each gene. Illumina probe
design differs in that typically only a single probe is designed for each gene and a high number of
replicates (average 10+) for this probe are included. This approach recognizes the shortcomings from
the Affymetrix probe layout since signals reported had varying intensities. With multiple identical probes,
the variability in signal intensity for each probe could be measured. However, the Illumina approach
relies heavily on properly designed probes since few probes are assigned to each gene.
Signal values obtained for each probe need to be pre-processed before they can be used for differential
expression analysis. Different technologies require different specialized methods but many pre-
processing steps are shared. Common pre-processing steps include quality assessment, background
correction, normalization, and signal transformation. The goal of background correction is to remove the
background fluoresce intensity to obtain a more accurate estimate of signal intensity. However,
background subtraction can create negative signal values that cannot be log transformed and unless
there are probes designed to measure background, estimating background intensity is often nontrivial.
Transformations, such as log transform, are used to fit the data closer to a normal distribution and
change multiplicative noise into additive noise leading to variance becoming more independent of
intensity (Quackenbush 2001; Rocke and Durbin 2001). Lastly, normalization (e.g. quantile normalization)
is necessary to compare between experiments and must be performed both within and between arrays.
After pre-processing, these signal values between conditions can be compared to identify significantly
differential genes. Instead of using a t-test to compare, we use a moderated t-test that borrows from
other genes for its variance estimate since microarray experiments used in biomedical research labs
have very few biological replicates (typically 2-4) and a much higher number of features (20k+ genes). I
apply the empirical Bayes approach to shrink variance towards a pooled estimate from the limma
package (Smyth 2004). Next, multiple testing correction must be applied to account for false discoveries
by chance, since we are typically making over 20,000 comparisons (number of probes on the array)
(Noble 2009). Lastly, for my most recent work, I have been using knitR to document my code and my
findings in order to make my results more reproducible.
Sequencing
DNA sequencing technologies are used as the endpoint for a variety of biological assays including
measuring gene expression, protein binding, chromatin accessibility and histone modifications. This
technique is more sensitive at detecting changes than microarrays when used for expression profiling ( ’t
Hoen et al. 2008). Presently, multiple competing DNA sequencing technologies are spearheaded by
different biotech companies. The most widely used sequencing method currently for research is from
Illumina and originally developed by Solexa. Their method (sequencing by synthesis), uses fluorescent
9
dyes attached to nucleotides to sequence up to 250 base pairs from each end of a DNA fragment.
Although 250 base pairs are the maximum, shorter reads are often used (50 base pairs) due to lower
costs and lack of advantages from longer sequencing reads for certain assays. For this method, special
adapters are attached to each DNA fragment and loaded into a multi-lane flow cell. Within each flow cell,
bridge amplification is performed to amplify the amount of DNA. Next, sequencing occurs with the
addition of fluorescent dyes attached to different nucleotides. These nucleotides are excited with a laser
while a camera captures fluorescence intensity of each spot; and this intensity is decoded into base pairs
and quality scores. By repeating the process of adding fluorescently labeled base pairs, exciting, cleaving,
and washing away the fluorophore, each spot can be decoded into a specific order of nucleotides called
a sequencing read and represents the DNA sequence from one end of the fragment (Kircher et al. 2011).
Reads with base pairs and quality scores are stored in FASTQ files and then mapped to a reference
genome. For most of my work, I use the hg19 reference genome (GRCh37), however, hg38 (GRCh38)
was recently released and would be the ideal choice for new experiments as annotations are ported
over. If no reference organism is available, reads could be assembled using specialized de-novo
assembly algorithms (Earl et al. 2011). Popular methods to map Illumina sequence reads (short reads)
include bwa (Li and Durbin 2009), bowtie (Langmead et al. 2009), SOAP (Li et al. 2009), subread (Liao et
al. 2013), and Novoalign (from Novocraft). Specialized tools also exist for specific assays such as
identifying differential exon usage (Grabherr et al. 2011; Kim et al. 2013), ChIP-seq (Zhang et al. 2008),
and DNAse-seq (Baek et al. 2012). If a large percentage of reads do not align, including mitochondrial
chromosome or checking for Mycoplasma contamination or Epstein-Barr virus using BLAST against all
organisms may be useful to identify the source of the contamination.
Both before and after mapping, quality control metrics should be assessed – such as percentage of low
quality base calls, low quality alignments, and read duplications. GC content is also another useful metric
to check that both the sequencing run was successful and there was sufficient biological variation in the
sequencing run (Ross et al. 2013). Read trimming and read filtering could help reduce noise at later
steps. For expression studies, lowly expressed genes are sometimes filtered out.
Unlike gene expression microarrays where signal is modeled as a normal distribution, sequence read
counts are typically modeled as a Negative Binomial distribution due to the higher variance from the
increased dynamic range possible with this new technology (Pepke et al. 2009; Robinson and Smyth
2007). Thus new methods have been developed to analyze expression data on this platform (Anders and
Huber 2010; Robinson and Smyth 2008) which model discrete read counts per gene as following the
negative binomial distribution. One interesting method attempts to transform the read count
distribution to a normal distribution and apply pre-existing microarray methods (Law et al. 2014). Since
sequencing does not require knowledge of pre-existing features to design probes against, one can use
this technology to assay genomic events where the DNA location of the features of interest are
unknown such as protein binding (ChIP-seq), chromatin accessibility (DNAse-seq), and DNA methylation
(bisulfite sequencing). Although these assays all use sequencing, specialized methods have been
developed to analyze sequencing data from these different assays.
10
Ultimately these genome-wide assays are tools to measure and aid in understanding the underlying
biology. My interest and focus will be on interpreting and combining data from these assays to gain
insight into the mechanism of biological regulation as well as comparing and identifying the proper
analysis approach for different datasets. As biology moves more and more towards high throughput
assays, leveraging these big datasets that are generated and publically available will be essential for
research and will help prevent duplication of work as well as open up opportunities where re-analysis of
multiple datasets could yield novel findings.
11
Chapter 2 Differential coregulator requirement at hormone regulated
genes
The microarray work presented in this chapter was performed by Irina Ianculescu (C4-2B cell line, p300
and CBP depletions), Danielle Bittencourt (A549 cell line, EHMT1, EHMT2 depletions), Chen-Yin Ou
(A549 cell line, CCAR1, CCAR2, CALCOCOA, ZNF282 depletions), and Rajas Chodankar (U2OS-GR cell line
and U2OS-ER cell lines, TGFB1I1 depletion).
Introduction
Nuclear receptors are ligand-regulated transcription factors through which the cell responds to external
stimuli. They can detect the presence of a small molecule ligand (e.g. a hormone, vitamin or metabolite)
and modify cellular gene expression to respond accordingly. The steroid hormone receptors including
the receptors for estrogens, progestins, androgens, glucocorticoids, and mineralocorticoids form one
class of nuclear receptors. Canonical steroid receptor function involves the receptor binding to its ligand,
which alters receptor conformation and potentiates binding to a specific related set of DNA motifs that
serve as regulatory elements for specific genes. The DNA-bound receptors recruit a large number of
transcriptional coregulator proteins, which remodel chromatin and regulate the assembly or
disassembly of active transcription complexes on the transcription start sites of the genes associated
with the enhancer/silencer elements. Coregulators are essential for proper gene regulation, and
coregulator mutants are involved in several diseases (Lonard and O’Malley 2012).
Glucocorticoid receptor (GR, official symbol NR3C1) is activated in humans by the steroid hormone
cortisol, which is produced in the adrenal cortex in response to many types of stress and serves a
homeostatic function by regulating many different physiological pathways. Synthetic glucocorticoids,
such as dexamethasone (dex), are one of the most widely prescribed classes of drugs, used clinically for
their anti-inflammatory and immune-suppressive effects and in some cancer chemotherapy regimens.
They are highly effective but have a host of deleterious side effects such as weight gain, insulin
resistance, hyperglycemia, hyperlipidemia, osteoporosis, and muscle wasting (Biddie et al. 2012; Gross
and Cidlowski 2008; Strehl and Buttgereit 2013). This reflects the role of glucocorticoids in regulating
inflammation and immune response, as well as metabolism of glucose, lipids, and bone, among other
physiological pathways.
A number of recent studies, each focusing on a single coregulator, indicated that steroid receptor
coregulators function in a gene-specific manner and are required for regulation of only a subset of the
genes activated or repressed by a steroid hormone and its receptor (Kim et al. 2008; Yu et al. 2011; Kim
et al. 2003; Yu et al. 2013). This invites the hypothesis that different coregulators could regulate
different physiological pathways controlled by glucocorticoids (Fan and Morand 2012; Dasgupta et al.
2014). Such a hypothesis necessitates that different coregulators are required for hormonal regulation
of different sets of genes. However, direct comparisons of the gene-specific actions of multiple
coregulators for a specific steroid receptor in a single cell line have yet to be reported. To test this
hypothesis, we conducted an unbiased, genome-wide analysis of the effects of depleting four different
coregulators on glucocorticoid-regulated gene expression in the A549 lung adenocarcinoma cell line. We
expected to find different but overlapping subsets of genes that are controlled by each coregulator, and
12
I used pathway analysis to test whether these gene subsets represent different known physiological
pathways that are regulated by glucocorticoid hormones.
The four nuclear receptor coregulators used in this study were chosen based on known physical and
functional interactions and some structural homology. CCAR1 (cell cycle and apoptosis regulator 1, also
known as CARP1), is important for cell cycle regulation and binds to nuclear receptors in a hormone
dependent manner (Kim et al. 2008). CCAR2 (also known as deleted in breast cancer 1, DBC1, or
KIAA1967) is a paralog of CCAR1 and has been shown to work synergistically with CCAR1 (Yu et al. 2011)
in a hormone dependent manner to coactivate target gene expression. CoCoA (coiled-coil coactivator,
gene name CALCOCO1) has a coiled-coil domain, binds the p160 coactivator complex and enhances
transcriptional activation of nuclear receptors (Kim et al. 2003). Lastly ZNF282 (homolog of Zfp282 in
mice and rats, also known as HUB1), has been shown to bind and coactivate estrogen receptor and
function synergistically with CoCoA (Yu et al. 2013). Structurally, ZNF282 has five C2H2 zinc fingers that
can bind DNA along with a repressive KRAB domain (Yu et al. 2013). CCAR1, CCAR2, and ZNF282 can bind
to the C-terminal activation domain of CoCoA. Several combinations of these coregulators act
cooperatively to enhance steroid receptor activity in transient reporter gene assays. We thus proposed
to test whether the physical and functional relationships among these four coregulators might result in
substantial overlap in the subsets of glucocorticoid-regulated genes and the corresponding physiological
pathways they control.
In addition to these coregulators, I have also analyzed other sets of coregulators in different cell lines
using different hormones (Table 2-1). CBP and p300 are two homologous proteins that can activate
genes through their histone and protein acetyltransferase activity or by recruiting other coregulators
and components of transcriptional machinery (Holmqvist and Mannervik 2013). G9a and GLP are
another two homologous proteins that have histone methyl transferase function and are associated
with transcriptional repression, however, we have found both coactivating and corepressive functions of
G9a that work through different domains (Lee et al. 2006). Lastly, Hic-5 is a coregulator that can bind the
hinge domain of GR and has been found to block the hormone regulation of a subset of genes
(Chodankar et al. 2014). I will briefly cover some of the analysis from these experiments but my primary
focus will be on the multi-coregulator depletion described earlier.
Depletion Receptor
Cell type
Timepoint Collaborator and Reference
CBP (CREBBP) AR
C4-2B
0h, 16h Ianculescu I (Ianculescu et al. 2012)
p300 (EP300) AR
C4-2B
0h, 16h Ianculescu I (Ianculescu et al. 2012)
G9a (EHMT2) GR
A549
0h, 6h, 24h Bittencourt D (Bittencourt et al. 2012)
GLP (EHMT1) GR
A549
0h, 6h, 24h Bittencourt D (unpublished)
Hic-5 (TGFB1I1) GR
U2OS-GR
0h, 2h, 4h, 24h Chodankar R (Chodankar et al. 2014)
Hic-5 (TGFB1I1) GRγ
U2OS-GR
0h, 2h, 4h, 24h Chodankar R (in prep)
Hic-5 (TGFB1I1) ER
U2OS-ER
0h, 4h, 24h Chodankar R (in prep)
CCAR1 GR
A549
0h, 6h Ou CY (Wu et al. 2014b)
CoCoA (CALCOCOA) GR
A549
0h, 6h Ou CY (Wu et al. 2014b)
DBC1 (CCAR2) GR
A549
0h, 6h Ou CY (Wu et al. 2014b)
ZFP282 (ZNF282) GR
A549
0h, 6h Ou CY (Wu et al. 2014b)
13
Table 2-1 Summary of expression microarray experiments with coregulator depletion
This table shows the coregulators studies that I worked on including which coregulator was depleted, which receptor was
activated, at what time-point were cells collected after no hormone treatment (0h) and hormone treatment and lastly the
collaborator I worked with on analyzing this data. Unless otherwise mentioned, GR is for the alpha isoform.
Methods
Cell culture and RNA interference
See references in Table 2-1 for details.
Microarray analysis
For G9a and GLP depletion, we used Illumina HuRef8v3 microarrays starting with bead level data and
used BASH (Cairns et al. 2008) to remove spatial artifacts before collapsing into bead summarized values
using beadarray (Dunning et al. 2008). Data has been deposited to Gene Expression Omnibus (GEO)
under accession GSE35962. CBP and p300 depletion used the newer Illumina HT12v4 microarray and we
started with expression values output by GenomeStudio. Combat was used to remove batch effects
(Johnson et al. 2007) and this data has been deposited to GEO under accession GSE31873. The Hic-5
microarrays were also performed using Illumina HT12v4 microarrays and output by GenomeStudio. No
obvious batch effects were found and published data was deposited to GEO under accession GSE46448.
For the CCAR1, CoCoA, CCAR2, ZNF282 depletions, we used Illumina HT12v4 microarrays to interrogate
the genome-wide mRNA levels 6 hours after treatment of A549 cells with ethanol control or 100 nM of
dex hormone, using cells that were previously transfected with a control siNS (non-specific sequence) or
siRNA directed against CCAR1, CCAR2, CoCoA, or ZNF282. All experiments had 4 biological replicates
performed on different days except our siNS control (6 replicates) and siZNF282 (2 replicates). Analysis
of the microarray data was performed using bioconductor package beadarray (v2.8.1) (Dunning et al.
2008) to process the raw data. The raw data can be found under GEO accession number GSE58715. We
filtered non-specific probes based on prior annotation for all microarrays (Barbosa-Morais et al. 2010)
and used limma to preprocess (Shi et al. 2010) and identify differentially expressed genes (Smyth 2004).
The Illumina HT12v4 microarray has over 45k different probes with most genes represented by a single
probe, thus the term gene was used when describing numbers of significant probes found from the
microarray. We performed and overlapped three two-way comparisons to identify hormone-regulated
in control (NS), hormone-regulated in coregulator depletion (si) and effect of coregulator depletion on
post-hormone gene expression. I used q-value to estimate the false discovery rate and account for
multiple hypothesis testing and calculated q-values using (Storey 2002) and called a difference in gene
expression significant if there was greater than 1.5x fold change and q-value of < 0.05. Weighted venn
diagrams were based on weights calculated from eulerAPE (Micallef and Rodgers 2012). We used IPA
(Ingenuity Systems, www.ingenuity.com) to identify pathways affected by each of the gene sets that we
identified. Modulated genes were identified using a significant cutoff of q-value less than < 0.05 without
fold-change cutoff.
14
Results
Quality control in pre-processing can catch important data quality issues
Figure 2-1 MDS plot and outlier plot of removed array in G9a/GLP data
MDS plot (left) showing the first two dimensions with last two digits of array number shown and with each color representing
experiments so the same colors represent replicates of the same experiment. Outlier plot for all 8 wells on the microarray that
was removed with blue dots representing and outlier.
In several of the datasets that I analyzed, quality control issues were found that needed to be addressed.
In the G9a/GLP experiments, the samples were not randomly assigned onto the arrays and had an array
that did not cluster with the rest of the experiments and this array had many outliers that were spread
out. I ended up removing the experiments from this array (label 22 in Figure 2-1) before proceeding with
my analysis. In the CBP/p300 experiments, we found batch effects with replicates 1 and 3 clustering
close together. We evaluated both SNM and Combat methods to remove batch effects (Mecham et al.
2010; Johnson et al. 2007) and ended up using Combat. Before batch correction, we did not identify any
significantly differential genes for CBP depletion. After batch correction (Figure 2-2), we were able to
identify a few differential genes, which we validated.
15
Figure 2-2 MDS plot before and after batch effect correction
MDS plot showing the first two dimensions with the treatment, timepoint, and replicate number used as label. The left figure is
before batch correction and is colored by batch (replicate) and the right figure is after batch correction.
16
Coregulator depletion alters hormonal regulation of subsets of genes
Figure 2-3 Effect of hormone treatment and coregulator depletion
(a) Experimental design. For each coregulator, A549 cells were transfected with coregulator-specific (siCoR) or control non-
specific (siNS) siRNA, and cells were subsequently treated with dex or ethanol for 6 hours. Genome-wide microarray analyses of
RNA with multiple independent biological replicates of each condition were conducted. The 3 comparisons of interest are:
before and after hormone in control (NS- vs NS+), before and after hormone in coregulator depletion (si- vs si+), with and
without coregulator depletion after hormone (NS+ vs si+). (b) Venn diagrams are used to show the overlap between significant
genes from the three comparisons. Overlapping compartments are labeled as shown for convenient reference. For each
comparison, the theoretical data bars representing relevant expression values being compared are shown in black beside the
17
venn diagram, while the bars representing expression values that are not part of the comparison are shown in gray. (c-f) Venn
diagrams were created from the overlapping statistically significant effects of hormone and coregulator depletion to visualize
the number of genes affected by depletion of CCAR1 (c), CoCoA (d), CCAR2 (e), and ZNF282 (f). The size of each ellipse and
overlap compartment in the venn diagram is proportional to the number of genes affected. We called a difference in gene
expression significant if there was greater than 1.5-fold change and a false discovery q-value < 0.05.
To evaluate the relationship between coregulator and hormone, we identified and overlapped a set of
hormone-regulated genes with no coregulator depletion (Figure 2-3a, comparison 1), hormone-
regulated genes upon coregulator depletion (Figure 2-3a, comparison 2), and coregulator-regulated
genes, i.e. genes from dex-treated cells with different mRNA levels when comparing cells containing and
lacking a coregulator (Figure 2-3a, comparisons 3). From these comparisons, we identified dex-regulated
genes that are not significantly affected by coregulator depletion (Figure 2-3b compartment i), genes
affected by coregulator depletion but not significantly hormone-regulated (Figure 2-3b ii), genes that
remained hormone-regulated after coregulator depletion (Figure 2-3b v, vii), hormone-regulated genes
that were affected by coregulator depletion (Figure 2-3b iii, vii) and genes that gained hormone
regulation upon coregulator depletion (Figure 2-3b iv, vi). In our detailed analysis below, we do not
focus on genes that gain hormone regulation upon coregulator depletion but do not have statistically
different gene expression levels after hormone treatment (Figure 2-3b iv) because this situation occurs
with changes of gene expression in the absence of hormone caused by coregulator depletion (baseline
differences). In general, we found fewer hormone-regulated genes after coregulator depletion
(comparison 2 smaller than comparison 1), and thousands of genes were affected by the depletion
(comparison 3), with ZNF282 depletion having the largest effect with almost 3000 genes affected while
the other coregulator depletions have around 1100 genes affected (Figure 2-3c-f).
Except for ZNF282, the majority of hormone-regulated genes remained hormone-regulated after
coregulator depletion (Figure 2-3b compartments v + vii are large compared to iv + vi and i + iii). Of the
genes regulated by dex either in the presence (comparison 1) or absence (comparison 2) of coregulator,
we found around 250 genes with their dex-regulated level of expression altered by depletion of CCAR1
(245), CoCoA (241), or CCAR2 (249), while 348 genes were affected by depletion of ZNF282 (Figure 2-3b
iii, vi, vii). These genes correspond to about one fifth of all hormone-regulated genes (CCAR1: 254 / 1208
= 0.20, CoCoA : 241 / 1292 = 0.19, CCAR2 : 249 / 1249 = 0.20, ZNF282: 348 / 1202 = 0.29, Figure 2-3b iii,
vii, vi / all but ii). Additionally, with the exception of CoCoA, there were more coregulator-regulated
genes that lost hormonal regulation upon coregulator depletion (Figure 2-3b iii) than genes that newly
acquired hormonal regulation after depletion of coregulator (Figure 2-3b vi) ( iii vs. vi: 100 vs. 29 for
CCAR1, 55 vs. 75 for CoCoA, 82 vs. 53 for CCAR2, 236 vs. 53 for ZNF282). Lastly, of the 1033 hormone-
regulated genes (comparison 1), 53% were affected by depletion of one or more of the four coregulators
studied (unique genes in the sum of compartments iii + vii for all four coregulators).
The fact that ZNF282 depletion altered expression of a much larger number of genes (Figure 2-3f
comparison 3) and reduced a larger number of hormone-regulated genes than depletion of the other
three coregulators (Figure 2-3f comparison 2 vs. comparison 1) suggests that ZNF282 may have a greater
impact in gene regulation than the other three coregulators. Using no fold change cutoff (instead of the
18
1.5-fold cutoff used in Figure 2-3), we found similar patterns of overlaps (Figure 2-4). When a more
stringent fold change cutoff of 2 was applied, most hormone-regulated genes were still shared in
comparisons 1 and 2 (compartments v + vii), but a lower percentage of coregulator-regulated genes
were hormone-regulated (Figure 2-5, compartments iii + vii). We concluded from this comparison that
the pattern of overlaps between the different coregulators remains consistent with different fold
change cutoffs, with higher fold-change cutoffs detecting fewer coregulator effects.
a) CCAR1 b) CoCoA
c) CCAR2 d) ZNF282
Figure 2-4 Venn diagrams showing overlap between comparisons with no fold change cutoff
Venn diagrams representing the number of genes significantly affected by depletion of CCAR1 (a), CoCoA (b), CCAR2 (c), and
ZNF282 (d). The size of each ellipse and overlap compartment in the venn diagram is proportional to the number of genes
affected. We called a difference in gene expression significant if there was a q-value < 0.05
19
a) CCAR1 b) CoCoA
c) CCAR2 d) ZNF282
Figure 2-5 Venn diagrams showing overlap between comparisons using 2-fold change cutoff
Venn diagrams representing the number of genes significantly affected by depletion of CCAR1 (a), CoCoA (b), CCAR2 (c), and
ZNF282 (d). The size of each ellipse and overlap compartment in the venn diagram is proportional to the number of genes
affected. We called a difference in gene expression significant if there was greater than 2-fold change and q-value< 0.05
Since this study analyzed the roles of coregulators in dex-regulated gene expression, the subsequent
sections focus on a subset of genes located in the compartments where effects of coregulator depletion
(comparison 3) overlap with effect of dex (comparisons 1 or 2)(compartments iii, vi, and vii). We
defined as "coregulator-modulated genes" those in compartments iii and vii, i.e. genes that were
regulated by hormone in the presence of coregulator (comparison 1) and regulated by coregulator in the
presence of dex (comparison 3). Among modulated genes, compartment iii contained genes that lose
significant regulation by dex after coregulator depletion, while compartment vii represents genes that
are significantly dex-regulated in the presence or absence of coregulator, but whose mRNA level in the
presence of dex was significantly altered by coregulator depletion. We defined "blocked genes" as those
20
in compartment vi, i.e. genes that only become dex-regulated after coregulator depletion and whose
mRNA level in the presence of dex was significantly altered by coregulator depletion. Blocked genes
indicate a novel and perhaps unexpected type of coregulator function where the coregulator was
blocking the hormone response for genes in this class. We found several hundred modulated genes for
each coregulator (CCAR1: 216, CoCoA: 166, CCAR2: 196, ZNF282: 295) and a smaller number of blocked
genes (CCAR1: 29, CoCoA: 75, CCAR2: 53, ZNF282: 53).
Evaluating statistical power in our experiment
Since our modulated and blocked classes were defined by overlapping significant genes from several
comparisons, genes in these classes must reject the null hypothesis in more than one situation. As it is
harder to pass two tests than one test, we became interested in the topic of power because we would
have less power to detect the genes that pass two tests. These genes in modulated and blocked classes
that pass multiple tests greatly exceed the fraction of genes expected to be overlapping by chance since
the number of genes in each class is small compared to total genes. I first calculate the power of a single
test to detect differences of 1.5 fold, I determined the significance level necessary (Jung 2005) and then
used the pwr package (Champely 2012). With 3 replicates, we have to power to detect >99% of
differences with a fold change of 1.5 using the 75
th
percentile of shrunken variance from eBayes at 5%
FDR. Since we use 4 replicates for most of the comparisons, we also have >99% power to detect
changes. Although all true positives that pass multiple tests may not have been recovered, we believe
that the majority of our results are true with few false negatives since our power is so high.
Modulated and blocked regulatory genes were often unique to each coregulator
Figure 2-6 Overlap within modulated and blocked gene sets
Four-way venn diagrams are used to indicate the overlap of the dex-regulated genes that belong to the modulated (a) or
blocked (b) gene sets for each of the four coregulators. Numbers indicate number of genes. The darkest shading indicates genes
that are common to all four coregulators, and the unshaded regions indicate genes that are uniquely regulated by a single
coregulator. The dark areas highlighted in the accompanying three-way venn diagrams indicate how the modulated and
blocked gene sets were derived from Figure 2-3.
21
To evaluate coregulator specificity, we overlapped the genes found for each coregulator in each of the
two gene classes of interest to look for shared regulation between different coregulators. Modulated
genes (Figure 2-6a) have some overlap between the different coregulators; however, about one third of
the genes were unique to each depleted coregulator (CCAR1: 38%, CoCoA: 31%, CCAR2: 32%, ZNF282:
39%) with about another third shared with one other coregulator (CCAR1: 38%, CoCoA: 36%, CCAR2:
39%, ZNF282: 38%) and the remaining genes shared by two or three coregulators. Blocked genes (Figure
2-6b) were nearly all unique to each coregulator, indicating that this type of function was highly specific
for each coregulator. The largest overlap in blocked genes occurred between CoCoA blocked genes and
CCAR2 blocked genes with 8 genes shared (out of 53 genes for CCAR2 and 75 genes for CoCoA).
Surprisingly, the high structural homology between CCAR1 and CCAR2 and the previously reported
physical and functional interactions among these four coregulators do not lead to substantial
proportions of modulated and blocked genes overlapping between the various pairs of proteins.
22
Blocked genes have similar chromatin profile to hormone-regulated genes
Figure 2-7 Difference in histone at TSS of blocked genes
We compare the fraction of histone marks that cover the transcription start sites (TSS) of blocked genes, hormone-regulated
genes and all genes. a) Differences in proportion of each histone mark at TSS of blocked genes versus TSS of all genes on
microarray. b) Differences in proportion of each histone mark at TSS of blocked genes versus TSS of hormone-regulated genes.
23
Figure 2-8 Histone marks around TSS of all genes
Bargraphs showing the fraction of transcription start sites of all genes with the specified histone mark.
Figure 2-9 Histone marks around TSS of blocked genes
Bargraphs showing the fraction of transcription start sites of blocked genes with the specified histone mark.
24
Figure 2-10 Histone marks around TSS of hormone regulated genes
Bargraphs showing the fraction of transcription start sites of hormone regulated genes with the specified histone mark.
Figure 2-11 Histone marks around TSS difference between hormone regulated versus all genes
Bargraphs showing the fraction of transcription start sites of hormone regulated genes versus all genes with the specified
histone mark.
To investigate blocked genes, we scanned the transcription start site (TSS) for various activating and
repressive histone marks. We wished to look for patterns in the chromatin structure that might give
insight into mechanism of regulation for blocked genes. We used publically available ENCODE histone
ChIP-seq data for the A549 cell line and scanned a window of -100 bp to +10 bp around the TSS to
identify the significant histone marks. We evaluated the presence of 4 activating histone marks
(H3K4me1, H3K4me3, H3K9ac, H3K27ac) and 2 repressive marks (H3K9me3, H3K27me3) around the TSS
of 193 different blocked genes, 1033 hormone-regulated genes, and all genes (Figure 2-8, Figure 2-9,
Figure 2-10, and Figure 2-11). We found differences in the proportion of histone marks at TSS of blocked
genes compared to all genes with about 15% more H3K4me3, and H3K9ac in blocked genes (Figure 2-7
left). Comparing the proportion of histone marks at dex-regulated genes to blocked genes (Figure 2-7
right), we found similar proportions of some histone marks (H3K4me3, H3K9ac, H3K9me3) while
H3K4me1 was more prevalent in hormone-regulated genes and H3K27me3 was more prevalent in
blocked genes. We speculate that although blocked genes share many chromatin features with
hormone-regulated genes, their hormone regulation was blocked by repressive histone marks that were
somehow facilitated by a coregulator.
25
Most blocked genes were up-regulated by hormone
Figure 2-12 Direction of regulation between blocked genes and hormone
Tables showing the direction of regulation by hormone treatment and by coregulator depletion for coregulator-blocked genes.
Four possibilities exist depending on direction of change from hormone treatment and coregulator depletion. Theoretical
examples to illustrate data for each type of regulation are shown along with breakdown of the number of genes in each
category for CCAR1 depletion (a), CoCoA depletion (b), CCAR2 depletion (c), and ZNF282 depletion (d).
To characterize the gene selective action of activation and repression for each coregulator, we analyzed
the direction of regulation of blocked genes. Blocked genes were not significantly dex regulated in the
presence of coregulator but become significantly dex-regulated after coregulator depletion, and their
level of expression in the presence of dex was significantly altered by coregulator depletion (Figure 2-3b
compartment vi). In Figure 2-12a, we show theoretical examples of gene expression changes indicative
of coregulator-blocked activation (upper-left) and coregulator-blocked repression (lower-right) – these
were the predominant categories of blocked genes, where the presence of the coregulator prevented
hormone regulation. The remaining two comparisons (upper-right and lower-left) require changes to
basal gene expression levels prior to hormone treatment due to coregulator depletion (baseline
differences) and represent only a small percentage of the blocked genes (CCAR1: 4 genes (15%), CoCoA:
4 genes (5%), CCAR2: 8 genes (15%), and ZNF282: 5 genes (9%)).
26
Blocked activation (Figure 2-12a, upper left) was the most common type of regulation among the
blocked genes for these four coregulators with 75% of CCAR2, 67% of CoCoA, 55% of ZNF282, and 48%
of CCAR1 genes in this category. This category represents coregulators repressing hormone activation at
these genes. After verifying that each coregulator was effectively depleted by its respective siRNA, my
collaborator validated selected blocked genes by qPCR (Wu et al. 2014b). Among all dex-regulated genes,
some were very highly regulated by dex, in contrast, fold changes for blocked genes were all less than
four (Figure 2-13).
Figure 2-13 Dex-regulated fold change of blocked genes compared to hormone regulated genes
Violin plot showing proportion of genes (vertical axis) from each hormone regulated category and the corresponding hormone-
regulated log fold changes (horizontal axis).
27
Blocked genes also exist in other steroid hormone systems
Figure 2-14 Venn diagrams showing overlap between comparisons in other steriod hormone systems
These overlaps show the three comparisons in U2OS-GRα, U2OS-GRγ, or U2OS-ER cell lines with and without hormone
induction (dex for GR, E2 for ER) and Hic-5 depletion. The “-“ and “+” indicate the absence or presence of steroid hormone. “NS”
indicates non-specific siRNA and “si” indicates siRNA against Hic-5.
In the original study to identify blocked genes, Hic-5 was depleted in U2OS-GRα cells (Chodankar et al.
2014). When we use the same cutoffs as A549 multi-coregulator depletion study, we find more blocked
genes in the Hic-5 depletions than with any of the four coregulators (Figure 2-14). The number of
blocked genes between the three different cell types stays about the same even though the number of
genes affected by depletion after hormone treatment (compartment 3) changes in each cell line.
Additionally, the number of hormone regulated genes after Hic-5 depletion (compartment 2) is higher in
ER system than GR system with also a higher proportion of genes that are only regulated after Hic-5
depletion.
28
Coregulators both support and antagonize the hormone effect on modulated genes in a gene-specific
manner
Figure 2-15 Direction of regulation between modulated genes and hormone
Tables showing the direction of regulation by hormone treatment and by coregulator depletion for coregulator-modulated
genes. Four possibilities exist depending on fold change from hormone treatment and coregulator depletion. Theoretical
examples to illustrate data for each type of regulation are shown along with breakdown of the number of genes in each
category for CCAR1 depletion (a), CoCoA depletion (b), CCAR2 depletion (c), ZNF282 depletion (d).
Next we investigated the direction of regulation of coregulator-modulated genes. Modulated genes
were defined as dex-regulated in the presence of coregulator, and their level of expression in the
presence of dex was altered by coregulator depletion (Figure 2-3b compartment iii and vii). Our work
generalized the idea that each coregulator can function as either a coactivator or corepressor and can
support or oppose the regulation by hormone. However, the ratio of genes that was positively or
negatively regulated was unique to each coregulator (Figure 2-15, ratio between numbers in table for
each coregulator). We define coactivated genes as being activated by hormone and requiring the
presence of the coregulator for activation, corepressed genes as being repressed by hormone and
requiring the presence of the coregulator for repression, anti-activated genes as being activated by
hormone but repressed by coregulator, anti-repressed genes as being repressed by hormone but
activated by coregulator. In Figure 2-15a, we show theoretical examples of gene expression patterns for
29
coactivated (upper-right) and corepressed (lower-left) genes – coregulator activity in the same direction
as dex hormone – while anti-activated (upper-left) and anti-repressed (lower-right) genes show
coregulator activity that opposes the direction of dex regulation.
All four of the coregulators that we depleted have been shown previously to have coactivating activity
for selected target genes in ligand-activated nuclear receptor systems (Kim et al. 2008; Yu et al. 2011;
Kim et al. 2003; Yu et al. 2013). We found that CCAR1 modulated genes were mostly coactivated with
CCAR1 supporting the direction of dex regulation. In contrast, the predominant effects of CoCoA and
ZNF282 depletion on dex-regulated gene expression was negative, with ZNF282 having an anti-activating
effect on many genes. One reason ZNF282 could be having a greater repressive effect compared to the
other coregulators was because of its repressive KRAB domain (Yu et al. 2013). The structurally-related
proteins CCAR1 and CCAR2 share a similar direction of regulation for the modulated genes, with the
majority (80% for CCAR1, 62% for CCAR2) of genes regulated by the coregulator in the same direction as
dex. Over 60% of modulated genes were up-regulated by dex (except for CoCoA which has 49%), and the
anti-repressed effect was the least common type of gene regulation by all coregulators (8% for CCAR1,
8% for CoCoA, 11% for CCAR2, 8% for ZNF282).
Coregulator specificity in differential regulation of physiological pathways
Figure 2-16 Effect of coregulator depletion on TNFR2 signaling pathway
IPA canonical pathway analysis of the TNFR2 pathway. White genes are not regulated by dex. Grey genes are dex-regulated
genes with no change in expression in dex-treated cells after coregulator depletion. Blue genes are dex-regulated genes that
have lower expression in dex-treated cells after coregulator depletion, with intensity corresponding to degree of down-
regulation. Green genes are dex-regulated genes that have higher expression in dex-treated cells after coregulator depletion,
30
with intensity corresponding to degree of up-regulation. P-values from Fisher’s exact test are: 0.24 (CCAR1), 0.0002 (CoCoA),
0.19 (CCAR2), 0.0000008 (ZNF282).
Figure 2-17 Hormone response in TNFR2 signaling
In this IPA analysis of dex-regulated genes in the TNFR2 pathway, white genes are not regulated by dex. Blue genes are down-
regulated by dex, and green genes are up-regulated by dex, with color intensity corresponding to degree of regulation.
31
a) Hormone response
32
b) Effect of CCAR1 depletion
33
c) Effect of CoCoA depletion
34
d) Effect of CCAR2 depletion
35
e) effect of ZNF282 depletion
36
Figure 2-18 Acute Phase Response
In this IPA analysis of dex-regulated genes in the acute phase pathway, (a) white genes are not regulated by dex. Blue genes are
down-regulated by dex, and green genes are up-regulated by dex, with color intensity corresponding to degree of regulation.
The next IPA analyses show the effects of depleting CCAR1 (b), CoCoA (c), CCAR2 (d), and ZNF282 (e) on expression of genes in
the acute phase response pathways in A549 cells treated with dex. White genes are not regulated by dex. Grey genes are dex-
regulated genes with no change in expression in dex-treated cells after coregulator depletion. Blue genes are dex-regulated
genes that have lower expression in dex-treated cells after coregulator depletion. Green genes are dex-regulated genes that
have higher expression in dex-treated cells after coregulator depletion. Color intensity corresponds to degree of regulation.
37
a) Hormone regulation
b) Effect of CCAR1 depletion
38
c) Effect of CoCoA depletion
d) Effect of CCAR2 depletion
39
e) Effect of ZNF282 depletion
Figure 2-19 Interferon Signaling
In this IPA analysis of dex-regulated genes in the interferon signaling pathways, (a) white genes are not regulated by dex. Blue
genes are down-regulated by dex, and green genes are up-regulated by dex, with color intensity corresponding to degree of
regulation. Next, IPA analyses show the effects of depleting CCAR1 (b), CoCoA (c), CCAR2 (d), and ZNF282 (e) on expression of
genes in the interferon signaling pathways in A549 cells treated with dex. White genes are not regulated by dex. Grey genes are
dex-regulated genes with no change in expression in dex-treated cells after coregulator depletion. Blue genes are dex-regulated
genes that have lower expression in dex-treated cells after coregulator depletion. Green genes are dex-regulated genes that
have higher expression in dex-treated cells after coregulator depletion. Color intensity corresponds to degree of regulation.
Glucocorticoids regulate many different developmental, metabolic, and inflammatory pathways and
mediate responses to various types of stress, including hunger, cold, anxiety, and disease. The gene-
specific actions of coregulators, as documented above, provide opportunities for the cell to modulate
the specific genes that are regulated by glucocorticoids, through regulation of coregulator protein levels
or regulation of coregulator activity via protein-protein interactions or post-translational modifications.
To test whether coregulator gene-specificity is associated with specific glucocorticoid-regulated
physiological pathways, we performed Ingenuity Pathway Analysis (IPA) on the coregulator-modulated
gene set for each of the four coregulators examined above. We focused on the anti-inflammatory
actions of glucocorticoids, since these complex pathways are key regulatory targets of GR. In addition,
glucocorticoid regulation of many anti-inflammatory genes is common to a wide variety of cell types,
including the A549 cell line used for this study, whereas other glucocorticoid responsive metabolic
40
pathways may be more tissue specific. IPA canonical pathway analysis found several inflammatory
pathways that were enriched among dex-regulated genes and one or more of the four coregulator-
modulated gene sets under investigation in this study. Here we focus our discussion on the tumor
necrosis factor receptor 2 (TNFR2), acute phase, and interferon signaling pathways, which provide
examples of both shared and coregulator-specific regulation of specific physiological pathways. The
TNFR2 pathway mediates signaling for TNF / and includes the well known NF B inflammatory
pathway as well as the JUN kinase (JNK) pathway, which is one of the three mitogen activated protein
kinase (MAP kinase) pathways (Cabal-Hierro and Lazo 2012). In A549 cells dex down-regulates several
genes in both the NF B and JNK pathways, as expected since glucocorticoids are anti-inflammatory
(Figure 2-17). Depletion of ZNF282, CoCoA and CCAR2 increased expression of several dex-regulated
NF B pathway genes, indicating that these three coregulators support the down-regulation of this
inflammatory pathway (Figure 2-16). ZNF282 was also required for the down-regulation by dex of a key
transcription factors, c-June, at the distal end of the JNK pathway. In contrast, CCAR1 depletion had no
effect on genes in the NF B pathway and had a mixed effect on the JNK pathway.
The acute phase response is a systemic defense system that responds to infection and stress and helps
to prevent infection and initiate inflammatory processes (Cray et al. 2009). This pathway contains both
the NF B and JNK pathways controlled by TNFR2 but also includes the ERK and p38 MAP kinase
pathways and the STAT3 pathway. In addition to down-regulating components of the NF B and JNK
pathways, dex also up-regulates STAT3 in A549 cells (Figure 2-18a). Depletion of ZNF282 and CCAR2
blocked dex inhibition of the NF B pathway and depletion of ZNF282 and CoCoA enhanced dex
stimulation of STAT3 expression, whereas neither CCAR1 nor CCAR2 affected STAT3 expression in dex-
treated A549 cells (Figure 2-18b-e).
In contrast to the different pathway specificities of the four coregulators in the TNFR2 and acute phase
response pathways, all four coregulators had similar effects on components of the interferon pathway in
A549 cells. Dex down-regulated the receptor for interferon / and up-regulated JAK1 and STAT1, key
components of the interferon , and pathways (
Figure 2-19a). Depletion of each of the four coregulators further enhanced the dex-regulated expression
of JAK1 and STAT1 (
Figure 2-19b-e). The only coregulator-specific effect was that depletion of CCAR2 (but none of the other
three coregulators) prevented the down-regulation of the interferon receptor by dex. A heatmap of the
top canonical pathways represented in the dex-regulated gene set and in the coregulator-modulated
gene sets (showing the effect of dex treatment and of coregulator depletion) can be found in Figure 2-20.
Lastly we analyzed the dex-regulated and coregulator-modulated genes using IPA software to identify
candidate upstream regulators whose actions were predicted to be most highly affected (Figure 2-21).
Although the details of the coregulator effects on each upstream regulator are beyond the scope of our
current discussion, we note that the pattern for the effect of CCAR1 depletion was almost completely
opposite to the pattern of effects observed for the other three coregulators, consistent with the
predicted effects on the TNFR2 and acute phase response pathways (Figure 2-16, Figure 2-18). This
41
unbiased analysis further supports the conclusion that the gene-specific actions of coregulators may
correlate with specific physiological pathways. Similar upstream regulator analysis was not performed
on blocked genes due to the small number of affected genes.
Figure 2-20 Top canonical pathways
42
IPA heatmap showing top canonical pathways for hormone (dex) and coregulator-modulated genes with darker shades
corresponding to higher significance of changes in that pathway. Score –log(p-value) calculated from right-tailed Fisher’s exact
test.
Figure 2-21 Upstream regulator prediction
Using interactions between genes mined from literature, IPA predicts the up- or down-regulation of an upstream regulator
based on changes in gene expression. Blue indicates predicted down-regulation of the indicated upstream regulator by dex
treatment or coregulator depletion, and red indicates predicted up-regulation of the indicated upstream regulator by dex
treatment or coregulator depletion, with darker shades indicating greater intensity of up or down regulation.
43
44
WGCNA for coregulator depletion shows mixed results
Figure 2-22 WGCNA results on module-trait relationships
This figure shows the relationship between modules (colors on the vertical) and the various depletions or hormone induction
(traits on the horizontal axis). Modules are sets of genes with similar expression pattern changes identified from tree cutting
after using WGCNA. The colors show correlations with traits with p-values in parenthesis. Negative correlations are in blue
while positive correlations are in red with intensity corresponding to degree of correlation.
45
Another way to identify interesting genes is by looking for clusters of genes with similar expression
patterns across multiple conditions. Since we deplete four different coregulators and also treat them
with hormone, I used WGCNA to identify interesting groups of genes that are specific to coregulator
depletion or hormone induction (Langfelder and Horvath 2008). I identified several modules (each
module corresponding to a color) that correlated with one or more traits (coregulator depletions).
Pathway analyses for genes in modules with high correlation were inconclusive and further investigation
was not pursued since we decided to focus on effects of coregulator on steroid hormone regulation
(Figure 2-22).
Discussion
Gene-specific actions of coregulators. Previous published studies from a number of laboratories, that
examined the effects of individual coregulators on steroid hormone-regulated gene expression indicated
that coregulators act in a gene-specific manner and were required for hormonal regulation of only a
subset of all hormone-regulated genes (Kim et al. 2008; Yu et al. 2011; Kim et al. 2003; Yu et al. 2013;
Lonard and O’Malley 2012; Chodankar et al. 2014; Bittencourt et al. 2012). These results also suggested
that different coregulators support the regulation of different sets of hormone-regulated genes, but
direct comparisons of multiple coregulators have not been reported, to our knowledge. In the study
reported here we conducted a direct, unbiased and genome-wide comparison of the gene-specific roles
of four different coregulators in glucocorticoid-regulated gene expression.
Our results extend our knowledge of the gene-specific actions of coregulators in a number of ways. 1) In
a previous study (Chodankar et al. 2014) with Hic-5 in U2OS osteosarcoma cells, we identified three
classes of dex-regulated genes, based upon the effect of Hic-5 depletion: dex regulation modulated by
Hic-5 (Hic5-modulated genes), dex regulation blocked by the coregulator (blocked genes), and Hic-5-
independent genes (independent). Here, by examining four different coregulators we show that these
classes (including the novel blocked class) are common to coregulators in general, indicating that
multiple coregulators share this function. 2) As previously reported each coregulator has both positive
and negative effects on the expression of different dex-regulated genes (Bittencourt et al. 2012; Purcell
et al. 2011), but we show here that the ratio of positive to negative effects varies with the specific
coregulator. 3) Likewise, each coregulator can support the actions of the hormone on some genes and
oppose dex action on other genes, but the specific ratio of these effects was also coregulator-specific.
We thus quantify the extent to which each coregulator functioned as a coactivator (supporting gene
activation by dex/GR), a corepressor (supporting gene repression by dex/GR), an anti-activator
(opposing gene activation by dex/GR) or an anti-repressor (opposing gene repression by dex/GR). 4) We
show that even when pairs of coregulators have extensive structural homology or have demonstrated
physical and functional interactions (in transient reporter gene assays), they influence the dex-regulated
expression of quite different sets of genes. For each of the four coregulators examined, one-third of the
modulated genes they influence were unique to that coregulator, and another third of the modulated
gene set was shared with only one of the other three coregulators. 5) In silico pathway analysis
indicated that different coregulators can have a dramatically different influence on different
physiological pathways regulated by glucocorticoids. Altogether, our findings demonstrate by direct
comparison that, although there was some overlap, each coregulator influences the expression of a
46
unique subset of dex-regulated genes. Furthermore, our results are consistent with the hypothesis that
the gene-specificity of coregulators has important physiological implications.
The proportion of shared regulated genes among the four coregulators was quite low with many genes
being regulated in a coregulator-specific manner (Figure 2-6). The structural homology between CCAR1
and CCAR2 does not increase the proportion of genes with shared regulation, compared to other gene
pairs. Of the 216 CCAR1-modulated genes, the overlap with other sets of modulated genes were 59 for
CCAR2, 55 for CoCoA, and 89 for ZNF282; of the 29 CCAR1-blocked genes, 2 were shared with CCAR2, 2
with CoCoA, and none with ZNF282 (Figure 2-6). A small number of genes that required two structurally
related coregulators has also been shown before for CBP and p300 (Ianculescu et al. 2012; Kahn 2011).
Figure 2-23 Model for coregulator control of gene specificity
Regulation of gene expression occurs at multiple levels. In our model system, nuclear receptors bind to enhancers after
hormone treatment and recruits coregulators. Different coregulators are recruited to each enhancer and each enhancer
regulates a few target genes. Coregulators can serve as an intermediate regulatory step for hormone regulation through
differential recruitment to enhancers as well as having both coactivating and corepressive effects.
The mechanistic explanation for why different genes require different coregulators for their dex-
regulated expression presumably lies in the unique regulatory context of each gene. When studying
hormone-regulated gene expression, regulation can occur at multiple levels. GR, as well as other DNA-
binding transcription factors, can bind to a related but extremely diverse set of motifs located within
47
enhancer and silencer elements, and each transcription factor can regulate different sets of genes in
different cell types. GR can recruit different combinations of coregulator proteins to its binding sites in
the regulatory elements, and each element also has different requirements for the types of coregulators
required for dex-regulated expression of the associated gene (Figure 2-23). Whether or not a
coregulator was recruited by GR to an enhancer element and required for dex-regulated expression of
the associated gene will depend on the regulatory context of each regulatory element, because each
regulatory element exists in a unique environment dictated by several different factors: GR binding site
sequence, which influences GR conformation; binding of nearby transcription factors to nearby or
distant interacting sites, which can contribute to the complement of coregulators recruited to the site;
and local chromatin structure, which can help to dictate the specific coregulators required to establish a
an appropriate chromatin environment (open or condensed) for the positive or negative regulatory
actions directed by GR. Our results reflect the diversity of regulatory environments and the resulting
diversity of coregulator requirements among the dex-regulated genes.
Validation of the coregulator-modulated, blocked, and independent classes of dex-regulated genes.
For all four coregulators, we identified genes that fall into the coregulator-modulated, coregulator-
blocked, and coregulator-independent classes of dex-regulated genes, consistent with our previous
observations of these three classes for the action of the coregulator Hic-5 in dex-regulated gene
expression in U2OS cells (Chodankar et al. 2014). Each set of coregulator-modulated genes in this study
represented 16-29% of all genes that were regulated by dex without coregulator depletion (Figure 2-3
compartments iii + vii compared with comparison 1), suggesting that each coregulator influences a
distinct portion of the biological response to dex. The blocked gene sets were considerably smaller than
the modulated sets, but nevertheless we find that this class was common to all of the coregulators
tested, indicating that each coregulator prevented dex-regulation of a specific set of genes. While the
mechanisms through which CCAR1, CCAR2, CoCoA, and ZNF282 blocked efficient dex regulation of genes
remain to be determined, we showed previously that blocking of dex regulation of genes by Hic-5
involved interference with GR binding on DNA sites and chromatin remodeling at those sites (Chodankar
et al. 2014). Our meta-analysis of ENCODE data from A549 cells indicates that there was an increased
frequency of the repressive histone mark H3K27me3 at the TSS of coregulator-blocked genes compared
with all dex-regulated genes, and a decreased frequency of the active histone modification H3K4me1
(Figure 2-7). These findings provide clues for potential regulatory mechanisms.
While the dex-regulated gene set contained genes with very robust and modest fold-changes in
response to dex, the blocked genes universally displayed modest fold changes upon hormone treatment
(Figure 2-13). A limited degree of regulation by dex could be an inherent property of the blocked gene
class, or it could be attributed in part to incomplete depletion of the coregulators. However, we note
that depletion of each of the four coregulators produced a robust biological effect in terms of the
number of coregulator-modulated genes.
Differential modulation of glucocorticoid-regulated physiological pathways by individual coregulators.
Our pathway analyses of the modulated gene sets for four coregulators indicated that ZNF282, CoCoA,
and CCAR2 were all involved in supporting various aspects of the anti-inflammatory actions of dex.
However, depletion of CCAR1 had no effect on dex regulation of genes in TNFR2 and acute phase
48
pathways (Figure 2-16, Figure 2-17, Figure 2-18). In contrast, all four coregulators had similar actions on
the interferon signaling pathways (
Figure 2-19), consistent with the notion of pathway-specific actions of GR coregulators. Our results for
CCAR1 in this study extend our previous analysis of the pathway-specificity of this coregulator. We
previously tested the effects of depleting 10 different coregulators on the ability of dex to induce
expression of several adipogenic genes in the 3T3-L1 preadipocyte cell line and several anti-
inflammatory genes in A549 cells (Ou et al. 2014). CCAR1 and each of the nine other coregulators had
distinct patterns of action on the adipogeneic and anti-inflammatory genes, with CCAR1 exhibiting the
strongest differential specificity between these two pathways. CCAR1 was required for dex-induced
expression of adipogenic genes but not for the induction of the anti-inflammatory genes by dex. We also
found that CCAR1 was required for differentiation of the 3T3-L1 cells to mature adipocytes in response
to an adipogenic cocktail that includes dex. Together, these two studies indicate that CCAR1 was
important for dex actions in adipogenesis but not for anti-inflammatory actions of dex. These results
suggest that CCAR1 may prove to be an attractive target to prevent some side effects of glucocorticoids
when employed clinically for extended treatment regimens.
Glucocorticoids regulate thousands of genes in various cell types, with different subsets of these genes
belonging to the different glucocorticoid regulated physiological pathways. If the gene-specific actions of
coregulators do indeed correlate with specific physiological pathways, as suggested by our data and
several previous studies, then regulating the activity of specific coregulators provides a potential
mechanism for the cell to modulate the hormone response. This could be accomplished by modulation
of signaling pathways that regulate the protein levels, protein-protein interactions, or post-translational
modifications of specific coregulators. For the same reason, coregulators could prove to be attractive
therapeutic targets for ameliorating side effects of hormone therapy.
49
Chapter 3 Investigation of differential transcription factor binding for
ChIP-seq
The ChIP-seq work presented in this chapter was performed by ENCODE consortium, with validation
provided by Danielle Bittencourt (GR A549 validation) and Rajas Chodankar (U2OS-GR cell line and
U2OS-ER cell lines, TGFB1I1 depletion).
Introduction
Chromatin immunoprecipitation combined with sequencing (ChIP-seq) is a technique used to determine
the presence of histone modifications and DNA binding sites for transcription factor proteins in a
genome-wide fashion and requires a large number of cells (around 2 x 10
7
cells). With this technique,
antibodies are used to enrich for a specific protein cross-linked with DNA, and DNA isolated in this
manner is sequenced. Along with the antibody enriched DNA, a negative control – often performed in
parallel – sequences the non-enriched DNA to assess background signal and is commonly referred to as
input. Peak-calling algorithms are used to identify protein-binding locations by identifying significantly
enriched counts in specific genomic regions compared to genomic background (review (Furey 2012)).
Signals from protein binding typically fall into two general categories: enrichment over broad regions
(typical for histone marks) and enrichment over narrow regions (typical for transcription factors) with
some proteins producing a signal that is a mixture of the two enrichment profiles (Pepke et al. 2009).
ChIP-seq experiments have become increasingly popular as sequencing prices decrease and the number
of ChIP-seq validated antibodies increase. Comparison between experiments can provide novel
information about quantitative and qualitative changes in protein occupancy and histone marks (Ross-
Innes et al. 2012; Shao et al. 2012; Xu et al. 2008). Methods have been developed for differential histone
mark analysis (Xu et al. 2008; Song and Smith 2011; Nair et al. 2012) and differential transcription factor
analysis (Stark and Brown 2011; Liang and Keles 2012; Shao et al. 2012). Whereas methods for
differential histone mark analysis may analyze differences in peak length or shape, methods for
differential transcription factor binding will often ignore these features and focus only on peak height,
since transcription factors bind to a specific genomic location on DNA. Here, we evaluate several
methods that have been proposed to quantify differential binding of transcription factors and compare
the differences in workflow and results from each method.
50
Figure 3-1 Typical workflow for identifying differential transcription factor binding
We show the typical workflow for differential transcription factor binding. First, ChIP-seq reads are mapped to the reference
genome with sequencing read representing binding signal. Next, a peak caller is run on the mapped reads for each antibody-
enriched DNA sample and input pair to identify regions of significant binding over non-specific input binding. Peak callers will
often output: (1) genomic coordinates for start and end of enrichment, (2) signal value or score that represents tag height, (3)
p-value for enrichment over input (background), and (4) genomic coordinate for the point of greatest binding signal (summit).
These significant binding regions are known as peaks and (shifted) sequencing reads overlapping each peak are counted to
create a binding counts matrix (see (Landt et al. 2012) for guidelines and (Park 2009; Bailey et al. 2013) for review on ChIP-seq
analysis). This binding counts matrix can be used to identify peaks that have significant differential binding between biological
conditions of interest.
• Map sequence reads to reference genome
• Remove sequencing artifacts, choose mapping parameters
Sequence
preprocessing
• Shift reads based on estimated fragment length
• Counting often done with sliding window
Shift and count
reads
• Identify regions with significant enrichment in read counts
compared to background
Peak Calling
• Define a reference set of peaks to count over
• Create binding counts matrix (peaks by sample)
Steps to count
reads in peaks
• Use RNA-seq methods (such as edgeR) with peak counts to
identify peaks with significant differential binding
Evaluation
• Using browsers such as (IGV/IGB/Gviz/Gbrowse/UCSC)
• Signal Track showing counts per million (CPM) or log ratio
Visualization
51
Peak Calling Steps before counting reads in peaks Evaluation
a) Condition-level comparisons
Overlap ENCODE q<0.01 n/a Overlap 1bp+
Fold Change ENCODE q<0.01 Extend on fragment size fold change (fc) cutoff (4x)
Ranking ENCODE q<0.01 n/a n/a
b) Sample-level comparisons
edgeR ENCODE q<0.01 Shift reads edgeR
DiffBind ENCODE q<0.01 Extend reads, scaled down bg* edgeR + TMM + bg* subtract
MAnorm3 ENCODE q<0.01 Shift reads, calculate MA adjust edgeR + MA offset
Voom ENCODE q<0.01 Shift reads, voom transform eBayes
c) Other methods
MACS2 MACS2 callpeak Extend on fragment size macs2 bdgdiff w/LLR^ cutoff
DBChIP User input
Cluster + recenter peaks, calculate
NCIS edgeR + bg* subtract
DIME n/a User inputs normalized counts Several mixture models**
Homer v4
Homer or user
input Shift reads and scale to smaller library Cumulative Poisson + 4x fc
Epicenter *** *** ***
Starklab MACS v1 Quantile normalization 2x fold change cutoff
EDASeq User input Quantile normalization, feature adjust edgeR + offsets
* background
** Chooses one of the following: local FDR, extended Normal-Uniform model, and Gamma-Normal-
Gamma mixture model
*** Epicenter combines counting, peak calling and evaluation using Poisson background modeling,
parsimony normalization with exact ratio test and z-test
^ log likelihood ratio
Table 3-1 Differential binding methods
A variety of approaches have been developed to perform differential transcription factor (TF) binding analysis (Table 3-1). We
group the approaches into two general categories: condition-level comparison methods (Table 3-1a) and sample-level
comparison methods (Table 3-1b). Condition-level methods collapse reads from replicate experiments into a single group
(condition) to compare against other conditions with replicate samples collapsed, while sample-level methods model the
biological variation between replicate experiments. Our goal is to provide a thorough evaluation of methods that will model
biological variation from replicate experiments, and to compare their results to those from condition-level approaches. The
methods we evaluate utilize software previously developed for differential expression of RNA-seq. We show that the results can
vary substantially when applied to the same dataset depending on computational parameters chosen, e.g. reference library size,
making it critical for communication between the data analyst and biologist to understand the biological system under
investigation in order to perform the proper analysis.
The methods appearing in Table 3-1 accept user-defined (peak) binding regions, allowing us to compare
methods for a controlled set of genomic intervals. However, this raises issues about how best to define a
reference set of binding regions. One major difference between differential binding analysis and
differential expression analysis is that expression counts are obtained over exonic regions which are
defined independently of the mapped sequence reads and are shared for all treatment conditions
whereas binding counts are typically obtained over regions of significant binding (peaks) defined from
the sequencing reads. This results in both regions of shared binding between different biological
52
conditions, but also regions that are unique to a single condition. When comparing multiple ChIP-seq
datasets (Figure 2), one must define a set of genomic intervals to serve as reference or consensus
binding regions for which binding counts will be compared. We have observed cases where peaks in one
condition overlap several peaks in a secondary condition and the number of overlapping peaks reported
can vary depending on which condition is used as overlap reference. In this paper, we evaluate several
approaches for defining reference-binding regions, and show how they compare to genome-wide
binning (Lun and Smyth 2014).
In order to compare methods, we evaluate their performance on six transcription factor ChIP-seq
datasets from ENCODE project (Dunham et al. 2012). These datasets were chosen based on availability
of ChIP-seq experiments produced by the same lab under multiple different biological conditions. We
use Glucocorticoid receptor (GR/NR3C1) and Estrogen receptor alpha (ERα/ESR1) data from Myer lab,
TCF7L2 (TCF4) data from Farnham lab and Pol2 (POLR2A), NRF1, and c-Myc (MYC) datasets from Snyder
lab. All but Pol2 are transcription factors that bind to enhancer sequences and have known binding
motifs. Biologically, enhancers are important for regulation of gene expression, and transcription factors
are often found to be bound to different subsets of enhancers in a cell type-specific manner. We group
the six different datasets into three types: control (Pol2 and c-Myc), differential cell-type binding
(TCF7L2 and NRF1), differential hormone binding (GR, ERα). We expect to find no differential binding in
control experiments, cell-type specific binding in the differential cell-type comparisons, and increased
total binding in dose-response hormone studies performed within the same cell line.
We find that analysis decisions can have a major impact on the results depending on the biological
conditions investigated. From this, we point out the critical decisions the analyst will make, and
recommend the preferred analysis procedures for some standard biological study designs.
To extend on the studies in the previous chapter – where coregulators were depleted and hormones
were added to identify genes that depended on each coregulator for its hormone regulation -- we want
to know the genome-wide localization of each coregulator. Unfortunately, since coregulators are not
very well studied and do not bind DNA, the antibodies available for coregulators do not give sufficient
signal for ChIP-seq. Instead, we use a different approach where we perform ChIP-seq against steroid
hormone receptor with and without coregulator depletion. A subset of these receptor-binding sites
could have their binding modulated by the coregulator and we are interested in quantifying changes in
binding after coregulator depletion since we do not expect these changes to be all-or-none. Thus, we
want to test how the different differential ChIP-seq methods perform to identify methods to use for our
study.
53
Methods
Figure 3-2 Overview of peak types and defining reference binding regions
(a) Several different types of peak comparisons are shown. The black curve represents binding signal with the dotted line
representing a hypothetical threshold for enrichment. The black boxes under each curve represent significant regions as
defined by peak caller output with the vertical line in the box representing summit point. Comparing binding profiles in
conditions A versus B we find: binding in condition A but not B (Unique – single enrichment), varying degrees of binding
between the two conditions (Unique and Shared peak – differential), and both conditions having a peak of about comparable
signal intensity (Shared peak – similar). (b) To illustrate the differences between ways to define reference binding regions we
show an example with three conditions (A,B,C). The comparison of interest is between condition A and B with peaks from
condition C only being used for the merge all method. (c) We show the expected MA plots for each reference method using the
signal values from the three example peak arrangements on the left.
54
Defining reference-binding regions
55
Figure 3-3 Distance between summits of peaks
We visualize distance between summits that are <500bp apart for peaks from all 6 different proteins. We find that most
summits are less than 50 base pairs apart leading us to specify 50 base pairs as the length to extend for merge centered.
We use four different strategies to create reference-binding regions, the genomic intervals at which
differential binding is evaluated (Figure 3-2b). The simplest strategy is to pool the peaks identified in the
different conditions, ignoring the fact that peaks obtained in different conditions may overlap (shared
peaks). The merge pairwise strategy merges peaks that overlap more than 1 base pair in the two
conditions being considered. The merge all strategy merges peaks from all conditions with the same
ChIP antibody that overlap more than 1 base pair (same as merge pairwise if only two conditions). Lastly,
the merge centered strategy redefines the peaks from merge all strategy as 25 base pairs (bp) around
the summit (50 bp total). After merging, if peak widths are greater than 50 bp, these peaks are
truncated to 50 bp by extending 25 bp from the center of the merged peak. This last strategy is designed
to avoid merging overlapping regions that represent distinct protein binding positions (see twin peaks in
Figure 3-2b). The extension length of 50 bp was chosen based on distribution of distances between
summits. Of the binding summits within 1 kb of each other (peaks called near each other), the majority
(Pol2: 58.9%, c-Myc: 52.3%, TCF7L2: 60.3%, Nrf: 93%, GR: 69.5%, ERα: 75%) are within 50 bp of each
other (Figure 3-3). Thus, we use 25 bp as the extension from summit to redefine each peak.
There are different rationales behind using each strategy of defining reference regions. Pooled appends
results from the peak caller without modifying the regions found. This is the simplest method and
naively, this is what an analysis that combines peaks from both conditions would be doing. However,
shared peaks found in both conditions will be double counted. Depending on the frequency of shared
peaks in the conditions being compared, this could adversely impact the differential binding analysis
when correcting for multiple comparisons. Merge pairwise will merge overlapping regions to avoid
double counting, but has the disadvantage of merging neighboring peaks that may in fact be distinct and
56
may also skew peak length distribution. Merge all is the intuitive choice for differential ChIP experiments
with multiple conditions. Using peaks from conditions that are outside of the pair-wise comparison
allows some normalization methods to take advantage of an increased number of unchanging
background regions with low fold changes. Lastly, as explained above, merge centered is proposed to
avoid merging distinct binding regions that have overlapping peaks. All four strategies to define
reference-binding regions are compared to an analysis based on binning the entire genome. Using bins
across the genome is more computationally expensive, but can providing better estimates of
background noise for data normalization and avoids the bias of specifying true binding regions a priori.
Effect of input
ChIP-seq experiments usually include an input control. Inputs serve as an indicator of background noise.
Some factors that could contribute to input signal are copy number alterations, high mappability regions,
or open chromatin. Input samples are often generated by sequencing reverse cross-linked DNA after
immunoprecipitation (IP) or ChIP with normal IgG. Some differential analysis methods subtract the input
counts from the binding counts matrix to remove background signal. This requires first scaling the input
to ChIP sample using the total number of counts in each experiment. Since removing input will lower
counts, not adjusting for input may be more sensitive when comparing across treatment conditions
within the same cell type.
Different normalization methods
Reference regions
Library size Input subtraction
pooled pairwise merge all centered Effective Full TMM MAnorm None Scale CPM
edgeR eff Y Y Y Y x
x x
edgeR full Y Y Y Y x x
DiffBind eff n/a Y Y Y x
x
x
DiffBind full n/a Y Y Y x x
x
MAnorm3 Y Y n/a Y x*
x x
Voom Y Y Y Y x x
Table 3-2 Summary of comparisons
Y indicates that the comparison was performed and X indicates the option(s) that was chosen. *MAnorm3 takes the average of
log transformed effective library sizes for scaling.
We evaluate four different between-sample normalization methods (Table 3-2), three of which use a
single factor scaling variable such as library size, and a fourth uses a linear model scaling adjustment.
The methods we evaluate calculate library size based on the sum of all reads overlapping regions of
interest (“effective” library size) or sum of all mapped reads in experiment (“full” library size) or more
specialized scaling parameters. Effective library size can be inaccurate if two ChIP-seq conditions have
very different total protein bound. This method makes the assumption that the sum of all binding signals
in peaks is the same between conditions and this assumption is invalid in the GR dataset we analyze. In
contrast, full library size makes the assumption that the combined signal from true binding and
background noise between ChIP-seq conditions are the same. A third method, TMM (trimmed mean of
57
M values)(Robinson and Oshlack 2010) trims the tails of log fold change (M values) and average log read
count (A) and computes a weighted average. This method attempts to decrease the fold change
differences between the two samples and is often used in conjunction with effective library size for
RNA-seq analysis. The TMM method assumes that most regions are not differential and differential
regions are both up and down regulated. A fourth method, MAnorm3, computes an adjustment based
on the relative amount of binding as a function of average log read count. This approach assumes that
most shared peaks, significant binding regions that overlap in both conditions, have comparable binding
signal and normalizes counts in all peaks for each sample based on a linear regression of log-fold change
versus average binding in shared regions. Table 3-2 shows the combinations of normalization methods
used in the different analysis methods.
Approaches and implementation
Condition-level comparison methods
Overlap method We take peaks identified by the ENCODE project using the IDR workflow and apply an
additional filter of q-value < 0.01 for peak detection. We take this subset and find overlaps in the
genomic coordinates of the peaks found in various conditions. The numbers shown in the Venn
diagrams in this section represent the number of peaks that overlap by more than 1 base pair between
conditions. Neighboring peaks can cause the number of overlaps to differ depending on which sample is
used as the reference. We count overlaps using each condition as reference and report the median.
Fold change method Reads are shifted based on half the average fragment length and read counts over
every peak are obtained. These binding counts are scaled by 10
6
and divided by the total number of
reads in the sample to generate counts per million (CPM). Replicate samples have their CPM values
summed and then log2 transformed and compared between conditions to generate a fold change value.
Regions with counts of zero are assigned a pseudocount of one prior to scaling to allow for log
transformation.
Ranking method Using only peaks that are identified in both conditions compared, we rank each peak
based on the signal value calculated by the peak calling software (SPP), representing average
enrichment for the region. When peaks in one condition overlapped multiple peaks in another condition,
the peak with the highest signal value among the multiple peaks is chosen.
Condition-level comparison methods can identify the different types of differential peaks shown in
Figure 3-2a. The overlaps method can identify both peaks that are shared and peaks that are unique by
the absence or presence of peak overlap; however, this method does not distinguish between
differential and similar or single enrichment. The fold change method can identify all four types listed.
Lastly, the ranking method is only relevant for shared binding and can identify differential versus similar,
with similar peaks showing a minimal change in ranks. None of the condition-level methods provide a
measure of statistical significance unlike the sample-level methods considered below.
58
Sample-level comparison methods
Three of the four sample-level comparison methods we evaluate utilize edgeR (Robinson et al. 2010), a
package originally developed for RNA-seq differential expression analysis. edgeR models the counts
using a negative-binomial distribution, and requires between-sample normalization to account for
different sequencing depths. All normalization procedures are implemented through the use of an offset
variable in the regression model for differential binding. The fourth method we evaluate, Voom, uses a
transformation of the count matrix to allow for analysis using software developed for analyzing gene
expression microarray data. Each method was also rerun using multiple different reference regions.
With the exception of MAnorm3, all packages listed below can be found on Bioconductor (Gentleman et
al. 2004). MAnorm3 is available here: https://github.com/ying-w/chipseq-
compare/tree/master/MAnorm. For each method, counts were obtained from counting over reads
(without duplicates removed) from ChIP-seq overlapping pre-defined peaks. Unless otherwise
mentioned, all methods were performed with an FDR-adjusted p-value <0.05.
First, we evaluate edgeR (version 3.0.8) (Robinson et al. 2010), without adjusting for input samples,
using different reference-binding regions and different normalization strategies. We estimated both
common and tag-wise dispersion but not trended dispersions due to low number of replicates and poor
fit when plotting biological coefficient of variation (BCV). We normalize using TMM (Robinson and
Oshlack 2010) and effective library size. We also used full library size and no TMM to evaluate
differences in results from changing these normalization options.
Second, we evaluate DiffBind (version v1.4.2) (Stark and Brown 2011), which uses edgeR for its
differential analysis and will subtract scaled input counts from the binding counts matrix. DiffBind will
merge overlapping peaks when counting in a way that is the same as merge all method to define
reference-binding regions. To implement merge pairwise, DiffBind must be run with only two datasets.
For input, we used the same samples that were used for peak calling except in the case of c-Myc. We
show DiffBind results with scaled input subtracted using both effective library size with TMM and full
library size with TMM for normalization.
Third, we evaluate a modified implementation of MAnorm (Shao et al. 2012) to allow for replicates and
to use edgeR for differential peak calling. Instead of subtracting the MA adjustment from binding counts
matrix, the adjustment from MAnorm3 was added to the average of log transformed effective library
sizes and used as an offset matrix for edgeR in place of library size. We will refer to our modified method
as MAnorm3. This method assumes that most shared peaks in the two conditions have the same
amount of binding. Since MAnorm uses intersection of peaks to normalize, the merge all strategy to
define reference-binding regions was not used as it would normalize based on all peaks and this would
violate the assumption that shared peaks are unchanging.
Lastly, we evaluate Voom (version 3.14.4)(Law et al. 2014). We chose this method due to favorable
performance in a comparison of RNA-seq data (Rapaport et al. 2013). This method transforms Poisson-
based read counts into normal based signal values that can be used with preexisting microarray analysis
59
methods. We used full library size when transforming the data and limma (Smyth 2004) for microarray
analysis.
Our analysis also included heavy use of GenomicRanges package (version 1.10.7) (Lawrence et al. 2013).
Binding counts matrix for most of the methods were created by counting over reference regions using
coverageBed (Quinlan and Hall 2010).
ChIP-qPCR validation
We performed validation of fold changes via ChIP-qPCR for GR in A549 cell lines. We choose a mixture of
regions that represented both shared and unique peaks. We followed the previously described protocol
for antibody and cell growth conditions as well as treatment conditions (Reddy et al. 2012). Chromatin
immunoprecipitation (ChIP) was performed as previously described (Bittencourt et al. 2012) except that
cells were cross-linked for 10 min at room temperature with formaldehyde only. IP signals were
normalized relative to the signal obtained from input chromatin and fold changes were calculated by
dividing normalized IP signal values followed by log2 transformation. Up to three qPCR technical
replicates were performed for each experiment along with IgG and 2% input chromatin as control.
Fold changes from ChIP-qPCR are compared to fold changes generated from each of the sample-level
comparison methods listed above with the addition of a new input subtraction method called
Normalization of ChIP-Seq (NCIS) that scales input based on non-enriched regions (Liang and Keleş 2012).
This method is implemented by the DBChIP package (version 1.2.0) (Liang and Keles 2012) and was used
to evaluate differences in input scaling. Unlike edgeR and DiffBind, this method uses a median ratio for
library size, and does not allow the user to specify the reference-binding regions. Using the median ratio
with and without background subtraction after NCIS scaling is visualized as NCIS sub and NCIS median
(Figure 8)(Wu et al. 2014a).
Datasets
We use ChIP-seq data for transcription factors from ENCODE (mar2012 freeze) to evaluate different
methods proposed for differential transcription factor binding. The advantages of using ENCODE
datasets include the high quality standard for biological experiments with the inclusion of replicates as
well as consistent processing of data. Peaks were generated from a modified SPP pipeline (Anshul
Kundaje, Lucy Yungsook et al. Assessment of ChIP-seq data quality using cross-correlation analysis.
Submitted) which also incorporates the IDR framework (Li et al. 2011) and deposited in mar2012
ENCODE freeze. For reads, we used the tagAlign files that were converted from bam files and use less
disk space. Both of these files types are generated by ENCODE and publicly available (Wu et al. 2014a).
Pol2 (official symbol: POLR2A) is the largest subunit of the RNA polymerase II complex. Multiple
replicates of ChIP-seq were performed by Synder lab as part of the ENCODE project in Gm12891 cell line.
This cell line is an Epstein Barr virus transformed B-lymphocyte cells that is used as part of the Utah
pedigree of the HapMap project. We split the six replicates into two groups of three and compare even
replicates versus odd replicates. This dataset serves are our first negative control where we do not
expect any differences.
60
c-Myc (official symbol: MYC) is a well studied tumor suppressor gene. The ENCODE project has
generated several ChIP-seq datasets for this protein including two pairs of replicates in the K562
leukemia cell line by the Snyder lab (Gerstein et al. 2012). This pair of replicates serves as our second
negative control, since we expect the results to be the same due to the use of the same antibody, cell
line, and presumably the same protocol. Although these samples were created by the same lab, they
were created at different times, and institutions, have different sequencing depths, and specify different
samples as inputs for peak calling. The lab that produced these datasets was originally at Yale and ChIP-
seq peak calling was performed there using reverse crosslink as control. Several years later, the lab
moved to Stanford where the experiment was repeated, but samples were sequenced to a higher depth,
and IgG served as the control. These datasets are referred to as c-Myc Yale and c-Myc Stanford in our
analysis. For our study, we use reverse crosslink as control for both datasets. For datasets with only two
conditions, the merge pairwise and merge all reference-binding regions are identical.
TCF7L2 (official symbol: TCF4) is a zinc-finger transcription factor that is up-regulated in cancer. Variants
of this protein have been found to be associated with type 2 diabetes. A recent publication from an
ENCODE lab looked at multiple commonly cultured cell types and found cell type specific enhancer
binding in several cell lines (Frietze et al. 2012). We use this dataset to look for differential binding of
TCF7L2 in HEK293 (human embryonic kidney), HeLa S3 (adenocarcinoma), and MCF-7 (breast cancer)
cell lines. These cell lines are often abbreviated to hek, hela, and mcf in our analysis. The comparisons
performed on this dataset most closely resemble the types of differential binding analysis that can be
done on publicly available ChIP-seq datasets.
Nuclear Respiratory Factor 1 (official symbol: NRF1) is a transcription factor that is important for
mitochondrial gene expression and is a susceptibility gene for type 2 diabetes (Liu et al. 2008). We
compare the binding of this protein in all the Tier 1 ENCODE cell lines (GM12878, H1-hESC, K562). These
cell lines are Epstin-Barr Virus transformed B-lymphocytes, embryonic stem cells, and leukemia cells.
Glucocorticoid receptor (GR, official symbol: NR3C1) is a type of steroid hormone receptor that responds
to the cortisol hormone in humans. This protein acts as a transcription factor and binds to enhancer
sequences upon hormone treatment (Tsai and O’Malley 1994). GR has been shown to be a key regulator
of inflammation, various developmental processes, and metabolism of glucose, lipids, and protein (Beck
et al. 2009). A recent publication from an ENCODE lab investigates circadian rhythm gene regulation by
GR in a dose dependent manner (Reddy et al. 2012). This paper includes GR ChIP-seq data and RNA-seq
data after 1 hour of dexamethasone hormone treatment at several different hormone concentrations.
Dexamethasone (dex) is a synthetic analogue of cortisol with higher affinity to GR and longer half life in
cell culture and animal models. The ENCODE project has data from three hormone levels, centered
around the disassociation constant (K
d
) of dex with GR. Hormone treatment levels used in this study
range from an order of magnitude higher than the K
d
, (50 nM) to an order of magnitude below the K
d
(0.5 nM) and include a third concentration near the K
d
(5 nM). These hormone concentrations are
abbreviated to high, med, and low in our analysis. These experiments are ideal for evaluating binding
changes since binding will change with hormone dosage. These three conditions allow us to quantify GR
binding when we expect overall hormone-activated GR protein levels to be changing dramatically since
GR only binds DNA upon hormone activation. In the low hormone level (0.5 nM) that is an order of
61
magnitude below K
d
we expect to observe very few GR binding sites, while in the high hormone level (50
nM) we expect most accessible GR binding sites to be occupied.
Estrogen Receptor alpha (ERα, official symbol: ESR1) is also a steroid hormone receptor and it responds
to estradiol hormone in humans. The ChIP-seq experiments that we compare use three types of
hormone to induce ERα binding: bisphenol A (bpa), genistein (gen), and 17β-estradiol (est)(Gertz et al.
2012). These three types of hormone are found in plastics, soybeans, and endogenously. These ChIP-seq
experiments were performed in ECC-1 cells which is a cell line derived from endometrium
adenocarcinoma.
62
Results
63
Figure 3-4 Results from condition-level comparison methods
(a-d) Venn diagrams of peak regions identified by the ENCODE project with an additional q-value cutoff of < 0.01 (a. Pol2, b. c-
Myc, c. GR, d. TCF7L2). (e-g) MA plots of log fold-change versus average log binding for each pair-wise condition of interest. (e.
c-Myc Stanford/IgG vs Yale/standard, f. TCF7L2 HEK293 vs HeLa S3, g. GR high vs low).
Condition-level comparison methods
Overlap: (Figure 3-4a-c)
Overlapping our Pol2 peaks (Figure 3-4a), we find that the two subsets (odd replicates and even
replicates) mostly overlap (80%+) and overlap an even larger fraction when compared to the peaks
found by the original set (pooled, 96-98% overlap). This result is expected since this is a negative control
where the two data subsets each contain three replicates of the identical experiment. In our second
control experiment (c-Myc), replicate experiments were conducted by the same lab, but in two different
institutions (Yale and Stanford) several years apart (see Methods). The second set of experiments used a
different type of input for control and was sequenced more deeply (~20M reads per replicate versus
~4M). The overlap of peaks in Figure 3-4b shows that the Stanford dataset has many more peaks than
the Yale dataset, and highlights the importance of sequencing depth for peak calling and overlap
analysis. Due to the large difference in peaks between the two c-Myc experiments, we filtered all but
the top 5000 peaks (since the smaller Yale dataset had about 5000 peaks) and find that over 70% of the
peaks overlap. Filtering additional peaks did not increase the overlapping proportion greater than 75%
(data not shown). We also reanalyzed the peak calling and overlap analysis for the Stanford data using
the same input control as was used for Yale, and down-sampled the sequencing depths to match the
Yale samples. Changing the choice of input for peak calling resulted in very few changes in number and
locations of peaks, however, reduction in sequencing depth led to a dramatic decrease in number of
peaks found (4341 peaks). Out of the 4341 peaks, 91% appeared in the larger (full-sized) Stanford data
set. Together, these results indicate that the critical difference between the replicate c-Myc data sets
generated at the different time points was sequencing depth, and not the input control.
64
NRF1 Overlaps ERα Overlaps
Figure 3-5 Overlapping conditions in NRF1 and ERα
Venn diagrams of peak regions identified by ENCODE with an additional q-value cutoff of <0.01. NRF1 (left) venn diagram shows
peak overlaps in three different cell types: Gm12878 (gm), K562 (k5), and H1esc (h1). ERα (right) venn diagram shows peak
overlaps in three different conditions: BPA, genistein (gen) or estradiol (est).
Overlapping binding sites between cell types identifies both cell type-specific binding as well as shared
binding sites between cell-types, a result that has been reported previously (Frietze et al. 2012). For
TCF7L2, fewer peaks were identified in the HeLa S3 cell line and fewer cell type-specific peaks were
identified in HeLa S3 (49%) compared to HEK293 (72%) and MCF7 (74%) (Figure 3-4c). In NRF1
experiments, less cell type-specific binding is found with most binding sites being shared (over 60% of
peaks are shared by all three conditions and under 20% are unique to each cell type) (Figure 3-5).
Overlapping GR peaks from three different hormone concentrations (Figure 3-4d) produces a result that
is consistent with the current understanding of GR function. The number of GR binding sites increases
with hormone concentration since GR protein requires hormone to bind to DNA. The majority of the
peaks are shared, with low concentration peaks being a subset of peaks from high hormone treatment.
All 25 regions specific to medium hormone binding either have very low enrichment or have a peak in
high hormone treatment nearby but not overlapping. In ERα datasets, treatment with three different
types of hormones caused differential binding with bpa-specific binding sites being a subset of genistein
and estradiol binding sites (Figure 3-5). In the analysis that accompanied the ERα datasets concluded
that bpa and genistein induce a subset of estradiol treatment effects by assaying both binding and gene
expression (Gertz et al. 2012).
65
C-Myc stanford vs yale GR high vs low merge
TCF HEK vs Hela merge
pool pairwise center pool pairwise all center pool pairwise all center
Non-Overlap 18894 17962 20536 17407 17339 12932 17408 5665 5314 4228 6009
Fold Change 93
14519 - - - 4294
edgeR efflib 450 292 50 4397 4318 4289 2681 5837 5199 6008 4632
edgeR fulllib 0 0 1 17403 17246 17260 17076 5101 4628 5290 4225
DiffBind efflib - 391 36 - 2908 2950 2129 - 5238 5585 5651
DiffBind fulllib - 5 2 - 17233 17249 17317 - 4663 5062 5118
MAnorm3 1894 1991 829 10857 14249 - 13375 5631 5063 - 4657
voom fulllib 5 1 0 17373 17215 17222 15653 4962 4497 5083 3948
# peaks 29176 22828 24852 17608 17439 17458 17508 7201 5976 10707 6604
Table 3-3 Number of significant differential binding regions
Non-overlap and fold change use cutoffs. The other methods all use FDR adjusted p<0.05
Fold Change: (Figure 3-4e-g)
We impose a 4x fold change cutoff and look at the number of peaks that pass this threshold in pair-wise
comparisons. In c-Myc control dataset (3e), we find a small number of peaks with a high fold change and
these have low binding counts. Although we normalize the fold change for differences in sequencing
depth, a skew towards condition-specific effects in the more deeply sequenced sample is observed, with
more regions having non-zero counts. For differential cell types, 50-60% of the binding sites in TCF7L2
have greater than 4x fold change difference showing a large amount of cell type-specific binding (Figure
3-4f). The two cancer cell lines (MCF7 and HeLa S3) have fewer differences from each other compared to
the embryonic kidney cells (HEK293) (Table 3-3). This difference is not attributable to sequencing depth
since HEK293 and MCF7 have about the same sequencing depth with HeLa S3 having fewer reads. In the
GR dataset, we have about the same sequencing depth for all three datasets and find the highest
proportion of significantly differential peaks in the high versus low (Figure 3-4g) comparison followed by
med versus low and high versus med (data not shown). This order is expected since high hormone
treatment should induce more binding than lower levels of hormone treatment and med and high are
both treated with hormone concentrations above the equilibrium constant (K
D
) so we would expect to
find robust GR binding in both conditions and thus smaller and fewer fold change differences.
Ranking:
Ranking the intersecting peaks reveals different correlations between different conditions (correlation
coefficient for Pol2: 0.85, c-Myc: 0.72, TCF7L2: 0.33-0.43, NRF1: 0.67-0.75, Gr: 0.36-0.79, ERα: 0.6-0.82).
GR high vs med had a much higher correlation at 0.75 than the other GR experiments that compared
against GR low. When we plot the normalized difference in ranks of shared peaks, we find that most
comparisons do not result in a bulge around zero indicating that most ranks are not preserved between
conditions (ranks change between conditions)(Supplemental File 6). From the plot, there is not an
obvious cutoff point to separate the true differential binding from non-differential binding using ranks
and thus we do not further investigate this approach.
66
In conclusion, the overlapping of peaks from different conditions depends on sequencing depth. The
number of peaks found at the same FDR cutoff will change if the samples being compared have large
differences in sequencing depth (e.g. c-Myc had about 5x more sequencing in Stanford replicates versus
Yale replicates). The differentially bound regions identified by overlap will serve as a benchmark for
sample-level comparison methods that incorporate read count from replicate samples to assign
statistical significance.
Sample-level comparison methods
67
Figure 3-6 MA plots for sample-level comparison methods
Regions with significant differential binding are highlighted in red. Controls: (a) Pol2 comparison, (b) c-Myc comparison.
Normalization differences: GR comparison between high versus low hormone using (c) effective or (d) full library size
normalization. TCF7L2 comparison between HEK293 and Hela S3 cells using (e) DiffBind with full library size, (f) MAnorm3, (g)
edgeR with full library size. Reference binding region difference: TCF7L2: (g) merge pairwise, (h) merge all, and (i) merge
centered using edgeR with full library size.
edgeR:
Next, we performed differential binding analysis using edgeR and evaluated the effect of normalization
using effective library size with TMM (trimmed mean of M values) or normalization using full library size
(fulllib, see methods for more details regarding normalization). Unless otherwise mentioned, the results
are from using the merge pairwise reference-binding regions
Normalizing using both full and effective library size, we found no differential peaks in Pol2 dataset (
Figure 3-6a). Our other control, c-Myc, had 292 (about 1% of peaks) differentially bound regions using
effective library size normalization, fewer than the 5% family-wise error rate, and vastly less than the
number of differential regions found by the overlaps method. These differential regions can be seen in
the MA plots (
Figure 3-6b), plots of log fold change (M-value on vertical axis) against average log counts (‘A’ on
horizontal axis). These differential sites all vanished when normalizing using full library size (Table 3-3),
suggesting that full library size is a more conservative adjustment for different sequencing depth. Thus,
either normalization method is valid when sequencing depth is dramatically different and the total
chromatin-bound protein is about the same with effective library size being less conservative.
Normalizing for library size showed dramatic differences when assaying differential binding between
treatments having different levels of total bound protein. Many more differential GR and ERα binding
sites are found when using full library size instead of effective library size (
Figure 3-6c-d, Supplemental File 2, Table 3-3, except for ER est vs gen). Biologically, in the GR
experiment, we expect most sites to be bound only in the high hormone treatment condition. As
hormone concentration increases, a higher proportion of available binding sites will be occupied by GR.
At low hormone concentrations, only the most sensitive binding sites would have GR bound. The MA
68
plot from using full library size for normalization matches our understanding of the biology for GR –
most sites are differentially bound and higher fold changes occur in high hormone treatment compared
to low hormone treatment (
Figure 3-6d). When using effective library size for normalization, a contradiction arises where we find
significantly reduced binding in high hormone treatment compared to low hormone treatment in
regions that are unique to high hormone (
Figure 3-6c). Both datasets have similar sequencing depth but different amounts of protein bound,
suggesting that normalization using full library size is more robust to variations in bound protein
concentration than effective library size.
When comparing between different cell types (TCF7L2 and NRF1 experiments), we find that similar
numbers of differentially bound sites with a high degree of overlap are found using the two different
normalization methods (Table 3-3). Sequencing depth is roughly similar between the samples and we
assume that protein binding is also similar between conditions. Under these two conditions, only a small
difference between the two library size normalization methods is observed, suggesting that the choice
of library size has little effect on results when we do not expect varying protein binding between
conditions and sequencing depth is similar.
Lastly, MA plots generated by edgeR are similar in shape to the MA plots generated from fold change
method (
Figure 3-6b and Figure 3-4e,
Figure 3-6d and Figure 3-4g,
Figure 3-6g and Figure 3-4f); however, using edgeR will allow us to utilize variance from replicates to
identify more robust regions of differential binding. We conclude that using edgeR with full library size
69
normalization performs similar to, or better than, edgeR with effective library size normalization at
detecting differentially bound regions.
DiffBind:
We performed DiffBind analysis, a method that extends the edgeR method by subtracting scaled input
(background) read counts from the binding counts matrix as well as automating the process to obtain
counts from peaks. We expect DiffBind to perform similarly to edgeR for the control and hormone
treatment datasets since they are performed on the same cell type. When comparing different cell types,
the input subtraction step may be useful to correct for differences in regional DNA copy number
alterations and high mappability regions.
DiffBind performs similarly to edgeR when using full library size normalization, with similar or less
number of differential peaks found by DiffBind in most comparisons (Table 3-3). Subtracting input will
lead to lower counts and reduce the power to detect differential binding while protecting against finding
regions with genomic biases such as copy number alterations (CNAs) or high mappability since these
regions are more likely to have higher input signals. Thus, we expect fewer differential regions to be
found when using DiffBind.
Surprisingly, the comparisons between different cell-types using DiffBind yielded a similar number of
differentially bound peaks compared to methods without subtracting input. This is unexpected since
known copy number alterations exist between these different cell types. It is possible that the proteins
we analyzed had few binding locations in regions with CNAs. We conclude from this that subtracting
input using DiffBind produces similar results to edgeR but can prevent finding differential binding in a
few regions with high input signal.
Figure 3-7 Differences in results of sample-level comparison methods
We overlap the regions with significant differential binding found by each method for (a) c-Myc , (b) GR high versus low
hormone, and (c) TCF7L2 HEK293 versus Hela S3. For c-Myc (our negative control), we picked methods with highest number of
false-positives and overlap these results with the peaks found by peak caller (stdYale and iggStanford).
MAnorm3:
70
MAnorm3 is a method that extends edgeR by normalizing the data using a linear regression of M-value
(difference) on A (average log counts) for peaks that are shared between conditions (instead of using full
or effective library size). When applied to the control datasets, this method finds no differences in Pol2
but finds a surprisingly high number of differentially bound regions in c-Myc (9%) (Table 3). The large
number of differentially bound regions found in c-Myc is troubling since c-Myc is a negative control and
should not show any differential binding. This result remained unchanged after down-sampling c-Myc
samples to similar sequencing depth and/or only assaying a similar number of top peaks from each
dataset (data not shown). Most differential regions identified by MAnorm3 in c-Myc are unique to the
Stanford data set (Figure 3-7a).
When comparing between cell types, the MAnorm3 strategy performs comparably with the other
sample-level comparison methods but has the highest number of unique differential regions (TCF7L2
results shown in Figure 3-7b). In contrast, we see a dramatic effect between treatment types within the
same cell type with fewer differential binding regions in both GR and ERα experiments. The MAnorm3
normalization strategy assumes that peaks shared between two conditions are not differentially bound,
thus, this method is not expected to perform well on the GR dataset where more binding is expected
with higher hormone dosages. Thus, normalizing using MAnorm3 removes true binding differences in
shared peak regions. As a result, MAnorm3 has reduced sensitivity for GR dataset (Figure 3-7c) where
over 3000 differential binding regions identified by edgeR, DiffBind and Voom were not identified by
MAnorm3. It is unknown if shared binding sites are unchanging for the ERα dataset after different
hormone treatments but we find similar number of differentially bound regions compared with edgeR
using effective library size. We believe that MAnorm3 may be useful in limited circumstances where its
assumptions are met, but encourage external validation of results since thousands of differential regions
were identified in one of our negative control dataset.
Voom:
a) Control (cmyc stanford vs yale and pol2 even vs odd)
71
b) Cell Type (TCF hek vs hela and NRF1 Gm12878vs H1-hesc)
72
c) Hormone (GR high vs low and ERa bpa vs est)
Figure 3-8 Voom MAplots
We show MAplots after performing voom on (a) control samples (b) cell-type comparisons, and (c) hormone treatment
comparisons. The blue horizontal line shows 0 fold change and the red dots are significant differential binding while black dots
are non-significantly different binding between conditions.
We used Voom with full library size, our preferred normalization method from above, to transform read
counts into signal values that follow a normal distribution; this allows the use of microarray methods on
read count data. This workflow produces similar or slightly fewer significant differential regions
compared to edgeR with full library size (Table 3-3). We observe from MA plots that this transformation
greatly reduces the read counts, especially for regions with high fold change, when compared to edgeR
with full library size (Figure 3-8). However, the lower counts had little effect on the number of
differential regions found.
Defining different reference-binding regions
We compared four different strategies for defining reference-binding regions: merge pairwise, merge all,
merge centered, and pooled (see Methods and Figure 3-2). Several of the strategies had an influence on
the final number of differential regions found. DiffBind found more differential peaks when using merge
centered reference-binding regions in almost all between cell-type comparisons. With merge centered,
peak width was fixed (at 50 bp). Shorter peak widths will reduce counts; these counts are further
reduced by input subtraction, and result in highly increased fold-change estimates. This shows that
DiffBind is sensitive to peak with and trimming peak widths will impact DiffBind results.
We recommend using merge pairwise strategy to define reference regions due to disadvantages in the
other strategies. For pooled strategy, shared peaks between two conditions will be double counted. For
the merge all strategy, we find more differential binding regions but some of these additional regions
73
have low binding signal and were not originally identified as being peaks in the conditions being
compared making these findings suspect (
Figure 3-6g-h). These low binding regions inflate the number of differential sites found. For the merge
centered strategy, we choose a peak width that would merge nearby binding sites but keep biologically
distinct binding sites separate. We achieved this by artificially trimming our peak widths; however,
smaller peaks lead to a counts matrix with lower counts. Lower counts resulted in less continuous fold
changes and higher fold changes at peaks with lower average counts (
Figure 3-6i). Meanwhile, the occurrence of distinct overlapping peaks with different fold change (Figure
3-1b, twin peaks), the specific situation that the merge centered strategy was designed to retain, were
rare in the datasets that we investigated. Thus, when peaks are not closely grouped together, we
recommend using the merge pairwise strategy to define reference regions; it is simple to implement and
does not suffer from the drawbacks encountered by other strategies. For proteins with multiple nearby
binding regions, we recommend carefully defining smaller windows of binding regions to reduce the
chance of combining separate binding events.
Whole genome binning
We investigated the effect of defining reference-binding regions using peak caller output versus binning
the genome agnostically. We used the GR binding in high versus low doses of hormone and TCF7L2
binding between HEK293 and HeLaS3 cells datasets and compared binning versus using edgeR with full
library size normalization. For binning, each chromosome was split into 200 bp non-overlapping bins,
and read counts were obtained for each bin. We found that after filtering bins with low average binding
and blacklisted regions, over 85% of the significantly differential regions found by using genome-wide
bins are shared with edgeR using full library size and merge pairwise reference-binding regions (data not
shown). The similarity of the results suggest that most differential regions are captured as peaks in the
data set and using reference-binding regions defined by peaks instead of creating bins genome-wide will
greatly reduce the number of comparisons without compromising sensitivity.
Input scaling in DiffBind
ChIP-seq input is used to capture biases such as copy number alterations that can exist between
different cell types. Interestingly, subtracting input in comparisons within the same cell type can create
rank changes. Instead of comparing the differences in number of significantly differential regions, we
rank the peaks found with no input subtracted and after input subtraction to see how input subtraction
using DiffBind changes the ranks. We find high correlation before and after input subtraction (r=0.68 for
GR high vs low and r=0.79 for GR high vs med). Even though the correlation is high, differences are still
74
found in the top differential regions when input is subtracted. Additionally, input subtraction decreases
average binding at most binding regions and greatly increases fold change at some binding regions while
the number of differentially bound regions stays around the same (Table 3-3). The increase in fold
change is due to DiffBind imposing a minimum count of 1 when subtracting input signal (to allow for log
transformation). In conclusion, subtracting input will affect fold change estimates in regions with high
input and can change the ranks of peaks while keeping the number of significant differential regions
about the same. This is true for comparisons within cell-type and also between cell-type. Subtracting
input also tends to inflate fold-change estimates due to lower counts and minimum count requirement.
Comparing reproducibility of sample-level comparison methods
Figure 3-9 Clustering of fold changes from top differentially bound regions from each method
We take the top 100 differential peaks in each condition and combined the regions to create a superset of regions. We used
hierarchical clustering with average linkage and Gower's distance on log2 fold changes with weighted based on how often
region was found in the superset. a) GR high versus low comparison, b) TCF7L2 HEK293 cell type versus HeLa S3 cell type. For
the Method row: edgeR (black), DiffBind (dark blue), MAnorm3 (brown), and voom full (orange). For the libsize row: effective
library size (white) and full library size (blue). For the Refpeak row: Pooled (grey), Merge Pairwise (light blue), Merge all (tan),
Merge centered (green) methods to define reference binding regions.
To evaluate the similarity and differences between the sample-level comparison methods, we cluster
the fold change estimates for the top differentially bound sites. For each method, we cluster fold change
estimates obtained from using different reference-binding regions, ways to model input, and choices for
normalization with more weight on the regions that were more robustly differentially regulated (Table
3-2). For cell-type comparisons, TCF7L2 showed that methods clustered based on whether input was
subtracted and if centered reference regions were used (Figure 3-9a). Since two of the three TCF7L2 cell
types were cancer cell lines, input subtraction can help remove effects from copy number differences
75
due to chromosomal abnormalities. All but one of the NRF1 fold changes also clustered based on
subtracting input and centered reference regions but also clustered based on using the Voom method
(Figure 3-10). The comparison that did not cluster based on input subtraction was between two non-
cancer cell lines.
For hormone comparisons, normalization makes the biggest difference on fold change estimates (GR in
Figure 3-9b) and samples using full library size (blue) cluster together in five out of the six pair-wise
comparisons. The comparison that did not cluster by normalization method had much lower number of
differential regions found (ERα est vs gen) (Figure 3-10).
ERα
76
77
GR
78
NRF1
79
TCF7L2
Figure 3-10 Clustring between all pairwise comparisons
Clustering diagrams, caption is same as Figure 3-9
80
Correcting for G+C content bias
Figure 3-11 Smoothed scatter plot of log fold change versus G+C content
Darker color corresponds to more binding peaks. a) GR ChIP-seq at high hormone levels versus low hormone levels; b) TCF7L2
ChIP-seq in HEK293 cells versus HeLa S3 cells. The red line represents a loess curve through the data.
It has been shown that G+C content and gene length can bias RNA-seq data (Hansen et al. 2012; Risso et
al. 2011). We investigate whether this trend also exists for ChIP-seq data. We calculated G+C content
and peak width (analogous to gene length) for all binding regions in each condition. We find that most
peak widths are the same with higher-ranking peaks having more variable peak widths (Supplemental
Figure 1). Since most peak widths are the same, we did not adjust for the affect of peak widths. When
plotting fold change versus G+C content in TCF7L2 samples, we find two groups of G+C content (Figure
3-11). The low G+C content group has highly variable differential binding while the high G+C content
group has negligible fold change. It is known that TCF7L2 binds both enhancers and promoters (Frietze
et al. 2012). This suggests that the low G+C content group is cell-type specific TCF7L2 binding and likely
enriched for enhancer binding; while high G+C content group is shared TCF7L2 binding between cell-
types and likely enriched for promoter binding. Biologically, we expect the primary regulation
mechanism of a transcription factor such as TCF7L2 to achieve cell type specificity through enhancers
and we see evidence for that through the fold change differences at low G+C content binding regions
but not at high G+C content regions. This finding serves as a warning against G+C content correction in
ChIP-seq experiments since removal of G+C content effects could eliminate real biological signals.
81
Validation of fold changes
Figure 3-12 Boxplot of differences in fold change between ChIP-qPCR validation and differential ChIP-seq methods
Difference in fold changes is calculated by subtracting ChIP-qPCR fold change from estimated fold changes using various
differential binding methods to quantify ChIP-seq. The closer the boxplot is to zero (blue line) the closer the fold changes
estimated by the various methods are to ChIP-qPCR validation. a) GR high vs low, n=8 binding regions; b) TCF7L2 for HEK293 cell
type versus HeLa S3, n=8 binding regions.
We validated fold changes for 10 peaks in the GR dataset via ChIP-qPCR and used the validation data
from the supplement of the published TCF7L2 results (Frietze et al. 2012)(also S. Frietze, personal
correspondence)(Figure 3-12). We compared fold changes from qPCR with fold changes estimated from
the sample-level comparison methods. For many sites, estimated fold change from ChIP-seq differential
methods is markedly higher than fold change values from qPCR. We find that for the GR dataset, using
full library size parameter performs closest to qPCR for high versus low and med versus low with edgeR
performing better than DiffBind. However, high versus med validation matched effective library size
closer than full library size. This could be due to less difference in total bound protein between mid and
high treatment doses, or due to interaction between DNA-bound protein levels and antibody efficiency
that is not captured by simply summing counts for normalization. With TCF7L2 datasets, the different
methods have fold changes comparable to qPCR for the validated regions. We conclude from these
results that of the methods we considered, edgeR with full library size generates fold changes most
similar to qPCR validation under most conditions but further work will need to be done to identify a
normalization method that better matches qPCR validation. Note that for the data presented, the GR
ChIP-qPCR was performed on samples using the same cell type and antibody and following the
published protocol as closely as possible, but the qPCR experiment used different DNA than the ChIP-seq,
82
which could introduce additional variability. In contrast, the TCF7L2 ChIP-qPCR was performed on the
same DNA as ChIP-seq so we expect and find fold change estimates to match more closely.
Relationship between motifs and differential binding peaks
We wanted to know if there were any patterns between direct binding (presence of motif) and
differential peaks. We used MAST (Bailey and Gribskov 1998) from MEME suite (v4.9.0) to scan for the
top motif characterized from ENCODE ChIP-seq datasets (Wang et al. 2012). Of the peaks called by
ENCODE, most of the binding regions for transcription factors identified by ChIP-seq contain a binding
motif (c-Myc: 68%, TCF7L2: 91%, NRF1: 97%, GR: 80%, ERα: 72%). When we looked for relationship
between motif prevalence and significantly differential peaks, we found that GR peaks with GR motif are
slightly more likely to be differentially bound and NRF1 peak regions without motifs are slightly more
likely to be differentially bound. Since GR peaks that are not differentially bound have lower signal, this
suggests that weaker GR binding sites are unchanging with hormone and have indirect GR binding (lack
canonical motif). For NRF1, most of the differential binding sites had lower signal, thus suggesting that
weak NRF1 binding sites are indirect binding and are more likely to be cell type specific.
Using negative regions for normalization
0
200
400
600
800
1000
1200
1400
0
2
4
6
8
10
12
Artifact
chrM
83
Figure 3-13 Overrepresentation of artifact and chrM reads
This figure shows the over-representation of artifact regions and chromosome M (chrM) in the various different ChIP-seq
experiments. The scale for artifact regions is on the left while scale for chrM is on the right. The scale represents observed /
expected amount of sequence considering the size of artifact and chrM regions.
Early on in my work, I considered alternative methods for normalization including using noise from
artifact regions or sequence from chromosome M. We found that chrM was often found several
hundred times higher than expected and can be attributed to multiple copies of chrM in the cell,
however, the protein of interest is not known to bind to the mitochondria. We find a pretty consistent
pattern of higher ratio of chrM in our input samples with ratios of artifact regions staying about the
same (Figure 3-13). I ended up not pursuing this topic further to narrow the scope of my work and also
because I believed that these noise regions could be protocol specific and there was no good biological
region that they should stay the same.
0
100
200
300
400
500
600
700
800
900
1000
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Artifact
chrM
84
Different peak calling in GR
Figure 3-14 Differential GR binding upon Hic-5 depletion
Venn diagram (left) shows significant GR peaks from control scrambled depletion (siNS) ChIP-seq, Hic-5 depletion (siHic) ChIP-
seq, and significant differential peaks between siNS and siHic using edgeR with effective library size (sig_edgerdefault_pairwise).
The MAplot (right) shows fold change and binding signal strength with red dots representing significantly differential binding.
We performed differential binding analysis using GR ChIP-seq with and without Hic-5 depletion. Since
we expect Hic-5 to only act on a subset of GR binding sites, we used the effective library size
normalization. Over nine thousand differential peaks were found with almost half of these peaks in
shared binding regions. These shared differential peaks would be missed in a overlaps analysis,
highlighting the utility of differential binding analysis. Additionally, many of the peaks found only in siHic
were not differential (7117), suggesting that the peak caller might have called many small peaks. Motif
analysis performed on peaks only in siHic (not in siNS control) did not find de-novo binding site specific
to Hic-5 depletion or enrichment in other known motifs.
Discussion
A number of new software packages for performing differential transcription factor (TF) binding analysis
are available in the public domain without accompanying papers that assess their performance. These
approaches often borrow from methods originally developed for RNA-seq experiments, allowing
differential peaks to be identified in regions of DNA that are bound by protein under both conditions
(common/shared peaks) rather than only identifying unique versus shared binding by overlaps. Many
approaches are limited to comparing pairs of experiments (Taslim et al. 2011; Heinz et al. 2010; Huang
et al. 2011; Bardet et al. 2012; Zhang et al. 2008) while others require allow for multiple biological
replicates in at least one condition (Stark and Brown 2011; Liang and Keles 2012). The combination of
different analysis approaches and lack of gold-standard data sets make a general comparison of
85
methods difficult. In this paper we standardize a number of analysis steps, such as peak calling and read
counting, in order to evaluate the effect of different methods applied within a controlled analysis
pipeline. We focus on evaluating methods for defining reference-binding regions, correcting for non-
specific binding, and normalizing for sequencing depth, when using the same negative-binomial
regression model for hypothesis testing. The sensitivity of results to different methods under different
biological comparisons highlights some key steps of analysis as well as questions that bioinformaticians
should consider asking biologists about the data.
Of the three analysis steps evaluated, normalization for sequencing depth had the greatest potential for
influencing differential binding site discovery. First, it is important to realize that ChIP-seq peak callers
can be sensitive to read depth, with more reads leading to more peaks found even when using the same
FDR and IDR cutoff, as seen in the c-Myc control dataset. Normalization for sequencing depth is a crucial
step that could lead to different conclusions if ignored or performed incorrectly. Large differences in
DNA-protein binding between conditions would lead to imbalance in number of peaks found and
skewed signal to noise ratios when comparing between conditions. A biologist would know whether
differences in the amount of DNA bound protein targeted by the ChIP antibody are expected between
conditions. Knowing that a protein binds few DNA sites could help differentiate between samples with
low enrichment versus declaring a sample failed. For example, in the GR hormone treatment dataset,
low hormone treatment binds very few sites and it is known that the amount of bound GR protein
changes with hormone concentration. Thus, using full library size for normalization produced results
that were consistent with the expectation of higher binding with higher hormone concentration. The
result using effective library size normalization differed dramatically and suggested an over-adjustment
from having a majority of peaks uniquely bound in the high or medium hormone concentration. From
our GR qPCR validation, the high versus low and medium versus low comparisons showed more binding
with more hormone and full library size normalization produced the most accurate fold changes.
However, validation of high versus medium hormone comparison found that effective library size was
most accurate suggesting that something in the complex interplay between enriched DNA, antibody
efficiency, and sequencing is not captured and warrants further study.
The second most influential analysis result depended on the definition of reference-binding regions.
Most previously proposed methods merge overlapping peaks before counting over them. Multiple
options exist for merging overlapping regions to define reference peaks. We investigated: no merging,
merging peaks from two conditions that are being compared, merging peaks in all possible conditions
under comparison and merging trimmed peaks centered on a summit. Typically, peaks that overlap 1
base pair or more are merged; however, one could create more stringent criteria that require a
percentage of the peak to be overlapping. The decision to merge nearby peaks should be based on both
peak caller output and protein binding profile. Different peak callers will often call peaks with different
widths. Peak callers that identify short peaks close to each other or proteins that have multiple binding
sites in close proximity could have decreased sensitivity when nearby peaks are merged. Diagnostic plots
of distance between peaks and peak widths (Figure 3-3) can be used to guide the merging strategy.
Some recent work looking at using whole genome bins versus using peaks has presented a hybrid
approach that combines p-values across windows in a peak to avoid inflating (Lun and Smyth 2014).
86
Lastly, we evaluated the effect of correcting for input binding. This typically involves subtracting out
background noise to obtain a stronger signal where there is real binding. We found that the background
correction method implemented by DiffBind impacted the ranking of differential binding sites, their
fold-change estimates, and the number of differential binding sites discovered for comparisons within
the same cell-type. In particular, input subtraction tended to protect us from finding differential binding
in regions of high background. Interestingly, we found little difference in the number of differential
binding sites identified in the comparisons between cell types despite known copy number alterations in
some of the cancer cell lines studied. Future work could investigate how subtracting input to account for
copy number variations compares to using dedicated CNA adjustment methods such as ABCD-DNA
(Robinson et al. 2012).
ChIP-seq and its analysis is a complex multi-step process. To minimize the effects from sample
preparation, the datasets we selected for analysis were samples generated by the same ENCODE lab.
We assume that samples from the same lab are created following similar protocols. However, technical
variations such as operator biases could still exist within a lab. The methods we applied are suitable only
when the same antibody is used since different antibodies have different binding efficiencies.
Comparing binding levels between samples using different antibodies is a topic of ongoing research (Bao
et al. 2013). The results may depend on peak caller used and we assume that the peak callers will
identify proper binding regions and exclude regions with high input binding.
For future work could consider older approaches that do not require replicates as well as recent and less
well studied factors not covered within the scope of our study including: newer RNA-seq normalization
methods (Glusman et al. 2013), an extension to TMM that is used by edgeR (Kadota et al. 2012), rank
based normalization methods (Diaz et al. 2012), differences in PCR amplification in library construction
(Benjamini and Speed 2012), proximity to highly expressed genes (Teytelman et al. 2013), shape of
peaks (Schweikert et al. 2013), and different RNA-seq methods (Anders and Huber 2010). An ideal
control would be the usage of spike-in DNA (Bonhoure et al. 2014) or measuring protein levels enriched
may improve the accuracy of quantitative ChIP-seq analysis, however, accurately measuring the quantity
of ChIP DNA is a challenge.
Conclusion
Differential peak calling methods are potentially useful in two scenarios: comparing binding between
different cell types and comparing binding within the same cell type. The latter scenario is often the
more biologically interesting but few published datasets are available with multiple treatments or time
points. Of the steps that we outlined for differential peak calling, we find that normalization is most
influential when comparing treatments within a cell line and changing normalization will cause fold-
change differences and will identify vastly different sets of regions with differential binding. Subtracting
input causes the biggest difference in fold changes when comparing between cell lines but had less of an
effect on identifying significant differential binding.
As a greater number of ChIP-seq datasets are published with multiple conditions, there will be more
interest in methods to quantify binding in different conditions. Our study provides a detailed
comparison of how several published differential ChIP-seq binding methods differ and compares how
87
results from condition-level comparison methods differ from sample-level comparison methods. We
also provide an overview of differential binding analysis workflow and describe the steps that differ
between methods. Some key analysis steps that have the biggest influence in downstream results are
highlighted and some pitfalls and challenges of working with this type of data are elucidated. Due to the
variability of ChIP-seq experiments, biological replicates are important and should be used when
quantifying peak heights. We recommend using merge pairwise to define reference regions, and using
edgeR with full library size for within-cell type treatment comparisons and using DiffBind to subtract
input coupled with full library size normalization when comparing between cells with high aneuploidy.
88
Chapter 4 Conclusion
The overarching goal of my research has been to expand our understanding of the mechanisms through
which genes are regulated. The goal is that eventually, the processes through which stimuli trigger
changes in gene expression and cause alterations in physiological pathways will be fully understood. This
has important clinical implications since aberrant gene expression is the fundamental cause of many
diseases. Treatments that reverse some of these gene expression changes could be extremely beneficial
but require detailed understanding of mechanism. Gene expression can be controlled at different scales:
large-scale control of genes can occur through receptors, which upon activation can regulate thousands
of genes through cascades of cellular changes, including gene expression changes. In contrast, targeting
a specific promoter would directly control the expression of a single gene. My work bridges gene
regulation at both small and large scales by studying the mechanism through which specificity of large-
scale gene expression changes occur.
Recent biomedical research has elucidated mechanisms through which gene regulation can occur. In a
gene-centric approach, research has focused around the transcription start site to identify factors that
activate or repress gene expression. Transcription factor proteins that bind DNA at promoters and
enhancers recruit coregulator proteins that help regulate gene expression by remodeling chromatin,
modifying histones, and recruiting proteins involved in gene transcription such as mediator complex,
basal transcription factors, and RNA polymerase. These proteins, known as coregulators, have been the
focus of my work and we believe that combinations of these proteins are important for modulating
selectivity of gene response.
Although coregulators in general do not bind directly to DNA, sequence context is still very important for
recruitment of coregulators. Changes in DNA sequence context can cause bound transcription factors to
assume slightly different conformations opening up different surfaces for recruitment and binding of
different sets of coregulators (Meijsing et al. 2009). Nearby motifs could suggest other transcription
factors that could potentially bind and recruit coregulators. I study the regulatory roles of coregulators
with a subset of nuclear receptors that are inducible by hormone. These hormone receptors bind
enhancers and silencer elements, have known binding motifs, and regulate known physiological
pathways.
My work investigated the effect of coregulators on gene expression and steroid hormone binding in
different cell types, using different steroid hormones, and at different time points. Coregulator-
regulated genes can be inferred using ChIP-seq experiments. Since coregulators do not bind directly to
DNA, ChIP-seq results are often noisy and unreliable without a very good antibody. Instead, my
approach has been to identify binding changes in the steroid receptor after coregulator depletion. With
this approach, we look for changes in steroid receptor binding using ChIP-seq. Currently, ChIP-seq
comparisons typically involve overlapping peaks, but this binary classification method cannot detect
differential binding when both conditions have peaks. For example, if the steroid receptor is bound at a
certain site in fewer cells (lower signal in ChIP-seq) after coregulator depletion, the ChIP-seq peak caller
might call a peak at this site and the information about lower signal would be lost. To extend on this
approach of binary classification (peak or no peak), I compared methods designed to quantify and
89
estimate fold changes for these binding sites in order to identify changes in steroid receptor binding
from coregulator depletion. This allows us to integrate information on genes affected by coregulator
depletion with sites affected by coregulator depletion to gain understanding into how coregulators and
steroid receptors regulate gene expression. In addition to my work with coregulators, identifying a
proper workflow for differential transcription factor binding analysis can be applied generally to
experiments assaying protein localization changes upon depletion of a target protein. Differential
binding analysis can also be used to assay differences between cell lines and drug treatments.
Figure 4-1 Model of cellular processes involved in gene regulation
This figure shows the various biological processes through which specificity of gene regulation can occur. It puts into
perspective step at which coregulators function. Starting from the top left, the stimulus, such as hormone, triggers a cascade of
changes in the cell. Hormone binds to receptors in the cell that then bind specific DNA motifs and regulate downstream target
genes. These target genes can then create a feedback loop by regulating other genes and eventually leading to different
pathways activated for physiological response.
Although hormone response can be quite broad in terms of number of genes regulated, specialized cell
types can respond in specific ways. We believe this cell-type specific hormone response is controlled by
coregulator proteins that interact with transcription factors; however, other cellular processes can also
contribute to response specificity such as abundance of receptor or genes in accessible euchromatin
(Figure 4-1). Many proteins are required for gene regulation; the abundance of these proteins in
different cellular compartments, shuttling between cellular compartments, rate of synthesis and
90
degradation, post translational modifications, alternative splice isoforms, and associated binding
partners of these proteins can affect gene regulation. Hormone receptors then bind to DNA and recruit
a complex of proteins to activate or repress downstream gene expression. The initial gene expression
changes are primary effects of the receptor. These changes in gene expression will produce RNAs and
proteins that lead to both primary and secondary changes that will further contribute to the regulation
of specific physiological pathways. I focus my work on the receptor binding to DNA and the proteins
recruited, and the resulting changes in gene expression.
To understand how coregulators can modulate hormone response, I collaborated with biologists to
analyze data where coregulators were depleted and hormone was added. This allows us to categorize
gene regulation based on its hormone responsiveness and dependence on coregulator. I identified
several subsets of genes that responded to both hormone and coregulator and associated these to
physiological pathways. By identifying physiologically interesting pathways that are affected, my work
further supports the possibility of using coregulators in therapy to modulate effects of treatment.
Although this is one interesting avenue for gene regulation that has not been well studied, other
avenues for gene specificity are known.
For transcription factors, cis-regulatory context of the binding site – features such as presence of histone
marks and modifications to DNA – are important for chromatin accessibility and determining if
transcription factors can bind to the DNA of the regulatory element. The factors that contribute to cis-
regulatory context are often referred to as epigenetics and understanding their function is currently an
active field of research (You and Jones 2012). It is unsurprising then that many coregulators can add,
remove, and recognize cis-regulatory components such as histone marks. Previous research
characterized a “histone code” where certain histone modifications are found to have specific activating
or repressive function (Jenuwein and Allis 2001). This concept could further be extended to include
coregulators and could be called a “coregulator code” which specifies the roles of different
combinations of coregulators at different genes. Coregulators interact with transcription factors to
modulate the expression of transcription factor target genes. To understand how coregulators can help
contribute to gene specificity, we use hormones such as estrogen, androgen, and glucocorticoids to
induce gene expression changes within cells and investigate how these hormone-induced changes differ
upon coregulator depletion
Understanding the role of regulatory elements and hormone selectivity is at the intersection of two
other areas of research. In the past, genome wide association studies (GWAS) were hailed as the
method to identify the genetic causes of inheritable diseases (McClellan and King 2010). Unfortunately,
GWAS studies have found that most single nucleotide polymorphisms (SNPs) are not found in coding
regions of genes but rather in non-coding regions of the genome. Characterization of these non-coding
regions containing SNPs has found enhancer activity and transcription factor binding (Biancolella et al.
2014). This further reinforces the importance of regulatory elements, and understanding how these
regions regulate gene expression would give insight into what becomes disregulated when a mutation or
SNP at a regulatory element is associated with inheritable diseases. In addition, my work on
understanding the selectivity of hormone intersects with the growing field of research on endocrine
disruptors. These compounds are found in drugs, pesticides, plastics, and there is a growing health
91
concern of the consequences of consuming these products. Since these compounds bind to hormone
receptors and can have a selective hormone-like response, understanding mechanisms of hormone
specificity can contribute to the understanding of potential side effects from endocrine disruptors.
In a way, understanding gene regulation is similar to understanding climate regulation. Both have
potential to solve life threatening challenges that currently exist – fully understanding and being able to
control gene regulation could mean the end of many diseases while fully understanding and being able
to control weather could mean the end of droughts and floods. In both fields of research, the tools and
assays needed to understand the mechanisms of regulation in a comprehensive context exist – genome
wide assays in biology and global thermal imaging for weather. Important components for regulation
have been identified, however, detailed mechanisms of how everything fits together is not yet
understood. I am optimistic that future breakthroughs in understanding gene regulation at a very basic
level will lead to advances biomedical research that will impact the lives of many patients.
92
References
’t Hoen P a C, Ariyurek Y, Thygesen HH, Vreugdenhil E, Vossen RH a M, de Menezes RX, Boer JM, van
Ommen G-JB, den Dunnen JT. 2008. Deep sequencing-based expression analysis shows major
advances in robustness, resolution and inter-lab portability over five microarray platforms. Nucleic
Acids Res 36: e141.
Anders S, Huber W. 2010. Differential expression analysis for sequence count data. Genome Biol 11:
R106.
Baek S, Sung M, Hager GL. 2012. Quantitative analysis of genome-wide chromatin remodeling. ed. R.H.
Morse. Methods Mol Biol 833: 433–41.
Bailey T, Krajewski P, Ladunga I, Lefebvre C, Li Q, Liu T, Madrigal P, Taslim C, Zhang J. 2013. Practical
Guidelines for the Comprehensive Analysis of ChIP-seq Data. ed. F. Lewitter. PLoS Comput Biol 9:
e1003326.
Bailey TL, Gribskov M. 1998. Combining evidence using p-values: application to sequence homology
searches. Bioinformatics 14: 48–54.
Bao Y, Vinciotti V, Wit E, ’t Hoen P a C. 2013. Accounting for immunoprecipitation efficiencies in the
statistical analysis of ChIP-seq data. BMC Bioinformatics 14: 169.
Barbosa-Morais NL, Dunning MJ, Samarajiwa SA, Darot JFJ, Ritchie ME, Lynch AG, Tavaré S. 2010. A re-
annotation pipeline for Illumina BeadArrays: improving the interpretation of gene expression data.
Nucleic Acids Res 38: e17.
Bardet AF, He Q, Zeitlinger J, Stark A. 2012. A computational pipeline for comparative ChIP-seq analyses.
Nat Protoc 7: 45–61.
Beck IME, Vanden Berghe W, Vermeulen L, Yamamoto KR, Haegeman G, De Bosscher K. 2009. Crosstalk
in inflammation: the interplay of glucocorticoid receptor-based mechanisms and kinases and
phosphatases. Endocr Rev 30: 830–82.
Benjamini Y, Speed TP. 2012. Summarizing and correcting the GC content bias in high-throughput
sequencing. Nucleic Acids Res 40: e72.
Biancolella M, Fortini BK, Tring S, Plummer SJ, Mendoza-Fandino G a, Hartiala J, Hitchler MJ, Yan C,
Schumacher FR, Conti D V, et al. 2014. Identification and characterization of functional risk variants
for colorectal cancer mapping to chromosome 11q23.1. Hum Mol Genet 23: 2198–209.
Biddie SC, Conway-Campbell BL, Lightman SL. 2012. Dynamic regulation of glucocorticoid signalling in
health and disease. Rheumatology (Oxford) 51: 403–12.
Bittencourt D, Wu D-Y, Jeong KW, Gerke DS, Herviou L, Ianculescu I, Chodankar R, Siegmund KD, Stallcup
MR. 2012. G9a functions as a molecular scaffold for assembly of transcriptional coactivators on a
subset of glucocorticoid receptor target genes. Proc Natl Acad Sci U S A 109: 19673–8.
93
Bonhoure N, Bounova G, Bernasconi D, Praz V, Lammers F, Canella D, Willis IM, Herr W, Hernandez N,
Delorenzi M. 2014. Quantifying ChIP-seq data: a spiking method providing an internal reference for
sample-to-sample normalization. Genome Res.
Cabal-Hierro L, Lazo PS. 2012. Signal transduction by tumor necrosis factor receptors. Cell Signal 24:
1297–305.
Cairns JM, Dunning MJ, Ritchie ME, Russell R, Lynch a G. 2008. BASH: a tool for managing BeadArray
spatial artefacts. Bioinformatics 24: 2921–2.
Champely S. 2012. pwr: Basic functions for power analysis. CRAN. http://cran.r-project.org/package=pwr.
Chen D, Ma H, Hong H, Koh SS, Huang SM, Schurter BT, Aswad DW, Stallcup MR. 1999. Regulation of
transcription by a protein methyltransferase. Science 284: 2174–7.
Chodankar R, Wu D-Y, Schiller BJ, Yamamoto KR, Stallcup MR. 2014. Hic-5 is a transcription coregulator
that acts before and/or after glucocorticoid receptor genome occupancy in a gene-selective
manner. Proc Natl Acad Sci U S A.
Clegg DJ. 2012. Minireview: the year in review of estrogen regulation of metabolism. Mol Endocrinol 26:
1957–60.
Cray C, Zaias J, Altman NH. 2009. Acute phase response in animals: a review. Comp Med 59: 517–26.
Dasgupta S, Lonard DM, O’Malley BW. 2014. Nuclear receptor coactivators: master regulators of human
health and disease. Annu Rev Med 65: 279–92.
De Bosscher K. 2010. Selective Glucocorticoid Receptor modulators. J Steroid Biochem Mol Biol 120: 96–
104.
Diaz A, Park K, Lim D a, Song JS. 2012. Normalization, bias correction, and peak calling for ChIP-seq. Stat
Appl Genet Mol Biol 11: Article 9.
Dunham I, Kundaje A, Aldred SF, Collins PJ, Davis C a, Doyle F, Epstein CB, Frietze S, Harrow J, Kaul R, et
al. 2012. An integrated encyclopedia of DNA elements in the human genome. Nature 489: 57–74.
Dunning MJ, Barbosa-Morais NL, Lynch AG, Tavaré S, Ritchie ME. 2008. Statistical issues in the analysis
of Illumina data. BMC Bioinformatics 9: 85.
Earl D, Bradnam K, St John J, Darling A, Lin D, Fass J, Yu HOK, Buffalo V, Zerbino DR, Diekhans M, et al.
2011. Assemblathon 1: a competitive assessment of de novo short read assembly methods.
Genome Res 21: 2224–41.
Fan H, Morand EF. 2012. Targeting the side effects of steroid therapy in autoimmune diseases: the role
of GILZ. Discov Med 13: 123–33.
94
Frietze S, Wang R, Yao L, Tak YG, Ye Z, Gaddis M, Witt H, Farnham PJ, Jin VX. 2012. Cell type-specific
binding patterns reveal that TCF7L2 can be tethered to the genome by association with GATA3.
Genome Biol 13: R52.
Furey TS. 2012. ChIP-seq and beyond: new and improved methodologies to detect and characterize
protein-DNA interactions. Nat Rev Genet 13: 840–52.
Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, et al.
2004. Bioconductor: open software development for computational biology and bioinformatics.
Genome Biol 5: R80.
Gerstein MB, Kundaje A, Hariharan M, Landt SG, Yan K-K, Cheng C, Mu XJ, Khurana E, Rozowsky J,
Alexander R, et al. 2012. Architecture of the human regulatory network derived from ENCODE data.
Nature 489: 91–100.
Gertz J, Reddy TE, Varley KE, Garabedian MJ, Myers RM. 2012. Genistein and bisphenol A exposure
cause estrogen receptor 1 to bind thousands of sites in a cell type-specific manner. Genome Res 22:
2153–62.
Glusman G, Caballero J, Robinson M, Kutlu B, Hood L. 2013. Optimal scaling of digital transcriptomes. ed.
I.K. Jordan. PLoS One 8: e77885.
Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson D a, Amit I, Adiconis X, Fan L, Raychowdhury R,
Zeng Q, et al. 2011. Full-length transcriptome assembly from RNA-Seq data without a reference
genome. Nat Biotechnol 29.
Gross KL, Cidlowski JA. 2008. Tissue-specific glucocorticoid action: a family affair. Trends Endocrinol
Metab 19: 331–9.
Hansen KD, Irizarry RA, Wu Z. 2012. Removing technical variability in RNA-seq data using conditional
quantile normalization. Biostatistics 13: 204–16.
Heinlein C a, Chang C. 2002. Androgen receptor (AR) coregulators: an overview. Endocr Rev 23: 175–200.
Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, Cheng JX, Murre C, Singh H, Glass CK. 2010.
Simple combinations of lineage-determining transcription factors prime cis-regulatory elements
required for macrophage and B cell identities. Mol Cell 38: 576–89.
Hoheisel JD. 2006. Microarray technology: beyond transcript profiling and genotype analysis. Nat Rev
Genet 7: 200–10.
Holmqvist P-H, Mannervik M. 2013. Genomic occupancy of the transcriptional co-activators p300 and
CBP. Transcription 4: 18–23.
Huang W, Umbach DM, Vincent Jordan N, Abell AN, Johnson GL, Li L. 2011. Efficiently identifying
genome-wide changes with next-generation sequencing data. Nucleic Acids Res 39: e130.
95
Ianculescu I, Wu D-Y, Siegmund KD, Stallcup MR. 2012. Selective roles for cAMP response element-
binding protein binding protein and p300 protein as coregulators for androgen-regulated gene
expression in advanced prostate cancer cells. J Biol Chem 287: 4000–13.
Irizarry R a, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP. 2003. Exploration,
normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4:
249–64.
Jenuwein T, Allis CD. 2001. Translating the histone code. Science 293: 1074–80.
Johnson WE, Li C, Rabinovic A. 2007. Adjusting batch effects in microarray expression data using
empirical Bayes methods. Biostatistics 8: 118–27.
Jung S-H. 2005. Sample size for FDR-control in microarray data analysis. Bioinformatics 21: 3097–104.
Kadota K, Nishiyama T, Shimizu K. 2012. A normalization strategy for comparing tag count data.
Algorithms Mol Biol 7: 5.
Kahn M. 2011. Symmetric division versus asymmetric division: a tale of two coactivators. Future Med
Chem 3: 1745–63.
Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. 2013. TopHat2: accurate alignment of
transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol 14: R36.
Kim JH, Li H, Stallcup MR. 2003. CoCoA, a nuclear receptor coactivator which acts through an N-terminal
activation domain of p160 coactivators. Mol Cell 12: 1537–49.
Kim JH, Yang CK, Heo K, Roeder RG, An W, Stallcup MR. 2008. CCAR1, a key regulator of mediator
complex recruitment to nuclear receptor transcription complexes. Mol Cell 31: 510–9.
Kircher M, Heyn P, Kelso J. 2011. Addressing challenges in the production and analysis of illumina
sequencing data. BMC Genomics 12: 382.
Komm BS, Mirkin S. 2014. An overview of current and emerging SERMs. J Steroid Biochem Mol Biol 143C:
207–222.
Landt SG, Marinov GK, Kundaje A, Kheradpour P, Pauli F, Batzoglou S, Bernstein BE, Bickel P, Brown JB,
Cayting P, et al. 2012. ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia.
Genome Res 22: 1813–31.
Langfelder P, Horvath S. 2008. WGCNA: an R package for weighted correlation network analysis. BMC
Bioinformatics 9: 559.
Langmead B, Trapnell C, Pop M, Salzberg SL. 2009. Ultrafast and memory-efficient alignment of short
DNA sequences to the human genome. Genome Biol 10: R25.
96
Law CW, Chen Y, Shi W, Smyth GK. 2014. Voom: precision weights unlock linear model analysis tools for
RNA-seq read counts. Genome Biol 15: R29.
Lawrence M, Huber W, Pagès H, Aboyoun P, Carlson M, Gentleman R, Morgan MT, Carey VJ. 2013.
Software for Computing and Annotating Genomic Ranges ed. A. Prlic. PLoS Comput Biol 9:
e1003118.
Lee DY, Northrop JP, Kuo M-H, Stallcup MR. 2006. Histone H3 lysine 9 methyltransferase G9a is a
transcriptional coactivator for nuclear receptors. J Biol Chem 281: 8476–85.
Li H, Durbin R. 2009. Fast and accurate short read alignment with Burrows-Wheeler transform.
Bioinformatics 25: 1754–60.
Li Q, Brown JB, Huang H, Bickel PJ. 2011. Measuring reproducibility of high-throughput experiments. Ann
Appl Stat 5: 1752–1779.
Li R, Yu C, Li Y, Lam T-W, Yiu S-M, Kristiansen K, Wang J. 2009. SOAP2: an improved ultrafast tool for
short read alignment. Bioinformatics 25: 1966–7.
Liang K, Keles S. 2012. Detecting differential binding of transcription factors with ChIP-seq.
Bioinformatics 28: 121–2.
Liang K, Keleş S. 2012. Normalization of ChIP-seq data with control. BMC Bioinformatics 13: 199.
Liao Y, Smyth GK, Shi W. 2013. The Subread aligner: fast, accurate and scalable read mapping by seed-
and-vote. Nucleic Acids Res 41: e108.
Liu Y, Niu N, Zhu X, Du T, Wang X, Chen D, Wu X, Gu HF, Liu Y. 2008. Genetic variation and association
analyses of the nuclear respiratory factor 1 (nRF1) gene in Chinese patients with type 2 diabetes.
Diabetes 57: 777–82.
Lonard DM, O’Malley BW. 2012. Nuclear receptor coregulators: modulators of pathology and
therapeutic targets. Nat Rev Endocrinol 8: 598–604.
Lun ATL, Smyth GK. 2014. De novo detection of differentially bound regions for ChIP-seq data using
peaks and windows: controlling error rates correctly. Nucleic Acids Res.
McClellan J, King M-C. 2010. Genetic heterogeneity in human disease. Cell 141: 210–7.
Mecham BH, Nelson PS, Storey JD. 2010. Supervised normalization of microarrays. Bioinformatics 26:
1308–15.
Meijsing SH, Pufall MA, So AY, Bates DL, Chen L, Yamamoto KR. 2009. DNA binding site sequence directs
glucocorticoid receptor structure and activity. Science 324: 407–10.
Micallef L, Rodgers P. 2012. Drawing Area-Proportional Venn-3 Diagrams Using Ellipses.
http://www.eulerdiagrams.org/eulerAPE.
97
Nair NU, Sahu A Das, Bucher P, Moret BME. 2012. ChIPnorm: a statistical method for normalizing and
identifying differential regions in histone modification ChIP-seq libraries. PLoS One 7: e39573.
Noble WS. 2009. How does multiple testing correction work? Nat Biotechnol 27: 1135–7.
Ou C-Y, Chen T-C, Lee J V, Wang J-C, Stallcup MR. 2014. Coregulator CCAR1 Positively Regulates
Adipocyte Differentiation through the Glucocorticoid Signaling Pathway. J Biol Chem In press.
Park PJ. 2009. ChIP-seq: advantages and challenges of a maturing technology. Nat Rev Genet 10: 669–80.
Pepke S, Wold B, Mortazavi A. 2009. Computation for ChIP-seq and RNA-seq studies. Nat Methods 6:
S22–32.
Purcell DJ, Jeong KW, Bittencourt D, Gerke DS, Stallcup MR. 2011. A distinct mechanism for coactivator
versus corepressor function by histone methyltransferase G9a in transcriptional regulation. J Biol
Chem 286: 41963–71.
Quackenbush J. 2001. Computational analysis of microarray data. Nat Rev Genet 2: 418–27.
Quinlan AR, Hall IM. 2010. BEDTools: a flexible suite of utilities for comparing genomic features.
Bioinformatics 26: 841–2.
Rapaport F, Khanin R, Liang Y, Pirun M, Krek A, Zumbo P, Mason CE, Socci ND, Betel D. 2013.
Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data.
Genome Biol 14: R95.
Reddy TE, Gertz J, Crawford GE, Garabedian MJ, Myers RM. 2012. The hypersensitive glucocorticoid
response specifically regulates period 1 and expression of circadian genes. Mol Cell Biol 32: 3756–
67.
Risso D, Schwartz K, Sherlock G, Dudoit S. 2011. GC-content normalization for RNA-Seq data. BMC
Bioinformatics 12: 480.
Robinson MD, McCarthy DJ, Smyth GK. 2010. edgeR: a Bioconductor package for differential expression
analysis of digital gene expression data. Bioinformatics 26: 139–40.
Robinson MD, Oshlack A. 2010. A scaling normalization method for differential expression analysis of
RNA-seq data. Genome Biol 11: R25.
Robinson MD, Smyth GK. 2007. Moderated statistical tests for assessing differences in tag abundance.
Bioinformatics 23: 2881–7.
Robinson MD, Smyth GK. 2008. Small-sample estimation of negative binomial dispersion, with
applications to SAGE data. Biostatistics 9: 321–32.
Robinson MD, Strbenac D, Stirzaker C, Statham AL, Song J, Speed TP, Clark SJ. 2012. Copy-number-aware
differential analysis of quantitative DNA sequencing data. Genome Res 22: 2489–96.
98
Rocke DM, Durbin B. 2001. A model for measurement error for gene expression arrays. J Comput Biol 8:
557–69.
Rosen J, Miner JN. 2005. The search for safer glucocorticoid receptor ligands. Endocr Rev 26: 452–64.
Ross MG, Russ C, Costello M, Hollinger A, Lennon NJ, Hegarty R, Nusbaum C, Jaffe DB. 2013.
Characterizing and measuring bias in sequence data. Genome Biol 14: R51.
Ross-Innes CS, Stark R, Teschendorff AE, Holmes K a, Ali HR, Dunning MJ, Brown GD, Gojis O, Ellis IO,
Green AR, et al. 2012. Differential oestrogen receptor binding is associated with clinical outcome in
breast cancer. Nature 481: 389–93.
Schäcke H, Döcke WD, Asadullah K. 2002. Mechanisms involved in the side effects of glucocorticoids.
Pharmacol Ther 96: 23–43.
Schulze a, Downward J. 2001. Navigating gene expression using microarrays--a technology review. Nat
Cell Biol 3: E190–5.
Schweikert G, Cseke B, Clouaire T, Bird A, Sanguinetti G. 2013. MMDiff: quantitative testing for shape
changes in ChIP-Seq data sets. BMC Genomics 14: 826.
Shao Z, Zhang Y, Yuan G-C, Orkin SH, Waxman DJ. 2012. MAnorm: a robust model for quantitative
comparison of ChIP-Seq data sets. Genome Biol 13: R16.
Shi W, Oshlack A, Smyth GK. 2010. Optimizing the noise versus bias trade-off for Illumina whole genome
expression BeadChips. Nucleic Acids Res 1–11.
Smyth GK. 2004. Linear models and empirical bayes methods for assessing differential expression in
microarray experiments. Stat Appl Genet Mol Biol 3: Article3.
Song Q, Smith AD. 2011. Identifying dispersed epigenomic domains from ChIP-Seq data. Bioinformatics
27: 870–1.
Stark R, Brown GD. 2011. DiffBind: differential binding analysis of ChIP-seq peak data. Bioconductor.
http://bioconductor.org/packages/release/bioc/html/DiffBind.html.
Storey JD. 2002. A direct approach to false discovery rates. J R Stat Soc Ser B (Statistical Methodol 64:
479–498.
Strehl C, Buttgereit F. 2013. Optimized glucocorticoid therapy: teaching old drugs new tricks. Mol Cell
Endocrinol 380: 32–40.
Taslim C, Huang T, Lin S. 2011. DIME: R-package for identifying differential ChIP-seq based on an
ensemble of mixture models. Bioinformatics 27: 1569–70.
Teytelman L, Thurtle DM, Rine J, van Oudenaarden A. 2013. Highly expressed loci are vulnerable to
misleading ChIP localization of multiple unrelated proteins. Proc Natl Acad Sci U S A.
99
Tsai MJ, O’Malley BW. 1994. Molecular mechanisms of action of steroid/thyroid receptor superfamily
members. Annu Rev Biochem 63: 451–86.
Turgeon JL, McDonnell DP, Martin K a, Wise PM. 2004. Hormone therapy: physiological complexity
belies therapeutic simplicity. Science 304: 1269–73.
Wang J, Zhuang J, Iyer S, Lin X, Whitfield TW, Greven MC, Pierce BG, Dong X, Kundaje A, Cheng Y, et al.
2012. Sequence features and chromatin structure around the genomic regions bound by 119
human transcription factors. Genome Res 22: 1798–812.
Wu D-Y, Bittencourt D, Stallcup MR, Siegmund KD. 2014a. Comparison of methods for differential
transcription factor binding analysis of ChIP-seq data. prep.
Wu D-Y, Ou C-Y, Chodankar R, Siegmund KD, Stallcup MR. 2014b. Distinct genome-wide, gene-specific
selectivity patterns of four glucocorticoid receptor coregulators. prep.
Xu H, Wei C-L, Lin F, Sung W-K. 2008. An HMM approach to genome-wide identification of differential
histone modification sites from ChIP-seq data. Bioinformatics 24: 2344–9.
You JS, Jones P a. 2012. Cancer genetics and epigenetics: two sides of the same coin? Cancer Cell 22: 9–
20.
Yu EJ, Kim S-H, Heo K, Ou C-Y, Stallcup MR, Kim JH. 2011. Reciprocal roles of DBC1 and SIRT1 in
regulating estrogen receptor α activity and co-activator synergy. Nucleic Acids Res 39: 6932–43.
Yu EJ, Kim S-H, Kim MJ, Seo W-Y, Song K, Kang M-S, Yang CK, Stallcup MR, Kim JH. 2013. SUMOylation of
ZFP282 potentiates its positive effect on estrogen signaling in breast tumorigenesis. Oncogene 32:
4160–8.
Zhang Y, Liu T, Meyer C a, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C, Myers RM, Brown M, Li W,
et al. 2008. Model-based analysis of ChIP-Seq (MACS). Genome Biol 9: R137.
Abstract (if available)
Abstract
The remodelling of chromatin and regulated assembly or disassembly of active transcription complexes by glucocorticoid receptor and other DNA-binding transcription factors is mediated and modulated by several hundred transcriptional coregulator proteins. Previous studies focusing on single coregulators demonstrated that each coregulator is required for regulation of only a subset of all the genes regulated by a steroid hormone. I analyze gene-specific patterns of coregulators and find that depleting specific coregulators corresponds to changes in physiological pathways. I find that different coregulators modulate the pathway-specificity of hormone action and thus providing a mechanism for fine tuning of the hormone response. ❧ ChIP-seq, a technique that is used to determine genome-wide protein binding, is used to understand the dynamics of transcription factor binding. Recently, interest in comparing protein binding between different biological conditions has increased, accompanied by an increase in new software for data analysis. Approaches for differential transcription factor binding utilize methods developed for RNA-seq. I apply these new software tools to a variety of data sets from the ENCODE project to illustrate the impact of data processing pipelines under different biological conditions. I present best practice guidelines for transcription factor binding analysis and provide points for discussion between biologists who perform the ChIP-seq experiment and bioinformaticians who analyze the data.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
The role of Hic-5 in glucocorticoid receptor binding to chromatin
PDF
Interaction of Hic-5 with different steroid receptors and its selective coregulator activity
PDF
Integrating high-throughput sequencing data to study gene regulation
PDF
Identification and characterization of cancer-associated enhancers
PDF
Genome-wide characterization of the regulatory relationships of cell type-specific enhancer-gene links
PDF
Do the ZFX and ZFY transcription factors have redundant or unique functions?
PDF
Runx2 interactions with the osteoblast genome
PDF
Differential role of two coactivators, CCAR1 and CARM1, for dysregulated beta-catenin activity in colorectal cancer cell growth and gene expression
PDF
Understanding DNA methylation and nucleosome organization in cancer cells using single molecule sequencing
PDF
The relationship between DNA methylation and transcription factor binding in colon cancer cells
PDF
Differential regulation of monoamine oxidase A and B genes
PDF
Do ZFX and ZNF711 regulate the same genes in HEK293T cells?
PDF
Genome-wide studies reveal the isoform selection of genes
PDF
Functional characterization of colon cancer risk-associated enhancers: connecting risk loci to risk genes
PDF
Computational analysis of genome architecture
PDF
The function of Rpd3 in balancing the replicaton initiation of different genomic regions
PDF
Epigenetic plasticity of cultured female human embryonic stem cells and regulation of gene expression and chromatin by PR-SET7 mediated H4K20me1
PDF
Forkhead transcription factors regulate replication origin firing through dimerization and cell cycle-dependent chromatin binding in S. cerevisiae
PDF
Functional characterization of a prostate cancer risk region
PDF
The development of targeted transcription factor transposition and understanding chromatin dynamics in hypertrophic cardiomyopathy
Asset Metadata
Creator
Wu, Dai-Ying
(author)
Core Title
Using genomics to understand the gene selectivity of steroid hormone receptors
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Genetic, Molecular and Cellular Biology
Publication Date
09/16/2014
Defense Date
07/07/2014
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
ChIP-seq,coregulators,gene regulation,genomics,OAI-PMH Harvest,steroid receptors
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Farnham, Peggy J. (
committee chair
), Hacia, Joseph G. (
committee member
), Siegmund, Kimberly D. (
committee member
), Stallcup, Michael R. (
committee member
)
Creator Email
daiyingw@gmail.com,daiyingw@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-478203
Unique identifier
UC11287665
Identifier
etd-WuDaiYing-2948.pdf (filename),usctheses-c3-478203 (legacy record id)
Legacy Identifier
etd-WuDaiYing-2948.pdf
Dmrecord
478203
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Wu, Dai-Ying
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
ChIP-seq
coregulators
gene regulation
genomics
steroid receptors