Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Quantitative modeling of in vivo transcription factor–DNA binding and beyond
(USC Thesis Other)
Quantitative modeling of in vivo transcription factor–DNA binding and beyond
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Quantitative Modeling of in vivo Transcription
Factor –DNA Binding and Beyond
by
Beibei Xin
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
Doctor of Philosophy
(COMPUTATIONAL BIOLOGY AND BIOINFORMATICS)
August 2018
Copyright 2018 Beibei Xin
i
Acknowledgements
This doctoral thesis would not be possible without many kind people around me.
I would like to express my deep gratitude to my advisor Dr. Remo Rohs for recognizing
my research potential during the admission process over the phone five years ago. Thank him
for being a fantastic mentor, for leading me to the field of fundamental biological sciences, and
for teaching me to think sharply and to be a caring person. He is also the connector and the
visionary for me during my Ph.D. study.
I would like to thank Dr. Remo Rohs, Dr. Michael Waterman, Dr. Paul Thomas, and Dr.
Frank Alber for their generous serving in both my qualifying and defense committee. Thank
them for the guidance and feedbacks on my projects. My grateful thanks are also extended to
Dr. Fengzhu Sun for being my qualifying committee member and for his advices on my career.
Many thanks to excellent Rohs lab members for being supportive and creative as a great
research family. Thank many wonderful friends at USC for making my life colorful and
memorable.
Lastly, thank my parents for always being supportive and encouraging on my decisions. I
would like to thank my husband Leyang Yu. He is the love of my life.
ii
iii
Table of Contents
Acknowledgements .................................................................................................................. i
List of Figures ........................................................................................................................ vii
List of Tables ........................................................................................................................... ix
Abstract ................................................................................................................................... xi
Chapter 1. Introduction ............................................................................................................ 1
1.1 TF –DNA recognition is a crucial step in many biological processes ......................... 1
1.2 Many layers determine in vivo TF –DNA recognition .................................................... 1
1.2.1 In vitro biophysical factors .......................................................................................... 3
1.2.2 In vivo genome-scale factors ...................................................................................... 4
1.3 Experimental and computational approaches to study TF –DNA binding
mechanisms .................................................................................................................... 5
1.3.1 In vitro high-throughput assays .................................................................................. 6
1.3.2 Quantitative modeling from experimental data ........................................................... 7
1.4 Overview of this thesis ................................................................................................... 8
Chapter 2. Relationship between HMs and TF –DNA binding ................................................ 9
2.1 Introduction..................................................................................................................... 9
2.2 Initial work: altered HMs induce ageing by Hoxa9 developmental signals ................ 9
2.3 Results .......................................................................................................................... 10
2.3.1 HM patterns surrounding TF BSs and non-BSs show conserved patterns ............... 10
2.3.2 HM patterns in the BS environment contribute to quantitative predictions of in vivo TF
binding ..................................................................................................................... 16
2.3.3 TF families vary in their preferences for DNA sequence and shape features and HM
patterns .................................................................................................................... 19
2.3.4 Preferences for HM patterns can constrain TF co-occupancy .................................. 23
2.3.5 Larger differences in HM patterns result in a substantial decrease in nucleosome
occupancy ................................................................................................................ 26
2.4 Discussion .................................................................................................................... 28
2.5 Methods ......................................................................................................................... 32
2.5.1 In vivo data collection and motif alignment ............................................................... 32
iv
2.5.2 Background definition ............................................................................................... 32
2.5.3 HM patterns of motif environments ........................................................................... 33
2.5.4 DNA shape features in flanking regions ................................................................... 33
2.5.5 Statistical comparison of HM patterns at bound BSs and non-BSs ........................... 33
2.5.6 MLR scoring scheme ............................................................................................... 34
2.5.7 Leave-one-feature-out L2-regularized MLR models ................................................. 34
2.5.8 Calculating co-occupancy of a TF pair ..................................................................... 35
2.5.9 Nucleosome occupancy ........................................................................................... 35
2.5.10 Software availability ............................................................................................... 35
Chapter 3. Sequence-based CNN models in studying TFs preferring low-affinity BSs .... 36
3.1 Introduction................................................................................................................... 36
3.2 Data for modeling ......................................................................................................... 38
3.3 Methods ......................................................................................................................... 39
3.3.1 Regression models for predicting binding affinity ..................................................... 39
3.3.2 Architecture of the CNN-based reverse complemented weight sharing model ......... 39
3.3.3 Four different approaches to implement convolutional layers ................................... 42
3.3.4 Hyperparameter determination ................................................................................. 43
3.3.5 Five different methods to interpret neural networks .................................................. 44
3.3.6 Visualization of unit-resolution importance matrix for a given sequence ................... 45
3.4 Results .......................................................................................................................... 46
3.4.1 Sequence alignment decrease abundance of low-affinity BSs ................................. 46
3.4.2 Modeling accuracy with L2-MLR and CNN-based models ........................................ 50
3.4.3 RC model predicts same binding affinity for forward strand and RC strand .............. 53
3.4.4 In silico mutagenesis is a robust and sensitive interpretation method ....................... 54
3.4.5 Validation of functional evidence of Exd-Ubx binding at low-affinity BSs .................. 56
3.5 Future work ................................................................................................................... 59
Chapter 4. Structure-based CNN models in studying TF –DNA binding mechanism......... 61
4.1 Introduction................................................................................................................... 61
4.2 Methods ......................................................................................................................... 64
4.2.1 Rebuild a 3D mesh graph for a DNA sequence (with Jinsen Li) ............................... 64
4.2.2 Architecture of CNN-based models .......................................................................... 66
4.2.3 Visualization of unit-resolution importance score ...................................................... 67
4.3 Results .......................................................................................................................... 68
v
4.3.1 Accuracy of rebuilding structures by 3DNA .............................................................. 68
4.3.2 Accurate modeling of binding affinity through structure-based CNN models ............ 68
4.3.3 Minor groove regions are highlighted in Exd-Scr high-affinity BSs............................ 69
4.4 Future work ................................................................................................................... 70
Chapter 5. Utilize 3D genome structure to predict HM patterns (work with Nan Hua in
Alber lab) ................................................................................................................................ 72
5.1 Introduction................................................................................................................... 72
5.2 Methods ......................................................................................................................... 74
5.2.1 Graph embedding framework ................................................................................... 74
5.3 Preliminary results ....................................................................................................... 76
5.3.1 Numeric representations of genomic regions preserve Hi-C contact map ................ 76
5.3.2 HM patterns at regions involved in a loop are similar ............................................... 78
5.4 Future work ................................................................................................................... 79
References ............................................................................................................................. 80
vi
vii
List of Figures
Figure 1–1 Several layers of determinants that affect TF–DNA binding specificity ..................... 2
Figure 2–1 Different HM patterns in young and aged muscle stem cells ...................................10
Figure 2–2 Flowchart describing the approach ..........................................................................12
Figure 2–3 TF families show conserved differences in HM patterns between BSs and non-BSs
..........................................................................................................................................13
Figure 2–4 HM patterns of the BS environment largely contribute to the quantitative prediction of
in vivo TF binding in a protein family-specific manner ........................................................18
Figure 2–5 Deconvolution of DNA sequence and shape features at flanking regions and 10 HM
patterns in the GM12878 cell line .......................................................................................20
Figure 2–6 HM environment can constrain TF co-occupancy in the GM12878 cell line .............24
Figure 2–7 Nucleosome occupancy decreases around BSs compared to non-BSs among TF
families that bind in an HM-specific manner .......................................................................27
Figure 3–1 A demonstration of the architecture of CNN-based reverse complemented weight
sharing model ....................................................................................................................41
Figure 3–2 Four different approaches to implement convolutional layers ..................................43
Figure 3–3 high-affinity and low-affinity BSs of Drosophila embryo ...........................................47
Figure 3–4 Abundance of high-affinity BSs reported in Slattery et al. and low-affinity BSs (Site 1-
3) in 16mer SELEX-seq data for eight Exd-Hox proteins ....................................................49
Figure 3–5 Abundance of low-affinity BSs in aligned DNA sequences ......................................50
Figure 3–6 CNN-based models outperform MLR models and selection of number of filters ......51
Figure 3–7 Comparison of modeling performance achieved by four CNN-based models ..........52
Figure 3–8 Models with reverse complemented weight sharing predict same binding affinity for a
sequence and its reverse complement strand ....................................................................54
Figure 3–9 Robustness of three interpretation methods ............................................................55
viii
Figure 3–10 Unit-resolution importance of WT sequence in svb enhancer ................................57
Figure 3–11 Binding preference changes of mutated svb enhancer for Exd-Ubx ......................58
Figure 3–12 Validation of binding preferences change of Exd-Scr ............................................59
Figure 4–1 Multiple readout mechanisms to achieve DNA-protein binding specificity ................62
Figure 4–2 Visualization of the with A-tract induced bending of a DNA segment of sequence
A5N5A5N5A5N5A5 .................................................................................................................63
Figure 4–3 A schematic figure showing how to rebuild 3D atomic structure of DNA sequences
as modeling input ...............................................................................................................65
Figure 4–4 The architecture of CNN-based models based on 3D structure of DNA sequences 67
Figure 4–5 Minor groove view and major groove views of highlighted voxels at first convolutional
layer ...................................................................................................................................70
Figure 5–1 Workflow of use 3D genome structure to predict HM patterns around genomic
regions ...............................................................................................................................75
Figure 5–2 Application of MF-based graph embedding methods on chromosome 2 Hi-C contact
data in the GM12878 cell line and consequent performance evaluation .............................77
Figure 5–3 Two regions involved in a loop have similar HM patterns ........................................79
ix
List of Tables
Table 3–1 The sample size of SELEX-data for Hox proteins .....................................................39
Table 3–2 Sample size for each Exd-Hox protein after TGAYNNAY alignment .........................49
Table 4–1 RMSD between rebuild_predicted and rebuild_measured of DNA dodecamer .........68
x
xi
Abstract
Gene expression is regulated through transcription factor (TF) binding to specific DNA
target sites in the genome. In vivo TF–DNA binding specificity is achieved by many layers of
determinants, including DNA sequence and DNA shape, cofactor and cooperative binding, DNA
accessibility, chromatin structure, and epigenetic marks. Our initial work found that changes in
histone modification (HM) can affect DNA packaging and TF–DNA binding in aged muscle stem
cells. This work motivated us to systematically study the relationship between HM patterns and
TF binding in vivo. We conducted comprehensive comparisons of HM patterns surrounding
binding sites (BSs) and non-BSs with exactly-matched core motifs for dozens of TFs in three
cell lines. These TFs displayed protein family-specific preferences for HM patterns surrounding
BSs, with high agreement among cell lines. Moreover, compared to models based on DNA
sequence and shape at flanking regions of BSs, HM-augmented quantitative machine-learning
methods resulted in increased performance in a TF family-specific manner. In terms of
computational methods, we evaluated both sequence-based and structure-based convolutional
neural network (CNN)-based models and different network interpretation methods, in modeling
and understanding binding specificity of Exd-Hox proteins. We found that reverse
complemented weight sharing CNN-based models, together with in silico mutagenesis network
interpretation methods, are robust and accurate strategies to model in vivo TF–DNA binding.
Morevoer, structure-based CNN-based models represent DNA sequences as 3D mesh graph,
and provide biophysical insights of TF–DNA binding specificity. Beyond TF–DNA binding, the
spatial organization of the genome also plays an important role in gene regulation. We utilized
3D genome structure to predict genome-wide HM patterns by proposing a graph embedding
method to vectorize genomic regions by preserving Hi-C contacts. Our preliminary result shows
that closed loop regions display similar HM patterns. The ultimate goal is to use the numeric
representation of genomic regions to predict HM patterns.
xii
1
Chapter 1. Introduction
1.1 TF –DNA recognition is a crucial step in many biological processes
Transcription factors (TFs) recognize specific DNA sequences to control chromatin and
transcription, forming a complex system that guides gene expression. They can directly interpret
the genome, performing the first step in decoding the DNA sequence (1). TF–DNA binding is a
crucial step in many biological processes. Many TFs exert control over cell fate decision and
developmental patterning (2) (i.e. Hox proteins in Drosophila (3)). Some TFs can control
biological pathways such as immune responses (4). Even in the laboratory, TFs can drive cell
differentiation (5) and transferring differentiated cells into induced pluripotent stem cells by
redefining four key TFs (6). Moreover, trait-associated genetic variants in TF binding sites (BSs)
disrupt TF–DNA binding, which is considered as key drivers of phenotypic variations (7), such
as heritable blood disorders (8) and an abnormal form of hemoglobin (9). Different TFs might
function together to regulate expression of one gene (10), or a specific TF can regulate different
genes in different cell types (11), indicating that TF–DNA regulatory networks are dynamic even
within the same organism. Understanding the principle underlying in vivo TF–DNA binding is a
daunting yet crucial task in complex organisms.
1.2 Many layers determine in vivo TF –DNA recognition
Unraveling the mechanisms of how TFs achieve DNA binding specificities in vivo is vital
for understanding transcriptional regulation. TFs usually bind to 6 to 20 base pairs DNA
consensus sequences. Intuitively, by scanning the genome with this a consensus sequence, we
would find where TFs bind. However, the relatively short core-binding motifs of TFs can appear
numerous times in a genome, only a very small fraction of these putative BSs is functional (12,
13). TFs can precisely identify their functional BSs from among the other 99.8% of putative BSs
in a cellular environment in vivo (14). Our current understanding on TF–DNA binding specificity
2
provides many in vitro biophysical factors and in vivo genome-wide scale factors (Figure 1–1).
Given the multiple layers that contribute to in vivo binding (15-18), it is clear that DNA sequence
and shape at core BSs, which in vitro experiments have identified as determinants of DNA
binding specificity (15, 19-23), are not sufficient to explain TF binding in vivo. An important
question is how TFs distinguish their functional BSs in one region of the genome from putative
non-BSs with exactly-matched core motifs in other regions in vivo.
Figure 1 –1 Several layers of determinants that affect TF –DNA binding specificity
Multiple factors that may explain this behavior include chromatin accessibility, cooperativity,
epigenetic marks, and sequence context (Figure 1–1) (16, 24, 25). Among these factors,
chromatin inaccessibility can largely explain non-BSs because motifs occupied by histones are
generally not accessible to TFs (26). Base pairs in flanking regions of core BSs can affect TF
binding through their effects on local DNA structure (27, 28). Nucleosome occupancy exerts
additional influence on TF binding (29, 30). Epigenetic marks are cell type-specific signatures
(31) that contribute to cell type-specific protein binding events (32). With these determinants,
how do TFs “read” DNA in in vivo environments? Why certain TFs bind at low-affinity BSs to
achieve their binding specificity? Beyond TF–DNA binding, can we utlize 3D genome strucutre
to understand gene regulation?
CACGTG
in vitro
in vivo
3
1.2.1 In vitro biophysical factors
To explore the biophysical mechanism mediating TF–DNA binding, structural biologists
have investigated nearly 4000 protein–DNA structures through crystallography or nuclear
magnetic resonance (NMR) spectroscopy, aiming to find a protein–DNA recognition code.
These structures have shown substantial evidences where proteins and DNA interact through
unique chemical signatures of DNA bases (base readout) and through a sequence-dependent
DNA shape (shape readout) (27). Base readout is utilized when amino acid side chains of a
protein interact with specific DNA bases (33). Most protein–DNA interactions are mediated by
direct physical interactions (hydrogen bonding or hydrophobic interactions). A/T, T/A, C/G, G/C
base pairs have their own hydrogen bond donors and acceptor patterns in both DNA minor and
major groove (16), and specific binding is mainly obtained via hydrogen bonding between the
protein and the major groove of base pairs.
Although the base readout exists in all TF–DNA complexes, the structure of bound DNA
frequently deviates from the standard B-DNA. These deviations can usually help to achieve
binding specificity. For example, when Exd-Scr heterodimer binds on its consensus sequence,
the arginine or histidine residues inserts to DNA minor groove, enhances negative electrostatic
potential, narrows DNA minor groove, and achieves its binding specificity (34). Such indirect
mechanisms for TF–DNA binding is also known as shape readout, in which the binding relies on
base pairs that create structural changes within the DNA to facilitate recognition. Some DNA
sequences are more bent than others because of complete or partial loss of stacking energies
at a single base pair step. It was shown that DNA segments containing AT-tracts are intrinsically
bent DNA (35). DNA bending and kinks are also parameters that contribute to shape readout
mechanisms. Human papillomavirus (HPV) E2 homodimer, for example, recognizes its
recognition helices the half-sites of its BS through the intrinsic curvature of the central spacer.
Note that most TFs use interplay between the base- and shape-readout modes to recognize
4
their DNA BSs. Moreover, the contribution of each mechanism to TF–DNA binding specificity
might also vary across TF families (16).
To probe the biophysical effect in the minor groove of DNA molecules more directly,
electrostatic potential was derived in a high-throughput manner (36), adding one more layer to
TF–DNA specificity determinants. Moreover, the DNA shape changes due to adding of methyl
group at cytosine of CpG were also predicted through methyl ‐DNAshape (37). Methylation-
induced changes in DNA shape then explained the measured rate of cleavage by DNase I (38)
and provided a possible mechanism for the methylation sensitives of human Pbx-Hox
complexes (39).
1.2.2 In vivo genome-scale factors
With the observation that a specific TF have different BSs in different cell types in vivo,
cell-type specific genomic information should exert a role in determining in vivo TF–DNA
binding. In the past decades, we have seen a widespread use of genome-wide technologies for
studying in vivo TF–DNA binding and transcriptional regulation, including ChIP-chip (40), ChIP-
seq (41), and ChIP-exo (42) for detecting in vivo TFBSs and histone modification (HM) patterns;
bisulfite sequencing for quantifying DNA methylation levels (43); DamID (44), 4C , 5C (45, 46),
and Hi-C techniques (47) for 3D genome organization; DNase-seq (48), FAIRE-seq (49), ATAC-
seq (50) for chromatin accessibility; and MNase-seq (51) for nucleosome occupancy. Different
techniques capture different aspects of genomic information, and combining with TF–DNA
binding data provides insights to understand in vivo transcriptional regulation.
Chromatin accessibility is one of the cell-type genomic signals that quantify accessibility
of nucleosomal DNA. Based on the relationship between TF–DNA binding and chromatin
accessibility, TFs were broken down into three categories: pioneer factors, settlers, and
migrants (16). Pioneer factors can bind at nucleosome-containing DNA and increase DNA
accessibility upon binding. By contrast, settlers bind at open chromatin regions and migrants
5
tend to bind on subset of open chromatin regions. For settlers and migrants, chromatin
inaccessibility can explain why most of the cognate BSs are not bound in vivo.
Histones are octamer protein that form nucleosomes when DNA sequences wrap around
them. Post-translational modifications refer to added chemical groups on certain amino acids of
histones. There are more than 100 possible HMs and numerous combinatorial HM patterns. It
has been shown that many TFs have preferred HM patterns that are consistent across multiple
cell types (52). For example, MYC has been experimentally validated to required primed H3
K4/K79 methylation and H4 acetylation for in vivo binding (53). Nevertheless, the mechanism is
not fully understood yet and it is unclear whether a certain HM patterns are permissive to TF
binding.
DNA methylation is another genome-wide level information that was reported to exert a
role in TF–DNA binding. Although the addition of a single methyl group to the carbon 5 atom of
cytosine leads to only a subtle change in DNA structure, there are genome-wide evidences that
show some proteins, such as Lac repressor, preferring methylated DNA sequences and forming
hydrophobic contacts with the added methyl group (54). By contrast, some proteins, such as
MspI, recognize the CCGG context regardless of the methylation status (54). This context-
dependent preference of methylated DNA was explained by three binding mechanisms: direct
contacts to methyl group (55), competitive binding (56), and structural readout (37, 38).
1.3 Experimental and computational approaches to study TF –DNA binding
mechanisms
Determine a DNA-binding motif is often the first step towards detailed examination of the
TF–DNA binding specificity. In vitro high-throughput experimental assays, motif discovery
methods, and quantitative modeling from experimental data have been improved dramatically
over the last decade. We will review each field separately.
6
1.3.1 In vitro high-throughput assays
There are many high-throughput methods to determine binding affinity (or relative
preference) of TFs to DNA sequences. In general, these methods have different sequence
spaces ranging from 10
6
to 10
15
DNA sequences. Depending on the experimental principle, they
also have different sources of error and detection resolution. We will next review two binding
assays whose data we investigated in later studies.
Protein binding microarrays (PBMs) measure the binding affinity of all possible 𝑘 -mer
sequences, including low-affinity TF–DNA BSs, which are not easily captured by low-throughput
assays (57, 58). Oligomers containing all 𝑘 -mer (𝑘 usually ranges from 8 to 10) was first
anchored in a microarray, followed by synthesis of their reverse complement strand into
anchored double strand DNA probes. Proteins of interest marked with fluorescence will then be
exposed to the pool of prepared DNA probes. Binding affinity of the TF on each probe will be
quantified by the fluorescence intensity signal on each probe. Eventually, nearly 10
6
sequences
of length 32bp and their corresponding binding affinity can be obtained when 𝑘 equals 10.
SELEX-seq/HT-SELEX is a more high-throughput method than PBMs. It starts with a
pool of synthesized DNA oligonucleotides containing a region of 16 random base pairs. This
pool is made double stranded and then sequenced with Illumina sequencing, resulting in a set
of R0 reads. This pool was then exposed to a proteins or a protein complex of interest, followed
by an electrophoretic mobility shift assays (EMSAs) step to separate bound sequences from
unbound sequences. Those bound sequences will be amplified by polymerase chain reaction
(PCR) and result in R1 sequence. This selection step will repeat several times. After rounds of
data collection, a Markov model is constructed to derive the relative binding affinity of each 𝑘 -
mer (𝑘 ranges from 12 to 16) (59-61). Note that the accuracy of SELEX-seq/HT-SELEX
experiment is usually challenged by DNA contamination, saturation of binding, and multimeric
binding.
7
1.3.2 Quantitative modeling from experimental data
Motifs are typically displayed as a sequence logo (62), which represents position weight
matrix (PWM) of relative preferences of a certain TF for each base in the BSs (63). In a PWM,
at each base position, each of the four bases has a score. PWMs quantify relative binding
affinity of a TF to a sequence by multiplying scores for each base of a sequence. PWM-based
methods have been applied to various data types due to their simplicity: from small sets of
known BSs to high-throughput assays. To better accommodate characteristics of high-
throughput experimental data, more complex sequence-based models of TF–DNA binding
specificity have been developed to deal with PBM and SELEX-seq/HT-SELEX data, including
matrixREDUCE (64), BEEML (61), BEESEM (65), DiMO (66), and selexGLM (67). One
limitation of these methods is that they implicitly assume that positions within the TFBS
independently contribute to the binding affinity, an assumption that does not always hold (68,
69).
To overcome this limitation, some computational models used dinucleotide features to
identify nonindependent interactions (70, 71) between nucleotides. Moreover, since interactions
between adjacent base pairs are dominated by base stacking (72), which can be approximately
captured by inter-base pair hydrogen bonds in the major groove, DNA shape features have
been proven to be more efficient and biologically meaningful features than DNA sequence alone
(21, 23). Over the last a couple of years, with the widespread successful application of deep
learning techniques to biological problems, convolutional neural networks (CNNs) have
achieved state-of-the-art performance in modeling TF–DNA binding affinity with data from
different experimental assays (73). Besides the ability to consider non-linear and dependent
relationship between nucleotides, CNN-based models are also computationally scalable and
alignment-free.
8
1.4 Overview of this thesis
This thesis will first cover some published work on the relationship between HMs and
TF–DNA binding. It will then include two ongoing projects on developing computational models
to study in vivo binding specificity: sequence-based CNN models and structure-based CNN
models. In the last chapter, it will introduce the idea of utilizing 3D genomic interactions to
predict genome-wide HM patterns, both of which are two critical in vivo genome-wide factors
that affect TF–DNA binding.
9
Chapter 2. Relationship between HMs and TF –DNA binding
2.1 Introduction
Recent work has introduced approaches for the quantitative modeling of relationships
between TF binding and HM patterns (74, 75). Despite the reported relationship between TF
binding and HM patterns, mechanisms that cause this relationship are still unknown. Moreover,
it is unclear whether this relationship varies between TF families, or if it can reveal mechanisms
of TF binding on a protein family-specific basis. HMs are small changes at nucleosome surfaces
that can significantly affect the chromatin tertiary structure and compaction (76, 77). From a
structural perspective, one may ask whether HM patterns are conserved around the in vivo BSs
of TFs, and whether this relationship varies among protein families. However, genomic data will
need to be analyzed to answer these questions, given the paucity of structural information about
proteins bound to nucleosomes with HM marks. Although HM patterns in the BS environment
are known to contribute to TF binding, this relationship is not yet understood from a mechanistic
perspective.
2.2 Initial work: altered HMs induce ageing by Hoxa9 developmental signals
Epigenetic studies suggested that posttranscriptional HMs play a central role in
transcriptional regulation and revealed substantial overlaps between TF BSs and HM marks
(78). In our recent collaboration work on how epigenetic stress responses induce muscle stem-
cell ageing by Hoxa9 developmental signals, we found that aging muscle stem cells exhibit
epigenetic alterations that affect DNA packaging and modify the transcriptional regulation.
Hoxa9, a TF of the homeodomain (HD) family, is aberrantly expressed, and in turn promotes
abnormal expression of downstream signaling pathways that drive the cell’s regenerative
10
decline (Figure 2–1). This work motivated us to systematically study the relationship between
HM patterns and TF binding in vivo.
Figure 2 –1 Different HM patterns in young and aged muscle stem cells
The upper panel shows that in quiescent young-adult muscle stem cells, high levels of activating
HMs and low levels of repressive marks are present in nucleosomes, leading to a permissive
state. However, when injury happens, stem cells become activated and chromatin state become
repressive. The bottom panel shows that aged muscle stem cells are in repressive state when in
quiescent, and become permissive when injury happened, thus allowing TFs to bind on the
open chromatin regions (i.e. TFs bind on the upstream region Hoxa9 gene). This figure is
adapted from Eliazer and Brack 2016.
2.3 Results
2.3.1 HM patterns surrounding TF BSs and non-BSs show conserved patterns
To study the different HM patterns surrounding TF BSs and non-BSs and the
conservation of these patterns, we downloaded data from the ENCODE Consortium
(http://genome.ucsc.edu/ENCODE/downloads.html) for three human cell lines: B-lymphoblastoid
cells (GM12878), erythrocytic leukemia cells (K562), and embryonic stem cells (H1-hESC). We
collected genome-wide TF binding profiles for 44 TFs in GM12878, 43 TFs in K562, and 24 TFs
in H1-hESC cells, and considered 10 HMs at histone tails (H3K4me2, H3K27ac, H3K4me1,
H3K4me3, H3K79me2, H3K9ac, H3K9me3, H4K20me1, H3K27me3, and H3K36me3). These
TF binding profiles and HM profiles were generated by ChIP-seq assays (79).
11
To make reasonable comparisons of HM patterns around TF BSs and non-BSs, our
experiments focused on BSs and non-BSs selected from regions that had similar levels of
chromatin accessibility and exactly-matched core motifs (see Methods). We performed motif
discovery on ChIP-seq peaks by using FIMO (80), obtaining a set of BSs for each TF and
selected non-BSs with the following assumptions. First, BSs and non-BSs in the human genome
were assumed to be located in regions with different levels of chromatin accessibility (14). To
exclude this effect for a valid control, non-BSs were selected to have distributions of chromatin
accessibility that were similar to those of the BSs of a given TF. Chromatin-accessible regions
were obtained with the DNase-seq technique (48, 81). Second, to avoid the effect of primary
sequence preference at core motifs, non-BSs were chosen based on them having exactly-
matched core motifs with the BSs. Third, selected non-BSs were located at distinct genomic
locations and had sample sizes that were similar to those of the BSs for modeling consistency.
A flowchart describing the analysis is shown in Figure 2–2. This experimental setup for defining
BSs and non-BSs enabled us to focus directly on HM pattern differences between BSs and non-
BSs.
12
Figure 2 –2 Flowchart describing the approach
In each cell line, chromatin-accessible regions derived from DNase-seq data were genomic
regions of interest. For each TF, sequences at ChIP-seq peaks were first aligned using PFMs to
obtain BSs. For each BS, an exactly-matched motif was found from chromatin-accessible and
distinct genomic regions as a motif pool. Within the motif pool, motif sets with similar chromatin
accessibility distributions as the BSs were selected as non-BSs. With a set of BSs and non-BSs,
DNA sequence and four DNA shape features, as well as 10 HM patterns, were calculated for
flanking regions and fed to downstream modeling to distinguish BSs and non-BSs.
We removed TFs that had fewer than 132 genomic binding locations or binding motifs at
the ChIP-seq peaks that could be aligned. As a result, 33, 37, and 18 TFs remained for further
analysis in the GM12878, K562, and H1-hESC cell lines, respectively. With these data, we were
first interested in the HM pattern differences around aligned BSs and non-BSs for each TF in
the GM12878 cell line. We examined HM patterns at single-base pair resolution within regions
of 1 kb upstream and downstream of the BSs and non-BSs, and then calculated the average
13
HM patterns. For these considered TFs, the average HM levels of H3K27me3, H3K36me3, and
H4K20me1 were reduced by 21%, 9.6%, and 17%, respectively, in the environment of BSs
compared to non-BSs. Average HM levels of H3K4me2, H3K27ac, H3K4me3, H3K79me2, and
H3K9ac were elevated by 18%, 52%, 19%, 8%, and 31%, respectively, for regions containing
BSs compared to non-BSs (Figure 2–3). These substantial differences in HM patterns for TFs
could not be detected when we randomly shuffled the labels between BSs and non-BSs.
Figure 2 –3 TF families show conserved differences in HM patterns between BSs and non-
BSs
(A) Heat map displaying results of statistical comparison between HM levels at positions 1-kb
upstream and downstream of BSs and non-BSs in the GM12878 cell line. Positive Δ(–log(q-
value)), in red, indicates BS environments with significantly higher HM levels compared to non-
BS environments. Negative Δ(–log(q-value)), in blue, indicates BS environments with lower HM
levels. (B) Average HM differences across TF families in the GM12878 cell line. Centerlines of
box plots represent medians, edges indicate the first and third quartiles, and whiskers indicate
minimum/maximum values within 1.5 times the interquartile from the box. This setup for
displaying box plots is consistent in subsequent box plots. (C) Average H3K4me3, H3K9ac, and
14
H3K27me3 levels at each position 1-kb upstream and downstream of BSs and non-BSs for
MYC (bHLH family). Black edges encompassing the average line represent standard error bars
at each nucleotide position.
We evaluated the statistical significance of HM pattern differences between BSs and
non-BSs and clustered TFs based on Pfam binding domains (82). Evolutionarily related TFs
displayed similarly substantial HM pattern differences. Specifically, TFs from the ETS (3 of 5)
and bHLH (7 of 7) families showed larger HM pattern differences between BSs and non-BSs,
whereas TFs from the bZIP (3 of 4), HD (2 of 2), and C2H2 (4 of 6) families displayed smaller
HM pattern differences in the GM12878 cell line (Figure 2–3). This observation indicates that
TFs with evolutionarily related DNA binding domains sample putative BSs with similar HM
pattern environments.
To investigate the conservation of HM pattern preferences for TFs from various protein
families, we applied a similar analysis to the K562 and H1-hESC cell lines. We found that HM
preferences were conserved for the MADS-domain, ETS, bHLH, bZIP, and C2H2 families. In
addition, the GATA and STAT families displayed HM pattern differences that were similar to
preferences of the bHLH family in the K562 cell line. Within TF families, the bZIP and C2H2
families showed more diverse HM pattern differences than the MADS-domain, ETS, and bHLH
families (Figure 2–3). This observation is in agreement with the fact that some TFs from the
C2H2 family, despite having conserved secondary structures of zinc fingers and linkers, still
have dynamic linker structures and diverse conformations in their unbound state prior to DNA
binding. These TFs also have diverse position weight matrices and require binding to correct
DNA sequences to adopt a stable protein structure (83).
These protein family-specific and cell type-consistent HM patterns in the environment of in vivo
BSs seemed to reveal particular differences when only small percentages of BSs overlapped
among different cell lines. For TFs appearing in the GM12878 cell line, the average percentage
15
of overlapping BSs between any two cell lines was 30% (ranging from 2–84%), and the overlap
among the three cell lines was 19% (ranging from 0.4–70%). These observations indicate that
fewer than half of the BSs were shared among different cell lines. We further analyzed c-Myc
(MYC) BSs in the GM12878 and K562 cell lines and partitioned the entire set of BSs into three
subsets: group 1 included 852 BSs in the GM12878 but not the K562 cell line, group 2 included
310 BSs shared in both cell lines, and group 3 included 4872 BSs in the K562 but not the
GM12878 cell line.
As high H3 K4/K79 methylation and H3 acetylation levels are prerequisites for MYC
binding in vivo (53), we examined the distribution of H3K4me3, H3K79me2, and H3K9ac
patterns among these three groups of BSs. Two of the three HM marks surrounding group 1
BSs had high levels in the GM12878 cell line and low levels in the K562 cell line (one-sided
paired t-test P-values: 6.2 10
-7
for H3K4me3 and 1.6 10
-20
for H3K9ac). By contrast, these
HM levels surrounding group 3 BSs were higher in the K562 than in the GM12878 cell line (one-
sided paired t-test P-values: 1.1 10
-10
for H3K4me3, 1.2 10
-38
for H3K79me2, and 3.9 10
-7
for H3K9ac). Despite poor conservation of the BSs in terms of their genomic location, most of
the considered TFs displayed conserved HM patterns among different cell lines.
To exemplify how HM patterns are distributed in motif environments of TF BSs, we
displayed average HM patterns of H3K4me3, H3K9ac, and H3K27me3 at each position 1-kb
upstream and downstream of BSs and non-BSs for MYC, a TF from the bHLH family, in the
GM12878 cell line (Figure 2–3C). H3K4me3 and H3K4ac are crucial for in vivo MYC binding.
Consistent with this fact, the average H3K4me3 and H3K9ac levels in the motif environment of
BSs were higher compared to non-BS regions. As a control, average H3K27me3 levels were
similar in environments of BSs and non-BSs, and our observations in the K562 and H1-hESC
cell lines were consistent. To compare HM patterns for TF families with fewer differences
between BSs and non-BSs (Figure 2–3A), we plotted average HM patterns for H3K4me3,
16
H3K9ac, and H3K27me3 for CEBPB from the bZIP family. As we had expected, these three
HMs showed similar patterns surrounding BSs and non-BSs of CEBPB.
Upon binding DNA in closed chromatin regions (irrespective of the presence of
nucleosomes), pioneer factors recruit chromatin remodelers and histone-modifying enzymes,
disrupt the chromatin structure, and reprogram epigenetic marks (84-86). The TFs considered in
this study included pioneer factors from different TF families, including GATA2 from the GATA
family (87), NFYB (88) from the NFY family, SPI1 from the ETS family (89), and RFX5 from the
family of RFX-related factors (90). Moreover, pioneer factors co-localize with other TFs in a cell
line-specific manner. For example, NFY extensively co-localizes only with USF1 and FOS at
inactive chromatin domains in the K562 cell line (91). We observed that these pioneer factors,
except for those from the GATA family, and co-localized TFs showed similar HM patterns in
environments of BSs compared to non-BSs (Figure 2–3).
2.3.2 HM patterns in the BS environment contribute to quantitative predictions of
in vivo TF binding
Our qualitative analysis of HM patterns between BSs and non-BSs of TFs revealed
similar and conserved differences for various protein families. Therefore, we were interested in
whether the HM patterns contribute quantitatively to the discrimination of BSs vs. non-BSs for
TFs, and whether those contributions are also protein family-dependent. We previously showed
that DNA sequence information and four DNA shape features (i.e., minor groove width [MGW],
propeller twist [ProT], Roll, and helix twist [HelT]) at flanking regions contribute to the
quantitative modeling of TF binding specificities both in vitro and in vivo (20, 92). Therefore, we
built L2-regularized multiple linear regression (MLR) models, incorporating various combinations
of DNA sequence and shape features at 10-bp flanking regions and 10 average HM patterns in
an environment of 1-kb upstream and downstream of motifs, to classify previously defined BSs
and non-BSs.
17
BSs and non-BSs were described as feature vectors containing distinct sets of features
(i.e., DNA sequence, DNA shape, and HM levels) at each nucleotide position. DNA sequence
features are binary categorical attributes characterizing the chemical identity of base pairs. This
information encodes hydrogen bonds and other direct contacts between amino acids and base
pairs in predominantly the major groove (27). DNA shape features are continuous attributes
capturing DNA shape properties and electrostatic interactions in predominantly the minor
groove (33, 93). HM levels are continuous attributes that describe surrounding epigenetic marks
that may be sensitive to (94) and can be primed for the binding of specific TFs (95). These three
types of feature categories represent different mechanisms of in vivo TF binding specificities.
After collecting binding data and encoding features for each BS sequence, we implemented two
different models: sequence+shape models, using a combination of DNA sequence and shape
features; and sequence+shape+HM models, using a combination of DNA sequence, DNA
shape, and HM pattern features. To examine how these models perform quantitatively as a
function of the lengths of flanking regions used in calculating HM patterns, we tried different
length scales ranging from 10 to 2,000 bp. Similarly, we tried flanking region lengths of 5, 10,
and 15 bp for calculating DNA sequence and shape features. We used the area under the
precision and recall curve (AUPRC) to evaluate the performance of the models.
Sequence+shape+HM models achieved average AUPRCs of 0.73, 0.74, and 0.75 for
TFs considered in the GM12878, K562, and H1-hESC cell lines. Adding HM patterns increased
the performance of discriminating BSs from non-BSs (Figure 2–4A-C). For example, because
certain HM patterns are prerequisites for MYC binding, adding HM patterns to sequence+shape
models yielded a 14.0% increase in AUPRC (from 0.71 to 0.81) in the GM12878 cell line.
Moreover, performances of sequence+shape+HM models did not show strong length-scale
dependencies in calculating HM patterns or in calculating DNA sequence and shape features.
18
Figure 2 –4 HM patterns of the BS environment largely contribute to the quantitative
prediction of in vivo TF binding in a protein family-specific manner
L2-regularized MLR models were implemented to distinguish BSs and non-BSs for TFs from
different families. AUPRC was used to measure performance of different models. Comparisons
of models are shown between sequence+shape and sequence+shape+HM features in the (A)
GM12878, (B) K562, and (C) H1-hESC cell lines. Extents of performance gain in HM-
augmented models are protein family-specific in the (D) GM12878, (E) K562, and (F) H1-hESC
cell lines.
The extent to which the inclusion of HM patterns in the models improved the prediction
accuracy of TF binding specificities was protein family-specific. With consistent and substantially
different HM patterns (Figure 2–4A), TFs from the bHLH family had median performance boosts
of 13.2%, 3.3%, and 9.2% when using HM-augmented models in the three cell lines. By
contrast, TFs from the C2H2 family benefited comparatively less, with median performance
19
improvements of 2.9%, 2.2%, and 6.5% in the three cell lines. Moreover, the performance
improvements were distributed over a wider range (Figure 2–4D-F). TFs from the MADS-
domain family in the GM12878 cell line and the GATA family in the K562 cell line also showed a
substantial performance boost when HM-augmented models were used.
2.3.3 TF families vary in their preferences for DNA sequence and shape features
and HM patterns
Determinants affecting in vivo TF binding are highly correlated and have not yet been
deconvolved. For example, the GC content of a BS region can affect nucleosome positioning
(96). DNA shape features are derived from nucleotide sequences (21), which are closely related
to HM patterns (74, 94, 97). With HM-augmented models that can increase the modeling
accuracy of DNA binding specificities across TF families, we further separated the contributions
from DNA sequence and shape at flanking regions and the contribution from HM patterns.
20
Figure 2 –5 Deconvolution of DNA sequence and shape features at flanking regions and
10 HM patterns in the GM12878 cell line
(A) Scatter plot showing performance gain through adding different sets of features. The x-axis
represents HM pattern-only models as baseline, and recorded performance increases through
adding DNA sequence and shape features at flanking regions. The y-axis represents models
based on DNA sequence and shape features at flanking regions as baseline, and recorded
performance increases through adding HM pattern features. Gray dashed lines intersect with x-
axis at 15% and with y-axis at 5%. The Pearson correlation coefficient (PCC) was calculated
between AUPRC gain through adding these two sets of features. (B) Heat map displaying
performance gains when adding either sequence+shape features or HM patterns. With cutoffs
as shown by the gray dashed line in (A), TFs were grouped into sequence+shape-specific, HM-
specific, and a group with other features preferred. (C) Pie charts showing the number of TFs
with different binding mechanisms in the MADS-domain, bHLH, bZIP, and C2H2 TF families. (D)
Heat map representing the percentage decrease of AUPRC in leave-one-feature-out
experiments compared to complete models considering DNA sequence and shape features,
and 10 HM features. A more intense red color in a cell indicates a greater performance
decrease when leaving out the feature displayed in x-axis for the TF displayed in the y-axis.
Our previous work was aimed at deconvolving determinants of in vitro TF binding (22).
Here, we applied a similar strategy to investigate the importance of DNA sequence and shape
features at flanking regions that could not be explained by HM features, or vice versa.
Specifically, we evaluated the importance of individual sequence+shape or HM features by
using performance increases relative to sequence+shape or HM-augmented models,
respectively. For most of the TFs considered, contributions of sequence+shape and HM
features had a strong negative relationship (Figure 2–5A), indicating a more general
phenomenon that the degree to which flanking regions contribute to TF binding can be
attenuated by the chromatin context (28). In general, if DNA flanking regions are more likely
occupied by nucleosomes, then the HM patterns contribute more than DNA sequence and
shape features to TF binding specificity predictions.
21
We further observed that TFs from the GATA and MADS-domain families were
distributed in the upper left quadrant of the plot in Figure 2–5A, with HM patterns showing larger
contributions than DNA sequence and shape features to TF binding specificity predictions. On
the other hand, TFs from the bZIP and C2H2 families were more broadly distributed in the
scatterplot. When we selected an AUPRC increase of 5% or 15% upon adding HMs and
sequence+shape features (gray dashed lines in Figure 2–5A), respectively, to be the “feature
importance” threshold (i.e., for a set of features to be considered important for TF binding), the
TFs separated into three groups. The TFs in upper left quadrant of the scatterplot in Figure 2–
3A were termed “HM specific”, those in the bottom right quadrant were termed
“sequence+shape specific”, and those in the bottom left quadrant were regarded as “other”
(Figure 2–5B). For example, all but one (USF1) of the five TFs in the bHLH family were
observed to bind DNA in an HM-specific manner (Figure 2–5C). Comparatively, for TFs in the
bZIP and C2H2 families, HM features did not always represent important features. We noticed,
however, that binding mechanisms were protein family-specific and consistent in the three cell
types. Specifically, TFs from the C2H2 and bZIP families were found to bind in both a DNA
sequence+shape-specific and an HM-specific manner, whereas most of the TFs from the bHLH,
GATA, and MADS-domain families tended to bind in an HM-specific manner (Figure 2–5C).
TFs from the bHLH family tended to bind mostly in an HM-specific manner. They
exhibited consistent and increased H3K4me3, H3K79me2, and H3K9ac patterns in the motif
environment of their in vivo BSs (Figure 2–3A). These TFs might require primed HM patterns to
achieve DNA binding specificity (53, 95). USF1 and USF2, as exceptions, were found to bind in
a sequence+shape-specific manner in the K562 cell line. On one hand, when accounting for the
preferences of flanking base pairs, in vivo BSs for TFs of the bHLH family showed increased in
vitro binding signals (20). Differences in the one or two proximal base pairs directly flanking the
E-box at promoter regions appeared to alter the transcriptional rates (98, 99). On the other
22
hand, the lesser importance of HM patterns for USF1 binding can be explained by its frequent
co-binding with pioneer factors.
TFs from the GATA family are pioneer factors. As such, confirming the binding
preference of these TFs might require the analysis of time-resolved HM pattern changes
surrounding the BSs. This possibility is because the large HM pattern differences between BSs
and non-BSs might be due to HM pattern changes upon TF binding (85, 100).
TFs from the STAT family employed both sequence+shape and HM features at flanking
regions to achieve DNA binding specificity. STAT1, for instance, has independently derived BSs
that associate strongly with regions of H3K4me1 and H3K4me3 histone marks in HeLa cells
(101). Examining three flanking positions upstream and two flanking positions downstream, we
found that STAT1, STAT5, and STAT6 revealed preferences for certain base pairs in the
flanking regions (102). The C2H2 family, which binds DNA using the most promiscuous
mechanisms, showed various HM pattern preferences in regions surrounding BSs (Figure 2–
3A). For TFs exhibiting large differences in HM patterns, HM changes in the motif environment
of BSs might be due to initial interactions with histone-modifying enzymes, followed by changes
in HM patterns. For instance, YY1 interacts with histone acetyltransferase p300 (103), CREB-
binding protein (CBP) (104), and histone deacetylase 1 (HDAC1), HDAC2, and HDAC3 (105).
Such HM pattern changes can be explained by DNA variants that are highly related to
alterations in TF binding intensities (94). These TFs may approach their BSs through initially
sampling DNA sequence and shape (24), followed by causing or stabilizing HM pattern changes
around the BSs.
We also investigated the importance of individual features that cannot be explained by other
features in predicting binding specificities for each TF. Starting from the HM-augmented model,
we individually removed sequence or shape features or one of the 10 HM patterns, built a series
of leave-one-feature-out L2-regularized MLR models, and recorded AUPRC decreases to
23
evaluate model performance (Figure 2–5D). For most TFs of the bHLH family, flanking regions
around BSs contributed to TF binding specificity more substantially through their local DNA
structure than through DNA sequence, as previously reported (20).
Among these DNA shape features, ProT was important for the bHLH family in all three
cell lines (92). H3K4me2, as an activating mark, was significantly different in regions
surrounding BSs of TFs from the bHLH and ETS families. The same histone mark, however,
showed lesser importance in the three cell types, implying co-occurrence with other activating
marks such as H3K4me3 or H3K27ac (106, 107). Comparatively, H3K4me1 carries unique
information and distinguishes active from poised enhancers when combined with H3K27ac and
H3K27me3, respectively (108-110). For TFs that exhibit HM patterns as an important feature
category, H3K4me1 usually indicates a substantial contribution.
2.3.4 Preferences for HM patterns can constrain TF co-occupancy
Our data showed that HM pattern differences were degenerate characteristics in defining
TF binding specificities (Figure 2–3A). An intriguing question was how HM pattern preferences
of TFs correlate with the co-binding of TFs, given previous observations that TFs tend to bind
DNA in a cooperative manner (111).
To investigate this relationship, we calculated the co-occupancy of all possible TF pairs
by measuring the percentage of proximal BSs between these pairs and examined the
distribution of H3K4me3 patterns surrounding their BSs. Presumably, intra-family TF pairs or
pairs from protein families that employ similar binding mechanisms (DNA sequence+shape or
HM specific manner) will prefer similar HM patterns surrounding their BSs. For these TF pairs,
we would expect a tendency to co-localize and to exhibit more distinct TF co-occupancies
compared to other characteristics. We found that TF pairs with similar H3K4me3 patterns
around their BSs had a larger percentage of proximal BSs, and that TF pairs with substantial
24
H3K4me3 pattern differences tended to avoid binding closely to each other in the GM12878 cell
line (Figure 2–6A).
Figure 2 –6 HM environment can constrain TF co-occupancy in the GM12878 cell line
TFs from the same protein family and TF families with a similarly favorable HM environment (or
binding manner) tend to co-localize in the genome. (A) Box plots of percentages of BSs of a TF
that are in close proximity (within 300 bp) to BSs of each of the other TFs, versus average
differences of H3K4me3 surrounding BSs between these two TFs. (B, left) H3K4me3 level
surrounding BSs shared (black) by MYC (bHLH family) and CEBPB (bZIP family), MYC-only
(blue), and CEBPB-only (purple). (B, right) Box plots representing the distribution of H3K4me3
levels surrounding BSs shared by MYC and CEBPB (black), MYC-only (blue), and CEBPB-only
(purple). (C, left) H3K4me3 level surrounding BSs shared (black) by MEF2A and MEF2C (both
from the MADS-domain family), MEF2A-only (purple), and MEF2C-only (blue). (C, right) Box
plots representing the distribution of H3K4me3 levels surrounding BSs shared by MEF2A and
MEF2C (black), MEF2A-only (purple), and MEF2C-only (blue). (D) Box plots displaying the
distribution of percentages of proximal BSs among intra-family TF pairs and inter-family TF pairs
25
for each protein family. One-sided Wilcoxon test P-values show that intra-family TF pairs have
significantly higher percentages of proximal BSs compared to inter-family TF pairs.
Here, we provide three examples of TF pairs to support this hypothesis. The first pair
was MYC and CEBPB from the bHLH and bZIP families, respectively. In the GM12878 cell line,
MYC showed HM-specific binding, whereas CEBPB preferred distinct DNA sequence and
shape features (Figure 2–5B). As a result, only about 1% of the MYC BSs were proximal to
CEBPB BSs, and CEBPB only shared about 2% of its BSs with MYC. Furthermore, MYC-only
BSs had higher H3K4me3 levels compared to CEBPB-only BSs (Figure 2–6B). These
observations were consistent with results in the K562 and H1-hESC cell lines.
The second pair was MEF2A and MEF2C, both from the MADS-domain family. Both of
these TFs bound DNA in an HM-specific manner. Even when we excluded overlaps of their
BSs, about 23% of MEF2A BSs were still proximal to MEF2C BSs, and about 29% of MEF2C
BSs were shared with MEF2A. Distributions of H3K4me3 levels surrounding MEF2A-only and
MEF2C-only BSs were quite similar to levels surrounding MEF2A- and MEF2C-shared BSs
(Figure 2–6C).
The third example was of TF pairs from different TF families, wherein both TFs showed
sequence+shape-specific DNA binding and similar H3K4me3 patterns surrounding their BSs.
These pairs were NRF1 and EGR1 in the GM12878 cell line, NFYB and USF1 in the K562 cell
line, and USF1 and JUND in the H1-hESC cell. This observation indicates that TFs with similar
preferences for HM patterns and similar binding mechanisms tend to co-occupy. Interestingly,
the third TF pair had very different sequence preferences.
In agreement with these examples of TF pairs, we observed that intra-family TF pairs
tended to have a significantly higher number of proximal BSs compared to inter-family TF pairs
for protein families having consistent and different HM patterns around their in vivo BSs, such
as the bHLH, MADS-domain, GATA, STAT, and ETS families (Figure 2–6D). In conclusion, our
26
results suggest a close dependency between the HM pattern preferences of TFs and the
tendency of TFs to occupy proximal BSs in vivo.
2.3.5 Larger differences in HM patterns result in a substantial decrease in
nucleosome occupancy
HMs can change DNA accessibility and nucleosome stability, either directly by adding
methyl or acetyl groups to histone tails (76, 97) or indirectly by recruiting specific proteins (e.g.,
with chromodomains at histone tails with methylation and bromodomains at histone tails with
acetylation) (112, 113). Beyond these processes, HMs can affect nucleosome positioning (114).
Therefore, we hypothesized that HM patterns are closely related to in vivo TF binding
specificities for certain protein families through their effects on nucleosome positioning.
27
Figure 2 –7 Nucleosome occupancy decreases around BSs compared to non-BSs among
TF families that bind in an HM-specific manner
Average nucleosome occupancy in each position 1-kb upstream and downstream of BSs and
non-BSs for the (A) bHLH and (B) C2H2 families in the GM12878 cell line. Black edges
encompassing the average line represent standard error bars at each nucleotide position. (C)
Density plots showing distributions of chromatin accessibility and nucleosome occupancy
around BSs and non-BSs for MYC in the bHLH family and (D) CTCF in the C2H2 family. Two-
sided Wilcoxon tests were conducted to test if these distributions had shifts. Only distributions of
nucleosome occupancy for MYC BSs exhibited significant shifts, as indicated by the P-value.
To test this hypothesis, we investigated the nucleosome positioning profiles surrounding
BSs and non-BSs of TFs that bind in a DNA sequence+shape-specific and HM-specific manner.
To derive nucleosome positioning profiles, we collected genome-wide MNase-seq data for the
GM12878 and K562 cell lines from the ENCODE project. We then derived the nucleosome
occupancy at each base pair in 1-kb regions upstream and downstream of known target sites.
TFs from families with consistent HM patterns across cell lines and substantially different HM
patterns between BSs and non-BSs (e.g., bHLH, ETS, GATA, and MADS-domain) exhibited
substantial decreases in average nucleosome occupancy around their BSs (Figure 2–7A).
These results indicated a competition between histones and TFs for target sites. By contrast,
the extent of the average decrease in nucleosome occupancy in regions surrounding BSs was
more diverse for TFs from the C2H2 family (Figure 2–7B). For example, BSs of MYC exhibited
lesser nucleosome occupancy than non-BSs (Figure 2–7C), whereas the nucleosome
occupancy distributions of BSs and non-BSs of CTCF were similar (Figure 2–7D). BSs of other
TFs in the bHLH family displayed more substantially decreased nucleosome occupancy than
BSs of other TFs in the C2H2 family. In our experimental setup, BSs and non-BSs of each TF
had similar distributions of chromatin accessibility.
Considering that nucleosome occupancy might be a more direct factor affecting TF
binding than HM patterns, we implemented additional L2-regularized MLR models using a
28
combination of DNA sequence, shape, and nucleosome occupancy features to classify BSs and
non-BSs. With consistently increased performance compared to sequence+shape models
across cell lines, nucleosome occupancy (nuc) can contribute to the distinction of in vivo BSs
and non-BSs. As HM patterns contain information for both nucleosome positioning and HM
levels, sequence+shape+HM models generally outperformed sequence+shape+nuc models
because nucleosome occupancy was represented by average occupancy levels and contained
more degenerative information. Interestingly, sequence+shape+nuc models were more
sensitive to flanking length than were sequence+shape+HM models, indicating that nucleosome
occupancy is more accurate in describing local chromatin structure as it is relevant to TF
binding. These observations indicate that HM-selective TFs require increased nucleosome
positioning flexibility compared to TFs that bind in a DNA sequence+shape-dependent manner.
2.4 Discussion
DNA sequence and shape preferences at flanking regions of core motifs play important
roles in achieving binding specificity for TFs from certain protein families, both in vitro (20, 115)
and in vivo (92). However, these DNA properties alone are insufficient to explain the different in
vivo TF binding preferences observed across distinct cell types (52, 78, 116). Considering cell
type-specific properties, previous studies described a general relationship between TF binding
and HM patterns (53, 94, 95, 117). These studies reported increased performances in the
sequence-based modeling of in vivo TF binding events when such models incorporated HMs
(74, 118). In this study, we described qualitatively and quantitatively the relationship between TF
binding and HM patterns in a protein family-specific manner across different cell lines. We
revealed that TFs from certain TF families displayed conserved HM pattern preferences
between BSs and non-BSs.
To investigate preferences in HM patterns surrounding BSs compared to non-BSs, we analyzed
29
comprehensive ENCODE data (78) and examined in vivo ChIP-seq data for 33, 37, and 18 TFs
in the GM12878, K562, and H1-hESC cell lines, respectively. The studied TFs covered eight
protein families, including the C2H2, MADS-domain, bHLH, bZIP, HD, STAT, GATA, and ETS
families. Among the feature categories in flanking regions of core motifs that influence in vivo TF
binding, two important determinants are chromatin accessibility (26) and DNA sequence context
(28). Closed chromatin is inaccessible to most TFs, whereas open chromatin provides
accessible regions for TF binding that are generally transcriptionally active (119, 120).
Considering these mechanisms, we restricted our dataset of non-BSs to exactly matched core
motifs with a similar distribution of chromatin accessibility in environments of BSs. Then, we
examined the differences in HM patterns between regions surrounding BSs versus non-BSs.
In regions surrounding BSs, preferences for HM patterns were protein family-specific.
TFs from the bHLH, GATA, and MADS-domain families displayed, across cell lines, decreased
levels of repressive histone marks, such as H3K27me3 modification, and increased levels of
active histone marks, such as H3K4me1, H3K27ac, H3K4me3, and H3K79me3 marks (121).
The C2H2, bZIP, and HD families, however, showed more divergent HM patterns surrounding
their BSs. As one of the largest TF families in eukaryotes, the C2H2 family exhibited varying
preferences, likely due to their much less conserved three-dimensional protein folds compared
to other protein families (83). Besides the 10 HM patterns considered, our work can be further
extended by adding other histone marks on linker histones, which closely relate to chromatin
structure and TF binding (122), as well as additional HM marks in core histones.
Recent studies reported a quantitative relationship between in vivo TF binding specificities and
HM patterns surrounding BSs near DNA regulatory elements (74, 118). Given our qualitative
observation that HM pattern preferences surrounding BSs are protein family-specific, an
obvious question is whether HM patterns at TF BSs can add another layer to modeling genome-
wide in vivo TF binding quantitatively in a protein family-specific manner. Using DNA sequence
30
and shape profiles as feature categories in our baseline models, HM-augmented models for
predicting TF binding resulted in larger performance increases for TFs from protein families that
had substantial HM pattern differences between BSs and non-BSs, such as members of the
bHLH, MADS-domain, and GATA families. This result indicates that HM patterns may contribute
to the binding of TFs from the same protein family to their putative BSs. Furthermore, other
mechanisms, such as cofactors, cooperativity, and higher-order chromatin structure, could
further increase the quantitative modeling of TF binding in vivo.
We previously suggested that specific TF families use different contributions of DNA
sequence and shape at flanking regions to achieve binding specificities in vivo and in vitro (92).
Furthermore, it is well established that eukaryotic transcriptional regulation requires many layers
of binding specificity determinants (123). Thus, here we disentangled the contributions of DNA
sequence and shape profiles and HM patterns in distinguishing BSs from non-BSs based on
HM-augmented binding specificity models. We found that contributions from these two sources
were complementary to each other. We further identified three binding mechanisms:
sequence+shape-specific, HM-specific, and a group with other features preferred. For most TFs
from the bHLH, ETS, GATA, and MADS-domain families, our data suggest an HM-dependent
binding mode. In contrast, TFs from the bZIP and C2H2 families seem to utilize a combination
of sequence+shape-specific and HM-specific modes. Moreover, we conducted leave-one-
feature-out feature selection experiments to deconvolve the contribution of each individual
feature rather than of a set of features. Certain feature importance results validated our previous
understanding of TF binding and might provide further insights into the role of other features in a
systematic way. These observations were consistent for all three considered cell lines.
TFs tend to co-bind DNA in close vicinity to each other in order to regulate transcriptional
processes cooperatively. Our analysis revealed a dependency between the HM pattern
preferences of TF pairs and their tendency to co-bind the genome, even if they have different
DNA sequence preferences. This interdependency indicates that HM patterns in regions where
31
BSs are located constrain TF co-occupancy. TFs from the same protein family, or TFs from
different TF families that bind in an HM-specific manner, tend to co-bind DNA BSs in close
proximity. Hox proteins from the HD family, for instance, bind in close proximity to their cofactors
from the same protein family to unleash their DNA binding specificities (22, 59). It is possible
that HM patterns explain the observation that cooperativity can occur promiscuously between
TFs from diverse structural families (10).
Given our observation that HM patterns contribute to the quantitative modeling of TF
binding specificities, an intriguing question is how TFs can sample the unique HM environment
far beyond the core motif in vivo. Other studies suggested that HMs have a direct physical effect
on chromatin structure (124). Lysine acetylation, for instance, removes the positive charge of
this residue, which is believed to increase DNA negative supercoiling (125). In another example,
DNA topology is intricately connected with nucleosome structure and stability (126). These DNA
topological changes might influence the binding of regulatory proteins to DNA (127). Moreover,
HMs modulate the nucleosomal barriers of the transcriptional machinery by altering histone-
DNA contacts so that transcription proceeds efficiently (128). Considering these hypotheses, we
examined the nucleosome occupancy surrounding BSs and non-BSs for various TF families.
Protein families with larger HM pattern differences tended to exhibit decreased nucleosome
occupancy at BSs compared to non-BSs. This finding indicates that HM pattern changes
influence nucleosome structure and stability, which, in turn, evoke changes in TF binding events.
With the observation that the readout of regulatory sequences might be affected by TF-
nucleosome interactions (129), future studies will be required to examine a possible HM–TF
interplay.
In summary, with stringent experimental setups, our analysis extends current knowledge
about the close relationship between HM patterns and genome-wide in vivo TF binding
specificities by revealing protein family-specific mechanisms. We found that HM pattern
32
differences surrounding BSs and non-BSs are TF family-dependent, and that the contribution of
HM patterns to quantitative models of binding specificities is TF family-specific across different
cell lines.
2.5 Methods
2.5.1 In vivo data collection and motif alignment
ChIP-seq data for human TFs with position frequency matrices (PFMs) in the JASPAR
database (17) and DNase I hypersensitivity sites were downloaded from the ENCODE Project
(78). Based on these PFMs, we used FIMO (80) to search and align BSs with default settings. If
a motif was found more than once in a sequence, then the motif with lowest FIMO P-value was
used. Data for the TF was discarded if: (i) the number of aligned BSs was less than 132, to
avoid the risk that the sample size would be less than the number of features used in the
downstream MLR models (which have a minimum of 80 features); or (ii) the peak of the motif
distribution did not coincide with the ChIP-seq peak summit. Final numbers of datasets for the
GM12878, K562, and H1-hESC cell lines were 33, 37, and 18 TFs, respectively.
TF datasets in each cell line were assigned to TF families according to the JASPAR
database (17). TF families with fewer than two members were grouped under ‘Other’. BSs were
derived from ChIP-seq peaks.
2.5.2 Background definition
For each BS, Bowtie (130) was used to scan exact-matched sequences at chromatin-
accessible regions determined by DNase-seq experiments. If more than one exact-matched
sequence was found at a chromatin-accessible region, then only the first sequence in the
Bowtie output was kept. Non-BSs were selected at distinct genomic locations with BSs. After
these steps, for each BS, we selected one non-BS that had the closest average chromatin
accessibility surrounding 1-kb regions upstream and downstream of this BS. The resulting non-
33
BSs had similar chromatin accessibility distributions and similar sample sizes as the selected
BSs.
2.5.3 HM patterns of motif environments
As the motif environment, we considered 1-kb genomic regions upstream and
downstream of each motif. Based on the ChIP-seq .bam files for each HM, BEDTools suite
coverage (131) was performed to calculate the coverage of each nucleotide as the number of
reads that included a given nucleotide. The HM level in each motif environment was averaged
over the coverage at the core motif and 1-kb genomic regions surrounding the core motif. Then,
the HM level was normalized by computing the value of reads per million (RPM). Average levels
were taken from experimental replicates.
2.5.4 DNA shape features in flanking regions
Starting from motifs in BSs and non-BSs, sequences at 10-bp regions upstream and
downstream of the motifs were extracted. To utilize DNA shape profiles at each nucleotide of
these sequences, these sequences with 2-bp flanking regions were generated as input for
DNAshapeR (132), our R software package for high-throughput DNA shape feature prediction.
Four DNA structural features (i.e., MGW, ProT, Roll, and HelT) were calculated, among which
MGW and ProT were predicted for each nucleotide position, and Roll and HelT were predicted
for each base pair step of these sequences.
2.5.5 Statistical comparison of HM patterns at bound BSs and non-BSs
We compared HM patterns between BS and non-BS environments for each HM using
the one-sided Wilcoxon signed rank test implemented by wilcox.test in R. The option greater in
the test meant the null hypothesis (BSs > non-BSs) and vice versa. Bonferroni correction was
applied to correct for multiple testing. The q-values corrected from tests with the greater and the
less options were used to calculate Δ(–log(q-value)), which indicates the results in HM pattern
comparisons between BSs and non-BSs. A positive Δ(–log(q-value)) was assigned to a HM
34
when this HM surrounding BSs had significantly higher levels than surrounding non-BSs, and
vice versa.
2.5.6 MLR scoring scheme
Three different L2-regularized MLR models were trained to distinguish BSs from non-
BSs by using the following feature combinations: (i) DNA sequence and four DNA shape
features (MGW, Roll, ProT, and HelT) at flanking regions 5’ and 3’ of the core motif; (ii) DNA
sequence and four DNA shape features at flanking regions, and 10 HM patterns at the core
motif (usually 6–20 bp) and 1-kb regions upstream and downstream; and (iii) DNA sequence
and four DNA shape features at flanking regions 5’ and 3’ of the core motif, and nucleosome
occupancy at the core motif and 1-kb regions upstream and downstream. For each BS, DNA
sequence was represented in a feature vector (where A was encoded as 1000, T as 0100, G as
0010, and C as 0001). Training sets for MLR classification were stacks of BSs (labeled as “1”)
and non-BSs (labeled as “0”). The penalty parameter λ was learned from the data by using an
embedded 10-fold cross-validation on the training set. AUPRC, computed by using the ROCR
package in R (133), was used to evaluate the accuracy of the respective models in predicting
BSs and non-BSs. Imbalanced data was generated by resampling five times non-BSs with a
bootstrapping strategy.
2.5.7 Leave-one-feature-out L2-regularized MLR models
To determine the importance of each feature in the classification models combining DNA
sequence, shape, and HM features, we implemented MLR models where we left out one of the
features at a time (i.e., DNA sequence, MGW, ProT, Roll, HelT, H3K4me2, H3K27ac,
H3K4me1, H3K4me3, H3K79me2, H3K9ac, H3K9me3, H4K20me1, H3K27me3, and
H3K36me3). We recorded the percentage decrease of AUPRC for each altered model
compared to models that considered all features.
35
2.5.8 Calculating co-occupancy of a TF pair
The percentage of proximal BSs of all possible TF pairs was calculated in each of the
three cell lines. Proximal BSs for a TF pair were defined similarly to (92). All ChIP-seq peaks
containing BSs of a given TF were collected and extended 300 bp at each flank. We calculated
the percentage of proximal BSs for each TF pair by examining the number of BSs of the TF pair
that were located inside these extended peaks. Because TFs from the same family usually have
similar preferences for genomic sequences, we discarded overlapping BSs. We measured the
%ΔH3K4me3 for each TF pair by the difference ratio of the average H3K4me3 patterns over the
environment of all BSs. Lastly, for each TF pair, we compared the percentage of proximal BSs
to the H3K4me3 pattern difference ratio around the BSs.
2.5.9 Nucleosome occupancy
Genome-wide MNase-seq data for the GM12878 and K562 cell lines were downloaded
from the ENCODE Project (78) in .bigwig format. BS and non-BS coordinates were derived from
our MLR classification model. Nucleosome occupancy at base pair resolution was calculated by
bwtool, developed by (134). For each TF, we calculated the average nucleosome occupancy for
regions 1 kb upstream and downstream of all BSs and non-BSs.
2.5.10 Software availability
Source code implementing data preprocessing and L2-regularized MLR models, as well
as BSs and non-BSs in the GM12878 cell line, are available in the GitHub repository at
https://github.com/xinbeibei/HM_and_TFbinding.
36
Chapter 3. Sequence-based CNN models in studying TFs preferring
low-affinity BSs
3.1 Introduction
Gene expression is regulated through TF binding to specific DNA target sites in the
genome. A widespread assumption is that TFs bind to high-affinity BSs, sequences with lowest
free energy (most favorable) of binding on the genome to achieve their binding specificity. Many
studies almost always use most cognate BSs or PWMs to infer both in vitro and in vivo BSs.
However, both functional evidences and genomic evidences over last few years show that TFs
bind on BSs with a spectrum of affinities both in vitro and in vivo. During embryonic
development, replacing low-affinity BSs of a TF with high-affinity BSs will cause ectopic
expression of the target gene of this TF in the normal expression domain, thus disrupting
embryonic development (135, 136). In few genomic studies to date that treated ChIP data as
quantitative measurement of in vivo TF–DNA binding, it has been shown that low-affinity
interactions are widespread (137-139). For example, low-affinity interactions comprise a large
proportion of interactions in yeast and predicted binding energies at promoter regions are
evolutionarily conserved (138). SPIB predominantly binds and regulate genes containing low
and high-affinity BSs in their regulatory regions in human cell lines (139). Moreover, a spectrum
of TF–DNA binding affinity has been found through large-scale in vitro studies based on data
generated by universal protein binding microarray (68) and SELEX-seq assays (59, 136).
Therefore, both functional studies and genomic studies in vivo and TF–DNA binding assays in
vitro demonstrates that low-affinity interactions of TFs and DNA sequences are widespread and
not ignorable in studying gene regulation.
In quantitative modeling of TF–DNA binding affinities, traditional methods mainly focus
on cognate BSs, starting with aligning sequences through PWM scanning or through consensus
37
sequence matching. Subsequently, a machine-learning method will be applied to study
positional biophysical importance at each nucleotide at and around binding motifs (20, 22).
However, several problems can arise by doing so. First, the alignment step largely sacrifices
information in original large amount of sequences detected by binding assays. During motif
alignment, those original sequences without a statically significant motif will be eliminated. It is
very probable that sequences with low-affinity BSs will be discarded. Second, a typical high-
throughput experiment can generate 10,000 to 100,000 raw sequences. If most of raw
sequences are used, it will become computational demanding for traditional machine-learning
models in step two. So far, these problems have been partially taken into account through de
novo motif discovery models that learn a PWM-like matrix, both in probabilistic models and
biophysical models (140). Probabilistic models learn a PWM matrix that can demonstrate the
probability of each possible base at each position of the BS (141, 142), whereas biophysical
models shows the binding energy contributions of each base at each position, such as
FeatureREDUCE (143), BEEML-PBM (19), BEESEM (65), selexGLM (67), and NRLB (71).
However, most of the biophysical models assume independent biophysical contribution of each
nucleotide to TF binding affinities. Moreover, although NRLB can model dinucleotide
interactions and is sensitive enough to capture a wide range of TF binding affinities, it does not
guarantee performance of modeling TF–DNA binding affinity and did not provide evidence of
capturing more complicated non-linear dependencies of nucleotides. As a consequence, these
models might be insufficient to model binding affinity of TFs that have dependent interactions of
non-adjacent nucleotides (144, 145).
In recent years, sequence-based CNNs models have provided effective solutions to
these abovementioned problems in motif discovery (73). They are alignment-free methods
which take raw sequences as input, aiming at modeling binding affinity of every sequence. They
usually start with convolutional layers which have filters that act as motif detectors. With
followed layers of non-linear transformations and utilization of GPU computation, CNN-based
38
deep learning models achieve good accuracy and maintained computational scalability. With
these advantages, CNN-based models in motif discovery also attract attention on how to
implement convolutional layers and how to interpret models with good performance. For
example, different approaches have been applied in the convolutional layer, including using
filters to scan forward strand augmented by its reverse complement strand (73), and using
reverse complemented weight sharing strategy to scan original input sequences (146).
Moreover, deep learning models usually have a “black box” reputation and it is not
straightforward to derive the positional biophysical contribution of every input sequence. Current
existing methods for interpreting trained CNN-models mainly include gradient-based
backpropagation methods (147), and in silico mutagenesis (148).
In this work, we built a sensitive and accurate CNN-based model that model TF–DNA
binding affinity at a large spectrum and explored its behavior in modeling TF–DNA binding. For
most of the experiments, we used unaligned SELEX-seq data for eight Exd-Hox complexes.
First, we compared performances of MLR model used in Abe et al. 2015 and CNN-based
models. Second, we evaluated four different approaches to implement convolutional layers in
terms of modeling accuracy and biological significance. Third, we studied the robustness of four
well-known strategies to interpret pre-trained CNN-based models. Fourth, we specifically chose
two TFs, MNT and EGR1, as representative of TFs that have inter-dependencies within
nucleotides of the BSs, and compared to modeling performance of both traditional methods and
CNN-based methods. Finally, as a case study, we validated the functional evidences of Exd-
Ubx binding at low-affinity BSs at svb enhancers to achieve binding specificity (136).
3.2 Data for modeling
We collected round 3 SELEX-seq data from GSE65073, previously generated and
studied by Crocker et al. 2015 and Abe et al. 2015. Similar to Abe et al. 2015, 14mer data were
used for modeling. Each dataset contains 14mers that are bound by a Exd-Hox protein, together
39
with the binding affinity which is scaled between 0 and 1. A brief summary of sample size of
each data set in in Table 3–1:
Table 3 –1 The sample size of SELEX-data for Hox proteins
Hox protein Lab Pb Dfd Scr Antp UbxIVa AbdA AbdB
Sample size 32563 228028 26652 34489 30777 42471 25063 41476
3.3 Methods
3.3.1 Regression models for predicting binding affinity
We conducted similar L2-regularized MLR model used in Abe et al. 2015 to model the
binding affinity of eight Exd-Hox proteins. All sequences for each Exd-Hox protein were first
aligned based on the consensus sequence TGAYNNAY with no mismatch. With aligned
sequences, one-hot encoding of each sequence and the logarithm of relative binding affinities
were features and response values for training the L2-MLR model. During training, a 10-fold
cross validation was performed with an embedded 10-fold cross validation on the training set to
determine the optimal 𝜆 parameter. The coefficient of determination 𝑅 2
between the predicted
and experimentally-observed logarithm binding affinity in testing data set were used as
measurement of the performance of the model.
3.3.2 Architecture of the CNN-based reverse complemented weight sharing model
To study TF–DNA binding affinity, we usually have 10,000 to 100,000 sequences with
quantitative measurement of how strong each sequence is being bound. The architecture of the
CNN-based reverse complemented weight sharing models (RC model) is shown in Figure 3–1,
which is similar to the one proposed by (146). For a given sequnce 𝑠 , a reverse complemented
weight sharing model computes a binidng score 𝑓 (𝑠 ) using following four types of operations:
𝑓 (𝑠 )=𝐷𝑒𝑛𝑠𝑒 𝑊 2
(𝑊𝑒𝑖𝑔 ℎ𝑡𝑒𝑑 _𝑠𝑢𝑚 𝑊 1
(𝑝𝑜𝑜𝑙 (𝑟𝑒𝑐𝑡 𝑏 (𝑐𝑜𝑛𝑣 𝑀 (𝑠 )))))
The convolution layer (𝑐𝑜𝑛𝑣 𝑀 ) uses a set of motif detectors with parameters 𝑀 to scan an one-
hot encoded matrix (𝐶 ×𝑁 ) of sequence 𝑠 , where 𝐶 is the number of channels in the data (𝐶 =4
for input DNA sequence) Each detector is 𝐶 ×𝑚 a matrix, it scans input layer or output of
40
previous layer through a window of 𝑚 . Different from canonical CNN-based models,
convolutional layers in RC models scan reverse complement strand of a sequecne through a
“constrained” filter rather than an independent filter. The “constrained” filter makes sure that the
reverse complemnet of a channel at index 𝑖 is present at index 𝐶 −1−𝑖 . This step is then
followed by a rectification step by adding a different shift 𝑏 to results of each detector and by
clamping negative values to zeros. More formally, a convolutional layer followed by a
rectification layer usually computes:
𝑟𝑒𝑐 𝑡 𝑏 (𝑐𝑜𝑛𝑣 𝑀 (𝑠 𝑘 ))=𝑅𝑒𝐿𝑈 (∑ ∑𝑀 𝑖 ,𝑗 𝑚 −1
𝑗 =0
𝑠 𝑖 ,𝑗 +𝑘 𝐶 −1
𝑖 =0
+𝑏 )
Here 𝑠 𝑘 is a window of length 𝑚 starting from position 𝑘 of sequence 𝑠 . Note that the
convlutional layer and the rectification layer can iterate several times depending on complexity
of the model. Moreover, number of channels becomes two times the number of filters used. It
was doubled trough the corresponding scanning of the reverse complement strand. ReLU
function represents the retified linear function:
𝑅𝑒𝐿𝑈 (𝑥 )= {
0 𝑥 ≤0
𝑥 𝑥 >0
After these two layers, a pooling step is followed to partition the output and only the maxium
output in each partition will be kept and passed down. More formally, a pooling stepss
implemnted the following operation:
𝑝𝑜𝑜𝑙 (𝑥 𝑘 )=max ({𝑥 𝑘𝑝
,𝑥 𝑘𝑝 +1
,…,𝑥 𝑘𝑝 +𝑝 −1
})
where 𝑝 is the window size of each partion and 𝑥 𝑘 is the output of previous layer. An weighted
sum layer after the pooling layer aims to learn a positional weighting for each channel
separately and then combine the positionally-weighted channel outputs (146). It is followed by
one or multiple dense layers to allow further non-linear transformations of signals in previous
41
layers. Note that results for forward and reverse compelment strand starts to merge in the first
dense layer (Figure 3–1). A final dense layer (not visible in Figure 3–1) is added to get final
output of each sequcne. A sigmoid function is used to scale all predictions between 0 and 1:
𝑠𝑖𝑔𝑚𝑜𝑖𝑑 (𝑥 )=
1
1+𝑒 −𝑥
Dropout techniques are applied to each dense layer. To train the model, we minimized the loss
function, which is defined as the mean squared error (MSE) between the final scaled output and
quantitaive binding affinity measured by experiments. Sepcificaly,
𝑙𝑜𝑠𝑠 =‖𝑦 𝑠 −𝑓 (𝑠 )‖
2
+𝜆 1
‖𝑀 ‖
1
+𝜆 2
‖𝑀 ‖
2
2
where 𝑦 𝑠 is the binding affinity of sequnce 𝑠 . The loss function penalizes the 𝑙 1
norm of the
weights in the convolutional layer with magnitude 𝜆 1
and the 𝑙 2
norm of the same weights with
magnitude 𝜆 2
. Adam gradient-based optimization algorithm (149) is used to update parameter,
with a learning rate 𝜆 3
. This reverse complemented weight sharing CNN-based architecture is
first built and investigated in Shrikumar et al. 2017 for different biological questions.
Figure 3 –1 A demonstration of the architecture of CNN-based reverse complemented
weight sharing model
42
The input sequence 𝑠 ="𝐴𝐶𝐺𝑇 " and for each detector 𝑚 =2, also known as filter length.
Weights of different values at each layer is marked as different colors in each color table on the
right side of the network. Edges on the network have matched colors with corresponding
weights in the color table. Different colors of neurons represent different values. “F” and “R” on
each neuron means forward strand and reverse complement strand, respectively.
3.3.3 Four different approaches to implement convolutional layers
When training a model on sequences derived from double-stranded DNA (generated
from protein binding microarray, SELEX-seq, HT-SELEX, and ChIP-seq), it becomes unclear
that whether a TF binds at the original sequence or its reverse complement strand. To deal with
this ambiguity, current existing CNN-based models mainly apply one of the four approaches
(Figure 3–2) to implement the convolution layers: Model 1: original sequences are augmented
with their reverse complement strand, then are fed to a canonical CNN-based model (146);
Model 2: directly feeding all original sequences to a canonical CNN-based model (148); Model
3: doubling sample size by adding reverse complement of original sequences and applying a
canonical CNN-based model afterwards (73); Model 4: feeding all original sequences to a
reverse complemented weight sharing CNN-based model (146, 150).
43
Figure 3 –2 Four different approaches to implement convolutional layers
On the left panel, four different approaches to implement convolutional layers are displayed. In
Model 1, reverse complement strand of each original sequence is appended to the original
sequence, gapped with Ns. Model 2 uses all original sequences as input. Model 3 doubles
samples size by adding reverse complement strand of each original sequence as a sample. All
these three models use canonical CNN-based model (see the schematic figure on the right
panel). Model 4 uses all original sequences as input of a reverse complemented weight sharing
model.
3.3.4 Hyperparameter determination
After building up the CNN-based network, we started with hyperparameter
determination, which is implemented with Keras with tensorflow backend and K80 GPU
platform.
Depth of convolutional layer and dense layer. It has been demonstrated that deeper
architectures do not necessarily perform better at learning sequence motifs than architectures
with one layer (151, 152). The reason might be that biological sequences are not composed of
44
complex hierachies of pattern such as those in image processing (153). Similar to DeepBind
(73), we chose to use one convolutional layer and one dense layer in Model 1 to 4.
Value of 𝝀 𝟏 ,𝝀 𝟐 ,𝝀 𝟑 . Since there is no universal set of 𝜆 1
,𝜆 2
,𝜆 3
for every model, an
automatical hyperparameter searcher was set up to find the best combination of parameters for
each model for each TF. Specially, for Model 1 to 3, we set 𝜆 1
∈
{0,0.000001,0.000005,0.00001},𝜆 2
∈{0.01},𝜆 3
∈{0.001,0.003,0.005}; for Model 4, we set 𝜆 1
∈
{0.000001,0.000005,0.00001},𝜆 2
∈{0.000001,0.000005,0.00001},𝜆 3
∈{0.0001,0.0003,0.0005} .
The hyperparameter searcher evaluated each combination of parameters and chose the set
with highest coefficient of determination 𝑅 2
in the test data set.
3.3.5 Five different methods to interpret neural networks
Suppose we have obtained models with good performance, one of the most important
questions is: what kind of information can we learn from this model? In this section, several
methods for intepreting CNN-based model will be reviewed. These interpretation methods aim
to calculate unit-resolution importance of every base in each position of BS for a given
sequecne. They are able to evaluate how much each nucleotide contriubtes towards the TF–
DNA binding affinity. For instance, positive values of contriubtion tell us that a small change to
that nucleotide will increase the output value. Intuitively, this will provide us biophysical insights
of how each low-affinity BSs is bound by a certain TF.
DeconvNet (154). This techinique is a gradient-based backpropagation approach to
visualize the input stimuli that excite indivual feature maps at any layer of the model, including
output layer, in the context of image classification. Through taking the gradient of output layer
with respect to the input image, those important pixels will be highlighted and displayed.
Gradient*input (155). Similar to DeconvNet, Simonyan et al. used Gradient-only
mehtod visualize image-specific class saliency map through taking gradients of output layer to
input layer. The difference is the way both methods deal with the backward pass of singal to
ReLU function. In DeconvNet, the importance signal will be set to zero if the signal itself is
45
negative during backward pass. By constrast, Gradient-only will zero’d out those important
signals that have negative input to ReLU function during the forward pass. Later, Gradient*input
method is often more preferable to Gradient-only mehtod as it considers the sign and strength
of the input.
Guided backpropagation (156). This method combines DeconvNet and Gradient*input
in the way that it only keeps importance singals if both these importance signals are positive
during backward pass and inputs to the ReLU function are positive during the forward pass.
Therefore, both Guided backpropagation and Deconvnet can fail to highlight input that
contribute negatively to the output layer.
DeepLIFT (147). Considering that the three above gradient-based mehtods will zero’d
out importance of input that have either negative input to ReLU during forward pass or negative
gradients during backward pass, Shrikumar et al. proposed DeepLIFT to deal with possible
ignorance of important input. During backpropagation, they avoided doing gradients by first
defining a reference input and then assigning significance scores to each input according to the
difference between the reference output and the real acitviation of each neuron. They
demonstrated the ability of DeepLIFT to reveal more important signals in both image
classification and motif discovery.
In silico mutageneiss (73, 148). This method is pertubation-based and was first applied
to see how occulusion of different portions of an input image can affect the output in image
classification (154). In CNN-based models with DNA sequences, this method has sucessfully
identified TF–DNA binding motifs (73) and evaluated the importance of single nucleotide
variants in genomic data (148).
3.3.6 Visualization of unit-resolution importance matrix for a given sequence
For any given DNA sequence 𝑠 of same length 𝑙 as input, gradient-based
backpropagation methods (DeconvNet, Gradient*input, Guided backpropagation) were
implemented through Keras built-in functions. The derived gradients at input layer was a matrix
46
of size 4×𝑙 , each element represents the importance of nucleotide A, T, C, G at the certain
position in predicting the binding affinity of sequence 𝑠 . For visualization purpose, only the
corresponding importance level (only importance of A will be visualized at position 1 if the first
nucleotide of 𝑠 is A) will be visualized in a PWM-like logo. The PWM-like logo is produced by
seq2logo (157).
There are several ways to define a mutation map ∆𝑠 for each given sequence 𝑠 and we
find the following method provides clear visualization in most cases. First, we computed the
predicted binding affinity of original sequence 𝑓 (𝑠 ). Then for every position 𝑠 𝑖 , we mutated it into
one of the other possible nucleotide 𝑏 𝑗 ∈{𝐴 ,𝑇 ,𝐶 ,𝐺 } as sequence 𝑠 ̂. Second, we calculated 𝑓 (𝑠 ̂)
and finally define ∆𝑠 𝑖𝑗
=(𝑓 (𝑠 ̂)−𝑓 (𝑠 ))∙max (0,𝑓 (𝑠 ),𝑓 (𝑠 ̂)). The second term in the product is to
override no binding and highlight the magnitudes of change in strong binding.
When visualizing the importance of each nucleotide of low-affinity BSs in the svb enhancer, the
sequence length is larger than input length and the enhancer sequence 𝑠 contains several weak
BSs. In this scenario, when one mutation destroys a BS, another BS will compensate and still
report a large score for binding, thus the importance of the mutated position will be misleading.
Instead, we first collected every 𝑙 -length window 𝑤 of a long sequence 𝑠 (length eques to 𝑛 ),
repeated aforementioned method to get a mutation map ∆𝑤 of size 4×𝑙 for each window,
resulting in 𝑛 −𝑙 +1 windows and mutation maps. We then took average of mutation maps ∆𝑤
that cover a certain position and created a mutation map ∆𝑠 (of size 4×𝑛 ) for 𝑠 . This
visualization method of in silico mutagenesis is also the strategy used in Alipanahi et al. 2015.
3.4 Results
3.4.1 Sequence alignment decrease abundance of low-affinity BSs
Hox proteins are essential to regulate development of anterior-posterior axis of animals
(158). Their binding specificity is quite paradoxical because all Hox proteins bind preferentially
47
at similar DNA binding domains (i.e. homeobox) and both sequences around BSs and cofactors
are not sufficient to explain their complex binding specificity (22). A recent study reported that
one Exd-Hox heterodimer, Exd-Ubx, tends to bind at homotypic clusters of low-affinity BSs in
vivo to achieve its binding specificity (136). In vitro binding data for eight Exd-Hox proteins
generated by SELEX-seq also indicated similar observations. To investigate the binding affinity
of Hox-Exd proteins, we went back to the previously-examined in vitro SELEX-seq data for eight
Hox-Exd proteins that regulate segment development of Drosophila embryo (Figure 3–3A): Exd-
Lab, Exd-Pb, Exd-Dfd, Exd-Scr, Exd-Antp, Exd-Ubx, Exd-AbdA, and Exd-AbdB (59). Based on
these data, it has been reported that these Exd-Hox proteins prefer three groups of high-affinity
BSs: Exd-Lab and Exd-Pb prefer TGATTGAT, Exd-Dfd and Exd-Scr prefer TGATTAAT, and
Exd-Antp, Exd-Ubx, Exd-AbdA, and Exd-AbdB prefer TGATTTAT (Figure 3–3A) (59). Moreover,
a following recent study has investigated the mechanism of how homotypic low-affinity BSs
contribute to binding specificity of Exd-Ubx through mainly in vivo techniques (136). Crocker et
al. also validated three low-affinity BSs in the E3N enhancer region that are specially for Exd-
Ubx binding in vivo, including CTGATTTGTTGA (Site 1), CCGATAAAAAAT (Site 2), and
ATAATTTGTAGT (Site 3) (Figure 3–3B).
Figure 3 –3 high-affinity and low-affinity BSs of Drosophila embryo
48
(A) Schematic figure showing eight Exd-Hox proteins that control the body plan of a Drosophila
embryo along the head-tail axis (adapted from (159)). The heat map shows three classes of
Exd-Hox proteins, each class prefers a slightly different consensus sequence TGAYNNAY
(adapted from (59)). (B) Three low-affinity BSs in the svb enhancer have been identified for Exd-
Ubx binding in vivo. Four different mutations of each of the BS or seven scenarios of replacing
low-affinity BSs with high-affinity BSs showed decreased binding specificity of Exd-Ubx (136).
We first examined the abundance of both high-affinity BSs and low-affinity BSs in the
14mer SELEX-seq data, we scanned all 14mer bound by each Exd-Hox and found exactly-
matched high-affinity BSs and low-affinity BSs in 8mer format. We observed that Exd-Ubx, Exd-
Pb and Exd-Scr have abundance of three sites Site 1-3 in their SELEX-seq data, with highest
affinity around 0.5, whereas other Exd-Hox proteins usually bind on sites with highest relative
affinity around 0.3 (Figure 3–4 right panel). This in vitro analysis is consistent with the in vivo
observation that this homotypic cluster of low-affinity BSs binding is bound by Exd-Ubx. As a
control, we found Exd-Hox proteins have expected abundance at their respective high-affinity
BSs (Figure 3–4 left panel). Next we are confident to use these SELEX-seq data and build
models to capture a large range of binding affinities.
49
Figure 3 –4 Abundance of high-affinity BSs reported in Slattery et al. and low-affinity BSs
(Site 1-3) in 16mer SELEX-seq data for eight Exd-Hox proteins
The left panel is a jitter plot showing results for three high-affinity BSs. The right panel is a jitter
plot for abundance of low-affinity BSs. Since the full-length sequences of Site 1-3 do not appear
in any 14mer, we checked the abundance of 8mer in each site instead since Hox-protein
typically bind on DNA sequence with that range.
Table 3 –2 Sample size for each Exd-Hox protein after TGAYNNAY alignment
Hox protein Lab Pb Dfd Scr Antp UbxIVa AbdA AbdB
Sample size 32563 228028 26652 34489 30777 42471 25063 41476
Remained
after alignment 39.11% 2.84% 34.36% 41.04% 34.24% 28.82% 40.45% 28.96%
Next, we aligned binding data of each Exd-Hox protein based on the exact match of
consensus sequence TGAYNNAY and the ratio of remaining DNA sequence for each complex
is around 30% for each data (Table 3–2). Even though the distribution of binding affinities of
pre-aligned and post-aligned Exd-Hox data is similar (Figure 3–5 left panel), the abundance of
50
low-affinity BSs in the aligned sequences is much less than the original data (Figure 3–5 right
panel). Therefore, in terms of studying the low-affinity BSs in Crocker et al., this alignment step
discards valuable information.
Figure 3 –5 Abundance of low-affinity BSs in aligned DNA sequences
In the left panel, each subgraph is for the distribution of binding affinities of original sequences
and aligned sequences. In the right panel, each subgraph shows the abundance of low-affinity
BSs in aligned sequences for each Exd-Hox protein complex. Each dot in the subgraph
represents a bound probe, and Y-axis shows the relative binding affinity of TF BSs.
3.4.2 Modeling accuracy with L2-MLR and CNN-based models
To evaluate the performance of traditional machine learning method on modeling binding
affinity of Exd-Hox proteins, we built L2-MLR models with aligned DNA sequences. As
comparisons, four CNN-based models described in Figure 3–2 were implemented with original
sequences, including Model 1: canonical CNN-based model on original data augmented by
reverse complement strand; Model 2: canonical CNN-based model on original data; Model 3:
canonical CNN-based model on original data doubled by reverse complement strand; Model 4:
51
RC model on original data. We used similar model architecture for four CNN-based models:
pooling window of size 2, and one dense layer with dropout rate 0.2 and 512 neurons before the
final output layer. For number of filters to use in the convolutional layer, we tried 32 and 100
filters. The reason we tried 32 filters is that in both DeepBind and original RC model papers,
dozens of filters were able to build accurate models. However, we found 32 filters at first layer
captures less information from input data than 100 filters, thus resulting in less general model,
which is more prone to be overfitting (Figure 3–6 right panel). Based on this, we used 100 filters
to configure both RC models and canonical models in following analyses. After running the
models with different combinations of hyperparameters, we selected the combination with
highest performance on test data set for each model.
Figure 3 –6 CNN-based models outperform MLR models and selection of number of filters
Left panel shows the comparison of modeling performance of MLR and CNN (RC model)
models. Right panel displays that performance gap of CNN (RC model) models on between
train data and test data, which can be regarded as the overfitting degree of models. Each dot in
each boxplot represents a CNN (RC model) with different learning rate and regularizer
coefficients.
The first observation is that CNN-based RC models can always outperform MLR models
in Exd-Hox heterodimer binding affinity studies (Figure 3–6 right panel). Note that CNN-based
models Model 1-4 were built with original sequences. They achieve high performance in test
52
data, meaning they can capture large range of binding affinities well. In the meanwhile,
canonical models (Model 2-4) performed slightly better than RC model (Model 1) (Figure 3–7),
which is not consistent with the case study of Shrikumar et al. 2017. The reason behind might
be that SELEX-seq data used here are less noisy than ChIP-seq data used in Shrikumar et al.
2017, where filters in the canonical CNN-based models learnt palindromic binding patterns.
Instead, with one more constraint that forward strand and reverse complement strand should be
treated equally, CNN-based RC model could compromise by not reaching the optimal solution
that canonical CNN-based models achieve.
Figure 3 –7 Comparison of modeling performance achieved by four CNN-based models
Four models are CNN (canonical, RC augmented): canonical CNN-based model on original
data augmented by reverse complement strand; CNN (canonical): canonical CNN-based model
on original data; CNN (canonical, double sample) canonical CNN-based model on original data
doubled by reverse complement strand; CNN (RC model): RC model on original data. 100 filters
were used in first convolutional layer. For each Exd-Hox heterodimer, we shuffled its raw data
53
and split the data as training data (80%) and test data (20%). Those numbers in the figure are
P-values from one-sided t-test with null hypothesis that the model on the left side achieves less
accuracy than the model on the right.
3.4.3 RC model predicts same binding affinity for forward strand and RC strand
As aforementioned, the architecture of RC models allows weight sharing between
forward strand 𝑠 and reverse complement strand 𝑟𝑐 _𝑠 at the convlution layer (𝑐𝑜𝑛𝑣 𝑀 ), activation
layer (𝑟𝑒𝑐𝑡 𝑏 ), and weighted sum layer (𝑊𝑒𝑖𝑔 ℎ𝑡𝑒𝑑 _𝑠𝑢𝑚 𝑊 1
). This architecture ganrantees that
𝑓 (𝑠 )=𝑓 (𝑟𝑐 _𝑠 ) which is consistent with biological intuition that forward strand and reverse
complement strand should have same predicted binding afffnity due to complementary base
pairing of double sranded DNA. Enven though canonical CNN-based models have hgiher
performances in our case study, they did not guarantee that forward strand and reverse
complement strand have same weights, which will lead different binding affinity predictions
when forward and reverse complement strands of a certain sequence is fed to a pre-trained
model, meaning 𝑓 (𝑠 ) does not equal to 𝑓 (𝑟𝑐 _𝑠 ).
It might be problematic when forward strand and reverse complement strand are treated
differently in canonical models. For example, one sequence pattern (“ACT”) on one strand is
equivalent to the appearance of sequence pattern (“AGT”) in its reverse complement strand.
However, traditional architecture either uses different convolution filters to scan sequences with
augmented reverse complement strand or use artificial palindromic filters to scan original
sequences, thus lowering the efficiency and accuracy of filters. Expectedly, a model predicts
binding affinity of one sequence and its reverse complement strand with the same results. We
found that models with reverse complemented weight sharing can achieve this (Figure 3–8
upper panels). Especially for those low-affinity BSs, the predicted binding affinities from
canonical models are usually more skewed (Figure 3–8 upper panels). Therefore, RC models
are more robust in modeling the binding affinities of low-affinity BSs. In following studies, we will
use RC models as a representation of CNN-based network.
54
Figure 3 –8 Models with reverse complemented weight sharing predict same binding
affinity for a sequence and its reverse complement strand
The upper panels are for CNN (canonical) models and the lower panels are for CNN (RC
models). Data for Exd-Lab, Exd-Scr, and Exd-Ubx are shown. In each panel, each dot
represents a 14mer sequence in the test data. The predicted relative binding affinity of each
14mer is displayed in x axis and the predicted relative binding affinity of its reverse complement
strand is shown in y axis. CNN (canonical) models have less consistency in predicting
sequences with low relative binding affinities (bottom left of each panel) than that with high
relative binding affinities (upper right of each panel).
3.4.4 In silico mutagenesis is a robust and sensitive interpretation method
With these well-performed models, we then explored what biological insights different
interpretation methods (see Methods) can provide. We thus selected three BSs from SELEX-
seq data: AATGATTGATTACC of Exd-Lab, AATGATTAATTGCT of Exd-Scr, and
ATGATTTATTACCC of Exd-UbxIVa. These BSs represent three classes of high-affinity BSs.
55
Since each method can provide the importance score of each nucleotide at every position (see
Methods), it is expected that the consensus sequence TGAYNNAY will be highlighted.
Figure 3 –9 Robustness of three interpretation methods
Given well-trained CNN (RC model) models with Exd-Ubx SELEX-seq data, three different
interpretation methods were used to see unit-resolution importance for a given sequence. To
evaluate the robustness of these interpretation methods, models with similar performance (𝑅 2
ranges from 0.974 to 0.984) but with slightly different hyperparameter configurations are used to
interpret three high-affinity BSs for Exd-Hox heterodimer. Each row represents a
hyperparameter configuration. The first column is how DeconvNet interpret the importance of
TGATTTAT. The second column is for Gradient*input and the last three columns are how in
silico mutagenesis interprets TGATTTAT, TGATTAAT, and TGATTGAT respectively.
Expectedly, interpretation methods can highlight high-affinity BS, but in silico
mutagenesis methods show more robust interpretation when different hyperparameters are
used. For example, when visualizing the unit-resolution importance of TGATTTAT (a high-
affinity BS for Exd-Ubx) with CNN (RC model) built on Exd-Ubx data, DeconvNet and
Gradient*input methods occasionally show negative weights for TG or TA steps. As a contrast,
in silico mutagenesis can output highlighted TGATTTAT constantly (Figure 3–9 first three
columns). Even though DeconvNet and Gradient*input has been successfully applied to image
classification problem which has continuous data as input, it might not be suitable for these
gradient-based backpropagation methods to deal with the sparse data type of one-hot encoding
56
DNA sequence. In sparse space, taking gradients needs more caution than in continuous space
since a little distortion from different directions might results in very different changes in output
value of the function. This might be the reason why DeconvNet and Gradient*input methods are
not stable.
Besides robustness, in silico mutagenesis method is also sensitive in distinguishing
specific high-affinity BS. When interpreting the unit-resolution importance of the three classes of
high-affinity BS, only all nucleotides in TGATTTAT were constantly highlighted. However, A and
G in position 6 of the high-affinity BS showed negative importance, meaning their presence is
not optimal for Exd-Ubx binding (Figure 3–9 three columns on the right).
3.4.5 Validation of functional evidence of Exd-Ubx binding at low-affinity BSs
In this work, we evaluated the ability of CNN-based models to validate the functional
evidences that Exd-Ubx binds on low-affinity BSs to achieve their binding specificity.
Significantly, these low-affinity BSs could not be recognized by oligomer-enrichment based
analyses, we hypothesized that CNN-based models are sensitive enough to predict the binding
affinities of wide type (WT) sequence for eight Exd-Hox heterodimers and to model their binding
preferences changes towards both mutations and replacements (displayed in Figure 3–3) into
high-affinity BSs of their in vivo low-affinity BSs.
We first validated the binding specificity of Exd-Ubx on the reported 74bp WT sequence
in the svb enhancer. We checked the unit-resolution importance of this sequence by best-
trained models on Exd-Lab, Exd-Scr, and Exd-Ubx SELEX-seq data respectively. In general,
more nucleotides in WT sequences are disfavored by Exd-Lab and Exd-Scr than by Exd-Ubx.
Most nucleotides in Site 1-3 contributes negatively to Exd-Scr binding (Figure 3–10). Even
though Site 1-3 are low-affinity BSs for Exd-Hox heterodimers, Exd-Ubx showed more preferred
binding than other heterodimers.
57
Figure 3 –10 Unit-resolution importance of WT sequence in svb enhancer
PWM-like logos show unit-resolution importance of each nucleotide in the WT sequence in the
svb enhancer. Each row displays the importance scores derived from the CNN (RC model)
model built with one Exd-Hox SELEX-data. Y axis in each logo represent the centralized
∆∆∆𝐺 /𝑅𝑇 . Three black frames highlight identified in vivo low-affinity BSs for Exd-Ubx. The logo
was created by (157).
Second, we validated reduced binding preference of Exd-Ubx in the mutated svb
enhancer. We evaluated binding energy changes due to mutations described in Figure 3–3 with
best-trained model for Exd-Ubx. Generally speaking, Mutation 1-3 cause disfavored binding
status for Exd-Ubx, based on the observation that the binding energy changes are mostly above
zero lines (Figure 3–11). We thus validated the reduced binding preference of Exd-Ubx due to
Mutation 1-3.
58
Figure 3 –11 Binding preference changes of mutated svb enhancer for Exd-Ubx
Each curve represents the binding preference changes (measured in ∆∆∆𝐺 /𝑅𝑇 ) if certain
mutations (Figure 3–3) are introduced to the svb enhancer. Only those affected nucleotides due
to mutations are displayed. Mutations are marked as red in x axis where “>” connect WT
nucleotide on the left side and the mutated nucleotide on the right side. Note that each dot in the
curve represents the binding energy change due to the binding on the following 14mer
sequence.
Similarly, we validated reduced binding specificity of Exd-Ubx and increased binding
affinity for other Exd-Hox heterodimers when replacing low-affinity BSs into high-affinity BSs in
the svb enhancer. In Crocker et al. 2015 study, one striking observation was that once replacing
low-affinity Site 1-3 into higher-affinity BSs, the svb enhancer will resume binding of other Exd-
Hox heterodimers. According to the embryo images (Figure 3–12 left panel), for example, Exd-
Scr starts to bind when relative binding affinity is above around 0.25. Expectedly, when we
applied best-trained models for Exd-Scr to predict binding affinity of replaced BSs, we found that
the binding preference increases with relative binding affinity of replaced BSs (Figure 3–12 right
panel).
59
Figure 3 –12 Validation of binding preferences change of Exd-Scr
In the left panel, embryo images with different replacements BSs in the svb are adapted from
(136). Relative binding affinity for each replacement is also shown on the leftmost side. In the
right panel, each curve corresponds to binding preference change (measured in ∆∆∆𝐺 /𝑅𝑇 ) due
to one replacement. These curves are ordered through the order of Drosophila embryo in the
left panel. Note that the y axis of these curves are all between -1.5 and 0.52.
3.5 Future work
We have shown that the CNN (RC model) are more suitable to study binding specificity
of low-affinity BSs because they treat forward strand and reverse complement strand equally
during modeling. Moreover, with network interpretation techniques, our framework is robust and
sensitive in studying the importance of every nucleotide in input sequence. More importantly,
60
our framework can validate previously functional evidence in vivo, which is beneficial to future
designs of potential BSs in vivo.
There are two remaining aspects to explore in the future. We have mentioned five
different methods for network interpretation but only showed comparison among three methods
in the Results. We will further explore robustness and sensitivity of Guided backpropagation and
DeepLIFT methods mentioned in the Methods part. Moreover, we will investigate how CNN (RC
model) models can improve binding specificity modeling of TFs who prefer BSs with dependent
nucleotides, due to their ability to consider non-linear dependencies between nucleotides that
are either adjacent or several bases away from each other.
61
Chapter 4. Structure-based CNN models in studying TF –DNA binding
mechanism
4.1 Introduction
Over the last decades, it has been largely understood about how TFs recognize their
cognate binding sties to initiate the gene regulation. However, it still requires efforts on
understanding how TFs bind to only specific subset of potential BSs in vivo. For example,
paralogous TFs, such as Exd-Hox heterodimers, bind to very similar consensus sequence in
vitro, however, bind to different in vivo BSs to execute different in vivo functions (136). Our
current knowledge on TF–DNA binding specificity is that TFs recognize DNA through the
interplay of base and shape readout (16, 27). In structural biology, nearly 4000 protein–DNA
structures have been solved and accumulated in Protein Data Bank (160, 161) over the last 30
years, aiming to find a protein–DNA recognition code. These structures have shown substantial
evidences where proteins and DNA interact through unique chemical signatures of DNA bases
(base readout) and through a sequence-dependent DNA shape (shape readout) (27). A Given
nucleotide at a certain position is recognized by physical interactions between amino acid side
chains of a TF and the accessible edges of the base pairs that are contacted. These contacts
formed at the binding interface include hydrogen bonds, water-mediated hydrogen bonds, and
hydrophobic contacts (27). For example, p53 was found to bind on a noncanonical Hoogsteen
base-pairing DNA sequence through the formation of bidentate hydrogen bonds between
arginine residues and guanine base in the major groove of DNA (162). This type of recognition
is called ‘base readout’. In parallel, the ‘shape readout’ include recognition of global shape
recognition (for example, an overall bend of a DNA sequence, Figure 4–2) and recognition of
local shape (27). For example, the recognition of narrowed minor grove because of kinked base
62
pairs, which will enhance negative electrostatic potential through insertion of arginine (33) or
histidine residues (163).
Figure 4 –1 Multiple readout mechanisms to achieve DNA-protein binding specificity
(A) base readout describes the interactions between amino acids and functional groups of
nucleotides in a base pair. Hydrogen bond acceptor is denoted as red circle; hydrogen bond
donor is represented as blue circle; nonpolar hydrogen is shown in white; and methyl group is in
yellow color. In the major groove, nucleotides have their specific patterns, whereas in the minor
grove, patterns are degenerate. (B) shape readout includes local shape readout (narrowed
minor groove) and global shape recognition (DNA bending). This figure is adapted from (16).
In Chapter 3, we have introduced quantitative methods for motif discovery and for
modeling the binding affinity of TF–DNA binding, including probabilistic models (141),
biophysical models (65, 71, 143), and CNN-based models (148). Regardless of the algorithm,
these models represent DNA sequences as sequential letters of A, T, C, G. Even though CNN-
based models use one-hot encoding to represent every nucleotide, the 4-bit binary features are
still sparse representation of DNA molecules. Together with the successful application of deep
learning models, dna2vec avoids one-hot encoding of DNA sequences by treating them as
natural language sentences and uses wording embedding algorithms to project k-mers to
numeric vectors (164). Recently, DeepVariant has stacked next-generation sequencing probes
63
as 2D images, with blocks in the images colored by the corresponding nucleotides (165).
DeepVariant has proven to be very sensitive in identifying single nucleotide variants in GATK
datasets and several challenges, but they are still sparse representation of DNA molecules and
assume independence of different nucleotides.
Given that DNA binding proteins combine multiple readout mechanisms to achieve DNA-binding
specificity, computational models treating DNA as a linear string of nucleotides are no longer
sufficient. Actually, when proteins approach DNA molecules, they recognize ordered biophysical
atoms (P, O, C, N, H) rather than read simplified letters A, T, C, G. Moreover, different
nucleotides have dependencies that cannot be captured by sequential letters. For example, A-T
and T-A base pairs have same hydrogen patterns in the minor groove and reverse patterns in
the major groove (Figure 4–1). In this study, we will take into account the three-dimensional
structure of DNA (Figure 4–2) and represent DNA sequence with a rebuilt 3D mesh graph,
which contains rich information including spatial arrangement of atoms, backbone information,
narrow minor groove, and DNA bending.
Figure 4 –2 Visualization of the with A-tract induced bending of a DNA segment of
sequence A 5N5A 5N5A 5N5A 5
In the surface of DNA molecule, carbon, oxygen, nitrogen atoms are displayed as green, red,
and blue colors respectively, the default setting in the ‘surface’ mode of visualization tool
PyMOL (166). The structure file (.pdb) for visualization was generated by rebuild method of
64
3DNA (167) (see Methods). This DNA segment containing A-tracts has shown large intrinsic
DNA bending (35).
In this work, we will revisit the problem of how Exd-Scr (one of the eight Exd-Hox
proteins in Chapter 3) achieves their binding specificity by using SELEX-seq data. Existing NMR
and crystal structures of Exd-Hox DNA complexes show that the recognition helix of the HD
makes base-specific contacts in the DNA major groove of a high-affinity consensus binding
sequence (168-170). In a recent study, the crystal structure of Exd-Scr and their in vivo BSs
suggests that the minor groove of the binding sequence creates an enhanced negative
electrostatic environment for basic residues in both the N-terminal arm and the linker region
(34). We will build CNN-based models using 3D mesh graphs as representation of DNA
sequences using SELEX-seq data mentioned in Chapter 3. Moreover, with the technique to
calculate the unit-resolution importance of input data, we are able to pinpoint the important
atoms on the DNA molecule surface that contribute to the binding affinity of TF–DNA complex.
This technique might assist in finding new binding mechanisms that could not be learnt through
sequence-based models, such as proteins contact with DNA backbone to achieve binding
specificity (171).
4.2 Methods
4.2.1 Rebuild a 3D mesh graph for a DNA sequence (with Jinsen Li)
The software package 3DNA (167) was used to reconstruct and visualize the three-
dimensional DNA sequences. Given a DNA sequence, two ‘N’s were first padded to each end of
the sequence. Then through the DNA shape pentamer query table used in DNAshape (172,
173), the shape parameters for every base pair and base pair step were derived, resulting in a
table with 12 shape parameters. These shape parameters include base pair parameters (Shear,
Stretch, Stagger, Buckle, Propeller Twist, and Opening) and dimer step parameters (Shift, Slide,
Rise, Tilt, Roll, and Helical Twist), which are good descriptors for DNA structure. With the table
65
of 12 shape parameters, the command rebuild with ‘–atomic’ option was used to rebuild the full
atomic models with the sugar-phosphate backbone, which was stored in a .pdb file with
publication quality. The .pdb file was then loaded to PyMOL (166) for both visualization in
‘surface’ mode and for storing mesh graph in .VRWL format. In the mesh graph, each point
(𝑥 ,𝑦 ,𝑧 ) (real values) was then mapped to discrete voxel coordinates (𝑖 ,𝑗 ,𝑘 ) (integers) in the
final volumetric representation (174). The mapping is a uniform discretization and keeps the
origin, orientation and resolution of the voxel grid in space. The resolution was further reduced
to avoid too detailed representation of DNA sequence and to lower computation complexity. To
do this, we set ratio of voxels in each dimension as 3:1:1, and use the average of original
coordinates (𝑖 ,𝑗 ,𝑘 ) that were located within the new voxel and their average RGB colors as the
coordinate and color of the new voxel. After these steps, a 4D tensor (𝑖 ,𝑗 ,𝑘 ,𝐶 ) with desired
resolution was created to represent every DNA sequence (the last dimension is for RGB color).
An schematic figure of this workflow is shown in Figure 4–3.
Figure 4 –3 A schematic figure showing how to rebuild 3D atomic structure of DNA
sequences as modeling input
66
Given a DNA sequence CGCGAATTCGCG with 12 calculated DNA shape parameters from
DNAshape (172, 173), 3DNA rebuild function was used to generate mesh graph (shown in
upper right). Followed by a resolution deduction step, the mesh graph was represented by a 4D
tensor with desired resolution for further input of CNN-based models.
4.2.2 Architecture of CNN-based models
The main components of the architectures include 3D convolutional layers (Conv3D), 3D
rectify layers (Rectify), and 3D max pooling (Pool) layers Figure 4–4. The 3D convolutional
layers consist of three-dimensional filters to scan through the input 4D tensor in each channel
with shared weights across the image. For instance, in the middle of Figure 4–4, the 3D
structure represented the information captured after scanning of three different filters (shown in
green, blue, red color respectively). The followed Rectify layer uses ReLU function and the Pool
layer highlight maximum signal over a sliding window size, aiming to capture invariances in the
image. After these operations, ideally the translational and scaling invariances of images will be
captured. After flattening output of the pooling layer, one dense layer with dropout technique
was applied to approximate the observed relative affinity. The following describes how a single
training example (𝑥 𝑖 ,𝑦 𝑖 ) was processed in a generic two layers CNN-based network.
ℎ
1
(𝑥 𝑖 )=𝑔 (𝑅𝑒𝐿𝑈 (𝑊 1
⊛𝑥 𝑖 +𝑏 1
))
ℎ
2
(𝑥 𝑖 )=𝑔 (𝑅𝑒𝐿𝑈 (𝑊 2
⊛ℎ
1
(𝑥 𝑖 )+𝑏 2
))
ℎ
3
(𝑥 𝑖 )=𝑅𝑒𝐿𝑈 (𝑊 3
ℎ
2
(𝑥 𝑖 )+𝑏 3
)
𝑓 𝑦 𝑖 (𝑥 𝑖 )=𝑊 4
∗ℎ
3
(𝑥 𝑖 )+𝑏 4
where 𝑊 1
,𝑊 2
are filter weights, ⊛ denotes the Conv3D operation, 𝑔 is the Pool operation,
𝑊 3
,𝑊 4
are weight matrices, 𝑏 1
,𝑏 2
,𝑏 3
,𝑏 4
are biases. The loss to be optimized is the mean square
error as follows.
𝑙𝑜𝑠𝑠 (𝑥 𝑖 ,𝑦 𝑖 )=‖𝑦 𝑖 −𝑓 𝑦 𝑖 (𝑥 𝑖 )‖
2
2
We used Adam (add citation) optimizer to train our model.
67
Figure 4 –4 The architecture of CNN-based models based on 3D structure of DNA
sequences
As for hyperparameter selection, the current result was based on the architecture with
two layers in convolutional layers, each of them had 32 filters of size (3,3,3). We applied dropout
rate 0.5 after the dense layer. During model training, learning rate was set to 0.0001. Parallel
jobs were submitted through slurm_with_tensorflow (https://github.com/deepsense-
ai/tensorflow_on_slurm), each time with one parameter server and 24 workers. High
performance computers (HPC) were used. Each parameter server and worker is a 2.6GHz,
50GB RAM machine with 16 cores.
4.2.3 Visualization of unit-resolution importance score
Gradient*input (155) and Guided backpropagation (156) methods (introduced in Chapter
3) were used to calculated unit-resolution importance score for each voxel in the input 4D
tensor. Since the 3D graph has continuous color-coding in each dimension, it will be
computationally demanding for in silico mutagenesis to consider all possible disruption of
original input 4D tensor. To visualize the importance score in original resolution of 3D mesh
graph, the importance score at each voxel of the 4D tensor was mapped back to the original
mesh graph.
68
4.3 Results
4.3.1 Accuracy of rebuilding structures by 3DNA
In this study, since the 3D mesh graph rebuilt from 3DNA (167) were used as
representation of DNA sequences, an very important question is that whether this
representation is accurate enough or not. To answer this question, we evaluated how well
3DNA rebuild B–DNA dodecamer sequence CGCGAATTCGCG, which has two Nuclear
magnetic resonance (NMR) structure 1BUF and 1NAJ in Protein Data Bank. Given the base
pair parameters and dimer step parameters, the relative position and orientation of the base
atoms are uniquely determined. However, the phosphate-sugar backbone is more flexible in
DNA molecules. Considering of this, the full atomic structures were used for fair comparison.
We can extract 12 DNA shape parameters either from high-through prediction method
DNAshape (rebuild_predicted) or from the calculation through the original .pdb files in Protein
Data Bank (rebuild_measured). The root mean square deviation (RMSD) between these two
rebuilt structures are shown in Table 4–1. The RMSD is less than 2Å, which is less than a
typical average RMSD during molecular dynamics of DNA structure (175). Note that the RMSD
between rebuild 3DNA models and experimental DNA structures reported to be less than 1Å
(167). Based on this observation, the rebuilt 3D structures are good representations of DNA
sequences.
Table 4 –1 RMSD between rebuild_predicted and rebuild_measured of DNA dodecamer
RMSD (in Å) 1BUF 1NAJ
Rebuild_predicted vs rebuild_measured 1.35 1.883
4.3.2 Accurate modeling of binding affinity through structure-based CNN models
With accurate 3D representation of DNA sequences, we then evaluated the modeling
accuracy of binding affinities of Exd-Scr heterodimer with these structure-based CNN models.
We first represented each DNA sequence as 3D graph with 90×30×30 voxels. Then we split
69
14mer SELEX-seq data into train (80%) and test (20%) dataset and used R-squared to measure
modeling accuracy. After training the model with parallel computing for about 100 hours, we can
achieve training accuracy as 0.92. However, the testing accuracy is lower than 0.2, meaning the
model has overfitted data, either because the model is too complicated or the input data have
too many detailed features. We will discuss about how to solve this problem later.
4.3.3 Minor groove regions are highlighted in Exd-Scr high-affinity BSs
With the well-trained model, we were then interested in if the model captured information
in the minor groove that was previously reported to determine Exd-Scr binding specificity. With
this goal, we first checked previously-reported crystal structure of Exd-Scr–DNA complex (PDB
ID: 2R5Z) (34). In general, residues make contacts with DNA backbone and most major groove
regions at consensus sequence AGATTAAT (Figure 4–5 left panel). Moreover, most of the
minor groove contacts occur in TAAT base pairs regions. We then visualized the highlighted
regions of Exd-Scr binding sequence GATGATTAATTACC after first convolutional layer (Figure
4–5 right panel). Overall, backbone interactions are highlighted. From minor groove view, more
voxels are highlighted in the TAAT region compared to TGA region. From the major groove
view, several nucleotides, but not all, have highlighted contacts. This reason might be that not
all major groove contacts are important for Exd-Scr DNA binding specificity.
70
Figure 4 –5 Minor groove view and major groove views of highlighted voxels at first
convolutional layer
The left panel shows the nucleotide-residue interaction map of Exd-Scr binding to a long DNA
sequence (PDB ID: 2R5Z), generated by DNAproDB (161). The shaded area indicates the high-
affinity BS AGATTAAT. Minor groove, major groove and DNA backbone interactions are shown
with default distance cutoffs. The right panel visualizes a high-affinity BS of Exd-Scr heterodimer:
GATGATTAATTACC. Important voxels that are highlighted in the first convolutional layer are in
different colors, representing the presence of different atoms. Again, green color represents
carbon, red color represents oxygen, and blue color represents nitrogen. Darker color means
more important.
4.4 Future work
In the future, in terms of modeling, we need to first fix the overfitting problem. We have
tried to repress the mesh graph into lower resolution, such as 45×15×15. However, it seems
that lowered resolution does not achieve good training accuracy. Considering that DNA
sequences in the test data set are positioned in different angles, we also duplicated each
sequence in train dataset with multiple (6 or 12) angles, each angle as a new sample for
71
training. This data preparation has enlarged dataset 6 or 12 times, thus requiring more time to
train.
There are many other TF–DNA binding examples that shows different binding
specificities. For example, MEF2B has been shown to bind in both minor groove and backbone
(176). As aforementioned, p53 binds on Hoogsteen base-pairing DNA sequence at major
groove regions to achieve its binding specificity. One of the future directions is to validate this
previously report binding mechanism, in the meanwhile, if possible, to report new mechanism
for certain TF–DNA binding. Moreover, this framework of representing DNA sequence as 3D
structure could be extended to modeling binding mechanism of other protein–DNA complex,
such as nucleosomal DNA. We could also use 3D electrostatic potential map to represent
macromolecules rather than atomic structures.
72
Chapter 5. Utilize 3D genome structure to predict HM patterns (work
with Nan Hua in Alber lab)
5.1 Introduction
The cell type-specific spatial organization of a genome plays an important role in gene
regulation and disruption of it can lead to disease (139, 177-179). Thanks to recent
improvements of genome-wide chromosome conformation capture methods, such as TCC
(180), single-cell (181) and in situ Hi-C (47), the resolution of contact map of whole-genome
regions has been largely improved. It has been shown that protein-mediated “loops”, which are
genomic regions brought together from distal regions to proximity, are highly related to gene
regulation. For example, around one third of loops bring together known enhancers and known
promoter regions and genes whose promoters are associated with loop regions are more highly
expressed than those whose promoters are not (47). Moreover, chromatin marks, which affect
gene regulation through their ability to affect DNA accessibility and nucleosome stability (76,
97), have been widely examined in 3D genome studies. It is still not clear if HM patterns bring
together long-range contacts or the other way around. On one hand, genomic regions interact
frequently with one another have similar patterns of HM patterns and similar long-range contact
patterns (47). Based on this, a recent study developed an energy landscape model, utilized HM
patterns to reconstruct accurate Hi-C contact maps de novo at 50-kb resolution (182). On the
other hand, nucleosomes in spatial proximity can exchange information (183). If the first
nucleosome is bound by a chromatin-modifying enzymes, the second nucleosome spatially
close to the first one might be subsequently be modified (184-186). However, to our knowledge,
there is very few genome-wide prediction of HM patterns through Hi-C contact map. One
challenge might be how to utilize information of contact patterns of a region to predict its HM
patterns. Here we first regard Hi-C contact map as a graph, with vertex as genomic regions and
73
weights of edges as transformed contact frequency. We develop a graph embedding method to
learn low-dimensional numeric representations of each genomic region (vertex) and
simultaneously preserve Hi-C contact map (weights on edges).
Graph embedding methods have emerged as one of the most effective tools for graph
structural learning, and have achieved state-of-the-art performance in various graph-related
tasks, such as network reconstruction (187, 188) and graph clustering (189, 190). There are
three main categories of graph embedding approaches (191), namely factorization methods,
random walks, and deep learning. Each embedding method maps nodes in a network to low
dimensional feature vectors and preserves certain network properties of interest. Factorization
based methods, such as locally linear embedding (192), graph factorization (193), and
Laplacian Eigenmaps (194), maintain connection strengths in a matrix. Random walk
approaches preserve community information by taking streams of short random walks for each
node and allow computational parallelization. DeepWalk (195) and node2vec (196) are
representatives of this category. Deep learning based methods, such as SDNE (188) have
recently attracted increasing attention due to their ability to capture non-linearity in graphs.
According to a recent survey (191), deep learning based methods outperform methods in other
categories in many tasks, such as network clustering and visualization.
In this study, considering that Hi-C contact maps are unidirectional and homogeneous
graph involving only one type of object (i.e. genomic regions), we made our first attempt to
graph factorization method which is more direct way to achieve graph embedding. We also
further build a DNN-based framework that been successfully applied to buyer-seller bipartitte
network embedding in e-commerce, to effectively learn non-linear network structure. Both
embedding approaches takes Hi-C contacts map as input and output embedding results that will
preserve genomic contacts well in terms of achieving small root mean squared error (RMSE).
74
5.2 Methods
5.2.1 Graph embedding framework
Definitions: Graph Embedding of 3D genome organization. Given a graph of genomic
regions 𝑅 ={𝑟 1
,𝑟 2
,…,𝑟 𝑁 }, 3D genome organization is defined as a graph 𝐺 ={𝑅 ,𝐶 } with vertex
set 𝑅 and edge set 𝐶 ={(𝑟 𝑖 ,𝑟 𝑗 )}
𝑖 ,𝑗 =1
𝑁 as the Hi-C contact map. The graph embedding consists of
a mapping function 𝑓 𝑅 : 𝑟 𝑖 →𝑣 𝑖 ∈ℝ
𝐾 , where 𝐾 ≪𝑁 . Objectives of 𝑓 𝑅 is to preserve implicit graph
structures depicted in Hi-C contact map 𝐶 . Equally speaking, graph embedding finds an
embedding matrix 𝑉 ∈ℝ
𝑁 ×𝐾 (𝐾 ≪𝑁 ) such that 𝐶 ≈𝑉 ∙𝑉 𝑇 .
Graph factorization method is matrix factorization (MF)-based method and learns liner
embedding for vertices. The cost function that quantifies the quality of approximation of graph
embedding is the summation of square of the Euclidean distance between two matrices (the
square of the Frobenius norm of two matrices difference) and a regularizing term of embedding
vectors:
𝑙𝑜𝑠𝑠 = ‖𝐶 −𝑉 𝑉 𝑇 ‖
𝐹 2
+
1
2
𝜆 ‖𝑉 ‖
𝐹 2
= ∑(𝐶 𝑖𝑗
−∑𝑣 𝑖𝑘
𝑣 𝑗𝑘
𝐾 𝑘 =1
)
2
+
1
2
𝜆 𝑁 𝑖 ,𝑗 ∑𝑣 𝑘 ,𝑙 2
𝐾 𝑘 ,𝑙
If we regard missing values in Hi-C contact map as zeros so that 𝐺 becomes a fully-connected
graph and 𝐶 is dense matrix, we can use the following steps to update 𝑉 :
{
𝐼𝑛𝑖𝑡𝑖𝑎𝑡𝑖𝑜𝑛 :𝑉 (0)
=𝑟𝑎𝑛𝑑𝑜𝑚 (𝑁 ,𝐾 )
𝑈𝑝𝑑𝑎𝑡𝑒 𝑎𝑡 𝑠𝑡𝑒𝑝 𝑛 : 𝑉 (𝑛 )
=𝑉 (𝑛 −1)
+𝛼 (4(𝐶 −𝑉 𝑉 𝑇 )𝑉 −𝜆𝑉 )
where 𝛼 is the learning rate of gradient descent.
75
Figure 5 –1 Workflow of use 3D genome structure to predict HM patterns around genomic
regions
If missing values in Hi-C experiments are of low reliability, Hi-C contact map 𝐶 is sparse.
In this case, we utilize sparse data by doing following element-to-element updates of 𝐶 .
{
𝐼𝑛𝑖𝑡𝑖𝑎𝑡𝑖𝑜𝑛 :𝑉 =𝑟𝑎𝑛𝑑𝑜𝑚 (𝑁 ,𝐾 )
𝑈𝑝𝑑𝑎𝑡𝑒 𝑤𝑖𝑡 ℎ 𝐶 𝑖𝑗
(𝑖 ≠𝑗 ): 𝑣 𝑖 (𝑛𝑒𝑤 )
=𝑣 𝑖 +𝛼 (2(𝐶 𝑖𝑗
−𝑣 𝑖 𝑣 𝑗 𝑇 )𝑣 𝑗 −𝜆 𝑣 𝑖 )
𝑣 𝑗 (𝑛𝑒𝑤 )
=𝑣 𝑗 +𝛼 (2(𝐶 𝑖𝑗
−𝑣 𝑗 𝑣 𝑖 𝑇 )𝑣 𝑖 −𝜆 𝑣 𝑗 )
76
Aforementioned, we also developed DNN-based graph embedding method with different loss
function which comprises of three term: reconstruction quality, regularizer for embedding
vectors, and regularizer for the weight of first neural network.
𝑙𝑜𝑠𝑠 =‖𝑉 𝑇 𝑉 −𝐶 ‖
𝐹 2
+
1
2
𝛼 1
‖𝑊 1
‖
𝐹 2
+
1
2
𝛼 2
‖𝑉 ‖
𝐹 2
Note that this DNN-based graph embedding method can be applied to both fully-connected and
sparse Hi-C contact map. A workflow incorporating this method is in Figure 5–1. After graph
embedding step to obtain low-dimensional numeric representation of genomic regions, we can
either characterize these genome regions and predict HM patterns around them or do graph
clustering on these genomic regions to gain more biological insights in 3D genome organization.
5.3 Preliminary results
5.3.1 Numeric representations of genomic regions preserve Hi-C contact map
To test the effectiveness of above-described graph embedding methods, we
downloaded Hi-C contact frequency at 10-kb resolution of chromosome 2 from Rao et al. 2014.
At this resolution, chromosome 2 contains 24320 genomic regions (Figure 5–2A). We first
normalized raw contact counts by equalizing row sums and column sums, and by applying
logarithm function to each positive counts. After implementing MF-based graph embedding
methods, we reconstructed Hi-C contact map with MSE of 0.001 (Figure 5–2B, C). Both short-
distance and distal contacts are roughly well preserved. To validate that loops regions are
captured well, we downloaded identified loops list from Rao et al. 2014 which contains 706 loop
regions for chromosome 2 in the GM12878 cell line. Intuitively, regions (region 1 and region 2)
involved in each loop should have larger similarity compared to random regions with equal
genomic distance. We permutated random regions 10 times and found significant result that
proves this intuition (Figure 5–2D).
77
Figure 5 –2 Application of MF-based graph embedding methods on chromosome 2 Hi-C
contact data in the GM12878 cell line and consequent performance evaluation
(A) contact map of chromosome 2 detected by Hi-C experiment in the GM12878 cell type. From
the original contact frequency, row sum and column sum are normalized to same value, and a
logarithm function has been applied to each cell for further 0-1 normalization. Regarding contact
map in (A) as a fully-connected graph, different approaches to implement gradient descent have
achieved different results in (B) and (C), respectively. (D) based on embedding results on (C),
genomic regions involved in loops have significantly larger similarities than random regions with
same genomic distances.
78
5.3.2 HM patterns at regions involved in a loop are similar
After observing that regions involved in a loop have larger similarity in terms of graph
embedding results, we were interested in whether these regions have similar HM patterns. To
investigate this, we downloaded ChIP-seq data for 10 HM patterns from the ENCODE database
and examined HM patterns at both regions involved in a loop. Since these regions are detected
at different resolutions and cover genomic regions of 5 kb or 10 kb, we took an average of HM
signals at each nucleotide covered by each region. With a few exceptions, two regions involved
in a loop generally have similar HM patterns (Figure 5–3). This observation encourages us to do
further modeling of HM patterns based on graph embedding results.
79
Figure 5 –3 Two regions involved in a loop have similar HM patterns
Left panel displays HM patterns in region 1 of a loop and the results of region 2, which forms a
loop with region 1, are displayed at the same level of y axis in right panel. Note that these
regions are ordered by the H3K27ac level from high to low in the left panel for better
visualization.
5.4 Future work
With our current progress in numeric representations of genomic regions by preserving
Hi-C contact frequencies, we first need to fix one approach to do graph embedding. We are also
very interested in downstream analyses based on these numeric presentations, such as
clustering genomic regions and see if new compartments or genomic organization can be found
or projecting genomic regions into 3D latent space to visualize 3D distributions of these genomic
regions. In terms of modeling, we will develop an effective pipeline to predict HM patterns
around genomic regions at 10-kb resolution.
80
References
1. Lambert,S.A., Jolma,A., Campitelli,L.F., Das,P.K., Yin,Y., Albu,M., Chen,X., Taipale,J.,
Hughes,T.R. and Weirauch,M.T. (2018) The Human Transcription Factors. Cell, 172, 650–
665.
2. Lee,T.I. and Young,R.A. (2013) Transcriptional regulation and its misregulation in disease.
Cell, 152, 1237–1251.
3. Pearson,J.C., Lemons,D. and McGinnis,W. (2005) Modulating Hox gene functions during
animal body patterning. Nature Reviews Genetics 2014 15:4, 6, 893–904.
4. Singh,H., Khan,A.A. and Dinner,A.R. (2014) Gene regulatory networks in the immune system.
Trends Immunol., 35, 211–218.
5. Fong,A.P. and Tapscott,S.J. (2013) Skeletal muscle programming and re-programming.
Current Opinion in Genetics & Development, 23, 568–573.
6. Takahashi,K. and Yamanaka,S. (2016) A decade of transcription factor-mediated
reprogramming to pluripotency. Nature Reviews Molecular Cell Biology 2017 19:3, 17, 183–
193.
7. Deplancke,B., Alpern,D. and Gardeux,V. (2016) The Genetics of Transcription Factor DNA
Binding Variation. Cell, 166, 538–554.
8. Tournamille,C., Colin,Y., Cartron,J.P. and Le Van Kim,C. (1995) Disruption of a GATA motif
in the Duffy gene promoter abolishes erythroid gene expression in Duffy–negative
individuals. Nature Genetics 2013 45:10, 10, 224–228.
9. Martin,D.I., Tsai,S.F. and Orkin,S.H. (1989) Increased gamma-globin expression in a
nondeletion HPFH mediated by an erythroid-specific DNA-binding factor. Nature 2012
486:7404, 338, 435–438.
10. Jolma,A., Yin,Y., Nitta,K.R., Dave,K., Popov,A., Taipale,M., Enge,M., Kivioja,T.,
Morgunova,E. and Taipale,J. (2015) DNA-dependent formation of transcription factor pairs
alters their binding specificity. Nature 2012 486:7404, 527, 384–388.
11. Geertz,M., Shore,D. and Maerkl,S.J. (2012) Massively parallel measurements of molecular
interaction kinetics on a microfluidic platform. Proc. Natl. Acad. Sci. U.S.A., 109, 16540–
16545.
12. Landolin,J.M., Johnson,D.S., Trinklein,N.D., Aldred,S.F., Medina,C., Shulha,H., Weng,Z.
and Myers,R.M. (2010) Sequence features that drive human promoter function and tissue
specificity. Genome Res., 20, 890–898.
13. Spitz,F. and Furlong,E.E.M. (2012) Transcription factors: from enhancer binding to
developmental control. Nature Reviews Genetics 2014 15:4, 13, 613–626.
14. Wang,J., Zhuang,J., Iyer,S., Lin,X., Whitfield,T.W., Greven,M.C., Pierce,B.G., Dong,X.,
81
Kundaje,A., Cheng,Y., et al. (2012) Sequence features and chromatin structure around the
genomic regions bound by 119 human transcription factors. Genome Res., 22, 1798–1812.
15. Levo,M., Zalckvar,E., Sharon,E., Dantas Machado,A.C., Kalma,Y., Lotam-Pompan,M.,
Weinberger,A., Yakhini,Z., Rohs,R. and Segal,E. (2015) Unraveling determinants of
transcription factor binding outside the core binding site. Genome Res., 25, 1018–1029.
16. Slattery,M., Zhou,T., Yang,L., Dantas Machado,A.C., Gordân,R. and Rohs,R. (2014)
Absence of a simple code: how transcription factors read the genome. Trends in
Biochemical Sciences, 39, 381–399.
17. Mathelier,A., Xin,B., Chiu,T.-P., Yang,L., Rohs,R. and Wasserman,W.W. (2016) DNA Shape
Features Improve Transcription Factor Binding Site Predictions In Vivo. Cell Systems, 3,
278–286.e4.
18. Zentner,G.E., Kasinathan,S., Xin,B., Rohs,R. and Henikoff,S. (2015) ChEC-seq kinetics
discriminates transcription factor binding sites by DNA sequence and shape
in vivo
.
Nature Communications 2015 6, 6, 8733.
19. Zhao,Y. and Stormo,G.D. (2011) Quantitative analysis demonstrates most transcription
factors require only simple models of specificity. Nature Biotechnology 2015 33:8, 29, 480–
483.
20. Gordân,R., Shen,N., Dror,I., Zhou,T., Horton,J., Rohs,R. and Bulyk,M.L. (2013) Genomic
Regions Flanking E-Box Binding Sites Influence DNA Binding Specificity of bHLH
Transcription Factors through DNA Shape. Cell Reports, 3, 1093–1104.
21. Zhou,T., Shen,N., Yang,L., Abe,N., Horton,J., Mann,R.S., Bussemaker,H.J., Gordân,R. and
Rohs,R. (2015) Quantitative modeling of transcription factor binding specificities using DNA
shape. PNAS, 112, 4654–4659.
22. Abe,N., Dror,I., Yang,L., Slattery,M., Zhou,T., Bussemaker,H.J., Rohs,R. and Mann,R.S.
(2015) Deconvolving the recognition of DNA shape from sequence. Cell, 161, 307–318.
23. Yang,L., Orenstein,Y., Jolma,A., Yin,Y., Taipale,J., Shamir,R. and Rohs,R. (2017)
Transcription factor family ‐specific DNA shape readout revealed by quantitative specificity
models. Molecular Systems Biology, 13, 910.
24. Dror,I., Rohs,R. and Gutfreund,Y.M. (2016) How motif environment influences transcription
factor search dynamics: Finding a needle in a haystack. BioEssays, 38, 605–612.
25. Inukai,S., Kock,K.H. and Bulyk,M.L. (2017) Transcription factor–DNA binding: beyond
binding site motifs. Current Opinion in Genetics & Development, 43, 110–119.
26. Song,L., Zhang,Z., Grasfeder,L.L., Boyle,A.P., Giresi,P.G., Lee,B.-K., Sheffield,N.C., Gräf,S.,
Huss,M., Keefe,D., et al. (2011) Open chromatin defined by DNaseI and FAIRE identifies
regulatory elements that shape cell-type identity. Genome Res., 21, 1757–1767.
27. Rohs,R., Jin,X., West,S.M., Joshi,R., Honig,B. and Mann,R.S. (2010) Origins of Specificity
in Protein-DNA Recognition. http://dx.doi.org/10.1146/annurev-biochem-060408-091030, 79,
233–269.
82
28. Levo,M. and Segal,E. (2014) In pursuit of design principles of regulatory sequences. Nature
Reviews Genetics 2014 15:4, 15, 453–468.
29. Kornberg,R.D. and Lorch,Y. (1999) Twenty-Five Years of the Nucleosome, Fundamental
Particle of the Eukaryote Chromosome. Cell, 98, 285–294.
30. Pique-Regi,R., Degner,J.F., Pai,A.A., Gaffney,D.J., Gilad,Y. and Pritchard,J.K. (2011)
Accurate inference of transcription factor binding from DNA sequence and chromatin
accessibility data. Genome Res., 21, 447–455.
31. He,Y., Gorkin,D.U., Dickel,D.E., Nery,J.R., Castanon,R.G., Lee,A.Y., Shen,Y., Visel,A.,
Pennacchio,L.A., Ren,B., et al. (2017) Improved regulatory element prediction based on
tissue-specific local epigenomic signatures. PNAS, 114, E1633–E1640.
32. Zhu,J., Adli,M., Zou,J.Y., Verstappen,G., Coyne,M., Zhang,X., Durham,T., Miri,M.,
Deshpande,V., De Jager,P.L., et al. (2013) Genome-wide Chromatin State Transitions
Associated with Developmental and Environmental Cues. Cell, 152, 642–654.
33. Rohs,R., West,S.M., Sosinsky,A., Liu,P., Mann,R.S. and Honig,B. (2009) The role of DNA
shape in protein–DNA recognition. Nature 2012 486:7404, 461, 1248–1253.
34. Joshi,R., Passner,J.M., Rohs,R., Jain,R., Sosinsky,A., Crickmore,M.A., Jacob,V.,
Aggarwal,A.K., Honig,B. and Mann,R.S. (2007) Functional Specificity of a Hox Protein
Mediated by the Recognition of Minor Groove Structure. Cell, 131, 530–543.
35. Crothers,D.M., Haran,T.E. and Nadeau,J.G. (1990) Intrinsically bent DNA. J. Biol. Chem.,
265, 7093–7096.
36. Chiu,T.-P., Rao,S., Mann,R.S., Honig,B. and Rohs,R. (2017) Genome-wide prediction of
minor-groove electrostatic potential enables biophysical modeling of protein-DNA binding.
Nucleic Acids Res, 45, 12565–12576.
37. Rao,S., Chiu,T.-P., Kribelbauer,J.F., Mann,R.S., Bussemaker,H.J. and Rohs,R. (2018)
Systematic prediction of DNA shape changes due to CpG methylation explains epigenetic
effects on protein–DNA binding. Epigenetics & Chromatin, 11, 9.
38. Lazarovici,A., Zhou,T., Shafer,A., Dantas Machado,A.C., Riley,T.R., Sandstrom,R.,
Sabo,P.J., Lu,Y., Rohs,R., Stamatoyannopoulos,J.A., et al. (2013) Probing DNA shape and
methylation state on a genomic scale with DNase I. Proc. Natl. Acad. Sci. U.S.A., 110,
6376–6381.
39. Kribelbauer,J.F., Laptenko,O., Chen,S., Martini,G.D., Freed-Pastor,W.A., Prives,C.,
Mann,R.S. and Bussemaker,H.J. (2017) Quantitative Analysis of the DNA Methylation
Sensitivity of Transcription Factor Complexes. Cell Reports, 19, 2383–2395.
40. Ren,B., Robert,F., Wyrick,J.J., Aparicio,O., Jennings,E.G., Simon,I., Zeitlinger,J.,
Schreiber,J., Hannett,N., Kanin,E., et al. (2000) Genome-wide location and function of DNA
binding proteins. Science, 290, 2306–2309.
41. Johnson,D.S., Mortazavi,A., Myers,R.M. and Wold,B. (2007) Genome-wide mapping of in
vivo protein-DNA interactions. Science, 316, 1497–1502.
83
42. Rhee,H.S. and Pugh,B.F. (2011) Comprehensive genome-wide protein-DNA interactions
detected at single-nucleotide resolution. Cell, 147, 1408–1419.
43. Hurd,P.J. and Nelson,C.J. (2009) Advantages of next-generation sequencing versus the
microarray in epigenetic research. Brief Funct Genomic Proteomic, 8, 174–183.
44. Guelen,L., Pagie,L., Brasset,E., Meuleman,W., Faza,M.B., Talhout,W., Eussen,B.H., de
Klein,A., Wessels,L., de Laat,W., et al. (2008) Domain organization of human chromosomes
revealed by mapping of nuclear lamina interactions. Nature 2012 486:7404, 453, 948–951.
45. de Wit,E. and de Laat,W. (2012) A decade of 3C technologies: insights into nuclear
organization. Genes Dev., 26, 11–24.
46. Dekker,J., Marti-Renom,M.A. and Mirny,L.A. (2013) Exploring the three-dimensional
organization of genomes: interpreting chromatin interaction data. Nature Reviews Genetics
2014 15:4, 14, 390–403.
47. Rao,S.S.P., Huntley,M.H., Durand,N.C., Stamenova,E.K., Bochkov,I.D., Robinson,J.T.,
Sanborn,A.L., Machol,I., Omer,A.D., Lander,E.S., et al. (2014) A 3D Map of the Human
Genome at Kilobase Resolution Reveals Principles of Chromatin Looping. Cell, 159, 1665–
1680.
48. Hesselberth,J.R., Chen,X., Zhang,Z., Sabo,P.J., Sandstrom,R., Reynolds,A.P.,
Thurman,R.E., Neph,S., Kuehn,M.S., Noble,W.S., et al. (2009) Global mapping of protein-
DNA interactions
in vivo
by digital genomic footprinting. Nat. Methods, 6, 283–289.
49. Giresi,P.G., Kim,J., McDaniell,R.M., Iyer,V.R. and Lieb,J.D. (2007) FAIRE (Formaldehyde-
Assisted Isolation of Regulatory Elements) isolates active regulatory elements from human
chromatin. Genome Res., 17, 877–885.
50. Buenrostro,J.D., Giresi,P.G., Zaba,L.C., Chang,H.Y. and Greenleaf,W.J. (2013)
Transposition of native chromatin for fast and sensitive epigenomic profiling of open
chromatin, DNA-binding proteins and nucleosome position. Nat. Methods, 10, 1213–1218.
51. Cui,K. and Zhao,K. (2012) Genome-wide approaches to determining nucleosome
occupancy in metazoans using MNase-Seq. Methods Mol. Biol., 833, 413–419.
52. Ernst,J. and Kellis,M. (2013) Interplay between chromatin state, regulator binding, and
regulatory motifs in six human cell types. Genome Res., 23, 1142–1154.
53. Guccione,E., Martinato,F., Finocchiaro,G., Luzi,L., Tizzoni,L., Dall' Olio,V., Zardo,G.,
Nervi,C., Bernard,L. and Amati,B. (2006) Myc-binding-site recognition in the human genome
is determined by chromatin context. Nature Cell Biology 2006 8:7, 8, 764–770.
54. Razin,A. and Riggs,A.D. (1980) DNA methylation and gene function. Science, 210, 604–610.
55. Iguchi-Ariga,S.M. and Schaffner,W. (1989) CpG methylation of the cAMP-responsive
enhancer/promoter sequence TGACGTCA abolishes specific factor binding as well as
transcriptional activation. Genes Dev., 3, 612–619.
56. Boyes,J. and Bird,A. (1991) DNA methylation inhibits transcription indirectly via a methyl-
84
CpG binding protein. Cell, 64, 1123–1134.
57. Berger,M.F., Philippakis,A.A., Qureshi,A.M., He,F.S., Estep,P.W. and Bulyk,M.L. (2006)
Compact, universal DNA microarrays to comprehensively determine transcription-factor
binding site specificities. Nature Biotechnology 2015 33:8, 24, 1429–1435.
58. Wong,D., Teixeira,A., Oikonomopoulos,S., Humburg,P., Lone,I.N., Saliba,D., Siggers,T.,
Bulyk,M., Angelov,D., Dimitrov,S., et al. (2011) Extensive characterization of NF-κB binding
uncovers non-canonical motifs and advances the interpretation of genetic functional traits.
Genome Biol., 12, R70.
59. Slattery,M., Riley,T., Liu,P., Abe,N., Gomez-Alcala,P., Dror,I., Zhou,T., Rohs,R., Honig,B.,
Bussemaker,H.J., et al. (2011) Cofactor Binding Evokes Latent Differences in DNA Binding
Specificity between Hox Proteins. Cell, 147, 1270–1282.
60. Jolma,A., Kivioja,T., Toivonen,J., Cheng,L., Wei,G., Enge,M., Taipale,M., Vaquerizas,J.M.,
Yan,J., Sillanpää,M.J., et al. (2010) Multiplexed massively parallel SELEX for
characterization of human transcription factor binding specificities. Genome Res., 20, 861–
873.
61. Zhao,Y., Granas,D. and Stormo,G.D. (2009) Inferring binding energies from selected
binding sites. PLoS Comput Biol, 5, e1000590.
62. Schneider,T.D. and Stephens,R.M. (1990) Sequence logos: a new way to display
consensus sequences. Nucleic Acids Res, 18, 6097–6100.
63. Stormo,G.D. and Zhao,Y. (2010) Determining the specificity of protein-DNA interactions.
Nature Reviews Genetics 2014 15:4, 11, 751–760.
64. Foat,B.C., Morozov,A.V. and Bussemaker,H.J. (2006) Statistical mechanical modeling of
genome-wide transcription factor occupancy data by MatrixREDUCE. Bioinformatics, 22,
e141–9.
65. Ruan,S., Swamidass,S.J. and Stormo,G.D. (2017) BEESEM: estimation of binding energy
models using HT-SELEX data. Bioinformatics, 33, 2288–2295.
66. Patel,R.Y. and Stormo,G.D. (2014) Discriminative motif optimization based on perceptron
training. Bioinformatics, 30, 941–948.
67. Zhang,L., Martini,G.D., Rube,H.T., Kribelbauer,J.F., Rastogi,C., FitzPatrick,V.D.,
Houtman,J.C., Bussemaker,H.J. and Pufall,M.A. (2018) SelexGLM differentiates androgen
and glucocorticoid receptor DNA-binding preference over an extended binding site.
Genome Res., 28, 111–121.
68. Badis,G., Berger,M.F., Philippakis,A.A., Talukder,S., Gehrke,A.R., Jaeger,S.A., Chan,E.T.,
Metzler,G., Vedenko,A., Chen,X., et al. (2009) Diversity and Complexity in DNA Recognition
by Transcription Factors. Science, 324, 1720–1723.
69. Jolma,A., Yan,J., Whitington,T., Toivonen,J., Nitta,K.R., Rastas,P., Morgunova,E., Enge,M.,
Taipale,M., Wei,G., et al. (2013) DNA-binding specificities of human transcription factors.
Cell, 152, 327–339.
85
70. Zhao,Y., Ruan,S., Pandey,M. and Stormo,G.D. (2012) Improved models for transcription
factor binding site identification using nonindependent interactions. Genetics, 191, 781–790.
71. Rastogi,C., Rube,H.T., Kribelbauer,J.F., Crocker,J., Loker,R.E., Martini,G.D., Laptenko,O.,
Freed-Pastor,W.A., Prives,C., Stern,D.L., et al. (2018) Accurate and sensitive quantification
of protein-DNA binding affinity. Proc. Natl. Acad. Sci. U.S.A., 115, E3692–E3701.
72. Olson,W.K., Gorin,A.A., Lu,X.J., Hock,L.M. and Zhurkin,V.B. (1998) DNA sequence-
dependent deformability deduced from protein-DNA crystal complexes. PNAS, 95, 11163–
11168.
73. Alipanahi,B., Delong,A., Weirauch,M.T. and Frey,B.J. (2015) Predicting the sequence
specificities of DNA- and RNA-binding proteins by deep learning. Nature Biotechnology
2015 33:8, 33, 831–838.
74. Benveniste,D., Sonntag,H.-J., Sanguinetti,G. and Sproul,D. (2014) Transcription factor
binding predicts histone modifications in human cell lines. PNAS, 111, 13367–13372.
75. Liu,L., Jin,G., research,X.Z.N.A.2015 Modeling the relationship of epigenetic modifications
to transcription factor binding. academic.oup.com
76. Lu,X., Simon,M.D., Chodaparambil,J.V., Hansen,J.C., Shokat,K.M. and Luger,K. (2008) The
effect of H3K79 dimethylation and H4K20 trimethylation on nucleosome and chromatin
structure. Nature Structural & Molecular Biology 2013 20:3, 15, 1122–1124.
77. Glatt,S., Alfieri,C. and Müller,C.W. (2011) Recognizing and remodeling the nucleosome.
Current Opinion in Structural Biology, 21, 335–341.
78. Consortium,T.E.P. (2012) An integrated encyclopedia of DNA elements in the human
genome. Nature 2012 486:7404, 489, 57–74.
79. Bernstein,B.E., Kamal,M., Lindblad-Toh,K., Bekiranov,S., Bailey,D.K., Huebert,D.J.,
McMahon,S., Karlsson,E.K., Kulbokas,E.J.,III, Gingeras,T.R., et al. (2005) Genomic Maps
and Comparative Analysis of Histone Modifications in Human and Mouse. Cell, 120, 169–
181.
80. Grant,C.E., Bailey,T.L., Bioinformatics,W.N.2011 FIMO: scanning for occurrences of a given
motif. academic.oup.com
81. Boyle,A.P., Song,L., Lee,B.-K., London,D., Keefe,D., Birney,E., Iyer,V.R., Crawford,G.E.
and Furey,T.S. (2011) High-resolution genome-wide in vivo footprinting of diverse
transcription factors in human cells. Genome Res., 21, 456–464.
82. Finn,R.D., Bateman,A., Clements,J., acids,P.C.N.2013 Pfam: the protein families database.
academic.oup.com
83. Laity,J.H., Lee,B.M. and Wright,P.E. (2001) Zinc finger proteins: new insights into structural
and functional diversity. Current Opinion in Structural Biology, 11, 39–46.
84. Zaret,K.S. and Carroll,J.S. (2011) Pioneer transcription factors: establishing competence for
gene expression. Genes Dev., 25, 2227–2241.
86
85. Magnani,L., Eeckhoute,J. and Lupien,M. (2011) Pioneer factors: directing transcriptional
regulators within the chromatin environment. Trends in Genetics, 27, 465–474.
86. Vernimmen,D. and Bickmore,W.A. (2015) The Hierarchy of Transcriptional Activation: From
Enhancer to Promoter. Trends in Genetics, 31, 696–708.
87. Anguita,E., Sharpe,J.A., Sloane-Stanley,J.A., Tufarelli,C., Higgs,D.R. and Wood,W.G. (2002)
Deletion of the mouse α-globin regulatory element (HS −26) has an unexpectedly mild
phenotype. Blood, 100, 3450–3456.
88. Oldfield,A.J., Yang,P., Conway,A.E., Cinghu,S., Freudenberg,J.M., Yellaboina,S. and
Jothi,R. (2014) Histone-Fold Domain Protein NF-Y Promotes Chromatin Accessibility for
Cell Type-Specific Master Transcription Factors. Molecular Cell, 55, 708–722.
89. Wang,S., Linde,M.H., Munde,M., Carvalho,V.D., Wilson,W.D. and Poon,G.M.K. (2014)
Mechanistic heterogeneity in site recognition by the structurally homologous DNA-binding
domains of the ETS family transcription factors Ets-1 and PU.1. J. Biol. Chem., 289, 21605–
21616.
90. Budry,L., Balsalobre,A., Gauthier,Y., Khetchoumian,K., L'Honoré,A., Vallette,S., Brue,T.,
Figarella-Branger,D., Meij,B. and Drouin,J. (2012) The selector gene Pax7 dictates
alternate pituitary cell fates through its pioneer action on chromatin remodeling. Genes Dev.,
26, 2299–2310.
91. Fleming,J.D., Pavesi,G., Benatti,P., Imbriano,C., Mantovani,R. and Struhl,K. (2013) NF-Y
coassociates with FOS at promoters, enhancers, repetitive elements, and inactive
chromatin regions, and is stereo-positioned with growth-controlling transcription factors.
Genome Res., 23, 1195–1209.
92. Dror,I., Golan,T., Levy,C., Rohs,R. and Mandel-Gutfreund,Y. (2015) A widespread role of
the motif environment in transcription factor binding across diverse protein families.
Genome Res., 25, 1268–1280.
93. Chiu,T.P., Rao,S., Mann,R.S., acids,B.H.N.2017 Genome-wide prediction of minor-groove
electrostatic potential enables biophysical modeling of protein–DNA binding.
academic.oup.com
94. Grubert,F., Zaugg,J.B., Kasowski,M., Ursu,O., Spacek,D.V., Martin,A.R., Greenside,P.,
Srivas,R., Phanstiel,D.H., Pekowska,A., et al. (2015) Genetic Control of Chromatin States in
Humans Involves Local and Distal Chromosomal Interactions. Cell, 162, 1051–1065.
95. Ziller,M.J., Edri,R., Yaffe,Y., Donaghey,J., Pop,R., Mallard,W., Issner,R., Gifford,C.A.,
Goren,A., Xing,J., et al. (2015) Dissecting neural differentiation regulatory networks through
epigenetic footprinting. Nature 2012 486:7404, 518, 355–359.
96. Brogaard,K., Xi,L., Wang,J.-P. and Widom,J. (2012) A map of nucleosome positions in yeast
at base-pair resolution. Nature 2012 486:7404, 486, 496–501.
97. Henikoff,S. and Shilatifard,A. (2011) Histone modification: cause or cog? Trends in Genetics,
27, 389–396.
87
98. Rajkumar,A.S., Dénervaud,N. and Maerkl,S.J. (2013) Mapping the fine structure of a
eukaryotic promoter input-output function. Nature Genetics 2013 45:10, 45, 1207–1215.
99. Aow,J.S.Z., Xue,X., Run,J.-Q., Lim,G.F.S., Goh,W.S. and Clarke,N.D. (2013) Differential
binding of the related transcription factors Pho4 and Cbf1 can tune the sensitivity of
promoters to different levels of an induction signal. Nucleic Acids Res, 41, 4877–4887.
100. Zaret,K.S. and Mango,S.E. (2016) Pioneer transcription factors, chromatin dynamics, and
cell fate control. Current Opinion in Genetics & Development, 37, 76–81.
101. Robertson,A.G., Bilenky,M., Tam,A., Zhao,Y., Zeng,T., Thiessen,N., Cezard,T., Fejes,A.P.,
Wederell,E.D., Cullum,R., et al. (2008) Genome-wide relationship between histone H3
lysine 4 mono- and tri-methylation and transcription factor binding. Genome Res., 18, 1906–
1917.
102. Ehret,G.B., Reichenbach,P., Biological,U.S.J.O.2000 DNA binding specificity of different
STAT proteins: comparison of in vitro specificity with natural target sites. ASBMB
103. Lee,J.S., Galvin,K.M., See,R.H., Eckner,R., Livingston,D., Moran,E. and Shi,Y. (1995)
Relief of YY1 transcriptional repression by adenovirus E1A is mediated by E1A-associated
protein p300. Genes Dev., 9, 1188–1198.
104. Austen,M., Lüscher,B. and Lüscher-Firzlaff,J.M. (1997) Characterization of the
Transcriptional Regulator YY1 THE BIPARTITE TRANSACTIVATION DOMAIN IS
INDEPENDENT OF INTERACTION WITH THE TATA BOX-BINDING PROTEIN,
TRANSCRIPTION FACTOR IIB, TAFII55, OR cAMP-RESPONSIVE ELEMENT-BINDING
PROTEIN (CBP)-BINDING PROTEIN. J. Biol. Chem., 272, 1709–1717.
105. Yang,W.-M., Inouye,C., Zeng,Y., Bearss,D. and Seto,E. (1996) Transcriptional repression
by YY1 is mediated by interaction with a mammalian homolog of the yeast
global regulator RPD3. PNAS, 93, 12845–12850.
106. Peach,S.E., Rudomin,E.L., Udeshi,N.D., Cellular,S.C.M.2012 Quantitative assessment of
chromatin immunoprecipitation grade antibodies directed against histone modifications
reveals patterns of co-occurring marks on histone …. ASBMB
107. Du,Z., Li,H., Wei,Q., Zhao,X., Wang,C., Zhu,Q., Yi,X., Xu,W., Liu,X.S., Jin,W., et al. (2013)
Genome-Wide Analysis of Histone Modifications: H3K4me2, H3K4me3, H3K9ac, and
H3K27ac in Oryza sativa L. Japonica. Molecular Plant, 6, 1463–1472.
108. Creyghton,M.P., Cheng,A.W., Welstead,G.G., Kooistra,T., Carey,B.W., Steine,E.J.,
Hanna,J., Lodato,M.A., Frampton,G.M., Sharp,P.A., et al. (2010) Histone H3K27ac
separates active from poised enhancers and predicts developmental state. PNAS, 107,
21931–21936.
109. Rada-Iglesias,A., Bajpai,R., Swigut,T., Brugmann,S.A., Flynn,R.A. and Wysocka,J. (2011)
A unique chromatin signature uncovers early developmental enhancers in humans. Nature
2012 486:7404, 470, 279–283.
110. Wei,G., Wei,L., Zhu,J., Zang,C., Hu-Li,J., Yao,Z., Cui,K., Kanno,Y., Roh,T.-Y.,
Watford,W.T., et al. (2009) Global Mapping of H3K4me3 and H3K27me3 Reveals
88
Specificity and Plasticity in Lineage Fate Determination of Differentiating CD4+ T Cells.
Immunity, 30, 155–167.
111. Mann,R.S., Lelli,K.M. and Joshi,R. (2009) Chapter 3 Hox Specificity: Unique Roles for
Cofactors and Collaborators. Current Topics in Developmental Biology, 88, 63–101.
112. Bell,O., Schwaiger,M., Oakeley,E.J., Lienert,F., Beisel,C., Stadler,M.B. and Schübeler,D.
(2010) Accessibility of the
Drosophila
genome discriminates PcG repression, H4K16
acetylation and replication timing. Nature Structural & Molecular Biology 2013 20:3, 17,
894–900.
113. Canzio,D., Chang,E.Y., Shankar,S., Kuchenbecker,K.M., Simon,M.D., Madhani,H.D.,
Narlikar,G.J. and Al-Sady,B. (2011) Chromodomain-Mediated Oligomerization of HP1
Suggests a Nucleosome-Bridging Mechanism for Heterochromatin Assembly. Molecular
Cell, 41, 67–81.
114. Anderson,J.D., Lowary,P.T. and Widom,J. (2001) Effects of histone acetylation on the
equilibrium accessibility of nucleosomal DNA target sites. Journal of Molecular Biology, 307,
977–985.
115. Afek,A., Schipper,J.L., Horton,J., Gordân,R. and Lukatsky,D.B. (2014) Protein-DNA
binding in the absence of specific base-pair recognition. Proc. Natl. Acad. Sci. U.S.A., 111,
17140–17145.
116. Tsankov,A.M., Gu,H., Akopian,V., Ziller,M.J., Donaghey,J., Amit,I., Gnirke,A. and
Meissner,A. (2015) Transcription factor binding dynamics during human ES cell
differentiation. Nature 2012 486:7404, 518, 344–349.
117. Arnold,P., Schöler,A., Pachkov,M., Balwierz,P.J., Jørgensen,H., Stadler,M.B., van
Nimwegen,E. and Schübeler,D. (2013) Modeling of epigenome dynamics identifies
transcription factors that mediate Polycomb targeting. Genome Res., 23, 60–73.
118. Liu,L., Zhao,W., research,X.Z.N.A.2015 Modeling co-occupancy of transcription factors
using chromatin features. academic.oup.com
119. Grewal,S.I.S. and Moazed,D. (2003) Heterochromatin and Epigenetic Control of Gene
Expression. Science, 301, 798–802.
120. Huisinga,K.L., Brower-Toland,B. and Elgin,S.C.R. (2006) The contradictory definitions of
heterochromatin: transcription and silencing. Chromosoma, 115, 110–122.
121. Shlyueva,D., Stampfel,G. and Stark,A. (2014) Transcriptional enhancers: from properties
to genome-wide predictions. Nature Reviews Genetics 2014 15:4, 15, 272–286.
122. Fyodorov,D.V., Zhou,B.-R., Skoultchi,A.I. and Bai,Y. (2018) Emerging roles of linker
histones in regulating chromatin structure and function. Nature Reviews Molecular Cell
Biology 2017 19:3, 19, 192–206.
123. Lelli,K.M., Slattery,M. and Mann,R.S. (2012) Disentangling the Many Layers of Eukaryotic
Transcriptional Regulation. http://dx.doi.org/10.1146/annurev-genet-110711-155437, 46,
43–68.
89
124. Shogren-Knaak,M., Ishii,H., Sun,J.-M., Pazin,M.J., Davie,J.R. and Peterson,C.L. (2006)
Histone H4-K16 Acetylation Controls Chromatin Structure and Protein Interactions. Science,
311, 844–847.
125. Rothbart,S.B. and Strahl,B.D. (2014) Interpreting the language of histone and DNA
modifications. Biochimica et Biophysica Acta (BBA) - Gene Regulatory Mechanisms, 1839,
627–643.
126. Gupta,P., Zlatanova,J. and Tomschik,M. (2009) Nucleosome Assembly Depends on the
Torsion in the DNA Molecule: A Magnetic Tweezers Study. Biophysical Journal, 97, 3150–
3157.
127. Kouzine,F., Gupta,A., Baranello,L., Wojtowicz,D., Ben-Aissa,K., Liu,J., Przytycka,T.M. and
Levens,D. (2013) Transcription-dependent dynamic supercoiling is a short-range genomic
force. Nature Structural & Molecular Biology 2013 20:3, 20, 396–403.
128. Teves,S.S., Weber,C.M. and Henikoff,S. (2014) Transcribing through the nucleosome.
Trends in Biochemical Sciences, 39, 577–586.
129. Levo,M., Avnit-Sagi,T., Lotan-Pompan,M., Kalma,Y., Weinberger,A., Yakhini,Z. and
Segal,E. (2017) Systematic Investigation of Transcription Factor Activity in the Context of
Chromatin Using Massively Parallel Binding and Expression Assays. Molecular Cell, 65,
604–617.e6.
130. Ben Langmead, Trapnell,C., Pop,M. and Salzberg,S.L. (2009) Ultrafast and memory-
efficient alignment of short DNA sequences to the human genome. Genome Biol., 10, R25.
131. Quinlan,A.R., Bioinformatics,I.H.2010 BEDTools: a flexible suite of utilities for comparing
genomic features. academic.oup.com
132. Chiu,T.P., Comoglio,F., Zhou,T., Yang,L., Paro,R.2015 DNAshapeR: an R/Bioconductor
package for DNA shape prediction and feature encoding. academic.oup.com
133. Sing,T., Sander,O., Beerenwinkel,N., Bioinformatics,T.L.2005 ROCR: visualizing classifier
performance in R. academic.oup.com
134. Pohl,A., Bioinformatics,M.B.2014 bwtool: a tool for bigWig files. academic.oup.com
135. Gaudet,J. and Mango,S.E. (2002) Regulation of Organogenesis by the Caenorhabditis
elegans FoxA Protein PHA-4. Science, 295, 821–825.
136. Crocker,J., Abe,N., Rinaldi,L., McGregor,A.P., Frankel,N., Wang,S., Alsawadi,A., Valenti,P.,
Plaza,S., Payre,F., et al. (2015) Low affinity binding site clusters confer hox specificity and
regulatory robustness. Cell, 160, 191–203.
137. Li,X.-Y., MacArthur,S., Bourgon,R., Nix,D., Pollard,D.A., Iyer,V.N., Hechmer,A.,
Simirenko,L., Stapleton,M., Luengo Hendriks,C.L., et al. (2008) Transcription factors bind
thousands of active and inactive regions in the Drosophila blastoderm. PLoS Biol., 6, e27.
138. Tanay,A. (2006) Extensive low-affinity transcriptional interactions in the yeast genome.
Genome Res., 16, 962–972.
90
139. Wang,J., Malecka,A., Trøen,G. and Delabie,J. (2015) Comprehensive genome-wide
transcription factor analysis reveals that a combination of high affinity and low affinity DNA
binding is needed for human gene regulation. BMC Genomics 2015 16:7, 16, S12.
140. Ruan,S. and Stormo,G.D. (2017) Inherent limitations of probabilistic models for protein-
DNA binding specificity. PLoS Comput Biol, 13, e1005638.
141. Stormo,G.D. (2000) DNA binding sites: representation and discovery. Bioinformatics, 16,
16–23.
142. Stormo,G.D. (2013) Modeling the specificity of protein-DNA interactions. Quant Biol, 1,
115–130.
143. Riley,T.R., Lazarovici,A., Mann,R.S. and Bussemaker,H.J. (2015) Building accurate
sequence-to-affinity models from high-throughput in vitro protein-DNA binding data using
FeatureREDUCE. eLife, 4, 307.
144. Bulyk,M.L., Johnson,P.L.F. and Church,G.M. (2002) Nucleotides of transcription factor
binding sites exert interdependent effects on the binding affinities of transcription factors.
Nucleic Acids Res, 30, 1255–1261.
145. Man,T.K. and Stormo,G.D. (2001) Non-independence of Mnt repressor-operator interaction
determined by a new quantitative multiple fluorescence relative affinity (QuMFRA) assay.
Nucleic Acids Res, 29, 2471–2478.
146. Shrikumar,A., Greenside,P. and Kundaje,A. (2017) Reverse-complement parameter
sharing improves deep learning models for genomics. bioRxiv, 10.1101/103663.
147. Shrikumar,A., Greenside,P. and Kundaje,A. (2017) Learning Important Features Through
Propagating Activation Differences. arXiv, cs.CV.
148. Zhou,J. and Troyanskaya,O.G. (2015) Predicting effects of noncoding variants with deep
learning-based sequence model. Nat. Methods, 12, 931–934.
149. Kingma,D.P. and Ba,J. (2014) Adam: A Method for Stochastic Optimization.
150. Wang,M., Tai,C., E,W. and Wei,L. (2018) DeFine: deep convolutional neural networks
accurately quantify intensities of transcription factor-DNA binding and facilitate evaluation of
functional non-coding variants. Nucleic Acids Res, 29, 11.
151. Blum,C.F. and Kollmann,M. (2018) Circularly shifted filters enable data efficient sequence
motif inference with neural networks. 10.1101/312959.
152. Zeng,H., Edwards,M.D., Liu,G. and Gifford,D.K. (2016) Convolutional neural network
architectures for predicting DNA–protein binding. Bioinformatics, 32, i121–i127.
153. He,K., Zhang,X., Ren,S. and Sun,J. (2015) Deep Residual Learning for Image Recognition.
arXiv, cs.CV.
154. Zeiler,M.D. and Fergus,R. (2014) Visualizing and Understanding Convolutional Networks.
In Computer Vision – ECCV 2014, Lecture Notes in Computer Science. Springer, Cham,
91
Cham, Vol. 8689, pp. 818–833.
155. Simonyan,K., Vedaldi,A. and Zisserman,A. (2013) Deep Inside Convolutional Networks:
Visualising Image Classification Models and Saliency Maps. arXiv, cs.CV.
156. Springenberg,J.T., Dosovitskiy,A., Brox,T. and Riedmiller,M. (2014) Striving for Simplicity:
The All Convolutional Net. arxiv.org
157. Thomsen,M.C.F. and Nielsen,M. (2012) Seq2Logo: a method for construction and
visualization of amino acid binding motifs and sequence profiles including sequence
weighting, pseudo counts and two-sided representation of amino acid enrichment and
depletion. Nucleic Acids Res, 40, W281–7.
158. McGinnis,W. and Krumlauf,R. (1992) Homeobox genes and axial patterning. Cell, 68, 283–
302.
159. Pearson,J.C., Lemons,D. and McGinnis,W. (2005) Hox gene. Nature Reviews Genetics
2014 15:4, 6, 893–904.
160. Berman,H.M. The Protein Data Bank. In Leadership in Science and Technology: A
Reference Handbook. SAGE Publications, Inc., 2455 Teller Road, Thousand
Oaks California 91320 United States, pp. 661–667.
161. Sagendorf,J.M., Berman,H.M. and Rohs,R. (2017) DNAproDB: an interactive tool for
structural analysis of DNA–protein complexes. Nucleic Acids Res, 45, W89–W97.
162. Kitayner,M., Rozenberg,H., Rohs,R., Suad,O., Rabinovich,D., Honig,B. and Shakked,Z.
(2010) Diversity in DNA recognition by p53 revealed by crystal structures with Hoogsteen
base pairs. Nature Structural & Molecular Biology 2013 20:3, 17, 423–429.
163. Chang,Y.P., Xu,M., Machado,A.C.D., Yu,X.J., Rohs,R. and Chen,X.S. (2013) Mechanism
of origin DNA recognition and assembly of an initiator-helicase complex by SV40 large
tumor antigen. Cell Reports, 3, 1117–1127.
164. Ng,P. (2017) dna2vec: Consistent vector representations of variable-length k-mers. arXiv,
q-bio.QM.
165. Poplin,R., Newburger,D., Dijamco,J., Nguyen,N., Loy,D., Gross,S.S., McLean,C.Y. and
DePristo,M.A. (2017) Creating a universal SNP and small indel variant caller with deep
neural networks. bioRxiv, 10.1101/092890.
166. DELANO,W.L. (2002) The PyMOL Molecular Graphics System. http://pymol.org.
167. Lu,X.-J. and Olson,W.K. (2008) 3DNA: a versatile, integrated software system for the
analysis, rebuilding and visualization of three-dimensional nucleic-acid structures. Nature
Protocols, 3, 1213–1227.
168. Passner,J.M., Ryoo,H.D., Shen,L., Mann,R.S. and Aggarwal,A.K. (1999) Structure of a
DNA-bound Ultrabithorax-Extradenticle homeodomain complex. Nature 2012 486:7404, 397,
714–719.
92
169. Piper,D.E., Batchelor,A.H., Chang,C.P., Cleary,M.L. and Wolberger,C. (1999) Structure of
a HoxB1-Pbx1 heterodimer bound to DNA: role of the hexapeptide and a fourth
homeodomain helix in complex formation. Cell, 96, 587–597.
170. Hovde,S., Abate-Shen,C. and Geiger,J.H. (2001) Crystal structure of the Msx-1
homeodomain/DNA complex. Biochemistry, 40, 12013–12021.
171. Schildbach,J.F., Karzai,A.W., Raumann,B.E. and Sauer,R.T. (1999) Origins of DNA-
binding specificity: Role of protein contacts with the DNA backbone. PNAS, 96, 811–817.
172. Zhou,T., Yang,L., Lu,Y., Dror,I., Dantas Machado,A.C., Ghane,T., Di Felice,R. and Rohs,R.
(2013) DNAshape: a method for the high-throughput prediction of DNA structural features
on a genomic scale. Nucleic Acids Res, 41, W56–62.
173. Li,J., Sagendorf,J.M., Chiu,T.-P., Pasi,M., Perez,A. and Rohs,R. (2017) Expanding the
repertoire of DNA shape features for genome-scale studies of transcription factor binding.
Nucleic Acids Res, 45, 12877–12887.
174. Maturana,D. and Scherer,S. (2015) VoxNet: A 3D Convolutional Neural Network for real-
time object recognition. In. IEEE, pp. 922–928.
175. Perez,A., Lankas,F., research,F.L.N.A.2008 Towards a molecular dynamics consensus
view of B-DNA flexibility. academic.oup.com
176. Han,A., Pan,F., Stroud,J.C., Youn,H.-D., Liu,J.O. and Chen,L. (2003) Sequence-specific
recruitment of transcriptional co-repressor Cabin1 by myocyte enhancer factor-2. Nature
2012 486:7404, 422, 730–734.
177. Krijger,P.H.L. and de Laat,W. (2016) Regulation of disease-associated gene expression in
the 3D genome. Nature Reviews Molecular Cell Biology 2017 19:3, 17, 771–782.
178. Montefiori,L., Wuerffel,R., Roqueiro,D., Lajoie,B., Guo,C., Gerasimova,T., De,S., Wood,W.,
Becker,K.G., Dekker,J., et al. (2016) Extremely Long-Range Chromatin Loops Link
Topological Domains to Facilitate a Diverse Antibody Repertoire. Cell Reports, 14, 896–906.
179. Li,Q., Tjong,H., Li,X., Gong,K., Zhou,X.J., Chiolo,I. and Alber,F. (2017) The three-
dimensional genome organization of Drosophila melanogaster through data integration.
Genome Biol., 18, 145.
180. Kalhor,R., Tjong,H., Jayathilaka,N., Alber,F. and Chen,L. (2012) Genome architectures
revealed by tethered chromosome conformation capture and population-based modeling.
Nature Biotechnology 2015 33:8, 30, 90–98.
181. Nagano,T., Lubling,Y., Stevens,T.J., Schoenfelder,S., Yaffe,E., Dean,W., Laue,E.D.,
Tanay,A. and Fraser,P. (2013) Single-cell Hi-C reveals cell-to-cell variability in chromosome
structure. Nature 2012 486:7404, 502, 59–64.
182. Di Pierro,M., Cheng,R.R., Lieberman Aiden,E., Wolynes,P.G. and Onuchic,J.N. (2017) De
novo prediction of human chromosome structures: Epigenetic marking patterns encode
genome architecture. Proc. Natl. Acad. Sci. U.S.A., 114, 12126–12131.
93
183. Erdel,F. (2017) How Communication Between Nucleosomes Enables Spreading and
Epigenetic Memory of Histone Modifications. BioEssays, 39, 1700053.
184. Margueron,R., Justin,N., Ohno,K., Sharpe,M.L., Son,J., Drury,W.J.,III, Voigt,P., Martin,S.R.,
Taylor,W.R., De Marco,V., et al. (2009) Role of the polycomb protein EED in the
propagation of repressive histone marks. Nature 2012 486:7404, 461, 762–767.
185. Yuan,W., Wu,T., Fu,H., Dai,C., Wu,H., Liu,N., Li,X., Xu,M., Zhang,Z., Niu,T., et al. (2012)
Dense Chromatin Activates Polycomb Repressive Complex 2 to Regulate H3 Lysine 27
Methylation. Science, 337, 971–975.
186. Müller,M.M., Fierz,B., Bittova,L., Liszczak,G. and Muir,T.W. (2016) A two-state activation
mechanism controls the histone methyltransferase Suv39h1. Nature Chemical Biology 2016
12:3, 12, 188–193.
187. Ou,M., Cui,P., Pei,J., Zhang,Z. and Zhu,W. (2016) Asymmetric Transitivity Preserving
Graph Embedding ACM, New York, New York, USA.
188. Wang,D., Cui,P. and Zhu,W. (2016) Structural Deep Network Embedding ACM, New York,
New York, USA.
189. Jianbo Shi and Malik,J. (2000) Normalized cuts and image segmentation. IEEE Trans.
Pattern Anal. Machine Intell., 22, 888–905.
190. Ding,C.H.Q., Xiaofeng He, Hongyuan Zha, Ming Gu, Simon,H.D.2001 A min-max cut
algorithm for graph partitioning and data clustering. In. IEEE Comput. Soc, pp. 107–114.
191. Goyal,P. and Ferrara,E. (2017) Graph Embedding Techniques, Applications, and
Performance: A Survey.
192. Roweis,S.T. and Saul,L.K. (2000) Nonlinear Dimensionality Reduction by Locally Linear
Embedding. Science, 290, 2323–2326.
193. Ahmed,A., Shervashidze,N., Narayanamurthy,S., Josifovski,V. and Smola,A.J. (2013)
Distributed large-scale natural graph factorization ACM, New York, New York, USA.
194. Belkin,M., processing,P.N.A.I.N.I.2002 Laplacian eigenmaps and spectral techniques for
embedding and clustering. papers.nips.cc
195. Perozzi,B., Al-Rfou,R. and Skiena,S. (2014) DeepWalk: online learning of social
representations ACM, New York, New York, USA.
196. Grover,A. and Leskovec,J. (2016) node2vec: Scalable Feature Learning for Networks ACM,
New York, New York, USA.
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Genome-wide studies of protein–DNA binding: beyond sequence towards biophysical and physicochemical models
PDF
Profiling transcription factor-DNA binding specificity
PDF
Genome-wide studies reveal the function and evolution of DNA shape
PDF
Machine learning of DNA shape and spatial geometry
PDF
DNA shape at transcription factor binding sites: from purifying selection to a new alphabet
PDF
3D modeling of eukaryotic genomes
PDF
Decoding protein-DNA binding determinants mediated through DNA shape readout
PDF
Understanding protein–DNA recognition in the context of DNA methylation
PDF
Mapping 3D genome structures: a data driven modeling method for integrated structural analysis
PDF
Deciphering protein-nucleic acid interactions with artificial intelligence
PDF
Improved methods for the quantification of transcription factor binding using SELEX-seq
PDF
Data-driven approaches to studying protein-DNA interactions from a structural point of view
PDF
Forkhead transcription factors regulate replication origin firing through dimerization and cell cycle-dependent chromatin binding in S. cerevisiae
PDF
Simulating the helicase motor of SV40 large tumor antigen
PDF
The relationship between DNA methylation and transcription factor binding in colon cancer cells
PDF
Exploring the application and usage of whole genome chromosome conformation capture
PDF
Measuring, modeling and identifying factors that influence eukaryotic DNA replication
PDF
The function of Rpd3 in balancing the replicaton initiation of different genomic regions
PDF
Functional DNA methylation changes in normal and cancer cells
PDF
Exploring three-dimensional organization of the genome by mapping chromatin contacts and population modeling
Asset Metadata
Creator
Xin, Beibei
(author)
Core Title
Quantitative modeling of in vivo transcription factor–DNA binding and beyond
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Publication Date
07/26/2020
Defense Date
06/19/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
3D genome,chromatin accessibility,DNA shape,Hi-C,histone modification,OAI-PMH Harvest,protein-DNA recognition,statistical machine learning,transcription factor binding
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Rohs, Remo (
committee chair
), Alber, Frank (
committee member
), Thomas, Paul Denis (
committee member
), Waterman, Michael S (
committee member
)
Creator Email
bxin@usc.edu,xinbeibei2012@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-23398
Unique identifier
UC11670520
Identifier
etd-XinBeibei-6497.pdf (filename),usctheses-c89-23398 (legacy record id)
Legacy Identifier
etd-XinBeibei-6497.pdf
Dmrecord
23398
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Xin, Beibei
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
3D genome
chromatin accessibility
DNA shape
Hi-C
histone modification
protein-DNA recognition
statistical machine learning
transcription factor binding