Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Genome-wide studies reveal the function and evolution of DNA shape
(USC Thesis Other)
Genome-wide studies reveal the function and evolution of DNA shape
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Genome-wide Studies Reveal the
Function and Evolution of DNA Shape
Ph.D. Dissertation
Submitted by
Tianyin Zhou
Computational Biology & Bioinformatics Ph.D. Program, Department of Biological
Sciences
University of Southern California
December 2014
Ph.D. Dissertation Committee
Dr. Remo Rohs (Chair)
Dr. Frank Alber
Dr. Aiichiro Nakano
Dr. Michael S. Waterman
ii
© 2014
Tianyin Zhou
All Rights Reserved
iii
Acknowledgement
It would not have been possible to write this doctoral thesis without many kind
people around me.
Above all, I would like to thank my advisor Dr. Remo Rohs for his mentoring and
constant support. In the past 5 years, Dr. Rohs has taught me how to be an active
learner, a sharp scientist, a supportive colleague, a caring person and a loving husband.
I would like to express my gratitude to Dr. Rosa Di Felice, Dr. Fengzhu Sun, Dr.
Frank Alber, Dr. Aiichiro Nakano, Dr. Gaurav S. Sukhatme and Dr. Michael S.
Waterman. Thank you for guiding my academic growth and development.
I am also grateful to Dr. Andrew D. Smith for acknowledging my academic ability
in the phone interview 5 years ago and bringing me to USC to pursue my American
dream.
Last, but by no means, I thank my wife, Xuelu Wu. You are the love of my life.
iv
Table of Contents
Acknowledgement iii
List of Figures viii
Abstract x
Chaper 1. Background: Protein-DNA recognition ......................................................... 1
1.1 TFs recognize DNA through the interplay of base and shape readout ............... 2
1.2 Computational models for describing the DNA-binding specificities of TFs. ...... 3
1.3 Binding assays for probing the DNA-binding specificities of TFs ....................... 7
Chaper 2. High-throughput prediction of DNA shape ................................................... 8
2.1 Introduction ........................................................................................................ 8
2.2 Methods ........................................................................................................... 10
2.3 Validation ......................................................................................................... 12
2.4 Conclusion ....................................................................................................... 21
Chaper 3. The role of DNA shape in Hox-DNA recognition ........................................ 22
3.1 Introduction ...................................................................................................... 22
3.2 Results ............................................................................................................. 25
3.3 Methods ........................................................................................................... 27
3.4 Discussion ........................................................................................................ 28
v
Chaper 4. CpG methylation modulates DNase I-DNA interaction strength via the
remodeling of DNA shape .......................................................................... 29
4.1 Introduction ...................................................................................................... 29
4.2 Results ............................................................................................................. 33
4.2.1 Minor groove width profile is predictive of DNase I cleavage rate ............. 33
4.2.2 CpG methylation greatly enhances adjacent DNase I cleavage ................ 36
4.2.3 CpG Methylation narrows the minor groove at adjacent positions ............. 37
4.2.4 Modulation of DNA shape explains the methylation sensitivity
of DNase I .................................................................................................. 38
4.3 Discussion ........................................................................................................ 41
4.4 Methods ........................................................................................................... 42
Chaper 5. Evidence of purifying selection on DNA shape .......................................... 43
5.1 Introduction ...................................................................................................... 43
5.2 Results ............................................................................................................. 45
5.3 Methods ........................................................................................................... 50
5.4 Future direction ................................................................................................ 51
Chaper 6. Quantitative modeling of transcription factor binding
specificities using DNA shape ................................................................... 52
6.1 Introduction ...................................................................................................... 52
6.2 Results ............................................................................................................. 54
6.2.1 DNA shape-augmented models outperform sequence-based models ...... 54
6.2.2 Models using 2mers and 3mers contain implicit DNA shape information .. 59
vi
6.2.3 DNA shape-augmented models can accurately predict TF binding
data across experimental platforms ........................................................... 61
6.2.4 Replacing k-mer with DNA shape features reduces the dimensionality
of the feature space ................................................................................... 66
6.2.5 DNA shape-augmented models reveal TF family-specific readout
mechanisms .............................................................................................. 70
6.3 Discussion ........................................................................................................ 72
6.4 Methods ........................................................................................................... 73
6.4.1 Encoding of DNA sequence or k-mer features .......................................... 73
6.4.2 Encoding of DNA shape features .............................................................. 74
6.4.3 Performance evaluation of sequence- and shape-based models
for gcPBM and uPBM data ........................................................................ 75
Bibliography .................................................................................................................. 77
Appendix A. Cofactor binding evokes latent differences in DNA binding
specificity between Hox proteins ............................................................... 96
Appendix B. DNAshape: a method for the high-throughput prediction
of DNA structural features on a genomic scale .......................................... 97
Appendix C. Genomic regions flanking E-box binding sites influence DNA
binding specificity of bHLH transcription factors through DNA shape ........ 98
Appendix D. Covariation between homeodomain transcription factors
and the shape of their DNA binding sites................................................... 99
Appendix E. Probing DNA shape and methylation state on a genomic
scale with DNase I ................................................................................... 100
vii
Appendix F. TFBSshape: a motif database for DNA shape features of
transcription factor binding sites .............................................................. 102
Appendix G. Absence of a simple code: how transcription factors read the genome .. 103
Appendix H. Quantitative modeling of transcription factor binding
specificities using DNA shape ................................................................. 104
Appendix I. Evolving insights on how cytosine methylation affects
protein-DNA binding ................................................................................ 105
Appendix J. GBshape: a genome browser database for DNA shape annotations ...... 106
viii
List of Figures
Figure 1–1 Base and shape readout contribute to TF–DNA binding specificity. .............. 3
Figure 1–2 Timeline of genomic approaches for experimental and computational
studies of TF-DNA binding specificity. ......................................................... 5
Figure 2–1 Pentamer model for HT prediction and validation of HT approach
using Fis binding sites. .............................................................................. 11
Figure 2–2 Validation of HT predictions using the Dickerson dodecamer. .................... 14
Figure 2–3 Validation of HT predictions using TF binding sites. .................................... 15
Figure 2–4 Validation of HT predictions using nucleosome binding sites. ..................... 16
Figure 2–5 Validation of HT predictions of Roll and HelT for the 10 unique
dinucleotides based on a dataset of co-crystal structures of
protein-DNA complexes. ............................................................................ 20
Figure 3–1 Heterodimerization with Exd elicits novel binding specificities. ................... 24
Figure 3–2 Predicted minor groove widths of Exd-Hox binding sites. ........................... 26
Figure 4–1 Global properties of the hexamer-based model of intrinsic
DNase I cleavage rate. .............................................................................. 32
Figure 4–2 Minor groove width is predictive of DNase I cleavage rate. ......................... 34
Figure 4–3 Observation and analysis of the effect of methylation on DNase I
cleavage rate. ............................................................................................ 37
Figure 4–4 Intrinsic DNA methylation sensitivity of DNase I. ......................................... 39
Figure 5–1 Effect of SNVs on local DNA MGW pattern. ................................................ 46
ix
Figure 5–2 Distribution of DNA MGW change for SNVs in functional and
non-functional regions. .............................................................................. 47
Figure 5–3 Distribution of DNA MGW change for SNVs with high and low MAF. .......... 48
Figure 5–4 Comparison of distribution using arbitrary DNA MGW shows no
signal of purifying selection. ....................................................................... 49
Figure 5–5 Definition of functional and non-functional regions. ..................................... 51
Figure 6–1 Design of the sequence+shape feature vector, and TF family-specific
performance comparison of binding specificity predictions. ....................... 55
Figure 6–2 Performances of various models on the DREAM5 data for mouse TFs. ..... 58
Figure 6–3 Performance of various models on DREAM5 data for 65 mouse TFs. ........ 60
Figure 6–4 Performance of binding specificity models across experimental platforms. 63
Figure 6–5 Performances of binding specificity models across experimental
platforms. ................................................................................................... 65
Figure 6–6 Comparison of various models using gcPBM data for human Max TF. ....... 67
Figure 6–7 Insights into TF-specific readout mechanisms derived from
shape-augmented binding specificity models. ........................................... 69
Figure 6–8 SVR Weights reveal structural mechanisms of DNA readout for bHLH and
homeodomain TFs. .................................................................................... 72
x
Abstract
Gene regulation is orchestrated by the binding of transcription factors (TFs) to
their specific target DNA sequence. An increasing number of structural biology and
genomics studies associate protein-DNA binding with the recognition of three-
dimensional DNA structure, or “DNA shape”. Retrieval of local nuances in DNA shape is
therefore a fundamental step to understand the general mechanisms of gene regulation.
Due to the limitation of X-ray crystallography and NMR spectroscopy, molecular
simulations have long been the only available approach to deduce atomic information
on intrinsic DNA structure. Despite significant recent advances, the bottleneck for any
molecular simulation is that only a limited number of DNA sequences can be pursued.
To overcome this limitation, we have developed a high-throughput (HT) method for DNA
shape prediction based on the mining of atomic resolution data obtained from Monte
Carlo (MC) simulations of numerous DNA fragments.
Applying the HT method to predict the DNA shape of many SELEX-seq-identified
Hox binding sites, we revealed that closely related Hox paralogs achieve their distinct
function in vivo through differential binding preference for DNA sequences with distinct
minor groove topographies. Analyzing millions of DNA backbone cleavage events on
naked genomic DNA conducted by the endonuclease DNase I, we showed that CpG
methylation modulates DNase I-DNA interaction strengths via the remodeling of DNA
shape. Examining the effect of millions of single nucleotide variants (SNVs) on local
DNA minor groove width (MGW) pattern, we proved that purifying selection acts on
DNA shape as well as nucleotide sequence.
xi
To gain a comprehensive, systematic and quantitative view of DNA shape effects
on protein-DNA binding, we integrated intuitive DNA shape features in the modeling of
TF binding specificities using statistical machine learning approaches. We compared
the quantitative predictions of DNA binding specificities for 68 mouse and human TFs
using shape-augmented models with those using sequence-based models. Based on
our model evaluation, which included cross-validation on specific PBM array designs,
testing across different PBM array designs, and using PBM-trained models to predict
relative binding affinities derived from SELEX-seq data, we showed that shape-
augmented models compared favorably to sequence-based models. Besides, whereas
adding either k-mer or DNA shape features to sequence-only models improves the
modeling of DNA binding specificities, using DNA shape that represents
interdependencies implicitly with smaller numbers of features yields more efficient
models.
In summary, my work on the DNA shape prediction and its application adds a
new dimension to genome-wide studies of gene regulation and evolution.
1
Chaper 1. Background: Protein-DNA recognition
After decades of research, much is now understood about how TFs recognize
their cognate binding sites in the genome to initiate gene regulatory functions. However,
potential target sites for each TF occur many times in the genome. How proteins can
very precisely identify their functional binding sites in a cellular environment has not
been resolved. Although closely related proteins are known to bind to distinct target
sites to execute different in vivo functions, the mechanisms by which paralogous TFs
select very similar, but not identical, target sites are not understood. Current knowledge
on the DNA-binding specificities of TFs is largely derived from research in genomics
and structural biology, two fields of research that have developed along parallel lines
with limited interactions and that have only begun to become integrated.
Recent studies have focused on the question of how TFs recognize a subset of
putative DNA target sites by identifying features, beyond the sequence of the core
binding site, which contribute to TF–DNA binding specificity (1-4). Several features
contribute to TF–DNA readout on multiple levels, including the nucleotide sequence (5-
9), 3D structure and flexibility of TFs and their binding sites (10-13), TF–DNA binding in
the presence of cofactors (14), cooperative DNA-binding of TFs (10, 15-17), chromatin
accessibility and nucleosome occupancy (18-21), indirect cooperativity via competition
with nucleosomes (22, 23), pioneer TFs that bind to nucleosomal DNA (24, 25), and
DNA methylation (12). In addition, interactions exist among all of these factors, which
might alter binding in a cell type-specific manner (25, 26).
2
1.1 TFs recognize DNA through the interplay of base and shape
readout
Structural biology has been at the forefront of the search for a protein–DNA
recognition code. Cocrystal structures of protein–DNA complexes were first solved in
the 1980s (27). Since then, more than 1600 protein–DNA structures have been entered
into the Protein Data Bank (28), including structures solved by nuclear magnetic
resonance (NMR) spectroscopy. These structures have revealed why many TFs
preferentially bind to a specific DNA sequence (29). Namely, the preference for a given
nucleotide at a specific position is mainly determined by physical interactions between
the amino acid side chains of the TF and the accessible edges of the base pairs that are
contacted. These contacts include direct hydrogen bonds, water-mediated hydrogen
bonds, and hydrophobic contacts. This form of protein–DNA recognition is known as
‘base readout’ (Figure 1–1A). A prominent example for base readout is the formation of
bidentate hydrogen bonds between arginine residues and guanine bases in the major
groove of DNA (17).
TFs can also recognize the structural features of their binding sites, such as
sequence-dependent DNA bending (30, 31) and unwinding (32). This phenomenon of
recognizing sequence-dependent DNA structure is known as ‘shape readout’ (Figure 1–
1B). The DNA shape concept includes the static and dynamic properties of DNA
structure, and the readout of enhanced negative electrostatic potential in narrow minor
groove regions through arginine (11) or histidine (33) residues.
3
Figure 1 –1 Base and shape readout contribute to TF –DNA binding specificity.
(A) Base readout describes direct interactions between amino acids and the functional groups of
the bases. Whereas the pattern of hydrogen bond acceptors (red) and donors (blue),
heterocyclic hydrogen atoms (white) and the hydrophobic methyl group (yellow) is base pair-
specific in the major groove, the pattern is degenerate in the minor groove. (B) Shape readout
includes any form of structural readout based on global and local DNA shape features, including
conformational flexibility and shape-dependent electrostatic potential. The DNA target of the
IFN-β enhanceosome (PDB ID 1t2k; top) varies in minor groove shape. The human
papillomavirus E2 protein binds to a DNA binding site (PDB ID 1jj4; bottom) with intrinsic
curvature. Figure adapted from (34).
1.2 Computational models for describing the DNA-binding
specificities of TFs.
In parallel to structural biology approaches to studying protein–DNA binding
specificity, sequence-based computational methods have been developed. These
4
methods use a set of known protein–DNA binding sites to generate ‘DNA motif models’
for predicting the binding specificity to any new site. Early DNA motif discovery methods
(35-37) were trained and tested on: (i) small sets of aligned TF binding sites (TFBSs)
collected from small-scale experiments such as DNase I footprinting (38) or
electrophoretic mobility shift assays (39), (ii) simulated data, in which TFBSs were
artificially inserted into background DNA, or (iii) sets of promoter regions of coregulated
genes (36). The development of microarray- and sequencing-based assays for the high-
throughput measurement of protein–DNA binding resulted in a burst of motif discovery
methods; to date, hundreds of DNA motif discovery algorithms have been developed (9,
40, 41).
Most sequence-based DNA motif discovery methods use position weight
matrices (PWMs) to represent the TF–DNA binding specificity (5, 8). This type of model
is simple, intuitive, and can be learned from various data types: from small sets of
known binding sites to high-throughput protein–DNA binding data. Traditional PWM
models have the benefit of being easy to visualize as DNA motif logos [(42). However,
these models are only able to describe the DNA base readout by a TF. Moreover, they
implicitly assume that positions within a TFBS independently contribute to the binding
affinity, an assumption that does not always hold (7, 43-46). Consequently, more
complex sequence-based models of protein–DNA binding specificity have been
developed (Figure 1–2) to account for positional dependencies within TFBSs, as well as
other complexities in protein–DNA recognition [2, 9, 31, 72, 73, 74].
5
Figure 1 –2 Timeline of genomic approaches for experimental and computational studies
of TF-DNA binding specificity.
Development of experimental high-throughput DNA binding assays (above the timeline axis)
and computational DNA-binding specificity models and algorithms (below the timeline axis).
Figure adapted from (34).
These complex models typically perform better than traditional PWMs (2, 45, 47),
providing important insights into the DNA recognition mechanisms used by different TFs.
For example, a dinucleotide-based model (47) revealed that including the non-
independent contributions between two specific positions in the DNA-binding models of
Hnf4a was crucial for accurately predicting the genomic regions bound by Hnf4a in vivo.
Another recent study (2) revealed that contributions from di- and trinucleotides in the
DNA regions flanking TFBSs can influence TF binding specificity. Importantly, however,
the flanking di- and trinucleotides in these models did not reflect base readout by the
TFs; instead, the effect of the higher-order sequence features was exerted through local
3D DNA structure (i.e., DNA shape) (11).
6
Interactions between adjacent base pairs are dominated by base stacking (48)
and, to a lesser degree, by inter-base pair hydrogen bonds in the major groove. These
physical interactions give rise to DNA shape (49, 50) and explain the interdependencies
between adjacent positions in a TFBS (47) and other, more complex situations. DNA
shape features can be derived on a genomic scale by using a sliding pentamer window
to mine Monte Carlo (MC) predictions (49). This approach was the basis for generating
a motif database of structural features of TFBSs (50), a genome browser for DNA shape
annotation(51), as well as for multiple studies in which hundreds of thousands of DNA
sequences were analyzed in terms of DNA shape features (1, 2, 52).
A small but important class of sequence-based motif discovery methods
represents approaches to infer DNA binding affinities by fitting thermodynamic energy-
based models to experimental data. Similarly to probabilistic models, some energy-
based models assume independent contributions among positions in the TFBS (53-55),
whereas others incorporate non-independent contributions(47). Structure-based
atomistic models of DNA-binding specificity have also been developed (56-61).
However, these models are not yet widely used, likely because they require knowledge
of the structure of the protein (or one of its homologs) when bound to the DNA target
site. Such data are not as easily available as DNA sequence data. Without having to
model the complete structure of the TF–DNA complex, structural information on DNA
alone can be incorporated into DNA motif discovery models. Recently, probabilistic
models incorporating DNA structure-derived features (2, 50, 62, 63) were shown to
perform better than models based on DNA sequence information alone. Thus, genomic
7
and structural information is beginning to be integrated into protein–DNA binding
models that account for both base- and shape-readout mechanisms.
1.3 Binding assays for probing the DNA-binding specificities of TFs
With the emergence of new high-throughput technologies for measuring protein–
DNA binding (Figure 1–2), it has become more feasible to create complex models of
DNA binding specificity through machine learning. However, all experimental datasets
contain noise and (potentially substantial) biases, and complex models will fit the noise
and biases more easily than simple PWM models. Thus, it is not surprising that, in some
recent studies of algorithms for training DNA-binding specificity models from high-
throughput data (9, 64), the models that performed best on particular in vitro datasets
did not always generalize well on independent in vivo data. As more accurate datasets
emerge (e.g., from genomic-context protein-binding microarrays, gcPBMs (2, 65), it is
likely that more TFs will be better described by complex models of DNA-binding
specificity.
The rich datasets provided by high-throughput technologies have revolutionized
our ability to characterize protein–DNA binding specificity. For example, the
comprehensive nature of universal protein-binding microarray (PBM) data (66), which
include measurements of TF binding specificity to all possible 8 base pair (bp)
sequences, has facilitated the characterization of low-affinity TF–DNA binding sites,
which are often not captured by simple DNA-binding models (67, 68). Such sites, which
are under widespread evolutionary selection (69, 70), are crucial for interpreting the
spatial and temporal TF gradients that arise during development (71). High-throughput
8
datasets have revealed that closely related TFs, even when they exhibit a high degree
of similarity in their DNA-binding domains (DBDs; up to 67% amino acid identity), can
have distinct DNA-binding profiles (7, 67). Moreover, different TF family members can
prefer different core binding sites (7, 72) or flanking DNA sequences (2). Thus, both
base- and shape-readout mechanisms might play roles in the differential DNA-binding
specificity of paralogous TFs.
Chaper 2. High-throughput prediction of DNA shape
2.1 Introduction
An increase number of structural biology and genomics studies associate
protein-DNA binding with the recognition of the three-dimensional DNA structure, or
‘DNA shape’ (11). Whereas DNA shape is sequence dependent, degeneracy in the
sequence-structure relationship enables the formation of very similar shapes by
dissimilar sequences or, in turn, dramatic effects on structure in an extended region by
only a single-nucleotide substitution (73).
In early studies, all of the available crystal structures of DNA fragments and
protein-DNA complexes were analyzed to derive the average conformations for the 10
unique dinucleotides (74). Based on the many more high-resolution structures that have
been solved and analyzed in recent years, it is now apparent that longer DNA segments
must be characterized to capture the sequence-structure degeneracy of DNA (11). Such
structural information, which can be retrieved from X-ray crystallography or NMR
spectroscopy data, ideally provides information on the three-dimensional structure of a
9
DNA binding site prior to and after protein binding. However, such data are unavailable
for most sequences for their unbound or “naked” states (27). Consequently, molecular
simulations are the only available approach to deduce atomic information on intrinsic
DNA structure.
Recent efforts to characterize the structures of all 136 unique tetranucleotides
have used all-atom molecular dynamics (MD) simulations of either 136 dodecamers (48)
or 39 duplexes of 18 base pairs (bp) in length (75). However, in both designs most
tetranucleotides occur only in the context of a single sequence, which limits the ability
for a statistically robust comparison of the simulation results with experimental data.
Due to their relatively time-consuming nature, simulation methods make structural
analysis of massive DNA sequence data unfeasible.
Here we present a method for high-throughput (HT) prediction of various
structural features of DNA through mining of atomic resolution data obtained from
Monte Carlo (MC) simulations (76-78). Given their importance in DNA shape readout,
the predicted features include minor groove width (MGW), Roll, propeller twist (ProT)
and helix twist (HelT). While we emphasize that protein-DNA recognition is a dynamic
process, the predicted structural features reveal preferred conformations that are
intrinsic to a given DNA sequence. We demonstrate the robustness of our method
through extensive validation with massive experimental and computational data. Our HT
method underlying the DNAshape web server can be used to predicte DNA structural
features of the entire yeast genome at nucleotide resolution in less than 1 min on a
single processor.
10
2.2 Methods
DNA shape features at single nucleotide position are determined by the
sequence context of the corresponding bp. The context can include the immediate
neighbors of a bp or a larger number of adjacent bps. Pentamers take into account
nearest and next-nearest neighbors of the central position. To this extent, a pentamer
model for the sequence environment of a given bp is a reasonably simplified approach
to predict structural features at the central position of the pentamer. Thus, the structural
features at each bp position can be characterized as a function of its pentameric
environment. Assuming that the structure of each of the 512 unique pentamers is known,
we can use a “sliding pentamer model” to derive structural features of DNA molecules of
any length and with any sequence in an instant manner (Figure 2–1A).
To assemble the pentamer data, we have generated a large training dataset of
MC predictions for 2,121 different DNA fragments of 12-27 bps in length. Our MC data
provide a full coverage of all 512 unique pentamers, with each occurring on average 44
times in the training dataset. These multiple occurrences of each pentamer in our
training dataset in different sequence environments take into account the effect of
flanking sequences through averaging.
For each MC prediction, structural features at single nucleotide resolution were
analyzed with Curves (79), using a modified definition for levels to a bp, in which the
MGW definition is independent of the usage of strand 1 or 2 of the DNA duplex as the
leading strand. Specifically, MGW at a given bp was calculated by averaging groove
width measurements over three levels [-1, 0, +1] surrounding the plane of a given bp,
with levels 0 and +1 representing the first two levels calculated by Curves (79) for a
11
nucleotide, and -1 being the last level of the preceding nucleotide. This definition
includes 5’ and 3’ inter-bp values and defines MGW as a direction-independent
measure.
Figure 2 –1 Pentamer model for HT prediction and validation of HT approach using Fis
binding sites.
(A) Using a sliding pentamer window, MC predictions were mined and a query table of average
structural features characterizing either the central bp (e.g., MGW) or the two central bps steps
(e.g., Roll) of a pentamer was assembled to predict structural features of any length of DNA. (B)
HT predictions of average MGW over the five central bps of 7 Fis binding sites correlate with the
logarithm of binding affinity (red). The predictive power further increases for 6 sequences (blue)
after removing one sequence with a TpA step. MGW as a function of sequence is predicted for
(C) a high-affinity (Kd = 0.2 nM) and (D) a low-affinity (Kd = 140 nM) binding site using the HT
(blue) and MC (red) approaches and compared with X-ray data (green) of protein-DNA
complexes. The large positive MGW values observed in X-ray data are usually due to crystal
packing and are not observed in solution. Spearman’s rank correlation coefficients demonstrate
12
the statistical similarity between the predicted and experimental MGW profiles. Figure adapted
from (49).
Using a sliding pentamer window, we decomposed each MC prediction into a set
of overlapping pentamers (Figure 2–1A). All occurrences of a given pentamer in our MC
dataset were collected and average structural features were calculated at the central bp
(MGW and ProT) or the two central bp steps (Roll and HelT). These average structural
parameters for each pentamer were then stored in a query table, with the pentamer
sequence serving as the search key (Figure 2–1A). These data can then be used to
derive DNA shape features of arbitrary sequences.
2.3 Validation
We compared DNA shape predictions for seven DNA binding sites that exhibit
Fis binding affinities differing by three orders of magnitude (80). The Fis protein binds a
large variety of DNA sequences but its binding affinity depends on MGW in the central
region of its binding site (80). We found that the predicted MGW, averaged over the
region of the five central nucleotides, correlates with the logarithm of binding affinity with
R
2
= 0.65 (Figure 2–1B). If a particular sequence, denoted F25, with a central TpA
“hinge” step was excluded due to its high flexibility (11), the correlation is even stronger
with R
2
= 0.99 (Figure 2–1B), suggesting a future application of DNA shape analysis in
predicting binding affinity. For the two sequences with the highest and the lowest
binding affinity (80), MGW predictions of the DNA targets show that the high-affinity site
assumes a groove geometry in its unbound state similar to that observed in the complex
13
(Figure 2–1C), whereas the low-affinity site must be deformed upon binding (Figure 2–
1D).
We explicitly demonstrate the performance of our predictions for the Dickerson
dodecamer of the palindromic sequence CGCGAATTCGCG, which is the
experimentally most extensively studied DNA molecule (27). Crystal packing effects
lead to an asymmetric structure of this molecule, which we can eliminate through
symmetrization due to the palindromic symmetry of its sequence (76). The refinement of
NMR structures requires a large number of NOE constraints, which are sparse in DNA
(27), but additional NMR data can be derived from residual dipolar coupling (RDC) (81).
The sequence-dependent patterns of MGW, Roll, ProT and HelT predictions using our
HT server agree very well with average measurements of shape features for 8 X-ray
and 10 NMR structures (Figure 2–2).
14
Figure 2 –2 Validation of HT predictions using the Dickerson dodecamer.
The structural features (A) MGW, (B) Roll, (C) ProT, and (D) HelT of the palindromic Dickerson
dodecamer are predicted using the HT (blue) and MC (red) approaches and compared with the
symmetrized average profiles derived from 8 crystal structures (green) without chemical
modifications and the average profiles derived from 10 NMR structures (purple) using RDC. The
more extreme HelT values observed in X-ray data are usually due to crystal packing and are not
observed in solution. Spearman’s rank correlation coefficients (ρ) demonstrate the statistical
similarity between the predicted and experimental structural feature profiles. Structural features
were symmetrized according to the palindromic sequence of the Dickerson dodecamer. Figure
adapted from (49).
We next focused on the MGW profiles of DNA binding sites of six proteins for
which we previously established the importance of minor groove shape readout (11).
The HT predictions show very good agreement with MC predictions (Figure 2–3). It is
noteworthy that the HT approach tends to yield predictions without extreme variations in
15
structural features compared to MC because our HT method essentially yields the
average results of multiple MC predictions, which decreases the impact of potential
sampling artifacts. We also validated our HT predictions with crystal structures of
protein-DNA complexes for these six examples (11). The maxima in MGW observed in
crystal structures assume more extreme values compared to the 5.8 Å B-DNA average
due to crystal packing effects. Nevertheless, our HT method predicts the MGW patterns
observed in the six crystal structures very well, indicated by Spearman’s rank
correlation coefficients ranging from 0.49 to 0.93.
Figure 2 –3 Validation of HT predictions using TF binding sites.
(A-F) MGW for the DNA binding sites of 6 TFs, for which shape readout was previously
observed (1), are predicted using the HT (blue) and MC (red) approaches and compared with X-
ray data (green) of protein-DNA complexes. The large positive MGW values observed in X-ray
data are usually due to crystal packing and are not observed in solution. Therefore, qualitative
MGW patterns (minima vs. maxima) are the more essential characteristics compared to actual
values. The MGW minima correlate with enhanced negative electrostatic potential providing
binding sites for basic arginine side chains (1). Spearman’s rank correlation coefficients (ρ)
16
demonstrate the statistical similarity between the predicted and experimental MGW profiles.
Figure adapted from (49).
Hydroxyl radical (OH) cleavage is a qualitative measure for MGW in solution (82).
In a sliding window approach similar to the one proposed here, measured OH cleavage
intensities originating from two nucleotides from each strand that are closest across the
minor groove were compiled in the ORChID2 server that enables predictions of OH
patterns for DNA duplexes of any length (82). This approach was used to validate HT
predictions of large numbers of sequences. We predicted the average MGW profiles of
23,076 yeast (83) and 25,654 fly (84) in vivo nucleosome binding sites, respectively.
Highly consistent with OH data based on Spearman’s rank correlations of 0.82 and 0.67,
our HT predictions of MGW exhibit a 10-bp periodicity (Figure 2–4) in accordance with
known dinucleotide analysis (85, 86) suggesting that DNA shape is recognized by
histones (11).
Figure 2 –4 Validation of HT predictions using nucleosome binding sites.
17
MGW is predicted for (A) 23,076 yeast and (B) 25,654 fly in vivo nucleosome binding sites
(25,26), and its average is compared with OH cleavage data derived from ORChID2. HT
predictions of MGW (blue) and OH cleavage intensity (orange) are highly correlated, both
revealing the 10-bp periodicity observed in dinucleotide propensity. Spearman’s rank correlation
coefficients (ρ) demonstrate the statistical similarity between the predicted MGW and OH
cleavage intensity profiles. Figure adapted from (49).
We also performed massive validation of HT predictions with a large number of
experimental DNA structures from the Protein Data Bank (PDB). We collected
structures from a previously characterized dataset (82) solved by X-ray crystallography
(760 bound DNAs; 46 unbound DNAs) and NMR spectroscopy (90 unbound DNAs) as
validation datasets. Only structures with at least one helical turn and no chemical
modifications were included. We organized the predicted and experimentally derived
shape parameters into separate “structural feature vectors”, in which the components of
a vector are ordered by positions of either each bp (MGW and Roll) or bp step (ProT
and HelT) that they correspond to. We could then evaluate the correlation between such
vectors to obtain a quantitative measure of agreement between HT predictions and
experimental structures.
For this comparison to be meaningful, however, two problems were to be
addressed. First, for structures of short length and small variations in their structural
features, quantitative comparison of the HT prediction with experimental measurement
is vulnerable to small fluctuations. Second, some experimental structures exhibit drastic
deformations, mainly due to crystal packing but in some cases also induced by protein
18
binding. Such deformations yield unusual structural features particularly in end regions
that are not observed in solution.
To address these issues, we identified regions of extreme deformations in
experimental structures, which satisfied one of the following criteria: (i) MGW > 8.5 Å or
< 1.5 Å (~ 5.8 Å in B-DNA); (ii) HelT > 45º (~ 35º in B-DNA); (iii) |Roll| > 20º (~ 0º in B-
DNA). We removed the components of the structural feature vectors corresponding to
these deformed regions. . Of the remaining structural data, regions of length ≤ 3 bp
were also discarded. In addition, the three bps flanking a deformed region 5’ or 3’ were
removed from the validation dataset. For each of the structural features (MGW, Roll,
ProT and HeltT), we concatenated the structural feature vectors from all structures in
one single vector. As a result of the concatenation, the structural feature vectors can
represent up to 3,000 bp, depending on the validation dataset. We then calculated
Spearman’s rank correlation coefficients between structural feature vectors to obtain a
quantitative measure of agreement between HT predictions and experimental structures.
For the largest dataset of bound DNAs derived from X-ray data, we achieved
Spearman’s rank correlations of 0.67 for MGW, 0.63 for Roll, 0.55 for ProT, and 0.54 for
HelT. In comparison, in unbound DNA structures some features were less well
predicted, with the exception of the unbound Dicerson dodecamer for which we found
excellent agreement with Spearman’s rank correlation coefficients of 1.0 for MGW, 0.63
for Roll, 0.95 for ProT, and 0.54 for HelT for X-ray data and 1.0 for MGW, 0.95 for Roll,
0.95 for ProT, and 0.95 for HelT for NMR data. This observation is due to (i) much
smaller datasets of experimental structures for unbound DNAs compared to protein-
bound DNAs, (ii) stronger crystal packing deformations of unbound DNAs compared to
19
DNAs in complexes, and (iii) the smaller number of available NMR-derived constraints
for DNA (27). The latter two effects were not observed for the Dickerson dodecamer
and the accuracy of HT prediction for this particular form of unbound DNA is
comparable and even superior to that for bound DNAs. This can be explained by the
removal of crystal packing effects through symmetrization due to the palindromic
sequence (76); while concerning NMR structures of unbound DNAs, the Dickerson
dodecamer is the only sequence for which RDC data was used in the structure
refinement (81). The Spearman’s rank correlation coefficients for HelT between HT
predictions and experimental data are lower for X-ray than for NMR data. This finding is
likely due to the extreme values of HelT seen in crystal structures, which are not
observed in solution-state NMR structures (Figure 2–2D).
We further investigated the HT prediction of Roll and HelT for the 10 unique
dinucleotides and found that the dinucleotide-specific pattern of these helical
parameters is well captured by HT predictions compared to X-ray data (Figure 2–5). In
particular, the comparison shows that the MC-based HT method accurately predicts the
average HelT over all dinucleotides with an average value of 34.4° . This value differs by
only 0.2° from the average value derived from the largest validation dataset of 760
crystal structure of bound DNAs with an average occurrence of 262 times of each of the
10 unique dinucleotides. It is noteworthy that HelT is also correctly predicted for CpA,
CpG, and TpA dinucleotides, for which MD simulations even with revised force fields
(87) report very low HelT values (75).
20
Figure 2 –5 Validation of HT predictions of Roll and HelT for the 10 unique dinucleotides
based on a dataset of co-crystal structures of protein-DNA complexes.
The agreement between HT predictions and X-ray data of a large dataset of 760 co-crystal
structures is apparent. Each unique dinucleotide occurs on average 262 times in these
structures in regions that are not drastically deformed. (A) Roll is slightly underestimated in HT
predictions, while the sequence-dependent variations between dinucleotides are well captured.
(B) HelT is estimated very well. Noteworthy is that the known systematic underestimation of
HelT in MD simulations (13,26,27) has been overcome in HT predictions, including the HelT
values for CpA, CpG, and TpA dinucleotides for which MD simulations even with refined force
fields report systematically low HelT values. Figure adapted from (49).
Besides the validation with experimental data, we also tested if a pentamer is the
appropriate size of a sliding window for mining MC data using leave-one-out cross-
validation. In each round of the cross validation, we removed one of the 2,121
sequences. Then the pentamer query table for the HT method was recompiled using the
remaining MC data to predict the structural feature vectors of the removed sequence.
This was repeated for each of the sequences in our training dataset. Structural feature
vectors derived from HT and MC predictions were then concatenated and compared
based on Spearman’s rank correlation. The resulting correlation coefficients for the
21
respective feature vectors were 0.85 for MGW, 0.92 for Roll, 0.96 for ProT, and 0.94 for
HelT. These very high correlations demonstrate that the pentamer model is sufficient to
capture the determinants of the predicted DNA shape features.
2.4 Conclusion
Our previous work and that by others has established MGW as an important
feature of DNA shape (11, 17). However, prior to the development of the DNAshape
web server, DNA structural features could not be analyzed for large sequence datasets
in a HT manner. To embrace the challenges of the genomic era and to be able to infer
various DNA structural features that play a role in DNA shape readout, we present a HT
approach to derive DNA structural features from massive sequence data. The HT
method is based on the assumption that pentanucleotides can be used to describe the
sequence-structure degeneracy of the double helix with sufficient accuracy. The
DNAshape web server and its underlying HT methodology predict, for the first time,
structural features of DNA that are currently established as important elements for
protein-DNA recognition (11, 17, 76, 88, 89). The rapid progress in making HT
sequencing data available can now realistically be coupled with structural analyses.
Providing structural information in a HT manner and at genomic scale will be the
necessary basis for understanding protein-DNA binding mechanistically.
22
Chaper 3. The role of DNA shape in Hox-DNA recognition
3.1 Introduction
Although members of the same transcription factor family typically have very
similar DNA binding domains these domains are rarely identical, raising the possibility
that small differences in protein sequence could lead to significant differences in binding
specificity. However, when assayed in vitro, using either classical or high-throughput
methods, different members of the same protein family generally do not show large
differences in binding specificity. For example, in Drosophila more than 50
homeodomain proteins bind to the six-base-pair sequences TAATTG and TAATTA,
despite differences in their DNA binding domains (72, 90). On the other hand, subtle
differences in homeodomain sequences, and transcription factor sequences in general,
are often conserved across vast evolutionary distances, arguing that these differences
are functionally important. The eight Hox paralogs in Drosophila, for instance, which
execute distinct functions in vivo, each have recognizable orthologs in both vertebrates
and other invertebrates. Hox orthologs can be recognized not only by their protein
sequences but also from the order in which they are expressed along an animal's
anteroposterior (AP) axis (91). Moreover, orthologous Hox proteins often have
conserved functions when expressed in a heterologous species (92, 93). These
observations suggest that sequence differences between related transcription factors,
although evolutionarily conserved and functionally relevant, are not typically reflected in
differences in their in-vitro DNA binding preferences.
Transcription factors often bind DNA as multiprotein complexes, raising the
possibility that complex formation might modify their DNA binding specificities. To test
23
this “latent specificity” hypothesis, our collaborators developed an experimental and
computational platform, SELEX-seq, that can be used to determine the relative affinities
to any DNA sequence for any transcription factor complex. Using the SELEX-seq
platform our collaborators compared the specificities of four monomeric Hox proteins
with the specificities of the same Hox proteins complexed with Exd. In all cases the Hox
specificities are modified in the presence of Exd.
Two pairwise comparisons of monomeric Hox binding preferences (Scr versus
Labial and Scr versus Ubx) reveal the general tendency for all three of these Hox
proteins to select sequences containing a TAAT, the motif that is traditionally associated
with Hox binding sites (Figure 3–1, A and B). Although some modest preferences are
observed (for example, Ubx prefers TTTAT more than Scr, Figure 3–1A), the
monomeric specificities are not sufficient to distinguish between these Hox proteins,
consistent with previous studies (72, 90). In contrast, when the DNA binding
preferences for the same Hox proteins are compared as complexes with Exd, a high
degree of specificity is observed (Figure 3–1, C and D). While red binding sites are
bound well by both Exd-Scr and Exd-Ubx, the blue and green sites are bound more
strongly by Exd-Scr than by Exd-Ubx. Conversely, the magenta site is bound more
strongly by Exd-Ubx than by Exd-Scr (Figure 3–1C). Similarly, in the presence of Exd
the specificities of Scr and Lab are readily distinguished, while the corresponding
monomeric specificities are largely overlapping (Figure 3–1, B and D). Together, these
findings demonstrate that unique Hox DNA Binding preferences are revealed upon
heterodimerization with Exd.
24
Figure 3 –1 Heterodimerization with Exd elicits novel binding specificities.
(A–B) Comparative specificity plots for monomeric Hox proteins showing relative affinities for all
9-mers. Comparing Scr versus Ubx (A) and Scr versus Lab (B) shows that there are only small
differences in binding preference. (C–D) Comparative specificity plots for Exd-Hox dimers
showing relative affinities for all 12-mers. Comparing Exd-Scr versus Exd-Ubx (C) and Exd-Scr
versus Exd-Lab (D) reveals differences in binding preference not observed for the
corresponding monomer comparisons. Figure adapted from (52).
It has been shown previously that for Hox paralog Scr, when it forms the
heterodimer with Exd its functional specificity is mediated by the recognition of minor
25
groove structure. To test if minor groove width is an universally important factor in the
determination of the specificity of Exd-Hox heterodimer, we applied our HT method to
predict the DNA shape of massive Hox-binding sites derived from SELEX-seq platform
and demonstrated that anterior and posterior Hox proteins prefer DNA sequences with
distinct minor groove topographies.
3.2 Results
We applied our HT method to predict the DNA shape of massive Hox-binding
sites derived from SELEX-seq, which contain the core motif of form
n
1
T
2
G
3
A
4
Y
5
N
6
N
7
A
8
Y
9
nnn. We revealed that nearly all sequences, independent of Hox
protein, had a minimum near A
4
. In all of the existing crystal structures, Arg5 of both
Exd and Hox are either bound to or located near to this narrow minor groove region,
likely mediated through electrostatic interactions (89).
In contrast, binding sites preferred by class 1 and 2 Hox proteins had on average
narrow minor grooves at A
8
Y
9
, w hile those preferred by class 3 Hox proteins had on
average wide minor grooves at A
8
T
9
or A
8
C
9
(Figure 3–2A). The difference in the
distribution of MGW at the A
8
Y
9
position between class 1 plus 2 versus class 3
sequences is highly statistically significant (p<2.2*10
-16
; Mann-Whitney U test Figure 3–
2B). Pearson correlations between MGW profiles along the central 12-mer also confirm
this difference in shape (Figure 3–2C). This observation that minor groove pattern was
sufficient to partition the preferred binding sites of the three classes of Hox proteins
irrespective of the primary sequence underscore the important role the minor groove
shape play in Exd-Hox binding preferences.
26
Figure 3 –2 Predicted minor groove widths of Exd-Hox binding sites.
(A) Heap map characterizing the average minor groove width of all sequences above a relative
binding affinity threshold 0.1 from each Exd-Hox heterodimer. Dark green represents narrow
minor groove regions and white denotes wider minor grooves. (B) Minor groove width values at
the most distinct A
8
and Y
9
positions are compared in box plots for the data shown in Panel (A)
and Mann-Whitney U p-values between the two groups, class 1+2 and class 3 Hox binding
sites, indicate significant differences. (C) Average minor groove width is compared in all
positions of the nTGAYNNAYnnn dodecamer for the different Exd-Hox using Pearson
correlation. Dark purple represents high similarity while white characterizes low similarity. Figure
adapted from (1).
27
3.3 Methods
For the high-throughput analysis, a total of 1,658 trajectories from independent
MC simulations were used to build a database for DNA shape predictions. These MC
trajectories were analyzed in terms of the conformation of all associated tetra- and
penta-nucleotides. The data derived from tetramer and pentamer conformations were
combined in a hybrid model, which uses only pentamer data if the pentamer
occurrence > 3, a combination of penta- and tetramer data if the pentamer occurrence
≤ 3, and only tetramer data if the pentamer occurrence is 0. The hybrid model is
necessary because only 467 of the 512 unique pentamers (91%) occur in our current
dataset compared to the almost complete coverage of 135 of the 136 unique tetramers.
Each tetra-nucleotide occurs on average 178 times and each penta-nucleotide on
average 50 times in the MC data used for the predictions. The method is later further
improved and published in Nucleic Acids Research (49).
Applying this method to the SELEX-seq binding sites, the average minor groove
width at the two central nucleotides of tetramers and the central nucleotide of
pentamers were used to infer the shape of all sequences that had a relative affinity
above 0.1. All reads were aligned based on the TGAYNNAY motif (excluding reads with
more than one motif) and the average minor groove width in each position was
calculated. We used box plots to compare differences in minor groove width at the most
distinct positions A8 and Y9 and calculated Mann-Whitney U p-values to show the
significance of differences in shape between the two groups, class 1+2 and class 3 Exd-
Hox sites. To further evaluate the similarities between the different Exd-Hox sites, we
28
compared the average width in all positions of the 12-mer nTGAYNNAYnnn using
Pearson correlation. The width values at the six positions of the AYNNAY core motif
were used to calculate a Euclidean distance tree that relates the shapes selected by all
Exd-Hox dimers. This dendrogram was generated with the UPGMA method as
implemented in the MEGA program (94).
3.4 Discussion
In the present work, we find that all preferred binding sites, regardless of Exd-
Hox preference, are predicted to have narrow minor grooves at TGAY (positions 2 to 5).
In all of the existing crystal structures, Arg5 of both Exd and Hox are either bound to or
located near to this narrow minor groove region, likely mediated through electrostatic
interactions (11).
In contrast to this shared feature, minor groove topography varies in the Hox
portion of these binding sites. Most notably, class 1 and 2 Hox proteins select binding
sites that have an additional minor groove minimum close to the AY of the Hox half site,
NNAY, whereas class 3 Hox proteins prefer a wider minor groove in this region. In
several cases the binding sites preferred by a particular Exd-Hox complex have similar
DNA shapes despite having different sequences, in agreement with the observation that
DNA shape is often more conserved than DNA sequence (73). That minor groove
shape may play an important role in Exd-Hox binding preferences is further
underscored by our observation that this parameter was sufficient to partition the
preferred binding sites of the three classes of Hox proteins, irrespective of the primary
sequence.
29
It is interesting that most of the sequence variation contributing to Hox preference
is located at position 6 and 7. Remarkably, the base pair at position 7 makes no protein
contacts in any of the kn own crystal structures, while position 6 makes only a small
number of contacts that do not appear to be specifc. How is it possible that a single
nucleotide position that makes no contacts can play such an important role in specificity?
We suggest that the effect is due to the location of a TpR step (R = A or G), which tends
to widen the minor groove. There is a TpR step at positions 6 and 7 in most class 1 and
2 sites that should widen the groove in the middle of the binding site, allowing Arg3 and
Arg5 to bind to the two minima on either side. In contrast, the TpA step in most class 3
sites at positions 7 and 8 may block Arg3 from stably inserting into the groove. Note that
the shift of the TpR step by one nucleotide in the 3’ direction is the main source of
variability at position 7 since there is a purine at this position in class 1 and class 2 sites
and T in class 3 sites.
Chaper 4. CpG methylation modulates DNase I-DNA
interaction strength via the remodeling of DNA shape
4.1 Introduction
DNase I is an endonuclease that cleaves the backbone of double-stranded DNA
(95). It approximates the size and nuclear diffusion properties of a typical human
transcription factor (TF). The enzyme interacts with DNA via the minor groove (96),
where it recognizes approximately six consecutive base pairs (97). In addition to its
30
nearly ubiquitous use in the removal of DNA from cellular extracts, DNase I has been
widely used as a structural probe of in vitro and in vivo DNA and chromatin structure
(98), and to map regulatory DNA in the human and other genomes (99, 100). The
interaction of DNase I with specific DNA sequences can be abrogated through steric
hindrance by DNA binding proteins, leading to its widespread use as a reagent for
studying TF binding (38, 101).
One of the best studied DNA modifications is the methylation of cytosines at
position 5 of the pyrimidine ring. This covalent modification, in the context of a CpG
dinucleotide, can be found in eukaryotes from plants to humans and is observed on
over 70% of CpGs in vertebrate DNA (102). The patterns of methylation can be dynamic
(103), can vary between cell lines and in the course of developmental processes, and
therefore provide a mechanism for the generation of epigenetic variation at the level of
the primary DNA sequence (104). The biological contribution of DNA methylation is both
significant and complex. First and foremost, CpG methylation has been linked to
transcriptional silencing at promoters of genes on the inactive X chromosome, on
imprinted loci and genes rendered inactive in cancers (102). Moreover, gene silencing
may be mediated by the recruitment of repressor proteins by methyl CpG binding
proteins to promoters (105), or by interference with TF action. Notably, however, some
CpG-containing promoters can be both methylated and transcriptionally active (106).
How DNA methylation affects the binding of transcriptional regulators is currently
unknown. It has long been speculated that steric occlusion by a bulky methyl group of
the cognate recognition sequence of a TF could affect its binding affinity (107). However,
this putative mechanism leaves various observations unaccounted for. For instance,
31
some TFs interact with the major groove, yet are not affected by DNA methylation
despite the extra methyl group protruding in the major groove. Other TFs interact with
the minor groove, yet are affected by DNA methylation. Increased TF occupancy upon
DNA methylation within their recognition sites has also been observed (108).
An explanation for these phenomena might be found in the 3D structure of DNA
(27).TFs can form direct and specific contacts with functional groups of the bases in the
major groove. TFs can form direct and specific contacts with functional groups of the
based in the major This base readout mechanism, however, does not suffice for the
minor groove, where instead subtle sequence-dependent variation in DNA shape is
read out by charged amino acid side chains via local variation in electrostatic potential
(11, 29).
To expand this line of thought our collaborators generated high-resolution DNase
I cleavage profiles for human fibroblast (IMR90) cells. The relative cleavage rate for
each of the 4096 possible hexamer contexts was calculated through modeling on those
profiles. Surprisingly, this rate is found to vary with local hexamer context over almost
three orders of magnitude (Figure 4–1A). The cleavage also exhibit strong strand
specificity (Figure 4–1B).
Since cytosine methylation covalently alters DNA, it may also influence DNase I
cleavage rate. Through integrating DNase I cleavage data with bisulfite sequencing data
for genomic DNA purified from the same cell type, our collaborators discovered marked
(>eightfold) enhancement of DNA backbone cleavage directly adjacent to methylated
CpG dinucleotides.
32
Figure 4 –1 Global properties of the hexamer-based model of intrinsic DNase I cleavage
rate.
(A) Cumulative distribution function of cleavage rate, showing that its variation with local
sequence spans three orders of magnitude. (B) A direct comparison of the cleavage of
hexamers and their reverse complements reveals that DNase I cuts DNA in a highly strand-
specific manner. Palindromic hexmers were excluded from this plot. Figure adapted from (52).
To understand the role of DNA shape in DNase I-DNA binding and the enhanced
cleavage upon methylation binding, we analyzed millions of DNA backbone hydrolysis
events on naked genomic DNA and examined the effect of CpG methylation on DNA
geometry for hundreds of sequence contexts. Our results show that the intrinsic rate of
cleavage by DNase I closely tracks the width of the minor groove and attributed the
enhanced cleavage upon methylation to the methylation-induced DNA shape change.
33
4.2 Results
4.2.1 Minor groove width profile is predictive of DNase I cleavage rate
DNase I is known to interact with the minor groove of DNA (96, 109). We
therefore asked whether a quantitative relationship exists between minor groove width
(MGW) and cleavage rate. To this end, we used a HT approach that can predict MGW
at the center of any pentanucleotide to predict MGW across all six nucleotide positions
for each of the 4096 possible hexamers. Since the hexamers occur as part of longer
double-stranded DNA sequences, we accounted for the influence of flanking sequence
by averaging over all possible ways of adding a dinucleotide flank on each side.
To assess to what extent the variation in DNA shape might explain the observed
variation in DNase I cleavage rate, we first plotted the negative of the logarithm of the
relative DNase I cleavage rate as a function of MGW at each base-pair position. We
interpret this negative logarithm as a binding free energy difference G between a
given sequence and the optimal sequence for DNase I cleavage. This analysis revealed
a clear partitioning of the hexamer into three parts (Figure 4–2A): at positions 3 and 2
a narrow minor groove is highly significantly associated with higher cleavage rate (with
the t-values measuring the regression coefficient in units of its standard error equal to
+19.8 and +15.1, respectively); at positions 1 and +1 this relationship is reversed but
still highly significant (t-values −15.6 and −26.3); at positions +2 and +3 a less strong
association is observed (t-values −6.0 and +6.4). The spatial profile of correlation
between MGW and DNase I cleavage rate is consistent with features of a crystal
structure of a complex of DNase I with a nicked DNA octamer duplex (110) (Figure 4–
2B). In that structure, an arginine, Arg41, from DNase I can be seen to interact with the
34
minor groove near the 3 position, while a second arginine, Arg9, contacts the minor
groove between the 2 and 1 positions (Figure 4–2C). The narrower the minor groove
is in the 5 region of the hexamer at the 3 and 2 positions, the higher is the cleavage
rate.
Figure 4 –2 Minor groove width is predictive of DNase I cleavage rate.
(A) G derived from the negative logarithm of cleavage rate as a function of minor groove
width (MGW) at the six positions of all 4,096 unique hexamers. MGW of this region was
predicted for naked binding sites based on a pentamer-based HT shape prediction approach.
HT predictions for all possible 16 dinucleotide flanks were averaged and values of MGW that fall
within intervals of 0.3 Å assigned to groups of sequences for which cleavage rates are shown as
box plots. (B) DNase I-DNA complex based on crystal structure (PDB ID 2DNJ). Base pairs at
positions 3 and 2, where DNase I cleavage anticorrelates with MGW, are highlighted in blue.
Base pairs at positions 1 and +1, where DNase I cleavage correlates positively with MGW, are
35
highlighted in green. Regions where no correlation could be detected are shown in gray. The
color code of the base pairs in the crystal structure is equivalent to the one used for the box
plots. (C) DNase I-minor groove contacts within a distance of 5 Å from any base atom are
shown for the same crystal structure. Arg41 and Arg9 bind upstream of the cleavage site, where
MGW anticorrelates with DNase I cleavage (blue base pairs). This anticorrelation likely arises
from the attraction between the positively charged arginine residues and the locally enhanced
negative electrostatic potential. The cleavage site (indicated by the orange arrow), by contrast,
is located in a region where MGW correlates positively with DNAse I cleavage (green base
pairs). Figure adapted from (52).
The relationship between MGW and DNase I cleavage rate indicates a
recognition mechanism similar to the recently described binding of arginine residues to
narrow regions of the minor groove (29). Such minor groove shape readout is based on
the enhancement of negative electrostatic potential in narrow groove regions, which in
turn allows for a stronger interaction with positively charged arginine residues (11). The
increase in DNase I cleavage rate with narrowing of the minor groove is likely to be
based on the attraction of the two arginine side chains through such locally enhanced
negative electrostatic potential. The opposite sign of the correlation between MGW and
cleavage rate at the 1 and +1 positions (Figure 4–2A) also makes structural sense.
Earlier reports have shown that the phosphodiester backbone at purine-pyrimidine (RpY)
dinucleotides, which intrinsically widen the minor groove, are cleaved by DNase I at
higher rates (82, 111, 112). Having a widened minor groove where the backbone is
cleaved would thus seem to be beneficial.
36
4.2.2 CpG methylation greatly enhances adjacent DNase I cleavage
The results above indicate that molecular recognition of DNA by DNase I is
subject to significant dependencies between nucleotides, consistent with readout of
specific features of DNA shape (113). Since DNA methylation has the potential to alter
the structural properties of DNA, our collaborators used whole-genome shotgun bisulfite
sequencing data obtained from the same cell type to define two subsets of phosphate
positons, with hexamer contexts containing only hypermethylated or only
hypomethylated CpG dinucleotides. Direct comparison of cleavage rates between both
sets revealed a striking dependency on methylation status for a subset of the hexamers.
A systematic search for DNA sequence features that could explain this dependency
revealed that it is almost completely explained by the occurrence of a CpG dinucleotide
immediately downstream of the cleaved bond. Upon methylation of the two cytosines
within the C+1G+2 base pair step, the rate of cleavage by DNase I is enhanced
~eightfold (red points in Figure 4–3A), and for the most cleavable CpG-containing
hexamer (ACT|CGA) increases from ~7% to ~68% of the maximum. Our findings are
consistent with, but greatly extend, and earlier observation that methylation of the
central cytosine in the sequence GCGC renders the 5’ phosphate more susceptible to
cleavage by DNase I (114, 115).
37
Figure 4 –3 Observation and analysis of the effect of methylation on DNase I cleavage
rate.
(A) The rate of cleavage depends strongly on the DNA methylation status. We used a positional
map of DNA methylation in IMR90 (36) to delineate subsets of genomic positions with low/high
degrees of CpG methylation, respectively. Comparison between the hexamer cleavage rates
derived from these respective subsets shows an ∼eightfold increase in cleavage rate for
hexamers with a methylated CpG immediately downstream of the cleaved phosphate (red
points). (B) Roll and MGW of methylated and unmethylated versions of the same hexamer
based on the average of MC predictions for three different flanking sequences. Methylation
leads to an increase in the positive roll angle at the CpG dinucleotide and a narrowing of the
MGW at position −2 by roughly 0.5 Å. Figure adapted from (52).
4.2.3 CpG Methylation narrows the minor groove at adjacent positions
So far, we have described two independent observations regarding DNase I as it
acts on naked DNA. First, its cleavage rate depends on the primary sequence context
via the width of the minor groove (Figure 4–2A). Second, this rate increases by a
38
multiplicative factor when the cytosines in the CpG base pair step immediately 3’ of the
cleaved phosphate are methylated (Figure 4–3A). We wondered if a direct relationship
exists between methylation and MGW, as that would have the potential to unify both
observed phenomena. Specifically, we asked whether methylation intrinsically leads to
a narrowing of the minor groove, which in turn would explain the observed increase in
cleavage rate upon methylation. To test this hypothesis, we extended the MC algorithm
(see Methods for details) so that it could also predict the shape of free DNA molecules
containing 5-methylcytosine bases. We first applied it to the most cleavable hexamer,
ACTCGA. Strikingly, we observed that CpG methylation leads to an increased roll angle
at the CpG step and a narrowing of the minor groove (Figure 4–3B). Roll is the angle
between two adjacent base pair planes describing the opening of a dinucleotide to
either the minor or major groove. The narrowing of the minor groove is most
pronounced (∼0.5 Å) at position −2, which is exactly where according to Figure 4–2A
we expect it to have the biggest positive impact on cleavage rate.
4.2.4 Modulation of DNA shape explains the methylation sensitivity of DNase I
Encouraged by the above result, we reasoned that if DNase I cleavage rate
indeed only depends on primary sequence and methylation status to the extent that the
latter modulates DNA shape, we should be able to test this explicitly. To this end, we
constructed the ‘shape-to-affinity model as illustrated in Figure 4–4A. Each hexamer
sequence is converted into a set of position-specific Roll and MGW values, which serve
as the independent (‘predictor’) variables in a multiple linear regression. The negative
logarithm of the relative cleavage rate serves as the dependent (‘response’) variable,
39
which can be interpreted as the binding free energy G relative to that for the most
cleavable sequence (for which, by definition, G = 0). Thus, a value of G/RT = 1
corresponds to a relative cleavage rate of 37%, G/RT = 2 corresponds to a relative
rate of 14%, etc. As shown in Figure 4–4B, the fraction of the total variance of G
among all unmethylated sequences of type NNN|CGN (where ‘|’ indicates the cleavage
site) can be explained by the shape-based model. Although Roll is more predictive than
MGW, the two variables complement each other.
Figure 4 –4 Intrinsic DNA methylation sensitivity of DNase I.
(A) Schematic diagram illustrating construction of the ‘shape-to-affinity model’ for predicting the
binding free energy G (logarithm of the relative cleavage rate) from DNA structural features,
such as Roll and minor groove width (MGW). (B) Fraction of variance explained by the binding
40
free energy among all 256 unmethylated DNA sequences containing a CpG dinucleotide. (C)
Effect of cytosine methylation on each shape parameter (distribution of difference for Roll and
MGW values, along the leading strand of the hexamer), derived from all-atom Monte Carlo
simulations of methylated DNA fragments. Five Roll values describe the base-pair steps in a
hexamer, whereas each of the six base pairs can be assigned a MGW value (D) Change in
binding free energy for each sequence, predicted from changes in shape features in (C), using
the model constructed in (A). (E) Empirically observed change in binding free energy obtained
from parallel methylation and DNase I profiling on genomic DNA from the same cell line. (F)
Effect of methylation across all 256 nucleotide sequences, quantified as the change in G, as
empirically determined and as predicted using Roll, MGW or both. Figure adapted from (116).
We applied computationally intensive Monte Carlo simulations of free DNA to
obtain the Roll and MGW values across the DNA molecule, which were used as
predictors in the shape-to-affinity model. Simulations were performed with or without a
methyl group added to each cytosine base in the DNA molecule. This approach enabled
the authors to build a shape-to-affinity model from the shape parameters derived from
unmethylated sequences of the form NNN|CGN, as well as to predict, for each DNA
sequence, the change in binding free energy associated with methylation of the CpG
dinucleotide downstream of the cleavage site. Strikingly, the simulations revealed that a
change in Roll due to the substitution of C by 5mC was largely independent of the
identity of the other four bases within the NNN|CGN hexamer (Figure 4–4C). Moreover,
a linear model was trained on shape parameters derived from unmethylated sequences.
When only the shape changes in the sequences were used to transfer the model to
methylated DNA, the model predicted an increased binding free energy that was
41
independent of sequence context and indicated increased binding in the presence of
5mC (Figure 4–4D).
The qualitative observation that methylation increases the DNase I cleavage rate
by a fixed multiplicative factor (corresponding to the shift in binding free energy) and the
quantitative prediction of the magnitude of this effect agreed well with direct empirical
observations made by using parallel DNase I footprinting and methylome data for
genomic DNA from the same cell line (Figure 4–4E). Interestingly, Roll seemed to be
more useful than MGW for predicting the effect of CpG methylation on binding free
energy (Figure 4–4F). Analysis of the empirical data revealed an approximately 9-fold
change in cleavage rate (and, presumably, in protein-DNA binding affinity) associated
with CpG methylation (median G/RT = −2.2). However, this value should be
interpreted as a lower boundary, because it was obtained by constructing hexamer
cleavage tables from the top and bottom 20% of genomic locations in terms of observed
cytosine methylation levels.
4.3 Discussion
Our in-depth study of DNase I allowed us to uncover a structural mechanism that
plausibly answers the long-standing question of how cytosine methylation modulates
protein–DNA interaction. Our data strongly suggest that DNA methylation generally acts
to remodel DNA shape. DNase I activity is sensitive to this change in DNA conformation,
and this explains our observation of greatly enhanced cleavage adjacent to methylated
CpG base pair steps. However, we believe that our insight could apply much more
widely across many families of nucleotide-binding proteins. Narrowing of the MGW may
42
thus be the general mechanism by which the addition or removal of methyl groups in the
major groove influences gene expression. An intriguing possibility is that nucleosome
positioning might be influenced by methylation (117, 118). Recently, an observed
correlation between these two variables was interpreted as influence of nucleosome
positioning on methylation patterning (117). However, electrostatic interactions between
arginines and the minor groove occur in the nucleosome (11, 119), and the minor
groove narrowing associated with cytosine methylation could enhance these. The
methylation patterns might therefore also be a partial determinant of nucleosome
position, with methylated CpG dinucleotides giving rise to stronger electrostatic
interactions with histones, increasing the stability of nucleosomes.
4.4 Methods
To predict MGW and roll of unbound DNA targets, we performed all-atom MC
simulations for all 256 possible hexamers of the form NNNCGN, and the 256
corresponding sequences with 5-methylcytosines at both base pairs of the CpG
dinucleotide at positions +1 and +2. We performed one MC simulation each for the
unmethylated and methylated forms of these 256 hexamers after adding a CGCG flank
on both sides. To evaluate end-effects for the hexamer ACTCGA, whose DNase I
cleavage rate increased ~eightfold upon methylation, we performed nine independent
MC simulations for this sequence, three for each of the three different tetramer flanks
(CGCG-N
6
-CGCG, CGTA-N
6
-TACG, and CATG-N
6
-CATG). These runs were performed
for the unmethylated and CpG methylated variants of the ACT|CGA hexamer. MC
predictions yield ensembles of all-atom structures, which we analyzed for structural
43
changes upon methylation. MGW and roll showed most significant effects upon CpG
methylation. We used the CURVES algorithm (79) for deriving average MGW and roll
form MC ensembles of 150,000 structures. For the ACTCGA sequence, we computed
MGW and roll as the median over the predictions with the three different flanking
sequences. The MC methodology was previously described and validated (27, 76, 82,
89). The sampling methodology was now expanded by facilitating one additional internal
degree of freedom, the rotation of the cytosine 5-methyl group, which we implemented
in analogy to the rotation of the thymine methyl group (). Partial charges for 5-
methylcytosine were taken from a set of AMBER force field parameters derived for
chemically modified nucleotides (120).
Chaper 5. Evidence of purifying selection on DNA shape
5.1 Introduction
The availability of next-generation sequencing (NGS) technologies and the
development of SNP calling methods facilitates the generation of massive coding and
noncoding variants. The fundamental goal of many genetics and genomics studies that
analyze those variants is to identify the functional or pathogenic variants that are directly
responsible for the phenotype change (121).
Despite of the abundance of noncoding variants in human (122) and widespread
evidence of their functional implication uncovered through genome-wide association
studies (GWAS) (123), to date most of efforts have been spent on developing methods
that are capable of distinguishing functional coding variants from background coding
44
variants. These methods usually used alignments of homologous proteins to estimate
mutational deleteriousness and leverage biochemical data including amino acid
properties, sequence information and structural information (121).
For identification of pathogenic non-coding variants one of the most informative
signal researchers can tap into is the DNA conservation level of the variant position. If a
variant position is highly conserved, it indicates that this variant is very likely to be
pathogenic because past mutations were removed by purifying selection due to their
deleteriousness.
Conventionally DNA conservation level is calculated through comparative
sequence analysis. PhastCons, a popular program used to identify the evolutionarily
conserved elements (124), relies on the input of multiple sequence alignment and a
phylogenetic tree. However, nucleotide sequence isn’t the sole target of selective
pressure. Nucleotide sequence-based algorithms that neglect the selective pressure
beyond the nucleotide sequence level are thus prone to underestimate the DNA
conservation level. Using an indirect measure of local DNA topography, hydroxyl
cleavage rate, Parker et al. found twice as many conserved DNA bases in the human
genome as those detected by nucleotide sequence- based algorithm(73).
Our previous work has shown that both sequence and shape readout are
employed by TFs to achieve their binding specificity. It is therefore reasonable to think
that selective pressure must also act on DNA shape. Among the 4 DNA shape features
our HT method can predict, which are MGW, ProT, HelT and Roll, the function of MGW
in TF-DNA recognition is best studied. So we decided to focus on MGW in this project.
45
Recently our collaborators have sequenced the whole genome of 216 Drosophila
melanogaster individuals. SNP calling on this vast amount of data yields a total number
of 2,605,315 single nucleotide variants (SNVs) with occurrence among 216 individuals
at least twice. The minor allele frequency (MAF) of each SNV was also calculated
based on this data. Combining with the collected 4767 potential cis-regulatory modules
(CRMs) that are identified from analyzing the changing chromatin landscape during the
Drosophila embryonic development (125), we narrowed down the number of SNVs to
analyze to 47,495 by retaining only SNVs located within those CRMs. This valuable
dataset provides us the opportunity to statistically test the existence of purifying
selection on DNA shape.
5.2 Results
For each SNV, we used our HT DNA shape prediction method to predict the
MGW of its surrounding 5bp DNA region with and without the variant. This enables us to
visualize the effect of SNVs on local DNA MGW pattern.
As illustrated in Figure 5–1, the effect of SNVs on local DNA MGW pattern varies
with mutation type and local DNA context. At one extreme of the spectrum are those
SNVs (Figure 5–1, A and B) that have little effect on the local MGW pattern. If the MGW
pattern is recognized by TFs then this recognition is likely to be preserved after the
mutation. At the other extreme are SNVs (Figure 5–1, C and D) that completely disrupt
the local MGW pattern. Function loss is therefore more likely to happen.
46
Figure 5 –1 Effect of SNVs on local DNA MGW pattern.
The effect of SNVs on their local DNA MGW pattern varies with mutation type and local DNA
context. MGWs of local DNA region with and without the presence of variant are plotted in blue
and red respectively. At one extreme of the spectrum are those SNVs (Figure 5–1, A and B) that
have no effect on the local MGW pattern. At other extreme are SNVs (Figure 5–1, C and D) that
completely disrupt the local MGW pattern.
Since MGW is a continuous feature, the DNA MGW change for each SNV can be
intuitively quantified through calculating the Euclidean distance between the MGW
patterns of local DNA region with and without the variant. This quantification allows us
to perform the series of statistical test shown below.
We first plotted the distribution of DNA MGW change for SNVs in functional and
non-functional regions (Figure 5–2). Compared to the distribution for functional regions
47
(blue), distribution for non-functional regions (red) is shifted towards the right with
statistical significance (p-value = 1.44e-15), indicating that in functional regions SNVs
that greatly change MGW are removed by purifying selection.
Figure 5 –2 Distribution of DNA MGW change for SNVs in functional and non-functional
regions.
Compared to the distribution for functional regions (blue), distribution for non-functional regions
(red) is shifted towards the right with statistical significance, indicating that in general DNA
MGW change induced by SNVs in non-functional regions is greater than change induced by
SNVs in functional regions.
48
For SNVs in functional regions, we classified them into high- and low- frequency
group based on their minor allele frequency (MAF). The frequency cutoff 0.1 is used
here so that the resulting two groups have similar group size. Plotting the distribution of
DNA MGW change for SNVs in these two groups revealed a negative correlation
between MAF and DNA shape change (Figure 5–3A). SNVs that greatly change DNA
shape are less likely to have high MAFs. This is another proof supporting the existence
of DNA MGW-targeting purifying selection.
To assure that the negative correlation we found is meaningful, we used SNVs in
non-functional regions as negative control group and compared the distribution of DNA
MGW change for high- and low- frequency group as we did for SNVs in functional
regions. This time no negative correlation between MAF and DNA shape change could
be found (Figure 5–3B).
Figure 5 –3 Distribution of DNA MGW change for SNVs with high and low MAF.
49
Among SNVs in functional regions (A), distribution of DNA MGW change for low MAF is shifted
towards the right with statistical siginificance. Among SNVs in non-functional regions (B),
distribution of these two groups exhibits little difference.
The MGW used in this study were generated based on pentamer query tables
derived from thousands of all-atom Monte Carlo simulations (49).To rule out the
possibility that the previously-observed distribution difference is due to the purifying
selection acting on DNA pentamer sequence instead of DNA MGW, we constructed
arbitrary MGW features by independently randomized the association between
pentamer identities and MGW width value in the pentamer query table With arbitrary
MGW features we re-did all the distribution comparisons and found no distribution
difference with statistical significance (Figure 5–4) in all the comparisons.
Figure 5 –4 Comparison of distribution using arbitrary DNA MGW shows no signal of
purifying selection.
50
Taken together, our results demonstrate that purifying selection acts on DNA
MGW as well as nucleotide sequence.
5.3 Methods
The binding sites for 34 principal TFs in both the anterior/posterior and
dorsal/ventral system (http://bdtnp.lbl.gov/Fly-Net/chipchip.jsp?w=tfs) were identified
using the PWMs of those TFs. The phastCons conservation score (124) for each
nucleotide position is downloaded from the UCSC Genome Browser.
For statistical testing purpose, we define the functional and non-functional
regions within CRMs as follows. Nucleotide positions with conservation score greater
than 0.1 are considered functional regions (Figure 5–5). Nucleotide positions that have
conservation score smaller or equal to 0.1 and are not covered by any of the identified
TF binding sites and their immediate 5bp flanks are considered non-functional regions
(Figure 5–5).
51
Figure 5 –5 Definition of functional and non-functional regions.
Within CRMs, nucleotide positions with conservation score greater than 0.1 are considered
functional regions. Nucleotide positions that have conservation score smaller or equal to 0.1 and
are not covered by any of the identified TF binding sites and their immediate 5bp flanks are
considered non-functional regions.
5.4 Future direction
Recently Ritchie et al. showed that a combination of genomic and epigenomic
annotations can be used to distinguish functional variants from the background variants.
We propose that adding the annotation of DNA MGW change can further improve the
prediction accuracy of their method.
52
Chaper 6. Quantitative modeling of transcription factor
binding specificities using DNA shape
6.1 Introduction
A fundamental first step towards understanding TF binding is to describe the
recognition of “naked” DNA by TFs in vitro. This process involves readout of the
nucleotide sequence (8, 126) and the three-dimensional DNA structure (12, 13, 29).
Although DNA structural features have been discussed as potential determinants of
qualitative TF binding events (11, 89, 127), quantitative models describing the impact of
DNA shape on the strength of TF binding are still lacking.
Experimental high-throughput (HT) assays, such as protein-binding microarrays
(PBMs) (66), measure the in vitro binding preferences of TFs to tens of thousands of
different nucleotide sequences. Sequence-based modeling of in vitro TF binding
specificities using HT data has been a topic of broad interest (7, 128, 129). Several
methods for PBM data analysis have been developed. Recently, the performances of 26
sequence-based methods for predicting DNA binding specificity were assessed, based
on the DREAM5 dataset of PBM data for 66 mouse TFs (130).
DNA sequence preferences are generally represented as position weight
matrices (PWMs) (8) or position-specific affinity matrices (6). The original concept of the
PWM assumed that a position within a binding site contributes to binding affinity
independent of other positions (5). This concept has recently been expanded to include
dinucleotide features that encode dependencies between adjacent nucleotide positions
(e.g., (47)). Trinucleotides (65, 131) and higher-order k-mer features (132, 133) have
53
also been included in models of DNA sequence specificity. However, model complexity
can increase dramatically when such k-mer features are used (65). Interdependencies
between nucleotide positions originate from physical interactions between base pairs,
which give rise to the three-dimensional DNA structure. DNA-binding proteins, in turn,
recognize the resulting DNA structure (11). Thus, the large number of k-mer features
necessary for encoding interdependencies between nucleotide positions could
potentially be replaced by a smaller number of structural features.
Recently, we developed technology enabling augmentation of the nucleotide
sequence with three-dimensional DNA shape features predicted using a pentamer-
based model (49) built from all-atom Monte Carlo simulations of DNA structures (76,
134). Four distinct shape features—minor groove width (MGW), propeller twist (ProT),
Roll, and helix twist (HelT)—had been shown to be important for protein-DNA
recognition in specific cases (33, 131, 135, 136). However, to date, a systematic and
comprehensive survey of the value of shape-based models of DNA recognition has
been lacking.
Here, we used HT protein-DNA binding data for 68 mammalian TFs from
different structural classes to develop and evaluate DNA binding specificity modes
based on different combinations of sequence- and shape-based features, including
mononucleotide (1mer), dinucleotide (2mer), and trinucleotide (3mer) identity, as well as
the DNA shape features of MGW, ProT, Roll, and HelT. We used support vector
regression (SVR) (137) to train linear regression models for mapping DNA sequences to
measurements of binding affinity. Then, we evaluated these models using four different
approaches.
54
First, we used 10-fold cross-validation on a diverse set of universal PBM (uPBM)
data for 65 mouse TFs (130) and genomic-context PBM (gcPBM) data for three human
basic helix-loop-helix (bHLH) TFs. Second, in a manner similar to (130), we trained our
models using PBM data from one uPBM array design and tested the models using data
from a different uPBM array design. Third, we tested the ability of our PBM-derived
models to predict data generated using a different experimental platform. In particular,
for one of the TFs in our study (the human TF Max), we generated independent in vitro
data using the SELEX-seq technology (1). We used our PBM-derived binding specificity
models to predict the relative binding affinities measured by SELEX-seq. Fourth, we
used our regression models to predict in vivo TF-DNA binding data, in a manner similar
to (130). Finally, using feature weights for the best-performing models, we identified
structural mechanisms that are used on a TF family-specific basis for achieving DNA
binding specificity.
6.2 Results
6.2.1 DNA shape-augmented models outperform sequence-based models
TF binding sites were described in our models based on feature vectors
containing the distinct set of features of a specific model at each nucleotide position.
Figure 6–1A shows a shape-augmented model that combines 1mer sequence features
with the four DNA shape features, MGW, ProT, Roll, and HelT. Features of the k-mers
and of DNA shape are substantially different in nature. Specifically, the k-mer features
are binary categorical attributes that characterize hydrogen bonds and other direct
contacts between the protein and the base pairs in the major groove (29). In contrast,
55
the DNA shape features are continuous attributes that reflect properties of the DNA
structure and capture interactions in the minor groove (11). Given these differences,
these two feature types may potentially describe different mechanisms by which a TF
achieves its DNA binding specificity.
Figure 6 –1 Design of the sequence+shape feature vector, and TF family-specific
performance comparison of binding specificity predictions.
56
(A) The feature vector used in the 1mer+shape model combined binary features for the
sequence (1mers) with continuous values for the DNA shape features (MGW, ProT, Roll, and
HelT). Although second-order shape features were also used throughout this study, they are not
shown in the schematic representation. (B) Performance comparison for different TF families
tested in this study. DNA shape contributed to the DNA binding specificities of all homeodomain
and bHLH TFs in the uPBM, gcPBM, and SELEX-seq datasets, consistent with previous work
on these TF families (1, 89, 131, 138, 139). Figure adapted from (140).
When tested on the uPBM data from the DREAM5 dataset (130), our shape-
augmented (1mer+shape) model outperformed the sequence-only (1mer or PWM)
model on 58 of the 65 TFs that were tested (Figure 6–1B). The only exceptions to this
finding were TFs with low-quality data (R
2
< 0.25). To include high-quality TF datasets,
we generated gcPBM data for the human bHLH TFs Mad1 (also known as Mxd1), Max,
and c-Myc, based on a previously described experimental protocol (65, 131). For these
gcPBM data, we observed even larger improvements in R
2
when DNA shape features
were incorporated into the binding specificity models (Figure 6–1B). We observed a
similar improvement when we used SELEX-seq data generated for the human TF Max.
Considering the uPBM, gcPBM, and SELEX-seq data together, we found that the
1mer+shape model led to consistent improvements for all homedomain and bHLH TFs
(Figure 6–1B). This observation is in accordance with our previous finding that DNA
shape readout plays an important role for these TF families (1, 131, 139). The results
also indicate that predictions of binding specificities for other TF family members (e.g.,
zinc finger, bZIP, and forkhead TFs) can benefit from adding DNA shape information to
the models. However, because the DREAM5 data were of lower quality (in terms of R
2
)
57
for some members of these families, conclusions on readout mechanisms employed by
these TF families will require additional data from HT assays.
The 1mer+shape model implemented in this study used additional second-order
shape features to account for dependencies between structural features at adjacent
nucleotide positions, such as the formation of A-tracts where MGW narrowing is more
pronounced due to several adjacent A/T base pairs (11, 29). These second-order shape
features were the product terms for the same DNA shape feature category at two
adjacent nucleotide positions (MGW and ProT) or base pair step positions (Roll and
HelT). The combined use of the second-order DNA shape features with first-order
shape features and 1mer features improved the prediction accuracy compared to the
1mer+1
st
order shape model (Figure 6–2A).
58
Figure 6 –2 Performances of various models on the DREAM5 data for mouse TFs.
(A) Comparison of model performances on uPBM data of 65 mouse TFs from the DREAM5
dataset (130). Models combining sequence (1mer) with first- and second-order DNA shape
features were compared with models combining sequence and only first-order DNA shape
features. (B-D) Comparison of model performances for shape-augmented models
(1mer+shape) with performances for sequence-based (B) 1mer, (C) 1mer+2mer, and (D)
59
1mer+2mer+3mer models in a cross-array evaluation (i.e., models were trained on one uPBM
array design and then tested on a different uPBM array design, as in (130)). Results are shown
for 10 TFs from the DREAM5 study, chosen based on high agreement between the uPBM data
generated using the two array designs (see section “Training and Testing on Different uPBM
Array Designs” below for more details). (E) Performance comparison of the shape-only model to
the sequence-only model in a cross-array evaluation for the 10 TFs shown in panels (B-D).
Figure adapted from (140).
6.2.2 Models using 2mers and 3mers contain implicit DNA shape information
We also tested dinucleotide- and trinucleotide-based models using 1mer, 2mer,
and 3mer features as predictors. When tested on the DREAM5 uPBM data, the
performances of the 1mer+2mer and 1mer+2mer+3mer models were, on average, very
close to the performance of the 1mer+shape model (Figure 6–3, A and B). This
observation was not surprising because 2mers and 3mers partially capture the effect of
the DNA shape variation on binding. Specifically, 2mers describe stacking interactions
between adjacent base pairs, and 3mers represent short structural elements, such as
A-tracts, which form distinct structures (11). Furthermore, for certain TFs, models that
used DNA shape features alone, without any explicit information about base identity,
were more accurate than models based on sequence alone (Figure 6–3C).
60
Figure 6 –3 Performance of various models on DREAM5 data for 65 mouse TFs.
(A-B) Using R
2
as a measure for prediction accuracy, the performance of shape-augmented
model (1mer+shape) was compared with the performances of sequence-based (A) 1mer+2mer
and (B) 1mer+2mer+3mer models. (C) Performance comparison of the shape-only model to the
sequence-only model. (D) Performance comparison of the shape-only model to a model
augmented by arbitrary values for shape features derived from arbitrary pentamer projections
onto a continuous scale. Figure adapted from (140).
The DNA shape features used in our study were generated based on pentamer
query tables derived from thousands of all-atom Monte Carlo simulations, and they were
61
validated with X-ray and NMR structures (49). To rule out the possibility that the use of
pentamer-based independent features in itself could explain the enhanced performance
of our shape-based models, we independently randomized the association between
pentamer identities and values for each of the four shape features. The R
2
values
obtained with the randomized shape tables were significantly lower compared to those
for our DNA shape predictions (Figure 6–3D), demonstrating that the predictive power
of DNA shape was not simply due to the inclusion of pentamer-based features.
In the aforementioned analyses of 65 mouse TFs using uPBM data from the
DREAM5 study (130), we used 10-fold cross-validation to evaluate model performance,
based on a single uPBM dataset for each TF. An alternative approach to evaluate our
DNA shape-augmented models would be to train the models on uPBM data from one
array design and use them to predict binding data obtained using a different array
design, similar to the procedure used by (130). We present the results of such an
analysis in Figure 6–2B-E and in the Supplemental Information. The results of the
cross-array testing were fully consistent with the results of the 10-fold cross-validation
tests.
6.2.3 DNA shape-augmented models can accurately predict TF binding data
across experimental platforms
To ensure that our regression models of TF binding specificity are capturing real
signals and not experimental biases intrinsic to the PBM technology, we performed a
cross-platform analysis. Specifically, we trained our models on gcPBM data and tested
them on quantitative SELEX-seq data generated for the human TF Max. Briefly, the
62
Max SELEX-seq experiment was carried out as described previously (1), with two
rounds of selection, and the data were used to compute relative binding affinities for 12-
mer DNA probes. Next, sequence- and shape-based regression models trained on our
Max gcPBM data were used to predict the binding levels of Max to DNA targets
identified by SELEX-seq (Figure 6–4A). Given that the two experimental platforms
(gcPBM and SELEX-seq) provided different measurements for TF binding (i.e., binding
intensity and relative binding affinity, respectively), we used Spearman’s rank
correlation coefficient to assess the accuracy of our predictions. When tested on
SELEX-seq data, the 1mer+shape model outperformed all of the sequence-based
models, including the 1mer+2mer+3mer model (Figure 6–4B-D, Figure 6–5A-C).
63
Figure 6 –4 Performance of binding specificity models across experimental platforms.
(A) Flowchart illustrating that Max-DNA binding specificity models were trained on natural
logarithms of fluorescence binding intensities derived from a gcPBM experiment and used to
predict the binding level of DNA targets derived from a SELEX-seq experiment for the same TF.
(B) Performance of various models for cross-platform predictions based on Spearman’s rank
correlation coefficients between observed SELEX-seq relative binding affinities and predicted
gcPBM signal intensities.(C-D) Scatter plots of predicted versus observed binding site ranks,
showing the performance of the (C) 1mer and (D) 1mer+shape models trained on gcPBM data
and tested on SELEX-seq data. Figure adapted from (140).
We also tested our models on in vivo ChIP-seq data using an area under the
receiver-operating characteristic curve (AUC)-based method, similar to the approach
64
used by (130). The AUC values computed for TF binding models reflect how well the
models discriminate DNA sites that are bound or not bound by the TF in vivo. Our SVR
models were not optimized to discriminate bound from unbound sites, but rather were
intended to predict quantitatively the level of TF binding to any given DNA site.
Nevertheless, we tested our regression models on in vivo binding data for the five
mouse TFs that were tested in the DREAM5 study. Our models performed better than
previously reported PWMs (130) for three of the five TFs, and they performed worse in
one case. For the fifth TF, all of the tested models achieved an AUC enrichment that
was close to what we would expect by chance (Figure 6–5D). When the SVR models
were tested on the ChIP-seq data for the three human bHLH proteins in our study, the
models performed better than the PWMs for all three TFs (Figure 6–5E). The
1mer+shape model did not outperform the sequence-only models in this particular test.
This result was probably because our models were optimized to predict TF binding
strength quantitatively and were not intended to distinguish whether sites were in the
bound versus the unbound state, which is what the AUC method is testing.
65
Figure 6 –5 Performances of binding specificity models across experimental platforms.
(A-C) Scatter plots of predicted versus observed binding site ranks, illustrating the performances
of the (A) 1mer+2mer, (B) 1mer+2mer+3mer, and (C) shape-only models trained on gcPBM
data and tested on SELEX-seq data. (D-E) Performances of various SVR models on in vivo TF-
66
DNA binding data for (D) five TFs from the DREAM5 dataset and (E) human bHLH factors Mad,
Max, and Myc. Performances of the regression models are compared to the performances of
PWM models that were previously reported to perform best on in vivo data (130). For only two
of the TFs tested, Tbx20 and Esrrb, the PWMs performed better than the regression models.
However, in the case of Tbx20, the AUCs for all models were close to what one would expect by
chance. For TF Esrrb, optimizing the L and R parameters in the uPBM analyses (see section
“uPBM Data” below) resulted in regression models with much higher AUC values (~0.87), close
to the performance of the Esrrb PWM (0.92). Figure adapted from (140).
6.2.4 Replacing k-mer with DNA shape features reduces the dimensionality of
the feature space
The assessment of different models required not only a comparison of prediction
accuracy but also of model complexity, which relates to the required size of the training
data and the computational cost (Figure 6–6A). Compared to the sequence-based
1mer+2mer and 1mer+2mer+3mer models, the 1mer+shape model contained fewer
features, which translated into fewer model parameters. For each nucleotide position,
four features were introduced to encode 1mer identity, 16 features to encode 2mer
identity, 64 features to encode 3mer identity, and 8 features to encode DNA shape.
Therefore, the 1mer+shape, 1mer+2mer, and 1mer+2mer+3mer models used a total of
12, 20, and 84 features per nucleotide position, respectively (Figure 6–6A). The finite
sequence length required a slight end-effect adjustment of these numbers of features
per nucleotide position.
67
Figure 6 –6 Comparison of various models using gcPBM data for human Max TF.
(A) The number of required features (per nucleotide position) was correlated with the average
running time for the training and testing of different models for Max-DNA binding, based on
gcPBM data. (B) Performances of the sequence- and shape-based models for Max-DNA
binding as the sample size was decreased. Figure adapted from (140).
To assess how models of different complexity performed on smaller datasets, we
used the gcPBM data for human bHLH TFs to generate datasets of decreasing sample
size. As expected, the performance of all models decreased with decreasing sample
size (Figure 6–6B). This decreasing trend was consistently more pronounced for
sequence-based models (1mer+2mer and 1mer+2mer+3mer) than for shape-based
models (1mer+shape and shape-only). This finding suggests that shape features more
efficiently captured the binding specificities of the studied TFs than k-mer features.
Analysis of the gcPBM data for human bHLH TFs demonstrated that the
1mer+shape model was superior to the 1mer+2mer and 1mer+2mer+3mer models
(Figure 6–6B). This difference was much more pronounced for the higher-quality
68
gcPBM than for the lower-quality uPBM data (Figure 6–3A-B), likely because the
gcPBM data contained less positional bias and provided information on the genomic
flanking regions (131).
The positive impact of a smaller number of features on the model prediction
accuracy for small sample sizes was also demonstrated by using sequence models
augmented with only one category of DNA shape features, instead of all four. The
inclusion of only one DNA shape feature further reduced the number of features (Figure
6–7A). When tested on gcPBM data for Max with smaller sample sizes (10 randomly
generated samples for each size), the prediction accuracies of the 1mer+Roll and
1mer+ProT models dropped at a slower rate compared to the 1mer+shape and
1mer+2mer+3mer models, which required many more features (Figure 6–7B).
69
Figure 6 –7 Insights into TF-specific readout mechanisms derived from shape-augmented
binding specificity models.
(A) Number of features per nucleotide position introduced in different models. Models that
included only one shape feature further reduced the total number of features compared to the
1mer+2mer+3mer and 1mer+shape models. (B) The single shape feature models 1mer+ProT
and 1mer+Roll performed better than the 1mer+shape and 1mer+2mer+3mer models on
smaller datasets. (C) Weights for Roll derived from the 1mer+1
st
order shape model accurately
reflected the cocrystal structure of the ternary Max homodimer/DNA complex. (D) The CACGTG
E-box in the cocrystal structure was the highest affinity core among the nine observed E-box
cores. Although other cores were present in the gcPBM data for Max, the SVR weights correctly
reflected the Roll features of the CACGTG core observed in the cocrystal structure. Figure
adapted from (140).
70
6.2.5 DNA shape-augmented models reveal TF family-specific readout
mechanisms
We trained regression models using mononucleotide identities (i.e., 1mer
features) and individual DNA shape features (i.e., MGW, ProT, Roll, or HelT), to
determine which shape features are most important for predicting the binding
preferences of TFs from different structural families. For the bHLH TF Max, data for
models using sequence and a single DNA shape category suggested that Roll and ProT
were the dominant structural determinants of the DNA binding specificity. MGW and
HelT were of lesser importance for the DNA readout of this specific TF (Figure 6–7B).
The inclusion of 1mer features ensured that the nucleotide identities necessary for base
readout did not indirectly influence the impact of the DNA shape features.
Analysis of the feature weights for 1mer+1
st
order shape models indicated that
our approach has the potential to reveal readout mechanisms on a TF family-specific
basis. Second-order shape features were not included in the models, to simplify the
interpretation of the shape feature weights. Feature weights for Roll in the 1mer+1
st
order shape model varied between large positive and large negative weights within the
core E-box binding site (Figure 6–7C). These feature weights accurately reflected the
pattern between large positive and large negative Roll angles observed in a cocrystal
structure of the Max-Max/DNA complex (141). The DNA target of this ternary complex
contained the CACGTG E-box, which is the highest affinity core present in the gcPBM
dataset (Figure 6–7D). These observations indicate that despite the presence of
multiple E-box cores, there exists a specific Roll pattern that is preferred by the Max
homodimer. Moreover, the findings show that a degeneracy of DNA sequence and
71
shape, similar to the well-documented degeneracy of protein sequence and structure
(142), constitutes an important characteristic of TF binding sites.
Among the remaining DNA shape features, compared to the weights for MGW
and HelT, the weights for ProT derived from the same Max 1mer+1
st
order shape model
agreed better with the cocrystal-derived structural parameters (Figure 6–8). This
observation was consistent with the prediction accuracies of the different models of Max
binding specificity (Figure 6–8B). We also analyzed the feature weights for MGW
derived from uPBM data for homeodomain TFs. We found that the 1mer+1
st
order
shape model accurately reflected the known MGW preferences of homeodomain TFs
(Figure 6–8D). The second half of the TAAT core-binding site contained a region of
narrow MGW, which is consistent with previous X-ray (89), SELEX-seq (1), and uPBM
(138) data.
72
Figure 6 –8 SVR Weights reveal structural mechanisms of DNA readout for bHLH and
homeodomain TFs.
(A-C) Structural characteristics observed in the cocrystal structure of the ternary Max-Max/DNA
complex (141) correspond, to different degrees, with the SVR weights derived from the 1mer+1
st
order shape models for (A) MGW, (B) ProT, and (C) HelT, based on the gcPBM data for Max.
(D) SVR weights derived from the 1mer+1
st
order shape model based on uPBM data (130) for
the homeodomain TFs Oct1, Pit1, and Pou1 agree with X-ray (89), SELEX-seq (1), and uPBM
(138) data. The sequence probability matrix and MGW plot were adapted from our earlier
publication (138) and compared to SVR weights derived in this study. Figure adapted from
(140).
6.3 Discussion
In this study, we have shown that statistical machine learning models of TF-DNA
binding specificity benefit from augmenting sequence-based models with features
73
encoding interactions between nucleotide positions. This augmentation can be realized
with either k-mer or DNA shape features. Whereas adding any interaction term (i.e.,
2mer, 3mer, or shape features) improved the modeling of DNA binding specificities,
more efficient models were obtained by using DNA shape, which represents interactions
with a smaller number of features.
DNA shape integrates the complex interdependencies between multiple positions
of a binding site. This integration is achieved implicitly, without any explicit knowledge of
individual interdependencies. In this way, the incorporation of DNA shape reduces the
number of required parameters, while providing a compelling mechanistic explanation
for why dinucleotides and trinucleotides can increase the accuracy of motif descriptions.
Despite the lower accuracy for some of the DREAM5 data, our results show that
quantitative models derived from SVR analyses using HT sequence data can reveal
specific mechanisms of protein-DNA recognition on a TF family basis and can
contribute to the understanding of TF binding to the genome. The combination of DNA
sequence and shape provides a fundamentally new approach for understanding TF
binding specificities, which can be broadly applied to TFs from different structural
classes.
6.4 Methods
6.4.1 Encoding of DNA sequence or k-mer features
For each nucleotide position in a given DNA sequence of length L, the k-mer
feature at that position was encoded as a binary vector of length 4
𝑘 , where a value of 1
represents the occurrence of a particular k-mer starting at that position. The k-mer
74
features for a given DNA sequence of length L were encoded by concatenating the k-
mer feature vectors at each nucleotide position, resulting in a binary vector of length
4
𝑘 (𝐿 − 𝑘 + 1). The resulting 1mer, 2mer, and 3mer feature vectors were of the form:
1𝑚𝑒𝑟 ⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑
= (𝐴 1
,𝑇 1
,𝐺 1
,𝐶 1
,…,𝐴 𝐿 ,𝑇 𝐿 ,𝐺 𝐿 ,𝐶 𝐿 )
2𝑚𝑒𝑟 ⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑
= (𝐴𝐴
1
,𝐴𝑇
1
,𝐴𝐺
1
,𝐴𝐶
1
,…𝐶𝐺
𝐿 −1
,𝐶𝐶
𝐿 −1
)
3𝑚𝑒𝑟 ⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑
= (𝐴𝐴𝐴
1
,𝐴𝐴𝑇
1
,𝐴𝐴𝐺
1
,𝐴𝐴𝐶
1
,…,𝐶𝐶𝐺
𝐿 −2
,𝐶𝐶𝐶
𝐿 −2
)
with lengths 4𝐿 , 16(𝐿 − 1), and 64(𝐿 − 2), respectively, where 𝑁 𝑖 , 𝑁𝑁
𝑖 , and 𝑁𝑁𝑁
𝑖 have a
value of 1 if the particular k-mer occurs at position 𝑖 , and 0 otherwise.
6.4.2 Encoding of DNA shape features
For a given DNA sequence of length L, four DNA shape features were predicted
by a HT method trained on Monte Carlo simulations of about 2,200 DNA fragments of
12–27 base pairs (bp) in length (49). These structural features included two nucleotide
parameters (MGW and ProT) and two bp step parameters (Roll and HelT). The
prediction of each shape feature constituted a numeric vector:
𝑀 𝐺 𝑊 ⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑
= (𝑀𝐺𝑊 3
,𝑀𝐺𝑊 4
,…,𝑀𝐺𝑊 𝐿 −3
,𝑀𝐺𝑊 𝐿 −2
)
𝑃𝑟𝑜𝑇 ⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑
= (𝑃𝑟𝑜𝑇 3
,𝑃𝑟𝑜𝑇 4
,…,𝑃𝑟𝑜𝑇 𝐿 −3
,𝑃𝑟𝑜𝑇 𝐿 −2
)
𝑅𝑜𝑙𝑙 ⃑⃑⃑⃑⃑⃑⃑⃑
= (𝑅𝑜𝑙𝑙 2
,𝑅𝑜𝑙𝑙 3
,…,𝑅𝑜𝑙𝑙 𝐿 −3
,𝑅𝑜𝑙𝑙 𝐿 −2
)
𝐻𝑒𝑙𝑇 ⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑
= (𝐻𝑒𝑙𝑇 2
,𝐻𝑒𝑙𝑇 3
,…,𝐻𝑒𝑙𝑇 𝐿 −3
,𝐻𝑒𝑙𝑇 𝐿 −2
)
where 𝑀𝐺𝑊 𝑖 and 𝑃𝑟𝑜𝑇 𝑖 represent MGW and ProT, respectively, at nucleotide position i,
and 𝑅𝑜𝑙𝑙 𝑖 and 𝐻𝑒𝑙𝑇 𝑖 represent Roll and HelT, respectively, of the dinucleotide between
75
positions 𝑖 and (𝑖 + 1). Shape values at two positions of both the 5’ and 3’ ends were
unavailable because a pentamer model was used to derive the structural features (49).
Each shape feature vector was further expanded to include second-order shape
features by adding products of the same shape parameter at two adjacent positions. For
example, the resulting 𝑀𝐺𝑊 1𝑠𝑡 +2𝑛𝑑
⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑
vector was of the form:
𝑀𝐺𝑊 1𝑠𝑡 +2𝑛𝑑
⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑
= (𝑀𝐺𝑊 3
,𝑀𝐺𝑊 4
,…,𝑀𝐺𝑊 𝐿 −3
,𝑀𝐺𝑊 𝐿 −2
,𝑀𝐺𝑊 3
∗ 𝑀𝐺𝑊 4
,…,𝑀𝐺𝑊 𝐿 −3
∗ 𝑀𝐺𝑊 𝐿 −2
)
The complete shape feature vector used in this study was obtained by concatenating
the four numeric vectors 𝑀𝐺𝑊 1𝑠𝑡 +2𝑛𝑑
⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑
, 𝑃𝑟𝑜𝑇 1𝑠𝑡 +2𝑛𝑑
⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑
, 𝑅𝑜𝑙𝑙 1𝑠𝑡 +2𝑛𝑑
⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑
, and 𝐻𝑒𝑙𝑇 1𝑠𝑡 +2𝑛𝑑
⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑⃑
. The
total length of the final DNA shape feature vector was (8𝐿 − 32), due to the use of four
first-order and four second-order shape features, and the unavailability of values at two
positions at each end (139).
6.4.3 Performance evaluation of sequence- and shape-based models for gcPBM
and uPBM data
After preprocessing and feature encoding, PBM data for each TF were
transformed into a matrix. The first column of this matrix contained the natural logarithm
of the fluorescence signal intensities of the PBM probes, and the remaining columns
contained the encoded features. For each DNA shape characteristic, first- and second-
order DNA shape features were normalized to values between 0 and 1.
The ε-SVR algorithm (143) implemented in the libsvm toolkit (144) was used to
train linear regression models for predicting the natural logarithm of the PBM signal
intensities (response variable) based on the encoded features. ε-SVR contains two
76
user-defined hyperparameters: C, the penalty factor used for regularization; and ε, the
parameter in the loss function (i.e., maximum distance between the predicted and actual
values for which no penalty is incurred).
To obtain unbiased performance estimates of the regression models on each
dataset, a nested 10-fold cross-validation procedure was implemented. First, each
dataset was randomly partitioned into 10 equally sized subsets. Each subset was used
for testing, while the other nine subsets were used for training. Thus, our models were
always tested on data not included in the training process. An internal 10-fold cross-
validation was used at each step to identify the hyperparameters C and ε that yielded
the best performance (i.e., lowest mean squared error). The hyperparameter space was
manually specified as:
𝜀 ∈ {0.001,0.01,0.1,0.2,0.3,0.4,0.5,0.7,0.9,1.0}
𝐶 ∈ {0.001,0.01,0.05,0.1,0.5,1,5,10,50}
For each TF dataset, the R
2
between the predicted and observed values of the
response variables for all DNA sequences in that dataset was reported. The response
variables were the natural logarithms of the fluorescence signal intensities.
77
Bibliography
1. Slattery M, Riley T, Liu P, Abe N, Gomez-Alcala P, Dror I, et al. Cofactor binding evokes
latent differences in DNA binding specificity between Hox proteins. Cell. 2011 Dec
9;147(6):1270-82. PubMed PMID: 22153072. Pubmed Central PMCID: 3319069. Epub
2011/12/14. eng.
2. Gordan R, Shen N, Dror I, Zhou T, Horton J, Rohs R, et al. Genomic regions flanking E-
box binding sites influence DNA binding specificity of bHLH transcription factors through DNA
shape. Cell Rep. 2013;in press.
3. Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, et al. Simple combinations of
lineage-determining transcription factors prime cis-regulatory elements required for macrophage
and B cell identities. Molecular cell. 2010 May 28;38(4):576-89. PubMed PMID: 20513432.
Pubmed Central PMCID: 2898526.
4. Yanez-Cuna JO, Dinh HQ, Kvon EZ, Shlyueva D, Stark A. Uncovering cis-regulatory
sequence requirements for context-specific transcription factor binding. Genome research. 2012
Oct;22(10):2018-30. PubMed PMID: 22534400. Pubmed Central PMCID: 3460196.
5. Stormo GD. DNA binding sites: representation and discovery. Bioinformatics. 2000
Jan;16(1):16-23. PubMed PMID: 10812473.
6. Bussemaker HJ, Foat BC, Ward LD. Predictive modeling of genome-wide mRNA
expression: from modules to molecules. Annual review of biophysics and biomolecular structure.
2007;36:329-47. PubMed PMID: 17311525. Epub 2007/02/22. eng.
7. Badis G, Berger MF, Philippakis AA, Talukder S, Gehrke AR, Jaeger SA, et al. Diversity
and complexity in DNA recognition by transcription factors. Science. 2009 Jun
26;324(5935):1720-3. PubMed PMID: 19443739. Pubmed Central PMCID: 2905877.
8. Stormo GD, Zhao Y. Determining the specificity of protein-DNA interactions. Nature
reviews Genetics. 2010 Nov;11(11):751-60. PubMed PMID: 20877328.
78
9. Weirauch MT, Cote A, Norel R, Annala M, Zhao Y, Riley TR, et al. Evaluation of
methods for modeling transcription factor sequence specificity. Nature biotechnology. 2013 Jan
27;31(2):126-34. PubMed PMID: 23354101.
10. Meijsing SH, Pufall MA, So AY, Bates DL, Chen L, Yamamoto KR. DNA binding site
sequence directs glucocorticoid receptor structure and activity. Science. 2009 Apr
17;324(5925):407-10. PubMed PMID: 19372434. Epub 2009/04/18. eng.
11. Rohs R, West SM, Sosinsky A, Liu P, Mann RS, Honig B. The role of DNA shape in
protein-DNA recognition. Nature. 2009 Oct 29;461(7268):1248-53. PubMed PMID: 19865164.
Pubmed Central PMCID: 2793086.
12. Kim S, Brostromer E, Xing D, Jin J, Chong S, Ge H, et al. Probing allostery through DNA.
Science. 2013 Feb 15;339(6121):816-9. PubMed PMID: 23413354.
13. Watson LC, Kuchenbecker KM, Schiller BJ, Gross JD, Pufall MA, Yamamoto KR. The
glucocorticoid receptor dimer interface allosterically transmits sequence-specific DNA signals.
Nature structural & molecular biology. 2013 Jul;20(7):876-83. PubMed PMID: 23728292.
Pubmed Central PMCID: 3702670.
14. Siggers T, Duyzend MH, Reddy J, Khan S, Bulyk ML. Non-DNA-binding cofactors
enhance DNA-binding specificity of a transcriptional regulatory complex. Molecular systems
biology. 2011;7:555. PubMed PMID: 22146299. Pubmed Central PMCID: 3737730.
15. Panne D. The enhanceosome. Current opinion in structural biology. 2008 Apr;18(2):236-
42. PubMed PMID: 18206362. Epub 2008/01/22. eng.
16. Wasson T, Hartemink AJ. An ensemble model of competitive multi-factor binding of the
genome. Genome research. 2009 Nov;19(11):2101-12. PubMed PMID: 19720867. Pubmed
Central PMCID: 2775586.
17. Kitayner M, Rozenberg H, Rohs R, Suad O, Rabinovich D, Honig B, et al. Diversity in
DNA recognition by p53 revealed by crystal structures with Hoogsteen base pairs. Nature
79
structural & molecular biology. 2010 Apr;17(4):423-9. PubMed PMID: 20364130. Pubmed
Central PMCID: 20364130. Epub 2010/04/07. eng.
18. Liu X, Lee CK, Granek JA, Clarke ND, Lieb JD. Whole-genome comparison of Leu3
binding in vitro and in vivo reveals the importance of nucleosome occupancy in target site
selection. Genome research. 2006 Dec;16(12):1517-28. PubMed PMID: 17053089. Pubmed
Central PMCID: 1665635.
19. Kaplan N, Moore IK, Fondufe-Mittendorf Y, Gossett AJ, Tillo D, Field Y, et al. The DNA-
encoded nucleosome organization of a eukaryotic genome. Nature. 2009 Mar
19;458(7236):362-6. PubMed PMID: 19092803. Pubmed Central PMCID: 2658732. Epub
2008/12/19. eng.
20. Bai L, Morozov AV. Gene regulation by nucleosome positioning. Trends in genetics : TIG.
2010 Nov;26(11):476-83. PubMed PMID: 20832136.
21. Kaplan T, Li XY, Sabo PJ, Thomas S, Stamatoyannopoulos JA, Biggin MD, et al.
Quantitative models of the mechanisms that control genome-wide patterns of transcription factor
binding during early Drosophila development. PLoS genetics. 2011;7(2):e1001290. PubMed
PMID: 21304941. Pubmed Central PMCID: 3033374.
22. Miller JA, Widom J. Collaborative competition mechanism for gene activation in vivo.
Molecular and cellular biology. 2003 Mar;23(5):1623-32. PubMed PMID: 12588982. Pubmed
Central PMCID: 151720.
23. Mirny LA. Nucleosome-mediated cooperativity between transcription factors.
Proceedings of the National Academy of Sciences of the United States of America. 2010 Dec
28;107(52):22534-9. PubMed PMID: 21149679. Pubmed Central PMCID: 3012490.
24. Glatt S, Alfieri C, Muller CW. Recognizing and remodeling the nucleosome. Current
opinion in structural biology. 2011 Jun;21(3):335-41. PubMed PMID: 21377352.
80
25. Barozzi I, Simonatto M, Bonifacio S, Yang L, Rohs R, Ghisletti S, et al. Coregulation of
transcription factor binding and nucleosome occupancy through DNA features of mammalian
enhancers. Molecular cell. 2014 Jun 5;54(5):844-57. PubMed PMID: 24813947. Pubmed
Central PMCID: 4048654.
26. Agius P, Arvey A, Chang W, Noble WS, Leslie C. High resolution models of transcription
factor-DNA affinities improve in vitro and in vivo binding predictions. PLoS computational
biology. 2010;6(9). PubMed PMID: 20838582. Pubmed Central PMCID: 2936517.
27. Rohs R, West SM, Liu P, Honig B. Nuance in the double-helix and its role in protein-
DNA recognition. Current opinion in structural biology. 2009 Apr;19(2):171-7. PubMed PMID:
19362815. Pubmed Central PMCID: 2701566.
28. Luscombe NM, Austin SE, Berman HM, Thornton JM. An overview of the structures of
protein-DNA complexes. Genome biology. 2000;1(1):REVIEWS001. PubMed PMID: 11104519.
Pubmed Central PMCID: 138832. Epub 2000/12/06. eng.
29. Rohs R, Jin X, West SM, Joshi R, Honig B, Mann RS. Origins of specificity in protein-
DNA recognition. Annu Rev Biochem. 2010;79:233-69. PubMed PMID: 20334529. Pubmed
Central PMCID: 3285485.
30. Stella S, Cascio D, Johnson RC. The shape of the DNA minor groove directs binding by
the DNA-bending protein Fis. Genes & development. Apr 15;24(8):814-26. PubMed PMID:
20395367. Pubmed Central PMCID: 2854395. Epub 2010/04/17. eng.
31. Hancock SP, Ghane T, Cascio D, Rohs R, Di Felice R, Johnson RC. Control of DNA
minor groove width and Fis protein binding by the purine 2-amino group. Nucleic Acids Res.
2013 Jul;41(13):6750-60. PubMed PMID: 23661683. Pubmed Central PMCID: 3711457.
32. Chen Y, Zhang X, Dantas Machado AC, Ding Y, Chen Z, Qin PZ, et al. Structure of p53
binding to the BAX response element reveals DNA unwinding and compression to
81
accommodate base-pair insertion. Nucleic Acids Res. 2013 Sep;41(17):8368-76. PubMed PMID:
23836939. Pubmed Central PMCID: 3783167.
33. Chang YP, Xu M, Machado AC, Yu XJ, Rohs R, Chen XS. Mechanism of origin DNA
recognition and assembly of an initiator-helicase complex by SV40 large tumor antigen. Cell
reports. 2013 Apr 25;3(4):1117-27. PubMed PMID: 23545501. Pubmed Central PMCID:
3748285.
34. Slattery M, Zhou T, Yang L, Dantas Machado AC, Gordan R, Rohs R. Absence of a
simple code: how transcription factors read the genome. Trends in biochemical sciences. 2014
Sep;39(9):381-99. PubMed PMID: 25129887. Pubmed Central PMCID: 4149858.
35. Bailey TL, Elkan C. Fitting a mixture model by expectation maximization to discover
motifs in biopolymers. Proceedings / International Conference on Intelligent Systems for
Molecular Biology ; ISMB International Conference on Intelligent Systems for Molecular Biology.
1994;2:28-36. PubMed PMID: 7584402.
36. Roth FP, Hughes JD, Estep PW, Church GM. Finding DNA regulatory motifs within
unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nature
biotechnology. 1998 Oct;16(10):939-45. PubMed PMID: 9788350.
37. Pevzner PA, Sze SH. Combinatorial approaches to finding subtle signals in DNA
sequences. Proceedings / International Conference on Intelligent Systems for Molecular
Biology ; ISMB International Conference on Intelligent Systems for Molecular Biology.
2000;8:269-78. PubMed PMID: 10977088.
38. Galas DJ, Schmitz A. DNAse footprinting: a simple method for the detection of protein-
DNA binding specificity. Nucleic Acids Res. 1978 Sep;5(9):3157-70. PubMed PMID: 212715.
Pubmed Central PMCID: 342238.
39. Garner MM, Revzin A. A gel electrophoresis method for quantifying the binding of
proteins to specific DNA regions: application to components of the Escherichia coli lactose
82
operon regulatory system. Nucleic Acids Res. 1981 Jul 10;9(13):3047-60. PubMed PMID:
6269071. Pubmed Central PMCID: 327330.
40. Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, et al. Assessing
computational tools for the discovery of transcription factor binding sites. Nature biotechnology.
2005 Jan;23(1):137-44. PubMed PMID: 15637633.
41. Sandve GK, Drablos F. A survey of motif discovery methods in an integrated framework.
Biology direct. 2006;1:11. PubMed PMID: 16600018. Pubmed Central PMCID: 1479319.
42. Workman CT, Yin Y, Corcoran DL, Ideker T, Stormo GD, Benos PV. enoLOGOS: a
versatile web tool for energy normalized sequence logos. Nucleic Acids Res. 2005 Jul 1;33(Web
Server issue):W389-92. PubMed PMID: 15980495. Pubmed Central PMCID: 1160200.
43. Jolma A, Yan J, Whitington T, Toivonen J, Nitta KR, Rastas P, et al. DNA-Binding
Specificities of Human Transcription Factors. Cell. 2013 Jan 17;152(1-2):327-39. PubMed PMID:
23332764.
44. Man TK, Stormo GD. Non-independence of Mnt repressor-operator interaction
determined by a new quantitative multiple fluorescence relative affinity (QuMFRA) assay.
Nucleic Acids Res. 2001 Jun 15;29(12):2471-8. PubMed PMID: 11410653. Pubmed Central
PMCID: 55749.
45. Bulyk ML, Johnson PL, Church GM. Nucleotides of transcription factor binding sites
exert interdependent effects on the binding affinities of transcription factors. Nucleic Acids Res.
2002 Mar 1;30(5):1255-61. PubMed PMID: 11861919. Pubmed Central PMCID: 101241.
46. Tomovic A, Oakeley EJ. Position dependencies in transcription factor binding sites.
Bioinformatics. 2007 Apr 15;23(8):933-41. PubMed PMID: 17308339.
47. Zhao Y, Ruan S, Pandey M, Stormo GD. Improved models for transcription factor
binding site identification using nonindependent interactions. Genetics. 2012 Jul;191(3):781-90.
PubMed PMID: 22505627. Pubmed Central PMCID: 3389974.
83
48. Olson WK, Gorin AA, Lu XJ, Hock LM, Zhurkin VB. DNA sequence-dependent
deformability deduced from protein-DNA crystal complexes. Proceedings of the National
Academy of Sciences of the United States of America. 1998 Sep 15;95(19):11163-8. PubMed
PMID: 9736707. Pubmed Central PMCID: 21613. Epub 1998/09/16. eng.
49. Zhou T, Yang L, Lu Y, Dror I, Dantas Machado AC, Ghane T, et al. DNAshape: a
method for the high-throughput prediction of DNA structural features on a genomic scale.
Nucleic Acids Res. 2013 Jul;41(Web Server issue):W56-62. PubMed PMID: 23703209. Pubmed
Central PMCID: 3692085.
50. Yang L, Zhou T, Dror I, Mathelier A, Wasserman WW, Gordan R, et al. TFBSshape: a
motif database for DNA shape features of transcription factor binding sites. Nucleic Acids Res.
2014 Jan;42(Database issue):D148-55. PubMed PMID: 24214955. Pubmed Central PMCID:
3964943.
51. Chiu TP, Yang L, Zhou T, Main BJ, Parker SC, Nuzhdin SV, et al. GBshape: a genome
browser database for DNA shape annotations. Nucleic Acids Res. 2014 Oct 17. PubMed PMID:
25326329.
52. Lazarovici A, Zhou T, Shafer A, Dantas Machado AC, Riley TR, Sandstrom R, et al.
Probing DNA shape and methylation state on a genomic scale with DNase I. Proceedings of the
National Academy of Sciences of the United States of America. 2013 Apr 1. PubMed PMID:
23576721.
53. Roider HG, Kanhere A, Manke T, Vingron M. Predicting transcription factor affinities to
DNA from a biophysical model. Bioinformatics. 2007 Jan 15;23(2):134-41. PubMed PMID:
17098775.
54. Zhao Y, Granas D, Stormo GD. Inferring binding energies from selected binding sites.
PLoS computational biology. 2009 Dec;5(12):e1000590. PubMed PMID: 19997485. Pubmed
Central PMCID: 2777355. Epub 2009/12/10. eng.
84
55. Sun W, Hu X, Lim MH, Ng CK, Choo SH, Castro DS, et al. TherMos: Estimating protein-
DNA binding energies from in vivo binding profiles. Nucleic Acids Res. 2013 Jun;41(11):5555-68.
PubMed PMID: 23595148. Pubmed Central PMCID: 3675472.
56. Mandel-Gutfreund Y, Margalit H. Quantitative parameters for amino acid-base
interaction: implications for prediction of protein-DNA binding sites. Nucleic Acids Res. 1998
May 15;26(10):2306-12. PubMed PMID: 9580679. eng.
57. Havranek JJ, Duarte CM, Baker D. A simple physical model for the prediction and
design of protein-DNA interactions. Journal of molecular biology. 2004 Nov 12;344(1):59-70.
PubMed PMID: 15504402.
58. Morozov AV, Havranek JJ, Baker D, Siggia ED. Protein-DNA binding specificity
predictions with structural models. Nucleic Acids Res. 2005;33(18):5781-98. PubMed PMID:
16246914. Pubmed Central PMCID: 1270944.
59. Kaplan T, Friedman N, Margalit H. Ab initio prediction of transcription factor targets using
structural knowledge. PLoS computational biology. 2005 Jun;1(1):e1. PubMed PMID: 16103898.
Pubmed Central PMCID: 1183507.
60. Siggers TW, Silkov A, Honig B. Structural alignment of protein--DNA interfaces: insights
into the determinants of binding specificity. Journal of molecular biology. 2005 Feb
4;345(5):1027-45. PubMed PMID: 15644202. Epub 2005/01/13. eng.
61. Liu LA, Bradley P. Atomistic modeling of protein-DNA interaction specificity: progress
and applications. Current opinion in structural biology. 2012 Aug;22(4):397-405. PubMed PMID:
22796087. Pubmed Central PMCID: 3425445.
62. Maienschein-Cline M, Dinner AR, Hlavacek WS, Mu F. Improved predictions of
transcription factor binding sites using physicochemical features of DNA. Nucleic Acids Res.
2012 Dec;40(22):e175. PubMed PMID: 22923524. Pubmed Central PMCID: 3526315.
85
63. Hooghe B, Broos S, van Roy F, De Bleser P. A flexible integrative approach based on
random forest improves prediction of transcription factor binding sites. Nucleic Acids Res. 2012
Aug;40(14):e106. PubMed PMID: 22492513. Pubmed Central PMCID: 3413102.
64. Kahara J, Lahdesmaki H. Evaluating a linear k-mer model for protein-DNA interactions
using high-throughput SELEX data. BMC bioinformatics. 2013;14 Suppl 10:S2. PubMed PMID:
24267147. Pubmed Central PMCID: 3750486.
65. Mordelet F, Horton J, Hartemink AJ, Engelhardt BE, Gordan R. Stability selection for
regression-based models of transcription factor-DNA binding specificity. Bioinformatics. 2013
Jul 1;29(13):i117-25. PubMed PMID: 23812975. Pubmed Central PMCID: 3694650.
66. Berger MF, Philippakis AA, Qureshi AM, He FS, Estep PW, 3rd, Bulyk ML. Compact,
universal DNA microarrays to comprehensively determine transcription-factor binding site
specificities. Nature biotechnology. 2006 Nov;24(11):1429-35. PubMed PMID: 16998473.
67. Wong D, Teixeira A, Oikonomopoulos S, Humburg P, Lone IN, Saliba D, et al. Extensive
characterization of NF-kappaB binding uncovers non-canonical motifs and advances the
interpretation of genetic functional traits. Genome biology. 2011;12(7):R70. PubMed PMID:
21801342. Pubmed Central PMCID: 3218832.
68. Siggers T, Chang AB, Teixeira A, Wong D, Williams KJ, Ahmed B, et al. Principles of
dimer-specific gene regulation revealed by a comprehensive characterization of NF-kappaB
family DNA binding. Nature immunology. 2012 Jan;13(1):95-102. PubMed PMID: 22101729.
Pubmed Central PMCID: 3242931.
69. Tanay A. Extensive low-affinity transcriptional interactions in the yeast genome. Genome
research. 2006 Aug;16(8):962-72. PubMed PMID: 16809671. Pubmed Central PMCID: 1524868.
70. Jaeger SA, Chan ET, Berger MF, Stottmann R, Hughes TR, Bulyk ML. Conservation and
regulatory associations of a wide affinity range of mouse transcription factor binding sites.
86
Genomics. 2010 Apr;95(4):185-95. PubMed PMID: 20079828. Pubmed Central PMCID:
2848887.
71. White MA, Parker DS, Barolo S, Cohen BA. A model of spatially restricted transcription
in opposing gradients of activators and repressors. Molecular systems biology. 2012;8:614.
PubMed PMID: 23010997. Pubmed Central PMCID: 3472688.
72. Berger MF, Badis G, Gehrke AR, Talukder S, Philippakis AA, Pena-Castillo L, et al.
Variation in homeodomain DNA binding revealed by high-resolution analysis of sequence
preferences. Cell. 2008 Jun 27;133(7):1266-76. PubMed PMID: 18585359. Pubmed Central
PMCID: 2531161.
73. Parker SC, Hansen L, Abaan HO, Tullius TD, Margulies EH. Local DNA topography
correlates with functional noncoding regions of the human genome. Science. 2009 Apr
17;324(5925):389-92. PubMed PMID: 19286520. Pubmed Central PMCID: 2749491.
74. Olson WK, Gorin AA, Lu XJ, Hock LM, Zhurkin VB. DNA sequence-dependent
deformability deduced from protein-DNA crystal complexes. ProcNatlAcadSciUSA.
1998;95(19):11163-8.
75. Lavery R, Zakrzewska K, Beveridge D, Bishop TC, Case DA, Cheatham T, 3rd, et al. A
systematic molecular dynamics study of nearest-neighbor effects on base pair and base pair
step conformations and fluctuations in B-DNA. Nucleic Acids Research. 2010 Jan;38(1):299-313.
PubMed PMID: 19850719. Pubmed Central PMCID: 2800215. Epub 2009/10/24. eng.
76. Rohs R, Sklenar H, Shakked Z. Structural and energetic origins of sequence-specific
DNA bending: Monte Carlo simulations of papillomavirus E2-DNA binding sites. Structure. 2005
Oct;13(10):1499-509. PubMed PMID: 16216581. Epub 2005/10/12. eng.
77. Rohs R, Bloch I, Sklenar H, Shakked Z. Molecular flexibility in ab initio drug docking to
DNA: binding-site and binding-mode transitions in all-atom Monte Carlo simulations. Nucleic
87
Acids Res. 2005;33(22):7048-57. PubMed PMID: 16352865. Pubmed Central PMCID: 1312361.
Epub 2005/12/15. eng.
78. Sklenar H, Wustner D, Rohs R. Using internal and collective variables in Monte Carlo
simulations of nucleic acid structures: chain breakage/closure algorithm and associated
Jacobians. J Comput Chem. 2006 Feb;27(3):309-15. PubMed PMID: 16355439. Epub
2005/12/16. eng.
79. Lavery R, Sklenar H. Defining the structure of irregular nucleic acids: conventions and
principles. Journal of biomolecular structure & dynamics. 1989 Feb;6(4):655-67. PubMed PMID:
2619933.
80. Stella S, Cascio D, Johnson RC. The shape of the DNA minor groove directs binding by
the DNA-bending protein Fis. Genes Dev. 2010 Apr 15;24(8):814-26. PubMed PMID: 20395367.
Pubmed Central PMCID: 2854395. Epub 2010/04/17. eng.
81. Wu Z, Delaglio F, Tjandra N, Zhurkin VB, Bax A. Overall structure and sugar dynamics
of a DNA dodecamer from homo- and heteronuclear dipolar couplings and 31P chemical shift
anisotropy. Journal of biomolecular NMR. 2003 Aug;26(4):297-315. PubMed PMID: 12815257.
82. Bishop EP, Rohs R, Parker SC, West SM, Liu P, Mann RS, et al. A map of minor groove
shape and electrostatic potential from hydroxyl radical cleavage patterns of DNA. ACS chemical
biology. 2011 Dec 16;6(12):1314-20. PubMed PMID: 21967305. Pubmed Central PMCID:
3241897. Epub 2011/10/05. eng.
83. Field Y, Kaplan N, Fondufe-Mittendorf Y, Moore IK, Sharon E, Lubling Y, et al. Distinct
modes of regulation by chromatin encoded through nucleosome positioning signals. PLoS
computational biology. 2008 Nov;4(11):e1000216. PubMed PMID: 18989395. Pubmed Central
PMCID: 2570626. Epub 2008/11/08. eng.
88
84. Mavrich TN, Jiang C, Ioshikhes IP, Li X, Venters BJ, Zanton SJ, et al. Nucleosome
organization in the Drosophila genome. Nature. 2008 May 15;453(7193):358-62. PubMed PMID:
18408708. Pubmed Central PMCID: 2735122.
85. Trifonov EN, Sussman JL. The pitch of chromatin DNA is reflected in its nucleotide
sequence. Proceedings of the National Academy of Sciences of the United States of America.
1980 Jul;77(7):3816-20. PubMed PMID: 6933438. Pubmed Central PMCID: 349717. Epub
1980/07/01. eng.
86. Satchwell SC, Drew HR, Travers AA. Sequence periodicities in chicken nucleosome
core DNA. Journal of molecular biology. 1986 Oct 20;191(4):659-75. PubMed PMID: 3806678.
Epub 1986/10/20. eng.
87. Perez A, Marchan I, Svozil D, Sponer J, Cheatham TE, 3rd, Laughton CA, et al.
Refinement of the AMBER force field for nucleic acids: improving the description of
alpha/gamma conformers. Biophysical journal. 2007 Jun 1;92(11):3817-29. PubMed PMID:
17351000. Pubmed Central PMCID: 1868997. Epub 2007/03/14. eng.
88. Slattery M, Ma L, Negre N, White KP, Mann RS. Genome-wide tissue-specific
occupancy of the Hox protein Ultrabithorax and Hox cofactor Homothorax in Drosophila. PloS
one. 2011;6(4):e14686. PubMed PMID: 21483663. Pubmed Central PMCID: 3071676. Epub
2011/04/13. eng.
89. Joshi R, Passner JM, Rohs R, Jain R, Sosinsky A, Crickmore MA, et al. Functional
specificity of a Hox protein mediated by the recognition of minor groove structure. Cell. 2007
Nov 2;131(3):530-43. PubMed PMID: 17981120. eng.
90. Noyes MB, Christensen RG, Wakabayashi A, Stormo GD, Brodsky MH, Wolfe SA.
Analysis of homeodomain specificities allows the family-wide prediction of preferred recognition
sites. Cell. 2008 Jun 27;133(7):1277-89. PubMed PMID: 18585360. Pubmed Central PMCID:
2478728. Epub 2008/07/01. eng.
89
91. Hueber SD, Weiller GF, Djordjevic MA, Frickey T. Improving Hox protein classification
across the major model organisms. PloS one. 2010;5(5):e10820. PubMed PMID: 20520839.
Pubmed Central PMCID: 2876039.
92. Lutz B, Lu HC, Eichele G, Miller D, Kaufman TC. Rescue of Drosophila labial null mutant
by the chicken ortholog Hoxb-1 demonstrates that the function of Hox genes is phylogenetically
conserved. Genes & development. 1996 Jan 15;10(2):176-84. PubMed PMID: 8566751.
93. McGinnis N, Kuziora MA, McGinnis W. Human Hox-4.2 and Drosophila deformed
encode similar regulatory specificities in Drosophila embryos and larvae. Cell. 1990 Nov
30;63(5):969-76. PubMed PMID: 1979526.
94. Tamura K, Peterson D, Peterson N, Stecher G, Nei M, Kumar S. MEGA5: Molecular
Evolutionary Genetics Analysis Using Maximum Likelihood, Evolutionary Distance, and
Maximum Parsimony Methods. Mol Biol Evol. 2011 Oct;28(10):2731-9. PubMed PMID:
WOS:000295184200003. English.
95. Oliveri M, Daga A, Lunardi C, Navone R, Millo R, Puccetti A. DNase I behaves as a
transcription factor which modulates Fas expression in human cells. Eur J Immunol. 2004
Jan;34(1):273-9. PubMed PMID: WOS:000188412400029. English.
96. Suck D, Lahm A, Oefner C. Structure refined to 2A of a nicked DNA octanucleotide
complex with DNase I. Nature. 1988 Mar 31;332(6163):464-8. PubMed PMID: 3352748. Epub
1988/03/31. eng.
97. Weston SA, Lahm A, Suck D. X-ray structure of the DNase I-d(GGTATACC)2 complex
at 2.3 A resolution. Journal of molecular biology. 1992 Aug 20;226(4):1237-56. PubMed PMID:
1518054. Epub 1992/08/20. eng.
98. Hogan ME, Roberson MW, Austin RH. DNA Flexibility Variation May Dominate Dnase-I
Cleavage. Proceedings of the National Academy of Sciences of the United States of America.
1989 Dec;86(23):9273-7. PubMed PMID: WOS:A1989CC53300042. English.
90
99. Dorschner M, Hawrylycz M, Humbert R, Wallace JC, Shafer A, Kawamoto J, et al. High-
throughput localization of functional elements by quantitative chromatin profiting. Nat Methods.
2004 Dec;1(3):219-25. PubMed PMID: WOS:000226753900012. English.
100. Sabo PJ, Kuehn MS, Thurman R, Johnson BE, Johnson EM, Hua C, et al. Genome-
scale mapping of DNase I sensitivity in vivo using tiling DNA microarrays. Nat Methods. 2006
Jul;3(7):511-8. PubMed PMID: WOS:000238669600012. English.
101. Hesselberth JR, Chen X, Zhang Z, Sabo PJ, Sandstrom R, Reynolds AP, et al. Global
mapping of protein-DNA interactions in vivo by digital genomic footprinting. Nat Methods. 2009
Apr;6(4):283-9. PubMed PMID: 19305407. Pubmed Central PMCID: 2668528. Epub 2009/03/24.
eng.
102. Suzuki MM, Bird A. DNA methylation landscapes: provocative insights from epigenomics.
Nature Reviews Genetics. 2008 Jun;9(6):465-76. PubMed PMID: WOS:000255953500014.
English.
103. Kangaspeska S, Stride B, Metivier R, Polycarpou-Schwarz M, Ibberson D, Carmouche
RP, et al. Transient cyclical methylation of promoter DNA. Nature. 2008 Mar 6;452(7183):112-
U14. PubMed PMID: WOS:000253671900057. English.
104. Pennings S, Allan J, Davey CS. DNA methylation, nucleosome formation and positioning.
Briefings in functional genomics & proteomics. 2005 Feb;3(4):351-61. PubMed PMID: 15814025.
105. Bird A. DNA methylation patterns and epigenetic memory. Genes & development. 2002
Jan 1;16(1):6-21. PubMed PMID: 11782440.
106. Eckhardt F, Lewin J, Cortese R, Rakyan VK, Attwood J, Burger M, et al. DNA
methylation profiling of human chromosomes 6, 20 and 22. Nat Genet. 2006 Dec;38(12):1378-
85. PubMed PMID: WOS:000242404200014. English.
107. Rozenberg JM, Shlyakhtenko A, Glass K, Rishi V, Myakishev MV, FitzGerald PC, et al.
All and only CpG containing sequences are enriched in promoters abundantly bound by RNA
91
polymerase II in multiple tissues. Bmc Genomics. 2008 Feb 5;9. PubMed PMID:
WOS:000253988300001. English.
108. Rishi V, Bhattacharya P, Chatterjee R, Rozenberg J, Zhao JF, Glass K, et al. CpG
methylation of half-CRE sequences creates C/EBP alpha binding sites that activate some
tissue-specific genes. Proceedings of the National Academy of Sciences of the United States of
America. 2010 Nov 23;107(47):20311-6. PubMed PMID: WOS:000284529000033. English.
109. Suck D. DNA recognition by DNase I. Journal of molecular recognition : JMR. 1994
Jun;7(2):65-70. PubMed PMID: 7826675. Epub 1994/06/01. eng.
110. Lahm A, Suck D. DNase I-induced DNA conformation. 2 A structure of a DNase I-
octamer complex. J Mol Biol. 1991 Dec 5;222(3):645-67. PubMed PMID: 1748997. Epub
1991/12/05. eng.
111. Lomonossoff GP, Butler PJ, Klug A. Sequence-dependent variation in the conformation
of DNA. J Mol Biol. 1981 Jul 15;149(4):745-60. PubMed PMID: 6273590. Epub 1981/07/15. eng.
112. Brukner I, Jurukovski V, Savic A. Sequence-dependent structural variations of DNA
revealed by DNase I. Nucleic acids research. 1990 Feb 25;18(4):891-4. PubMed PMID:
2179873. Pubmed Central PMCID: 330342. Epub 1990/02/25. eng.
113. Brukner I, Sanchez R, Suck D, Pongor S. Sequence-dependent bending propensity of
DNA as revealed by DNase I: parameters for trinucleotides. Embo J. 1995 Apr 18;14(8):1812-8.
PubMed PMID: 7737131. Pubmed Central PMCID: 398274. Epub 1995/04/18. eng.
114. Fox KR. The Effect of Hhai Methylation on DNA Local-Structure. Biochem J. 1986 Feb
15;234(1):213-6. PubMed PMID: WOS:A1986A126300028. English.
115. Kochanek S, Renz D, Doerfler W. Differences in the Accessibility of Methylated and
Unmethylated DNA to Dnase-I. Nucleic Acids Research. 1993 Dec 25;21(25):5843-5. PubMed
PMID: WOS:A1993MR19900006. English.
92
116. Dantas Machado AC, Zhou T, Rao S, Goel P, Rastogi C, Lazarovici A, et al. Evolving
insights on how cytosine methylation affects protein-DNA binding. Briefings in functional
genomics. 2014 Oct 14. PubMed PMID: 25319759.
117. Chodavarapu RK, Feng S, Bernatavichute YV, Chen PY, Stroud H, Yu Y, et al.
Relationship between nucleosome positioning and DNA methylation. Nature. 2010 Jul
15;466(7304):388-92. PubMed PMID: 20512117. Pubmed Central PMCID: 2964354. Epub
2010/06/01. eng.
118. Kelly TK, Liu YP, Lay FD, Liang GN, Berman BP, Jones PA. Genome-wide mapping of
nucleosome positioning and DNA methylation within individual DNA molecules. Genome
research. 2012 Dec;22(12):2497-506. PubMed PMID: WOS:000311895500018. English.
119. West SM, Rohs R, Mann RS, Honig B. Electrostatic interactions between arginines and
the minor groove in the nucleosome. Journal of biomolecular structure & dynamics. 2010
Jun;27(6):861-6. PubMed PMID: 20232938. Pubmed Central PMCID: 2946858. Epub
2010/03/18. eng.
120. Aduri R, Psciuk BT, Saro P, Taniga H, Schlegel HB, SantaLucia J. AMBER force field
parameters for the naturally occurring modified nucleosides in RNA. J Chem Theory Comput.
2007 Jul-Aug;3(4):1464-75. PubMed PMID: WOS:000247893400022. English.
121. Cooper GM, Shendure J. Needles in stacks of needles: finding disease-causal variants
in a wealth of genomic data. Nature Reviews Genetics. 2011 Sep;12(9):628-40. PubMed PMID:
WOS:000294004100009. English.
122. Genomes Project C, Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, et al. A
map of human genome variation from population-scale sequencing. Nature. 2010 Oct
28;467(7319):1061-73. PubMed PMID: 20981092. Pubmed Central PMCID: 3042601.
123. Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, et al.
Potential etiologic and functional implications of genome-wide association loci for human
93
diseases and traits. Proceedings of the National Academy of Sciences of the United States of
America. 2009 Jun 9;106(23):9362-7. PubMed PMID: 19474294. Pubmed Central PMCID:
2687147.
124. Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou MM, Rosenbloom K, et al.
Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome
research. 2005 Aug;15(8):1034-50. PubMed PMID: WOS:000231032000002. English.
125. Thomas S, Li XY, Sabo PJ, Sandstrom R, Thurman RE, Canfield TK, et al. Dynamic
reprogramming of chromatin accessibility during Drosophila embryo development. Genome
biology. 2011;12(5). PubMed PMID: WOS:000295732700007. English.
126. Stormo GD. Modeling the specificity of protein-DNA interactions. Quantitative Biology. 1.
Berlin Heidelberg: Springer; 2013. p. 115-30.
127. White MA, Myers CA, Corbo JC, Cohen BA. Massively parallel in vivo enhancer assay
reveals that highly local features determine the cis-regulatory function of ChIP-seq peaks. Proc
Natl Acad Sci U S A. 2013 Jul 16;110(29):11952-7. PubMed PMID: 23818646. Pubmed Central
PMCID: 3718143.
128. Zhao Y, Stormo GD. Quantitative analysis demonstrates most transcription factors
require only simple models of specificity. Nat Biotechnol. 2011 Jun;29(6):480-3. PubMed PMID:
21654662. Pubmed Central PMCID: 3111930.
129. Orenstein Y, Shamir R. A comparative analysis of transcription factor binding models
learned from PBM, HT-SELEX and ChIP data. Nucleic Acids Res. 2014 Feb 5. PubMed PMID:
24500199.
130. Weirauch MT, Cote A, Norel R, Annala M, Zhao Y, Riley TR, et al. Evaluation of
methods for modeling transcription factor sequence specificity. Nat Biotechnol. 2013
Feb;31(2):126-34. PubMed PMID: 23354101. Pubmed Central PMCID: 3687085.
94
131. Gordâ n R, Shen N, Dror I, Zhou T, Horton J, Rohs R, et al. Genomic regions flanking E-
box binding sites influence DNA binding specificity of bHLH transcription factors through DNA
shape. Cell Rep. 2013 Apr 25;3(4):1093-104. PubMed PMID: 23562153. Pubmed Central
PMCID: 3640701.
132. Sharon E, Lubliner S, Segal E. A feature-based approach to modeling protein-DNA
interactions. PLoS computational biology. 2008;4(8):e1000154. PubMed PMID: 18725950.
Pubmed Central PMCID: 2516605.
133. Caelli T, Bischof WF. Machine learning paradigms for pattern recognition and image
understanding. Spatial vision. 1996;10(1):87-103. PubMed PMID: 8817773.
134. Zhang X, Dantas Machado AC, Ding Y, Chen Y, Lu Y, Duan Y, et al. Conformations of
p53 response elements in solution deduced using site-directed spin labeling and Monte Carlo
sampling. Nucleic Acids Res. 2014 Feb;42(4):2789-97. PubMed PMID: 24293651. Pubmed
Central PMCID: 3936745.
135. Lazarovici A, Zhou T, Shafer A, Dantas Machado AC, Riley TR, Sandstrom R, et al.
Probing DNA shape and methylation state on a genomic scale with DNase I. Proc Natl Acad Sci
U S A. 2013 Apr 16;110(16):6376-81. PubMed PMID: 23576721. Pubmed Central PMCID:
3631675.
136. Stevens BW, DiBattista AM, William Rebeck G, Green AE. A gene-brain-cognition
pathway for the effect of an Alzheimers risk gene on working memory in young adults.
Neuropsychologia. 2014 Aug;61:143-9. PubMed PMID: 24967550.
137. Vapnik VN. The Nature of Statistical Learning Theory. New York: Springer; 1995.
138. Dror I, Zhou T, Mandel-Gutfreund Y, Rohs R. Covariation between homeodomain
transcription factors and the shape of their DNA binding sites. Nucleic Acids Res. 2014
Jan;42(1):430-41. PubMed PMID: 24078250. Pubmed Central PMCID: 3874178.
95
139. Yang L, Zhou T, Dror I, Mathelier A, Wasserman WW, Gordan R, et al. TFBSshape: a
motif database for DNA shape features of transcription factor binding sites. Nucleic Acids Res.
2014 Jan 1;42(1):D148-55. PubMed PMID: 24214955.
140. Zhou T, Shen N, Yang L, Abe N, Horton J, Mann RS, et al. Quantitative Modeling of
Transcription Factor Binding Specificities Using DNA Shape. Under Review.
141. Brownlie P, Ceska T, Lamers M, Romier C, Stier G, Teo H, et al. The crystal structure of
an intact human Max-DNA complex: new insights into mechanisms of transcriptional control.
Structure. 1997 Apr 15;5(4):509-20. PubMed PMID: 9115440. eng.
142. Kosloff M, Kolodny R. Sequence-similar, structure-dissimilar protein pairs in the PDB.
Proteins. 2008 May 1;71(2):891-902. PubMed PMID: 18004789. Pubmed Central PMCID:
2673347.
143. Drucker H, Burges CJC, Kaufman L, Smola A, Vapnik V. Support vector regression
machines. Adv Neur In. 1997;9:155-61. PubMed PMID: WOS:A1997BH93C00022. English.
144. Chang CC, Lin CJ. LIBSVM: A Library for Support Vector Machines. Acm T Intel Syst
Tec. 2011;2(3). PubMed PMID: WOS:000208617000010. English.
96
Appendix A. Cofactor binding evokes latent differences in
DNA binding specificity between Hox proteins
97
Appendix B. DNAshape: a method for the high-throughput
prediction of DNA structural features on a genomic scale
98
Appendix C. Genomic regions flanking E-box binding sites
influence DNA binding specificity of bHLH transcription
factors through DNA shape
99
Appendix D. Covariation between homeodomain
transcription factors and the shape of their DNA binding
sites
100
Appendix E. Probing DNA shape and methylation state on a
genomic scale with DNase I
101
102
Appendix F. TFBSshape: a motif database for DNA shape
features of transcription factor binding sites
103
Appendix G. Absence of a simple code: how transcription
factors read the genome
104
Appendix H. Quantitative modeling of transcription factor
binding specificities using DNA shape
105
Appendix I. Evolving insights on how cytosine methylation
affects protein-DNA binding
106
Appendix J. GBshape: a genome browser database for DNA
shape annotations
Abstract (if available)
Abstract
Gene regulation is orchestrated by the binding of transcription factors (TFs) to their specific target DNA sequence. An increasing number of structural biology and genomics studies associate protein-DNA binding with the recognition of three-dimensional DNA structure, or “DNA shape”. Retrieval of local nuances in DNA shape is therefore a fundamental step to understand the general mechanisms of gene regulation. ❧ Due to the limitation of X-ray crystallography and NMR spectroscopy, molecular simulations have long been the only available approach to deduce atomic information on intrinsic DNA structure. Despite significant recent advances, the bottleneck for any molecular simulation is that only a limited number of DNA sequences can be pursued. To overcome this limitation, we have developed a high-throughput (HT) method for DNA shape prediction based on the mining of atomic resolution data obtained from Monte Carlo (MC) simulations of numerous DNA fragments. ❧ Applying the HT method to predict the DNA shape of many SELEX-seq-identified Hox binding sites, we revealed that closely related Hox paralogs achieve their distinct function in vivo through differential binding preference for DNA sequences with distinct minor groove topographies. Analyzing millions of DNA backbone cleavage events on naked genomic DNA conducted by the endonuclease DNase I, we showed that CpG methylation modulates DNase I-DNA interaction strengths via the remodeling of DNA shape. Examining the effect of millions of single nucleotide variants (SNVs) on local DNA minor groove width (MGW) pattern, we proved that purifying selection acts on DNA shape as well as nucleotide sequence. ❧ To gain a comprehensive, systematic and quantitative view of DNA shape effects on protein-DNA binding, we integrated intuitive DNA shape features in the modeling of TF binding specificities using statistical machine learning approaches. We compared the quantitative predictions of DNA binding specificities for 68 mouse and human TFs using shape-augmented models with those using sequence-based models. Based on our model evaluation, which included cross-validation on specific PBM array designs, testing across different PBM array designs, and using PBM-trained models to predict relative binding affinities derived from SELEX-seq data, we showed that shape-augmented models compared favorably to sequence-based models. Besides, whereas adding either k-mer or DNA shape features to sequence-only models improves the modeling of DNA binding specificities, using DNA shape that represents interdependencies implicitly with smaller numbers of features yields more efficient models. ❧ In summary, my work on the DNA shape prediction and its application adds a new dimension to genome-wide studies of gene regulation and evolution.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Genome-wide studies of protein–DNA binding: beyond sequence towards biophysical and physicochemical models
PDF
Machine learning of DNA shape and spatial geometry
PDF
Quantitative modeling of in vivo transcription factor–DNA binding and beyond
PDF
Profiling transcription factor-DNA binding specificity
PDF
Understanding protein–DNA recognition in the context of DNA methylation
PDF
DNA shape at transcription factor binding sites: from purifying selection to a new alphabet
PDF
Decoding protein-DNA binding determinants mediated through DNA shape readout
PDF
3D modeling of eukaryotic genomes
PDF
Deciphering protein-nucleic acid interactions with artificial intelligence
PDF
Genome-wide studies reveal the isoform selection of genes
PDF
The relationship between DNA methylation and transcription factor binding in colon cancer cells
PDF
Identifying allele-specific DNA methylation in mammalian genomes
PDF
Simulating the helicase motor of SV40 large tumor antigen
PDF
Improved methods for the quantification of transcription factor binding using SELEX-seq
PDF
Mapping 3D genome structures: a data driven modeling method for integrated structural analysis
PDF
Site-directed spin labeling studies of sequence-dependent DNA shape and protein-DNA recognition
PDF
Data-driven approaches to studying protein-DNA interactions from a structural point of view
PDF
Exploring the application and usage of whole genome chromosome conformation capture
PDF
Efficient algorithms to map whole genome bisulfite sequencing reads
PDF
The function of Rpd3 in balancing the replicaton initiation of different genomic regions
Asset Metadata
Creator
Zhou, Tianyin
(author)
Core Title
Genome-wide studies reveal the function and evolution of DNA shape
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Publication Date
11/03/2014
Defense Date
11/01/2014
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
DNA methylation,DNA shape,gene evolution,gene regulation,modeling of transcription factor binding specificity,OAI-PMH Harvest,protein-DNA interaction,purifying selection,transcription factor
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Rohs, Remo (
committee chair
), Alber, Frank (
committee member
), Nakano, Aiichiro (
committee member
), Waterman, Michael S. (
committee member
)
Creator Email
tianyinz@usc.edu,zhoutianyin@hotmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-512386
Unique identifier
UC11298638
Identifier
etd-ZhouTianyi-3058.pdf (filename),usctheses-c3-512386 (legacy record id)
Legacy Identifier
etd-ZhouTianyi-3058.pdf
Dmrecord
512386
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Zhou, Tianyin
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
DNA methylation
DNA shape
gene evolution
gene regulation
modeling of transcription factor binding specificity
protein-DNA interaction
purifying selection
transcription factor