Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Genome-wide studies of protein–DNA binding: beyond sequence towards biophysical and physicochemical models
(USC Thesis Other)
Genome-wide studies of protein–DNA binding: beyond sequence towards biophysical and physicochemical models
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
GENOME-WIDE STUDIES OF PROTEIN–DNA BINDING:
BEYOND SEQUENCE TOWARDS BIOPHYSICAL AND
PHYSICOCHEMICAL MODELS
by
Tsu-Pei Chiu
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(Computational Biology and Bioinformatics)
Thesis Committee:
Dr. Remo Rohs (Chair)
Dr. Frank Alber
Dr. Aiichiro Nakano
Dr. Michael S. Waterman
August 2018
2
We cannot solve our problems with the same thinking we used when we created them.
- Albert Einstein
3
Abstract
Protein–DNA binding is a fundamental component of gene regulatory processes, but it is
still not completely understood how proteins recognize their target sites in the genome.
Transcription factors (TFs) and other DNA binding proteins employ two different DNA readout
mechanisms to recognize their genomic target sites, including sequence-based readout of direct
contacts with the functional groups of the bases (base readout) and structure-based readout of
intrinsic deviations from a canonical double helix (shape readout). A mechanistic understanding
of how TFs recognize their target binding sites will provide insights into mechanisms underlying
diseases and therefore improve human health.
To be able to derive DNA structural features for whole-genome, we developed an
R/Bioconductor package, DNAshapeR. This method extends an existing high-throughput method
based on mining of all-atom Monte Carlo simulations, and further encodes DNA sequence and
shape features which can be readily applied to modeling studies. We further developed GBshape
that provides DNA shape features and hydroxyl radical cleavage predictions for the entire genomes
of more than a hundred organisms. With this tool, we illustrated the periodicity of DNA shape
features that are present in nucleosome-occupied sequences from human, fly and worm, and
demonstrated structural similarities between transcription start sites (TSSs) in the genomes of four
Drosophila species.
However, DNA shape is a proxy for biophysical interactions. A/T and C/G base pairs (bps)
carry different functional groups in the minor groove, that affect electrostatic potential (EP) due to
their charges, in addition to their effect on groove width. To probe this biophysical effect directly,
we developed DNAphi for predicting EP in the minor groove and confirmed the direct role of EP
in protein–DNA binding using massive sequencing data. Using statistical machine-learning
4
approaches, we showed that adding EP as a biophysical feature can improve the predictive power
of quantitative binding specificity models across 27 transcription factor families. High-throughput
prediction of EP offers a novel way to integrate biophysical and genomic studies of protein–DNA
binding.
The rapid advancement of high-throughput assays result in enormous data generation. In
order to leverage data of such proportions for unraveling TF-binding mechanism, a powerful
computational framework is required. Conventional machine learning approaches, such as support
vector regression (SVR) and multiple linear regression (MLR), require pre-processing steps. The
pre-processing step, for example, alignment to a known DNA motif, limits these methods in
exploiting the complete data derived from high-throughput binding assays and, in turn, introduces
biases. Canonical convolutional neural networks (CNNs) possess many advantages over
traditional machine learning methods, but only utilize DNA sequence as a feature, thereby
resulting in the lack of biophysical interpretability. To address these issues, we developed DeepRec,
a multi-module deep learning framework capable of mining physicochemicalsignatures of DNA.
We demonstrated applications of our method for multiple TFs and were able to elucidate important
insights into the TF-binding mechanism.
As a tangential research direction, we included two cutting-edge technologies: quantum
computing and brain imaging. We demonstrated the possible application of modeling TF–DNA
binding mechanisms by leveraging the power of quantum computing. On the other hand, we
analyzed cellular signals from large-scale brain imaging data with CNNs, aiming at uncovering
insights linked to neuronal activities and disorders.
5
Acknowledgement
First and foremost I would like to express my sincere gratitude to my advisor Prof. Remo
Rohs who has supported me throughout my Ph.D. with his patience, motivation, and immense
knowledge. His positive attitude and encouragement helped me overcome various challenges and
difficulties without fear and ultimately inspired me to become a better person.
My sincere thanks also go to my dissertation committee members Prof. Michael S.
Waterman, Prof. Aiichiro Nakano and Prof. Frank Alber, and Ph.D. guidance committee members
Prof. Sergey Nuzhdin and Prof. Christoph Haselwandter. Their insightful comments enriched and
widened my research from various perspectives, and their encouragement strengthened my
confidence in the future challenges yet to face.
Profound gratitude goes to my collaborators Prof. Barry Honig and Prof. Richard S. Mann
from Columbia University, for being supportive to my research and paper. They demonstrated the
perfect role model for a scientist by conducting their science research with enthusiasm and
dedication.
I thank my fellow labmates Rosa, Iris, Tianyin, Lin, Carolina, Satya, Beibei, Jared, Richard,
Xiaofei, Jinsen, Liana, Brendon, and Yingfei, for their continued support. This dissertation would
not have been possible without their stimulating discussions and contributions.
Special thanks go to my friends Li-Chin Lai, Kuan-Ting Chen, Maggie Deagon, Lawrence
Whitfield, Caroline Chiu, Bryan Li, Oliver Hsu, Danielle Lai, Cora Chiu, Second Chen, and other
friends, for supporting me along the way. Without your support, I would not be able to complete
my Ph.D.
Last but not the least, I am very grateful to my parents, sister, brother in law, cousins Irene
Chiu and Shean Chiu, and all family members from Taiwan, who provided me with unfailing
support and continuous encouragement throughout the time of my Ph.D.
6
Table of Contents
Abstract ......................................................................................................................................................... 3
List of figures ................................................................................................................................................ 9
List of tables ................................................................................................................................................ 10
List of abbreviations ................................................................................................................................... 11
Chapter 1 Introduction ................................................................................................................................ 12
1.1 The central dogma and gene regulation ........................................................................................ 12
1.2 TF–DNA recognition .................................................................................................................... 13
1.3 Binding assays for probing the TF–DNA binding specificities .................................................... 17
1.4 Computational methods for modeling the TF–DNA binding specificities ................................... 18
1.5 Overview for this thesis ................................................................................................................ 19
Chapter 2 DNAshapeR: an R/Bioconductor package for DNA shape prediction and feature encoding .... 21
2.1 Introduction ................................................................................................................................... 21
2.2 Methods and results ...................................................................................................................... 22
2.2.1 High-throughput DNA shape prediction ............................................................................ 22
2.2.2 DNA shape and k-mer feature encoding ............................................................................ 24
2.2.3 Availability and implementation ........................................................................................ 25
2.3 Conclusions ................................................................................................................................... 25
Chapter 3 GBshape: a genome browser database for DNA shape annotations .......................................... 26
3.1 Introduction ................................................................................................................................... 26
3.2 Materials and methods .................................................................................................................. 28
3.2.1 Database architecture and methodology ............................................................................ 28
3.2.2 User interface ..................................................................................................................... 30
3.3 Results and discussion .................................................................................................................. 33
3.3.1 Nucleosome binding sites .................................................................................................. 33
3.3.2 Transcription start sites (TSSs) .......................................................................................... 37
3.4 Conclusions ................................................................................................................................... 39
Chapter 4 DNAphi: a method for high-throughput prediction of minor groove electrostatic potential ..... 40
4.1 Introduction ................................................................................................................................... 40
4.2 Materials and methods .................................................................................................................. 45
4.2.1 Poisson–Boltzmann calculation of EP ............................................................................... 45
4.2.2 High-throughput prediction of EP ...................................................................................... 46
4.2.3 EP-augmented protein–DNA binding models ................................................................... 47
4.3 Results and discussion .................................................................................................................. 49
7
4.3.1 Validation of EP prediction ................................................................................................ 49
4.3.2 Correlation between EP and MGW.................................................................................... 52
4.3.3 Correlation of EP and Fis protein binding affinity ............................................................. 55
4.3.4 EP-based modeling of DNA binding affinity for 27 protein families ................................ 57
4.3.5 Base versus phosphate EP contributions in quantitative models ....................................... 60
4.3.6 EP contributions to Hox-DNA binding specificity ............................................................ 61
4.4 Conclusions ................................................................................................................................... 64
Chapter 5 DeepRec: a deep learning method for the discovery of physicochemicalsignature in DNA for
aaaaaaaaaaTF binding ................................................................................................................................. 66
5.1 Introduction ................................................................................................................................... 66
5.2 Materials and methods .................................................................................................................. 70
5.2.1 Deep learning framework ................................................................................................... 70
5.2.2 DNA physicochemicalsignatures and feature encoding..................................................... 72
5.2.3 DeepRec model design ....................................................................................................... 74
5.2.4 Interpretation ...................................................................................................................... 77
5.3 Results and discussion .................................................................................................................. 79
5.3.1 Sequence-based models fail to determine physico-chemistry-based models ..................... 79
5.3.2 DeepRec predicts DNA contacts of TF–DNA binding ...................................................... 80
5.3.3 DeepRec predicts the impact of DNA modification on TF binding specificities .............. 84
5.3.4 DeepRec predicts TF binding preference in the context of DNA grooves ........................ 86
5.4 Conclusions ................................................................................................................................... 89
Chapter 6 Discussion and future work ........................................................................................................ 90
Chapter 7 Tangential research I – Quantum computing ............................................................................. 93
7.1 Introduction ................................................................................................................................... 93
7.2 Materials and methods .................................................................................................................. 94
7.2.1 The mapping of Ising/Qubo model .................................................................................... 94
7.2.2 Time complexity comparison and analysis ........................................................................ 95
7.2.3 Binary representation ......................................................................................................... 96
7.2.4 Reconstruction of MLR hypothesis function ..................................................................... 97
7.2.5 Embedding of qubit ............................................................................................................ 97
7.2.6 Rescale ............................................................................................................................... 98
7.2.7 Experimental setup on D-Wave ......................................................................................... 98
7.3 Results and discussion .................................................................................................................. 99
7.4 Conclusions and discussion ........................................................................................................ 103
Chapter 8 Tangential research II – Zebrafish brain imaging .................................................................... 104
8
8.1 Introduction ................................................................................................................................. 104
8.2 Materials and methods ................................................................................................................ 105
8.3 Results and discussion ................................................................................................................ 106
8.4 Conclusions ................................................................................................................................. 108
Supplementary material for Chapter 2 – DNAshapeR .............................................................................. 109
Supplementary material for Chapter 3 – GBshape ................................................................................... 120
Supplementary materials for Chapter 4 – DNAphi ................................................................................... 123
References ................................................................................................................................................. 143
9
List of figures
1.1 Physicochemicalsignatures in the DNA major and minor groove. ....................................................... 14
1.2 Schematic representations of four standard DNA shapes. .................................................................... 16
2.1 Schematic representations of an expanded repertoire of 13 DNA shape features. ............................... 23
2.2 Flowchart of DNAshapeR analysis. ...................................................................................................... 24
3.1 Architecture of the GBshape database. ................................................................................................. 29
3.2 Visual display of GBshape annotations in the genome browser. .......................................................... 32
3.3 Variation in MGW and ORChID2 signals on average in nucleosome sequences. ............................... 35
3.4 Variation in Roll, HelT and ProT on average in nucleosome sequences. ............................................. 36
3.5 Average heat maps for four DNA shape features of TSSs .................................................................... 38
4.1 Functional groups of C/G and A/T bp exposed on major and minor grooves. ..................................... 42
4.2 Overview of DNAphi method. .............................................................................................................. 43
4.3 Illustration of possible applications of DNAphi. .................................................................................. 44
4.4. Validation of HT EP predictions using TF–DNA binding sites. ......................................................... 51
4.5 Comparison of EP and MGW distributions of 512 unique pentamers categorized by central bp. ....... 53
4.6 Scatter plot for EP and MGW values of all 512 unique pentamers. ..................................................... 54
4.7 High-throughput EP predictions for Fis-binding sites. ......................................................................... 56
4.8 Performance comparison of binding specificity predictions for 239 TFs ............................................. 60
4.9 Figure 4.9. Heat map of average EP at each position of 16-mers selected by each Exd-Hox
aaaaheterodimer. ......................................................................................................................................... 63
5.1 Physicochemicalstandard and expanded sigantures in the DNA major and minor groove. .................. 68
5.2 Multimodal deep learning framework. .................................................................................................. 71
5.3 Modeling TF–DNA interactions using DeepRec. ................................................................................. 72
5.4 Functional groups of G/C and A/T bps exposed on DNA major and minor grooves. .......................... 73
5.5 DNA physicochemical energy logos. .................................................................................................... 79
5.6 DNA physicochemical energy logos and structure of MAX-5caC complex (PDB ID 5EYO). ........... 83
5.7 Geometry of physicochemical signatures and corresponding structure of MAX-5caC complex. ........ 86
5.8 DNA physicochemical energy logos and structure of MEF2 complex (PDB ID 1N6J). ...................... 88
7.1 Predictive performance with respect to different modleing strategy. ................................................. 101
7.2 Predictive performance with respect to different modleing strategy. ................................................. 102
7.3 Predictive performance with respect to different modleing strategy. ................................................. 103
8.1 Framework of convolutional neuron network for brain imaging data. ............................................... 106
8.2 Model performance measurements using data with different time frame. .......................................... 107
8.3 Visualization of back-propagation of weights. ................................................................................... 108
10
List of tables
3.1 Current number of genome repositories in GBshape. ........................................................................... 27
7.1 Binary representation for 5 bits. ............................................................................................................ 96
8.1 Sample size with different time frame size ......................................................................................... 106
11
List of abbreviations
Abbreviation Definition
Bp Base pair
CNNs Convolutional neural networks
CREs Cis-regulatory elements
DBDs DNA-binding domains
EP Electrostatic potential
HT high-throughput
ML Machine learning
MLR Multiple linear regression
TF Transcription factor
TFBSs
TSSs
Transcription factor–DNA binding sites
Transcription start sites
12
Chapter 1 Introduction
1.1 The central dogma and gene regulation
A myriad of organisms inhabiting our planet is specified by their genomes that carry
biological information needed for development and functions. Most genomes from high- to low-
level organisms are made of DNA (Deoxyribonucleic acid) while some genomes from few viruses
are made of RNA (Ribonucleic acid). DNA, a double-stranded molecule, is composed of building
blocks called nucleotides. Each nucleotide is made up of three components: a phosphate group, a
5-carbon sugar and a nitrogenous base – either adenine (A), cytosine (C), guanine (G) or thymine
(T). The five-carbon sugar of a nucleotide is covalently bonded to the phosphate group of the next
nucleotide, forming a sugar-phosphate backbone of one DNA strand. Two DNA strands are bound
together by hydrogen bonds formed by a complementary nucleotide base pairing (either an adenine
pairs with a thymine or a guanine pairs with a cytosine), giving DNA a double helix structure.
A gene is a segment of DNA that encodes for a specific protein. The flow of passing genetic
information starts from a transcription process (transcribing a gene into a message RNA, mRNA),
then performs a translation process (translating a mRNA into an amino acid sequence), and ends
with a final product – a protein, which is a functional molecule involved in biological processes
such as providing the body structure, catalyzing metabolic reactions, transporting molecular
materials, and activating immune responses. However, although coding DNA has essential
functions to the biological processes, a large proportion of DNA consists of non-coding DNA. In
the human genome, for example, over 98% of DNA is non-coding DNA, implying that the non-
coding DNA might also have particular functionalities.
13
Cis-regulatory elements (CREs), found in non-coding regions, regulate the transcription of
neighboring genes by functioning as binding sites for transcription factors (TFs). TFs are particular
proteins that utilize a wide range of DNA-binding structural motifs (DNA-binding domains, DBDs)
to recognize DNA binding sites (TFBSs); TFs can work alone or cooperate with other proteins in
a complex to facilitate DNA recognitions. The TF–DNA binding controls the transcription rate by
activating or repressing the recruitment of RNA polymerase (the enzyme transcribing genetic
information from DNA to mRNA). Some TFs stabilize the binding by directly interacting with
RNA polymerase, while other TFs exclude the binding by blocking the binding sites. The
transcriptional regulation can be more complicated. For example, different types of cells possess
particular functions as well as express a distinct set of proteins even though they share the same
gene content. There must be mechanisms to turn specific genes on and off by particular TFs in
response to the local environment. In addition, many putative binding sites along the genome share
the same core motif sequence, but only a small subset of them is bond by TFs; this suggests that
there must be a sophisticated mechanism for TF–DNA recognition. Understanding TF–DNA
recognition and binding mechanisms is conducive to deciphering gene regulation networks.
1.2 TF–DNA recognition
TF–DNA recognition is achieved through a three-dimensional structure interaction of TFs
and DNA; TFs can make multiple contacts to the edges of DNA bases. The interaction can be
divided into two distinct readout modes [1-4] including sequence-based readout of direct contacts
with the functional groups of the bases (base readout) [5-7] and structure-based readout of intrinsic
deviations from a canonical double helix (shape readout) [2, 8, 9].
Base readout is one way that TFs recognize DNA bases through the major groove where
hydrogen bond formation and hydrophobic interactions mainly occur. The major groove of the
14
DNA helix offers a set of base-specific hydrogen bond acceptors, donors, methyl groups and
nonpolar hydrogens that can be recognized by TFs with a set of complementary physicochemical
signatures (Figure 1.1A) [2]. The four possible base pairs, as shown in Figure 1.1B, can be
distinguished by their unique patterns of hydrogen bond acceptors and donors in the major groove.
The recognition of hydrogen bond pattern can also happen in the minor groove, but due to the
degeneracy, the A/T and T/A base pairs (bps), or C/G and G/C bps cannot be identified (Figure
1.1B). The interaction mediated by the formation of hydrogen bonds between TF and DNA bases
conveys a high degree of binding specificity. As a prominent example, arginine can recognize a
guanine base through bidentate interactions, in which a single side chain forms two hydrogen
bonds with DNA [4, 10]. Hydrophobic contacts with bases can also contribute the binding
specificity. For example, protein residues can employ hydrophobic interactions to differentiate
thymine from cytosine [11-13].
Figure 1.1 Physicochemicalsignatures in the DNA major and minor groove.
(A) Physicochemicalpatterns on the edges of the bases in the major and minor groove enable proteins to readout
DNA through hydrogen bonds and hydrophobic contacts. The signatures include hydrogen- bond acceptor (in
15
red) and donor (in blue), methyl group (in yellow) and the nonpolar hydrogen (in gray). (B) The pattern is
sequence-specific in the major groove but degenerate in the minor groove.
However, the base readout through hydrogen bonds or hydrophobic contacts cannot
explain the complete story of TF–DNA interactions. Shape readout, in which TFs read DNA
through a three-dimension structure of DNA (DNA shape), is the other type of interaction. Slight
base pair movement or rotation can lead to various DNA shape variations, which can be
quantitatively represented by base-pair variables. For instance, propeller twist (ProT) describes the
relative rotation between bases within a bp with reference to the base pairing axis; helix twist
(HelT) describes the relative rotation between adjacent base pairs with reference to the helix axis;
Roll describes the relative rotation between adjacent base pairs with reference to the base pairing
axis [14-16], as shown in Figure 1.2.
Minor groove width (MGW) is another variable varying in DNA structure. The MGW of
a given bp i can be measured by the distance between the O4’ atoms of nucleotides i+1 in strand
1 and i-2 in strand 2, as shown in Figure 1.2. Narrower minor groove width can induce a more
negative electrostatic potential in the minor groove [17] due to the compression of electric field
lines, which can attract positively charged protein residues such as arginine [17], lysine [18] and
histidine [19]. For example, the flexible N-terminal arms of homeodomain proteins can insert into
the minor groove and play a role in TF–DNA binding recognition [17, 20].
16
Figure 1.2 Schematic representations of four standard DNA shapes.
The schematic representations include minor groove width (MGW), propeller twist (ProT), helix twist (HelT) and
Roll as defined by Curves [21].
Most TF–DNA binding recognitions involve a subtle interplay between the base and shape
readout [3]. The contributions of base and shape readout to the TF–DNA binding specificity vary
across TF families. In extreme cases, the DNA recognition can be dominated by either base or
shape readout. bHLH protein Pho4, for example, recognizes DNA mainly by the base readout [22],
while high-motility group (HMG) box proteins bind DNA through the minor groove recognition
[2]. The other cases of TFs are more or less equally use both base and shape readout modes, such
as Hox-Exd heterodimer [3].
TF–DNA binding is a fundamental mechanism in gene regulation and therefore plays an
essential role in every biological process including the development of an embryo and even a
cancer. Deciphering the factors that influence TF-binding processes and exploring their biological
insights in structural and genomics perspectives can improve our understanding of TF–DNA
binding mechanism, and will eventually contribute to the improvement of human health.
17
1.3 Binding assays for probing the TF–DNA binding specificities
In vivo technologies for studying TF–DNA biding and transcriptional regulation have been
developed over the course of the last decade. The most widely used in vivo method is chromatin
immunoprecipitation, followed by sequencing (ChIP-seq) [23-25]. This method allows us to
identify genome-wide DNA binding sites for a TF of interest. Several advances to the ChIP-seq
protocol further improve the precision of determining TF–DNA binding sites in the genome. ChIP-
exo [26, 27] and ChIP-nexus [28], for example, are capable of identifying the DNA binding sites
at nearly nucleotide-resolution; CUT&RUN [29] avoids cross-linking process, thus yielding a
relatively unbiased transcription binding profile. Although these in vivo methods serve as a
platform for understanding TF–DNA binding on a global scale, the relationship between sequence
and TF-binding affinity is difficult to determine, because the presence of many in vivo factors such
as chromatin accessibility, nucleosome occupancy, cofactor event, and cooperative DNA-binding
of TFs severely increase the complexity in relation to determining the binding mechanism.
In vitro technologies provide an alternative path to uncover the TF–DNA binding
mechanism with the help of a quantitative measurement of TF binding against thousands or even
millions of different DNA probes to characterize intrinsic TF binding preferences. Microarray-
based methods such as protein-binding microarrays (PBMs) [30, 31] quantify TF binding affinities
for all possible eight-bp sequences, using a fluorescent tag with a wide dynamic range. Genomic-
context PBMs (gcPBMs) [32] use custom protein microarrays with DNA probes centered within
native genomic sequences to investigate the contributions of flanking genomic sequences to TF–
DNA recognition. Parallel with microarray-based methods, methods based on next-generation
sequencing (NGS), such as systematic evolution of ligands by enrichment followed by massively
parallel sequencing (SELEX-seq [33] or HT-SELEX [34-36]), measure TF-binding affinity to
18
millions of oligonucleotides over multiple repeats of enrichment and amplification cycles; thus it
is in their nature to carry out full characterizations of TF-binding specificities. However, due to
the interactive cycles, these types of methods show a weakness toward identifying low-affinity
binding sites. Ideally, the first-round SELEX data should contain the whole information, but due
to its character, the expected count of DNA probe decreases exponentially with its length;
analyzing such data becomes a challenge, and thus require a sensitive and powerful computational
tool.
1.4 Computational methods for modeling the TF–DNA binding specificities
With the rapid development of high-throughput assays for probing TF–DNA binding
specificities, several DNA motif discovery methods have been developed. The most popular
methods model TF–DNA binding preference is based on position weight matrix (PWM) [5, 6, 37],
where each element of the matrix represents the contribution to the overall binding affinity from a
nucleotide at the corresponding position. However, traditional PWM-based methods assume that
each nucleotide independently contribute to the TF–DNA binding, which is untrue for some TFs
[38, 39]. To address this issue, more complex modeling methods such as dinucleotide weight
matrices [40, 41], k-mer based models [42, 43], Hidden Markov models [44, 45], and Baysian
networks [46] have been developed. An alternative representation of interdependencies between
base pairs is DNA 3D structure, which results from physical interactions such as inter-base
stacking and intra-base pairing. Several studies integrated DNA sequence and shape features to
model the TF binding mechanism [47-49]. In addition to the methods based on statistics and pattern
recognition, other methods include MatrixREDUCE [50] and BEEML [35] biophysically model
the non-linear relationship between binding energy and the statistics of the binding sites which
selected by a high-throughput assay. A frequent limitation of the traditional modeling methods is
19
the requirement to align raw data based on the occurrence of a known core motif; this data
preprocessing results in discarding informative data. To address this issue, BEESEM [51],
SelexGLM [52] and NRLB [53] statistically model long sequences containing shorter binding sites
without the prior knowledge of the corresponding binding location. DeepBind [54], on the other
hand, utilized deep convolutional neural networks (CNNs) to capture the binding site from the raw
data without carrying out sequence alignment. The methods modeling TF–DNA binding,
nowadays, all focus on DNA sequence; however, the primary source of TF-binding specificity
comes from physicochemical signatures in DNA. Therefore, methods based on such finer solution
features have the potential to have a better elucidation into the binding mechanism.
1.5 Overview for this thesis
This thesis presents several widely used tools that I have developed during my Ph.D., and
also provides an introduction to uncovering biological insights of TF–DNA binding mechanism
through the use of these tools. DNAshapeR (in Chapter 2) provides 13 DNA shape and four
methylated DNA shape features for any sequence regardless of length; the resulting shape
information can be easily applied to high-throughput genomic analyses. GBshape (in Chapter 3)
predicts DNA shape features for the entire genomes of more than one hundred organisms. With
this tool I have illustrated DNA shape periodicity presenting in nucleosome-occupied sequences,
and also demonstrated structural similarities between transcription start sites in the genomes of
multiple species. To investigate biophysical determinates for TF–DNA recognition, I have
developed DNAphi (in Chapter 4) for the prediction of minor-groove electrostatic potential (EP).
Integrating the biophysical features in the analysis of high-throughput sequencing data enables the
use of EP in genome analysis, with the potential to reveal yet unknown mechanisms of gene
regulation. I have further established a deep learning framework, DeepRec (in Chapter 5), which
20
enables the modeling of TF–DNA interactions at physicochemical resolution in order to reveal the
importance of physicochemical signatures beyond bp resolution. Finally, the discussion and
follow-up studies are covered in Chapter 6. As a tangential research direction, I have demonstrated
the preliminary results with two cutting-edge technologies: quantum computing and brain imaging
(in Chapter 7 and 8).
21
Chapter 2 DNAshapeR: an R/Bioconductor package for DNA shape
prediction and feature encoding
2.1 Introduction
DNA shape readout was originally described based on the analysis of co-crystal structures
of protein–DNA complexes. Studies of DNA shape readout were then extended to massive datasets
of protein-interacting DNA sequences via the use of DNAshape, a method for the high-throughput
prediction of DNA structural features [14]. Rules that determine the binding affinity between TFs
and their binding sites can be statistically learned from the data derived from in vitro high-
throughput binding assays such as PBM [30], gcPBM [32], SELEX-seq [33], HT-SELEX [34-36].
Although sequence-based methods have long been used to model TF binding specificities, high-
throughput prediction of DNA shape enabled us to develop methods that leverage both DNA
sequence and shape information. Trained with either linear regression or support vector regression
algorithms, shape-augmented models were consistently shown to outperform sequence-based
methods in modeling the in vitro binding of TFs quantitatively [47].
DNAshape is currently released as a stand-alone web service [14]. Its pre-defined
functionality and internet bandwidth-bounded performance made it difficult to use in genome-
wide studies. In addition, the method is limited to four standard shape features; nine additional
shape features for improving accuracy and interpretability of model prediction [15] and four
methylated DNA shape features for predicting the effect of cytosine methylation on DNA shape
[55] are not included. To address these issues, we developed DNAshapeR, an R/Bioconductor
package that can generate DNA shape predictions in an easy-to-use, easy-to-integrate and easy-to-
22
extend manner. The output can be readily integrated into other high-throughput genomic analysis
platforms.
2.2 Methods and results
2.2.1 High-throughput DNA shape prediction
The core of DNAshapeR is the DNAshape prediction method [14], which uses a sliding
pentamer window to derive the structural features minor groove width (MGW), helix twist (HelT),
propeller twist (ProT) and Roll from all-atom Monte Carlo simulations of multiple DNA fragments
of 10–27 bp in length. We further expanded the repertoire of DNA shape to 17 features including
nine additional structural features Rise, Shift, Slide, Tilt, Buckle, Opening, Shear, Stagger and
Stretch (Figure 2.1), and four methylated DNA shape features (methyl-MGW, methyl-ProT,
methyl-HelT and methyl-Roll) [15]. These DNA shape features were observed in various co-
crystal structures as playing an important role in achieving protein–DNA binding specificity. High-
throughput predictions of DNA shape have shed light on the DNA binding specificity of TFs [28,
56] and were shown to be predictive of replication origins [57].
The DNAshapeR package enables ultra-fast, high-throughput predictions of shape features
for thousands of genomic sequences and generates various graphical outputs of the data (Figure
2.2). The modular design of DNAshapeR enables the expansion to additional features, such as
conformational flexibility, biophysical properties and methylation status, to be added in future
releases of the DNAshapeR package.
23
Figure 2.1 Schematic representations of an expanded repertoire of 13 DNA shape features.
The schematic representations include minor groove width, six intra-bp DNA shape features, and six inter-bp
DNA shape features, as defined by Curves [21].
24
Figure 2.2 Flowchart of DNAshapeR analysis.
The input data can be either nucleotide sequence(s) in FASTA file format or genomic intervals, provided by the
user in BED format or derived from public databases. The core of DNAshapeR includes a high-throughput
approach for the prediction of DNA shape features. MGW, HelT, ProT and Roll can then be visualized in the
form of plots, heat maps or genome browser tracks or used for the assembly of feature vectors of user-defined
combinations of k-mer and shape features.
2.2.2 DNA shape and k-mer feature encoding
Besides DNA shape predictions and data visualization, DNAshapeR can also be used to
generate feature vectors for user-defined models. These models consist of sequence features (1mer,
2mer, 3mer), shape features (MGW, Roll, ProT, HelT, Rise, Shift, Slide, Tilt, Buckle, Opening,
Shear, Stagger, Stretch, methyl-MGW, methyl-ProT, methyl-HelT and methyl-Roll) or any
combination of those features (Figure 2.2). DNAshapeR encodes sequence as binary features.
25
DNA shape features are normalized by default and can include second-order shape features. The
detailed definitions of sequence and shape features were provided in an earlier study [47].
The feature encoding function of DNAshapeR enables the generation of any user-defined
subset of these features. The result of the feature encoding for each sequence is a chimera feature
vector. Feature encoding of multiple sequences thus results in a feature matrix, which can be used
as input for a variety of statistical machine learning methods.
2.2.3 Availability and implementation
The DNAshapeR software package implemented in the statistical programming language
R and is freely available through the Bioconductor project at
https://www.bioconductor.org/packages/devel/bioc/html/DNAshapeR.html and at the GitHub
developer site, http://tsupeichiu.github.io/DNAshapeR/.
2.3 Conclusions
DNAshapeR predicts DNA shape features in an ultra-fast, high-throughput manner from
genomic sequencing data. The package takes either nucleotide sequence or genomic coordinates
as input and generates various graphical representations for visualization and further analysis.
DNAshapeR further encodes DNA sequence and shape features as user-defined combinations of
k-mer and DNA shape features. The resulting feature matrices can be readily used as input of
various machine learning software packages for further modeling studies. In the following chapters
DNAshapeR serves as a fundamental tool for various analyses.
26
Chapter 3 GBshape: a genome browser database for DNA shape
annotations
3.1 Introduction
Interactions between nucleotides within a binding site or its flanks are implicitly contained
in the 3D structure of a DNA binding site. DNA shape is influenced by the core motif [33] and its
flanking sequences [32] and therefore potentially characterizes binding sites with higher precision.
In addition to taking into account interrelationships between nucleotide positions, DNA shape
integrates over diverse nucleotide sequences that can give rise to similar DNA shapes, a
phenomenon known as degeneracy of DNA sequence and structure. As a consequence, DNA shape
was found to be evolutionarily conserved to a higher degree than is DNA sequence [58].
Based on these findings it seems advantageous to incorporate DNA shape features in motif
scanning and de novo motif discovery methods [59-62]. Another application for DNA shape
analysis would be in the functional evaluation of genetic variation, which is commonly described
in terms of nucleotide sequence [63, 64]. These and other applications will require the mapping of
DNA shape features for entire genomes. To make the necessary data available we developed
GBshape. Prediction of DNA shape features from nucleotide sequence is based on high-throughput
methods for deriving DNA shape features, by using pentamers to mine results from all-atom Monte
Carlo simulations of DNA fragments [14, 65, 66], and by predicting hydroxyl radical cleavage
patterns based on an experimental dataset [62].
GBshape is a multi-species database currently containing whole-genome data for more than
one hundred organisms from groups of diverse species (Table 3.1). For each organism the database
provides four genome browser tracks with annotations for Minor Groove Width (MGW), Propeller
27
Twist (ProT), Roll and Helix Twist (HelT) [14]. In a fifth track, GBshape shows hydroxyl radical
cleavage annotations from the ·OH Radical Cleavage Intensity Database for double-stranded DNA
(ORChID2) [67]. These five DNA shape annotations were generated with the high-throughput
prediction platform. DNA shape data can be visualized either qualitatively or downloaded as
quantitative values via the GBshape user interface. GBshape contains DNA shape annotations for
92 genomes taken from the UCSC Genome Browser [68] and 14 additional genomes from plants,
parasitic protists and bacteria (Table 3.1).
Table 3.1 Current number of genome repositories in GBshape.
The organism group names are listed by UCSC Genome Browser organism group with additional groups added.
Group Count
Mammal 48
Vertebrate 19
Deuterostome 3
Insect 15
Nematode 6
Fungi 6
Plants 2
Protists 1
Bacteria 4
Others 2
Total 106
We demonstrate the value of analyzing DNA shape annotations using GBshape by
comparing the structural features of in vivo nucleosome binding sites from worm, fly and human
[69] and the evolutionary conservation of DNA shape at transcription start sites (TSSs) across
multiple Drosophila species [70]. The GBshape database completes the family of DNA shape tools
that includes DNAshape, a web server for high-throughput prediction of DNA shape features for
up to 1 million base pairs [14], TFBSshape, a database of DNA shape features of transcription
28
factor binding site motifs [49], and ORChID, a database and prediction tool for hydroxyl radical
cleavage patterns [62, 67].
3.2 Materials and methods
3.2.1 Database architecture and methodology
The core of our database is a high-throughput prediction platform (Figure 3.1) that we
developed to generate DNA shape data for storage in GBshape. Whole genome sequence files (in
FASTA format) for multiple species are subjected to the high-throughput prediction programs
DNAshapeR [14, 16] and ORChID2 [67] that are embedded in the GBshape platform. The
GBshape prediction platform was designed to be extendable by plug-ins of other whole-genome
annotation programs (Figure 3.1). The results of high-throughput prediction programs are
converted to the bigWig data format, which can be displayed in a genome browser. The platform
was developed in C++ and runs on a high-performance computing cluster (HPCC).
The GBshape database combines DNA shape data with standard UCSC Genome Browser
annotations [68] and whole-genome sequence data. The sequences of 92 genomes (Table 3.1) and
corresponding standard annotations were downloaded from the UCSC Genome Browser (19).
Although several genome assembly versions are available for many of these species, at this stage
of development the most recent genome assembly for each species was chosen. The reference
genome from Saccharomyces cerevisiae was identical with the one provided by the
Saccharomyces Genome Database (23). Three additional reference genomes from Arabidopsis
thaliana (24), Plasmodium falciparum (25) and Escherichia coli (26) that were not present in the
UCSC Genome Browser were added to GBshape (Table 3.1). The GBshape framework enables an
29
easy expansion to additional genome assemblies, and users can submit a web form requesting the
addition of specific genomes to our database. The GBshape database runs on MySQL (Figure 3.1).
Figure 3.1 Architecture of the GBshape database.
GBshape consists of a high-throughput prediction platform, data depositories and a user interface. DNA shape
annotations of entire genomes can be generated by the high-throughput prediction platform, which runs on a
high-performance computing cluster (HPCC), and, together with genome sequences and UCSC Genome Browser
standard annotations, stored in the data depositories. The user interface provides multiple functionalities for users
to either visualize or download structural annotations.
The GBshape tracks for MGW, ProT, Roll and HelT were generated using our high-
throughput method DNAshapeR [14, 16]. These DNA shape features were selected based on prior
experimental studies demonstrating their important role in protein-DNA recognition, and include
MGW [19, 20, 71], ProT [32], Roll [72] and HelT [19]. Using pentamers as sliding windows,
DNAshape mines all-atom Monte Carlo simulations [65, 73] of 2,121 DNA fragments of 10–27
bp in length. Each of the 512 unique pentamers is assigned the average value of all of its
30
occurrences in the dataset at the central nucleotide for MGW and ProT and at the two central bp
steps of the pentamer for Roll and HelT. Each pentamer occurs on average 44 times in our Monte
Carlo-generated dataset. The DNAshape method was validated against experimental data from X-
ray crystallography, nuclear magnetic resonance spectroscopy and hydroxyl radical cleavage
measurements [14].
We have shown that the hydroxyl radical, a small, uncharged, highly reactive molecule,
reacts with the backbone of naked DNA in a manner that reflects the solvent accessible surface
areas of the hydrogen atoms of the deoxyribose sugar, thus providing an experimental image of
DNA backbone shape [67, 74]. To develop this chemical approach into a high-throughput method
we performed hydroxyl radical cleavage experiments on 150 diverse DNA fragments of 40 bp in
length. We devised a prediction algorithm, based on this database of experimental cleavage
patterns, that uses a sliding tetramer window to predict the cleavage pattern for DNA sequences of
any length [62]. We subsequently extended this method by averaging the predicted cleavage
patterns of both DNA strands to develop ORChID2, which we previously showed to be correlated
with MGW and electrostatic potential [67]. Thus, the ORChID2 pattern provides an experiment-
based prediction of minor groove shape, which complements the Monte Carlo-based DNA shape
features as an additional annotation track in GBshape.
3.2.2 User interface
The GBshape user interface is a customized version of the UCSC Genome Browser that is
hosted on our local server. The user interface contains some important functionalities of the UCSC
Genome Browser, including the genome browser, table browser, the Basic Local Alignment
Search Tool-like alignment tool (BLAT) and the ‘add custom tracks’ tool. The GBshape interface
runs on a Linux-operated dual-core IBM server with Apache.
31
GBshape consists of two major tools—a genome browser and a table browser. The genome
browser provides a graphical representation of DNA shape annotations along with standard
genome browser annotations. The genome browser also supports text and sequence search
functions to provide easy access to genomic regions of interest. The table browser enables data
manipulation, downloads of multiple records and basic statistical analyses, which cannot be
performed with the genome browser function.
To visualize DNA shape annotations the user clicks on ‘Genome Browser’ in the
navigation bar on the left of the GBshape homepage. On the Genome Browser Gateway page the
user chooses an organism group, species genome, genome assembly, genome position and search
terms of interest. After the ‘submit’ button is pressed, consolidated results for DNA shape
annotations, together with standard genome annotations (Figure 3.2A), are shown on the display
page. The shape annotations MGW, ProT, Roll, HelT and ORChID2 can be shown as quantitative
plots (Figure 3.2B) or condensed into heat maps (Figure 3.2C).
The sequence-alignment tool, BLAT, can be used to search specific regions of the genome
based on sequence similarity. To use BLAT, click on ‘Tools’ in the navigation bar at the top of the
Genome Browser Gateway page, select ‘Blat’ in the pull-down menu, select a genome, assembly,
query type, sort output and output type, and then press the ‘submit’ button. Genomic annotations
can be viewed by clicking on the ‘browser’ link at the left of the search results. Supplementary
Table S2 provides information on genomes for which BLAT supports a sequence search.
32
Figure 3.2 Visual display of GBshape annotations in the genome browser.
(A) Genome positions and UCSC Genome Browser standard annotation tracks. (B) DNA shape annotation tracks
for MGW, ProT, Roll, HelT and hydroxyl radical cleavage intensity (ORChID2). (C) Heat map views for DNA
shape annotations.
The view of the genome browser can be adjusted by using the buttons located near the top
of the display page to move along the genome sequence, zoom in or zoom out, or by dragging and
zooming the genomic position. The display type of an annotation track can be changed by selecting
the pull-down menu from the track control panel at the bottom of the page. A heat map view can
be shown for a track by setting the display type as ‘dense’ on the corresponding control panel.
Users can upload their own tracks to compare with the existing annotations by using the function
‘add custom tracks’.
33
The table browser supports downloading and analysis of quantitative DNA shape
annotations. To access these functions, click on ‘Table Browser’ in the navigation bar on the
GBshape homepage. The Table Browser can also be found under the ‘Tools’ link in the navigation
bar of the Genome Browser Gateway page. One can download DNA shape annotations for an
entire genome or for a specified genomic region by setting parameters on the Table Browser page.
Download of data for multiple regions is specified by setting ‘define regions’. Users can download
data that match certain criteria by setting the ‘filter’ function, or manipulate data from different
datasets by using the ‘intersection’ function. Output data can be exported in a variety of formats
for further analysis or for use in other applications. Statistical correlations can be calculated over
selected datasets, such as the correlation between data in different shape annotation tracks.
3.3 Results and discussion
3.3.1 Nucleosome binding sites
Periodicity in nucleotide sequence has been detected in DNA sequences that wrap around
histone octamers to form nucleosome core particles [75]. The 10-bp periodicity of dinucleotide
occurrence [76] and A-tract composition [17] mirrors the variation in width of the DNA minor
groove as it is directed toward the histone core once every helical turn. We reported that the minor
groove in nucleosome-bound DNA exhibits a 10-bp periodicity in MGW and electrostatic potential,
and concluded that contacts of histone arginines with narrow minor groove regions are stabilized
by the 10-bp shape-dependent periodicity in electrostatic potential [17].
A question that arises from these observations is whether periodic patterns in dinucleotide
occurrence result in DNA shape features that guide nucleosome formation. Genome-wide
nucleosome occupancy maps with thousands of nucleosome binding sites have been
34
experimentally constructed by digesting intact chromatin with micrococcal nuclease followed by
sequencing the underlying protected DNA fragments (MNase-seq) [76, 77]. We have used
GBshape to infer structural features of these nucleosome-bound sequences.
We previously analyzed DNA shape features of 23,076 nucleosome-bound sequences from
Saccharomyces cerevisae [76] and 25,654 from Drosophila melanogaster [77]. We showed that
analysis of shape profiles generated by the DNAshape and ORChID2 algorithms reveals a
pronounced 10-bp periodicity in structural properties of nucleosomal DNA [14, 67]. The
modENCODE consortium recently generated more extensive lists of nucleosome-bound
sequences of much higher quality for human, Drosophila melanogaster and Caenorhabditis
elegans [69].
We have now used GBshape to predict MGW and compare this structural property to the
ORChID2 pattern for these massive lists of 3.6 million from Caenorhabditis elegans, 3.8 million
nucleosome-bound sequences from Drosophila melanogaster and 13.1 million from the human
genome (Figure 3.3). The strong correlation between MGW and ORChID2 for all three organisms
served as a validation of GBshape due to the independent approaches used to generate these
predictions. Whereas the 10-bp periodicity was shared between human, fly and worm, details of
the DNA shape profiles of nucleosomal DNA varied across species due to the different nucleotide
compositions of these genomes. Analysis of the other DNA shape features Roll, HelT and ProT
further confirmed the shared 10-bp periodicity as well as distinctions in DNA shape between
nucleosome-bound sequences in these genomes (Figure 3.4). The maxima and minima of the
MGW, Roll and ProT patterns overlapped, whereas the troughs in the HelT patterns matched the
peaks in the other parameters, indicating a local helix unwinding at positions where a more positive
Roll locally widens the minor groove.
35
Figure 3.3 Variation in MGW and ORChID2 signals on average in nucleosome sequences.
Variation in MGW (blue) and ORChID2 (green) signals on average in nucleosome sequences from the (A)
Caenorhabditis elegans, (B) Drosophila melanogaster and (C) human genomes. Numbering of the nucleotide
position starts with −1 and 1 for the central two base pairs, respectively.
36
Figure 3.4 Variation in Roll, HelT and ProT on average in nucleosome sequences.
Variation in Roll (blue), HelT (green) and ProT (red) on average in nucleosome sequences from the (A)
Caenorhabditis elegans, (B) Drosophila melanogaster and (C) human genomes. Numbering of the nucleotide
position starts with −1 and 1 for the central two base pairs, respectively.
37
3.3.2 Transcription start sites (TSSs)
TSSs are located at the 5’ end of genes where contacts with RNA polymerase II initiate
transcription. A long-standing question in the field is how these positions can be identified in a
genome using computational methods [78]. Whereas the presence of conserved sequence elements,
such as the TATA box and the Initiator (Inr) element, represent one possibility for identifying
TSSs, nucleotide composition varies in Inr elements and in regions surrounding TSSs. Previous
reports suggested that structural features, including DNA bending and melting, enhance protein
binding at TSSs [78]. We used GBshape as a high-throughput approach to annotate DNA shape
features at TSSs of four different Drosophila species.
We derived data from paired-end cap analysis for gene expression experiments [70] to
identify TSSs in the D. melanogaster, D. simulans, D. sechellia and D. pseudoobscura genomes.
Transcription initiates from a range of positions at a given promoter, resulting in a frequency
distribution that varies from ‘broad’ to ‘sharp’ between promoters [79]. Depending on the analysis,
a single representative TSS position for each promoter can be chosen based on the median position
or the position with the maximum number of initiation events (peaks) within a given TSS
distribution. For this analysis, we chose the peak position within each local frequency distribution
in order to maximize the 5’ sequence alignments.
Our DNA shape analysis of Drosophila TSSs revealed a clear structural signature for the
Inr element despite the nucleotide sequence variation of this element. Moreover, specific DNA
shape annotations of TSS regions were apparent for MGW, ProT, Roll and HelT (Figure 3.5). For
each DNA shape feature the patterns were similar among the four Drosophila species, suggesting
an evolutionary role of DNA structure. Whereas this effect merits further investigation, GBshape
38
provides a platform that enables studies in which one can easily navigate between DNA sequence
and shape information for a very large genomic datasets.
Figure 3.5 Average heat maps for four DNA shape features of TSSs
Average heat maps for four DNA shape features of TSSs and 50 bp up- and downstream in four fly species were
shown. The analysis is based on 3,823 TSSs from the D. melanogaster, 6,909 TSSs from the D. simulans, 7,234
TSSs from the D. sechellia and 7,397 TSSs from the D. pseudoobscura genomes. Column numbers in each heat
map indicate the nucleotide position relative to the TSS. Black frames mark the locations of the Initiatior (Inr)
element and TATA box.
39
3.4 Conclusions
We have developed a database of DNA shape annotations for whole genomes of 106
organisms. Given the emerging literature on the importance of DNA structural features in refining
transcription factor binding specificities [3], this tool provides a framework for integrating DNA
shape in whole-genome analyses. GBshape currently includes tracks for five structural features:
MGW, ProT, Roll and HelT using DNAshape predictions [14, 16], and hydroxyl radical cleavage
intensity derived from ORChID2 [67]. To demonstrate the utility of GBshape we analyzed
structural features of nucleosome binding sites and TSSs. The availability of DNA shape
annotations for entire genomes will enable the integration of DNA structure into genome analyses
that currently are based only on nucleotide sequence.
40
Chapter 4 DNAphi: a method for high-throughput prediction of
minor groove electrostatic potential
4.1 Introduction
We previously showed that variations of electrostatic potential (EP) upon changes in
minor-groove topography represent a biophysical source of protein–DNA binding specificity [17].
Variations in the three-dimensional (3D) structure of DNA alter its dielectric boundary with
surrounding solvent. These structural changes deform electric field lines, resulting in enhanced
negativity of the EP in regions of narrow minor groove [17]. This phenomenon, called electrostatic
focusing [80], was originally discovered for proteins [81] and, more recently, was applied to
protein–DNA interactions and used to explain biophysically why arginine [17], lysine [18] and
histidine [19] residues often recognize DNA sequences with a narrow minor groove. However,
because EP could only be calculated for individual structures [17], it was not accessible on a
genome-wide basis and could be used only indirectly in modeling TF binding specificity due to its
correlation with MGW [47, 82]. While a correlation between EP and MGW holds for narrow
minor-groove regions, MGW is not a proxy for EP in general. To capture the actual biophysical
contribution of minor-groove EP to TF binding, knowledge of EP on a genome-wide basis is
required for analysis of HT binding data.
Experimentally determined structures of protein–DNA complexes represent atomic-
resolution data on interactions between TFs and their DNA binding sites [83], thus providing
crucial insights into binding mechanisms [20]. However, co-crystal structures are available for
relatively few TFs and are typically limited to complexes where a protein or its DNA binding
domain binds to a single DNA sequence. Rarely, structural biology provides insights into the
41
binding of a TF to multiple DNA sequences [20, 84-88]. To fill this gap and probe the binding of
a given TF to many DNA sequences, technologies for measuring protein–DNA binding specificity
in a HT manner have advanced tremendously in the last decade [5, 6, 17, 89, 90]. Assays, such as
protein-binding microarray (PBM) [30], genomic-context PBM (gcPBM) [32], high-throughput
SELEX (HT-SELEX) [34-36] and SELEX-seq [33], have enabled measurements of binding
affinities of one protein or protein complex against thousands or even millions of different DNA
sequences. Such HT approaches to DNA binding specificity provide an alternative path to infer
protein–DNA binding mechanisms without requiring time-consuming structural biology
experiments or molecular simulations [1, 3, 33].
The minor-groove EP of DNA can be obtained by solving the non-linear Poisson–
Boltzmann (NLPB) equation, as provided by the DelPhi program [80]. Previous work showed that
DelPhi represents an accurate description of electrostatic interactions involving DNA in atomic-
resolution structures [10, 17, 20, 91, 92]. However, NLPB calculations are computationally costly
and cannot be used on massive or genome-scale DNA sequences. To infer minor-groove EP in a
HT manner, we previously developed a HT method, DNAshape [14, 16], which enables prediction
of MGW for massive experimental and computational data. Prediction results can be used to
measure minor-groove EP somewhat indirectly, although correspondence between EP and MGW
is only well established for narrow minor-groove regions [20].
A/T and C/G bp carry different partial charge distributions in the minor groove (due
primarily to the guanine amino group). These partial charges, in addition to charges on the
phosphates, will affect minor-groove EP (Figure 4.1). Therefore, we asked whether we could
account for minor-groove EP directly, rather than using MGW, to capture the effects of partial
charges of bases and to reveal novel base-specific electrostatic interactions. To address this
42
question, we developed an approach for the HT prediction of EP in the minor groove, called
DNAphi (Figure 4.2). We designed this approach based on the mining of numerical solutions to
the NLPB equation provided by the DelPhi program [80]. DNAphi is a HT method that enables
efficient calculation of minor-groove EP for an unlimited number and length of DNA sequences,
without the requirement of NLPB calculations on an atomic-level 3D structure.
Figure 4.1 Functional groups of C/G and A/T bp exposed on major and minor grooves.
Proteins recognize binding sites mainly through contacts with unique functional groups in the major groove (base
readout), whereas the pattern is degenerate in the minor groove. For example, in the major groove, a G/C bp can
be distinguished by its unique functional-group pattern ‘AADN’ (A: hydrogen bond acceptor; D: hydrogen bond
donor; N: nonpolar hydrogen) as it differs from a C/G bp pattern (‘NDAA’), an A/T bp pattern (‘ADAM’) and a
T/A bp pattern (‘MADA’) with ‘M’ representing the thymine methyl group. In the minor groove, the G/C bp
shares the same functional-group pattern ‘ADA’ with a C/G bp. Likewise, the functional-group pattern is
identical for A/T and T/A bp (‘ANA’). The positively charged guanine amino group in the minor groove (circled)
can affect EP in the minor groove by partially neutralizing the negative EP.
43
Figure 4.2 Overview of DNAphi method.
(A) All-atom MC simulations were used to generate seed structures from 2,297 DNA fragments. (B) We solved
the NLPB equation to calculate EPs on these seed structures and (C) defined a sphere at the center of the minor
groove in the plane of each bp. We derived the EP values at the center and at 26 points on the surface of each
sphere. The average EP value at these points was assigned to be the EP of the respective pentamer. (D) EP values
for all occurrences of the same pentamer were averaged to form a query table of (E) HT predictions of EP as a
function of nucleotide sequence.
Using machine-learning (ML) techniques, we can exploit EP as a biophysical feature to
model quantitatively protein–DNA binding on massive sequencing data. This approach provides
a novel way to investigate how biophysical characteristics of the genome affect the strength of
protein binding, thereby leading to a better understanding of protein–DNA binding mechanisms.
Traditionally, such predictive ML models were built based on nucleotide sequence [50-54, 93-96].
We previously extended the sequence models by integrating DNA shape features derived from
DNAshape [14, 16] to build models that integrate structural information [1, 47, 82]. In these studies,
we used MGW as a ‘proxy’ for EP, based on the observation that MGW correlates closely with
44
EP when the minor groove is narrow [17, 20]. Here, we revisited this assumption and demonstrated
that the direct use of EP in quantitative models of protein–DNA binding specificities can yield
similarly or more accurate models and potentially reveal new biophysical recognition mechanisms
(Supplementary Figure 4.3). We tested our new biophysical models on 239 TFs from 27 different
protein families.
Figure 4.3 Illustration of possible applications of DNAphi.
(A) Protein DNA binding affinities can be measured by several types of HT assays, including PBM, HT-SELEX
and their variants. Outputs of these experiments are represented as sequences followed by corresponding binding
affinities. (B) Output sequences are used as inputs to the HT prediction programs DNAphi (this study) and
DNAshape (both methods are included in the DNAshapeR package) for prediction of minor-groove EP and DNA
shape features. DNAshapeR can encode EP, sequence and shape features as a concatenated feature vector. (C)
45
Resulting vectors can be used as input of statistical ML methods for further analysis and modeling. (D) Resulting
models can be used to infer specific mechanisms of protein-DNA recognition (without requiring 3D structures) or
to identify TF binding sites in the genome.
4.2 Materials and methods
4.2.1 Poisson–Boltzmann calculation of EP
We carried out NLPB calculations for an exhaustive sampling of pentameric conformations
of nucleotides. All-atom MC simulations were used for the structural sampling of 2,297 different
DNA fragments ranging from 12 to 27 bp in length. Each simulation was started from a canonical
B-DNA conformation and extended over 2 million MC cycles. We considered the first 500,000
MC cycles to be the equilibration period. We recorded snapshots every 10th MC cycle along the
MC trajectory and generated an average conformation for each DNA fragment (Figure 4.2A). This
dataset represents an extension of the one used for the DNAshape method [14] to expand sequence
coverage.
We used the DelPhi program [80] to carry out NLPB calculations (see Supplementary
materials for details) on all MC-derived average conformations of DNA fragments (Figure 4.2B)
at a physiological ionic strength of I = 0.145 M. Partial charges of DNA were derived from the
AMBER force field [97]. The dielectric boundary between solute (internal dielectric ε = 2) and
surrounding solvent (ε = 80) was determined by using a probe radius of 1.4 Å (48). Space filling
of the solute molecule was increased in five focusing steps, with a cubic grid size of 165, by
following a previously described protocol [17]. We verified the stability of NLPB calculations by
comparison with a cubic grid size of 501 using three focusing steps and otherwise identical DelPhi
parameters. In addition, we identified contributions of different chemical groups of a nucleotide
46
(i.e. phosphate, base and sugar moiety) based on additive linear Poisson–Boltzmann (LPB)
calculations. For each component, we solved the LPB equation for 2,297 structures by considering
only the partial charges for atoms corresponding to that chemical group.
4.2.2 High-throughput prediction of EP
To define EP as a function of nucleotide sequence, for a given nucleotide index i, we
obtained EP at the midpoint between O4’ atoms of nucleotides i+1 on the Watson strand and i –1
on the Crick strand from the DelPhi-calculated potential map [17]. This midpoint is approximately
located within the plane of bp i. To capture fluctuations of EP due to different distances of this
midpoint to the dielectric boundary of the DNA segments with various deformations, we derived
EP values at 26 points that were equally distributed on a sphere with 1 Å radius surrounding
midpoint i [98] (Figure 4.2C). Excluding extreme EP values (due to clashes with the molecular
surface of the solute DNA in certain conformations) and averaging EP values at the remaining
points, we assigned an average value to each sphere. This approach prevents the inclusion of outlier
values in the EP calculation. Sphere i lies at the approximate center of the minor groove in the
plane of bp i. In this way, EP can be defined as a function of sequence, with one value per bp.
By mapping EP values of 2,297 DNA fragments, we calculated the average value at the
central bp of each unique pentamer and generated a query table of average values for each
occurrence of the 512 possible pentamers in our dataset (see Supplementary materials for details).
Each pentamer occurred in our dataset about 45 times (Figure 4.2D). This pentamer lookup table
was integrated in a sliding-window approach to predict minor-groove EP for any sequence,
regardless of length, or for millions of sequences (Figure 4.2E). Likewise, we used the pentamer
sliding-window method to generate pentamer query tables for the HT prediction of deconvolved
EP values based on each chemical group of a nucleotide (phosphate, base and sugar).
47
The DNAphi web server facilitates EP prediction on a HT scale in genome-wide studies
and is available at http://rohslab.usc.edu/DNAphi/. DNAphi was also implemented in the statistical
programming language R and integrated in the Bioconductor package DNAshapeR [16], available
at http://www.bioconductor.org/packages/devel/bioc/html/DNAshapeR.html.
4.2.3 EP-augmented protein–DNA binding models
We used DNAshapeR [16] to encode DNA sequence, EP and shape feature vectors for ML
analysis. For the sequence feature vector, the nucleotide at each position in a given sequence of
length L was encoded as four binary numbers (adenine = 1000, cytosine = 0100, guanine = 0010
and thymine = 0001), resulting in a binary vector of length 4L [47]. EP and shape features included
the bp parameters EP, MGW and propeller twist (ProT), and the bp-step parameters Roll and helix
twist (HelT). For a sequence of length L, the length of the nucleotide feature vector was L–4, and
the length of the bp-step feature vector was L–3 (see Supplementary materials for details).
We used HT-SELEX data for 215 mammalian TFs from 27 protein families [34], which
were re-sequenced with an average 10-fold increase in sequencing depth [82]. Sequencing data
were obtained from the European Nucleotide Archive (ENA; study identifier PRJEB14744) and
pre-processed following our recently published protocol [82]. We also included SELEX-seq data
for eight Drosophila Hox proteins, including Hox mutants, in the presence of their cofactor
Extradenticle (Exd) [1]. The 21 Exd-Hox datasets can be downloaded from the Gene Expression
Omnibus (GEO; accession number GSE65073). Sequences with multiple occurrences of the core
motif were removed from this analysis. In addition, we used gcPBM data for three human basic
helix-loop-helix (bHLH) proteins [47]. These data contained 36-bp genomic sequences centered
at a putative TF binding site and can be downloaded from the GEO (accession number GSE59845).
48
We trained multiple linear regression (MLR) models on each dataset to predict the relative
binding affinity for every sequence bound by a given TF. To measure the predictive power of
regression models in an unbiased and robust manner, we adopted a 10-fold cross-validation
approach [99]. Each dataset was randomly partitioned into ten equally sized subsets. One subset
was retained as validation data for testing the model, while the other nine subsets were used for
training. Thus, models were always tested on data that had been excluded in the training process.
The cross-validation process on each dataset was repeated ten times. Each time, we calculated the
coefficient of determination (R2) between predicted and observed values of response variables for
all DNA sequences in the validation dataset. R2 values from the ten tests were averaged to produce
a single estimate to be reported. Because the relative binding affinities were derived in separate
experiments, the MLR models were trained and assessed for each TF binding dataset individually.
Prediction and validation processes were performed using the Caret package (http://caret.r-forge.r-
project.org). Source code for the prediction method is available at
https://github.com/TsuPeiChiu/DNAphi_analysis.
To evaluate the predictive power of EP for TF–DNA recognition, we compared multiple
models built from different combinations of features, including DNA sequence and EP
(sequence+EP models), sequence and MGW (sequence+MGW models), sequence and shape
(sequence+shape models), models that combine sequence with EP and the three shape features
ProT, Roll and HelT (sequence+3shapes+EP models), and models that combine sequence with EP
and all four shape features including MGW (sequence+shape+EP models).
49
4.3 Results and discussion
4.3.1 Validation of EP prediction
To examine whether the advantage of a fast EP calculation without the requirement of a
3D structure, as provided by DNAphi, compromises the accuracy of the EP prediction, we
validated DNAphi through direct comparison with NLPB calculations using DelPhi [80] on crystal
structures where protein atoms were removed. We first targeted the minor-groove EP of DNA
binding sites from various crystal structures [20, 100-110], for which we previously established
the importance of minor-groove shape readout [14, 17]. The DNAphi predictions agreed well with
actual NLPB calculations for DNA binding sites in crystal structures of protein–DNA complexes,
as indicated by Pearson correlation coefficients (PCCs) ranging from 0.43 to 0.93 (Figure 4.4 and
Supplementary Figures S1 and S2). Stability of the DelPhi calculations was shown for the TF
binding sites illustrated in Figure 4.4 by using different grid spacing (Supplementary Figure S3).
Differences in predictions between DNAphi and DelPhi might result from different types of input.
DelPhi takes DNA structure as input, which can possibly get deformed by protein binding
following the initial protein–DNA recognition process [111]. In contrast, DNAphi only uses DNA
sequence as input and estimates EP based on population-based statistics rather than individual
calculations, which can yield more robust results. Arginine residues tended to be located near
positions with lower minor-groove EP as predicted by DNAphi (Figure 4.4 and Supplementary
Figures S1 and S2). These observations confirmed our previous finding that the binding of arginine
residues to narrow minor grooves is a commonly used mode for protein–DNA recognition [17].
In addition to direct validation through comparisons with NLPB calculations on
experimentally solved structures, we examined the predictive efficiency of the pentameric sliding-
window approach. We applied leave-one-out cross-validation using a pentameric sliding window
50
to mine training data derived from NLPB calculations. In each round of cross-validation, we
removed one of the 2,297 assigned all-atom average conformations derived from MC simulations.
We recompiled the pentamer query table of our HT approach with the remaining training data and
predicted the EP of the removed structure. These steps were repeated for each of the 2,297
structures. Predictions were concatenated and compared to the direct NLPB calculations. The
results showed a strong correlation (PCC = 0.84), demonstrating that our pentameric sliding-
window approach captures EP derived from direct PB calculations with high accuracy.
51
Figure 4.4 Validation of HT EP predictions using TF–DNA binding sites.
Minor-groove EP values of binding sites of (A) OCT1-POU (PDB ID 1OCT) (52), (B) OCT1-PORE (PDB ID
1HF0) (56), (C) Msx-1 (PDB ID 1IG7) (54) and (D) MATa1-MATα2 (PDB ID 1AKH) (55), whose binding
interface includes an arginine inserted into the minor groove, were predicted using DNAphi (blue) and DelPhi
(red), respectively. Pearson correlation coefficients (PCCs) demonstrate the statistical similarity between EP
profiles derived from these two approaches. We highlighted the more negative minor-groove EP values (≤ –6.505
kT/e, which is the average value in the EP query table) predicted by DNAphi by underlining the respective x-axis
labels (red). Corresponding spheres defined by DNAphi are represented by spheres in each structure, with red
indicating below-average EP values ≤ –6.505 kT/e and pink indicating EP values > –6.505 kT/e. Protein residues
of minor-groove contact defined by DNAproDB (21) are shown in each structure.
52
4.3.2 Correlation between EP and MGW
MGW closely correlates with EP due to the shape-dependent focusing of electric field lines
[17, 20, 67]. However, there is degeneracy in the sequence-to-MGW mapping, such that functional
groups in the minor groove can affect EP despite similar MGW. For example, the presence of the
partial positive charge of the guanine amino group can partially neutralize a negative EP in the
minor groove (Figure 4.1). This effect was observed in distribution plots of EP and MGW for the
512 pentamers in the query table. EP appeared to follow a bimodal distribution with two clear
peaks (Figure 4.5A), whereas the MGW distribution was essentially unimodal (Figure 4.5B). The
two distinct EP distributions could be distinguished by classifying pentamers into categories based
on their central bp (A/T or C/G) (Figure 4.5C).
We further identified contributions from different chemical components of a nucleotide
using additive LPB calculations (Supplementary Figure S4A). Although EP contributions from the
bases separated pentamers with central A/T versus C/G bp most distinctly (Supplementary Figure
S4B), EP contributions from phosphate groups (Supplementary Figure S4C) and sugar moieties
(Supplementary Figure S4D) also exhibited shifted peaks of overlapping distributions. These
results demonstrate that functional groups of the bp can strongly affect minor-groove EP, an effect
that cannot be fully captured by MGW (Figure 4.5D).
53
Figure 4.5 Comparison of EP and MGW distributions of 512 unique pentamers categorized by central bp.
(A) EP distribution shows two separated peaks, reflecting bimodal behavior. (B) Pronounced separate peaks are
not observed in the MGW distribution. (C) EP distribution forms two subgroups, defined by the identity of the
central bp, which (D) is not the case for the MGW distribution.
Distinct subgroups with a central A/T bp (NN(A/T)NN pentamer) or central C/G bp
(NN(C/G)NN pentamer) were distinguished when EP was directly plotted against MGW (Figure
4.6). EP showed a higher correlation with MGW in the subgroup of pentamers with a central A/T
bp (PCC = 0.87) than in the subgroup with a central C/G bp (PCC = 0.75). In particular, the A-
tract subgroup (pentamers containing ApA, ApT or TpT steps formed by at least three bp) showed
a narrower MGW and enhanced negative EP, resulting in a slightly higher correlation with EP
(PCC = 0.86) than the subgroup excluding A-tract pentamers (PCC = 0.85). These results confirm
our previous finding of a high correlation between EP and MGW in AT-rich sequences (16,22). In
some cases, EP is more sensitive than MGW to chemical signatures (e.g. pentamers AAGTT,
54
AAAAA and AAAGT in dashed box of Figure 4.6), suggesting that EP may provide a more
sensitive approach to the prediction of protein–DNA binding affinities.
Figure 4.6 Scatter plot for EP and MGW values of all 512 unique pentamers.
Two separate groups reflect different correlations between EP and MGW. The subgroup with a central A/T bp
(red) demonstrates a stronger correlation between EP and MGW than the subgroup with a central C/G bp (cyan).
The A/T subgroup representing A-tracts (large red) exhibits more negative EP and narrower MGW compared to
other groups. EP variation over an order of magnitude can be seen in narrower MGW (dashed box).
55
4.3.3 Correlation of EP and Fis protein binding affinity
Next, we targeted eight DNA binding sites, which exhibit Escherichia coli Fis protein-
binding affinities differing over three orders of magnitude depending on the DNA sequence in the
central region [84, 86] (see Supplementary materials for details on datasets). Fis binds non-
specifically to the bacterial genome [112]. Using DNAphi (with DNA sequence as input) and
DelPhi (with DNA structure with proteins removed as input), we predicted the minor-groove EP
values of six DNA binding sites for which crystal structures of Fis-DNA complexes were available
(Supplementary Table S1). Our HT predictions demonstrated good agreement with direct NLPB
calculations (Figure 4.7A and B). The Fis protein binds various DNA sequences with an affinity
that depends on the MGW in the central region of its binding site [84]. The average MGW over
the five central nucleotides predicted by DNAshapeR [16] was highly correlated with the logarithm
of binding affinity Kd when we excluded a particular sequence with a central TpA dinucleotide,
representing a flexible ‘hinge’ step (Figure 4.7C) [14]. We calculated the average EP over the same
five central nucleotides and obtained a stronger correlation, even when we included the sequence
with the central TpA step (Figure 4.7D).
We expanded this analysis to additional groups of sequence variants in the Fis protein
binding site (25) (Supplementary Figure S5A) and observed similar improvements in the
correlation between EP and the logarithm of binding affinity. The correlation was either already
high (Supplementary Figure S5B) or improved by using EP rather than MGW due to the removal
of outliers (Supplementary Figure S5C and D). Exception were the sequence variants that differed
solely by A/T versus G/C bp, which caused EP to change through the addition or removal of the
guanine amino group (Supplementary Figure S5E and F). In flanking regions of Fis binding sites,
56
EP values correlated better than MGW with the logarithm of binding affinity (Supplementary
Figure S5G and H).
To decipher the contribution of individual nucleotide components, we further deconvolved
contributions into chemical groups (base, sugar moiety and phosphate groups) by solving the LPB
equation for the subset of Fis binding sites analyzed in Figure 4.7. The contribution from partial
base charges showed a higher correlation with the logarithm of binding affinity than the
contribution from phosphate groups, demonstrating the importance of the effects of bases
(Supplementary Figure S6).
Figure 4.7 High-throughput EP predictions for Fis-binding sites.
EP as a function of sequence was predicted for (A) high-affinity (Kd = 0.2 nM) and (B) low-affinity (Kd = 140
nM) binding sites using DNAphi (blue) and DelPhi (red). Pearson correlation coefficients (PCCs) demonstrate
the statistical similarity between EP profiles derived by the two approaches. HT predictions of (C) MGW and (D)
EP over five central bp of eight Fis binding sites correlate with the logarithm of binding affinity. Coefficients of
determination (R2) between the logarithm of Kd and (C) MGW and (D) EP, respectively, were calculated for all
eight Fis binding sites (red) compared to seven Fis binding sites without the central TpA ‘hinge’ step (blue).
57
4.3.4 EP-based modeling of DNA binding affinity for 27 protein families
In a HT approach to the modeling of TF binding specificity, we added EP as an explicit
biophysical feature in our ML approach for quantitative prediction of DNA binding specificities
of TFs [47]. We targeted the most extensive mammalian TF binding data available to date derived
from re-sequenced HT-SELEX experiments [82], in addition to SELEX-seq data for Drosophila
Exd-Hox complexes [1] and gcPBM experiments for human bHLH TFs [32]. In total, these data
included 239 TFs from 27 different protein families (see Supplementary materials for details on
datasets). Directly integrating EP as a biophysical feature in the analysis of HT sequencing data
enabled the testing of its contribution to quantitative binding models. We used L2-regularized
MLR to train predictive models based on different combinations of EP, sequence and shape
features, to predict binding affinity for all DNA sequences in a dataset that can determine TF–
DNA binding specificity. For this purpose, we concatenated feature vectors, as previously
introduced [47], which were comprised of binary features representing sequence combined with
DNA shape and EP features (feature encoding is described in Supplementary materials). DNA
shape and EP values were normalized between 0 and 1, as previously described [16]. We evaluated
different models based on the coefficient of determination (R2) between measured and predicted
binding affinities using 10-fold cross-validation.
Sequence+EP models outperformed sequence-only models for 233 of the 239 tested TFs
(P < 2.270 × 10
−36
, Figure 4.8A) (statistical testing is described in Supplementary materials). This
observation indicates that EP plays an important role in TF binding specificity, consistent with our
previous conclusion that DNA shape readout is important for TF binding specificity [81]. To test
whether EP has additional predictive power beyond MGW, we added EP to our sequence+MGW
models. The resulting sequence+MGW+EP models outperformed sequence+MGW models (P <
58
1.549 × 10
−34
, Figure 4.8B). Our interpretation of this observation is that EP and MGW encode
largely overlapping but not identical information. EP describes base-specific charged groups in the
minor groove in addition to shape-dependent electrostatic focusing. MGW also includes geometric
information on the possibility of contacts with the sugar and phosphate groups.
Similarly, when we added EP to our sequence+shape models, which contain information
on all four shape features, the resulting sequence+shape+EP models outperformed
sequence+shape models (P < 4.569 × 10
−41
, Supplementary Figure S7A). Although the relatively
small improvement was due to overlapping information, this result nevertheless implies that
directly using EP contributes predictive power to models that are based on DNA sequence and
shape. When we replaced MGW by EP, the sequence+3shapes+EP model slightly outperformed
the sequence+shape models (P < 6.748 × 10
−4
, Supplementary Figure S7B). In contrast, the
sequence+EP models did not outperform sequence+MGW models (Supplementary Figure S7C),
and EP models did not outperform MGW models (Supplementary Figure S7D). The latter result
is not surprising because DNA shape or EP alone does not contain information on hydrogen
bonding opportunities, which are necessary for TF–DNA readout [2].
Although the performance gain was quite small when DNA shape was already included in
the model, it was highly significant (P < 1.549 × 10
−34
for sequence+MGW+EP versus
sequence+MGW models; P < 4.569 × 10
−41
for sequence+shape+EP versus sequence+shape
models; P < 6.748 × 10
−4
for sequence+3shapes+EP versus sequence+shape). Nevertheless, our
results demonstrate the added information content of biophysical features compared to purely
geometric DNA shape information. The performance gain for EP-augmented models based on
high-quality gcPBM datasets for human bHLH TFs was higher than for models based on other
datasets (Supplementary Figure S8). This improvement was probably influenced by the
59
consideration of genomic flanks comprising 15 bp 5’ and 3’ of the core binding sites (enhancer or
E-box), which likely increases information content derived from the flanking regions. To
investigate whether EP in the flanking region contributes to the binding specificity of different TF
families, we targeted binding sites of human bHLH TF dimers Mad1/Max (‘Mad’), Max/Max
(‘Max’) and c-Myc/Max (‘Myc’) derived from gcPBM experiments [113]. Whereas the E-box
(CANNTG) as the core-binding motif is shared among all three TF complexes (Supplementary
Figure S9A), differential DNA binding specificities can be detected through the analysis of EP
preferences between TFs (Supplementary Figure S9B). These differences can be due to EP
variations in the flanking sequences. As protein loops can contact these flanking regions [32], this
result suggests a mechanism of how the flanks biophysically contribute to the selection of binding
sites in the genome.
60
Figure 4.8 Performance comparison of binding specificity predictions for 239 TFs
Performance comparison of binding specificity predictions for 239 TFs derived from HT-SELEX, SELEX-seq
and gcPBM HT binding assays. Each data point demonstrates performances of two models (denoted by x- and y-
axis labels). The model labeled on the y-axis outperforms the model labeled on the x-axis if the data point is
located above the diagonal line. (A) The sequence+EP models outperform the sequence-only models and
contribute to the increased prediction accuracy of DNA binding specificities based on L2-regularized MLR and
10-fold cross validation. (B) The sequence+MGW+EP models likewise outperform the sequence+MGW models.
The P values were calculated by using the t-test hypothesis testing method with performance increase in terms of
R2 as the alternative hypothesis.
4.3.5 Base versus phosphate EP contributions in quantitative models
To decipher the additional information that EP might contain relative to MGW, we
deconvolved the EP originating from bases versus phosphate groups by using the LPB equation
(due to its additivity) [17]. Using the deconvolved contributions to EP in ML methods to predict
TF–DNA binding specificities, we found that EPs originating from bases (Supplementary Figure
S10A and B) or phosphate groups (Supplementary Figure S10C and D) contributed similarly to
the performance of sequence+EP compared to sequence models and, likewise, to the performance
of sequence+MGW+EP compared to sequence+MGW models. This difference in performance
61
increased when the added EP information was largely from the contributions of phosphate groups
in bHLH TFs in the gcPBM data. On the other hand, there was essentially no improvement for EP
contributions from the bases. This observation indicates that contacts with phosphate groups in the
flanking regions of core binding sites contribute to binding specificity. Contacts of basic side
chains in linker regions of bHLH proteins with phosphates in the regions flanking their core
binding sites have been reported in co-crystal structures for Max [114] and USF [115].
Our EP dissection combined with quantitative modeling revealed this readout mechanism
without the need of an experimentally solved structure. Thus, our results suggest that EP
contributes to TF–DNA binding specificity, and that a direct analysis of EP might reveal the
biophysical origin of DNA shape readout mechanisms.
4.3.6 EP contributions to Hox-DNA binding specificity
The performance gain of SELEX-seq datasets for Exd-Hox protein complexes was higher
than the gain of all HT-SELEX datasets (Supplementary Figure S8). The improvement difference
between the SELEX-seq datasets for Exd-Hox heterodimers and HT-SELEX datasets for
monomeric homeodomains was likely due to the fact that the minor-groove readout for SELEX-
seq data depends in part on the linker region between the Hox protein and its Exd cofactor [1]. In
contrast, the HT-SELEX data were derived from the binding of homeodomains in monomeric form
[82] and, therefore, were not influenced by latent specificity [33].
To examine the contributions of EP to Hox-DNA binding specificity in detail, we used
DNAphi to predict minor-groove EP for sequences derived from SELEX-seq experiments for
Drosophila Exd-Hox heterodimers. We averaged EP predictions at each nucleotide position of the
sequences selected by each Hox protein. This family of TFs uses positively charged amino acids
62
to recognize the enhanced negative EP in the minor groove of the Exd-Hox consensus site
GAYNNAY (with Y = C or T) [33]. For example, Arg5 within the Hox linker region selects for
more negative EP at A5Y6 positions, whereas Arg3/His–12 preferentially binds to sequences with
more negative EP at A9Y10 positions (Figure 4.9).
Although most sequences were predicted to have more negative EP in the minor groove at
A5Y6 positions, the largest variation among binding sites of different Hox proteins occurred at
A9Y10 positions. This characteristic EP pattern cannot be explained by nucleotide composition
alone and correlates with our previous observations on DNA shape [20]. For example, EP varied
gradually at A9Y10 positions from more negative EP for sites of anterior Hox proteins to less
negative EP for sequences selected by posterior Hox proteins. As a result, anterior Hox proteins
(Lab, Pb, Dfd and Scr) selected sequences with two regions of more negative EP, whereas posterior
Hox proteins (Antp, Ubx, AbdA and AbdB) selected sequences with only one region of more
negative EP at A5Y6 positions (Figure 4.9A). Scr mutants (in which Arg3, His–12 or Arg5 was
mutated to alanine) demonstrated the effect of losing a positive charge on the ability to recognize
regions of enhanced negative EP (Figure 4.9B). Antp mutants (in which minor groove-contacting
residues from Scr were engineered into the Scr linker) regained the ability to read two regions of
more negative EP (Figure 4.9C).
63
Figure 4.9 Heat map of average EP at each position of 16-mers selected by each Exd-Hox heterodimer.
Dark and light red colors represent regions of more negative and less negative EP, respectively. Black boxes
indicate A5Y6 positions of 16-mers containing the TGAYNNAY core where, in the case of Scr, Arg5 contacts
the minor groove, and A9Y10 positions where Arg3 and His–12 bind to the minor groove. (A) EP profiles for
selected sequences vary between anterior and posterior Hox proteins. (B) EP profiles for sequences selected by
Scr mutants suggest a loss in the ability to recognize the EP profile when minor groove-contacting residues are
mutated. (C) EP profiles for sequences selected by Antp mutants, where Scr-linker residues are inserted into the
Antp protein, show the restored ability to recognize a second minimum of enhanced negative EP.
64
4.4 Conclusions
Multiple determinants, including DNA sequence and shape, contribute to TF–DNA
binding specificity [3]. Albeit related to shape, EP adds an additional layer—a biophysical
determinant—to the complexity of protein–DNA binding. Before the development of our new
approach, HT analysis of EP was not possible for large datasets due to the limitations of
unavailable structures and constrained computing power. Although EP can be inferred indirectly
by HT DNA shape prediction [14] or experiment-based methods [67], quantitative EP prediction
on a genomic scale required a new approach.
In this study we introduced DNAphi, a HT approach to derive EP features from massive
sequencing data. This method is based on the data mining of results from Poisson–Boltzmann
calculations on DNA structures obtained from MC simulations. Validation of DNAphi by using
available data revealed improved prediction accuracy based on a biophysical feature of protein–
DNA binding. Statistical models of TF–DNA binding specificity consistently benefited from EP-
augmented models.
Approaches to calculate EP at surfaces of biological molecules and complexes have been
widely and successfully applied in many studies of molecular recognition [81, 92, 98, 116-120].
DNAphi, however, is the first methodology to enable rapid calculation of EP in the minor groove
of double-stranded DNA as a function of nucleotide sequence. The HT approach makes
electrostatic information accessible for whole-genome analysis. By combining knowledge from
biophysics and genomics, this work suggests a new path toward understanding protein–DNA
binding and function, with the possibility of extensions to investigate higher-level effects from,
for example, chromatin and cooperativity. In addition, because EP can be defined at diverse
65
molecular surfaces, we envision that EP analysis will be more generally applicable than MGW of
double-helical DNA to, for instance, protein-RNA binding specificity studies.
66
Chapter 5 DeepRec: a deep learning method for discovery of
physicochemical signature in DNA for TF binding
5.1 Introduction
Each DNA sequence has a physicochemical signature characterized by a set of functional
groups on the edges of the base pairs exposed in the DNA grooves (Figure 5.1A). Proteins can
recognize the signature by having a surface physicochemically complementary to the signature of
DNA, forming a series of contacts between the protein and the base pairs. These contacts include
direct hydrogen bonds, water-mediated hydrogen bonds, and hydrophobic contacts [2].
Much effort has gone into understanding the function of hydrogen bonds and hydrophobic
contacts in protein–DNA interaction, which can play a decisive role for TF recognition and
stability of protein-DNA complexes. For example, TF can form hydrogen bonds with bases in the
major groove using basic helix-loop-helix (bHLH), basic leucine zipper (bZip), helix-turn-helix
(HTH), zinc finger, immunoglobulin fold domains, etc. [4, 121] These hydrogen bonds lead to
specific recognition of base-pair identified by TF. As a pronounced example, arginine often
recognizes guanine by making a bidentate interaction, that is, forming two hydrogen bonds, which
contribute significantly to the TF–DNA binding specificity [4, 10]. Hydrophobic contacts with
bases can also contribute to binding specificity. For example, protein residues can employ
hydrophobic interactions to differentiate thymine from cytosine [11-13]. In some cases, deviations
from a B-form helix increase the accessibility of physicochemical signatures in DNA for TF to
establish a set of hydrogen bonds or nonpolar interactions that can alter the DNA-binding
specificity without changing physicochemical profile of DNA [122-124].
67
Postsynthetic modification on bases can change physicochemical signatures in DNA and,
in turn, affect the binding of TF to DNA. The most studied is DNA cytosine methylation, which
alters the DNA physicochemical signatures by covalently adding a methyl group at the 5-carbon
of the cytosine ring to form 5-methylcytosine (5mC) [125-128] (Figure 5.1B). 5mC in the sequence
context CpG (mCpG) inhibits DNA binding of most major classes of TF, including bHLH, bZIP,
and ETS [129]. On the other hand, the methylated DNA has a preference for being bound by TFs
such as homeodomain, POU, and NFAT proteins which were reported to have correlation with
embryonic and organismal development [129]. DNA demethylation, the removal of a methyl
group, can also lead to an alteration of DNA physicochemical signatures. For example, during the
demethylation process, 5mC is converted to 5-hydroxymethylcytosine (5hmC), 5-formylcytosine
(5fC) and 5-carboxylcytosine (5caC) in successive oxidation steps [130-133]. These intermediates
have been reported to function as epigenetic marks in their own right [134-138]. 5caC, for example,
changes the physicochemical pattern of cytosine through substitution of 5-methyl group with a
negatively charged carboxylate group [125, 138] (Figure 5.1B). In addition to 5mC, other
modifications such as 4-methylcytosine (4mC), 6-methyadenine (6mA), can also change the DNA
signatures [125], thus potentially acting as another determinate of TF–DNA binding (Figure 5.1B).
DNA mismatch occurs when two non-complementary bases are aligned on opposite strands
of a DNA duplex, providing another way for DNA to change its physicochemical profile. For
example, T/G mismatch provides an additional hydrogen bond, while C/T mismatch reduces one
hydrogen bond, compared to a canonical C/G base pair (Figure 5.1B). Mismatches happen during
DNA replication, genetic recombination, or by chemical reactions, which can lead to mutations
and have been reported to correlate with numerous cancers and neurogenerative diseases [139].
However, the effects of mismatches on TF binding are still poorly understood.
68
Figure 5.1 Physicochemical standard and expanded sigantures in the DNA major and minor groove.
(A) Physicochemical signatures on the edges of the bases in the major and minor groove enable proteins to
readout DNA though hydrogen bonds and hydrophobic contacts. The signatures include hydrogen- bond acceptor
(in red) and donor (in blue), methyl group (in yellow) and the nonpolar hydrogen (in gray). (B) In addition to the
signatures for standard bps, DNA modification and mismatch bp introduce additional signatures.
X-ray crystallographic structures of protein–DNA complexes provide essential insights
into TF–DNA binding mechanisms [20, 83]. However, experimentally determined structures are
available for relatively few TFs and are typically limited to complexes where a protein or its DNA
binding domain binds to a single DNA sequence. Moreover, the positions of hydrogen atoms in
X-ray crystal structure are not often reliable, because they are generally more dynamic and have
less electron density for polar hydrogens bonded to nitrogen or oxygen atoms to verify. This could
be one reason why various studies on protein–DNA complexes based on X-ray crystallographic
structures have been unable to identify the presence of sufficient hydrogen bonds, which is
necessary to maintain the specificity and stability of such complexes [140-144].
In the last decade, several high-throughput technologies have been developed for a better
understanding of the TF–DNA binding mechanisms by quantitatively measuring the binding
69
affinities of a TF against thousands or even millions of different DNA sequences in vitro. Relevant
methods include high-throughput sequencing technologies, such as SELEX-seq, HT-SELEX or
SMiLE-seq [33-35, 145], microarray-based technologies such as PBM or gcPBM [30-32], and
MITOMI-based methods that include HiTS-FLIP or BET-seq [146, 147]. These methods provide
an alternative path to infer protein-DNA binding mechanisms without requiring time-consuming
structural biology experiments.
With the development of these high-throughput assays, many DNA motif discovery
methods have been developed. There are methods that use statistics and pattern recognition to
search for patterns in massive DNA sequencing data [38-43, 45, 46]. Most of these methods model
TF–DNA binding preference using position weight matrix (PWM), with each element representing
the contribution to the overall binding affinity from a nucleotide at the corresponding position [5,
6, 37]. To address the issue that PWM-based methods cannot describe the interdependent
contributions between nucleotides, more complex modeling methods such as dinucleotide weight
matrices (DWMs) [40, 41], Bayesian networks [46], Hidden Markov models [44, 45], k-mer based
regression [42, 43], and DNA shape [47, 48] models have been developed. There are other methods
that biophysically model the non-linear relationship between binding energy and the statistics of
selected binding sites, such as MatrixREDUCE [50] and BEEML [35]. A frequent limitation of
these methods is the need to align and filter the raw data based on the occurrence of a core motif,
which results in discarding informative data. To address this issue, BEESEM [51], SelexGLM [52]
and NRLB [53] statistically model long sequences containing shorter binding sites without the
prior knowledge of the corresponding binding location, while DeepBind [54] uses deep
convolutional neural networks (CNNs) to capture the binding site without preprocessing raw data
and so is capable of profiling the full affinity range. However, all these methods are limited to the
70
resolution at DNA sequence and are unable to mine the insightful patterns at physicochemical
resolution.
Here we present DeepRec (Deep Recognition for TF–DNA binding), which integrates two
CNN modules for extracting the important physicochemical signatures in the DNA major and
minor groove, and is thus capable of addressing the aforementioned limitations. The method also
integrates a forward perturbation-based interpretation approach to highlight the important
physicochemical signatures for deciphering the binding mechanisms.
5.2 Materials and methods
5.2.1 Deep learning framework
A deep neural network is a type of artificial neural network comprising of multiple hidden
layers between input and output layers. Each layer consists of a number of neurons, which receive
input from a set of previous-layer neurons. This sequential layer-by-layer structure executes a
sequence of functional transformations to model complex non-linear relationships between
predictor and the response variables.
We developed a multi-module and multi-task deep learning framework with an ability to
mine important patterns in multimodal systems (Figure 5.2). This framework can be extended by
simply plugging in a module of user-defined set of features to fit questions of interest. The
extendable architecture includes alternating convolutional and down-sampling layers for each
module that extract features from input data, a joint layer combining features retrieved from
heterogeneous sources, and hidden layers that further integrate features to discover higher level
patterns (Figure 5.2). Based on this framework, we developed DeepRec (Deep Recognition of TF–
DNA binding), which integrates physicochemical features of DNA in the major and minor groove
71
followed by a perturbation-based forward-propagation approach to interpret the resulting model
(Figure 5.3). This method aims at discovering important physicochemical patterns recognized by
TFs, and further explains biological insights in which sequence-based models cannot elucidate.
Figure 5.2 Multimodal deep learning framework.
This framework comprises of a flexible user-defined input layer, a feature-extraction convolutional layer, a
feature-interaction joint layer, and a hidden layer followed by an output layer.
72
Figure 5.3 Modeling TF–DNA interactions using DeepRec.
DeepRec predictions using physicochemical features followed by perturbation-based interpretation reveal the
importance of physicochemical signatures.
5.2.2 DNA physicochemical signatures and feature encoding
The sequence-specific physico-cheimcal patterns on the edges of the bases in the DNA
major or minor groove underline the ability of TFs to readout bps through hydrogen bonds or
hydrophobic contacts, as shown in Figure 5.4.
73
Figure 5.4 Functional groups of G/C and A/T bps exposed on DNA major and minor grooves.
The signatures are hydrogen-bond acceptors in red, donors in blue, methyl groups in yellow, and nonpolar
hydrogens in gray.
Major groove encoding
In the major groove, a G/C bp has a unique physicochemical pattern ‘AADN’ (A: hydrogen
bond acceptor; D: hydrogen bond donor; N: nonpolar hydrogen) as it differs from a C/G bp pattern
(‘NDAA’), an A/T bp pattern (‘ADAM’) and a T/A bp pattern (‘MADA’) with ‘M’ representing
the thymine methyl group (Figure 5.4). We encode the signature as binary vectors A=[0,0,0,1],
D=[0,0,1,0], M=[0,1,0,0], N=[1,0,0,0]. With this we can further encode the major groove signature
of each bp as a 4 × 4 binary vector:
A/T = [
𝐴 𝐷 𝐴 𝑀 ] = [
0 0
0 0
0 1
1 0
0 0
0 1
0 1
0 0
] , T/A = [
𝑀 𝐴 𝐷 𝐴 ] = [
0 1
0 0
0 0
0 1
0 0
0 0
1 0
0 1
] ,
74
G/C = [
𝐴 𝐴 𝐷 𝑁 ] = [
0 0
0 0
0 1
0 1
0 0
1 0
1 0
0 0
] , C/G = [
𝑁 𝐷 𝐴 𝐴 ] = [
1 0
0 0
0 0
1 0
0 0
0 0
0 1
0 1
]
Minor groove encoding
In the minor groove, A/T bp shares the same pattern the pattern (‘ANA’) with a T/A bp.
Likewise, the pattern is identical for G/C and C/G bp (‘ADA’). Using the same signature vector
for encoding, we can encode the minor groove signature of each bp as a 3 × 4 binary vector:
A/T = [
𝐴 𝑁 𝐴 ] = [
0 0
1 0
0 1
0 0
0 0 0 1
] , T/A = [
𝐴 𝑁 𝐴 ] = [
0 0
1 0
0 1
0 0
0 0 0 1
] ,
G/C = [
𝐴 𝐷 𝐴 ] = [
0 0
0 0
0 1
1 0
0 0 0 1
] , C/G = [
𝐴 𝐷 𝐴 ] = [
0 0
0 0
0 1
1 0
0 0 0 1
]
5.2.3 DeepRec model design
DeepRec consists of a major groove physicochemical module, a minor groove
physicochemical module, a joint module that integrates signals from the two physicochemical
modules, and an output layer that computes the relative TF binding affinities for each probe (Figure
5.3).
Major groove physicochemical module
The major groove module is convolutional neural networks (CNNs) with a convolutional,
a pooling layer, and a fully connected hidden layer. Each layer of CNNs executes a linear
transformation of the output from the previous layer by multiplying a weight matrix, followed by
a nonlinear transformation. The weight matrix is learned during training to minimize predictive
errors. The module takes major groove physicochemical signatures of user-defined length (l) as
75
input, which is represented as a binary matrix in dimension of 4 × 4𝑙 (see 5.2.2 Major groove
encoding).
The convolution layer convolves the 2D physicochemical profile (a 2D image) at every
position (i, j) by using filter 𝑊 𝑘 :
𝑐𝑜𝑛𝑣𝑜𝑙𝑢𝑡𝑖𝑜𝑛 (𝑋 )
𝑖𝑗𝑘 = 𝑅𝑒𝐿𝑢 ( ∑ ∑ ∑ 𝑊 𝑚𝑛𝑑 𝑘 𝑋 𝑖 +𝑚 ,𝑗 +𝑛 ,𝑑 𝐷 −1
𝑑 =0
𝑁 −1
𝑛 =0
𝑀 −1
𝑚 =0
)
where X is the input, i and j are the index of the output position (i is the index of edge-offset; j is
the index of bp position), and k is the index of filters. Each convolution filter 𝑊 𝑘 is an 𝑀 × 𝑁 × 𝐷
weight matrix with M being the filter height, N being the filter width and D being the number of
input channels (for now, we have four channels including hydrogen-bond acceptor, donor, methyl-
group and nonpolar hydrogen). ReLU represents the rectified nonlinear function:
𝑅𝑒𝐿𝑈 (𝑥 ) = {
𝑥 , 𝑖𝑓 𝑥 ≥ 0
0,𝑖𝑓 𝑥 < 0
The polling layer computes the maximum value in a window of the convolution layer
outputs for each filter. To decrease the dimension of the input and the number of model parameters,
non-overlapping pooling with window size 𝑀 × 𝑁 was applied. The pooling operation is defined
as:
𝑝𝑜𝑜𝑙𝑖𝑛𝑔 (𝑋 )
𝑖𝑗𝑘 = 𝑚𝑎𝑥 ({𝑋 𝐴 ,𝐵 ,𝑘 })
where X is the input, i and j are the index for output position, k is the index of filters, 𝐴 =
{𝑖𝑀 ,𝑖𝑀 + 1,⋯,(𝑖𝑀 + 𝑀 − 1} and 𝐵 = {𝑗𝑁 ,𝑗𝑁 + 1,⋯,(𝑗𝑁 + 𝑁 − 1}, M is the height, and N is
the width of the pooling window size.
76
Physicochemical minor groove module
The minor groove module is nearly identical to the major groove module. The only
difference is minor groove module takes as input the minor groove physicochemical signatures,
which is represented as a binary matrix in dimension of 3 × 4𝑙 (see 5.2.2 Minor groove encoding).
Joint module
To model interactions between extracted physicochemical signatures from the major and
minor groove modules, the Joint module has one hidden layer with ReLU activation function,
which is connected to all neurons of the last layers of the major and minor groove modules. One
output neuron with sigmoid activation function is connected to outputs from the hidden layer,
which predicts the relative TF binding affinity:
𝑆𝑖𝑔𝑚𝑜𝑖𝑑 (𝑥 ) =
1
1 + 𝑒 −𝑥
Model training of DeepRec
Model parameters were initialized randomly [148] and learnt on the training dataset by
minimizing the following loss function, which is defined as the sum of mean squared error (MSE)
and regularization terms for controlling overfitting:
𝐿𝑜𝑠𝑠 (𝑤 ) = 𝑀𝑆𝐸 𝑤 (𝑦̂ ,𝑦 )+ 𝜆 1
‖𝑤 ‖
1
+ 𝜆 2
‖𝑤 ‖
2
𝑀𝑆𝐸 𝑤 (𝑦̂ ,𝑦 ) =
1
𝑁 ∑(𝑦 𝑛 ̂ − 𝑦 𝑛 )
2
𝑁 𝑛 =1
The loss function was optimized by mini-batch stochastic gradient descent with a batch size of 128
and a global learning rate of 0.0001 which was multiplied by 0.5 if the validation loss did not
77
improve over four epochs. The learning rate was adapted by Adam [149]. DeepRec was
implemented in Python 2.7 using Theano [150] 0.8.2 and Keras [151] 0.2.0. We trained and tested
DeepRec model using a single NVIDIA tesla K80 GPU.
5.2.4 Interpretation
DeepRec utilizes a perturbation-based forward propagation approach that performs
nullification of a physicochemical signature at individual edge-offset position of a base-pair at a
time, and quantified its impact on binding free energy (Figure 5.5). We calculated the binding free
energy difference between presence and absence of the signature at the corresponding edge offset
regardless of nucleotide composition. That is, free energy for a particular signature at that edge
offset can be contributed from different nucleotides possessing the same signature at that edge
offset. For example, A/T and G/C bp share an acceptor at edge-offset 1, so we calculate the binding
free energy difference between presence and absence of that acceptor of the corresponding A/T
and G/C bps respectively and report an average of those two values of binding free energy. T/A
bp, as another example, is the only nucleotide, which has a methyl-group at edge-offset 1; therefore,
to calculate the binding free energy for the methyl-group at edge-offset 1, we only need to consider
T/A bp.
To visualize the detailed binding preferences of an individual TF, DeepRec introduced a
new visualization, named DNA physicochemical energy logos, as an example shown in figure 5.5.
In these logos, we used the letters A, D, M, and N to represent DNA physicochemical features
hydrogen-bond acceptor, hydrogen-bond donor, methyl group, and nonpolar hydrogen
respectively. The logos describe the binding preference in each DNA groove (major and minor)
and edge-offset (1–4) at single nucleotide resolution. The height of each letter indicates the change
78
in binding free energy (- Δ Δ ΔG) resulting from the comparison of the reference probe to its mutants
with a nullified physicochemical signature (Figure 5.5).
We expanded this method to handle DNA modification by considering adding or replacing
a specific physicochemical signature. For example, we can swap the nonpolar-hydrogen signature
at the edge-offset 1 of C/T bp with a methyl-group signature in the major groove to mimic the
methylation at the 5-position of cytosine, 5mC, and then we can measure the change in binding
free energy by comparing the reference probe with a nullification of that physicochemical
signature.
Likewise, we can switch the nonpolar-hydrogen signature at the same position with a
signature of hydrogen-bond acceptor to mimic, for example, the further oxidation of modified
cytosine to the 5-carboxyl form, 5caC. Similarly, we can remove the methyl-group signature from
the edge-offset 1 of T/A bp with nonpolar-hydrogen signature in the major groove to investigate
the effect of demethylation, or add amino-group to A/T bp in the minor groove by swapping the
nonpolar-hydrogen signature with hydrogen-bond donor in the minor groove.
79
Figure 5.5 DNA physicochemical energy logos.
The logos consist of four physicochemical signatures including hydrogen-bond acceptor (in red A) and donor (in
blue D), methyl group (in yellow M) and nonpolar hydrogen (in gray N). The height of these letters quantifies the
importance of individual physicso-chemical signatures in each DNA groove at single nucleotide resolution on
TF–DNA binding.
5.3 Results and discussion
5.3.1 Sequence-based models fail to determine physico-chemistry-based models
The learned weights for sequence models (represented as PWMs, for example) cannot
imply the weights learned from physico-chemistry-based models. Assuming that PWM can be
represented as a linear combination of physicochemical signatures, the PWM at position i of a
DNA sequence can be written as a matrix equation 𝐴𝑥 = 𝑦
80
𝐴 =
[
𝛽 𝑖 1,𝑎 𝛽 𝑖 2,𝑑 𝛽 𝑖 1,𝑛 𝛽 𝑖 2,𝑑 𝛽 𝑖 3,𝑎 𝛽 𝑖 4,𝑚 𝛽 𝑖 3,𝑎 𝛽 𝑖 4,𝑎 𝛽 𝑖 1,𝑛 𝛽 𝑖 2,𝑎 𝛽 𝑖 1,𝑚 𝛽 𝑖 2,𝑎 𝛽 𝑖 3,𝑑 𝛽 𝑖 4,𝑛 𝛽 𝑖 3,𝑑 𝛽 𝑖 4,𝑎 ]
𝑦 =
[
𝛽 𝑖 ,𝐴 𝛽 𝑖 ,𝐶 𝛽 𝑖 ,𝐺 𝛽 𝑖 ,𝑇 ]
where 𝛽 𝑖 ,𝐴 is PWM coefficient for A/T bp at position i, 𝛽 𝑖 1,𝑎 is physicochemical coefficient for
“hydrogen-bond acceptor” at the edge-offset 1 of the position i. Since there are 10 unknown
variables (𝛽 𝑖 1,𝑎 ,𝛽 𝑖 1,𝑛 ,𝛽 𝑖 1,𝑚 ,𝛽 𝑖 2,𝑑 ,𝛽 𝑖 2,𝑎 ,𝛽 𝑖 3,𝑎 ,𝛽 𝑖 3,𝑑 ,𝛽 𝑖 4,𝑚 ,𝛽 𝑖 4,𝑎 and 𝛽 𝑖 4,𝑛 ), while there are only four
equations, this system of linear equations is considered underdetermined where the system has
either no solution or infinitely many solutions. Therefore, a sequence model cannot imply a
physicochemical model, i.e. physicochemical model contains more information than sequence-
based model.
5.3.2 DeepRec predicts DNA contacts of TF–DNA binding
Experimentally determined structures of protein–DNA complexes provide critical insights
into binding mechanisms. By observing the number and geometry of hydrogen bonds as well as
hydrophobic contacts between protein residues and DNA bases, one can understand how proteins
use their unique readout mechanism to achieve DNA-binding specificity. However, co-crystal
structures are available for relatively few TFs and are typically limited to complexes where a
protein or its DNA-binding domain binds to a single DNA sequence. Besides, crystal-packing
contacts near the binding site, dynamic positions of hydrogen atoms, and less electron density for
polar hydrogen bonding sometimes make it difficult to determine a reliable binding contact.
DeepRec leverages big data generated from high-throughput assays and enables, for the first time,
the prediction of physicochemical signatures; thus, it has the potential to confirm or reveal
unknown binding mechanism.
81
We first targeted the widely studies human helix–loop–helix (bHLH) protein MAX, which
preferentially recognizes a subset of enhancer-box (E-box) sequences containing a central CpG
dinucleotide (5’-CACGTG-3’) [152]. We trained DeepRec on MAX SELEX-seq data [53],
interpreted the model using DNA physicochemical energy logos (Figure 5.6A), and compared the
predicted logos with the TF–DNA contacts of a MAX co-crystal structure (PDB ID: 5EYO) [138]
(Figure 5.6B–D).
Two prominent A’s (A is the label for hydrogen-bond acceptor) located at the edge-offset
3 and 4 in the major groove of C-3/G-3 bp in the E-box can be observed in the physicochemical
energy logos (Figure 5.6A). Agreeing with the Max crystal structure, HIS28 donates one hydrogen
bond to either O6 or N7 of the 3’ Guanine (G-3) or forms a bifurcated hydrogen bond [153] (Figure
5.6B). The system prefers a nonpolar hydrogen (N, the label for nonpolar hydrogen, shows a
positive - ΔΔΔG) instead of other functional groups (M, the label for methyl-group, and A, the label
for functional groups donating acceptors, show a negative - ΔΔΔG) at the edge-offset 1 in the major
groove of C-3/G-3 bp. No clear signal can be found at the edge-offset 2 in the major groove of C-
3/G-3 bp in the energy logos (Figure 5.6A). In the MAX crystal structure (Figure 5.6B), the
hydrogen-bond acceptor of GLU32 might be occupied by the donors of ARG35, which might
prevent the bonding with the donor of 5’ Cytosine (C-3).
Another notable A is shown at the edge-offset 3 in the major groove of A-2/T-2 bp in the
physicochemical energy logos (Figure 5.6A), in consistence with the crystal structure where
ARG36 donates one or two hydrogen bonds to O4 of the 3’ Thymine (T-2) (Figure 5.6C). No strong
message can be found at the edge-offset 1, 2, and 4 where no hydrogen bonding or hydrophobic
contact is detected in the crystal structure (Figure 5.6C).
82
The other two prominent A’s can be found at the edge-offset 3 and 4 in the major groove
of C-1/G-1 bp (Figure 5.6A). In the crystal structure (Figure 5.6D), ARG36 interacts with the central
5’-CpG-3’ dinucleotide by donating two hydrogen atoms to the O6 and N7 atoms of the 3’ Guanine
(G-1), forming a bidentate hydrogen bond; this geometry conveys a high degree of specificity [144].
The energy logos show a positive A and a negative M at the edge-offset 1 in the major groove of
C-1/G-1, implying binding preferences of MAX with different chemical-modified nucleotides.
83
Figure 5.6 DNA physicochemical energy logos and structure of MAX-5caC complex (PDB ID 5EYO).
(A) The physicochemical energy logos were generated by DeepRec based on MAX SELEX-seq data. The
position with repect to the edge offset and bp position was indicated by the name of MAX protein residue if there
is a contact between protein and DNA found using DNAproDB [83]. The MAX binding site is highlighted in the
yellow box. (B) Possible interactions between MAX residues and the C -3/G -3 bp in the major groove. The
84
numerical numbers indicate the inter-atom distance in angstrom. (C) Possible interactions between MAX residues
and the A -2/T -2 bp in the major groove. (D) Possible interactions between MAX residues and the C -1/G -1 bp in the
major groove.
5.3.3 DeepRec predicts the impact of DNA modification on TF binding specificities
Postsynthetic modifications to DNA have been proved to play a key role in gene regulation.
However, the effects of the modifications on TF binding is still not completely understood. There
are studies, which have adopted low-throughput methods, such as crystallization or electrophoretic
mobility shift assay, on the complex of TF and DNA with chemical modifications and were able
to explain TF-binding mechanism at atomic resolution [138]. Other studies have utilized high-
throughput methods, such as Methyl-SELEX [129] or EpiSELEX-seq [154], to determine the
effect of CpG modification on TF-binding specificities with a requirement of a large pool of
chemically modified oligonucleotides. To address the issues of time-consuming or additional
process of preparing chemical-modified DNA libraries, we asked whether the impact of DNA
modification on TF-binding specificities can be learned from unmodified DNA datasets. We
trained DeepRec solely on unmethylated MAX SELEX-seq data and interpreted the model using
physicochemical energy logos.
The physicochemical logos (Figure 5.7A) show that MAX prefers a hydrogen bond
acceptor located at the edge-offset 1 in the major groove of C-1/G-1 bp and two hydrogen bond
acceptors located at the edge-offset 1 and 2 in the major groove of C1/G1 bp respectively. In the
MAX crystal structure (Figure 5.7B), the guanidino group of ARG36 forms a bidentate hydrogen
bond by donating two hydrogen atoms to the O6 and N7 atoms of the 5’ Guanine (G1), as well as
stacks with the carboxylate group of the 5’ 5caC (C-1), establishing a 5caC-ARG-Guanine triad
[138], as shown in the black box of Figure 5.7A and in the red circle of Figure 5.7B. Notably, 5caC
85
(C-1) is also involved in the distant interaction with ARG60 by two negatively charged carboxylate
groups to stabilize the conformation [138] (Figure 5.7B).
On the other hand, M shows a strong negative effect at the offset-edge 1 in the major groove
of C1/G1 bp (Figure 5.7A), which agrees with the result from MAX methyl-SELEX that indicated
CpG methylation severely decreases the MAX–DNA binding affinity [129]. Modeling a methyl
group onto unmodified CpG site of the central E-box of MAX crystal structure shows that methyl-
group at the C5 position potentially causes steric obstruction with ARG36, which is a possible
explanation for the diminished binding to E-box DNA containing CpG modification [138].
Together, these results show that DeepRec can not only accurately predict the
physicochemical signatures, which possibly contribute to TF–DNA binding specificity, but also
determine the geometry of these signatures. Strikingly, DeepRec is capable of learning the impact
of postsynthetic modifications on DNA binding specificities only based on unmodified DNA
dataset.
86
Figure 5.7 Geometry of physicochemical signatures and corresponding structure of MAX-5caC complex.
(A) Close-up view of the edge-offset 1 and 2 of central C -1/G -1 bp. The physicochemical patterns for 5caC-ARG-
Guanine triad is indicated in the black box. (B) Close-up view of the MAX-5caC structure. ARG36 forms a 5caC-
ARG-Guanine triad with the central CpG dinucleotide.
5.3.4 DeepRec predicts TF binding preference in the context of DNA grooves
The physicochemical signatures in DNA major groove are more diverse and unique than
the one in minor groove, thus conveying highest degree TF-binding specificity. MAX, for example,
only recognizes the signatures in major groove to achieve its binding specificity. This is confirmed
by the physicochemical energy logos shown in Figure 5.6A and has been observed in the MAX
crystal structure as well (PDB ID 5EYO). However, TF-binding specificity can be achieved
through a more complex recognition involving major and minor groove readout [2]. Therefore, we
asked whether DeepRec can predict the binding mechanisms that involve physicochemical
signatures in major and minor grooves.
87
We next targeted the myocyte enhancer factor 2B (MEF2B) belonging to the MEF2 family
that plays vital roles in development and functioning of neuronal and muscle cells, and known to
play an important role in TF binding through minor groove contacts as well [155-157]. We trained
DeepRec on MEF2B SELEX-seq data, interpreted the model using DNA physicochemical energy
logos (Figure 5.8A), and compared the predicted logos with the TF–DNA contacts of the MEF2
co-crystal structure (PDB ID 1N6J) [158] (Figure 5.8B-E).
One prominent A located at the edge-offset 4 in the major groove of T-4/A-4 bp can be
observed in the physicochemical energy logos (Figure 5.8A). Agreed with the MEF2 crystal
structure, LYS23 donates one hydrogen bond to N7 of 3’ Adenine (A-4) (Figure 5.8C). Another A
can be seen at the edge-offset 3 in the major groove of G-5/T-5 bp (Figure 5.8A), which agrees to a
possible contact between LYS23 and O6 of 3’ Guanine (G-5). Intriguingly, the major groove
signals mainly occur in the edge of core binding site (5’-CTAW4TAG-3’). In contrast, notable
signals can be observed in the minor groove of the core of the binding site (right panel in Figure
5.8A).
Compared to the physicochemical energy logos of MAX (right panel in Figure 5.7A),
MEF2B logos show stronger signals in the minor groove with prominent positive N’s and negative
D’s (right panel in Figure 5.9A). In the MEF2 crystal structure (Figure 5.9D), GLY2–ARG3
conformation inserts into the minor groove: ARG3 makes electrostatic interactions with the sugar-
phosphate backbone, and GLY2, on the other hand, makes Van der Waals or hydrophobic
interaction with nonpolar atoms.
The energy logos demonstrate long-range interactions in the minor groove (right panel in
Figure 5.9A), suggesting potential presence of AT-hook like motif with an “Arg–Gly–Arg” or
“Arg–Gly–His,” which can form high-degree specific interactions for minor-groove recognition
88
(Figure 5.9E). The logos can imply this long-range interaction, even though “Arg” or “His” might
be missing in the crystal structure (Figure 5.9E).
Figure 5.8 DNA physicochemical energy logos and structure of MEF2 complex (PDB ID 1N6J).
(A) The physicochemical energy logos were generated by DeepRec base on MEF2B SELEX-seq data. The
position with repect to the edge offset and bp position was indicate by the name of MEB2B protein residue if
there is a contact between protein and DNA found using DNAproDB [83]. The centeral core binding site is
highlighted in the yellow box. (B) Possible interactions between MEF2 residues and the C -5/G -5 bp in the major
groove. The numerical numbers indicate the inter-atom distance in angstrom. (C) Possible interactions between
MEF2 residues and the T -4/A -4 bp in the major groove. (D) Possible interactions between MEF2 residues and
atoms in the minor groove. (E) Possible “Arg–Gly–Arg” or “Arg–Gly–His” conformation in the minor groove.
89
5.4 Conclusions
Investigating important physicochemical patterns in DNA has great potential for
uncovering TF–DNA binding mechanisms. However, current experimental methods such as
crystallization face limitations in examining enough complex structures to be able to explain the
whole story of TF–DNA binding mechanism. One that other hand, existing methods lack the
ability to mine important physicochemical signatures by leveraging big data. In this study, we
proposed that DeepRec can mine essential physicochemical signals for TF–DNA binding
specificity in the context of binding free energy. It accurately predicted the possible binding
contacts for MAX and MEF2B at physicochemical resolution and was able to indicate the
corresponding forces such as hydrogen bonding, hydrophobic interactions, etc., where current
sequence modeling methods cannot achieve. Moreover, DeepRec can further detect the geometry
of physicochemical signatures which can contribute great TF binding specificity, such as
bifurcated-, bidentate-hydrogen bonding, 5caC–ARG–Guanine triad, and Arg–Gly–Arg–DNA
minor-groove-interaction pattern. Strikingly, DeepRec is capable of implying the impact of
postsynthetic modifications to DNA solely based on the learning on the dataset without DNA
modification.
Because DeepRec can mine finer information beyond DNA sequence, we envision that this
method can be easily expanded to the studies beyond the concept of DNA sequence toward a
physicochemical modeling. For example, to investigate the effect of Hoogsteen bp [159] observed
in p53 [10, 160], mismatch bp, and more DNA modified bp, such as 4mC and 6mA [125], in the
context of TF–DNA binding.
90
Chapter 6 Discussion and future work
Many regulatory mechanisms require a high degree of specificity in TF-DNA binding.
However, the nucleotide sequence does not answer the question of why a protein binds only to a
small subset of the many putative binding sites in the genome that share the same core motif.
Whereas higher-order effects, such as chromatin accessibility, cooperativity and cofactors, have
been described, DNA shape has recently gained attention as another feature that fine-tunes the
DNA binding specificities of some TFs. Structural biology generates data at high-resolution,
allowing us to biophysically explain the binding mechanism in atomic detail. However, due to the
low-throughput character, only a handful of TF complexes can be examined. In contrast, genomics
generates high-throughput (HT) data for entire genomes or massive sequences bound by a TF,
providing big data for investigating the binding mechanism. However, current modeling methods
are designed based on the DNA sequence (the letters A, C, G, and T), thus making it hard to
biophysically explain the binding mechanism. In this thesis, I intended to bridge the research gap
between the fields of structural biology and genomics. I first developed high-throughput prediction
methods for generating structural and biophysical features for massive sequence data. Then I
developed a highly accurate deep learning method to model TF–DNA binding specificities by
leveraging big biophysical feature data, further unraveling several insights into the TF binding
mechanisms. The future work arising from this study can be progress along the following axes:
Modeling for higher-order TF binding cooperativity in vitro
Cofactors also play an important role in TF-DNA binding in vivo. TFs often bind DNA as
multi-protein complexes for altering DNA binding specificities. Drosophila Hox proteins show
different binding properties when binding DNA with the dimeric cofactor Extradenticle-
Homothorax (Exd) [1, 33]. With this mechanism, TFs can regulate gene expression on a higher
91
level. A study demonstrated machine learning methods to qualitatively discover cofactor
interaction networks using multiple datasets such as gene expression profiles, protein-protein
interactions and recognition motifs [161], but the details of cofactor interplay mechanisms cannot
be predicted by such models. To model higher-order TF binding specificities with the recent
abundantly available of data [33, 162], the current modeling method can be expanded by
incorporating higher-order features in order to better explain the combinatorial co-binding of TFs
and their co-factors as key elements of gene regulation.
Modeling of higher-order TF binding in vivo
My current research focuses primarily on building quantitative prediction models for TF-
DNA binding specificities in vitro. However, biological processes and gene regulations in vivo
involve complex cellular conditions including interactions with cofactors, chromatin accessibility,
cooperativity and enhancer activities; these higher-order effects can influence TF-DNA binding
preferences in vivo, which is difficult explain via in vitro systems. Hence, developing prediction
models that include additional layers of binding specificity determinants is essential for my further
research.
Extending the modeling method from an in vitro context to an in vivo context is not
straightforward. Many genome-wide technologies for studying in vivo TF-DNA binding and
transcriptional regulations have been proposed [3]: in vivo high-throughput DNA-binding assays
(ChIP-seq [24], ChIP-exo [26, 27], DamID [163], DNase-seq [164], ATAC-seq [165], FAIRE-seq
[166], and so on), gene expression profiling, and high-throughput methods for enhancer region
identification [167-169]. Several studies revealed the influence of chromatin accessibility in TF-
DNA binding, and tried to explain the in vivo mechanisms behind the selection of few out of many
putative binding sites with the same core motifs [63, 181-183]. Proceeding further in this direction,
92
we can extend our current multi-module and multi-task deep learning by incorporating these multi-
scale and heterogeneous data.
RNA/DNA hybrids and CRISPR-Cas9 system
My current research also pertains to the prediction of multiple biophysical properties for
the relatively more stable double strand DNA. RNA/DNA hybrids, in contrast, are very dynamic;
this feature reportedly involves many essential cellular processes and activities such as gene
regulation and genome editing. To model a RNA/DNA hybrid system, we can implement an
approach similar to DNAshape by simulating massive seed structures, and mine general features
among these large simulated data through statistical methods. The general rules can be represented
as multiple biophysical features (such as hydrogen bond signature and hydrophobic profile) for
modeling RNA/DNA hybrid systems. The resulting model can be utilized to, for example,
understand a CRISPR-Cas9 system. A poor understanding of the mechanism of a guide RNA
(gRNA) targeting complementary genomic sequences with mismatches of gRNA:DNA hybrids
impedes the broader use of Cas9 [170, 171]. Modeling CRISPR-Cas9 systems using the
aforementioned biophysical features on CRISPR-Cas9 experimental data can potentially enable
systematic investigation of the binding and dissociation kinetics of Cas9–RNA as a function of
sequence mismatches, thus proposing great potentials for the advanced application of CRISPR-
Cas9.
93
Chapter 7 Tangential research I – Quantum computing
7.1 Introduction
With the advance of various high-throughput experimental technologies, the biological
data are growing with an exponential rate. The demand of mining a large-scale of data using
machine learning approaches, statistical methods or complex algorithms will challenge the
conventional computing technologies. Therefore, seeking new powerful tools to improve current
computing power is a must.
The concept of a quantum computer, constructed by an entirely different approach than
classical computers, is born to revolutionarily speed up current computing power [172-177].
Unlike using registers and memory, quantum computers use a quantum bit (qubit) to hold
information. The behavior of qubit follows the law of quantum mechanics which enables qubits to
stay in a “superposition” state; both 0 and 1 states exist at the same time until an interrupt of outside
event occurs and then the qubit collapse into either a 0 or a 1. The processor of quantum computer
considers all possibilities simultaneously to determine the lowest energy required from the system.
With these properties, quantum computer is capable of quickly solving some types of complex
optimization problems. Here we attempt to use D-Wave quantum computer to model the
relationship between multiple features and a response variable by fitting a linear equation to the
observed experimental data (multiple linear regression; MLR).
94
7.2 Materials and methods
7.2.1 The mapping of Ising/Qubo model
Mapping a given problem to an Ising/Qubo model is the first step before using quantum
computer. The hypothesis function of MLR is
ℎ
𝜃 (𝑥 ) = 𝜃 0
𝑥 0
+ 𝜃 1
𝑥 1
+ 𝜃 2
𝑥 2
+ ⋯+ 𝜃 𝑛 𝑥 𝑛 = ∑ 𝜃 𝑗 𝑥 𝑗 𝑛 𝑗 =0
The best model can be found by minimizing the objective function below
𝐽 (𝜃 ) = ∑ {(∑ 𝜃 𝑗 𝑥 𝑖𝑗
𝑛 𝑗 =0
)− 𝑦 𝑖 }
2
𝑚 𝑖 =1
+
1
2
𝜆 ∑ 𝜃 𝑗 2 𝑛 𝑗 =0
where m is the sample size; n is the feature number.
The Ising/Qubo model can be represented by
𝐻 (𝜎 ) = ∑ℎ
𝑗 𝜎 𝑗 𝑗 + ∑ 𝐽 𝑖𝑗
𝜎 𝑖 𝜎 𝑗 𝑖 ≠𝑗 [178],
where 𝜎 𝑖 𝑎𝑛𝑑 𝜎 𝑗 ∈ {0,1} are spin variables, ℎ
𝑗 and 𝐽 𝑖𝑗
are dimensionless coefficients matrix.
The deduction of mapping L2-MLR to Ising/Qubo model is:
𝐽 (𝜃 ) = ∑ {(∑ 𝜃 𝑗 𝑥 𝑖𝑗
𝑛 𝑗 =0
)− 𝑦 𝑖 }
2
𝑚 𝑖 =1
+
1
2
𝜆 ∑ 𝜃 𝑗 2 𝑛 𝑗 =0
= ∑ {(∑ 𝜃 𝑗 𝑥 𝑖𝑗
𝑛 𝑗 =0
)
2
− 2(∑ 𝜃 𝑗 𝑥 𝑖𝑗
𝑛 𝑗 =0
)𝑦 𝑖 + 𝑦 𝑖 2
}
𝑚 𝑖 =1
+
1
2
𝜆 ∑ 𝜃 𝑗 2 𝑛 𝑗 =0
= ∑ {∑ (𝜃 𝑗 𝑥 𝑖𝑗
)
2
+ ∑ 𝜃 𝑗 𝑥 𝑖𝑗
𝜃 𝑘 𝑥 𝑖𝑘
𝑛 𝑗 ≠𝑘 𝑛 𝑗 =0
− 2(∑ 𝜃 𝑗 𝑥 𝑖𝑗
𝑛 𝑗 =0
)𝑦 𝑖 + 𝑦 𝑖 2
} +
1
2
𝜆 ∑ 𝜃 𝑗 2 𝑛 𝑗 =0
𝑚 𝑖 =1
= ∑ ∑ (𝜃 𝑗 𝑥 𝑖𝑗
)
2
+
𝑛 𝑗 =0
∑ ∑ 𝜃 𝑗 𝑥 𝑖𝑗
𝜃 𝑘 𝑥 𝑖𝑘
𝑛 𝑗 ≠𝑘 +
𝑚 𝑖 =1
𝑚 𝑖 =1
∑ ∑ (−2)𝜃 𝑗 𝑥 𝑖𝑗
𝑦 𝑖 𝑛 𝑗 =0
+
𝑚 𝑖 =1
∑ 𝑦 𝑖 2
+ ∑ (
1
2
𝜆 )𝜃 𝑗 2
𝑛 𝑗 =0
𝑚 𝑖 =1
= ∑ ∑ (𝑥 𝑖𝑗
)
2
𝜃 𝑗 2 𝑚 𝑖 =1
𝑛 𝑗 =0
+ ∑ ∑ 𝑥 𝑖𝑗
𝑥 𝑖𝑘
𝜃 𝑗 𝜃 𝑘 + ∑ ∑ (−2)𝑥 𝑖𝑗
𝑦 𝑖 𝜃 𝑗 𝑚 𝑖 =1
𝑛 𝑗 =0
+ ∑ 𝑦 𝑖 2
+ ∑ (
1
2
𝜆 )𝜃 𝑗 2
𝑛 𝑗 =0
𝑚 𝑖 =1
𝑚 𝑖 =1
𝑛 𝑗 ≠𝑘
95
𝜃 = 𝜃 2
𝑏𝑒𝑐𝑎𝑢𝑠𝑒 𝜃 𝑜𝑛𝑙𝑦 𝑡𝑎𝑘𝑒𝑠 {0,1}
𝑤𝑒 𝑐𝑎 𝑛 𝑟𝑒𝑚𝑜𝑣𝑒 𝑡 ℎ𝑒 𝑡𝑒𝑟𝑚 𝑜𝑓 ∑ 𝑦 𝑖 2
𝑏𝑒𝑐𝑎𝑢𝑠𝑒 𝑖𝑡 𝑖𝑠 𝑎 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡 𝑚 𝑖 =1
→ ∑ (∑ 𝑥 𝑖𝑗
2 𝑚 𝑖 =1
)
𝑛 𝑗 =0
𝜃 𝑗 + ∑ (∑ 𝑥 𝑖𝑗
𝑥 𝑖𝑘
𝑚 𝑖 =1
)𝜃 𝑗 𝜃 𝑘 𝑛 𝑗 ≠𝑘 + ∑ (∑ (−2)𝑥 𝑖𝑗
𝑦 𝑖 𝑚 𝑖 =1
)𝜃 𝑗 + ∑ (
1
2
𝜆 )𝜃 𝑗 𝑛 𝑗 =0
𝑛 𝑗 =0
= ∑ [∑ (𝑥 𝑖𝑗
2
− 2𝑥 𝑖𝑗
𝑦 𝑖 ) +
1
2
𝑚 𝑖 =1
𝜆 ]
𝑛 𝑗 =0
𝜃 𝑗 + ∑ [∑ 𝑥 𝑖𝑗
𝑥 𝑖𝑘
𝑚 𝑖 =1
]𝜃 𝑗 𝜃 𝑘 𝑛 𝑗 ≠𝑘
7.2.2 Time complexity comparison and analysis
L2-MLR can be solved by the normal equation
𝜽 = (𝜆 𝐈 + 𝜽 𝑇 𝜽 )
−1
𝜽 𝑇 𝒚 .
Suppose the 𝜽 is 𝑚 × 𝑛 matrix where m is the sample size and n is feature size. The time
complexity for the matrix multiplication is 𝑂 (𝑛 2
𝑚 ) and for matrix inversion is 𝑂 (𝑛 2.373
) (by
using optimized CW-like algorithms[179]). When the sample size 𝑚 is very huge, the 𝑂 (𝑛 2
𝑚 )
term will dominate. On the other hand, the complexity of quantum computing is 𝑂 (𝑛𝑚 )+
𝑡𝑖𝑚𝑒 𝑠𝑝𝑒𝑛𝑑 𝑏𝑦 𝑄𝑢𝑎𝑛𝑡𝑢𝑚 𝑐𝑜𝑚𝑝𝑢𝑡𝑖𝑛𝑔 . Considering the large dataset experienced in biological
experiences, quantum computing platform can be a potential technology for addressing biological
problems.
96
7.2.3 Binary representation
Issue 1: Due to the constraints of Quantum computer, 𝜽 can only take 0 or 1. However, for our
question, 𝜽 can be positive, negative or floating point numbers.
To address this issue, I introduced the concept of Binary Representation System. Positive,
negative and float point numbers can be represented by Binary System. For example, a
combination of 5 bits can represent a range from -4~3.75 in a unit of 0.25 (Table 6.1). Once
introducing more bits, the precision will increase.
Table 7.1 Binary representation for 5 bits.
97
7.2.4 Reconstruction of MLR hypothesis function
Reconstruct the hypothesis function according to the binary system. Suppose there is a two
features function
ℎ
𝜃 (𝑥 ) = 𝜃 0
𝑥 0
+ 𝜃 1
𝑥 1
The hypothesis function can be reconstructed by
ℎ
𝜃 (𝑥 ) = 𝜎 01
(−2
2
𝑥 0
)+ 𝜎 02
(2
1
𝑥 0
)+ 𝜎 03
(2
0
𝑥 0
)+ 𝜎 04
(2
−1
𝑥 0
)+ 𝜎 05
(2
−2
𝑥 0
)+ 𝜎 11
(−2
2
𝑥 1
)
+ 𝜎 12
(2
1
𝑥 1
)+ 𝜎 13
(2
0
𝑥 1
)+ 𝜎 14
(2
−1
𝑥 1
)+ 𝜎 15
(2
−2
𝑥 1
)
Suppose the optimal result given from Quantum computer is
(𝜎 01
𝜎 02
𝜎 03
𝜎 04
𝜎 05
𝜎 11
𝜎 12
𝜎 13
𝜎 14
𝜎 15
) = (0111010011 )
And then we can recombine the result and get a hypothesis function with positive, negative or
floating point coefficients
ℎ
𝜃 (𝑥 ) = 0(−2
2
𝑥 0
)+ 1(2
1
𝑥 0
)+ 1(2
0
𝑥 0
)+ 1(2
−1
𝑥 0
)+ 0(2
−2
𝑥 0
)+ 1(−2
2
𝑥 1
)+ 0(2
1
𝑥 1
)+ 0(2
0
𝑥 1
)
+ 1(2
−1
𝑥 1
)+ 1(2
−2
𝑥 1
) = 3.5𝑥 0
− 3.25𝑥 1
7.2.5 Embedding of qubit
Issue2: One qubit can only couple with up to six qubits in D-Wave system
The problem that D-Wave system can solve is based on Chimera graph. The connection
number of each qubit is up to six. For my problem, however, feature number is bigger than six in
most of cases. To address this issue, I embedded multiple qubits into one qubit unit according to
our Ising/Qubo problem. After testing, when feature number reaches around 40, finding an
embedding becomes difficult.
98
7.2.6 Rescale
Issue3: Precision of values in hi and Jij are limited
hi value is restricted to the range of -2 to 2; Jij value is restricted to the range of -1 to 1. The
precision of both values are 4-bits. Here, I used “automatic scaling” function on D-Wave to
normalize data.
7.2.7 Experimental setup on D-Wave
Datasets used
Three high-quality gcPBM datasets for the human bHLH TFs, c-Myc (Myc), Mad1 (Mad)
and Max are tested in this project. The data is generated by an experimental protocol of [32]. For
data analysis, due to the limitation of qubit number, I extracted eight-mers binding probe located
in middle of the sequences with length 32.
Encoding of DNA sequence and DNA shape feature
The development of our prediction model is based on feature vectors. Figure 1 shows how
I encoded feature vectors. For each position, the sequence feature at that position was encoded as
4-bits binary vectors. A length L sequence can be encoded as 4L-bit feature vector. The DNA
shape data was generated by using tool DNAshape [14]. Each value for corresponding position
was normalized into a value between 0 and 1. The feature vectors were generated by the tool
DNAfeatures which is available at [7].
Generating vector and matrix for D-Wave
99
After generating of feature vector data for gcPBM experimental data, we splitted data into
ten partitions and generated corresponding vector hj and matrix J ij of training sets according to the
mapping function.
Running programs on D-Wave
For each vector and matrix, performed 1,000 experiments and picked up the solution with
minimum energy state. Replicated the same step for 5 times.
Performance evaluation of prediction model
10-fold cross-validation method was used. Ten round of cross-validation involves
partitioning a sample of data into complementary subset, performing the data analysis on one
subset, training set, and validating the analysis on the other subset, testing set. To reduce variability,
we use ten rounds of cross-validation to perform using different partitions, and the validation
results are average over the rounds.
7.3 Results and discussion
I trained models on the three datasets, Myc, Mad and Max, by four different modeling
strategies listed below (I to V), and compared R
2
and CPU time of modeling based on these
strategies, respectively.
I. MLR running on D-Wave, without precision extension: coefficient 𝜃 𝑗 can only take 0 or 1
II. MLR running on D-Wave, with 1-bit precision extension in significant: coefficient 𝜃 𝑗 can
take 0, 0.5, 1, or 1.5. Due to the limited number of qubits, the precision extension only
covers the first four mers of eight-mers sequences.
100
III. ε-SVR running on a HPCC (classical computer): coefficient 𝜃 𝑗 can be declared as “float”
with 36-bit precision.
IV. MLR running on D-Wave, including a DNA shape feature (minor groove width) with 1-
bit precision extension in significant: only coefficient 𝜃 𝑗 of shape feature can take 0, 0.5, 1
or 1.5, others take 0 or 1.
V. ε-SVR running on HPCC, including a DNA shape feature (minor groove width):
coefficient 𝜃 𝑗 can be declared as “float” with 36-bit precision
Experiment1: Compare squared Pearson correlation coefficient of different modeling strategies
among three datasets
The performances of modeling strategy II. (yellow bars in Figure 7.1A1, B1 and C1) are
all better than modeling strategy I. (orange bars). The reason should be II. has a higher resolution
than I. so that II. can have a better solution even though only 1-bit precision is added. When having
more qubits in the future, we have a high chance to achieve the performance that classical computer
can give (as strategy III. green bars).
101
Figure 7.1 Predictive performance with respect to different modleing strategy.
Experiment2: Compare CPU time of different modeling strategies among three datasets
The CPU time of modeling strategies I. and II. (orange and yellow bars) are significantly
faster than strategy III., which shows high computing power of D-Wave. Remind that comparing
MLR and SVR might not be fair even though running MLR on D-Wave should be still faster than
running it on HPCC. The reason I chose SVR is only because of convenience. We already
developed the SVR for these datasets. For future work, the comparison of MLR on HPCC will be
included.
102
Figure 7.2 Predictive performance with respect to different modleing strategy.
Experiment3: Compare squared Pearson correlation coefficient of the strategies with or without
including DNA shape feature among two datasets, Myc and Mad
By adding DNA shape feature (minor groove width) can improve performance in modeling
strategy V (left blue bar). The improvement is not that obvious in modeling strategy IV. The
phenomena might result from the precision we set still low.
103
Figure 7.3 Predictive performance with respect to different modleing strategy.
7.4 Conclusions and discussion
In this study I have developed a tool as well as a modeling procedure for the mapping of
Ising/Qubo model for particular problems. The analysis demonstrated that designing models with
higher resolution increases predictive performance, even 1-bit precision is solely added. The
higher resolution, the more qubits are required. With the advance of quantum computing
technologies, we can expect the usage of more qubits to handle more complex problem.
104
Chapter 8 Tangential research II – Zebrafish brain imaging
8.1 Introduction
Understanding the insights into brain activity linked to neuronal disorders and normal
human functioning is key to improve human health. Sleep, an important phenomenon in our daily
lives, involves complex brain activities, which are still not fully understood. Probing into the brain
regions, particularly at the single-neuron resolution, can fundamentally contribute to understand
the regulation and function of sleep, as well as sleep disorders and related diseases. Recent
advances in fluorescence imaging have enabled capturing large-scale brain activity over time by
recording calcium dynamics in large numbers of cells. These information-rich datasets offer an
opportunity to connect cellular brain activity to behavior, yet they pose the challenge of developing
a powerful model for mining the patterns of the underlying activities and allowing one to perform
prediction in a systematic manner.
Conventional analysis methods focus on extracting cellular signals from imaging data by
using statistical techniques, for example, principal or independent component analysis (PCA/ICA)
[180]. The requirement of image registration and filter selection may lead to discarding possibly
informative samples as well as introducing potential bias. In addition, these methods can only be
applied to one dataset at a time, thus failing to generate a population-based general model. To
address these issues, we developed a deep learning framework, which extends the connectivity of
a convolutional neural network (CNN) in the time domain to learn local spatio-temporal features.
With the integration of a back propagation-based approach [181, 182], we were able to interpret
the trained model in the context of the known neuroanatomy.
105
8.2 Materials and methods
The experiment involves fluorescence volumetric imaging of the live zebrafish larva at
high-resolution [183]. The experimental setup generated a sample volume (400×500x50 voxels;
16-bit grayscale) in every 3 seconds, and continue the process for 24 hours across the sleep/wake
cycle. The resulting images capture the whole-brain activity pattern of the animal at single-cell
resolution. At the same time, a wide-field camera monitors the tail-twitching event of the animal,
which provides the binary motility metric to define the sleep/wake states, in addition to the
day/night circadian clock.
For the preliminary analysis, we applied our deep convolutional neural network on a single image
slice (selected out of the z-stack of 50 slices) of the volumetric time-course image data with
day/night labeling, and performed back propagation-based methods to highlight the important
regions relevant to brain activities (Figure 8.1). Because of the limited volume of sample size, we
subsampled the image data with different size of time frame (Table 8.1).
For the configurations of convolutional neural networks, we used 3-layer convolutional
layers with 32 3D filters with dimension (3,3,t), (3,3,t) and (7,7,t) (x: image height, y: image width,
t: frame of time), followed one hidden layer. The framework in this study was implemented in
Python 2.7 using Theano [150] 0.8.2 and Keras [151] 0.2.0. We trained and tested our model using
a single NVIDIA tesla K80 GPU.
106
Figure 8.1 Framework of convolutional neuron network for brain imaging data.
Table 8.1 Sample size with different time frame size
Source ID/# frame
1 2 3 4 5 6 8 10
170707
(Day&night labeling)
10680 5340 3560 2670 2136 1780 1335 1068
170917&170918
(Sleep&awake labeling)
19882 9847 6506 4829 3821 3161 2318 1814
8.3 Results and discussion
Model performance
We applied our deep learning method to build a model on the zebrafish brain imaging data
with day/night and awake/sleep labeling. The resulting model successfully predicted
corresponding labels for images in time frame less and equal than 3 seconds on sleep/awake dataset,
and in time frame less and equal than 5 seconds on day/night dataset (Figure 7.2).
107
Figure 8.2 Model performance measurements using data with different time frame.
Model interpretation
Model interpretation was built based on back-propagation based method. The results
functionally identify the optic tectum region (the main visual processor of the brain) (Figure 7.3).
108
Figure 8.3 Visualization of back-propagation of weights.
Using gradient-based back propagation methods we highlight the important regions in the input contributing
toward prediction.
8.4 Conclusions
In this study, we successfully applied 3D convolutional neural framework on massive
zebrafish brain imaging data. This method enables future applications of systematically profiling
more diverse neuronal behaviors. However, volumetric time-course image data require higher
order convolutional process (4D convolution), which is not feasible with current framework. To
address this issue, we will further expand this method with an early-stage fusion process [184] to
enable the pattern discovery on the spatio-temporal data at higher dimension.
109
Supplementary material for Chapter 2 – DNAshapeR
Introduction
DNAshapeR predicts DNA shape features in an ultra-fast, high-throughput manner from
genomic sequencing data. The package takes either nucleotide sequence or genomic intervals as
input, and generates various graphical representations for further analysis. DNAshapeR further
encodes DNA sequence and shape features for statistical learning applications by concatenating
feature matrices with user-defined combinations of k-mer and DNA shape features that can be
readily used as input for machine learning algorithms.
In this vignette, you will learn:
1. how to load/install DNAshapeR
2. how to predict DNA shape features
3. how to visualize DNA shape predictions
4. how to encode sequence and shape features, and apply them
Load DNAshapeR
library(DNAshapeR)
Predict DNA shape features
The core of DNAshapeR, the DNAshape method [14], uses a sliding pentamer window
where structural features unique to each of the 512 distinct pentamers define a vector of minor
groove width (MGW), Roll, propeller twist (ProT), and helix twist (HelT) at each nucleotide
position. MGW and ProT define base-pair parameters whereas Roll and HelT represent base pair-
step parameters. The values for each DNA shape feature as function of its pentamer sequence were
110
derived from all-atom Monte Carlo simulations where DNA structure is sampled in collective and
internal degrees of freedom in combination with explicit sodium counter ions [66]. The Monte
Carlo simulations were analyzed with a modified Curves approach [14]. Average values of each
shape feature for each pentamer were derived from analyzing the ensemble of Monte Carlo
predictions for 2,121 DNA fragments of 12−27 base pairs in length. DNAshapeR predicts the four
DNA shape features MGW, HelT, ProT, and Roll, which were observed in various co-crystal
structures as playing an important role in specific protein−DNA binding.
In the latest version, we further added additional 9 DNA shape features beyond our
previous set of 4 features, and expanded our available repertoire to a total of 13 features, including
6 inter-base pair or base pair-step parameters (HelT, Rise, Roll, Shift, Slide, and Tilt), 6 intra-base
pair or base pair-step parameters (Buckle, Opening, ProT, Shear, Stagger, and Stretch), and MGW.
Predict biophysical feature
Our previous work explained protein-DNA binding specificity based on correlations
between MGW and (electrostatic potential) EP observed in experimentally available structures
[20]. However, A/T and C/G base pairs carry different partial charge distributions in the minor
groove (due primarily to the guanine amino group), which will affect minor-groove EP. We
developed a high-throughput method, named DNAphi [185], to predict minor-groove EP based on
data mining of results from solving the nonlinear Poisson-Boltzmann calculations [80] on 2,297
DNA structures derived from Monte Carlo simulations. DNAshapeR includes EP as an additional
feature.
Predict DNA shape feature due to CpG methylation
111
To achieve a better mechanistic understanding of the effect of CpG methylation on local
DNA structure, we developed a high-throughput method, named methyl-DNAshape, for predicting
the impact of cytosine methylation on DNA shape features. In analogy to the DNAshape method
[14], the method predicts DNA shape features (ProT, HelT, Roll, and MGW) in the context of
CpG methylation based on methyl-DNAshape Pentamer Query Table (mPQT) derived from the
results of all-atom Monte Carlo simulations on a total of 3,518 DNA fragments of lengths varying
from 13 to 24 bp.
DNAshapeR can predict DNA shape features from custom FASTA files or directly from
genomic coordinates in the form of a GRanges object within Bioconductor (see
https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html for more information).
From FASTA file
To predict DNA shape features from a FASTA file
library(DNAshapeR)
fn <- system.file("extdata", "CGRsample.fa", package = "DNAshapeR")
pred <- getShape(fn)
## Reading the input sequence......
## Reading the input sequence......
## Reading the input sequence......
## Reading the input sequence......
## Parsing files......
## Record length: 2000
## Record length: 1999
## Record length: 2000
112
## Record length: 1999
## Done
From genomic intervals (e.g. TFs binding sites, CpG islands, replication origins, …)
To predict DNA shape from genomic intervals stored as GRanges object, a reference
genome is required. Several reference genomes are available within BioConductor as BSgenome
objects (see http://bioconductor.org/packages/release/bioc/html/BSgenome.html for more
information). For example, the sacCer3 release of the S.Cerevisiae genome can be retrieved by
# Install Bioconductor packages
source("http://bioconductor.org/biocLite.R")
biocLite("BSgenome.Scerevisiae.UCSC.sacCer3")
library(BSgenome.Scerevisiae.UCSC.sacCer3)
Given a reference genome, the getFasta function first extracts the DNA sequences based
on the provided genomic coordinates, and then performs shape predictions within a user-defined
window (of size equal to width, 100 bp in the example below) computed from the center of each
genomic interval:
# Create a query GRanges object
gr <- GRanges(seqnames = c("chrI"),
strand = c("+", "-", "+"),
ranges = IRanges(start = c(100, 200, 300), width = 100))
getFasta(gr, Scerevisiae, width = 100, filename = "tmp.fa")
fn <- "tmp.fa"
pred <- getShape(fn)
113
From public domain projects
The genomic intervals can also be obtained from public domain projects, including
ENCODE, NCBI, Ensembl, etc. The AnnotationHub package
(see http://bioconductor.org/packages/release/bioc/html/AnnotationHub.html for more
information) provides an interface to retrieve genomic intervals from these multiple online project
resources.
# Install Bioconductor packages
library(BSgenome.Hsapiens.UCSC.hg19)
library(AnnotationHub)
The genomic intervals of interest can be selected progressively through the functions
of sebset and query with keywords, and can be subjected as an input of GRanges object
to getFasta function.
ah <- AnnotationHub()
ah <- subset(ah, species=="Homo sapiens")
ah <- query(ah, c("H3K4me3", "Gm12878", "Roadmap"))
getFasta(ah[[1]], Hsapiens, width = 150, filename = "tmp.fa")
fn <- "tmp.fa"
pred <- getShape(fn)
From FASTA file with methylated DNA sequence
To predict DNA shape features in the context of CpG methylation, one can prepare a
FASTA file of sequence with symbol ‘Mg’: ‘M’ referring to cytosine of methylated CpG on the
leading strand and ‘g’ referring the cytosine of methylated CpG on lagging strand. For example,
> seq1
114
GTGTCACMgCGTCTATACG
notifying the cytosine at position 8
th
on the leading strand and the one at position 9
th
on the
lagging strand are methylated.
library(DNAshapeR)
fn <- system.file("extdata", "MethylSample.fa", package = "DNAshapeR")
pred <- getShape(fn, methylate = TRUE)
From FASTA and methylated position files
To predict DNA shape features in the context of CpG methylation, in addition to providing
regular FASTA file (without symbolizing ‘Mg’) one can provide an additional input file
identifying methylated positions. For example,
> seq1
4,16
notifying the cytosine at position 4
th
and 16
th
on leading strand, and 5
th
and 17
th
on lagging
strand are methylated.
library(DNAshapeR)
fn <- system.file("extdata", "SingleSeqsample.fa", package = "DNAshapeR")
fn_pos <- system.file("extdata", "MethylSamplePos.fa", package = "DNAshapeR")
pred <- getShape(fn, methylate = TRUE, methylatedPosFile = fn_pos)
Visualize DNA shape prediction
DNAshapeR can be used to generate various graphical representations for further analyses.
The prediction result can be visualized in the form of scatter plots (as introduced in [57], heat maps
(as introduced in [49]), or genome browser tracks (as introduced in [186]).
Ensemble representation: metashape plot
115
The prediction result can be visualized in the metaprofiles of DNA shape features.
plotShape(pred$MGW)
# MGW can be replaced with HelT, Rise, Roll, Shift, Slide, Tilt, Buckle, Opening, ProT, Shear,
Stagger, Stretch or EP
Ensemble representation: heat map
The prediction result can be visualized in the heat map of DNA shape features.
library(fields)
heatShape(pred$ProT, 20)
116
# ProT can be replaced with MGW can be replaced with MGW, Buckle, Opening, ProT, Shear,
Stagger, Stretch or EP
#heatShape(pred$Roll[1:500, 1:1980], 20)
# Roll can be replaced with HelT, Rise, Roll, Shift, Slide or Tilt
Individual representation: genome browser-like tracks
The prediction result can be visualized in the form of genome browser tracks.
*Note that the input data should only contain one sequence.
fn2 <- system.file("extdata", "SingleSeqsample.fa", package = "DNAshapeR")
pred2 <- getShape(fn2)
trackShape(fn2, pred2) # Only for single sequence files
117
Encode sequence and shape features
DNAshapeR can be used to generate feature vectors for a user-defined model. These
models can consist of either sequence features (1-mer, 2-mer, 3-mer), shape features (MGW, HelT,
Rise, Roll, Shift, Slide, Tilt, Buckle, Opening, ProT, Shear, Stagger, Stretch and EP), or any
combination of those two. For 1-mer features, sequence is encoded in form of four binary numbers
118
(i.e., in terms of 1-mers, 1000 for adenine, 0100 for cytosine, 0010 for guanine, and 0001 for
thymine) at each nucleotide position [47]. The feature encoding function of the DNAshapeR
package enables the determination of higher order sequence features, for example, 2-mers and 3-
mers (16 and 64 binary features at each position, respectively).
User can also choose to include second order shape features in the generated feature vector.
The second order shape features are product terms of values for the same category of shape features
(MGW, HelT, Rise, Roll, Shift, Slide, Tilt, Buckle, Opening, ProT, Shear, Stagger, Stretch or EP)
at adjacent positions. They were introduced to encode the tendency of, for instance, a narrow minor
groove region exhibiting an enhanced narrowing if adjacent positions are also characterized by a
narrow groove. The feature encoding function of DNAshapeR enables the generation of any subset
of these features, either only a selected shape category or first order shape features, and any
combination with shape or sequence features. The result of feature encoding for each sequence is
a chimera feature vector.
Encoding process
A feature type vector should be defined before encoding. The vector can be any
combination of characters of “k-mer”, “n-shape”, “n-MGW”, “n-ProT”, “n-Roll”, “n-HelT”, “n-
Rise”, “n-Shift”, “n-Slide”, “n-Tilt”, “n-Buckle”, “n-Opening”, “n-Shear”, “n-Stagger”, “n-
Stretch”, “n-EP” (k, n are integers) where “1-shape” refers to first order and “2-shape” to second
order shape features. Notice that n-shape represents the default 4 shape features (MGW, ProT, Roll
and HelT).
library(Biostrings)
featureType <- c("1-mer", "1-shape")
119
featureVector <- encodeSeqShape(fn, pred, featureType)
featureVector
Showcase of statistical machine learning application
Feature encoding of multiple sequences thus results in a feature matrix, which can be used
as input for variety of statistical machine learning methods. For example, an application is the
quantitative modeling of PBM derived protein-DNA binding by linear regression as demonstrated
below.
First, the experimental binding affinity values are combined with the feature matrix in a
data frame structure.
filename3 <- system.file("extdata", "PBMsample.s", package = "DNAshapeR")
experimentalData <- read.table(filename3)
df <- data.frame(affinity=experimentalData$V1, featureVector)
Then, a machine learning package (which can be any learning tools) is used to train a
multiple linear regression (MLR) model based on 10-fold cross-validation. In this example, we
used the caret package (see http://caret.r-forge.r-project.org/ for more information).
library(caret)
trainControl <- trainControl(method = "cv", number = 10, savePredictions = TRUE)
model <- train (affinity~ ., data = df, trControl=trainControl, method="lm", preProcess=NULL)
summary(model)
120
Supplementary material for Chapter 3 – GBshape
Supplementary Table 1.
List of current genome assemblies in GBshape.
Common name Scientific name Assembly
Mammal
Human Homo sapiens Feb. 2009 (GRCh37/hg19)
Chimp Pan troglodytes Feb. 2011 (CSAC 2.1.4/panTro4)
Orangutan Pongo pygmaeus abelii July 2007 (WUGSC 2.0.2/ponAbe2)
Gorilla Gorilla gorilla gorilla May 2011 (gorGor3.1/gorGor3)
Gibbon Nomascus leucogenys Oct. 2012 (GGSC Nleu3.0/nomLeu3)
Rhesus Macaca mulatta Oct. 2010 (BGI CR_1.0/rheMac3)
Ferret Mustela putorius furo Apr. 2011 (MusPutFur1.0/musFur1)
Bushbaby Otolemur garnettii Mar. 2011 (Broad/otoGar3)
Squirrel monkey Saimiri boliviensis Oct. 2011 (Broad/saiBol1)
Marmoset Callithrix jacchus March 2009 (WUGSC 3.2/calJac3)
Baboon Papio anubis Mar. 2012 (Baylor Panu_2.0/papAnu2)
Sloth Choloepus hoffmanni Jul. 2008 (Broad/choHof1)
Cat Felis catus Sep. 2011 (ICGSC Felis_catus 6.2/felCat5)
Tree shrew Tupaia belangeri Dec. 2006 (Broad/tupBel1)
Panda Ailuropoda melanoleuca Dec. 2009 (BGI-Shenzhen 1.0/ailMel1)
Dog Canis lupus familiaris Sep. 2011 (Broad CanFam3.1/canFam3)
Shrew Sorex araneus Aug. 2008 (Broad/sorAra2)
Cow Bos taurus Oct. 2011 (Baylor Btau_4.6.1/bosTau7)
Sheep Ovis aries Aug. 2012 (ISGC Oar_v3.1/oviAri3)
Mouse lemur Microcebus murinus Jul. 2007 (Broad/micMur1)
Kangaroo rat Dipodomys ordii Jul. 2008 (Broad/dipOrd1)
Minke whale Balaenoptera acutorostrata scammoni Oct. 2013 (BalAcu1.0/balAcu1)
Megabat Pteropus vampyrus Jul. 2008 (Broad/pteVam1)
Squirrel Spermophilus tridecemlineatus Nov. 2011 (Broad/speTri2)
Rock hyrax Procavia capensis Jul. 2008 (Broad/proCap1)
Alpaca Vicugna pacos Mar. 2013 (Vicugna_pacos-2.0.1/vicPac2)
Dolphin Tursiops truncatus Oct. 2011 (Baylor Ttru_1.4/turTru2)
Microbat Myotis lucifugus Jul. 2010 (Broad Institute Myoluc2.0/myoLuc2)
Pig Sus scrofa Aug. 2011 (SGSC Sscrofa10.2/susScr3)
Naked mole-rat Heterocephalus glaber Jan. 2012 (Broad HetGla_female_1.0/hetGla2)
Guinea pig Cavia porcellus Feb. 2008 (Broad/cavPor3)
Tarsier Tarsius syrichta Aug. 2008 (Broad/tarSyr1)
Pika Ochotona princeps May 2012 (OchPri3.0/ochPri3)
Horse Equus caballus Sep. 2007 (Broad/equCab2)
Mouse Mus musculus Dec. 2011 (GRCm38/mm10)
Chinese hamster Cricetulus griseus Jul. 2013 (C_griseus_v1.0/criGri1)
White rhinoceros Ceratotherium simum May 2012 (CerSimSim1.0/cerSim1)
Rat Rattus norvegicus Mar. 2012 (RGSC 5.0/rn5)
Armadillo Dasypus novemcinctus Dec. 2011 (Baylor/dasNov3)
Elephant Loxodonta africana Jul. 2009 (Broad/loxAfr3)
Manatee Trichechus manatus latirostris Oct. 2011 (Broad v1.0/triMan1)
Rabbit Oryctolagus cuniculus Apr. 2009 (Broad/oryCun2)
Tenrec Echinops telfairi Nov. 2012 (Broad/echTel2)
Tasmanian devil Sarcophilus harrisii Feb. 2011 (WTSI Devil_ref v7.0/sarHar1)
Hedgehog Erinaceus europaeus May 2012 (EriEur2.0/eriEur2)
Platypus Ornithorhynchus anatinus Mar. 2007 (WUGSC 5.0.1/ornAna1)
Wallaby Macropus eugenii Sep. 2009 (TWGS Meug_1.1/macEug2)
Vertebrate
Chicken Gallus gallus Nov. 2011 (ICGSC Gallus_gallus-4.0/galGal4)
121
Turkey Meleagris gallopavo Dec. 2009 (TGC Turkey_2.01/melGal1)
Budgerigar Melopsittacus undulatus Sep. 2011 (WUSTL v6.3/melUnd1)
Zebra finch Taeniopygia guttata Feb. 2013 (WashU taeGut324/taeGut2)
Medium ground
finch Geospiza fortis Apr. 2012 (GeoFor_1.0/geoFor1)
American alligator Alligator mississippiensis Aug. 2012 (allMis0.2/allMis1)
Elephant shark Callorhinchus milii Dec. 2013 (Callorhinchus_milii-6.1.3/calMil1)
Lizard Anolis carolinensis May 2010 (Broad AnoCar2.0/anoCar2)
Painted turtle Chrysemys picta bellii Dec. 2011 (v3.0.1/chrPic1)
X. tropicalis Xenopus tropicalis Nov. 2009 (JGI 4.2/xenTro3)
Coelacanth Latimeria chalumnae Aug. 2011 (Broad/latCha1)
Nile tilapia Oreochromis niloticus Jan. 2011 (Broad oreNil1.1/oreNil2)
Zebrafish Danio rerio Jul. 2010 (Zv9/danRer7)
Tetraodon Tetraodon nigroviridis Mar. 2007 (Genoscope 8.0/tetNig2)
Medaka Oryzias latipes Oct. 2005 (NIG/UT MEDAKA1/oryLat2)
Fugu Takifugu rubripes Oct. 2011 (FUGU5/fr3)
Stickleback Gasterosteus aculeatus Feb. 2006 (Broad/gasAcu1)
Atlantic cod Gadus morhua May 2010 (Genofisk GadMor_May2010/gadMor1)
Lamprey Petromyzon marinus Sep. 2010 (WUGSC 7.0/petMar2)
Deuterostome
Lancelet Branchiostoma floridae Mar. 2006 (JGI 1.0/braFlo1)
C. intestinalis Ciona intestinalis Mar. 2005 (JGI 2.1/ci2)
S. purpuratus Strongylocentrotus purpuratus Sep. 2006 (Baylor 2.1/strPur2)
Insect
D. melanogaster Drosophila melanogaster Apr. 2006 (BDGP R5/dm3)
D. simulans Drosophila simulans Apr. 2005 (WUGSC mosaic 1.0/droSim1)
D. sechellia Drosophila sechellia Oct. 2005 (Broad/droSec1)
D. erecta Drosophila erecta Aug. 2005 (Agencourt prelim/droEre1)
D. yakuba Drosophila yakuba Nov. 2005 (WUGSC 7.1/droYak2)
D. ananassae Drosophila ananassae Aug. 2005 (Agencourt prelim/droAna2)
D. pseudoobscura Drosophila pseudoobscura Nov. 2004 (FlyBase 1.03/dp3)
D. pseudoobscura Drosophila pseudoobscura Feb. 2006 (Baylor CAF1/dp4)
D. persimilis Drosophila persimilis Oct. 2005 (Broad/droPer1)
D. virilis Drosophila virilis Aug. 2005 (Agencourt prelim/droVir2)
D. mojavensis Drosophila mojavensis Aug. 2005 (Agencourt prelim/droMoj2)
D. grimshawi Drosophila grimshawi Aug. 2005 (Agencourt prelim/droGri1)
A. mellifera Apis mellifera Jan. 2005 (Baylor 2.0/apiMel2)
A. gambiae Anopheles gambiae Feb. 2003 (IAGEC MOZ2/anoGam1)
Nematode
C. elegans Caenorhabditis elegans Oct. 2010 (WS220/ce10)
C. briggsae Caenorhabditis briggsae Jan. 2007 (WUGSC 1.0/cb3)
C. remanei Caenorhabditis remanei May 2007 (WUGSC 15.0.1/caeRem3)
C. brenneri Caenorhabditis brenneri Feb. 2008 (WUGSC 6.0.1/caePb2)
C. japonica Caenorhabditis japonica Mar. 2008 (WUGSC 3.0.2/caeJap1)
P. pacificus Pristionchus pacificus Feb. 2007 (WUGSC 5.0/priPac1)
Other
S. cerevisiae Saccharomyces cerevisiae Apr. 2011 (SacCer_Apr2011/sacCer3)
Sea hare Aplysia californica Sept. 2008 (Broad 2.0/aplCal1)
P. falciparum Plasmodium falciparum Jun. 2007 (PlaFal_5.5/plaFal1)
122
Supplementary Table 2.
List of current assemblies supported by BLAT sequence search in GBshape.
Common name Scientific name Assembly
Mammal
Mouse Mus musculus Dec. 2011 (GRCm38/mm10)
Rat Rattus norvegicus Mar. 2012 (RGSC 5.0/rn5)
Insect
D. melanogaster Drosophila melanogaster Apr. 2006 (BDGP R5/dm3)
D. simulans Drosophila simulans
Apr. 2005 (WUGSC mosaic
1.0/droSim1)
D. sechellia Drosophila sechellia Oct. 2005 (Broad/droSec1)
D. pseudoobscura Drosophila pseudoobscura Feb. 2006 (Baylor CAF1/dp4)
Nematode
C. elegans Caenorhabditis elegans Oct. 2010 (WS220/ce10)
Other
S. cerevisiae Saccharomyces cerevisiae Apr. 2011 (SacCer_Apr2011/sacCer3)
123
Supplementary materials for Chapter 4 – DNAphi
Non-linear Poisson–Boltzmann equation (NLPB)
The NLPB equation is widely used to calculate electrostatic interactions in ionic solutions
[80]. The equation reads
The equation reads
∇ ∙ [𝜀 (𝑟⃑)∇ ∙ 𝜙 (𝑟⃑)]− 𝜀 (𝑟⃑)𝜅 (𝑟⃑)
2
sinh[𝜙 (𝑟⃑)]+ 4π𝜌 𝑓 (𝑟⃑)/kT = 0,
where 𝜙 (𝑟⃑) is the electrostatic potential (EP) at any location 𝑟⃑. The unit of 𝜙 (𝑟⃑) is kT/e,
where k is the Boltzmann constant, T is the temperature, and e is the elementary charge. ε(𝑟⃑) is the
dielectric constant in the solute or solution. The term 𝜅 2
= 1/𝜆 2
= 8πq
2
𝐼 /ekT, where 𝐼 is the
ionic strength (moles/L) of the bulk solution. 𝜌 𝑓 is the fixed charge density. The variables 𝜙 ,𝜀 ,𝜅
and 𝜌 are functions of the position vector 𝑟⃑ . The linearization sinh𝜙 (𝑟⃑) = 𝜙 (𝑟⃑) for small
potentials leads to the linear Poisson-Boltzmann (LPB) equation in which contributions from
different chemical groups are additive [80].
DelPhi – a finite-difference method for solving the NLPB equation
The DelPhi program [81, 187] utilizes a finite-difference method to solve the NLPB
equation for biomolecules in aqueous solution. The method maps the charge density and dielectric
constant onto a three-dimensional cubic grid and solves the NLPB equation iteratively. The
program allows users to specify ionic strength and dielectric constants of the solute and solvent.
In addition, users can assign charges to the solute and measure the particular charge effects on the
molecule in terms of EP.
Generation of pentamer query table for EP prediction
124
To compile the pentamer query table, we generated a large training dataset of all-atom
Monte Carlo (MC) predictions for 2,297 different DNA fragments ranging 12 to 27 base pairs (bp)
in length. We performed NLPB calculations to profile EP on these average structures using the
DelPhi program [186]. For each bp, we derived the EP at the midpoint of the minor groove and at
26 points that were equally distributed on a sphere with 1 Å radius surrounding this midpoint, with
a total of 27 EP values for a sphere [98] (Figure 3C). After filtering out extreme EP values (> 0 or
< –20 kT/e), we calculated the mean and standard deviation (SD) for the remaining EP values. We
assigned an average value to the sphere if its SD was < 3 kT/e, thereby excluding mean values
with large EP fluctuations. As the sphere lies in the approximate center of the minor groove, EP
can be defined as a function of sequence, with one value per bp.
After mapping EP as a function of sequence for 2,297 DNA fragments, we applied a
pentameric sliding-window approach to each fragment. EP values at the central bp in a pentamer
were recorded for the 512 unique pentamers. To remove outlier effects, we kept 80% of the data
points and removed the extreme 10% of data points from the head and tail end of the data. Using
this filtering approach, we generated a query table of average values for each occurrence of 512
possible pentamers in our dataset. The average occurrence of possible pentamers was 45.2 with a
SD of 0.3. This query table was integrated in a sliding-window approach for high-throughput (HT)
minor-groove EP prediction, similarly to our previous approach for DNA shape [14].
High-throughput prediction of EP
We provide a stand-alone web server (DNAphi; http://rohslab.usc.edu/DNAphi) for HT
prediction of EP in the minor groove. Input data are nucleotide sequences in FASTA format, which
can be pasted into a web form or uploaded as a separate file. The function ‘φ Prediction’ allows
users to perform predictions and view the results in a graphical representation that can be
125
downloaded as a quantitative data table for further analysis. The function ‘φ Learning’ requires
the binding strength per given sequence in a user-defined unit as additional input data or response
variable. The statistical machine-learning (ML) approach of L2-regularized multiple linear
regression (MLR) was applied to this function [1]. We also integrated the HT EP prediction
function into DNAshapeR [16], our R/Bioconductor package
(https://www.bioconductor.org/packages/devel/bioc/html/DNAshapeR.html). This package
provides an easy-to-use and easy-to-extend interface that can be readily integrated into other HT
genomic analysis platforms [16].
Fis binding site data
We used Fis-DNA binding site data for the eight sequences of F1, F24, F25, F26, F27, F28,
F29 and F36 [84, 86], which exhibit Fis-binding affinities differing by three orders of magnitude.
Seven of these sites only differ in their five central bp (colored red in Supplementary Table S1).
F36 has one additional bp that differs in the flank of the five central bp (bold in Supplementary
Table S1). Sequences with their logarithms of binding affinity Kd are listed in Supplementary Table
S1. Analysis of additional Fis binding sites is shown in Supplementary Figure S5.
HT-SELEX data for 215 mammalian transcription factors (TFs)
We used HT-SELEX data for 215 TFs from 27 protein families originally published by
Jolma et al. [34] and recently re-sequenced with an on average 10-fold increase in sequencing
depth, resulting in more accurate binding models [36]. Sequencing data are available at the
European Nucleotide Archive (http://www.ebi.ac.uk/ena) under study identifier PRJEB14744.
Data were pre-processed as described in Yang et al. [82] and are available at
http://rohslab.usc.edu/MSB2017/.
126
SELEX-seq data for Drosophila Hox proteins
We used experimental data for 21 Exd-Hox heterodimers derived from binding assays
followed by deep sequencing (SELEX-seq) [1]. Data included the anterior and posterior Hox
proteins, the Scr mutants containing mutated Arg3, His–12 or Arg5 and the Antp mutants in which
minor groove-contacting residues from Scr were engineered into the Scr linker, all in complex
with the cofactor Exd. All sequences selected in SELEX-seq experiments with a count ≥ 25 were
aligned based on the core motif 5’-TGAYNNAY-3’, where N can be any nucleotide and Y
represents C or T. Raw data can be downloaded from the Gene Expression Omnibus (GEO;
https://www.ncbi.nlm.nih.gov/geo/) under accession number GSE65073 [1]. Sequences with
multiple occurrences of the core motif were removed from this analysis.
gcPBM data for human basic helix-loop-helix (bHLH) proteins
We used experimental data for three bHLH dimers Mad1/Max (‘Mad’), Max/Max (‘Max’)
and c-Myc/Max (‘Myc’) derived from gcPBM experiments [47]. Data contained 36-bp genomic
sequences centered at a putative E-box binding site. The number of sequences for Mad was 6,927,
for Max was 8,569 and for Myc was 7,535. Data can be downloaded from GEO under accession
number GSE59845 [47].
Feature vector encoding
Given a DNA sequence s of length l, the corresponding feature vector can be derived from
the following basis functions, in a similar way to those introduced by Zhou et al. [47] and Yang et
al. [82]. The basis function for EP features is:
127
𝛺 𝑖 𝐸𝑃
(𝑠 ) = {
𝐸𝑃
𝑖 ,if 𝑠 (𝑖 )
∈ {A,C,G,T}
𝐸𝑃
𝑎𝑣𝑔 ,otherwise
, 𝑖 = 3,4,…,𝑙 − 2
where 𝐸𝑃
𝑎𝑣𝑔 is the average value of EP over all possible pentamers. Basis functions for
mononucleotide sequence features are:
𝛺 4×(𝑖 −1)+1
𝑠 𝑒 𝑞 (s) = {
0,if 𝑠 (𝑖 )
≠ A
1,if 𝑠 (𝑖 )
= A
,𝑖 = 1,⋯,𝑙
𝛺 4×(𝑖 −1)+2
𝑠𝑒𝑞 (𝑠 ) = {
0,if 𝑠 (𝑖 )
≠ C
1,if 𝑠 (𝑖 )
= C
,𝑖 = 1,⋯,𝑙
𝛺 4×(𝑖 −1)+3
𝑠𝑒𝑞 (𝑠 ) = {
0,if 𝑠 (𝑖 )
≠ G
1,if 𝑠 (𝑖 )
= G
,𝑖 = 1,⋯,𝑙
𝛺 4×𝑖 𝑠𝑒𝑞 (𝑠 ) = {
0,if 𝑠 (𝑖 )
≠ T
1,if 𝑠 (𝑖 )
= T
,𝑖 = 1,⋯,𝑙
where 𝑖 is the position of the given sequence. Basis functions for the DNA shape features minor-
groove width (MGW), propeller twist (ProT), Roll and helix twist (HelT) are:
𝛺 𝑖 𝑀𝐺𝑊
(𝑠 ) = {
𝑀𝐺𝑊 𝑖 ,if 𝑠 (𝑖 )
∈ {A,C,G,T}
𝑀𝐺𝑊 𝑎𝑣𝑔 ,otherwise
, 𝑖 = 3,4,…,𝑙 − 2
𝛺 𝑖 𝑃𝑟𝑜𝑇 (𝑠 ) = {
𝑃𝑟𝑜𝑇 𝑖 ,if 𝑠 (𝑖 )
∈ {A,C,G,T}
𝑃𝑟𝑜𝑇 𝑎𝑣𝑔 ,otherwise
, 𝑖 = 3,4,…,𝑙 − 2
𝛺 𝑖 𝑅𝑜𝑙𝑙 (𝑠 ) = {
𝑅𝑜𝑙𝑙 𝑖 ,if 𝑠 (𝑖 )
∈ {A,C,G,T}
𝑅𝑜𝑙𝑙 𝑎𝑣𝑔 ,otherwise
, 𝑖 = 2,3,…,𝑙 − 1
𝛺 𝑖 𝐻𝑒𝑙𝑇 (𝑠 ) = {
𝐻𝑒𝑙𝑇 𝑖 ,if 𝑠 (𝑖 )
∈ {A,C,G,T}
𝐻𝑒𝑙𝑇 𝑎𝑣𝑔 ,otherwise
, 𝑖 = 2,3,…, 𝑙 − 1
where 𝑀𝐺𝑊 𝑎𝑣𝑔 , 𝑃𝑟𝑜𝑇 𝑎𝑣𝑔 , 𝑅𝑜𝑙𝑙 𝑎𝑣𝑔 and 𝐻𝑒𝑙𝑇 𝑎𝑣𝑔 are average values with respect to MGW, ProT,
Roll and HelT, respectively, over all possible pentamers.
128
EP and DNA shape features, denoted 𝛺 𝑖 𝐸𝑃
, 𝛺 𝑖 𝑀𝐺𝑊
, 𝛺 𝑖 𝑃𝑟𝑜𝑇 , 𝛺 𝑖 𝑅𝑜𝑙𝑙 and 𝛺 𝑖 𝐻𝑒𝑙𝑇 , respectively,
were generated and normalized by DNAshapeR [16]. Normalization was performed by:
𝛺 𝑖 𝐸𝑃
(s) = (𝐸𝑃
𝑖 − 𝐸𝑃
𝑚𝑖𝑛 )/(𝐸𝑃
𝑚𝑎𝑥
− 𝐸𝑃
𝑚𝑖𝑛 )
where 𝐸𝑃
𝑖 is the predicted EP value, 𝐸𝑃
𝑚𝑖𝑛 is the minimum EP value and 𝐸𝑃
𝑚𝑎𝑥 is the maximum
EP value over all possible pentamers. Similarly, normalization for DNA shape features was
performed by using:
𝛺 𝑖 𝑀𝐺𝑊
(𝑠 ) = (𝑀𝐺𝑊 𝑖 − 𝑀𝐺𝑊 𝑚𝑖𝑛 )/(𝑀𝐺𝑊 𝑚𝑎𝑥
− 𝑀𝐺𝑊 𝑚𝑖𝑛 )
𝛺 𝑖 𝑃𝑟𝑜𝑇 (𝑠 ) = (𝑃𝑟𝑜𝑇 𝑖 − 𝑃𝑟𝑜𝑇 𝑚𝑖𝑛 )/(𝑃𝑟𝑜𝑇 𝑚𝑎𝑥
− 𝑃𝑟𝑜𝑇 𝑚𝑖𝑛 )
𝛺 𝑖 𝑅𝑜𝑙𝑙 (𝑠 ) = (𝑅𝑜𝑙𝑙 𝑖 − 𝑅𝑜𝑙𝑙 𝑚𝑖𝑛 )/(𝑅𝑜𝑙𝑙 𝑚𝑎𝑥
− 𝑅𝑜𝑙𝑙 𝑚𝑖𝑛 )
𝛺 𝑖 𝐻𝑒𝑙𝑇 (𝑠 ) = (𝐻𝑒𝑙𝑇 𝑖 − 𝐻𝑒𝑙𝑇 𝑚𝑖𝑛 )/(𝐻𝑒𝑙𝑇 𝑚𝑎𝑥
− 𝐻𝑒𝑙𝑇 𝑚𝑖𝑛 )
The complete feature vector for the given sequence can be obtained by concatenating the six
numeric vectors:
𝛺 (𝑥 ) = [𝛺 𝐸𝑃
(𝑠 ),𝛺 𝑠𝑒𝑞 (𝑠 ),𝛺 𝑀𝐺𝑊
(𝑠 ),𝛺 𝑃𝑟𝑜𝑇 (𝑠 ),𝛺 𝑅𝑜𝑙𝑙 (𝑠 ),𝛺 𝐻𝑒𝑙𝑇 (𝑠 )]
T
Statistical tests and significance levels
We applied t-test hypothesis testing to determine whether there was a significant difference
between two experimental groups. For the predictive-power comparison in quantitative modeling
(Figure 7, Supplementary Figures S8 and S11), we assumed a performance increase in terms of R
2
for the augmented models as the alternative hypothesis. For the comparison of EP-based models
for gcPBM datasets (Supplementary Figure S10), we assumed for the alternative hypothesis that
the EP preferences at a particular position were different for two datasets. The calculated p-value
for the hypothesis test represents the probability of mistakenly rejecting the null hypothesis if the
129
null hypothesis is true. The significance level α is the standard value for which a p-value ≤ α is
considered statistically significant. Typical values for α are 0.1 (*), 0.05 (**), 0.01 (***) and 0.001
(****). For example, the p-value 1.549×10
-34
(Figure 7B) indicates that the probability of making
a mistake by rejecting a true null hypothesis (i.e. probability that there is no improvement achieved
by an EP-augmented model) is 1.549×10
-34
, which falls below the significance level of 0.001 and
is therefore considered highly statistically significant.
130
SUPPLEMENTARY FIGURES
Supplementary Figure S1. Validation of DNAphi predictions at TF-DNA binding sites. Minor-groove EPs of
binding sites of (A) Exd-Ubx heterodimer (PDB ID 1B8I) [104], (B) Tc3 transposase (PDB ID 1V78) [105], (C)
131
Exd-Scr heterodimer (PDB ID 2R5Z) [20] and (D) MATα2-MCM1 (PDB ID 1MNM) [106], whose binding
interfaces include arginine residues inserted into the minor groove, were predicted by using DNAphi (blue) and
DelPhi (red). Pearson Correlation Coefficients (PCCs) demonstrate the statistical similarity between EP profiles
derived by these two approaches. We highlighted the more negative minor-groove EP values (≤ –6.505 kT/e)
predicted by DNAphi by underlining respective positions on the x-axis. Corresponding spheres defined by DNAphi
are represented by spheres in each structure, with red indicating below-average EP values ≤ –6.505 kT/e and pink
indicating EP values > –6.505 kT/e. Protein residues that form minor-groove contacts as defined by DNAproDB
[83] are shown in each structure.
132
Supplementary Figure S2. Validation of DNAphi predictions using TF-DNA binding sites. Minor-groove EPs of
binding sites of (A) PhoB response regulator (PDB ID 1GXP) [104], (B) OhrR regulator (PDB ID 1Z9C) [105], (C)
MogR (PDB ID 3FDQ) and (D) IRF-3 (PDB ID 1T2K) [106], whose binding interfaces include arginine residues
133
inserted into the minor groove, were predicted by using DNAphi (blue) and DelPhi (red). PCCs demonstrate the
statistical similarity between EP profiles derived from these two approaches. We highlighted the more negative
minor-groove EP values (≤ –6.505 kT/e) predicted by DNAphi by underlining respective positions on the x-axis.
Corresponding spheres defined by DNAphi are represented by spheres in each structure, with red indicating below-
average EP values ≤ –6.505 kT/e and pink indicating EP values > –6.505 kT/e. Protein residues that form minor-
groove contacts as defined by DNAproDB [83] are shown in each structure
Supplementary Figure S3. Comparison of minor-groove EP predictions using DelPhi with different parameter
settings. (A-D) EPs for DNA binding sites of four proteins (shown in Figure 3) were predicted by using DelPhi with
a grid size of 165 and five focusing steps (red) vs. a grid size of 501 and three focusing steps (green). PCCs
demonstrate the statistical similarity between EP profiles derived from DelPhi with varying parameter sets.
134
Supplementary Figure S4. EP distributions of 512 unique pentamers based on LPB calculations. (A) Distributions
for complete nucleotides based on LPB calculations. Patterns of distributions are similar to those derived with the
NLPB (Figure 5C). (B-D) Distributions based on partial charges of the (B) bases, (C) phosphate groups or (D) sugar
moieties.
135
136
137
Supplementary Figure S5. HT predictions of MGW and EP for the mutated regions with respect to the high-affinity
Fis binding site F1. The reference Fis-F1 co-crystal structure (PDB ID 3IV5) and quantitative Fis binding site logo
[86] are shown in (A). The altered nucleotide positions include (B) mutations introducing asymmetry in the
structure, (C) mutations introducing a YpR bp step at the ±(4-5) positions (F18) or ±(5-6) positions (F31) or
eliminating the YpR bp steps (F32), mutations substituting bp flanking the core at the ±8 positions (D), ±9 positions
(E), ±10 positions (F) or further outside the core binding site (G) and (H) mutations inverting AT-rich flanking
sequences. We highlighted mutated regions in the respective table (orange shaded columns). HT predictions of
MGW and EP within the mutated regions of the Fis binding sites correlate with the logarithm of binding affinity
log(Kd). Correlation coefficients R2 between the logarithm of Kd and EP (blue) or MGW (red) were calculated for
all Fis binding sites.
Supplementary Figure S6. HT predictions of EP based on the partial charge of (A) bases and (B) phosphates over
the five central bp of eight Fis binding sites correlated with the logarithm of binding affinity K d. Contribution from
the partial base charges has a higher correlation with the logarithm of binding affinity than the contribution from
phosphate groups, particularly when including the sequence with a central TpA bp step.
138
Supplementary Figure S7. Performance comparison of binding specificity predictions for multiple TFs derived from
HT-SELEX, SELEX-seq and gcPBM HT binding assays. (A) Comparison of sequence+shape+EP and
sequence+shape models. (B) Comparison of sequence+3shapes+EP models with three shape features (HelT, ProT
and Roll) and sequence+shape models with four shape features (HelT, ProT, Roll and MGW). (C) Comparison of
sequence+EP and sequence+MGW models. (D) Comparison of EP and MGW models. The p-values were calculated
by the t-test hypothesis testing method with performance increase in terms of R
2
as the alternative hypothesis.
139
Supplementary Figure S8. Box plot analysis for quantifying performance gain of EP-augmented models across TF
families. Sequence+shape+EP models were compared with sequence+shape models for 239 TFs. Most of the
sequence+shape+EP models outperform sequence+shape models (225 of 239 tested proteins). The p-values were
calculated by using the t-test hypothesis testing method with performance increase in terms of R
2
as the alternative
hypothesis. The p-value for each TF dataset (TF family/HT binding experiment) is shown below the corresponding
box plot.
140
Supplementary Figure S9. Minor-groove EP preferences of human bHLH TFs. (A) Heat maps illustrate minor-
groove EP preferences of the Mad1/Max heterodimer (‘Mad’), Max/Max homodimer (‘Max’) and c-Myc/Max
heterodimer (‘Myc’). Sequence data were derived from gcPBM experiments [113] using the top-25% signal
intensities. (B) Nucleotide positions with significant EP differences based on a t-test are indicated for comparisons
of Mad vs. Max, Mad vs. Myc, and Max vs. Myc (nucleotide positions with different minor-groove EP distributions
are shown in different colors for p-value). EP features were symmetrized based on the palindromic E-box, which is
located at the central positions –3 to +3 marked by a black frame.
141
Supplementary Figure S10. Performance comparison of binding specificity predictions for TFs derived from HT-
SELEX, SELEX-seq and gcPBM binding assays based on LPB calculations (for bases and phosphate groups).
(A+C) Sequence+EP models outperform sequence-only models and contribute to the prediction accuracy of DNA
binding specificities based on L2-regularized MLR and 10-fold cross-validation. (B+D) Sequence+shape+EP
models outperform sequence+MGW models. The p-values were calculated by the t-test hypothesis testing method
with performance increase in terms of R2 as the alternative hypothesis.
142
SUPPLEMENTARY TABLE
Supplementary Table S1. Eight sequences and logarithms of binding affinities of Fis-DNA sites. Co-crystal
structures were solved for six of these binding sites [85, 86].
Label PDB ID Sequence log(K d)
F1 3IV5
AAATTTGTTTGAATTTTGAGCAAATTT
-0.69897
F24 3JRB
AAATTTGTTTGTTTTTTGAGCAAATTT
-0.30103
F25 3JRD
AAATTTGTTTGTTAAATGAGCAAATTT
0
F26 3JRE
AAATTTGTTTGAAAAATGAGCAAATTT
-0.30103
F27 3JRF
AAATTTGTTTGAACTTTGAGCAAATTT
-0.22185
F28
AAATTTGTTTGAGCGTTGAGCAAATTT
1.748188
F29 3JRC
AAATTTGTTTGGGCGCTGAGCAAATTT
2.146128
F36
AAATTTGTTTGAATCTCGAGCAAATTT
0.477121
143
References
1. Abe N, Dror I, Yang L, Slattery M, Zhou T, Bussemaker HJ, Rohs R, Mann RS: Deconvolving
the recognition of DNA shape from sequence. Cell 2015, 161:307-318.
2. Rohs R, Jin X, West SM, Joshi R, Honig B, Mann RS: Origins of specificity in protein-DNA
recognition. Annu Rev Biochem 2010, 79:233-269.
3. Slattery M, Zhou T, Yang L, Dantas Machado AC, Gordan R, Rohs R: Absence of a simple
code: how transcription factors read the genome. Trends Biochem Sci 2014, 39:381-399.
4. Garvie CW, Wolberger C: Recognition of specific DNA sequences. Mol Cell 2001, 8:937-946.
5. Stormo GD, Zhao Y: Determining the specificity of protein-DNA interactions. Nat Rev Genet
2010, 11:751-760.
6. Stormo GD: Modeling the specificity of protein-DNA interactions. Quant Biol 2013, 1:115-
130.
7. Seeman NC, Rosenberg JM, Rich A: Sequence-specific recognition of double helical nucleic
acids by proteins. Proc Natl Acad Sci U S A 1976, 73:804-808.
8. Kim S, Brostromer E, Xing D, Jin J, Chong S, Ge H, Wang S, Gu C, Yang L, Gao YQ, et al:
Probing allostery through DNA. Science 2013, 339:816-819.
9. Watson LC, Kuchenbecker KM, Schiller BJ, Gross JD, Pufall MA, Yamamoto KR: The
glucocorticoid receptor dimer interface allosterically transmits sequence-specific DNA
signals. Nat Struct Mol Biol 2013, 20:876-883.
10. Kitayner M, Rozenberg H, Rohs R, Suad O, Rabinovich D, Honig B, Shakked Z: Diversity in
DNA recognition by p53 revealed by crystal structures with Hoogsteen base pairs. Nat Struct
Mol Biol 2010, 17:423-429.
11. Harrison SC, Aggarwal AK: DNA recognition by proteins with the helix-turn-helix motif.
Annu Rev Biochem 1990, 59:933-969.
12. Aggarwal AK, Rodgers DW, Drottar M, Ptashne M, Harrison SC: Recognition of a DNA
operator by the repressor of phage 434: a view at high resolution. Science 1988, 242:899-
907.
13. Wolberger C, Dong YC, Ptashne M, Harrison SC: Structure of a phage 434 Cro/DNA complex.
Nature 1988, 335:789-795.
14. Zhou T, Yang L, Lu Y, Dror I, Dantas Machado AC, Ghane T, Di Felice R, Rohs R: DNAshape:
a method for the high-throughput prediction of DNA structural features on a genomic scale.
Nucleic Acids Res 2013, 41:W56-62.
15. Li J, Sagendorf JM, Chiu TP, Pasi M, Perez A, Rohs R: Expanding the repertoire of DNA
shape features for genome-scale studies of transcription factor binding. Nucleic Acids Res
2017, 45:12877-12887.
16. Chiu TP, Comoglio F, Zhou T, Yang L, Paro R, Rohs R: DNAshapeR: an R/Bioconductor
package for DNA shape prediction and feature encoding. Bioinformatics 2016, 32:1211-1213.
17. Rohs R, West SM, Sosinsky A, Liu P, Mann RS, Honig B: The role of DNA shape in protein-
DNA recognition. Nature 2009, 461:1248-1253.
18. Deng Z, Wang Q, Liu Z, Zhang M, Machado AC, Chiu TP, Feng C, Zhang Q, Yu L, Qi L, et al:
Mechanistic insights into metal ion activation and operator recognition by the ferric uptake
regulator. Nat Commun 2015, 6:7642.
19. Chang YP, Xu M, Machado AC, Yu XJ, Rohs R, Chen XS: Mechanism of origin DNA
recognition and assembly of an initiator-helicase complex by SV40 large tumor antigen. Cell
Rep 2013, 3:1117-1127.
20. Joshi R, Passner JM, Rohs R, Jain R, Sosinsky A, Crickmore MA, Jacob V, Aggarwal AK, Honig
B, Mann RS: Functional specificity of a Hox protein mediated by the recognition of minor
groove structure. Cell 2007, 131:530-543.
144
21. Lavery R, Sklenar H: Defining the structure of irregular nucleic acids: conventions and
principles. J Biomol Struct Dyn 1989, 6:655-667.
22. Shimizu T, Toumoto A, Ihara K, Shimizu M, Kyogoku Y, Ogawa N, Oshima Y, Hakoshima T:
Crystal structure of PHO4 bHLH domain-DNA complex: flanking base recognition. EMBO
J 1997, 16:4689-4697.
23. Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, Gerber GK, Hannett NM, Harbison CT,
Thompson CM, Simon I, et al: Transcriptional regulatory networks in Saccharomyces
cerevisiae. Science 2002, 298:799-804.
24. Johnson DS, Mortazavi A, Myers RM, Wold B: Genome-wide mapping of in vivo protein-
DNA interactions. Science 2007, 316:1497-1502.
25. Consortium EP: An integrated encyclopedia of DNA elements in the human genome. Nature
2012, 489:57-74.
26. Rhee HS, Pugh BF: Comprehensive genome-wide protein-DNA interactions detected at
single-nucleotide resolution. Cell 2011, 147:1408-1419.
27. Rhee HS, Pugh BF: ChIP-exo method for identifying genomic location of DNA-binding
proteins with near-single-nucleotide accuracy. Curr Protoc Mol Biol 2012, Chapter 21:Unit
21 24.
28. He Q, Johnston J, Zeitlinger J: ChIP-nexus enables improved detection of in vivo
transcription factor binding footprints. Nat Biotechnol 2015, 33:395-401.
29. Skene PJ, Henikoff S: An efficient targeted nuclease strategy for high-resolution mapping of
DNA binding sites. Elife 2017, 6.
30. Berger MF, Philippakis AA, Qureshi AM, He FS, Estep PW, 3rd, Bulyk ML: Compact,
universal DNA microarrays to comprehensively determine transcription-factor binding site
specificities. Nat Biotechnol 2006, 24:1429-1435.
31. Warren CL, Kratochvil NC, Hauschild KE, Foister S, Brezinski ML, Dervan PB, Phillips GN, Jr.,
Ansari AZ: Defining the sequence-recognition profile of DNA-binding molecules. Proc Natl
Acad Sci U S A 2006, 103:867-872.
32. Gordan R, Shen N, Dror I, Zhou T, Horton J, Rohs R, Bulyk ML: Genomic regions flanking E-
box binding sites influence DNA binding specificity of bHLH transcription factors through
DNA shape. Cell Rep 2013, 3:1093-1104.
33. Slattery M, Riley T, Liu P, Abe N, Gomez-Alcala P, Dror I, Zhou T, Rohs R, Honig B,
Bussemaker HJ, Mann RS: Cofactor binding evokes latent differences in DNA binding
specificity between Hox proteins. Cell 2011, 147:1270-1282.
34. Jolma A, Kivioja T, Toivonen J, Cheng L, Wei G, Enge M, Taipale M, Vaquerizas JM, Yan J,
Sillanpaa MJ, et al: Multiplexed massively parallel SELEX for characterization of human
transcription factor binding specificities. Genome Res 2010, 20:861-873.
35. Zhao Y, Granas D, Stormo GD: Inferring binding energies from selected binding sites. PLoS
Comput Biol 2009, 5:e1000590.
36. Jolma A, Yan J, Whitington T, Toivonen J, Nitta KR, Rastas P, Morgunova E, Enge M, Taipale
M, Wei G, et al: DNA-binding specificities of human transcription factors. Cell 2013,
152:327-339.
37. Stormo GD: DNA binding sites: representation and discovery. Bioinformatics 2000, 16:16-23.
38. Benos PV, Bulyk ML, Stormo GD: Additivity in protein-DNA interactions: how good an
approximation is it? Nucleic Acids Res 2002, 30:4442-4451.
39. Eggeling R, Roos T, Myllymaki P, Grosse I: Inferring intra-motif dependencies of DNA
binding sites from ChIP-seq data. BMC Bioinformatics 2015, 16:375.
40. Zhao Y, Ruan S, Pandey M, Stormo GD: Improved Models for Transcription Factor Binding
Site Identification Using Nonindependent Interactions. Genetics 2012, 191:781-790.
41. Siddharthan R: Dinucleotide weight matrices for predicting transcription factor binding
sites: generalizing the position weight matrix. PLoS One 2010, 5:e9722.
145
42. Kahara J, Lahdesmaki H: Evaluating a linear k-mer model for protein-DNA interactions
using high-throughput SELEX data. BMC Bioinformatics 2013, 14 Suppl 10:S2.
43. Annala M, Laurila K, Lahdesmaki H, Nykter M: A linear model for transcription factor
binding affinity prediction in protein binding microarrays. PLoS One 2011, 6:e20059.
44. Bulyk ML, Johnson PL, Church GM: Nucleotides of transcription factor binding sites exert
interdependent effects on the binding affinities of transcription factors. Nucleic Acids Res
2002, 30:1255-1261.
45. Mathelier A, Wasserman WW: The next generation of transcription factor binding site
prediction. PLoS Comput Biol 2013, 9:e1003214.
46. Barash Y, Elidan G, Friedman N, Kaplan T: Modeling dependencies in protein-DNA binding
sites. In Proceedings of the Annual International Conference on Computational Molecular
Biology, RECOMB. 2003: 28-37.
47. Zhou T, Shen N, Yang L, Abe N, Horton J, Mann RS, Bussemaker HJ, Gordan R, Rohs R:
Quantitative modeling of transcription factor binding specificities using DNA shape. Proc
Natl Acad Sci U S A 2015, 112:4654-4659.
48. Mathelier A, Xin B, Chiu TP, Yang L, Rohs R, Wasserman WW: DNA Shape Features
Improve Transcription Factor Binding Site Predictions In Vivo. Cell Syst 2016, 3:278-286
e274.
49. Yang L, Zhou T, Dror I, Mathelier A, Wasserman WW, Gordan R, Rohs R: TFBSshape: a motif
database for DNA shape features of transcription factor binding sites. Nucleic Acids Res
2014, 42:D148-155.
50. Foat BC, Morozov AV, Bussemaker HJ: Statistical mechanical modeling of genome-wide
transcription factor occupancy data by MatrixREDUCE. Bioinformatics 2006, 22:e141-149.
51. Ruan S, Swamidass SJ, Stormo GD: BEESEM: estimation of binding energy models using
HT-SELEX data. Bioinformatics 2017, 33:2288-2295.
52. Zhang L, Martini GD, Rube HT, Kribelbauer JF, Rastogi C, FitzPatrick VD, Houtman JC,
Bussemaker HJ, Pufall MA: SelexGLM differentiates androgen and glucocorticoid receptor
DNA-binding preference over an extended binding site. Genome Res 2018, 28:111-121.
53. Rastogi C, Rube HT, Kribelbauer JF, Crocker J, Loker RE, Martini GD, Laptenko O, Freed-
Pastor WA, Prives C, Stern DL, et al: Accurate and sensitive quantification of protein-DNA
binding affinity. Proc Natl Acad Sci U S A 2018, 115:E3692-E3701.
54. Alipanahi B, Delong A, Weirauch MT, Frey BJ: Predicting the sequence specificities of DNA-
and RNA-binding proteins by deep learning. Nat Biotechnol 2015, 33:831-838.
55. Rao S, Chiu TP, Kribelbauer JF, Mann RS, Bussemaker HJ, Rohs R: Systematic prediction of
DNA shape changes due to CpG methylation explains epigenetic effects on protein-DNA
binding. Epigenetics Chromatin 2018, 11:6.
56. Murphy MW, Lee JK, Rojo S, Gearhart MD, Kurahashi K, Banerjee S, Loeuille GA, Bashamboo
A, McElreavey K, Zarkower D, et al: An ancient protein-DNA interaction underlying
metazoan sex determination. Nat Struct Mol Biol 2015, 22:442-451.
57. Comoglio F, Schlumpf T, Schmid V, Rohs R, Beisel C, Paro R: High-resolution profiling of
Drosophila replication start sites reveals a DNA shape and chromatin signature of
metazoan origins. Cell Rep 2015, 11:821-834.
58. Parker SC, Hansen L, Abaan HO, Tullius TD, Margulies EH: Local DNA topography
correlates with functional noncoding regions of the human genome. Science 2009, 324:389-
392.
59. Meysman P, Dang TH, Laukens K, De Smet R, Wu Y, Marchal K, Engelen K: Use of structural
DNA properties for the prediction of transcription-factor binding sites in Escherichia coli.
Nucleic Acids Res 2011, 39:e6.
60. Maienschein-Cline M, Dinner AR, Hlavacek WS, Mu F: Improved predictions of transcription
factor binding sites using physicochemical features of DNA. Nucleic Acids Res 2012, 40:e175.
146
61. Hooghe B, Broos S, van Roy F, De Bleser P: A flexible integrative approach based on random
forest improves prediction of transcription factor binding sites. Nucleic Acids Res 2012,
40:e106.
62. Greenbaum JA, Parker SC, Tullius TD: Detection of DNA structural motifs in functional
genomic elements. Genome Res 2007, 17:940-946.
63. Maurano MT, Wang H, Kutyavin T, Stamatoyannopoulos JA: Widespread site-dependent
buffering of human regulatory polymorphism. PLoS Genet 2012, 8:e1002599.
64. Maurano MT, Humbert R, Rynes E, Thurman RE, Haugen E, Wang H, Reynolds AP, Sandstrom
R, Qu H, Brody J, et al: Systematic localization of common disease-associated variation in
regulatory DNA. Science 2012, 337:1190-1195.
65. Rohs R, Sklenar H, Shakked Z: Structural and energetic origins of sequence-specific DNA
bending: Monte Carlo simulations of papillomavirus E2-DNA binding sites. Structure 2005,
13:1499-1509.
66. Zhang X, Dantas Machado AC, Ding Y, Chen Y, Lu Y, Duan Y, Tham KW, Chen L, Rohs R,
Qin PZ: Conformations of p53 response elements in solution deduced using site-directed
spin labeling and Monte Carlo sampling. Nucleic Acids Res 2014, 42:2789-2797.
67. Bishop EP, Rohs R, Parker SC, West SM, Liu P, Mann RS, Honig B, Tullius TD: A map of
minor groove shape and electrostatic potential from hydroxyl radical cleavage patterns of
DNA. ACS Chem Biol 2011, 6:1314-1320.
68. Karolchik D, Barber GP, Casper J, Clawson H, Cline MS, Diekhans M, Dreszer TR, Fujita PA,
Guruvadoo L, Haeussler M, et al: The UCSC Genome Browser database: 2014 update.
Nucleic Acids Res 2014, 42:D764-770.
69. Ho JW, Jung YL, Liu T, Alver BH, Lee S, Ikegami K, Sohn KA, Minoda A, Tolstorukov MY,
Appert A, et al: Comparative analysis of metazoan chromatin organization. Nature 2014,
512:449-452.
70. Main BJ, Smith AD, Jang H, Nuzhdin SV: Transcription start site evolution in Drosophila.
Mol Biol Evol 2013, 30:1966-1974.
71. Eldar A, Rozenberg H, Diskin-Posner Y, Rohs R, Shakked Z: Structural studies of p53
inactivation by DNA-contact mutations and its rescue by suppressor mutations via
alternative protein-DNA interactions. Nucleic Acids Res 2013, 41:8748-8759.
72. Lazarovici A, Zhou T, Shafer A, Dantas Machado AC, Riley TR, Sandstrom R, Sabo PJ, Lu Y,
Rohs R, Stamatoyannopoulos JA, Bussemaker HJ: Probing DNA shape and methylation state
on a genomic scale with DNase I. Proc Natl Acad Sci U S A 2013, 110:6376-6381.
73. Rohs R, Bloch I, Sklenar H, Shakked Z: Molecular flexibility in ab initio drug docking to
DNA: binding-site and binding-mode transitions in all-atom Monte Carlo simulations.
Nucleic Acids Res 2005, 33:7048-7057.
74. Balasubramanian B, Pogozelski WK, Tullius TD: DNA strand breaking by the hydroxyl
radical is governed by the accessible surface areas of the hydrogen atoms of the DNA
backbone. Proc Natl Acad Sci U S A 1998, 95:9738-9743.
75. Segal E, Fondufe-Mittendorf Y, Chen L, Thastrom A, Field Y, Moore IK, Wang JP, Widom J: A
genomic code for nucleosome positioning. Nature 2006, 442:772-778.
76. Field Y, Kaplan N, Fondufe-Mittendorf Y, Moore IK, Sharon E, Lubling Y, Widom J, Segal E:
Distinct modes of regulation by chromatin encoded through nucleosome positioning signals.
PLoS Comput Biol 2008, 4:e1000216.
77. Mavrich TN, Jiang C, Ioshikhes IP, Li X, Venters BJ, Zanton SJ, Tomsho LP, Qi J, Glaser RL,
Schuster SC, et al: Nucleosome organization in the Drosophila genome. Nature 2008, 453:358-
362.
78. Bansal M, Kumar A, Yella VR: Role of DNA sequence based structural features of promoters
in transcription initiation and gene expression. Curr Opin Struct Biol 2014, 25:77-85.
147
79. Rach EA, Yuan HY, Majoros WH, Tomancak P, Ohler U: Motif composition, conservation and
condition-specificity of single and alternative transcription start sites in the Drosophila
genome. Genome Biol 2009, 10:R73.
80. Honig B, Nicholls A: Classical electrostatics in biology and chemistry. Science 1995,
268:1144-1149.
81. Klapper I, Hagstrom R, Fine R, Sharp K, Honig B: Focusing of electric fields in the active site
of Cu-Zn superoxide dismutase: effects of ionic strength and amino-acid modification.
Proteins 1986, 1:47-59.
82. Yang L, Orenstein Y, Jolma A, Yin Y, Taipale J, Shamir R, Rohs R: Transcription factor
family-specific DNA shape readout revealed by quantitative specificity models. Mol Syst Biol
2017, 13:910.
83. Sagendorf JM, Berman HM, Rohs R: DNAproDB: an interactive tool for structural analysis
of DNA-protein complexes. Nucleic Acids Res 2017, 45:W89-W97.
84. Stella S, Cascio D, Johnson RC: The shape of the DNA minor groove directs binding by the
DNA-bending protein Fis. Genes Dev 2010, 24:814-826.
85. Hancock SP, Ghane T, Cascio D, Rohs R, Di Felice R, Johnson RC: Control of DNA minor
groove width and Fis protein binding by the purine 2-amino group. Nucleic Acids Res 2013,
41:6750-6760.
86. Hancock SP, Stella S, Cascio D, Johnson RC: DNA Sequence Determinants Controlling
Affinity, Stability and Shape of DNA Complexes Bound by the Nucleoid Protein Fis. PLoS
One 2016, 11:e0150189.
87. Kalodimos CG, Biris N, Bonvin AM, Levandoski MM, Guennuegues M, Boelens R, Kaptein R:
Structure and flexibility adaptation in nonspecific and specific protein-DNA complexes.
Science 2004, 305:386-389.
88. Li J, Dantas Machado AC, Guo M, Sagendorf JM, Zhou Z, Jiang L, Chen X, Wu D, Qu L, Chen
Z, et al: Structure of the Forkhead Domain of FOXA2 Bound to a Complete DNA
Consensus Site. Biochemistry 2017, 56:3745-3753.
89. Berger MF, Badis G, Gehrke AR, Talukder S, Philippakis AA, Pena-Castillo L, Alleyne TM,
Mnaimneh S, Botvinnik OB, Chan ET, et al: Variation in homeodomain DNA binding
revealed by high-resolution analysis of sequence preferences. Cell 2008, 133:1266-1276.
90. Noyes MB, Christensen RG, Wakabayashi A, Stormo GD, Brodsky MH, Wolfe SA: Analysis of
homeodomain specificities allows the family-wide prediction of preferred recognition sites.
Cell 2008, 133:1277-1289.
91. Jayaram B, Sharp KA, Honig B: The electrostatic potential of B-DNA. Biopolymers 1989,
28:975-993.
92. Chin K, Sharp KA, Honig B, Pyle AM: Calculating the electrostatic properties of RNA
provides new insights into molecular interactions and function. Nat Struct Biol 1999, 6:1055-
1061.
93. Zhao Y, Stormo GD: Quantitative analysis demonstrates most transcription factors require
only simple models of specificity. Nat Biotechnol 2011, 29:480-483.
94. Orenstein Y, Shamir R: A comparative analysis of transcription factor binding models
learned from PBM, HT-SELEX and ChIP data. Nucleic Acids Res 2014, 42:e63.
95. Badis G, Berger MF, Philippakis AA, Talukder S, Gehrke AR, Jaeger SA, Chan ET, Metzler G,
Vedenko A, Chen X, et al: Diversity and complexity in DNA recognition by transcription
factors. Science 2009, 324:1720-1723.
96. Weirauch MT, Cote A, Norel R, Annala M, Zhao Y, Riley TR, Saez-Rodriguez J, Cokelaer T,
Vedenko A, Talukder S, et al: Evaluation of methods for modeling transcription factor
sequence specificity. Nat Biotechnol 2013, 31:126-134.
97. Cornell WD, Cieplak P, Bayly CI, Gould IR, Merz KM, Ferguson DM, Spellmeyer DC, Fox T,
Caldwell JW, Kollman PA: A Second Generation Force Field for the Simulation of Proteins,
148
Nucleic Acids, and Organic Molecules. Journal of the American Chemical Society 1995,
117:5179-5197.
98. Harris RC, Mackoy T, Machado ACD, Xu D, Rohs R, Fenley MO: Opposites attract: shape
and electrostatic complementarity in protein-DNA complexes. Innovations in Biomolecular
Modeling and Simulations 2012, 2:53-80.
99. Kohavi R: A study of cross-validation and bootstrap for accuracy estimation and model
selection. In Proceedings of the 14th international joint conference on Artificial intelligence -
Volume 2. pp. 1137-1143. Montreal, Quebec, Canada: Morgan Kaufmann Publishers Inc.;
1995:1137-1143.
100. Klemm JD, Rould MA, Aurora R, Herr W, Pabo CO: Crystal structure of the Oct-1 POU
domain bound to an octamer site: DNA recognition with tethered DNA-binding modules.
Cell 1994, 77:21-32.
101. Remenyi A, Tomilin A, Pohl E, Lins K, Philippsen A, Reinbold R, Scholer HR, Wilmanns M:
Differential dimer activities of the transcription factor Oct-1 by DNA-induced interface
swapping. Mol Cell 2001, 8:569-580.
102. Hovde S, Abate-Shen C, Geiger JH: Crystal structure of the Msx-1 homeodomain/DNA
complex. Biochemistry 2001, 40:12013-12021.
103. Li T, Jin Y, Vershon AK, Wolberger C: Crystal structure of the MATa1/MATalpha2
homeodomain heterodimer in complex with DNA containing an A-tract. Nucleic Acids Res
1998, 26:5707-5718.
104. Passner JM, Ryoo HD, Shen L, Mann RS, Aggarwal AK: Structure of a DNA-bound
Ultrabithorax-Extradenticle homeodomain complex. Nature 1999, 397:714-719.
105. Watkins S, van Pouderoyen G, Sixma TK: Structural analysis of the bipartite DNA-binding
domain of Tc3 transposase bound to transposon DNA. Nucleic Acids Res 2004, 32:4306-4312.
106. Tan S, Richmond TJ: Crystal structure of the yeast MATalpha2/MCM1/DNA ternary
complex. Nature 1998, 391:660-666.
107. Blanco AG, Sola M, Gomis-Ruth FX, Coll M: Tandem DNA recognition by PhoB, a two-
component signal transduction transcriptional activator. Structure 2002, 10:701-713.
108. Hong M, Fuangthong M, Helmann JD, Brennan RG: Structure of an OhrR-ohrA operator
complex reveals the DNA binding mechanism of the MarR family. Mol Cell 2005, 20:131-
141.
109. Shen A, Higgins DE, Panne D: Recognition of AT-rich DNA binding sites by the MogR
repressor. Structure 2009, 17:769-777.
110. Panne D, Maniatis T, Harrison SC: Crystal structure of ATF-2/c-Jun and IRF-3 bound to the
interferon-beta enhancer. EMBO J 2004, 23:4384-4393.
111. Dror I, Rohs R, Mandel-Gutfreund Y: How motif environment influences transcription factor
search dynamics: Finding a needle in a haystack. Bioessays 2016, 38:605-612.
112. Finkel SE, Johnson RC: The Fis protein: it's not just for DNA inversion anymore. Mol
Microbiol 1992, 6:3257-3265.
113. Mordelet F, Horton J, Hartemink AJ, Engelhardt BE, Gordan R: Stability selection for
regression-based models of transcription factor-DNA binding specificity. Bioinformatics
2013, 29:i117-125.
114. Ferre-D'Amare AR, Prendergast GC, Ziff EB, Burley SK: Recognition by Max of its cognate
DNA through a dimeric b/HLH/Z domain. Nature 1993, 363:38-45.
115. Ferre-D'Amare AR, Pognonec P, Roeder RG, Burley SK: Structure and function of the
b/HLH/Z domain of USF. EMBO J 1994, 13:180-189.
116. Gilson MK, Honig BH: Calculation of electrostatic potentials in an enzyme active site. Nature
1987, 330:84-86.
117. Gilson MK, Zhou HX: Calculation of protein-ligand binding affinities. Annu Rev Biophys
Biomol Struct 2007, 36:21-42.
149
118. Nicholls A, Honig B: A rapid finite difference algorithm, utilizing successive over ‐
relaxation to solve the Poisson –Boltzmann equation. Journal of computational chemistry
1991, 12:435-445.
119. McCammon JA: Theory of biomolecular recognition. Curr Opin Struct Biol 1998, 8:245-249.
120. Fogolari F, Elcock AH, Esposito G, Viglino P, Briggs JM, McCammon JA: Electrostatic effects
in homeodomain-DNA interactions. J Mol Biol 1997, 267:368-381.
121. Luscombe NM, Austin SE, Berman HM, Thornton JM: An overview of the structures of
protein-DNA complexes. Genome Biol 2000, 1:REVIEWS001.
122. Hegde RS, Grossman SR, Laimins LA, Sigler PB: Crystal structure at 1.7 A of the bovine
papillomavirus-1 E2 DNA-binding domain bound to its DNA target. Nature 1992, 359:505-
512.
123. Kim Y, Geiger JH, Hahn S, Sigler PB: Crystal structure of a yeast TBP/TATA-box complex.
Nature 1993, 365:512-520.
124. Kim JL, Nikolov DB, Burley SK: Co-crystal structure of TBP recognizing the minor groove
of a TATA element. Nature 1993, 365:520-527.
125. Raiber E-A, Hardisty R, van Delft P, Balasubramanian S: Mapping and elucidating the
function of modified bases in DNA. Nature Reviews Chemistry 2017, 1:0069.
126. Gommers-Ampt JH, Borst P: Hypermodified bases in DNA. FASEB J 1995, 9:1034-1042.
127. Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G, Tonti-Filippini J, Nery JR, Lee L, Ye Z,
Ngo QM, et al: Human DNA methylomes at base resolution show widespread epigenomic
differences. Nature 2009, 462:315-322.
128. Kundaje A, Meuleman W, Ernst J, Bilenky M, Yen A, Heravi-Moussavi A, Kheradpour P, Zhang
Z, Wang J, Ziller MJ, et al: Integrative analysis of 111 reference human epigenomes. Nature
2015, 518:317-330.
129. Yin Y, Morgunova E, Jolma A, Kaasinen E, Sahu B, Khund-Sayeed S, Das PK, Kivioja T, Dave
K, Zhong F, et al: Impact of cytosine methylation on DNA binding specificities of human
transcription factors. Science 2017, 356.
130. Tahiliani M, Koh KP, Shen Y, Pastor WA, Bandukwala H, Brudno Y, Agarwal S, Iyer LM, Liu
DR, Aravind L, Rao A: Conversion of 5-methylcytosine to 5-hydroxymethylcytosine in
mammalian DNA by MLL partner TET1. Science 2009, 324:930-935.
131. Ito S, D'Alessio AC, Taranova OV, Hong K, Sowers LC, Zhang Y: Role of Tet proteins in 5mC
to 5hmC conversion, ES-cell self-renewal and inner cell mass specification. Nature 2010,
466:1129-1133.
132. Ito S, Shen L, Dai Q, Wu SC, Collins LB, Swenberg JA, He C, Zhang Y: Tet proteins can
convert 5-methylcytosine to 5-formylcytosine and 5-carboxylcytosine. Science 2011,
333:1300-1303.
133. He Y-F, Li B-Z, Li Z, Liu P, Wang Y, Tang Q, Ding J, Jia Y, Chen Z, Li L, et al: Tet-Mediated
Formation of 5-Carboxylcytosine and Its Excision by TDG in Mammalian DNA. Science
2011, 333:1303-1307.
134. Bachman M, Uribe-Lewis S, Yang X, Williams M, Murrell A, Balasubramanian S: 5-
Hydroxymethylcytosine is a predominantly stable DNA modification. Nat Chem 2014,
6:1049-1055.
135. Bachman M, Uribe-Lewis S, Yang X, Burgess HE, Iurlaro M, Reik W, Murrell A,
Balasubramanian S: 5-Formylcytosine can be a stable DNA modification in mammals. Nat
Chem Biol 2015, 11:555-557.
136. Iurlaro M, Ficz G, Oxley D, Raiber EA, Bachman M, Booth MJ, Andrews S, Balasubramanian S,
Reik W: A screen for hydroxymethylcytosine and formylcytosine binding proteins suggests
functions in transcription and chromatin regulation. Genome Biol 2013, 14:R119.
150
137. Spruijt CG, Gnerlich F, Smits AH, Pfaffeneder T, Jansen PW, Bauer C, Munzel M, Wagner M,
Muller M, Khan F, et al: Dynamic readers for 5-(hydroxy)methylcytosine and its oxidized
derivatives. Cell 2013, 152:1146-1159.
138. Wang D, Hashimoto H, Zhang X, Barwick BG, Lonial S, Boise LH, Vertino PM, Cheng X:
MAX is an epigenetic sensor of 5-carboxylcytosine and is altered in multiple myeloma.
Nucleic Acids Res 2017, 45:2396-2407.
139. Jackson SP, Bartek J: The DNA-damage response in human biology and disease. Nature 2009,
461:1071-1078.
140. Mandel-Gutfreund Y, Schueler O, Margalit H: Comprehensive analysis of hydrogen bonds in
regulatory protein DNA-complexes: in search of common principles. J Mol Biol 1995,
253:370-382.
141. Werner MH, Gronenborn AM, Clore GM: Intercalation, DNA kinking, and the control of
transcription. Science 1996, 271:778-784.
142. Nadassy K, Wodak SJ, Janin J: Structural features of protein-nucleic acid recognition sites.
Biochemistry 1999, 38:1999-2017.
143. Jones S, van Heyningen P, Berman HM, Thornton JM: Protein-DNA interactions: A structural
analysis. J Mol Biol 1999, 287:877-896.
144. Luscombe NM, Laskowski RA, Thornton JM: Amino acid-base interactions: a three-
dimensional analysis of protein-DNA interactions at an atomic level. Nucleic Acids Res 2001,
29:2860-2874.
145. Isakova A, Groux R, Imbeault M, Rainer P, Alpern D, Dainese R, Ambrosini G, Trono D, Bucher
P, Deplancke B: SMiLE-seq identifies binding motifs of single and dimeric transcription
factors. Nat Methods 2017, 14:316-322.
146. Nutiu R, Friedman RC, Luo S, Khrebtukova I, Silva D, Li R, Zhang L, Schroth GP, Burge CB:
Direct measurement of DNA affinity landscapes on a high-throughput sequencing
instrument. Nat Biotechnol 2011, 29:659-664.
147. Le DD, Shimko TC, Aditham AK, Keys AM, Longwell SA, Orenstein Y, Fordyce PM:
Comprehensive, high-resolution binding energy landscapes reveal context dependencies of
transcription factor binding. Proc Natl Acad Sci U S A 2018, 115:E3702-E3711.
148. Glorot X, Bengio Y: Understanding the difficulty of training deep feedforward neural
networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence
and Statistics (Yee Whye T, Mike T eds.), vol. 9. pp. 249--256. Proceedings of Machine Learning
Research: PMLR; 2010:249--256.
149. Kingma DP, Ba J: Adam: A Method for Stochastic Optimization. In ArXiv e-prints; 2014.
150. Bastien F, Lamblin P, Pascanu R, Bergstra J, Goodfellow I, Bergeron A, Bouchard N, Warde-
Farley D, Bengio Y: Theano: new features and speed improvements. In ArXiv e-prints; 2012.
151. Keras: Theano-based deep learning library. [https://github.com/fchollet/keras]
152. Brownlie P, Ceska T, Lamers M, Romier C, Stier G, Teo H, Suck D: The crystal structure of an
intact human Max-DNA complex: new insights into mechanisms of transcriptional control.
Structure 1997, 5:509-520.
153. Rozas I, Alkorta I, Elguero J: Bifurcated Hydrogen Bonds: Three-Centered Interactions. The
Journal of Physical Chemistry A 1998, 102:9925-9932.
154. Kribelbauer JF, Laptenko O, Chen S, Martini GD, Freed-Pastor WA, Prives C, Mann RS,
Bussemaker HJ: Quantitative Analysis of the DNA Methylation Sensitivity of Transcription
Factor Complexes. Cell Rep 2017, 19:2383-2395.
155. Andres V, Cervera M, Mahdavi V: Determination of the consensus binding site for MEF2
expressed in muscle and brain reveals tissue-specific sequence constraints. J Biol Chem
1995, 270:23246-23249.
156. Ornatsky OI, McDermott JC: MEF2 protein expression, DNA binding specificity and complex
composition, and transcriptional activity in muscle and non-muscle cells. J Biol Chem 1996,
271:24927-24933.
151
157. Wu W, Huang X, Cheng J, Li Z, de Folter S, Huang Z, Jiang X, Pang H, Tao S: Conservation
and evolution in and among SRF- and MEF2-type MADS domains and their binding sites.
Mol Biol Evol 2011, 28:501-511.
158. Han A, Pan F, Stroud JC, Youn HD, Liu JO, Chen L: Sequence-specific recruitment of
transcriptional co-repressor Cabin1 by myocyte enhancer factor-2. Nature 2003, 422:730-
734.
159. Hoogsteen K: CRYSTAL AND MOLECULAR STRUCTURE OF A HYDROGEN-
BONDED COMPLEX BETWEEN 1-METHYLTHYMINE AND 9-METHYLADENINE.
Acta Crystallographica 1963, 16:907-+.
160. Kitayner M, Rozenberg H, Kessler N, Rabinovich D, Shaulov L, Haran TE, Shakked Z:
Structural basis of DNA recognition by p53 tetramers. Mol Cell 2006, 22:741-753.
161. Cirillo D, Botta-Orfila T, Tartaglia GG: By the company they keep: interaction networks
define the binding ability of transcription factors. Nucleic Acids Res 2015, 43:e125.
162. Jolma A, Yin Y, Nitta KR, Dave K, Popov A, Taipale M, Enge M, Kivioja T, Morgunova E,
Taipale J: DNA-dependent formation of transcription factor pairs alters their binding
specificity. Nature 2015, 527:384-388.
163. Greil F, Moorman C, van Steensel B: DamID: mapping of in vivo protein-genome interactions
using tethered DNA adenine methyltransferase. Methods Enzymol 2006, 410:342-359.
164. Hesselberth JR, Chen X, Zhang Z, Sabo PJ, Sandstrom R, Reynolds AP, Thurman RE, Neph S,
Kuehn MS, Noble WS, et al: Global mapping of protein-DNA interactions in vivo by digital
genomic footprinting. Nat Methods 2009, 6:283-289.
165. Buenrostro JD, Giresi PG, Zaba LC, Chang HY, Greenleaf WJ: Transposition of native
chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding
proteins and nucleosome position. Nat Methods 2013, 10:1213-1218.
166. Giresi PG, Kim J, McDaniell RM, Iyer VR, Lieb JD: FAIRE (Formaldehyde-Assisted Isolation
of Regulatory Elements) isolates active regulatory elements from human chromatin.
Genome Res 2007, 17:877-885.
167. Arnold CD, Gerlach D, Stelzer C, Boryn LM, Rath M, Stark A: Genome-wide quantitative
enhancer activity maps identified by STARR-seq. Science 2013, 339:1074-1077.
168. Gisselbrecht SS, Barrera LA, Porsch M, Aboukhalil A, Estep PW, 3rd, Vedenko A, Palagi A,
Kim Y, Zhu X, Busser BW, et al: Highly parallel assays of tissue-specific enhancers in whole
Drosophila embryos. Nat Methods 2013, 10:774-780.
169. Kvon EZ, Kazmar T, Stampfel G, Yanez-Cuna JO, Pagani M, Schernhuber K, Dickson BJ, Stark
A: Genome-scale functional characterization of Drosophila developmental enhancers in
vivo. Nature 2014, 512:91-95.
170. Boyle EA, Andreasson JOL, Chircus LM, Sternberg SH, Wu MJ, Guegler CK, Doudna JA,
Greenleaf WJ: High-throughput biochemical profiling reveals sequence determinants of
dCas9 off-target binding and unbinding. Proc Natl Acad Sci U S A 2017, 114:5461-5466.
171. Cameron P, Fuller CK, Donohoue PD, Jones BN, Thompson MS, Carter MM, Gradia S, Vidal B,
Garner E, Slorach EM, et al: Mapping the genomic landscape of CRISPR-Cas9 cleavage. Nat
Methods 2017, 14:600-606.
172. Manin YI: Vychislimoe i nevychislimoe (Computable and Noncomputable), Moscow: Sov.
Radio; 1980.
173. Manin YI: Classical computing, quantum computing, and Shor's factoring algorithm.
Séminaire Bourbaki 2000, 41:375-404.
174. Feynman RP: Simulating physics with computers. International journal of theoretical physics
1982, 21:467-488.
175. Deutsch D: Quantum theory, the Church-Turing principle and the universal quantum
computer. In Proceedings of the Royal Society of London A: Mathematical, Physical and
Engineering Sciences. The Royal Society; 1985: 97-117.
152
176. Shor PW: Algorithms for quantum computation: Discrete logarithms and factoring. In
Foundations of Computer Science, 1994 Proceedings, 35th Annual Symposium on. IEEE; 1994:
124-134.
177. Bennett CH, Bernstein E, Brassard G, Vazirani U: Strengths and weaknesses of quantum
computing. SIAM journal on Computing 1997, 26:1510-1523.
178. Rose G, Macready W: An introduction to quantum annealing. DWave Technical Document
2007, 712.
179. Le Gall F: Powers of Tensors and Fast Matrix Multiplication. In ArXiv e-prints; 2014.
180. Mukamel EA, Nimmerjahn A, Schnitzer MJ: Automated analysis of cellular signals from
large-scale calcium imaging data. Neuron 2009, 63:747-760.
181. Simonyan K, Vedaldi A, Zisserman A: Deep Inside Convolutional Networks: Visualising
Image Classification Models and Saliency Maps. In ArXiv e-prints; 2013.
182. Shrikumar A, Greenside P, Shcherbina A, Kundaje A: Not Just a Black Box: Learning
Important Features Through Propagating Activation Differences. In ArXiv e-prints; 2016.
183. Truong TV, Supatto W, Koos DS, Choi JM, Fraser SE: Deep and fast live imaging with two-
photon scanned light-sheet microscopy. Nat Methods 2011, 8:757-760.
184. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L: Large-Scale Video
Classification with Convolutional Neural Networks. In 2014 IEEE Conference on Computer
Vision and Pattern Recognition; 23-28 June 2014. 2014: 1725-1732.
185. Chiu TP, Rao S, Mann RS, Honig B, Rohs R: Genome-wide prediction of minor-groove
electrostatic potential enables biophysical modeling of protein-DNA binding. Nucleic Acids
Res 2017, 45:12565-12576.
186. Chiu TP, Yang L, Zhou T, Main BJ, Parker SC, Nuzhdin SV, Tullius TD, Rohs R: GBshape: a
genome browser database for DNA shape annotations. Nucleic Acids Res 2015, 43:D103-109.
187. Rocchia W, Sridharan S, Nicholls A, Alexov E, Chiabrera A, Honig B: Rapid grid-based
construction of the molecular surface and the use of induced surface charge to calculate
reaction field energies: applications to the molecular systems and geometric objects. J
Comput Chem 2002, 23:128-137.
Abstract (if available)
Abstract
Protein–DNA binding is a fundamental component of gene regulatory processes, but it is still not completely understood how proteins recognize their target sites in the genome. Transcription factors (TFs) and other DNA binding proteins employ two different DNA readout mechanisms to recognize their genomic target sites, including sequence-based readout of direct contacts with the functional groups of the bases (base readout) and structure-based readout of intrinsic deviations from a canonical double helix (shape readout). A mechanistic understanding of how TFs recognize their target binding sites will provide insights into mechanisms underlying diseases and therefore improve human health. ❧ To be able to derive DNA structural features for whole-genome, we developed an R/Bioconductor package, DNAshapeR. This method extends an existing high-throughput method based on mining of all-atom Monte Carlo simulations, and further encodes DNA sequence and shape features which can be readily applied to modeling studies. We further developed GBshape that provides DNA shape features and hydroxyl radical cleavage predictions for the entire genomes of more than a hundred organisms. With this tool, we illustrated the periodicity of DNA shape features that are present in nucleosome-occupied sequences from human, fly and worm, and demonstrated structural similarities between transcription start sites (TSSs) in the genomes of four Drosophila species. ❧ However, DNA shape is a proxy for biophysical interactions. A/T and C/G base pairs (bps) carry different functional groups in the minor groove, that affect electrostatic potential (EP) due to their charges, in addition to their effect on groove width. To probe this biophysical effect directly, we developed DNAphi for predicting EP in the minor groove and confirmed the direct role of EP in protein–DNA binding using massive sequencing data. Using statistical machine-learning approaches, we showed that adding EP as a biophysical feature can improve the predictive power of quantitative binding specificity models across 27 transcription factor families. High-throughput prediction of EP offers a novel way to integrate biophysical and genomic studies of protein–DNA binding. ❧ The rapid advancement of high-throughput assays result in enormous data generation. In order to leverage data of such proportions for unraveling TF-binding mechanism, a powerful computational framework is required. Conventional machine learning approaches, such as support vector regression (SVR) and multiple linear regression (MLR), require pre-processing steps. The pre-processing step, for example, alignment to a known DNA motif, limits these methods in exploiting the complete data derived from high-throughput binding assays and, in turn, introduces biases. Canonical convolutional neural networks (CNNs) possess many advantages over traditional machine learning methods, but only utilize DNA sequence as a feature, thereby resulting in the lack of biophysical interpretability. To address these issues, we developed DeepRec, a multi-module deep learning framework capable of mining physicochemicalsignatures of DNA. We demonstrated applications of our method for multiple TFs and were able to elucidate important insights into the TF-binding mechanism. ❧ As a tangential research direction, we included two cutting-edge technologies: quantum computing and brain imaging. We demonstrated the possible application of modeling TF–DNA binding mechanisms by leveraging the power of quantum computing. On the other hand, we analyzed cellular signals from large-scale brain imaging data with CNNs, aiming at uncovering insights linked to neuronal activities and disorders.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Genome-wide studies reveal the function and evolution of DNA shape
PDF
Quantitative modeling of in vivo transcription factor–DNA binding and beyond
PDF
Profiling transcription factor-DNA binding specificity
PDF
Understanding protein–DNA recognition in the context of DNA methylation
PDF
Machine learning of DNA shape and spatial geometry
PDF
Decoding protein-DNA binding determinants mediated through DNA shape readout
PDF
DNA shape at transcription factor binding sites: from purifying selection to a new alphabet
PDF
Deciphering protein-nucleic acid interactions with artificial intelligence
PDF
Fast search and clustering for large-scale next generation sequencing data
PDF
Site-directed spin labeling studies of sequence-dependent DNA shape and protein-DNA recognition
PDF
3D modeling of eukaryotic genomes
PDF
Efficient algorithms to map whole genome bisulfite sequencing reads
PDF
Mapping 3D genome structures: a data driven modeling method for integrated structural analysis
PDF
Data-driven approaches to studying protein-DNA interactions from a structural point of view
PDF
Exploring the application and usage of whole genome chromosome conformation capture
PDF
Genome-wide studies reveal the isoform selection of genes
PDF
Exploring three-dimensional organization of the genome by mapping chromatin contacts and population modeling
PDF
Discovery of mature microRNA sequences within the protein- coding regions of global HIV-1 genomes: Predictions of novel mechanisms for viral infection and pathogenicity
PDF
Detecting joint interactions between sets of variables in the context of studies with a dichotomous phenotype, with applications to asthma susceptibility involving epigenetics and epistasis
PDF
Isoform quantification and splicing regulation analysis in RNA-seq studies
Asset Metadata
Creator
Chiu, Tsu-Pei (author)
Core Title
Genome-wide studies of protein–DNA binding: beyond sequence towards biophysical and physicochemical models
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Publication Date
08/06/2018
Defense Date
06/20/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
biophysics,deep learning,DNA 3D structure,DNA shape,electrostatic potential,high-throughput,OAI-PMH Harvest,physicochemistry,protein–DNA binding,quantitative modeling,transcription factor
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Rohs, Remo (
committee chair
), Alber, Frank (
committee member
), Nakano, Aiichiro (
committee member
), Waterman, Michael S. (
committee member
)
Creator Email
tpc2005@gmail.com,tsupeich@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-55960
Unique identifier
UC11668670
Identifier
etd-ChiuTsuPei-6670.pdf (filename),usctheses-c89-55960 (legacy record id)
Legacy Identifier
etd-ChiuTsuPei-6670.pdf
Dmrecord
55960
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Chiu, Tsu-Pei
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
biophysics
deep learning
DNA 3D structure
DNA shape
electrostatic potential
high-throughput
physicochemistry
protein–DNA binding
quantitative modeling
transcription factor