Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Data-driven approaches to studying protein-DNA interactions from a structural point of view
(USC Thesis Other)
Data-driven approaches to studying protein-DNA interactions from a structural point of view
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Data-Driven Approaches to Studying Protein-DNA Interactions from a Structural
Point of View
by
Jared M. Sagendorf
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(Physics)
August 2020
Copyright 2020 Jared M. Sagendorf
Hey brother christian with your high and mighty errand,
Your actions speak so loud I can't hear a word you're saying.
Hey sister bleeding heart with all of your compassion,
Your labors soothe the hurt but can't assuage temptation.
Hey man of science with your perfect rules of measure,
Can you improve this place with the data that you gather?
Hey mother mercy can your loins bear fruit forever?
Is your fecundity a trammel or a treasure?
{ Bad Religion
ii
Dedicated to my Father, whose enthusiasm for science and education undoubtedly set me on this
journey long before I realized where I was heading.
iii
Acknowledgements
The path I have taken in life which has led me to pursue a PhD and ultimately write this
dissertation has been far from direct. There have been many sharp turns and many moments
where things could have easily gone astray, and I have made many decisions without careful
planning or consideration { often turning to gut feelings for guidance. I credit much of my success
to pure circumstance and to the opportunities which were given to me through the good nature
of many people who I was fortunate enough to meet along the way, and whose support I did little
to earn. I would like to acknowledge those people and the people who have been in
uential in
both my academic and personal life.
I almost certainly would not be sitting where I am today had I not met Yiyuan Xie. It was
during his introductory physics class that I discovered a passion for physics and mathematics which
led me to change my major from computer science to physics. And it was on his recommendations
that I received both a scholarship to Rensselaer Polytechnic Institute (RPI) and had my rst
research experience at Cornell University through an REU program in the summer of 2010, which
would turn out to be a formative experience for me. During my summer at Cornell I was hosted
by professor Matthias Liepe and his gracious graduate students who entrusted me with a research
project which I had almost no qualications for, but which provided me a wonderful learning
experience. Overall that summer was both engaging and rewarding, and one I look back on very
fondly - and I owe it mostly to Yiyuan.
Following the summer at Cornell I would begin my undergraduate at RPI where I would again
be fortunate enough to learn from and work with some great professors. I want to thank Jim
iv
Napolitano for showing me what a true lecturer should be like and for respecting his students
enough to push them farther than they thought they were capable. I want to thank the generosity
of Professors Douglas Whittet and Shengbai Zhang for giving me opportunities to work on research
projects in their labs, which undoubtedly allowed me to develop many skills I would need for
success in graduate school. I also want to acknowledge Dr. Damien West for his mentorship
during my time in the Zhang lab. And of course I'd like to acknowledge my friends Tyler and
Jake who were my brothers in arms through many long hours of homework and study sessions
and who I consider friends for life. My undergraduate experience would not have been the same
without them { may they keep the secret of room 353 well.
Upon starting graduate school at the University of Southern California (USC) I again, for
reasons I can not fully understand, was fortunate enough to meet some amazing friends and
colleagues. First I'd like to thank my advisor Remo Rohs. He's been a great boss to work for {
he has always given me the freedom to pursue my ideas and been supportive of me trying new
things, and he's always been there for his students when they've needed him. So thank you Remo,
you've been a great mentor and a great role model. I'd also like to thank Helen Berman. She was
an important part of the DNAproDB project and she helped arrange for me to spend a week at
Rutgers University to visit with developers and scientists at the Protein Data Bank, which was a
great learning experience for me. And naturally I want to thank my friends in physics Je and
Emily, Roelof, Pete, Malte and Lisa and Pancha for their camaraderie and for more laughs and
drunken nights than I can remember. They've shared some amazing adventures with me, and I
value their friendship immensely.
I also want to thank my friends from New York, Adam and Kevin in particular, for always
making me feel like I have a home to go back to. I wouldn't be the person I am today without
having had friends like these to grow up with. And nally I want to thank Junru for her support
during this last year of graduate school, which has been at times hectic and stressful. She is one
of the most genuine people I have ever known.
v
Table of Contents
Epigraph ii
Dedication iii
Acknowledgements iv
List Of Tables viii
List Of Figures x
Abstract xxi
Chapter 1: Introduction and Background 1
1.1 Mechanisms of protein-DNA recognition . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Implicit Solvent Electrostatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.1 The Linear Poisson-Boltzmann Equation . . . . . . . . . . . . . . . . . . . . 12
1.2.2 Numerical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.3.1 Polar Solvation Energies . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.3.2 Binding Energies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.1 Supervised Training of Neural Networks: Back Propagation . . . . . . . . . 20
1.3.2 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.3.3 Semantic Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.3.4 Applications to Protein-DNA Binding . . . . . . . . . . . . . . . . . . . . . 29
Chapter 2: DNAproDB: a Database and Web-Based Tool for Structural Analysis
of DNA{Protein Complexes 31
2.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3 Structure Processing Upgrades . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3.1 Structure preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3.2 DNAproDB data overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3.3 DNA secondary structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.3.4 Chemically modied components . . . . . . . . . . . . . . . . . . . . . . . . 42
2.3.5 Identication of structural and interaction moieties . . . . . . . . . . . . . . 42
2.4 Database and Web Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.4.1 Searching the database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.4.2 Structure reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.4.3 Integration with the Nucleic Acid Database . . . . . . . . . . . . . . . . . . 49
vi
2.5 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Chapter 3: Predicting DNA{Protein Binding From Structure Using Geometric
Deep Learning 54
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.1.1 GeoBind overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.3 Methods Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.3.1 Mesh Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.3.2 Training Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.3.2.1 Structure Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.3.2.2 Mapping Structure Features to the Mesh . . . . . . . . . . . . . . 68
3.3.2.3 Shape Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.3.2.4 Edge Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.3.3 Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.3.3.1 Graph Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.3.3.2 Graph Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.3.4 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.3.5 Network Design and Training Procedure . . . . . . . . . . . . . . . . . . . . 86
3.3.6 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.4 Results of Protein-DNA Binding Prediction . . . . . . . . . . . . . . . . . . . . . . 88
3.5 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Chapter 4: Discussion { Physical Modeling versus Machine Learning and Future
Outlook 98
Reference List 104
Appendix A
Supplementary Information and Methods for Chapter 2 . . . . . . . . . . . . . . . . . . 113
A.1 Supplementary Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
A.1.1 Detection of major and minor grooves for double-helical DNA . . . . . . . . 113
A.1.2 DNA structural entity classication . . . . . . . . . . . . . . . . . . . . . . 113
A.1.3 Approximating parameters for chemically modied components . . . . . . . 114
A.1.4 Interacting moieties identication . . . . . . . . . . . . . . . . . . . . . . . . 114
A.1.5 Tyrosine interaction motif analysis . . . . . . . . . . . . . . . . . . . . . . . 115
A.1.6 PCA analysis of interface features . . . . . . . . . . . . . . . . . . . . . . . 115
A.1.7 Nucleotide{residue stacking probability analysis . . . . . . . . . . . . . . . . 116
Appendix B
Denitions and Formulas for Computing Features used by GeoBind . . . . . . . . . . . . 123
B.1 Structure Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
B.2 Shape Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
vii
List Of Tables
1.1 A list of common activation functions used in neural networks. A good activation
function should be piecewise-dierentiable and should allow gradients to "
ow"
through the network without vanishing or diverging. . . . . . . . . . . . . . . . . . 19
3.1 Performance metrics for a variety of DNA binding site predictions methods taken
from Jinga et al. [113]. Dataset refers to the structural dataset used to train the
authors models, which is either PDNA-316, a standard dataset from [112], or 'own'
referring to a custom dataset curated by the authors. The metrics are dened
as follows: BA - balanced accuracy:
TP
TP+FN
+
TN
TN+FP
2, ACC - accuracy:
TP+TN
TP+TN+FP+FN
, REC - recall:
TP
TP+FN
, AUC - area under the ROC curve, F-score
- harmonic mean of REC and PRE:
2RECPRE
REC+PRE
, PRE - precision:
TP
TP+FP
. TP,
TN, FP, FN refer to true positive, true negative, false positive and false negative
respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.2 A list of structure features provided by GeoBind. Feature Name is the common
name or name given by the authors who rst described the algorithm for computing
the descriptor and any reference refers to the original papers where a denition was
given. Dependency is any third-party external packages or software which GeoBind
uses to implement the feature. The references under this column refer to the paper
where the implementation was described. Level refers to what structural level the
feature is dened for, with can either be 'R' for residue level features or 'A' for
atom level features. A full description of all features is available in appendix B.1 . 68
3.3 A list of shape features provided by GeoBind. Feature Name is the common name
or name given by the authors who rst described the algorithm for computing
the descriptor and references refer to original papers where a denition was given.
Dependency is any third-party external packages or software which GeoBind uses to
implement the feature. References here refer to the paper where the implementation
was published. Full descriptions are available for each feature in appendix B.2 . . . 70
3.4 Characteristics of the structural datasets of protein-DNA complexes used to gen-
erate training and validation data for the models presented below. The size of the
dataset refers to the number of protein entities (see methods section of [103]). The
various structural criteria used to select the candidate structures from all possi-
ble structures of protein-ssDNA and protein-dsDNA structures are derived directly
from DNAproDB data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
viii
3.5 Performance measures for predicted DNA binding sites on protein molecular sur-
face meshes. Deep GNNs were trained using the GeoBind framework to predict
vertex-level probabilities of belonging to a binding site or not, and predictions were
compared to ground-truth labels obtained from the observed protein-DNA complex
in the training data. Metrics are reported by aggregating predictions over the en-
tire datasets. The threshold values chosen for Balanced Accuracy (BA), Precision
and Recall were determined by maximizing BA over the training dataset, but both
were close to 0.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.6 Performance measures for prediction of DNA binding regions on protein molecular
surface meshes for the negative control dataset. Metrics are reported per-protein by
aggregating all vertices within a particular mesh. The threshold values chosen for
Balanced Accuracy (BA) were the same used in calculating the metrics in table 3.5. 94
A.1 Interaction feature cut-o values for arginine interactions with the standard four
DNA nucleotides. The numbers in parentheses are the total number of interactions
used to calculate 20
th
percentiles of the feature value distributions, which are then
used as cut-o values. Note that these cut-o values are inclusive, and the feature
values were clipped at 0.5 before computing percentiles to avoid a large bias towards
zero. The rst column indicates the feature type { h. bond is the total number
of hydrogen bonds for a moiety interaction, vdw is the number of van der Waals
interactions, and basa is the buried solvent accessible surface area. The remaining
columns are the possible moiety interactions with the following abbreviations: pp
{ DNA phosphate; sr { DNA sugar; wg { DNA major groove; sg { DNA minor
groove; bs { DNA base; mc { protein main chain; sc { protein side chain. . . . . . 122
ix
List Of Figures
1.1 The growth of protein-nucleic acid structures deposited in the protein data bank
(source: RCSB PDB website [100]) since 1989. The number of protein-nucleic acid
structures has doubled roughly every seven years. . . . . . . . . . . . . . . . . . . 4
1.2 Examples of two recognition mechanisms we can observe from structure. A The
pattern of hydrogen bond donors/acceptors along the B-form DNA double helix.
Acceptors are shown in red, donors in blue, methyl groups in yellow, and base car-
bons in white. In the major groove and minor groove, this pattern forms a unique
signature of the DNA sequence. Panel adapted from Rohs et al. [96] B Example
of a hydrophobic stacking interactions occuring between ssDNA and hydrophobic
aromatic side chains. A tyrosine stacks with a thymine which stacks with a trypto-
phan which stacks with a guanine forming a YTWG stack. These kinds of stacking
arrangements mimic the stacking which occurs spontaneously among the nucleotide
bases for the free DNA and are a powerful recognition mechanism for binding ssDNA. 7
1.3 The conceptual framework of the Poisson-Boltzmann equation. The solvent (water)
is modeled as a continuous dielectric with relative permittivity
s
. The ionic charges
are modeled as continuous charge distributions given by a Boltzmann distribution
of the form
P
i
q
i
c
1
i
e
qi(x)=kT
(see (1.1)). Note that the discrete nature of the
actual ionic charges are not explicitly modeled. The solute molecule is modeled as
a series of discrete point charges embedded in a dielectric continuum with relative
permittivity
i
. The boundary of the solute is dened by its molecular surface, and
dierent denitions of this surface are possible but depend on the van der Waals
radii of the atoms that make up the solute structure. . . . . . . . . . . . . . . . . 11
1.4 The discretization of the PBE on a cubic lattice of grid size h. The potential is
mapped to each grid point while the relative permittivity is mapped to the faces
of a cube that lie halfway between each grid point. The potential at grid point i
is given in terms of the charge at grid point i, and the potential at neighboring
grid points and the permittivity on the line segments joining those grid points as
in eq. (1.4) and (1.5). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
x
1.5 A free energy cycle to model the contributions of solvation to binding energy. In
step 3, a solvated complex dissociates into two free monomers. The binding energy
free energy is given by the change in solvation energy due to the additional exposed
surface area of the free monomers which allows for additonal interaction with the
solvent-ionic environment, and the change in the Coulomb interaction energy. Steps
4 and 2 model the change in solvation energy, and step 1 models the change in
Coulomb energy. Figure adapted from [5]. . . . . . . . . . . . . . . . . . . . . . . . 16
1.6 An illustrative example of the issues which can occur if the learning rate is not
chosen properly when performing gradient descent. . . . . . . . . . . . . . . . . . 23
1.7 Illustrations of the convolution and transposed convolution operations. (A) A 2D
convolution is shown here between some input data (blue) and a kernel (shaded
box), producing the response (green). The input is 4 4, and the kernel is 3 3
with a stride of 1, producing a 22 response. (B) A transposed convolution. Here,
the input (blue) is 33 but is dilated as shown, and the kernel (shaded box) is also
3 3. The transposed convolution interpolates the input data to produce a 5 5
response. This is the basis of upsampling in fully convolutional neural networks.
Figure adapted from Vincent and Francesco [38] . . . . . . . . . . . . . . . . . . . 26
1.8 (A) An image segmentation example. The picture to the right is given as input,
and the ground-truth segmentation which is used for training is given in the middle.
The prediction of a fully CNN that has been trained for semantic segmentation is
on the right. The dierent colors represent dierent class assignments for each
pixel (human, horse or background). adapted from [109] (B) A schematic of the
U-net architecture, adapted from [97]. Each box represents a multi-channel feature
map (the output of a convolution or pooling layer), with the number on top of
each box representing the number of lters in that layer. The spatial dimension
of the feature maps is decreased until a 30 30 1024 activation produced. The
activation is then upsampled until a 1 1 convolution is applied which assigns a
probability distribution to each pixel. . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.1 A schematic overview of the DNAproDB structure processing pipeline and front-end
interface. The main stages are structure preprocessing where structures are pre-
pared for processing, feature extraction where various biophysical features are cal-
culated, data retrieval which pulls additional annotations from existing databases
for protein chains, and data aggregation where features are combined in a standard
data format. The DNAproDB database stores processed structural data for more
than 4,500 PDB entries containing DNA{protein complexes. Users can search the
database on features of the DNA, protein or DNA{protein interactions, can gener-
ate reports for the returned results, and can upload their own structures for private
analysis. The report page contains functionalities for downloading extracted fea-
tures as a JSON le [39] and for visualizing data using interactive visualization
tools which can export static gures. . . . . . . . . . . . . . . . . . . . . . . . . . . 36
xi
2.2 The data hierarchy used by DNAproDB, which inherits part of its overall struc-
ture from that used by the Protein Data Bank. Lines with a single arrow indicate
one to many relationships, and lines with two arrows indicate one to one rela-
tionships. The PDB data hierarchy begins at Entry, which is often referred to
as simply "the structure" or "a structure" and contains all information about a
macromolecule. An entry may have multiple Models which are dierent conforma-
tions of the structure (many entries contain only a single model). Models contain
Chains, which are generally polymerized chains of amino acids or nucleic acids
and may also contain ligands, solvent or other small molecules. Finally a chain
contains Components, which are the molecular units that make up the chain |
protein residues, DNA nucleotides or other small molecules. DNAproDB directly
inherits the Entry-Model levels of the PDB hierarchy. From there, models contain
Structural Entities (see sec 2.3.2) which are distinct structural components within a
particular model. DNA (protein) structural entities contain DNA strands (protein
chain segments), which are derived from a DNA (protein) chain, but distinct in
that a strand (chain segment) may not contain any backbone breaks. DNA strands
(protein chain segments) then contain nucleotides (residues) at the lowest level of
the hierarchy. Nucleotides and residues have a one-to-one correspondence with
the component level of the PDB hierarchy, but structural entities have no direct
correspondence. DNAproDB provides data on DNA{protein interfaces between
every interacting DNA/protein structural entity pair { and this data is grouped
into individual nucleotide{residue interaction features and global features of the
interface. For a more detailed description of the DNAproDB data and features, see
Supplementary Figure A.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
xii
2.3 A variety of statistics and distributions for a set of select DNAproDB features over
the entire database. A. Shown here are two features describing DNA structural
entities { the number of nucleotides in the entity and the entity type. The mean
nucleotide count is 34, with peaks at approximately 35 and 300, the latter being
mainly from nucleosome structures. The most common entity type is helical, mak-
ing up 74% of all DNA structural entities, while single-stranded DNA is the next
most common making up 13.2%. B. This panel shows pair-wise nucleotide{residue
interaction features. Both heatmaps show values of interactions between the ma-
jor groove moiety of nucleotides in a helical conformation with the side chains of
each standard residue. The top heatmap shows the mean number of major groove{
side chain hydrogen bonds, and the bottom shows major groove{side chain buried
solvent accessible surface area. Note that the hydrogen bond values for alanine,
glycine, leucine, isoleucine, phenylalanine, proline and valine are all zero because
these residues do not contain any hydrogen bond donors/acceptors in their side
chains. The especially high value of guanine{arginine, major groove{side chain
hydrogen bonds is due to the ability of arginine to form bidentate bonds with the
guanine's major groove moiety (when the guanine is in a Watson-Crick base-pairing
conformation). C. The four plots in this panel show various features which describe
properties of DNA{protein interfaces as a whole. The rst is the interaction density
of the interface. This is the number of nucleotide{residue interactions divided by
the number of nucleotides times the number of residues and measures how many
interactions are present versus the total number possible and lies between zero and
one. Most interfaces are rather sparse, which is typical of helical DNA interfaces,
while single-stranded DNA interfaces tend to be denser. The next plot shows the
protein secondary structure composition of interfaces. "-helix" refers to interfaces
which are made up primarily of alpha helices, "-sheet" primarily of beta sheets
etc. Helix and helix/strand hybrid are the most common interface type, while a
predominantly strand composition is the least common. The nal two plots com-
pare the mean number of nucleotide interactions per residue for helical DNA and
single-stranded DNA interfaces. Single-stranded DNA interfaces overall tend to
involve less nucleotide interactions per residue, which may indicate that individ-
ual residues contribute less to the overall binding anity than for helical DNA on
average. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
xiii
2.4 Example visualizations provided by DNAproDB that have been generated through
our browser-based tools. DNAproDB visualizations are interactive, customizable
and can be exported directly as PNG images. These visualizations were taken from
the report page of the conjugative relaxase TrwC{DNA co-crystal structure (PDB
ID 1OMH). Selected DNA moiety interactions were base, major and minor groove,
all protein secondary structure types were included, and interactions were ltered
by a minimum nucleotide{residue distance of 4.0
A or less. The size (pixel area) of
each residue or SSE symbol has been chosen to indicate the number of nucleotides
it interacts with. A. A three-dimensional view of the structure produced by NGL
[99, 98]. The colored regions are the interacting residues which are displayed in the
remaining three panels and are colored using the same color scheme. B. DNAproDB
uses a graph-based approach for displaying the residue contact map. Information
about nucleotide base-pairing, base-stacking and backbone linkages are displayed
via edges between nucleotides, and edges between nucleotides and residues (shown
here as red circles, blue squares, and green triangles; the shapes and colors of the
symbols indicating secondary structure) denote an interaction. The sense of the
DNA strands can be seen based on the direction of the sugar-phosphate linkages
between each adjacent nucleotide in each strand. A G/A mismatch can be seen
near the top of the gure,
anked by two
ipped-out bases and indicated by a red
dashed line signifying the non-Watson-Crick base pair. C. A polar contact map
showing major and minor groove interactions in the helical region of the structure.
Protein secondary structure elements are shown and plotted according to which
DNA moieties they interact with. The major and minor grooves are represented by
concentric annuli and the position of each interacting protein secondary structure
element (SSE) within each annulus is determined by the helicoidal coordinates
of the SSE [102]. Protein beta sheets are represented by triangles, helices are
represented by circles and loops by squares. The polar distribution of the SSE
interactions re
ects that the protein binds to one side of the helix. D. A helical
shape overlay with the shape parameter Buckle plotted. The extreme value at the
last position of the sequence is due to an A/G mismatch which is automatically
indicated in red by DNAproDB. The green triangle and blue squares represent sheet
and loop residues which interact with the DNA and their approximate location
along the helical sequence. The same residues can be seen interacting with the
helical region of the DNA in B. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
xiv
2.5 Several example analyses done using only data provided by the DNAproDB database.
A. Shown here is a three nucleotide 'interaction motif' for the residue tyrosine inter-
acting in the minor groove of helical DNA. The heatmap was generated by taking
all available instances (99; interactions were ltered for sequence redundancy of
the interacting protein chain) of tyrosine residues which bind in the DNA minor
groove and interact with exactly three nucleotides simultaneously (identied from
the list of nucleotide{residue interactions provided under interface features). The
three interacting nucleotides were sorted by center of mass distance (an interaction
feature) and placed into bins 1, 2 and 3 from the nearest nucleotide to the furthest.
In this way, we capture the radial distribution of the three-dimensional structure of
the interacting nucleotide triplet. The frequency of each nucleotide type (A, C, G
or T) was then plotted for each distance bin. In this example, we see a strong signal
for a C-C-G interaction motif. We note that this is not the same as a sequence motif
because these nucleotides need not be adjacent or even on the same strand. On the
right side of the panel an example is shown of the structure of this interaction motif
with a tyrosine forming two hydrogen bonds between its OH group and the N2 atom
of the guanine and the O2 atom of cytosine. See Supplementary Data for a more
detailed description of the analysis. B. The rst and second principal components
from a PCA projection of 134 DNAproDB features (or features derived from them)
describing the DNA{protein interface for 4,758 dierent interfaces (which were bro-
ken down by protein chain). Each point in the plot corresponds to one interface.
The plots are colored according to the GO annotations of the protein chain and
are grouped into four categories based on the annotated biological function of the
protein. Correlations can be seen between the protein chain annotations and the
way it interacts with DNA as captured by the DNAproDB features when projected
to this low dimensional space. Several distinct, albeit overlapping, clusters can be
seen which correspond to the dierent biological functions of the involved protein
chain. See Supplementary Data for a more detailed description of the analysis.
C. Probabilities for a particular nucleotide{residue interaction to be in a stacking
conformation. More precisely, these are conditional probabilities of P (stackjN;R)
whereN is a nucleotide type andR is a residue type. DNAproDB assigns a geome-
try for every nucleotide{residue interaction identied using SNAP, a component of
the 3DNA program suite [77, 78]. The residues for which probabilities are shown are
those with planar side chains so that a stacking conformation can be dened. The
conditional probabilities for each residue to stack with each nucleotide are shown
as a stacked bar chart, with the numbers inside the bars indicating the probability
values and the numbers overtop of the bars the total number of interactions used
for that residue type. Only interactions with nucleotides in a single-stranded con-
formation were included in this analysis. It is interesting to note the variation in
stacking probabilities between dierent nucleotides for a given residue, most no-
tably with tyrosine preferring guanine stacking and histidine strongly preferring
adenine stacking. See Supplementary Data for a more detailed description of the
analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
xv
3.1 A plot of residue propensities at the interface of DNA-protein complexes for protein-
ssDNA (SBP) and protein-dsDNA (DBP) complex structures. Propensity measures
the frequency of a residue to appear in the protein-DNA interface relative to the
rest of the protein surface, with values above 0 indicating it is more common in
the interface, and values less than 0 indicating it is less common. The much higher
propensities for the hydrophobic residues tryptophan, tyrosine and phenlyalanine
for SBP relative to DBP show that these proteins have much more hydrophobic
interfaces, a chemical feature which is related to ssDNA binding. . . . . . . . . . . 57
3.2 The GeoBind framework applied to the task of predicting DNA binding sites on the
surface of proteins from structural data. Our starting point is observed complexes of
proteins bound to DNA. We rst identify DNA binding residues from the structure
of the complex. We then generate surface meshes using the protein structure and
assign features to the vertices and edges of the mesh using the GeoBind framework.
The meshes and features are fed into a neural network which predicts likely binding
sites over the vertices of the mesh, and these predictions are compared with the
observed binding sites. The corresponding error is propagated back through the
network to improved future predictions, and this process is iterated. . . . . . . . . 61
3.3 An illustration showing the geometric primitives used as edge features in GeoBind
and corresponding primitive features used in images. Neural networks can learn
higher order representations from primitive features that do not require any hand-
crafting. A Some of the geometrical relations we encode as edge features. These
include the dihedral angle between triangles which share an edge, '(e
uv
), the span
of adjacent triangles sharing the edge, the angle between the vector lying along
the edge and the normal at the adjacent vertices. These raw geometrical relation-
ships encode local information about the geometry of the surface. B By analogy,
these relationships are similar to pixel intensities in an image which a conventional
convolutional neural network would use as input features. . . . . . . . . . . . . . . 73
3.4 An overview of the message passing convolution operation. In the rst layer, mes-
saged are passed between the node u and its rst-order neighbors. In the second
layer, messages are again exchanged between rst order neighbors, however each
nodes hidden state already contains information from its nearest neighbors - there-
fore the messages being passed to node u contain both rst and second order
information. Hence the size of the receptive eld of the convolution increases by
one at each layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.5 An overview of the edge-collapse procedure used for edge pooling. An edge is
identied for pooling based on a learned edge-scoring function. The edge is then
removed and the two adjacent nodes are merged into a new node, where the position
of the new node is given by the average of the two merged nodes. The positions of
the original two nodes are stored, and in the unpooling step the two nodes which
were previously deleted are restored along with the edge joining them. The hidden
state of the restored nodes are given by eq. (3.16). . . . . . . . . . . . . . . . . . . 82
xvi
3.6 A schematic of a graph U-net model with two pooling layers and two linear (lin)
layers. The rst linear layer transforms the number of input features to a lower
dimensional space, and can be thought of as a learned dimensionality reduction
layer. A number of convolutional layers are then applied, followed by a pooling
layer. Each pooling layer reduces the resolution of the mesh by a factor of approx-
imately two. One or more convolutions are applied after each pooling step. Once
the nal pooling level is reaching, unpooling operations are performed where the
graph structure is restored at each step, again followed by at least one convolution.
Skip connections are shown as gray arrows. These combine the hidden representa-
tions from earlier layers of the same graph resolution before the pooling/unpooling
operation was performed. Including these connections generally allows gradients
to
ow better through the network. The nal layer of the network is one or more
linear layers which perform the nal dimensionality reduction to a binary proba-
bility distribution for each vertex. Every layer in the network is typically followed
by a non-linear activation function such as ReLU, save for the last layer which uses
softmax. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.7 Binding site predictions for top performing proteins from the SBP and DBP vali-
dation datasets. These predictions are on proteins the model has never seen during
training. The left panel shows four protein surface meshes colored according to the
predicted class, with the mesh rotated to give dierent views. Green and orange
are correct predictions corresponding to true negatives and true positives while
red and purple are incorrect predictions, corresponding to false positives and false
negatives respectively. White vertices are masked and were ignored during eval-
uation of performance metrics. We see very few false negative predictions, with
most of the error being false positives. We also note that false positives tend to
occur around the edges of a correctly identied binding site. The table gives the
individual per-protein performance measurements and the protein name along with
the PDB identier that corresponds to the structure entry. . . . . . . . . . . . . . 90
3.8 The distribution of predicted probabilities p
+
over the training data for the SBP
and DBP models. The predicted probabilities for the ground-truth negative and
positive classes (non-binding site and binding site respectively) are plotted as sepa-
rate distributions. These plots show peaks near 0 for the negative classes and peaks
near 1 for the positive class, which is ideal for a binary classier. The wide valley
near the center of the plots shows that our models are not particularly sensitive to
the choice of threshold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.9 Characteristics of AUROC scores over the training and validation datasets for the
SBP and DBP models. Each datapoint represents the AUROC over a particular
protein entity surface mesh, where predictions over all non-masked vertices in a
mesh were aggregated. The histograms show the distribution of AUROC. For both
models, the majority of the distribution lies between 0.9 and 1.0. The plots on
the right side of the gure plot AUROC versus the size of the protein structure
in total number of atoms. These plots show that the performance of our models
are completely independent of the size of the structure due to the local message
passing framework of the graph convolutions. . . . . . . . . . . . . . . . . . . . . . 92
xvii
3.10 Predictions by the SBP model for a non-DNA binding protein thrombin from the
NC dataset. Any binding site prediction is a false-positive by denition, as we
expect the entire surface to be devoid of any DNA binding sites. The model per-
forms well achieving over 0.94 BA despite having never been trained on non-DNA
binding proteins. However, the areas where it does make false predictions are phys-
ically plausible, such as regions with exposed hydrophobic side chains or regions of
positively charged residues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
A.1 An example of a DNA{protein complex which contains two DNA structural entities
and one protein structural entity. A. A graph showing nucleotide base pairing, base
stacking and sugar-phosphate linkages which has been stylized using DNAproDB
visualization tools. Two disparate sub-graphs can be seen which represent the two
discrete DNA structural entities seen in the second panel. DNA structural entities
are named based on the DNA strands they are composed of { their identier is
simply a concatenation of their component strand identiers. The DNA strand
identiers are based on the DNA chain which the strand belongs to. The rst
strand in chain E is named "E1", the second strand "E2" etc. In this structure, the
DNA chain E has a break (due to a missing nucleotide) thus forming two strands.
B. The three-dimensional structure which shows the two discrete DNA structural
entities and the single discrete protein structural entity. This protein entity has
two chains { chain L and H which form a heterodimer, however chain H consists of
three chain segments, H1, H2 and H3 which are due to backbone breaks. The four
chain segments form a single closed molecular surface, and hence are considered a
single structural entity. Protein structural entities are named in the same way that
DNA entities are. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
A.2 An overview of the DNAproDB feature hierarchy. Features are grouped into three
main categories { DNA specic features, protein specic features, and interface
specic features. Within each category there are two levels of features { entry-level
and model-level. Entry level features are those which can be described at the level
of the entry (i.e. the structure), and do not vary across models. For instance
under protein features, chains are an entry level feature because the number of
chains and most of their properties (e.g. chain identier, sequence, sequence-based
annotations, length etc.) do not vary or depend on the coordinates assigned in
a particular model. Any feature within an entry-level feature branch that does
depend on the model will be stored as an array with one element per model. For
example, the secondary structure feature of a protein chain can vary from model
to model since secondary structure depends on the residue coordinates. Features
under an entry-level branch which depend on the model index are noted with an
asterisk in the gure. Model-level features are those which depend on and may
change with the three-dimensional coordinates of the structure or represent a data
object which may only exist for a particular model. These include items such as
DNA base pairs, all interface properties, and protein secondary structure elements.
Some features refer to other features which are used as identiers. An identier
feature is analogous to a primary key in a relational database setting and are
italicized in the gure. Underlined features refer to identier features. . . . . . . . 118
xviii
A.3 A graphical summary of our procedure for determining major and minor groove
structural moieties. A. The base-pair coordinate frame used by DSSR, from which
we gather base-pairing information. Depending on the glycosidic torsion angle
conformation of the two bases, the x-direction will either point towards the minor
groove or major groove. In Watson-Crick geometry, it will always point towards
the major groove. B. An illustration of our procedure for dening major and minor
groove atoms. An A/T Watson-Crick base pair is used for illustration purposes. In
this case, both bases are in the anti conformation, so the x-direction of the base-
pair coordinate frame is towards the major groove. The coordinates of the base
atoms are projected to the x-y plane of the base-pairing frame, and the shortest
path from the glycosidic nitrogen atom of both bases is found (shown in red) using
the N1-N3 hydrogen bond as an edge joining the bases. This path in addition to
the rays emanating from the glycosidic nitrogen atoms bisect the plane. All atoms
lying on the path and on the minor groove side of the bisection are classied as
minor groove, and atoms on the other side as major groove. . . . . . . . . . . . . . 119
A.4 An example of dierent ways to lter nucleotide{residue interactions in DNAproDB
visualizations using a DNAproDB entry containing the catalytic domain of APOBEC3G
bound to a single-stranded DNA oligomer (PDB ID 6BUX). The visualizations in
panels B{G are examples of the residue contact map visualization. A. The three-
dimensional structure as depicted by NGL viewer [99, 98]. B. Default DNAproDB
criteria (4.5
A minimum distance, interactions not weak; see Supplementary Meth-
ods) with DNA base interactions shown and all protein secondary structures. Many
interactions are shown involving the rst cytosine nucleoside which is inserted into
the active site's binding pocket of the APOBEC3G domain. C. The same criteria
as in B but showing only residues which form stacking interactions. D. Interac-
tions are shown using default criteria but only to the DNA sugar moieties indicated
by the yellow interaction lines which connect to the sugar moiety symbol of each
nucleotide. E. The same as in D except now involving only interactions with at
least one sugar hydrogen bond. F. Here we are visualizing all DNA moiety interac-
tions using default criteria but only for helix or strand residues. Only two residues
are shown, and neither make any phosphate interactions. This is because these
residues are in the active site, which is deeply buried in the protein and is mainly
accessible by the inserted cytosine base with some sugar interactions. G. Here all
DNA moiety interactions are shown but only for loop residues. The interaction
distance cut-o has been lowered to a 3.5
A minimum nucleotide{residue distance,
and any interaction involving a hydrogen bond is highlighted with a red outline. . 120
xix
A.5 Visualizations from DNAproDB for a heterodimer of the Hox protein Sex combs
reduced (Scr) and its cofactor Extradenticle (Exd) bound to two dierent DNA frag-
ments (PDB IDs 2R5Z and 2R5Y). Only major groove and minor groove contacts
are shown. Joshi et al. [59] showed that for this protein complex Scr N-terminal
linker residues Arg3 and His{12 are important for conferring sequence specicity
via shape recognition of the minor groove. A. Three-dimensional views of the mi-
nor groove in the region of the Scr Arg5 linker residue. The left panel is an Scr
in vivo binding site (PDB ID 2R5Z) in which the Arg3 and His{12 residues can
be seen in the minor groove. On the right is a Hox consensus site which lack the
Arg3 and His{12 contacts. B. Two residue contact maps showing major groove
and minor groove contacts for the Scr{Exd heterodimer bound to the Scr in vivo
binding site fkh250 on the left (PDB ID 2R5Z) and a Hox consensus site fkh250
con*
on the right (PDB ID 2R5Y). The colored markers indicate residues and their sec-
ondary structure { helix residues are represented as red circles and linker residues
are represented as blue squares. Residues are grouped into SSEs and markers on
each nucleotide represent the major and minor groove contacts, respectively. The
Scr residues Arg3 and His{12 are seen making contacts in the DNA minor groove
of the in vivo binding site but cannot be seen contacting the Hox consensus site. C.
Shape overlay plots of the minor groove width of the two binding sites, fkh250 and
fkh250
con*
. The dierences in the intrinsic shape prole of these DNA sequences,
which are described in [59], explain the preference for the Scr in vivo binding site. 121
xx
Abstract
Protein-DNA interactions are the foundation of nearly every process that occurs within living
cells. Every function a cell performs (or has the capacity to perform) is determined by the
cells genetic code which is written in the alphabet of nucleotide sequences and stored in what
we call DNA. It is through the action of a vast army of proteins which bind DNA that this
genetic information is converted to the chemical building blocks that give rise to living organisms.
Proteins are responsible for transcribing DNA to RNA and regulating that process via a complex
set of interactions between various transcription factors and DNA that aect the rate at which
the RNA polymerase enzyme can transcribe a particular gene. The genome its self consists of a
highly organized and compacted assembly of nucleoprotein complexes known as chromatin. DNA
in a single human cell is about two meters in length if laid out in a linear fashion. In vivo it is
folded and coiled into a densely packed nucleus of only about six micrometers in size through the
action of proteins such as histones, condensin and CTCF. DNA is constantly suering damage due
to the eects of ionizing and ultraviolet radiation, reactive oxygen species, viruses and toxins. To
prevent rapid deterioration of genetic information and malfunction of the machinery of the cell,
specialized proteins have evolved which are continually repairing DNA damage and maintaining
genetic delity. In order for organisms to grow, cells must divide and the entire content of the
genome unraveled, replicated and recombined { processes which are governed by the coordination
of a large number of specialized proteins. It it therefore quite evident that to gain a mechanistic
understanding of biological processes requires an understanding of how proteins interact and bind
to DNA.
xxi
Structural data of protein-DNA complexes provides researchers with an atom-level view of
protein-DNA interactions. These models of proteins bound to their cognate DNA binding sites
help to elucidate the relationships among sequence, structure and biological function, distinguish
dierent mechanisms of DNA recognition, oer a deeper physical understanding of existing ex-
perimental results, and give insights into the molecular machinery which drive living cells. The
amount of available structure data continues to increase every year, and this ever expanding re-
source is enabling new approaches to studying protein-DNA interactions on large scales. In this
thesis I present approaches we have developed for studying protein-DNA interactions which utilize
structural data. We have built tools specically for the analysis of the structure of proteins bound
to DNA, which include a data processing pipeline for automatically extracting meaningful struc-
tural features from a protein-DNA complex, a large database of pre-preprocessed complexes which
can be searched based on features of the protein, DNA or the molecular nucleoprotein interactions,
and interactive visualization and data exploration tools for examining the interface interactions
that occur in protein-DNA complexes. We have also developed machine learning methods for
predicting DNA binding sites from protein structure using geometric deep learning. We represent
a protein structure using a molecular surface mesh and have built graph convolutional neural
networks for identifying regions of a protein surface which correspond to a DNA binding site. In
addition to original work we have contributed to the study of protein-DNA interactions I review
relevant background material.
xxii
Chapter 1
Introduction and Background
Cells are among the most intricate and diverse systems known to science. Their staggering
complexity is the result of billions of years of evolution which has endowed them with the ability
to perform many functions and respond to environmental stimuli such as changes in temperature,
pH, salt concentration and availability of resources (e.g. oxygen). They are nely tuned molecular
machines which have been optimized through evolutionary processes to survive independently in
their respective environments or perform specialized tasks that enhance the survival of their
host organism. Essential to evolution is the process of heredity whereby an organism which has
acquired some advantageous adaptation may pass this increased utility to their ospring, either
directly through cellular division in the case of single-celled organisms or through sexual or asexual
reproduction for multicellular organisms. It is via the ability of organisms to confer the innovations
of previous generations to future ones that we see the amazing spectrum of life on earth today,
and has enabled it to thrive in nearly every possible ecological niche.
Nucleic acids, and deoxyribose nucleic acids (DNA) specically, form the physical medium of
heredity in all known forms of life. In its double-stranded form, DNA is a highly stable molecule
and exists in vivo as the well known double-helix structure. The double helix is held together
primarily by hydrogen bonds between complementary nucleotide base-pairs that form between
opposing strands and hydrophobic stacking between adjacent nucleotide bases which occur on
1
the same strand. In nature, there are four nucleotide bases which store genetic information:
adenine (A), cytosine (C), guanine (G) and thymine (T). Only two base pairs exist which are
thermodynamically stable | A-T and G-C; others are possible, however they are energetically
unfavorable under normal conditions and so exist only transiently and will generally destabilize
the DNA. The double-helix structure of DNA, rst published by Watson and Crick [121] has given
us enormous insight into the physical mechanisms which underly genetics and heredity. It was
recognized by Watson and Crick that if the DNA helix were unwound, each strand would contain
a complete copy of the cells genetic information because either strand could be used as a template
to form a new double helix, identical to the original, due to the base-pairing rules imposed by the
chemical structure of the nucleotide bases. Hence, from the structure of DNA alone, one could
conclude what the physical basis for heredity was, and pin the conceptual ideas of evolution to a
tangible substrate.
The discovery of the structure of DNA was perhaps one of the most important contributions
in structural biology. DNA itself is only part of the story, however, because DNA is relatively
inert and on its own can not perform any meaningful biological function. For the cell to access
and process the information stored by DNA requires the action of many proteins, and it is the
interactions between DNA and these proteins which drive much of the molecular machinery of the
cell. These interactions depend on a variety of physical mechanisms, many of which are fundamen-
tally electrostatic in nature [43]. However, the strong dielectric constant of water combined with
screening eects of electrolytes which are present in the cellular protoplasm (the
uid substances
which ll the interior regions of the cell) reduce the eective scale of electrostatic interactions to
relatively short distances. Other non-electrostatic interactions are only relevant over even shorter
distances. Further, biological processes generally occur over long time scales (microseconds to
minutes) relative to the rate of diusion a typical protein (on the order of 10m
2
=s [67]). There-
fore a meaningful "interaction" between DNA and a protein is synonymous to binding, requiring
close physical proximity over an extended period of time. The interactions between DNA and
2
proteins essentially amount to the protein binding and holding onto the DNA (or holding the
DNA in place for another protein) long enough for the actual biochemistry or molecular processes
in the cell to occur. It is the process of proteins binding DNA that is the central subject of study
in this thesis, and a fundamental concept needed to understand the molecular mechanisms that
drive biology.
Just as the discovery of the structure of DNA has illuminated our understanding of the phys-
ical basis for heredity and genetics, structures of proteins bound to DNA can greatly improve
our understanding of fundamental processes in the cell such as gene regulation, transcription,
chromatin regulation and DNA repair. The protein data bank (PDB) [11, 126] is an interna-
tional collaboration which aims to be the central repository for three-dimensional structure data
of biological macromolecules. Structural data provides atom-level resolution of biological macro-
molecules and allows us to view biological entities like nucleic acids and proteins through the lens
of physical models, and helps to unite the elds of biology, chemistry and physics. The PDB
currently provides structures for nearly 8000 protein-nucleic acid complexes, of which over 5100
are protein-DNA complexes. Figure 1.1 shows the growth of protein-nucleic acid structures in
the PDB, the cumulative sum of which has doubled roughly every seven years. This ever in-
creasing resource of structural data provides rich opportunities for understanding the biological
role and function of DNA-binding proteins at a mechanistic level in terms of individual atomic
and residue-nucleotide interactions. Structures of protein-DNA complexes provide a bridge from
the high-level conceptual understanding of how biological systems are organized and function to
physical reality.
In the remainder of this chapter, I will review background material on protein-DNA interac-
tions and methods which can take advantage of structural data that are important for the remain-
ing chapters { Poisson-Boltzmann electrostatics and Deep Learning. In chapter two I present a
tool for performing structural analysis of experimentally determined or simulated protein-DNA
complexes, called DNAproDB [102, 103]. In chapter 3 I present work on predicting DNA binding
3
Figure 1.1: The growth of protein-nucleic acid structures deposited in the protein data bank (source:
RCSB PDB website [100]) since 1989. The number of protein-nucleic acid structures has doubled roughly
every seven years.
from protein structure using geometric deep learning { a method we call GeoBind. In chapter 4 I
conclude with a discussion of the use of structural data in machine learning models and compare
machine learning methods to physical modeling.
1.1 Mechanisms of protein-DNA recognition
Proteins which bind DNA utilize a variety of mechanisms to recognize their molecular targets.
DNA exists primarily as a B-form double helix in the cell, but can be found in single-stranded
form (ssDNA) both transiently during the processes of transcription, DNA repair, replication or
recombination and permanently in the end regions of chromosomes known as telomeres. Dier-
ences in the structure and chemical make up of a protein DNA-binding domain can distinguish
these forms of DNA as well as distinguish DNA from other types of molecules in the cell. Fur-
ther, many proteins which bind double-stranded DNA (dsDNA) or ssDNA can even distinguish
dierent DNA sequences from one another, a property which is known as specicity. Structure
data of proteins bound to DNA have provided a deep view into how the chemical properties and
4
structure of proteins and DNA give rise to molecular recognition. The ne details are not al-
ways well understood, but a coarse view of protein-DNA recognition has emerged from analysis
of many protein-DNA complex structures along with models of protein-DNA binding [133, 83],
electrostatics calculations [96, 20], molecular dynamics simulations [52, 94] and thermodynamical
models.
The primary recognition mechanisms proteins use to bind DNA are hydrogen bonding, hy-
drophobic contacts (usually via pockets or stacking), electrostatic interactions, van der Waals
contacts and steric complementarity. These mechanisms can be seen in a wide variety of protein-
DNA complex structures, however the degree to which each mechanism contributes to binding
varies greatly between dierent proteins and depends on both the form of the DNA as well as
the DNA sequence (and shape in the case of dsDNA { for a review on the role of DNA shape in
protein recognition see Rohs et al. [95]).
Hydrogen bonds are formed between two electronegative atoms which "share" a hydrogen.
The hydrogen is covalently bonded to one of the two heavy atoms (known as the donor) and
has a partial positive charge due to the electronegativity of the donor. This creates an electric
dipole with a partial negative charge on the donor and partial positive charge on the hydrogen.
This dipole can then form a strong attractive interaction with the dipole formed by another
electronegative atom which is often bonded to a carbon. The hydrogen bond is directional, and is
strongest when the donor-hydrogen-acceptor angle is close to zero, and falls o rapidly in strength
at greater angles, however the exact interaction strength depends on the environment of the bond.
Typical donor-acceptor distances are roughly 2.8-3.5
A and bond strengths range from 4 kJ/mol to
25 kJ/mol in biological contexts, though these estimates can vary quite widely. Hydrogen bonds
usually form between nitrogens and/or oxygens, and are extremely important for understanding
both protein and DNA structure as well as protein-DNA recognition.
5
The phosphate, sugar, and base of a nucleotide can all form hydrogen bonds with polar amino
acid side-chains. Some side-chain/base combinations can form bidentate bonds which are partic-
ularly strong, such as arginine and guanine, and long side-chains can form bonds with multiple
bases. The cumulative sum of hydrogen bonding interactions often contributes a signicant por-
tion to the binding anity of a protein for DNA, but they also oer one of the most powerful
mechanisms of specicity. Because hydrogen bonds are both short-ranged, directional and asym-
metric the pattern of possible hydrogen bonds between two binding partners is sensitive to small
variations in orientation, shape and sequence. Maximizing the hydrogen bond network that forms
when a protein and DNA bind can only be accomplished if the two partners have complemen-
tary shapes, dynamics and donor/acceptor sites. In dsDNA, the major groove of AT-base pairs
and GC-base pairs display dierent patterns of hydrogen bond donor/acceptors (g 1.2A) which
can discriminate between the two and contributes to the sequence specicity of proteins such as
transcription factors [95]. For ssDNA, the exposed Watson-Crick edge of the bases have distinct
patterns of hydrogen bond donor/acceptors which can also be used for sequence recognition. The
sugar moiety of DNA can be used to distinguish it from RNA by the absence of an additional
OH group which can act as a hydrogen bond donor to asparagine, glutamine, aspartic acid and
glutamic acid. RNA binding proteins may form such hydrogen bonds in order to discriminate
RNA from DNA.
Van der Waals contacts are attractive interactions that occur between induced multipoles
when atoms are brought into close contact. Multipoles can be induced between an atom with
a permanent multipole and a non-polar atom or between two non-polar atoms due to random
uctuation and correlation of the electron orbitals. They only occur over short distances and
are relatively weak interactions, however their lack of directionality and asymmetry means many
such contacts can be made, and their cumulative sum can be substantial. Because van der Waals
contacts require close proximity, the more complementarity between the shape of one molecule and
another, the more van der Waals contacts can form, and this produces a net positive attraction.
6
Figure 1.2: Examples of two recognition mechanisms we can observe from structure. A The pattern of
hydrogen bond donors/acceptors along the B-form DNA double helix. Acceptors are shown in red, donors
in blue, methyl groups in yellow, and base carbons in white. In the major groove and minor groove, this
pattern forms a unique signature of the DNA sequence. Panel adapted from Rohs et al. [96] B Example
of a hydrophobic stacking interactions occuring between ssDNA and hydrophobic aromatic side chains.
A tyrosine stacks with a thymine which stacks with a tryptophan which stacks with a guanine forming
a YTWG stack. These kinds of stacking arrangements mimic the stacking which occurs spontaneously
among the nucleotide bases for the free DNA and are a powerful recognition mechanism for binding
ssDNA.
Therefore, van der Waal contacts are one of the physical mechanisms which can discriminate
molecules based on shape alone. The number of van der Waals contacts is roughly proportional
to the area of overlap between the solvent accessible surface area of one molecule and another,
and this overlap depends on the complementarity of their shapes. Binding channels and pockets,
common in ssDNA binding proteins, can capture ssDNA strands or bases and form large number
of van der Waals contacts.
Hydrophobic interactions occur when non-polar chemical groups come into close proximity,
with the tendency to reduce the total exposed non-polar surface area. Non-polar chemical groups
do not form favorable interactions with the polar molecules of water, and they disrupt the network
of hydrogen bonds which normally exist in the bulk solution or between the solvent and polar
solutes. The strong temperature dependence of hydrophobic interactions [84] suggests that it
is an entropic eect, and it is known to be crucial for both protein folding and the stability of
the DNA double helix. Both protein folding and stability of the double helix require a polar
solvent. Hydrophobic stacking is commonly seen in the interfaces of proteins which bind ssDNA,
7
and occurs primarily between the aromatic residues tyrosine, tryptophan and phenylalanine and
the exposed DNA bases [103] (see g 1.2B). These interactions are analogous to the stacking
which occurs normally in the DNA double helix and form a powerful discrimination mechanism
that allow certain proteins to distinguish ssDNA and dsDNA. Hydrophobic pockets can be used
to capture nucleotide bases and have been shown to help the human protein POT1 distinguish
ssDNA from single-stranded RNA (ssRNA) by recognizing the hydrophobic 2' ribose carbon to
achieve 190 fold selection against ssRNA [33]. Hydrophobic residues with aromatic rings or
alkane branches can be used to recognize the methyl group of thymine and methylated cytosine
in dsDNA, adding additional specicity mechanisms.
Perhaps the most important mechanism for protein-DNA recognition is the electrostatic in-
teraction. The DNA backbone is highly charged, carrying an extra electron for every phosphate
group which amounts to a linear charge density of about 5:88e=nm for dsDNA. The amino acids
arginine, lysine and histidine can carry an extra proton at physiological pH, and hence a net
positive charge. These residues are often at or near the surface of the protein where electrostatic
screening can minimize the repulsion with other polar residues in the protein, and create regions of
net positive charge on the protein surface. These regions of net charge in DNA and proteins give
rise to strong electrostatic interactions, and indeed the positively charged residues are extremely
common in the interfaces of protein-DNA complexes. The electrostatic interaction has, by far,
the longest range of any other interaction mechanism. This can allow proteins to "feel" nearby
DNA without needing to be in direct contact. Electrostatic interactions are generally non-specic,
meaning they do not distinguish sequence well, but can allow the protein to diuse along a strand
of DNA until it "nds" a cognate binding site (if it shows any preferential binding). Highly specic
proteins, therefore, are likely to have smaller electrostatic contributions to their binding energies
relative to specic interactions. Some authors make the distinction of a particular electrostatic
interactions known as a salt-bridge, which is a direct (i.e. close proximity) interaction between
8
a positively charged side chain a phosphate group of DNA. I review methods for calculating the
electrostatic potential of a biological macromolecule in section 1.2.
Protein-DNA recognition is a complex combination of all the mechanisms outlined above,
and every DNA binding protein utilizes each to dierent extents. So, while a general picture of
protein-DNA binding mechanisms is well understood, using that information to predict binding
from protein structure or how changes to protein structure will aect binding remains challeng-
ing. Structural data of proteins bound to DNA has been indispensable in the studying of these
mechanisms, however bulk processes such as the changes in surface hydration, redistribution of
macromolecule associated cations and anions, and interaction with other solution components
remain invisible to current methods [57]. These interactions add a whole new layer of complexity
to protein-DNA recognition, and surely are an import part of the story. Electrostatic interac-
tions, for example have a large dependence on salt concentration, because the anions and cations
in solution act to screen electric elds. However, despite the limitations of structural data with
its inability to capture these solvent aects in addition to dynamical properties of the systems,
we can still go a very long way in understanding protein-DNA binding with the structure data
already at our disposal.
1.2 Implicit Solvent Electrostatics
Electrostatic interactions are important for virtually all biomolecular systems. Nearly 20% of
amino acid side chains are ionized in globular proteins under physiological conditions, and a further
25% contain polarized chemical groups [27, 43]. Nucleic acids contain charged phosphate groups
which link the sugar moieties (ribose for RNA, deoxyribose for DNA) of adjacent nucleotides in
the polymer chain via phosphodiester bonds. Each phosphate group contains an extra electron,
amounting to a net linear charge density of approximately 5:88e=nm along the polymer chain in
the case of dsDNA. These charged macromolecules exist in a solution of water, a polar solvent
9
and strong dielectric, along with various dissolved salts that produce free ionic charges. The
electrostatic interactions between proteins and DNA are mediated by this complicated dielectric
and electrolytic environment and depend not just on the distribution of charge in the protein and
DNA, but also the motions of the polar water molecules and the charged ionic species in solution.
The Poisson-Boltzmann equation (PBE) is a second order partial dierential equation which
provides a framework for modeling the electrostatics of biological macromolecules under approxi-
mately physiological conditions. First proposed by Gouy [50] and Chapman [17] and later gener-
alized by Debye and H uckle [30], it describes the interactions between a set of xed charges that
occupy some molecular volume in a dielectric continuum with one or more ionic charge distribu-
tions that exist beyond the molecular volume. The idea is to replace the discrete water molecules
and ionic charges with a continuous dielectric medium and continuous charge densities respec-
tively. The macromolecular solute is modeled as a set of spheres whose union denes a closed
molecular surface that encompasses a set of point charges centered at each sphere. The general
form of the equation can be written as follows [43]:
r[(x)r(x)] = 4
m
(x) + 4(x)
X
i
q
i
c
1
i
e
qi(x)=kT
(1.1)
(x) is the electrostatic potential which solves the above equation, and is the primary quantity
of interest. (x) is the position-dependent relative permittivity tensor, which we assume is linear
and isotropic and so can be represented by a single scalar. It is generally given a value equal
to that of water under STP conditions (approximately 80) in the region outside the molecular
surface, and a value in the range of 2-4 inside the surface which represents the polarizability typical
of proteins or nucleic acids.
m
is a set of point charges dened by the molecular structure of our
solute. In practice the value of these charges are derived from a force eld and may be somewhat
application dependent. (x) represents the accessibility of the ionic charge distributions and is
dened by the molecular surface of the solute - it has a value of 0 inside the molecular surface
10
Figure 1.3: The conceptual framework of the Poisson-Boltzmann equation. The solvent (water) is modeled
as a continuous dielectric with relative permittivitys. The ionic charges are modeled as continuous charge
distributions given by a Boltzmann distribution of the form
P
i
qic
1
i
e
q
i
(x)=kT
(see (1.1)). Note that
the discrete nature of the actual ionic charges are not explicitly modeled. The solute molecule is modeled
as a series of discrete point charges embedded in a dielectric continuum with relative permittivity i. The
boundary of the solute is dened by its molecular surface, and dierent denitions of this surface are
possible but depend on the van der Waals radii of the atoms that make up the solute structure.
and 1 outside of it. Finally, q
i
and c
1
i
are the charge and bulk concentration of the i
th
ionic
species. Figure 1.3 gives a schematic overview of the conceptual framework underlying the PBE.
By replacing the discrete water and ionic charges with continuous distributions, we are essentially
averaging over the motion of all the solution components while leaving the solute xed. This
is valid if we assume that the timescale for
uctuations of the water and ions is much smaller
than that of our solute. In practice this is a reasonable assumption because the mass of the
biological macromolecules which we are treating are usually on the order of tens of kilodaltons,
and so have much slower dynamics than individual water molecules or salt ions. The charge on
the ions are usually determined by the chemical species we observe under physiological conditions
for the particular system of interest, but we are free to choose the partial charges of our solute.
This allows the model to be quite
exible and it can be t to experimental data to produce good
results. In practice, many force-elds have shown to generalize quite well.
11
1.2.1 The Linear Poisson-Boltzmann Equation
The second term in (1.1) is an exponential whose argument is linear in the electrostatic potential,
making (1.1) a non-linear second order partial dierential equation in(x). However, if we assume
that the potential is small relative to the thermal energy coecient (kT
1
), which is reasonable
in the case of high bulk concentration of ionic charges (due to screening eects ) or when dealing
with small molecular charge density [108], we can simplify this equation.
Let q
i
(x)kT , then we can replace the exponential term with a truncated taylor series as
so:
e
qi(x)=kT
' 1
q
i
(x)
kT
(1.2)
If we further assume that the net ionic charge is zero, i.e.
P
i
q
i
c
1
i
= 0, then by substituting
(1.2) into (1.1) and letting (x) =(x)
P
i
q
2
i
c
1
i
kT
we nd:
1
4
r[(x)r(x)] =
m
(x)(x)(x) (1.3)
(1.3) is known as the linear Poisson-Boltzmann equation. While this may seem like a rough
approximation, the linear PBE can be used as a good approximation in many cases where more
exact values of the potential are not important, for example in regions far from the solute or for
solutes with few polar charges. The linear PBE can be used to calculate the boundary values or
initial values for a more accurate non-linear calculation.
1.2.2 Numerical Methods
The form of eq. (1.1) is a second order nonlinear partial dierential equation, while eq. (1.3) is
a second order linear partial dierential equation. In either case the complicated geometries of
macromolecular structures do not allow for analytical solutions, therefore we must solve eq. (1.1)
12
and eq. (1.3) numerically. The most common approach taken is the nite dierence method. In
this method the potential, charges and relative permittivity are discretized on a cubic lattice with
grid scale h as shown in gure 1.4. Eq. (1.4) and (1.5) below are the nite dierence equations
for the nonlinear and linear PBE respectively [64, 134]. The potential at each grid point is given
in terms of the charge at that grid point, the potential at neighboring grid points and the relative
permittivity at a point which lies halfway on the line segment joining neighboring grid points (see
gure 1.4). The nite dierence equations for the non-linear and linear PBE are given by eq. (1.4)
and (1.5).
Figure 1.4: The discretization of the
PBE on a cubic lattice of grid size h.
The potential is mapped to each grid
point while the relative permittivity is
mapped to the faces of a cube that lie
halfway between each grid point. The
potential at grid point i is given in terms
of the charge at grid point i, and the
potential at neighboring grid points and
the permittivity on the line segments
joining those grid points as in eq. (1.4)
and (1.5).
(n)
i
(x) =
P
j
ij
(n1)
j
+ 4
i
=h
P
j
ij
+h
2
N(
(n1)
i
)
(1.4)
(n)
i
(x) =
P
j
ij
(n1)
j
+ 4
i
=h
P
j
ij
+h
2
2
i
(1.5)
(n)
i
is the value of the electrostatic potential at grid
point i at iteration n,
ij
is the value of the relative
permittivity on the line joining grid points i and j, and
i
is the value of the xed charge distribution at the grid
pointi. The termN(
(n1)
i
) in eq. (1.4) is the non-linear
term, representing a Boltzmann distribution at the grid
point i. Eq. (1.4) and (1.5) can be solved many ways.
Common approaches include relaxation methods [64, 89]
and conjugate gradient methods [28].
Generally the PBE is treated as a Dirichlet problem and so we must specify the value of the
potential at the boundary of the simulation grid. A common method to choosing an appropriate
13
boundary value is to start with a very large but coarse grid, where the boundary value of the
grid is chosen by approximating the potential using a Debye-H uckel model. We then solve for the
potential on the coarse grid and then use the value of this coarse-grid solution as the boundary
value of a smaller but ner grid. This is known as focusing, and can be repeated several times.
Many algorithms and numerical solvers have been implemented for solving the PBE equation
using a number of numerical methods, some of the more widely known include Delphi [72], APBS
[60], UHBD [80], MEAD [6], and PBEQ [55].
1.2.3 Applications
Many quantities relevant to molecular biology can be calculated within the PBE framework with-
out the need for extensive sampling that other physical methods would require (due to the implicit
solvent approximation). These include electrostatic free energies such as polar solvation energies
and binding energies, conformational free energies, and protein titration and pK
a
calculations
[43, 5, 81]. Below I review two applications that are particularly relevant to protein-DNA inter-
actions, polar solvation energies and binding energies.
1.2.3.1 Polar Solvation Energies
Polar solvation energies are thought of as the change in free energy when moving a solute from a
homogeneous dielectric environment free of any ions to an inhomogeneous dielectric environment
with ions present (e.g. water). Such a process might model the eect of a protein migrating
from the apolar lipid environment of a cell membrane to the polar electrolytic environment of the
cytoplasm, for example, or contribute to the binding free energy between two polar molecules in
a solution of water. We can calculate the solvation free energy by computing the free energies in
two reference states as described below.
Given a numerical solution to the PBE, we can partition the potential into three components,
=
s
+
c(FD)
+
x
[81] where
s
is a self-interaction term (which will be removed),
c(FD)
is
14
the Coulomb interaction between xed partial charges which, due to the way charge is discretized
on the grid, diers from
c
=
1
2
P
i;j6=i
qiqj
rij
, and
x
is the reaction eld energy that is due to
the interaction of the molecular charges with the solvent and ions. It is this last term which
contributes to the polar solvation energy. To isolate
x
we perform two calculations - the rst
in a state where (x) =
i
and no ions present, and the second in the solvated state with ions
present. The energy in the rst state is given by
G
1
=
1
2
X
i
q
i
(1)
i
=
1
2
X
i
q
i
(
s
i
+
c(FD)
i
) (1.6)
and the free energy in the second state is
G
2
=
1
2
X
i
q
i
(2)
i
=
1
2
X
i
q
i
(
s
i
+
c(FD)
i
+
x
i
) (1.7)
.
Therefore, the polar solvation energy G
(P)
S
is equal to G
2
G
1
.
There are additional contribution to the total solvation energy, which are apolar solvation
energy terms. These are due to the energy required to create a cavity in the solvent, which
creates a volume term, plus dispersive interactions between the solvent and solute, which creates
a surface area term. These contributions can be signicant for non-polar molecules. Wei and
Baker [1] have presented a review on this topic.
1.2.3.2 Binding Energies
Binding free energies can be modeled as the sum of the change in solvation energy when two
solutes, A and B form a complex AB plus the Coulomb interactions of the xed charges in A and
B. This model is summarized in gure 1.5.
The energy we wish to model is the change in state in step 3, the change in energy from the
solvated complex to the solvated free monomers. We model this as the change in solvation energy
15
Figure 1.5: A free energy cycle to model the contributions of solvation to binding energy. In step 3, a
solvated complex dissociates into two free monomers. The binding energy free energy is given by the
change in solvation energy due to the additional exposed surface area of the free monomers which allows
for additonal interaction with the solvent-ionic environment, and the change in the Coulomb interaction
energy. Steps 4 and 2 model the change in solvation energy, and step 1 models the change in Coulomb
energy. Figure adapted from [5].
16
of the complex in step 4 minus the change in solvation energy of the free monomers in step 2, plus
the coloumb interaction in step 1.
G
B
=G
3
= G
4
G
2
G
1
(1.8)
The minus signs here are determined by the direction of the processes in the thermodynamic
cycle. G
4
and G
2
are simply three solvation energy calculations described in the previous
section, where G
2
is the solvation energy of the two monomers calculated separately. G
1
is
the Coulomb energy of the xed charges in the complex,
G
1
=
1
2
X
a
X
b
q
a
q
b
i
r
ab
(1.9)
where
i
is the relative permittivity in the interior region of the complex. Note that this
model does not account for entropy changes or molecular mechanics energies, and these must be
computed separately. Additionally, we may or may not include the apolar contributions to the
solvation energies mentioned in sec. 1.2.3.1.
1.3 Deep Learning
Deep learning is a subeld of machine learning which makes use of articial neural networks (NNs)
in combination with a learning algorithm, such as back propagation [49], to learn a mapping from
one nite dimensional space R
N
to another R
M
. The feverish attention given to these models
in recent years is due to their extreme
exibility and ability to learn hierarchical representations
of their inputs which increase the predictive power of the network and can be applied in other
domains. NNs come in many shapes and forms where the structure of the network is referred to
as its architecture. The most basic NN architecture is the feed-forward neural network (FFNN),
which contains a layer of input "neurons", followed by one or more so-called hidden layers, and
17
a nal output layer of neurons. Here, a "neuron" is named so because of a loose analogy to
biological neurons, but is simply a placeholder for a real number, and can be thought of as the
element of a tensor. A network is built from a series of "layers", with the rst layer representing
an input tensor, intermediate layers (known as hidden layers) representing a set of intermediate
tensor states (known as activations), and the nal layer representing the output of the neural
network. In an FFNN, each layer is dened in terms of the previous layer's activation, and the
functional form of this relationship denes the architecture of the network.
In a fully-connected FFNN (FCNN) each layer's activation is dened as a linear transforma-
tion of the previous layer's activation followed by the application of a non-linear, dierentiable
"squashing" function as in eq. 1.10.
h
k
=
k
(W
k
h
k1
+b
k
) (1.10)
Here, h
k
2 R
N
k
is the activation at layer k (also known as the hidden state), h
k1
is the
activation of the previous layer, W
k
2R
N
k
N
k1
and b
k
2R
N
k
are learnable parameters known
as weights and biases, and
k
() is a non-linear function, applied element-wise, which usually has
a non-linear region around the origin and saturates to 1, -1 or 0 for very large or small inputs. A
table of some common activation functions is given in table 1.1. Generally the same function is
used for all layers except the last, however one is free to let this choice of function vary per-layer.
A FCNN with K layers is composed of a chain of functional compositions of the form
h
K
=g
W
K
K1
W
K1
K2
W
K2
K3
(:::) +b
K2
+b
K1
+b
K
(1.11)
with the nal activationh
K
usually taken as the output of the network, and the rst layer being
the network input. FCNNs are just one of many possible network architectures. More generally,
a hidden layer need not be a a function of a linear transformation of the previous layer. It could,
for example, depend on a convolution of the previous activation e.g. h
k
=(h
k1
w
k
+b
k
) for
18
Common Activation Functions Used in NNs
Name Equation Range
arctan f(x) = tan
1
(x) (=2;=2)
hyperbolic tan
f(x) =
e
x
e
x
e
x
+e
x
(1; 1)
sigmoid f(x) =
1
1 +e
x
(0; 1)
ReLU f(x) =
(
x x> 0
0 x 0
[0;1)
Leaky ReLU f(x) =
(
x x> 0
0:01x x 0
(1;1)
Table 1.1: A list of common activation functions used in neural networks. A good activation func-
tion should be piecewise-dierentiable and should allow gradients to "
ow" through the network
without vanishing or diverging.
some choice of activation function and a learnable convolution kernel w
k
and bias b
k
. Further,
it could depend on a sequence of previous activations, as in a recurrent neural network (RNN).
The expressive power of the network is determined by the number and size of the hidden layers
{ which directly correspond to the number of parameters in the network. Typical networks may
have thousands to millions of parameters depending on the size of their input and the number
of layers in the network. It has been shown that deep neural networks form a class of universal
approximators, which are capable of approximating any Lebesgue integrable function to arbitrary
accuracy provided that hidden layers are large enough, and that the correct parameters can be
found [53, 79] { which is a non-trivial task { however, local minima often exist which can provide
very good results.
In the next sections I give an overview of how neural networks are trained using the back
propagation algorithm, and I discuss a particular architecture of neural networks known as a
convolutional neural network and their application to the specic task of semantic segmentation
and how this relates to protein-DNA binding.
19
1.3.1 Supervised Training of Neural Networks: Back Propagation
Neural networks can be trained under various assumptions and conditions within three major
paradigms - supervised learning, unsupervised learning, and reinforcement learning. In this work
we focus only on supervised learning, but all three are active and exciting areas of research. In
the supervised learning paradigm, a neural network is thought of as a mapping f(Xj) : X2
R
N
! Y2R
M
that depends on the trainable parameters of the network . During supervised
learning we learn the parameters using a set of training data which samples the mapping!
D
we wish to approximate. Our training dataD =fx
i
2X;y
i
2Yg consists of pairs of input and
output variables which represent a mapping from some domainX to anotherY.
The basic formulation for training a neural network is rst to dene the size and architecture
of the network, given the size and type of data to be used for input. Next, one must dene a
loss or objective function that describes how good or accurate the output of the network is with
respect to a set of training data which has been prepared and achieves a minimum when there
is a perfect t of the network prediction to the training set. That is, we choose some functionL
which compares the output of the network f(x
i
j) to the corresponding output in our training
datay
i
, and is zero when f(x
i
j) =y
i
or otherwise denes some metricjf(x
i
;)y
i
j that tell us
how far we are from our target. For regression tasks where!
D
is a continuous function fromX
toY, the mean squared error loss is often used,L(f(Xj);Y) =
1
N
P
i
(f(x
i
j)y
i
)
2
. However,
the choice of loss function is highly dependent on context and the application at hand, and what
exactly the target outputsy
i
are expected to be { whether they are continuous variables, discrete
variables, probabilities etc...
An important consideration is that the loss function be dierentiable with respect to f(x
i
j).
If this is the case, then the gradients of the loss with respect to the weights of the network can
be computed easily using the algorithm known as back-propagation [49]. Given these gradients,
we can then update the values of the weights in such a way as to minimize the loss function using
20
what is known as gradient descent. Eq. 1.12 shows an example of a simple form of gradient
descent. Here,W
k
are the weights of thek
th
layer,L is a loss function which we wish to minimize
and is a function of the training dataX and the parameters of the network, and is a parameter
known as the learning rate. Repeated iterations of this update rule to the weights of the network
will moveL towards a local minima (or in unfortunate cases towards a saddle point, but this is
fairly easy to avoid).
W
k
!W
k
@L(X;Yj)
@W
k
(1.12)
The back-propagation algorithm tells us how to compute the gradient
@L(X;Yj)
@W
k
and is based
on the chain-rule for derivatives. Let the total loss over the training data be the average over the
loss of individual training examples,L(X;Yj) =
1
N
P
i
L(x
i
;y
i
j). To make our notation more
compact, let
k
be the activation function at the k
th
layer and let the hidden state, or activation
be h
k
=
k
(z
k
) where z
k
= W
k
h
k1
which we'll call the weighted input to the k
th
activation
function. Here we have included the bias term in the learnable weights W
k
, which we can can
easily do by appending an extra value of 1 to each activation. h
k
;W
k
andz
k
should be understood
as tensors.
The gradient with respect to weights from the k
th
layer can be written, using the chain-rule
as
@L(x
i
;y
i
j)
@w
k
ij
=
@L(x
i
;y
i
j)
@z
k
j
@z
k
j
@w
k
ij
(1.13)
and where
@z
k
j
@w
k
ij
=h
k1
i
(1.14)
.
21
based on the denition of z
k
j
=
P
i
w
k
ij
h
k1
i
. Dene
k
j
=
@L(xi;yij)
@z
k
j
which we call the error
term and we can write
@L(x
i
;y
i
j)
@w
k
ij
=
k
j
h
k1
i
(1.15)
Now, given the value of a hidden state at layer k, the loss function depends only on the
remaining layers from k + 1:::K. Hence, we can re-write
k
j
as
k
j
=
@L(x
i
;y
i
j)
@z
k
j
=
r(k+1)
X
l=1
@L(x
i
;y
i
j)
@z
k+1
l
@z
k+1
l
@z
k
l
=
r(k+1)
X
l=1
k+1
l
@z
k+1
l
@z
k
l
(1.16)
.
Wherer(k + 1) is the size of thek + 1 layer. The weighted input at thek + 1 layer is given by
z
k+1
l
=
P
m
w
k+1
ml
k
(z
k
m
), hence we know that
@z
k+1
l
@z
k
j
=w
k+1
jl
@
k
@z
k
j
(1.17)
Plugging this into the above equation yeilds
k
j
=
r(k+1)
X
l=1
k+1
l
w
jl
@
k
@z
k
j
=
@
k
@z
k
j
r(k+1)
X
l=1
k+1
l
w
k+1
jl
(1.18)
.
Finally, combining eq. (1.15) and (1.18) we have that
@L(x
i
;y
i
j)
@W
k
ij
=
k
j
h
k1
i
=h
k1
i
@
k
@z
k
j
r(k+1)
X
l=1
k+1
l
w
k+1
jl
(1.19)
.
This gives us a straight-forward way to compute the gradients of every weight in the network
and update them as follows. First, we do a forward-pass through the network, starting with the
22
Figure 1.6: An illustrative example of the issues which can occur if the learning rate is not chosen properly
when performing gradient descent. In the rst case, is chosen too small, and we require many iterations
to achieve a minima. In practice, neural networks are large models and computing gradients is costly
in both memory and time, therefore we prefer a method which converges faster. The second example
is the "sweet spot" { we achieve convergence in a reasonable number of steps. The third, and perhaps
most disastrous, is if we chose to be too large. We want to be large to speed up learning, but if it
is too large then we can overshoot our minima and, at best, the network will simply oscillate and never
converge, but at worse can actually diverge depending on the landscape of our loss function.
input and computing h
k
at every layer. Then we perform a backwards pass, starting with
K
at
the nal layer, we compute the gradients at each layer in terms of the weights and error term of
the layer above it. Finally, when all the gradients have been computed we update the weights
of the network using the rule in equation (1.12). This framework has several major diculties
however. First, is that the learning rate in eq. (1.12) is not specied { we must choose a value,
but a priori we have little way of knowing what a good value should be. This can be partially
solved through trial and error { simply trying many dierent learning rates and observing how
the loss function converges towards a minimum (or not) over time. Figure 1.6 illustrates the
issues with choosing too high or too low of a learning rate, either the loss will take a very long
time to converge, or it may actually diverge if we set too high.
The second, and much more pernicious issue is that loss functions are rarely, if ever, convex.
In reality there are many saddle points and local minima where the gradients are zero, but we
are far from an optimal solution. In fact, achieving a global minima is almost never possible
when training a neural network, so we are left only with good local minima in a sea of poor
ones. Both choosing the initial learning rate, and avoiding saddle points and local minima can
be addressed using adaptive learning rates [101]. Gradient descent optimization algorithms with
23
adaptive learning rates aim to tune based on the history of previous gradients and generally
modify the form of eq. (1.12) to include additional "momentum" terms. A proper review of these
learning algorithms is beyond the scope of this work, but several popular ones include RMSprop
(unpublished), Adam [63], Adagrad [37] and Adadelta [131].
1.3.2 Convolutional Neural Networks
Convolutional neural networks are a form of FFNNs which implement convolutions, a discrete
linear operation which in one dimension can be dened as
(fg)[n] =
M
X
m=M
f[nm]g[m] (1.20)
for a pair of nite sequencesf andg, wheren andm are indexes into the sequences. This can
easily be extended to arbitrary dimensions by summing over additional indices.
(fg)[n
1
;:::;n
N
] =
M1
X
m1=M1
M
N
X
m
N
=M
N
f[n
1
m
1
;:::;n
N
m
N
]g[m
1
;:::;m
N
] (1.21)
In the context of CNNs, g is referred to as the convolution kernel, lter or simply weights. A
CNN is simply a neural network which implements convolutional layers - maps from one activation
to the next which are generated by convolving the lower layer's activation with a set of kernels
and usually followed by the application of a non-linear function, e.g.
h
k
=
h
k1
w
k
+b
k
) (1.22)
where h
k
is the activation, or hidden state, at layer k, h
k1
is the previous layer's activation,
w
k
is the k
th
layer's kernel, and () is a non-linear function which usually saturates to -1, 1 or
0 for very large positive and negative inputs. Convolutions can be implemented with various
24
kernel sizes, strides, dilations, and paddings, which are used to control the dimension and size of
the nal output. Typically repeated convolutions, in conjunction with pooling layers, are used
to gradually reduce the spatial dimension of the original input to some much smaller dimension
(g. 1.7), while increasing the dimension along the lter depth which forms a dense encoding of
information useful for the task at hand, such as image classication or segmentation. CNNs have
dominated the eld of machine vision and been applied with great to success to a large variety
of image processing tasks over the past eight years or so. A full review of CNNs is far beyond
the scope of this thesis, and the details of their implementation, training, and application would
require several textbook chapters to cover (see Goodfellow [49]). However, convolutional neural
networks have a few simple properties which can help to understand why they are so useful for
processing images and indeed any grid-like structured data.
Convolutions are not sensitive to translations, and it is easily shown that the convolution
operator commutes with the translation operator. This is an extremely useful property for images
where the content and information of a picture should not depend on exactly which pixels contain
the information. A picture of a cat is still a picture of a cat regardless of whether the cat is to
the left, to the right, or centered in the image. In addition, convolutions are local operators {
their outputs only depend on a local widow centered on each output element whose size is dened
by the size of the kernel. This allows for sharing of weights across the entire input image, which
greatly reduces the number of parameters which need to be learned by the network, increases
computational eciency and parallelization, and helps to avoid overtting the training data.
In the case of computer vision tasks, information in a single pixel is essentially meaningless - it
is only the correlations among regions of nearby pixels in an image which have meaning. However,
meaningful correlations are generally localized. Convolutions can naturally capture these localized
correlations among the data due to their localized kernels. By stacking multiple convolutions, it
is possible to increase the receptive eld of the higher level layers in the network. This allows
information to hierarchically be propagated from large areas of the input, in small localized steps,
25
down to a single output. This provides a mechanism for aggregating global information contained
in a picture, such as identifying that an image is of a cat, from localized information such as the
texture of fur, the shape of the eyes, corners of the ears, etc...
Figure 1.7: Illustrations of the convolution and transposed convolution operations. (A) A 2D convolution
is shown here between some input data (blue) and a kernel (shaded box), producing the response (green).
The input is 44, and the kernel is 33 with a stride of 1, producing a 22 response. (B) A transposed
convolution. Here, the input (blue) is 3 3 but is dilated as shown, and the kernel (shaded box) is also
3 3. The transposed convolution interpolates the input data to produce a 5 5 response. This is the
basis of upsampling in fully convolutional neural networks. Figure adapted from Vincent and Francesco
[38]
A sort of inverse operation exists which is typically called a transposed convolution or frac-
tionally strided convolution (gure 1.7.B) which is used in cases where upsampling is required.
Upsampling is simply the process of expanding the resolution of some grid-structured data, such as
enlarging an image, which requires some form of interpolation. A transposed convolution performs
this interpolation, but the interpolation can be learned by the network, rather than remaining
xed, which may produce better results for whatever the intended application is.
1.3.3 Semantic Segmentation
Currently CNNs dominate state-of-the-art results in a wide variety of computer vision tasks
including semantic segmentation. Semantic segmentation is the problem of taking an image and
26
partitioning it into one or more segments or objects at the pixel level. It can be thought of as a
classication problem where we aim to learn a probability distribution over every pixel that give
us the probability of that pixel belonging to a set of possible classesY. We then assign a label to
each pixel based on the most likely class it belongs to, i.e.
Y = argmax
yi2Y
p(y
i
jX) (1.23)
where Y is the assignment of pixels and X is the input image, which is generally just the
(R,G,B) color channel intensities at each pixel. Through the use of convolutions, a network can
learn local lters that can capture the dependencies of the label at each vertex on its neighboring
vertices in such a way that we produce smooth and coherent segmentations.
Many network variations and architectures have been implemented for the task of semantic
segmentation (see [85] for a review), and semantic segmentation, like many problems in computer
vision is a very active and dynamic eld of research. One design which is particularly relevant
for the work in this thesis is the U-net architecture [97]. U-net models were initially developed
for medical/biomedical image segmentation. The U-Net architecture (gure 1.8) comprises two
parts, a down-sampling path to capture local context over the image and a symmetric upsampling
path that enables precise localization of pixel probabilities. The down-sampling or contract-
ing part has a FCN-like [109] architecture that extracts features with 3 3 convolutions. The
up-sampling or expanding part uses up-convolution (i.e. transposed convolutions or fractionally
strided convolutions, g. 1.7B), reducing the number of feature maps while increasing their dimen-
sions. Connections from layers in the down-sampling path are made to layers in the up-sampling
path to avoid losing contextual information and improve the
ow of gradients while keeping the
coarse-grained features learned in the up-sampling layers. Finally, a 1 1 convolution processes
the feature maps to generate a segmentation map that categorizes each pixel of the input image.
27
[85] U-Net was initially trained on transmitted light microscopy images, and it won the ISBI cell
tracking challenge 2015 by a large margin.
Figure 1.8: (A) An image segmentation example. The picture to the right is given as input, and the
ground-truth segmentation which is used for training is given in the middle. The prediction of a fully CNN
that has been trained for semantic segmentation is on the right. The dierent colors represent dierent
class assignments for each pixel (human, horse or background). adapted from [109] (B) A schematic of
the U-net architecture, adapted from [97]. Each box represents a multi-channel feature map (the output
of a convolution or pooling layer), with the number on top of each box representing the number of lters
in that layer. The spatial dimension of the feature maps is decreased until a 30 30 1024 activation
produced. The activation is then upsampled until a 11 convolution is applied which assigns a probability
distribution to each pixel.
28
U-nets are one example of fully convolutional neural networks { they rely entirely on convo-
lutions and contain no fully-connected layers. Rather, they make use of 1 1 convolutions to
transform the learned feature maps in the nal layer of the network to an N class probability
distribution over each pixel. They inherit all the benets of convolutions in addition to being
independent of the size of the input, which makes them very versatile architectures. U-nets are
part of a larger class of encoder-decoder architectures which all make use of some form of down-
sampling followed by up-sampling. Another approach is to incorporate a graphical model known
as a conditional random eld (CRF) [117] into the network, either as a learned layer or as a
post-processing step. A variety of methods have used CRFs to improve the segmentation over
purely convolution based approaches [18, 109, 74, 118].
1.3.4 Applications to Protein-DNA Binding
Structural data of protein-DNA complexes encodes much information about the physical mech-
anisms that drive protein-DNA binding. Traditional approaches for utilizing structural data are
based on physical models, such as continuum electrostatics or molecular dynamics. These sim-
plied physical models are applied to experimentally derived structures in order to compute free
energies or observe dynamical properties of the system. Physical models, however, are limited
in several ways. A true quantum mechanical treatment of biological macromolecules is compu-
tationally infeasible. Therefore, most physical models rely on simplications such as the Born-
Oppenheimer approximation [13] to trade realism for computational tractability. A downside to
this is that many physical models rely heavily on parameterization in the form of force-elds,
which need to be carefully tuned and may only apply to specic molecules under specic circum-
stances. Physical modeling also tends to scale poorly with the size of the system. For example,
in continuum electrostatics the memory requirements scale as O(n
3
) where n is the smallest di-
mension of the simulation box, and in molecular dynamics similar issues arise where the number
29
of water molecules also scales with the volume of the simulation box. For dynamical approaches,
sampling can be an issue as many force elds are susceptible to being trapped in local minima.
Machine learning approaches can overcome these drawbacks, oering methods which scale lin-
early with the size of the system, are parameterization-free and do not require sampling. Structural
data varies widely in size, shape and sequence composition and hence is heterogeneous in many
respects. The salient properties of CNNs - their invariance to translations, being unrestricted by
the size of their input, and ability to learn local features make them extremely attractive models
for applications involving structural data. In chapter 2 I present DNAproDB, a tool we have
developed for processing, annotating, and curating structural data of protein-DNA complexes
which can be used for creating large datasets easily, and in chapter 3 I present a method based
on convolutional neural networks for predicting DNA binding sites from protein structure.
30
Chapter 2
DNAproDB: a Database and Web-Based Tool for
Structural Analysis of DNA{Protein Complexes
This chapter describes our most recent work on DNAproDB, a structural analysis tool we have
developed for analyzing the structures of DNA-protein complexes. DNAproDB was rst published
in Volume 45 of Nucleic Acids Research [102], and this chapter describes work since then that
expands and improves the original tool, and is available in Nucleic Acids Research volume 48
[103]. DNAproDB is a web-based tool which researchers from all over the world are using to
analyze structures and build datasets from the large volume of structural data and associated
annotations our database provides.
2.1 Abstract
DNAproDB (https://dnaprodb.usc.edu) is a web{based database and structural analysis tool that
oers a combination of data visualization, data processing, and search functionality that improves
the speed and ease with which researchers can analyze, access and visualize structural data of
DNA{protein complexes. In this paper we report signicant improvements made to DNAproDB
since its initial release. DNAproDB now supports any DNA secondary structure from typical
B{form DNA to single-stranded DNA to G{quadruplexes. We have updated the structure of
31
our data les to support complex DNA conformations, multiple DNA{protein complexes within
a DNAproDB entry, and model indexing for analysis of ensemble data. Support for chemically
modied residues and nucleotides has been signicantly improved along with the addition of new
structural features, improved structural moiety assignment, and use of more sequence-based anno-
tations. We have redesigned our report pages and search forms to support these enhancements and
the DNAproDB website has been improved to be more responsive and user-friendly. DNAproDB
is now integrated with the Nucleic Acid Database and we have increased our coverage of avail-
able Protein Data Bank entries. Our database now contains 95% of all available DNA{protein
complexes, making our tools for analysis of these structures accessible to a broad community.
2.2 Introduction
Analyzing structures of DNA{protein complexes provides valuable insight into the physical mech-
anisms that drive fundamental biological processes such as chromatin structural organization and
DNA transcription, replication and repair. Atomic resolution models of proteins bound to their
cognate DNA binding sites help to elucidate the relationships among sequence, structure and
biological function, distinguish dierent mechanisms of DNA recognition, oer a deeper physical
understanding of existing experimental results, and give insights into the molecular machineries
which drive living cells. The Protein Data Bank (PDB) [11] is an archival repository that cur-
rently contains approximately 4,800 structures of proteins bound to DNA (excluding those also
containing RNA). These structures vary widely with respect to many features: molecular and
biological function of the DNA binding proteins; tertiary and secondary structure of the bound
DNA; protein and DNA sequence; and structure size.
A number of databases have been developed which provide data for structures of DNA{protein
complexes from the PDB. PDIdb [90] is a database that provides information on eective atomic
interactions for each DNA{protein interface in a complex and classies proteins by function and
32
structure. Users can search the database for entries based on features of the interface, DNA
or protein. The 3D-footprint database [25] provides structure-based binding specicities for all
DNA{protein complexes in the PDB and static gures that display DNA{protein interactions in
the complexes. NPIDB [130] contains structural information on DNA{protein and RNA{protein
complexes, and computes hydrogen bonds, water bridges and hydrophobic interactions. PDBsum
[68] is a database that summarizes the contents of each macromolecular structure deposited in
the PDB, including DNA{protein complexes, and provides various analysis tools. The Nucleic
Acid Database (NDB) [10] provides detailed structural annotations on nucleic acid structure and
annotates protein function, and can be searched for DNA{protein complexes.
DNAproDB [102] is a database, structure processing pipeline and web-based visualization tool
designed to aide structural analysis of DNA{protein complexes, visualize features of DNA{protein
interactions, and generate structural datasets that meet specic criteria based on a variety of
biophysical features and annotations. DNAproDB is unique in its combination of rich structural
features and annotations, detailed structure, function and sequence-based search capabilities,
and interactive, customizable visualizations. It is built on an automated end-to-end structure
processing pipeline that is designed to handle the complexity inherent in structural data. The
pipeline takes as input the atomic coordinates of a three-dimensional DNA{protein complex,
extracts from it a large variety of structural and biophysical features and combines this data in a
hierarchical data format.
Features include information such as structural annotations (e.g. identifying a region of DNA
as single{stranded or double{helical, and identifying protein secondary structure elements), se-
quence information (e.g. GC{content and identication of A{tracts) and statistical information
about the DNA{protein interfaces in the complex (e.g. residue propensity and interface hydropho-
bicity). Information about individual nucleotide{residue interactions is also provided, such as
hydrogen bonding, interaction geometry, buried solvent accessible surface areas and identication
33
of the interacting residue and nucleotide moieties (see sec. 2.3.5). DNAproDB includes anno-
tations from other databases where available, such as UniProt [24], CATH [114], and the Gene
Ontology knowledgebase [3, 2] which enhance the search capability of our database and integra-
tion with the biological community. The DNAproDB database provides data for over 4,500 PDB
entries that contain DNA{protein complexes. Our database can be searched for entries meet-
ing specic criteria based on features of the DNA, protein or DNA{protein interactions. Data
for the resulting entries can then easily be downloaded in JSON format [39] and parsed oine
or explored and visualized with in-browser visualization tools provided through the DNAproDB
website (https://dnaprodb.usc.edu). We also oer an option for users to upload and process in-
dividual structures using the same processing pipeline used to build our database and to analyze
their structure with a privately generated report page (see sec. 2.4.2).
In this report we describe major improvements and upgrades to DNAproDB that broadly
improve its utility, scope and the raw amount of data available to users. The original version of
DNAproDB, as described in [102], was designed to process structures of proteins bound to double-
stranded helical DNA. Our newly updated processing pipeline and database can accommodate any
DNA structure from double-stranded B{form DNA to single-stranded DNA to G{quadruplexes,
and now supports multiple DNA{protein complexes per DNAproDB entry. This and other im-
provements have increased our coverage of PDB structures containing DNA{protein complexes
from 51% of available structures in the original version of DNAproDB to more than 95% in the
current version. Accordingly, our visualization tools have been completely redesigned to handle
a much wider variety of DNA{protein structures. In the process of implementing these changes,
we have made many additional improvements both to our structure processing capabilities, data
format and to the visualization tools and website interface which are described in the following
sections.
34
2.3 Structure Processing Upgrades
The DNAproDB structure-processing pipeline (Figure 2.1) takes as input the atomic coordinates
of DNA{protein complexes in PDB or mmCIF [122] format and extracts from them a variety of
structural and biophysical features. These features describe aspects of the DNA, protein(s) and
DNA{protein interactions observed in the structure. The pipeline is an end-to-end processing
work
ow, and the resulting data which is generated from an input structure is combined in
a single document which we refer to as DNAproDB data les. One input structure produces
one output data le, and if a structure contains multiple DNA{protein complexes then data for
each complex is contained in the same data le. The data les are in JSON format [39], and a
description of their content is given below. Our database is built from a collection of these data
les and their structure is designed to be convenient for the user who wishes to use our data
for their own analysis, enable elegant search capabilities of our database, and allow for detailed
reporting and visualization using our web-based analysis tools. Below we describe various notable
aspects of DNAproDB which have been improved, added or modied in our latest revision.
2.3.1 Structure preprocessing
DNAproDB performs several preprocessing steps on structures before attempting to calculate
features from them. These preprocessing steps may result in some minor dierences in the resulting
structure and those available for download from the PDB (or the original data in the case of an
uploaded structure), so we note our preprocessing procedure here for clarity and completeness.
First DNAproDB automatically generates biological assemblies from the asymmetric unit using
the symmetry operations provided by each processed entry. Often a biological assembly may be
identical to the asymmetric unit, although some may contain multiple copies of the asymmetric
unit or only part of it. Therefore, multiple copies of a DNA or protein chain may occur in a
biological assembly and DNAproDB assigns a unique identier to each one. The identier of the
35
Figure 2.1: A schematic overview of the DNAproDB structure processing pipeline and front-end inter-
face. The main stages are structure preprocessing where structures are prepared for processing, feature
extraction where various biophysical features are calculated, data retrieval which pulls additional anno-
tations from existing databases for protein chains, and data aggregation where features are combined in
a standard data format. The DNAproDB database stores processed structural data for more than 4,500
PDB entries containing DNA{protein complexes. Users can search the database on features of the DNA,
protein or DNA{protein interactions, can generate reports for the returned results, and can upload their
own structures for private analysis. The report page contains functionalities for downloading extracted
features as a JSON le [39] and for visualizing data using interactive visualization tools which can export
static gures.
36
parent chain in the asymmetric unit is recorded in the DNAproDB data le. Currently, only one
biological assembly per structure is used (users may upload any alternate biological assembly to
our web server and generate a report for it if they wish). For structures that are uploaded to our
web server, the provided coordinates are assumed to already be those of the biological assembly,
and no symmetry operations are applied. Multiple models may exist within a biological assembly
which represent dierent conformations of the assembly. This is common for NMR structures
and is useful for analyzing snapshots of a simulation such as a molecular dynamics trajectory.
Structures obtained from X-ray crystallography generally will only have one model.
Next DNAproDB removes components of the structure that are not part of the protein, DNA,
solvent or a known coordination center such as a zinc ion. Any small ligands or other chemical en-
tities that are not chemically derived from an amino acid or nucleic acid are removed and ignored.
Additionally, components which are missing too many atoms, clash with other components, or
do not appear across all models are removed. Any removed component and the reason it was
removed is recorded in the DNAproDB data le. Hydrogen atoms are added to all structures
using the program Reduce [124]. For structures which are uploaded to our web server by a user,
we provide an option to repair any missing heavy atoms using the program PDB2PQR [36, 35],
however we do not add missing heavy atoms to structures used to build our database.
2.3.2 DNAproDB data overview
The features DNAproDB extracts from structures are organized in a hierarchical manner with
three main feature categories being DNA features, protein features, and DNA{protein inter-
face features (Figure 2.2). We have signicantly improved and updated the way we organize
DNAproDB data les, and we now include many more elds and support model-level features,
meaning features which can vary from one model in the structure to another. Figure 2.2 shows
an overview of the conceptual hierarchy that denes how features are organized in DNAproDB.
37
Figure 2.2: The data hierarchy used by DNAproDB, which inherits part of its overall structure from that
used by the Protein Data Bank. Lines with a single arrow indicate one to many relationships, and lines
with two arrows indicate one to one relationships. The PDB data hierarchy begins at Entry, which is often
referred to as simply "the structure" or "a structure" and contains all information about a macromolecule.
An entry may have multiple Models which are dierent conformations of the structure (many entries con-
tain only a single model). Models contain Chains, which are generally polymerized chains of amino acids
or nucleic acids and may also contain ligands, solvent or other small molecules. Finally a chain contains
Components, which are the molecular units that make up the chain | protein residues, DNA nucleotides
or other small molecules. DNAproDB directly inherits the Entry-Model levels of the PDB hierarchy. From
there, models contain Structural Entities (see sec 2.3.2) which are distinct structural components within
a particular model. DNA (protein) structural entities contain DNA strands (protein chain segments),
which are derived from a DNA (protein) chain, but distinct in that a strand (chain segment) may not
contain any backbone breaks. DNA strands (protein chain segments) then contain nucleotides (residues)
at the lowest level of the hierarchy. Nucleotides and residues have a one-to-one correspondence with the
component level of the PDB hierarchy, but structural entities have no direct correspondence. DNAproDB
provides data on DNA{protein interfaces between every interacting DNA/protein structural entity pair
{ and this data is grouped into individual nucleotide{residue interaction features and global features of
the interface. For a more detailed description of the DNAproDB data and features, see Supplementary
Figure A.2.
38
Two important concepts new to DNAproDB are that of a DNA structural entity and a protein
structural entity. DNA structural entities are a collection of DNA strands which are connected
via base-pairing or base-stacking. We dene a DNA strand as a collection of nucleotides which are
connected via a continuous set of phosphodiester sugar-phosphate bonds with no backbone breaks.
A DNA structural entity thus forms a discrete structural component of the overall structure.
Base-pairing and back-stacking between nucleotides is identied using the program DSSR [76].
DNA structural entities can be thought of as undirected connected subgraphs, where every node
corresponds to a nucleotide and edges between nodes indicate either a base-pairing or base-stacking
interaction or phosphodiester bond. Many structures will contain only a single DNA entity, but
some may contain several. We make a distinction between a DNA "strand" and a DNA "chain"
(as dened by the input le in PDB/mmCIF format) because, while ideally a DNA strand and
a DNA chain have a one to one correspondence, missing nucleotides or backbone breaks may
produce several DNA strands within a chain. Similar to DNA structural entities, a protein
structural entity is a collection of protein chain segments which interact with one another to form
a closed molecular surface, where a protein chain segment is a continuous segment of a protein
chain with no backbone breaks. Thus, each discrete closed protein molecular surface (as dened
by the solvent excluded surface) in the structure corresponds to a protein structural entity. See
Supplementary Figure A.1 for an example of a structure with two DNA structural entities and
one protein structural entity.
A DNA{protein complex is then dened as a DNA structural entity and protein structural
entity that are in sucient proximity and share at least one nucleotide{residue interaction. Each
DNA{protein interface described in a DNAproDB data le corresponds to one interacting DNA
entity{protein entity pair. Descriptions of the interface are grouped into two sets of features
{ individual nucleotide{residue interaction features, and global features of the interface, which
include aggregations of some nucleotide{residue interaction features, geometrical features of the
protein surface, and other statistical descriptions. Note that a nucleotide{residue pair is considered
39
an "interaction" based on the minimum distance between them { DNAproDB uses a cuto of 4.5
A. For a detailed description of the DNAproDB data hierarchy and a list of features currently
provided see Supplementary Figure A.2.
2.3.3 DNA secondary structure
The original release of DNAproDB included only DNA{protein complexes where the DNA was in
a helical, double-stranded conformation. This is the most common DNA structural conformation
which accounts for roughly 74% of DNA{protein complexes currently available in our database
(see Figure 2.3). Additionally, the original release could not support structures with more than
one double-stranded helix. We have redesigned the way DNAproDB processes DNA structures in
order to accommodate virtually any DNA secondary structure, to classify DNA structural entities
(see sec. 2.3.2) for a description of this term) based on their secondary structure, and to support
an arbitrary number of both DNA structural entities and protein structural entities within a
structure. These improvements have not only vastly increased our total coverage of currently
available PDB entries containing DNA{protein complexes (currently 95% coverage) but have also
greatly enriched the diversity of DNA binding proteins and DNA structural conformations which
we now provide data for, and we have developed many new features to support this increased
diversity.
We provide coarse classications of DNA structural entities based on their secondary structure.
An entity can be classied as helical (meaning double-stranded), single-stranded, helical/single-
stranded (referring to a helical conformation with at least one single-stranded overhang), or other
which encompasses a wide variety of conformations which are either irregular, unnamed, or are
not abundant enough to warrant their own class. For more details of how we classify structural
entities, see Supplementary Methods.
40
Figure 2.3: A variety of statistics and distributions for a set of select DNAproDB features over the entire
database. A. Shown here are two features describing DNA structural entities { the number of nucleotides
in the entity and the entity type. The mean nucleotide count is 34, with peaks at approximately 35 and
300, the latter being mainly from nucleosome structures. The most common entity type is helical, making
up 74% of all DNA structural entities, while single-stranded DNA is the next most common making up
13.2%. B. This panel shows pair-wise nucleotide{residue interaction features. Both heatmaps show values
of interactions between the major groove moiety of nucleotides in a helical conformation with the side
chains of each standard residue. The top heatmap shows the mean number of major groove{side chain
hydrogen bonds, and the bottom shows major groove{side chain buried solvent accessible surface area.
Note that the hydrogen bond values for alanine, glycine, leucine, isoleucine, phenylalanine, proline and
valine are all zero because these residues do not contain any hydrogen bond donors/acceptors in their
side chains. The especially high value of guanine{arginine, major groove{side chain hydrogen bonds is
due to the ability of arginine to form bidentate bonds with the guanine's major groove moiety (when the
guanine is in a Watson-Crick base-pairing conformation). C. The four plots in this panel show various
features which describe properties of DNA{protein interfaces as a whole. The rst is the interaction
density of the interface. This is the number of nucleotide{residue interactions divided by the number
of nucleotides times the number of residues and measures how many interactions are present versus the
total number possible and lies between zero and one. Most interfaces are rather sparse, which is typical
of helical DNA interfaces, while single-stranded DNA interfaces tend to be denser. The next plot shows
the protein secondary structure composition of interfaces. "-helix" refers to interfaces which are made
up primarily of alpha helices, "-sheet" primarily of beta sheets etc. Helix and helix/strand hybrid are
the most common interface type, while a predominantly strand composition is the least common. The
nal two plots compare the mean number of nucleotide interactions per residue for helical DNA and
single-stranded DNA interfaces. Single-stranded DNA interfaces overall tend to involve less nucleotide
interactions per residue, which may indicate that individual residues contribute less to the overall binding
anity than for helical DNA on average.
41
2.3.4 Chemically modied components
DNAproDB now supports a much wider range of amino acids or nucleic acids that are chemical
modications of the standard 20 amino acids or 4 DNA nucleotides. A chemical modication can
be a substitution of a chemical group such as the replacement of the terminal nitrogen with an
oxygen atom in arginine citrullination or addition of a chemical group, such as the addition of the
methyl group to the C5 atom of cytosine, forming 5-methylcytosine. The modied component
must have an entry in the PDB Chemical Component Dictionary [123], and must not signicantly
deviate from its parent component so as to make identication of structural moieties (see be-
low) ambiguous. DNAproDB requires a small amount of parameterization for the calculation of
solvent accessible and excluded surface areas (van der Waals radii) and hydrophobicity calcula-
tions (residue hydrophobicity). Parameters for chemical modications are generally not available,
however we attempt to use approximate values where appropriate { therefore, some feature val-
ues may only be approximate for modied components. For a more detailed explanation of our
approximation scheme, see Supplementary Methods.
2.3.5 Identication of structural and interaction moieties
A core and unique aspect of DNAproDB is the identication of DNA{protein interaction features
broken down by both secondary structure and structural moiety. A structural moiety is a term
we use to describe a chemical group or structural component of a nucleotide or residue which is
distinguishable from the whole. Protein residues (amino acids) have two structural moieties { the
main chain, which is the amine-carboyxl group that forms the protein backbone, and the side chain
beginning at the alpha carbon atom. With the addition of new DNA secondary structure confor-
mations to DNAproDB, nucleotides now have either three or four structural moieties depending on
their secondary structure. Helical (by which we mean double-stranded helices) DNA nucleotides
42
have phosphate, sugar, and major groove and minor groove moieties. The groove moieties rep-
resent the edges of the paired bases which are exposed in the respective groove. Single-stranded
and unclassied ('other') DNA nucleotides have phosphate, sugar and base moieties. Note that
for helical DNA, the detection of the groove moieties has been improved to account for glycosidic
bond angle and relative base orientation (see Supplementary Methods and Supplementary Figure
A.3).
For a given nucleotide{residue interaction pair, DNAproDB identies interacting moieties
within the pair. For example, given an adenine{arginine interaction, DNAproDB may iden-
tify that the interaction involves a sugar{side chain interaction and a minor groove{side chain
interaction. Our procedure for identifying moiety interactions has been greatly improved in the
newest release of DNAproDB. Moiety interactions are now determined by hydrogen bond, van
der Waals interaction and buried solvent accessible surface area values. Cut-o values for these
features for every nucleotide{residue pair are determined using the distribution of values among
a large sample of nucleotide{residue interactions. Figure 2.3B shows mean values of hydrogen
bonds and buried solvent accessible surface areas for major groove{side chain interactions for
every nucleotide{residue pair type. For more details of how we identify and classify interaction
moieties, see Supplementary Methods.
2.4 Database and Web Interface
The DNAproDB database is a document-oriented database built using MongoDB (19). At the
time of publication, it provides data for 4,509 PDB entries that contain at least one DNA{
protein complex and is updated regularly as new PDB entries are released. Every entry in our
database corresponds to an entry in the PDB and all the data we provide is encapsulated in
a single JSON document for that entry. The structure of these data les is described in the
proceeding sections and in Figure 2.2 and Supplementary Figure A.2. Figure 2.3 shows various
43
statistics and distributions of selected features which give a brief overview of the DNAproDB
database. The DNAproDB database is quite heterogenous, with structures ranging in size from
only two nucleotides to several hundred, DNA tertiary and secondary structures ranging from
single-stranded DNA to three-way holiday junctions, and proteins ranging from transcription
factors to DNA recombinases. This heterogeneity is completely transparent to the user, however,
as every entry follows a standard data format which has been designed to be
exible and to
accommodate the wide variation that exists in the structural dataset DNAproDB is built on.
Users can access the data in our database in several dierent ways. First, we oer the entire
database for download as a
at le at our download page: https://dnaprodb.usc.edu/download.html.
When uncompressed, each line of this le contains a JSON document corresponding to one entry
in the database. Second, users can search our database using our search page at
https://dnaprodb.usc.edu/search.html. Here users can construct a query based on a variety of
features related to DNA, protein or DNA{protein interactions to generate a dataset meeting a
specic criterion or set of criteria. The user will be presented with a result page summarizing all
the returned entries which meet the specied criteria, and from this page can download data for
any or all of the returned entries. Third, users can download data for individual structures from
the report page for those structures (see sec. 2.4.2 for a description of these pages), and nally,
we provide a simple RESTful web API where advanced users can query our database directly
without needing to use our web-interfaces. We refer interested readers to our documentation at
https://dnaprodb.usc.edu/documentation.html for more information about using this API.
2.4.1 Searching the database
Users may search the DNAproDB database in a variety of dierent ways. The simplest search is
to directly provide a list of PDB identiers which will return all matching entries in our database
from the supplied list. Additional features can be specied as a lter, and only structures which
meet the additional criteria will be returned. More generally, a user will search for structures
44
based on features describing structural and sequence characteristics rather than knowing the
PDB identiers in advance. Using our search form and combining dierent features, users can
create powerful searches for structures based on characteristics of the DNA, protein or DNA{
protein interactions. Our search form provides an easy user interface to build powerful queries
on a subset of available features, and includes inputs for DNA features, protein features and
DNA{protein interaction features.
Using our search form, DNA features can be included at several levels { entity level features
which describe a DNA structural entity as a whole such as the entity type (e.g. single{stranded),
strand-level features which are features of individual DNA strands, such as sequence motif or GC
content, and helix level features which are features of helical segments within a DNA structural
entity. Protein features include chain-level features such as sequence clusters, UniProt, CATH
and GO annotations. These are extremely useful when wanting to include or exclude a particular
protein or protein family from a search. Interaction features can be included in the search at
the level of individual nucleotide{residue interactions or at the level of global interface properties.
Much more complex and detailed searches are possible using the MongoDB query language in
combination with our web API, but the searches which can be constructed using our search form
cover the majority of use cases.
As one example of the type of searches DNAproDB supports, using the interaction feature
inputs a user could search for structures where an arginine forms at least one hydrogen bond with
a guanine in the major groove of the DNA and with the arginine belonging to a helical secondary
structure. Alternatively, the user could simply search for structures where there are any contacts
in the DNA major groove with a protein alpha helix. This search could be combined with DNA
features, such as constraining that the DNA structural entity contains a helical segment between 8
and 20 base pairs in length and the helix conformation be in B{form. The user could also include
one or more DNA sequence motifs to match on. Using protein features, we could further restrict
this to return complexes containing homeodomain proteins which share a characteristic fold, by
45
specifying the relevant CATH annotation as a protein chain feature. The DNAproDB database
provides powerful search capabilities that no other structural database currently oers. Users can
search the database to quickly retrieve data for a particular DNA{protein complex, discover new
structures or generate datasets based on structural criteria.
2.4.2 Structure reports
DNAproDB presents the data available for any entry in our database or any structure uploaded
to our server via a report page. The report page is the central web component of DNAproDB and
allows users to explore, visualize and interact with data DNAproDB provides as well as view the
structure in three dimensions using NGL viewer [99, 98]. The report page has been completely
redesigned for the newest release of DNAproDB in order to support many new features and
upgrades on the backend of our processing pipeline and database. Data are presented to the user
based on their selection which is indexed by model, DNA structural entity and one or more protein
chains within any protein entities interacting with the selected DNA entity. These selections allow
users to step through the data in manageable ways and display as much or as little information
as is relevant at one time.
The report page has three major components; the rst are tables which display information
about protein entities, chains, and chain segments, DNA entities, strands, helices and single-
stranded segments, and data about DNA{protein interfaces in the structure. The tables present
data at the model level, updating any time the model index changes in the user selection. Citation
data is also provided with references to the original publication and links to the PDB and NDB
entries if the report is for a DNAproDB database entry.
The second component of the report page is our interactive visualizations. We currently oer
three visualization types: the residue contact map, the helical contact map, and the helical shape
plot. The residue contact map shows individual nucleotide{residue interactions, DNA secondary
structure, protein secondary structure and interaction moieties all in one gure. The DNA is
46
displayed as a graph, with individual nucleotides being nodes in the graph, and edges between them
indicating backbone links, base pairing or base stacking, each with a distinct color. Dierent base-
pairing geometries such as Watson{Crick, Hoogsteen or other non-standard pairing conformations
are indicated via the base-pair edges, and other structural features such as backbone breaks,
missing phosphates, the DNA strand sense, and nucleotide structural moieties (see sec. 2.3.5,
Figure 2.4B and Supplementary Figure A.4) are all represented graphically. Protein residues
are displayed as small nodes with the node shape and color representing the residue secondary
structure, and edges between residue and nucleotide nodes represent an interaction between the
two and which DNA moiety(s) the interaction involves.
The helical contact map is a compact visualization where interactions with protein secondary
structural elements (SSEs) are plotted along the helical axis for individual DNA helices present
in the selection (if none exist then this visualization is not available). Interactions of SSEs with
dierent DNA moieties are represented in concentric annuli and the position of each plotted SSE
is given in helicoidal coordinates which is a curvilinear coordinate system dened by the axis of a
helical DNA segment. This visualization is a unique and compact way to summarize the coarse-
grained interactions of helical DNA regions. See Sagendorf et. al. [102] for a detailed description
of this visualization type.
The helical shape plot is the newest visualization type in DNAproDB and plots DNA shape
parameters (such as major and minor groove width, or base-pair shape parameters) for a selected
helix within the user selection along the sequence of the helix. In addition, DNA{protein residue
interactions are plotted showing approximately where each residue in the interface interacts in
sequence space and the secondary structure of that residue. Protein residue interactions can be
toggled o if only DNA shape parameters are desired, or they can be displayed to indicate possible
DNA shape readout, such as the presence of positively charged residues in regions of narrow minor
groove width. Supplementary Figure A.5 shows an example of a shape overlay plot which nicely
illustrates such a readout mechanism for a Hox{Exd heterodimer (PDB ID 2R5Z).
47
Figure 2.4: Example visualizations provided by DNAproDB that have been generated through our
browser-based tools. DNAproDB visualizations are interactive, customizable and can be exported di-
rectly as PNG images. These visualizations were taken from the report page of the conjugative relaxase
TrwC{DNA co-crystal structure (PDB ID 1OMH). Selected DNA moiety interactions were base, major
and minor groove, all protein secondary structure types were included, and interactions were ltered by a
minimum nucleotide{residue distance of 4.0
A or less. The size (pixel area) of each residue or SSE symbol
has been chosen to indicate the number of nucleotides it interacts with. A. A three-dimensional view of
the structure produced by NGL [99, 98]. The colored regions are the interacting residues which are dis-
played in the remaining three panels and are colored using the same color scheme. B. DNAproDB uses a
graph-based approach for displaying the residue contact map. Information about nucleotide base-pairing,
base-stacking and backbone linkages are displayed via edges between nucleotides, and edges between nu-
cleotides and residues (shown here as red circles, blue squares, and green triangles; the shapes and colors
of the symbols indicating secondary structure) denote an interaction. The sense of the DNA strands can
be seen based on the direction of the sugar-phosphate linkages between each adjacent nucleotide in each
strand. A G/A mismatch can be seen near the top of the gure,
anked by two
ipped-out bases and in-
dicated by a red dashed line signifying the non-Watson-Crick base pair. C. A polar contact map showing
major and minor groove interactions in the helical region of the structure. Protein secondary structure
elements are shown and plotted according to which DNA moieties they interact with. The major and
minor grooves are represented by concentric annuli and the position of each interacting protein secondary
structure element (SSE) within each annulus is determined by the helicoidal coordinates of the SSE [102].
Protein beta sheets are represented by triangles, helices are represented by circles and loops by squares.
The polar distribution of the SSE interactions re
ects that the protein binds to one side of the helix. D.
A helical shape overlay with the shape parameter Buckle plotted. The extreme value at the last position
of the sequence is due to an A/G mismatch which is automatically indicated in red by DNAproDB. The
green triangle and blue squares represent sheet and loop residues which interact with the DNA and their
approximate location along the helical sequence. The same residues can be seen interacting with the
helical region of the DNA in B.
48
All of our visualizations are highly customizable and interactive. Custom color schemes, labels,
and plot orientations can be chosen. Figure 2.4 shows example visualizations for the conjugative
relaxase TrwC bound to a helical segment of DNA with a long single-stranded overhang (PDB ID
1OMH). The visualized interactions and plot components are determined based on the selected
DNA structural entity and protein chains, however many additional criteria can be dened. Users
can decide which DNA moiety interactions should be included or what protein secondary structure
types to display. They can lter interactions based on the number of hydrogen bonds, buried
accessible surface area, center of mass, mean nearest neighbor or minimum nucleotide{residue
distances, or interaction geometry. Supplementary Figure A.4 shows a variety of dierent criteria
applied to the same structure in order to lter what interactions are shown. This high degree
of customization allows users to create visualizations which display very specic views of the
DNAproDB data to highlight particular aspects of the interface. Visualizations can be exported
directly as a high-resolution static PNG le for use in publications or presentations. They are also
interactive { hovering the cursor over various parts of the visualizations will display additional
information and clicking on residues or nucleotides will highlight them in the three-dimensional
view of the structure.
The nal component of the report page is the data explorer, which allows users to traverse the
DNAproDB datale hierarchy and explore the raw data for that structure using a searchable JSON
viewer. Every item and visualization on the report page is generated from the data contained in
this data le which the user can download from the report page for their own use
2.4.3 Integration with the Nucleic Acid Database
DNAproDB can also be accessed through the Nucleic Acid Database (NDB) [10, 22]. Each
NDB entry is individually linked to its respective DNAproDB report page under NDB Structural
Features. The integration of DNAproDB with the NDB makes DNAproDB report pages directly
accessible through the PDB for any DNA containing structure.
49
2.5 Discussion and Conclusion
DNAproDB aims to provide a variety of biophysical and structural features that are useful for
analyzing structures of DNA{protein complexes. These include commonly used features such as
hydrogen bonding, DNA shape parameters, DNA and protein secondary structure, nucleotide{
residue interaction distances and many more. DNAproDB and the features it provides can be used
for many purposes. The data we provide can be used as is and DNAproDB can be treated simply as
a processing pipeline, simplifying and automating the process of generating the set of features users
nd relevant to their work. It can be used as visualization and data exploration tool by generating
interactive plots and graphics of DNA{protein interfaces using the web-based tools available from
structure report pages. By taking advantage of our database's search capabilities a user can
generate sets of PDB entries which meet specic criteria using the variety of available features
DNAproDB provides. The data we provide for each entry can also be used in more sophisticated
ways for clustering, regression, classication or other statistical analysis using external software
tools.
A key design consideration of DNAproDB was the ability for users to take our data and
perform analyses independent of the web-based tools which we have developed. The choice of our
data structure and format makes it easy for users with only a limited knowledge of JSON to parse,
search, and analyze the DNAproDB data. In Figure 2.5 we show three examples of dierent types
of analyses done using DNAproDB data. In each case, we used the
at DNAproDB database
le (which is available to users through our website) in combination with external software tools.
A full description of the methods used to perform these analyses is provided in Supplementary
Data. In Figure 2.5A we examine an "interaction motif" for the residue tyrosine. The heatmap
shown is for a three-nucleotide minor groove interaction where tyrosine interacts with exactly
three nucleotides in the DNA minor groove and displays a strong signal for a C-C-G interaction.
The right panel oers a plausible explanation for the presence of this motif - the hydroxyl group
50
of the tyrosine can form bidentate hydrogen bonds with the O2 atom of cytosine and the N2
atom of guanine along the C/G base pair's minor groove edge. These favorable interactions may
increase the likelihood of tyrosine binding to a CG-rich region of the DNA minor groove. Such
an interaction motif may be relevant for proteins which bind DNA sequences containing CG
regions and contain tyrosine in their binding domains and is a good example of how we can infer
biologically relevant information from structural analysis using DNAproDB.
In Figure 2.5B we show how features of the DNA{protein interface vary by the biological
function of the protein, with proteins of dierent functions occupying distinct regions of the
feature space. Every point in this plot is a protein chain which interacts with DNA from a
DNAproDB entry and the color represents a biological function or process the protein is involved
in (based on Gene Ontology annotations). Correlations can be seen between the function of the
protein chain and the way it interacts with DNA when the interface features are projected to a
low dimensional space. Several distinct, though partially overlapping, clusters are evident in this
principal component analysis (PCA). The PCA plot in Figure 2.5B indicates that DNAproDB
features can capture dierences in the binding mechanisms and characteristics of these proteins,
which are at least partially related to their biological function. This is important since we expect
that proteins which bind dierent forms of DNA and under dierent circumstances should have
noticeably distinct binding mechanisms, and the set of features which describe the DNA{protein
binding should re
ect those dierences accordingly. In Figure 2.5C we calculate the probability
of protein residues with planar side chains to form stacking interactions with dierent nucleotide
bases in single-stranded DNA. These probabilities can be used to roughly estimate the free energy
of stacking via
G
stack
N;R
=RT ln (p(stackjN;R)=(1p(stackjN;R))) (2.1)
51
Figure 2.5: Several example analyses done using only data provided by the DNAproDB database. A.
Shown here is a three nucleotide 'interaction motif' for the residue tyrosine interacting in the minor
groove of helical DNA. The heatmap was generated by taking all available instances (99; interactions
were ltered for sequence redundancy of the interacting protein chain) of tyrosine residues which bind in
the DNA minor groove and interact with exactly three nucleotides simultaneously (identied from the list
of nucleotide{residue interactions provided under interface features). The three interacting nucleotides
were sorted by center of mass distance (an interaction feature) and placed into bins 1, 2 and 3 from the
nearest nucleotide to the furthest. In this way, we capture the radial distribution of the three-dimensional
structure of the interacting nucleotide triplet. The frequency of each nucleotide type (A, C, G or T) was
then plotted for each distance bin. In this example, we see a strong signal for a C-C-G interaction motif.
We note that this is not the same as a sequence motif because these nucleotides need not be adjacent
or even on the same strand. On the right side of the panel an example is shown of the structure of this
interaction motif with a tyrosine forming two hydrogen bonds between its OH group and the N2 atom of
the guanine and the O2 atom of cytosine. See Supplementary Data for a more detailed description of the
analysis. B. The rst and second principal components from a PCA projection of 134 DNAproDB features
(or features derived from them) describing the DNA{protein interface for 4,758 dierent interfaces (which
were broken down by protein chain). Each point in the plot corresponds to one interface. The plots are
colored according to the GO annotations of the protein chain and are grouped into four categories based
on the annotated biological function of the protein. Correlations can be seen between the protein chain
annotations and the way it interacts with DNA as captured by the DNAproDB features when projected
to this low dimensional space. Several distinct, albeit overlapping, clusters can be seen which correspond
to the dierent biological functions of the involved protein chain. See Supplementary Data for a more
detailed description of the analysis. C. Probabilities for a particular nucleotide{residue interaction to be
in a stacking conformation. More precisely, these are conditional probabilities of P (stackjN;R) where N
is a nucleotide type andR is a residue type. DNAproDB assigns a geometry for every nucleotide{residue
interaction identied using SNAP, a component of the 3DNA program suite [77, 78]. The residues for
which probabilities are shown are those with planar side chains so that a stacking conformation can be
dened. The conditional probabilities for each residue to stack with each nucleotide are shown as a stacked
bar chart, with the numbers inside the bars indicating the probability values and the numbers overtop of
the bars the total number of interactions used for that residue type. Only interactions with nucleotides
in a single-stranded conformation were included in this analysis. It is interesting to note the variation
in stacking probabilities between dierent nucleotides for a given residue, most notably with tyrosine
preferring guanine stacking and histidine strongly preferring adenine stacking. See Supplementary Data
for a more detailed description of the analysis.
52
where N is a nucleotide type and R is a residue type. Naturally we do not expect the sam-
pling from a limited number of structural examples to be good enough for an accurate estimation,
however the notable dierences in stacking probabilities may be relevant to the binding specici-
ties of certain single-stranded DNA binding proteins { preferring sequences which, among other
mechanisms, can form stacking interactions with favorable energies. DNAproDB makes sampling
structural data easy, and one can perform many kinds of analyses thanks to the large volume
of structure-based data available in our database. The structures of biological macromolecules
are a rich source of information and DNAproDB makes use of the wealth of available data to
enable structure-based computational biology. DNAproDB processes structures provided by the
PDB in a unique way and makes the features and annotations we generate available to be used
by researchers for further studies. The improvements and updates we have made to DNAproDB
have enriched our database with a much larger number of available entries. The addition of new
features, improved data organization, a stronger set of visualization tools, and more user-friendly
web interfaces make the newest release of DNAproDB a stronger tool for structural analysis of
DNA{protein complexes than currently available.
53
Chapter 3
Predicting DNA{Protein Binding From Structure Using
Geometric Deep Learning
3.1 Introduction
The molecular machinery of the cell has evolved to be highly modular with many cellular compo-
nents being formed from a number of smaller subunits. Cells are made of complex arrangements of
proteins, nucleotides, polysaccharides, lipids and carbohydrates. The size and complexity of these
modular complexes ranges from simple dimers, such as the bZIP transcription factors in which
two protein chains that form long alpha helices are held together in a scissor-like arrangement
known as a leucine zipper [41], to larger structures such as the histone octamer which is formed by
two copies of each of four histone core proteins and coils long stretches of double-stranded DNA
into a superhelical conguration, to the large ribosome complex which is comprised of dozens of
proteins bound to ribosomal RNA.
The process by which these varied subunits adhere to each other and form a stable complex is
known as binding. Binding is generally both specic and orientated, meaning two components will
only bind if they are of the correct molecular structure and in the correct orientation with respect
to one another. Some proteins show very high degrees of specicity for binding, such as tran-
scription factors which will only bind to particular sequences of double-stranded DNA (dsDNA),
54
while others such as SSB in prokaryotes bind single-stranded DNA (ssDNA) in a promiscuous
manner [106] . Binding is an essential process because within the cell there are many thousands
of proteins, nucleic acids, lipids and carbohydrates mixed together in solution, and when two or
more cognate components meet they need to selectively bind to one another reliably and strongly
enough to avoid simply diusing apart almost immediately. In addition to being the mechanism
which holds the molecular machinery of the cell together, binding can also be used as a means
of signaling or sensing changes in the environment. A well understood case is that of the lac
operon where the small disaccharide allolactose binds to a receptor protein (lac repressor) which
represses a set of genes that allow the cell to metabolize the sugar lactose [71]. Upon binding
to the repressor, it no longer functions to repress the lactose metabolism genes, and the cell can
begin producing the required enzymes for lactose metabolism. Hence the binding of allolactose to
the lac repressor allows the cell to detect the presence of lactose in its environment and modify
its metabolism accordingly.
Despite the wide range in structure, function, evolutionary history and chemical properties of
the many molecular components which interact within the cell, we can conceptualize binding in
terms of some common mechanisms. Within the framework of the Born-Oppenheimer approxi-
mation [13] (which will we will always assume in this work) we propose that these mechanisms
can loosely be categorized into three groups - chemical interactions, electrostatic interactions, and
geometrical compatibility.
Chemical interactions are perhaps the most broad category but include things such as hydrogen
bonds, hydrophobic stacking and van der Waals interactions. These are interactions which require
close physical proximity and depend upon the chemical properties of the component interfaces.
These interactions will generally take place within a region of solvent-exclution, but may be
mediated by a single layer of solvent atoms in some cases, such as water-mediate hydrogen bonding.
Chemical interactions are usually quite sensitive to small changes in distance or orientation of the
molecules.
55
Electrostatic interactions are Coulomb or higher-order multipole interactions and are strongest
in regions of net charge. These are longer-range interactions which are less sensitive to variations
in distance or orientation due to the
1
r
dependence of the Coulomb interaction. However, due to
electrostatic screening of the electrolytic environment of the cell, and the strong dielectric constant
of water, these interactions are most relevant over smaller distance scales. Electrostatics play an
especially important role in protein-DNA binding due to the highly charged phosphate backbone
of DNA strands which can interact electrostatically with positively charged residues arginine and
lysine, which are highly prevalent in DNA binding domains of proteins.
Geometrical compatibility refers to the fact that molecules which bind one another generally
have geometries which allow the two to form a close tting interface. This is important because
many of the interactions responsible for creating attractive forces between the two molecules are
only relevant over small distance scales, and hence binding is strongest when maximizing the
surface area of the solvent-excluded region of the interface, which requires the two interfaces to
have compatible shapes, much like a lock and key. Because of the huge variety in the possible
shapes and orientations of binding interfaces, the geometry alone can form a powerful mechanism
for specicity. The geometry of the interface can also play a large role in the K
D
of a complex.
For example, deep binding pockets are common features of protein binding sites which help to
trap their target ligands in a tightly bound conguration.
These categorizations of binding mechanisms are by no means rigorous, and there is no clear
boundary between them. They do serve a useful purpose when trying to understand how dierent,
but similar, molecular components distinguish amongst one another when preferentially binding
to one type and not another. For example, proteins which bind single-stranded DNA (ssDNA)
often utilize hydrophobic stacking interactions between hydrophobic aromatic side-chains (e.g.
tyrosine, tryptophan, phenylalanine) and exposed nucleotide bases in the ssDNA ligand. We
also observe an increased propensity in hydrophobic residues in the interfaces of protein-ssDNA
complexes relative to protein-dsDNA complexes (g. 3.1), and hence we can conclude that, in part,
56
the specicity of some proteins to bind ssDNA over dsDNA is due to a dierence in the chemical
nature of their binding domain, being more hydrophobic than their dsDNA binding counterparts.
Figure 3.1: A plot of residue propensities at the in-
terface of DNA-protein complexes for protein-ssDNA
(SBP) and protein-dsDNA (DBP) complex struc-
tures. Propensity measures the frequency of a residue
to appear in the protein-DNA interface relative to
the rest of the protein surface, with values above
0 indicating it is more common in the interface,
and values less than 0 indicating it is less common.
The much higher propensities for the hydrophobic
residues tryptophan, tyrosine and phenlyalanine for
SBP relative to DBP show that these proteins have
much more hydrophobic interfaces, a chemical fea-
ture which is related to ssDNA binding.
Protein-DNA binding is of particular inter-
est because it is responsible for gene regulation
and mediates all processes which occur in the
cell. It can be measured using various exper-
imental binding assays [47, 15, 44, 58], which
provide both qualitative and quantitative mea-
surements of binding anities and specicities.
These methods only oer a very coarse view
of binding however. They generally do not
provide information about the mechanism of
binding or provide an easy way to connect ob-
served binding measures to the structural char-
acteristics of the protein and DNA. Structural
data of protein-DNA complexes oers a much
richer and deeper view of binding at the level
of individual atomic interactions, and can help
researchers to understand the relationship be-
tween the structure of a protein and the physical mechanisms that result in binding. The protein
data bank [126] provides a resource for accessing and depositing structural data of biological
macromolecules, and more specialized databases such as DNAproDB [103] provided additional
resources for structures of protein-DNA complexes.
57
3.1.1 GeoBind overview
In this work we propose a method for predicting macromolecular binding from structural data
using techniques based on recent developments in the eld of geometric deep learning (for a
general introduction to deep learning, see sec. 1.3). We call our approach GeoBind, which both
refers to the method and to the software package which other researchers can use to apply our
methods. Geometric deep learning is a generalization of traditional deep learning methods that
were historically developed to work in the euclidean domain to operate in the non-euclidean
domain [14, 125]. Euclidean data is data which can be arranged on a grid or lattice with some
kind of regular or repeating structure. Typical examples include time series data, audio signals,
images, geospatial data, text and video. Euclidean data can be represented as xed-dimensional
arrays and the network must only accommodate the same xed structure. Non-euclidean data
is essentially "everything else", which can not be neatly represented in this way. The design
and implementation of networks which operate in euclidian domains is generally constrained by
the underlying structure of the data, and many implicit assumptions that in
uence the network
architecture are based on the regularity present in the data. Therefore these traditional networks,
which have seen great success in many elds, do not generalize well to non-euclidean domains.
The most general representation of non-euclidean data is a graph, where nodes represent data
points or entities and edges represent some relationship between them. Most of the work done in
geometric deep learning has focused on generalizing operations such as convolutions and pooling,
which have shown great success in the euclidean domain, to operate on graph structures.
Geometric deep learning provides an opportunity to combine the power of machine learning
approaches with large amounts of existing structural data of macromolecular complexes [126, 103].
Structural data is rich in the information it contains but is complicated in its nature, and does
not adapt well to a euclidean representation. Biological macromolecule structures vary widely
in size and shape and are irregular three-dimensional assemblies of polymer chains. By choosing
58
an appropriate representation, however, we can easily apply geometric deep learning techniques
to structural data. In this work we use the molecular surface representation as a way to cap-
ture the information present in structural data accurately while providing a natural framing that
works seamlessly with existing geometric deep learning methods. The molecular surface is a pow-
erful concept that has found applications in implicit solvent electrostatics (sec 1.2), [43, 5, 81],
molecular visualization [105, 92], protein shape analysis [62, 128, 51], and more. The molecular
surface represents an articial boundary between the interior and exterior of a macromolecule. It
is a closed, two-dimensional manifold which fully envelopes every atom in the structure. Several
molecular surface denitions have been developed which are dened by the set of atomic coor-
dinates and associated radii that correspond to a molecular structure. Example surfaces are the
solvent excluded surface [23], solvent accessible surface [69], molecular skin surface [40], minimal
molecular surface [7], and gaussian surface[75] each with varying characteristics, advantages and
disadvantages. The triangulation of these surfaces provides a discrete representation of a macro-
molecule structure which we refer to simply as a surface mesh or just a mesh. The mesh has many
advantages which make it an attractive representation for a macromolecule structure. First, it
captures the geometry of the structure well - geometric features such as binding pockets, binding
channels, helix recognition motifs, and other structurally conserved motifs are well captured by
surface mesh representations. Second, because many of the binding mechanisms which mediate
protein-DNA binding are local in nature, we can imagine that binding occurs primarily via the
interaction of molecular surfaces if we assign the correct chemical and electrostatic properties
to those surfaces. Finally, a mesh is nothing more than a graph which is embedded on a two-
dimensional manifold where the nodes of the graph represent points on the surface and edges
encode geometrical relationships between them such that properties like curvature and topology
of the surface are preserved. Therefore, the methods already used in geometric deep learning can
be applied directly to macromolecule surface meshes. We will refer to molecular surface meshes
and both meshes and graphs interchangeably depending on context.
59
We propose predicting protein-DNA binding via molecular surface meshes and geometric deep
learning. An overview of our approach is shown in gure 3.2. The basic concept is to capture
the geometrical, chemical and electrostatic signatures of DNA binding sites observed in existing
structures of protein-DNA complexes which are available in databases such as the PDB [126]
or DNAproDB [103]. DNAproDB, for example, contains over 4600 structures of protein-DNA
complexes. This dataset spans a wide range of protein structures, sequences and functions and
presents a diverse sampling over the space of protein-DNA interactions which are observed in
vitro. Given a complex, we can identify which protein residues constitute the DNA binding site
and label each vertex on the corresponding molecular surface mesh as a "binding site" (positive
class) or "non-binding site" (negative class). We then assign various geometrical, chemical and
electrostatic features (see sec. 3.3.2) to each vertex describing its local environment. These surface
meshes and corresponding features are then used to train fully convolutional graph neural networks
to segment the molecular surface into binding sites or non-binding sites. Probabilities for each
vertex to belong to a DNA binding site or not are predicted, and a loss function is computed using
the cross entropy between the observed probability distributions (which are binary distributions)
over vertices and the predicted ones. Gradients for the weights of the network can then be
computed and updated using the back-propagation algorithm to minimize this loss, and hence
maximize agreement between predicted and observed probabilities. The result is a network which
learns to predict the probability that each vertex on a molecular surface mesh belongs to a DNA
binding site or not based on the local geometrical, chemical and electrostatic environment of that
vertex. These predictions can then easily be mapped back to the corresponding structure for atom
or reside level predictions. In this work we apply GeoBind to predicting protein-DNA binding,
but in principal it can be used for predicting almost any macromolecular binding, given sucient
data.
The remainder of this chapter is organized as follows. I review related work on the subject
of predicting protein binding from structure using machine learning techniques in section 3.2.
60
Figure 3.2: The GeoBind framework applied to the task of predicting DNA binding sites on the surface
of proteins from structural data. Our starting point is observed complexes of proteins bound to DNA.
We rst identify DNA binding residues from the structure of the complex. We then generate surface
meshes using the protein structure and assign features to the vertices and edges of the mesh using the
GeoBind framework. The meshes and features are fed into a neural network which predicts likely binding
sites over the vertices of the mesh, and these predictions are compared with the observed binding sites.
The corresponding error is propagated back through the network to improved future predictions, and this
process is iterated.
61
This does not include sequence-based methods or methods which use structure and some kind
of sampling technique, such as Monte Carlo based approaches, as these methods are beyond the
scope of this work. In section 3.3 I explain the methods of our approach in greater detail, and in
section 3.4 I discuss current results of applying the GeoBind method to protein-dsDNA binding
and protein-ssDNA binding. The nal sections are devoted to discussion of current and future
work.
3.2 Related Work
In this section we brie
y review previous work done on the subject of predicting macromolecular
binding sites from structural data using machine learning-based approaches. We distinguish this
from docking-based approaches [88] which require the structure of both the ligand and receptor
molecule, and are generally based on some sampling algorithm combined with a scoring function,
with the essential idea to discover conformations of the ligand-receptor complex which minimize
the scoring function. Machine learning-based approaches can predict macromolecular binding
without the need for sampling because the sampling is performed upfront when the model is
trained. They are generally limited in scope to a particular kind of binding, for example protein-
RNA binding or protein-drug binding, but the advantage of this is the structure of ligand is not
required to perform inference.
SPPIDER [93] is a method for predicting protein-protein interaction sites from protein struc-
ture. The goal is to classify each residue in a protein structure as being part of a protein interaction
site or not. For each residue in a protein, various features are computed which the authors group
into four categories: (i) sequence-based features; (ii) features derived from evolutionary proles of
protein families; (iii) features based on protein tertiary structure; (iv) RSA-based features, where
RSA refers to the relative solvent exposed surface area [93]. A training set was compiled from
435 protein-protein complex structures obtained from the Protein Data Bank (PDB) [126], and
62
several models including neural networks (NN) and support vector machines (SVM) were trained
to classify a residue as being part of a protein-protein interaction site or not. The authors report
AUROC scores ranging from 0.62 to 0.76.
DBSI (DNA Binding Site Identier) [135] is a method to predict DNA binding residues in the
binding sites of DNA binding proteins. Given the structure of a protein, the authors compute
various features describing the local structural and electrostatic context of each residue on the
protein surface in addition to PSSM features produced from multiple sequence alignments, which
indicate conserved residues. The authors started with a total of 480 features and through an
iterative feature selection process narrowed this down to a nal set of 164 features. The authors
used an SVM to learn a binary classier to distinguish between a DNA-binding residue and a
non-DNA binding residue. Training was performed on a dataset of 263 DNA-protein complex
structures. The authors report AUROC scores in the range of 0.83-0.88 for several datasets.
MaSIF [45] is recent work most similar to our own and presents its self as a general purpose
framework for predicting protein interactions and capturing ngerprints important for those in-
teractions. It is based on the molecular surface representation of a protein, where the molecular
surface is represented as a discrete mesh. At each vertex of the mesh, ve features are computed
which describe the local curvature, hydropathy, and electrostatic potential of the molecular surface
near that point. Geodesic patches are extracted at each point on the surface, of a xed radius, and
radial convolutions with angular max-pooling are performed on each patch. The authors report
balanced accuracy score of 0.74 and AUROC of 0.9-0.95 for the task of predicting ligand binding
sites on protein surfaces and an AUROC of 0.85 for predicting protein-protein binding sites.
Jinga et al. [113] have done an extensive review of methods for predicting DNA binding sites
from protein structure (as of Dec. 2014). I have adapted table 3 from their paper which shows
performance measures of then state of the art methods for predicting DNA binding residues from
the structure of DNA binding proteins. These performance measures serve as a good baseline for
comparison to our work. I refer the reader to their work for a more comprehensive overview of
63
the methods listed in table 3.1. The relatively low performance of most of these approaches can
be seen as a testament to the diculty of this problem.
Performance measures of DNA-binding site prediction methods
Author Dataset BA ACC REC AUC F-score PRE
Jones 2003 own 0.680
Ahmad 2004 own 0.644 0.682
Ahmad 2004 PDNA-316 0.650 0.750 0.530 0.230
Ferrer-Costa 2005 own 0.835
Kuznetsov 2006 own 0.760 0.769 0.830
Wang 2006 own 0.703 0.694 0.750
Wang 2006 PDNA-316 0.670 0.780 0.540 0.260
Yan 2006 own 0.710 0.530
Yan 2006 PDNA-316 0.700 0.730 0.660 0.260
Tjong 2007 own 0.680
Ofran 2007 own 0.890
Ofran 2007 PDNA-316 0.590 0.920 0.190 0.270
Hwang 2007 own 0.772 0.764
Hwang 2007 PDNA-316 0.740 0.780 0.690 0.310
Nimrod 2009 own 0.900 0.900 0.350
Wang 2009 own 0.800 0.731 0.850
Wang 2009 PDNA-316 0.750 0.820 0.670 0.830 0.340
Wu 2009 own 0.914 0.766
Carson 2010 own 0.785 0.797 0.860
Ozbek 2010 own 0.960 0.360
Si 2011 PDNA-316 0.770 0.770 0.770 0.330
Table 3.1: Performance metrics for a variety of DNA binding site predictions methods taken
from Jinga et al. [113]. Dataset refers to the structural dataset used to train the authors mod-
els, which is either PDNA-316, a standard dataset from [112], or 'own' referring to a custom
dataset curated by the authors. The metrics are dened as follows: BA - balanced accuracy:
TP
TP+FN
+
TN
TN+FP
2, ACC - accuracy:
TP+TN
TP+TN+FP+FN
, REC - recall:
TP
TP+FN
, AUC - area
under the ROC curve, F-score - harmonic mean of REC and PRE:
2RECPRE
REC+PRE
, PRE - precision:
TP
TP+FP
. TP, TN, FP, FN refer to true positive, true negative, false positive and false negative
respectively.
3.3 Methods Overview
GeoBind is both a method for predicting macromolecular binding using geometric deep learning
and a set of reusable tools for researchers who are interested in predicting binding sites or other
properties of molecular surfaces. GeoBind is built with Python 3 and can be installed using pip
(the python package manager) just like any other python package. Installing the package will
64
make the geobind module available for import, and the major functions are organized into three
submodules. The rst is geobind.structure which provides utilities for loading macromolecule
structures from le as BioPython [21] Structure objects, protonating structures and assigning
radius and charge parameters from force elds, and performing calculations of structural features
such as secondary structure assignment or computing the electrostatic potential using the ABPS
[60] Poisson-Boltzmann solver. This module can also perform some basic clean-up operations such
as removing chemically modied protein residues or mutating them to their respective standard
parents. This is useful because most forceelds do not provide charge and radius parameters
for non-standard residues, which are necessary for some calculations. The second submodule is
geobind.mesh which provides wrapper functions for mesh generation algorithms (sec. 3.3.6), a
class for storing and manipulating molecular surface meshes, and functions for calculating geo-
metric features of meshes (sec. 3.3.2.3). The nal major submodule is geobind.nn which provides
utilities for constructing, training, performing inference with and evaluating graph neural networks
(sec. 3.3.3).
GeoBind has several python and non-python third-party dependencies, but most are optional
and depend on which features the user is interested in using. Some major dependencies required
for basic functionality of the three submodules include BioPython [21] for manipulating and
storing macromolecule structures, trimesh [29] for storing and performing operations on molecular
surface meshes, and Pytorch [91] + Pytorch-geometric [42] for building graph neural networks.
We integrate our framework with widely used packages which makes it easier for users who are
already familiar with these packages and frameworks to adopt GeoBind into their own work
environment and allows GeoBind to benet from a large number of developers which contribute
to these third party dependencies and perform upstream bug-xes and improvements. In the
following sections we describe various components of GeoBind as well as the methods for our
experiments in predicting protein-DNA binding using the GeoBind approach.
65
3.3.1 Mesh Generation
GeoBind provides convenient wrappers for several mesh generation algorithms. These wrappers
provide a common interface for generating surface meshes from a macromolecule structure and do
not require the user to interact directly with the underlying mesh generation software dependen-
cies. To call a wrapper function the user must simply pass a BioPython [21] Structure object or
a list of BioPython Atom objects, the choice of the mesh generation algorithm, and any parameters
specic to that algorithm which are collected via the kwargs argument. These parameters might
include things like the choice of molecular surface to triangulate, the resolution of the mesh, etc...
We provide wrappers for MSMS [104], NanoShaper [31] and EDTSurf [128]. MSMS and
NanoShaper require radii for each atom and GeoBind provides a functiongetAtomChargeRadius()
for assigning radii and charge parameters to each atom in a structure. By default we use
PDB2PQR [36] with the AMBER [26] force eld to assign the radii and charge, but the user
may choose another force eld to use. The radii and charge parameters are stored in the xtra
property of each Atom object in the given structure. Charge and radii parameters are used by
other functions for computing features such as solvent accessible and solvent excluded surface
areas, and calculation of the electrostatic potential, so storing them as properties of each Atom
object ensures consistency and reusability. EDTSurf does not allow the user to provide custom
radii and has hard-coded radii based on atomic species, which may create minor inconsistencies
with other features.
All mesh generation wrapper functions return a GeoBind Mesh object, which is a wrapper
class for a trimesh [29] Trimesh object. In addition to all the properties and utilities of the
base Trimesh class, the GeoBind Mesh object provides additional utilities such as KDTree based
nearest-neighbor and triangle face searching, sparse adjacency matrix output and removal of
disconnected vertices. All GeoBind mesh feature and utility functions can be passed this Mesh
object, and it can be saved to disk in a variety of common mesh le formats such as OFF or PLY.
66
3.3.2 Training Features
GeoBind provides utilities for calculating chemical, geometrical and electrostatic features of macro-
molecular structures and surface meshes which can be used for training graph neural networks,
visualization of molecular properties, comparison of molecular surfaces and more. There are two
distinct set of features which can be dened. The rst are features which are computed from the
molecular structure i.e. depend upon the coordinates of atomic centers and may also depend on
additional parameters such as atomic charge, radius, chemical species, etc... The second feature
set are geometrical descriptions of the molecular surface mesh. These are dened independent of
the underlying structure from which the mesh was generated, and only depend on the geometry
and topology of the resulting mesh.
3.3.2.1 Structure Features
Structure features are those which depend on the atomic coordinates of a macromolecular struc-
ture, and in some cases additional parameters related to the structure such as atomic radii, charge,
chemical species, etc... GeoBind provides utility functions for computing a variety of structure-
based features which all operate using a standard API - accepting a BioPython Structure object
as input, plus any additional parameters specic to that feature which need to be specied, and
returning the Structure object as output. The results of a feature calculation are stored in
the xtra property of every Atom object in a given Structure (see BioPython documentation for
more details), which is simply a dictionary for storying atom-specic data. If a feature is residue-
level, such as secondary structure classication, then the feature is simply mapped to every child
atom of that residue. A notable exception to this are electrostatic potential calculations, where
a Structure is given as input and an object which performs interpolation is given as output,
allowing the user to query the interpolator for the value of the potential at any point in space
within the simulation box.
67
Examples of structure-based features include the solvent-excluded surface area for every atom
using MSMS [104], spatial aggregation propensity [19] - a measure of hydrophobicity that depends
on the local context of each atom, and circular variance [16] which describes how buried an atom
is relative to it's local surroundings. A full list of structure-based features and their dependencies
is given in table 3.2, and denitions and algorithms for each feature is given in appendix B.1.
Structure features available in GeoBind
Feature Name Dependency Level
solvent excluded surface area [23] MSMS [104] A
solvent accessible surface area[69] FreeSASA [86] A
protein secondary structure DSSP [61] R
Atchley factors [4] R
spatial aggregation propensity [19] A
hydrogen bond potential A
circular variance [16] A
Table 3.2: A list of structure features provided by GeoBind. Feature Name is the common name
or name given by the authors who rst described the algorithm for computing the descriptor and
any reference refers to the original papers where a denition was given. Dependency is any third-
party external packages or software which GeoBind uses to implement the feature. The references
under this column refer to the paper where the implementation was described. Level refers to
what structural level the feature is dened for, with can either be 'R' for residue level features or
'A' for atom level features. A full description of all features is available in appendix B.1
3.3.2.2 Mapping Structure Features to the Mesh
GeoBind provides tools for mapping structure-based features to vertices of a corresponding molec-
ular surface mesh. There are two mapping schemes which the user may select depending on the
application. In the rst, each vertex is assigned the features of the nearest atom, based on eu-
clidian distance using a KDTree query. Using this scheme, vertices in the same neighborhood
may be mapped identical features, with potentially sharp boundaries depending upon how the
features of nearby atoms vary, but the features of each vertex will only correspond to a single
atom which may be desirable in some cases. In other words, the features of multiple atoms are
not mixed. The second scheme performs a weighted average of features of nearby atoms within
some cut-o radius R
c
for each vertex. The weights depend on the euclidean distance between
68
the vertex and each nearby atom, with dierent weighting schemes provided such as weighting
by inverse distance to weight further atoms less than nearby ones. A default value of R
c
= 3:0 is
used if none is provided. All vertex features are stored as separate numpy arrays in the property
vertex_properties of the corresponding Mesh object.
3.3.2.3 Shape Features
Shape features are computed directly from the molecular surface mesh and do not require any
information about the underlying molecular structure. These features encode geometrical infor-
mation about the surface and may be completely local or describe geometrical properties over a
larger scale. They are generally based on statistical measures such as the shape diameter func-
tion [107] or based on dierential geometry approaches, such as heat kernel signatures [116] (see
appendix B.1 for descriptions). The idea is to generate features that dene scalar elds over the
two-dimensional surface mesh which encode dierent kinds of information about the geometry of
the surface at each point.
Many of these features have been developed in the context of the elds of 3D shape repre-
sentation, object classication, point matching, shape segmentation and shape retrieval, where
numerical descriptors that encode three dimensional geometries are necessary for various tasks in
computer vision and data mining. A good geometrical descriptor should have several properties
in order to be useful in general applications. The rst is invariance to isometric transformations
- the descriptor should not depend on the coordinate frame of the mesh vertices and should be
invariant to any isometric transformation of the mesh. Second it should be insensitive to noise -
meshes may be subject to artifacts such as sharp corners, missing vertices or edges, and regions
of poor triangulation resulting in highly scalene triangle faces, as opposed to equilateral triangle
faces in the ideal case. Good geometry descriptors should not be too sensitive to the presence
of such artifacts. Finally, descriptors should show some degree of invariance to pose. Dierent
poses of the same shape refer to three dimensional shapes which are related by some topological
69
deformation. For example, a cat in a sitting pose versus a standing pose is still a cat, despite
the local deformations of the cat shape which relate the two. The geometrical descriptors we use
to identify regions of the cat such as its tail, legs, abdomen and head should in some sense be
invariant to the pose of the cat.
Thankfully, many geometric descriptors which have the properties described above have been
developed and applied to a wide variety of problems. In the context of macromolecular binding,
encoding the geometry of the macromolecular surface is important because binding often depends
on the shape of the binding interface. We want our models to be able to leverage and learn
geometrical features to discriminate between likely or unlikely binding sites. GeoBind provides
functions for computing shape features of a mesh, which take as input a Mesh object and any pa-
rameters related to the particular shape feature and stores the results in the vertex_properties
dictionary of the Mesh object. A full list of the shape features GeoBind provides functions for
is given in table 3.3. We note that the shape features described here fall into the category of
engineered features - that is they are in some sense hand crafted for general use or in some cases
for particular tasks. It is also possible to design networks which learn shape features, which will
be discussed in the following sections.
Shape Features Available in GeoBind
Feature Name Dependency
mean curvature libigl [56]
shape index [65]
convex-hull distance
pocket volume NanoShaper [31]
pocket area NanoShaper
shape diameter function [107] libigl
Table 3.3: A list of shape features provided by GeoBind. Feature Name is the common name or
name given by the authors who rst described the algorithm for computing the descriptor and
references refer to original papers where a denition was given. Dependency is any third-party
external packages or software which GeoBind uses to implement the feature. References here refer
to the paper where the implementation was published. Full descriptions are available for each
feature in appendix B.2
70
3.3.2.4 Edge Features
In addition to computing features of mesh vertices that describe the local geometrical, chemical
and electrostatic environment of that vertex, we can also compute features of edges which describe
the geometrical relationship between vertices which the edges join. These features can be used in
the graph convolution and pooling operations which are described in the following sections, and
can allow the GNN to learn information about the mesh geometry in addition to the hand-crafted
features we provide it. We compute the following edge features, with the rst three being inspired
by the PPFNet work of Deng, Birdal and Ilic [32]. We use the notation that x
u
is the cartesian
coordinate of the vertex u, n
u
is the normal vector to the surface at the vertex u and e
uv
is an
edge which joins vertices u and v.
Edge Length: The edge length is given by the euclidean distance between vertices
d
uv
=jje
uv
jj =jjx
u
x
v
jj (3.1)
.
The mean value depends on the resolution of the mesh - hence networks which utilize this
feature may be sensitive to mesh resolution, and care should be taken to ensure that training and
testing data use consistent mesh resolutions.
Normal-Edge Angles: This is dened as the angle between the displacement vector along
the edge and the normal vectors of the two vertices joined by the edge. We can dene this for
both vertices on the edge creating two features.
\(e
uv
;n
u
) = cos
1
(x
v
x
u
)n
u
jjx
v
x
u
jj
(3.2)
Normal-Normal Angle: The normal-normal angle is the angle between normal vectors of
the two vertices joined by the edge.
71
\(n
u
;n
v
) = cos
1
(n
u
n
v
)) (3.3)
Edge Dihedral Angle: The dihedral angle of the edge is the angle between the two normal
vectors of the triangle faces which share the edge. This incorporates information from the two
vertices joined by the edges in addition to the two vertices which form the edge, and adds local
contextual information about the surface curvature. It can be computed from the normal vectors
n
uvi
and n
uvj
as
'(e
uv
) = cos
1
(n
uvi
n
uvj
)) (3.4)
where i and j are the unshared vertices that belong the triangle faces adjacent to e
uv
(see
g. 3.3).
Face Adjacency Span: The face adjacency span is the distance between the perpendicular
projections of the unshared vertices of the triangle faces which are adjacent to the edge onto the
edge. It is given by
span(e
uv
) =j((x
j
x
u
)e
uv
) ((x
i
x
u
)e
uv
)j =j(x
j
x
i
)e
uv
j (3.5)
and measures how skew the triangles which share the edge e
uv
are to one another.
Edge features represent geometric primitives, they encode raw geometric information at a
low level which the network can use to extract higher level features. This is similar to standard
convolutional neural networks which operate on raw pixel intensities rather than handcrafted
inputs. The various edge features described above are illustrated in g. 3.3.
72
Figure 3.3: An illustration showing the geometric primitives used as edge features in GeoBind and corre-
sponding primitive features used in images. Neural networks can learn higher order representations from
primitive features that do not require any hand-crafting. A Some of the geometrical relations we encode
as edge features. These include the dihedral angle between triangles which share an edge, '(euv), the
span of adjacent triangles sharing the edge, the angle between the vector lying along the edge and the
normal at the adjacent vertices. These raw geometrical relationships encode local information about the
geometry of the surface. B By analogy, these relationships are similar to pixel intensities in an image
which a conventional convolutional neural network would use as input features.
3.3.3 Graph Neural Networks
Graph Neural Networks (GNNs) are neural networks which operate on graphs. GeoBind uses
GNNs to train models which can predict macromolecular binding by representing molecular surface
meshes as graphs, and making predictions over the graph nodes. A graph G =fV;Eg is a set of
nodes, V and edges E where every edge in E represents a pair of nodes in V . We represent a
mesh as a graph where the vertices of the mesh are the nodes of the graph, and graph edges are
the triangle edges of the mesh. Every graph node is associated with a point in space which lies on
the molecular surface which the mesh is approximating, and edges between nodes are points that
form the corners of the triangle faces in the mesh. We can assign additional features to the graph
nodes such as descriptors of the local chemical, geometrical and electrostatic environment (see
sec. 3.3.2.1 and 3.3.2.1). We can also assign features to each edge which describe the geometrical
relationships between adjacent nodes (see sec. 3.3.2.4).
73
Graphs can be represented compactly using standard data structures. LetjVj andjEj refer
to the number of nodes and edges respectively. Then a mesh graph can be represented as G
M
=
fV2 R
jVj3
;E2 N
jEj2
;X
V
2 R
jVjN
V
;X
E
2 R
jEjN
E
g where V are the vertex coordinates
which implicitly represent the nodes/mesh vertices, E are pairs of indices into V which represent
directed edges, X
V
and N
V
are node features and number of node features respectively and
X
E
and N
E
are edge features and number of edge features respectively. Often a GNN does
not explicitly use the coordinates in V, as these are used mostly for dening the node and edge
features, though they are used in some applications such as point set registration.
A GNN then denes some signal over the graph Y2R
jVjN
Y
= GNN(E;X
E
;X
V
) such as a
probability distribution. GeoBind predicts probabilities over nodes that each node belongs to a
binding site or not. A GNN must be invariant to rotations, translations and permutations over
the vertex coordinates, features and edges. The order in which we present the edge indices, vertex
coordinates and vertex and edge features must not aect the performance of the network or the
outcome of the predictions (so long as the order remains self consistent among the arrays), as a
graph has no inherent ordering among nodes and edges. Convolutions and pooling are ubiquitous
operations in conventional convolutional neural networks (CNNs), and both can be generalized
to graphs in such a way as to preserve the rotational, translational and permutation invariance of
the graph. We review these operations in the sections below.
3.3.3.1 Graph Convolutions
Convolutions in the context of neural networks have seen enormous success in applications such as
computer vision, natural language processing, and time series forecasting, but they can be gener-
alized to operate on graphs and applied to non-euclidean data as well. Generalizing convolutions
to graphs has been one of the main areas of focus within the eld of geometric deep learning [125],
and many ideas have been put forward with good results. Graph convolutions can generally be
separated into two groups - spectral convolutions and spatial convolutions.
74
Spectral convolutions are dened in the Fourier domain of the graph Laplacian. Given an
undirected graphG = (V;E) we can represent the connectivity of the graph by an adjacency
matrix A where A
ij
= 1 if there is an edge (i;j)2E and zero everywhere else. The Laplacian of
the graph is dened as
L =DA (3.6)
where D is a diagonal matrix that represents the degree of each node in V , D
ii
=
P
j
A
ij
.
The normalized graph Laplacian is dened as
^
L =D
1=2
LD
1=2
=ID
1=2
AD
1=2
(3.7)
which is a real symmetric positive semidenite matrix. It is well known that this implies
^
L
is diagonalizable such that
^
L = UU
T
where U2 R
nn
is a matrix of eigenvectors ordered by
the eigenvalues in and where
ii
=
i
are the eigenvalues of
^
L. A signal over a graph is some
vector x where each entry x
i
corresponds to some feature of the i
th
node. We dene the graph
Fourier transform [111] of a signal as
F
G
(x) =U
T
x (3.8)
.
This provides a mechanism for us to compute the convolution operation on a graph using
the convolution theorem which states that the Fourier transform of a convolution is equal to
point-multiplication in the Fourier domain. That is,
F
G
(x
G
K) =F
G
(x)F
G
(K) (3.9)
75
for a signal x and some convolution kernel K. From eq. (3.8) we can write the above as
F
G
(x
G
K) =U
T
xU
T
K. It thus follows that
x
G
K =F
1
G
(F
G
(x
G
K)) =U(U
T
xU
T
K) (3.10)
.
This can be simplied if we write K
= diag(U
T
K), and hence
x
G
K =U(U
T
xK
) =UK
U
T
x (3.11)
.
Convolutions dened in this way are referred to as spectral convolutions, because the convo-
lution operation involves the eigenvectors of the graph Laplacian. While mathematically elegant,
they suer from several drawbacks. Any perturbation to the graph structure results in a change of
eigenbasisU, and the denition of convolution which was constructed in eq. (3.11) depends upon
this basis. This means that these convolutions do not generalize to other graphs - the parameters
learned in K
can not be applied to other domains. In addition, the computation of U is O(n
3
)
where n is the number of vertices, which is prohibitively slow for large graphs. Thus GeoBind
implements convolutions which are strictly of the form of spatial convolutions.
Spatial convolutions use the framework of message passing [48] where convolutions are dened
as continuous functions over spatial neighborhoods of each node in a graph. Nodes exchange
information called "messages", and an update their hidden state as given in eqs. (3.12) and (3.13)
below.
m
`+1
u
= aggr
v2N(u)
M
`
(h
`
u
;h
`
v
;e
uv
) (3.12)
76
h
`+1
u
=U
`
(h
`
u
;m
`+1
u
) (3.13)
Here h
`
u
is the hidden state of a node u at the `
th
layer of the network, e
uv
are any features
associated with the edge joining neighbors u and v, M
`
is a learned message function combining
the hidden states of u and v, and aggr is any dierentiable permutation invariant aggregation
operation which combines messages from all nodes v in the neighborhoodN (u) of u such as
summation or max. The nal hidden state of a node u is updated by combining the aggregated
messages with its hidden state via a learned update function U
`
. A network can be constructed
from layers of these message and update functions, extracting higher and higher order information
from nodes as shown in gure 3.4. At each layer, we aggregate information over the rst nearest
neighbors, however each node v in the neighborhood of u contains information from their ` 1
nearest neighbors - thus the eective receptive eld of the network is equal to the number of
layers. In general, message functions need not be symmetric and have a directionality - the
direction of the edge matters, and thus the graph must be considered a directed graph. If the
message functions are dened such that M
`
(h
`
u
;h
`
v
;e
uv
) =M
`
(h
`
v
;h
`
u
;e
vu
) ande
uv
=e
vu
then we
may apply message passing to undirected graphs as well.
Message passing is an extremely
exible framework and many convolutions can be dened
in this way. GeoBind provides interfaces for several message-passing convolutions which are
summarized below. For each convolution, we have a choice of what aggregation operation to
perform. In practice the mean aggregation is used unless otherwise stated. For every convolution
we can also add an optional bias term to the update function,U
0
`
=U
`
(h
`
u
;m
`+1
u
)+b or an optional
root node term U
0
`
= U
`
(h
`
u
;m
`+1
u
) +
`
h
`
u
if the original denition of the update function does
not include such terms. In the denitions below we use the original denitions as written by the
authors.
77
Figure 3.4: An overview of the message passing convolution operation. In the rst layer, messaged
are passed between the node u and its rst-order neighbors. In the second layer, messages are again
exchanged between rst order neighbors, however each nodes hidden state already contains information
from its nearest neighbors - therefore the messages being passed to node u contain both rst and second
order information. Hence the size of the receptive eld of the convolution increases by one at each layer.
Gaussian Mixture Model Convolution (GMMConv): GMMConv [87] uses a gaus-
sian mixture model over edge features of neighboring nodes and learned weights over node
features. The message function is given by M
`
=
1
K
P
K
k=1
w
k
(e
uv
)
k
h
`
v
where w
k
(e) =
exp
1
2
(e
k
)
T
1
k
(e
k
)
and
k
,
k
and
k
are learnable parameters. K is a hyper pa-
rameter of the convoluton and determines the number of gaussian kernels in the mixture model.
The update function is simply U
`
=m
`+1
u
.
Edge-Conditioned Convolution (ECConv): The ECConv [115] weights each neighboring
node feature with a learned weight-matrix over edge features. M
`
= h
`
v
H
(e
uv
) where H
is
a simple neural network with learnable weights that learns weights over edge features. The
structure exact structure of this network can be considered a hyper-parameter, but our experi-
ments have shown good success with a single hidden layer in simple fully connected network. The
update function is given by U
`
=m
`+1
u
+b.
78
Crystal Graph Convolution (CGConv): CGConv [127] was proposed in the context of
predicting the material properties of crystal lattices. Its message function is given by M
`
=
(z
uv
W
f
+ b
f
) g(z
uv
W
s
+ b
s
) where is the sigmoid function, g is the softplus function,
z
uv
= [h
`
u
;h
`
v
;e
uv
] is the concatenation of neighbor features and edge features that join them,
andW
f
;b
f
;W
s
;b
s
are the learnable parameters. The update function is given byU
`
=h
`
u
+m
`+1
u
.
Point Pair Feature Convolution (PPFConv): The PPFConv [32] is only dened on
meshes. Unlike the previous convolutions which operate on general graphs, given some edge
features of the graph, PPFConv operate on specic point-pair features which are dened for
mesh edges and encode geometric relationships between pairs of vertices. The message function is
M
`
=H
(h
`
v
;f
uv
) whereH
is a learnable function given by a fully-connected neural network with
learnable parameters and f
uv
= [jje
uv
jj;\(e
uv
;n
u
);\(e
uv
;n
v
);\(n
u
;n
v
)] where e
uv
=x
v
x
u
is the displacement vector from vertex u to v and n
u
is the normal vector of vertex u. The
authors use the max aggregation to compute the nal message m
`+1
u
. The update function of the
PPFConv is U
`
=
(m
`+1
u
) where
is a learnable function given by a neural network with
learnable weights .
Feature Steered Graph Convolution (FeaStConv): FeaStConv [119] is unique in the
convolutions listed so far in that it does not consider edge features. It denes a convolution kernel
which is "spread out" among a set of attention heads. The message function is dened by M
`
=
P
H
h=1
q
h
(h
`
u
;h
`
v
)W
h
h
`
v
whereq
h
(h
`
u
;h
`
v
) = softmax
h
(a
T
h
(h
`
v
h
`
u
)+b
h
) such that
P
h
q
h
(h
`
u
;h
`
v
) = 1
and a
h
;b
h
are learnable parameters. The idea is that the term q
h
soft assigns each node in the
neighborhood of u to a learnable weight matrix W
h
based on the similarity between u and v.
There are unlimited ways to dene graph convolutions using message passing, and many more
have been proposed in the literature. These represent a small subset which GeoBind provides
explicit implementations of, but in practice the user can use arbitrary convolutions with the
GeoBind framework.
79
3.3.3.2 Graph Pooling
Pooling is an important operation and used widely in conventional convolutional neural networks.
Most of the eort in graph neural networks has been focused on the convolution operation, but
attempts at generalizing pooling have been made as well. The idea is to iteratively reduce the
resolution of the graph by combining two or more nodes into a single node which represents the
entire group of nodes. The process of doing this can be xed or depend on some kind of learned
scoring function. At each pooling step the total number of nodes and edges in the graph is reduced
until a nal pooled graph is reached. The original graph structure can then be restored using
unpooling operations, whereby information about the graph structure before the pooling operation
occurred is used to restore the graph structure to its previous state.
In the context of graph neural networks pooling can help improve the receptive eld of convo-
lutions, without increasing the depth of the network, by decreasing the size of the graph. It can
also help smooth predictions over nodes by grouping similar nodes and then propagating learned
features from the lowest resolution of the graph up to the original resolution during the unpooling
stage. This is benecial if the predictions of the network are expected to be smooth, as we might
expect in the case of predicting macromolecular binding. By a smooth prediction we mean that
the value predicted for each node (e.g. the probability of being a binding site) does not change
too much between adjacent nodes.
Several strategies have been developed for pooling on graphs. Ying et al. [129] proposed
DiPool which learns to soft-assign each node to a xed number of clusters based on their features,
and all nodes within a cluster are then pooled. In their approach the number of clusters must be
chosen in advance. Goa and Ji [46] introduced TopK pooling where each node is assigned a learned
score and only the topK nodes are kept, and the remainingjGjK nodes are dropped along with
any edges containing them. Lee et al. [70] improved upon Top-K pooling by introducing a graph
convolution into the pooling layer which computes an attention mask before the node scores are
80
computed, which they call SAGPool. MinCut pooling was introduced by Bianchi et al. [12] and
learns a soft cluster assignment vector for each node that depends on the node features and the
graph connectivity. Like DiPool, a xed number of clusters must be chosen in advance.
GeoBind operates on graphs which represent molecular surface meshes. A mesh is a discrete
representation of a continuous manifold which adds additional constraints to the graph structure.
It is possible to loosen this requirement and pool the mesh in a way that destroys the mesh
properties and simply treat it as a general graph after pooling, but we prefer to enforce that the
graph remain a valid mesh at every layer in the network. In addition, the methods summarized
above require us to choose some xed number of clusters or xed number of nodes to drop from
the graph, and it is unclear a priori how to choose these numbers for molecular surface meshes
which can vary greatly in size and shape. We can satisfy the requirement that we preserve
the mesh structure of the graph and avoid hard-coded hyper-parameters with the EdgePool [34]
pooling method. EdgePool combines nodes by collapsing edges between nodes at each pooling
layer based on a learned edge-scoring function. Edge collapse is a well known method of reducing
mesh resolution and leaves the mesh structure intact, so this is an ideal pooling strategy for our
purposes.
EdgePool learns a scoring function over edges in a graph. For each edge, a raw score is
computed from a combination of node features as
r(e
uv
) =W [h
`
u
;h
`
v
] +b (3.14)
where W is a learned weight vector and b a learned bias. These scores are then normalized
over all incoming edges of a node to produce a nal normalized score,
s
uv
= softmax
v2N(u)
(r(e
uv
)) + 0:5 (3.15)
81
Figure 3.5: An overview of the edge-collapse procedure used for edge pooling. An edge is identied for
pooling based on a learned edge-scoring function. The edge is then removed and the two adjacent nodes
are merged into a new node, where the position of the new node is given by the average of the two merged
nodes. The positions of the original two nodes are stored, and in the unpooling step the two nodes which
were previously deleted are restored along with the edge joining them. The hidden state of the restored
nodes are given by eq. (3.16).
82
where the additive constant was chosen by the author to normalize the mean score to 1, which
he claims improve the
ow of gradients. The hidden state of the new node is given by a symmetric
summation of the pooled nodes features, h
`+1
uv
=s
uv
(h
`
u
+h
`
v
). Multiplication by the edge score
allows gradients to propagate backwards. The pooling procedure iteratively contracts edges with
the highest scores, skipping edges which are incident to a newly merged node. This procedure
pools roughly 50% of the nodes in the graph without creating any holes or changing the mesh
topology and produces a valid mesh with lower resolution than the original. To unpool the graph,
we simply split every node which was pooled and replace the contracted edge between them. The
node features are interpolated by
h
`+1
u
=h
`+1
v
=h
`
uv
=s
uv
(3.16)
.
EdgePooling allows the network to iteratively reduce the resolution of an input mesh in a
learned way. This can allow the network to focus on particular regions or reduce the resolution in
a more uniform manner and increase the receptive eld of graph convolutions. It may also allow
the network to be more robust to small scale variations of the surface since we expect these to
be reduced as the resolution is decreased. We have improved upon EdgePooling with a minor
modication by including edge features into the computation of the edge scores as follows
r(e
uv
) =W [h
`
u
;h
`
v
;e
uv
] +b (3.17)
.
Our reasoning is that when choosing which edges to collapse we want to allow the network
to consider the geometrical relationships that edges encode and more directly utilize the mesh
geometry when deciding which edges to collapse. We have observed good performance increases
with this modication. Since we are pooling meshes and not just graphs, we must take care to
83
also update the vertex positions when pooling two nodes. In this case we simply average the
position of the two merged nodes whenever an edge is collapsed, and return unpooled vertices to
their original positions in the mesh during the unpooling operation.
3.3.4 Datasets
For this work we applied the GeoBind framework to predicting two types of macromolecular
binding: protein-dsDNA and protein-ssDNA binding. For each case we used the DNAproDB
database (chapter 2) [102, 103] to identify candidate protein-DNA complex structures based on
structural properties of the bound DNA and protein (provided in the database). A list of these
criteria is given in table 3.4. Within each pool of candidate protein-DNA complexes, we clustered
all protein chains present by 70% sequence similarity using CD-HIT [73]. We then randomly
sampled protein-DNA complexes such that no protein-chain cluster was represented more than
n times in the nal dataset, with n = 3 for the protein-ssDNA (SBP) dataset and n = 2 for
the protein-dsDNA (DBP) dataset. Finally, the selected complex structures were then split into
training/validation datasets such that there were no overlaps in clusters between training and
validation. The nal number of protein-DNA complexes in each dataset is given in table 3.4.
Protein-DNA Complex Structure Datasets
Dataset Train. Size Val. Size Structural Criteria
protein-
dsDNA
501 62 B-form helix only
no chain breaks
DNA not enveloped
no chemical modications
protein chain length 2 [40; 1000]
W.C. base pairing only
protein-
ssDNA
100 14 ssDNA only
pro. chain len. 2 [40; 1000]
no chemical modications
protein has known ssDNA binding
Table 3.4: Characteristics of the structural datasets of protein-DNA complexes used to generate
training and validation data for the models presented below. The size of the dataset refers to the
number of protein entities (see methods section of [103]). The various structural criteria used to
select the candidate structures from all possible structures of protein-ssDNA and protein-dsDNA
structures are derived directly from DNAproDB data.
84
For every structure in the datasets the DNA was removed and molecular surface meshes and
mesh features were generated for the protein structure. The mesh was generated using NanoShaper
[31] and the molecular skin surface was chosen with skin parameter of 0.45 and grid scale of 2.0.
Vertices in the surface mesh corresponding to DNA binding sites were identied as any vertex
within 4.0
A of a nucleotide, accounting for line of site occlusions. We note that the boundary of a
binding site is not well dened - this is because both the protein and DNA are
exible molecules,
and we expect the outer edges of the binding site to have some fractional occupancy of DNA in
a bound state due to the vibrational and diusive motion of the DNA and protein. Therefore to
avoid introducing any bias in the training data we buer the boundary of the observed binding
sites and exclude a small region of the vertices on the periphery. These vertices are not used for
training or for evaluating model performance. The resulting molecular surface mesh labels are
highly imbalanced, with only 12.4% binding-site vertices for the SBP training data and 13.6%
binding site vertices for the DBP data (after masking).
After all the processing steps are complete, for each protein-DNA complex structure we obtain
a list of DNA-binding residues, a molecular surface mesh, features over the mesh, and labels over
the mesh that identify vertices as binding sites or not. For both datasets we used the same features
for training the models, a total of 23 geometrical, chemical and electrostatic descriptors for vertex
features, and the four PPFNet edge features (see sec. 3.3.2.4).
We also compiled a negative-control dataset (NC) of proteins with no known DNA binding
function. We chose proteins which are well studied and whose function is known unambiguously,
and should have negligible anity for DNA. The procedure for generating meshes and features
was the same as for the SBP and DBP datasets, except that all vertices in the NC data were
non-binding sites. The following proteins were used to create the NC dataset: thrombin (pdbid:
3shc), myoglobin (pdbid: 1mnh) and rhodopsin (pdbid: 1gue).
85
3.3.5 Network Design and Training Procedure
For each dataset we trained and evaluated many models with dierent architectures, convolution
kernels, use of pooling or not, size of the node hidden states, number of layers, etc... We found that
graph U-net [46] type architectures performed well, so our focus was on variations of these models.
A sketch of our network design is shown in g. 3.6. The major choices aecting model performance
are the type of convolution, the number of pooling layers, the number of convolutional layers,
whether or not to concatenate or sum skip-connections, and the size of the hidden representation.
We trained U-net models on both DBP and SBP datasets for 30 and 60 epochs respectively,
using the Adam [63] stochastic gradient descent algorithm with varying initial learning rates.
Batch sizes range from 2-6 graphs per batch (see the Pytorch-Geometric [42] documentation for
an explanation of how graphs are batched). We minimized the binary cross-entropy loss function,
and weighted each class by a term that accounts for class imbalance. Models with the highest BA
scores on the validation data were chosen and the results are presented in section 3.4.
3.3.6 Evaluation Metrics
The output of the SBP and DBP models are probabilities over mesh vertices for each vertex to
be a binding site (positive class) or not (negative class). We can evaluate the performance of the
models using binary classication metrics. By choosing some threshold value 2 [0; 1], we can
assign predicted labels to each vertex where y
i
= y
+
if p(y
i
= y
+
) . We can then compare
the predicted labels to the ground-truth training labels for these vertices and compute various
metrics. For the reported metrics which require a threshold on the predicted probabilities, we
used a threshold value that maximizes the BA score.
Denitions for the metrics we computed are given below. They are dened in terms of true-
positives (TP), positive predictions which are correct, true-negatives (TN), negatives predictions
which are correct, and false-positives (FP) and false negatives (FN).
86
Figure 3.6: A schematic of a graph U-net model with two pooling layers and two linear (lin) layers. The
rst linear layer transforms the number of input features to a lower dimensional space, and can be thought
of as a learned dimensionality reduction layer. A number of convolutional layers are then applied, followed
by a pooling layer. Each pooling layer reduces the resolution of the mesh by a factor of approximately two.
One or more convolutions are applied after each pooling step. Once the nal pooling level is reaching,
unpooling operations are performed where the graph structure is restored at each step, again followed
by at least one convolution. Skip connections are shown as gray arrows. These combine the hidden
representations from earlier layers of the same graph resolution before the pooling/unpooling operation
was performed. Including these connections generally allows gradients to
ow better through the network.
The nal layer of the network is one or more linear layers which perform the nal dimensionality reduction
to a binary probability distribution for each vertex. Every layer in the network is typically followed by a
non-linear activation function such as ReLU, save for the last layer which uses softmax.
Balanced Accuracy (BA): Balanced accuracy the is mean of the true-positive rate and
true-negative rate. It is the mean probability of the model making a correct prediction, which
accounts for class imbalance.
BA =
TP
TP +FN
+
TN
TN +FP
2 (3.18)
Recall: Recall is equivalent to the true-positive rate (TPR) and measures the fraction of the
positive class which is correctly identied.
recall =
TP
TP +FN
(3.19)
Precision: Precision measures the probability that a positive prediction is correct. If the
model over-predicts the positive class, the precision score will lower.
87
precision =
TP
TP +FP
(3.20)
AUROC: The area under the ROC curve is the area of the curve formed when plotting
TPR() v.s. FPR() as varies from 0 to 1. It is the probability that a randomly chosen positive
sample has a higher probability p
+
than a randomly chosen negative sample, and varies from 0.5
to 1.0, the latter being a perfect classier.
AUPRC: AUPRC is the area under the precision-recall curve, and measures the area under
the curve formed by plotting precision() v.s. recall() as varies from 0 to 1. A perfect classier
will have a score of 1.0.
3.4 Results of Protein-DNA Binding Prediction
We trained graph convolutional neural networks using the GeoBind framework to predict DNA
binding regions on protein molecular surfaces using the structure of the unbound protein. Two
datasets were used to generate training data, one of protein-ssDNA complexes (SBP) and the
other of protein-dsDNA complexes (DBP). Both datasets used the same set of training features
and similar model architectures. The state of each vertex is considered a random variable, with
two possible states (binding site, non-binding site) and probabilities for each state were predicted
by the models. These probabilities were then compared to the binary training labels generated
from the observed conformation of the bound DNA in the training data, and performance metrics
were computed. The results are shown in table 3.5.
We observe good performance as measured by AUROC, BA and Recall. When compared with
table 3.1 we see that our models greatly outperform older methods. The performance of our
models on AUPRC and Precision are lower, which is indicative of an over-prediction of binding
sites. Our dataset is highly imbalanced, with approximately 12-13% of training labels being the
positive class (binding site). Therefore, a relatively small amount of false-positive predictions,
88
Protein-DNA Binding Site Prediction Metrics
dataset AUROC AUPRC BA Precision Recall
SBP Train 0.962 0.790 0.906 0.540 0.923
SBP Validation 0.924 0.546 0.858 0.333 0.888
DBP Train 0.927 0.692 0.852 0.442 0.878
DBP Validation 0.936 0.773 0.860 0.495 0.873
Table 3.5: Performance measures for predicted DNA binding sites on protein molecular surface
meshes. Deep GNNs were trained using the GeoBind framework to predict vertex-level probabil-
ities of belonging to a binding site or not, and predictions were compared to ground-truth labels
obtained from the observed protein-DNA complex in the training data. Metrics are reported by
aggregating predictions over the entire datasets. The threshold values chosen for Balanced Ac-
curacy (BA), Precision and Recall were determined by maximizing BA over the training dataset,
but both were close to 0.5.
relative to the total size of the training set, can have a large impact on the Precision and AUPRC
scores. Figure 3.7 shows binding site predictions for select proteins from the SBP and DBP
validation datasets respectively. We observe very few, if any, false negative predictions in these
examples, and we note that nearly all the loss of performance in BA and AUROC comes from
false positive predictions.
Fig. 3.8 shows the distribution of vertex probabilities (for the positive class) over the SBP and
DBP training data for the positive and negative classes (ground-truth). We see very clear sepa-
rations between the two classes, with the majority of the positive class having a high probability
p
+
=p(y = 1jE;X
V
;X
E
) and the negative class a lowp
+
, which is ideal. The dashed vertical line
shows the chosen threshold used to measure the model performances given in table 3.5. The
tail of the negative distribution which lies to the right of this dashed line represents false positive
predictions, because these are vertices that belong to the negative class and have a probability
p
+
>, and hence are misclassied. For the SBP data in panel A, we can see that the number
of false positive and false negative predictions are essentially the same, however because there
are far fewer positive class vertices overall, the false positive predictions greatly bring down the
AUPRC and Precision scores, relative to the eect of false negatives on AUROC and BA. We
could marginally improve Precision by shifting the threshold further to the right, at the cost of
lower BA, but there is fundamentally a trade-o in these measures. For the DBP distributions in
89
Figure 3.7: Binding site predictions for top performing proteins from the SBP and DBP validation
datasets. These predictions are on proteins the model has never seen during training. The left panel
shows four protein surface meshes colored according to the predicted class, with the mesh rotated to
give dierent views. Green and orange are correct predictions corresponding to true negatives and true
positives while red and purple are incorrect predictions, corresponding to false positives and false negatives
respectively. White vertices are masked and were ignored during evaluation of performance metrics. We
see very few false negative predictions, with most of the error being false positives. We also note that
false positives tend to occur around the edges of a correctly identied binding site. The table gives the
individual per-protein performance measurements and the protein name along with the PDB identier
that corresponds to the structure entry.
panel B, the situation is somewhat worse - here we see a noticeable broad peak in the tail of the
negative distribution far to the right, indicating false positives with higher p
+
, which accounts for
the lower AUPRC and Precision for this data. Choice of threshold aside, our main area of focus
moving forward is to improve the model's predictions by reducing the tails present in the negative
class without shifting the center of the peaks too much.
In g. 3.9 we have plotted the distribution of AUROC scores and AUROC v.s. size of the
protein structure corresponding to the molecular surface mesh for the SBP and DBP datasets.
On the left we see that for both models the AUROC distributions are centered near the overall
AUROC values, with long tails that extend to lower values. The right shows how the model
performance depends on the size of the structure, evidently there being no correlation between
structure size and model performance. This is due to the local nature of the graph convolutions
our models are based on - they only depend on locally aggregated mesh features and the weights
are shared across the entire graph, hence by design can not depend on global properties of the
mesh such as surface area or volume. This is a desirable property for our models because biological
90
Figure 3.8: The distribution of predicted probabilities p+ over the training data for the SBP and DBP
models. The predicted probabilities for the ground-truth negative and positive classes (non-binding site
and binding site respectively) are plotted as separate distributions. These plots show peaks near 0 for the
negative classes and peaks near 1 for the positive class, which is ideal for a binary classier. The wide
valley near the center of the plots shows that our models are not particularly sensitive to the choice of
threshold.
macromolecules vary widely in overall size and shape and hence our approach should generalize
well to a variety of macromolecular structures and their corresponding surface meshes.
Using the SBP and DBP models, we performed binding site predictions on a set of negative-
control proteins (NC) that have no known DNA binding function. We thus expect that few
(ideally none) vertices should be classied as DNA-binding sites by the SBP or DBP model.
The results of our predictions on the NC proteins for each model is shown in table 3.6. The
SBP and DBP models perform as good or better on the NC dataset compared to the average
performance on the SBP and DBP datasets, despite only being trained on DNA binding proteins.
This gives us condence that the space of geometrical, chemical and electrostatic features over
which our models were trained is general enough to capture informative features for a wide variety
of protein structures and functions, and that our models truly are capturing the signatures of the
DNA binding sites. All of the error in the NC predictions comes from false positives, and it is
informative to look at a sample prediction as shown in gure 3.10. Here we show predictions of
the SBP model on the thrombin protein molecular surface mesh. The regions in red are predicted
91
Figure 3.9: Characteristics of AUROC scores over the training and validation datasets for the SBP and
DBP models. Each datapoint represents the AUROC over a particular protein entity surface mesh, where
predictions over all non-masked vertices in a mesh were aggregated. The histograms show the distribution
of AUROC. For both models, the majority of the distribution lies between 0.9 and 1.0. The plots on the
right side of the gure plot AUROC versus the size of the protein structure in total number of atoms.
These plots show that the performance of our models are completely independent of the size of the
structure due to the local message passing framework of the graph convolutions.
92
Figure 3.10: Predictions by the SBP model for a non-DNA binding protein thrombin from the NC dataset.
Any binding site prediction is a false-positive by denition, as we expect the entire surface to be devoid of
any DNA binding sites. The model performs well achieving over 0.94 BA despite having never been trained
on non-DNA binding proteins. However, the areas where it does make false predictions are physically
plausible, such as regions with exposed hydrophobic side chains or regions of positively charged residues.
binding sites, which are false positives, however they are not totally erroneous and we argue are
biologically plausible. In panelA we see a region that corresponds to a highly exposed tryptophan
residue with the plane of its aromatic side chain facing outward. This is ideal for hydrophobic
stacking with exposed nucleotide bases in ssDNA, so it's reasonable that the model assigns a high
probability of ssDNA binding in this region. In panel B we see a similar case of a phenylalanine
which is exposed, and also a cluster of exposed positively charged side-chains which correspond
to a region that can form attractive electrostatic interactions with the negatively charged DNA
backbone.
We propose that these false binding site predictions do actually possess signatures of DNA
binding sites. However, they are not functional binding sites because they are simply too small in
area. Binding free-energy has an enthalpic and entropic term, G = HT S and many of the
93
interactions corresponding to H and S directly correspond to surface area (such as hydropho-
bic stacking, van der Waals interactions, etc...). Therefore, in order to correctly distinguish a
functional DNA binding site from a region containing positive DNA binding signatures, we must
consider the surface area of the region. Our models do not explicitly take this into consideration,
and thus are unable to make these dierentiations. Therefore, further improvement on the models
should include mechanisms that allow the model to condition predictions on the spatial extent of
the possible binding site.
Negative Control Prediction Metrics
protein BA (SBP) BA (DBP)
thrombin 0.948 0.918
myoglobin 0.913 0.872
rhodopsin 0.977 0.914
Table 3.6: Performance measures for prediction of DNA binding regions on protein molecular
surface meshes for the negative control dataset. Metrics are reported per-protein by aggregating
all vertices within a particular mesh. The threshold values chosen for Balanced Accuracy (BA)
were the same used in calculating the metrics in table 3.5.
3.5 Discussion and Future Work
GeoBind provides a framework and reusable set of tools for constructing molecular surface meshes
from macromolecule structures, computing features over edges and vertices of the mesh, and build-
ing, training and evaluating graph neural networks for predicting graph signals over the vertices
of the mesh. We have demonstrated the potential of our approach by applying it to predicting
DNA binding sites on the surfaces of ssDNA and dsDNA binding proteins using structural data
provided by the PDB [11] and DNAproDB [102, 103]. Our method provides a way of harnessing
the vast amount of information that structural data provides using state of the art approaches in
machine learning and articial intelligence.
We provide wrappers to many graph convolutions which have been developed for other appli-
cations, and our results show that many work well in the context presented in this work. However,
more work can be done on the front of developing new convolutions or adapting existing ones to
94
work within the message-passing framework. One limitation of all the convolutions presented here
is that they only pass information to their nearest neighbors. It is desirable to have meshes with
high resolutions because this reduces noise and artifacts when computing geometrical properties of
the surface, however high resolution means shorter edge length - thus there is a trade o between
the spatial extent of a convolutions receptive eld and the resolution of the mesh. Masci et al. [82]
proposed a convolution which is independent of the mesh resolution and aggregates nearby ver-
tices based on geodesic distance. They dene a two-dimensional kernel which is based on a polar
coordinate system, with the radial coordinate corresponding to geodesic distance. Their approach
is limited however because of the ambiguously dened angular coordinate. To solve this, they
apply angular-max pooling over a set of rotated lters. This approach requires pre-computing
geodesic patches and is wasteful because many lters must be learned while most will be discarded
during the angular max-pooling step. However, the concept of aggregating information based on
geodesic distance seems particularly salient for the case of predicting macromolecular binding,
as it means that the receptive eld of the convolution is independent of the resolution of the
mesh, and can be chosen such that it corresponds to some distance scale relevant to the particular
application. We plan to explore ways of utilizing this idea in a message passing framework and
developing resolution-independent convolutions.
A second major limitation of the DNA binding site prediction models presented is that the
model is unaware of the spatial extent of predicted binding sites. This is because vertices are
predicted independently when using conventional loss functions, such as cross entropy. The neural
network learns independent probability distributions over classes for each node in the graph, and
while these probability distributions do consider the features of neighboring vertices, they do not
consider the nal predictions for neighboring vertices. Conditional random elds (CRFs) [117]
are graphical models which explicitly model the interdependencies of random variables. These
have seen wide application in computer vision tasks such as semantic segmentation [66, 118, 132].
Rather than classifying vertices of a graph, semantic segmentation aims to classify pixels in an
95
image (arranged as a grid) using the spatial relationships to segment dierent 'semantic objects',
such as a cat, dog, or automobile. Semantic segmentation is closely related to the task presented in
this work, and many approaches which have been successful in the domain of computer vision can
be generalized to the non-euclidian domain. Indeed it is possible to integrate CRFs with the neural
network framework we have developed here, either directly as a trainable layer in the network
or indirectly as a post-processing step, rening the predictions of the network. Future work will
explore the use of CRFs to condition the prediction of vertices on their local neighborhoods, which
can believe can improve the binding predictions of our models.
We have treated the problem of predicting macromolecular binding as a fully-supervised learn-
ing problem (with the exception of masked vertices on the boundary of binding sites), however it is
very often the case that structural data is incomplete. Regions of a bound ligand or cofactor may
be missing due to experimental restraints and it may be impossible to model the exact biological
complex. Further, the experimental conditions may alter the structure of the complexes or the
modes of binding, or may preclude other modes of binding which might be present in vivo, but
do not appear during crystallization or imaging. Therefore, regions of the protein surface which
are identied as "non-binding sites" (negative class) based on the observed complex may indeed
have binding function. Thus, when treated as a binary classication problem, our ground-truth
training labels may have some degree of noise in the form of false negatives.
PU learning is a semi-supervised learning approach which has been developed to address just
such a situation [9]. Rather than treating binary classication as a task of learning from positive
(P) and negative (N) examples, in PU learning training data is treated as either positive or
unlabeled (U), with U being a mix of both P and N. Therefore, the positive labels are known to
be positive, while unlabeled examples may be either positive or negative, with some priorp(y =P )
which may or may not be known. For the case of structural data, it may be more appropriate to
view the regions of a molecular surface where binding has not been observed as unlabeled, rather
than as a negative example, because we can not be sure that binding sites are not present in these
96
regions. We are exploring applying PU learning to our models and what applicability it has in
the context of graph neural networks.
In principal, GeoBind can be applied to many molecular systems. Some of the features we have
used in our experiments are specic to protein structures, however we expect that given sucient
data our models could be applied to other molecular systems such as protein-protein binding,
protein-drug binding, DNA-drug binding, or protein-lipid binding. Further, our models are not
limited to binary classication - we could train a model to predict multiple classes, for example
predict binding to a family of dierent drugs, or to predict ssDNA and dsDNA binding sites
simultaneously. We could also increase the granularity of binding site predictions, for example
predicting major groove, minor groove and backbone binding regions in the case of dsDNA. Our
method is quite general and we imagine that there are many future applications where it will be
successful.
97
Chapter 4
Discussion { Physical Modeling versus Machine Learning
and Future Outlook
In this thesis I have presented data-driven approaches which utilize structural data to study
protein-DNA interactions. This is in contrast to more common sequencing based approaches
which are based almost entirely on big data but do not consider structure at all and form the ba-
sis of most bioinformatics methodologies. Physical modeling approaches on the other hand, which
have been the more traditional paths taken in biophysics and structural biology that rely primarily
on the physics of electrostatics, statistical mechanics and thermodynamics are not data-driven.
The primary objective of our work is to establish an end-to-end methodology for applying ma-
chine learning and big-data methods to structural data and the study of protein-DNA interactions.
Machine learning is a powerful statistical approach which has dominated many diverse elds of re-
search including computer vision, natural language processing, speech recognition, bioinformatics,
recommendation systems, machine translation, robotic locomotion and more. The meteoric rise of
machine learning in recent decades has been driven by advances in computing technology and the
availability of large datasets, along with a huge eort and focus of academic research. Traditional
methods which are rooted in physical models and rely primarily on the tools of mathematical and
numerical analysis have generally seen far less innovation or progress during the same period of
time.
98
In the context of modeling protein-DNA interactions, physical models based on electrostat-
ics, statistical mechanics or thermodynamics are powerful methods that allow one to bridge the
gap between our understanding of conceptual biological processes and physical processes. They
provide a mathematical framework for modeling biomolecular systems, making predictions and
understanding existing biological data at a deeper level. However, physical models have some
disadvantages compared to machine learning based approaches. For one, they are often more
dicult to improve. Many of the assumptions made in physical models are constrained by the
need for mathematical or computational tractability. While improvements to physical models do
get made, it is generally discouraged to base such improvements purely on empirical grounds.
Physical models are usually based on deeper principals and a set of approximations. They are
then mathematically derivable from more "exact" models (or even physical "law") given certain
assumptions. As more experimental data accrues, physical models remain relatively constant and
progress is often shifted to issues of computation or nding "tricks" to improve sampling or con-
vergence properties. Molecular dynamics, for example, is based on Hamiltonian mechanics { a
physical theory which was well established more than 150 years ago, and the Poisson-Boltzmann
equation was developed in the 1920s, nearly 100 years ago. While huge advances have been made
in computational eciency, software development, parametrization of force elds, and our under-
standing of how to apply these methods to relevant question in biology, the underlying physical
model remains more or less unchanged.
In contrast, machine learning models constantly improve over time as more data accrues. Ma-
chine learning methods are based on very general mathematical models and make few assumptions
about the expected properties of the model, and they need not follow the rules of mathematical
induction to arrive at their nal predictions. They instead rely on a large number of free param-
eters which are then tuned in order to t predictions of the model to observed data. This is in
stark contrast to physical models where free parameters are generally seen as unfavorable, and
there is often a notion that fewer is better { after all, there are no free parameters in nature. As
99
more data is available, we can reduce the bias and variance in machine learning models to better
and better approximate observed data (within the limits of any particular approach). Therefore,
the gap between a physical model and observation remains constant as the number of observa-
tions increase, but the gap shrinks for machine learning models, making them more and more
reliable over time. To put it facetiously, when using physical models we want to make the wrong
prediction for the right reason, and when using machine learning models we want to make the
right prediction for the wrong reason.
Machine learning models are often far less computationally demanding than physical models.
This is partly because in machine learning, models are determined through optimization and
not through solutions to dierential equations as is usually the case for physical models. Some
physical models also require extensive sampling (for example molecular dynamics and monte carlo
approaches) in order for meaningful statistical properties to converge and every simulation or
calculation requires an independent sampling procedure. Each run of a simulation or calculation
does not benet from previous computational eort which may have been performed on similar
or identical systems. In machine learning, all of the sampling is performed upfront in the form
of a large training dataset and in some sense the model "carries" the learned information from
sampling with it. In this way, machine learning models are much more ecient than physical
models in tasks of inference and prediction. Machine learning models can also be updated - any
model which has been trained can be used as a starting point for additional training if new data
becomes available.
The major bottleneck of any data-driven approach is often the data its self. Acquiring data
can be expensive, time consuming or beyond the control of the people who wish to use it. Once a
large amount of raw data is available, there are still many challenges in processing the data such
as organizing it, cleaning it and developing a framework such that we can interact with the data in
a form that makes it possible to extract meaningful information from it. In the case of structure
data available in the protein data bank, which we use in the work presented in this thesis, the
100
raw data is simply a list of atomic coordinates and atomic species that make up the structure of
a protein or nucleic acid. Individual atoms are assigned to individual residues and nucleotides,
which are then organized into polymer chains. This is the extent of the raw data, and any further
information such as dihedral angles of the protein backbone, shape parameters and base pairing
in a DNA helix, structural classications of protein or DNA entities, and which components of
the structures are interacting are not immediately available. In chapter 2 I presented DNAproDB
which is our solution to this issue. DNAproDB is a data processing and visualization tool which
greatly simplies or removes entirely the need for data engineering when working with structures
of protein-DNA complexes. We have organized, processed, cleaned, annotated and aggregated a
large amount of structural data related to protein-DNA binding, and thus have made the step
from raw structural data to usable training data much easier and faster for future researchers.
In chapter 3 I presented work which demonstrates how advances in machine learning can
be used to study biological questions using structural data. There we applied geometric deep
learning to the molecular surfaces of DNA binding proteins to predict DNA binding sites using
only the protein structure. We have termed the framework developed in this approach "GeoBind",
and our goal is to provide the community a common set of tools for building, training and
evaluating deep learning models which operate on molecular surface meshes, an approach we
feel has applications to many systems beside protein-DNA interactions. This approach was only
possible due to the large number of protein-DNA complexes now available in the PDB, and
generating our datasets was made far easier by the assistance of DNAproDB (chapter 2). The
GeoBind approach relies on features computed from the protein structure and some features, like
the electrostatic potential, utilize physical models. Therefore, our approach actually integrates
physical modeling with machine learning methods.
One of the major advantages of physical models over machine learning models is their gener-
ality. Because they are based on physical principals, their domains of applicability are far greater
than most machine learning models, which are constrained to domains which are well sampled by
101
the training data. By integrating physical modeling with machine learning approaches, we can
reap the benets of both. This can be done in many ways { we can use physical models to compute
reliable features that provide rich sources of information which machine learning models can use
to capture important correlations in training data that may otherwise be dicult for the model
to learn. We can use machine learning to help with parameterization of physical models or semi-
empirical models (e.g. improving force elds, or adding empirical terms to potentials which are
based on a learned function). Or we can use physical models to generate new data which machine
learning models can then be trained with and used as a means of data mining or a way to transfer
the information generated from slow but analytical-based approaches to fast but statistical-based
approaches. A particularly salient example is using molecular dynamics to generate dynamical
information from a starting experimental structure, and then using machine learning to data mine
the dynamics and extract meaningful information. The possibilities of combining physical models
and machine learning methods are quite promising and will likely be a major direction of future
biophysical research in this authors opinion.
On another frontier, deeper integration of structural data with other types of data, such a
sequence data, seems enormously promising. There are vast amounts of protein and nucleotide
sequencing data currently available including both in vitro and in vivo transcription factor binding
specicities. Zhou et al. [133] showed how incorporating DNA structure information can improve
the predictions of transcription factor binding specicities over methods which only use DNA
sequence. Further work along this line, incorporating structure information at deeper levels and
perhaps combing both DNA and proteins structure may oer many opportunities for exploring
new questions and unify the elds of bioinformatics and structural biology.
Structural data allows researchers to explore molecular interactions at the highest level of detail
possible and sheds light on the inner workings and molecular origins of biological processes. In the
past, structural data was scarce and dicult to obtain, but advances in structural biology over
many decades have produced structure datasets which are now large enough that machine learning
102
approaches are eective ways of utilizing them. As more data accumulates and improvements in
machine learning and computation continue to accelerate, data-driven approaches oer more and
more promise of being powerful tools by which to leverage the ne granularity and deep physical
insight that structure data provides to answer important biological questions.
103
Reference List
[1] Dierential geometry-based solvation and electrolyte transport models for biomolecular
modeling: a review. In Many-Body Eects and Electrostatics in Biomolecules, pages 435{
480. Jenny Stanford Publishing, 2016.
[2] Expansion of the gene ontology knowledgebase and resources. Nucleic Acids Res,
45(D1):D331{D338, 2017.
[3] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis,
K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis,
S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, and G. Sherlock. Gene
ontology: tool for the unication of biology. the gene ontology consortium. Nat Genet,
25(1):25{9, 2000.
[4] William R. Atchley, Jieping Zhao, Andrew D. Fernandes, and Tanja Dr uke. Solving
the protein sequence metric problem. Proceedings of the National Academy of Sciences,
102(18):6395{6400, 2005.
[5] Nathan A. Baker. Biomolecular Applications of Poisson-Boltzmann Methods, chapter 5,
pages 349{379. John Wiley and Sons, Ltd, 2005.
[6] Donald Bashford. An object-oriented programming suite for electrostatic eects in biolog-
ical molecules an experience report on the mead project. In Yutaka Ishikawa, Rodney R.
Oldehoeft, John V. W. Reynders, and Marydell Tholburn, editors, Scientic Computing in
Object-Oriented Parallel Environments, pages 233{240, Berlin, Heidelberg, 1997. Springer
Berlin Heidelberg.
[7] P W Bates, G W Wei, and Shan Zhao. Minimal molecular surfaces and their applications.
J Comput Chem, 29(3):380{391, Feb 2008.
[8] S. S. Batsanov. Van der waals radii of elements. Inorganic Materials, 37(9):871{885, 2001.
[9] Jessa Bekker and Jesse Davis. Learning from positive and unlabeled data: A survey. arXiv
preprint arXiv:1811.04820, 2018.
[10] H. M. Berman, W. K. Olson, D. L. Beveridge, J. Westbrook, A. Gelbin, T. Demeny, S. H.
Hsieh, A. R. Srinivasan, and B. Schneider. The nucleic acid database. a comprehensive
relational database of three-dimensional structures of nucleic acids. Biophys J, 63(3):751{9,
1992.
[11] Helen Berman, Kim Henrick, and Haruki Nakamura. Announcing the worldwide protein
data bank. Nature Structural Biology, 10:980 EP {, Dec 2003.
[12] Filippo Maria Bianchi, Daniele Grattarola, and Cesare Alippi. Mincut pooling in graph
neural networks. arXiv preprint arXiv:1907.00481, 2019.
104
[13] M. Born and R. Oppenheimer. Zur quantentheorie der molekeln. Annalen der Physik,
389(20):457{484, 1927.
[14] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst. Geometric deep
learning: Going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18{42,
2017.
[15] Martha L Bulyk. Protein binding microarrays for the characterization of dna{protein inter-
actions. In Analytics of Protein{DNA Interactions, pages 65{85. Springer, 2006.
[16] Nicoletta Ceres, Marco Pasi, and Richard Lavery. A protein solvation model based on
residue burial. Journal of Chemical Theory and Computation, 8(6):2141{2144, 06 2012.
[17] D. L. Chapman. A contribution to the theory of electrocapillarity. The London, Edinburgh,
and Dublin Philosophical Magazine and Journal of Science, 25(148):475{481, 1913.
[18] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L
Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs.
arXiv preprint arXiv:1412.7062, 2014.
[19] N. Chennamsetty, V. Voynov, V. Kayser, B. Helk, and B. L. Trout. Prediction of aggregation
prone regions of therapeutic proteins. J Phys Chem B, 114(19):6614{24, 2010.
[20] Tsu-Pei Chiu, Satyanarayan Rao, Richard S Mann, Barry Honig, and Remo Rohs. Genome-
wide prediction of minor-groove electrostatic potential enables biophysical modeling of
protein{dna binding. Nucleic acids research, 45(21):12565{12576, 2017.
[21] P. J. Cock, T. Antao, J. T. Chang, B. A. Chapman, C. J. Cox, A. Dalke, I. Friedberg,
T. Hamelryck, F. Kau, B. Wilczynski, and M. J. de Hoon. Biopython: freely avail-
able python tools for computational molecular biology and bioinformatics. Bioinformatics,
25(11):1422{3, 2009.
[22] B. Coimbatore Narayanan, J. Westbrook, S. Ghosh, A. I. Petrov, B. Sweeney, C. L. Zirbel,
N. B. Leontis, and H. M. Berman. The nucleic acid database: new features and capabilities.
Nucleic Acids Res, 42(Database issue):D114{22, 2014.
[23] M. L. Connolly. Analytical molecular surface calculation. Journal of Applied Crystallogra-
phy, 16(5):548{558, Oct 1983.
[24] UniProt Consortium. Uniprot: a hub for protein information. Nucleic Acids Res,
43(Database issue):D204{12, 2015.
[25] B. Contreras-Moreira. 3d-footprint: a database for the structural analysis of protein-dna
complexes. Nucleic Acids Res, 38(Database issue):D91{7, 2010.
[26] Wendy D. Cornell, Piotr Cieplak, Christopher I. Bayly, Ian R. Gould, Kenneth M. Merz,
David M. Ferguson, David C. Spellmeyer, Thomas Fox, James W. Caldwell, and Peter A.
Kollman. A second generation force eld for the simulation of proteins, nucleic acids, and
organic molecules. Journal of the American Chemical Society, 117(19):5179{5197, 1995.
[27] T. E. Creighton. Proteins: Structures and Molecular Properties. W. H. Freeman, 1993.
[28] M. E. Davis and J. A. McCammon. Solving the nite dierence linearized poisson-
boltzmann equation: A comparison of relaxation and conjugate gradient methods. Journal
of Computational Chemistry, 10(3):386{391, 1989.
[29] Dawson-Haggerty et al. trimesh.
105
[30] P. Debye and E. H uckle. Zur theorie der elektrolyte. Physikalische Zeitschrift, 24(1):185{
206, 1923.
[31] Sergio Decherchi and Walter Rocchia. A general and robust ray-casting-based algorithm for
triangulating surfaces at the nanoscale. PLoS One, 8(4):e59744, 2013.
[32] Haowen Deng, Tolga Birdal, and Slobodan Ilic. Ppfnet: Global context aware local features
for robust 3d point matching, 2018.
[33] Thayne H. Dickey, Sarah E. Altschuler, and Deborah S. Wuttke. Single-stranded dna-
binding proteins: Multiple domains for multiple functions. Structure, 21(7):1074 { 1084,
2013.
[34] Frederik Diehl. Edge contraction pooling for graph neural networks. arXiv preprint
arXiv:1905.10990, 2019.
[35] T. J. Dolinsky, P. Czodrowski, H. Li, J. E. Nielsen, J. H. Jensen, G. Klebe, and N. A. Baker.
Pdb2pqr: expanding and upgrading automated preparation of biomolecular structures for
molecular simulations. Nucleic Acids Res, 35(Web Server issue):W522{5, 2007.
[36] T. J. Dolinsky, J. E. Nielsen, J. A. McCammon, and N. A. Baker. Pdb2pqr: an automated
pipeline for the setup of poisson-boltzmann electrostatics calculations. Nucleic Acids Res,
32(Web Server issue):W665{7, 2004.
[37] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online
learning and stochastic optimization. Journal of Machine Learning Research, 12(61):2121{
2159, 2011.
[38] Vincent Dumoulin and Francesco Visin. A guide to convolution arithmetic for deep learning.
2016.
[39] ECMA-404. The json data interchange format. Electronic Article 1, Ecma International,
Rue du Rhone 114 CH-1204 Geneva, 2013.
[40] H. Edelsbrunner. Deformable smooth surface design. Discrete & Computational Geometry,
21(1):87{115, 1999.
[41] Tom Ellenberger. Getting a grip on dna recognition: structures of the basic region leucine
zipper, and the basic region helix-loop-helix dna-binding domains. Current Opinion in
Structural Biology, 4(1):12 { 21, 1994.
[42] Matthias Fey and Jan E. Lenssen. Fast graph representation learning with PyTorch Geo-
metric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.
[43] F. Fogolari, A. Brigo, and H. Molinari. The poisson-boltzmann equation for biomolecular
electrostatics: a tool for structural biology. Journal of Molecular Recognition, 15(6):377{
392, 2002.
[44] Terrence S Furey. Chip{seq and beyond: new and improved methodologies to detect and
characterize protein{dna interactions. Nature Reviews Genetics, 13(12):840{852, 2012.
[45] P. Gainza, F. Sverrisson, F. Monti, E. Rodol a, D. Boscaini, M. M. Bronstein, and B. E. Cor-
reia. Deciphering interaction ngerprints from protein molecular surfaces using geometric
deep learning. Nature Methods, 17(2):184{192, 2020.
[46] Hongyang Gao and Shuiwang Ji. Graph u-nets. arXiv preprint arXiv:1905.05178, 2019.
106
[47] Mark M Garner and Arnold Revzin. A gel electrophoresis method for quantifying the
binding of proteins to specic dna regions: application to components of the escherichia coli
lactose operon regulatory system. Nucleic acids research, 9(13):3047{3060, 1981.
[48] Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl.
Neural message passing for quantum chemistry. In Proceedings of the 34th International
Conference on Machine Learning - Volume 70, ICML'17, pages 1263{1272. JMLR.org, 2017.
[49] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.
[50] M. Gouy. Sur la constitution de la charge electrique a la surface d'un electrolyte. J. Phys.
Theor. Apply., 9(1):457{468, 1910.
[51] Xusi Han, Atilla Sit, Charles Christoer, Siyang Chen, and Daisuke Kihara. A global map
of the protein shape universe. PLoS computational biology, 15(4):e1006969, 2019.
[52] Stephen P. Hancock, Tahereh Ghane, Duilio Cascio, Remo Rohs, Rosa Di Felice, and Reid C.
Johnson. Control of dna minor groove width and s protein binding by the purine 2-amino
group. Nucleic Acids Research, 41(13):6750{6760, 6/9/2020 2013.
[53] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks
are universal approximators. Neural Networks, 2(5):359 { 366, 1989.
[54] S.J. Hubbard and J.M. Thornton. Naccess. Computer Program, 1993.
[55] Wonpil Im, Dmitrii Beglov, and Beno
~
A®t Roux. Continuum solvation model: Compu-
tation of electrostatic forces from numerical solutions to the poisson-boltzmann equation.
Computer Physics Communications, 111(1):59 { 75, 1998.
[56] Alec Jacobson, Daniele Panozzo, et al. libigl: A simple C++ geometry processing library,
2018. https://libigl.github.io/.
[57] Linda Jen-Jacobson and Lewis A Jacobson. Role of water and eects of small ions in site-
specic protein-dna interactions. Protein-Nucleic Acid Interactions: Structural Biology,
pages 13{46, 2008.
[58] Kenan Jijakli, Basel Khraiwesh, Weiqi Fu, Liming Luo, Amnah Alzahmi, Joseph Koussa,
Amphun Chaiboonchoe, Serdal Kirmizialtin, Laising Yen, and Kourosh Salehi-Ashtiani.
The in vitro selection world. Methods, 106:3{13, Aug 2016.
[59] R. Joshi, J. M. Passner, R. Rohs, R. Jain, A. Sosinsky, M. A. Crickmore, V. Jacob, A. K.
Aggarwal, B. Honig, and R. S. Mann. Functional specicity of a hox protein mediated by
the recognition of minor groove structure. Cell, 131(3):530{43, 2007.
[60] E. Jurrus, D. Engel, K. Star, K. Monson, J. Brandi, L. E. Felberg, D. H. Brookes, L. Wilson,
J. Chen, K. Liles, M. Chun, P. Li, D. W. Gohara, T. Dolinsky, R. Konecny, D. R. Koes, J. E.
Nielsen, T. Head-Gordon, W. Geng, R. Krasny, G. W. Wei, M. J. Holst, J. A. McCammon,
and N. A. Baker. Improvements to the apbs biomolecular solvation software suite. Protein
Sci, 27(1):112{128, 2018.
[61] W. Kabsch and C. Sander. Dictionary of protein secondary structure: pattern recognition
of hydrogen-bonded and geometrical features. Biopolymers, 22(12):2577{637, 1983.
[62] Daisuke Kihara, Lee Sael, Rayan Chikhi, and Juan Esquivel-Rodriguez. Molecular sur-
face representation using 3d zernike descriptors for protein shape comparison and docking.
Current Protein and Peptide Science, 12(6):520{530, 2011.
107
[63] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv
preprint arXiv:1412.6980, 2014.
[64] Isaac Klapper, Ray Hagstrom, Richard Fine, Kim Sharp, and Barry Honig. Focusing of
electric elds in the active site of cu-zn superoxide dismutase: Eects of ionic strength and
amino-acid modication. Proteins: Structure, Function, and Bioinformatics, 1(1):47{59,
1986.
[65] Jan J Koenderink and Andrea J [van Doorn]. Surface shape and curvature scales. Image
and Vision Computing, 10(8):557 { 564, 1992.
[66] Philipp Kr ahenb uhl and Vladlen Koltun. Ecient inference in fully connected crfs with
gaussian edge potentials. In Advances in neural information processing systems, pages 109{
117, 2011.
[67] Mohit Kumar, Mario S Mommer, and Victor Sourjik. Mobility of cytoplasmic, membrane,
and dna-binding proteins in escherichia coli. Biophys J, 98(4):552{559, Feb 2010.
[68] R. A. Laskowski, J. Jab lo nska, L. Pravda, R. S. Va rekov a, and J. M. Thornton. Pdbsum:
Structural summaries of pdb entries. Protein Sci, 27(1):129{134, 2018.
[69] B. Lee and F.M. Richards. The interpretation of protein structures: Estimation of static
accessibility. Journal of Molecular Biology, 55(3):379 { IN4, 1971.
[70] Junhyun Lee, Inyeop Lee, and Jaewoo Kang. Self-attention graph pooling. arXiv preprint
arXiv:1904.08082, 2019.
[71] Mitchell Lewis. The lac repressor. Comptes Rendus Biologies, 328(6):521 { 548, 2005.
Retour sur l'operon lac.
[72] Lin Li, Chuan Li, Subhra Sarkar, Jie Zhang, Shawn Witham, Zhe Zhang, Lin Wang, Nicholas
Smith, Marharyta Petukh, and Emil Alexov. Delphi: a comprehensive suite for delphi
software and associated resources. BMC biophysics, 5(1):9, 2012.
[73] Weizhong Li and Adam Godzik. Cd-hit: a fast program for clustering and comparing large
sets of protein or nucleotide sequences. Bioinformatics, 22(13):1658{1659, 2006.
[74] Guosheng Lin, Chunhua Shen, Anton Van Den Hengel, and Ian Reid. Ecient piecewise
training of deep structured models for semantic segmentation. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 3194{3203, 2016.
[75] Tiantian Liu, Minxin Chen, and Benzhuo Lu. Parameterization for molecular gaussian
surface and a comparison study of surface mesh generation. Journal of Molecular Modeling,
21(5):113, 2015.
[76] X. J. Lu, H. J. Bussemaker, and W. K. Olson. Dssr: an integrated software tool for dissecting
the spatial structure of rna. Nucleic Acids Res, 43(21):e142, 2015.
[77] X. J. Lu and W. K. Olson. 3dna: a software package for the analysis, rebuilding and
visualization of three-dimensional nucleic acid structures. Nucleic Acids Res, 31(17):5108{
21, 2003.
[78] X. J. Lu and W. K. Olson. 3dna: a versatile, integrated software system for the analy-
sis, rebuilding and visualization of three-dimensional nucleic-acid structures. Nat Protoc,
3(7):1213{27, 2008.
108
[79] Zhou Lu, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei Wang. The expressive
power of neural networks: A view from the width. In Advances in neural information
processing systems, pages 6231{6239, 2017.
[80] Jery D Madura, James M Briggs, Rebecca C Wade, Malcolm E Davis, Brock A Luty,
Andrew Ilin, Jan Antosiewicz, Michael K Gilson, Babak Bagheri, L Ridgway Scott, et al.
Electrostatics and diusion of molecules in solution: simulations with the university of
houston brownian dynamics program. Computer Physics Communications, 91(1-3):57{95,
1995.
[81] Jery D. Madura, Malcolm E. Davist, Michael K. Gilson, Rebecca C. Wades, Brock A.
Luty, and J. Andrew McCammon. Biological Applications of Electrostatic Calculations and
Brownian Dynamics Simulations, pages 229{267. John Wiley and Sons, Ltd, 2007.
[82] Jonathan Masci, Davide Boscaini, Michael Bronstein, and Pierre Vandergheynst. Geodesic
convolutional neural networks on riemannian manifolds. In Proceedings of the IEEE inter-
national conference on computer vision workshops, pages 37{45, 2015.
[83] Anthony Mathelier, Beibei Xin, Tsu-Pei Chiu, Lin Yang, Remo Rohs, and Wyeth W Wasser-
man. Dna shape features improve transcription factor binding site predictions in vivo. Cell
systems, 3(3):278{286, 2016.
[84] Emily E. Meyer, Kenneth J. Rosenberg, and Jacob Israelachvili. Recent progress in un-
derstanding hydrophobic interactions. Proceedings of the National Academy of Sciences,
103(43):15739{15746, 2006.
[85] Shervin Minaee, Yuri Boykov, Fatih Porikli, Antonio Plaza, Nasser Kehtarnavaz, and
Demetri Terzopoulos. Image segmentation using deep learning: A survey. arXiv preprint
arXiv:2001.05566, 2020.
[86] S. Mitternacht. Freesasa: An open source c library for solvent accessible surface area cal-
culations. F1000Res, 5:189, 2016.
[87] Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodol a, Jan Svoboda, and
Michael M. Bronstein. Geometric deep learning on graphs and manifolds using mixture
model cnns, 2016.
[88] Garrett M. Morris and Marguerita Lim-Wilby. Molecular Docking, pages 365{382. Humana
Press, Totowa, NJ, 2008.
[89] Haruki Nakamura and Shinichi Nishida. Numerical calculations of electrostatic potentials
of protein-solvent systems by the self consistent boundary method. Journal of the Physical
Society of Japan, 56(4):1609{1622, 1987.
[90] T. Norambuena and F. Melo. The protein-dna interface database. BMC Bioinformatics,
11:262, 2010.
[91] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,
Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, An-
dreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chil-
amkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An impera-
tive style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelz-
imer, F. d'Alch e-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information
Processing Systems 32, pages 8024{8035. Curran Associates, Inc., 2019.
109
[92] Eric F. Pettersen, Thomas D. Goddard, Conrad C. Huang, Gregory S. Couch, Daniel M.
Greenblatt, Elaine C. Meng, and Thomas E. Ferrin. Ucsf chimera - a visualization system for
exploratory research and analysis. Journal of Computational Chemistry, 25(13):1605{1612,
2004.
[93] Aleksey Porollo and Jaroslaw Meller. Prediction-based ngerprints of protein-protein inter-
actions. Proteins: Structure, Function, and Bioinformatics, 66(3):630{645, 2007.
[94] Angana Ray and Rosa Di Felice. Protein-mutation-induced conformational changes of the
dna and nuclease domain in crispr/cas9 systems by molecular dynamics simulations. The
Journal of Physical Chemistry B, 124(11):2168{2179, 03 2020.
[95] Remo Rohs, Xiangshu Jin, Sean M West, Rohit Joshi, Barry Honig, and Richard S Mann.
Origins of specicity in protein-dna recognition. Annual review of biochemistry, 79:233{269,
2010.
[96] Remo Rohs, Sean M West, Alona Sosinsky, Peng Liu, Richard S Mann, and Barry Honig.
The role of dna shape in protein{dna recognition. Nature, 461(7268):1248{1253, 2009.
[97] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for
biomedical image segmentation. In International Conference on Medical image computing
and computer-assisted intervention, pages 234{241. Springer, 2015.
[98] A. S. Rose, A. R. Bradley, Y. Valasatava, J. M. Duarte, A. Prlic, and P. W. Rose. Ngl viewer:
web-based molecular graphics for large complexes. Bioinformatics, 34(21):3755{3758, 2018.
[99] A. S. Rose and P. W. Hildebrand. Ngl viewer: a web application for molecular visualization.
Nucleic Acids Res, 43(W1):W576{9, 2015.
[100] P. W. Rose, A. Prli c, A. Altunkaya, C. Bi, A. R. Bradley, C. H. Christie, L. D. Costanzo,
J. M. Duarte, S. Dutta, Z. Feng, R. K. Green, D. S. Goodsell, B. Hudson, T. Kalro, R. Lowe,
E. Peisach, C. Randle, A. S. Rose, C. Shao, Y. P. Tao, Y. Valasatava, M. Voigt, J. D.
Westbrook, J. Woo, H. Yang, J. Y. Young, C. Zardecki, H. M. Berman, and S. K. Burley.
The rcsb protein data bank: integrative view of protein, gene and 3d structural information.
Nucleic Acids Res, 45(D1):D271{D281, 2017.
[101] Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint
arXiv:1609.04747, 2016.
[102] Jared M. Sagendorf, Helen M. Berman, and Remo Rohs. Dnaprodb: an interactive tool for
structural analysis of dna-protein complexes. Nucleic Acids Res, 45(Web Server issue):W89{
W97, Jul 2017. 28431131[pmid].
[103] Jared M Sagendorf, Nicholas Markarian, Helen M Berman, and Remo Rohs. Dnaprodb:
an expanded database and web-based tool for structural analysis of dna-protein complexes.
Nucleic Acids Res, 48(D1):D277{D287, Jan 2020.
[104] M F Sanner, A J Olson, and J C Spehner. Reduced surface: an ecient way to compute
molecular surfaces. Biopolymers, 38(3):305{320, Mar 1996.
[105] Schr odinger, LLC. The PyMOL molecular graphics system, version 1.8. November 2015.
[106] Yousif Shamoo. Single-stranded DNA-binding Proteins. American Cancer Society, 2001.
[107] Lior Shapira, Ariel Shamir, and Daniel Cohen-Or. Consistent mesh partitioning and skele-
tonisation using the shape diameter function. The Visual Computer, 24(4):249, 2008.
110
[108] Kim A. Sharp and Barry. Honig. Calculating total electrostatic energies with the nonlinear
poisson-boltzmann equation. The Journal of Physical Chemistry, 94(19):7684{7692, 09
1990.
[109] Evan Shelhamer, Jonathan Long, and Trevor Darrell. Fully convolutional networks for
semantic segmentation. CoRR, abs/1605.06211, 2016.
[110] A. Shrake and J.A. Rupley. Environment and exposure to solvent of protein atoms. lysozyme
and insulin. Journal of Molecular Biology, 79(2):351 { 371, 1973.
[111] D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst. The emerging
eld of signal processing on graphs: Extending high-dimensional data analysis to networks
and other irregular domains. IEEE Signal Processing Magazine, 30(3):83{98, 2013.
[112] Jingna Si, Zengming Zhang, Biaoyang Lin, Michael Schroeder, and Bingding Huang.
Metadbsite: a meta approach to improve protein dna-binding sites prediction. BMC systems
biology, 5(1):S7, 2011.
[113] Jingna Si, Rui Zhao, and Rongling Wu. An overview of the prediction of protein dna-binding
sites. International journal of molecular sciences, 16(3):5194{5215, 2015.
[114] I. Sillitoe, T. E. Lewis, A. Cu, S. Das, P. Ashford, N. L. Dawson, N. Furnham, R. A.
Laskowski, D. Lee, J. G. Lees, S. Lehtinen, R. A. Studer, J. Thornton, and C. A. Orengo.
Cath: comprehensive structural and functional annotations for genome sequences. Nucleic
Acids Res, 43(Database issue):D376{81, 2015.
[115] Martin Simonovsky and Nikos Komodakis. Dynamic edge-conditioned lters in convolu-
tional neural networks on graphs. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 3693{3702, 2017.
[116] Jian Sun, Maks Ovsjanikov, and Leonidas Guibas. A concise and provably informative
multi-scale signature based on heat diusion. Computer Graphics Forum, 28(5):1383{1392,
2009.
[117] Charles Sutton, Andrew McCallum, et al. An introduction to conditional random elds.
Foundations and Trends® in Machine Learning, 4(4):267{373, 2012.
[118] Marvin TT Teichmann and Roberto Cipolla. Convolutional crfs for semantic segmentation.
arXiv preprint arXiv:1805.04777, 2018.
[119] Nitika Verma, Edmond Boyer, and Jakob Verbeek. Feastnet: Feature-steered graph con-
volutions for 3d shape analysis. In Proceedings of the IEEE conference on computer vision
and pattern recognition, pages 2598{2606, 2018.
[120] Lincong Wang. The solvent-excluded surfaces of water-soluble proteins. bioRxiv, page
294082, 2018.
[121] J. D. Watson and F. H. C. Crick. Molecular structure of nucleic acids: A structure for
deoxyribose nucleic acid. Nature, 171(4356):737{738, 1953.
[122] J. D. Westbrook and P. M. D. Fitzgerald. The PDB format, mmCIF formats, and other
data formats, book section 10, pages 271{291. John Wiley and Sons, Inc., Hoboken, New
Jersey, second edition, 2009.
111
[123] J. D. Westbrook, C. Shao, Z. Feng, M. Zhuravleva, S. Velankar, and J. Young. The chemical
component dictionary: complete descriptions of constituent molecules in experimentally
determined 3d macromolecules in the protein data bank. Bioinformatics, 31(8):1274{8,
2015.
[124] J. M. Word, S. C. Lovell, J. S. Richardson, and D. C. Richardson. Asparagine and glutamine:
using hydrogen atom contacts in the choice of side-chain amide orientation. J Mol Biol,
285(4):1735{47, 1999.
[125] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu. A comprehensive survey on graph
neural networks. IEEE Transactions on Neural Networks and Learning Systems, pages 1{21,
2020.
[126] wwPDB consortium. Protein Data Bank: the single global archive for 3D macromolecular
structure data. Nucleic Acids Research, 47(D1):D520{D528, 10 2018.
[127] Tian Xie and Jerey C. Grossman. Crystal graph convolutional neural networks for an
accurate and interpretable prediction of material properties. Phys. Rev. Lett., 120:145301,
Apr 2018.
[128] Dong Xu and Yang Zhang. Generating triangulated macromolecular surfaces by euclidean
distance transform. PLoS One, 4(12):e8140, Dec 2009.
[129] Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure
Leskovec. Hierarchical graph representation learning with dierentiable pooling. In Ad-
vances in neural information processing systems, pages 4800{4810, 2018.
[130] O. Zanegina, D. Kirsanov, E. Baulin, A. Karyagina, A. Alexeevski, and S. Spirin. An
updated version of npidb includes new classications of dna-protein complexes and their
families. Nucleic Acids Res, 44(D1):D144{53, 2016.
[131] Matthew D Zeiler. Adadelta: An adaptive learning rate method. arxiv 2012. arXiv preprint
arXiv:1212.5701, 1212, 2012.
[132] Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong
Su, Dalong Du, Chang Huang, and Philip HS Torr. Conditional random elds as recurrent
neural networks. In Proceedings of the IEEE international conference on computer vision,
pages 1529{1537, 2015.
[133] Tianyin Zhou, Ning Shen, Lin Yang, Namiko Abe, John Horton, Richard S Mann, Harmen J
Bussemaker, Raluca Gord^ an, and Remo Rohs. Quantitative modeling of transcription factor
binding specicities using dna shape. Proceedings of the National Academy of Sciences,
112(15):4654{4659, 2015.
[134] Zhongxiang Zhou, Philip Payne, Max Vasquez, Nat Kuhn, and Michael Levitt. Finite-
dierence solution of the poisson-boltzmann equation: Complete elimination of self-energy.
Journal of Computational Chemistry, 17(11):1344{1351, 1996.
[135] Xiaolei Zhu, Spencer S. Ericksen, and Julie C. Mitchell. DBSI: DNA-binding site identier.
Nucleic Acids Research, 41(16):e160{e160, 07 2013.
112
Appendix A
Supplementary Information and Methods for Chapter 2
A.1 Supplementary Methods
DNAproDB is available at https://dnaprodb.usc.edu
A.1.1 Detection of major and minor grooves for double-helical DNA
The major and minor grooves are important structural moieties for binding to double-stranded
helical DNA. Many proteins recognize distinct biophysical signatures of the grooves such as hydro-
gen bond donor/acceptor patterns, DNA shape, or electrostatic potential. DNAproDB identies
the major groove and minor groove edges of base pairs which form double-stranded helices and dis-
tinguishes interactions that occur in either groove. For bases which form canonical Watson-Crick
base pairs, the groove edges are known a priori. However, in general base-pairing geometry may
substantially deviate from the Watson-Crick conformation and the glycosidic torsion angle and
relative position and orientation of the base coordinate frames matters. Additionally, chemically
modied nucleotides may have additional chemical groups that should be correctly identied if
they protrude into one groove or the other.
We have developed a simple algorithm for distinguishing which atoms of each base in a base
pair should be identied as being in the major groove or minor groove. Supplementary Figure S3
shows an illustration of our approach. First, the base-pair coordinate frame is determined for a
given base pair using the program DSSR [76]. The direction of thex-axis will either point towards
the major or minor groove depending on the relative glycosidic bond angles of the two bases. The
base pair is then treated as a graph with each atom in the pair being a node projected to the
x-y plane of the base coordinate frame, and covalent bonds being edges. Knowing the direction
of the minor groove, an edge joining the two bases is added to the graph which represents the
hydrogen bond joining the two bases that is closest to the minor groove. The shortest path from
the glycosidic nitrogen atoms of each base is then found. This path, in addition to the rays N
1
andN
2
, bisects the plane. All atoms that lie along the path and are on the minor groove side are
classied as minor groove atoms, and all atoms on the major groove side as major groove atoms.
Note that this algorithm works for any nucleoside pair so long as a coordinate frame for the base
pair can be dened.
A.1.2 DNA structural entity classication
DNAproDB classies DNA structural entities based on the secondary structure of the entity. We
assigned four classications: helical, single-stranded, helix/single-stranded hybrid or other. The
latter class is reserved for secondary structures which are either irregular, have no commonly used
name, or are too infrequent to warrant their own classication. For every DNA structural entity,
any helical segments and single-stranded segments present within the entity are rst determined.
113
Helical segments are identied using DSSR [76], which denes helices as one or more stems which
stack on top of one another, allowing for some
exibility in terms of
ipped out bases, backbone
breaks etc. Single-stranded segments are dened as a segment of a DNA strand which does not
belong to a helix, does not form any base pairs with other strands, does not have more than two
consecutive intra-strand base pairs and is at least three nucleotides in length.
If a DNA entity has more than one helical segment, it is automatically classied as other. If a
DNA entity has exactly one helical segment it is classied as helical if at least 60% of nucleotides
within the entity are part of the helical segment. Otherwise, if the number of helical nucleotides
plus the number of single-stranded nucleotides is equal to or greater than 60%, the entity is
classied as helical/single-stranded. If neither condition is met, it is classied as other.
Finally, if a structural entity contains no helical segments, then it is classied as single-stranded
if at least 60% of nucleotides are in a single-stranded conformation, else it is classied as other.
This is a heuristics-based classication scheme but it has been tested against a large number of
structures and been found to work extremely well.
Within the helical assignment, three sub-classications are provided; a perfect helix, an imper-
fect helix and an irregular helix. These classications are based on a numerical feature DNAproDB
computes for each helical segment called the helix score, which is the ratio of canonical base pairs
to the total number of base pairs in the helix. A canonical base pair is a base pair in which both
bases form stacking interactions to the neighboring bases on their respective strand. A perfect
helix has a helix score above 0.9, an imperfect helix has a score between 0.9 and 0.6 and an
irregular helix a helix score below 0.6.
A.1.3 Approximating parameters for chemically modied components
DNAproDB uses atomic van der Waals radii parameters for the calculation of solvent accessible
and solvent excluded surface areas, and residue hydrophobicity scores for the calculation of spatial
aggregation propensities [19] for protein surface residues. These parameters are generally not
available for chemically modied components, so approximations are used in order to provide
reasonable values for these features and still support as many chemical modications as possible.
DNAproDB supports chemical modications of any of the standard 20 amino acids and 4 DNA
nucleotides that have an entry in the PDB Chemical Component Dictionary [123] and do not
signicantly deviate from their parent component so as to make identication of structural moieties
ambiguous (see main manuscript).
In the case of van der Waals radii, base atomic parameters for standard components are taken
from the NACCESS [54] radii values. Radii for chemically modied components are then taken
from any corresponding atoms in their standard parent and values for remaining atoms are taken
from Table 9 of Batsanov S.S. [8] based on the element type of the atom.
For the calculation of spatial aggregation propensities (SAP), chemically modied residues are
ignored. The SAP values of standard residues in close proximity to chemically modied residues
may be slightly aected, and the SAP values for chemically modied residues is not reported.
A.1.4 Interacting moieties identication
Nucleotide{residue interactions are dened based on the minimum distance between a nucleotide
and residue (excluding hydrogen atoms). The current release of DNAproDB uses a value of 4.5
A, and any nucleotide{residue pairs with a distance beyond that cuto value are not considered
to be interacting. Given an interaction pair, DNAproDB identies which structural moieties
(see main manuscript) within the pair are interacting based on the values of interaction features
(which are broken down by structural moiety). For example, if the NZ atom of a lysine (using
PDB atom naming conventions) forms a hydrogen bond with the O6 atom of a guanine in a
helical conformation, then this is a side chain{major groove hydrogen bond. The total number of
114
hydrogen bonds, van der Waals interactions (which are dened as heavy atom pairs within 3.9
A
but not forming a hydrogen bond) and values of buried solvent accessible surface area components
(see Supplementary Data of Sagendorf et. al. [102] for a description) are used to determine which
structural moieties are interacting. To determine cut-o values, the distributions of interaction
feature values among a large sample of nucleotide{residue interaction pairs (464,303) was compiled
using DNAproDB data generated from PDB structures. We note that not all nucleotide{residue
pair types occur in equal number. Arginine interactions, for example, are much more numerous
than glutamic acid interactions. A large bias towards zero values are present in these features
because most interactions do not involve alI possible structural moieties { for example, residue
interactions in the DNA minor groove will have all zero values for any major groove or base
moiety features. In order to avoid this bias, feature values were clipped at 0.5 before computing
percentiles.
Feature values are broken down for each nucleotide{residue pair type, and each structural
moiety interaction type. The 20th percentiles of these distributions are then used as lower bounds
for determining when to consider a structural moiety interaction. For each structural moiety
interaction type (e.g. side chain{sugar) there are three cut-o values to consider { the number
of hydrogen bonds, number of van der Waals interactions, and the buried solvent accessible sur-
face area (BASA) component. If any feature among the three meets the threshold for a moiety
interaction, then the nucleotide{residue interaction pair is assigned that moiety interaction. Sup-
plementary Table A.1 shows 20th percentiles for interaction features of arginine interactions with
the standard four nucleotides. If a nucleotide{residue interaction fails to meet the cut-os for any
moiety interaction pair, then the feature value with the largest ratio to its cut-o value is chosen
to assign a moiety interaction, and the nucleotide{residue interaction is classied as a "weak"
interaction.
A.1.5 Tyrosine interaction motif analysis
To examine the minor groove interactions of the residue tyrosine (as shown in Figure 5A), we
compiled all tyrosine{nucleotide interactions which occur in the minor groove of a helical region
of DNA. Using the DNAproDB data, we found all instances of a tyrosine{nucleotide interaction
via the interface.nucleotide-residue_interaction feature arrays which list all nucleotide{
residue interactions in a given interface, and ltered for those with a nuc_secondary_structure
feature value of "helical" and which included a minor groove interaction moiety (see Identication
of structural and interaction moieties). For each tyrosine, we rst noted how many nucleotides it
interacted with simultaneously. To simplify the analysis, we kept only the subset of tyrosine which
interact with exactly three nucleotides in the minor groove. The remaining list of interactions
were then ltered for sequence redundancy of the tyrosine parent chain using the PDB-derived
protein.chain.sequence_clusters features (see Supplementary Figure A.2), keeping up to 15
tyrosine examples per 90% sequence cluster, resulting in 99 total examples. For each example
the nucleotides were sorted by the center of mass distance from the tyrosine and placed into bins
1, 2 and 3 of a two-dimensional histogram, with the second index indicating the nucleotide type
(A,C,G,T).
A.1.6 PCA analysis of interface features
A Principal Component Analysis (PCA) was performed on 134 DNAproDB features (or features
derived from them) describing the DNA{protein interface for 4,758 dierent interfaces as shown
in Figure 5B. Each interface used was that of a single protein chain interacting with a single
DNA entity; DNAproDB breaks down every DNA{protein interface by protein chain under the
interface.interface_features feature array (see Supplementary Figure A.2). The features
used describe characteristics such as geometry of the protein surface, BASA and hydrogen bond
115
statistics, nucleotide{residue interaction geometries and residue propensities, and were concate-
nated to create a dense feature vector. Every vector was then projected along the eigenvectors
corresponding to the rst and second principal component axes and the projected components
were plotted. Each interface was then grouped according to the GO annotations of the protein
chain (available under protein.chains features) into one of four broad classes { transcription
factor activity, DNA repair, DNA recombination, and single-stranded DNA binding.
A.1.7 Nucleotide{residue stacking probability analysis
To generate the stacking probabilities shown in Figure 2.5C, each residue with a planar side chain
(arginine, phenylalanine, tyrosine, tryptophan, histidine, asparagine, aspartic acid, glutamine and
glutamic acid), all instances of a base interaction (determined via the interaction moiety feature
under the interface.nucleotide-residue_interactions feature array; see Identication of
structural and interaction moieties) with a nucleotide in a single-stranded DNA conformation
were gathered from the DNAproDB dataset (4,509 entries), and the conditional probability for
that residue{nucleotide interaction pair to form a base-stacking geometry was computed in the
following way
P (stackjN;R) =
P (stack;N;R)
P (N;R)
(A.1)
where
P (stack;N;R) =
n(g = stack;N;R)
P
g;N;R
n(g;N;R)
(A.2)
is the joint probability for an interaction between a residue R and a nucleoide N to be in
a geometry g2 [stack,pseudo-pair,other] (interaction geometry was determined by the program
SNAP) [77, 78] with g = stack , and n(g;N;R) is the number of residue{nucleotide interactions
between N and R in the dataset with a stacking geometry g. The prior probability is given by
P (N;R) =
P
g
n(g;N;R)
P
g;N;R
n(g;N;R)
(A.3)
116
Figure A.1: An example of a DNA{protein complex which contains two DNA structural entities and
one protein structural entity. A. A graph showing nucleotide base pairing, base stacking and sugar-
phosphate linkages which has been stylized using DNAproDB visualization tools. Two disparate sub-
graphs can be seen which represent the two discrete DNA structural entities seen in the second panel.
DNA structural entities are named based on the DNA strands they are composed of { their identier is
simply a concatenation of their component strand identiers. The DNA strand identiers are based on
the DNA chain which the strand belongs to. The rst strand in chain E is named "E1", the second strand
"E2" etc. In this structure, the DNA chain E has a break (due to a missing nucleotide) thus forming
two strands. B. The three-dimensional structure which shows the two discrete DNA structural entities
and the single discrete protein structural entity. This protein entity has two chains { chain L and H
which form a heterodimer, however chain H consists of three chain segments, H1, H2 and H3 which are
due to backbone breaks. The four chain segments form a single closed molecular surface, and hence are
considered a single structural entity. Protein structural entities are named in the same way that DNA
entities are.
117
Figure A.2: An overview of the DNAproDB feature hierarchy. Features are grouped into three main
categories { DNA specic features, protein specic features, and interface specic features. Within each
category there are two levels of features { entry-level and model-level. Entry level features are those which
can be described at the level of the entry (i.e. the structure), and do not vary across models. For instance
under protein features, chains are an entry level feature because the number of chains and most of their
properties (e.g. chain identier, sequence, sequence-based annotations, length etc.) do not vary or depend
on the coordinates assigned in a particular model. Any feature within an entry-level feature branch that
does depend on the model will be stored as an array with one element per model. For example, the
secondary structure feature of a protein chain can vary from model to model since secondary structure
depends on the residue coordinates. Features under an entry-level branch which depend on the model
index are noted with an asterisk in the gure. Model-level features are those which depend on and may
change with the three-dimensional coordinates of the structure or represent a data object which may only
exist for a particular model. These include items such as DNA base pairs, all interface properties, and
protein secondary structure elements. Some features refer to other features which are used as identiers.
An identier feature is analogous to a primary key in a relational database setting and are italicized in
the gure. Underlined features refer to identier features.
118
Figure A.3: A graphical summary of our procedure for determining major and minor groove structural
moieties. A. The base-pair coordinate frame used by DSSR, from which we gather base-pairing informa-
tion. Depending on the glycosidic torsion angle conformation of the two bases, the x-direction will either
point towards the minor groove or major groove. In Watson-Crick geometry, it will always point towards
the major groove. B. An illustration of our procedure for dening major and minor groove atoms. An
A/T Watson-Crick base pair is used for illustration purposes. In this case, both bases are in the anti
conformation, so thex-direction of the base-pair coordinate frame is towards the major groove. The coor-
dinates of the base atoms are projected to the x-y plane of the base-pairing frame, and the shortest path
from the glycosidic nitrogen atom of both bases is found (shown in red) using the N1-N3 hydrogen bond
as an edge joining the bases. This path in addition to the rays emanating from the glycosidic nitrogen
atoms bisect the plane. All atoms lying on the path and on the minor groove side of the bisection are
classied as minor groove, and atoms on the other side as major groove.
119
Figure A.4: An example of dierent ways to lter nucleotide{residue interactions in DNAproDB visualiza-
tions using a DNAproDB entry containing the catalytic domain of APOBEC3G bound to a single-stranded
DNA oligomer (PDB ID 6BUX). The visualizations in panels B{G are examples of the residue contact
map visualization. A. The three-dimensional structure as depicted by NGL viewer [99, 98]. B. Default
DNAproDB criteria (4.5
A minimum distance, interactions not weak; see Supplementary Methods) with
DNA base interactions shown and all protein secondary structures. Many interactions are shown involving
the rst cytosine nucleoside which is inserted into the active site's binding pocket of the APOBEC3G
domain. C. The same criteria as in B but showing only residues which form stacking interactions. D.
Interactions are shown using default criteria but only to the DNA sugar moieties indicated by the yellow
interaction lines which connect to the sugar moiety symbol of each nucleotide. E. The same as in D
except now involving only interactions with at least one sugar hydrogen bond. F. Here we are visualizing
all DNA moiety interactions using default criteria but only for helix or strand residues. Only two residues
are shown, and neither make any phosphate interactions. This is because these residues are in the active
site, which is deeply buried in the protein and is mainly accessible by the inserted cytosine base with
some sugar interactions. G. Here all DNA moiety interactions are shown but only for loop residues. The
interaction distance cut-o has been lowered to a 3.5
A minimum nucleotide{residue distance, and any
interaction involving a hydrogen bond is highlighted with a red outline.
120
Figure A.5: Visualizations from DNAproDB for a heterodimer of the Hox protein Sex combs reduced
(Scr) and its cofactor Extradenticle (Exd) bound to two dierent DNA fragments (PDB IDs 2R5Z and
2R5Y). Only major groove and minor groove contacts are shown. Joshi et al. [59] showed that for this
protein complex Scr N-terminal linker residues Arg3 and His{12 are important for conferring sequence
specicity via shape recognition of the minor groove. A. Three-dimensional views of the minor groove in
the region of the Scr Arg5 linker residue. The left panel is an Scr in vivo binding site (PDB ID 2R5Z) in
which the Arg3 and His{12 residues can be seen in the minor groove. On the right is a Hox consensus site
which lack the Arg3 and His{12 contacts. B. Two residue contact maps showing major groove and minor
groove contacts for the Scr{Exd heterodimer bound to the Scr in vivo binding site fkh250 on the left
(PDB ID 2R5Z) and a Hox consensus site fkh250
con*
on the right (PDB ID 2R5Y). The colored markers
indicate residues and their secondary structure { helix residues are represented as red circles and linker
residues are represented as blue squares. Residues are grouped into SSEs and markers on each nucleotide
represent the major and minor groove contacts, respectively. The Scr residues Arg3 and His{12 are seen
making contacts in the DNA minor groove of the in vivo binding site but cannot be seen contacting the
Hox consensus site. C. Shape overlay plots of the minor groove width of the two binding sites, fkh250 and
fkh250
con*
. The dierences in the intrinsic shape prole of these DNA sequences, which are described in
[59], explain the preference for the Scr in vivo binding site.
121
feature pp/mc pp/sc sr/mc sr/sc wg/mc wg/sc sg/mc sg/sc bs/mc bs/sc
A (23227) h. bond 1 1 1 1 1 1 1 1 1 1
vdw 1 1 1 1 1 1 1 1 1 1
basa 1 4.4 1 5.8 0.9 3.2 0.8 3.4 1.3 9.7
C (19559) h. bond 1 1 1 1 1 1 1 1 1 1
vdw 1 1 1 1 1 1 1 1 1 1
basa 0.9 5.6 1.2 4.5 1 3.7 0.9 1.9 2 4.8
G (27427) h. bond 1 1 1 1 1 1 1 1 1 1
vdw 1 1 1 1 1 2 1 1 1 1
basa 0.8 3.9 1.4 4.7 0.8 5.3 0.9 3.5 0.9 7.7
T (24547) h. bond 1 1 1 1 1 1 1 1 1 1
vdw 1 1 1 1 1 1 1 1 1 1
basa 1 4.9 1.2 3.8 0.8 7 0.8 2.3 1.3 7.5
Table A.1: Interaction feature cut-o values for arginine interactions with the standard four DNA
nucleotides. The numbers in parentheses are the total number of interactions used to calculate
20
th
percentiles of the feature value distributions, which are then used as cut-o values. Note
that these cut-o values are inclusive, and the feature values were clipped at 0.5 before computing
percentiles to avoid a large bias towards zero. The rst column indicates the feature type { h.
bond is the total number of hydrogen bonds for a moiety interaction, vdw is the number of van der
Waals interactions, and basa is the buried solvent accessible surface area. The remaining columns
are the possible moiety interactions with the following abbreviations: pp { DNA phosphate; sr
{ DNA sugar; wg { DNA major groove; sg { DNA minor groove; bs { DNA base; mc { protein
main chain; sc { protein side chain.
122
Appendix B
Denitions and Formulas for Computing Features used by
GeoBind
The following are denitions and formula (where applicable) for the structure and shape features
implemented by GeoBind.
B.1 Structure Features
The following are structure features used in the GeoBind method. A structure feature maps
some descriptor of the local chemical, structural, or geometric environment to every atom in
the structure. They can be considered functions of the atomic coordinates in the structure, or
functions of the underlying sequence.
Solvent excluded surface area (SESA): The SESA is dened as the reentrant surface of a
spherical probe which is rolled along the van-der Waals surface of a structure. Generally a radius
of 1.4 A is chosen, approximating the volume of a water molecule. It denes the surface beyond
which solvent molecules can not penetrate. The SESA is composed of three types of areas: a
spherical polygon area a
s
(i) on the surface of a solvent-accessible atom i, a patch area a
t
(i;j)
on a toroid dened by two atoms i;j and a spherical polygon area a
p
(i;j;k) on the surface of
a probe whose position is determined by three atoms i;j;k [120]. GeoBind uses MSMS [104] to
compute the SESA contribution of each atom, where the total SESA is given by the union of the
contribution of each atom.
Solvent accessible surface area (SASA): SASA is similar to SESA, also dened using
a rolling spherical probe along the van der Waals surface of a structure, but it is dened as
the surface traced out by the center of the spherical probe. A radius of 1.4 is typically used
approximating the radius of a water molecule. The SASA is equivalent to the van der Waals
surface when adding a constant equal to the probe radius to the radii of every atom. Therefore,
every atom contributes a spherical patch of surface area to the total SASA which can be calculated
using the Lee-Richards [69] or Shrake-Rupley algorithms [110].
Residue secondary structure: We use DSSP [61] to calculate secondary structure at the
residue level. DSSP detects eight secondary structure conformations - three helix (3
1
0,,), two
strand and three loop, and we bin these into just three classes. Every residue is assigned one of
these classes, and we use one-hot encoding to assign secondary structure as a numerical feature.
Every atom in the residue is assigned the same secondary structure feature.
Atchley Factors: Atchley factors are ve numerical descriptors developed by Atchley et
al. [4] which summarize statistical properties of the twenty standard amino acids. They are the
result of dimensionality reduction and identify clusters of amino acid properties that co-vary. The
ve factors correspond approximately to polarity, secondary structure, molecular volume, codon
diversity, and electrostatic charge. Each residue is assigned a xed ve dimensional vector of real
numbers based on the proteins sequence.
123
Spatial Aggregation Propensity: Spatial aggregation propensity [19] is a measure of hy-
drophobicity that accounts for each atoms local structural context. It is given by
SAP
i
=
X
j:mindist(Rj;ai)D
P
a
k
2Rj
SASA(a
k
)
P
a
k
2Rj
SASA
(a
k
)
H
j
(B.1)
.
Where R
j
are residues that have at least one atom within a cut-o distance D of atom i,
SASA and SASA
are the observed and standard SASA for an atom a
k
that belongs to residue
R
j
and H
j
is a measure of hydrophobicity for residue j.
Hydrogen Bonding Potential: We identify hydrogen bond donor (heavy atom and hydro-
gen atom) and acceptor atoms from the molecular structure and one-hot encode these atoms with
a [0; 1] potential. Mapping this feature to the mesh using distance weighting produces a smooth
signal over the mesh indicating regions of likely hydrogen bonds.
Circular Variance: Circular variance is a feature in [0; 1] that describes how buried each
atom is relative to its local surroundings.
CV
i
= 1
1
n
i
X
j6=i;rijrc
~ r
ij
r
ij
(B.2)
It is the average of unit vectors from all atoms j to atom i within some cut-o radius r
c
. If
all these vectors lie along the same direction, then they will average to 1 and the CV will be at a
minimum, whereas if an atom is completely surrounded by atoms in random directions, the unit
vectors will average to 0 and CV will be at a maximum. By adjusting the cut-o radius we can
describe the distribution of atoms at dierent scales.
B.2 Shape Features
The following are shape features used in the GeoBind method. A shape feature maps a descriptor
of local or global geometrical properties of the surface mesh to every vertex in the mesh. Shape
features depend only on the mesh vertex coordinates and edges and encode purely geometrical
information.
Mean Curvature: Given the principal curvatures at a vertex,k
1
andk
2
, the mean curvature
is the arithmetic mean of the the principal curvatures, H = (k
1
+k
2
)=2. It is a positive value
when the normal vector of the surface has convex curvature, zero when the surface is locally
at,
and negative for concave curvature.
Shape Index: Shape index was developed in [65] and combines principal curvatures in a way
that is independent of the total curvatures. It is given by
s =
2
arctan
k
2
+k
1
k
2
k
1
(k
1
k
2
) (B.3)
Convex Hull Distance: The convex hull of a set of points is the smallest convex set which
contains the points. That is, the convex hull contains every pair of points such that the line
segment joining the pair is within the convex hull. We can view the convex hull of a mesh as
a crude approximation of the overall shape. Regions where the convex hull and the mesh are
coincident are regions where the mesh is convex, and regions far from the convex hull are concave.
Therefore, the distance between the mesh and convex hull is a measure of concavity of the mesh
shape. We compute the minimum distance between the hull and the mesh at every vertex.
Shape Diameter Function (SDF): The shape diameter function relates the surface of a
mesh to its volume [107]. Given a point on the surface mesh, a cone is placed centered around
its inward-normal direction (the opposite direction of its normal), and several rays are projected
124
inside this cone to the other side of the mesh. The SDF at a point is dened as the weighted
average of all rays lengths which fall within one standard deviation from the median of all lengths.
The weights used are the inverse of the angle between the ray to the center of the cone.
Pocket Features: Pockets in the surface mesh are identied using NanoShaper [31] and
dened by volumetric dierences with the molecular surface and the molecular surface with ex-
panded atomic radii. We use a threshold that the volume and shape of the pocket must be large
enough to accommodate approximately two water molecules. Once identied, we can map features
to the vertices that comprise the pocket, such as the volume of the pocket, the surface area of
the pocket, and its aspect ratio, which roughly encodes its deviation from sphericity. Non-pocket
vertices are simply assigned zero for all of these features.
125
Abstract (if available)
Abstract
Protein-DNA interactions are the foundation of nearly every process that occurs within living cells. Every function a cell performs (or has the capacity to perform) is determined by the cells genetic code which is written in the alphabet of nucleotide sequences and stored in what we call DNA. It is through the action of a vast army of proteins which bind DNA that this genetic information is converted to the chemical building blocks that give rise to living organisms. Proteins are responsible for transcribing DNA to RNA and regulating that process via a complex set of interactions between various transcription factors and DNA that affect the rate at which the RNA polymerase enzyme can transcribe a particular gene. The genome its self consists of a highly organized and compacted assembly of nucleoprotein complexes known as chromatin. DNA in a single human cell is about two meters in length if laid out in a linear fashion. In vivo it is folded and coiled into a densely packed nucleus of only about six micrometers in size through the action of proteins such as histones, condensin and CTCF. DNA is constantly suffering damage due to the effects of ionizing and ultraviolet radiation, reactive oxygen species, viruses and toxins. To prevent rapid deterioration of genetic information and malfunction of the machinery of the cell, specialized proteins have evolved which are continually repairing DNA damage and maintaining genetic fidelity. In order for organisms to grow, cells must divide and the entire content of the genome unraveled, replicated and recombined—processes which are governed by the coordination of a large number of specialized proteins. It it therefore quite evident that to gain a mechanistic understanding of biological processes requires an understanding of how proteins interact and bind to DNA. ❧ Structural data of protein-DNA complexes provides researchers with an atom-level view of protein-DNA interactions. These models of proteins bound to their cognate DNA binding sites help to elucidate the relationships among sequence, structure and biological function, distinguish different mechanisms of DNA recognition, offer a deeper physical understanding of existing experimental results, and give insights into the molecular machinery which drive living cells. The amount of available structure data continues to increase every year, and this ever expanding resource is enabling new approaches to studying protein-DNA interactions on large scales. In this thesis I present approaches we have developed for studying protein-DNA interactions which utilize structural data. We have built tools specifically for the analysis of the structure of proteins bound to DNA, which include a data processing pipeline for automatically extracting meaningful structural features from a protein-DNA complex, a large database of pre-preprocessed complexes which can be searched based on features of the protein, DNA or the molecular nucleoprotein interactions, and interactive visualization and data exploration tools for examining the interface interactions that occur in protein-DNA complexes. We have also developed machine learning methods for predicting DNA binding sites from protein structure using geometric deep learning. We represent a protein structure using a molecular surface mesh and have built graph convolutional neural networks for identifying regions of a protein surface which correspond to a DNA binding site. In addition to original work we have contributed to the study of protein-DNA interactions I review relevant background material.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Deciphering protein-nucleic acid interactions with artificial intelligence
PDF
Machine learning of DNA shape and spatial geometry
PDF
Decoding protein-DNA binding determinants mediated through DNA shape readout
PDF
Genome-wide studies of protein–DNA binding: beyond sequence towards biophysical and physicochemical models
PDF
Understanding protein–DNA recognition in the context of DNA methylation
PDF
Quantitative modeling of in vivo transcription factor–DNA binding and beyond
PDF
Genome-wide studies reveal the function and evolution of DNA shape
PDF
Biochemical characterization and structural analysis of two hexameric helicases for eukaryotic DNA replication
PDF
The folding, misfolding and refolding of membrane proteins and the design of a phosphatidylserine-specific membrane sensor
PDF
Improved methods for the quantification of transcription factor binding using SELEX-seq
PDF
Mapping 3D genome structures: a data driven modeling method for integrated structural analysis
PDF
Profiling transcription factor-DNA binding specificity
PDF
Simulating the helicase motor of SV40 large tumor antigen
PDF
X-ray structural studies on DNA-dependent protein kinase catalytic subunit:DNA co-crystals
PDF
The physics of membrane protein polyhedra
PDF
Elucidating structural and functional aspects of biomolecules using classical and quantum mechanics
PDF
Molecular and computational analysis of spin-labeled nucleic acids and proteins
PDF
Structural and biochemical studies of large T antigen: the SV40 replicative helicase
PDF
Structure – dynamics – function analysis of class A GPCRs and exploration of chemical space using integrative computational approaches
PDF
Structural and biochemical determinants of APOBEC1 substrate recognition and enzymatic function
Asset Metadata
Creator
Sagendorf, Jared Michael
(author)
Core Title
Data-driven approaches to studying protein-DNA interactions from a structural point of view
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Physics
Publication Date
07/28/2020
Defense Date
06/23/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
binding prediction,deep learning,machine learning,nucleic acids,OAI-PMH Harvest,protein,structural analysis
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Rohs, Remo (
committee chair
), Chen, Xiaojiang (
committee member
), Di Felice, Rosa (
committee member
), Haas, Stephan (
committee member
), Haselwandter, Christoph (
committee member
)
Creator Email
jsagendorf@gmail.com,sagendor@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-349093
Unique identifier
UC11663356
Identifier
etd-SagendorfJ-8800.pdf (filename),usctheses-c89-349093 (legacy record id)
Legacy Identifier
etd-SagendorfJ-8800.pdf
Dmrecord
349093
Document Type
Dissertation
Rights
Sagendorf, Jared Michael
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
binding prediction
deep learning
machine learning
nucleic acids
protein
structural analysis