Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Deciphering protein-nucleic acid interactions with artificial intelligence
(USC Thesis Other)
Deciphering protein-nucleic acid interactions with artificial intelligence
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Deciphering Protein-Nucleic Acid Interactions with Artificial
Intelligence
by
Raktim Mitra
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTATIONAL BIOLOGY AND BIOINFORMATICS)
December 2024
Copyright 2024 Raktim Mitra
Epigraph
“Karmanye vadhik ¯ araste M ¯ a Phaleshu Kad ¯ achana, ¯
Ma Karmaphalaheturbh ¯ urm¯ a Te Sangostvakarmani.” ¯
−Srimadbhagabatgita, Sankhya Yoga (Dwitiya Adhyaya), Sloka 47 ¯
ii
Dedication
I dedicate this dissertation to Hiren and Mahua Mitra,
for their never ending love and support.
iii
Anowledgements
I offer my utmost thanks to my advisor Prof. Remo Rohs, whose encouragement, patience,
intelligent decisions and immense knowledge supported me throughout my Ph.D., made it a
wonderful experience, and helped me overcome many difficulties.
I am also extremely thankful to my dissertation commiee members Prof. Helen M. Berman,
Prof. Fengzhu Sun, Prof. Adam L. MacLean and Prof. Aiichiro Nakano, and qualifying exam
commiee member Prof. Xiaojiang Chen. eir insightful comments enriched and widened my
research from various perspectives. I acknowledge efforts of Prof. Helen M. Berman and others,
who worked to build and maintain the Protein Data Bank over all these years. Building DeepPBS,
RNAscape, DNAproDB and RNAproDB would not be possible otherwise. I also thank Prof. Helen
M. Berman for inspiring me to work towards building RNAscape and for her valuable guidance.
I joined the QCB department at USC as a PhD student in August 2019. e rigorous curriculum
of the CBB PhD program helped me shape my scientific vision and research skills. Soon, the COVID19 pandemic changed everything about our life and uncertainty covered the world. ankfully, due
to wise and loving care of the department, my advisor Prof. Remo Rohs and other members of the
QCB department, I was able to continue my work through the difficult times. Prof. Adam MacLean
worked hard to help me publish my first first-author paper during this time. I also thank Professor
Andrew P. McMahon and Professor Mona Singh, for valuable discussions. e pandemic affected
scientific travel opportunities severely, during those initial years of my PhD. However, later on I
was fortunate to be able to travel to numerous high quality conferences and present my work there.
I am thankful to my advisor Prof. Remo Rohs for providing me with these opportunities. In fact,
meeting Nobel laureate Prof. Ada Yonath (for solving the structure of ribosome [1, 2]) has been one
of the most memorable experiences of my life. Traveling to conferences and presenting my work
iv
through posters and oral presentation opened up a lot of career and collaboration opportunities
for me. My scientific thinking has been shaped by these experiences, coupled with professional
growth. I am grateful to the Rohs Lab members for accompanying and participating alongside me
in these events and for always being supportive and encouraging.
I also present my gratitude to my collaborator Dr. Cameron Glasscock from the lab of Prof.
David Baker at University of Washington, and to Prof. Ada Yonath and her lab members, for
being supportive to my research and providing us with the proto-ribosome coordinates for the
RNAscape project. Scientific works of the two Nobel laureates, Prof. Ada Yonath and Prof. David
Baker, have been highly influential in my research.
I am thankful to the Andrew Viterbi Fellowship in Computational Biology and Bioinformatics
for supporting me over three years of my PhD.
My sincere thanks goes to the QCB department for allowing me to participate in various
departmental activities, which helped shape me as a person. It has been a pleasure to work with
and receive all forms of support from Luigi, Rokas, Tanya, Katie and Christian.
I thank my fellow labmates Yibei, Tsu-Pei, Jinsen, Ari, Jesse, Yingfei, George, previous lab
members Brendon and Jared. is dissertation would not have been possible without their
stimulating discussions and contributions, both in science and in life. Specially, without Jared’s
mentorship during the earlier years of my PhD, my dissertation would not be able to achieve
its current form. I also thank my graduate mentees Zijin, Wei Yu, Lexi (and others), and my
undergraduate mentees Andrew, Hirad, Avinash and Irika. It has been a pleasure to work with all
of you and to see your scientific growth.
Special thanks go to my friends Bryan, Meilu, Sophia, Vivian, Fred, Eric, Priyanka, Chloe, Nic
and Anik, for supporting me along the way. You have made my life in Los Angeles a pristine
remembrance. I also offer my gratitude to Prof. Lina Bahn, Christine Lee, Yue Qian, and Haesol
Lee, for helping me pursue my interest in violin and music, which helped me conquer many
challenging times.
v
Last but not the least, I am forever indebted to my parents, without whose decades long hard
work and dedication I would not even be here. On the same thread, I am grateful for the support of
my extended family members and childhood friends and teachers, who provided me with unfailing
support and continuous encouragement throughout the journey.
vi
Contents
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Epigraph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Chapter 1: A probabilistic model for smooth binding site label prediction on protein surfaces 6
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.1 A suitable smoothness metric for binding site prediction . . . . . . . . . . 9
1.2.2 Continuous Conditional Random Fields layer . . . . . . . . . . . . . . . . 12
1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Chapter 2: Geometric deep learning of protein–DNA binding specificity . . . . . . . . . . 20
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.1 e DeepPBS Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.2 DeepPBS performance for experimentally determined structures . . . . . 25
vii
2.2.3 DeepPBS captures paerns of family-specific binding modes . . . . . . . . 28
2.2.4 Application to in silico-predicted protein–DNA complexes . . . . . . . . . 30
2.2.5 Assessing protein residue importance at p53-DNA interface . . . . . . . . 32
2.2.6 Comparison of residue-level importance with mutagenesis data . . . . . . 35
2.2.7 Application to designed scaffolds targeting specific DNA . . . . . . . . . . 37
2.2.8 Application of DeepPBS to MD simulation of Exd-Scr–DNA system . . . . 39
2.2.9 Details on outliers seen on the benchmark set performance . . . . . . . . 40
2.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.4.1 Position Weight Matrix (PWM) . . . . . . . . . . . . . . . . . . . . . . . . 44
2.4.2 Ungapped Local Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.4.3 Representing DNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.4.4 Representing protein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.4.5 DeepPBS architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.4.6 Training, cross-validation, and benchmarking . . . . . . . . . . . . . . . . 49
2.4.7 Performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.4.8 Additional discussion on behavior of metrics . . . . . . . . . . . . . . . . 50
2.4.9 Additional details associated with application of DeepPBS on predicted
structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.4.10 Molecular Dynamics (MD) simulation of Exd-Scr–DNA system . . . . . . 54
2.4.11 Measures ensuring prevention of overfiing to protein sequences . . . . . 56
2.4.12 Bipartite edge perturbation and protein heavy atom importance score
calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.4.13 Description of competitor assay for quantifying designed proteins’ binding
specificity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.4.14 DeepPBS webserver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
viii
2.6 Data availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.7 Code availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Chapter 3: RNAscape: Geometric mapping and customizable visualization of RNA structure 62
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.2 Materials and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.2.1 Programming languages and general tools . . . . . . . . . . . . . . . . . . 66
3.2.2 e RNAscape algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.3.1 Application of RNAscape to structures from the PDB . . . . . . . . . . . . 69
3.3.2 RNAscape user interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.3.3 Output images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.3.4 Base-pairing annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.3.5 Customizable seings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.5 Data availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Chapter 4: DNAproDB: an updated database for the automated and interactive analysis of
protein–DNA complexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.2 Update details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.2.1 Processing pipeline and data update . . . . . . . . . . . . . . . . . . . . . 79
4.2.2 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.2.3 Web interface and user experience . . . . . . . . . . . . . . . . . . . . . . 81
4.2.4 antitative analysis of readout features . . . . . . . . . . . . . . . . . . . 82
4.2.5 Water-mediated hydrogen bonds . . . . . . . . . . . . . . . . . . . . . . . 85
4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.3.1 Data availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
ix
Chapter 5: RNAproDB: a webserver and interactive database for analyzing protein–RNA
interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.2 Processing pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.3 Interface explorer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.4 Sequence viewer and 3D viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.5 Secondary structure selector and subgraph exploration . . . . . . . . . . . . . . . 93
5.6 Search functionalities and Upload . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.7 RNA-RNA water mediated interactions . . . . . . . . . . . . . . . . . . . . . . . . 95
5.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.9 Data availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Chapter 6: Tangential research: Generative modeling of gene expression time series data . 99
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.2.1 Variational inference and variational autoencoders . . . . . . . . . . . . . 102
6.2.2 RVAgene: A recurrent variational autoencoder to model gene expression
dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.2.3 Generating synthetic gene expression time series data . . . . . . . . . . . 106
6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.3.1 RVAgene can accurately and efficiently reconstruct temporal profiles from
synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.3.2 RVAgene modeling of pseudotemporally ordered data during embryonic
stem cell differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.3.3 Comparison of RVAgene with alternative approaches for gene clustering . 114
6.3.4 RVAgene can classify and predict gene expression dynamics in response
to kidney injury . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.3.5 Assessment of the computational efficiency of RVAgene . . . . . . . . . . 122
x
6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Chapter 7: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Appendix B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
xi
List of Figures
1 Structure of the human interferon-beta enhanceosome . . . . . . . . . . . . . . . 2
1.1 PNAbind schematic, example and conceptual explanation of CRF application scenario 10
1.2 e three smoothness metrics M1, M2, S evaluated on 15 validation predictions on
PDNA-74 dataset. S clearly behaves much nicer showing a large range of values
compared to the rest. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3 CCRF for smooth binding site label prediction over protein surface. . . . . . . . . 16
2.1 Interaction propensities of components of arginine amino acid towards various
DNA moieties and functional groups . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2 Schematic illustration of the DeepPBS framework . . . . . . . . . . . . . . . . . . 26
2.3 Performance of DeepPBS for predicting binding specificity across protein families
for experimentally determined structures. . . . . . . . . . . . . . . . . . . . . . . . 29
2.4 Application of DeepPBS on predicted protein–DNA complex structures. . . . . . . 34
2.5 Visualization of DeepPBS importance scores in p53–DNA interface as a case study,
and experimental validation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.6 Application of DeepPBS to in silico-designed HTH scaffolds targeting a specific
DNA sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.1 RNAscape output for various structures from the PDB. . . . . . . . . . . . . . . . 65
3.2 RNAscape overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.3 RNAscape algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.4 RNAscape output for large-size structures from the PDB. . . . . . . . . . . . . . . 76
4.1 Key aspects of this update to DNAproDB. . . . . . . . . . . . . . . . . . . . . . . . 79
4.2 antitative analysis of protein–DNA complexes in the DNAproDB collection. . 83
xii
4.3 Water-mediated hydrogen bond annotation in DNAproDB . . . . . . . . . . . . . 86
5.1 Multiple mapping algorithms available in RNAproDB for protein-RNA complex
(PDB ID: 1IVS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.2 Novel interactive capabilities of the RNAproDB user interface . . . . . . . . . . . 94
5.3 Example illustration of RNA-RNA water mediated hydrogen bond facilitating
non-Watson-Crick interaction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.1 Unsupervised representation learning with RVAgene using synthetic data. . . . . 108
6.2 Accurate reconstruction of embryonic stem cell differentiation dynamics with
RVAgene. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.3 Comparison of information captured in RVAgene latent space compared to a
standard fully connected VAE and results of standard hierarchical clusterings. . . 113
6.4 Accurate reconstruction of kidney injury response gene dynamics with RVAgene. 116
6.5 RVAgene latent space captures biological processes driving concordant gene expression changes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.6 Computational cost of training RVAgene . . . . . . . . . . . . . . . . . . . . . . . 122
S1 Dataset details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
S2 Data Representation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
S3 DeepPBS architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
S4 Schematic representation of bipartite edge-perturbation process. . . . . . . . . . . 161
S5 Cross-validation performance of DeepPBS for predicting binding specificity across
protein families on experimentally determined structures. . . . . . . . . . . . . . . 162
S5 Cross-validation performance of DeepPBS for predicting binding specificity across
protein families on experimentally determined structures. . . . . . . . . . . . . . . 163
S6 Application of DeepPBS to MD simulation of AlphaFold2- and PDB (2R5Z)-based
modeled complex of Exd- Scr system. . . . . . . . . . . . . . . . . . . . . . . . . . 164
S7 Example DeepPBS ensemble predictions on structures of specific DNA binders. . 166
S8 Example DeepPBS ensemble predictions on structures of non-specific DNA binders. 167
xiii
S9 Application of DeepPBS on modeled structures. . . . . . . . . . . . . . . . . . . . 168
S10 Behavior of metric for different target PWM and interpolated predictions. . . . . 169
S11 MAE equivalent of benchmark performance vs alignment score plot. . . . . . . . 170
S12 Deep DNAshape prediction of the minor groove width (MGW) profile . . . . . . . 171
S13 Tertiary structure aware mapping of the peptidyl transferase center (PTC) of the
large ribosomal subunit of Deinococcus radiodurans (PDB ID: 1NKW), also known
as proto-ribosome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
S14 RNAscape web interface for side-by-side comparison of two structures. . . . . . . 173
S15 Visualization of a riboswitch by various methods. . . . . . . . . . . . . . . . . . . 174
S16 Demonstration of RVAgene working principle on simulated data with high noise. 175
S17 Characterization of gene dynamics by linear fit using Pearson correlation coefficient for 5 sample genes in the ESC differentiation dataset . . . . . . . . . . . . . 176
S18 Clusters detected by the unsupervised clustering algorithm DPGP for ESC differentiation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
S19 Accuracy of RVAgene reconstructions for different train/test group sizes. . . . . . 178
S20 Modeling response to kidney injury and analysis of linear fits. . . . . . . . . . . . 179
S21 Comparison of linear and quadratic fits to describe gene dynamics in response to
kidney injury. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
S22 Clustering on R1 and cluster specific GO enrichment analysis. . . . . . . . . . . . 181
S23 RVAgene latent space captures biological processes driving concordant gene expression changes (Sdc1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
S24 Examples of continuous time prediction of ESC differentiation. . . . . . . . . . . . 183
S25 Interaction propensities of components of lysine amino acid towards various DNA
moieties and functional groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
S26 Interaction propensities of components of histidine amino acid towards various
DNA moieties and functional groups . . . . . . . . . . . . . . . . . . . . . . . . . 185
xiv
S27 Interaction propensities of components of asparagine amino acid towards various
DNA moieties and functional groups . . . . . . . . . . . . . . . . . . . . . . . . . 186
S28 Interaction propensities of components of serine amino acid towards various DNA
moieties and functional groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
S29 RNAproDB search page showing card view results for the keyword search “tetrahymena” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
S30 RNAproDB output page for uploaded AlphaFold3 predicted structure (model 0)
for PDB ID: 8AW3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
S31 Multiple mapping algorithms available in RNAproDB for protein-NA-hybrid complex (PDB ID: 4OO8) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
S32 Step-by-step building of a symmetrized DNA base-pair. . . . . . . . . . . . . . . . 192
*
xv
List of Tables
1 anitification of growth of protein-DNA structures on the PDB (based on pre2000 and post-2000 release dates). . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Data presented showing correpondance of LogSum aggregated DeepPBS RI scores
with alanine scanning mutagenesis data. . . . . . . . . . . . . . . . . . . . . . . . 38
3.1 Description of various aributes of relevant tools which produce 2D visualizations
of RNA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
xvi
Abstract
is dissertation contains an account of my research work during my PhD at University
of Southern California. My primary focus has been deciphering protein-DNA interaction
with data driven deep learning methods. I present, in chapter 1,my work on the segmentation of protein surfaces into nucleic acid binding and non-binding regions. In Chapter 2, I
describe DeepPBS , a geometric deep learning method to predict DNA binding specificity
given protein-DNA complexes. DeepPBS acts as a bridge between structure determining and
specificity determining experiments. In chapter 3, we design and showcase the RNAscape
algorithm and webserver, a geometric mapping method of RNA 3D structures to 2D, which
aempts to preserve the three dimensional topology (unlike common secondary structure
based visualization methods). Chapter 4 describes an updated DNAproDB database. rough
this update we introduce both technical advances and an expansion of features included in
the analysis.At the same time we recognized lack of a comprehensive analysis and exploration
tool for RNA/protein-RNA structures. Being inspired from RNAscape and DNAproDB, we
developed RNAproDB, which is described in Chapter 5. RNAproDB is a modern highly interactive structure exploration tool tailored for the complexity and structural variance of RNA
structures. In chapter 6, I present a tangential work, regarding generative modeling of gene
expression time-series data, to learn a regularized latent space representation. We conclude
this thesis by discussing current state of the field of structural biology of protein-nucleic acid
complexes and discussing future possibilities.
xvii
Introduction
e 21st century is an exciting time to be doing science. Specifically, computational biosciences
have taken a central role in today’s world, starting from the Human Genome Project [3], unveiled
at the turn of the century. Fast-forward two decades: the complete sequence of a human genome
[4] is unveiled. However, sequence is only one side of the coin, as the drivers of the most intricate
cellular processes are structures of biomolecules. Complex biophysical processes govern how the
genome is regulated and transcribed [5], and during these processes the biological reality of the
genome is often far from the standard B-DNA structure discovered more than 70 years ago [6]. A
beautiful reminder of this is the modeled complex of the human interferon beta enhanceosome
[7], shown in Fig. 1. e structure displays remarkably intricate shape, a regulatory DNA segment
occupies, to serve as binding site for a plethora of eukaryotic transcription factors. is reality
has been demonstrated [8] to be present in a broad scale, across the structures deposited on the
Protein Data Bank (PDB) [9].
On the other hand, this century has witnessed remarkable advancements in the field of Artificial
Intelligence (AI) and data-driven models. Neural network models, which can be thought of as
universal non-linear function approximators, have been at the forefront of this journey. Although,
conceptually neural networks have existed for a while [10], late 20th century saw significant steps
forward to be able to train them at scale [11, 12, 13]. Clever incorporation of different kinds
of inductive biases soon led to a wave of applications across different domains of science and
technology [14, 15, 16, 17, 18] and within the last decade, “Deep Learning” [19] has become a
popular phrase in all domains of science. e domain of biology is no exception. Deep learning
driven genome sequence analysis [20] is one of the standard biological applications of AI. Large
strides have been made towards interpretability methods [21], which is essential in biological
applications. In structural biology, significant steps forward have been made with the help of AI,
in terms of structure determination [22], prediction [23, 24, 25, 26, 27], and analysis [28, 29, 30, 31].
1
e COVID-19 pandemic also inspired a lot of structural biology research [32, 33, 34], brought
popular interest and funding to this domain.
Figure 1. Structure of the human interferon-beta enhanceosome as modeled and published by
Panne, Maniatis, and Harrison [7]. e protein chains are shown in a surface representation, one
color for each unique monomer. DNA backbone is shown as a ribbon (wheat color). AT pairs are
shown in light green while GC pairs are shown in light blue. e view was chosen to emphasize
the intricate DNA shape observed in the complex. e DNA has been extended slightly on both
ends.
is progress in structural biology research wouldn’t have been possible without the highly
curated public domain data deposited and preserved on the PDB [9, 35]. Extensive study of the
structures of the SARS-COV-2 nucleocapsid proteins and their RNA-binding domains [34, 36] was
crucial for the vaccine development process and ultimately fending off the pandemic. Structural
studies of the tumor supressor protein p53 [37, 38, 39, 40], which binds to specific DNA sequences
as a tetramer, have been important for the understanding of cancer dynamics and developing
drugs to treat cancer patients. Recently, cooperativity of transcription factors binding to DNA,
has been shown to influence human facial phenotype [41]. Structural models are essential for
studying cooperativity of transcription factors [42]. Recent advances in gene editing technology
2
have also benefied from solving and studying structural models of proteins bound to DNA and
RNA [43].
Such advances in biological sciences, theraputics, and AI based structure prediction tools are
possible mainly because of the ever expanding repertoire of protein-nucleic acid structures on the
PDB. To shade some light on this, I present a summary of protein-DNA structures on the PDB, as
a comparison of structures available on or before 2000 and post 2000 (Table 1). is data shows
a significant growth in the past 20 years, both in terms of count and variety of unqiue protein
sequences and organisms. Individual families of DNA binding binding domains are represented at
a higher number. e availability of such well curated data, ready to be used, is a crucial driver of
scientific progress, and especially to my own research.
Inspired by such advances, my primary focus has been deciphering protein-DNA interaction
with data driven deep learning methods. Proteins bind to DNA to perform essential cellular
functions like gene regulation, genome organization, genome repair etc. Owing to the advent
of AlphaFold2 [23], a plethora of hitherto unknown protein structures could be predicted for a
multitude of protein sequences across organisms [44]. Deciphering possible nucleic acid binding
function of these proteins, based on the predicted structures, is crucial. is involves being able to
analyze a predicted protein structure, to find out if it contains a DNA/RNA binding site. Which
can lead to further downstream tasks of figuring out potentially new functions a protein performs.
I present, in chapter 1, my work on the segmentation of protein surfaces into nucleic acid binding
and non-binding regions. is chapter describes a graph neural network layer, which enables
smooth prediction of binding site labels on the protein surface. We start from a model of nucleic
acid binding site prediction (PNAbind) developed in the Rohs lab and improve upon it by designing
a neural network layer using mean field VI over a continuous Conditional Random Field model,
which makes the predicted binding sites smoother while improving upon prediction metrics. We
also design a smoothness metric addressing the label imbalance problem associated with the task.
is work was later published as part of the PNAbind package [45]. In Chapter 2, I describe
DeepPBS (Deep predictor of binding specificity), the first of its kind geometric deep learning
3
Entries on or before 2000 post 2000
All entries 464 8320
Unique uniprot accessions represented 194 3422
Unique source organisms represented 62 696
Structures with source organism Human 158 3007
Entries containing PFAM domains:
Homeodomains 18 111
ZF-C2H2 14 103
Basic helix-Loop-helix (bHLH) 2 57
Basic leucine zipper (bZIP) 4 10
Forkhead 1 45
Erythroblast transformation specific (ETS) 7 76
p53 6 40
Table 1. anitification of growth of protein-DNA structures on the PDB (based on pre-2000 and
post-2000 release dates). Based on PDB entries as of October 07, 2024 and PDB-PFAM databse
mapping as of September 28, 2024.
method developed to predict DNA binding specificity of proteins based on a given protein-DNA
complex. DeepPBS acts as a bridge between structure determining (which shows mechanism
but not sequence diversity) and specificity determining experiments (which reflects sequences
diversity but not mechanism). DeepPBS is applicable on experimentally determined, simulated,
predicted or designed complexes, resulting in a broad impact in the domain. is work has been
recently published [46].
In chapter 3, we design and showcase the RNAscape algorithm and webserver, a geometric
mapping method of RNA 3D structures to 2D, which aempts to preserve the three dimensional
topology (unlike common secondary structure based visualization methods). RNAscape significantly improves over existing competitors [47] in terms of the mapping quality, visualization
and customizability. RNAscape has been published [48] and featured on the cover of the 50th
anniversary web-server issue of the Nucleic Acids Research.
Chapter 4 describes an updated DNAproDB database [49, 50]. rough this update we introduce
both technical advances and an expansion of features included in the analysis. DNAproDB is
now automatically updated weekly with newly released structures and thereby will remain up to
date as new DNA–protein structures are solved. We also include much larger complexes, expand
external annotations and upload/download formats, and improve the user experience through a
4
reorganization of the web interface and more visualization options and controls. We added the
annotation of water-mediated hydrogen bonds as a new feature. At the same time we recognized
lack of a comprehensive analysis and exploration tool for RNA/protein-RNA structures. Being
inspired from RNAscape and DNAproDB, we developed RNAproDB, which is described in Chapter
5. RNAproDB is a modern highly interactive structure exploration tool tailored for the complexity
and structural variance of RNA structures. is is achieved by intricate interplay of a 3D viewer,
interface explorer, sequence viewer, secondary structure selector and tabular data: making it
the most versatile tool for analyzing and exploring protien-NA complexes. With the advent of
complex structure prediction methods like AlphaFold3, we expect RNAproDB to serve a crucial
role in analyzing predicted structures and advancing the understanding of cellular biology. In
chapter 6, I present a tangential research work on autoencoding gene expression time-series
data, to learn a regularized latent space representation and a generative process. RVAgene is
primarily a visualization tool suitable for biological knowledge discovery, while also being suitable
for de novo data generation, denoising and a more efficient alternative to hierarchical Gaussian
process based methods [51] to cluster such data. We analyze one synthetic and two real datasets
and demonstrate various properties and aspects of the model and its potential for unsupervised
discovery. In particular, RVAgene identifies new programs of shared gene regulation of Lox family
genes in response to kidney injury. is project originated as part of my rotation at the MacLean
lab at USC, QCB. And I am thankful for Adam and Remo’s support for me to be able to publish
this [52]. We conclude this thesis by discussing the current state of the field of structural biology
of protein-nucleic acid complexes and future possibilities.
5
Chapter 1
A probabilistic model for smooth binding site label prediction on
protein surfaces
Abstract
Predicting DNA/RNA binding sites on a given protein surface is an important computational task, since experimentally determining such information is often expensive and time
consuming. is binding site prediction task can be formulated as a node classification task
over a 3D mesh representing the protein surface, with features over the vertices and edges of
the mesh representing various geometrical and physicochemical features of the protein structure. We developed a deep learning based method, PNAbind, which classifies mesh vertices as
binding and non-binding sites. is often results in irregular binding site predictions over the
protein surface. However, intuitively, binding site predictions should be contiguous and not
patchy i.e. as “smooth” as possible while being correct. In this chapter, we describe what such
kind of “smoothness” entails and we improve upon the original architecture by designing a
network layer based on a probabilistic Continuous Conditional Random Field (CCRF) model,
which increases smoothness of binding site prediction while improving prediction accuracy
of the model. is network layer implementation has been published as part of the PNAbind
package.
6
1.1 Introduction
Predicting nucleic acid binding sites on protein structures is an important computational task.
ere are many existing methods which aempt to solve this problem either on the sequence level
or on 3D structure level based on either sequence features or structural features of proteins [53, 54,
55, 56]. We represent protein surfaces as 3D triangulated meshes with vertices and edges having
features representing different geometrical and physicochemical aspects of the protein structure.
Now, we can learn a model which classifies each vertex of such a mesh to either binding-site or
non-binding sites. With recent advances in 3D deep learning, Graph Convolutional Networks
(GCNs) have come out as a useful approach for learning higher level features over 3D mesh objects.
We developed a GCN based deep learning model for the binding site classification task: PNAbind.
Fig. 1.1A shows a schematic diagram of the task PNAbind achieves.
Standard neural network approach is to predict each binding-site label independently through
it’s output layer. However, classification tasks often come with additional non-trivialities which are
not addressable with independence assumption. For example, image segmentation task involves
classifying every pixel of an image (which can be thought of as a grid of pixels) to some class.
However, independence assumption often gives poor/scaered results. In that case, some form of
conditional label (re)assignment is necessary for each pixel considering its neighboring pixels.
e most common way to achieve the same in the field of computer vision is via optimizing
a conditional random field (CRF) model over a set of initially learned labels. For example, Fig.
1.1B adapted from Krahenb ¨ uhl and Koltun [ ¨ 57] shows how an implementation based on CRF
can improve image segmentation results. In our case, we also expect certain properties from a
predicted binding site region on a protein surface. A binding site region on a protein surface is
a combination of multiple mesh vertices predicted as binding sites. We generally expect such a
region to be “smooth” i.e. not randomly have misclassified points scaered around. is idea is
visually illustrated in Fig. 1.1C. So, we need to do some kind of conditional (re)assignment of class
labels generated by PNAbind.
7
In addition to Computer vision, CRF models are also heavily used in Natural Language processing (NLP) [58, 59, 60]. However, both in Computer vision and NLP, the underlying graph
structure, over which the CRF is defined, is very sparse and simple e.g. linear chain or tree for
NLP and 2D grid for computer vision. Which makes it easy to find some form of optimization
scheme on the discrete domain of labels. However, this is not the case for a general underlying
graph, where the combinatoriality of the possible label assignments for all vertices is huge and in
the discrete domain of labels there is no gradient information for efficient optimization.
Krahenb ¨ uhl and Koltun [ ¨ 57] proposes an efficient inference algorithm for fully connected CRFs
based on mean field Variational Inference and high dimensional filtering using permutohedral
laice approximation. However, the high dimensional filtering step is not compatible with
SIMD[61] paradigm of GPU computation. [62] making it impossible to use on large datasets and
non-trivial graphs. Teichmann and Cipolla [62] improves upon this by implementing Convolutional
CRFs which is another kind of approximation to full CRF inference which is efficient and compatible
with GPUs. However, their method is only applicable for 2D grid i.e. image data. However, none
of these methods are applicable to our case directly. One of the more interesting advances of
applying a CRF layer on mesh objects comes from Kalogerakis, Hertzmann, and Singh [63] using
alpha-expansion graph-cuts method proposed by Boykov, Veksler, and Zabih [64]. However,
being pre-CNN revolution work, this method is not suitable for deep learning and is not readily
integrable to our PNAbind model.
Overall, it turns out, optimizing a CRF model over the predicted labels of PNAbind on the
discrete domain to reassign them is not computationally expensive. However, instead of trying to
smooth the predicted labels, we can directly try to make these predicted labels itself smoother.
is means we need to have an operation before the final fully connected layer of PNAbind which
smooths the information over different vertices taking into account their neighbours information
along with their own. e good thing about this idea is that the optimization domain is not discrete
anymore.
8
Such an operation is based on a continuous variant of a CRF or CCRF. Ristovski, Radosavljevic,
Vucetic, and Obradovic [65] shows how to calculate the updates for such a model using mean
field variational inference, although not in a neural network context. Gao, Pei, and Huang [66]
introduces this approach for graph convolution based networks, however, the network layer update
equation presented in their work results in simply an Identity transformation. We calculated the
correct form of the network layer which can be used as the second last layer of PNAbind.
In next sections, we shall formally describe a metric denoting “smoothness” of binding site
prediction over protein meshes, the CCRF model and the CCRF layer’s architecture, followed by
results showing how it has significantly improved PNAbind for the binding site classification task.
1.2 Methods
1.2.1 A suitable smoothness metric for binding site prediction
ere has been some work on smoothness of labels or features over graphs using some form of
measure of local variations around each vertex as a smoothness metric [67, 68, 69]. However, the
smoothness metrics used by most of these works are not suitable for imbalanced classes. In our
case, since binding site regions are generally only a small portion of the whole protein surface,
the classes are extremely imbalanced. is means, metrics not addressing class imbalance all
give extremely high values of smoothness even for meshes which we would consider as quite
”unsmooth” due to lots of patchy/scaered predictions.
Given a graph G = (V,E) and label assignments l such that lv ∈ {0,1,..., k − 1}∀v ∈ V, an
example of a smoothness metric not considering class imbalance is as follows:
M1 =
1
|E| ∑
(u,v)∈E
I[lu = lv] (1.1)
9
Figure 1.1. PNAbind sematic, example and conceptual explanation of CRF application scenario (A) Schematic diagram showing how PNAbind predicts binding sites over a protein surface
represented as a mesh. (B) Krahenb ¨ uhl and Koltun [ ¨ 57] shows how applying a fully connected
CRF model results in beer image segmentation results compared to just unary classification. (C)
(above) An example independent class assignment in a 2-D grid of cells which is irregular, (below)
A smoother classification, which we would expect to get as result of a conditional assignment
process.
10
e metric M1 presented above has a value of 1 if all the labels belong to one class and has a
value around 0.5 when the labels are totally randomly assigned. is already limits the range
of this metric. Not only that in practice, we almost never see our predicted labels over protein
meshes go below the value of 0.9. Which makes the metric really skewed and any judgements of
improvements in similarity is difficult to make. ere are other weighted versions of this metric
possible, for example, an inverse edge weighted version we aempted looks as follows:
M2 = 1−
1
∑(u,v)∈E
1
d(u,v)
∑
(u,v)∈E
1
d(u, v)
I[lu 6= lv] (1.2)
where,d(u, v) = distance between vertices u, v or edge length of (u, v)
However, this does not help much, because the variance of the edge length distribution is really
low for our protein meshes. So, it behaves similarly as M1.
So finally, we designed a metric which takes care of the issue of label imbalanced (and the
resulting skewed smoothness values). Assuming there are k possible classes, we compute a measure
of smoothness for labels corresponding to each of these classes separately and then do a weighted
combination of them. So, let, Vi ⊂ V be the set of vertices which has the label i ∈ {0,1,.., k −1},
and V
in
i ⊂Vi to be the set of vertices within Vi all of whose neighbors have class label i. is makes
V
out
i = V −V
in
i
the vertices which have at least one neighbor having a class label other than i. So,
in essence, V
in
i
is analogous to area and V
out
i
is analogous to the perimeter of regions with class
label i. Hence, a measure for smoothness for class i is simply:
Si =
|V
in
i
|
|Vi
|
= 1−
|V
out
i
|
|Vi
|
(1.3)
11
Now, we can take weighted combination of these class specific metrics to constitute a balanced
smoothness smoothness metric:
S =
∑
k−1
i=0
1
|Vi
|
Si
∑
k−1
i=0
1
|Vi
|
=
∑
k−1
i=0
|V
in
i
|
|Vi
|
2
∑
k−1
i=0
1
|Vi
|
(1.4)
is metric is nice in the sense it gives a broader range of values and hence is more useful in our
case. In Fig. 1.2 we show how the metrics M1, M2, S behave by applying it on a set of 15 different
proteins with predicted labels by PNAbind. is demonstrates the advantage of S clearly over the
other two metrics.
1.2.2 Continuous Conditional Random Fields layer
Given a graph G = (V,E), features over graph vertices x such that xv ∈ R
d∀v ∈ V and labels over
the vertices y such that yv ∈ {0,1,.., k}∀v ∈ V, A CRF aims to maximize the following probability
to reassign the labels y.
P(yi
|x, y j6=i) = 1
Z(xi)
exp[−(ψ1(xi
, yi) + ∑
j∈N (i)
ψ2(yi
, y j
, xi
, x j))] (1.5)
e terms ψ1 and ψ2 in the above equation are known as unary and binary potentials respectively.
ψ1 models how much the label yi conforms to the features xi and ψ2 models how much the label
yi conforms to the labels and features of the neighbors of vertex i. e exact form of these terms
can be application dependent and designed accordingly. However, optimizing eq. 1.5 in a general
scenario is non-trivial and approximations either do not perform well or are heavily resource
consuming. Moreover, this kind of optimization over discrete domain is not suitable for neural
network application.
In our case, PNAbind is a deep learning model. So , we would like to implement this conditional
random field model as a network layer which can just be plugged into the network and will be
optimized by backpropagation along with the whole network, without us requiring any extra
12
optimization. So, following the approach described in Ristovski, Radosavljevic, Vucetic, and
Obradovic [65] and Gao, Pei, and Huang [66], we use a Continuous CRF model.
Figure 1.2. e three smoothness metrics M1, M2, S evaluated on 15 validation predictions on
PDNA-74 dataset. S clearly behaves much nicer showing a large range of values compared to the
rest.
Since, we want to implement it as a network layer, we define our terminology thus: Let,
Bi ∈ R
d ∀i ∈ V be output of a neural network layer and is input the CCRF layer, Hi ∈ R
d ∀i ∈ V
be output of the CCRF layer. In this scenario, we can think B as analogous to x in eq. 1.5 and
before any kind of optimization H = H
0 = B is analogous to y in eq. 1.5. H then gets reassigned
some final output value after the layer performs its operation. Now, all variables are in continuous
domain and the model looks as follows:
P(Hi
|B) = 1
Z(B)
exp[−(ψ1(Hi
,Bi) + ∑
j∈N (i)
ψ2(Hi
,Hj
,Bi
,Bj))] (1.6)
Now, it is time to design ψ1 and ψ2 which is straightforward in this case. We want output Hi to
preserve its identity as much possible i.e. be as close to Bi as possible, which makes squared error
w.r.t. a reasonable choice for ψ1. However, we want Hi to be not too different from its neighbors
also. We would like to different neighbors of i differently according to their initial similarity with
13
vertex i. Finally, the two potential terms need to be weighted by some optimizable constants
determined by the network while training. is motivates the following choices of the functions:
ψ1(Hi
,Bi) = α||Hi −Bi
||2
2 ψ2(Hi
,Hj
,Bi
,Bj) = βgi j||Hi −Hj
||2
2 α,β > 0
A simple and effective choice: gi j = exp
B
T
i Bj
σ2
||Bi
||2||Bj
||2
(1.7)
σ in the above equation can be left to the network to optimize. ere can be other choices of gi j
in the above equation too e.g. it can be parametrized as a separate neural network itself. is
makes the final form of our CCRF model as follows:
P(Hi
|B) = 1
Z(B)
exp[−(||Hi −Bi
||2
2 + ∑
j∈N (i)
gi j||Hi −Hj
||2
2
)] (1.8)
Now, we need to optimize eq. 1.8 mathematically to get an update algorithm that the network
layer can execute to get the final result Hˆ. For this, we turn to Bayesian Variational Inference. In
this framework, we propose to approximate the posterior distribution P(Hi
|B) which may not
have a tractable form, using a candidate distribution Q(H) which has a suitable form which we can
work with. We can do this in this case, by adopting the standard mean field variational inference
(VI) assumption where we assume the full posterior distribution can be expressed as a product
of independent marginal distributions over the vertices. i.e. Q(H) = ∏i∈V Qi(Hi). e we can
minimize the KL divergence between the candidate and target distribution is standard VI manner:
Q
∗ = argminQ∈QKL(Q(H)||P(H|B)) (1.9)
Following mean field VI calculation presented in [70], we can now calculate the posterior distribution of Hi
:
Q
∗
i
(Hi) = Ej6=i
[lnP(Hi
|B)] +const (1.10)
14
Now, the output of CCRF layer is simply the Hˆ
i that maximises Q
∗
(Hi)∀i ∈ V. Our detailed
calculations for the same is presented in Appendices. In essence, this constitutes a system
of equations over the vertices which can be solved by iteratively updating the vertices until
convergence is reached. For our purposes we stop the iterations at a set max iteration number T.
Algorithm 1 presents the final form of the CCRF layer as used in PNAbind. e network optimizes
the parameters α,β and σ. Intuitively, if the input vertex information is already smooth and no
update is necessary then β goes towards 0 and otherwise, we have some non zero value of β
resulting in a balance between preserving the vertex’s own features and conforming to features of
its neighbors. In the following section, we show that using this layer as the second last of PNAbind
significantly improves the smoothness of predicted labels over protein meshes while improving
upon or at least preserving desired metrics of classification results. Note: this is important, because
a naive way to increase smoothness of prediction is to predict the same class for all vertices, but,
that would dramatically decrease performance metrics of the prediction. Fig. 1.3AB schematically
describes the usage context and layer details of the CCRF layer as described in this section.
Algorithm 1: CCRF Layer
Input: Bi ∀i, E (adjacency information)
1: Initialize H
0
i = Bi ∀i //H
0
i
maximises Q
0
i =
1
Z
0
i
exp(−c||H
0
i −Bi
||2
)
2: for t = 0,1,2,...,T −1 do //T signifies convergence
3: compute ( ∑
j∈N (i)
(gi jH
t
j
), ∑
j∈N (i)
gi j) //message passing
4: H
t
0
i = αBi +β ∑
j∈N (i)
(gi jH
t
j
)
5: H
t+1
i = H
t
0
i
/(α +β ∑
j∈N (i)
gi j)
6: end for
7: H
∗
i = H
T
i
8: return H
∗
i
15
Figure 1.3. CCRF for smooth binding site label prediction over protein surface. (A) Schematic
description of CCRF layer usage. (B) Details of the CCRF layer. (C) Example effect of application
of applying CCRF layer for a validation protein from PDNA-74 dataset. (D) Effect of using CCRF
layer on Smoothness and Balanced Accuracy of validation predictions over three datasets. (E)
Average standard deviation of classification metrics for validation predictions in a 5 fold cross
validation seing.
16
1.3 Results
We compare binding site prediction results between two PNAbind networks, one without a CCRF
layer (noCRF) and one with a CCRF layer as its second last layer (CRF). For both cases we trained
and validated the networks on three different datasets. e datasets used are as follows:
PDNA-62 : Ahmad, Gromiha, and Sarai [71] constructed a non-redundant dataset of 62 protein–DNA complexes which has been used in a variety of other studies [55, 72] etc. e protein
sequences used were filtered to ensure a maximum identity of no more than 25% between any
two sequences and the resolution of the chosen structures were 2.5 A or beer. e structures in
this dataset contain only helical B-form DNA.
PDNA-74 : We constructed a dataset of 74 single-stranded DNA binding proteins bound to target
ssDNA. We first used the structural database DNAproDB [73, 74] to identify 374 protein-ssDNA
complexes based on structural criteria which included ensuring the bound DNA in the structure
presented the single-stranded secondary structure, a minimum length of 4 nucleotides per DNA
strand and 40 residues per protein chain, and a minimum of 5 nucleotide-residue interactions (as
defined by DNAproDB). Next, we verified that all proteins identified had known ssDNA binding
function based on annotations from the Gene Ontology knowledgebase [75]. Finally, all protein
sequences were clustered with a 70% sequence identity threshold using CD-HIT [76]. ese
clusters were then randomly sampled, with up to three samples per cluster, to generate the final
set of 74 protein structures. is sampling method allows us to construct a dataset with limited
amount of sequence redundancy but more conformational sampling than would be possible with
a stricter requirement on sequence redundancy. is is useful in the case of ssDNA where the
polymer is very flexible, but structural data is limited.
PDNA-224: a non-redundant dataset of 224 protein-DNA complexes originally constructed by Li,
Li, Liu, Fan, Zuo, and Peng [56]
RB198: A protein-RNA binding dataset consisting of 198 RNA binding protein chains [77].
17
Both CRF and noCRF models were trained on PDNA-62, PDNA-74 and PDNA-224, with a 4:1
training and validation set split. Fig. 1.3C shows one particular example of the effect of using
CRF layer against the noCRF model for a protein (pdbid: 1bj6) in the validation set for PDNA-74.
e top left panel in Fig. 1.3C shows the ground truth data. We can clearly see how the CRF
model improves smoothness of the prediction along with various other classification metrics. Fig.
1.3D shows smoothness (eq. 1.4) and Balanced Accuracy metrics achieved on the validation set in
each case. We can clearly see that the smoothness of the predictions have increased significantly.
It should also be noted that, median Balanced Accuracy has also increased for all three cases.
erefore, we can conclude that applying the CCRF layer improves the smoothness of the predicted
labels over mesh vertices without compromising in accuracy of prediction.
We also performed 5-fold cross validation on PDNA-224, PDNA-74 and RB198 datasets. While
analyzing these results we discovered another interesting effect of the CCRF layer. Fig. 1.3E
shows average standard deviation of validation predictions across 5 folds for various classification
metrics. We can see, the CRF models consistently results into lower standard deviations in the
metrics. is result hints towards the conclusion that the CCRF layer is having a regularizing
effect making the model more consistent.
1.4 Discussion
In this chapter we applied mean field Bayesian Variational inference to design a network layer for
PNAbind which results in smoother binding site predictions on protein surfaces. We also designed
a smoothness metric appropriate for the task of protein surface segmentation.
In the PNAbind framework, the model predicts whether a protein would bind nucleic acids or
not, and segments the protein surface into binding and non-binding regions. However, it does not
try to predict binding specificity (i.e. what nucleic acid sequence is preferred) and is only based on
protein surface and physicochemical features. We assume there are some commonalities between
18
the proteins binding to, say, ss-DNA (for PDNA-74) and this model implicitly learns those sets of
commonalities.
e PNAbind package has been published as joint work led by Jared Sagendorf, who mentored
me in this project, with contributions from Jiawei Hunag, Prof. Xiaojiang Chen and supervised by
Prof. Remo Rohs[45]. In recent years multiple works have been published which compete with
PNAbind [29, 30, 78, 79, 80, 81, 82, 83]. But, no deep learning method to predict binding specificity
across protein families has been achieved yet.
As a next step, we work towards predicting binding specificity. One of the key challenges
in this problem seing is data sparsity. In the next chapter, we present DeepPBS, a model for
protein-DNA binding specificity prediction, based on a given co-crystal structural model, which
works across protein families.
19
Chapter 2
Geometric deep learning of protein–DNA binding specificity
Abstract
Predicting protein–DNA binding specificity is a challenging yet essential task for understanding gene regulation. Protein–DNA complexes usually exhibit binding to a selected DNA
target site, whereas a protein binds, with varying degrees of binding specificity, to a wide
range of DNA sequences. is information is not directly accessible in a single structure. Here,
to access this information, we present Deep Predictor of Binding Specificity (DeepPBS), a
geometric deep-learning model designed to predict binding specificity from protein–DNA
structure. DeepPBS can be applied to experimental or predicted structures. Interpretable protein heavy atom importance scores for interface residues can be extracted. When aggregated at
the protein residue level, these scores are validated through mutagenesis experiments. Applied
to designed proteins targeting specific DNA sequences, DeepPBS was demonstrated to predict
experimentally measured binding specificity. DeepPBS offers a foundation for machine-aided
studies that advance our understanding of molecular interactions and guide experimental
designs and synthetic biology.
2.1 Introduction
Transcription factors play critical roles in various regulatory functions that are essential to all
aspects of life [84]. erefore, understanding the mechanisms by which proteins target specific
DNA sequences is crucial [85]. Extensive research has uncovered myriad binding mechanisms
20
that lead to specific high-affinity binding, including strong electrostatic interaction of arginine
residues in the DNA minor groove [8], deoxyribose sugar-phenylalanine stacking [86], bidentate
hydrogen bonds (H-bonds) between guanine (G) and arginine (Arg) in the major groove [87], and
other interactions [88, 89, 90].
Protein-DNA structures are typically [91] obtained through X-ray crystallography, nuclear
magnetic resonance spectroscopy or cryo-electron microscopy experiments and stored in the
Protein Data Bank (PDB) [9]. Generally, these structures display one bound DNA sequence and the
associated physicochemical interactions [88] but do not encompass the full range of potentially
bound DNA sequences. Conversely, this information can be experimentally obtained through
protein-binding microarray [92], systematic evolution of ligands by exponential enrichment
combined with high-throughput sequencing (SELEX-seq) [93], chromatin immunoprecipitation
followed by sequencing [94], high-throughput SELEX [95] or related high-throughput approaches
[96]. ese experiments capture the range of possible bound DNA sequences but do not necessarily
provide structural information. In essence, these sets of experiments are complementary, and
manual examination is often required to correlate molecular interaction details from structural
data with binding specificity data [8].
Predicting binding specificity for a given protein sequence, across protein families, remains a
challenging and unsolved problem, despite progress for specific protein families [97, 98, 99, 100, 101,
102, 103, 104]. Structural changes in the context of binding, along with large mechanistic diversity,
contribute to the difficulty [96, 105]. Protein-DNA structures contain valuable information, which
has been used to come up with models of specificity tested on small datasets [106]. Different
chemical groups of amino acid sidechains have different propensities of interaction with different
DNA moieties. One such example is shown in (Fig. 2.1a): preferences of the guanidine group,
alkane part of arginine side chains and the backbone. Similar trends can be observed in the
functional group [105] interaction propensities (Fig. 2.1b). Further examples of such preferences
are presented in supplementary figures (Lysine: Fig. S25), Histidine: Fig. S26, Asparagine: Fig.
S27, Serine: Fig. S28). Artificial intelligence can leverage this information to broadly achieve
21
generalizability across protein families. In this framework, we introduce Deep Predictor of Binding
Specificity (DeepPBS). is deep-learning model is designed to capture the physicochemical and
geometric contexts of protein-DNA interactions to predict binding specificity, represented as a
position weight matrix (PWM) [107] based on a given protein-DNA structure (Fig. 2.2a). DeepPBS
functions across protein families (Fig. 2.2) and acts as a bridge between structure-determining and
binding specificity-determining experiments.
Input of DeepPBS is not limited to experimental structures (Fig. 2.2a). e rapid advancement of protein structure prediction methods, including AlphaFold [23], OpenFold [108] and
RoseTTAFold [25], along with protein-DNA complex modelers, such as RoseTTAFoldNA (RFNA)
[24], RoseTTAFold All-Atom [26], MELD-DNA [109] and AlphaFold3 [27], have led to an exponential increase in the availability of structural data for analysis. is scenario highlights the growing
need for a generalized computational model to analyze protein-DNA structures. We demonstrate
how DeepPBS can work in conjunction with structure prediction methods for predicting specificity
for proteins without available experimental structures (Fig. 2.4a-d). In addition, the design of a
protein-DNA complex can be improved by optimizing bound DNA using DeepPBS feedback (Fig.
2.4e-g). We show that this pipeline is competitive with the recent family-specific model rCLAMPS
[98] (Fig. 2.4h,i) while being more generalizable: specifically, DeepPBS is protein family-agnostic,
can handle biological assemblies and can predict DNA flanking preferences.
In terms of interpretability, ‘relative importance’ (RI) scores for different heavy atoms in
proteins that are involved in interactions with DNA can be extracted from DeepPBS (Fig. 2.5). As
a case study on an important protein for cancer development, we analyze the p53-DNA interface
via these RI scores and relate them with existing literature for validation. Additionally, we show
that the DeepPBS scores align well with existing knowledge and can be aggregated to produce
reasonable agreement with alanine scanning mutagenesis experiments [110] (Fig. 2.5h).
In additional proof-of-principle studies, we apply DeepPBS to in silico-designed protein-DNA
complexes targeting specific DNA sequences (Fig. 2.6), obtained from a recent study that combines
structural design with DNA mutagenesis experiments [111]. Finally, we show that DeepPBS
22
can also be used to analyze molecular simulation trajectories. We demonstrate an example by
applying DeepPBS to a molecular dynamics (MD) simulation of Extradenticle (Exd) and Sex
combs reduced (Scr) Hox heterodimer in complex with DNA [112] with an AlphaFold-based
modeled protein linker (Methods, Supplementary Fig. S6 ). DeepPBS is available as a webserver
at https://deeppbs.usc.edu. is work is led by me, with contributions from Dr. Jinsen Li,
Dr. Jared M. Sagendorf, Yibei Jiang, Ari S. Cohen, Dr. Tsu-Pei Chiu, Dr. Cameron Glasscock and
supervised by Prof. Remo Rohs.
2.2 Results
2.2.1 e DeepPBS Framework
e DeepPBS framework is illustrated in Fig. 2.2. Input to DeepPBS (Fig. 2.2a) is composed of
one protein-DNA complex structure, with one or more protein chains bound to a DNA double
helix. Potential sources for such structures include experimental data (for example, PDB[9]),
molecular simulation snapshots or designed complexes. DeepPBS processes the structure as a
bipartite graph with distinct spatial graph representations for protein and DNA components.
e protein graph is an atom-based graph, with heavy atoms as vertices. Several features are
computed on these vertices (Fig. 2.2b). Further information on protein representation and feature
computation is available in Methods. We represent DNA as a symmetrized helix (sym-helix), as
detailed in Methods. is representation removes any sequence identity that the DNA possesses,
while preserving the shape of the double helix [8]. Optionally, DNA sequence information can be
reintroduced as a feature on the sym-helix points.
DeepPBS performs a series of spatial graph convolutions on the protein graph to aggregate
atomic neighborhood information (Fig. 2.2d). e next crucial component of DeepPBS consists
of a set of bipartite geometric convolutions applied from the protein graph to the sym-helix (Fig.
23
Figure 2.1. Interaction propensities of components of arginine amino acid towards various
DNA moieties and functional groups. (a) Interaction propensity towards DNA bases A,C,G,T, in
addition to phosphate (P) and sugar (S) moieties, categorized by major groovre , minor groove
and all (major,minor groove and DNA backbone) (b) Interaction propensity towards functional
groups [105] A ((H-bond acceptor), D (H-bond donor), M (methyl),H (hydrogen)), in addition
to phosphate (P) and sugar (S) moieties, categorized by major groovre , minor groove and all
(major,minor groove and DNA backbone)
24
2.2d). Specific chemical interactions (for example, hydrogen bonds) depend on both location
and orientation [87]. DeepPBS learns how the geometric orientation of the sym-helix points is
associated with the orientations and chemistry of neighboring protein residues. Four distinct
bipartite convolutions are employed for the sym-helix points, corresponding to the major groove,
the minor groove and the phosphate and sugar moieties. Major and minor groove convolutions
are referred to as ‘groove readout’. is term was chosen over the term ‘base readout’ due to the
removal of base identity in the sym-helix. Phosphate and sugar moiety convolutions, combined
with DNA shape information, form the ‘shape readout’ (Fig. 2.2e). e ‘groove readout’ and ‘shape
readout’ factors collaboratively determine binding specificity to varying extents for different
protein families. At this point, the sym-helix representation enables a straightforward flaening
of aggregated features on the three-dimensional sym-helix to the one-dimensional (1D) base
pair-level features. By adding DNA shape information and implementing 1D convolutional neural
network and prediction layers (Fig. 2.2e), DeepPBS ultimately predicts binding specificity (Fig.
2.2f). Further architectural details are described in Supplementary Section 5.
Lack of an existing published standard dataset for predicting binding specificity across protein
families from protein-DNA complex structure data made it necessary for us to build a dataset for
cross-validation and benchmarking. Details of this process can be found in Methods.
2.2.2 DeepPBS performance for experimentally determined structures
e DeepPBS ensemble (Methods) was employed to evaluate model performance against a benchmark set, as outlined in Supplementary Section 1. e DeepPBS architecture allows models to be
trained on two mechanisms: ‘groove readout’, which does not involve backbone convolutions and
excludes shape information, and ‘shape readout’, which does not involve groove convolutions
(Fig. 2.2d,e). Benchmark performances of DeepPBS (which performs both ‘groove readout’ and
‘shape readout’ modes combined) and these two variations are shown in Fig. 2.3a. e ‘groove
readout’ version does beer than the ‘shape readout’ version in terms of median performance,
25
Figure 2.2. — Sematic illustration of the DeepPBS framework. (a) DeepPBS input (PDB ID
2R5Y in this example) and possible input sources. (b) Protein structure (heavy atom graph, with
features computed for each vertex). (c) Symmetrization schema in base-pair frame applied to
DNA structure, resulting in a sym-helix. (d) Spatial graph convolution on the protein graph
for atom environment aggregation, followed by bipartite geometric convolutions from protein
graph vertices to sym-helix points (shown as spheres with specific colors for major groove, minor
groove, phosphate and sugar). (e) ree-dimensional sym-helix is flaened with aggregated
information (concatenated with computed shape features) into a 1D representation, followed
by 1D convolutions and regression onto base pair probabilities. (f) DeepPBS outputs binding
specificity. (g) Effect of perturbing bipartite edges involved in d can be measured in terms of
changes in the output, providing an effective measure of interpretability. Phos, phosphate; conv,
convolutions.
26
while the DeepPBS model improves upon either component in isolation (two-sided t-test P value
<0.01; Fig. 2.3a).
e dataset was constructed using experimentally determined structures; thus, the co-crystal
structure-derived DNA sequence typically serves as a reasonable example of a bound sequence.
As expected, integrating sequence information into the sym-helix points (‘DeepPBS with DNA
SeqInfo’) enhanced performance (Fig. 2.3a), significantly closing the gap toward the inherent
performance limit in the dataset. e inherent performance limit originates from the fact that for
the same protein the binding specificity data presented by two databases [113, 114] used to create
the dataset may disagree to some extent (Supplementary Fig. S1c). We computed the distribution
of disagreement across all unique PWMs appearing in both databases (Supplementary Section 1).
However, from both interpretability and design perspectives, particularly when the bound DNA
sequence may not be representative, the ‘DeepPBS’ model is optimal due to its low sensitivity
to the DNA sequence in the structure. is fact is evidenced by comparing performances of the
‘DeepPBS’ and ‘DeepPBS with DNA SeqInfo’ models in the context of the PWM-co-crystal-derived
DNA alignment score (Supplementary Section 1). Compared with the line fit to the variation with
DNA sequence information (slope −0.44 for root mean squared error (RMSE), slope −0.62 for
mean absolute error (MAE); Supplementary Fig. S11), the slope of the line fit to the DeepPBS
predictions was closer to zero (Fig. 2.3b and Supplementary Fig. S11).
As an example, we show the DeepPBS ensemble prediction for the NF-κB biological assembly
from the benchmark dataset. Although the co-crystal structure-derived DNA sequence was not
of the highest binding affinity, as indicated by experimental data from HOCOMOCO [114], our
prediction circumvented this issue, predicting a binding specificity that was more closely aligned
with the experimental data (Supplementary Fig. S5d). Similar trends (Supplementary Fig. S5a-c)
can be observed from cross-validation predictions by individual DeepPBS models (Methods). We
also included example DeepPBS ensemble predictions (Supplementary Fig. S7) for structures
in the PDB that correspond to specific interactions but do not have a PWM in the two binding
specificity databases considered (Methods). In addition, example DeepPBS ensemble predictions
27
(Supplementary Fig. S8) for structures of nonspecific protein-DNA binding (for example, SSO7DDNA interaction [115]) present in the PDB are presented. ese predictions have notably lower
information content compared with those in Supplementary Fig. S7.
2.2.3 DeepPBS captures patterns of family-specific binding modes
Abundances of different protein families in the benchmark set are described in Fig. 2.3c (Supplementary Fig. S5b for cross-validation set). Family annotations were obtained from the Database of
Protein Families (PFAM) [116]. e dataset encompasses a wide range of DNA-binding protein
families. Performance of DeepPBS for various protein families provides several key insights.
DeepPBS showed reasonable generalizability across protein families, performing well even for
families with relatively fewer structures (Fig. 2.3d and Supplementary Fig. S5c), such as heat shock
factor proteins. is observation suggests that the model is learning the underlying mechanisms
of protein-DNA binding rather than overfiing on family-specific paerns.
Further validation is provided by comparing performances of the DeepPBS ‘groove readout’
and ‘shape readout’ models (Fig. 2.3d and Supplementary Fig. S5c). For families like zf-C2H2,
zf-C4 the ‘shape readout’ model did not perform as well as the ‘groove readout’ model. is
result aligns with the common understanding of the binding mechanism of these families. For
example, zf-C2H2 uses zinc finger motifs to scan DNA for suitable base interactions, with minimal
DNA bending or conformational change [117]. is binding mode makes the zf-C2H2 family
a popular target of protein sequence-based binding specificity prediction and design [99, 100,
104, 118, 119]. Conversely, families like interferon-regulatory factor (IRF) proteins (Fig. 2.3d and
Supplementary Fig. S5c) and T-box proteins (Supplementary Fig. 5c) showed higher performances
for the ‘shape readout’ model, consistent with their known binding mechanisms that involve
significant conformational changes[120, 121]. For families such as homeodomain (HD) and
forkhead (Fig. 2.3d and Supplementary Fig. S5c), the DeepPBS model outperformed both the
‘groove readout’ and ‘shape readout’ components. is result suggests that the network captures
complex higher-order relationships of these components.
28
Figure 2.3. Performance of DeepPBS for predicting binding specificity across protein families
for experimentally determined structures. (a) Prediction performances of DeepPBS along with
‘groove readout’, ‘shape readout’ and ‘with DNA SeqInfo’ variations, on benchmark set (biological
assemblies corresponding to n = 130 protein chains (for each box plot); Supplementary Section 1).
MAE, mean absolute error; RMSE, root mean squared error. (b) Performances of DeepPBS and ‘with
DNA SeqInfo’ models in context of PWM–co-crystal-derived DNA alignment score (Supplementary
Section 2). e shaded regions indicate the 95% confidence interval for the corresponding linear
fit. e MAE equivalent of this plot is available as Supplementary Fig. 12, showing similar trends.
(c) Abundances of various protein families (as appearing in PFAM annotations) in constructed
benchmark set (counts >3). (d) Performances of DeepPBS, groove readout and shape readout
models across various protein families (counts >3) (biological assemblies corresponding to n
protein chains (for each family), where n is as described in (c), total unique n = 130). All benchmark
predictions are made by an ensemble average of five models trained via cross-validation. Crossvalidation performances of individual trained models are shown in Supplementary Fig. S5c. For
the box plots in (a) and (d), the lower limit represents the lower quartile, the middle line represents
the median and the upper limit represents the upper quartile.
29
2.2.4 Application to in silico-predicted protein–DNA complexes
e DeepPBS framework is not limited to experimental structures. Recent advances in scalable
structural prediction approaches, driven by artificial intelligence [23, 25], offer unprecedented
potential. Specifically, models like RFNA [24] and MELD-DNA [109] can be used to predict the
structures of protein-DNA complexes from sequence. Such prediction algorithms have paved the
way for DeepPBS to be applicable to proteins that lack experimental DNA-bound structure data.
We suggest one potential approach for working with predictive structures in DeepPBS. First,
we make an initial guess for the DNA (IG DNA) sequence bound to each protein of interest based
on the corresponding protein family. en, we use RFNA to predict the protein-DNA complex
structure, followed by DeepPBS to predict binding specificity. We demonstrate this process (Fig.
2.4a-c) for three proteins classified as basic helix-loop-helix (bHLH) in JASPAR [113]. In all three
cases, the PDB lacked experimental protein-DNA complex structures. e IG DNA (Supplementary
Section 8) has an enhancer box motif (‘CACGTG’) in the center, which is known [122] to be a
bHLH family target. e first example (UniProt Q4H376; Fig. 2.4a) is a Max homodimer, for
which DeepPBS predicted a specificity closely mirroring that of the IG DNA. e second example
(TCF21 dimer, O43680) was more complicated; the central ‘CACGTG’ motif in the IG DNA was
erroneously assumed, yet DeepPBS successfully predicted the correct motif as ‘CATATG’ (Fig.
2.4b). e third example (Fig. 2.4c, protein OJ1581 H09.2, Q6H878) does not conform to any
enhancer box motif. Nevertheless, DeepPBS predicted a binding specificity closely mirroring the
experimental data (Fig. 2.4c).
We ran the DeepPBS pipeline for full-length UniProt protein sequences, each with a unique
JASPAR entry and no experimental structure for the complex, across three different families
(Supplementary Section 8): bZIP, bHLH and HD families. DeepPBS predictions based on RFNApredicted structures exhibited an improved MAE (that is, closer to experimental data) compared
with the IG DNA baseline (Fig. 2.4d). An application of DeepPBS to a MELD-DNA-predicted
30
complex of the mouse CREB1 protein is demonstrated in Supplementary Fig. S9b. us, DeepPBS
can take predicted structures from suboptimal DNA sequences and predict binding specificity
close to experimental data.
We next explored whether DeepPBS prediction could be used as feedback (in a loop) to
enhance modeling of the protein complex (and, subsequently, improve DeepPBS prediction). We
demonstrated this process for the human TGIF2LY protein (UniProt ID Q8IUE0, unstructured
region trimmed; Supplementary Section 8) in Fig. 2.4e. In round 1, we applied RFNA to this
protein sequence alongside the IG DNA sequence for the HD family and then used the predicted
complexes as input for DeepPBS. For IG DNA position T15 (Fig. 2.4e, round 1), DeepPBS predicted
a strong preference for G. In the round 1 RFNA output, Arg57 and T15 were involved in one
hydrogen bond (H-bond) and one van der Waals interaction. ese interactions are theoretically
weaker than the possible bidentate H-bonds between a G and Arg57. In round 2, we altered
the RFNA input by taking the argmax (the most preferred sequence) from the DeepPBS output
(Fig. 2.4e, round 2). e subsequently folded structure reflected a more robust bidentate H-bond
interaction between G15 and Arg57, with the DeepPBS prediction more closely aligning with the
experimental data (note positions (round 2) A18, G19 and T14, corresponding to positions 4-6 in
MA1572.1; Fig. 2.4e).
We repeated this DeepPBS prediction process for a total of seven rounds, for the set of HD
monomer sequences (Supplementary Section 8). e RFNA-predicted confidence metric (predicted
local distance difference test (pLDDT), LDDT [123] reflects similarity between the predicted and
reference structure for a complex; Supplementary Section 8) improved over these rounds (Fig.
2.4f). To independently evaluate structure quality, we calculated the molecular mechanics and
Poisson-Boltzmann surface area [124] binding energy (Supplementary Section 8). From round 1 to
round 3+, the number of stable structures (binding energy kJ/mol) increased (Supplementary Fig.
S9c), while their binding energy distributions shifted toward lower values (Supplementary Fig.
S9c). DeepPBS performance improved across the five rounds (Supplementary Fig. S9a). We also
refolded the benchmark set datapoints via RFNA (Supplementary Section 8) and compared (for the
31
full processable set (n = 98) and a high-confidence set, pLDDT >0.9, n = 31) the performances
with the equivalent performance obtained for the experimental structures (Fig. 2.4g). ere is
a drop in performance. We can expect that it will improve when future models for structure
prediction become available.
e DeepPBS approach for predicting binding specificity fundamentally differs from that
of existing methods, which predict binding specificity solely on the basis of protein sequence
information. As a result, comparisons with existing family-specific methods that operate exclusively on protein sequence are unfeasible. However, in conjunction with a complex structure
prediction method, we can start from protein sequence information alone and predict binding
specificity using DeepPBS. is process can be compared with the recent HD family-specific
method, rCLAMPS [98] (Supplementary Section 8). rCLAMPS can predict core 6-mer binding
specificities for monomer HD proteins. A comprehensive overview of performances is shown in
Fig. 2.4h. For different significant portions of the data, DeepPBS and rCLAMPS outperformed
each other. DeepPBS outperformed rCLAMPS where the pLDDT scores were higher (Fig. 2.4i).
us, the DeepPBS pipeline is comparable to rCLAMPS, while having broader applicability across
families and biological assemblies as well as not being limited to predicting the DNA core binding
region.
2.2.5 Assessing protein residue importance at p53-DNA interface
e DeepPBS architecture permits intentional activation or deactivation of specific edges in the
bipartite geometric convolution stage (Fig. 2.2d and Supplementary Fig. S4). Perturbing a set of
edges in this manner will alter the network-predicted result. e mean absolute difference between
the original and altered prediction can be used (with proper normalization) as a quantification of the
impact of the perturbed set of edges in determining binding specificity (Fig. 2.2g, Supplementary
Fig. S4 and Methods).
32
Figure 2.4. Application of DeepPBS on predicted protein–DNA complex structures. Caption
on next page.
We present results for perturbing edge sets for individual protein heavy atoms, which can also
be aggregated to compute residue-level importance. As an example, we examined the protein-DNA
33
Figure 2.4. Application of DeepPBS on predicted protein–DNA complex structures. (a) Various
predictive approaches (for example, RFNA and MELD-DNA) can be used to predict protein–DNA
complex structures in the absence of experimental data. DeepPBS can predict binding specificity
on the basis of this predicted complex. a-c, Examples for three full-length bHLH protein sequences:
Max homodimer from Ciona intestinalis (a), TCF21 dimer from Homo sapiens (b) and OJ1581 H09.2
dimer from Oryza sativa (c). (d) Performance of DeepPBS via the same process applied for three
different families, bZIP (n = 50 predicted assemblies), bHLH (n = 49 predicted assemblies) and
HD (n = 236 predicted assemblies), compared with baselines determined for random (drawn from
uniform) and IG DNA sequences. Each protein has a unique JASPAR annotationand lacks an
experimental structure for the complex. Structures for protein complexes were predicted by
RFNA. Proteins passed the preprocessing criterion of DeepPBS. (e) One iteration of DeepPBS
feedback, demonstrated for human TGIF2LY protein. vdW, van der Waals. (f) RFNA-predicted
LDDT 44 score over rounds 1–7 of DeepPBS feedback loop (n = 236 predicted assemblies). (g)
Comparison of DeepPBS ensemble performance on benchmark set for experimental and RFNA
folded structures (for all processable RFNA-folded structures with greater than 500 contact counts
(5 ˚A cutoff) to the DNA helix (n = 98 predicted assemblies) and high confidence (pLDDT >0.9) set
(n = 31 predicted assemblies)). (h) Comparison of DeepPBS predictions against HD family-specific
method rCLAMPS, color-coded by pLDDT. Diagonal dashed line represents y = x. (i) Distribution
of pLDDT for two cases: when DeepPBS outperforms rCLAMPS (below diagonal in h) and vice
versa (above diagonal in h) (n = 140 (left) and 96 (right) predicted assemblies). e box colors
denote the average pLDDT, using the same colormap as in h. For the box plots in d, f, g and i, the
lower limit represents lower quartile, the center line represents the median and the upper limit
represents the upper quartile. e whiskers do not include outliers.
interface of p53 (PDB ID: 3Q05), a protein crucial for regulating cancer development and cell
apoptosis [125]. e tumor suppressor p53 binds to DNA as a tetramer with two symmetric
protein-DNA interfaces [40, 126]. We show the RI scores (with min-max normalization applied)
calculated for heavy atoms within 5A˚ of the sym-helix (Fig. 2.5a). Sphere sizes in Fig. 2.5a
denote computed RI scores, with the largest being 1 and smallest 0. Lys120 [38] is involved in
both groove readout (H-bond with G) and shape readout-based binding specificity (H-bond with
backbone phosphate) (Fig. 2.5b). e network deems G-Arg280 [38] bidentate H-bonds as another
strong driver of binding specificity5 (Fig. 2.5c). Cys277 confers specificity through its thiol sulfur,
accepting an H-bond in the major groove [38] (Fig. 2.5d). Another important residue according to
DeepPBS, Arg248 [39], is present at the minor groove (Fig. 2.5e). is decision by the model is
primarily based on the orientation of arginine relative to the sym-helix, which is devoid of DNA
sequence information. Arg248 is aracted through enhanced negative electrostatic potential due
34
to a narrowing of the minor groove where it binds [40]. Among other residues in Fig. 2.5f, Ser241
is known [39] to be important for stabilizing Arg248. Ala276 (known for causing apoptosis upon
mutation [127]) appears as another driver of specificity. is residue has been shown to be a driver
of specificity via van der Waals contacts with the methyl group of T in the major groove[38]. e
binding specificity prediction of DeepPBS (Fig. 2.5g) aligns well with known binding paerns of
p53, which follows the form RRRC(A/T)(A/T)GYYY (R denotes purine, and Y denotes pyrimidine).
e interactions shown here are deemed [37, 125] as significant drivers of p53 binding.
2.2.6 Comparison of residue-level importance with mutagenesis data
We next asked whether DeepPBS-derived importance scores, which reflect the degree to which an
interaction determines output binding specificity, can be considered as reliable and potentially
physically significant. Although high-affinity interactions can be nonspecific [115, 128], interactions that contribute to high specificity would be expected to maximize binding affinity across
different base pair possibilities. erefore, the DeepPBS importance scores associated with these
interactions should display some correlation with the corresponding binding affinities. We can
test this hypothesis experimentally by using alanine scanning mutagenesis data (Supplementary
Section 1). Sets of such experimental data have been made available through recent contributions[129] in the field. Utilizing these data [130], we applied suitable filtering for our context and
calculated the log sum aggregated residue level importance scores using DeepPBS (Methods).
A regression plot and Pearson’s correlation coefficient (PCC), as shown in Fig. 2.5h, illustrate
the correspondence between computed values and experimental ∆∆G values for a diverse array
of proteins and residues within the protein-DNA interface (Supplementary Table 1). e obtained
PCC of 0.60 corroborates our hypothesis. It is noteworthy that the model was not trained to predict
these values. ese values were only obtained through perturbing the wild-type (WT) structures
as input (Supplementary Fig. 2.5 and Table 2.1). ese results highlight the potential of DeepPBS
as an economical guide for experimentalists who are selecting alanine scanning mutagenesis
experiments to conduct at the protein-DNA interface.
35
Figure 2.5. Visualization of DeepPBS importance scores in p53–DNA interface as a case study,
and experimental validation. p53 binds to DNA as a tetramer with two symmetric protein–DNA
interfaces 47 (A, B, C and D refer to each monomer; PDB ID: 3Q05). (a) Relative importance (RI)
score (normalized by maximum across atoms) calculated for heavy atoms (denoted by sphere sizes:
largest 1, smallest 0) within 5 A˚ of the sym-helix. (b-e) , Zoomed-in view of specific interactions
by protein–DNA interface residues Lys120B (b) , Arg280A (c) , Cys277A (d) and Arg248B (e)
with RI scores assigned by DeepPBS. (f) Residue importance computed by average and maximum
aggregation of heavy atom importance (top 20). (g) DeepPBS prediction. (h) Comparison of log
sum aggregated residue importance computed from DeepPBS ensemble, with experimental free
energy change (∆∆G) determined by alanine scanning mutagenesis experiments. e blue line
indicates linear regression fit. e light-blue region indicates the corresponding 95% confidence
interval computed via bootstrapping mean.
36
2.2.7 Application to designed scaffolds targeting specific DNA
Recent work [111] made significant progress in designing structural models of fully synthetic
helix-turn-helix (HTH) protein scaffolds targeting specific DNA sequences. We applied DeepPBS
to synthetically designed proteins targeting a specific DNA sequence (GCAGATCTGCACATC),
named DBP5/6/9/35, respectively (Fig. 2.6a,e,i,m ). e predicted PWMs are shown (Fig. 5b,f,j,n)
and the heavy atom level RI scores are visualized for the interfaces (Fig. 2.6c,g,k,o). We explored
qualitative agreement of these predictions with experimental results obtained from the study
(Fig. 2.6d,h,l,p, relative binding signal of all possible single base-pair mutations obtained via flow
cytometry analysis [111] in yeast display competition assays). DeepPBS mostly correctly predicted
the columns of high specificity (where the mutants show less binding that is darker red) except for
a couple of cases. Some of the alternate base preference predictions by DeepPBS appear to agree
with the experimental data. For example, for DBP35-position 11, DeepPBS predicts an alternate
specific binding possibility to C along with the WT base A, and similarly for DBP35-position 9
and DBP5-position 7. Also, it is important to look at the flanking predictions for DeepPBS’ ability
to produce sensible predictions for unbound DNA regions. For DBP9 and DBP6, the flanking
predictions look remarkably uniform, which is consistent with the designed structure having
mostly unbound canonical B-DNA structure. is baseline behavior is intuitive and nontrivial
in this problem seing (given that there is a DNA sequence present in the design and the model
has to circumvent overfiing of it). On the other hand, for DBP5 and DBP35, the flanks have a
non-canonical shape with a narrow minor groove interaction with a loop region of the protein
(obtained from PDB ID 1L3L). e DeepPBS prediction of a mostly A-tract preference (positions
3-8) is consistent with narrow minor groove preferred by such sequences [131]. DNA shape
prediction [28] for the top base prediction of these columns (AAATTT) is consistent with the
shape visualized in the design (Supplementary Fig. S12), showing a significant dip in minor groove
37
DeepPBS ∆∆G(kcal/mol) PDB ID Residue
3.4246 2.62 1b3t Y518
0.0973 0.17 1fos K148
0.1073 0.17 1fos K153
3.0678 1.04 1fos R155
0.1036 0.34 1fos R158
0.0782 0.04 1fos R272
2.8199 0.84 1hcq E25
0.7292 0.92 1hcq H18
4.2104 1.21 1hcq K28
2.8322 1.24 1hcq K32
1.0187 1.22 1hcq Y19
0.6486 0.7 1j5n K22
0.6486 0.7 1j5n K22
0.5429 0.5 1j5n K53
1.0272 0.4 1j5n K60
0.9116 0.4 1j5n K78
1.0853 0.2 1j5n K85
3.2762 0.5 1j5n M29
0.2997 0.7 1j5n N33
0.9606 0.8 1j5n R36
0.7074 0.7 1j5n R40
3.751 0.9 1j5n Y28
0.7891 0.2 1j5n Y81
1.7017 0.2 1mse S187
1.1549 0.7 1tn9 K21
3.2913 1.36 1tn9 K28
0.2911 1.33 1tn9 K54
2.8115 0.43 1tn9 R20
0.996 1.22 1tn9 R24
2.6845 1.17 1tn9 R55
0.0579 0.75 1tn9 R5
0.5691 0.04 1tn9 T15
0.3096 0.44 1tn9 W42
2.596 1.51 1tn9 Y40
0.5089 0.2725 2mxf K102
0.6247 0.6225 2mxf K105
1.1177 0.7518 2mxf K108
0.5467 0.283 2mxf K81
2.0751 0.2967 2mxf K97
4.1414 0.7725 2mxf N100
4.454 1.3382 2mxf R80
3.5985 2.0467 3ufd R46
0.6796 0.971 3ufd S52
2.6846 1.4177 3ufd Y37
3.7829 2.2518 4bnc R391
Table 2.1. Data presented showing correpondance of LogSum aggregated DeepPBS RI scores with
alanine scanning mutagenesis data.
38
width. ese examples illustrate the potential for DeepPBS as a computational guide to performing
expensive and laborious wet lab experiments.
2.2.8 Application of DeepPBS to MD simulation of Exd-Scr–DNA system
Owing to a fast inference time, DeepPBS can be used to analyze molecular simulation trajectories.
We demonstrated how the protein heavy atom-level interpretability allows automatic detection of
conformational changes in the protein-DNA interface. We applied DeepPBS to an MD simulation
of the well-studied Exd-Scr-DNA system (Fig. S6a) (PDB ID: 2R5Z)[93, 132, 133, 134] by computing
the DeepPBS prediction over the trajectory, consistent with the known binding specificity [132] of
the system. Details of the simulation method are provided in Methods. e simulation trajectory
was divided into 3,000 snapshots (0.1 ns apart), and the DeepPBS ensemble was applied to predict
binding specificity for each snapshot. Relative importance (RI) scores were calculated for each
heavy atom within 5A˚ of DNA, followed by computation of max-aggregated residue RI scores. Fig.
S6a shows the initial structure of the simulation, with the locations of some residues of interest
marked. Residues Arg5 and His-12 of the Scr protein contribute to minor groove narrowing
through electrostatic interactions, which play a crucial role in determining binding specificity
[112]. Residues Arg58, Ile57, and Lys61 on the Exd protein interact with the major groove, driving
specificity through hydrogen bonding and van der Waals interactions. In the simulation, residues
Arg2, Arg3, and Arg5 on Exd contact with the flanking sequences.
Variation of RI of the residues discussed earlier are shown in Fig. S6b,d,f. roughout the
trajectory, Arg5 and His-12 on Scr consistently interact in the minor groove to drive protein-DNA
binding specificity (Fig. S6e). Our model assigns stable RI scores to these residues (Fig. S6d). Arg58
strongly drives specificity by contacting G in the major groove, forming a bidentate hydrogen
bond. However, after 100 ns of simulation, the Exd recognition helix moves closer to the DNA
major groove, leading to rotation of Arg58 (Fig. S6f,g) and causing a loss of strong specificity for
G. Lys61 intermiently contacts the DNA through strong electrostatic interactions, leading to a
gain in RI (Fig. S6f,g).
39
RI scores assigned by our end-to-end deep-learning model offer an efficient alternative to
traditional energy calculations, which require meticulous force-field design and energy computations. In the case of residues Arg2, Arg5, and Arg3 at the terminal loop region of the Exd protein,
temporal changes in RI scores (Fig. S6b) strongly correspond to conformational changes of these
residues over the simulation trajectory, as highlighted in Fig. S6c. Arg2 forms a bidentate hydrogen
bond with G ( 40 ns to 100 ns), which appears in DeepPBS predictions as highly specific for C (Fig.
S6b). Arg5 interacts with an adjacent minor groove for most of the trajectory; however, it deviates
away from the minor groove after 210 ns, and a corresponding reduction in RI is observed. is
demonstrates the ability of our deep-learning model to capture the dynamic behavior of residues
and their interactions with the DNA.
DeepPBS has demonstrated its robustness and adaptability in response to both small dynamical
fluctuations and conformational changes. Although the model was trained on snapshot structures
and experimental PWMs, its predictions and RI scores are well-regularized and versatile, making
it suitable for automated analysis of MD trajectories and designed protein-DNA complexes.
ese factors make DeepPBS a valuable tool for researchers working in the field of protein-DNA
interactions, enabling deeper understanding and insights into the behavior of these complex
molecular systems.
2.2.9 Details on outliers seen on the benmark set performance
e outlier for the DeepPBS (Groove Readout) model is a TATA-box binding protein (TBP) bound
to nucleosome bound-DNA (PDB ID: 7OH9). It is understandable that the ‘groove readout’ model
will fail for this structure simply because TBP-DNA binding is known to be a primarily ‘shape
readout’ driven process depending on the strong bendability and high conformational flexibility of
the TATA motif [135]. Other than data quality limitations, another form of data limitation can be
representation. For example, carboxylic acid side chains (glutamic and aspartic acids) are generally
rare in biological DNA binding domains and hence in DeepPBS training data. A synthetic protein
chemist should be weary of this fact, while designing domains with these residues.
40
Figure 2.6. Application of DeepPBS to in silico-designed HTH scaffolds targeting a specific
DNA sequence. (a,e,i,m) Design models of four different synthetic HTH proteins targeting the
DNA sequence GCAGATCTGCACATC (design based on DNA sequence from PDB ID 1L3L,
canonical B-DNA structure used for e and i, co-crystal-derived DNA structure used for a and
m), obtained from a recent sequence-specific DNA binder design study 34 . (b,f,j,n) DeepPBS
ensemble predictions based on each design model shown in a, e, i and m, respectively. As
expected, the predictions for DBP5 and DBP35 were very similar due to comparable designs (see
‘Data availability’ section). (c,g,k,o) DeepPBS assessment of heavy atom level RI scores for each
interface in the design models shown in a, e, i and m respectively. (d,h,l,p) Relative binding
activity (phycoerythrin/ fluorescein isothiocyanate normalized to the no-competitor condition) of
all possible single base-pair mutations obtained via flow cytometry analysis 34 in yeast display
competition assays for each of the four HTH proteins shown in a, e, i and m respectively. Blue
indicates competitor mutations where competition was stronger than with the WT competitor,
while red indicates competitor mutations where competition was weaker.
41
2.3 Datasets
We collected structural data from the Protein Data Bank (PDB) [136] and binding specificity
data from JASPAR (version 2022) [113] and HOCOMOCO (v11 core collection) [114]. JASPAR
catalogs a comprehensive set of experimental binding specificity data for proteins from different
species obtained through various types of experimental platforms. HOCOMOCO consists of
mainly chromatin immunoprecipitation followed by sequencing (ChIP-seq) [94] data for human
and mouse proteins.
Next, we searched for protein-DNA co-crystal structures available in the PDB (Dec 2022)
for each position weight matrix (PWM) available to us using corresponding UniProt IDs. We
employed DSSR [137] to check for and annotate the existence of one contiguous double helical
region in these structures. In our application, we focused on double-stranded DNA only and
discarded structures that did not conform to this requirement. Base modifications were replaced
by their parent base identity. A total of 1,155 PDB chain IDs were filtered into the dataset. For
each structure (biological assembly containing a chain of interest) in the dataset, a corresponding
PWM was paired with it. If a PWM existed in both JASPAR2022 and HOCOMOCOv11, one was
randomly chosen. PWMs were trimmed to remove uninformative terminal regions with a 0.5
information content (IC) threshold. For each structure, we aligned the corresponding PWM to
the DNA helix using an ungapped local alignment (Methods), annotating the region on the DNA
helix where predictions should be made and the loss computed during training. For source code
and further details of data cleaning and pre-processing, see the Data/Code Availability section.
We clustered the protein chains using CD-HITv4.8.1 [138] with a 40% sequence similarity
threshold for clustering, resulting in 189 clusters. is step ensures that our dataset does not
overrepresent any particular protein sequence. Next, we sampled up to five members from each
cluster, prioritizing biological assemblies where the chain of interest has more contacts with the
42
DNA region where the PWM was aligned into a fold. Full list of these memberships is available in
Extended Data. We set the cutoff for alignment length to be at least five base pairs. We split this
set of structures into five folds to create a cross-validation set. A schematic representation of this
process is shown in Fig. S1a. Experimental and species diversity of the gathered cross-validation
dataset are shown in Fig. S1b.
Structures that were not included in the cross-validation dataset were resampled, selecting up
to five per cluster following the same criterion. is resulted in 130 datapoints, which we used as
a benchmark set. Predictions on this set were only calculated once, after finalizing all models. e
family distribution of this set (Fig. 2c) differs from that of the cross-validation set (Fig. S5b).
e PWM of the same protein differed slightly between JASPAR and HOCOMOCO (example
shown in Fig. S1c for human estrogen receptor). is observation indicates that there is an inherent
limit on what can be possibly learned, signifying noise in collected knowledge. To quantify the
performance limit on the dataset based on this phenomenon, we computed the distribution of
performance metrics across all unique PWMs appearing in both databases (111 cases).
Alanine scanning mutagenesis involves measuring changes in binding free energy (∆∆G) when
performing the same binding experiment for a given protein, with a specific residue mutated to
alanine. We used an already gathered dataset [130] of alanine scanning mutagenesis experiments
for protein-DNA structures. We filtered the dataset to make it suitable for our context. Specifically,
we removed cases involving single-stranded DNA. Mutations to alanine residues with values
within 0-3 kcal/mol were retained. We removed cases in which no heavy atom of the mutated
residue was within 5 A˚ of DNA, because our model only assigns importance scores within this
range.
43
2.4 Methods
2.4.1 Position Weight Matrix (PWM)
For the purposes of this study, a PWM is defined as an N × 4 matrix, where N represents the length
of the DNA of interest, and the four positions correspond to the four DNA bases: adenine (A),
cytosine (C), guanine (G) and thymine (T). Each column in the PWM represents the probabilities
of the four bases occurring at that particular position.
ColPWM = [PA,PC,PG,PT ]
PA +PC +PG +PT = 1
2.4.2 Ungapped Local Alignment
Alignment of experimental PWMs to the corresponding co-crystal structure derived DNA is an
important step for correctly annotating experimental protein-DNA structural data for model
training and evaluation. is alignment needs to be ungapped and should prioritize alignment of
higher IC columns from the PWM. Hence, we used an IC-weighted Pearson correlation coefficient
(PCC) scoring scheme for the alignment, given by:
ICWeightedPCC(ColPWM,ColDNA) = PearsonR(ColPWM,ColDNA)×
IC(ColPWM)
2
(2.1)
where PearsonR refers to a standard PCC, and IC refers to the information content calculated for
a probability simplex with a uniform background, in this context:
IC([PA,PC,PG,PT ]) = ∑
i∈[A,C,G,T]
log(Pi)
log(0.25)
(2.2)
44
Algorithm 2: Ungapped Local Alignment
Input: seq, pwm //Length X 4 arrays
1: max score ← −9999
2: opt i ← 0
3: opt j ← 0
4: opt k ← 0
5: l ← length(seq)
6: s ← length(pwm)
7: for i = 0,1,2,...,s−1 do //
8: for k = 0,1,2,...,s−i do //
9: for j = 0,1,2,...,l −k do //
10: score ← 0
11: for col = 0,1,2,..., k −1 do //
12: col score ← ICWeightedPCC(pwm[i : i+k,:][col,:],
seq[ j : j +k,:][col,:])
13: score ← score+col score
14: end for
15: if score > max score then //
16: max score ← score+col score
17: opt i ← i
18: opt j ← j
19: opt k ← k
20: end if
21: end for
22: end for
23: end for
24: return opt i,opt j,opt k,max score
2.4.3 Representing DNA
Our framework must consider several important factors for representing DNA. First, our model
observes the structure of DNA in the input, but will predict a one-dimensional (1D) representation
(a PWM). us, from an engineering perspective, it is beneficial to have the same number of features
per base pair. Second, the input co-crystal structure derived DNA has a sequence; depending on
the use case, we may or may not want our model to observe this sequence. Moreover, because
experimental structural data are sparse, the co-crystal derived sequence has a strong potential for
overfiing if observed by the model in the input. erefore, in general, we want to symmetrize
45
each base pair such that all sequence information is lost, but the global shape of the double helix
is preserved.
With these points in mind, we developed a coarse-grain symmetrized representation of DNA,
where each base pair is represented by 11 points: two points for the phosphate moiety on each
strand, two points for the sugar moiety, four points for the major groove, and three points for the
minor groove. Major and minor groove points are placed symmetrically in the base-pair plane, so
that they do not possess any particular base identity but roughly correspond to the major and
minor groove chemical positions known [105] to be used for base readout. e phosphate moiety
is represented by the coordinate of the phosphorus atom. e sugar moiety is represented by
the average coordinate of all sugar heavy atoms. e three minor groove points divide the line
segment connecting the two C1’ atoms into four equal segments. e base-pair plane is determined
by the triangle connecting the two C1’ atoms and point O (average of atoms N1 and N9). Next, we
move perpendicular (to the minor groove line segment) in this plane from either C1’ for 3.75 A˚
and expand the line segment by another 1.54 A˚ in either direction. e line segment is divided
into five equal segments to determine positions of the four major groove points. Additionally, the
central two major groove points are shifted by an additional 1 A˚ . is geometric construction is
based solely on domain knowledge; no learning is employed to estimate any parameter. Fig. S2a
shows a schematic representation of this process for an A-T base-pair. e only base atoms used
for this process are N1 and N9, making it agnostic of base identity. A step by step illustration of
this process is provided in is Fig. S32.
Fig. S2b shows an example transformation of a DNA structure to a symmetrized helix (symhelix) using the described process. Fig. S2c shows one C-G base pair overlayed with sym-helix
points computed for the corresponding base pair. As a result, the DNA structure is represented
as G
d = (V
d
,X
d
,N
d
) . V
d
represents coordinates of the sym-helix points, and X
d
represents
point-level DNA features, which reflect a one-hot encoded annotation of the 11 positions in the
symmetrized base-pair representation. If desired, we can reintroduce the DNA sequence (‘DeepPBS
with DNASeqInfo’ model) by including base-pair-specific chemical group features for each point
46
to X
d
, as [105]. For each point , we also define an interaction vector N
d
v
. ese vectors act as
reference directions in the base-pair frame. ey are used to compute relative orientation-based
features coupled with vectors N
p on the protein graph (refer to Section 4). For the phosphate
point, this vector is the average direction of the two double-bonded oxygens; for the sugar point,
this vector is the direction of the C4’-C5’ bond. For the seven major and minor groove points,
these directions are determined by connecting each point to the centroid of the heptagon formed
by these points. Fig. S2e shows the arrangement of these vectors on a sym-helix. ese directions
do not encode any base-specific information and only serve to inform the relative orientation of a
sym-helix point in the context of binding. In addition, we include 14 DNA shape features [139,
140, 141] denoted as X
s
, which are base-pair level features (Fig. S2d). ese features are: buckle,
shear, stretch, stagger, propeller twist, opening, shift, slide, rise, tilt, roll, helix-twist, major groove
width, and minor groove width. 3DNAv2.3 [140] and Curves5.3 [139, 141] were used to calculate
these values. Mean-padding was used to offset inter-base-pair features (shift, slide, rise, tilt, roll,
helix-twist).
2.4.4 Representing protein
In our framework protein is viewed as a spatial graph G
p = (V
p
,X
p
,E
p
,N
p
) where the coordinates
of the heavy atoms constitute the vertices V
p
. For each vertex v ∈ V
p we define a set of features
X
p
v which include one hot encoded atom type, solvent accessible surface area of the atom, charge,
radius, circular variance (7.5 A˚) and Atchley factors [142]. e edges E
p of the protein graph are
determined by the covalent bonds i.e. if vertices u and v have a covalent bond between them then
(u, v) ∈ E
p
. e edges are unordered. Lastly to encode directionality of protein side chains we
encode a unit vector N
p
v for each vertex v computed by averaging the directions of convalent
bonds associated with each heavy atom.
47
2.4.5 DeepPBS aritecture
e architecture of DeepPBS is modular. First, the ProteinEncoder module applies spatial graph
convolutions on the protein graph to aggregate neighborhood environment information for each
protein heavy atom. Initially, a fully connected embedding layer is applied to X
p
v ∀v ∈ G
p
, which
expands the dimensionality of X
p
v to 10 dimensions. Four layers of crystal graph convolutions
(CGConv)[143] are applied. e first two layers use only covalent bond edges, and the next two
layers use distance-based edges with a 4A˚ radius. e mathematical description of the messagepassing scheme for CGConv is as follows:
X
p
v ← X
p
v +
1
|N (v)|
∑
u∈N (v)
σ(zuvWf +bf)g(zuvWs +bs) (2.3)
where N (v) denotes the neighbors of v ∈V
p
, and zuv = [X
p
v ,X
p
u , euv] denotes the concatenation
of target node features, source/neighboring node features, and edge features (here, the distance
between u and v). In addition, σ denotes the sigmoid function, and g denotes the softplus function.
A Rectified Linear Unit (ReLU) [144] activation function is applied after each round of graph
convolutions. is marks the end of the ProteinEncoder module.
e sym-helix point features X
d
v
are embedded into a 10-dimensional (10D) space using fully
connected neural network layers and are used in the next module, the Bipartite Geometric
Network (BiNet). In this module, aggregated information on G
p
are pulled onto the sym-helix by
performing geometry-aware bipartite convolutions. We use a modified version of point pair feature
convolutions (PPFConv) [145], which we call a bipartite ResidualPPFConv. e message-passing
update scheme associated with it is as follows:
X
d
v ← X
d
v +γθ
∑
u∈N (v),u∈V p
hθ
x
p
u
,||duv||,∠(N
d
v
,duv),∠(N
p
u
,duv),∠(N
d
v
,N
p
u
)
(2.4)
where duv denotes the line segments connecting a sym-helix point v and a protein heavy atom
point u. hθ is a transformation parametrized by fully connected neural networks. We set γθ to be
the identity transformation. Four separate ResidualPPFConvs are applied for the major groove,
minor groove, phosphate, and sugar points, respectively (from neighbors within 5 A˚), followed by
ReLU activation. At this stage, we have aggregated all local chemical and geometric interaction
contexts onto the sym-helix.
e final module is the CNN-Predictor module. We flaen the helix into a 1D base pair-level
representation and apply a Multi Layered Perceptron (MLP) to reduce the dimensionality for
each base-pair to 32. We concatenate precomputed helix shape features to aggregated base pairlevel features. is step allows the network to make connections/correlate paerns between the
aggregated information from BiNet and the global shape of the helix. We apply two rounds of
1D convolutions of filter size 3 with a stride of 1. We also apply relevant padding and set output
feature size of 8, followed by ReLU activation, making the effective field of view five base-pairs.
Next, we apply an MLP for each base-pair to generate logits for predicted base probabilities
([LA,LC,LG,LT ]) for corresponding base-pairs. We apply a SoftMax [146] activation to generate
the output DNA base probabilities ([PA,PC,PG,PT ]). A global temperature parameter (Tglob) is
learned for SoftMax through the training process. Fig. S3 schematically describes the DeepPBS
architecture.
Pi =
Li
e
Tglob
∑
j∈[A,C,G,T]
Lj
e
Tglob
∀i ∈ {A,C,G,T} (2.5)
2.4.6 Training, cross-validation, and benmarking
Five models were trained for each of the four types: DeepPBS, DeepPBS GrooveReadout, DeepPBS
ShapeReadout, and DeepPBS with DNA SeqInfo. Each model was trained on four folds of the
constructed cross-validation set. Training was conducted for 50 epochs with early stopping
on an NVIDIA RTX A4000 using an Adam [147] optimizer, with a learning rate of 0.001 and
weight decay of 0.0001. Hyperparameters were selected based on domain knowledge and training
curves. For every datapoint, two forward passes were made to account for reverse complement
49
predictions for both strand directions (with relevant index transformations for input; refer to
Data/Code Availability). Outputs were concatenated, and MAE loss was calculated with ground
truth (corresponding PWM and its reverse complement concatenated). Predictions were made on
the corresponding fifth/validation fold with each model to gather predictions for all datapoints in
the 5-fold dataset. ese predictions were used to report metrics in Fig. S5a-c.
For benchmarking purposes, ensemble averaged (of the five trained cross-validation models)
predictions are used Fig. 2.3a-d. e ensemble is also used for results presented in Fig. 2.4, 2.5, 2.6,
and Fig. S5b,d, S6, S7, S8, and S9. All core DeepPBS code was wrien in python3.9+ with various
pythonic dependencies (full list available at GitHub). Packages used for geometric deep learning
are pytorch1.12+ and torch-geometric (pyg v2.0+).
2.4.7 Performance metrics
Performance metrics used in this chapter are MAE and RMSE, defined as
MAE(Y,Y
pred) = 1
N ∑
i∈{0,..,N−1}
∑
b∈{A,C,G,T}
|Yib −Y
pred
ib |
RMSE(Y,Y
pred) = s
1
N ∑
i∈{0,..,N−1}
∑
b∈{A,C,G,T}
(Yib −Y
pred
ib )
2
N refers to the number of columns in the PWMs being compared. Both metrics follow ‘the
lower the beer’ principle. ey are not independent but have different properties.
2.4.8 Additional discussion on behavior of metrics
To get a beer perspective of the behavior of MAE metric, we demonstrated of how the MAE
metric behaves for various different target PWM columns and possible predictions (Fig. S10a).
50
e predictions are of three forms, based on interpolated values of a variable x ∈ [0,1]. ey are
as follows:
[1−x,0, x,0],[x,
1−x
3
,
1−x
3
,
1−x
3
],[
1−x
3
,
1−x
3
,
1−x
3
, x] (2.6)
. ese demonstrations, although they do not form an exhaustive set, but gives an idea of the
behavior of the MAE metric. Based on these plots and taking into account the inherent performance
limit computed for this metric, we can consider values less than 0.8 to be of reasonable agreement
and below 0.6 to be in good agreement. Although, we do note that these values should not be set
in stone as the problem in question is a regression problem as opposed to binary classification.
Some further thoughts on this maer: In theory, the uniform prediction [0.25, 0.25, 0.25,
0.25] can be regarded as a bad prediction, because a naive model can always predict such a
value. is prediction will perform well for highly non-specific binders (e.g., cases like Fig. S8)
but will fail for highly specific binders. However, the one-hot prediction [1,0,0,0] can also be a
naive (bad) prediction for this problem seing. It is the case when the sequence present in the
co-crystal structure itself is the output, e.g., for the sequence ACG: [[1,0,0,0],[0,1,0,0],[0,0,1,0]].
For experimentally determined structures, this prediction will generally perform well, especially
for highly specific binders, but will fail for nonspecific binders. Ultimately, we wanted to create
a general model that can handle specific binders and non-specific binders. So, we need a target
metric to strike a balance between both scenarios. erefore, in our opinion, looking at the
predictive performances in context of the alignment scores of the co-crystal structure derived
sequence with the target PWM gives a clearer picture. is has now been added (for the benchmark
set) to the manuscript as Fig. S10b.
We choose a continuous distance metrics like MAE over statistical measures like PCC (Pearson
R) or SCC (Spearman R) because they are not very robust with only four (linearly dependent)
points (well-known result in the statistics community [148, 149]. In our observation, SCC can
only take a few discrete values in this scenario and can sharply change for a very small change of
values that change the rank order. e PCC metric, although it takes continuous values, is affected
51
by similar non-intuitive situations. is is easy to demonstrate by considering the following three
slightly altered predictions:
PCC([0.5,0.5,0,0], [0.25,0.25,0.25,0.25]) = Undefined
PCC([0.5,0.5,0,0], [0.23,0.24,0.26,0.27]) = -0.9487
PCC([0.5,0.5,0,0], [0.27,0.26,0.24,0.23]) = 0.9487
(calculated using scipy.stats.pearsonr() function)
Intuitively, all three predictions are of similar caliber. However, the PCC metric paints a
dramatically different picture. MAE on the other hand produces values of 1, 1.06 and 0.94. is is
much more nuanced and intuitive. We can also easily construct other non-intuitive situations for
PCC. For example, if the target is [0.1,0.2,0.3,0.4], any monotonically increasing prediction will
have a high PCC value.
PCC([0.1,0.2,0.3,0.4], [0.001,0.002,0.003,0.994]) = 0.776
is gives the impression that this prediction is quite good, while in reality, it is almost just
a one-hot prediction. MAE on the other hand produces a value of 1.187 which depicts a bad
prediction.
2.4.9 Additional details associated with application of DeepPBS on predicted
structures
Running rCLAMPS: We ran the rCLAMPS model with default parameters provided by the model’s
authors according to instructions provided through their GitHub.
Running RoseTTAfoldNA (RFNA): We ran the RFNA model with default options and model
weights (version: April 13, 2023 v0.2) as provided by the authors through their GitHub.
Trimming unstructured regions from full-length homeodomain (HD) sequences: For the analysis in Fig. 2.4e-i, full-length HD sequences were first trimmed to remove unstructured regions,
while retaining the main Homeobox domain of interest (rCLAMPS also applies the same process
in its pre-processing). is step was achieved by HMMERv3.4 [150] using the ‘homeobox.hmm’
file provided by the rCLAMPS repository.
52
MM-PBSA vacuum energy calculation: For each PDB file, we generated a topology file and a
run-parameter file using Gromacs 2020.3 to define the force fields amber14sb for protein and
parmbsc1 for DNA. ese files were used as input for g mmpbsa to calculate the potential energy
in a vacuum. e dielectric constant of the solute was set to 8.
Dataset: We obtained UniProt protein sequences for three different families, bZIP (homodimers),
bHLH (homodimers), and homeodomain (HD) (heterodimers excluded), for cases with a corresponding unique JASPAR entry and no experimental structure for the complex. RFNA predictions
that could be successfully processed by DeepPBS pre-processing steps were fed into DeepPBS for
specificity prediction (n=49 for bHLH, n=50 for bZIP, n=236 for HD family members).
Choice of initial guess (IG) DNA: e IG DNA for the bHLH family was chosen as ‘GCGCACCACGTGGTGCGC’, which has a center E-box motif (‘CACGTG’) that is known [122] to be a
bHLH family target. e IG DNA for the bZIP family was chosen as ‘GCGCTGATGTCAGCGC’
(based on human CREB1 motif MA0018.4). e IG DNA for the HD family was chosen as ‘GCGTGTAAATGAATTACATGT’, based on DNA from PDB ID 1APL.
Details on metric calculation for Fig. 2.4d: We calculated MAE (best full overlap) for predictions
in Fig. 2.4d against corresponding JASPAR annotations. As a baseline (apart from random predictions drawn from uniform), we calculated the MAE (best full overlap) for the one-hot PWM
determined by the IG DNA against the corresponding JASPAR annotation. For the bHLH and HD
families, the IG DNA was closer to experimental data than to random baseline (Fig. 2.4d).
pLDDT score: (RFNA-predicted LDDT (Local Distance Difference Test) score [123]). e LDDT
score measures the similarity between a predicted and reference structure. When predicting a
complex structure, RFNA predicts an LDDT score (pLDDT). ese pLDDT scores were shown [24]
to be well correlated with the true LDDT of RFNA predictions. us, the pLDDT can be taken as a
measure of quality of a complex generated by RFNA.
Comparison of DeepPBS and rCLAMPS: ere are several qualitative advantages to the DeepPBS
approach. First, rCLAMPS uses structural mapping for HD-DNA binding to predict specificity
for a given HD sequence. is structural mapping leads to a specificity output of exactly six 6
53
base pairs (bp). DeepPBS is functionally not limited to only predicting a 6-base pair core and can
predict preferences in the flanks (Fig. 2.4e). Second, rCLAMPS is restricted to predicting monomer
preferences (although HD proteins can often bind as dimers; see, e.g., Fig. S6a). In contrast,
DeepPBS is able to handle biological assemblies. ird, DeepPBS is not limited to a specific family.
For quantitative comparison, we compare the aspects that are achievable by rCLAMPS, namely,
6mer specificity predictions for monomer HD proteins.
Metric computation for comparison with rCLAMPS: We ran rCLAMPS (github commit version
32a94edb65e87c6d038823dc34c4bcf6e1071b7b) on the set of monomer HD proteins (with unavailable complex structure, i.e. not part of RFNA or DeepPBS training). We computed the MAE (best
overlap) values for rCLAMPS predictions against the corresponding JASPAR entry of experimental data, and compared these values to the MAE values of the best 6-mer overlap for DeepPBS
predictions. In this case, for each datapoint, one of round 4-7 prediction was chosen. is choice
was based on maximizing corresponding protein-DNA contact count (5 A˚cutoff) of input RFNA
predicted structure.
Folding benmark set datapoints with RFNA: We refolded all the benchmark set datapoints
with RFNA, for results presented in Fig. 2.4g. 108/130 RFNA predictions produced pre-processable
results. Out of them a few of were not able to place the protein near the DNA helix properly.
So, we filtered an atom-to-atom contact count (within 5 A˚) of greater than 500 contacts between
protein and DNA to filter this set producing a set of size 98. e high confidence set among these
are taken as predictions for which the RFNA pLDDT was greater that 0.9.
2.4.10 Molecular Dynamics (MD) simulation of Exd-Scr–DNA system
We conducted MD simulations on the Drosophila Extradenticle (Exd)-Sex combs reduced (Scr)
system, with the dimer bound to its target DNA, using the co-crystal structure (PDB ID: 2R5Z).
AlphaFold2 predictions of the proteins were aligned to the PDB structure to create an initial
structure of the simulation. is process aided in filling in the missing linker residues in the
biological assembly. e simulation was executed using the Gromacs [151] 2020.3 software package.
54
Protein interactions were modeled with the amber14sb [152] force field, and DNA interactions
were modeled with the parmbsc1 [153] force field. e pdb2gmx program from Gromacs was used
to generate topological information for the simulation. e -his flag was used to protonate both
Nδ and Nε atoms of the His-12 residue for the system with protonated His-12. All complexes
were solvated using the explicit TIP3P water model. e negative net charge of the Exd-Hox-DNA
complex was neutralized by adding positively charged Na+ counterions, along with negatively
charged Cl- counterions, to reach a final NaCl concentration of 150 mM that approximates the
physiological concentration. e GROMACS 2020.3 genion program was used to place these
counterions throughout the box.
e protein-DNA complex was energy-minimized with steep descent energy minimization for
2,000 steps to relax the structure and remove any steric clashes. Next, we performed three rounds
of gradual NVT (constant Number of particles, Volume, and Temperature) equilibration for 10 ps
to slowly heat the prepared system to 300K and 1 round of NPT (constant Number of particles,
Pressure, and Temperature) to equilibrate the pressure of the system to 1 bar for 700ps. ese
equilibration rounds were used to adjust the whole system to biological conditions before starting
the production simulation. e production simulation for the system was run for 300 ns in the
isobaric–isothermal ensemble, where the pressure was maintained at 1 bar and temperature at
300K. e integration time step of 2 fs was used for all calculations. e Verlet cutoff scheme was
used for all calculations. Long-range electrostatic interactions were computed using the Particle
Mesh Ewald method [154] from the GROMACSv2020.3 package, with a 12A˚cutoff. Nonbonded
van der Waals interactions were calculated with a 12A˚cutoff. e LINCS [155] algorithm from the
GRPMACSv2020.3 package, was employed to constrain all bonds.e MD simulation trajectories
were generated as part of a recent study exploring effects of minor groove linker histidine protonation [156].
55
2.4.11 Measures ensuring prevention of overfitting to protein sequences
We took several steps to prevent the model to be overfit on protein sequences. e cross-validation
fold creation script (see Code Availability) took care to keep members of the same cluster in
the same fold (except a handful (< 4%) got split randomly to keep the fold sizes same). Full
description of these splits is available (see Data Availability) and the code for this process is
available (see Code Availability). In addition, no more than five samples were chosen per cluster
into a fold. is ensures prevention of over representation. Furthermore, the input to DeepPBS is
purely structural and physico-chemical, and does not contain the sequence representation. ese
structures may demonstrate different spatial conformations interacting with potentially different
DNA sequence/shape and randomly picked target PWM from JASPAR or HOCOMOCO. ese
guardrails in architecture and all of these variations, even for similar protein sequences, makes
the model less prone to overfit on protein sequences.
2.4.12 Bipartite edge perturbation and protein heavy atom importance score
calculation
Supplementary Fig. S4 schematically describes the bipartite edge perturbation process for calculating protein heavy atom (say, atom a) importance scores. Briefly, the prediction is calculated
twice: once (say, Ya) while considering edges corresponding to the protein heavy atoms, and again
(say, Y∼a) while masking the same edges. is process results in differences in predictions, which
can be calculated using the mean absolute difference measure. On their own, these values may
not be meaningful, but they can be normalized to the 0–1 range by dividing by the maximum
value within a structure. e normalized values, RI scores, signify how much the specificity
prediction is influenced by interactions made by the corresponding heavy atom. Depending on
56
the downstream use, RI scores can be aggregated at the residue level using either the average, max
or sum aggregations. Mathematically,
RIα =
MAE(Ya,Y∼a)
max{b∈all atoms}MAE(Yb,Y∼b)
(2.7)
Computationally, this process is like measuring the effect of a deactivating mutation, which is
why we hypothesized that, at a residue level, these scores could correlate with alanine scanning
mutagenesis data. For comparison with alanine scanning mutagenesis experiments (Fig. 2.5h) at
a residue level, the log sum aggregated importance score was calculated. For each atom a of a
residue r in the protein–DNA interface, let the calculated RI be RIa. en, this value is calculated
as
LogSum aggregated residue importance(r) = log2(1+ ∑
a∈r
RIα)
Structure visualizations presented were produced using PyMOL2.5 [157].
2.4.13 Description of competitor assay for quantifying designed proteins’ binding
specificity
Glasscock et al. [111] used a yeast display assay to quantify binding of their designed proteins.
e proteins were expressed by integrating the corresponding synthetic oligonucleotide to a yeast
surface expression vector. Yeast cells expressing designed proteins on their surface were labeled
with biotinylated dsDNA targets, streptavidin–phycoerythrin and anti-c-Myc fluorescein isothiocyanate in a 96-well plate format, after which a binding signal was quantified on an Aune NxT
flow cytometer. Excess addition of a competitor nonfluorescent target DNA reduces this binding
signal. us, scanning single mutations for each position was possible through the competitor
producing the data shown in Fig. 2.6d,h,l,p.
57
2.4.14 DeepPBS webserver
DeepPBS is available as a webserver at hps://deeppbs.usc.edu. e webserver provides the
functionality of the DeepPBS method of predicting a PWM on the basis of the structure of a
protein–DNA complex. e structure can be uploaded as a PDB or macromolecular crystallographic
information file. e webserver provides a documentation for users.
2.5 Discussion
Computationally identifying which DNA sequences, a given protein will bind to remains a challenging question. Although proteins from certain DNA-binding families, such as homeodomain
[98, 158, 159, 160] and C2H2 zinc finger proteins [98, 99, 117, 118, 161, 162] have been studied
extensively in this regard, a generalized model of binding specificity remains elusive. is complexity emanates, in part, from the pivotal role that the protein and DNA conformation or shape
play in the context of binding specificity. For example, TBX5 undergoes an α- to 310-helix conformational change when interacting with DNA. Despite the energy penalty, this transformation, in
conjunction with an appropriately matching DNA shape, instigates a strong phenylalanine-sugar
ring stacking, thereby facilitating binding [120]. Another example is the Trp repressor protein,
which exhibits an almost entirely geometry-driven binding specificity. is protein only forms
direct and water-mediated H-bonds with the backbone phosphates [163], and the DNA shape
required for optimal binding gives rise to sequence specificity. Capturing such interactions and
how they lead to binding specificity with protein information alone is complicated and cannot
be understood in a sequence space alone [105, 164]. Furthermore, for many protein families, the
protein monomer is insufficient [38] for binding; a biological assembly, potentially with other
interaction partners [165], is often necessary.
DeepPBS achieves generality across protein families with the tradeoff of requiring a docked
sym-helix, representing a significant step toward solving the larger unsolved problem. As demonstrated in this work, coupling DeepPBS with aempts to model protein–DNA complexes provides
58
a significant step forward in predicting binding specificity across families, based solely on protein
information.
DeepPBS allows exploration of exciting future possibilities, including the creation of DNAtargeted protein designs that could potentially contribute to therapeutic advancements. DeepPBS
could serve as a preliminary screening tool for devised candidate complexes, ensuring their
specificity to the intended target DNA sequence before any costly experimental validations.
Moreover, recent studies have shown that transcription factor–DNA binding can energetically
favor mismatched base pairs [166]. Given the combinatorial complexity of possible hypotheses,
deciding which DNA mismatch experiments to perform to discover more such instances poses a
significant challenge. Although there is currently a lack of training data for base-pair mismatches,
the DeepPBS architecture, in theory, could facilitate the prediction of mismatched base-pair
binding specificity. is approach could assist in deciding which experiments to conduct.
In summary, we have introduced a computational framework that distills the intricate structural nuances of protein−DNA binding and bridges this understanding with binding specificity
data, effectively connecting structure-determining and specificity-determining experiments. e
DeepPBS architecture allows inspection of family-specific ‘groove readout’ and ‘shape readout’
paerns and their effects on binding specificity. Although structure prediction methods like
RFNA [24], MELD-DNA [109] and AlphaFold3 [27] can predict a complex from given protein and
DNA sequences, they cannot provide insights into binding specificity. e development of these
computational methods for structure prediction expands the need of an approach like DeepPBS to
derive protein–DNA binding specificity. DeepPBS operates on predicted complexes to yield the
binding specificity of the system, thereby guiding the further improvement of modeling techniques
for protein–DNA complexes.
DeepPBS was trained only on experimentally determined structures and binding specificity
data. We have shown, it can be used to make predictions based on predicted structures. However,
there is a concern of prediction error propagation, specially if this kind of prediction based on
prediction is done iteratively. is arises from the simple fact that structure prediction methods
59
(although becoming increasingly accurate) are not perfect. Similar concern also needs to be
acknowledged if one is looking to train DeepPBS (or a similar model) on predicted complexes.
However, there is still some advantage in doing so, specificially for the binding specificity prediction
task. e advantage arises from having a lot more binding specificity data than structural data.
e large surplus of experimentally determined binding specifity data can be used to inform the
model. Training on predicted complexes with experimental specificity data thus could still have
some advantages.
DeepPBS, despite its generality, exhibits performance comparable to the recently described
family-specific method rCLAMPS [98]. In addition to modeled complexes for biologically existing
systems, DeepPBS is also applicable to in silico synthetically designed proteins that target specific
DNA sequences.
DeepPBS-derived RI scores are biologically relevant. ey can be aggregated at a protein
residue level, aligning with alanine scanning mutagenesis experimental data. Another advantage
of DeepPBS is its speed in predicting binding specificity. Specifically, DeepPBS only requires
a single forward call through the model (no required database search or multiple sequence
alignment computation), making it suitable for high-throughput applications such as analyzing
MD simulation trajectories (Supplementary Fig. S6). In this context, DeepPBS is robust to small
dynamical fluctuations and can respond to conformational changes.
e current version of DeepPBS has inherent limitations. It is tailored for double-stranded DNA
and is not yet applicable to single-stranded DNA, RNA or chemically modified bases. However,
there is potential for extending the model to accommodate these different scenarios as well as other
polymer–polymer interactions and potentially for mechanistic mutations. Further limitations
include data limitations, as discussed in Results. e DeepPBS architecture can be refined and
expanded in terms of applications and engineering enhancements. Collectively, these possibilities
hint at an exciting future for molecular interaction studies and computationally driven synthetic
biology.
60
2.6 Data availability
Datasets used for all analysis and associated custom scripts were deposited via figshare at
https://doi.org/10.6084/m9.figshare.25678053. Accession codes for discussed structures
from the PDB: 1L3L, 7CLI, 2R5Z, 1CIT, 1F4K, 1GJI, 1TC3, 2BSQ, 2C9L, 5ZGN, 1BBX, 1KLN,
1N5Y, 5YUZ, 1QAI, 1XC8, 6T8H, 4TUI, 1DH3, 7OH9 and 1APL. UniProt accession codes for
protein sequences discussed (folded with RFNA): Q8IUE0, Q6H878, O43680 and Q4H376. Accession codes for discussed experimental specificity data from JASPAR2022 and HOCOMOCOv11:
MA1897.1, MA1568.1, MA1031.1, MA1572.1, MA0112.2, MA0112.3, ESR1 HUMAN.H11MO.0 and
NFKB2 HUMAN.H11MO.0.B. Mutagenesis experiment data used are available from the SAMPDI
website (http://compbio.clemson.edu/media/download/SAMPDI dataset. xlsx). MELD-DNA
modeled complex data were taken from Zenodo at https://doi.org/10.5281/zenodo.7501937.
Source data are provided with this paper.
2.7 Code availability
Installable source code, pretrained models, associated guidelines and various custom scripts can
be found via GitHub at https://github.com/timkartar/DeepPBS. e implementation is also
available via a Code Ocean capsule at https://doi.org/10.24433/CO.0545023.v2. In addition,
DeepPBS is accessible as a webserver through https://deeppbs.usc.edu.
61
Chapter 3
RNAscape: Geometric mapping and customizable visualization
of RNA structure
Abstract
Analyzing and visualizing the tertiary structure and complex interactions of RNA is
essential for being able to mechanistically decipher their molecular functions in vivo. Secondary structure visualization software can portray many aspects of RNA; however, these
layouts are often unable to preserve topological correspondence since they do not consider
tertiary interactions between different regions of an RNA molecule. Likewise, quaternary
interactions between two or more interacting RNA molecules are not considered in secondary
structure visualization tools. e RNAscape webserver produces visualizations that can preserve topological correspondence while remaining both visually intuitive and structurally
insightful. RNAscape achieves this by designing a mathematical structural mapping algorithm
which prioritizes the helical segments, reflecting their tertiary organization. Non-helical
segments are mapped in a way that minimizes structural cluer. RNAscape runs a ploing
script that is designed to generate publication-quality images. RNAscape natively supports
non-standard nucleotides, multiple base-pairing annotation styles and requires no programming experience. RNAscape can also be used to analyze RNA/DNA hybrid structures and
DNA topologies, including G-quadruplexes. Users can upload their own three-dimensional
structures or enter a Protein Data Bank (PDB) ID of an existing structure. e RNAscape
62
webserver allows users to customize visualizations through various seings as desired. URL:
https://rnascape.usc.edu/.
3.1 Introduction
e structural diversity of RNA molecules influences their broad biological functions [167, 168,
169]. is diversity [170] is primarily driven by its ability to form complicated tertiary interactions,
a plethora of non-standard base-pairing conformations and quaternary interactions with other
RNA, DNA or protein molecules. Visualizing RNA in two dimensions poses the challenge of
capturing these complex interactions while remaining comprehensible and valuable to researchers.
One popular means of representing complicated RNA structures is through secondary structure diagrams. ese two-dimensional (2D) diagrams are exclusively driven by base-pairing
relationships and laid out in an abstract space. Extensive literature and software [171, 172, 173,
174, 175, 176, 177, 178, 179, 180, 181] describe secondary structure diagrams. However, these
representations do not effectively capture tertiary molecular interactions, such as base pairing,
stacking, and pseudoknot interactions. erefore, although this approach scales relatively well
for large RNA sequences [171], not considering tertiary interactions can lead to a diagram far
from the biological structure and function. More specifically, nucleotides which are positioned
relatively close together in three-dimensional (3D) space may appear far away in the visualization.
Some tools promise to capture tertiary interactions [47, 182]. Of these tools, RNAView [47], is
widely known and has been a current standard linked in the Nucleic Acid Knowledge Base (NAKB)
[183]. However, RNAView [47] lacks a webserver, requires a complicated setup and usage pipeline,
and cannot handle some complex topologies resulting in output that is not always interpretable
or intuitive. Moreover, it is unable to provide publication-quality images. e only other available
tool that retains tertiary interactions, RNAglib [182], is not deterministic and results in different
outputs for repeat runs under the default configuration documented by the authors, which likely
63
explains why it has not been adopted by the field compared to RNAView. A description of different
tools which create various 2D diagrams of RNA molecules is provided in Table 3.1.
RNAscape addresses and overcomes the outlined issues and limitations of existing approaches
at several levels. e RNAscape algorithm includes a mapping process that conforms to the helical
geometry of RNA structures. By doing so, it aempts to preserve the intuitive correspondence
between the 2D mapping and 3D structure. At the same time, RNAscape optimizes each layout
to place non-helical segments of the structure without sacrificing tertiary interactions. is
enables visualizations that are compact while remaining as visually intuitive as possible (Fig. 3.1,
Supplementary Fig. S13, S15).
e RNAscape webserver Fig. 3.2 offers various customization options for its visualizations.
Users can zoom, pan, and rotate images directly on the webserver. In addition, one can easily
customize a plot with different base-pairing annotations [47, 137, 184], residue colors, nucleotide or
text-label sizes, and numbering schemas. RNAscape encourages users to iteratively refine an image.
In addition, RNAscape allows the user to modify the calculated map. Upon completion, RNAscape
visualizations can be exported to vector format (SVG) or image format (PNG), enabling further
refinements by the user. Both Protein Data Bank (PDB) and macromolecular Crystallographic
Information File (mmCIF) format files are supported to maximize compatibility. Additionally,
RNAscape can directly fetch structures (biological assembly 1) from the PDB [9] based on a given
PDB ID. RNAscape supports multiple base-pairing annotation conventions: Leontis-Westhof (LW)
[47], Saenger [184], DSSR (Dissecting the Spatial Structure of RNA) [137] and a no-annotation
option. Future updates to base-pairing conventions by the nucleic acid community can easily be
incorporated. Modified/non-standard nucleotides are denoted by a white circle and annotated
with a small leer code (based on its parent standard base or simply ‘x’ if this information is
unavailable). is work is led by me, and performed jointly with Ari S. Cohen, and supervised by
Prof. Remo Rohs.
64
Figure 3.1. RNAscape output for various structures from the PDB. e 3D structure at the top
of each panel is from the PDB structure, with its corresponding RNAscape visualization shown
below it. (A) tRNA from Sulfolobus tokodaii (PDB ID: 7VNV), (B) a single-stranded DNA molecule
(PDB ID: 4NOE), (C) Dengue virus RNA promoter (PDB ID: 7UMD), (D) Pistol ribozyme (PDB ID:
6R47), (E) Riboswitch from Escherichia coli (PDB ID: 1Y26), (F) Cobalamin riboswitch regulatory
element (PDB ID: 4FRN), (G) NAD-II riboswitch (PDB ID: 8HBA), (H) G-quadruplex (PDB ID:
2M18), (I) RNA kink-turn motif (PDB ID: 7EFG) and (J) the semi-symmetric peptidyl transferase
center (PTC) of the large ribosomal subunit of Deinococcus radiodurans (PDB ID: 1NKW), also
known as proto-ribosome [185]. e molecular structure in (J) is shown along the two-fold
pseudo-symmetry axis.
65
Method Preserves 3D topology Input Webserver Upload limit Output Framework needed
RNAscape [48] Yes .pdb, .cif Yes 50 MB .png, .svg, .npz Python (pip)
RNAview [47] Yes .pdb, .cif No N/A .ps C, make based installation
RNAglib [182] Semi (2.5D graphs) .cif, compiled dataset by authors No N/A .png, .svg Python (pip)
RNAcanvas [171] No (secondary structure) dot-bracket/ss formats Yes N/A .svg, .pptx, interactive GUI Unavailable
Forna [179] No (secondary structure, force directed) dot-bracket, .pdb, .cif, .json Yes 2 MB .png, .svg, .json, interactive GUI JavaScript
VARNA [178] No (secondary structure) dot-bracket/ss formats No N/A .eps, .svg, .xfig, .jpg, .png Java
jVizRNA [175] No (spring based) dot-bracket/ss formats No N/A interactive GUI Java
PseudoViewer [177] No (but considers pseudoknots) dot-bracket/ss formats Yes N/A .eps, .svg, .gif, .png, bracket view Microsoft .NET Framework
XRNA (link) No (secondary structure) special program specific format No N/A .ps Java
RNAViz [176] No (secondary structure) dot-bracket/ss formats, DCSE alignment No N/A image format (exact unknown) C, make based installation
RNAPuzzler [174] No (secondary structure) dot-bracket/ss format (.txt) No N/A image format (exact unknown) C, make based installation
R2R [173] No (secondary structure) Stockholm format (alignment) No N/A .pdf, .svg UNIX command
R2DT [172] No (predicted secondary structure) RNA sequence Yes N/A .txt, .svg Docker tool
RNArtist (link) No (secondary structure) dot-bracket/ss formats, .pdb No N/A .svg Java (jdeploy)
Ribosketch [180] No (secondary structure, force directed) dot-bracket/ss formats Yes N/A .txt, .svg, interactive GUI standalone installer
RNAvista [186] No (prediction tool for secondary/tertiary structure) FASTA or dot-bracket/ss formats Yes N/A .svg, predicted 3D model (.pdb) unavailable
Table 3.1. Description of various attributes of relevant tools whi produce 2D visualizations
of RNA. e first row corresponds to RNAscape. e next two rows correspond to two methods which incorporate tertiary interactions in their output mapping: RNAView [47] being the
most frequently used, and RNAglib [182] being the most recent. e remaining rows indicate
secondary structure drawing tools: Forna [179] being the most used (supports structure upload)
and RNAcanvas [171] being the most recent (does not support structure upload). Abbreviations
used: ss (secondary structure); GUI: Graphical User Interface; Dot-bracket: RNA sequence and
secondary structure in dot- bracket format.
3.2 Materials and methods
3.2.1 Programming languages and general tools
e RNAscape webserver is a single-page web application. e backend (Figure 3.2A, B) is
implemented in Python 3.9.18, and Django [187] is used to communicate with the backend. e
frontend is (Figure 3.2C) designed in React v18.2.0 framework and implemented in Hypertext
Markup Language (HTML)/Cascading Style Sheets (CSS)/JavaScript.
3.2.2 e RNAscape algorithm
Upon upload, the structure file is sent via Hypertext Transfer Protocol Secure (HTTPS) to the
RNAscape webserver where backend processing occurs. If a user selects a PDB ID [9], its corresponding first biological assembly is downloaded by the backend (Figure 3.2A, B) for processing.
Pre-processing (Figure 3.3A). e DSSR program (v1.7.8) [137] is run on the structure file to
detect helices and base pairs, and assign base-pairing annotations.
Helical regions (Figure 3.3B). e positioning of helices, as well as non-helical regions, involves
multiple considerations. e 3D coordinates of each nucleotide are represented by the centroid
66
of atoms belonging to it (i.e. for the i
th nucleotide, Vi =
1
|atoms(i)| ∑
a∈atoms(i)
[ax,ay,az
] ). e set of
all nucleotide centroids is a combination of two subsets (i.e. V = VH ∪VNH,VH (helical regions)
and VNH (non-helical regions)). Helical regions receive the highest priority and are placed in a
way that reflects their spatial orientation while remaining visually intuitive. To do so, first, we
run principal component analysis (PCA) exclusively on the helical segments (VH) and project
the points onto the plane determined by the first two components. In this process, the |VH| ×3
sized matrix VH is converted to a |VH| ×2 matrix (which we can denote as T), which preserves
the maximum spatial variance possible in two dimensions [188]. Next, we convert T into a more
visually intuitive ‘ladder’ representation, which first involves estimating a ladder axis in the
projection plane for each helix. An initial estimate is made by connecting the centroid of the first
and last base pairs of a helical region using a line segment. T consists of multiple helical regions
(i.e. T = TH1 ∪TH2 ∪...∪tHn). If the midpoint of a base-pair B ∈ THk is T
B
Hk , the ladder axis for
THk is the vector LHk = T
Blast
Hk −T
Bfirst
Hk , rooted at the point T
Bfirst
Hk .
However, for bent helices, this estimate may be imprecise. To account for this case, we measure
the distance between the centroid of the helical projection and the midpoint of the estimated
ladder axis (i.e. d = |centroid(THk)−(T
Bfirst
Hk +
1
2
LHk
)| . If this distance is greater than 10 A˚, we reestimate the ladder axis as a combination of two line segments: one connecting the first and central
base-pair centroids and another between the central and last base-pair centroids. In theory, this
process can be recursively performed. In practice, however, we observe that doing so once suffices.
Next, if two helical projections are within a certain distance threshold (i.e. T
Bfirst
Hk −T
Blast
Hk+1 < 20A˚
˚A) and have similar orientations (i.e. cos−1
(
LHk
|LHk|
.
LHk+1
|LHk+1|
) <
π
6
), we merge them and recompute
the ladder axis as described above. Next, we uniformly distribute the base pairs in the ‘ladder’
formation along each ladder axis. Finally, for cases where the projection of a helix is skewed,
resulting in an overly cramped ladder representation, we lengthen the ladder to reduce visual
cluer. e final mapping for nucleotide points in helical regions can be denoted as PH. Non-helical
regions (Figure 3.3C). Loops are either preferentially bulged out in a radial curve or interpolated
linearly based on a spatial density threshold (see implementation in Data Availability), depending
67
on the chosen seing. We choose bulging by default to reduce graph overlap and crowding. For
bulging out, the structure mapping algorithm computes potential layouts and performs greedy
optimization to select an optimal layout. is optimization considers the total nearest-neighbor
count (within 10 A˚) of all members of a loop, and the orientation with the lowest number of
neighbors is selected. Let us assume that the loop is connected to two nucleotides which are part
of a helical region, mapped to positions P
i
H
,P
j
H ∈ PH . Two possible circular layouts are computed
for the loop based on P
i
H
,P
j
H
: bulging out in perpendicular directions P
i
H −P
j
H ×Z (layout Lpos)
and −(P
i
H −P
j
H ×Z) (layout Lneg), where Z denotes the unit vector which is perpendicular relative
to the mapping plane. In each case, the center of the layout remains at the point (P
i
H +P
j
H
)/2 . e
radius of the circular arc is either |(P
i
H −P
j
H
)/2| or |
√
n×(P
i
H −P
j
H
)/2| , if |(P
i
H −P
j
H
)| < 3A˚ or
|(P
i
H −P
j
H
)/n| < 1.5A˚ where n is the number of points in the loop. Points are uniformly distributed
on the circular arc. One of the two loop orientations is selected based on minimizing the neighbor
count in helical segments as follows:
argminL∈Lpos,Lneg ∑
p∈L
∑
v∈PH
I[|v− p| < 5A˚] (3.1)
Hanging single stranded regions are linearly interpolated based on its connecting mapped helix.
Additional adjustments are made for certain edge cases, such as, when a linearly interpolated nonhelix nucleotide exactly overlaps with another nucleotide (see implementation in Data Availability).
Structures containing no helices (generally rare) are mapped solely using a PCA.
Visualization (Figure 3.3C). e RNAscape backend utilizes the Matplotlib [189] and NetworkX
[190] packages to plot visualizations. As input, the ploing algorithm requires the mapped points,
base-pairing annotations, and user-selected visual seings for a structure. As output, it generates
an image that is temporarily stored (up to 48 h) on the webserver and tied to a specific user session.
Structure files are not stored. e image is served to the frontend via a Django [187] server, where
it can be interacted with by the user. A user can also regenerate a plot with different visual seings.
In this case, we reuse the mapping output and rerun the visualization script, resulting in a faster
response time than the complete computation.
68
3.3 Results
3.3.1 Application of RNAscape to structures from the PDB
We present RNAscape output for various structures (Fig. 3.1, 3.4 from the PDB [9]). In Fig. 3.1A,
tRNA from Sulfolobus tokodaii (PDB ID: 7VNV) is shown. RNAscape output preserves the L-shaped
topology (as opposed to known ‘clover leaf’ shaped secondary structure [191] visualizations) and
annotates non-standard bases and base-pairing geometries (critical in many RNA interactions
[192]. RNAscape can also process unusual DNA structures, as shown by a single-stranded DNA
with circular topology (PDB ID: 4NOE, Fig. 3.1B). In Fig. 3.1C, Dengue virus RNA promoter (PDB
ID: 7UMD) is depicted, which is a single-stranded RNA molecule containing only standard RNA
bases.
We present a few different examples of ribozymes and riboswitches (Fig. 3.1D-G). RNA loop
modeling [193] for riboswitches is an important area of research, and RNAscape visualizations
(e.g. PDB IDs: 1Y26, 4FRN, 8HBA, Fig. 3.1E-G) may aid in these efforts. e pistol ribozyme (PDB
ID: 6R47, Fig. 3.1D) and the Nicotinamide Adenine Dinucleotide-II (NAD-II) riboswitch (PDB ID:
8HBA, Fig. 3.1G) illustrate how RNAscape places non-helical segments and can clearly depict
their non-standard base pairs with helical segments. RNAscape natively supports multiple strands
(e.g. PDB ID: 1Y26, Fig. 3.1E). RNAscape is also able to visualize G-quadruplexes (PDB ID: 2M18,
Fig. 3.1H). An RNA structural motif which can serve as a binding site for proteins is the kink-turn
motif (PDB ID: 7EFG) [194], and it is visualized in Fig. 3.1I.
ere has been a continued interest in structural studies of the ribosome which postulate the
role of a proto-ribosome [185] in the origin of life. e proto-ribosome is a semi-symmetrical core
of the ribosome comprised of RNA molecules representing the site for peptide bond formation,
therefore known as peptidyl transferase center (PTC). e RNAscape visualization (Fig. 3.1J,
69
Figure 3.2. RNAscape overview. (A) RNAscape algorithm for geometric mapping. RNAscape
builds a 2D ladder representation for helical segments that closely adheres to their principal
component analysis (PCA) projection. It adds non-helical segments to this representation by
optimizing nearest neighbor counts. (B) RNAscape webserver for ploing flexible, publicationquality visualizations. A ploing script is used to plot the geometrically mapped points and
incorporates desired user customizations. A user can also regenerate the plot with alternate
seings and reuse a prior geometric mapping, saving time and compute power. (C) RNAscape
frontend for user interaction and customization. RNAscape allows users to upload a file or input a
PDB ID for processing and to adjust various seings including base-pairing annotations, residue
colors, nucleotide or text-label sizes and numbering schema. RNAscape also provides an associated
documentation. After processing, RNAscape supports interaction with output layouts including
zooming, rotating, panning, and modifying the map. Users can download layouts and geometric
mappings and view a log of non-Watson-Crick (non-WC) nucleotides.
70
Supplementary Fig. S13) for the same reflects the high degree of conformational symmetry, based
on structural coordinates of the PTC provided by Bose et al. [185].
RNAscape can run on relatively large structures (structures of up to 50 MB are processed
by the webserver). In Fig. 3.4, we demonstrate its application to four different topologies of
larger structures. In Fig. 3.4A, a triangular topology of Mycobacterium tuberculosis ileS T-box
in complex with tRNA (PDB ID: 6UFH, 244 nucleotides) is shown, followed by a diamond-like
topology of mutant P4-P6 domain of Tetrahymena thermophila group I intron (PDB ID: 1HR2, Fig.
3.4B, 157 nucleotides) and an exon free state of the Tetrahymena group I intron (PDB ID: 7R6N,
Fig. 3.4C, 354 nucleotides). Secondary structure representations will not resemble the structure at
all for many of these cases (e.g. stacked ladders, PDB ID: 7QDU, Fig. 3.4D, 552 nucleotides), while
RNAscape is able to reflect the 3D topology of these large RNA molecules.
3.3.2 RNAscape user interface
e RNAscape webserver (Fig. 3.2C) displays three primary items: header, file upload, and
documentation panels. In the header, a user can click the ‘Run on Example Data’ buon to view
an example visualization (PDB ID: 3ZP8). In the file upload panel, a user can upload a structure
using the file upload feature. is file may contain non-nucleic acid entities which will be ignored.
Alternatively, a user can directly input a PDB ID to load its corresponding first assembly file.
Clicking the ‘Run’ buon runs the RNAscape pipeline on the uploaded structure file or provided
PDB ID (biological assembly 1). After running RNAscape for a structure, a user has the option to
add a second structure for side-by-side viewing. We demonstrate this capability for two structures
of tRNA molecules (PDB IDs: 8UPT and 8UPY, Supplementary Fig. S14), introduced by recent
work [195] on the importance of tRNA shape. e documentation panel enables easy navigation
71
Figure 3.3. RNAscape algorithm. Caption on next page.
72
Figure 3.3. RNAscape algorithm. (A) Pre-processing of uploaded/fetched structures. Structure
files are obtained either through user upload or direct download from the PDB. DSSR is run on
each structure to detect helical segments and assign base-pairing annotations. (B) Geometric
mapping and post-processing of helical segments. For each helical segment, RNAscape estimates
the ladder axis by connecting the centroids of the starting and ending nucleotides with a line
segment. If a helix is bent, detected by a distance >10 A˚ between the midpoint of the ladder axis
and corresponding helical segment’s centroid, the ladder axis is split into two. Helical projections
within 20 A˚ and 30◦
are optionally merged. Base pairs are uniformly spaced along each ladder axis,
and cramped helices are lengthened. (C) Optimizing placement of non-helical regions followed
by creation of annotated visualization. Hanging single stranded regions are mapped to their
corresponding, connected helix and merged with the ladder. Loops are either preferentially bulged
out in a radial curve or interpolated linearly based on a spatial density threshold. Loop direction
is determined by minimizing the total nearest neighbor count of a given loop. Mapped points as
well as pairing and backbone annotations are passed to the ploing script to create a visualization.
and provides a quick start guide, tips, and examples for using RNAscape. It also includes detailed
explanations for configurable seings.
3.3.3 Output images
e frontend (Fig. 3.2C) natively supports touch-screen compatible image exploration. A user
can zoom, center, or reset any zooming/panning via buons above the display box. e image
can also be rotated using a slider, and a ‘regenerate’ buon is offered that replots the image,
associated annotations, and user customizations in the desired rotation. To the right of the image,
a legend is displayed that corresponds to the base-pairing annotation selected by the user. For
the Saenger [184] base-pairing annotation, no legend is shown. e local strand direction (5
0
to 3
0
) is indicated by the black arrows between nucleotides for all plots. Other interactions are
shown in blue doed lines. ese colors are fully customizable by the user. e user also has
the option of downloading RNAscape mapped points in a numerical format (.npz) processable
by the NumPy [196] library. Additionally, a log is provided which contains a description of the
non-standard/modified nucleotides in the plot and other associated information.
73
3.3.4 Base-pairing annotations
RNAscape offers three base-pairing annotation styles: LW [47, 197], DSSR [137] and Saenger
[184]. All base-pairing annotations are calculated via DSSR, although any future updates to
these conventions by the nucleic acid community can be easily incorporated. Annotations do not
affect geometric mapping, and a user can forego an annotation altogether. e LW annotation
contains two key parameters: bond orientation (cis/trans) and base edge type. Bond orientation is
represented by a filled or unfilled marker. e edge types: Watson-Crick (W), Hoogsteen (H), or
sugar (S), are represented by marker shapes (Fig. 3.1).
e DSSR style differs in that base edges are delineated by major groove (M), minor groove
(m), or Watson-Crick (W) edges. Bond orientation annotation is the same as in the LW [47,
197] annotation. DSSR also reports local strand orientation as a base-pairing annotation feature.
RNAscape always denotes local strand orientation by the backbone arrows (Fig. 3.1). Non-standard
pairings flagged as ‘not categorized’ by DSSR are not annotated. For the Saenger [184] annotation,
each bond type is represented by a number corresponding to its Roman numeral annotation.
3.3.5 Customizable settings
Several custom seings options are available (Fig. 3.2C). e Loop Bulging seing controls whether
loops are bulged outwards or linearly interpolated (see Materials and Methods). Additionally, the
post-processing step of merging proximate, similarly oriented ladders can be turned off (Fig. 3.3B).
Since these seings affect the geometric mapping, a user must click ‘Run’ to run the pipeline
again if they are changed. Arrow size, circle size, and circle label size affect nucleotide appearance.
Base-pairing marker sizes can also be adjusted. rough the number seings, a user instructs
RNAscape to label residue numbers in the numbering schema defined by the structure file. Color,
size, frequency, and spacing of these labels can also be modified. Color seings allow a user
to customize the color of each nucleotide type: A, C, G, U/T and X (non-standard nucleotides).
Colors used to denote both backbone chain and non-chain interactions and markers can also be
74
modified. Furthermore, RNAscape provides a functionality to modify calculated maps. By clicking
on the ‘Modify Mapping’ buon, the user can move and adjust nucleotide locations to resolve, for
instance, overlap and regenerate the output.
3.4 Discussion
e RNAscape webserver produces customizable, publication-quality visualizations of nucleic
acid tertiary structure. It prioritizes the topology of a structure while striving to create a clean
and optimized output, and it is designed to minimize user effort. RNAscape significantly deviates
from any existing method in terms of its output quality, usability, and layout algorithm (Table 3.1,
Supplementary Fig. S13, S15). Users can refine visualizations on the webserver, and RNAscape
also supports non-standard nucleotides and various base-pairing annotations. Further updates to
base-pairing conventions may be easily incorporated. e RNAscape webserver allows a maximum
file size of 50 MB. While potentially informative, the output for extremely large structures may not
be well suited for presentation. We provide the RNAscape implementation via GitHub (see Data
Availability) for those inclined to try the pipeline locally on even larger structures. We conclude
with the hope that our effort facilitates advancement of the ever-growing field of RNA biology.
3.5 Data availability
RNAscape is freely available for all users at https://rnascape.usc.edu/. e backend implementation is also available on GitHub at https://github.com/timkartar/RNAscape and
preserved through figshare at https://doi.org/10.6084/m9.figshare.25201889.
75
Figure 3.4. RNAscape output for large-size structures from the PDB. e 3D structure at the
top of each panel is from the PDB structure, with its corresponding RNAscape visualization
shown below it. (A) Mycobacterium tuberculosis T-box in complex with tRNA (PDB ID: 6UFH, 244
nucleotides), (B) Mutant P4-P6 domain (DELC209) of Tetrahymena thermophila group I intron
(PDB ID: 1HR2, 157 nucleotides), (C) Exon-free state of the Tetrahymena group I intron (PDB ID:
7R6N, 354 nucleotides), and (D) Twist-corrected RNA origami 5-helix Tile A (PDB ID: 7QDU, 552
nucleotides).
76
Chapter 4
DNAproDB: an updated database for the automated and
interactive analysis of protein–DNA complexes
Abstract
DNAproDB (https://dnaprodb.usc.edu/) is a widely used database, visualization tool,
and processing pipeline for analyzing the structural features of protein–DNA interactions.
Here we present a substantially updated version through additional data and functionalities.
It contains an expanded volume of pre-analyzed protein–DNA structures, which will now be
automatically updated weekly. e analysis pipeline now identifies water-mediated hydrogen
bonds, and modified visualizations of protein–DNA incorporate this data. Tertiary structureaware nucleotide layouts are now available. New file formats and external database annotations
are supported. e website has been aesthetically modernized, and interactions with graphs
and data is more intuitive. We also present a statistical analysis on the updated collection of
structures revealing salient paerns in protein–DNA interactions.
4.1 Introduction
Protein–DNA interactions play crucial roles in essential cellular functions like gene regulation,
genome packaging, and DNA replication [84, 198]. Diverse recognition mechanisms underlie these
interactions [8, 40, 88, 105]. Atomic resolution structures of protein–DNA complexes available
in the Protein Data Bank (PDB) [35] have been invaluable for understanding these readout
77
mechanisms and provide insight that relate them to function. As a computational resource which
extensively analyzes such structures and presents their data in publication-quality representations,
the DNAproDB web server [49] and database [50] have been a useful resource for biologists,
recognized by tool libraries such as the Nucleic Acid Knowledge Base (NAKB) [183]. is update
improves the DNAproDB analysis pipeline, output data presentation, and web interface Fig. 4.1.
e updated analysis pipeline now computes annotations of water-mediated hydrogen bonds,
which are known to play an important role [199] in protein–DNA recognition and, in some cases,
a very prominent one [163]. Also, the pipeline now automatically processes and incorporates
new PDB structures weekly. e primary interface visualization, ‘Residue contact map,’ now
allows users to select a mapping algorithm for nucleic acid layout. In addition to secondary
structure-based mapping [200], tertiary-structure aware mapping [48] is now available. Binding
specificity data for transcription factors cataloged in the JASPAR2024 [201] database has been
integrated. Users can now upload structures in the macromolecular Crystallographic Information
File (mmCIF) format and download interface visualizations in an editable figure format. More
information regarding these updates, as well as quality-of-life and user-interface improvements,
is described in the following sections. We analyzed the expanded DNAproDB structure collection
for salient features of protein–DNA interactions (Fig. 4.2). ese results (based on a larger sample
size in this update) reaffirm previous statistics presented about DNA minor groove recognition [8]
and paerns of amino acid-base stacking for single stranded DNA presented [50]. Additionally,
we present and discuss examples of the newly added water-mediated hydrogen bond annotations
in selected structures (Fig. 4.3). DNAproDB has been used by experimental biologists to upload,
analyze, and present interface visualizations in their work [202]. We developed this update to assist
their efforts, likely leading to additional contributions from the scientific community. We want to
emphasize the increased utility of DNAproDB in light of structure prediction tools like AlphaFold3
[27], RoseTTAFoldNA [24], and RoseTTAFold-AA [26], and binding specificity prediction tools
including DeepPBS [46] and rCLAMPS [98]. ese computational tools hint towards a promising
future of protein–DNA structure prediction and design [111]. We expect that DNAproDB will be
78
Figure 4.1. Key aspects of this update to DNAproDB. (A) Automatic update and separated
external annotation incorporation scheme. (B) Different nucleic acid layout options, with added
tertiary structure aware RNAscape layout, shown for PDB ID: 3LDY. (C) Water-mediated hydrogen
bond annotation. (D) Various improvements in other aspects of DNAproDB.
an invaluable tool and assist such efforts. is work is led by me, with contributions from Ari S.
Cohen, Dr. Jared M. Sagendorf, Prof. Helen Berman and supervised by Prof. Remo Rohs.
4.2 Update details
4.2.1 Processing pipeline and data update
At the time of its previous release [50], DNAproDB contained a static collection of structures.
is resulted in newly released structures being unavailable. In this update, we have addressed
79
this limitation by implementing an automatic update pipeline (Fig. 4.1A). Every weekend, the
pipeline queries the PDB for newly released structures, downloads and processes them, and
adds them to the DNAproDB collection. In addition, the structure processing pipeline has been
separated from any external annotation dependency. is allows external annotations to be
updated without reprocessing each structure or affecting the user experience. Annotations from
the JASPAR2024 [201] database (incorporating the most recent binding specificity matrix ID and
logo) have been included whenever applicable. e asymmetric unit molecular weight cutoff,
which determines whether a structure is included in the collection, has been expanded from
250 to 1500 kDa, increasing the structures available for analysis. e latest collection size as of
June 7th, 2024, is 6,731 structures. is set has been analyzed and was included in the results
presented in Fig. 4.2. Originally, a large part of the processing pipeline was wrien using Python
2 [203]. However, support for Python 2 is no longer available by the open-source community as
of January 1, 2020, and Python 3 [204] has become the new standard. We redesigned the backend
processing pipeline to ensure compatibility with Python 3. As an expansion of available features,
water-mediated hydrogen bonds between protein and DNA have been calculated and annotated
within this update. e program HBPLUS [205], with the ‘-h’ option set to 3 A˚, and the ‘-d’ option
set to 3.5 A˚, and with the remaining parameters set as default, is used to detect hydrogen bonds.
Custom scripts were wrien to determine water-mediated interactions via shared water molecules
between hydrogen-bonded pairs (see Data Availability).
4.2.2 Visualization
We updated the ‘Residue contact map’ and ‘3D structure’ (Fig. 4.1B) visualizations presented in
DNAproDB in several ways. e nucleic acid backbone color used in these components has been
changed to a more visually pleasing metallic blue-gray color, compared to the previously used
yellow-orange color. In addition to the previous secondary-structure-based and circular layouts,
an RNAscape [48] based layout for placing nucleic acids has been computed and added to the
‘Residue contact map’. is new layout is more representative of tertiary structure compared to
80
the other two representations (Fig. 4.1B). An option to switch between these different layouts
is available. During this update, some Python 2 version utilities for secondary structure-based
layout computation were discontinued. We replaced these utilities with analogous Python 3
versions provided by the ‘Forgi’ [206] package. Water-mediated hydrogen bonds have now been
incorporated as an interaction edge in the ‘Residue contact map’. ese are indicated by a black
circle (Fig. 4.1C) in the interaction map. Hovering over the water-mediated contacts will present
further information (e.g., residue number of the water molecule involved). An option to hide these
interactions is also available. e ‘3D structure’ component now displays the solvent alongside the
structure (in a previous version, the solvent was removed). A buon to hide solvent is included.
4.2.3 Web interface and user experience
Since its inception, we have continuously provided support for DNAproDB users and taken note
of their feedback. In this update, we redesigned the web interface based on this information (Fig.
4.1D). Textual cluer has been reduced on the home page and report pages for each structure.
Instructions and explanations for different components, which were previously wrien directly
on the page, are now available as pop-up components upon mouse hover. Report pages for each
PDB entry now prominently display the title of the entry. e information tables have been
rearranged in a modern and tabular fashion, resulting in further decluering of information.
DNAproDB offers many customization features for the Residue contact map. However, these
options were often overlooked by users due to their non-prominent placement on the website. We
have redesigned the user interface to make basic options like rotation, zooming, download, and
switching between the layout algorithms easily accessible directly above the visualization. Buons
to access further customization options (‘Chart options’ and ‘Interface selection’) are prominently
placed. e options within the ‘Chart options’ tab have been expanded. Within the ‘Interface
selection’ tab, basic options (model, entity, chain, moiety selection) are shown first. Additional
options are presented as advanced options. Mouse-based interaction controls for the ‘3D viewer’
and ‘Residue contact map’ have been made analogous, as much as possible. e download option
81
now supports the editable Scalable Vector Graphics (SVG) format. DNAproDB currently displays
Watson-Crick, Hoogsteen, and other base-pairing geometries via correspondingly stylized basepairing edges (e.g., Hoogsteen base-pairing in p53-tetramer-DNA complex [40] reflected in Fig.
4.3E). For additional analysis of non-Watson-Crick base-pairing geometries, a link to the RNAscape
[48] webserver has been included in each report page. Clicking this link will redirect the user to
the RNAscape website and automatically run it on the desired structure.
4.2.4 antitative analysis of readout features
Entries in the DNAproDB collection (as of June 7th, 2024) encompass protein–DNA structures
including single-stranded (ssDNA), double-stranded DNA (dsDNA), and other conformations
(e.g., G-quadruplex). We quantified the growth of such entries over time based on their PDB
release dates, which reflects an exponential trend (Fig. 4.2A). Fewer entries contain ssDNA and
other conformations compared to dsDNA. However, recent years (2016 onwards) demonstrate
a steady growth also in ssDNA entries (Fig. 4.2A). Studies on protein-DNA structures have
revealed consistent paerns in protein residue–DNA interaction frequencies [8, 207]. We sought
to quantify similar statistics in the updated collection of DNAproDB. To this end, we computed
relative abundances of different amino acids interacting with the major groove (Fig. 4.2B), minor
groove (Fig. 4.2C), and phosphodiester backbone (Fig. 4.2D). Relative abundance for a residue (R)
is the fraction of appearance of this protein residue in an interaction with a DNA moiety relative
to other residues.
Relative abundance (R) =
|interactions involving R|
∑
R
|interactions involving R|
(4.1)
is is computed separately for the major groove, minor groove, and DNA backbone. Each of
these values in Fig. 4.2B-E is further subdivided into fractions per DNA base, shown in four
colors. For the major groove, we see an abundance of residues able to perform recognition via
82
Figure 4.2. antitative analysis of protein–DNA complexes in the DNAproDB collection. (A)
PDB release years of structures catalogued in the updated DNAproDB collection (as of June 7th,
2024). (B-D) Relative abundance of different amino acids interacting with DNA major groove (B),
minor groove (C), and phosphodiester backbone (D). (E) Conditional probabilities of different
protein residues and base forming a stacking geometry. Y-axis represents summed values over
the bases for each amino acid.(F-H) Counts of interactions with different bases, categorized by
major and minor groove for secondary structure classes: Helix (includes α/310/π-helix) (F), Sheet
(β-sheet) (G) and loop residues (H)
83
hydrogen bonds with arginine (Arg) and lysine (Lys) residues showing the greatest presence
(Fig. 4.2B). For the minor groove, this preference for arginine and lysine is even stronger relative
to other residues (Fig. 4.2C). is agrees with the observation that the minor groove is more
electronegative [8], favoring positively charged amino acid sidechains while repelling negatively
charged sidechains (e.g., aspartic acid (Asp), glutamic acid (Glu) etc.). For amino acid residues
(R) with a planar side chain component (i.e., able to form a stacking interaction with a base (B ∈
[A,C,G,T]) in single-stranded DNA), interaction geometries (g) can be of three types: g ∈ [stack,
pseudo pair, other]. Stacking conditionals P(g=stack | R, B) were computed for major and minor
groove interactions as a fraction of the counts of stack geometry against counts for all geometries.
i.e.,
P(g=stack|R, B) = |g=stack, R,B|
∑
g
|g, R, B|
(4.2)
is information is presented in Fig. 4.2E in the form of a stacked bar chart. e total height of
each stacked bar (i.e., for each amino acid) is ∑
B
P(g=stack|R, B) . e paern visible in this data
conforms with the previously computed version in [50] while encompassing a larger sample size.
DNAproDB also provides annotations and a visualization (‘Helical contact map’) reflecting how
various secondary structure elements of a protein interact with the major and minor groove of DNA.
We quantified these interactions to reveal statistical paerns (Fig. 4.2F-H). We compute instances
of helical secondary structures (including α-helices, 310-helices, and π-helices) interacting with
the four primary DNA bases in either the major or minor groove (Fig. 4.2F). ere is a clear
preference for the major groove for protein helices, reflecting the use of a recognition helix by
many protein families [91]. On the other hand, for β-sheets, major and minor groove interactions
are comparable in number, with a slight preference for the major groove (Fig. 4.2G). e ‘loop’
category reflects residues appearing in loop regions of proteins interacting with DNA. Minor
groove interactions are slightly more favored in this case (Fig. 4.2H). In all cases, guanine (G) is
the most favored DNA base that is contacted.
84
4.2.5 Water-mediated hydrogen bonds
As described previously, the updated DNAproDB processing pipeline detects and visually annotates
(Fig. 4.1B) water-mediated hydrogen bond interactions between protein and DNA. is feature
improves the accuracy and relevance of the DNAproDB visualization for some structures. For
example, the co-crystal structure of the Trp repressor/operator complex (PDB ID: 1TRO, Fig. 4.3A,
Residue contact map: Fig. 4.3B) reflects a protein–DNA recognition scheme without any direct
hydrogen bonds in the major and minor groove. Instead, DNA recognition occurs via watermediated hydrogen bonds (Fig. 4.3B) [163]. A detailed view of two protein backbone nitrogen
atoms (belonging to Ile79 and Ala80) recognizing G11 in this manner is presented in Fig. 4.3C. is
type of recognition scheme was previously not reflected in DNAproDB. Similarly, protein residues
interacting with DNA only through water-mediated hydrogen bonds were also not displayed
in the Residue contact map. One such example is the p53 tetramer structure (PDB ID: 3KZ8
[40], Fig. 4.3D, Residue contact map: Fig. 4.3E). is structure illustrates serine residues (Ser121)
near the tetramerization interfaces involved in water-mediated hydrogen bonds with the major
groove edge of two G bases (shown for one selected base in Fig. 4.3F). As this is the sole mode
of interaction for these two residues, they were omied from the visualization in the previous
DNAproDB version [50]. In this update, these interactions are correctly shown. A variety of
complex interaction geometries are possible when water-mediated hydrogen bonds are involved.
One such example can be found in interactions of the RXR/RAR DNA-binding domain heterodimer
in complex with the retinoic acid response element (PDB ID: 1DSZ [208], Fig. 4.3G, Residue contact
map: Fig. 4.3H). e lysine residue (Lys1260) is involved in recognizing consecutive bases (G and
T) through water-mediated hydrogen bonds involving two different water molecules. is update
to DNAproDB allows exploring such recognition schemes.
85
Figure 4.3. Selected examples of water-mediated hydrogen bond annotations as reflected in
the updated DNAproDB. (A-C) Trp repressor/operator complex (PDB ID: 1TRO) (D-F) p53
tetramer with Hoogsteen base pairs (PDB ID: 3KZ8). (G-I) Conditional probabilities of different
protein residues and base forming a stacking geometry. Y-axis represents summed values over
the bases for each amino acid.(F-H) RXR-RAR DNA-binding complex (PDB ID: 1DSZ). In each of
the three cases, the 3D structure of the respective complex is shown in (A, D, G). e DNAproDB
Residue contact map is shown (with only selected protein residues annotated) in (B, E, H). Atomic
views of selected water-mediated hydrogen bond interactions are shown in (C, F, I), respectively.
86
4.3 Discussion
DNAproDB, since its inception in 2017 [49], has been a valuable resource for the structural
biology community. Its comprehensive analysis pipeline, covering diverse aspects of protein–DNA
binding, outputs data that can be readily used in downstream analysis by the user [50]. DNAproDB
also provides interactive and publication-quality visualizations. In this update, we improved
DNAproDB in multiple aspects. New structures released since the last update in 2019 [50] have
been incorporated, resulting in a much larger collection. e pipeline has been future-proofed via
the new automatic update feature. e backend implementation has been upgraded to Python
3, ensuring a long-lasting lifespan for DNAproDB. A key scientific improvement in the analysis
pipeline is the incorporation of water-mediated hydrogen bond calculation. Interest in watermediated interactions has been growing. is is evidenced by the CASP16 challenge for predicting
solvent shells around the Tetrahymena ribozyme structure [209]. Currently, these interactions
are not well modeled by structure prediction and analysis tools [23, 24, 26, 27, 45, 46]. We
expect that this added feature in DNAproDB advances the field in the understanding of readout
mechanisms. Visualizations have been improved by enabling tertiary structure-aware nucleic
acid layouts, incorporation of water-mediated hydrogen bond indicators, greater customizability,
and other aesthetic changes. e website style has been redesigned, and data presentation has
been improved. Structure files in mmCIF format can now be uploaded, which was previously
unsupported. Altogether, these updates result in an updated DNAproDB, which we expect to
continue serving the structural biology community for the foreseeable future.
4.3.1 Data availability
DNAproDB is freely available for all users at https://dnaprosb.usc.edu/. e previous version
remains available at https://dnaprosb.usc.edu/v1/. e pipeline and frontend implementations are available via GitHub: https://github.com/timkartar/DNAproDB; https://github.
com/ariscohen/DNAproDB frontend
87
Chapter 5
RNAproDB: a webserver and interactive database for analyzing
protein–RNA interactions
Abstract
We present RNAproDB (https://rohslab.usc.edu/rnaprodb/), a new webserver, analysis pipeline, database and highly interactive visualization tool designed for protein-RNA
complexes and applicable on all forms of nucleic acid containing structures. e RNAproDB
analysis of a structure involves computing several mapping schemes for nucleic acid components and presenting protein-RNA interactions appropriately. Various structural annotations
are computed which include non-standard base-pairing interactions, hydrogen bonds, proteinRNA and RNA-RNA water mediated hydrogen bonds etc. is information is presented in the
form of interconnected Sequence viewer, 3D viewer, Interface explorer, Secondary structure
selector and tabular data. A novel feature, subgraph selection, is also implemented, which
facilitates studying individual components of complex structures. We hope RNAproDB will
be highly useful for analyzing and exploring not only experimentally determined, but also
predicted or designed protein-nucleic acid complexes.
5.1 Introduction
Structural complexity of RNA molecules is vast and so are their modes of interaction with proteins
[210]. Although there has been an ever-expanding repertoire of structural data on the PDB,
88
there is a lack of data resources which systematically analyze these interactions. In addition,
recent advances in artificial intelligence have made high throughput prediction of protein-RNA
complex structures viable [27, 211]. RNAproDB is a webserver where a user can upload a proteinRNA complex and explore analyzed results covering a multitude of aspects (e.g., direct and
water-mediated hydrogen bonds, base-pairing annotations, nucleotide modifications, secondary
structural features). Compared to existing resources, which often focus on the RNA structure
alone and provide a particular way of visualization [47, 179], or use a representation which is
very coarse-grained [212], RNAproDB provides three different algorithms for visualizing the RNA
topology along with the interacting protein residues: a new design based on partial projection of the
structure (RNA and interacting protein residues), tertiary structure aware mapping [48], secondary
structure-based mapping [179]. is information is presented via a highly interactive interface
explorer combined with sequence and 3D-structure viewers, and a secondary structure selector.
Tabular data is also available. Another novel functionality, subgraph exploration, allows a user to
explore parts of the structure (e.g., a particular junction region) via the interface explorer. Subgraph
selections can be manually entered or automatically selected from the secondary structure selector.
As of our knowledge, no existing tool offers such capability. In addition to the webserver, we offer
pre-analyzed structures containing RNA, from the PDB, in the form of a searchable collection.
With a cutoff of 10,000 monomers per biological assembly and molecular weight cut off of 800
kDa on the asymmetric unit, the initial collection of RNAproDB provides around 3,500 biological
assemblies containing RNA molecules. RNAproDB will be automatically updated weekly with
new PDB entries. We believe RNAproDB will be a valuable resource for the scientific community
interested in RNA biology, protein-nucleic acid interaction and function, structure prediction,
and drug design. is work is led by me, with contributions from Ari S. Cohen, Hirad Hosseini,
rotation students at the Rohs Lab, Prof. Helen Berman and supervised by Prof. Remo Rohs.
89
Figure 5.1. Multiple mapping algorithms available in RNAproDB for protein-RNA complex
(PDB ID: 1IVS). (A) Crystal structure of the valyl-tRNA synthetase bound to tRNA (PDB ID: 1IVS).
(B) Mapping produced based on partial projection. (C) Mapping produced based on RNAscape
algorithm. (D) Mapping produced by applying ViennaRNA secondary structure layout algorithm.
90
5.2 Processing pipeline
e RNAproDB processing pipeline starts with a structure and computes multiple visualizations
and interaction information from it. As part of the processing piepline multiple software are run,
which include X3DNA-DSSR ([213]computes base-pairing geometries, protein-RNA hydrogen
bonds and RNA secondary structure), HBPLUS ([205] to compute hydrogen bonds involving water
molecules), RNAscape ([48] tertiary structure aware nucleotide mapping), ViennaRNA ([200]
secondary structure-based nucleotide mapping), DSSP ([214, 215] computes protein secondary
structure). In addition, custom code has been developed to compute a new “Partial Projection” based
mapping for the nucleotides. e processing pipeline combines all these data into a graph object
which is used to compute highly interactive frontend presentations of the structure. is frontend
presentation is an intertwined experience of an Interface explorer, a 3D viewer, a Secondary
structure selector, a Sequence viewer and Tabular data. In the next sections we describe the
functionalities implemented through these components.
5.3 Interface explorer
e ‘Interface explorer’ for RNAproDB is freshly designed to present an interaction graph of the
RNA structure along with interacting protein residues. e explorer present three different layout
algorithm options, selectable by the user. two of these options are secondary structure based
(computed using ViennaRNA [200]) and tertiary structure aware mapping based (computed using
RNAscape [48]). e third (default) option is a new mapping scheme produced by projecting the
RNA residues and only interacting protein residues into the 2D plane maximizing their spatial
variance [188]. We call this method “Partial projection” (instead of the whole structure, it only
projects the RNA and interacting protein components). In our observation, this scheme is visually
intuitive for exploring corresponding 3D structure. e user is free to choose between the three
layouts. In each case, the protein residues are placed using a force directed layout scheme [216].
91
We demonstrate the three layout schemes for a valyl-tRNA synthetase-tRNA complex (PDB ID:
1IVS) (Fig. 5.1A). e partial projection-based layout is shown in Fig. 5.1B. is layout reflects
the helical turns, making the best correspondance with the 3D structure. e RNAscape [48]
layout (Fig. 5.1C) is cleaner and more suitable for users used to a ladder-like representation. e
secondary structure based layout, shown in Fig. 5.1D, although familiar, suffers from tertiary
interactions within RNA and protein-RNA interactions criss-crossing the view. Options to turn
off tertiary interaction edges and protein interactions have been implemented to help with this
situation, in case a user wants a clean view of the secondary structure. is is demonstrated in
Fig. 5.2A-C). Another useful feature of the interface explorer is the ability to change distance
threshold of visible protein-RNA interactions on the fly. is threshold can be modified using
a slider, and upon modification, edge distances that fall beyond the threshold are hidden. e
protein-RNA interactions shown were computed with an atom atom cut-off distance of 6 A˚, except
for water-mediated hydrogen bond edges, which can have a longer distance. A protein residue
interacting only with the major groove side or minor groove side (as in standard watson crick
conformation) of a base are assigned special colors whereas alternate possibilities are left gray.
In addition, protein-RNA interaction edges have a distance dependent opacity seing, making
interactions which are further away, less prominent, thereby reducing visual cluer. e distance
threshold control provided with the interface explorer is operated on centroid-centroid distance
between a protein residue and a nucleotide (0-15 A˚). At any point of the exploration, the user is
able to download the visualization as a static picture, a scalable vector graphic format image or
download the corresponding graphical data.
5.4 Sequence viewer and 3D viewer
In addition to the interface explorer, we also provide a 3D structure viewer and sequence viewer
to facilitate the exploration process (Fig. S30). e sequence viewer lists sequences for each chain
in the structure. A chain can be selected from a drop-down menu, resulting in the corresponding
92
sequence being displayed. Selecting a residue/nucleotide from the sequence viewer highlights the
corresponding residue/nucleotide in the Interface explorer. In addition, the 3D viewer will also
zoom and orient to focus on the specific residue/nucleotide, which is now shown in a ball-stick
view. If the subgraph selection dialogue box is open, this will also populate the dialogue box
with the ID of the selected residue/nucleotide. Residues/nucleotides can also be selected from the
Interface explorer (using single click, or multi-select using shift+click). Doing so will highlight
them in the 3D viewer also, in a similar manner. Right clicking on consecutive atoms on the 3D
viewer allows for visualizing measurements also. For the 3D viewer, buons to show/hide carton
representation and solvent is available.
5.5 Secondary structure selector and subgraph exploration
A more coarse grained diagram is also computed as part of the processing pipeline for RNAproDB.
is diagram reflects the different secondary structure elements as one node each, interaction edges
corresponding to residues involved in one secondary structure element with another are collapsed
into one edge. is view is presented to serve as a more coarse representation, components of
which are selectable by the user, named ”Secondary structure selector”. An example is shown for
PDB ID: 1UN6 Fig. 5.2D). e corresponding Secondary structure selector is shown in Fig. 5.2E.
e partial projection-based layout for this structure is presented in Fig. 5.2F.
Whenever a user clicks a particular node (e.g. ‘Stem 3’ in Fig. 5.2E), corresponding nucleic
acid residue ids are populated into the subgraph generation dialogue. e user can now click the
”Generate subgraph” buon to generate and explore the subgraph (up to first order neighbors) via
the interface explorer (Fig. 5.2G). Beyond clicking the secondary structure selector, a subgraph
selection can also be manually entered or selected from the Sequence viewer or Interface explorer.
93
Figure 5.2. Novel interactive capabilities of the RNAproDB user interface. (A) Secondary structure based layout of a protein-RNA complex (PDB ID: 1IVS) (B) Updated layout with tertiary RNA
interactions turned off. (C) Updated layout with protein interactions turned off. (D) 3D structure
of a zinc finger protien-RNA complex (PDB ID: 1UN6). (E) Corresponding secondary structure
selector, a user can click on any node (example: stem3, circles) (F) Partial projection based layout
for PDB ID: 1UN6. (G) Generated subgraph based on user selected secondary structure element.
94
5.6 Sear functionalities and Upload
e pre-analyzed collection of RNAproDB contains 3500+ protein-RNA structures. In addition,
RNA structures without proteins, DNA and NA-hybrid containing structures are also included.
A known PDB ID can be directly searched using the quick search box available at the top-right
location of the website. More refined searches can be performed from the ”Search” page. e
search page allows search queries to be entered which can be author names or keywords related
to the structures of interest. Additional filters can be set, which include molecule type (polymer
entity type: RNA/DNA/NA-hybrid), experimental modality, resolution range, publication year,
number of NA polymers, number of protein polymers, molecular weight of the biological assembly.
e search results will be presented in tabular view. Optionally a card view of the search results is
also available. e resulting PDB IDs can be copied to clipboard or the data can be downloaded in
comma-separated value (CSV) or JavaScript object notation (JSON) format. An example search
output page for the keyword search “tetrahymena” is shown in Fig. S29.
Structure files in mmCIF format can be uploaded to the RNAproDB webserver through the
upload page. Upon upload, the server will process the structure and create a custom page with the
results. We hope this feature will be exceedingly valuable for people looking to analyze predicted
complex structures (e.g. from AlphaFold3 [27]) of protein and nucleic acids. An example output
page for AlphaFold3 predicted structure (model 0) for molecules in PDB ID: 8AW3 is shown in Fig.
S30. An example of Interface explorer layouts for a CAS9-DNA-RNA complex is also presented in
Fig. S31.
5.7 RNA-RNA water mediated interactions
Non-canonical base-pairing is very common in RNA structures and often they influence structural
organization of the molecule and interaction with other molecules [217]. Varied combinations of
base pairings often lead to sub-optimal direct hydrogen bonds between pairs or adjacent bases.
95
Water molecules appear to compensate such situations by making “base-pairing” possible through
water-mediated hydrogen bonds. One such example is the CUG repeat structure from PDB ID:
7Y2B ([218], Fig. 5.3A)). e U-U mismatches in this structure are often unable to form direct
hydrogen bonds (specifically, the central U-U mismatch, forms no direct H-bond). erefore, DSSR
([213]) does not consider it a base-pair. However, two water molecules form water-mediated
hydrogen bonds between the two U’s. is has an effect on the structure, as the U-U mismatch
region now has a similar base-pair width as the C-G paired regions (Fig. 5.2D), otherwise expected
to be narrower in width).
RNAproDB computes RNA-RNA water mediated hydrogen bonds to reflect such information
in the interface explorer (Fig. 5.2B), which would otherwise be overlooked. We demonstrate the
molecular conformation of the central U-U mismatch (Fig. 5.2C) and a off-center U-U mismatch
(Fig. 5.2D)
5.8 Discussion
After working on RNAscape [48] and the DNAproDB [49, 50] update, we realized the lack of
a modern software interface for structural analysis of protein-RNA interactions. DNAproDB,
although an excellent tool, is geared very specifically, towards DNA in many aspects. is
precludes it from being applicable to nucleic acid structures in general. Here, we developed
RNAproDB, targeting a broader class of protein-nucleic acid structures. is required innovating
new ways of presenting the data and interface explorer (e.g. partial projection layout, subgraph
exploration coupled with secondary structure selector etc.). RNAproDB provides access to NAhybrid structures like Cas9 bound to target DNA and guide RNA [219] and structures related
to newly developed Bridge editing technique for genome editing [43], which previously lacked
an interface for analysis and interactive visualization. RNAproDB is also suitable for analyzing
predicted complexes by recent advances like AlphaFold3 [27]. With the extensive list of novel
96
Figure 5.3. Example illustration of RNA-RNA water mediated hydrogen bond facilitating nonWatson-Cri interaction. (A) 3D structure of PDB ID: 7Y2B (B) RNAproDB interface explorer
layout (RNAscape algorithm) for PDB ID: 7Y2B. e central U-U is considered un-paired by DSSR,
but computing RNA-RNA water-mediated hydrogen bonds reveal interactions between them. (C)
Zoomed in view of central U-U interaction with two water-molecules facilitating water-mediated
hydrogen bonds. (D) Zoomed in view of an off center U-U interaction with one water molecule
facilitating water-mediated hydrogen bond in addition to a direct hydrogen bond.
97
capabilities described in this chapter, RNAproDB is the most modern tool for analyzing proteinnucleic acid structures which requires zero programming experience from the user. Further
features like electrostatics of the protein-RNA interfaces are also currently in works. We hope
RNAproDB serves as a valuable service and database for the structural and cellular biology
community.
5.9 Data availability
RNAproDB is freely available for all users at https://rohslab.usc.edu/rnaprodb/. e pipeline
and frontend implementations are available via GitHub: https://github.com/timkartar/rnaprodb
dev, and https://github.com/ariscohen/rnaprodb frontend.
98
Chapter 6
Tangential resear: Generative modeling of gene expression
time series data
Research described in this chapter was supervised by Dr. Adam MacLean as part of a rotation in
his laboratory.
Abstract
Methods to model dynamic changes in gene expression at a genome-wide level are not
currently sufficient for large (temporally rich or single-cell) datasets. Variational autoencoders
offer means to characterize large datasets and have been used effectively to characterize
features of single-cell datasets. Here we extend these methods for use with gene expression
time series data. We present RVAgene: a recurrent variational autoencoder to model gene
expression dynamics. RVAgene learns to accurately and efficiently reconstruct temporal gene
profiles. It also learns a low dimensional representation of the data via a recurrent encoder
network that can be used for biological feature discovery, and from which we can generate new
gene expression data by sampling the latent space. We test RVAgene on simulated and real
biological datasets, including embryonic stem cell differentiation and kidney injury response
dynamics. In all cases, RVAgene accurately reconstructed complex gene expression temporal
profiles. Via cross validation, we show that a low-error latent space representation can be learnt
using only a fraction of the data. rough clustering and gene ontology term enrichment
analysis on the latent space, we demonstrate the potential of RVAgene for unsupervised
99
discovery. In particular, RVAgene identifies new programs of shared gene regulation of Lox
family genes in response to kidney injury.
6.1 Introduction
Dynamic changes in gene expression control the transcriptional state of a cell, and are responsible
for modulating cellular states and fates. Gene expression dynamics are in turn controlled by
cell-internal and external signaling networks. Despite the noisiness of gene expression in single
cells [220], over time or over populations of cells, predictable paerns emerge. Here we address
the challenge of classifying and predicting gene expression dynamics across large groups of genes.
Machine learning (and deep learning in particular) has led to recent advances in our ability to
explain or predict biological phenomena [221]. Deep learning modeling via autoencoders [222]
and variational autoencoders [223] has been central to progress in the field. Autoencoders learn
two functions: one to encode each input data point to a low dimensional point, and another
(the decoder) to reconstruct the original data point from the low dimensional representation.
Variational autoencoders (VAEs) build on this architecture and instead encode input data points
as distributions; VAEs are less prone to overfiing and can offer meaningful representations of
biological features in the latent space [224].
Single-cell mRNA sequencing (scRNA-seq) data present appealing sources of data for deep
learning models, given their size and complexity [225]. Deep learning models have been used to
analyze scRNA-seq data and address a variety of challenges. Autoencoders have been developed
to perform noise removal/batch correction [226, 227, 228], imputation [229], and visualization &
clustering [230]. VAEs have been developed for the visualization and clustering of scRNA-seq data
[231, 232], and can provide a broad framework for generative modeling of scRNA-seq data [233]:
scVI can be used for batch correction, clustering, visualization, and differential expression testing.
e methods described above for single-cell data analysis by deep learning focus primarily on
cell-centric tasks; here we are interested in gene-centric inference. Particularly, we are interested
100
in characterizing dynamic changes in gene expression. ese can be either changes with respect
to real time or “pseudotime,” the laer referring to the ordering of single cells along an axis
describing a dynamic cell process such as development or stem cell differentiation (see methods
overview in [234]). We can interpret any scRNA-seq data as gene expression time series data,
given an appropriate underlying temporal process, either in terms of real (experimental) time
(low resolution: around 2 − 20 data points) or pseudotime (high resolution: 103 − 106 data
points). McDowell, Manandhar, Vockley, Schmid, Reddy, and Engelhardt [235] introduced a nonparametric hierarchical Bayesian method (DPGP) to model such data. Using a Gaussian process
to cluster temporal gene profiles and a Dirichlet process to generate the Gaussian processes,
DPGP offers powerful and intuitive means with which to cluster gene expression time series data.
However, since learning Gaussian processes is equivalent to a fully agnostic search in function
space, training DPGP is computationally intensive and difficult to parallelize.
Clustering relies on strong assumptions about the underlying structure of the data. Even for
methods that move away from hard clustering towards probabilistic methods for cell type assignment [236, 237], assumptions remain and under certain conditions a continuous representation
of the data may be beer. Here we take such an approach, and seek to find a low dimensional
representation of the data, on which further analyses (including but not limited to clustering)
can be performed. VAEs are an obvious choice, given their success on other scRNA-seq analysis
tasks, but modeling temporal changes with a feed-forward VAE would be equivalent to a fully
agnostic search, similar to learning a Gaussian process. Recurrent networks offer well-established
architectures for learning sequential and temporal data, and have been successfully combined
with VAEs [238]. We use a recurrent network architecture to take advantage of the structure in
the data.
We introduce a recurrent variational autoencoder for modeling gene dynamics from scRNA-seq
data (RVAgene). RVAgene learns two functions during training, parameterized by encoder and
decoder networks. e encoder network projects the training data into latent space (we use a 2
or 3 dimensions in order to visualize, though there are no inherent limits). e decoder network
101
learns a reconstruction of training genes from their latent representation. RVAgene facilitates
clustering of other characterization of gene profiles in the latent space. By sampling points from
the latent space and decoding them, RVAgene provides means to generate new gene expression
time series data, drawn from the biological process that was encoded. Overall, RVAgene serves as
a multipurpose generative model for exploring gene expression time-series data.
e remainder of the paper is structured as follows: we next present methodological details and
development of RVAgene. We produce a synthetic gene expression time-series dataset with innate
cluster structure, and demonstrate the accuracy of RVAgene on these data. We then explore two
biological datasets with RVAgene: a scRNA-seq dataset on stem cell differentiation over pseudotime,
on which we demonstrate the advantages of RVAgene over alternative approaches; and a bulk
RNA-seq dataset describing dynamic responses to kidney injury, on which we demonstrate the
potential for biological discovery. We also present evidence for the efficiency and scalability
of RVAgene, and we conclude by discussing its key features and limitations, in light of recent
advances in machine learning that will pave the way for future work in these directions.
6.2 Methods
We develop a recurrent variational autoencoder to model gene expression dynamics (RVAgene).
Here we briefly describe the methods underpinning variational autoencoders, and present the
implementation of RVAgene.
6.2.1 Variational inference and variational autoencoders
In the most general seing of a Bayesian model, we seek to learn the latent variables z that best
characterize some data x. Given a generative process that draws latent variables from a prior
102
distribution, p(z), and a likelihood of the data observed that is given by p(x|z), then the posterior
probability is given by Bayes rule:
p(z|x) = p(x|z)p(z)
R
z
p(x|z)p(z)dz
. (6.1)
e denominator is often intractable, making it difficult to estimate p(z|x). Markov Chain Monte
Carlo methods provide means to estimate posterior probability distributions. An alternative
method to estimate hard-to-compute probability distributions is Variational Inference (VI) [239],
which starts from the assumption that the posterior can be approximated by a distribution q(z)
from the family Q. VI then amounts to an optimization problem to find the q
∗
that minimizes the
Kullback–Leibler (KL) divergence between the approximation and the true posterior:
q
∗
(z) = argminq(z)∈QKL(q(z)||p(z|x)). (6.2)
Much recent effort has gone into solving VI problems in different seings [240, 241, 242].
VI can be framed as solving an optimization problem over function families: neural networks
are popular candidates for representing and learning complex functions. VI was incorporated
into autoencoders [223] to create the architecture of a variational autoencoder (VAE). A VAE
consists of an encoder network to approximate p(z|x) through a function qx(z), and a decoder
network p(x|z) (fig. 6.1). Conceptually, the encoder solves an inference problem: approximating
the posterior distribution p(z|x) as some q
∗
x
(z), while the decoder solves a reconstruction problem:
defining a generative process for p(x|z), given the latent variables. e VAE posterior is modeled
by a multivariate normal N (µ,Σ) of the same dimension as z. Training then comes down
to minimizing two objective functions. For the encoder network, which should learn a “well
distributed” latent space, minimize the KL divergence: KL(N (µ,Σ)||N (0,I)). For the decoder
network, which should reconstruct the inputs x from the latent space, minimizing either an L1
or L2 objective function with respect to xˆ is appropriate. e use of KL-divergence and an L2
103
objective solves the VI formulation of Eq. 6.2 [223], however, an L1 objective may be preferred in
practice, e.g. in cases where we want to suppress the effects of outliers on the structure of z [243].
6.2.2 RVAgene: A recurrent variational autoencoder to model gene expression
dynamics
Following the VAE architecture, RVAgene consists of an encoder and a decoder network with a
reparameterization step in between. To incorporate the knowledge that we are modeling temporal
data, recurrent neural networks offer an ideal architecture to use for both the encoder and the
decoder networks. Recurrent and VAE networks have been successfully combined elsewhere, e.g.
for textual [244] and time series data [245].
e architecture of RVAgene is based on Fabius and Amersfoort [238]. An input sequence
(i.e. gene) x ∈ x, x = (x1, x2,..., xt
,..., xT ) is encoded using a recurrent function described by a long
short-term memory (LSTM) unit. LSTM units are the state-of-the-art in recurrent architectures,
since they are robust against the vanishing gradient problem for longer sequences, unlike other
recurrent units (see details in Hochreiter and Schmidhuber [246]). We encode x in the following
manner:
h
enc
t+1 = LSTM(WT
ench
enc
t +WT
inpxt +benc), (6.3)
where (Wenc, Winp and benc) are network weight parameters, and the hidden states ht represent
information shared over timepoints in the LSTM. e dimension of the ht
(and Wenc) is given by a
hyperparameter (“hidden-size”). e encoded ht+1 are used to parametrize the posterior mean
and variance from x, with mean µz and diagonal covariance σz as:
µz = WT
µ h
enc
T+1 +bµ (6.4)
log(σz) = WT
σ h
enc
T+1 +bσ .
104
We then use the reparameterization step described in Kingma and Welling [223] to sample z from
the distribution:
z = µz +εσz
, (6.5)
where, for known ε, backpropagation through the sampling step is possible while training the
network.
For the decoder network, the first state h1 is calculated from z, and the recurrent formulation
follows by reconstructing x as xˆ = (xˆ1, xˆ2,..., xˆt
,..., xˆT ), thus:
h
dec
1 = sigm(WT
z
z+bz)
h
dec
t+1 = LSTM(WT
dech
dec
t +WT
outxˆt +bdec) (6.6)
xˆt = sigm(WT
outh
dec
t +bout),
where sigm(u) = 1
1+e−u
is the sigmoid activation function, and (Wi
,bi
) are the network weight
parameters. A schematic diagram of the network is shown in fig. 6.1A, which can now be trained
using backpropagation, to minimize the objective function:
L (θ, x) = DKL(N (µz
,Σz)||N (0,I)) +|x−xˆ|, (6.7)
where µz and Σz = diag(σz) are calculated from x by the encoder.
To evaluate the accuracy of RVAgene, we need an appropriate error measure. For each gene in
the test set, we calculate the L1 reconstruction error between generated data xˆ and true data x,
105
averaged over all time points. We normalize the data to lie in [0,1] to avoid skewing the error by
differences in gene expression magnitudes. us we define:
Reconstruction error(x, xˆ) = 1
T ∑
t
|s(xˆ)t −s(x)t
|, where s(x) = x
∑
T
t=1
xt
. (6.8)
6.2.3 Generating synthetic gene expression time series data
To test RVAgene, we generate a synthetic time series dataset. Six clusters each containing 20 genes
are simulated, where for each cluster c, the mean gene expression time series Yc = (yc1, yc2,..., yct)
was generated using addition or convolution and rescaling of two random sinusoidal functions of
the form k1sin(k2t), where k1, k2 are randomly chosen positive integers. Trajectories of cluster
members were then generated by sampling from the multivariate normal N (Yc,Σc). We model
Σc as the positive definite matrix αYcY
T
c
, where α is a scaling factor, we use: α = 1/|YcY
T
c
|.
As defined, Σc will describe nonzero correlations for all pairs of time points, (ti
,tj). is is
unrealistic, so we set to 0 the entries of Σc for which column and row indices have a difference
of more than some threshold T (we used T = 50), reflecting the fact that correlations between
time points are lost over larger time windows (temporal correlations are local). Note that under
this condition, Σc is no longer necessarily positive definite. e multivariate Gaussian sampler
numpy.random.multivariate_normal() implemented in numpy [196] was used to sample from
this augmented Σc. After generating a simulated dataset by this process, we also added Gaussian
noise, drawn from N (0,0.7), to the simulated dataset to produce an additional dataset exhibiting
higher levels of noise.
106
6.3 Results
6.3.1 RVAgene can accurately and efficiently reconstruct temporal profiles
from synthetic data
We generated a dataset of 120 genes using convolutions of sinusoidal functions (see Methods) to
test the ability of RVAgene (fig. 6.1A) to learn and reconstruct noisy nonlinear temporal profiles.
An RVAgene model was trained on all 120 genes from 6 clusters with a hidden size of 70 and a 3
dimensional latent space. e model was trained for 400 epochs, after which the average batch
objective L function indicates convergence (fig. 6.1B), producing a three-dimensional latent space
representation (fig. 6.1C). K-means clustering on the latent space (k=6) identified well-separated
clusters (fig. 6.1D).
RVAgene modeling followed by k-means clustering on the latent space identified 6 clusters
with perfect fidelity between predicted and true clusters. One might reasonably ask, why use a
neural network for this task? Simpler dimensionality reduction methods (e.g. PCA, t-SNE, or a
non-variational autoencoder) would also find the correct solution. RVAgene has the advantage
over these methods that the underlying structure of the latent space leads to interpretability. A
point in reduced PCA or t-SNE space that does not overlap with a data point is not interpretable.
Traditional autoencoders lack regularity in the latent space, i.e. even for a representation with
arbitrary accuracy (a reconstruction error of zero), decoding a point that does not correspond to a
training data point can result in nonsensical generated data, even if the decoded point is arbitrarily
close to a training data point. Variational Auoencoders remedy this by learning a regularized
or smoother distribution on the latent space. In this sense, the KL-divergence term in the VAE
loss function can be thought of as a regularizer. is property enables RVAgene to generate new
gene expression dynamics by decoding points from different regions of the latent space, having
properties similar to clusters nearby to those points.
107
Figure 6.1. Unsupervised representation learning with RVAgene using synthetic data. (A)
Schematic diagram of the RVAgene model. (B) Average loss function L as over duration of
training. (C) Latent space representation learnt by RVAgene model after training. (D) Clusters
detected by k-means clustering on the latent space, with k = 6 (E) First and third rows show input
training data used (20 simulated genes in each of six clusters); cluster means shown in black.
Second and fourth rows show the model-generated data, obtained by sampling and decoding
points from the latent space; decoded cluster empirical means shown in black.
108
To demonstrate the generative properties of the RVAgene latent space, we sample points from
multivariate Normal distributions, centered on the empirical mean of each cluster with variance of
0.4, i.e. N (µc,0.4I), where µc is the empirical mean of the cluster and Iis the identity matrix in R
3
.
Corresponding to each cluster, we sample 20 points in the latent space, and use the decoder network
to generate new time series data (fig. 6.1E). Most of the points sampled generate trajectories that
belong to the correct cluster. Moreover, we identify cases corresponding to transitions between
clusters. For example, some points sampled near Cluster 2 generate trajectories that are similar to
members of Cluster 4, and vice versa. is makes sense due to the similarity between the temporal
profiles of Clusters 2 and 4. A similar correspondence is observed between Clusters 1 and 5. We
note that in a few cases the generated data have profiles that differ from their cluster of origin
and appear most similar to those of another cluster. is occurs when points are sampled close
to neighboring clusters, e.g. the red line for Cluster 5 in fig. 6.1E has been sampled from a point
close to cluster 3. We also observe some generated trajectories that display intermediate profiles
between two or more clusters: the decoder function learnt by RVAgene is smooth, and gives rise
to meaningful representations of points across regions of the latent space.
RVAgene offers additional functionality as a tool for removing noise from the data. Via sampling
and decoding points from the latent space, RVAgene reconstructs trajectories that are smooth
and de-noised relative to the input data (fig. 6.1E, Fig. S16). Similar neural network approaches
have been proposed to denoise from single-cell data, e.g. using a deep count autoencoder [227].
RVAgene provides data denoising as a by-product of its primary functionality: learning paerns
of dynamic gene expression.
To investigate the impact of input noise levels on RVAgene performance, we added Gaussian
noise drawn from N (0,0.7) to the simulated data to produce a dataset with higher overall noise
levels. RVAgene learns a latent space shown in (Fig. S16A) from which six clusters are identified
by k-means clustering (Fig. S16B). It is notable that the clusters identified in the latent space are
not as clear in this case as for lower noise levels (fig. 6.1), however RVAgene can still reconstruct
the distinct profiles with high confidence. To illustrate this, we plot the original training data
109
alongside model-generated data, sampled at random points in the latent space from N (µ,0.4I)
around each cluster mean µ for each of the 6 clusters (Fig. S16C). From these simulations, RVAgene
appears able to separate even relatively high levels of noise from the signal, in order to learn a
smooth encoding and corresponding generative process for distinct temporal paerns.
It is inevitably challenging to include sufficient dimensionality and variation in synthetic
datasets to accurately capture biological processes such as those we observe in experimental
datasets. us, in the subsequent two sections, we test the capabilities of RVAgene on two wholegenome biological datasets: embryonic stem cell differentiation, and kidney injury response. As
we will see, in these cases it may not be possible to characterize the latent space by simple (e.g.
k-means) clustering; we need to use other means to gain insight into the features of the latent
space.
6.3.2 RVAgene modeling of pseudotemporally ordered data during embryonic
stem cell differentiation
We applied RVAgene to model gene expression dynamics during embryonic stem cell (ESC)
differentiation. Klein, Mazutis, Akartuna, Tallapragada, Veres, Li, Peshkin, Weitz, and Kirschner
[247] identified 732 differentially expressed genes over the time course of mouse ESC differentiation
following leukemia inhibitory factor (LIF) withdrawal. Data is gathered at four time points: 0, 2,
4, and 7 days after LIF withdrawal. (Table S2 in Klein, Mazutis, Akartuna, Tallapragada, Veres,
Li, Peshkin, Weitz, and Kirschner [247]). We ordered the data (2717 single cells) using diffusion
pseudotime (DPT), which provides robust methods for the reconstruction of single-cell temporal
processes [248]. e root cell was randomly sampled from the initial time point (fig. 6.2A). e
inferred pseudotime is highly correlated with the experimental time points, giving confidence that
true biological processes are represented over the DPT pseudotime. e gene expression dynamics
over pseudotime show considerable variability among cells. To smooth the data, we apply a
moving window average, over windows of length 40, to give 68 time points after smoothing (fig.
6.2A). We fit linear regression models to the smoothed pseudotime profiles of each gene (Fig. S17),
110
and see that for the majority of genes the correlation coefficients are > 0.5 (fig. 6.2B), with a clear
distinction between the up- and down-regulated genes over pseudotime.
An RVAgene model was trained on the data with a two-dimensional latent space, on which
genes are classified based on their correlation coefficients (fig. 6.2C). Two distinctive characteristics
emerge: a) the two groups (up- and down-regulated genes) are well-separated in the latent space,
and b) the two groups merge and overlap at some point, illustrating the continuity of the latent
space, as discussed above. We compared the results of RVAgene with DPGP, an unsupervised
approach for gene expression time series clustering [235]. DPGP is a hierarchical Bayesian model
that estimates the number of clusters along with the cluster membership.
To assess the correspondence between methods, genes clustered by DPGP (Fig. S18) were
projected onto the RVAgene latent space (fig. 6.2D). Of the 12 clusters detected by DPGP, the four
largest can be characterized by their up- and down-regulation profiles over pseudotime. On the
RVAgene latent space, we find that genes sampled from each of the DPGP clusters appear close
together, and moreover, are represented on a spectrum from upregulation to downregulation (fig.
6.2D). e goals of RVAgene and DPGP are to some degree complementary: DPGP characterizes
gene expression profiles discretely with no need for prior information, while RVAgene characterizes
profiles with a continuous representation, that can explain smooth changes in paerns.
To assess the ability of the model to reconstruct genes not used during training, we kept
aside 300 genes for testing and trained RVAgene on the remaining 432 genes. We note that in
this case (and in the case of single-cell datasets in general), the generative model of RVAgene
produces pseudotime-smoothed gene expression trajectories, rather than being generative of
raw pseudotemporal data, which tend to display overall high noise levels. Reconstructed test
gene expression profiles are shown for three reconstructed genes (fig. 6.2E), chosen to sample
across the spectrum of reconstruction errors (fig. 6.2F). e reconstruction for Ddt, which has a
reconstruction error near the mode (fig. 6.2F), shows very high accuracy. e reconstruction for
Hmgb2, which has twice the reconstruction error, still broadly captures the temporal profile but
with lesser accuracy. Finally we show the reconstruction for Rhox4e, a gene that was sampled from
111
Figure 6.2. Accurate reconstruction of embryonic stem cell differentiation dynamics with
RVAgene. (A) Pseudotemporal ordering of 2717 single cells (data from [247]), calculated using
DPT; example gene shown: Ahsa1. Gene expression values given as log2(counts+1) for all cells
(left), and for sliding window average (right). (B) Pearson correlation coefficient between gene
expression and time for 732 differentially expressed genes. (C) e 2D latent space learnt by an
RVAgene model trained on 732 gene profiles over pseudotime, showing clear separation between
upregulated and downregulated genes. (D) Comparison of RVAgene and DPGP. e four largest
clusters from DPGP are ploed on the RVAgene latent space: temporal expression paerns
(from highly upregulated to highly downregulated) are in close agreement between methods. (E)
Comparison of experimental data and reconstructions. Model-generated reconstructions of three
genes from the test set not used in training: Ddt, Hmgb2, and Rhox4e. Expression values are
log2(counts+1). (F) Distribution of average L1 reconstruction errors for the 300 genes used in the
test set. Genes ploed in C are marked. (G) Cumulative distributions of reconstruction errors on
randomly sampled sets of test genes, where the full data were split into test groups of: 200 genes
(train on 72%), 300 genes (train on 59%), 400 genes (train on 45%), 500 genes (train on 31%), and
600 genes (train on 18%).
112
Figure 6.3. Comparison of information captured in RVAgene latent space compared to a standard fully connected VAE and results of standard hierarical clusterings. (A) Here we show
latent spaces learned by fully connected VAE and RVAgene. e pseudotemporally ordered data
was also smoothed. (B) We annotate the learned latent spaces using the top 4 clusters detected
by DPGP on this dataset. In all three of these cases we report best results after relevant hyperparameter search and optimal training. (C) We perform standard hierarchical clusterings (Nearest
Point Algorithm, Farthest Point Algorithm and UPGMA (Unweighted Pair Group Method with
Arithmetic mean) ) on pseudotemporally ordered and smoothed ESC data and annotate the learned
representation in the same manner as in (B).
113
the long tail of the reconstruction error distribution, i.e. does not well match the data. Comparing
these three examples with the full distribution of reconstruction errors (fig. 6.2F), we see that
the large majority of genes lie to the left of Hmgb2, i.e. have beer-than-moderate accuracy. e
reconstruction error of Hmgb2 is close to 0.005, which we use as a cut off for “well-reconstructed”
genes, based on analysis of individual gene reconstructions. e cumulative reconstruction error
distribution reiterates this point: 230 out of 300 genes (77%) have a reconstruction error ≤ 0.005
(fig. 6.2G); we can conclude that the majority of test genes were faithfully reconstructed by the
model.
RVAgene accurately reconstructed most gene profiles using only ∼ 60% of the data for training
(fig. 6.2G), likely due to co-regulation of gene expression programs. is led to a question: what
is the smallest training gene set that can be used to accurately reconstruct gene dynamics? We
subset the data randomly into train/test sets and trained separate RVAgene models on each. We
found that reconstruction errors slowly increase as the size of the training set decreases, but not
until the training set was as low as 18% of the data did the reconstruction errors significantly
increase (fig. 6.2G, Fig. S19). Analysis of the cumulative distribution of reconstruction errors
across all groups found that RVAgene reconstructs the majority of gene temporal profiles well
(defined as below a reconstruction error of 0.005) if ≥ 45% of the data is used for training. e
successful reconstruction of gene expression dynamics de novo while training on small subsets of
the data suggests widespread co-regulation of gene expression programs during embryonic stem
cell differentiation, as found in previous work [249].
6.3.3 Comparison of RVAgene with alternative approaes for gene clustering
In order to assess the performance of RVAgene for gene clustering and biological discovery, we
compared it to five alternative methods: two neural network approaches and three hierarchical
clustering methods. To assess the utility of the recurrent architecture of RVAgene, we trained nonrecurrent (i.e. fully connected) variational autoencoders on the embryonic stem cell differentiation
dataset [247]. We compared two options: using the pseudotemporally ordered and smoothed
114
data as input (same as for RVAgene), or using the raw (i.e. unordered and unsmoothed) gene
expression data as input. We trained encoder and decoder networks of depth two (one hidden
layer) and with a hidden layer size of 400 (we performed a hyperparameter search to optimize
this). eoretically, depth two networks are large enough to learn any non-linear function [250,
251, 252, 253], although the fully connected VAE has no recurrent inductive bias. us we test
how important this recurrent inductive bias is in practice.
e results of the comparison of neural networks are given in fig. 6.3A-B. In each case, models
were trained for 200 epochs. Annotating the results in latent space using correlations against
pseudotime (fig. 6.3A) shows that all three models separate the data reasonably well, with slightly
beer separation for the recurrent architecture (RVAgene). We also annotated the results using
cluster labels from the largest four DPGP clusters for comparison. ese are appropriate “goldstandard” cluster labels since robust dynamical signatures are learnt by DPGP in each case (Fig.
S18). RVAgene captures: 1) beer separation between clusters that either of the non-recurrent
networks, and 2) a spectrum of behaviors from up- to down-regulated (fig. 6.3B).
We also performed hierarchical clustering on the pseudotemporally ordered and smoothed data
using three standard hierarchical clustering methods: the Nearest Point Algorithm, the Farthest
Point Algorithm, and UPGMA (the Unweighted Pair Group Method with Arithmetic mean). We
annotated the results with the same clusters labels from DPGP (fig. 6.3C). UPGMA performs
best out of these three clustering algorithms, yet still does not aain clear separation between
each of the four groups. us, the 2D latent space representation of RVAgene is beer than both
1D representations via hierarchical clustering and the alternative neural network latent space
representations at distinguishing between dynamic gene profiles in pseudotemporally-ordered
data.
115
Figure 6.4. Accurate reconstruction of kidney injury response gene dynamics with RVAgene.
(A) Latent space representations of RVAgene models trained separately on three independent
replicates (R1-R3); classified by quadratic fit coefficient a. (B) Model generation of gene dynamics
for genes not used in training: Foxm1, Cxcl9 and Ctsk. (C) Histograms of reconstruction errors
for RVAgene models trained on R1-R3 (truncated). (D) Cumulative distribution of reconstruction
errors.
116
6.3.4 RVAgene can classify and predict gene expression dynamics in response
to kidney injury
We investigated gene expression dynamics in the murine kidney by applying RVAgene to a dataset
that describes gene expression profiles before, during, and after a kidney injury [254]. e dataset
is temporally rich, with a total of ten bulk samples over twelve months. Since in this case no
single-cell information is available, we cannot order samples by pseudotime to smooth the data.
Moreover, the temporal gene expression profiles described in Liu, Kumar, Dolzhenko, Alvarado,
Guo, Lu, Chen, Li, Dessing, Parvez, et al. [254] display more complex dynamics than for the
previous dataset [247], and are not readily separable by linear paerns of up- and down-regulated
genes (cf. fig. 6.2C). us, below, we must consider nonlinear models in order to characterize the
temporal paerns observed.
e data consist of one initial timepoint (t = 0) before the injury event (an ischemia/reperfusion
injury model) and nine subsequent time points (t = 1 to 10) following the injury (48 hours, 72
hours, 7 days, 14 days, 28 days, 6 months and 12 months). We note that the timepoints are not
uniformly spaced, which is not taken into account in RVAgene, which only models the broad
temporal trend (see Discussion). From an initial list of 1927 differentially expressed genes measured
over the time course in three biological replicates, we removed putative/predicted and non-protein
coding genes, retaining a list of 1713 genes as input to the model.
We ran RVAgene separately for each of three biological replicates. Independent replicates &
independently trained models provide additional means with which to test the reproducibility
of these methods. For each replicate, RVAgene was trained with a two-dimensional latent space
and a hidden size of 10, on the full set of genes over 200 epochs: found to be sufficient for the
convergence of L (see Methods for further details). We fit linear regression models to the temporal
gene profiles (Fig. S20) and found that linear fits rarely described the gene temporal profiles well
(most correlation coefficients had values close to zero), not did they identify separate clusters in
the latent space. Normalizing the data to lie in [0,1] improved our ability to discriminate clusters
117
in the latent space (Fig. S20C), but came at the expense of a significant loss of information, as the
variance captured in the latent space was dramatically reduced. e absence of evidence for linear
correlations could indicate expression dynamics that are uncorrelated with time, but could of
course also indicate more complicated (nonlinear) gene expression dynamics, which are explored
below.
To study nonlinear gene expression dynamics, we fit a 2nd degree polynomial, i.e. we fit
the temporal trajectory of each gene x to: x = at2 +bt +c, where a,b, c are constants (Fig. S21).
We hypothesized that this function could adequately describe the transient dynamics observed
by Liu, Kumar, Dolzhenko, Alvarado, Guo, Lu, Chen, Li, Dessing, Parvez, et al. [254] for most
genes in response to the kidney injury. us, we classified genes into one of two groups, a < 0:
convex (up-down paern), 1200 genes; and a ≥ 0: concave (down-up paern), 512 genes. In the
latent space, the separation of these two groups is clearly visible for each replicate (fig. 6.4A).
Moreover, the classification is in agreement with Liu, Kumar, Dolzhenko, Alvarado, Guo, Lu,
Chen, Li, Dessing, Parvez, et al. [254], where the majority of differentially expressed genes are
upregulated transiently. To explore the ability of RVAgene to reconstruct gene expression profiles
not used in model development, we kept aside 300 randomly sampled genes for testing, and
trained RVAgene models on the remaining genes for each of the three replicates. Independently
for each model, we then generated dynamic profiles for the test genes. ree genes sampled
randomly from the test set are ploed in fig. 6.4B. Of particular note, for each of genes, the modelgenerated data captures the temporal paerns while displaying a higher degree of similarity across
replicates than the experimental data itself. is illustrates that the model is neither under- nor
overfiing, but capturing the underlying biological paerns while sufficiently accounting for the
noise. Reconstruction errors are comparable across the three replicates, albeit with slightly higher
overall errors in replicate 1 (fig. 6.4C-D). Overall, the reconstruction errors are higher than for the
previous section (averaging over many pseudotemporal time points allowed us to significantly
reduced the noise).
118
Figure 6.5. RVAgene latent space captures biological processes driving concordant gene expression anges. (A) Z-plots for replicates R1-R3 with local neighborhoods of Wnt2 and Wnt4
marked (circles). (B) As in A, for Slc family members Slc22a18 and Slc7a13. (C) Heatmap of
expression changes over time course of injury for the Wnt neighborhood genes in the intersection
of R1-R3. Selected genes marked (black), as well as ortholog gene pairs (blue). (D) As in C, for
Slc neighborhood genes. (E) Histogram of -log10 p values of gene ontology terms for biological
processes terms associated with the Wnt neighborhood (gene set in C). (F) As in E, with the Slc
neighborhood (gene set in D).
119
To investigate in more depth the features that are captured in the RVAgene latent space, we
performed two sets of analyses: unbiased clustering, and targeted exploration. For the unbiased
analysis, we performed k-means clustering on RVAgene latent space of replicate 1 (R1) with k = 9
(Supplementary Fig. S22A); we project the clusters labels learnt onto replicates R2 and R3. All
cluster identities are well-preserved across replicates, with the exception of cluster 5, which seems
to indicate outlier genes in R1. To study biological processes within these clusters, we performed
GO term enrichment analysis on each. In Supplementary Fig. S22B we plot one significant GO
term per cluster (omiing cluster 5), and see that specific regions of the latent spaces across
replicates can be characterized in terms of biological processes, many of which relate to metabolic
and immune system responses. ese can be separated into two broad classes, which separate
the left-hand side of R1 (metabolic processes downregulated during injury response) from the
right-hand side (immune responses upregulated during injury response).
To study the effects of gene-specific regions of the latent space in greater depth, we chose
three distinct regions based on the co-location of genes of interest. ese gene groups studied on
the latent space are: 1) a Wnt group consisting of family members Wnt2 & Wnt4; 2) an Slc group
consisting of family members Slc7a13 & Slc22a18; and 3) a Sdc1 group, consisting of only Sdc1. For
each group, we characterized neighboring genes by defining a circular neighborhood around each
gene in the group, with radius r (depending on the local density, the radius was varied, giving:
r
2 = 1 for Slc, r
2 = 0.3 for Sdc, r
2 = 0.05 for Wnt. We then took all genes inside this radius for
each replicate, and found the intersection of genes over the three replicates (fig. 6.5A-B). We
analyzed the intersection gene set for each group by studying their temporal profiles and their
gene ontology (GO) term associations. Each group was characterized by a strikingly clear temporal
profile. e Sdc1 and Wnt groups both show transient upregulation, over different timescales: the
Sdc1 group is upregulated from 24 hours post-injury until 14-28 days post-injury (fast response)
(Supplementary Fig. S23B), whereas the Wnt group is upregulated at 7 days post-injury until 28
days post-injury (slow response) (fig. 6.5C). In contrast, the Slc group is downregulated at 24 hours
post-injury, and remains suppressed until 7-28 days post-injury (fig. 6.5D).
120
Analysis of GO biological process terms enriched in each gene group further highlighted the
power of the latent space for biological discovery. e fast response (Sdc1) group was characterized
by upregulation of programs related to apoptosis, stress response, wound healing and chemotaxis,
i.e. the first responders to the site of injury (Fig. S23C). In addition all five Lox genes comprising
the GO term “peptidyl-lysine oxidization” were found in this group. is is consistent with the
oxidative stress resulting from the renal ischemia-reperfusion injury that was performed. However,
distinct factors regulate the Lox family genes, as can be partly observed by their subtle differences
in temporal profile (Fig. S23D). eir co-location in the latent spaces of all three models thus
highlights the potential use of RVAgene for discovery of complex temporal regulatory events from
gene expression data.
e slow response (Wnt) group was primarily characterized by immune response processes,
including leukocyte activation, platelet aggregation, and various cytokine-mediated pathways
including IL-1 and IL-33 (fig. 6.5E). Notably, the Wnt group identifies multiple gene orthologs (fig.
6.5C) with very similar profiles: likely evidence of shared temporal regulation. is illustrates
once again (as for the Lox genes above) the potency of RVAgene for the discovery of temporally
co-regulated genes.
Finally, the Slc group of genes shows a transiently down-regulated paern between 24 hours
and 7-28 days, although some gene in this group deviate from this paern (Fig. 1.5D). GO term
enrichment identifies the positive regulation of metabolic processes (fig. 6.5F). e downregulation
of metabolic programs during the response to kidney injury is agreement with the findings of
Liu, Kumar, Dolzhenko, Alvarado, Guo, Lu, Chen, Li, Dessing, Parvez, et al. [254]. Notably, this
metabolism-sensitive group contains many genes that also display sexually dimorphic expression,
primarily in specific regions of the proximal tubule [255], thus independently identifying the
well-established (though under-studied) interplay between sex differences and injury responses
in the kidney [256].
In summary, unsupervised analysis of groups of genes co-located in the latent spaces of
RVAgene finds: 1) high similarity between temporal gene profiles of genes nearby in latent space,
121
Figure 6.6. Training RVAgene is reasonably scalable on CPU and even more so using hardware
acceleration through GPU. (A) Time cost of training RVAgene for 100 epochs for datasets with
varying number of genes and time points on CPU and GPU. (B) Maximum memory utilized during
training of the model on CPU an GPU for the cases in (A), inset plot: comparison of max memory
used compared to DPGP for varying number of genes.
and 2) clear biological signatures represented by these groups of nearby genes, in strong agreement
with prior knowledge [254]. Moreover, the latent spaces of RVAgene models can be used to predict
programs of temporal co-regulation.
6.3.5 Assessment of the computational efficiency of RVAgene
We assessed the computational efficiency of RVAgene for various seings and hardware. For the
majority of the models trained, 100-200 epochs was sufficient for the loss function L to converge.
For tests performed here, we recorded the RVAgene runtime for 100 epochs of training using
models that varied in their number of genes and time points. In each case we used a latent space
of dimension two, a hidden size of 10, and a training batch size of 10. We ran the model on an intel
i7 CPU with four cores and a Tesla K20 GPU. Runtimes were recorded on linux via the inbuilt
time script (/usr/bin/time --verbose). As the number of time points and genes grew large (up
122
to 60 time points and 700 genes), total runtimes on CPU were on the order of 103
seconds (< 20
minutes) (fig. 6.6A). On GPU, total runtimes were decreased to around 100 seconds (< 3 minutes).
us, RVAgene is readily scalable to tens of thousands of genes and hundreds of time points for
training times of up to a few days on CPU or hours on GPU. For comparison, as described in
McDowell, Manandhar, Vockley, Schmid, Reddy, and Engelhardt [235], the approximation-free
time complexity of each iteration of learning for DPGP is O(GT3
), due to the G matrix inversions,
each of size T ×T, for a dataset with G genes and T timepoints. e complexity for each epoch of
training of RVAgene is O(GT).
In terms of peak memory usage, since RVAgene is a neural network trained using backpropagation [12], maximum memory used during training is of the same size as the network itself, which
is constant given that the model parameters are fixed (fig. 6.6B). is is in contrast to Gaussian
Processes (such as DPGP), which initially assign each gene to its own cluster, thus must store G
matrices of size T ×T, for G genes and T timepoints per gene. is leads to quickly increasing
runtime peak resident set sizes for DPGP compared to RVAgene (fig. 6.6B Inset). e memory
used by DPGP grows with the number of time points as O(GT2
)). us, DPGP will not run with
large numbers of genes and time points. A note on this comparison: it is not direct, in the sense
that DPGP performs clustering and RVAgene does not, in addition to other important differences
between the goals of the methods. Nonetheless, the size and scope of current biological datasets
– particularly at single-cell resolution – in many cases preclude the use of DPGP without large
reductions of the input data size. As we have shown, a feasible and efficient alternative in such
cases is to run RVAgene, and then to perform clustering or other classification analyses post hoc
on the latent space of the model.
6.4 Discussion
We have presented RVAgene, a recurrent variational autoencoder for generative modeling of gene
expression time series data. rough its encoder network, RVAgene provides means to visualize and
123
classify gene expression dynamic profiles, which can lead to the discovery of biological processes.
rough its decoder network, RVAgene provides means to generate new gene expression dynamic
profiles of either the full data or (in the case of single-cell studies) the pseudotime-smoothed data
by sampling points from the latent space. In doing so, RVAgene can accurately reconstruct gene
dynamics in complex biological data. As a by-product, on single-cell datasets the model directly
produces smoothed outputs, useful for denoising gene expression time series data. RVAgene is
efficient on temporally-rich whole genome datasets, in comparison to current existing methods.
RVAgene can be used to discover structure in the data, such as gene profile clusters. Popular
methods for clustering gene profiles such as Bayesian hierarchical clustering [257] or DPGP
[235] detect the number of clusters in the data by fiing a hyperparameter α, the concentration
parameter of the governing Dirichlet process [258]. Although unsupervised, inevitably, the choice
of α affects the number of clusters output. Visualizing the data first with RVAgene can give an
idea whether the data favor clustering or a continuous representation. us analysis in RVAgene
can guide the seing of the hyperparameter α in DPGP and similar methods. In the case of ESC
differentiation, DPGP predicts 12 clusters (Fig. S18), yet most have very few members and many
share similar paerns. e RVAgene latent space for this dataset finds two major divisions in the
data, and orders the largest DPGP clusters along a spectrum (fig. 6.2D), suggesting that DPGP
might be overfiing the data. Indeed, the two methods can be used complementarily: RVAgene for
high-level structure discovery and DPGP for clustering. In cases where learning a detailed noise
model (at single time point resolution) is important to the user, DPGP or other Gaussian Process
models are preferable over RVAgene. However, DPGP does not scale well with large datasets and
thus cannot always be used (fig. 6.6).
e latent space of an RVAgene model encodes useful information about biological features,
and in that sense provides biologically interpretable representations of the data. However, the
representation is not interpretable in the sense that the components of the latent space do not
have a physical meaning nor are they necessarily independent. Recent methods have tackled this
issue of interpretability, by either modifying the loss function to make components independent
124
[259] or substituting linear functions in parts of the VAE [260, 261]. ese methods have clear
advantages regarding the analysis and interpretation of features in the latent space. In future work,
decoding an RVAgene model with a linear function [260] could facilitate additional discovery and
improve our ability to gain insight into dynamic biological processes through the analysis of the
latent space.
Dynamic changes in gene expression underlie essential cell processes. As such, modeling
gene expression changes can also facilitate downstream analysis tasks, including gene regulatory
network (GRN) inference. Inferring gene regulatory networks from single-cell data is challenging
[262], particularly due to cell-cell heterogeneity and high levels of noise. Several recent approaches
to GRN inference make use of temporal profiles [263, 264] or differential equations [265, 266,
267]. RVAgene could supplement such methods either by providing denoised input data, or by
completely replacing the temporal ordering/differential equation-based components of these
methods (which can be notoriously difficult to parameterize) with data produced from a RVAgene
generative model of the gene expression dynamics.
RVAgene is currently agnostic of irregular time intervals between consecutive points in a
time series, i.e. it standardizes the time interval. is is not usually a concern for single-cell
data, since with pseudotime information we can choose appropriate time intervals. However,
in other cases, such as in response to kidney injury [254], standardizing time intervals distorts
the dynamic profiles. Since RVAgene seeks to describe broad temporal paerns, we do not see
this as a critical issue, though it would be desirable to generalize the model. A simple way to
model irregularly spaced time points would be to augment the data through interpolation, though
this is difficult without making strong assumptions about the (generally unknown) noise model.
Gaussian process models [235, 268] can take irregular data as input, although (as noted above) are
not efficient enough to run on large datasets. An alternative approach would be to modify the
recurrent network architecture to take time points explicitly as input values, this would enable
modeling of irregular or asynchronous data [269].
125
RVAgene models in discrete time steps. ere is no simple modification to the recurrent
network structure that allows for prediction on continuously valued time. However, a recent
development: neural ordinary differential equations (ODEs) [270], enables modeling of time series
data with continuous timepoints. Chen, Rubanova, Beencourt, and Duvenaud [270] describe
a generative latent ODE architecture similar to that of RVAgene, except that in their case the
recurrent decoder network is replaced by a neural ODE decoder network. Chen, Rubanova,
Beencourt, and Duvenaud [270] demonstrate accurate results using synthetic data, however
when we applied the method to the ESC single-cell differentiation dataset [247], the neural ODE
network was found to converge very slowly and was overall underfit (Fig. S24). e latent ODE
method used by Chen, Rubanova, Beencourt, and Duvenaud [270] does not address the challenge
of modeling asynchronous/irregularly spaced data, but this has been more recently addressed
[271]. ese new models may well lead to future improvements in network architectures, although
it seems that computational progress is needed before they can be successfully applied to complex
biological systems.
In the current work, the prior on latent space used throughout was a unit spherical Normal,
appropriate for exploratory data analysis where we have no further knowledge about structure in
the latent space. However, given more information, e.g. that the data contains k clusters, a different
prior on the latent space might be more appropriate. A multi-modal prior – such as a Gaussian
Mixture Model (GMM) prior – would permit structured (multi-modal) representations. However,
the KL-divergence for an arbitrary GMM is not tractable; approximation [272] or numerical
computation would be necessary. Moreover, there is a greater problem: mixture models contain
discrete parameters and VAE models are ill-suited for the optimization of discrete parameters [273],
thus directly replacing the Normal prior of a VAE with a GMM is not feasible. A workaround to
this problem is presented in [273], however implementing this for a recurrent model architecture
remains an open problem.
e points raised above offer much scope for future work. ese include the design of new
latent space models with informative priors, modeling irregular time series data, and modeling
126
in continuous time. Developments in some of these areas [270], while promising, tend to rely
on training data with relatively low levels of noise: far from the reality of most biological data.
us it seems highly likely to be beneficial for both machine learning and biology to develop new
neural network architectures in light of biological data.
Data Availability
e synthetic data used for evaluation of RVAgene are available at: https://github.com/
maclean-lab/RVAgene. Additional data used in the manuscript are available from the Gene
Expression Omnibus: ESC differentiation (GEO accession GSE65525) and kidney injury (GEO
accession GSE98622).
Software Availability
RVAgene is available in Python released under an MIT license: https://github.com/maclean-lab/
RVAgene.
127
Chapter 7
Conclusion
It has been a fascinating experience to work in the intersection of artificial intelligence and structural biology during these past years. My contributions to the PNAbind project (Chapter 1) served
to establish a biophysically inspired graph neural network model for segregating DNA/RNA/nonNA binding proteins and segmenting them [45]. is also acted as a learning experience for me,
in this domain, which allowed us to aempt the binding specificity project (Chapter 2). In 2021,
we witnessed a more than half a century old problem, protein folding, being almost solved by
AlphaFold2 [23], as I was working on my project of modeling protein-DNA binding specificity
based on structures. Efforts were immediately underway, to aempt to predict structures of
higher order complexes of biomolecules [24, 274], and recently took a big step forward through
AlphaFold3 [27]. However, these complex structure prediction methods, thus far, are not yet
able to model binding specificity. is puts our model, DeepPBS [46], in a uniquely synergistic
position (which can look at a predicted or designed complex, and predict binding specificity).
We demonstrated through our work, how DeepPBS can be applicable to predicted and designed
complexes across different families. We achieve this impact while staying true to our motivation of
building models which are biophysically inspired and interpretable. It gives me joy to witness that
as of now, the DeepPBS webserver is being used by scientists around the world. Looking ahead,
in my view, combination of structure prediction and specificity prediction methods is the future
of predicting/designing biologically meaningful complexes. In fact, a fresh direction of thinking
128
about this problem is joint modeling of structure and specificity. is is still quite ambitious, as
data sparsity poses a big challenge, especially for complexes involving nucleic acids.
Alongside my work on protein-DNA, concurrent events, and my advisor’s encouragement
inspired me to explore the field of RNA biology, which led to the projects, RNAscape (Chapter
3) and RNAproDB (Chapter 5). As of now, the field of RNA structures is also being shaped by
artificial intelligence methods [32]. Yet an AlphaFold level breakthrough is still out of reach [275].
Recent works have shown progress in protein structure targeted RNA structure design [276] and
prediction of protein-RNA binding energy [277]. Although promising, a lot of it is still quite
preliminary and/or lacks biological validation. A DeepPBS like model for RNA binding specificity
prediction is also non-existent (although there has been some progress [278]). It has also been
known for a while that solvent molecules have an effect on protein-nucleic acid recognition [163].
e extent of this phenomenon has not been quantified. Several further directions regarding
DeepPBS remains to be explored. Both JASPAR and HOCOMOCO released updated versions
recently [201, 279]. Many aspects of DeepPBS could be rethought in the light of ever expanding
repertoire of new deep learning techniques [280, 281]. Training new DeepPBS models on this
data may improve performance. Somewhat related, the input to DeepPBS is a static structure.
Augmenting the dataset with more conformers (potentially sampled through MD simulation)
could improve performance too. Although there are some existing options [282], an effective
method for docking DNA to a target protein structure is still unavailable. Solving this problem
might finally lead to a general model of binding specificity, which can operate based on protein
structure alone. is might be very complex and a family specific docking algorithm might be the
first step towards this direction. All these possibilities make me excited to experience what the
future holds in store.
Structure, function and localization of non-coding RNA is also something that remains to be
studied and modeled. 70-90% mammalian genome get transcribed into RNA, but only 1% of it get
translated. Rest are known as ncRNA (non coding RNA). lncRNAs are defined by their size range
of 200 bases to 10 kilobases. For a long time lncRNAs were regarded as only transcriptional noise.
129
But, recently, it has been shown that they perform important regulatory functions. Variations
in lncRNA expression has been shown to be significantly correlated with certain disease traits
[283] and they are tissue-specific [284]. LncRNA expression actually beer explains certain cancer
classification data compared to mRNA expression [285]. Recent work on population-scale tissue
transcriptomics [286] discovers high tissue specific regulation of lncRNA. ey also identify 800
lncRNA-trait relationships which are not explained by protein coding genes. Hence, there is a
growing interest in developing computational methods addressing various kinds of biological
problems in lncRNAome. [287] discusses Deep learning methods recently being developed for
such tasks. Some of these works include, lncRNA-protein interaction prediction [288, 289, 290,
291, 292], lncRNA identification [293, 294, 295], learning regulatory information [296, 297],
predicting subcellular localization of lncRNAs [298], lncRNA-miRNA interaction prediction [299],
lncRNA-disease association prediction [285, 300, 301, 302] etc. All these tools might improve in
future through incorporation of structural and specificity information, as structure and specificity
prediction methods evolve over time.
With the improvements in structure prediction and design, there is an increasing need of high
quality analysis tools, to be able to study, visualize and explore these structures. In later part of
my PhD, we built an updated DNAproDB (Chapter 4), RNAscape [48] and RNAproDB to address
this need. DNAproDB [49, 50] has been widely used by the structural biology community in the
past. We hope this new update (especially with new annotations of water-mediated hydrogen
bonds) makes it even more useful. ese tools are designed as a webserver for a user to be able to
analyze protein-DNA-RNA structures without requiring a programming background. DNAproDB
and RNAproDB are also accompanied by a pre-analyzed collection, which a user can explore
immediately. e modern, interactive user interface is a key feature of these tools. I believe these
will be essential assets of the field to analyze high throughput predicted/designed complexes, as
they continue to evolve, in near future.
It has also been a pleasure to witness my tangential research work on generative modeling
of gene expression time-series data (Chapter 6) being used and discussed by highly impactful
130
scientific research in understanding diseases, for example, single-nulceus chromatin accessibility
and transcriptomic characterization of Alzheimer’s disease [303].
I conclude this dissertation by expressing my deepest gratitude, love and friendship to everyone
in my life, who supported me through this journey. e Nobel prize in Chemistry (2024) was
recently awarded to the field of structural biology : protein design (Prof. David Baker) and
biomolecular structure prediction (Demis Hassabis and John Jumper). It is exciting for me to have
done my PhD research in this field and to be able to produce highly related research. I hope my
scientific contribution helps towards further progress of our society and to improve the human
condition in general.
131
References
[1] Frank Schluenzen, Ante Tocilj, Raz Zarivach, Joerg Harms, Marco Gluehmann, Daniela
Janell, Anat Bashan, Heike Bartels, Ilana Agmon, Franc¸ois Franceschi, et al. “Structure of
functionally activated small ribosomal subunit at 3.3 ˚A resolution”. In: cell 102.5 (2000),
pp. 615–623.
[2] Joerg Harms, Frank Schluenzen, Raz Zarivach, Anat Bashan, Sharon Gat, Ilana Agmon,
Heike Bartels, Franc¸ois Franceschi, and Ada Yonath. “High resolution structure of the large
ribosomal subunit from a mesophilic eubacterium”. In: Cell 107.5 (2001), pp. 679–688.
[3] International Human Genome Sequencing Consortium. “Initial sequencing and analysis of
the human genome”. In: Nature 409.6822 (2001), pp. 860–921.
[4] Sergey Nurk, Sergey Koren, Arang Rhie, Mikko Rautiainen, Andrey V Bzikadze, Alla
Mikheenko, Mitchell R Vollger, Nicolas Altemose, Lev Uralsky, Ariel Gershman, et al. “e
complete sequence of a human genome”. In: Science 376.6588 (2022), pp. 44–53.
[5] Samuel A Lambert, Aru Jolma, Laura F Campitelli, Pratyush K Das, Yimeng Yin, Mihai
Albu, Xiaoting Chen, Jussi Taipale, Timothy R Hughes, and Mahew T Weirauch. “e
human transcription factors”. In: Cell 172.4 (2018), pp. 650–665.
[6] James D Watson and Francis HC Crick. “Molecular structure of nucleic acids: a structure
for deoxyribose nucleic acid”. In: Nature 171.4356 (1953), pp. 737–738.
[7] Daniel Panne, Tom Maniatis, and Stephen C Harrison. “An atomic model of the interferon-β
enhanceosome”. In: Cell 129.6 (2007), pp. 1111–1123.
[8] Remo Rohs, Sean M West, Alona Sosinsky, Peng Liu, Richard S Mann, and Barry Honig.
“e role of DNA shape in protein–DNA recognition”. In: Nature 461.7268 (2009), pp. 1248–
1253.
[9] Helen M Berman, John Westbrook, Zukang Feng, Gary Gilliland, Talapady N Bhat, Helge
Weissig, Ilya N Shindyalov, and Philip E Bourne. “e protein data bank”. In: Nucleic acids
research 28.1 (2000), pp. 235–242.
[10] SC Kleene. “Representationof Events in Nerve Nets and Finite Automata”. In: CE Shannon
and J. McCarthy (1951).
[11] Seppo Linnainmaa. “Taylor expansion of the accumulated rounding error”. In: BIT Numerical Mathematics 16.2 (1976), pp. 146–160.
[12] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. “Learning representations
by back-propagating errors”. In: nature 323.6088 (1986), pp. 533–536.
[13] James A Anderson and Edward Rosenfeld. Talking nets: An oral history of neural networks.
MiT Press, 2000.
[14] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard,
Wayne Hubbard, and Lawrence D Jackel. “Backpropagation applied to handwrien zip
code recognition”. In: Neural computation 1.4 (1989), pp. 541–551.
[15] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenet classification with
deep convolutional neural networks”. In: Advances in neural information processing systems
25 (2012).
[16] Yann LeCun, Leon Boou, Yoshua Bengio, and Patrick Haffner. “Gradient-based learning ´
applied to document recognition”. In: Proceedings of the IEEE 86.11 (1998), pp. 2278–2324.
132
[17] Yann LeCun, Yoshua Bengio, et al. “Convolutional networks for images, speech, and time
series”. In: e handbook of brain theory and neural networks 3361.10 (1995), p. 1995.
[18] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. “Imagenet: A largescale hierarchical image database”. In: 2009 IEEE conference on computer vision and paern
recognition. Ieee. 2009, pp. 248–255.
[19] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. “Deep learning”. In: nature 521.7553
(2015), pp. 436–444.
[20] David R Kelley, Yakir A Reshef, Maxwell Bileschi, David Belanger, Cory Y McLean, and
Jasper Snoek. “Sequential regulatory activity prediction across chromosomes with convolutional neural networks”. In: Genome research 28.5 (2018), pp. 739–750.
[21] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. “Learning important features
through propagating activation differences”. In: International conference on machine learning. PMlR. 2017, pp. 3145–3153.
[22] Ellen D Zhong, Tristan Bepler, Bonnie Berger, and Joseph H Davis. “CryoDRGN: reconstruction of heterogeneous cryo-EM structures using neural networks”. In: Nature methods
18.2 (2021), pp. 176–185.
[23] John Jumper et al. “Highly accurate protein structure prediction with AlphaFold”. In: Nature
596 (7873 Aug. 2021), pp. 583–589. issn: 0028-0836. doi: 10.1038/s41586-021-03819-2.
[24] Minkyung Baek, Ryan McHugh, Ivan Anishchenko, Hanlun Jiang, David Baker, and Frank
DiMaio. “Accurate prediction of protein–nucleic acid complexes using RoseTTAFoldNA”.
In: Nature methods 21.1 (2024), pp. 117–121.
[25] Minkyung Baek et al. “Accurate prediction of protein structures and interactions using a
three-track neural network”. In: Science 373 (6557 Aug. 2021), pp. 871–876. issn: 0036-8075.
doi: 10.1126/science.abj8754.
[26] Rohith Krishna et al. “Generalized biomolecular modeling and design with RoseTTAFold
All-Atom”. In: Science 384 (6693 Apr. 2024). issn: 0036-8075. doi: 10 . 1126 / science .
adl2528.
[27] Josh Abramson, Jonas Adler, Jack Dunger, Richard Evans, Tim Green, Alexander Pritzel,
Olaf Ronneberger, Lindsay Willmore, Andrew J Ballard, Joshua Bambrick, et al. “Accurate
structure prediction of biomolecular interactions with AlphaFold 3”. In: Nature (2024),
pp. 1–3.
[28] Jinsen Li, Tsu-Pei Chiu, and Remo Rohs. “Deep DNAshape: Predicting DNA shape considering extended flanking regions using a deep learning method.” In: bioRxiv : the preprint
server for biology (Oct. 2023). doi: 10.1101/2023.10.22.563383.
[29] Pengpai Li and Zhi-Ping Liu. “GeoBind: segmentation of nucleic acid binding interface
on protein surface with geometric deep learning”. In: Nucleic Acids Research 51.10 (2023),
e60–e60.
[30] Pablo Gainza, Freyr Sverrisson, Frederico Monti, Emanuele Rodola, Davide Boscaini,
Michael M Bronstein, and Bruno E Correia. “Deciphering interaction fingerprints from
protein molecular surfaces using geometric deep learning”. In: Nature Methods 17.2 (2020),
pp. 184–192.
[31] Pablo Gainza, Sarah Wehrle, Alexandra Van Hall-Beauvais, Anthony Marchand, Andreas
Scheck, Zander Harteveld, Stephen Buckley, Dongchun Ni, Shuguang Tan, Freyr Sverrisson,
et al. “De novo design of protein interactions with learned surface fingerprints”. In: Nature
617.7959 (2023), pp. 176–184.
133
[32] Shujun He, Rui Huang, Jill Townley, Rachael C Kretsch, omas G Karagianes, David BT
Cox, Hamish Blair, Dmitry Penzar, Valeriy Vyaltsev, Elizaveta Aristova, et al. “Ribonanza:
deep learning of RNA structure through dual crowdsourcing”. In: bioRxiv (2024).
[33] Weizhu Yan, Yanhui Zheng, Xiaotao Zeng, Bin He, and Wei Cheng. “Structural biology
of SARS-CoV-2: open the door for novel therapies”. In: Signal Transduction and Targeted
erapy 7.1 (2022), p. 26.
[34] Cody B Jackson, Michael Farzan, Bing Chen, and Hyeryun Choe. “Mechanisms of SARSCoV-2 entry into cells”. In: Nature reviews Molecular cell biology 23.1 (2022), pp. 3–20.
[35] wwPDB consortium. “Protein Data Bank: the single global archive for 3D macromolecular
structure data”. In: Nucleic acids research 47.D1 (2019), pp. D520–D528.
[36] Jasmine Cubuk, Jhullian J Alston, J Jeremias Incicco, Sukrit Singh, Melissa D StuchellBrereton, Michael D Ward, Maxwell I Zimmerman, Neha Vithani, Daniel Griffith, Jason A
Wagoner, et al. “e SARS-CoV-2 nucleocapsid protein is dynamic, disordered, and phase
separates with RNA”. In: Nature communications 12.1 (2021), pp. 1–17.
[37] Karen H. Vousden and Carol Prives. “Blinded by the Light: e Growing Complexity of p53”.
In: Cell 137 (3 May 2009), pp. 413–431. issn: 00928674. doi: 10.1016/j.cell.2009.04.037.
[38] Malka Kitayner, Haim Rozenberg, Naama Kessler, Dov Rabinovich, Lihi Shaulov, Tali E.
Haran, and Zippora Shakked. “Structural Basis of DNA Recognition by p53 Tetramers”.
In: Molecular Cell 22 (6 June 2006), pp. 741–753. issn: 10972765. doi: 10.1016/j.molcel.
2006.05.015.
[39] Khaled Barakat, Bilkiss B. Issack, Maria Stepanova, and Jack Tuszynski. “Effects of Temperature on the p53-DNA Binding Interactions and eir Dynamical Behavior: Comparing the
Wild Type to the R248Q Mutant”. In: PLoS ONE 6 (11 Nov. 2011), e27651. issn: 1932-6203.
doi: 10.1371/journal.pone.0027651.
[40] Malka Kitayner, Haim Rozenberg, Remo Rohs, Oded Suad, Dov Rabinovich, Barry Honig,
and Zippora Shakked. “Diversity in DNA recognition by p53 revealed by crystal structures
with Hoogsteen base pairs”. In: Nature Structural & Molecular Biology 17 (4 Apr. 2010),
pp. 423–429. issn: 1545-9993. doi: 10.1038/nsmb.1800.
[41] Seungsoo Kim, Ekaterina Morgunova, Sahin Naqvi, Seppe Goovaerts, Maram Bader, Mervenaz Koska, Alexander Popov, Christy Luong, Angela Pogson, Tomek Swigut, et al.
“DNA-guided transcription factor cooperativity shapes face and limb mesenchyme”. In:
Cell 187.3 (2024), pp. 692–711.
[42] Alicia K Michael, Ralph S Grand, Luke Isbel, Simone Cavadini, Zuzanna Kozicka, Georg
Kempf, Richard D Bunker, Andreas D Schenk, Alexandra Graff-Meyer, Ganesh R Pathare,
et al. “Mechanisms of OCT4-SOX2 motif readout on nucleosomes”. In: Science 368.6498
(2020), pp. 1460–1465.
[43] Mahew G Durrant, Nicholas T Perry, James J Pai, Aditya R Jangid, Januka S Athukoralage,
Masahiro Hiraizumi, John P McSpedon, April Pawluk, Hiroshi Nishimasu, Silvana Konermann, et al. “Bridge RNAs direct programmable recombination of target and donor DNA”.
In: Nature 630.8018 (2024), pp. 984–993.
[44] Mihaly Varadi, Stephen Anyango, Mandar Deshpande, Sreenath Nair, Cindy Natassia,
Galabina Yordanova, David Yuan, Oana Stroe, Gemma Wood, Agata Laydon, et al. “AlphaFold Protein Structure Database: massively expanding the structural coverage of
protein-sequence space with high-accuracy models”. In: Nucleic acids research 50.D1 (2022),
pp. D439–D444.
134
[45] Jared M. Sagendorf, Raktim Mitra, Jiawei Huang, Xiaojiang S. Chen, and Remo Rohs.
“Structure-based prediction of protein-nucleic acid binding using graph neural networks”.
In: Biophysical Reviews 16 (3 June 2024), pp. 297–314. issn: 1867-2450. doi: 10.1007/
s12551-024-01201-w.
[46] Raktim Mitra, Jinsen Li, Jared M. Sagendorf, Yibei Jiang, Ari S. Cohen, Tsu-Pei Chiu,
Cameron J. Glasscock, and Remo Rohs. “Geometric deep learning of protein–DNA binding
specificity”. In: Nature Methods (Aug. 2024). issn: 1548-7091. doi: 10.1038/s41592-024-
02372-w.
[47] H. Yang. “Tools for the automatic identification and classification of RNA base pairs”. In:
Nucleic Acids Research 31 (13 July 2003), pp. 3450–3460. issn: 1362-4962. doi: 10.1093/
nar/gkg529.
[48] Raktim Mitra, Ari S Cohen, and Remo Rohs. “RNAscape: geometric mapping and customizable visualization of RNA structure”. In: Nucleic Acids Research (Apr. 2024). issn: 0305-1048.
doi: 10.1093/nar/gkae269.
[49] Jared M Sagendorf, Helen M Berman, and Remo Rohs. “DNAproDB: an interactive tool for
structural analysis of DNA–protein complexes”. In: Nucleic acids research 45 (W1 2017),
W89–W97.
[50] Jared M Sagendorf, Nicholas Markarian, Helen M Berman, and Remo Rohs. “DNAproDB: an
expanded database and web-based tool for structural analysis of DNA–protein complexes”.
In: Nucleic acids research 48 (D1 2020), pp. D277–D287.
[51] Ian C McDowell, Dinesh Manandhar, Christopher M Vockley, Amy K Schmid, Timothy E
Reddy, and Barbara E Engelhardt. “Clustering gene expression time series data using
an infinite Gaussian process mixture model”. In: PLoS computational biology 14.1 (2018),
e1005896.
[52] Raktim Mitra and Adam L MacLean. “RVAgene: Generative modeling of gene expression time series data”. In: Bioinformatics (Feb. 2021). btab260. issn: 1367-4803. doi: 10.
1093/bioinformatics/btab260. url: https://doi.org/10.1093/bioinformatics/
btab260.
[53] Lei Deng, Juan Pan, Xiaojie Xu, Wenyi Yang, Chuyao Liu, and Hui Liu. “PDRLGB: precise DNA-binding residue prediction using a light gradient boosting machine”. In: BMC
bioinformatics 19.19 (2018), pp. 135–145.
[54] Liangjiang Wang, Caiyan Huang, Mary Yang, and Jack Y Yang. “BindN+ for accurate
prediction of DNA and RNA-binding residues from protein sequence features”. In: BMC
Systems Biology 4.1 (2010), pp. 1–9.
[55] Liangjiang Wang and Susan J Brown. “BindN: a web-based tool for efficient prediction of
DNA and RNA binding sites in amino acid sequences”. In: Nucleic acids research 34.suppl 2
(2006), W243–W248.
[56] Tao Li, Qian-Zhong Li, Shuai Liu, Guo-Liang Fan, Yong-Chun Zuo, and Yong Peng.
“PreDNA: accurate prediction of DNA-binding sites in proteins by integrating sequence
and geometric structure information”. In: Bioinformatics 29.6 (2013), pp. 678–685.
[57] Philipp Krahenb ¨ uhl and Vladlen Koltun. “Efficient inference in fully connected crfs with ¨
gaussian edge potentials”. In: arXiv preprint arXiv:1210.5644 (2012).
[58] Brian Roark, Murat Saraclar, Michael Collins, and Mark Johnson. “Discriminative language
modeling with conditional random fields and the perceptron algorithm”. In: Proceedings of
135
the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04). 2004,
pp. 47–54.
[59] Andrew McCallum and Wei Li. “Early results for named entity recognition with conditional
random fields, feature induction and web-enhanced lexicons”. In: (2003).
[60] Zengjian Liu, Buzhou Tang, Xiaolong Wang, and Qingcai Chen. “De-identification of
clinical notes via recurrent neural network and conditional random field”. In: Journal of
biomedical informatics 75 (2017), S34–S42.
[61] John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. “Scalable parallel programming with cuda: Is cuda the parallel programming model that application developers have
been waiting for?” In: eue 6.2 (2008), pp. 40–53.
[62] Marvin TT Teichmann and Roberto Cipolla. “Convolutional CRFs for semantic segmentation”. In: arXiv preprint arXiv:1805.04777 (2018).
[63] Evangelos Kalogerakis, Aaron Hertzmann, and Karan Singh. “Learning 3D mesh segmentation and labeling”. In: ACM SIGGRAPH 2010 papers. 2010, pp. 1–12.
[64] Yuri Boykov, Olga Veksler, and Ramin Zabih. “Fast approximate energy minimization via
graph cuts”. In: IEEE Transactions on paern analysis and machine intelligence 23.11 (2001),
pp. 1222–1239.
[65] Kosta Ristovski, Vladan Radosavljevic, Slobodan Vucetic, and Zoran Obradovic. “Continuous conditional random fields for efficient regression in large fully connected graphs”. In:
Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 27. 1. 2013.
[66] Hongchang Gao, Jian Pei, and Heng Huang. “Conditional random field enhanced graph
convolutional neural networks”. In: Proceedings of the 25th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining. 2019, pp. 276–284.
[67] Dengyong Zhou and Bernhard Scholkopf. “A regularization framework for learning from ¨
graph data”. In: ICML 2004 Workshop on Statistical Relational Learning and Its Connections
to Other Fields (SRL 2004). 2004, pp. 132–137.
[68] Yifan Hou, Jian Zhang, James Cheng, Kaili Ma, Richard TB Ma, Hongzhi Chen, and MingChang Yang. “Measuring and improving the use of graph information in graph neural
networks”. In: International Conference on Learning Representations. 2019.
[69] Hongwei Wang, Fuzheng Zhang, Mengdi Zhang, Jure Leskovec, Miao Zhao, Wenjie Li, and
Zhongyuan Wang. “Knowledge-aware graph neural networks with label smoothness regularization for recommender systems”. In: Proceedings of the 25th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining. 2019, pp. 968–977.
[70] Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.
[71] Shandar Ahmad, M Michael Gromiha, and Akinori Sarai. “Analysis and prediction of
DNA-binding proteins and their binding residues based on composition, sequence and
structural information”. In: Bioinformatics 20.4 (2004), pp. 477–486.
[72] Alexey S Kuznetsov, Nancy J Kopell, and Charles J Wilson. “Transient high-frequency
firing in a coupled-oscillator model of the mesencephalic dopaminergic neuron”. In: Journal
of Neurophysiology 95.2 (2006), pp. 932–947.
[73] Jared M Sagendorf, Helen M Berman, and Remo Rohs. “DNAproDB: an interactive tool for
structural analysis of DNA–protein complexes”. In: Nucleic acids research 45.W1 (2017),
W89–W97.
136
[74] Jared M Sagendorf, Nicholas Markarian, Helen M Berman, and Remo Rohs. “DNAproDB: an
expanded database and web-based tool for structural analysis of DNA–protein complexes”.
In: Nucleic acids research 48.D1 (2020), pp. D277–D287.
[75] Gene Ontology Consortium. “e gene ontology resource: 20 years and still GOing strong”.
In: Nucleic acids research 47.D1 (2019), pp. D330–D338.
[76] Weizhong Li and Adam Godzik. “Cd-hit: a fast program for clustering and comparing large
sets of protein or nucleotide sequences”. In: Bioinformatics 22.13 (2006), pp. 1658–1659.
[77] Rasna R Walia, Cornelia Caragea, Benjamin A Lewis, Fadi Towfic, Michael Terribilini,
Yasser El-Manzalawy, Drena Dobbs, and Vasant Honavar. “Protein-RNA interface residue
prediction using machine learning: an assessment of the state of the art”. In: BMC bioinformatics 13.1 (2012), pp. 1–20.
[78] Vladimir Gligorijevic, P Douglas Renfrew, Tomasz Kosciolek, Julia Koehler Leman, Daniel ´
Berenberg, Tommi Vatanen, Chris Chandler, Bryn C Taylor, Ian M Fisk, Hera Vlamakis,
et al. “Structure-based protein function prediction using graph convolutional networks”.
In: Nature communications 12.1 (2021), p. 3168.
[79] Qianmu Yuan, Sheng Chen, Jiahua Rao, Shuangjia Zheng, Huiying Zhao, and Yuedong
Yang. “AlphaFold2-aware protein–DNA binding site prediction using graph transformer”.
In: Briefings in Bioinformatics 23.2 (2022), bbab564.
[80] Ying Xia, Chun-Qiu Xia, Xiaoyong Pan, and Hong-Bin Shen. “GraphBind: protein structural
context embedded rules learned by hierarchical graph neural networks for recognizing
nucleic-acid-binding residues”. In: Nucleic acids research 49.9 (2021), e51–e51.
[81] Jer´ ome Tubiana, Dina Schneidman-Duhovny, and Haim J Wolfson. “ScanNet: an inter- ˆ
pretable geometric deep learning model for structure-based protein binding site prediction”.
In: Nature Methods 19.6 (2022), pp. 730–739.
[82] Lucien F Krapp, Luciano A Abriata, Fabio Cortes Rodriguez, and Maeo Dal Peraro. ´
“PeSTo: parameter-free geometric deep learning for accurate prediction of protein binding
interfaces”. In: Nature communications 14.1 (2023), p. 2175.
[83] Freyr Sverrisson, Jean Feydy, Bruno E Correia, and Michael M Bronstein. “Fast end-to-end
learning on protein surfaces”. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Paern Recognition. 2021, pp. 15272–15281.
[84] Franc¸ois Spitz and Eileen E. M. Furlong. “Transcription factors: from enhancer binding to
developmental control”. In: Nature Reviews Genetics 13 (9 Sept. 2012), pp. 613–626. issn:
1471-0056. doi: 10.1038/nrg3207.
[85] Yue Zhao, David Granas, and Gary D. Stormo. “Inferring Binding Energies from Selected
Binding Sites”. In: PLoS Computational Biology 5 (12 Dec. 2009), e1000590. issn: 1553-7358.
doi: 10.1371/journal.pcbi.1000590.
[86] Christian U Stirnimann, Denis Ptchelkine, Clemens Grimm, and Christoph W Muller. ¨
“Structural Basis of TBX5–DNA Recognition: e T-Box Domain in Its DNA-Bound andUnbound Form”. In: Journal of molecular biology 400.1 (2010), pp. 71–81.
[87] Claude Helene. “Specific recognition of guanine bases in protein—nucleic acid complexes”.
In: FEBS leers 74 (1 1977), pp. 10–13.
[88] Remo Rohs, Xiangshu Jin, Sean M. West, Rohit Joshi, Barry Honig, and Richard S. Mann.
“Origins of Specificity in Protein-DNA Recognition”. In: Annual Review of Biochemistry
79 (1 June 2010), pp. 233–269. issn: 0066-4154. doi: 10.1146/annurev-biochem-060408-
091030.
137
[89] Joel F. Schildbach, A. Wali Karzai, Brigie E. Raumann, and Robert T. Sauer. “Origins of
DNA-binding specificity: Role of protein contacts with the DNA backbone”. In: Proceedings
of the National Academy of Sciences 96 (3 Feb. 1999), pp. 811–817. issn: 0027-8424. doi:
10.1073/pnas.96.3.811.
[90] N C Seeman, J M Rosenberg, and A Rich. “Sequence-specific recognition of double helical
nucleic acids by proteins.” In: Proceedings of the National Academy of Sciences 73 (3 Mar.
1976), pp. 804–808. issn: 0027-8424. doi: 10.1073/pnas.73.3.804.
[91] C W Garvie and C Wolberger. “Recognition of specific DNA sequences.” In: Molecular cell
8 (5 Nov. 2001), pp. 937–46. issn: 1097-2765. doi: 10.1016/s1097-2765(01)00392-6.
[92] Michael F Berger and Martha L Bulyk. “Universal protein-binding microarrays for the
comprehensive characterization of the DNA-binding specificities of transcription factors”.
In: Nature Protocols 4 (3 Mar. 2009), pp. 393–411. issn: 1754-2189. doi: 10.1038/nprot.
2008.195.
[93] Mahew Slaery et al. “Cofactor Binding Evokes Latent Differences in DNA Binding
Specificity between Hox Proteins”. In: Cell 147 (6 Dec. 2011), pp. 1270–1282. issn: 00928674.
doi: 10.1016/j.cell.2011.10.053.
[94] Peter J. Park. “ChIP–seq: advantages and challenges of a maturing technology”. In: Nature
Reviews Genetics 10 (10 Oct. 2009), pp. 669–680. issn: 1471-0056. doi: 10.1038/nrg2641.
[95] Aru Jolma et al. “DNA-Binding Specificities of Human Transcription Factors”. In: Cell
152 (1-2 Jan. 2013), pp. 327–339. issn: 00928674. doi: 10.1016/j.cell.2012.12.009.
[96] Mahew Slaery, Tianyin Zhou, Lin Yang, Ana Carolina Dantas Machado, Raluca Gordan, ˆ
and Remo Rohs. “Absence of a simple code: how transcription factors read the genome”.
In: Trends in Biochemical Sciences 39 (9 Sept. 2014), pp. 381–399. issn: 09680004. doi:
10.1016/j.tibs.2014.07.002.
[97] Anton V Persikov and Mona Singh. “De novo prediction of DNA-binding specificities for
Cys2His2 zinc finger proteins”. In: Nucleic acids research 42.1 (2014), pp. 97–108.
[98] Joshua L. Wetzel, Kaiqian Zhang, and Mona Singh. “Learning probabilistic protein–DNA
recognition codes from DNA-binding specificities using structural mappings”. In: Genome
Research 32 (9 Sept. 2022), pp. 1776–1786. issn: 1088-9051. doi: 10.1101/gr.276606.122.
[99] Anton V Persikov, Robert Osada, and Mona Singh. “Predicting DNA recognition by
Cys2His2 zinc finger proteins”. In: Bioinformatics 25.1 (2009), pp. 22–29.
[100] Sofia Aizenshtein-Gazit and Yaron Orenstein. “DeepZF: improved DNA-binding prediction of C2H2-zinc-finger proteins by deep transfer learning”. In: Bioinformatics 38 (Supplement 2 Sept. 2022), pp. ii62–ii67. issn: 1367-4803. doi: 10.1093/bioinformatics/
btac469.
[101] Alberto Meseguer, Filip ˚Arman, Oriol Fornes, Ruben Molina-Fernandez, Jaume Bonet, ´
Narcis Fernandez-Fuentes, and Baldo Oliva. “On the prediction of DNA-binding preferences
of C2H2-ZF domains using structural models: application on human CTCF”. In: NAR
genomics and bioinformatics 2 (3 2020), lqaa046.
[102] Bhuvan Molparia, Kanav Goyal, Anita Sarkar, Sonu Kumar, and Durai Sundar. “ZiFPredict: a web tool for predicting DNA-binding specificity in C2H2 zinc finger proteins”.
In: Genomics, Proteomics & Bioinformatics 8.2 (2010), pp. 122–126.
[103] Ryan G Christensen, Metewo Selase Enuameh, Marcus B Noyes, Michael H Brodsky, Scot A
Wolfe, and Gary D Stormo. “Recognition models to predict DNA-binding specificities of
homeodomain proteins”. In: Bioinformatics 28.12 (2012), pp. i84–i89.
138
[104] Chen Yanover and Philip Bradley. “Extensive protein and DNA backbone sampling improves structure-based specificity prediction for C2H2 zinc fingers”. In: Nucleic Acids
Research 39 (11 June 2011), pp. 4564–4576. issn: 1362-4962. doi: 10.1093/nar/gkr048.
[105] Tsu-Pei Chiu, Satyanarayan Rao, and Remo Rohs. “Physicochemical models of protein–DNA
binding with standard and modified base pairs”. In: Proceedings of the National Academy of
Sciences 120 (4 Jan. 2023). issn: 0027-8424. doi: 10.1073/pnas.2205796120.
[106] Alexandre V Morozov, James J Havranek, David Baker, and Eric D Siggia. “Protein–DNA
binding specificity predictions with structural models”. In: Nucleic acids research 33.18
(2005), pp. 5781–5798.
[107] Gary D. Stormo. “Modeling the specificity of protein-DNA interactions”. In: antitative
Biology 1 (2 June 2013), pp. 115–130. issn: 2095-4689. doi: 10.1007/s40484-013-0012-4.
[108] Gustaf Ahdritz et al. “OpenFold: retraining AlphaFold2 yields new insights into its learning
mechanisms and capacity for generalization”. In: Nature Methods (May 2024). issn: 1548-
7091. doi: 10.1038/s41592-024-02272-z.
[109] Reza Esmaeeli, Antonio Bauza, and Alberto Perez. “Structural predictions of protein–DNA ´
binding: MELD-DNA”. In: Nucleic Acids Research (Feb. 2023). issn: 0305-1048. doi: 10.
1093/nar/gkad013.
[110] Kim L Morrison and Gregory A Weiss. “Combinatorial alanine-scanning”. In: Current
Opinion in Chemical Biology 5 (3 June 2001), pp. 302–307. issn: 13675931. doi: 10.1016/
S1367-5931(00)00206-4.
[111] Cameron J Glasscock et al. “Computational design of sequence-specific DNA-binding
proteins.” In: bioRxiv : the preprint server for biology (Sept. 2023). doi: 10.1101/2023.09.
20.558720.
[112] Rohit Joshi, Jonathan M. Passner, Remo Rohs, Rinku Jain, Alona Sosinsky, Michael A. Crickmore, Vinitha Jacob, Aneel K. Aggarwal, Barry Honig, and Richard S. Mann. “Functional
Specificity of a Hox Protein Mediated by the Recognition of Minor Groove Structure”. In:
Cell 131 (3 Nov. 2007), pp. 530–543. issn: 00928674. doi: 10.1016/j.cell.2007.09.024.
[113] Jaime A Castro-Mondragon et al. “JASPAR 2022: the 9th release of the open-access database
of transcription factor binding profiles”. In: Nucleic Acids Research 50 (D1 Jan. 2022),
pp. D165–D173. issn: 0305-1048. doi: 10.1093/nar/gkab1113.
[114] Ivan V Kulakovskiy, Ilya E Vorontsov, Ivan S Yevshin, Ruslan N Sharipov, Alla D Fedorova,
Eugene I Rumynskiy, Yulia A Medvedeva, Arturo Magana-Mora, Vladimir B Bajic, Dmitry
A Papatsenko, et al. “HOCOMOCO: towards a complete collection of transcription factor
binding models for human and mouse via large-scale ChIP-Seq analysis”. In: Nucleic acids
research 46.D1 (2018), pp. D252–D259.
[115] Peter Agback, Herbert Baumann, Stefan Knapp, Rudolf Ladenstein, and Torleif Hard. ¨
“Architecture of nonspecific protein–DNA interactions in the Sso7d–DNA complex”. In:
Nature Structural Biology 5 (7 July 1998), pp. 579–584. issn: 1072-8368. doi: 10.1038/836.
[116] Jaina Mistry et al. “Pfam: e protein families database in 2021”. In: Nucleic Acids Research
49 (D1 Jan. 2021), pp. D412–D419. issn: 0305-1048. doi: 10.1093/nar/gkaa913.
[117] Anton V Persikov and Mona Singh. “An expanded binding model for Cys ¡sub¿2¡/sub¿
His ¡sub¿2¡/sub¿ zinc finger protein–DNA interfaces”. In: Physical Biology 8 (3 June 2011),
p. 035010. issn: 1478-3975. doi: 10.1088/1478-3975/8/3/035010.
[118] Anton V Persikov and Mona Singh. “De novo prediction of DNA-binding specificities for
Cys2His2 zinc finger proteins”. In: Nucleic acids research 42 (1 2014), pp. 97–108.
139
[119] David M. Ichikawa et al. “A universal deep-learning model for zinc finger design enables
transcription factor reprogramming”. In: Nature Biotechnology (Jan. 2023). issn: 1087-0156.
doi: 10.1038/s41587-022-01624-4.
[120] Christian U Stirnimann, Denis Ptchelkine, Clemens Grimm, and Christoph W Muller. ¨
“Structural Basis of TBX5–DNA Recognition: e T-Box Domain in Its DNA-Bound andUnbound Form”. In: Journal of molecular biology 400 (1 2010), pp. 71–81.
[121] Carlos R Escalante, Junming Yie, Dimitris anos, and Aneel K Aggarwal. “Structure of
IRF-1 with bound DNA reveals determinants of interferon regulation”. In: Nature 391 (6662
1998), pp. 103–106.
[122] Xabier de Martin, Reza Sodaei, and Gabriel Santpere. “Mechanisms of Binding Specificity
among bHLH Transcription Factors”. In: International Journal of Molecular Sciences 22 (17
Aug. 2021), p. 9150. issn: 1422-0067. doi: 10.3390/ijms22179150.
[123] Valerio Mariani, Marco Biasini, Alessandro Barbato, and Torsten Schwede. “lDDT: a local
superposition-free score for comparing protein structures and models using distance
difference tests”. In: Bioinformatics 29 (21 Nov. 2013), pp. 2722–2728. issn: 1367-4803. doi:
10.1093/bioinformatics/btt473.
[124] Samuel Genheden and Ulf Ryde. “e MM/PBSA and MM/GBSA methods to estimate
ligand-binding affinities”. In: Expert Opinion on Drug Discovery 10 (5 May 2015), pp. 449–
461. issn: 1746-0441. doi: 10.1517/17460441.2015.1032936.
[125] Andreas C. Joerger and Alan R. Fersht. “Structural Biology of the Tumor Suppressor p53”.
In: Annual Review of Biochemistry 77 (1 June 2008), pp. 557–582. issn: 0066-4154. doi:
10.1146/annurev.biochem.77.060806.091238.
[126] Tom J Pey, Soheila Emamzadah, Lorenzo Costantino, Irina Petkova, Elena S Stavridi,
Jeffery G Saven, Eric Vauthey, and anos D Halazonetis. “An induced fit mechanism
regulates p53 DNA binding kinetics to confer sequence specificity”. In: e EMBO Journal
30 (11 June 2011), pp. 2167–2176. issn: 02614189. doi: 10.1038/emboj.2011.127.
[127] Shams Reaz, Mohanad Mossalam, Abood Okal, and Carol S. Lim. “A Single Mutant, A276S
of p53, Turns the Switch to Apoptosis”. In: Molecular Pharmaceutics 10 (4 Apr. 2013),
pp. 1350–1359. issn: 1543-8384. doi: 10.1021/mp300598k.
[128] Stacey N. Peterson, Frederick W. Dahlquist, and Norbert O. Reich. “e Role of High Affinity
Non-specific DNA Binding by Lrp in Transcriptional Regulation and DNA Organization”.
In: Journal of Molecular Biology 369 (5 June 2007), pp. 1307–1317. issn: 00222836. doi:
10.1016/j.jmb.2007.04.023.
[129] Damla Ovek, Zeynep Abali, Melisa Ece Zeylan, Ozlem Keskin, Aila Gursoy, and Nurcan
Tuncbag. “Artificial intelligence based methods for hot spot prediction”. In: Current Opinion
in Structural Biology 72 (Feb. 2022), pp. 209–218. issn: 0959440X. doi: 10.1016/j.sbi.
2021.11.003.
[130] Yunhui Peng, Lexuan Sun, Zhe Jia, Lin Li, and Emil Alexov. “Predicting protein–DNA
binding free energy change upon missense mutations using modified MM/PBSA approach:
SAMPDI webserver”. In: Bioinformatics 34 (5 Mar. 2018), pp. 779–786. issn: 1367-4803. doi:
10.1093/bioinformatics/btx698.
[131] Richard Stefl, Haihong Wu, Sapna Ravindranathan, Vladimir Sklena´r, and Juli Feigon. ˇ
“DNA A-tract bending in three dimensions: Solving the dA4 T4 vs. dT4 A4 conundrum”.
In: Proceedings of the National Academy of Sciences 101 (5 Feb. 2004), pp. 1177–1182. issn:
0027-8424. doi: 10.1073/pnas.0308143100.
140
[132] Namiko Abe, Iris Dror, Lin Yang, Mahew Slaery, Tianyin Zhou, Harmen J. Bussemaker,
Remo Rohs, and Richard S. Mann. “Deconvolving the Recognition of DNA Shape from
Sequence”. In: Cell 161 (2 Apr. 2015), pp. 307–318. issn: 00928674. doi: 10.1016/j.cell.
2015.02.008.
[133] Tsu-Pei Chiu, Jinsen Li, Yibei Jiang, and Remo Rohs. “It is in the flanks: Conformational
flexibility of transcription factor binding sites”. In: Biophysical Journal 121 (20 Oct. 2022),
pp. 3765–3767. issn: 00063495. doi: 10.1016/j.bpj.2022.09.020.
[134] Debostuti Ghoshdastidar and Manju Bansal. “Flexibility of flanking DNA is a key determinant of transcription factor affinity for the core motif”. In: Biophysical Journal 121 (20 Oct.
2022), pp. 3987–4000. issn: 00063495. doi: 10.1016/j.bpj.2022.08.015.
[135] Simon F. Tolic-Nørrelykke, Mee B. Rasmussen, Francesco S. Pavone, Kirstine Berg- ´
Sørensen, and Lene B. Oddershede. “Stepwise Bending of DNA by a Single TATA-Box
Binding Protein”. In: Biophysical Journal 90 (10 May 2006), pp. 3694–3703. issn: 00063495.
doi: 10.1529/biophysj.105.074856.
[136] Helen M Berman, John Westbrook, Zukang Feng, Gary Gilliland, Talapady N Bhat, Helge
Weissig, Ilya N Shindyalov, and Philip E Bourne. “e protein data bank”. In: Nucleic acids
research 28 (1 2000), pp. 235–242.
[137] Xiang-Jun Lu, Harmen J Bussemaker, and Wilma K Olson. “DSSR: an integrated software
tool for dissecting the spatial structure of RNA”. In: Nucleic acids research 43.21 (2015),
e142–e142.
[138] Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu, and Weizhong Li. “CD-HIT: accelerated for
clustering the next-generation sequencing data”. In: Bioinformatics 28.23 (2012), pp. 3150–
3152.
[139] R. Lavery, M. Moakher, J. H. Maddocks, D. Petkeviciute, and K. Zakrzewska. “Conformational analysis of nucleic acids revisited: Curves+”. In: Nucleic Acids Research 37 (17 Sept.
2009), pp. 5917–5929. issn: 1362-4962. doi: 10.1093/nar/gkp608.
[140] Xiang-Jun Lu and Wilma K Olson. “3DNA: a versatile, integrated software system for
the analysis, rebuilding and visualization of three-dimensional nucleic-acid structures”.
In: Nature Protocols 3 (7 July 2008), pp. 1213–1227. issn: 1754-2189. doi: 10.1038/nprot.
2008.104.
[141] Richard Lavery and Heinz Sklenar. “Defining the structure of irregular nucleic acids:
conventions and principles”. In: Journal of Biomolecular Structure and Dynamics 6.4 (1989),
pp. 655–667.
[142] William R. Atchley, Jieping Zhao, Andrew D. Fernandes, and Tanja Druke. “Solving the ¨
protein sequence metric problem”. In: Proceedings of the National Academy of Sciences 102
(18 May 2005), pp. 6395–6400. issn: 0027-8424. doi: 10.1073/pnas.0408677102.
[143] Tian Xie and Jeffrey C. Grossman. “Crystal Graph Convolutional Neural Networks for an
Accurate and Interpretable Prediction of Material Properties”. In: Physical Review Leers
120 (14 Apr. 2018), p. 145301. issn: 0031-9007. doi: 10.1103/PhysRevLett.120.145301.
[144] Abien Fred Agarap. “Deep learning using rectified linear units (relu)”. In: arXiv preprint
arXiv:1803.08375 (2018).
[145] Lei Deng, Juan Pan, Xiaojie Xu, Wenyi Yang, Chuyao Liu, and Hui Liu. “PDRLGB: precise DNA-binding residue prediction using a light gradient boosting machine”. In: BMC
bioinformatics 19 (19 2018), pp. 135–145.
141
[146] John S. Bridle. “Probabilistic Interpretation of Feedforward Classification Network Outputs,
with Relationships to Statistical Paern Recognition”. In: Springer Berlin Heidelberg, 1990,
pp. 227–236. doi: 10.1007/978-3-642-76153-9 28.
[147] Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic Optimization”. In:
arXiv 1412.6980 (2017).
[148] Douglas G. Bone and omas A. Wright. “Sample size requirements for estimating
pearson, kendall and spearman correlations”. In: Psychometrika 65 (1 Mar. 2000), pp. 23–28.
issn: 0033-3123. doi: 10.1007/BF02294183.
[149] Francis McIntyre and F. N. David. “Tables of the Ordinates and Probability Integral of the
Distribution of the Correlation Coefficient in Small Samples.” In: Journal of the American
Statistical Association 33 (204 Dec. 1938), p. 751. issn: 01621459. doi: 10.2307/2279076.
[150] R. D. Finn, J. Clements, and S. R. Eddy. “HMMER web server: interactive sequence similarity
searching”. In: Nucleic Acids Research 39 (suppl July 2011), W29–W37. issn: 0305-1048. doi:
10.1093/nar/gkr367.
[151] Mark James Abraham, Teemu Murtola, Roland Schulz, Szilard P ´ all, Jeremy C. Smith, Berk ´
Hess, and Erik Lindahl. “GROMACS: High performance molecular simulations through
multi-level parallelism from laptops to supercomputers”. In: SoftwareX 1-2 (Sept. 2015),
pp. 19–25. issn: 23527110. doi: 10.1016/j.softx.2015.06.001.
[152] James A. Maier, Carmenza Martinez, Koushik Kasavajhala, Lauren Wickstrom, Kevin E.
Hauser, and Carlos Simmerling. “ff14SB: Improving the Accuracy of Protein Side Chain
and Backbone Parameters from ff99SB”. In: Journal of Chemical eory and Computation
11 (8 Aug. 2015), pp. 3696–3713. issn: 1549-9618. doi: 10.1021/acs.jctc.5b00255.
[153] Ivan Ivani et al. “Parmbsc1: a refined force field for DNA simulations”. In: Nature Methods
13 (1 Jan. 2016), pp. 55–58. issn: 1548-7091. doi: 10.1038/nmeth.3658.
[154] Tom Darden, Darrin York, and Lee Pedersen. “Particle mesh Ewald: An Nlog(N) method
for Ewald sums in large systems”. In: e Journal of Chemical Physics 98 (12 June 1993),
pp. 10089–10092. issn: 0021-9606. doi: 10.1063/1.464397.
[155] Berk Hess, Henk Bekker, Herman J. C. Berendsen, and Johannes G. E. M. Fraaije. “LINCS:
A linear constraint solver for molecular simulations”. In: Journal of Computational Chemistry 18 (12 Sept. 1997), pp. 1463–1472. issn: 0192-8651. doi: 10 . 1002 / (SICI ) 1096 -
987X(199709)18:12h1463::AID-JCC4i3.0.CO;2-H.
[156] Yibei Jiang, Tsu-Pei Chiu, Raktim Mitra, and Remo Rohs. “Probing the role of the protonation state of a minor groove-linker histidine in Exd-Hox–DNA binding”. In: Biophysical
Journal (Dec. 2023). issn: 00063495. doi: 10.1016/j.bpj.2023.12.013.
[157] LLC Schrodinger and Warren DeLano. ¨ PyMOL. Version 2.4.0. May 20, 2020. url: http:
//www.pymol.org/pymol.
[158] Ryan G. Christensen, Metewo Selase Enuameh, Marcus B. Noyes, Michael H. Brodsky, Scot
A. Wolfe, and Gary D. Stormo. “Recognition models to predict DNA-binding specificities of
homeodomain proteins”. In: Bioinformatics 28 (12 June 2012), pp. i84–i89. issn: 1367-4811.
doi: 10.1093/bioinformatics/bts202.
[159] Iris Dror, Tianyin Zhou, Yael Mandel-Gutfreund, and Remo Rohs. “Covariation between
homeodomain transcription factors and the shape of their DNA binding sites”. In: Nucleic
acids research 42 (1 2014), pp. 430–441.
142
[160] Marcus B Noyes, Ryan G Christensen, Atsuya Wakabayashi, Gary D Stormo, Michael H
Brodsky, and Scot A Wolfe. “Analysis of homeodomain specificities allows the family-wide
prediction of preferred recognition sites”. In: Cell 133.7 (2008), pp. 1277–1289.
[161] Alberto Meseguer, Filip ˚Arman, Oriol Fornes, Ruben Molina-Fernandez, Jaume Bonet, ´
Narcis Fernandez-Fuentes, and Baldo Oliva. “On the prediction of DNA-binding preferences
of C2H2-ZF domains using structural models: application on human CTCF”. In: NAR
genomics and bioinformatics 2.3 (2020), lqaa046.
[162] Anton V Persikov, Joshua L Wetzel, Elizabeth F Rowland, Benjamin L Oakes, Denise J
Xu, Mona Singh, and Marcus B Noyes. “A systematic survey of the Cys2His2 zinc finger
DNA-binding landscape”. In: Nucleic acids research 43.3 (2015), pp. 1965–1984.
[163] Z. Otwinowski, R. W. Schevitz, R.-G. Zhang, C. L. Lawson, A. Joachimiak, R. Q. Marmorstein,
B. F. Luisi, and P. B. Sigler. “Crystal structure of trp represser/operator complex at atomic
resolution”. In: Nature 335 (6188 Sept. 1988), pp. 321–329. issn: 0028-0836. doi: 10.1038/
335321a0.
[164] Tianyin Zhou, Ning Shen, Lin Yang, Namiko Abe, John Horton, Richard S. Mann, Harmen J.
Bussemaker, Raluca Gordan, and Remo Rohs. “antitative modeling of transcription factor ˆ
binding specificities using DNA shape”. In: Proceedings of the National Academy of Sciences
112 (15 Apr. 2015), pp. 4654–4659. issn: 0027-8424. doi: 10.1073/pnas.1422023112.
[165] Satish K. Nair and Stephen K. Burley. “X-Ray Structures of Myc-Max and Mad-Max Recognizing DNA”. In: Cell 112 (2 Jan. 2003), pp. 193–205. issn: 00928674. doi: 10.1016/S0092-
8674(02)01284-9.
[166] Ariel Afek, Honglue Shi, Atul Rangadurai, Harshit Sahay, Alon Senitzki, Suela Xhani,
Mimi Fang, Raul Salinas, Zachery Mielko, Miles A Pufall, et al. “DNA mismatches reveal
conformational penalties in protein–DNA recognition”. In: Nature 587.7833 (2020), pp. 291–
296.
[167] Phillip J Tomezsko et al. “Determination of RNA structural diversity and its role in HIV-1
RNA splicing.” In: Nature 582 (7812 June 2020), pp. 438–442. issn: 1476-4687. doi: 10.
1038/s41586-020-2253-5.
[168] Stefan E Seemann, Aashiq H Mirza, Claus Hansen, Claus H Bang-Berthelsen, Christian
Garde, Mikkel Christensen-Dalsgaard, Elfar Torarinsson, Zizhen Yao, Christopher T Workman, Flemming Pociot, et al. “e identification and functional annotation of RNA structures conserved in vertebrates”. In: Genome research 27.8 (2017), pp. 1371–1383.
[169] Stefanie A Mortimer, Mary Anne Kidwell, and Jennifer A Doudna. “Insights into RNA
structure and function from genome-wide studies.” In: Nature reviews. Genetics 15 (7 July
2014), pp. 469–79. issn: 1471-0064. doi: 10.1038/nrg3681.
[170] Batey R. T., Rambo R. P., and J. A. Doudna. “Tertiary motifs in RNA structure and folding.”
In: Angewandte Chemie International Edition 38 (16 1999), pp. 2326–2343.
[171] Philip Z Johnson and Anne E Simon. “RNAcanvas: interactive drawing and exploration of
nucleic acid structures”. In: Nucleic Acids Research 51 (W1 July 2023), W501–W508. issn:
0305-1048. doi: 10.1093/nar/gkad302.
[172] Blake A. Sweeney et al. “R2DT is a framework for predicting and visualising RNA secondary
structure using templates”. In: Nature Communications 12 (1 June 2021), p. 3494. issn: 2041-
1723. doi: 10.1038/s41467-021-23555-5.
143
[173] Zasha Weinberg and Ronald R Breaker. “R2R - software to speed the depiction of aesthetic
consensus RNA secondary structures”. In: BMC Bioinformatics 12 (1 Dec. 2011), p. 3. issn:
1471-2105. doi: 10.1186/1471-2105-12-3.
[174] Daniel Wiegreffe, Daniel Alexander, Peter F Stadler, and Dirk Zeckzer. “RNApuzzler:
efficient outerplanar drawing of RNA-secondary structures”. In: Bioinformatics 35 (8 Apr.
2019), pp. 1342–1349. issn: 1367-4803. doi: 10.1093/bioinformatics/bty817.
[175] Boris Shabash and Kay C. Wiese. “jViz.RNA 4.0—Visualizing pseudoknots and RNA editing
employing compressed tree graphs”. In: PLOS ONE 14 (5 May 2019), e0210281. issn: 1932-
6203. doi: 10.1371/journal.pone.0210281.
[176] Peter De Rijk, Jan Wuyts, and Rupert De Wachter. “RnaViz 2: an improved representation of
RNA secondary structure”. In: Bioinformatics 19 (2 Jan. 2003), pp. 299–300. issn: 1367-4811.
doi: 10.1093/bioinformatics/19.2.299.
[177] Yanga Byun and Kyungsook Han. “PseudoViewer3: generating planar drawings of largescale RNA structures with pseudoknots”. In: Bioinformatics 25 (11 June 2009), pp. 1435–
1437. issn: 1367-4811. doi: 10.1093/bioinformatics/btp252.
[178] Kevin Darty, Alain Denise, and Yann Ponty. “VARNA: Interactive drawing and editing of ´
the RNA secondary structure”. In: Bioinformatics 25 (15 Aug. 2009), pp. 1974–1975. issn:
1367-4811. doi: 10.1093/bioinformatics/btp250.
[179] Peter Kerpedjiev, Stefan Hammer, and Ivo L. Hofacker. “Forna (force-directed RNA): Simple
and effective online RNA secondary structure diagrams”. In: Bioinformatics 31 (20 Oct.
2015), pp. 3377–3379. issn: 1367-4811. doi: 10.1093/bioinformatics/btv372.
[180] Jacob S Lu, Eckart Bindewald, Wojciech K Kasprzak, and Bruce A Shapiro. “RiboSketch:
versatile visualization of multi-stranded RNA and DNA secondary structure”. In: Bioinformatics 34 (24 Dec. 2018), pp. 4297–4299. issn: 1367-4803. doi: 10.1093/bioinformatics/
bty468.
[181] M.S. Waterman and T.F. Smith. “RNA secondary structure: a complete mathematical
analysis”. In: Mathematical Biosciences 42 (3-4 Dec. 1978), pp. 257–266. issn: 00255564. doi:
10.1016/0025-5564(78)90099-8.
[182] Vincent Mallet, Carlos Oliver, Jonathan Broadbent, William L Hamilton, and Jer´ ome ˆ
Waldispuhl. “RNAglib: a python package for RNA 2.5 D graphs.” In: ¨ Bioinformatics (Oxford,
England) 38 (5 Feb. 2022), pp. 1458–1459. issn: 1367-4811. doi: 10.1093/bioinformatics/
btab844.
[183] Catherine L Lawson, Helen M Berman, Li Chen, Brinda Vallat, and Craig L Zirbel. “e
Nucleic Acid Knowledgebase: a new portal for 3D structural information about nucleic
acids.” In: Nucleic acids research 52 (D1 Jan. 2024), pp. D245–D254. issn: 1362-4962. doi:
10.1093/nar/gkad957.
[184] W. Saenger. Principles of Nucleic Acid Structure. Springer, 1984, pp. 120–121.
[185] Tanaya Bose, Gil Fridkin, Chen Davidovich, Miri Krupkin, Nikita Dinger, Alla H Falkovich,
Yoav Peleg, Ilana Agmon, Anat Bashan, and Ada Yonath. “Origin of life: protoribosome
forms peptide bonds and links RNA and protein dominated worlds”. In: Nucleic Acids
Research 50 (4 Feb. 2022), pp. 1815–1828. issn: 0305-1048. doi: 10.1093/nar/gkac052.
[186] Maciej Antczak, Marcin Zablocki, Tomasz Zok, Agnieszka Rybarczyk, Jacek Blazewicz,
and Marta Szachniuk. “RNAvista: a webserver to assess RNA secondary structures with
non-canonical base pairs”. In: Bioinformatics 35.1 (2019), pp. 152–155.
[187] Django Software Foundation. Django. May 2019. url: https://djangoproject.com.
144
[188] Karl Pearson. “LIII. ¡i¿On lines and planes of closest fit to systems of points in space¡/i¿”.
In: e London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2 (11
Nov. 1901), pp. 559–572. issn: 1941-5982. doi: 10.1080/14786440109462720.
[189] John D. Hunter. “Matplotlib: A 2D Graphics Environment”. In: Computing in Science &
Engineering 9 (3 2007), pp. 90–95. issn: 1521-9615. doi: 10.1109/MCSE.2007.55.
[190] Daniel S Chult Hagberg Aric Pieter Swart. Exploring network structure, dynamics, and
function using NetworkX. 2008.
[191] Natalie Krahn, Jonathan T. Fischer, and Dieter Soll. “Naturally Occurring tRNAs With ¨
Non-canonical Structures”. In: Frontiers in Microbiology 11 (Oct. 2020). issn: 1664-302X.
doi: 10.3389/fmicb.2020.596914.
[192] T Hermann and E Westhof. “Non-Watson-Crick base pairs in RNA-protein recognition.”
In: Chemistry & biology 6 (12 Dec. 1999), R335–43. issn: 1074-5521. doi: 10.1016/s1074-
5521(00)80003-4.
[193] Parin Sripakdeevong, Wipapat Kladwang, and Rhiju Das. “An enumerative stepwise ansatz
enables atomic-accuracy RNA loop modeling.” In: Proceedings of the National Academy of
Sciences of the United States of America 108 (51 Dec. 2011), pp. 20573–8. issn: 1091-6490.
doi: 10.1073/pnas.1106516108.
[194] Kersten T Schroeder, Sco A McPhee, Jonathan Ouellet, and David M J Lilley. “A structural
database for k-turn motifs in RNA.” In: RNA (New York, N.Y.) 16 (8 Aug. 2010), pp. 1463–8.
issn: 1469-9001. doi: 10.1261/rna.2207910.
[195] Natalie Krahn et al. “tRNA shape is an identity element for an archaeal pyrrolysyl-tRNA
synthetase from the human gut”. In: Nucleic Acids Research 52 (2 Jan. 2024), pp. 513–524.
issn: 0305-1048. doi: 10.1093/nar/gkad1188.
[196] Charles R Harris, K Jarrod Millman, Stefan J van der Walt, Ralf Gommers, Pauli Virtanen, ´
David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J Smith, et al.
“Array programming with NumPy”. In: Nature 585.7825 (2020), pp. 357–362.
[197] N B Leontis and E Westhof. “Geometric nomenclature and classification of RNA base
pairs.” In: RNA (New York, N.Y.) 7 (4 Apr. 2001), pp. 499–512. issn: 1355-8382. doi: 10.1017/
s1355838201002515.
[198] William K M Lai and B Franklin Pugh. “Understanding nucleosome dynamics and their
links to gene expression and DNA replication.” In: Nature reviews. Molecular cell biology
18 (9 Sept. 2017), pp. 548–562. issn: 1471-0080. doi: 10.1038/nrm.2017.47.
[199] Ch.Koti Reddy, Achintya Das, and B. Jayaram. “Do water molecules mediate protein-DNA
recognition? 1 1Edited by B. Honig”. In: Journal of Molecular Biology 314 (3 Nov. 2001),
pp. 619–632. issn: 00222836. doi: 10.1006/jmbi.2001.5154.
[200] Ronny Lorenz, Stephan H Bernhart, Christian Honer zu Siederdissen, Hakim Tafer, Christoph ¨
Flamm, Peter F Stadler, and Ivo L Hofacker. “ViennaRNA Package 2.0”. In: Algorithms for
Molecular Biology 6 (1 Dec. 2011), p. 26. issn: 1748-7188. doi: 10.1186/1748-7188-6-26.
[201] Ieva Rauluseviciute et al. “JASPAR 2024: 20th anniversary of the open-access database of
transcription factor binding profiles”. In: Nucleic Acids Research 52 (D1 Jan. 2024), pp. D174–
D182. issn: 0305-1048. doi: 10.1093/nar/gkad1059.
[202] Jordan A Webb, Edward Farrow, Briany Cain, Zhenyu Yuan, Alexander E Yarawsky,
Emma Schoch, Ellen K Gagliani, Andrew B Herr, Brian Gebelein, and Rhe A Kovall.
“Cooperative Gsx2–DNA binding requires DNA bending and a novel Gsx2 homeodomain
145
interface”. In: Nucleic Acids Research 52 (13 July 2024), pp. 7987–8002. issn: 0305-1048. doi:
10.1093/nar/gkae522.
[203] Guido Van Rossum and Fred L Drake Jr. Python reference manual. Centrum voor Wiskunde
en Informatica Amsterdam, 1995.
[204] Guido Van Rossum and Fred L Drake. Python 3 Reference Manual. CreateSpace, 2009. isbn:
1441412697.
[205] I K McDonald and J M ornton. “Satisfying hydrogen bonding potential in proteins.” In:
Journal of molecular biology 238 (5 May 1994), pp. 777–93. issn: 0022-2836. doi: 10.1006/
jmbi.1994.1334.
[206] Bernhard C. iel, Irene K. Beckmann, Peter Kerpedjiev, and Ivo L. Hofacker. “3D based on
2D: Calculating helix angles and stacking paerns using forgi 2.0, an RNA Python library
centered on secondary structure elements.” In: F1000Research 8 (Apr. 2019), p. 287. issn:
2046-1402. doi: 10.12688/f1000research.18458.2.
[207] Xue Lin, Dongmei Niu, Xiuyang Zhao, Bo Yang, and Caiming Zhang. “A novel method for
graph matching based on belief propagation”. In: Neurocomputing 325 (2019), pp. 131–141.
[208] Fraydoon Rastinejad, Trixie Wagner, Qiang Zhao, and Sepideh Khorasanizadeh. “Structure
of the RXR–RAR DNA-binding complex on the retinoic acid response element DR1”. In:
e EMBO Journal 19 (5 Mar. 2000), pp. 1045–1054. issn: 0261-4189. doi: 10.1093/emboj/
19.5.1045.
[209] Andriy Kryshtafovych, Torsten Schwede, Maya Topf, Krzysztof Fidelis, and John Moult.
“Critical assessment of methods of protein structure prediction (CASP)—Round XV”. In:
Proteins: Structure, Function, and Bioinformatics 91.12 (2023), pp. 1539–1549.
[210] Susan Jones, David TA Daley, Nicholas M Luscombe, Helen M Berman, and Janet M
ornton. “Protein–RNA interactions: a structural analysis”. In: Nucleic acids research 29.4
(2001), pp. 943–954.
[211] Joseph L Watson, David Juergens, Nathaniel R Benne, Brian L Trippe, Jason Yim, Helen E
Eisenach, Woody Ahern, Andrew J Borst, Robert J Ragoe, Lukas F Milles, et al. “De novo
design of protein structure and function with RFdiffusion”. In: Nature 620.7976 (2023),
pp. 1089–1100.
[212] Grzegorz Chojnowski, Tomasz Walen, and Janusz M Bujnicki. “RNA Bricks—a database of ´
RNA 3D motifs and their interactions”. In: Nucleic acids research 42.D1 (2014), pp. D123–
D131.
[213] Xiang-Jun Lu, Harmen J Bussemaker, and Wilma K Olson. “DSSR: an integrated software
tool for dissecting the spatial structure of RNA”. In: Nucleic acids research 43 (21 2015),
e142–e142.
[214] Robbie P Joosten, Tim AH Te Beek, Elmar Krieger, Maarten L Hekkelman, Rob WW Hooft,
Reinhard Schneider, Chris Sander, and Gert Vriend. “A series of PDB related databases for
everyday needs”. In: Nucleic acids research 39.suppl 1 (2010), pp. D411–D419.
[215] Wolfgang Kabsch and Christian Sander. “Dictionary of protein secondary structure: paern
recognition of hydrogen-bonded and geometrical features”. In: Biopolymers: Original
Research on Biomolecules 22.12 (1983), pp. 2577–2637.
[216] Mike Bostock. Force Layout. 2012. url: https://github.com/mbostock/d3/wiki/ForceLayout (visited on 03/25/2014).
146
[217] Wilma K Olson, Shuxiang Li, omas Kaukonen, Andrew V Colasanti, Yurong Xin, and
Xiang-Jun Lu. “Effects of noncanonical base pairing on RNA folding: structural context
and spatial arrangements of G· A pairs”. In: Biochemistry 58.20 (2019), pp. 2474–2487.
[218] Shun-Ching Wang, Yi-Tsao Chen, Roshan Satange, Jhih-Wei Chu, and Ming-Hon Hou.
“Structural basis for water modulating RNA duplex formation in the CUG repeats of
myotonic dystrophy type 1”. In: Journal of Biological Chemistry 299.7 (2023).
[219] Hiroshi Nishimasu, F Ann Ran, Patrick D Hsu, Silvana Konermann, Soraya I Shehata,
Naoshi Dohmae, Ryuichiro Ishitani, Feng Zhang, and Osamu Nureki. “Crystal structure of
Cas9 in complex with guide RNA and target DNA”. In: Cell 156.5 (2014), pp. 935–949.
[220] Arjun Raj and Alexander Van Oudenaarden. “Nature, nurture, or chance: stochastic gene
expression and its consequences”. In: Cell 135.2 (2008), pp. 216–226.
[221] Travers Ching, Daniel S Himmelstein, Bre K Beaulieu-Jones, Alexandr A Kalinin, Brian T
Do, Gregory P Way, Enrico Ferrero, Paul-Michael Agapow, Michael Zietz, Michael M
Hoffman, et al. “Opportunities and obstacles for deep learning in biology and medicine”.
In: Journal of e Royal Society Interface 15.141 (2018), p. 20170387.
[222] Geoffrey E Hinton and Ruslan R Salakhutdinov. “Reducing the dimensionality of data with
neural networks”. In: science 313.5786 (2006), pp. 504–507.
[223] Diederik P Kingma and Max Welling. “Auto-encoding variational bayes”. In: 2nd International Conference on Learning Representations, ICLR 2014 - Conference Track Proceedings
arXiv:1312.6114 (2014).
[224] Gregory P Way and Casey S Greene. “Extracting a biologically relevant latent space from
cancer transcriptomes with variational autoencoders”. In: BioRxiv (2017), p. 174474.
[225] Valentine Svensson, Roser Vento-Tormo, and Sarah A Teichmann. “Exponential scaling of
single-cell RNA-seq in the past decade”. In: Nature protocols 13.4 (2018), pp. 599–604.
[226] Yue Deng, Feng Bao, Qionghai Dai, Lani F Wu, and Steven J Altschuler. “Scalable analysis
of cell-type composition from single-cell transcriptomics using deep recurrent learning”.
In: Nature methods 16.4 (2019), pp. 311–314.
[227] Gokcen Eraslan, Lukas M Simon, Maria Mircea, Nikola S Mueller, and Fabian Jeis. “Single- ¨
cell RNA-seq denoising using a deep count autoencoder”. In: Nature communications 10.1
(2019), pp. 1–14.
[228] Jingshu Wang, Divyansh Agarwal, Mo Huang, Gang Hu, Zilu Zhou, Chengzhong Ye, and
Nancy R Zhang. “Data denoising with transfer learning in single-cell transcriptomics”. In:
Nature methods 16.9 (2019), pp. 875–878.
[229] Divyanshu Talwar, Aanchal Mongia, Debarka Sengupta, and Angshul Majumdar. “AutoImpute: Autoencoder based imputation of single-cell RNA-seq data”. In: Scientific reports 8.1
(2018), pp. 1–11.
[230] Chieh Lin, Siddhartha Jain, Hannah Kim, and Ziv Bar-Joseph. “Using neural networks
for reducing the dimensions of single-cell RNA-Seq data”. In: Nucleic acids research 45.17
(2017), e156–e156.
[231] Jiarui Ding, Anne Condon, and Sohrab P Shah. “Interpretable dimensionality reduction of
single cell transcriptome data with deep generative models”. In: Nature communications
9.1 (2018), pp. 1–13.
[232] Dongfang Wang and Jin Gu. “VASC: dimension reduction and visualization of single-cell
RNA-seq data by deep variational autoencoder”. In: Genomics, proteomics & bioinformatics
16.5 (2018), pp. 320–331.
147
[233] Romain Lopez, Jeffrey Regier, Michael B Cole, Michael I Jordan, and Nir Yosef. “Deep
generative modeling for single-cell transcriptomics”. In: Nature methods 15.12 (2018),
pp. 1053–1058.
[234] Wouter Saelens, Robrecht Cannoodt, Helena Todorov, and Yvan Saeys. “A comparison of
single-cell trajectory inference methods”. In: Nature biotechnology 37.5 (2019), pp. 547–554.
[235] Ian C. McDowell, Dinesh Manandhar, Christopher M. Vockley, Amy K. Schmid, Timothy E.
Reddy, and Barbara E. Engelhardt. “Clustering gene expression time series data using an
infinite Gaussian process mixture model”. In: PLoS Computational Biology (2018). issn:
15537358. doi: 10.1371/journal.pcbi.1005896.
[236] Tomasz Jetka, Karol Nienałtowski, Sarah Filippi, Michael PH Stumpf, and Michał Komorowski. “An information-theoretic framework for deciphering pleiotropic and noisy
biochemical signaling”. In: Nature communications 9.1 (2018), pp. 1–9.
[237] Lingxue Zhu, Jing Lei, Lambertus Klei, Bernie Devlin, and Kathryn Roeder. “Semisoft
clustering of single-cell data”. In: Proceedings of the National Academy of Sciences 116.2
(2019), pp. 466–471.
[238] Oo Fabius and Joost R. van Amersfoort. “Variational recurrent auto-encoders”. In: 3rd International Conference on Learning Representations, ICLR 2015 - Workshop Track Proceedings.
2015.
[239] Mahew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. “Stochastic variational
inference”. In: Journal of Machine Learning Research (2013). issn: 15324435.
[240] Cheng Zhang, Judith Butepage, Hedvig Kjellstrom, and Stephan Mandt. “Advances in
Variational Inference”. In: IEEE Transactions on Paern Analysis and Machine Intelligence
(2019). issn: 19393539. doi: 10.1109/TPAMI.2018.2889774.
[241] John Ingraham and Debora Marks. “Variational inference for sparse and undirected models”.
In: International Conference on Machine Learning. 2017, pp. 1607–1616.
[242] Alexandre Bouchard-Cotˆ e and Michael I Jordan. “Variational inference over combinatorial ´
spaces”. In: Advances in Neural Information Processing Systems. 2010, pp. 280–288.
[243] Alexei Botchkarev. “Performance metrics (error measures) in machine learning regression,
forecasting and prognostics: Properties and typology”. In: arXiv preprint arXiv:1809.03006
(2018).
[244] Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre, Bing Xiang, et al. “Abstractive text summarization using sequence-to-sequence rnns and beyond”. In: arXiv preprint arXiv:1602.06023
(2016).
[245] Pankaj Malhotra, Lovekesh Vig, Gautam Shroff, and Puneet Agarwal. “Long short term
memory networks for anomaly detection in time series”. In: Proceedings. Vol. 89. 2015,
pp. 89–94.
[246] Sepp Hochreiter and Jurgen Schmidhuber. “Long short-term memory”. In: ¨ Neural computation 9 (8 1997), pp. 1735–1780.
[247] Allon M. Klein, Linas Mazutis, Ilke Akartuna, Naren Tallapragada, Adrian Veres, Victor Li,
Leonid Peshkin, David A. Weitz, and Marc W. Kirschner. “Droplet barcoding for single-cell
transcriptomics applied to embryonic stem cells”. In: Cell (2015). issn: 10974172. doi:
10.1016/j.cell.2015.04.044.
[248] Laleh Haghverdi, Maren Buener, F Alexander Wolf, Florian Buener, and Fabian J eis.
“Diffusion pseudotime robustly reconstructs lineage branching”. In: Nature methods 13.10
(2016), p. 845.
148
[249] Sumin Jang, Sandeep Choubey, Leon Furchtgo, Ling-Nan Zou, Adele Doyle, Vilas Menon,
Ethan B Loew, Anne-Rachel Krostag, Refugio A Martinez, Linda Madisen, et al. “Dynamics
of embryonic stem cell differentiation inferred from single-cell transcriptomics show a
series of transitions through discrete cell states”. In: elife 6 (2017), e20487.
[250] George Cybenko. “Approximation by superpositions of a sigmoidal function”. In: Mathematics of control, signals and systems 2.4 (1989), pp. 303–314.
[251] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. “Multilayer feedforward networks
are universal approximators”. In: Neural networks 2.5 (1989), pp. 359–366.
[252] Ken-Ichi Funahashi. “On the approximate realization of continuous mappings by neural
networks”. In: Neural networks 2.3 (1989), pp. 183–192.
[253] Andrew R Barron. “Approximation and estimation bounds for artificial neural networks”.
In: Machine learning 14.1 (1994), pp. 115–133.
[254] Jing Liu, Sanjeev Kumar, Egor Dolzhenko, Gregory F Alvarado, Jinjin Guo, Can Lu, Yibu
Chen, Meng Li, Mark C Dessing, Riana K Parvez, et al. “Molecular characterization of the
transition from acute to chronic kidney injury following ischemia/reperfusion”. In: JCI
insight 2.18 (2017).
[255] Andrew Ransick, Nils O. Lindstrom, Jing Liu, Qin Zhu, Jin-Jin Guo, Gregory F. Alvarado, ¨
Albert D. Kim, Hannah G. Black, Junhyong Kim, and Andrew P. McMahon. “Single-Cell
Profiling Reveals Sex, Lineage, and Regional Diversity in the Mouse Kidney”. English. In:
Developmental Cell 51.3 (Nov. 2019), 399–413.e7. issn: 1534-5807. doi: 10.1016/j.devcel.
2019.10.005.
[256] Joel Neugarten, Anjali Acharya, and Sharon R. Silbiger. “Effect of Gender on the Progression
of Nondiabetic Renal Disease: A Meta-Analysis”. en. In: Journal of the American Society of
Nephrology 11.2 (Feb. 2000), pp. 319–329. issn: 1046-6673, 1533-3450.
[257] Emma J Cooke, Richard S Savage, Paul DW Kirk, Robert Darkins, and David L Wild.
“Bayesian hierarchical clustering for microarray time series data with replicates and outlier
measurements”. In: BMC bioinformatics 12.1 (2011), p. 399.
[258] omas S Ferguson. “A Bayesian analysis of some nonparametric problems”. In: e annals
of statistics (1973), pp. 209–230.
[259] Irina Higgins, Loic Mahey, Arka Pal, Christopher Burgess, Xavier Glorot, Mahew
Botvinick, Shakir Mohamed, and Alexander Lerchner. “beta-vae: Learning basic visual
concepts with a constrained variational framework”. In: (2016).
[260] Valentine Svensson, Adam Gayoso, Nir Yosef, and Lior Pachter. “Interpretable factor
models of single-cell RNA-seq via variational autoencoders”. In: Bioinformatics 36.11
(2020), pp. 3418–3421.
[261] Samuel K Ainsworth, Nicholas J Foti, Adrian KC Lee, and Emily B Fox. “oi-VAE: Output
interpretable VAEs for nonlinear group factor analysis”. In: International Conference on
Machine Learning. 2018, pp. 119–128.
[262] Shuonan Chen and Jessica C. Mar. “Evaluating Methods of Inferring Gene Regulatory
Networks Highlights eir Lack of Performance for Single Cell Gene Expression Data”. In:
BMC Bioinformatics 19.1 (June 2018), p. 232. issn: 1471-2105. doi: 10.1186/s12859-018-
2217-z.
[263] Atul Deshpande, Li-Fang Chu, Ron Stewart, and Anthony Gier. “Network Inference with
Granger Causality Ensembles on Single-Cell Transcriptomic Data”. en. In: bioRxiv (Jan.
2019), p. 534834. doi: 10.1101/534834.
149
[264] Junil Kim, Simon T. Jakobsen, Kedar N. Natarajan, and Kyoung-Jae Won. “TENET: Gene
Network Reconstruction Using Transfer Entropy Reveals Key Regulatory Factors from
Single Cell Transcriptomic Data”. en. In: Nucleic Acids Research (2020). doi: 10.1093/nar/
gkaa1014.
[265] Baoshan Ma, Mingkun Fang, and Xiangtian Jiao. “Inference of Gene Regulatory Networks
Based on Nonlinear Ordinary Differential Equations”. In: Bioinformatics (2020).
[266] Pierre-Cyril Aubin-Frankowski and Jean-Philippe Vert. “Gene Regulation Inference from
Single-Cell RNA-Seq Data with Linear Differential Equations and Velocity Inference”. en.
In: Bioinformatics (2020), btaa576. doi: 10.1093/bioinformatics/btaa576.
[267] Hirotaka Matsumoto, Hisanori Kiryu, Chikara Furusawa, Minoru S H Ko, Shigeru B H Ko,
Norio Gouda, Tetsutaro Hayashi, and Itoshi Nikaido. “SCODE: An Efficient Regulatory
Network Inference Algorithm from Single-Cell RNA-Seq during Differentiation”. English.
In: Bioinformatics 33.15 (Apr. 2017), pp. 2314–2321. doi: 10 . 1093 / bioinformatics /
btx194.
[268] James Hensman, Neil D Lawrence, and Magnus Raray. “Hierarchical Bayesian modelling
of gene expression time series across irregularly sampled replicates and clusters”. In: BMC
bioinformatics 14.1 (2013), p. 252.
[269] Stephen Wu, Sijia Liu, Sunghwan Sohn, Sungrim Moon, Chung-il Wi, Young Juhn, and
Hongfang Liu. “Modeling asynchronous event sequences with RNNs”. In: Journal of biomedical informatics 83 (2018), pp. 167–177.
[270] Tian Qi Chen, Yulia Rubanova, Jesse Beencourt, and David K Duvenaud. “Neural ordinary differential equations”. In: Advances in neural information processing systems. 2018,
pp. 6571–6583.
[271] Yulia Rubanova, Tian Qi Chen, and David K Duvenaud. “Latent Ordinary Differential Equations for Irregularly-Sampled Time Series”. In: Advances in Neural Information Processing
Systems. 2019, pp. 5321–5331.
[272] John R Hershey and Peder A Olsen. “Approximating the Kullback Leibler divergence
between Gaussian mixture models”. In: 2007 IEEE International Conference on Acoustics,
Speech and Signal Processing-ICASSP’07. Vol. 4. IEEE. 2007, pp. IV–317.
[273] Nat Dilokthanakul, Pedro AM Mediano, Marta Garnelo, Mahew CH Lee, Hugh Salimbeni,
Kai Arulkumaran, and Murray Shanahan. “Deep unsupervised clustering with gaussian
mixture variational autoencoders”. In: arXiv preprint arXiv:1611.02648 (2016).
[274] Richard Evans, Michael O’Neill, Alexander Pritzel, Natasha Antropova, Andrew Senior,
Tim Green, Augustin Zidek, Russ Bates, Sam Blackwell, Jason Yim, et al. “Protein complex ˇ
prediction with AlphaFold-Multimer”. In: biorxiv (2021), pp. 2021–10.
[275] Bohdan Schneider, Blake Alexander Sweeney, Alex Bateman, Jiri Cerny, Tomasz Zok, and
Marta Szachniuk. “When will RNA get its AlphaFold moment?” In: Nucleic Acids Research
51.18 (2023), pp. 9522–9532.
[276] Divya Nori and Wengong Jin. “RNAFlow: RNA Structure & Sequence Design via Inverse
Folding-Based Flow Matching”. In: arXiv preprint arXiv:2405.18768 (2024).
[277] Rong Han, Xiaohong Liu, Tong Pan, Jing Xu, Xiaoyu Wang, Wuyang Lan, Zhenyu Li,
Zixuan Wang, Jiangning Song, Guangyu Wang, et al. “CoPRA: Bridging Cross-domain
Pretrained Sequence Models with Complex Structures for Protein-RNA Binding Affinity
Prediction”. In: arXiv preprint arXiv:2409.03773 (2024).
150
[278] Jordy Homing Lam et al. “A deep learning framework to predict binding preference of
RNA constituents on protein surface”. In: Nature Communications 10 (1 Dec. 2019). issn:
2041-1723. doi: 10.1038/s41467-019-12920-0.
[279] Ilya E Vorontsov, Irina A Eliseeva, Arsenii Zinkevich, Mikhail Nikonov, Sergey Abramov,
Alexandr Boytsov, Vasily Kamenets, Alexandra Kasianova, Semyon Kolmykov, Ivan S
Yevshin, et al. “HOCOMOCO in 2024: a rebuild of the curated collection of binding models
for human and mouse transcription factors”. In: Nucleic Acids Research 52.D1 (2024),
pp. D154–D163.
[280] Jonathan Ho, Ajay Jain, and Pieter Abbeel. “Denoising diffusion probabilistic models”. In:
Advances in neural information processing systems 33 (2020), pp. 6840–6851.
[281] Namrata Anand and Tudor Achim. “Protein structure and sequence generation with
equivariant denoising diffusion probabilistic models”. In: arXiv preprint arXiv:2205.15019
(2022).
[282] Marc van Dijk, Aalt DJ van Dijk, Victor Hsu, Rolf Boelens, and Alexandre MJJ Bonvin.
“Information-driven protein–DNA docking using HADDOCK: it is a maer of flexibility”.
In: Nucleic acids research 34.11 (2006), pp. 3317–3325.
[283] Orly Wapinski and Howard Y Chang. “Long noncoding RNAs and human disease”. In:
Trends in cell biology 21.6 (2011), pp. 354–361.
[284] Fayaz Seifuddin, Komudi Singh, Abhilash Suresh, Jennifer T Judy, Yun-Ching Chen, Vijender Chaitankar, Ilker Tunc, Xiangbo Ruan, Ping Li, Yi Chen, et al. “lncRNAKB, a knowledgebase of tissue-specific functional annotation and trait association of long noncoding
RNA”. In: Scientific Data 7.1 (2020), pp. 1–16.
[285] Abdullah Al Mamun and Ananda Mohan Mondal. “Long non-coding rna based cancer
classification using deep neural networks”. In: Proceedings of the 10th ACM International
Conference on Bioinformatics, Computational Biology and Health Informatics. 2019, pp. 541–
541.
[286] Olivia M de Goede, Daniel C Nachun, Nicole M Ferraro, Michael J Gloudemans, Abhiram S
Rao, Craig Smail, Tiffany Y Eulalio, Franc¸ois Aguet, Bernard Ng, Jishu Xu, et al. “Populationscale tissue transcriptomics maps long non-coding RNAs to complex disease”. In: Cell
(2021).
[287] Tanvir Alam, Hamada RH Al-Absi, and Sebastian Schmeier. “Deep Learning in LncRNAome:
Contribution, Challenges, and Perspectives”. In: Non-coding RNA 6.4 (2020), p. 47.
[288] Xiaoyong Pan, Yong-Xian Fan, Junchi Yan, and Hong-Bin Shen. “IPMiner: hidden ncRNAprotein interaction sequential paern mining with stacked autoencoder for accurate
computational prediction”. In: BMC genomics 17.1 (2016), pp. 1–14.
[289] Qi Zhao, Haifan Yu, Zhong Ming, Huan Hu, Guofei Ren, and Hongsheng Liu. “e bipartite
network projection-recommended algorithm for predicting long non-coding RNA-protein
interactions”. In: Molecular erapy-Nucleic Acids 13 (2018), pp. 464–471.
[290] Hai-Cheng Yi, Zhu-Hong You, De-Shuang Huang, Xiao Li, Tong-Hai Jiang, and Li-Ping
Li. “A deep learning framework for robust and accurate prediction of ncRNA-protein
interactions using evolutionary information”. In: Molecular erapy-Nucleic Acids 11 (2018),
pp. 337–344.
[291] Zhao-Hui Zhan, Li-Na Jia, Yong Zhou, Li-Ping Li, and Hai-Cheng Yi. “BGFE: a deep
learning model for ncRNA-protein interaction predictions based on improved sequence
information”. In: International journal of molecular sciences 20.4 (2019), p. 978.
151
[292] Cheng Peng, Siyu Han, Hui Zhang, and Ying Li. “Rpiter: A hierarchical deep learning
framework for ncrna–protein interaction prediction”. In: International journal of molecular
sciences 20.5 (2019), p. 1070.
[293] Junghwan Baek, Byunghan Lee, Sunyoung Kwon, and Sungroh Yoon. “LncRNAnet: long
non-coding RNA identification using deep learning”. In: Bioinformatics 34.22 (2018),
pp. 3889–3897.
[294] Cheng Yang, Longshu Yang, Man Zhou, Haoling Xie, Chengjiu Zhang, May D Wang, and
Huaiqiu Zhu. “LncADeep: an ab initio lncRNA identification and functional annotation
tool based on deep learning”. In: Bioinformatics 34.22 (2018), pp. 3825–3834.
[295] Rashmi Tripathi, Sunil Patel, Vandana Kumari, Pavan Chakraborty, and Pritish Kumar
Varadwaj. “DeepLNC, a long non-coding RNA prediction tool using deep neural network”.
In: Network Modeling Analysis in Health Informatics and Bioinformatics 5.1 (2016), pp. 1–14.
[296] Tanvir Alam, Mohammad Tariqul Islam, Mowafa S Househ, Samir Brahim Belhaouari,
and Ferdaus Ahmed Kawsar. “DeepCNPP: Deep Learning Architecture to Distinguish
the Promoter of Human Long Non-Coding RNA Genes and Protein-Coding Genes.” In:
ICIMTH. 2019, pp. 232–235.
[297] Tanvir Alam, Mohammad Tariqul Islam, Sebastian Schmeier, Mowafa Househ, and Dena
A Al-ani. “DeePEL: Deep learning architecture to recognize p-lncRNA and e-lncRNA
promoters”. In: 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).
IEEE. 2019, pp. 634–638.
[298] Brian L Gudenas and Liangjiang Wang. “Prediction of LncRNA subcellular localization
with deep learning from sequence features”. In: Scientific reports 8.1 (2018), pp. 1–10.
[299] Yu-An Huang, Zhi-An Huang, Zhu-Hong You, Zexuan Zhu, Wen-Zhun Huang, Jian-Xin
Guo, and Chang-Qing Yu. “Predicting lncRNA-miRNA interaction via graph convolution
auto-encoder”. In: Frontiers in genetics 10 (2019), p. 758.
[300] Jialu Hu, Yiqun Gao, Jing Li, and Xuequn Shang. “Deep learning enables accurate prediction
of interplay between lncRNA and disease”. In: Frontiers in genetics 10 (2019), p. 937.
[301] Ping Xuan, Yangkun Cao, Tiangang Zhang, Rui Kong, and Zhaogong Zhang. “Dual convolutional neural networks with aention mechanisms based method for predicting diseaserelated lncRNA genes”. In: Frontiers in genetics 10 (2019), p. 416.
[302] Ping Xuan, Shuxiang Pan, Tiangang Zhang, Yong Liu, and Hao Sun. “Graph convolutional
network and convolutional neural network based method for predicting lncRNA-disease
associations”. In: Cells 8.9 (2019), p. 1012.
[303] Samuel Morabito, Emily Miyoshi, Neethu Michael, Saba Shahin, Alessandra Cadete Martini,
Elizabeth Head, Justine Silva, Kelsey Leavy, Mari Perez-Rosendahl, and Vivek Swarup.
“Single-nucleus chromatin accessibility and transcriptomic characterization of Alzheimer’s
disease”. In: Nature genetics 53.8 (2021), pp. 1143–1155.
152
Appendices
Appendix A
Detailed calculation of the CCRF layer update
Starting from the mean field variational inference result in eq. 1.10 we have,
Qi(Hi) '
1
Z
exp
−α||Hi −Bi
||2 −β ∑
j∈N (i)
gi j Z ∞
−∞
Q(Hj)||Hi −Hj
||2
dHj
!
(.1)
where Hi
,Bi
,Hj ∈ R
N. Evaluating this component-wise and using the definition of the `2
norm, we can write the energy E(Hi) as
E(Hi) = α
N
∑n
(Hin −Bin)
2 +β ∑
j∈N (i)
gi j
N
∑n
Z ∞
−∞
···Z ∞
−∞
Q(Hj)(Hin −Hjn)
2
dHj1 ···dHjN (.2)
.
Focusing on the inner most summation of second term, since each component of Hi and Hj
is
independent (by eq. 10 of Gao, Pei, and Huang [66]), we can write Q(Hj) = ΠN
k=1Qjk(Hjk) and
assume that each Qjk is individually normalized. e summation then becomes
153
N
∑n
Z ∞
−∞
Q(Hj)(Hin −Hjn)
2
dHj1 ···dHjN
=
N
∑n
Z ∞
−∞
(Hin −Hjn)
2Π
N
k=1Qjk(Hjk)dHjk
=
N
∑n
Z ∞
−∞
(Hin −Hjn)
2Qjn(Hjn)dHjn Z ∞
−∞
Π
N
k6=nQjk(Hjk)dHjk
(.3)
where the last step follows because all the Qjk’s integrate to 1. erefore, we have that the
energy is given by
E(Hi) =
N
∑n
"
α(Hin −Bin)
2 +β ∑
j∈N (i)
gi j Z ∞
−∞
(Hin −Hjn)
2Qjn(Hjn)dHjn#
(.4)
Taking the derivative of E(Hi) with respect to the k component of Hi we have
∂E(Hi)
∂Hik
= 2α(Hik −Bik) +β ∑
j∈N (i)
gi j Z ∞
−∞
2(Hik −Hjk)Qjn(Hjk)dHjk
= 2α(Hik −Bik) +2β ∑
j∈N (i)
gi j(Hik −
Z ∞
−∞
HjkQjk(Hjk)dHjk) (.5)
seing this equal to zero and solving for Hi we have
H
∗
ik =
αBik +β ∑
j∈N (i)
gi jEHjk∼Qjk [Hjk]
α +β ∑
j∈N (i)
gi j
(.6)
154
Proposed Algorithm
Given initial states Bi of the nodes, H
0
i = Bi maximizes Q
0
i =
1
Z
0
i
exp(−c||H
0
i −Bi
||2
). Now, we can
use the above update equation and approximately compute (at t th iteration) H
t+1
i
(maximising
Q
t+1
i
) using H
t
i
and so forth until convergence (i.e. upto some iteration K, after which Q
t
i
and
Q
t+1
i
are not very different and hence, so are H
T
i
and H
T+1
i
).
Update for iteration k:
H
t+1
i =
αBi +β ∑
j∈N (i)
(gi jH
t
j
)
α +β ∑
j∈N (i)
gi j
(.7)
We can get the aggregated values ( ∑
j∈N (i)
(gi jH
t
j
), ∑
j∈N (i)
gi j) for ith node in a message passing
step using E,Bi and H
t
i
∀i.
155
Q
(t)
i
and H
(t)
i
calculation for t = 1,2
E
H
(0)
jk
[(H
(1)
ik −H
(0)
jk )
2
] = Z
x
exp[−α(x−Bjk)
2
](H
(1)
ik −x)
2
dx
=
Z
t
exp[−α(t −(Bjk −H
(1)
ik ))2
)t
2
dt
= Var(T) +E[T]
2
(T = X −H
(1)
ik )
= E[T]
2 +const.
= (H
(1)
ik −Bjk)
2
=⇒ Q
(1)
i =
1
Z
(1)
i
exp(−E(H
(1)
i
))
=⇒ E(H
(1)
i
) =
N
∑
k=1
[α(H
(1)
ik −Bik)
2 +β ∑
j∈N (i)
gi j(H
(1)
ik −Bjk)
2
]
Taking partial derivative and seing to 0:
H
(1)
ik =
αBik +β ∑j∈N (i) gi jBjk
α +β ∑j∈N (i) gi j
− − − − − − − − − − −
E
H
(1)
jk
[(H
(2)
ik −H
(1)
jk )
2
] = Z
x
exp[−α(x−Bjk)
2 −β ∑
p∈N (j)
gj p(x−Bpk)
2
](H
(2)
ik −x)
2
dx
=
Z
t
exp[−α(t −(Bjk −H
(2)
ik ))2 −β ∑
p∈N (j)
gip(t −(Bpk −H
(2)
ik ))2
]t
2
dt
= Var(T) +E[T]
2
(T = X −H
(2)
ik )
= E[T]
2 +const.
156
E[T] = argmint(α(t −(Bjk −H
(2)
ik ))2 +β ∑
p∈N (j)
gj p(t −(Bpk −H
(2)
ik ))2
)
=⇒ E[T] =
α(Bjk −H
(2)
ik ) +β ∑p∈N (j) gj p(Bpk −H
(2)
ik )
α +β ∑p∈N (j) gj p
=⇒ E[(H
(2)
ik −H
(1)
jk )
2
] = hα(Bjk −H
(2)
ik ) +β ∑p∈N (j) gj p(Bpk −H
(2)
ik )
α +β ∑p∈N (j) gj p
i2
=
hα(H
(2)
ik −Bjk) +β ∑p∈N (j) gj p(H
(2)
ik −Bpk)
α +β ∑p∈N (j) gj p
i2
=
hH
(2)
ik (α +β ∑p∈N (j) gj p)
α +β ∑p∈N (j) gj p
−
αBjk +β ∑p∈N (j) gj pBpk
α +β ∑p∈N (j) gj p
i2
=
h
H
(2)
ik −
αBjk +β ∑p∈N (j) gj pBpk
α +β ∑p∈N (j) gj p
i2
=
h
H
(2)
ik −H
(1)
jk i2
− − − − − − − − − −−
Q
(2)
i =
1
Z
(2)
i
exp(−E(H
(2)
i
))
=⇒ E(H
(2)
i
) =
N
∑
k=1
[α(H
(2)
ik −Bik)
2 +β ∑
j∈N (i)
gi j(H
(2)
ik −H
(1)
jk )
2
]
Taking partial derivative and seing to 0:
H
(2)
ik =
αBik +β ∑j∈N (i) gi jH
(1)
jk
α +β ∑j∈N (i) gi j
Similarly we can show the updates of eq.(7) holds true for t=3,4,…
—————————————
157
Appendix B
Supplementary figures
Figure S1. Dataset details. (a) Schematic representation of process for combining data sources.
(b) Distribution of source experiments and species in constructed cross-validation set. (c) Example
illustration demonstrating differences in binding specificity data for the same protein (human
estrogen receptor 1) from different experiments and databases. Mean absolute error (MAE) over
columns among three cases.
158
Figure S2. Data Representation. (a) Coarse-grain symmetrization schema at DNA base-pair
level. (b) Example illustration showing how the computed sym-helix compares with the original
structure. Each sym-helix point is shown as a sphere (1.5 A˚radius) for visibility. (c)Symmetrized
base-pair representation on one C-G base pair (four major groove points, three minor groove
points, and two points each for sugar and phosphate moieties). (d) Example computed shape
features overlayed on sym-helix as base pair-level features. (e) Standard vectors computed on
sym-helix, used by the network to correlate orientation information with (f) interaction vectors
on protein atom-graph, based on average direction of covalent bonds for each heavy atom.
159
Figure S3. DeepPBS aritecture. e DeepPBS architecture can be compartmentalized into
three modules: ProteinEncoder, which encodes the protein neighborhood through spatial graph
convolutions; BiNet, which consists of a network of bipartite geometric convolutions from the
protein graph (G
p = (V
p
,X
p
,E
p
,N
p
)) to sym-helix points (DNA) (G
d = (v
d
,X
d
,N
d
)); and CNNPredictor, which flaens the aggregated sym-helix features into a 1D representation, adds shape
features (X
s
), and applies 1D convolutional layers followed by fully connected prediction layers.
e final logits are converted into probability using a SoftMax activation with a learned temperature
parameter. Er=x represents edges determined by vertices/points within a specified radius of length
x. Each sym-helix point is shown as a sphere (1.5 A˚radius) for visibility.
160
Figure S4. Sematic representation of bipartite edge-perturbation process. Blue circles denote
protein heavy atoms. Green circles represent sym-helix points. In one forward pass, the output is
calculated with all edges present. In another pass, edges corresponding to one protein heavy atom
are excluded from the message-passing scheme, resulting in an alternate output. e difference
between the two outputs can be quantified using the mean absolute difference measure and
normalized as needed for interpretation purposes.
161
Figure S5. Cross-validation performance of DeepPBS for predicting binding specificity across
protein families on experimentally determined structures. Caption on next page.
162
Figure S5. Cross-validation performance of DeepPBS for predicting binding specificity across
protein families on experimentally determined structures.(a) Cross-validation performance of
individual trained models for the DeepPBS model and its variations (biological assemblies corresponding to n= 523 protein chains (for each box plot)) (b) Abundance of various protein families
(PFAM annotations) in cross-validation dataset (counts>8). (c) Cross-validation performance of
DeepPBS, along with ‘groove readout’ and ‘shape readout’ variations, across protein families
(counts > 8). (Biological assemblies corresponding to n protein chains (for each family), where n
is as described in (b), total unique n= 523) (d) Example DeepPBS ensemble prediction on NF-κB
biological assembly (sampled in benchmark set) containing non-optimal DNA sequence. DeepPBS
ensemble prediction on NF-κB biological assembly for human (NFKB2, UniProt ID Q00653) is
shown in the benchmark dataset. e colored heatmaps (top and boom) encode probabilities of
base identity. e back and white heatmap (center) represents the co-crystal structure. Although
the co-crystal structure derived DNA sequence is not of the highest affinity (judging by experimental data from HOCOMOCO), our prediction can circumvent this issue and predict binding
specificity levels that are much closer to those observed experimentally. For box plots in (a) and (c),
lower limit represents lower quartile, middle line represents median, and upper limit represents
upper quartile. Whiskers do not include outliers.
163
Figure S6. Application of DeepPBS to MD simulation of AlphaFold2- and PDB (2R5Z)-based
modeled complex of Exd-Scr system. (a) Initial structure of simulation. Locations of residues of
interest are marked. (b) Residue-level RI score over time (averaged per 5 ns window) for residues
involved in Exd-DNA flank ineraction. (c) Snapshot of interactions by Exd Arg2, Arg3, and Arg5 at
50 ns, 200 ns, and 280 ns, respectively. (d) Residue-level RI score over time (averaged 5 ns window)
for residues involved in Scr-DNA minor groove interaction. (e) Snapshot of interactions by Scr
Arg5 and His-12 at 70 ns and 250 ns, respectively. (f) Residue-level RI score over time (averaged
per 5 ns window) for residues involved in Scr-DNA major groove interaction. (g) Snapshot of
interactions by Exd Arg58, Lys6, and Ile57 at 70 ns and 250 ns.
164
165
Figure S7. Example DeepPBS ensemble predictions on structures of specific DNA binders.
Specificity data were unavailable on JASPAR/HOCOMOCO. (a) Monomeric orphan nuclear receptor NGFI-B, (b) replication termination protein in bacteria, (c) proto-oncogene product c-Rel,
(d) Tc3 transposase bound to transposon DNA, (e) Fitab protein from Neisseria gonorrhoeae, (f)
Epstein-Barr virus ZEBRA protein, (g) Smad5-MH1 protein/ palindromic SBE DNA complex, and
(h) DUF1778 domain-containing kacTA protein.
166
Figure S8. Example DeepPBS ensemble predictions on structures of non-specific DNA binders.
(a) Sso7D-DNA complex, (b) DNA polymerase I Klenow fragment, (c) HIV-1 reverse transcriptase with pre-translocation and post-translocation AZTMP-terminated DNA, (d) DNA polymerase IV, (e) Moloney murine leukemia virus reverse transcriptase, (f) DNA repair enzyme
formamidopyrimidine-DNA glycosylase (Fpg), (g) DNA polymerases in complex with proliferative
cell nuclear antigen (PCNA), and (h) DNA damage sensing enzyme Methanococcus jannaschii
Mre11 (MjMre11).
167
Figure S9. Application of DeepPBS on modeled structures. (a) Performance of DeepPBS (best
6-mer overlap) improves as the feedback loop progresses (corresponding to Fig. 3f) in conjunction
with improvement in RFNA complex design. (n = 236 predicted assemblies) (b) Example application
of DeepPBS on mouse CREB1 dimer bound to DNA modeled by MELD-DNA (as provided by their
authors). (c) Calculated MM-PBSA (Supplementary Section 7) vacuum energy distribution for
stable RFNA predictions (<0 kJ/mol) and corresponding counts over rounds 1–7 (for n predicted
assemblies for each boxplot, where n is same as stable structure count (corresponding bar plot)).
For box plots in (a) and (c), lower limit represents lower quartile, middle point/line represents
median, and upper limit represents upper quartile.
168
Figure S10. Behavior of metric for different target PWM and interpolated predictions. (a)
Demonstrations of how the MAE metric behaves for various different target PWM columns and
possible predictions (Fig. S10a). e predictions are of three forms, based on interpolated values
of a variable x ∈ [0,1] . Although not an exhaustive set, we hope it helps the reader put into
context the behavior of this metric. (b) DeepPBS and DeepPBS with DNA SeqInfo performance
on benchmark set in context of two naive models (uniform predictor and co-crystal structure
derived sequence predictor). Lines indicates linear regression fit. Light shaded regions indicate
corresponding 95% confidence interval computed via bootstrapping mean.
169
Figure S11. MAE equivalent of benmark performance vs alignment score plot showing the
bias of the DeepPBS with DNA SeqInfo model compared to the DeepPBS model towards the alignment score of co-crystal DNA structure derived sequence and target PWM. Lines indicates linear
regression fit. Light shaded regions indicate corresponding 95% confidence interval computed via
bootstrapping mean.
170
Figure S12. Deep DNAshape [28] prediction of the minor groove width (MGW) profile for the
sequence AAATTG (top DeepPBS prediction for the designed model DBP35 and DBP5, positions
3–8) compared to the actual MGW of the DNA in these designed complexes.
171
Figure S13. Tertiary structure aware mapping of the peptidyl transferase center (PTC) of the
large ribosomal subunit of Deinococcus radiodurans (PDB ID: 1NKW), also known as protoribosome. (A) Principal axis view of the PTC [185]. (B) Rotated view (approximately 90° about
horizontal axis). (C) RNAscape visualization of the proto-ribosome in orientation along the
principal axis view. (D) RNAView [47] visualization of the proto-ribosome in orientation along
the principal axis view. Unlike RNAView, RNAscape captures the semi-symmetric nature of the
structure.
172
Figure S14. RNAscape web interface for side-by-side comparison of two structures. (A) 3D
structure of PDB ID: 8UPT and (B) 3D structure of PDB ID: 8UPY. (C) After running RNAscape on
one structure (in this example, PDB ID: 8UPT [195], a user can upload a second plot for side-byside comparison (in this example, PDB ID: 8UPY [195]. Both plots have separate transformation
controls which can be used to orient and compare them.
173
Figure S15. Visualization of a riboswit by various methods. (A) Riboswitch from Escherichia
coli (PDB ID: 1Y26) (B) RNAscape visualization. (C) RNAView [47] visualization (rotated to align
with the 3D structure.). (D) RNAglib [182] visualization. (E) Seondary structure visualization
based on the structure by Forna [179].
174
Figure S16. Demonstration of RVAgene working principle on simulated data with high noise.
Gaussian noise drawn from N (0,0.7) was added to the simulated data to produce a dataset
with heavy noise. RVAgene learns the latent space shown in (A). (B) shows 6 clusters learned by
k-means on the learned latent space. (C) shows original training data and model generated data
from random points in the latent space sampled from N (µ,0.4I) around each cluster mean µ for
each of the 6 clusters detected by k-means.
175
Figure S17. Characterization of gene dynamics by linear fit using Pearson correlation coefficient
for 5 sample genes in the ESC differentiation dataset [247]. Blue lines represents original data and
orange lines represents linear fits. e Pearson correlation coefficient r is given for each plot.
176
Figure S18. Clusters detected by the unsupervised clustering algorithm DPGP for ESC differentiation. Clusters detected by DPGP in the ESC differentiation dataset [247] with default
hyperparameters showing cluster means (black), mean ± 2 s.d. in (blue) and cluster members
(red).
177
Figure S19. Accuracy of RVAgene reconstructions for different train/test group sizes. Distributions of reconstruction errors on randomly sampled sets of test genes, where the full data were
split into test groups of: 200 genes (train on 72%), 300 genes (train on 59%), 400 genes (train on
45%), 500 genes (train on 31%), and 600 genes (train on 18%). Cumulative fractional distribution of
reconstruction errors (cumulative count/test set size) for all groups.
178
Figure S20. Modeling response to kidney injury and analysis of linear fits. (A) Pearson correlation coefficients between gene expression and time for each differentially expressed gene in the
kidney injury dataset for each of the 3 replicates [254]. (B) RVAgene latent space representation
of fied model for each replicate; color represents positive or negative correlation coefficients.
(C) RVAgene latent space representation learnt for the same three replicates as in (B), but where
every input gene was normalized so that its expression sums to 1.
179
Figure S21. Comparison of linear and quadratic fits to describe gene dynamics in response to
kidney injury. For each of the three replicates (R1-R3), five genes are shown, with experimental
data (blue), linear fit (orange), and quadratic fit (green). Pearson correlation coefficients, r, and
quadratic coefficients, a (x = at2 +bt +c), are given for each plot.
180
Figure S22. Clustering on R1 and cluster specific GO enriment analysis. We performed kmeans clustering on latent space learned by RVAgene on R1 with k = 9. We also show learned
latent space on R2 and R3 annotated by the clustering done on R1. All clusters (except cluster 5)
appears well preserved. We perform GO analysis for each cluster and select one significant GO
term from each cluster (except cluster 5) and show how all genes in the dataset corresponding to
each GO term appears on the latent space for all three replicates.
181
Figure S23. (A) Latent space representations for replicates R1-R3 with local neighborhoods of
Sdc1 marked (circles). (B) Heatmap of expression changes over time course of injury for the Sdc1
neighborhood genes in the intersection of R1-R3; selected genes highlighted. (C) Histogram of
-log10 p values of top GO terms for biological processes for gene set in (B). D) Reconstructed vs
true data ploed for each of the Lox genes identified in (B). 182
Figure S24. Examples of continuous time prediction of ESC differentiation. Reconstruction (up
to t = 6.8) and future prediction (for t > 6.8) for 4 example genes by a latent ODE [270] trained on
ESC data [247] for 1000000 iterations, showing a good fit for the initial timepoints, but underfiing
for the later timepoints.
183
Figure S25. Interaction propensities of components of lysine amino acid towards various
DNA moieties and functional groups. (a) Interaction propensity towards DNA bases A,C,G,T, in
addition to phosphate (P) and sugar (S) moieties, categorized by major groovre , minor groove
and all (major,minor groove and DNA backbone) (b) Interaction propensity towards functional
groups [105] A ((H-bond acceptor), D (H-bond donor), M (methyl),H (hydrogen)), in addition
to phosphate (P) and sugar (S) moieties, categorized by major groovre , minor groove and all
(major,minor groove and DNA backbone)
184
Figure S26. Interaction propensities of components of histidine amino acid towards various
DNA moieties and functional groups. (a) Interaction propensity towards DNA bases A,C,G,T, in
addition to phosphate (P) and sugar (S) moieties, categorized by major groovre , minor groove
and all (major,minor groove and DNA backbone) (b) Interaction propensity towards functional
groups [105] A ((H-bond acceptor), D (H-bond donor), M (methyl),H (hydrogen)), in addition
to phosphate (P) and sugar (S) moieties, categorized by major groovre , minor groove and all
(major,minor groove and DNA backbone)
185
Figure S27. Interaction propensities of components of asparagine amino acid towards various
DNA moieties and functional groups. (a) Interaction propensity towards DNA bases A,C,G,T, in
addition to phosphate (P) and sugar (S) moieties, categorized by major groovre , minor groove
and all (major,minor groove and DNA backbone) (b) Interaction propensity towards functional
groups [105] A ((H-bond acceptor), D (H-bond donor), M (methyl),H (hydrogen)), in addition
to phosphate (P) and sugar (S) moieties, categorized by major groovre , minor groove and all
(major,minor groove and DNA backbone)
186
Figure S28. Interaction propensities of components of serine amino acid towards various
DNA moieties and functional groups. (a) Interaction propensity towards DNA bases A,C,G,T, in
addition to phosphate (P) and sugar (S) moieties, categorized by major groovre , minor groove
and all (major,minor groove and DNA backbone) (b) Interaction propensity towards functional
groups [105] A ((H-bond acceptor), D (H-bond donor), M (methyl),H (hydrogen)), in addition
to phosphate (P) and sugar (S) moieties, categorized by major groovre , minor groove and all
(major,minor groove and DNA backbone)
187
Figure S29. RNAproDB sear page showing card view results for the keyword sear
“tetrahymena”.
188
Figure S30. RNAproDB output page for uploaded AlphaFold3 predicted structure (model 0)
for PDB ID: 8AW3.
189
Figure S31. Multiple mapping algorithms available in RNAproDB for protein-NA-hybrid complex (PDB ID: 4OO8). (A) Crystal structure of of Streptococcus pyogenes Cas9 in complex with
guide RNA (red) and target DNA (blue) (PDB ID: 4OO8). (B) Mapping produced based on partial
projection. (C) Mapping produced based on RNAscape algorithm. (D) Mapping produced by
applying ViennaRNA secondary structure layout algorithm. Centroid distance cut off of 9A˚ was
used for protein-NA interaction edges.
190
191
Figure S32. Step-by-step building of a symmetrized DNA base-pair. 192
193
Abstract (if available)
Abstract
This dissertation contains an account of my research work during my PhD at University of Southern California. My primary focus has been deciphering protein-DNA interaction with data driven deep learning methods. I present, in chapter 1,my work on the segmentation of protein surfaces into nucleic acid binding and non-binding regions. In Chapter 2, I describe DeepPBS , a geometric deep learning method to predict DNA binding specificity given protein-DNA complexes. DeepPBS acts as a bridge between structure determining and specificity determining experiments. In chapter 3, we design and showcase the RNAscape algorithm and webserver, a geometric mapping method of RNA 3D structures to 2D, which attempts to preserve the three dimensional topology (unlike common secondary structure based visualization methods). Chapter 4 describes an updated DNAproDB database. Through this update we introduce both technical advances and an expansion of features included in the analysis. At the same time we recognized lack of a comprehensive analysis and exploration tool for RNA/protein-RNA structures. Being inspired from RNAscape and DNAproDB, we developed RNAproDB, which is described in Chapter 5. RNAproDB is a modern highly interactive structure exploration tool tailored for the complexity and structural variance of RNA structures. In chapter 6, I present a tangential work, regarding generative modeling of gene expression time-series data, to learn a regularized latent space representation. We conclude this thesis by discussing current state of the field of structural biology of protein-nucleic acid complexes and discussing future possibilities.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Genome-wide studies of protein–DNA binding: beyond sequence towards biophysical and physicochemical models
PDF
Machine learning of DNA shape and spatial geometry
PDF
Quantitative modeling of in vivo transcription factor–DNA binding and beyond
PDF
Profiling transcription factor-DNA binding specificity
PDF
Molecular and computational analysis of spin-labeled nucleic acids and proteins
PDF
Decoding protein-DNA binding determinants mediated through DNA shape readout
PDF
Data-driven approaches to studying protein-DNA interactions from a structural point of view
PDF
Genome-wide studies reveal the function and evolution of DNA shape
PDF
Computational construction and analysis of DNA and RNA
PDF
Understanding protein–DNA recognition in the context of DNA methylation
PDF
Probabilistic modeling and data integration to examine RNA-protein interactions
PDF
Fast search and clustering for large-scale next generation sequencing data
PDF
Improved methods for the quantification of transcription factor binding using SELEX-seq
PDF
Simulating the helicase motor of SV40 large tumor antigen
PDF
Responsible artificial intelligence for a complex world
PDF
Data-driven learning for dynamical systems in biology
PDF
Identification and analysis of shared epigenetic changes in extraembryonic development and tumorigenesis
PDF
DNA shape at transcription factor binding sites: from purifying selection to a new alphabet
PDF
Interaction between Artificial Intelligence Systems and Primate Brains
PDF
3D modeling of eukaryotic genomes
Asset Metadata
Creator
Mitra, Raktim
(author)
Core Title
Deciphering protein-nucleic acid interactions with artificial intelligence
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Computational Biology and Bioinformatics
Degree Conferral Date
2024-12
Publication Date
10/15/2024
Defense Date
10/04/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
artificial intelligence,bioinformatics,computational biology,Computer Science,data science,database,deep learning,DNA,gene regulation,machine learning,nucleic acid,protein,protein design,protein-DNA,protein-nucleic acid,protein-RNA,RNA,structural biology,systems biology,transcription factors,Visualization
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Rohs, Remo (
committee chair
), Berman, Helen M. (
committee member
), MacLean, Adam L. (
committee member
), Nakano, Aiichiro (
committee member
), Sun, Fengzhu (
committee member
)
Creator Email
raktimmi@usc.edu,timkartar7879@gmail.com
Unique identifier
UC11399CJ95
Identifier
etd-MitraRakti-13594.pdf (filename)
Legacy Identifier
etd-MitraRakti-13594
Document Type
Dissertation
Format
theses (aat)
Rights
Mitra, Raktim
Internet Media Type
application/pdf
Type
texts
Source
20241018-usctheses-batch-1219
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
artificial intelligence
bioinformatics
computational biology
data science
database
deep learning
DNA
gene regulation
machine learning
nucleic acid
protein
protein design
protein-DNA
protein-nucleic acid
protein-RNA
RNA
structural biology
systems biology
transcription factors